This is cool, but it doesn't look like it's doing even trivial optimizations... ...

merlincorey · on Oct 22, 2012

You have apparently useless genes in your genetic code as well.

These regex's are generated with genetic programming techniques, so they are not guaranteed to be an optimal solution, just a good (by the metric of working) one.

tlrobinson · on Oct 22, 2012

You're right, I was just pointing out that with some basic optimizations you could get a much cleaner result.

Applying some fairly simple optimizations to my example result I get the following equivalent regular expression:

   \d{2}[\d\.]{3,}

which isn't too bad.

EDIT: Here are the optimizations:

The original:

    (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))

1. Remove non-capturing groups and lookahead:

    \d\d[\d\d\.\d\.][\d\.]([\d\.]+)\1

2. Simplify set:

    \d\d[\d\.][\d\.]([\d\.]+)\1

3. Remove redundant backreference:

    \d\d[\d\.][\d\.][\d\.]+

4. Combine duplicates:

    \d{2}[\d\.]{3,}

mdda · on Oct 22, 2012

The optimizations are great to apply to an end result, but the Genetic Programming process probably won't benefit from the expressions being optimized 'along the way' because some of the 'junk' subexpressions allow for more robustness in the face of the genetic operators.

Putting it another way, your highly optimized version (4) is likely to be much more brittle to small 'defects' than (1).

DannyBee · on Oct 22, 2012

The last point can't be overstated. Optimization is highly likely to overfit the training data.

For example, given training data 000001 000002 000003 000004

The "optimized" version would be \d+[1-4]

If the goal was simply to find the smallest regex that matches all the training data, that's trivial to code, NP-hard to execute.

But the goal isn't to find the smallest regexp, it's to find a "good" regexp.

khafra · on Oct 22, 2012

Actually, the last point can be overstated, because the optimizations he applied created a functionally equivalent state machine to the non-optimized version. If it overfitted, the original GA-generated algorithm overfitted as well.

tlrobinson · on Oct 22, 2012

My "optimizations" were trivial, in that they don't change the behavior of the regex at all, they just simplify it. As khafra points out, they are equivalent.

I also wasn't trying to say those simplifications should be applied between iterations of the algorithm, only at the end.

mdda · on Oct 22, 2012

Understood - and I was just making the observation that the expressions produced during the process were junky in a way that's characteristic of the way they are evolving : Mixing together two honed expressions would produce much more extreme deviations in 'children' than the offspring of two of these hairy expressions.

Not only are the expressions tending to get more accurate with each generation, but their degree of 'evolvability' is also being selected for.

d0mine · on Oct 23, 2012

You could consider it 2 step process:

  - find "small" regex that defines the language for the example — hard problem
  - minimize the regex (the language stays the same) — can be solved efficiently

See the link from http://news.ycombinator.com/item?id=4682623