You have apparently useless genes in your genetic code as well.
These regex's are generated with genetic programming techniques, so they are not guaranteed to be an optimal solution, just a good (by the metric of working) one.
The optimizations are great to apply to an end result, but the Genetic Programming process probably won't benefit from the expressions being optimized 'along the way' because some of the 'junk' subexpressions allow for more robustness in the face of the genetic operators.
Putting it another way, your highly optimized version (4) is likely to be much more brittle to small 'defects' than (1).
Actually, the last point can be overstated, because the optimizations he applied created a functionally equivalent state machine to the non-optimized version. If it overfitted, the original GA-generated algorithm overfitted as well.
My "optimizations" were trivial, in that they don't change the behavior of the regex at all, they just simplify it. As khafra points out, they are equivalent.
I also wasn't trying to say those simplifications should be applied between iterations of the algorithm, only at the end.
Understood - and I was just making the observation that the expressions produced during the process were junky in a way that's characteristic of the way they are evolving : Mixing together two honed expressions would produce much more extreme deviations in 'children' than the offspring of two of these hairy expressions.
Not only are the expressions tending to get more accurate with each generation, but their degree of 'evolvability' is also being selected for.
- find "small" regex that defines the language for the example — hard problem
- minimize the regex (the language stays the same) — can be solved efficiently
It's also got a useless non-capturing group and lookahead.