Automatic Generation of Regular Expressions from Examples

bartolialberto · on Oct 23, 2012

I am the "boss" of the research group that put the regex generator tool on line (http://machinelearning.inginf.units.it/ "Data and tools"): we are all VERY SORRY for any inconvenience we may have created.

We simply did not expect so much interest in our work. Yesterday we received more than 11000 visits. We are not prepared to handle such traffic.

We are working toward adding a few more CPUs and will notify the status on our Twitter account: http://twitter.com/@MaleLabTs.

We are afraid we will not be able to fully solve the overload problem anytime soon, though. We are a small research group, with scarce resources.

We are also preparing a small faq for addressing several of the interesting points raised in this discussion (e.g., why not executing everything in Javascript ?). We will notify completion via Twitter. Further comments and criticisms are most welcome.

Thanks a lot for your interest and patience.

ColinWright · on Oct 23, 2012

That's a useful response - thanks for taking the time to register and respond.

Apologies for the extreme load caused - when geeks find something cool they tend to share it, and not everyone is ready for that level of interest.

I, for one, look forward to your updates. What would be really great would be for you to prepare some kind of informal overview and put that up as a blog post, making it clear that it's intended to be informal, and is not in any way your formal research results or anything like that. Then post here a link to it. If you engage with this community you'll get a lot of smart people offering thoughts and suggestions from which you can pick and choose.

Great work - thank you.

a3_nm · on Oct 22, 2012

Interestingly, there is much to say about the theoretical problem of finding the smallest regular expression for a set of positive and negative examples. A nice overview: http://cstheory.blogoverflow.com/2011/08/on-learning-regular...

tlrobinson · on Oct 22, 2012

This is cool, but it doesn't look like it's doing even trivial optimizations...

    (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))

"[\d\d\.\d\.]" is equivalent to "[\d\.]", right?

It's also got a useless non-capturing group and lookahead.

merlincorey · on Oct 22, 2012

You have apparently useless genes in your genetic code as well.

These regex's are generated with genetic programming techniques, so they are not guaranteed to be an optimal solution, just a good (by the metric of working) one.

tlrobinson · on Oct 22, 2012

You're right, I was just pointing out that with some basic optimizations you could get a much cleaner result.

Applying some fairly simple optimizations to my example result I get the following equivalent regular expression:

   \d{2}[\d\.]{3,}

which isn't too bad.

EDIT: Here are the optimizations:

The original:

    (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))

1. Remove non-capturing groups and lookahead:

    \d\d[\d\d\.\d\.][\d\.]([\d\.]+)\1

2. Simplify set:

    \d\d[\d\.][\d\.]([\d\.]+)\1

3. Remove redundant backreference:

    \d\d[\d\.][\d\.][\d\.]+

4. Combine duplicates:

    \d{2}[\d\.]{3,}

mdda · on Oct 22, 2012

The optimizations are great to apply to an end result, but the Genetic Programming process probably won't benefit from the expressions being optimized 'along the way' because some of the 'junk' subexpressions allow for more robustness in the face of the genetic operators.

Putting it another way, your highly optimized version (4) is likely to be much more brittle to small 'defects' than (1).

DannyBee · on Oct 22, 2012

The last point can't be overstated. Optimization is highly likely to overfit the training data.

For example, given training data 000001 000002 000003 000004

The "optimized" version would be \d+[1-4]

If the goal was simply to find the smallest regex that matches all the training data, that's trivial to code, NP-hard to execute.

But the goal isn't to find the smallest regexp, it's to find a "good" regexp.

khafra · on Oct 22, 2012

Actually, the last point can be overstated, because the optimizations he applied created a functionally equivalent state machine to the non-optimized version. If it overfitted, the original GA-generated algorithm overfitted as well.

tlrobinson · on Oct 22, 2012

My "optimizations" were trivial, in that they don't change the behavior of the regex at all, they just simplify it. As khafra points out, they are equivalent.

I also wasn't trying to say those simplifications should be applied between iterations of the algorithm, only at the end.

mdda · on Oct 22, 2012

Understood - and I was just making the observation that the expressions produced during the process were junky in a way that's characteristic of the way they are evolving : Mixing together two honed expressions would produce much more extreme deviations in 'children' than the offspring of two of these hairy expressions.

Not only are the expressions tending to get more accurate with each generation, but their degree of 'evolvability' is also being selected for.

d0mine · on Oct 23, 2012

You could consider it 2 step process:

  - find "small" regex that defines the language for the example — hard problem
  - minimize the regex (the language stays the same) — can be solved efficiently

See the link from http://news.ycombinator.com/item?id=4682623

curiousdannii · on Oct 22, 2012

"When there are less than 200 examples, the training set and validation set have the same size. Otherwise, the training set is composed of 100 examples and the other examples constitute the validation set."

Why divide the data into sets like that? Assuming the sample data is representative, wouldn't it be better to use much more for training, say 90% with 10% for validation?

haffi112 · on Oct 22, 2012

Then you might have the problem of over-fitting the training set.

See here for example: http://en.wikipedia.org/wiki/Overfitting

msellout · on Oct 22, 2012

More training data helps prevent overfitting.

stavros · on Oct 22, 2012

You use the validation set precisely to avoid overfitting. I agree that a 90/10 split would be more reasonable.

ZoFreX · on Oct 22, 2012

You could also use 10-fold cross validation[1] or similar where you rotate the training/validation sets.

[1]: http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...

icelancer · on Oct 22, 2012

Also natively supported in WEKA. (Not that WEKA is easy to use.)

stavros · on Oct 22, 2012

Yep, that's another good way.

Cyranix · on Oct 22, 2012

The other replies, directly discussing overfitting, are spot on. However, I did want to point out explicitly that the stated assumption of representative sample data can be a fundamental challenge, not to be taken lightly.

draegtun · on Oct 22, 2012

For the reverse of this see HN discussion Given a list of regexes, generate all possible strings that matches all of them - http://news.ycombinator.com/item?id=2579783

RyanMcGreal · on Oct 22, 2012

I can see how this supports TDD. Write a test suite and it will generate a regex that passes it.

ajanuary · on Oct 22, 2012

I can't find it now, but I've read about projects generating haskell code from unit tests. Anyone got links?

IsTom · on Oct 22, 2012

I don't know about generating code from unit tests, but there's Djinn for generating code from type signatures.

egonschiele · on Oct 23, 2012

Do you mean QuickCheck? http://www.haskell.org/haskellwiki/Introduction_to_QuickChec...

Evbn · on Oct 23, 2012

QC generated random examples from a rule, not the reverse.

jcromartie · on Oct 22, 2012

Slightly off-topic: I was surprised to see a ".it" domain that was actually Italian.

ExpiredLink · on Oct 22, 2012

More off-topic: I wonder why 'interesting' top-level domains like '.it' or '.at' aren't used more frequently to create 'interesting' domain names.

Edit: you.get.it

dwwoelfel · on Oct 22, 2012

They're hard to register and cost more. I registered http://canhekick.it through iwantmyname.com after someone snatched up canhekickit.com. It cost me $30 instead of $10.

Ecio78 · on Oct 22, 2012

It's an italian university, Università degli Studi di Trieste (northeast of Italy): http://www.units.it/

superasn · on Oct 22, 2012

Sounds neat, but unfortunately it isn't working with a "Server is busy!" error.

songgao · on Oct 22, 2012

I wonder why the GP has to be running at server side, rather than in JavaScript in the browser.

mistercow · on Oct 22, 2012

It certainly does not have to be running on the server side. I'm guessing that it's just a matter of what environment the creator was comfortable with.

jmmcd · on Oct 22, 2012

Sumit Gulwani of Microsoft Research has a non-genetic programming approach to this and similar problems, which is interesting to compare: http://dl.acm.org/citation.cfm?doid=2240236.2240260

csmatt · on Oct 22, 2012

I attempted writing something similar when I was bored at work a year ago. I soon discovered how difficult this problem actually is. It's definitely beyond afternoon hacking. I'd love to get a peek at the secret sauce!

a3_nm · on Oct 22, 2012

There is a paper which explains how it works: http://mllab.inginf.units.it/publications/international-conf...

Note that there are other (non-genetic) approaches to learn regular expressions (some are cited in the above paper).

csmatt · on Oct 22, 2012

Thanks for the link. I've been meaning to get into machine learning and this sounds like a good, tractable problem to start with.

arekp · on Oct 22, 2012

This makes me think. In 2012 people still think that genetic programming, a numerical optimization technique (and not the most efficient one), magically solves random machine learning problems.

drblast · on Oct 23, 2012

More efficient techniques, such as?

It would be interesting for me and others here to learn from what you know rather than be left with what you wrote.

pathikrit · on Oct 22, 2012

Funnily, I had asked this question awhile back on StackOverflow: http://stackoverflow.com/questions/4880402/

coroxout · on Oct 22, 2012

I look forward to trying this once the HN server overload effect has worn off.

This is something I would have found useful a year or two ago, considered writing, and decided it would be impossible to get a usefully generalised regex. I'm glad people cleverer and less defeatist than me had the same idea, since it seems it is possible to get useful results after all :)

vineet · on Oct 22, 2012

There is a related cool system: http://groups.csail.mit.edu/uid/lapis/

It is actually a text editor, and doesn't come up with Regular Expressions but similar textual constraint patterns.

josephagoss · on Oct 22, 2012

Is there any chance to look at the code (application)? I'd be interested in running this locally as I have a need for generation of regular expressions.

reader5000 · on Oct 22, 2012

Instead of randomized hill-climbing (genetic programming's search algo) why not try a little A*?

sunir · on Oct 22, 2012

How would you envision that working? What is the estimated distance function?

DasIch · on Oct 22, 2012

The "distance" of a regular expression could be the specifity (think CSS) or the size of the language described by the regular expression (with some adjustments to account for repetition as a* being infinite is probably not what is intended.)

The real problem is that the space of possible solutions is very large, so you will have to do that lazily, which would make an implementation in a lot of languages rather annoying.

I'd also expect this algorithm to be biased towards regular expressions that are more specific than the user intended them to be, especially for example sets that are small or not very diverse.

__alexs · on Oct 22, 2012

Is a* more or less infinite than .*?

DasIch · on Oct 23, 2012

I intended `a` to mean `a` repeated zero or more times, meaning that `a` defines a language of infinite size as there are an infinite number of strings that match that grammar.

krallja · on Oct 22, 2012

http://en.wikipedia.org/wiki/A*_search_algorithm

__alexs · on Oct 23, 2012

http://en.wikipedia.org/wiki/Aleph_number

songgao · on Oct 22, 2012

The searching space is too large - there are too many combinations to search for. Genetic Programming makes it faster to get a good solution. The solution might not be the optimal one, but it's good enough :-)

wib · on Oct 22, 2012

Cool & handy

tlarkworthy · on Oct 22, 2012

The sledge hammer approach.

enraged_camel · on Oct 22, 2012

I keep getting "server busy". Why share a link if it won't be able to handle the traffic?

ColinWright · on Oct 22, 2012

So, every time you share a link, you always test it first to see if it can handle the load?

Ah, no, looking at your profile you haven't yet ever shared a link. I can only conclude that you find lots and lots of things that you are eager to share with us, but when you rigorously test them you find that they can't handle the load, so you sadly decline to submit them.

Thanks for the feedback. I'll be sure to keep it in mind.

enraged_camel · on Oct 22, 2012

This is a community for geeks. I believe it is a realistic expectation that one should be wary of sharing with tens of thousands of readers sites that run processor-intensive server-side scripts.

(Nice ad hominem, by the way.)

ColinWright · on Oct 22, 2012

This is a community for geeks ( although there are those who claim otherwise: http://news.ycombinator.com/item?id=4681146 ). I believe it is a reasonable expectation that people will want to see things and will understand that sometimes they will need to spread the load. Cutting edge technology sometimes can't handle the load, and geeks will know that, understand, and make allowances. I would have thought that people in this community would want to see new stuff, rather than have it withheld simply because someone thought the server might not be able to handle the load.

How can one tell if the server is going to be able to handle the load? Perhaps the owners had anticipated that this would be of interest and were ready to handle it?

How could I tell?

I think it's a reasonable expectation in a community of geeks that if you are to level criticism at me, then you should offer alternatives and constructive suggestions, and I look forward to your reply as to what one should do when considering submitting a link like this.

How would you test it?

enraged_camel · on Oct 22, 2012

I didn't give suggestions on how to test it because I figured someone who knows how to automatically generate regular expressions from examples would also know how to test it under load (or that it should be tested before being made public). It should not be that difficult to simulate simple user behavior, especially since the majority of the functionality in this case is tied to (presumably) one method in the code. Run it 100/1,000/10,000 times simultaneously behind the scenes with a bunch of test examples and see if your server vomits. Pretty easy.

Honestly, I'm a bit dismayed that you are challenging me on this. "Load-test your code before releasing it" should be a strict requirement for anything that is submitted here. Otherwise, how are people expected to give feedback? Why is our time being wasted on stuff that does not work?

ColinWright · on Oct 22, 2012

Hmm.

You do realize, do you not, that it's not my code. Or my server. I'm just sharing something I found.