Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Automatic Generation of Regular Expressions from Examples (units.it)
170 points by ColinWright on Oct 22, 2012 | hide | past | favorite | 59 comments


I am the "boss" of the research group that put the regex generator tool on line (http://machinelearning.inginf.units.it/ "Data and tools"): we are all VERY SORRY for any inconvenience we may have created.

We simply did not expect so much interest in our work. Yesterday we received more than 11000 visits. We are not prepared to handle such traffic.

We are working toward adding a few more CPUs and will notify the status on our Twitter account: http://twitter.com/@MaleLabTs.

We are afraid we will not be able to fully solve the overload problem anytime soon, though. We are a small research group, with scarce resources.

We are also preparing a small faq for addressing several of the interesting points raised in this discussion (e.g., why not executing everything in Javascript ?). We will notify completion via Twitter. Further comments and criticisms are most welcome.

Thanks a lot for your interest and patience.


That's a useful response - thanks for taking the time to register and respond.

Apologies for the extreme load caused - when geeks find something cool they tend to share it, and not everyone is ready for that level of interest.

I, for one, look forward to your updates. What would be really great would be for you to prepare some kind of informal overview and put that up as a blog post, making it clear that it's intended to be informal, and is not in any way your formal research results or anything like that. Then post here a link to it. If you engage with this community you'll get a lot of smart people offering thoughts and suggestions from which you can pick and choose.

Great work - thank you.


Interestingly, there is much to say about the theoretical problem of finding the smallest regular expression for a set of positive and negative examples. A nice overview: http://cstheory.blogoverflow.com/2011/08/on-learning-regular...


This is cool, but it doesn't look like it's doing even trivial optimizations...

    (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))
"[\d\d\.\d\.]" is equivalent to "[\d\.]", right?

It's also got a useless non-capturing group and lookahead.


You have apparently useless genes in your genetic code as well.

These regex's are generated with genetic programming techniques, so they are not guaranteed to be an optimal solution, just a good (by the metric of working) one.


You're right, I was just pointing out that with some basic optimizations you could get a much cleaner result.

Applying some fairly simple optimizations to my example result I get the following equivalent regular expression:

   \d{2}[\d\.]{3,}
which isn't too bad.

EDIT: Here are the optimizations:

The original:

    (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))
1. Remove non-capturing groups and lookahead:

    \d\d[\d\d\.\d\.][\d\.]([\d\.]+)\1
2. Simplify set:

    \d\d[\d\.][\d\.]([\d\.]+)\1
3. Remove redundant backreference:

    \d\d[\d\.][\d\.][\d\.]+
4. Combine duplicates:

    \d{2}[\d\.]{3,}


The optimizations are great to apply to an end result, but the Genetic Programming process probably won't benefit from the expressions being optimized 'along the way' because some of the 'junk' subexpressions allow for more robustness in the face of the genetic operators.

Putting it another way, your highly optimized version (4) is likely to be much more brittle to small 'defects' than (1).


The last point can't be overstated. Optimization is highly likely to overfit the training data.

For example, given training data 000001 000002 000003 000004

The "optimized" version would be \d+[1-4]

If the goal was simply to find the smallest regex that matches all the training data, that's trivial to code, NP-hard to execute.

But the goal isn't to find the smallest regexp, it's to find a "good" regexp.


Actually, the last point can be overstated, because the optimizations he applied created a functionally equivalent state machine to the non-optimized version. If it overfitted, the original GA-generated algorithm overfitted as well.


My "optimizations" were trivial, in that they don't change the behavior of the regex at all, they just simplify it. As khafra points out, they are equivalent.

I also wasn't trying to say those simplifications should be applied between iterations of the algorithm, only at the end.


Understood - and I was just making the observation that the expressions produced during the process were junky in a way that's characteristic of the way they are evolving : Mixing together two honed expressions would produce much more extreme deviations in 'children' than the offspring of two of these hairy expressions.

Not only are the expressions tending to get more accurate with each generation, but their degree of 'evolvability' is also being selected for.


You could consider it 2 step process:

  - find "small" regex that defines the language for the example — hard problem
  - minimize the regex (the language stays the same) — can be solved efficiently
See the link from http://news.ycombinator.com/item?id=4682623


"When there are less than 200 examples, the training set and validation set have the same size. Otherwise, the training set is composed of 100 examples and the other examples constitute the validation set."

Why divide the data into sets like that? Assuming the sample data is representative, wouldn't it be better to use much more for training, say 90% with 10% for validation?


Then you might have the problem of over-fitting the training set.

See here for example: http://en.wikipedia.org/wiki/Overfitting


More training data helps prevent overfitting.


You use the validation set precisely to avoid overfitting. I agree that a 90/10 split would be more reasonable.


You could also use 10-fold cross validation[1] or similar where you rotate the training/validation sets.

[1]: http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...


Also natively supported in WEKA. (Not that WEKA is easy to use.)


Yep, that's another good way.


The other replies, directly discussing overfitting, are spot on. However, I did want to point out explicitly that the stated assumption of representative sample data can be a fundamental challenge, not to be taken lightly.


For the reverse of this see HN discussion Given a list of regexes, generate all possible strings that matches all of them - http://news.ycombinator.com/item?id=2579783


I can see how this supports TDD. Write a test suite and it will generate a regex that passes it.


I can't find it now, but I've read about projects generating haskell code from unit tests. Anyone got links?


I don't know about generating code from unit tests, but there's Djinn for generating code from type signatures.



QC generated random examples from a rule, not the reverse.


Slightly off-topic: I was surprised to see a ".it" domain that was actually Italian.


More off-topic: I wonder why 'interesting' top-level domains like '.it' or '.at' aren't used more frequently to create 'interesting' domain names.

Edit: you.get.it


They're hard to register and cost more. I registered http://canhekick.it through iwantmyname.com after someone snatched up canhekickit.com. It cost me $30 instead of $10.


It's an italian university, Università degli Studi di Trieste (northeast of Italy): http://www.units.it/


Sounds neat, but unfortunately it isn't working with a "Server is busy!" error.


I wonder why the GP has to be running at server side, rather than in JavaScript in the browser.


It certainly does not have to be running on the server side. I'm guessing that it's just a matter of what environment the creator was comfortable with.


Sumit Gulwani of Microsoft Research has a non-genetic programming approach to this and similar problems, which is interesting to compare: http://dl.acm.org/citation.cfm?doid=2240236.2240260


I attempted writing something similar when I was bored at work a year ago. I soon discovered how difficult this problem actually is. It's definitely beyond afternoon hacking. I'd love to get a peek at the secret sauce!


There is a paper which explains how it works: http://mllab.inginf.units.it/publications/international-conf...

Note that there are other (non-genetic) approaches to learn regular expressions (some are cited in the above paper).


Thanks for the link. I've been meaning to get into machine learning and this sounds like a good, tractable problem to start with.


This makes me think. In 2012 people still think that genetic programming, a numerical optimization technique (and not the most efficient one), magically solves random machine learning problems.


More efficient techniques, such as?

It would be interesting for me and others here to learn from what you know rather than be left with what you wrote.


Funnily, I had asked this question awhile back on StackOverflow: http://stackoverflow.com/questions/4880402/


I look forward to trying this once the HN server overload effect has worn off.

This is something I would have found useful a year or two ago, considered writing, and decided it would be impossible to get a usefully generalised regex. I'm glad people cleverer and less defeatist than me had the same idea, since it seems it is possible to get useful results after all :)


There is a related cool system: http://groups.csail.mit.edu/uid/lapis/

It is actually a text editor, and doesn't come up with Regular Expressions but similar textual constraint patterns.


Is there any chance to look at the code (application)? I'd be interested in running this locally as I have a need for generation of regular expressions.


Instead of randomized hill-climbing (genetic programming's search algo) why not try a little A*?


How would you envision that working? What is the estimated distance function?


The "distance" of a regular expression could be the specifity (think CSS) or the size of the language described by the regular expression (with some adjustments to account for repetition as a* being infinite is probably not what is intended.)

The real problem is that the space of possible solutions is very large, so you will have to do that lazily, which would make an implementation in a lot of languages rather annoying.

I'd also expect this algorithm to be biased towards regular expressions that are more specific than the user intended them to be, especially for example sets that are small or not very diverse.


Is a* more or less infinite than .*?


I intended `a` to mean `a` repeated zero or more times, meaning that `a` defines a language of infinite size as there are an infinite number of strings that match that grammar.




The searching space is too large - there are too many combinations to search for. Genetic Programming makes it faster to get a good solution. The solution might not be the optimal one, but it's good enough :-)


Cool & handy


The sledge hammer approach.


I keep getting "server busy". Why share a link if it won't be able to handle the traffic?


So, every time you share a link, you always test it first to see if it can handle the load?

Ah, no, looking at your profile you haven't yet ever shared a link. I can only conclude that you find lots and lots of things that you are eager to share with us, but when you rigorously test them you find that they can't handle the load, so you sadly decline to submit them.

Thanks for the feedback. I'll be sure to keep it in mind.


This is a community for geeks. I believe it is a realistic expectation that one should be wary of sharing with tens of thousands of readers sites that run processor-intensive server-side scripts.

(Nice ad hominem, by the way.)


This is a community for geeks ( although there are those who claim otherwise: http://news.ycombinator.com/item?id=4681146 ). I believe it is a reasonable expectation that people will want to see things and will understand that sometimes they will need to spread the load. Cutting edge technology sometimes can't handle the load, and geeks will know that, understand, and make allowances. I would have thought that people in this community would want to see new stuff, rather than have it withheld simply because someone thought the server might not be able to handle the load.

How can one tell if the server is going to be able to handle the load? Perhaps the owners had anticipated that this would be of interest and were ready to handle it?

How could I tell?

I think it's a reasonable expectation in a community of geeks that if you are to level criticism at me, then you should offer alternatives and constructive suggestions, and I look forward to your reply as to what one should do when considering submitting a link like this.

How would you test it?


I didn't give suggestions on how to test it because I figured someone who knows how to automatically generate regular expressions from examples would also know how to test it under load (or that it should be tested before being made public). It should not be that difficult to simulate simple user behavior, especially since the majority of the functionality in this case is tied to (presumably) one method in the code. Run it 100/1,000/10,000 times simultaneously behind the scenes with a bunch of test examples and see if your server vomits. Pretty easy.

Honestly, I'm a bit dismayed that you are challenging me on this. "Load-test your code before releasing it" should be a strict requirement for anything that is submitted here. Otherwise, how are people expected to give feedback? Why is our time being wasted on stuff that does not work?


Hmm.

You do realize, do you not, that it's not my code. Or my server. I'm just sharing something I found.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: