I am the "boss" of the research group that put the regex generator tool on line (http://machinelearning.inginf.units.it/ "Data and tools"): we are all VERY SORRY for any inconvenience we may have created.
We simply did not expect so much interest in our work. Yesterday we received more than 11000 visits. We are not prepared to handle such traffic.
We are working toward adding a few more CPUs and will notify the status on our Twitter account: http://twitter.com/@MaleLabTs.
We are afraid we will not be able to fully solve the overload problem anytime soon, though. We are a small research group, with scarce resources.
We are also preparing a small faq for addressing several of the interesting points raised in this discussion (e.g., why not executing everything in Javascript ?). We will notify completion via Twitter. Further comments and criticisms are most welcome.
That's a useful response - thanks for taking the time to register and respond.
Apologies for the extreme load caused - when geeks find something cool they tend to share it, and not everyone is ready for that level of interest.
I, for one, look forward to your updates. What would be really great would be for you to prepare some kind of informal overview and put that up as a blog post, making it clear that it's intended to be informal, and is not in any way your formal research results or anything like that. Then post here a link to it. If you engage with this community you'll get a lot of smart people offering thoughts and suggestions from which you can pick and choose.
You have apparently useless genes in your genetic code as well.
These regex's are generated with genetic programming techniques, so they are not guaranteed to be an optimal solution, just a good (by the metric of working) one.
The optimizations are great to apply to an end result, but the Genetic Programming process probably won't benefit from the expressions being optimized 'along the way' because some of the 'junk' subexpressions allow for more robustness in the face of the genetic operators.
Putting it another way, your highly optimized version (4) is likely to be much more brittle to small 'defects' than (1).
Actually, the last point can be overstated, because the optimizations he applied created a functionally equivalent state machine to the non-optimized version. If it overfitted, the original GA-generated algorithm overfitted as well.
My "optimizations" were trivial, in that they don't change the behavior of the regex at all, they just simplify it. As khafra points out, they are equivalent.
I also wasn't trying to say those simplifications should be applied between iterations of the algorithm, only at the end.
Understood - and I was just making the observation that the expressions produced during the process were junky in a way that's characteristic of the way they are evolving : Mixing together two honed expressions would produce much more extreme deviations in 'children' than the offspring of two of these hairy expressions.
Not only are the expressions tending to get more accurate with each generation, but their degree of 'evolvability' is also being selected for.
- find "small" regex that defines the language for the example — hard problem
- minimize the regex (the language stays the same) — can be solved efficiently
"When there are less than 200 examples, the training set and validation set have the same size. Otherwise, the training set is composed of 100 examples and the other examples constitute the validation set."
Why divide the data into sets like that? Assuming the sample data is representative, wouldn't it be better to use much more for training, say 90% with 10% for validation?
The other replies, directly discussing overfitting, are spot on. However, I did want to point out explicitly that the stated assumption of representative sample data can be a fundamental challenge, not to be taken lightly.
They're hard to register and cost more. I registered http://canhekick.it through iwantmyname.com after someone snatched up canhekickit.com. It cost me $30 instead of $10.
It certainly does not have to be running on the server side. I'm guessing that it's just a matter of what environment the creator was comfortable with.
I attempted writing something similar when I was bored at work a year ago. I soon discovered how difficult this problem actually is. It's definitely beyond afternoon hacking. I'd love to get a peek at the secret sauce!
This makes me think. In 2012 people still think that genetic programming, a numerical optimization technique (and not the most efficient one), magically solves random machine learning problems.
I look forward to trying this once the HN server overload effect has worn off.
This is something I would have found useful a year or two ago, considered writing, and decided it would be impossible to get a usefully generalised regex. I'm glad people cleverer and less defeatist than me had the same idea, since it seems it is possible to get useful results after all :)
Is there any chance to look at the code (application)? I'd be interested in running this locally as I have a need for generation of regular expressions.
The "distance" of a regular expression could be the specifity (think CSS) or the size of the language described by the regular expression (with some adjustments to account for repetition as a* being infinite is probably not what is intended.)
The real problem is that the space of possible solutions is very large, so you will have to do that lazily, which would make an implementation in a lot of languages rather annoying.
I'd also expect this algorithm to be biased towards regular expressions that are more specific than the user intended them to be, especially for example sets that are small or not very diverse.
I intended `a` to mean `a` repeated zero or more times, meaning that `a` defines a language of infinite size as there are an infinite number of strings that match that grammar.
The searching space is too large - there are too many combinations to search for. Genetic Programming makes it faster to get a good solution. The solution might not be the optimal one, but it's good enough :-)
So, every time you share a link, you always test it first to see if it can handle the load?
Ah, no, looking at your profile you haven't yet ever shared a link. I can only conclude that you find lots and lots of things that you are eager to share with us, but when you rigorously test them you find that they can't handle the load, so you sadly decline to submit them.
Thanks for the feedback. I'll be sure to keep it in mind.
This is a community for geeks. I believe it is a realistic expectation that one should be wary of sharing with tens of thousands of readers sites that run processor-intensive server-side scripts.
This is a community for geeks ( although there are those who claim otherwise: http://news.ycombinator.com/item?id=4681146 ). I believe it is a reasonable expectation that people will want to see things and will understand that sometimes they will need to spread the load. Cutting edge technology sometimes can't handle the load, and geeks will know that, understand, and make allowances. I would have thought that people in this community would want to see new stuff, rather than have it withheld simply because someone thought the server might not be able to handle the load.
How can one tell if the server is going to be able to handle the load? Perhaps the owners had anticipated that this would be of interest and were ready to handle it?
How could I tell?
I think it's a reasonable expectation in a community of geeks that if you are to level criticism at me, then you should offer alternatives and constructive suggestions, and I look forward to your reply as to what one should do when considering submitting a link like this.
I didn't give suggestions on how to test it because I figured someone who knows how to automatically generate regular expressions from examples would also know how to test it under load (or that it should be tested before being made public). It should not be that difficult to simulate simple user behavior, especially since the majority of the functionality in this case is tied to (presumably) one method in the code. Run it 100/1,000/10,000 times simultaneously behind the scenes with a bunch of test examples and see if your server vomits. Pretty easy.
Honestly, I'm a bit dismayed that you are challenging me on this. "Load-test your code before releasing it" should be a strict requirement for anything that is submitted here. Otherwise, how are people expected to give feedback? Why is our time being wasted on stuff that does not work?
We simply did not expect so much interest in our work. Yesterday we received more than 11000 visits. We are not prepared to handle such traffic.
We are working toward adding a few more CPUs and will notify the status on our Twitter account: http://twitter.com/@MaleLabTs.
We are afraid we will not be able to fully solve the overload problem anytime soon, though. We are a small research group, with scarce resources.
We are also preparing a small faq for addressing several of the interesting points raised in this discussion (e.g., why not executing everything in Javascript ?). We will notify completion via Twitter. Further comments and criticisms are most welcome.
Thanks a lot for your interest and patience.