Hacker Newsnew | past | comments | ask | show | jobs | submit | ieviev's commentslogin

Oh i didnt even know of this. But i got a lot of help from this one

https://github.com/madler/infgen

EDIT: Didnt notice that it’s even by the same person of course it’s very similar


> lazy quantifiers for inclusion into the next release of POSIX

That is surprising!

We've found that in certain simpler scenarios it's possible to use complement to express lazy quantifiers, but in the general case it appears very fragile.

a simple example: a.*?b can be rewritten to something like (a.*b&~(a.*b_+)) in RE# syntax, which effectively means "there is a match, but must not contain a shorter match"

> 1998 TOPLAS

I will read through it properly, but i can already see from page 8 that it requires a table of (DFA size x input length) which makes me very suspicious that it is more of a thought exercise than a real world solution.

> It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3].

Thank you for this, this is an interesting paper


>> lazy quantifiers for inclusion into the next release of POSIX

> That is surprising!

I mean, POSIX BREs (only!) also include (single-digit) backreferences. Surprisingly, (with such a restriction) this is actually polynomial[1,2] if impractical (h/t 'burntsushi for the reference[3]). But I still wouldn’t take POSIX as the arbiter of sanity in this case. Thus far I’m not even convinced their text actually amounts to a well-defined ordering on parse trees.

I don’t know if nongreedy quantifiers are all that interesting without match groups, though, so this isn’t a particularly burning question in my view.

[1] https://branchfree.org/2019/04/04/question-is-matching-fixed...

[2] https://github.com/travisdowns/polyregex

[3] https://news.ycombinator.com/item?id=40431198


I have not verified it but i assume matching the whole line grep-style would bypass this problem entirely

Since a regex with ^<pattern>$ can not have overlapping matches it should guarantee linearity, given a linear engine. Or in the case of preprocessing lines first and then running is_match it's also linear


It's still O(n * m). The matches are non-overlapping, even if you run a separate engine on the matches post-match it will at most traverse the full input once across all matches.


Ture, I forgot about the non-overlapping matchss and that that will cancel out any extra n parameter.

Actually rereading again, it almost seems like it should be possible to do the matching as a special case of the Thompson/Pike VMs and have it loop back around to the start, and deduplicate multiples, but I am only speculating.


I’m still open to it and thinking about it actually. I will explore if it’s possible to eliminate the large losses on common patterns and if it turns out it is then it’s a no brainer.

Going forward this and the extended operators + large pattern perf will hopefully be a strong selling point to gain more traction


Oh that is interesting! I haven't even looked Nim regex until now, is it similar to the approach in Go?


It's similar to RE2, but it lacks the on the fly DFA, ie: it's just the classic Thompson's NFA with some tweaks. It does not implement find all the same way, though.


> constraining oneself to fixed-upper-bound-length needles

wait! you haven't reached the important part of the post yet


I have not heard of this before, i will have a look!


Haha, you're right about that. I was looking for another word for "default"


Good catch! I changed this to leftmost-longest nonoverlapping matches so it's not misleading


> I changed this to leftmost-longest nonoverlapping matches so it's not misleading

Well, you changed the sentence I quoted. That doesn't really address what I was objecting to; I quoted that sentence because that's the first point in the essay where it's clear that you don't mean "all matches". That's where the reader becomes confused, but they don't become confused because that sentence is unclear - they become confused because everything else on the page is misleading, and that sentence unambiguously contradicts the misleading impression created by everything else.

Your headline says "all matches", your subheadline says "all matches", and your text both before and (still) after the sentence you changed frequently says "all matches", and in none of those cases do you actually mean "all matches". You mention that there is an existing solution, REmatch, but only to dismiss it as "solving a different problem", the problem of finding all matches. You also note that "all matches" is inherently quadratic because the size of the output is potentially quadratic, leading me to wonder why it's a surprise that asking for all matches yields quadratic performance.


I don't agree that overlapping matches is the only interpretation of "all matches". it very clearly does return all matches and it removes overlap

The post states clearly "finding all leftmost-longest non-overlapping matches without quadratic blowup"

The engines mentioned at the beginning all return nonoverlapping matches. Basic document search returns nonoverlapping matches.

Finding overlapping matches is exotic and rarely ever used in practice outside of network security. It does solve a different problem


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: