Hacker Newsnew | past | comments | ask | show | jobs | submit | omcnoe's commentslogin

Been working on a toy compiler for fun recently.

I have ignored all the stuff about parsing theory, parser generators, custom DSL's, formal grammers etc. and instead have just been using the wonderful Megaparsec parser combinator library. I can easily follow the parsing logic, it's unambiguous (only one successful parse is possible, even if it might not be what you intended), it's easy to compose and re-use parser functions (was particularly helpful for whitespace sensitive parsing/line-fold handling), and it removes the tedious lexer/parser split you get with traditional parsing approaches.


It seems to me LL and LR parser generators are overrated, and hand-written recursive descent is best in practice. I understand why academics teach them, but not why some spend so long on different parsing techniques, nor why hobbyists who just want to compile their toy language are directed to them.

I work in PL, and from my first compiler to today, have always found recursive descent easiest, most effective (less bugs, better error diagnostics, fast enough), and intuitive. Many popular language compilers use recursive descent: I know at least C# (Roslyn) and Rust, but I believe most except Haskell (GHC) and ocaml.

The LR algorithm was simple once I learned it, and yacc-like LR (and antlr-like LL) parser generators were straightforward once I learned how to resolve conflicts. But recursive descent (at least to me) is simpler and more straightforward.

LR being more expressive than LL has never mattered. A hand-written recursive descent parser is most expressive: it has unlimited lookahead, and can modify parsed AST nodes (e.g. reordering for precedence, converting if into if-else).

The only solution that comes close is tree-sitter, because it implements GLR, provides helpful conflict messages, and provides basic IDE support (e.g. syntax highlighting) almost for free. But it’s a build dependency, while recursive descent parsers can be written in most languages with zero dependencies and minimal boilerplate.


> It seems to me LL and LR parser generators are overrated, and hand-written recursive descent is best in practice

I would now agree with that. My compiler experience was on a team that happened to have a LALR expert, Jeanne Musinski PhD, a student of Jeffrey Ullman. She invented a better error recovery for the language. Recursive descent would have been perfectly suited to the task.

> LR being more expressive than LL has never mattered.

Quite agree. One might guess that a language that needs that might be hard to program in.


Parser generators are great in Python (Lark for me) so you can iterate fast and get a runtime spec of your grammar.

A hand-written recursive descent parser is something you do later when you start to industrialize your code, to get better error messages, make the parser incremental, etc.

Bison/ANTLR are code generators, they do not fit well in that model.


I'll push back and say that the lexer/parser split is well worth it.

And the best thing about the parser combinator approach is that each is just a kind of parser, something like

  type Lexer = ParsecT e ByteString m [Token]

  type Parser = ParsecT e [Token] Expr
All the usual helper functions like many or sepBy work equally well in the lexing and parsing phases.

It really beats getting to the parentheses-interacting-with-ordering-of-division-operations stage and still having to think "have I already trimmed off the whitespace here or not?"


I am writing a whitespace sensitive parser - trimming whitespace matters because whitespace consumption is used to implement the indentation rules/constraints.

For example, doing things like passing an indentation sensitive whitespace consumer to a parser inside `many` for consuming all of an indented child block. If I split lexing/parsing I think I'd have to do things like insert indentation tokens into the stream, and end up with the same indentation logic (but instead matching on those indentation tokens) in the parser regardless.

I have found that order-of-operations is somewhat trivially solved by `makeExprParser` from `Control.Monad.Combinators.Expr`.


This is a hilarious, and also terrible, reason.

Why can't we let the FIPS people play in their own weird corner, while not compromising whole internet security for their sake? OpenSSL is too important to get distracted by a weird US-specific security standard. I'm not convinced FIPS is a path to actual computer security. Ah well it's the way the world goes I suppose.


No, only if your website abuses window.history.pushState to redirect the user to spam/ad content is it considered abuse.

There are subtle gotchas around sequential UUID compared to serial depending on where you generate the UUIDs. You can kinda only get hard sequential guarantee if you are generating them at write time on DB host itself.

But, for both Serial & db-gen’d sequential UUID you can still encounter transaction commit order surprises. I think software relying on sequential records should use some mechanism other than Id/PK to determine it. I’ve personally encountered extremely subtle bugs related to transaction commit order and sequential Id assumptions multiple times.


DB perf considerations aside, a lot of software pattern around idempotency/safe retries/horiz-scaling/distributed systems are super awkward with a serial pk because you don’t have any kind of unambiguous unique record identifier until after the DB write succeeds.

DB itself is “distributed” in that it’s running outside the services own memory in 99% of cases, in complex systems the actual DB write may be buried under multiple layers of service indirection across multiple hosts. Trying to design that correctly while also dealing with pre-write/post-write split on record id is a nightmare.


DB sequence will give you a unique ID before the transaction succeeds. If the transaction fails, there's just a gap in the IDs.

If some service that doesn't interact with the DB wants to define its own IDs, sure, but even then whatever writes to the DB can always remap that to serial IDs. I know there are use cases where that still doesn't make sense and you really need UUID PKs, but it's not the norm.


It also contains a wonderfully prescient question asked right at the end of the talk: "... the processor gonna speculate, doing some loads out of the bounds of the array, how does it work in the hardware that it doesn't crash?"

Left unanswered at the time. I believe Spectre was known but not publicly disclosed at this time.


Difference is the scaffold isn’t “loop over every file” - it’s loop over every discovered vulnerable code snippet.

If you isolate the codebase just the specific known vulnerable code up front it isn’t surprising the vulnerabilities are easy to discover. Same is true for humans.

Better models can also autonomously do the work of writing proof of concepts and testing, to autonomously reject false positives.


The methodology here is completely wrong, outright dishonest.

Finding a needle in a haystack is easy if someone hands you the small handful of hay containing the needle up front, and raises their eyebrows at you saying “there might be a needle in this clump of hay”.


Lightweight branch model of git mapped so much better to the way that actual development processes of medium to large projects really work(ed).

Named branches vs bookmarks in hg just means bike shedding about branching strategy. Bookmarks ultimately work more like lightweight git style branches, but they came later, and originally couldn't even be shared (literally just local bookmarks). Named branches on the other hand permanently accumulate as part of the repository history.

Git came out with 1 cohesive branch design from day 1.


That's a fair criticism for some workflows, and I like the lightweight model, but we should keep in mind the context of the time:

When these DVCS appeared, Git's branch design departed from what "branch" meant practically everywhere else. That added to its already significant learning curve, creating more friction for people trying (or being asked) to adopt it.

Meanwhile, Mercurial's "branch" was closer to well-established norms. This was one of several factors that made it the easier of the two to learn, which was was important when already asking people to uproot from their familiar centralized systems and learn the ins and outs of distributed version control. I suspect it also made repository migrations more straightforward, avoiding the impedance mismatch presented by Git's branches.


I work on a mercurial hosted project right now. What ticks me off is all those unnamed heads you need to handle every time you pull other people's changes. Yes they're more flexible. Most of the time that just means extra operations for no good reason.

Yeah, agreed. I liked the idea of Mercurial branches better than git's — in principle I prefer more rather than less metadata in history — but they genuinely had a scaling problem. I can't recall the numbers, this being more than a decade ago, but I tested with a realistic number of branches for a team of developers using short-lived branches for a while and you could easily see Mercurial slowing down.

Back when I was testing bookmarks were available, but Bitbucket was pretty much the only forge that supported Mercurial and their tooling didn't support bookmarks, so that made them a non-starter for many users.


That is very different from my experience with git. I know that the kernel uses branches a lot, but that's probably because of git's history with the project. At every company I worked git is used exactly the same way as CVS or SVN was used many years ago: you make some local changes, you push these local changes to the central store, you forget about it. Branches make local switching between tasks easier, but apart from that nobody cares about branches and they're definitely not treated as an important part of the repo. In fact, they're usually deleted immediately after the change is merged.

I think you have it swapped around. This is exactly the kind of workflow that git provided better support for - lightweight branches, not integral part of master history, deleted after merge.

ARM macs apparently have some kind of specific handling in place for this when a process is running with x86_64 compatibility, but it’s not publicly documented anywhere that I can see.

XNU has this oddity: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...

Redacted from open source XNU, but exists in the closed source version


Is it actually redacted, or just a leftover stub from a feature implemented in silicon instead of software? Isn't the x86 memory order compatibility done at hardware level?

Redacted

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: