Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
InstaParse: Context-Free Grammars as Easy as Regular Expressions (github.com/engelberg)
126 points by brudgers on March 5, 2015 | hide | past | favorite | 40 comments


Having parsers at hand in a language is an important feature.

In Opa (http://opalang.org), we decided to have parsers as first-class objects: cf. https://github.com/MLstate/opalang/wiki/The-core-language#pa...

And this is used at least 35 times in a real project such as PEPS: https://github.com/MLstate/PEPS

(according to `git grep "= parser" | wc -l`)


Are you suggesting that parsing capability should be part of a languages standard library, or just that it is useful to have good open source libraries available for the purpose?

Anecdotally, my experience with Scala's parser combinators has led me to the latter conclusion bu not the former. Parsers and parser generators are very specific tools which (as useful as they are) don't see use in most applications. This leads you to a situation like the one in Scala, where open source libs such as Parboiled ( http://git.io/xpg3 ) are far better than the std lib offering because they recieve more attention.


I think the language should support the creation of parser generators such that the output of the generators can be used in a straightforward way (as opposed to the Yacc way, where you have to write difficult make files to include the generator output in the build process)


Ruby DSLs are great, you get most of the benefit of a parser and your own language getting Ruby syntax and semantics for free. You just eval the code in the context of your DSL class.


The algorithm used is Generalized LL parsing (http://dotat.at/tmp/gll.pdf), which promises linear-time parsing of LL grammars and cubic-time parsing of arbitrary grammars. For some reason this isn't mentioned until the very end.


The biggest news here is the existence of a current ClojureScript port:

https://github.com/lbradstreet/instaparse-cljs


I'm the author of the ClojureScript port. I'd really love to hear from anyone who eventually uses it, as I'm getting it into shape so it can eventually be merged upstream.

All the tests pass and it works for my use case, but it'd be great to get some more data points.


I've used InstaParse with great success for an IMAP client. It's well documented and delivers as promised. Kudos to the author.


Parsing declaratively (by e.g. declaring EBNF) is really interesting.

It sort of mushes together the two steps ("lexing" and "parsing"), which, as far as I can tell, are traditionally distinguished from one another.

I guess that the "token" definitions really just reside within the EBNF rules..


Separating "lexing" and "parsing" can still be good for performance, since "lexing" needs no auxiliary storage (not even a stack) and cannot backtrack. The author points this out in the performance documentation: https://github.com/Engelberg/instaparse/blob/master/docs/Per...


This is the best library of its kind that I've seen. Shockingly intuitive compared to something like Treetop for Ruby.


I have to say - for parsers I really like the scala library parboiled2 ( http://git.io/xpg3 ). Its DSL for defining PEGs is very intuitive and should be easy to pick up for anyone.

Its also supposed to be pretty performant, although I have never seen a side by side comparison with ANTLR and friends.


I love instaparse. It's been an inspiration for some of the work I've been doing recently with lpeg.

I also contributed the viz hooks, back in the day, finding it difficult to work with parsers without an antlr-style tree printer. It's a fun time to be into parsers, there's a lot going on in the space.

Try decompiling a grammar into combinators!


This looks really cool, but I'm having a tough time using it from within a Java app.

My knowledge of Clojure is very limited, but it appears to run on the JVM and should work. It installs via Maven just fine. Running it, however, is currently beyond my ken.

A little sample Java code would open this fine piece of software to a much wider audience.


Calling the Clojure api directly from Java would look something like the following:

    import clojure.java.api.*;
    import clojure.lang.IFn;
    import clojure.lang.PersistentVector;
    
    public class Core {
    	
    	public static void main(String[] args)
    	{
    		String grammar1 = "S = AB*\n" +
    				  "AB = A B\n" +
    				  "A = 'a'+\n" +
    				  "B = 'b'+";
    		
    		Clojure.var("clojure.core", "require").invoke(Clojure.read("instaparse.core"));
    		IFn parser = Clojure.var("instaparse.core", "parser");
    		IFn asandbs = (IFn)parser.invoke(grammar1);
    		PersistentVector result = (PersistentVector)asandbs.invoke("aaaaabbbaaaabb");
    		
    		System.out.println(result.get(0).toString());
    	}
    }
But, my suggestion would be to build a Java friendly api in Clojure, with gen-class, and then consume that.


One thing I really like about LALR and LR parsers is that they warn you about all ambiguities. IMO that's one of the most useful things when designing a syntax.

People (and Wikipedia) like to pretend that e.g. PEG grammars cannot be ambiguous, because the PEG "choice" operator is ordered. In reality, that doesn't make them unambiguous, it just hides the ambiguity. By the same logic, you could say that LR grammars cannot be ambiguous either; the parser generator will always produce a working parser. The difference is, LR parser generators warn you about the parts of the grammar where the generator had to make a choice (and their choices are well-documented and not arbitrary), while PEG parsers don't.


Er, no, they don't. Detecting ambiguities is undecidable [1] -- LALR and LR parsers don't tell you about ambiguities. They only tell you about shift and reduce conflicts. Those do not necessarily imply the existence of ambiguities.

[1] https://en.wikipedia.org/wiki/Ambiguous_grammar#Recognizing_...


It's undecidable for context-free grammars, but LR grammars are a deterministic subset of context-free grammars. If I understand "deterministic" correctly, it implies unambiguous.


No, conflicts in LR grammars imply nothing about the ambiguity of the grammar. Only the absence of conflicts implies the grammar is unambiguous.

Remember that ambiguity only refers to whether or not there can be multiple derivations for the same string, not whether the parser action is ambiguous. For example, consider:

  S: "a" "b" "c" | A "b" "d"
  A: "a"
This LR(1) grammar has a shift/reduce conflict, but it is unambiguous.

You could even make it worse by:

  S: "a" B "c" | A B "d"
  A: "a"
  B: "b" | B "b"
in which interpreting this as an LR(k) grammar for all k < ∞ results in conflicts even though the grammar is still unambiguous.


This reminds me of Prolog's definite clause grammars.

http://en.wikipedia.org/wiki/Definite_clause_grammar


A couple questions:

1. Can it parse its input using an LR(k) grammar in time linear in its input when k > 1?

2. Is there any other parser known to be able to achieve the above in practice?


I am a fan of parser generators, but isn't a good DSL (think Parsec) a nicer answer than passing in a string?

EDIT: this is still way cool in itself.


> but isn't a good DSL (think Parsec) a nicer answer than passing in a string.

EBNF is a DSL.

I happen to like it more than many apis because of it's history, stability, and my experience with it. But, that's an opinion.


but this isn't a DSL in the "in the language" sense, it's just a string. I like EBNF a lot too, I think writing (| A B) is not that much worse than writing "A | B" in clojure, but that you get the advanatages of manipulating a language-level object instead of an opaque string.

Let's not forget that string-based programming is the source of many many bugs


I also think this string format is the right default choice, since most of the time I'm building parsers that don't change or need to be built programmatically.

But Instaparse has exposed all the functionality as an "in the language" DSL as well using parser combinators in a map structure (near the bottom of the README).

You can also turn a string specification into a grammar map/combinator specification, which I have used when a part of my parser was generated at run time.


> Let's not forget that string-based programming is the source of many many bugs

Since the parser generator compiles the EBNF into an executable function at runtime. The solution is the same as writing code in Ruby, JavaScript, or any other dynamic language. Write a unit test.


The name you want for "in the language" is EDSL: Embedded DSL. DSL is quite often used for library-level constructs but really, any concrete syntax (e.g. graphviz's dot format) is a DSL.


Does it really matter? I mean, I'd just fetch it from a file. I think having it as a string gives much more flexibility than being stuck with the host-language DSL.


Instaparse is great. I've used it just enough to know how powerful it is, in a weekend project to write an org-file parser.


Is there anything like this for python?


I usually don't point this out, but:

Come, let us go down and confuse their language so they will not understand each other.

Like the issue with RegEx, its probably better to use nice English words in code to describe a problem solution, so that other people can understand it faster (before they get demotivated).


English words don't solve the real difficulty with Regex's; mathematical regular expressions; or parsers. Describing state machines to do accurately and precisely what we want, though easier than writing an optimizing compiler, is usually non-trivial.

The amazing thing about Regex's is that they make describing some state machines easy.


English is a common base. Not everyone understands Regex or enough about math. If you use Regex, some people might not be able to understand and change your Regex if necessary.

Well, most of the time, no one else needs to understand it ...


Is calling '*' "Kleene_Star" a marked improvement?

The hard part of regular expressions is envisioning the languages they accept. It's not primarily a difficulty with notation as with the concepts the notation encodes.


The * or Kleene star describes a sequence of values from a specific set. A nice function name would be "many", like in Haskell's Parsec. In comparison between *, "Kleene_Star" and "many" is the meaning of the function, when named "many", is easy to imply without consulting the docs.


It's a quote from the story of the Tower of Babel, in the Bible. https://www.biblegateway.com/passage/?search=Genesis%2011&ve... (Verse 7)


Thats why I usually don't point this out. Some people could get infected or something.


Oh man, I totally misunderstood your comment. Never mind!


"...as Easy as Regular Expressions"

I see you too.. like to live dangerously. tips hat


Yes, my reaction to that was "so it's really, really hard to use for anything but trivial use cases?"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: