Wednesday, 22 June 2011

Simple, functional, sound and complete parsing for all context-free grammars


Parsers for context-free grammars can be implemented directly and naturally in a functional style known as ``combinator parsing'', using recursion following the structure of the grammar rules. However, naive implementations fail to terminate on left-recursive grammars, and despite extensive research the only complete parsers for general context-free grammars are constructed using other techniques such as Earley parsing. Our main contribution is to show how to construct simple, sound and complete parser implementations directly from grammar specifications, for all context-free grammars, based on combinator parsing. We then construct a generic parser generator and show that generated parsers are sound and complete. The formal proofs are mechanized using the HOL4 theorem prover. Memoized parsers based on our approach are polynomial-time in the size of the input. Preliminary real-world performance testing on highly ambiguous grammars indicates our parsers are faster than those generated by the popular Happy parser generator.

Keywords: verified, mechanized, formal, parsing, sound, complete, functional, combinator

[ridge11parsing-cpp.pdf] [additional material]

9 comments:

  1. Performance on well-bracketed expressions now fixed (ie linear) by dropping memoization (something about Hashtbl is O(n^2)). Performance on xml is still not linear for some reason... investigations ongoing

    ReplyDelete
  2. The additional material now contains lots of grammars, test data, and performance graphs.

    ReplyDelete
  3. Significant performance improvement: currently, the context contains substrings, ie triples (s,l,h). However, for a given run of the program, the string component is always constant. However, the hashtable often computes a hash of the input i, which includes computing a hash for the context, which includes computing a hash for s. This happens many many times. If s is long (and it typically is several MB) this can get really slow. The solution is to remove the s component from the context. For xml3.g, with memoization, this reduces the time for ab160.xml from 52 seconds to 3 seconds. Pretty good! but unfortunately still not linear.

    ReplyDelete
  4. Can now parse xml_no_newlines/mcs100.xml (1.4M xml file) in 0.5s (and with linear growth). The remaining performance bug is in the ab_no_newlines dataset (see performance graphs in additional material).

    ReplyDelete
  5. Remaining performance bug fixed- minor heap garbage collection was kicking in too often and causing super-linear slowdown. With largish minor heap, xml parsing on all xml examples is now linear.

    ReplyDelete
  6. This whole experience has made me think rather a lot about the need for verified proofs of performance... and the problems of doing this eg in the presence of garbage collection. Interesting PhD topic?

    ReplyDelete
  7. Accepted to CPP 2011. http://formes.asia/cpp/

    ReplyDelete
  8. I hacked up some more functionality: there is now basic support for "semantic actions". See the examples/tags.html file, subdirectory "actions" for some examples.

    ReplyDelete
  9. I guess I should also add that I suspect that the O(n^5) time complexity could be improved with a bit more engineering (specifically, using a decent implementation of sets rather than repeatedly eliminating duplicates from a list). At best this would give O(n^4) complexity, which is still worse than the O(n^3) claimed by Earley etc.

    ReplyDelete