Parser Combinators

kqr

, published 2015-03-06

Tags:

I have to deal with a bunch of textual data at work. The target format for it is HTML, but the way things turned out means that there will be more markup than actual text; there is very little actual text, but it has a lot of structure.

In fact, here's an example of what I want one entry out of many to look like.

Steering wheel: A big circular thing inside a car that allows the driver to steer it.

All vehicles do not use steering wheels. Examples of vehicles that don't include
- trains (because they can't turn)
- Apollo spacecraft (because it needs to turn along more than one axis, which requires more complicated control mechanisms. A steering wheel is limited to input along one axis.)

In essence, a little custom-made dictionary with very short entries full of cross-references. What's interesting about this is that a lot of the structure the HTML exposes is actually implicit. What I mean by that is that it is possible for the machine to figure out where links should lead, which phrase is being defined, how to format "see also" entries, and so on.

Alternative Formats

HTML is therefore not a good choice to write this text in. I briefly explored others, including my trusty old friend JSON, but none were quite terse enough to make it convenient without sacrificing important expressive power.

Naturally, I invented my own format. It looks like this:


Steering wheel {
    A big #{circular} thing inside a #{car} that
    allows the #{driver} to steer it.

    All vehicles do not use steering wheels. Examples
    of vehicles that don't include

    | #{trains} { because they can't turn }
    | Apollo spacecraft {
        because they need to turn along more than
        one axis, which requires more complicated
        control mechanisms. A steering wheel
        is limited to input along one axis.
    }
}

This is convenient enough to write in while retaining all expressive power I need. Since I have to convert this format to HTML anyway, I can also include some convenient automation mechanisms, such as sorting entries in alphabetical order before I output the final HTML list.

Parser Combinators

Well, there's one hurdle left. One important reason people are averse to using custom formats for things is that parsing is a pain. Not only do I have to generate HTML – now I also have to parse this weird format!

I was initially looking at Python for creating the script to convert the custom format to HTML, but I slowly realised what I was getting myself into. Parsing is just not fun.

Unless! I can use Haskell... Because with Haskell, I get to use Parsec. Being a parser combinator library, Parsec makes parsing easy and fun. I'm sure there are parser combinator libraries for Python too by now, but Parsec was an early player and it happens to be the one I know.

The basic idea of parser combinator libraries is that if you want to parse a comma-separated list of strings, you first write a parser for strings, and you separately write a parser for "things which are separated by commas". If you then feed the first as an argument to the second, you get a parser that parses specifically strings separated by commas. Just like higher-order functions, but with parsers instead.

Parser combinators is a very natural way to approach parsing, and you can get some decent performance out of it as well. (The attoparsec package claims that it can be "realistic to expect it to perform similar to a hand-rolled C parser")

The Result

I'm not completely finished yet, since this little project has a pretty low priority on my schedule, but given that I can parse everything except cross-references in literally 20 lines of code, I'm expecting the full parser to be within 30 lines of code.

30 lines of code! To parse all of my custom format up there. That is pretty slick.

And I'm not even sure that's the best part. The best part might very well be that the code reads so well. It is almost entirely declarative. If you have learned how to read it, you can quickly skim through it and know what every part of it is concerned with. Reading parser combinator code feels a lot like reading plain BNF syntax. Here's an example of how I parse one of those "see also" entries:

seeAlso = do
  name <- char '|' *> pureText <* char '{'
  desc <- pureText <* char '}'
  return (name, desc)

This reads quite naturally as "the name is pure text preceded by a pipe and followed by an opening curly brace, and then the description is pure text followed by a closing curly brace".