Wednesday, May 30, 2007

"The Definitive ANTLR reference" is not definitive

I bought the PDF of this book for $US25, which is not a bad deal, but it should be mentioned that the book is lacking some important information like
  • an overview of changes from v2 (e.g. shouldn't we be given a few small v2 and v3 grammars side-by-side, highlighting the differences? what kinds of real-world grammars work in v3 that didn't work in v2? other than left recursion, what sorts of grammars still cause trouble?)
  • yes, LL(*) can handle lots of grammars, but at the cost of scanning a token or character stream repeatedly. How can a grammar be designed instead for good performance?
  • a complete reference of all the things you can put in a grammar file. There's a "basic structure" on p.90, but it's VERY brief. There's no complete list showing all available sections like options {}, @header {}, @namespace {}, @members {} ... How can it be a "definitive reference" without this?
  • a class heirarchy diagram and a summary of the methods in important classes like CommonToken, Lexer, Parser, etc. Again, "reference"? The lack of reference material in the book is quite annoying because as far as I know, the book is the only reference there is.
  • how to make a lexer-only or parser-only grammar. To do this you must write "lexer grammar foo;" or "parser grammar foo;" instead of "grammar foo;" but this is never pointed out specifically. On p.64 there is a sentence that sort of implies lexers should start with "lexer grammar", while "parser grammar" is mentioned for the first time in a code example on p.134, but it is treated like something the reader should have already known.
  • how to use ANTLR in a non-Java environment. Yeah, you use options{language=CSharp;} or whatever, but then what? where do I get the runtime? how are the runtimes of other languages different from the Java runtime? The book ignores non-Java issues completely.
  • how to create "unknown" tokens, i.e. tokens for groups of characters that the lexer has no specific rule for. There's a "filter" option to discard unrecognized characters, but in my compiler, I want to create tokens for them and actually give them to the parser.
  • how to support all kinds of text file encoding schemes: UTF8, UTF16, ASCII, files using specific "codepages", MBCS, ...
  • complete examples. Mostly just snippets are shown with links to complete source files online. Unfortunately, as of this writing, example grammars online (such as this one) have no line breaks in them.
  • the discussion of lexers is a bit impoverished. No discussion of the proper way to tell apart '/*' from the sequence of two tokens '/' '*', for example. Normally tokens contain the indexes of the starting and ending character; so what happens when I use the ! suffix? Can I have a catch-all rule, invoked only if the other rules fail (because I need one)?
  • an overview of the whole system - I mean there's an overview of concepts on p.22, but there isn't a concise overview of basic nuts & bolts that people need to know such as: how do the lexer and parser get connected together; what are the basic classes and their capabilities/responsibilities in the system (CommonTree, Token, Parser, Lexer, ...); how can ASTs and lists of tokens be traversed manually, inside semantic actions and outside ANTLR grammar files; what functionality is provided by ANTLR (e.g. line-number counting and error message generation) versus what must/should/can be written by the user.
Perhaps a lot of this information is in the book, it's just not organized how I might like. The information you need to write a complete lexer-parser and run it is interspersed with explanations of the basic concepts of recursive-decent parsing--explanations I don't want to read because I'm already an experienced programmer who has used v2 before. The babysitting and nitty-gritty details are mixed in such a way that it's hard to skim the text looking for the technical details I want. There may be some things in the list above that are in the book but that I just haven't been able to find yet.

I suspect the book would be much more useful if it were separated into sections according to use cases. I mean, there are lots of different reasons a person would use ANTLR:
  • Translating code from one syntax to another, with or without StringTemplate
  • Taking an AST (or a token list), doing modifications/transformations on it, and outputting the result while keeping original spacing and comments intact (apparently you can't use ANTLR to transform an AST, but keeping the spacing and comments intact is supposed to be "easy" with ANTLR v3, right? But how do you do it? Can't I do it without StringTemplate?)
  • Gathering information from a source file without necessarily doing a complete parse
  • Writing a compiler/interpreter for a small, simple domain-specific language
  • Writing a complex full-scale compiler--this may require "pulling out all the stops", using all of ANTLR's syntax and many of the classes in ANTLR's runtime. Since this is my use case, it's the one I would most like to read about.
  • Writing an extension to an existing compiler
  • Creating a new language target for ANTLR using StringTemplate
Perhaps there should have been a section for each specific use case, with a quick overview of ALL the things a user might need to do to complete his/her task. Since the use cases overlap, each one perhaps should be pretty short and consist mostly of pointers to other parts of the book where more details can be found.

(Note: certainly there are a sections geared toward specific use cases, but in general if you're trying to accomplish task X, there isn't a island paradise made just for you within the book.)

Now I wouldn't put such a high standard on free documentation of course, but this is a paid-for book entitled "The Definitive ANTLR Reference". It's a title that gives the impression that
  • it contains all the information you need
  • it is a complete reference
well, it doesn't and isn't.

But I guess I'm being overly harsh. I probably am. There are a lot of people in the world that yell "You did it all wrong! I know how things should be done!", when in fact if they were in charge, we'd all be worse off. And then some of those people end up getting elected, but I digress.... the point is, I haven't read the book that carefully and I'm just letting off steam.

Terence has put a lot of his free time into the book and ANTLR (or so I would assume) and he deserves compensation for it. I certainly don't regret paying for the book... but the fact remains, I'm disappointed. There's lots of space in the margins, maybe the extra information I want to see could go there...

2 Comments:

At 11/20/2008 8:48 PM, Blogger Luke said...

Thank you for your comments -- I've just starting out with ANTLR and compiler-generators; knowing what is missing can be quite useful. Otherwise, I'm stuck thinking, "hmmm, why doesn't he talk about this" and wondering whether I am not the one thinking correctly.

P.S. The online examples, such as the one you pointed out, actually do have line breaks if you do "view source". They're just served as text/html, instead of text/plain, so the browser uses HTML parsing and doesn't render line breaks.

P.P.S. Are you still doing ANTLR work? I'm working with it and documenting stuff as I go.

 
At 11/21/2008 9:31 AM, Blogger Qwertie said...

Wow I never knew there were line breaks! Even if you use "save as" in the browser, the browser deletes line breaks from the output file (you'll recall that normally line breaks are preserved if you save a .html file in the browser).

I'm not using ANTLR any more. Is the C# back-end mature enough and documented yet?

 

Post a Comment

<< Home