Monday, July 02, 2007

ANTLR 3 - Customizing token numbers

> I'm not sure if lexers support vocabulary importing (I know
> parsers do), but if they do then you should be able to do it that
> way -- make a tokens file and import it into the lexer. Worth a
> try, anyway :)

Ahh, of course, I should have tried that.

And happily, it works! But I found the following caveats.

There is a bug that occurs when ANTLR imports and then exports a backslash. So if I have
parser grammar FooParser;
options {
tokenVocab=Foo;
...
lexer grammar Foo;
options {
tokenVocab=Foo2;

// In Foo2.tokens
'\\'=25
// In the generated Foo.tokens
'\'=25
'\\'=31 // Added by ANTLR

This causes a syntax error when compiling the parser. And I guess there is another bug in ANTLRWorks because after the syntax error, ANTLRWorks will keep repeating the same error every time you try to Generate Code, until you quit and restart the program.

By the way, I found that

'\\\\'=25

Seems to work as a single backslash.

There is another important caveat: ANTLR cannot handle "holes" when importing tokens into the parser, i.e. unused numbers in the list of tokens. You must start numbering tokens at 4 and continue up from there with consecutive integers. The problem is that the token names array called tokenNames[] in your parser will not have any empty elements in it, so if your tokens are

APPLE=4
GRAPE=5
LEMON=9
PEAR=10

then your token array will be
public static readonly string[] tokenNames = new string[]
{
"[invalid]",
"[eor]",
"[down]",
"[up]",
"APPLE",
"GRAPE",
"LEMON",
"PEAR"
};

Therefore, token name lookups will not work correctly.

On the plus side, you do not have to define all tokens in your .tokens file; ANTLR can add any additional tokens you define in the lexer and will number them correctly.

P.S. I'm using the C# target; perhaps YMMV for Java etc.

1 Comments:

At 5/06/2010 5:00 PM, Blogger Allen said...

The same applies to ANTRL 3.2 using Java. (And I had the problem in 3.1 also). In my case both grammars are from ANTLR. I had to make a copy of the first grammar's tokens file and manually change the backslashes into double backslashes then make sure the second grammar found the modified tokens file.

You post was 2007. Now it's 2010 and they still haven't solved this problem. :-(

 

Post a Comment

<< Home