NOTE: This page contains deprecated information.

SableCC 4 Lexer Syntax

The syntax, for lexers, was changed in SableCC 4. As with earlier versions, the syntax aims at being highly intuitive and elegant.

SableCC 4 introduces very powerful features in its lexers: semantic selectors and investigators, internal states, and parser-controlled state transitions. As usual, all this is done without cluttering lexer specifications with target-language code fragments.

Highlights

  • The addition of an explicit Lexer keyword to delimit lexer related sections in language specifications.
  • As of SableCC 4, helpers, tokens, groups of tokens, and productions are in a single name space. This means, for example, that a helper and a token may not share the same name anymore. As a consequence, the T. and P. prefixes are eliminated.
  • The declaration order of tokens is now irrelevant; SableCC does not pick the earliest matching token anymore, when two (or more) tokens match a string. Instead, SableCC reports an error, unless an explicit priority is declared.
  • The addition of a Priorities section, to allow language developers to specify explicit priorities between tokens, when a string is matched by multiple tokens.
  • By default, SableCC now raises an error when a lexer requires infinite lookahead. This happens when a lexer needs to look ahead for an unbounded number of characters past an accept state to detect that it is the longest match. Such lexers deliver quadratic-time complexity O(n^2), instead of the expected linear-time complexity O(n), where n is the number of input characters. (SableCC is quite unique in reporting this information; other DFA-based lexer generators don't even warn their users about this possibility!)
  • Of course, explicit lookahead requirements can be specified:
    • None: There must not be any transition out of an accept state. When this is specified, tokens are matched without looking ahead at the next character that follows them. (Could be useful for streams).
    • Infinite: Quadratic-time (worst case) is expected. (At least, the developer knows about it!)
    • By default, bounded lookahead is provided. As a nice side effect of all this, there is no need, anymore, to guess a push back buffer size; it is explicitly computed and implemented within the generated lexer class.
  • The addition of semantic selectors, to allow user-written code to select which token to match.
  • The addition of token investigators, to allow user-written code to investigate matched tokens. Collected information is useful to semantic selectors.
  • Tokens can now be declared in groups (such as keywords or numbers). This can be useful to briefly describe collections of tokens.
  • The addition of a separate section to declare states and state transitions.
  • The introduction of two types of states: normal states, and internal states. Normal states are used by parsers to get a lexer which matches a specific set of tokens. Internal states are used by lexer developers to compute complex tokens which can't be matched by regular expressions, such as recursive comments.
  • As of SableCC 4, normal state transitions are controlled by parsers. No more error-prone developer specified transitions.
  • Internal state transitions have an elegant syntax.
  • Priorities, lookahead, and ignored tokens are state specific.
  • There is a new elegant syntax to specify token collections (such as All Except keywords).
  • Regular expressions have been improved:
    • The empty string can now be matched and has an explicit syntax ''.
    • New operators have been added: subtraction, intersection, and shortest.
    • The awkward notation for character set addition and subtraction has been eliminated, as it is no longer useful.

Details

See LexerDetails.

What's not included

There are two misfeatures that have not been selected for inclusion in SableCC 4 lexers: lookahead regular expression and explicit EOF characters.

Lookahead Regular Expression

It has been a tradition, for DFA-based lexer generators, to allow the specification of tokens conditionally to a lookahead regular expression, e.g.

  hi = 'hi' / ';';

In this example, the hi token is matched only when followed immediately by a semicolon. Many compiler books provide an intuitive implementation strategy for this operator. Unfortunately, this operator is broken by design. Here is why:

  1. The naive implementation approach which consists of using an epsilon transition between the NFAs of the token and of the lookahead and remembering the last visited accept state for the token part is broken for cases such as:
    a = 'a'* / 'aa' 'a'*;
    
  2. To return the correct token, some books propose to look for the "longest correct match". In other words, ('a'* 'aa' 'a'*) is first matched, then, within the matched string, we look for the longest correct ('a'*) token making sure that the rest of the string matches ('aa' 'a'*). Books are quite silent on how to do this efficiently, of course. One can easily discover that using two passes, marking positions with a backward DFA then a forward DFA would get the job done. So, this means 2 additional DFAs for each such token.
  3. But actually, the main reason why this is all broken, is that matching ('a'* 'aa' 'a'*) breaks the longest token match expectation. A token should only be matched if is is the longest match. Because of the lookahead part, a token might get matched by the lexer DFA even if it is not the longest, due to the DFA using the combined (token + lookahead) definition. For example:
     token1 = 'a'* / 'aa' 'a'*;
     token2 = 'aa';
    
    In this example, given an input string "aaa", token1("a") will be matched by the lexer instead of token2("aa"), as "aaa" (seen as "a"/"aa") is longer than "aa". The author of SableCC is not aware of any elegant, simple, and efficient O(n) algorithm to implement the intuitively expected behavior of matching token2("aa").

EOF Character

Some lexer generators allow for explicitly matching the end-of-file (EOF) character. It is a design choice of SableCC 4 not to provide such non-regular character, as it would have required a specific notation with related coherency rules. Furthermore, parsers can actually do the job of classifying (e.g. reduce) tokens appropriately using lookahead information. Please note that we are talking about matching an EOF character, not about maching an EOF token! Actually, it is often possible to get rid of EOF without even using parser-based classification, for example:

  // A line comment which works on the last line of a file
  line_comment = '//' not_cr_lf* (cr | lf | cr lf | EOF);

  // An equivalent solution without EOF. Yes, this works because of the longest match rule!
  line_comment = '//' not_cr_lf* (cr | lf | cr lf)?;

It would not make sense to match an EOF character within a token, as it cannot be followed by another (non-EOF) character and it is not a normal character (think of token.getText()). So, we would need to add a specific lookahead notation for it. As we did not include lookahead regular expressions, it would have made for a relatively inelegant addition.

Also, it is not a good idea to confuse users with two types of EOFs: EOF character and EOF token.

The EOF token is a special token generated by the lexer when it reaches the end of a file. Unlike the EOF character, it requires no specific notation in the lexer, nor does it need to be matched by the lexer DFA. It is usually required by parsers.