NOTE: This page contains deprecated information.
Detailed Lexer Syntax
The goal of this page is to expose the detailed syntax of SableCC 4 lexers with simple examples.
A Simple Lexer
Here is a very simple lexer example:
Language simple_example;
Syntax 4;
Lexer
Helpers
/* regular expression matching all valid 16-bit characters */
any = #x0..#xffff;
letter = 'a'..'z';
number = '0'..'9';
cr = #13;
lf = #10;
eol = cr | lf | cr lf;
Tokens
identifier = letter (letter | number)*;
line_comment = '//' (any - (cr | lf))* eol?;
Ignored
line_comment;
This lexer includes a Helpers section which defines useful regular expressions used in token definitions. It also has a Tokens section which defines two tokens, identifier and line_comment. Finally, it contains an Ignored section which declares tokens that shouldn't be returned to the parser.
This simple lexer also exhibits two features: intervals (e.g. 'a'..'z') and subtraction of regular expressions (e.g. any - (cr | lf)).
Groups and Priorities
It is often convenient to define tokens with regular expressions that overlap. A typical example is the matching of keywords and identifiers. SableCC 4 requires explicit priorities between overlapping tokens. Here is an example:
...
Tokens
Group keywords:
if = 'if';
for = 'for';
begin = 'begin';
end = 'end';
Group others:
identifier = letter (letter | number)*;
line_comment = '//' (any - (cr | lf))* eol?;
Priorities
/* if it can be both a keyword or an identifier, make it a keyword */
keywords Over identifier;
Ignored
line_comment;
In this example, we see how to declare tokens in groups. It is as simple as writing Group groupname: before a series of token. We also see how to declare an explicit priority between tokens. Note how the keywords group was useful. Without it, we would have had to write:
Priorities
if, for, begin, end Over identifier;
The priority declaration was necessary, as a string such as "begin" is matched by the definition of two tokens: begin and identifier. Somehow, this tells us that our definition of identifier is wrong. Effectively, we do not consider the string "begin" as an identifier, yet (letter (letter | number)*) matches it. The exact definition would be:
identifier = (letter (letter | number)*) - ('if' | 'for' | 'begin' | 'end');
It easy to see how it would be painful to exactly define identifier for a complex language with many keywords. It is much more convenient to declare it with a simple general regular expression, then declare explicit priorities between tokens.
SableCC will report an error if two tokens overlap and no priority is provided. It will also report an error when a token is never matched (possibly because of an erroneous priority declaration) or when a priority declaration is never used.
Lookahead
Lookahead Infinite;
DFA-based lexers, such as the lexers generated by SableCC, are known for their high efficiency. They are expected to deliver linear-time worst-case complexity O(n), where n is the number of input characters. Yet, it is little know, but in some obscure cases, such efficiency cannot be delivered. Here is an example:
Tokens
left_parenthese = '(';
parenthesized_word = '(' letter+ ')';
This example is not meant to do anything useful other than illustrate the problem. So, here how it goes.
Let's give this lexer the following input: "(adggsjfgsgfjasdkjhgfg". As you can see, the longest match is left_parenthese("("). Yet, the lexer cannot decide whether it must match left_parenthese or parenthesized_word before inspecting the "adggsjfgsgfjasdkjhgfg" suffix and discovering that it has no closing ")". Typically, a lexer will continue consuming characters until it is certain that no valid token can be matched anymore. In this case, it has to look until the end of input.
The following input exhibits the worst case scenario: "((((((((((((((((((((((((((((((((((((((((((((((((((". In such a case, the lexer will match 50 left_parenthese tokens, but for every token, the lexer will look at all the remaining input characters. The cost is 50 + 49 + 48 + ... + 1, which is O(n^2).
SableCC 4 will automatically detect this problem and report an error. Yet, if the user really wants such a degraded lexer, SableCC allows for it as follows:
Tokens
left_parenthese = '(';
parenthesized_word = '(' letter+ ')';
Lookahead
/* tell SableCC that we expect O(n^2) instead of O(n) */
Infinite;
To discourage users from developing bad habits which would hide the problem, SableCC reports an error if an infinite lookahead is specified without being necessary.
Lookahead None;
Usually, SableCC will require a bounded lookahead buffer to match longest tokens. Yet, it is sometimes desirable for the lexer not to look ahead. SableCC allows users to specify such no-lookahead requirement, and will then raise an error if token definitions require a lookahead. Example:
Tokens
begin = 'begin';
end = 'end';
success = 'success';
error = 'error';
Lookahead
/* tell SableCC not to allow for lookahead buffer */
None;
Let's look at a sample input: "beginsuccessend". You can easily see that a lexer needs not look beyond the "n" of "begin" to know that it must match the begin token. In this case, three tokens will be matched: begin, success, and end. We think that this feature might be appreciated by stream protocol designers.
Lookahead Buffer
In earlier SableCC versions, users had to provide a PushbackReader object to generated lexers. One had to resort to some kind of black magic to guess the correct push back buffer size, in order for the lexer not to throw an IOExeption for lack of buffer space. The black magic usually consisted of error-prone trial and error, or more commonly, using an arbitrary 1024 character buffer.
As of version 4, the exact push back buffer requirement is now computed by SableCC and directly encoded into the lexer. This completely eradicates one of the common pitfalls.
(Normal) Lexer States
SableCC allows users to declare lexer states. Each lexer state is, somehow, a separate lexer in itself. Each state recognizes a specific set of tokens.
Lexer states are useful for multi-level languages. A common example of this is the HTML langage. Here's a sample HTML fragment:
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>The Title</title>
</head>
<body>
Here is a <a href="link">link</a>.
</body>
</html>
Note how the syntax within < ... > is structured differently than the free text outside them. In particular, the "a" in "Here is a ..." is a simple word within the text, yet in <a href="link">, "a" should be matched as an identifier.
It would be quite painful to write a parser grammar for HTML without lexer states, as the following intuitive grammar wouldn't work:
/* This does not work! */
Lexer
Tokens
text = (any - '<')+;
identifier = letter+;
tag_start = '<';
tag_end = '>';
slash = '/';
equal = '=';
string = '"' (any - ('"' | cr | lf))* '"';
blank = (' ' | eol)+;
Ignored
blank;
Parser
mini_broken_html =
part*;
part =
{tag} open_tag part* close_tag |
{text} text;
open_tag =
tag_start identifier tag_body_part* tag_end;
close_tag =
tag_start slash identifier tag_body_part* tag_end;
tag_body_part =
identifier equal string;
What's wrong with it? Many things. For one thing, SableCC will complain about the overlapping text and identifier tokens. And, no, there is no good priority between these two tokens. But it gets worse... The following fragment will be rejected <html>"Hello!"</html>. Why? because "Hello!" will probably be matched as a string token, instead of a text token. We really need two lexers: one that will match text and tag_start, and one that will match the remaining tokens.
Here is where lexer states come to the rescue. In SableCC 4, the syntax for lexer states is as follows:
Lexer
Tokens
text = (any - '<')+;
identifier = letter+;
tag_start = '<';
tag_end = '>';
slash = '/';
equal = '=';
string = '"' (any - ('"' | cr | lf))* '"';
blank = (' ' | eol)+;
States
State free_text:
Tokens
text, tag_start;
State inside_tag:
Tokens
identifier, tag_end, slash, equal, string, blank;
Ignored
blank;
Now, though, we need to make sure that the lexer operates in the correct state, when matching tokens.
Traditionally, lexer generators provided a syntax to specify state transitions. The problem, with that approach, is that it is the responsibility of the programmer to correctly specify when to cause a state transition. This is quite error prone.
SableCC 4 does what no other parser generator has ever done before: it lets the parser automatically cause state transitions. The user simply needs to declare grammar productions within states, as follows:
Parser
State free_text:
mini_broken_html =
part*;
part =
{tag} open_tag part* close_tag |
{text} text;
open_tag =
tag_start open_tag_body;
close_tag =
tag_start close_tag_body;
State inside_tag:
open_tag_body =
identifier tag_body_part* tag_end;
close_tag_body =
slash identifier tag_body_part* tag_end;
tag_body_part =
identifier equal string;
As you see, it is all quite easy and intuitive. Yet, this approach has a huge advantage over the traditional approach: it will catch any ambiguous state transition!
In the above grammar, it is easy to see how transitions should be made on tag_start and tag_end. But, for many other languages, it is not that simple. Imagine, for example, that nested tags were allowed (e.g. <id <id>>). By leaving state transitions in the hand of the parser, SableCC automatically deals with such problems. Additionally, SableCC will also report an error for any ambiguous state transition. Here is an example:
Parser
State free_text:
...
part =
{tag} open_tag part* close_tag |
{ambiguous} tag_start text |
{text} text;
open_tag =
tag_start open_tag_body;
...
State inside_tag:
open_tag_body =
identifier tag_body_part* tag_end;
...
This grammar has no LR conflict! A traditional parser generator would simply accept it. The problem is related to state transitions. Should the lexer state change after seeing a tag_start token? There is no good answer. If it is changed, then the parser will never match the part.ambiguous alternative; if it is not changed, the parser will never match the part.tag alternative.
Concretely, this grammar is ambiguous because of lexer state transitions. SableCC 4 will correctly detect this and report an error. We hope that users will appreciate this. :-)
Internal States and Semantic Selectors and Investigators
Sometimes, regular expressions are insufficient to describe tokens. For example, recursive comments cannot be described using regular expressions. Example:
/* This is /* a comment within */ another comment */
Theoretical proofs exist (the famous pumping lemma) that no regular expression can describe them. Yet, some languages use them, such as Pascal. Actually, they can be quite useful, as they allow you to comment a program part, even when it already includes comments.
Earlier SableCC versions allowed for very hackish customized lexers doing ugly things to recognize recursive comments. If you're really curious, just go read the SableCC master thesis; it is explained somewhere within it...
SableCC 4 introduces the concept of incomplete tokens and internal lexer states, along with semantic investigators and selectors to deal elegantly with such problems.
So, what's all this? Internal lexer states are simply lexer states with a distinctive feature: they do not return matched tokens to the parser. Instead, they append the text of tokens they match to the original incomplete token that caused the transition to an internal state.
Here's a lexer specification for recursive comments:
Tokens
Group default:
/* incomplete token definition, the internal state will complete the token */
comment = '/*' ...;
Group comment_parts:
nested_comment_start = '/*';
slash = '/';
star = '*';
text = (any - ('*' | '/'))+;
nested_comment_end,
external_comment_end : comment_end_selector()
= '*/';
Investigators
nested_comment_start : comment_depth_increment();
nested_comment_end : comment_depth_decrement();
States
State normal:
Tokens
comment;
Internal
State comment_body:
Tokens
comment_parts;
Transitions
normal ->( comment )-> comment_body;
comment_body ->( external_comment_end )-> Back;
Wow! That's a lot to grasp...
Let's go at it slowly. First, there is the weird comment token definition:
comment = '/*' ...;
What it does is the following. It simply defines a token that will be matched normally using the '/*' regular expression. Yet, it also tells us that the final text of the token will (probably) be longer than "/*"; it will also contain all the text appended by internal states.
There is also an Investigators section:
Investigators
nested_comment_start : comment_depth_increment();
nested_comment_end : comment_depth_decrement();
An investigator is simply a method that will be called by the lexer every time a token is matched.
You might have missed the semantic selector. Here it is:
nested_comment_end,
external_comment_end : comment_end_selector()
= '*/';
Semantic selectors work as follows. First the lexer matches the token selection using the provided regular expression ('*/'). Then, it calls the specified method to select which token to return.
A typical implementation of the above investigators and selector, in Java, would look as follows:
public class RecursiveCommentLexer extends Lexer {
private int depth = 0;
@Override
public void commentDepthIncrement(
Token token) {
this.depth++;
}
@Override
public void commentDepthDecrement(
Token token) {
this.depth--;
}
@Override
public CommentEndSelectorResult commentEndSelector(String text) {
if(this.depth == 0) {
return CommentEndSelectorResult.EXTERNAL_COMMENT_END;
}
return CommentEndSelectorResult.NESTED_COMMENT_END;
}
We assume here that CommentEndSelectorResult is a Java enum declared by SableCC.
Now, the final part: the lexer transitions. There are two transitions:
Transitions
normal ->( comment )-> comment_body;
comment_body ->( external_comment_end )-> Back;
SableCC requires a transition like the first one for every incomplete comment, of course.
The destination state of a state transition must always be an internal state, or Back. A back transition means that we are finished matching the parts of the incomplete token; that it can now be constructed (with the full text) and sent to the parser.
So, let's walk through an example. Let's take the recursive comment above:
/* This is /* a comment within */ another comment */
- As the lexer will initially be in the default state, it will match the "/*" string as an incomplete comment token, and make a transition to the comment_body state. Note that the comment token (e.g. the TComment instance, in Java), won't be created yet.
- The next matched token is text(" This is "). As the lexer is in the comment_body internal state, the text of the token is simply appended to the comment token text ("/* This is ").
- Next, a nested_comment_start("/*") token is matched. As there is a registered investigator on this token, it is called. The commentDepthIncrement method is called setting the depth instance variable to 1. As usual, the text is appended to yield "/* This is /*".
- Again, a text(" a comment within ") is matched and the text is appended ("/* This is /* a comment within ").
- Next, "*/" is matched. It is a token selection, so the selector method is called. The commentEndSelector method is called; as the current depth is 1 (not 0), it will select the nested_comment_end("*/") token. As there is a registered investigator on this comment, the commentDepthDecrement method will be called and decrement the depth instance variable to 0. We are in an internal state, so the text will be appended ("/* This is /* a comment within */").
- Again, a text(" another comment ") is matched and the text is appended ("/* This is /* a comment within */ another comment ").
- Next, "*/" is matched. It is a token selection, so the selector method is called. The commentEndSelector method is called; as the current depth is 0, it will select the external_comment_end("*/") token. We are in an internal state, so the text will be appended ("/* This is /* a comment within */ another comment */").
- There is a registered transition on the external_comment_end comment, so it is applied. As it is a Back transition, the state of the lexer is reverted to the last non-internal state (normal in the current case), and finally, the comment("/* This is /* a comment within */ another comment */") token is created and returned to the parser.
As you can see, all this required very little target-specific code (e.g. Java code), which is added in an inherited class. Most of the tricky work is done by the generated lexer (state transitions, cumulating text, creating tokens or not). The final lexer specification is quite simple (we like to think of it as elegant, too). The type robustness of the selector return type (using enumerations, in Java), leaves very little room for user mistakes.
New Regular Expression Operators
Last, but not least... SableCC 4 introduces very convenient new operators in regular expression.
The first operator is the subtraction operator (-) that you have seen used above. It is very convenient, as it allows for expressing regular expressions that are otherwise very difficult to write. For example, here is a regular expression that matches all the strings except those that contain the "etienne" substring:
all_but_etienne = any* - (any* 'etienne' any*);
This seems simple, doesn't it? So, try to write it without using the subtraction operator. Yes, theory proves that there is a solution; actually, there is an infinite number of solutions! I leave it as to you as an exercise to find a single one. :-)
The second operator is the Shortest operator. As its name implies, this operator finds the shortest matching string, instead of the longest one. This can be useful to describe things such as C comments:
comment = Shortest('/*' any* '*/');
The shortest match is important, here, as if you select the longest match, instead, a comment will start with the first "/*" in a file and end with the last "*/" of the file! Without the shortest operator, the correct definition looks like:
comment = '/*' not_star* ('*' (not_star_slash not_star*)?)* '*/';
Most people find the first definition clearer.
Lastly, an intersection operator has been added, for completeness. If you find a natural use for it, please tell us about it (edit this wiki page, or send a message about it to the mailing list). Here's a useless example:
b_star = Intersection( ('a'|'b')*, ('b'|'c')* );
Conclusions
In conclusion, we hope that this document has provided you with a good overview of the syntax and features of lexers in SableCC 4.
SableCC 4 aims at providing a very powerful, efficient, and yet highly intuitive and robust environment for constructing lexers.
Have fun!
