Design of the org.sablecc.sablecc.alphabet Package
The main idea behind this package it to reduce the number of symbols involved in the computation of a minimal deterministic finite automaton (Minimal DFA).
The Problem
A regular expression serves to describe a possibly infinite set of strings (or language). A string is a finite sequence of symbols. A regular expression is defined over an alphabet, the set of symbols that appear in the strings of the language it describes.
Typically, lexer generators limited their users to a small alphabet: the ASCII character set. This made it possible to use the code of input characters as index in lexer decision tables. It was merely a matter of making tables of little over a hundred columns. Note that some modern lexer generators allow for a wider character set, but using an ad-hoc approach (they revert to non-standard and slower lexer code for non-ASCII symbols).
SableCC, on the other hands, allows language developers to use a huge alphabet. In SableCC 2 and 3, developers were "limited" to 65536 characters. In SableCC 4, there is no bound on the value of the biggest character. It is impractical to compute (and store) lexer decision tables of 65,000 columns. Thus, SableCC 2 and 3 used intervals of characters as symbols in the computation of lexer decision tables. Yet, experience has shown this not to be sufficient. For example, the lexer definition of the Java language (j11.sablecc) contains a huge set of intervals of valid identifier characters. This results in a slow and painful lexer table computation for this specification in SableCC 2 and 3.
The Solution
SableCC 4 defines a symbol as a collection of non-adjacent, non-overlapping intervals. In that context, an alphabet is a collection of symbols where no two symbols contain overlapping intervals.
SableCC 4 works hard at always computing the minimal alphabet for automatons it generates. So, during computations, SableCC might end up with two (or more) automatons with distinct alphabets. It becomes tricky to combine these automatons. To simplify the task, the Alphabet class provides a merge() method to merge alphabets.
Merging two alphabets A and B consists of creating a new alphabet C containing a minimal number of symbols, with the following property: For every symbol x element of (A union B), there exists a corresponding subset S of C, such that: merge(S) == x. Note that merging a set of symbols means creating a single symbol whose intervals cover exactly the characters covered by the intervals of the merged symbols.
Technical Details
SableCC's alphabet related classes are generic; they expect a Comparable type parameter T. Comparability of values is not sufficient; we need the notion of previous and next values. For this reason, the whole package is based on the AdjacencyRealm class which provides the required adjacency notion.
Intervals are created using AdjacencyRealm.createInterval(). Symbols are composed of intervals attached to a single realm. All of the Alphabet, Symbol, and Interval classes build constant (i.e. immutable) instances and refuse to create incoherent instances.
Additional details can be found in the javadoc documentation of these classes.
