This is a demo of the quasiliteral tokenizing code specified in the EcmaScript quasiliteral strawman.
It is meant to demonstrate that if we start down the slippery slope of allowing complex expressions in the quasiliteral SubstitutionBody production, that there is an efficiently tokenizable grammar at the bottom of that slope.
Enter JavaScript source code, including backquoted (`)
quasiliterals with ${…} style substitutions, hit
submit, and you should get a series of tokens at the bottom that
correctly identify quasiliteral, substitution, and expression token
boundaries. For simplicity, this tokenizer does not recognize
division operators or regular expression literals, and does not try to
fail on invalid or incomplete inputs.
| Token | IsLiteralPortion? | InLiteralPortion? | Stack | Line# |
|---|
First we define a JavaScript tokenizer for the entire grammar, minus div ops & regexp literals for simplicty.
We need to keep track of whether we're inside a literal portion.
Characters inside the literal portion are raw whereas outside they
might be part of a regular EcmaScript token. The underlined characters
in var foo = `Hello, ${world}!`; are in
literal portions.
Since quasiliterals are expressions, if a substitution
body can contain an arbitrary expression, then we need a way to figure
out whether a } token ends a substitution.
We keep a boolean stack to tell us whether the next } token
closes the innermost substitution. If the top of the stack is true, then
the next } token does close the innermost substitution.
A boolean stack need not be a heavyweight dynamically resizing data
structure.
If willing to limit bracket nesting to 58 levels, the stack and its
head index can pack into 64 bits.
The number of quasiliteral nesting levels at any time is the count
of bits in the closesInnermostSubstitution stack plus 0
if !inLiteralPortion or plus 1 if
inLiteralPortion.
We need to return a token, the untokenized source, and a boolean
telling us whether the token is a literal portion since literal
portions could look like regular tokens: return `${x}return${y}`
To pull a token off source, we scan from the left. Below we check the next character occasionally so this scanner needs to lookahead 1 SourceCharacter.
If we are in a literal portion and it is a backtick, then we have come to the end of a quasiliteral.
If we are in a literal portion and we see a backslash, then we have an escape. Skip it and the next character to avoid exiting early on \` or treating \${x} as a substitution.
If we are in a literal portion and we see ${, then we have a bracketed substitution. End the literal portion if any, transition out of the literal portion and update the stack since } ends the substitution.
Any other character in a literal portion goes onto the token.
If we are not in a literal portion, then a { expands
the stack.
While a } pops, possibly ending the substitution
and dropping us back into the literal portion.
A ` outside a literal portion enters a literal portion, possibly nested.
Otherwise, we delegate to the tokenizer that does not recognize quasis.
Finally, we return the token and other data.