Tokenize Demo

This is a demo of the quasiliteral tokenizing code specified in the EcmaScript quasiliteral strawman.

It is meant to demonstrate that if we start down the slippery slope of allowing complex expressions in the quasiliteral SubstitutionBody production, that there is an efficiently tokenizable grammar at the bottom of that slope.

Enter JavaScript source code, including backquoted (`) quasiliterals with ${…} style substitutions, hit submit, and you should get a series of tokens at the bottom that correctly identify quasiliteral, substitution, and expression token boundaries. For simplicity, this tokenizer does not recognize division operators or regular expression literals, and does not try to fail on invalid or incomplete inputs.

TokenIsLiteralPortion?InLiteralPortion?StackLine#

First we define a JavaScript tokenizer for the entire grammar, minus div ops & regexp literals for simplicty.

TOKEN = new RegExp("^(?:" + [ // Space "[\\s\ufeff]+", // A block comment "/\\*(?:[^*]|\\*+[^*/])*\\*+/", // A line comment "//.*", // A string "\'(?:[^\\\\\'\\r\\n\\u2028\\u2029]|\\\\(?:\\r\\n|[\\s\\S]))*\'", "\"(?:[^\\\\\"\\r\\n\\u2028\\u2029]|\\\\(?:\\r\\n|[\\s\\S]))*\"", // Number // (before punctuation to prevent breaking around decimal point) "0x[0-9a-f]+", "(?:\\d+(?:\\.\\d*)?|\\.\\d+)(?:e[+-]?\\d+)?", // IdentifierName (ignoring non-Latin letters) "[_$a-z][\\w$]*", // Punctuation [ ".", "[", "]", "(", ")", "++", "--", "!", "~", "+", "-", "*", "%", "+", "-", "<<", ">>", ">>>", "<", "<=", ">", ">=", "==", "!=", "===", "!==", "&", "^", "|", "&&", "||", "?", ":", "=", "+=", "-=", "*=", "%=", "<<=", ">>=", ">>>=", "&=", "^=", "|=", "&=", ",", "{", "}", ";" ] // Sort longest first so that we can join on | to get a // regular expression that matches the longest punctuation token // possible. .sort(function (a, b) { return b.length - a.length; }) // Escape for RegExp syntax. .map(function (x) { return x.replace(/./g, "\\$&"); }) .join("|")] .join("|") + ")", "i"); function tokenizeNoQuasis(source) { return source.match(TOKEN)[0]; }

We need to keep track of whether we're inside a literal portion. Characters inside the literal portion are raw whereas outside they might be part of a regular EcmaScript token. The underlined characters in var foo = `Hello, ${world}!`; are in literal portions.

inLiteralPortion = false;

Since quasiliterals are expressions, if a substitution body can contain an arbitrary expression, then we need a way to figure out whether a } token ends a substitution. We keep a boolean stack to tell us whether the next } token closes the innermost substitution. If the top of the stack is true, then the next } token does close the innermost substitution. A boolean stack need not be a heavyweight dynamically resizing data structure. If willing to limit bracket nesting to 58 levels, the stack and its head index can pack into 64 bits.

closesInnermostSubstitution = [];

The number of quasiliteral nesting levels at any time is the count of bits in the closesInnermostSubstitution stack plus 0 if !inLiteralPortion or plus 1 if inLiteralPortion.

We need to return a token, the untokenized source, and a boolean telling us whether the token is a literal portion since literal portions could look like regular tokens: return `${x}return${y}`

tokenize = function tokenize(source) { var tokenLen, tokenIsLiteralPortion = false;

To pull a token off source, we scan from the left. Below we check the next character occasionally so this scanner needs to lookahead 1 SourceCharacter.

for (var i = 0, n = source.length; i < n; ++i) { var ch = source.charAt(i);

If we are in a literal portion and it is a backtick, then we have come to the end of a quasiliteral.

if (inLiteralPortion && ch === '`') { inLiteralPortion = false; tokenIsLiteralPortion = true; tokenLen = i + 1; break;

If we are in a literal portion and we see a backslash, then we have an escape. Skip it and the next character to avoid exiting early on \` or treating \${x} as a substitution.

} else if (inLiteralPortion && ch == '\\') { if (++i === n) { throw Error(); }

If we are in a literal portion and we see ${, then we have a bracketed substitution. End the literal portion if any, transition out of the literal portion and update the stack since } ends the substitution.

} else if (inLiteralPortion && ch === '$' && source.charAt(i + 1) === '{') { if (i) { // Emit the literal portion parsed thus far. tokenLen = i; tokenIsLiteralPortion = true; break; } inLiteralPortion = false; closesInnermostSubstitution.push(1); tokenLen = i + 2; break;

Any other character in a literal portion goes onto the token.

} else if (inLiteralPortion) { // Let for-loop increment i.

If we are not in a literal portion, then a { expands the stack.

} else if (ch === '{') { closesInnermostSubstitution.push(0);

While a } pops, possibly ending the substitution and dropping us back into the literal portion.

} else if (ch === '}') { inLiteralPortion = !!closesInnermostSubstitution.pop(); tokenLen = i + 1; break;

A ` outside a literal portion enters a literal portion, possibly nested.

} else if (ch === '`') { inLiteralPortion = true;

Otherwise, we delegate to the tokenizer that does not recognize quasis.

} else { tokenLen = tokenizeNoQuasis(source).length; break; }

Finally, we return the token and other data.

} if (tokenLen === void 0) { throw Error(); } var token = source.substring(0, tokenLen); return [token, source.substring(tokenLen), tokenIsLiteralPortion]; }