lydell / js-tokens Goto Github PK

View Code? Open in Web Editor NEW

479.0 7.0 30.0 665 KB

Tiny JavaScript tokenizer.

License: MIT License

JavaScript 79.63% CoffeeScript 15.54% TypeScript 4.83%

regex javascript tokenizer ecmascript

js-tokens's Introduction

js-tokens

The tiny, regex powered, lenient, almost spec-compliant JavaScript tokenizer that never fails.

const jsTokens = require("js-tokens");

const jsString = 'JSON.stringify({k:3.14**2}, null /*replacer*/, "\\t")';

Array.from(jsTokens(jsString), (token) => token.value).join("|");
// JSON|.|stringify|(|{|k|:|3.14|**|2|}|,| |null| |/*replacer*/|,| |"\t"|)

Installation

npm install js-tokens

import jsTokens from "js-tokens";
// or:
const jsTokens = require("js-tokens");

Usage

jsTokens(string, options?)

Option	Type	Default	Description
jsx	`boolean`	`false`	Enable JSX support.

This package exports a generator function, jsTokens, that turns a string of JavaScript code into token objects.

For the empty string, the function yields nothing (which can be turned into an empty list). For any other input, the function always yields something, even for invalid JavaScript, and never throws. Concatenating the token values reproduces the input.

The package is very close to being fully spec compliant (it passes all but 3 of test262-parser-tests), but has taken a couple of shortcuts. See the following sections for limitations of some tokens.

// Loop over tokens:
for (const token of jsTokens("hello, !world")) {
  console.log(token);
}

// Get all tokens as an array:
const tokens = Array.from(jsTokens("hello, !world"));

Tokens

Spec: ECMAScript Language: Lexical Grammar + Additional Syntax

export default function jsTokens(input: string): Iterable<Token>;

type Token =
  | { type: "StringLiteral"; value: string; closed: boolean }
  | { type: "NoSubstitutionTemplate"; value: string; closed: boolean }
  | { type: "TemplateHead"; value: string }
  | { type: "TemplateMiddle"; value: string }
  | { type: "TemplateTail"; value: string; closed: boolean }
  | { type: "RegularExpressionLiteral"; value: string; closed: boolean }
  | { type: "MultiLineComment"; value: string; closed: boolean }
  | { type: "SingleLineComment"; value: string }
  | { type: "HashbangComment"; value: string }
  | { type: "IdentifierName"; value: string }
  | { type: "PrivateIdentifier"; value: string }
  | { type: "NumericLiteral"; value: string }
  | { type: "Punctuator"; value: string }
  | { type: "WhiteSpace"; value: string }
  | { type: "LineTerminatorSequence"; value: string }
  | { type: "Invalid"; value: string };

StringLiteral

Spec: StringLiteral

If the ending " or ' is missing, the token has closed: false. JavaScript strings cannot contain (unescaped) newlines, so unclosed strings simply end at the end of the line.

Escape sequences are supported, but may be invalid. For example, "\u" is matched as a StringLiteral even though it contains an invalid escape.

Examples:

"string"
'string'
""
''
"\""
'\''
"valid: \u00a0, invalid: \u"
'valid: \u00a0, invalid: \u'
"multi-\
line"
'multi-\
line'
" unclosed
' unclosed

NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

Spec: NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

A template without interpolations is matched as is. For, example:

`abc`: NoSubstitutionTemplate
`abc: NoSubstitutionTemplate with closed: false

A template with interpolations is matched as many tokens. For example, `head${1}middle${2}tail` is matched as follows (apart from the two NumericLiterals):

`head${: TemplateHead
}middle${: TemplateMiddle
}tail`: TemplateTail

TemplateMiddle is optional, and TemplateTail can be unclosed. For example, `head${1}tail (note the missing ending `):

`head${: TemplateHead
}tail: TemplateTail with closed: false

Templates can contain unescaped newlines, so unclosed templates go on to the end of input.

Just like for StringLiteral, templates can also contain invalid escapes. `\u` is matched as a NoSubstitutionTemplate even though it contains an invalid escape. Also note that in tagged templates, invalid escapes are not syntax errors: x`\u` is syntactically valid JavaScript.

RegularExpressionLiteral

Spec: RegularExpressionLiteral

Regex literals may contain invalid regex syntax. They are still matched as regex literals.

If the ending / is missing, the token has closed: false. JavaScript regex literals cannot contain newlines (not even escaped ones), so unclosed regex literals simply end at the end of the line.

According to the specification, the flags of regular expressions are IdentifierParts (unknown and repeated regex flags become errors at a later stage).

Differentiating between regex and division in JavaScript is really tricky. js-tokens looks at the previous token to tell them apart. As long as the previous tokens are valid, it should do the right thing. For invalid code, js-tokens might be confused and start matching division as regex or vice versa.

Examples:

/a/
/a/gimsuy
/a/Inva1id
/+/
/[/]\//

MultiLineComment

Spec: MultiLineComment

If the ending */ is missing, the token has closed: false. Unclosed multi-line comments go on to the end of the input.

Examples:

/* comment */
/* console.log(
    "commented", out + code);
    */
/**/
/* unclosed

SingleLineComment

Spec: SingleLineComment

Examples:

// comment
// console.log("commented", out + code);
//

HashbangComment

Spec: HashbangComment

Note that a HashbangComment can only occur at the very start of the string that is being tokenized. Anywhere else you will likely get an Invalid token # followed by a Punctuator token !.

Examples:

#!/usr/bin/env node
#! console.log("commented", out + code);
#!

IdentifierName

Spec: IdentifierName

Keywords, reserved words, null, true, false, variable names and property names.

Examples:

if
for
var
instanceof
package
null
true
false
Infinity
undefined
NaN
$variab1e_name
π
℮
ಠ_ಠ
\u006C\u006F\u006C\u0077\u0061\u0074

PrivateIdentifier

Spec: PrivateIdentifier

Any IdentifierName preceded by a #.

Examples:

#if
#for
#var
#instanceof
#package
#null
#true
#false
#Infinity
#undefined
#NaN
#$variab1e_name
#π
#℮
#ಠ_ಠ
#\u006C\u006F\u006C\u0077\u0061\u0074

NumericLiteral

Spec: NumericLiteral

Examples:

0
1.5
1
1_000
12e9
0.123e-32
0xDead_beef
0b110
12n
07
09.5

Punctuator

Spec: Punctuator + DivPunctuator + RightBracePunctuator

All possible values:

&&  ||  ??
--  ++
.   ?.
<   <=   >   >=
!=  !==  ==  ===
   +   -   %   &   |   ^   /   *   **   <<   >>   >>>
=  +=  -=  %=  &=  |=  ^=  /=  *=  **=  <<=  >>=  >>>=
(  )  [  ]  {  }
!  ?  :  ;  ,  ~  ...  =>

WhiteSpace

Spec: WhiteSpace

Unlike the specification, multiple whitespace characters in a row are matched as one token, not one token per character.

LineTerminatorSequence

Spec: LineTerminatorSequence

CR, LF and CRLF, plus \u2028 and \u2029.

Invalid

Spec: n/a

Single code points not matched in another token.

Examples:

#
@
💩

JSX Tokens

Spec: JSX Specification

export default function jsTokens(
  input: string,
  options: { jsx: true },
): Iterable<Token | JSXToken>;

export declare type JSXToken =
  | { type: "JSXString"; value: string; closed: boolean }
  | { type: "JSXText"; value: string }
  | { type: "JSXIdentifier"; value: string }
  | { type: "JSXPunctuator"; value: string }
  | { type: "JSXInvalid"; value: string };

The tokenizer switches between outputting runs of Token and runs of JSXToken.
Runs of JSXToken can also contain WhiteSpace, LineTerminatorSequence, MultiLineComment and SingleLineComment.

JSXString

Spec: " JSXDoubleStringCharacters " + ' JSXSingleStringCharacters '

If the ending " or ' is missing, the token has closed: false. JSX strings can contain unescaped newlines, so unclosed JSX strings go on to the end of input.

Note that JSX don’t support escape sequences as part of the token grammar. A " or ' always closes the string, even with a backslash before.

Examples:

"string"
'string'
""
''
"\"
'\'
"multi-
line"
'multi-
line'
" unclosed
' unclosed

JSXText

Spec: JSXText

Anything but <, >, { and }.

JSXIdentifier

Spec: JSXIdentifier

Examples:

div
class
xml
x-element
x------
$htm1_element
ಠ_ಠ

JSXPunctuator

Spec: n/a

All possible values:

<
>
/
.
:
=
{
}

JSXInvalid

Spec: n/a

Single code points not matched in another token.

Examples in JSX tags:

1
`
+
,
#
@
💩

All possible values in JSX children:

>
}

Compatibility

ECMAScript

The intention is to always support the latest ECMAScript version whose feature set has been finalized.

Currently, ECMAScript 2023 is supported.

Annex B and C (strict mode)

Section B: Additional ECMAScript Features for Web Browsers of the spec is optional if the ECMAScript host is not a web browser, and specifies some additional syntax. Section C: The Strict Mode of ECMAScript disallows certain syntax in Strict Mode.

Numeric literals: js-tokens supports legacy octal and octal like numeric literals, regardless of Strict Mode.
String literals: js-tokens supports legacy octal escapes, since it allows any invalid escapes.
HTML-like comments: Not supported. js-tokens prefers treating 5<!--x as 5 < !(--x) rather than as 5 //x.
Regular expression patterns: js-tokens doesn’t care what’s between the starting / and ending /, so this is supported.

TypeScript

Supporting TypeScript is not an explicit goal, but js-tokens and Babel both tokenize this TypeScript fixture and this TSX fixture the same way, with one edge case:

type A = Array<Array<string>>
type B = Array<Array<Array<string>>>

Both lines above should end with a couple of > tokens, but js-tokens instead matches the >> and >>> operators.

JSX

JSX is supported: jsTokens("<p>Hello, world!</p>", { jsx: true }).

JavaScript runtimes

js-tokens should work in any JavaScript runtime that supports Unicode property escapes.

Known errors

Here are a couple of tricky cases:

// Case 1:
switch (x) {
  case x: {}/a/g;
  case x: {}<div>x</div>/g;
}

// Case 2:
label: {}/a/g;
label: {}<div>x</div>/g;

// Case 3:
(function f() {}/a/g);
(function f() {}<div>x</div>/g);

This is what they mean:

// Case 1:
switch (x) {
  case x:
    {
    }
    /a/g;
  case x:
    {
    }
    <div>x</div> / g;
}

// Case 2:
label: {
}
/a/g;
label: {
}
<div>x</div> / g;

// Case 3:
(function f() {}) / a / g;
(function f() {}) < div > x < /div>/g;

But js-tokens thinks they mean:

// Case 1:
switch (x) {
  case x:
    ({}) / a / g;
  case x:
    ({}) < div > x < /div>/g;
}

// Case 2:
label: ({}) / a / g;
label: ({}) < div > x < /div>/g;

// Case 3:
function f() {}
/a/g;
function f() {}
<div>x</div> / g;

In other words, js-tokens:

Mis-identifies regex as division and JSX as comparison in case 1 and 2.
Mis-identifies division as regex and comparison as JSX in case 3.

This happens because js-tokens looks at the previous token when deciding between regex and division or JSX and comparison. In these cases, the previous token is }, which either means “end of block” (→ regex/JSX) or “end of object literal” (→ division/comparison). How does js-tokens determine if the } belongs to a block or an object literal? By looking at the token before the matching {.

In case 1 and 2, that’s a :. A : usually means that we have an object literal or ternary:

let some = weird ? { value: {}/a/g } : {}/a/g;

But : is also used for case and labeled statements.

One idea is to look for case before the : as an exception to the rule, but it’s not so easy:

switch (x) {
  case weird ? true : {}/a/g: {}/a/g
}

The first {}/a/g is a division, while the second {}/a/g is an empty block followed by a regex. Both are preceded by a colon with a case on the same line, and it does not seem like you can distinguish between the two without implementing a parser.

Labeled statements are similarly difficult, since they are so similar to object literals:

{
  label: {}/a/g
}

({
  key: {}/a/g
})

Finally, case 3 ((function f() {}/a/g);) is also difficult, because a ) before a { means that the { is part of a block, and blocks are usually statements:

if (x) {
}
/a/g;

function f() {}
/a/g;

But function expressions are of course not statements. It’s difficult to tell an function expression from a function statement without parsing.

Luckily, none of these edge cases are likely to occur in real code.

Known failures

js-tokens advertises that it “never fails”. Tell you what, it can fail on extreme inputs. The regex engine of the runtime can eventually give up. js-tokens has worked around it to some extent by changing its regexes to be easier on the regex engine. To solve completely, js-tokens would have to stop using regex, but then it wouldn’t be tiny anymore which is the whole point. Luckily, only extreme inputs can fail, hopefully ones you’ll never encounter.

For example, if you try to parse the string literal "\n\n\n" but with 10 million \n instead of just 3, the regex engine gives up with RangeError: Maximum call stack size exceeded (or similar). Try it out:

Array.from(require("js-tokens")(`"${"\\n".repeat(1e7)}"`));

(Yes, that is the regex engine of the runtime giving up. js-tokens has no recursive functions.)

However, if you repeat a instead of \n 10 million times ("aaaaaa…"), it works:

Array.from(require("js-tokens")(`"${"a".repeat(1e7)}"`));

That’s good, because it’s much more common to have lots of non-escapes in a row in a big string literal, than having mostly escapes. (Obfuscated code might have only escapes though.)

Safari warning

I’ve seen Safari give up instead of throwing an error.

In Safari, Chrome, Firefox and Node.js the following code successfully results in a match:

/(#)(?:a|b)+/.exec("#" + "a".repeat(1e5));

But for the following code (with 1e7 instead of 1e5), the runtimes differ:

/(#)(?:a|b)+/.exec("#" + "a".repeat(1e7));

Chrome, Firefox and Node.js all throw RangeError: Maximum call stack size exceeded (or similar).
Safari returns null (at the time of writing), silently giving up on matching the regex. It’s kind of lying that the regex did not match, while in reality it would given enough computing resources.

This means that in Safari, js-tokens might not fail but instead give you unexpected tokens.

Performance

With @babel/parser for comparison. Node.js 21.6.1 on a MacBook Pro M1 (Sonoma).

Lines of code	Size	[email protected]	@babel/[email protected]
~100	~4.0 KiB	~2 ms	~10 ms
~1 000	~39 KiB	~5 ms	~27 ms
~10 000	~353 KiB	~44 ms	~108 ms
~100 000	~5.1 MiB	~333 ms	~2.0 s
~2 400 000	~138 MiB	~7 s	~4 m 9 s (*)

(*) Required increasing the Node.js the memory limit (I set it to 8 GiB).

See benchmark.js if you want to run benchmarks yourself.

js-tokens's People

Contributors

Stargazers

Watchers

js-tokens's Issues

`Maximum call stack size exceeded` error happens with a long string literal

When a long string literal is contained in the input code, js-tokens throws Maximum call stack size exceeded error.

Reproduction: https://github.com/sapphi-red-repros/js-tokens-maximum-call-stack-size-exceeded-repro

It happens with

import jsTokens from 'js-tokens'

const code = `const foo = "${'f'.repeat(1000 * 1000 * 10)}"`
const iterable = jsTokens(code, { jsx: false })
for (const token of iterable) {
  //
}

as well.

Original issue: vitejs/vite#15703

webjars

Bit of a random one really but do you maintain / help get js-tokens into webjars?

https://www.webjars.org/

We had a recent break in our ScalaJs app where via a transitive dependency, js-tokens was requested from webjars at a version > 3.0.2. As best I can work out, webjars only has 3.0.2 in it and so we couldn't resolve a newer version and it broke the world.

Do you know how we can get more version in webjars? I couldn't find much on https://github.com/webjars or elsewhere.

Specifically, we reference:

"org.webjars.npm" % "js-tokens" % "3.0.2"

Unescaped closing bracket ending RegularExpressionLiteral early

JS-tokens ends the RegularExpressionLiteral when it reaches a ], but Firefox and Chrome treat it as if it has been escaped.

> [...jsTokens("/xyz]def/")]
[
  { type: 'RegularExpressionLiteral', value: '/xyz', closed: false },
  { type: 'Punctuator', value: ']' },
  { type: 'IdentifierName', value: 'def' },
  { type: 'Punctuator', value: '/' }
]
> /xyz]def/
/xyz]def/
> [...jsTokens("/xyz]def*/")]
[
  { type: 'RegularExpressionLiteral', value: '/xyz', closed: false },
  { type: 'Punctuator', value: ']' },
  { type: 'IdentifierName', value: 'def' },
  { type: 'Punctuator', value: '*' },
  { type: 'RegularExpressionLiteral', value: '/', closed: false }
]
> /xyz]def*/
/xyz]def*/

Encountered in a real file in fluent.js.

I'm not sure if this is technically a bug with JS-tokens. The regex does not match the spec, so this means Chrome and Firefox go against the spec. However, they will likely not fix this due to backwards compatibility, so it seems more appropriate to update JS-tokens.

Properly identify regex and division using Lookbehind assertions in regex

Following the documented Limitations on "Division and regex literals collision", since current oldest Nodejs LTS v8 has Lookbehind assertions in regex, please update the implementation if possible. Thanks!

Escaped newlines between multi-line strings cause merging of different strings

We're writing a JS util to parse CLI commands, and whilst testing various inputs I realised that if you have a multi-line command with escaped newlines where the escaped newline falls between two multi-line strings which use the same quote character, then the tokens are mis-handled and the quotes either side of the escaped newline get treated as a single string argument.

This code generates ' --data ' as a single token, whereas I think the escaped newline should be ignored but treated as a token break rather than the middle of a string.

import jsTokens from "js-tokens";

// Snipped from a multi-line command, eg, `curl -X POST url \
// --data etc`
const exampleInput = `
--data '{
  \
}' \
--data '{
  \
}'
`;

const tokens = jsTokens(exampleInput);
console.log(Array.from(tokens));

/* Output:
[
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'Punctuator', value: '--' },
  { type: 'IdentifierName', value: 'data' },
  { type: 'WhiteSpace', value: ' ' },
  { type: 'StringLiteral', value: "'{", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "' --data '", closed: true },
  { type: 'Punctuator', value: '{' },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "'", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' }
]
*/

import jsTokens from "js-tokens";

const expectedOutputInput = `
--data '{
  \
}'
--data '{
  \
}'
`;

const tokens = jsTokens(expectedOutputInput);
console.log(Array.from(tokens));

/* Expected output:
[
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'Punctuator', value: '--' },
  { type: 'IdentifierName', value: 'data' },
  { type: 'WhiteSpace', value: ' ' },
  { type: 'StringLiteral', value: "'{", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "'", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'Punctuator', value: '--' },
  { type: 'IdentifierName', value: 'data' },
  { type: 'WhiteSpace', value: ' ' },
  { type: 'StringLiteral', value: "'{", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' },
  { type: 'WhiteSpace', value: '  ' },
  { type: 'Punctuator', value: '}' },
  { type: 'StringLiteral', value: "'", closed: false },
  { type: 'LineTerminatorSequence', value: '\n' }
]
*/

Any ideas if there's a way to force the tokeniser to handle this, or is it a bug as I suspect?

So how can you, as a JavaScript developer, ensure that your RegExps are fast? If you are not interested in hooking into RegExp internals, make sure that neither the RegExp instance, nor its prototype is modified in order to get the best performance:
var re = /./g;
re.exec('');  // Fast path.
re.new_property = 'slow';

This module exports a single regex, with a new property bolted on, just like in the above example. That new property should be exported separately instead. This will be a breaking change.

js-tokens can not match variables in template string.

like this

jsTokens.exec("`hello world ${variable + `a + ${b}`} `")

Under normal circumstances, it can match variable and b

Why is this a single regex?

Just wondering.
It works like a charm, but I had to split the regex at the |s.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

lydell / js-tokens Goto Github PK

js-tokens's Introduction

js-tokens

Installation

Usage

Tokens

StringLiteral

NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

RegularExpressionLiteral

MultiLineComment

SingleLineComment

HashbangComment

IdentifierName

PrivateIdentifier

NumericLiteral

Punctuator

WhiteSpace

LineTerminatorSequence

Invalid

JSX Tokens

JSXString

JSXText

JSXIdentifier

JSXPunctuator

JSXInvalid

Compatibility

ECMAScript

Annex B and C (strict mode)

TypeScript

JSX

JavaScript runtimes

Known errors

Known failures

Safari warning

Performance

js-tokens's People

Contributors

Stargazers

Watchers

Forkers

js-tokens's Issues

Recommend Projects

Recommend Topics

Recommend Org