tc39 / proposal-binary-ast Goto Github PK

Binary AST proposal for ECMAScript

javascript ast parser ecmascript ecma proposal standard

proposal-binary-ast's Issues

Introduce LiteralI32Expression

In practice, pretty much all integer literal numbers we encounter are within [|-2^31, 2^31[| (I believe that I have never seen an integer literal outside of this range). Experience shows that introducing the following is a pretty good filesize win:

typedef (LiteralF64Expression or LiteralI32Expression)
  LiteralNumericExpression;

interface LiteralF64Expression {
  attribute double value;
}

interface LiteralI32Expression {
  attribute long value;
}

Any objection to this? The assumption being that the long will be a variable length encoded integer.

What is HasMalformedDirectives?

I don't understand what this assertion is meant to accomplish. Naively it looks like it would forbid

"directive";
("directive");

which I don't think it should?

Javascript can represent invalid unicode strings

It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example \udc11 (which is a lone surrogate pair component) can show up in our string literals.

The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.

My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.

When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.

I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.

Why feature post-default cases with `SwitchStatementWithDefault`?

Semantically, the labels are meaningless after parsing, so why include them as part of the AST?

In case you're wondering what precedent there is for just ignoring the extra case labels:

ESTree
Babel
UglifyJS just treats them all as one and the same, making no distinction between case and default. (The minifier still strips them as dead code, however.)

Introduce interface attribute [Skippable]

The following is a proposal to make clearer in the specifications which parts may be skipped and which may not.

Current state

In the current state of things, all lists may be skipped. This was decided because it was clear that hardcoding arbitrary interface names in the detokenizer was a bad idea, and because the only alternatives were allowing all interfaces to be skipped or allowing all lists to be skipped.

This is generally a waste of bytes, as we don't care about skipping most lists.

Proposal

Attribute

We add an extended attribute [Skippable] to interfaces. A [Skippable] interface is one in which

the encoder stores the byte-length of the node, so that we can easily jump ahead;
the semantics promise that skipping causes no bad surprises (i.e. captured names, lexically declared variables, etc. are all well-defined).

I'm not entirely clear about clause 2. at this stage, but I imagine that there are (or will be in the future) subsets of Asserted*Scope that are only useful when skipping. If so, this means that we only need them in a [Skippable] interface.

Example

Now, we redefine FunctionDeclaration as follows:

typedef (EagerFunctionDeclaration or SkippableFunctionDeclaration) FunctionDeclaration;

interface EagerFunctionDeclaration {
    // ...
    // Pretty much what we currently have in `FunctionDeclaration`, unless we realize
    // that some of the data is needed only for skipping.
}

[Skippable] interface SkippableFunctionDeclaration {
    // Here, insert any attribute that may be useful for skipping.
    // For instance, we could imagine an instruction for the parser to enqueue
    // the node for background processing, with a given priority.
    attribute EagerFunctionDeclaration content;
}

We apply the same treatment to other functions/method/getter/setter/...

Expected benefits

We give the encoder one more optimization lever, since the encoder may decide between EagerFunctionDeclaration and SkippableFunctionDeclaration.
We have a mechanism to extend laziness to other constructions, for instance large JSON objects.
We have a clearer manner of specifying/checking which parts of the semantics may be affected by the Binary AST.
We have a clearer manner of specifying the (de)tokenizer.
We improve file size.

Javascript Tooling

TL;DR - this is cool because it unifies the JS ecosystem AST, and can lead to perf when compiling JS in the current toolchain.

We expect the format to be output by existing compilers such as Babel and TypeScript, and by bundlers such as WebPack.

If I or someone else doesn't bring it up at the meeting, I'l just note here that another reason that this proposal is a 👍 from me is that it would help to unify the AST of the javascript ecosystem of tools (especially regarding babel, webpack, uglify, other parsers like acorn, esprima, etc). https://github.com/estree/estree was started as most know to unify this effort but it's diverted in certain ways (babel 5 to babel 6 https://github.com/babel/babylon/#output although we have a compat plugin).

I believe some people wanted to do an ESTree 2 (I believe the reasoning that Sebastian changed the AST in Babel 6 was just that the old AST was difficult to use when trying to create Babel plugins). If this becomes part of JS then it could be a nice forcing function for the tools that need to output the binary AST and everyone else that opts in to this could give perf benefits to developers given we can pass the same AST between all tools (ex: babel -> webpack -> uglify) and not have to re-parse in that toolchain as well as in the browser itself.

Discussion on the actual AST is a bit different but ya just saying the idea is enticing!

Also interesting is that future proposals could propose the AST node as part of the staging process (and Babel can help with that given we can start implementation at Stage 0). So we'll already have progress in that area

Can be another issue, but I had a question about if comments would be encoded in the AST or if that can be an extension? Webpack/uglify use comments to do chunking/dead code elim (it doesn't have to be the in output to the browser of course but if the AST is extensible enough to encode that info in a node property/comment that would be good)

Could the binary AST be...a bit more binary?

I'm revisiting this proposal a few months later, and I'm wondering: could this proposal be better specified in terms of raw bytes? Currently, it seems largely spec'd in terms of a JSON-like format, but IMHO that doesn't really seem like it's as small as it could be. For one, it could leverage LEB128 much like WASM does and in a similar fashion. It also doesn't need to keep type names or even operator names as strings, so I feel being a bit more binary could realize the proposal's intent a little better.

A FunctionExpression/Method/Getter/Setter may capture its own name

The following snippet demonstrate how the name of a FunctionExpression can be captured:

let a = function f() {
  return function g() {
    return f;
  }
};
let b = a()(); // This is `f`

The same technique would probably work for Method, Getter, Setter.

For the moment, we have nowhere to store this information.

AssertedVarScope needs the actual kind of each name

Currently AssertedVarScope has the following structure:

interface AssertedVarScope {
  // checked eagerly during transformation
  attribute FrozenArray<IdentifierName> lexicallyDeclaredNames;
  attribute FrozenArray<IdentifierName> varDeclaredNames;

  // checked lazily as inner functions are invoked
  attribute FrozenArray<IdentifierName> capturedNames;
  attribute boolean hasDirectEval;
};

So at least it doesn't distinguish between let and const.

This is troublesome for streaming compilation because the type of each name is unknown until we hit VariableDeclaration, which can appear later in the statement list. (there can be many another statements before it), and we cannot create efficient representation of the scope until hitting all of them.

Also, it would be nice if it also distinguish between var and function, which might be too-SpiderMonkey specific tho.

The binary format doesn't have to directly mirror normal JS.

I'm not convinced it's required to marry the binary format too tightly to the JS source, and there are things you could do to make it simpler. It's not like the binary format is meant to be human-readable, just machine-readable. There are also idioms in compiled-to-JS code (e.g. from Elm, CoffeeScript, and Babel) that could also stand to benefit from a binary format that that's at least aware of some of their needs.

Here's a few of the ideas I have to throw at the wall. Apologies if this comes across as a bit rambly.

Adding a constant for undefined
- This amounts to the vast majority of uses of void 0 (what UglifyJS replaces strict-mode undefined with) and similar.
- UglifyJS already prefers a local permanently undefined variable over void 0 for code size reasons.
- It's already a pseudo-literal in strict mode.
Allowing synthetic, anonymous locals that aren't viewable by direct eval or with
- This would be useful for anything compiling to JS with extra sugar or whatnot, like Scala.js, TypeScript, and Babel.
- This would also be useful for stripping local names from the binary for minification purposes.
Adding built-in source map support.
- Read: nodes can have locations attached.
Adding a built-in description/metadata field.
- Read: you can still include your license and such. It's just not something that can easily be placed as like a comment.
Adding combined type-checking operators for typeof x === "number", etc.
- This amounts to 90% of typeof use cases, and engines already desugar these 100% of the time. (It's an obvious optimization.)
Adding combined coersion operators for x | 0/~~x, !!x, x + "", etc.
- Those are the only uses for any of those productions, and engines never implement these literally.
- In binary asm.js code, one of those could replace the standard x | 0 or at least act as a synonym.
- Elm and TypeScript code could use these quite frequently.
Separating the pragmas from the source/function body.
- This avoids having to parse the source via lookahead to determine what pragmas apply.
- This makes it much simpler to just skip what you don't know, and skip pragmas you do know, but have already matched.
Separating strings from the code, instead storing debug string/name references as {int offset; int length} pairs and putting their values consecutively in a string table before any real code references.
- The string table could be just be slab-allocated from the start, then filled with data as fast as it receives it. (This is very low-allocation and easily optimizable, so parsing this part is I/O-bound.)
- It avoids needing to allocate nearly as much when the string/name is being parsed, needing only a reference to the data.
- The string table is specific to the script/module file, and can be reclaimed after compilation. (String names are copied during bytecode generation post-parsing.)
- There are potential size benefits here in my experience for making this separation: binary data and string data almost always compress differently, and from previous experimentation under similar circumstances (compressing data in a long URL hash), I noticed about a 2%-4% decrease alone just by doing this.
Reducing logical "and"/"or" to corresponding if/else variants: test && other ↔(tmp = test) ? tmp : other, test || other ↔ (tmp = test) ? other : test.
- This could increase code size slightly (no more than a few bytes overhead if it can't reuse an existing synthetic local), but every engine I'm aware of performs this desugaring already, so why make it different.
Using breakable blocks with an expression-based AST instead of a statement-based one.
- Expression-based ASTs are usually smaller than statement-based ones.
- You could optimize certain cases like let x = foo(); try { x = foo(); } catch (e) { if (!(e instanceof SafeError)) throw e }, where it could be simplified to something roughly like let x = try { foo() } catch (e) { e instanceof SafeError ? undefined : throw e }. Edit: Missed a backtick 😄
- Elm's highly expression-oriented nature would love this, too.
Requiring fallthrough to be explicit in switch statements.
- It's not common to actually want the fallthrough mechanism.
- Combined with the previous version, many uses of switch would become much smaller when encoded.
Declaring locals before script/module/function body.
- This makes it easier for engines to type-check which locals are accessible where.
Declaring imports/exports before module data table.
- These would be specified in a separate array, followed by a list of imported bindings from each, for performance reasons.
- This can enable engines to efficiently request and link modules before they even start parsing the meat of them. (In normal JS, you can't do this. At all. Not without some deep wizardry no engine implements support for.)
- Loading the data table is one of the most I/O-bound parts of the compilation process, so it's possible multiple modules could have their data tables being loaded simultaneously, even in single-threaded environments.
Encoding the bytecode as a hybrid register/stack machine.
- This removes most of the need for temporaries.
- This makes the bytecode a little more dense.
- Not 100% sure how this is otherwise helpful, though...

Subresource Integrity integration

How would a binary AST encoded resource be delivered when SRI is involved? How should the hash be calculated on both the encoder and the decoder sides?

Positions (à la source maps)

The current text of the proposal does not mention anything about mapping binary AST offsets back to source positions. If the original input is JavaScript source code, then I understand that debugging tools could show the pretty-printed version of the AST. However, if the input JS or the binary AST itself was produced by a compiler from another language, we need some way to map back to the original source file.

Directly leveraging source maps as they currently exist does not seem possible. They rely on very precise positions in the compiled .js file. If the .js file used for source mapping is the product of pretty-printing the AST, the source maps are very sensitively dependent on the pretty-printing algorithm. This would require to specify that algorithm along with the binary AST specification. IMO that is not a desirable situation.

Instead, I would suggest two possible paths to address this.

The first would be to store original positions of nodes directly in the AST. This has the advantage that it would produce extremely accurate source positions through the compilation pipeline of the VM, eventually resulting in better source mapping for the binary AST than what we enjoy with source maps at the moment. The disadvantage is that the binary AST itself is encumbered by positions, which would probably amount to a significant portion of the file size. Although useful for development "builds", it would unnecessary increase bloat for production files. That could be mitigated by a global flag at the beginning of the file telling whether positions are stored or not.

An alternative would be to define an equivalent to source maps, specifically designed for binary AST. Such source maps would map binary AST offsets to source positions.

Grammar quirks

It is almost certainly too early to worry about this, but a couple notes while I'm thinking of them:

The tree grammar specified does not allow for (var a = b in c);, which is a legal program (in sloppy mode, assuming Annex B) as of tc39/ecma262#614.
There's a variety of ways that well-typed trees can fail to correspond to real programs, which should all be captured in this project (except that said project hasn't been updated for async/await yet). For example, you can't have an if with an else as the body of an if without an else, even though the tree types can represent that. You also have to make sure that Identifiers are actually identifiers and that sort of thing. These aren't captured by the early error rules because they don't match the lexical grammar, and so presumably will need to be checked explicitly.
The type for TemplateExpression can be made more strict by having something like

interface Interpolation {
  attribute Expression value;
  attribute TemplateElement after;
}
interface TemplateExpression : Node {
  attribute Expression? tag;
  attribute TemplateElement start;
  attribute FrozenArray<Interpolation> elements;
};

instead of the current TemplateExpression definition which just has a list of elements which mixes Expressions and TemplateElements. Shift doesn't currently do this because it's kinda awkward to use (or, well, I think that was the justification, but have now forgotten), but this project might find it to be worth it.

Transfer repo to TC39 org

All stage 1+ proposals need to live in the tc39 org on github. Please follow the transfer instructions to move this repo to the appropriate place.

Clarify experiment results

The time required to create a full AST (without verifying annotations) was reduced by ~70-90%, which is a considerable reduction since parsing time in SpiderMonkey for the plain JavaScript was 500-800 ms for the benchmark.

Is "the time required to create a full AST" 500-800 ms? Or is it a subset of that? Maybe stating the actual reduction in milliseconds would be helpful.

.length property should be available for lazy function

Currently FormalParameters params field is inside FunctionExpressionContents interface, which has [Lazy] attribute inside LazyFunctionExpression interface.
Function object's .length property needs that information, even before executing the function.

interface LazyFunctionExpression : Node {
  attribute boolean isAsync;
  attribute boolean isGenerator;
  attribute BindingIdentifier? name;
  attribute FrozenArray<Directive> directives;
  [Lazy] attribute FunctionExpressionContents contents;
};

interface FunctionExpressionContents : Node {
  attribute boolean isFunctionNameCaptured;
  attribute boolean isThisCaptured;
  attribute AssertedParameterScope parameterScope;
  attribute FormalParameters params;
  attribute AssertedVarScope bodyScope;
  attribute FunctionBody body;
};

if we're going to entirely skip parsing FunctionExpressionContents contents field for lazy functions until executing the function, the length of formal parameters should be put into LazyFunctionExpression interface.

FunctionBody should be a Node in order to use "or"

Currently FunctionBody is FrozenArray.

typedef FrozenArray<Statement> FunctionBody;

but ArrowExpressionContents contains an attribute with (FunctionBody or Expression) type, which, I think, means FunctionBody and Expression should have common super interface.

interface ArrowExpressionContents : Node {
...
  attribute (FunctionBody or Expression) body;
};

So FunctionBody should be the following definition.

interface FunctionBody : Node {
  attribute FrozenArray<Statement> statements;
};

Why allow early errors to be serialized?

Curious as to why even allow early errors to become an AST?

Optimizing booleans, numbers, etc.

For the moment, literal numbers and booleans are represented by

interface LiteralInfinityExpression : Node { };

interface LiteralNumericExpression {
  attribute double value;
};

interface LiteralBooleanExpression : Node {
  attribute boolean value;
};

This typically means that a boolean will be stored as 2 bytes and a number other than infinity as 9 bytes. The latter is particularly odd, since the vast majority of numbers are 0 and 1. So, in AST v2, I had success decreasing the size of files by introducing special literal values for 0, true and false.

One way of doing this would be to introduce in the grammar

interface LiteralInfinityExpression : Node { };
typedef (LiteralZeroExpression or
  LiteralOneExpression or
  LiteralDoubleExpression)
 LiteralNumericExpression;

interface LiteralDoubleExpression {
  attribute double value;
};
interface LiteralZeroExpression {};
interface LiteralOneExpression {};

typedef (LiteralTrueBooleanExpression or LiteralFalseBooleanExpression) LiteralBooleanExpression

interface LiteralTrueExpression : Node { }
interface LiteralFalseExpression : Node { }

Similarly, it seems odd to reserve one byte for UpdateExpression to determine whether it's a prefix or a postfix expression, so we could rewrite

typedef (PrefixUpdateExpression or PostfixUpdateExpression) UpdateExpression;

interface PrefixUpdateExpression : Node {
  attribute UpdateOperator operator;
  attribute SimpleAssignmentTarget operand;
};

interface PostfixUpdateExpression : Node {
  attribute UpdateOperator operator;
  attribute SimpleAssignmentTarget operand;
};

Admittedly, introducing a change in the AST solely for the sake of compression sounds a bit odd. An alternative would be to either trust compression (but this didn't seem to work that well in AST v2) or somehow make the (de)tokenizers smart enough to introduce the above changes transparently (note to self: the deanonymizer would certainly be smart enough to do this).

Starting the conversation here.

Details on size overhead comparisons

It is unclear to me from the documentation provided whether the size savings specified are the result of the format essentially requiring compression as part of its specification, vs actually being meaningfully smaller; esp. compared to minified code+gzipped code that is currently the de jure deployment.

I ask because this is a significant body of new codegen, and new parsers with new attack surface -- I think it is reasonable to know how the size of the binary AST compares to the size of a standard conforming JS in a normal tool assisted environment (which to be clear, the binary AST would require).

What about directives?

I haven't attempted to benchmark that, but in the current text source world, when parsing a function, the parser must:

start parsing with the current directives;
if the first child is a string, check the string;
if that string is "use strict", restart from 1 with different options.

That seems awkward. Maybe there is a better way to handle this in the binary world. Or maybe "use strict" doesn't really impact parsing all that much in the binary world, to be checked.

How are the various "enum"s encoded?

They're listed as strings in the spec, but it would seem highly inefficient to encode them that way. Are they in fact encoded as strings? (If not, you could encode them as LEB128 integers.)

Add verification of property names vs identifiers

(Optional) type information

Binary AST could hold additional type information for variables, parameters and return types. This can help VMs do some optimizations ahead of time. Of course types should be statically validated but potentially this static analyse could be done by the same toolchain as used for binary AST creation. Was this idea considered?

Explicit inclusion of magic number in spec

Can we have a commitment to include a magic number in this new format? A lot of the pain that Node is experiencing around supporting ES Modules is the inability to determine whether a file is ESM or CJS based on the ambiguities that exist in the syntax and the lack of some kind of pragma. The controversy around .mjs vs .js exists almost entirely due to this ambiguity. Since Node can't use mime-types like the browser, a leading pragma or magic number in these new file formats that VMs can load would provide much more optionality than having to simply rely on file extension.

Reconcile streaming interpreter and streaming compiler usecases

As @sebmarkbage and others have pointed out, the deferred-until-invocation-of-function error model is not amenable to streaming interpreters. That model is amenable to streaming compilers, but to be able to interpret a binary AST stream incrementally, errors need to be even more lazy.

It is not clear to me how to reconcile the two use cases. Maximum laziness on errors enables streaming interpreters, but hurts compiled code performance by requiring runtime checks. Per-function laziness means the entire function needs to be inspected before any code can execute.

In the browser world, streaming interpreters is not a compelling use case IMO. However, streaming interpreters is more compelling for an engine specialized only for run-once code (e.g., no JITs).

We should either come up with a technical solution that enables both use cases, or explicitly decide to not support one.

Consider specializing a subset for JSON

I believe that we should consider offering a subset of the syntax specialized for JSON-style literal expressions.

Expected benefits

combined with Skippable, we could skip over these literals and read the object lazily;
one more optimization lever for the encoder when it meets JSON-style literals;
this can be used straightforwardly to extend JSON.parse() to TypeArray and create a new JSON.binify();
it is possible (but not certain) that we could make this encoding more concise and faster to decode.

Possible spec

typedef (... // Previous stuff
         LiteralExpression)
        Expression;

typedef (LiteralObjectExpression or
         LiteralBooleanExpression or
         LiteralStringExpression or
         LiteralNullExpression or
         LiteralNumericExpression or
         LiteralArrayExpression)
        LiteralExpression;

typedef (EagerLiteralObjectExpression or SkippableLiteralObjectExpression) LiteralObjectExpression;
typedef (EagerLiteralArrayExpression or SkippableLiteralArrayExpression) LiteralArrayExpression;

interface EagerLiteralObjectExpression {
    attribute FrozenArray<LiteralObjectProperty> properties;
}
[Skippable] interface SkippableLiteralObjectExpression {
    attribute EagerLiteralObjectExpression value;
}

interface LiteralObjectProperty {
   attribute LiteralPropertyName name;
   attribute LiteralExpression value;
}
interface EagerLiteralArrayExpression {
   attribute FrozenArray<LiteralExpression> elements;
}
[Skippable] interface SkippableLiteralArrayExpression {
    attribute EagerLiteralArrayExpression value;
}

Add high-level goal to allow ASTs to be inlined into <script> tags

I really think this proposal is great, and I wanted to suggest the addition of one relatively small high-level goal that I didn't see in the proposal.

It can be very useful in some cases to inline critical JavaScript into an HTML page, especially if the browser doesn't support HTTP/2 server push. I would love to see this project have as a high-level goal the ability for ASTs to be embedded into an HTML <script> tag. This could either be as the contents of the tag, or, more likely, as a data URI for the src attribute.

What should be put into AssertedParameterScope for duplicate parameters?

from https://bugzilla.mozilla.org/show_bug.cgi?id=1497788

If there's duplicate parameters, what should be in AssertedParameterScope?

The current spec requires AssertedPositionalParameterName for them.
like,

function f(a, a) {}

will have the following data for AssertedParameterScope:

AssertedParameterScope {
  paramNames: [
    AssertedPositionalParameterName {
      index: 0,
      name: "a",
      ...
    },
    AssertedPositionalParameterName {
      index: 1,
      name: "a",
      ...
    },
  ],
  ...
}

given the purpose of Asserted*Scope is to provide the information about binding, having duplicate entry without any info about duplication won't be nice.

The situation is following:

duplicate parameters are allowed only if there's no destructuring/default/rest parameters
only the last one of the duplicate parameters become the actual binding

what I can think of is the following 2 solutions:

(a) Use yet another `Asserted*Name` with index for non-last duplicate parameters

for example, AssertedPositionalDuplicateParameterName

AssertedParameterScope {
  paramNames: [
    AssertedPositionalDuplicateParameterName {
      index: 0,
      name: "a",
      ...
    },
    AssertedPositionalParameterName {
      index: 1,
      name: "a",
      ...
    },
  ],
  ...
}

this way, CheckParameterNames and CheckPositionalParameterIndices can be done in almost same way as current ones

(b) Do not put non-last duplicate parameters

AssertedParameterScope {
  paramNames: [
    AssertedPositionalParameterName {
      index: 1,
      name: "a",
      ...
    },
  ],
  ...
}

this is smallest, but CheckParameterNames and CheckPositionalParameterIndices should be modified in order to check duplication

Why do some interfaces not inherit from Node?

Is this a typo or is there a semantics associated to this?

Module Target

There seems to be very little mention of modules in this spec and issue queue - I assume a ECMAScript Module target will be supported? Ideally this should likely be the primary use case here.

[Discussion] Out-of-band signal for requesting Binary AST

I recently asked on the chat about a planned way to request Binary AST from the server and got the following answer:

@Yoric: We plan to have a mechanism, but we haven't attempted to design it yet. The vague consensus for the moment was to use something like <script src="..." binsrc="...">, which seems like the cheap way to keep it backwards-compatible.

While this is a relatively simple solution, I have a concern about limitations it imposes.

In particular, in an ideal world I think it would be reasonable to support a usecase where e.g. a shared CDN with lots of JavaScript libraries could simply create Binary AST variants of all the assets, and return them instead of regular JavaScript when it knows that 1) browser supports it and 2) that such change would be mostly invisible to the consumer (that is, JS was indeed requested via <script> or import(...) or other means purely for execution, and not with XMLHttpRequest or fetch).

To support usecases like that, signal for Binary AST support should come not from HTML level (as it's much harder to get HTML updated on all the websites where script is inserted), but rather on network level.

One way to do this would be adding binast or similar marker to the Accept-Encoding list for script requests in supported browsers, which would tell the server that Binary AST version can be safely returned with Content-Encoding: binast in the response.

Using encoding headers for this goal feels quite natural, as it's mostly an encoding format for JavaScript, although one might argue that because it's not lossless in terms of debugging information, it doesn't belong to Accept-Encoding/Content-Encoding headers - in that case, I'm open to any proposals.

We should be defining Dictionaries rather than Interfaces

I got an impromptu drive-by review from the Firefox DOM team. One of the suggestions was to turn all our Interfaces into Dictionaries.

question: resource-constrained devices

This question might not yet be answerable.

I'm curious about the impact on this proposal on resource-constrained devices.

Am I correct that a spec-compliant VM would need to implement a second, completely separate parser for the binary AST format? A resource-constrained device may find the added weight of this implementation burdensome.

However, if such a VM could be deployed with only the binary AST parser implementation, is it likely or unlikely to actually reduce the footprint of the VM compared to the way things currently stand (in terms of both memory and persistent storage usage)?

This proposal would certainly improve parsing speed on a puny device, but I'm wondering if there are other costs.

Thanks!

UPDATE: By "persistent storage usage" I mean size of the firmware blob.

Equivalent work

Are you aware of JSZap from Microsoft?
https://www.microsoft.com/en-us/research/publication/jszap-compressing-javascript-code/

Seems very close to the purpose of this proposal

Handling bindings introduced by a CatchClause then captured

Consider the following snippet:

try {
 ...
} catch (e) {
  return function() {
    throw e; // Or, really, do anything with `e`.
  }
}

In this snippet, the anonymous function captures a binding e which was introduced implicitly by a CatchClause. For the moment, we do not have a way of representing this in the AST.

Consider merging LiteralInfinityExpression into LiteralNumericExpression

As discussed on Gitter, we might want to do this to simplify handling and provide optimised representations (like binast/binjs-ref#239) more easily for both kinds of numbers.

Some points from the discussion:

double can store infinity values just fine in any IEEE.754 compatible representation, so this shouldn't be an issue, although one voiced concern is that WebIDL defines double as finite floating 64-bit numbers, but this likely shouldn't be a problem due to how we use it.
Another concern is having to provide a specialised codegen for infinity values in LiteralNumericExpression that would produce something like 1e111111111... (see https://bugzilla.mozilla.org/show_bug.cgi?id=1524302#c2), but implementors would need to do the same for LiteralInfinityExpression as well, so it's only a matter of place where to put the branching.

cc @syg @Yoric @arai-a

Avoiding AST and inserting the needed information into the source code

In the meeting notes of the July TC39 meeting, it was said that the space savings of the AST are minimal (compared to a minified version) - about 5%, so AST-ing the source code doesn't give a lot of benefit in terms of space-savings.

OTOH, the main (and considerable!) parse-time savings are accomplished from:
* Knowing which variables are hoisted in a scope
* Knowing which variables are "closed over" in a scope
* Marking the presence of an eval or with in a scope
* Early errors can de-optimize lazy parsing
* Efficiencies when using binary instead of text characters.

Then again, defining an AST is a big endeavor, and may be fraught with politics, not to mention having to maintain it forever every time the language changes syntactically.

So given that an AST is problematic, and yet the parse-time savings from the proposal are considerable, I was wondering whether we could get the parse-time savings by using various markers in the source code to add the information that is missing. In other words, the above points that are saved can be encoded into the source code using special comments or string directives. Something like:

//@efficient-parse

var x;
function f() {
  //@efficent-parse: hoisted[x]
  function use_x() {
    use(x);
  }

  var x;
}

(I wouldn't even call the above a strawman proposal, but just an incredibly rough draft to determine where I'm going here...)

This will probably enable parsers to deal with the first of the four points described in the proposal, but will not let them deal with the last point (efficiencies from the use of binary instead of text characters), given that it's still keeping the source code as source code with text characters. Unfortunately, I don't know how much of the performance improvements were gotten from each of the points above. It may be that 90% of the optimizations was because text was discarded, which makes my proposal of keeping the source code as text dead on arrival. :-)

So I can't really know whether my proposal is efficient, but if it is, I believe the benefits of keeping the source code as source code and not going into the endeavor of defining an AST for JS may outweigh the decrease in performance that keeping source code as text will make.

Comparison to WebASM

So I fully understand the differences, but some visitors may not. It might be nice to have a section about the differences between this and WebASM, and maybe (for the rest of us) why this is seen as more suitable or different than WebASM philosophically/practically.

Can we put more information into AssertedParameterScope ?

Another extension for https://github.com/binast/ecmascript-binary-ast/issues/50

The requirement on the implementation is following:
https://bugzilla.mozilla.org/show_bug.cgi?id=1475458

To satisfy it without changing the implementation (this is just for reference. I'm also thinking about changing implementation side), the following structure is necessary:

enum ParameterKind {
    "simple",
    "default",
    "destructuring",
    "destructuring default",
    "rest",
    "destructuring rest"
};

interface AssertedParameterName {
  attribute unsigned short index;
  attribute ParameterKind kind;
  attribute IdentifierName name;
  attribute boolean isCaptured;
};

interface AssertedParameterScope {
  attribute FrozenArray<AssertedParameterName> paramNames;
  attribute boolean hasDirectEval;
};

index is where the binding is defined in parameters list.

for example, function f(a, b=10, {c}, [d, e] = [], f, ...g) {} has the following scope data:

AssertedParameterScope {
  paramNames: [
    AssertedParameterName {
      index: 0, kind: "simple", name: "a", isCaptured: false
    },
    AssertedParameterName {
      index: 1, kind: "default", name: "b", isCaptured: false
    },
    AssertedParameterName {
      index: 2, kind: "destructuring", name: "c", isCaptured: false
    },
    AssertedParameterName {
      index: 3, kind: "destructuring default", name: "d", isCaptured: false
    },
    AssertedParameterName {
      index: 3, kind: "destructuring default", name: "e", isCaptured: false
    },
    AssertedParameterName {
      index: 4, kind: "simple", name: "f", isCaptured: false
    },
    AssertedParameterName {
      index: 5, kind: "rest", name: "g", isCaptured: false
    },
  ],
  hasDirectEval: false
}

Also, we need bodyScope before param.

Consider specifying how browsers tell servers they accept AST files

Thanks for such a cool proposal!

Continuing on from this thread, I wanted to suggest that there should be some sort of structured way for browsers to tell servers whether or not the browser understands Binary AST files, and, if supported, which version(s) are understood.

I think the most likely candidate for implementation is the HTTP Accept header, although it gets a bit complicated when combined with features like server push or inlined scripts.

If browsers don't send an Accept header or something like it, servers will have to use user-agent sniffing to figure out whether to send Binary AST or JavaScript. In my experience with sending different versions of JS to browsers, this is a pretty cumbersome and error-prone solution.

Thanks again for this!

Parameter binding lists should force parameters bindings to be enumerated in source order.

Right now the reference encoder feels free to sort parameter bindings lexicographically, which doesn't play nicely with engines.

What do deferred early errors mean for JSON?

This might be beyond the scope of the current proposal, but I expect that we'll want to apply binary parsing also for JSON data.

For the specific case of JSON, I suspect that we want the data to be checked eagerly. Does this mean that we want to guarantee that some subsets of EcmaScript will be checked eagerly? That we want to be able to be able to specify that some files need to be parsed eagerly?

The field order doesn't always match the streaming compilation order

When experimenting streaming compilation from multipart .binjs file to SpiderMonkey bytecode,
the order of the fields often become trouble [1][2][3].

The issue is that, if we don't seek/lookahead, and don't keep on-memory AST converted from .binjs file, we should compile in the same order as the .binjs file.
but SpiderMonkey often emits the bytecode in the different order (or interleaving sub-trees) than the original JS syntax itself,
and sometimes applies optimization depends on the nodes which appears later.

the issue comes from that, we cannot lookahead without extra overhead with current format,
(thus current experimental implementation doesn't support seek),
because non-Skippable nodes doesn't have length property at the beginning of the serialized data.

of course we could emit the different bytecode than original JS, in the same order as .binjs file, but it would be the source of extra .binjs-specific bugs, and I'd like to avoid it as much as possible.

So, it would be nice if we can support (maybe partial-) tree-traversal without reading same field twice, in the file-format level.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1456006#c1
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1456404#c2
[3] https://github.com/binast/ecmascript-binary-ast/issues/33

Add comments to length field of functions.

from https://bugzilla.mozilla.org/show_bug.cgi?id=1472103#c20

attribute unsigned long length; would need a comment that it's function object's length property.

Can we have `AssertedPositionalParameterName` interface which contains index?

At first I thought this is the same issue as https://github.com/binast/ecmascript-binary-ast/issues/30, but I guess it's a bit different.

While creating binding data for parameters, we want a list of positional formal parameters+indices which directly maps to arguments element, so that the binding name maps to arguments slot at the point of reading scope data.
With current spec, it's unknown before parsing FormalParameters.

What I propose is the following:

interface AssertedPositionalParameterName {
  attribute unsigned short index;
  attribute IdentifierName name;
  attribute boolean isCaptured;
};

interface AssertedParameterName {
  attribute IdentifierName name;
  attribute boolean isCaptured;
};

typedef (AssertedPositionalParameterName or
         AssertedParameterName)
        AssertedMaybePositionalParameterName;

interface AssertedParameterScope {
  attribute FrozenArray<AssertedMaybePositionalParameterName> paramNames;
  attribute boolean hasDirectEval;
};

AssertedPositionalParameterName contains the index, which is the index in parameter list, and also the index in arguments. (to be clear, it's not the index in paramNames array).
AssertedParameterName is basically the same thing as current AssertedBoundName.

for example, function f(a, b=10, {c}, [d, e] = [], f, ...g) {} has the following scope data:

AssertedParameterScope {
  paramNames: [
    AssertedPositionalParameterName {
      index: 0, name: "a", isCaptured: false
    },
    AssertedParameterName {
      name: "b", isCaptured: false
    },
    AssertedParameterName {
      name: "c", isCaptured: false
    },
    AssertedParameterName {
      name: "d", isCaptured: false
    },
    AssertedParameterName {
      name: "e", isCaptured: false
    },
    AssertedPositionalParameterName {
      index: 4, name: "f", isCaptured: false
    },
    AssertedParameterName {
      name: "g", isCaptured: false
    },
  ],
  hasDirectEval: false
}

there a and f are positional parameters and their indices are 0 and 4.
isSimpleParameterList field is removed, because it's obvious from paramNames (by, whether there's at least AssertedParameterName).

Validating Babylon ASTs

Even if it's well-typed, Babylon ASTs permit things which are not valid JavaScript syntax (for example, module declarations in scripts). What's the idea for how to validate these ASTs? I suppose, with the "delayed early error" idea, you validate at the function granularity, when the function is actually called; this would include include not just what would be JavaScript early errors, but also anything that would be malformed about the Babylon AST--is that what you had in mind?

Misleading FAQ on WebAssembly comparison

README states:

Why not use WebAssembly?

There are massive existing untyped codebases, and there is no easy way to convert an untyped, garbage collected language to WebAssembly. And even if there were, there is no guarantee that it would be any faster to transmit/parse/start than what we currently have.

whereas WebAssembly FAQ states:

The kind of binary format being considered for WebAssembly can be natively decoded much faster than JavaScript can be parsed (experiments show more than 20× faster). On mobile, large compiled codes can easily take 20–40 seconds just to parse, so native decoding (especially when combined with other techniques like streaming for better-than-gzip compression) is critical to providing a good cold-load user experience.

List annotations

For the moment, we have discussed the following annotations:

has direct eval;
let-bindings;
var-bindings;
captured names;
directives.

Consider adding IfElseStatement for if-then-else, separating from IfStatement for if-then

(moved from binast/binjs-ref#131)

Background

we want to generate bytecode directly from .binjs file, without generating on-memory AST structure
we don't want to seek or lookahead inside .binjs file, for simplicity and performance
we want to reduce the amount of modification to already-generated bytecode (including source note) (patching jump target is necessary anyway tho...)
IfStatement has optional alternate statement, and the existence of the alternate is unknown until we start parsing it, that means, it's unknown when generating branch opcode or generating bytecode for consequent

interface IfStatement : Node {
  attribute Expression test;
  // The first `Statement`.
  attribute Statement consequent;
  // The second `Statement`, if present.
  attribute Statement? alternate;
};

So, with current IfStatement interface, we should modify the branch's kind (source note) when it turns out that there's alternate.
It would be better that kind of information is known at the beginning.

Solution

Separate IfStatement into IfStatement without alternate, and IfElseStatement with alternate

interface IfStatement : Node {
  attribute Expression test;
  attribute Statement consequent;
};

interface IfElseStatement : Node {
  attribute Expression test;
  attribute Statement consequent;
  attribute Statement alternate;
};

Pros

Bytecode generation becomes simpler
IfStatement without alternate becomes smaller in .binjs file

Cons

The number of interfaces increases, that might increase the header size and tree size (depends on file format)
(With multipart encoding, increasing interfaces may result in the number of tuple index exceeding 127, which results in 2-bytes data inside [TREE])

Consideration

This is very implementation specific requirement (at least for my case), and I'm not sure if it's a good idea to modify spec for such purpose
Similar issue (requiring a certain aspect of subtree, before the appearance of the actual child node) would happen to other interfaces as well, and separating all of those interfaces might explode the number of interfaces

tc39 / proposal-binary-ast Goto Github PK

proposal-binary-ast's Issues

Current state

Proposal

Attribute

Example

Expected benefits

Expected benefits

Possible spec

(a) Use yet another Asserted*Name with index for non-last duplicate parameters

(b) Do not put non-last duplicate parameters

Background

Solution

Pros

Cons

Consideration

Recommend Projects

Recommend Topics

Recommend Org

(a) Use yet another `Asserted*Name` with index for non-last duplicate parameters