Giter Club home page Giter Club logo

Comments (17)

bmeck avatar bmeck commented on July 30, 2024

What would be wrong with the pretty printer also generating a new source map?

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

What data would it generate the new source map from? It needs to know somehow what the original positions are. If it sees only a Binary AST file, how can it come up with "foobar.cljs, line 15, column 3"?

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

@sjrd with chained source maps, this already occurs in the wild. Some transform like babel will generate a sourcemap, then some bundler like webpack will generate a new source map that points to the old one as well and chain lookups.

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

For any kind of chained source mapping to work, you need at least what I propose as "An alternative" in the last paragraph. Otherwise you are lacking the first link of the chain.

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

There are source map comments ( see example in https://developer.mozilla.org/en-US/docs/Tools/Debugger/How_to/Use_a_source_map ) that we could serialize. Also people can use HTTP headers that allow this ( https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/SourceMap ) if it needs to be out of band.

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

From your last comment, I think you misunderstand the issue I'm describing. Telling the browser that there exists a source map for foo.binary-ast and that it's located at foo.binary-ast.map is not the problem. The problem is what is contained in that .map file.

A source map file is a function from source position to source position. The codomain of that function is clearly defined: it is a position in a source file (e.g., Foo.scala). The domain of that function, for a .js file, is also clearly defined: it is a position in that .js file. But in the case of the Binary AST, the domain of the source mapping function is currently not defined whatsoever.

The purpose of this issue is to fine some way to address this.

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

@sjrd I don't understand. I think you want source maps to always have a well known transform from JS Source Text to the AST but I don't think that is reasonable. Converting could re-order things I'd imagine:

function f() {}
function g() {}
f(g());

Could have some transform that is done by a tool that generates an AST equivalent of:

function g() {}
function f() {}
f(g());

Are you trying to remove the ability for ASTs to map to non-JS Source Texts by mandating positions rather than leaving that up to code generators?

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

On the contrary, I am trying to add the ability for ASTs to map to non-JS source texts. I want the code generators to be able to tell the VM where any given Binary AST node comes from, whatever language that source was written in.

I'm asking for basically the same feature as source maps, but where the domain of the function is a binary AST offset instead of a position in the compiled .js file.

I think you want source maps to always have a well known transform from JS Source Text to the AST

No, I don't.

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

@sjrd wouldn't this be the same as a codegenerator that had a sourcemap that pointed to the .scala files by precomputing the chained offsets?

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

It would if there was a chain to begin with. But right now the first link of the chain is missing.

I can compile a Foo.scala file to foo.js and emit a perfect foo.js.map that maps positions in foo.js to positions in Foo.scala. If I then parse and store the AST of foo.js into foo.binary-ast, I currently have no way to map the offsets in foo.binary-ast to either positions in foo.js (the first link of the chain) or Foo.scala (the whole chain). foo.js.map has become completely useless.

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

If we serialize SourceMap comments it could be inline. If you send the HTTP header it works right now?

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

No. I have addressed this rebuttal already in #17 (comment).

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

@sjrd I disagree with your rebuttal since we have mechanisms to get these source maps. The source maps can point to either a chain or to precomputed maps. I see no gain in functionality.

But in the case of the Binary AST, the domain of the source mapping function is currently not defined whatsoever

A source map is a function of position -> position, I see no reason you cannot use a byte offset as a position currently.

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

I see no reason you cannot use a byte offset as a position currently.

Because a position, in a source map, is a triple (source file, line, column). An offset is an integer.

At this point I think we are talking past each other, and completely polluting this thread and the mailboxes of watchers. I will let others decide whether this issue needs more clarification.

from proposal-binary-ast.

bmeck avatar bmeck commented on July 30, 2024

Because a position, in a source map, is a triple (source file, line, column). An offset is an integer.

We still need the source file reference.
I would always expect the same line number, just like minified JS or WASM binary format.
Column is just a number here.

This is hinted at somewhat in WASM's design for tooling.

Stating that the line should always be 0 explicitly might be enough?

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

Stating that the line should always be 0 explicitly might be enough?

Sure, that would be enough. It has to be specified, though ;)

It might not be the most efficient encoding, but that's secondary, as long as there is something specified that works.

from proposal-binary-ast.

sjrd avatar sjrd commented on July 30, 2024

Following @Yoric's suggestion, here are some thoughts derived from my experience in designing the serialized form of the Scala.js IR, which is in a tree-based format and contains source positions, optimized for size (so lives in a constraints space relatively close to the Binary AST).

Some of the ideas are also present in source maps (specification here), and some are improvements because the format of the map is binary instead of text, and the annotated target is an AST instead of source text.

Basic considerations

When we have a text format, positions are implied by the offsets in the file, and therefore weigh 0 byte. Once we go to a serialized AST, we must explicitly store positions, but to be competitive we need to strive for a short encoding that is fast to parse.

As a first approximation, an AST will follow some source code more or less closely. This means that AST nodes in the binary format will follow each other more or less as tokens follow each other in the input file. This means that we can store position diffs instead of absolute positions, because the diffs are likely much smaller than the absolute positions. This idea is exploited by the format for source maps, actually.

Note that, even when doing a relatively direct translation from the source language to its AST, the order of the nodes does not completely follow the order of the tokens in the input file. For example, for the source x + 3, the AST node for the + comes either before x and 3 (in a pre-order serialized form) or after both (in a post-order serialized form). This means that position diffs are often negative. A serialized format for positions should allow negative diffs as well as positive ones to be useful for an AST format. This is also exploited by the source maps format.

Sometimes, positions jump around arbitrarily. In particular, when considering code generated by a possibly optimizing compiler, two adjacent AST nodes can point to source positions in different files. It must therefore be possible to have absolute jumps as well as relative diffs.

There are also things that the source map format is not efficient at, if we consider an AST instead of textual source. We can improve on those aspects for a binary AST format.

First, since the format follows whole tokens in the generated .js, it assumes that each new position map entry (called a segment) is given for the next token, compared to the previous one. Since the next token can span an arbitrary number of characters, the source map format starts each segment with a character diff in the generated .js. When giving positions to AST nodes, 1 AST node almost always matches 1 or more tokens. Therefore, whereas adjacent characters would often be part of the same token and share their positions, it is not the case for adjacent AST nodes. Optimizing for "ranges of AST nodes" is therefore harmful. It is better to always imply 1 AST node = 1 stored position (segment). We therefore avoid storing the "character diff" that source maps need, improving size.

Second, source maps were by design restricted to a textual encoding, using only characters in the ASCII set, valid as unescaped characters in a JS string. This design restriction is legitimate for source maps of .js, which is text itself, but is an unjustified burden for source maps for a binary AST, which is binary anyway. We can therefore improve on source maps by storing things in a binary-efficient way, rather than a text-efficient way. In particular, this includes all 8 bits of each byte. Since parsing time is also a concern, we also want positions not to be "null"-terminated (for some definition of "null", such as , in source maps), but rather know their lengths from the beginning.

The format we used in the Scala.js IR

Based on all these considerations, we designed the format of serialized positions in the Scala.js IR as follows:

  1. Each AST node is preceded by its encoded position (see rationale in the previous section).
  2. The position is either relative to the previous one (in the same file), or absolute (including from another file), or "unknown". If relative and the previous one is unknown, it's relative to the node before the previous one, and so on until one is not unknown.
  3. Smaller diffs are encoded with fewer bytes than larger diffs (with the expectation that smaller diffs are more common than larger diffs).
  4. The length of the serialized position is known from the first byte (for faster parsing).

Concretely, this is the format we use for each position. The first byte is decomposed in bits, the next bytes in pseudo-hex (2 chars per byte):

1st byte next bytes description
ccccccc0 Column diff (7-bit signed)
llllll01 CC Line diff (6-bit signed), column (8-bit unsigned)
____0011 LL LL CC Line diff (16-bit signed), column (8-bit unsigned)
____0111 12 bytes File index, line, column (all 32-bit signed)
11111111 Unknown position

Underscores are irrelevant and must be set to 0. The file index is an index to a constant table with URIs of all source files.

As you can see, we have the following sizes:

  • 1 byte for a position on the same line, and within 64 characters left or right of the previous position (in source maps, even this is at least 3 bytes!).
  • 2 bytes for a position on a line within 32 lines up or down the previous one
  • 4 bytes for a position further away in the same file
  • 13 bytes for a position in a different file (or very far away in the same file)

Note that there are some assumptions that we can make for .scala sources (which is very rarely a compiler target), that are probably not reasonable for .js (which is very often a compiler target). For example, we assume that line length does not exceed 255 characters (we have to use 13 bytes to represent any position further on a line), or that jumping around files is pretty rare.

To be clear, I am definitely not recommending that the exact same encoding be used for the Binary AST. Rather, I hope that some of the insights above can be useful in narrowing down the spectrum in which to look for the best encoding for the Binary AST.

Some suggestions for the Binary AST

  • As actually suggested by @Yoric on the HN thread, and I totally agree with this: store positions of all AST nodes further in the file, so that they can be ignored during parsing until they are needed. They should probably be batched per-function (so that, given a function, we can immediately jump to the serialized version of its nodes), which implies that the first AST node of each function always has an absolute position.
  • Use 16-bit unsigned integers for absolute column numbers, because lines could be quite long.
  • Use variable-length fields for the absolute position, because jumping around files might be more frequent in compiled .js than in source .scala.
  • And generally: perform benchmarks on slight variations of the format (what fits into how many bytes) on a corpus of actual codebases (both source JS and JS that is the result of some compilation) to find the best trade-off. The specific details shown above where the result on such a "benchmark" for .scala codebases.

HTH

from proposal-binary-ast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.