getify / concrete-syntax-tree Goto Github PK

Defining a standard JavaScript CST (concrete syntax tree) to complement ASTs

concrete-syntax-tree's Introduction

Update: This effort has moved to estree, so this repo has now been shut down and deprecated.

Concrete Syntax Tree

Defining a standard JavaScript CST (concrete syntax tree) to complement ASTs.

Problem

The standard SpiderMonkey AST format holds only information which is considered necessary for producing valid execution of a JS program. Parsing to this format explicitly discards certain "unnecessary" information such as comments, whitespace, and perhaps ignored syntactic forms like extra ( ) pairs, extra ;, etc.

Unfortunately, there's a class of emerging tools for which the AST is insufficient as an intermediate representation format, because loss of this information defeats the purpose of the tools. Whitespace, comments, etc need to be representable in the tree, so that a tool which wants to produce or use them, can.

More back-reading on this topic can be found in this really long thread.

Examples:

a JSDoc interpreter needs the comments, and in particular, exactly where and how they appear in the code is strictly important information.
a code formatter may only concern itself with modifying part of a JS program's formatting, and would need to preserve other code exactly as it originally appeared.

Goal

Our goal is to define an alternate tree format, called a CST (concrete syntax tree), which holds the additional information these other tools need. Of course, any tool could produce a CST, not just parsers: transpilers from other languages, code formatters, etc.

We need one standard format that all the various tools can agree to use as an intermediate form, so that processing flows can integrate multiple tools in the handling of a snippet of source code.

Plan

PLEASE GO READ THE CHARTER/MISSION FIRST

This is an open group, an ad hoc working committee if you will, where all interested parties can register their interest to participate. Ultimately, it is the tools makers (who must participate to make this work!) who will decide if what we come up with is suitable. The success of this group effort is if tools makers agree and implement what we propose.

We will conduct all discussions on this repo, in the open, or at least post transcribed notes here from any teleconference/online meetings. It seems like a good idea to have at least one initial meeting to kick things off, but we will self-organize here and make such decisions as we go.

"Register" Interest

Please go to this issue thread and add a comment to "register" your interest in this "committee".

We'll all figure out next steps after people have a chance to join.

Design

Some stated design biases (to be validated):

a CST should be pretty similar to the standard AST, as much as possible, so as to make conversion between CST and AST (or vice versa?) straightforward.
the CST must be a format that a code parser could practically produce while parsing, either by default or by option, as the main use-case of this format is to prevent information-loss during parsing.
most importantly, a CST format must be suitable for code generators to use while producing standard JS code.

Current Proposals

There were some discussions on this topic awhile back, and two rough ideas for CST format emerged. These should serve to inform the design process, but not limit/control it.

To illustrate the two proposals' approaches, consider this sample piece of code:

/*1*/ function /*2*/ foo /*3*/ ( /*4*/ ) /*5*/ {
  // ..
}

This program is parsed into the following AST:

{
    "type": "Program",
    "body": [
        {
            "type": "FunctionDeclaration",
            "id": {
                "type": "Identifier",
                "name": "foo"
            },
            "params": [],
            "defaults": [],
            "body": {
                "type": "BlockStatement",
                "body": []
            },
            "rest": null,
            "generator": false,
            "expression": false
        }
    ]
}

As you can see, there's whitespace and comments appearing in a several different syntactic "positions" in that function declaration header. Now, if you consider the current standard AST, not all of those syntactic positions have natural AST representations where these "extras" (comments, whitespace, etc) can be attached.

For example, /*4*/ appears inside an empty parameters-list. But how could we represent the whitespace/comments there in the above AST? The params list is an array, and in JSON/object-literal format, an array cannot also have properties.

We need a CST to do that.

So, briefly, a CST could look like:

These "extras" annotations could be represented by an "extras" property attached to any node, with three standard position names as properties of the "extras" annotation: "before", "inside", "after".

Since those names aren't sufficient to describe all syntactic forms with just the standard AST, such as /*4*/ above, we can add additional synthetic/virtual nodes to the tree (parent or child nodes) that act as anchors to hang these "extras" annotations in the proper positions.

We could solve this by adding a node called paramList in the function declaration, which is an object. This node could have as a child property the params array seen above. Or it could just be a sibling to params in the function declaration. Either way, paramList as an object can have a property called extras, and in that object, the inside position descriptor could be an array to hold the whitespace and comments items that appear in between the ( ).

Another option (which isn't as obvious here but would end up helping in a lot of other places) is to define a special virtual node type like EmptyExpression, which is a node that acts as a positional placeholder, but which would strictly imply no actual code when being processed. This node could be a child of params only if params was otherwise empty. That may help some cases, but it could make some tools' jobs a little harder in that they are in the habit of checking the length of the array only, not its contents.

Regardless of how or where these virtual nodes sit, it underscores the idea that a CST would probably need to be "converted" (that is, have all the virtual nodes stripped) to a standard AST before some AST-only-aware tools could work with the tree. That would be an explicit, and fairly trivial to implement, conversion step, provided by tools like parsers for instance, like makeASTFromCST(..).

The upside is that we only have 3 position names to know about and handle ("before", "inside", "after"). That makes tooling logic easier in the generation and consumption phases. The downside is that it creates extra nodes, which means current tools that deal with ASTs either need a "conversion" which strips the CST down to an AST, or have to change to be able to handle the presence of virtual nodes it should ignore in its processing. It could make some code paths a bit harder, while making other code paths much easier. Tradeoffs abound.
Instead of adding extra virtual nodes, we can just limit which nodes can have "extras" annotations to those which are already objects. This means that we would need to have potentially lots more "position descriptors", as "before", "inside", and "after" would not be sufficient in all the various different JS syntactic forms.

"inside" when attached to a function declaration could always mean "inside the param list", and so whether there are params or not, the whitespace/comments extras can always be represented. In the above example, "inside" kinda works, but there are lots of other syntactic forms where different position names would have to be used. Instead of knowing/handling just 3 positional labels, we might have 5-10 different labels (not all of which made sense in different nodes).

The upside is that a CST is structurally the same as an AST (simplifying some existing tools handling ASTs), with just extra properties which could theoretically be either used or ignored easily. The downside is that producing and consuming these extras is a fair bit more complicated as there are quite a few different positional names to handle.

Chicken Or The Egg?

Another way to boil down the contrast between these two proposals is: Is an AST a CST, or is a CST an AST? Or, which comes first, the AST or the CST?

The first proposal is that all ASTs are CSTs, but not vice versa. Any (future) tool which could take a (future) CST could just take an AST, and work just fine (without extras, of course!). Any existing tool which expects ASTs would probably need you to "convert" (that is, strip out the virtual nodes) a CST down to a base AST, so that they didn't need to handle all those virtual nodes.

The second proposal is essentially the reverse: all CSTs are ASTs, but not vice versa. Any tool which currently takes ASTs can start taking CSTs (without conversion) and not have to change (as long as it's tolerant to ignore properties it doesn't care about, which is sort of an implicit "conversion" from CST to AST). Any tool which would take a (future) CST could take an AST (without extras, of course!).

Neither proposal is perfect. They both imply different tradeoffs. They each imply that some code paths will be simpler and some will be harder. It's not totally clear that either is the right choice, or that another option wouldn't be better. These are things that should be explored and debated.

License & Assignment

By participating in this group, you are explicitly granting/assigning any and all ideas you present to the group to be licensed under CC0 public domain, and you agree that neither you nor anyone else who participates in any of these discussions will retain any intellectual property rights (copyright, patentability, etc) over what is submitted publicly to our discussions.

Simply do not participate if you have any potential or real conflict that prevents you from agreeing to this process under the spirit of those terms.

concrete-syntax-tree's People

Contributors

Stargazers

Watchers

concrete-syntax-tree's Issues

Constituency tree structure

So I've been trying to think about what kind of structure might make sense if we were to just start from scratch (not that we necessarily have to, I just wanted to explore the solution space a bit and have a target to shoot for). So I threw together a rough and dirty structure proposal, and wanted to get some people looking at it and discussing.

I'll back up and mention that my focus here is on use-cases for code analysis and source compilation (not so much code execution). There may be some trade-offs to be made in the design of the structure depending which of those two routes are most desired -- but it seems to me that the vast majority of the uses of JS syntax trees in the wild these days are for source code analysis/transformation.

Here's roughly what I've come up with, and it's mostly inspired by a few of the other syntax tree forms we use in other languages. I took the given example code in the readme and have been targeting that:

/*1*/ function /*2*/ foo /*3*/ ( /*4*/ ) /*5*/ {
  // ..
}

Proposed CST form: https://gist.github.com/jeffmo/9690116
(This is the output of a tiny, super incomplete parser I hacked together. I'm happy to release the code if we decide we want to pursue this style of structure further...but I figured it may be a bit early to start anchoring or discussing detailed parser implementations too much here)

The high level tenet of the structure is that tree nodes are categorized into two conceptual types: branch nodes and leaf nodes (leaf nodes are just token nodes). The tree does not include whitespace tokens as I felt they would be quite noisy and redundant with the location information (which should make it possible to re-construct the whitespace). EDIT: As pointed out below in the comments section, this is an incorrect assessment about whitespace reconstructability. I need to find some time to come back and address this (likely by adding support for whitespace tokens)

Branch nodes stick to the form {type: "string", children: [array], loc: {location obj}}.
Leaf/token nodes stick to the form {type: "string", value: <token-specific value type>, loc: {location obj}}

To further simplify parsing a bit, I've also squashed some token types together -- such as keywords (i.e. {"type": "KeywordToken", "value": "function", "loc": {...}}) and punctuators (i.e. {"type": "PunctuatorToken", "value": "(", "loc": {...}}) similar to esprima's node structure for tokens.

Pros:

The tree is entirely lossless with respect to the original source. Whitespace can be re-constructed using location information. EDIT: This is incorrect as pointed out in the comments section. I need to find some time to come back and address this (likely by adding support for whitespace tokens)
Since branch nodes list their children in an ordered array, it's extremely easy to traverse the tree in source-order. This is in contrast to the Spidermonkey API where you must traverse object properties that may/may not be in source order. The truth is that often they are (because parsers often define the properties in source order and just about every JS impl preserves object key order in 99% of the scenarios), but there are a few places in the spidermonkey api where this isn't possible (for example the rest property on Function nodes is out of place with respect to source-order and the other function params)
It's very easy to represent and observe optional intermediary nodes like the comments in the example code-chunk. Because a FunctionDeclaration node is just a branch node, it has a list of children which can include an arbitrary set of (perhaps pre-defined) types of child nodes.
Because we retain some constituency structure (rather than a pure stream of tokens), it's still very easy to skip over subtrees in the syntax.
Because sibling nodes are represented via an ordered list (vs a named object -- I address the downsides of that below), we don't have to attempt to define positional-adjective slots (like { ... beforeFunctionKeyword: [<commentNode>] ...}for all possible free-floating nodes (like comments/semicolons/etc).

Cons:

If, while analyzing a branch node, you want to jump to (or test for the existence of) a particular type of child node, you must scan it's children array for that node (rather than having direct access via a named property). It's unclear how much of a problem this is since normally these lists are relatively small for each branch-node...but it's definitely worth considering.
While similar, it doesn't really augment the SpiderMonkey API very well -- meaning new parsers would be necessary.

I have far from explored all the edges of the language with this design, so I'm interested in hearing peoples thoughts on this. I'm also happy to edit in pros/cons above as they come up in any comments people have.

CST-AST flow

This is in response to @Constellation's figures on another escodegen thread, which I hope get moved here.

Things to note different about my diagram:

There is only a "Parser to CST". "Parser to AST" would be a convenience API method that simply called "Parser to CST" -> "CST to AST". It would make no sense to me to have a parser which either was duplicated (one set of logic for AST, one for CST), or just more complex at each step, having to figure out if it should or shouldn't keep the CST info. I think it always should.

If your use-case for output of parsing is that you only want the AST, and don't care about the CST data, you strip it out with "CST to AST". You can do so manually, or you can call the "Parser to AST" helper method. Either way, there's just one parser: "Parser to CST".

I think this greatly simplifies the work needed to modify existing parsers. They just make one change to start (always) producing CSTs, and provide a simple CST-to-AST stripping conversion step, and maybe a convenience method "Parse to AST" which just calls both.

Also, it means that lots of existing tools which expect ASTs only don't have to change at all. If you have a CST and want to use that tool, you use the AST-to-CST converter first. Could those tools add convenience methods? Sure. But they don't have to.

Minimizing the amount of work change to implement CSTs.
My approach assumes that all ASTs are CSTs (strictly, that ASTs are subsets of CSTs), but an AST is a not-terribly-interesting CST in that it's missing all the extras a CST would normally have. IOW, an AST is a minimal CST.

This means any place that expects a CST (like linters, code formatters, etc) can either take an AST or a CST. Any place that requires an AST needs an AST only (not a CST), but that's OK because any place that produces CSTs (like parsers) will also provide the simple CST-to-AST converter, so it's trivial to get an AST from a CST and pass it wherever ASTs are expected.
Escodegen (and any other code generator) changes to only expect a CST. If you pass in an AST, that's OK, because (as I said above), all ASTs are CSTs, so no big deal.

If any tool (like escodegen or any other) wants to add default formatting like whitespace or comments or whatever, they take a CST (or an AST!) and add the "extras" annotations directly to it. A CST then is just an AST + extras.

I think this also minimizes complexity of Escodegen, it doesn't need to be two separate components, and it definitely doesn't need to have forked logic for AST vs. CST. It assumes a tree that will provide it whatever extras it needs, and it may add its own, and it cares not if this tree passed to it was an AST or a CST (AST+extras). Either way, it uses extras if present, and continues normally if not. Simple.

Charter, group mission, observations/assumptions

I am going to try to articulate in detail what my initial plans and goals are for this group. These things are of course open to discussion, but I want to stress one thing at the beginning:

This is not design by committee -- those types of efforts usually fail miserably. This is not democratic -- those things usually just divide constituency and never lead to consensus. I am going to lead this group and effort, because it needs to happen and because no one has done it before, so I stepped up. I am going to do the best I can to balance what input people give with the overall big picture and goals.

And I am going to actively seek your feedback to make sure we do the best we can.

Mission Statement

To develop a CST (concrete syntax tree) format which preserves/represents concrete syntax elements (whitespace, comments, etc) in a data structure (tree) "alongside" the AST (abstract syntax tree) elements (VariableDeclaration, BinaryExpression, etc), and to evangelize this new format to all tools in the JS tooling ecosystem (parsers, analyzers, transformers, code generators) to gain substantial/complete adoption as a new IR (intermediate representation) standard.

Charter

This ad hoc group of volunteer members seeks to develop a single, new standard (which we're currently calling "CST" -- Concrete Syntax Tree) for the IR (intermediate representation) of JS source code as it passes through various tools in the JS tooling ecosystem, from parsing to analysis to transformation to code (re)regeneration.

This new standard will replace/augment the existing standard (AST -- Abstract Syntax Tree as standardized by the SpiderMonkey Parser API). Note: That does not mean ASTs won't exist, but it means the preferred tree format for exchange will be the CST, and that AST will become a reduction of the CST for certain use-cases that only need/care about AST information.

The goal of the new standard tree format is to provide a standard and reliable way to represent all "concrete syntax" elements which are normally discarded and not kept by the AST, such as "unncessary" whitespace, comments, grammar-redundant ( ) or { } pairs, etc. These elements are needed for a variety of use-cases which cannot tolerate "information loss" in the round-trip from source-code to IR back to source-code.

This group seeks to discuss several existing proposals for a CST format, hammer out any problems with them, and find one that can gain the most support. We will publish a detailed specification/documentation for this new format, and evangelize and work with all the JS tooling ecosystem members to gain adoption and implementation as quickly as possible.

It is a success condition of this group if many or all JS tools agree, even in principle, to eventually move to this new standard IR format, even if they only can agree to support both CST and AST rather than replace AST. Furthermore, there must be at least one parser that produces CSTs from source code, and at least one code generator which takes a CST and produces code output, as this round-trip is inherent to nearly all use-cases this group concerns itself with.

Assumptions/Observations

AST is a lossy format, in that concrete syntax information is lost when a parser takes a program and outputs an AST. This lossy format has served many common use-cases well, but it has not served at all the use-cases which need to retain (and/or use!) this information.

As such, the new CST format that will retain this information must be seen as the primary format, as you can always strip out information from a CST representation to get only an AST, but you cannot go the other direction and restore information which was lost. Note: Some use-cases do call for adding in new "concrete syntax information", such as default whitespace, etc, but that's different than preserving (while parsing) the original information.

We will pay close attention to the tradeoffs in complexity/performance that this implies, and be sensitive to that in what we propose. We will not grossly degrade the performance of existing tools by forcing them to do things like tracking concrete-syntax which they have no concern with, except as it is minimally required to support the rest of the JS tooling ecosystem and use-cases.
This group is not an open-ended, unstructured exploration. It will be guided and informed by prior work, and seek to keep to the narrowest scope and process as necessary to get to a proposed solution with widest adoption/consensus.

There have been extensive discussions about various approaches to CST tracking over the 6+ months in various places. Two main proposals surfaced in that discussion, as detailed in the main README of this repo. It is my goal that we first validate both of these proposals, or invalidate them (with proof, not opinion).

We will entertain (and indeed, seek out!) discussion about concrete deficiencies in these proposals, but we will not entertain bikeshedding on opinions of taste on any proposal. If there are unresolvable deficiencies in current proposals, we will entertain alternate proposals, but again we will not get mired in bikeshedding, but rather seek to solve whatever problems exist in any given attempted solution.
Since AST is the current standard IR format with these tools, whatever the CST settles as must provide the least amount of friction to existing tools to augmenting or replacing current AST handling.
Throwing out the entire AST format and producing a new CST format that is wholly unlike the current AST is likely to produce a lot of friction to implementation with existing tools, even if it can be demonstrated that it would be superior (for some definition of "superior").

As such, a CST that augments AST in some way is generally more preferable as it generally would lead to less friction to implementation (less changes to existing tools' code). We should prefer incrementation/evolution of the current standard rather than reinvention.
The form that the CST takes (and how it co-exists with AST elements) matters, because it directly affects how easily the IR of code can pass through multiple tools in a chain of processing. A single tree (that can be textualized as JSON) is the current norm, and it is preferred (again for ease of friction) that the process not become significantly harder, such as creating multiple different streams of data to pass around, etc.
There have already been a lot of ad hoc explorations by various tools to tracking whitespace and/or comments, but each tool has done it differently, and none of them have handled all concrete-syntax preservation. All these different previous attempts inform our current attempts, but they are explicitly considered insufficient as the mission is to preserve all concrete-syntax in a standard and agreed-upon way across most/all tools.

As such, the CST effort seeks to replace any of those previous non-standard approaches, even eventually if not immediately. We want to solve problems, not create more problems for future users by having multiple different competing ways to do things and no consensus on how to do it properly.

How this project ?

Hello guys,

This project still continue ?

Register interest

Please leave a comment here with your interest in participating. List whatever stake you may have (such as "I work on XYZ tool that needs this..."), and any relevant contact info.

Also, please list what timezone you are primarily based in, for meeting planning purposes.

List any relevant contact info (twitter, email, etc) as you see fit.