0no-co / reghex Goto Github PK

View Code? Open in Web Editor NEW

293.0 7.0 5.0 354 KB

The magical sticky regex-based parser generator 🧙

License: MIT License

JavaScript 100.00%

regex javascript parser-generator

reghex's People

Contributors

Stargazers

Watchers

Forkers

ericdmoore tjdev7 cometkim icodein sweeneycoffey5522250

reghex's Issues

Regexes that can match 0 characters aren't handled properly

The following two parsers should produce the same output, in this case, but they don't:

console.log ( parse ( match ( '👎' )`${/\s*/} ${/foo/}` )( 'foo' ) ); // => undefined
console.log ( parse ( match ( '👍' )`${/\s/}? ${/foo/}` )( 'foo' ) ); // => [ 'foo', tag: '👍' ]

Support for string matchers

It'd be useful if a matcher could be specified using strings also, this would make parsers a bit cleaner and easier to read as a lot of characters wouldn't need to be escaped in strings.

E.g.: /\/\// => '//'

It would be useful to have a built-in CLI for bundling up a parser into a standalone file, from the user perspective it would be more user friendly if one didn't have to set-up Babel at all (I use TS most of the time), plus it would make the readme more impressive if with one command one could compile the demo parser into a 1kb file or whatever.

Allow starting with `|` in multiline match

I'm parsing bcp47 language tag using reghex and work perfectly!

Can we accept starting with a | character instead pattern when using a multiline match?

const irregular = match('irregular')`
-  ${/en-GB-oed/}
+ | ${/en-GB-oed/}
  | ${/i-ami/}
  | ${/i-bnn/}
  | ${/i-default/}
  | ${/i-enochian/}
  | ${/i-hak/}
  | ${/i-klingon/}
  | ${/i-lux/}
  | ${/i-mingo/}
  | ${/i-navajo/}
  | ${/i-pwn/}
  | ${/i-tao/}
  | ${/i-tay/}
  | ${/i-tsu/}
  | ${/sgn-BE-FR/}
  | ${/sgn-BE-NL/}
  | ${/sgn-CH-DE/}
`;

Miscellaneous feedback

I spent some time today benchmarking the library and playing with making a toy/useless Markdown parser with it, so here's some miscellaneous feedback after having interacted some more with the library, feel free to close this and perhaps open standalone issues for the parts you think are worth addressing.

For the Markdown parser thing I was trying to write a matcher that matched headings, and I had some problems with that:

I wanted to get the leading hashes, trailing hashes, and content in between as individual strings in the tag to be transformed, that immediately implies that I have to use multiple regexes because capturing groups within a single regex are lost in the tag. Perhaps this should be changed somehow as that would be quite powerful.
Not being able to use a single regex in this scenario means also that I can't use \1, \2 etc. to reference other capturing groups either, in my headings scenario the trailing hashes should really be considered trailing hashes only if they are exactly the same number as the leading hashes, otherwise they should be considered part of the body, this can't quite be expressed cleanly with the current system because the first capturing group/matcher can't be referenced.
1. Addressing the first issue would kind of address this too.
2. Another option would be to extend the DSL adding support for \[0-9] references, which in this case would mean referencing the 1st, 2nd... 9th whole sub-matcher.
3. Perhaps both options should be implemented, I'm not sure.
Continuing on the same scenario there's another issue once the standalone regex gets broke up:
```
Standalone:
`${/(#{1,6} )(.+?)( #{1,6})/}`
Broken-up:
`${/#{1,6} /} ${/.+?/} ${/ #{1,6}/}`
```
I forgot to take note of what the issue was exactly (🤦‍♂️), but unless I'm misremembering the issue is that those two expressions don't quite match the same things because the lazy modifier on the broken-up version doesn't behave the same way basically.
Custom regex flags are discarded, it would be nice in some scenarios to be able to write a case-insensitive regex or a regex where "^" and "$" match the start and end of the line respectively for example.
The DSL to me looks kind of like a reimplementation of a subset of the regex language, so perhaps it should be extended a bit to match it more closely, for example how-many-modifiers (what are these things actually called?) like {1,3} perhaps should be supported too.

Now about performance, perhaps the more interesting part of the feedback.

From what I've seen every ~atomic thing the library does is pretty fast, so there shouldn't be any meaningful micro-optimizations available, the root issue seems to be actually that the library spends too much times on the wrong alternations.

Some actual real numbers first so that the rest of the feedback sounds less crazy:

I've been benchmarking the library with this.zip. Basically there's a parser that matches against a subset of javascript and it is asked to match a bunch of expressions in a loop.

Making this 14-characters diff to a single matcher of the parser cut the time it took to run the benchmark by ~40%:

-  = $`${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Slow
+  = $`${/(?=.*\?)/} ${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Fast

Basically this is what's happening:
1. This rule matches a ternary expression.
2. Most expressions in the benchmark aren't ternary expressions.
3. Still those expressions could be the boolean condition of a ternary expression.
4. So the parser parses the entire thing, until it realizes the required "?" character for the ternary expression can't be found.
5. So it backtracks and eventually matches the entire thing again with another matcher.
6. From the "again" word here it comes the almost doubled performance because of the changed rule.
7. The only thing the changed rule does is checking if the "?" character is present before running any other gazillion matchers.

That's kind of the root of the performance problems with RegHex parsers in my opinion, if I had to guess with enough sophistication perhaps some parsers could become 100x faster or more just by going down branches/alternations more intelligently.

At a high-level to me RegHex parsers look kinda like CPUs, individual patterns are like instructions, each alternation is a branch etc. it should follow then that the same optimizations used for CPUs could/should be used for RegHex. I know next to nothing about that really, but just to mention some potential things that crossed my mind:

In my ternary expression matcher above there are a series of instructions that should succeed for the entire thing to be considered a match, but not all instructions are the same performance-wise, e.g. checking if the string to match contains a "?" is waaaaay faster than checking if the string starts with something that could be a boolean expression for the ternary operator, the fastest checks should be performed first.
1. This optimization specifically could be performed at the parser-level with a lookahead, that works but that's kinda ugly.
2. Another, more general, approach would be to analyze regexes, probably at build-time so that how long it takes to do that doesn't matter, and extract subsets of the automata that must match under every branch and are easy to check for, like the presence of "?" and ":", in that order, in the input string required by my ternary expression matcher.
Next perhaps a branch predictor could be added, the order in which alternations are tried matters for performance, and if the 9th alternation in a set of 10 alternation is the one that matched the most in the past perhaps that should be tried first and most of the times we can skip failing to match the first 8 alternations altogether.
1. This could be pretty tricky to optimize for automatically, because you need to know for which chunks in the alternations array the order doesn't matter. Maybe some API could be exposed to the user and just move the problem to the user, like a "isParallelizable" third argument to the match function or something.
This is kinda out-there but in some sense RegHex took the individual automata I wrote (~the matchers) and built a larger one by putting the smaller ones in a tree (~the final parser), now currently what happens if I don't add the lookahead for "?" is that the parser goes down the branch for the ternary expression, matches a whole lot of stuff, then realizes the "?" character can't be found and it goes back to the starting point, taking another branch - now it might be interesting to note that there are multiple branches of the tree here that look exactly the same for a while, so they should kind of get merged together in a prefix tree, this way once the ternary expression wouldn't ultimately get matched RegHex wouldn't go back to the very root of the tree but it would take the nearest unexplored branch first instead, which in this particular case would lead to finding a match almost immediately (e.g. "so far that was an OR expression -> there are no more characters left in the input string -> done").

Depending on how many of these fancy optimizations you are willing to spend time on perhaps a seriously fast JS parser could be written on top of RegHex 🤔 that'd be really cool.

Sorry for the long post, hopefully there's some useful feedback in here.

RFC: Add a traverse function

The traverse function would only support the default tag output, so [ /* ... */, tag: 'node' ], or rather type Node = Array<Node | string> & { tag: string }. Maybe it can also be limited to support any object of the shape { tag: string } | { type: string }.

It would function similarly to GraphQL's visit function or Babel's traverse function.

traverse({
  [tagName]: node => {
    // ...
    return node;
  }
})(node)

The different visitor functions match by tag name and would execute their functions as the AST is traversed. The return value should replace the previous value. This way it'd be possible to also transform an AST into the desired shape.

Somehow it should also be possible for this traverser to output strings! If the returned value of each node function is a string, this should easily work by concatenating the child strings per node.

Improved debuggability

Currently it's pretty difficult to debug a parser written with reghex, it'd be great if that could be made easier somehow.

I'm thinking maybe there could be like an onBeforeMatch function called right before executing any matcher, with the sole purpose (that I can think of) that somebody can put a "debugger" in there. It sounds ugly though.

Arrays as native alternatives for alternations

It would make my parsers a bit tidier if I could use arrays instead of alternations in the DSL passed to matchers.

Without arrays:

const Foo = passthrough`${A} | ${B} | ${C} | ${D} | ${E}`;
const Foos = somematcher`${Foo}+`;

With arrays:

const Foo = [A, B, C, D, E];
const Foos = somematcher`${Foo}+`;

Lookaheads shorthands

It'd be useful to have shorthands for lookahead expressions, they would make parsers easier to read and less error prone to write, as currently there's no syntax highlighting for unbalanced parenthesis used in lookaheads.

Peg.js uses ! for the negative lookahead and & for the positive lookahead, I think we should use ! and = respectively to better align with how these expressions are written in JS regexes (I guess peg.js uses & because they already use = for something else).

Examples:

(?!${/foo/}) => !${/foo/}
(?!${Foo}) => !${Foo}
(?!${Foo} | ${Bar}) => !(${Foo} | ${Bar})
(?=${/foo/}) => =${/foo/}
(?=${Foo}) => =${Foo}
(?=${Foo} | ${Bar}) => =(${Foo} | ${Bar})

Types

It'd be useful if the library provided some TS types.

RFC: Support tagged template parsing

Currently parse(node)(input) works using a string, i.e. input is expected to be a string.
It would be interesting to allow parse(node)(quasis, ...expressions) to be passed, where quasis is an array of strings and expressions is an array of interpolations.

This way it'd be possible to parse a tagged template literal input by introducing just one new matcher: match.interpolation('interpolation').

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.