0no-co / reghex Goto Github PK
View Code? Open in Web Editor NEWThe magical sticky regex-based parser generator ๐ง
License: MIT License
The magical sticky regex-based parser generator ๐ง
License: MIT License
Currently parse(node)(input)
works using a string, i.e. input
is expected to be a string.
It would be interesting to allow parse(node)(quasis, ...expressions)
to be passed, where quasis
is an array of strings and expressions
is an array of interpolations.
This way it'd be possible to parse a tagged template literal input by introducing just one new matcher: match.interpolation('interpolation')
.
It'd be useful if the library provided some TS types.
The traverse
function would only support the default tag
output, so [ /* ... */, tag: 'node' ]
, or rather type Node = Array<Node | string> & { tag: string }
. Maybe it can also be limited to support any object of the shape { tag: string } | { type: string }
.
It would function similarly to GraphQL's visit
function or Babel's traverse
function.
traverse({
[tagName]: node => {
// ...
return node;
}
})(node)
The different visitor functions match by tag
name and would execute their functions as the AST is traversed. The return value should replace the previous value. This way it'd be possible to also transform an AST into the desired shape.
Somehow it should also be possible for this traverser to output strings! If the returned value of each node function is a string, this should easily work by concatenating the child strings per node.
Currently it's pretty difficult to debug a parser written with reghex, it'd be great if that could be made easier somehow.
I'm thinking maybe there could be like an onBeforeMatch
function called right before executing any matcher, with the sole purpose (that I can think of) that somebody can put a "debugger" in there. It sounds ugly though.
It'd be useful to have shorthands for lookahead expressions, they would make parsers easier to read and less error prone to write, as currently there's no syntax highlighting for unbalanced parenthesis used in lookaheads.
Peg.js uses !
for the negative lookahead and &
for the positive lookahead, I think we should use !
and =
respectively to better align with how these expressions are written in JS regexes (I guess peg.js uses &
because they already use =
for something else).
Examples:
(?!${/foo/})
=> !${/foo/}
(?!${Foo})
=> !${Foo}
(?!${Foo} | ${Bar})
=> !(${Foo} | ${Bar})
(?=${/foo/})
=> =${/foo/}
(?=${Foo})
=> =${Foo}
(?=${Foo} | ${Bar})
=> =(${Foo} | ${Bar})
I'm parsing bcp47 language tag using reghex and work perfectly!
Can we accept starting with a |
character instead pattern when using a multiline match?
const irregular = match('irregular')`
- ${/en-GB-oed/}
+ | ${/en-GB-oed/}
| ${/i-ami/}
| ${/i-bnn/}
| ${/i-default/}
| ${/i-enochian/}
| ${/i-hak/}
| ${/i-klingon/}
| ${/i-lux/}
| ${/i-mingo/}
| ${/i-navajo/}
| ${/i-pwn/}
| ${/i-tao/}
| ${/i-tay/}
| ${/i-tsu/}
| ${/sgn-BE-FR/}
| ${/sgn-BE-NL/}
| ${/sgn-CH-DE/}
`;
It'd be useful if a matcher could be specified using strings also, this would make parsers a bit cleaner and easier to read as a lot of characters wouldn't need to be escaped in strings.
E.g.: /\/\//
=> '//'
The following two parsers should produce the same output, in this case, but they don't:
console.log ( parse ( match ( '๐' )`${/\s*/} ${/foo/}` )( 'foo' ) ); // => undefined
console.log ( parse ( match ( '๐' )`${/\s/}? ${/foo/}` )( 'foo' ) ); // => [ 'foo', tag: '๐' ]
It would be useful to have a built-in CLI for bundling up a parser into a standalone file, from the user perspective it would be more user friendly if one didn't have to set-up Babel at all (I use TS most of the time), plus it would make the readme more impressive if with one command one could compile the demo parser into a 1kb file or whatever.
It would make my parsers a bit tidier if I could use arrays instead of alternations in the DSL passed to matchers.
Without arrays:
const Foo = passthrough`${A} | ${B} | ${C} | ${D} | ${E}`;
const Foos = somematcher`${Foo}+`;
With arrays:
const Foo = [A, B, C, D, E];
const Foos = somematcher`${Foo}+`;
I spent some time today benchmarking the library and playing with making a toy/useless Markdown parser with it, so here's some miscellaneous feedback after having interacted some more with the library, feel free to close this and perhaps open standalone issues for the parts you think are worth addressing.
For the Markdown parser thing I was trying to write a matcher that matched headings, and I had some problems with that:
\1
, \2
etc. to reference other capturing groups either, in my headings scenario the trailing hashes should really be considered trailing hashes only if they are exactly the same number as the leading hashes, otherwise they should be considered part of the body, this can't quite be expressed cleanly with the current system because the first capturing group/matcher can't be referenced.
\[0-9]
references, which in this case would mean referencing the 1st, 2nd... 9th whole sub-matcher.Standalone:
`${/(#{1,6} )(.+?)( #{1,6})/}`
Broken-up:
`${/#{1,6} /} ${/.+?/} ${/ #{1,6}/}`
{1,3}
perhaps should be supported too.Now about performance, perhaps the more interesting part of the feedback.
From what I've seen every ~atomic thing the library does is pretty fast, so there shouldn't be any meaningful micro-optimizations available, the root issue seems to be actually that the library spends too much times on the wrong alternations.
Some actual real numbers first so that the rest of the feedback sounds less crazy:
- = $`${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Slow
+ = $`${/(?=.*\?)/} ${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Fast
That's kind of the root of the performance problems with RegHex parsers in my opinion, if I had to guess with enough sophistication perhaps some parsers could become 100x faster or more just by going down branches/alternations more intelligently.
At a high-level to me RegHex parsers look kinda like CPUs, individual patterns are like instructions, each alternation is a branch etc. it should follow then that the same optimizations used for CPUs could/should be used for RegHex. I know next to nothing about that really, but just to mention some potential things that crossed my mind:
Depending on how many of these fancy optimizations you are willing to spend time on perhaps a seriously fast JS parser could be written on top of RegHex ๐ค that'd be really cool.
Sorry for the long post, hopefully there's some useful feedback in here.
Trying to parse something like
x and y and z
with
const token = match('formula')`
${/\w+/}
`
const anded = match('anded')`
${token} (?: ${/\s+and\s+/}) ${token}
`
I can make it match when there's exactly two but it doesn't work with one or three:
parse(anded)('x')
// undefined
parse(anded)('x and y')
// [ [ 'x', tag: 'formula' ], [ 'y', tag: 'formula' ], tag: 'anded' ]
parse(anded)('x and y and z')
// [ [ 'x', tag: 'formula' ], [ 'y', tag: 'formula' ], tag: 'anded' ]
But if I try to make it recursive:
const token = match('formula')`
${/\w+/}
`
const anded = match('anded')`
${row} (?: ${/\s+and\s+/}) ${token}
`
const row = match('row')`
${anded} | ${token}
`
console.log([
parse(anded)('x'),
parse(anded)('x and y'),
parse(anded)('x and y and z')
])
/Users/glen/src/experiments/disclosure/src/3-parse-test.js:57
var anded = function _anded(state) {
^
RangeError: Maximum call stack size exceeded
I think I'm doing something wrong!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.