Comments (24)
Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.
from lark.
Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.
you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol
i cannot offer a smaller example as i think the problem resides in the size of the grammar paired with the size of the input text, and due to the nature of my recursive execution, all grammar must have their respective functions, there is only exactly what is absolutely necessary to demonstrate the issue provided
i suppose you could remove the whole execution section but i also leave this to demonstrate it is infact the tokenisation process that is the problem, and the parsing of the resulting json tree is indeed very fast, so there's little reason for it to be SO slow...
however feel free to delete the ASTexecutor and VARIABLE classes and the respective call at the bottom if you don't want to see that
then you will truly have the bare minimum
from lark.
i have edited the post and added more context
from lark.
i have removed astexecutor and variable classes, now it is minimum though it cannot be verified anymore
from lark.
you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol
Tip for the future: You can preview your messages before posting. Editing messages afterwards is a bit less optimal since I at least always receive an email first. Having lots of poorly formatted code in the email, hiding also the button to go the github message behind a wall of text is annoying for everyone else.
You are using dynamic_complete
. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.
And yes, earley has a time complexity of O(n^3)
for highly ambiguous grammars. But unless you are trying to parse a natural language, you don't need that. Try using parser="lalr"
and strict=True
and fix the problems that come up one by one.
from lark.
You are using
dynamic_complete
. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.
running the same thing in basic creates error:
Error: + at line 2, column 28.
print(round(strlen("reee") + 500+ (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500))) " is indeed a number!")
^
####################################
None
####################################
Expected one of: {'BOOLEAN', 'STRING', 'LSQB', 'NUMBER', '_WS', 'CNAME', 'LBRACE', 'LPAR'}.
Previous tokens: None
and editing the line to something simple
Error: := at line 2, column 5.
var := 5+5+5
^
####################################
None
####################################
Expected one of: {'LSQB', 'LBRACE', '_WS', 'NUMBER', 'STRING', 'CNAME', 'BOOLEAN', 'LPAR'}.
Previous tokens: None
relevant lines in grammar
start: statements
statements: (statement _NL)* statement
statement: _WS* _statement? _WS* LINE_COMMENT?
_statement: MULTI_LINE_COMMENT
| COMMAND
| statement_loop
| statement_if
| statement_function_definition
| statement_return
| statement_assignment
| expression
| statement_exp_inc_dec
| statement_inc_dec
statement_assignment: (variable|expression_array) _WS* OPERATOR_ASSIGN _WS* expression
variable: CNAME
expression_array: CNAME "[" array_list "]"
OPERATOR_ASSIGN: ":="
_WS: /[ \t]/
expression: _WS* expression_ternary _WS*
// continue to the expression tree
from lark.
The default is dynamic
, not basic
. It seems like the basic
lexer has some other bug that I am currently trying to figure out.
from lark.
trying dynamic it completes
testing dynamic_complete on around 700 lines of var := 5+5+5
assigning grammar
tokenizing
--- 31.171510457992554 seconds ---
testing dynamic on around 700 lines of var := 5+5+5
assigning grammar
tokenizing
--- 31.176007747650146 seconds ---
zero change
this didn't seem right so sanity check i ran the tests again and ensured i am indeed editing and running the correct file etc
dynamic_complete
tokenizing
--- 32.824604749679565 seconds ---
dynamic test 1
tokenizing
--- 25.929264545440674 seconds ---
dynamic test 2
tokenizing
--- 28.417376279830933 seconds ---
dynamic test 3
tokenizing
--- 27.651228427886963 seconds ---
so about 5% increase in speed (sometimes)
from lark.
Aha, OPERATOR_CONCAT
is the problem for lexer='basic'
. Having space be a valid operator is always going to cause problems for fast tokenization scheme.
from lark.
Aha,
OPERATOR_CONCAT
is the problem forlexer='basic'
. Having space be a valid operator is always going to cause problems for fast tokenization scheme.
that sounds about inline to what i had to deal with, one of the best features of AHK (and it's quite unconventional) is it's concat with space
print(var " is stored inside the variable") is valid, no . or + concat, and everything is string, float, string int all in strings, so it's very convenient to use, just throw the data it will handle it
however space concat is by far the hardest thing to get working which is ultimately what made me use earley over LALR and (probably) dynamic_complete over basic (though dynamic seems fine too)
i'm unsure if LALR is even possible for this syntax
to be clear nobody has tokenised ahk like this before, all the ahk source codes, _B, _L, v2, _H and ironahk etc all operate the syntax by hammering away at it with raw string funcs, (frankly this is insanity to me) so i wish to tokenise it, "as closely to original syntax" but i'm willing to make concessions for some form of standardised i/o...
i hope you can see what limitations this brings
from lark.
Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use expr expr
as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.
However, I would guess most of the slowdown is actually coming from _WS: /[ \t]/
, which would be much much better as _WS: /[ \t]/+
from lark.
Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use
expr expr
as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.
i think i tried this but i ran into problems where
print( var (5+5)) ; would resolve to
print( var(10) )
"function var not found"
(or vice versa based on python hash seed or something)
the vice versa being
print(var (10)) ; correctly
never been gaslit by a terminal before but lark pulled that off for a full 2 days somehow
However, I would guess most of the slowdown is actually coming from
_WS: /[ \t]/
, which would be much much better as_WS: /[ \t]/+
test with that whitespace, and dynamic:
tokenizing
--- 26.913031339645386 seconds ---
comparable
from lark.
You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.
test with that whitespace
What did you test?
Did you change all the WS* into WS? I think that is causing the slowdown.
from lark.
You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.
i'm unsure how to do this but i will read on it
test with that whitespace
What did you test? I'd appreciate it if you used full sentences.
the whitespace you provided. i didn't think it needed more clarification given there was two options, one of which was current, but there you go
_WS: /[ \t]/+
i tested with this whitespace change that you suggested
Did you change all the WS* into WS? I think that is causing the slowdown.
An unexpected error occurred: No terminal matches '
' in the current parser context, at line 1 col 1
^
Expected one of:
* _WS
now we run into new line issues which is frustrating, i've had this newline problem before aswell
An unexpected error occurred: No terminal matches 'v' in the current parser context, at line 1 col 1
var := 5+5+5
^
Expected one of:
* _WS
removing the first blank newline of the code it shows more inherent problems with changing _WS* to _WS
from lark.
That's because it should be sometimes _WS
and sometimes _WS?
. You'll have to do it yourself based on the context. But that should help the parser.
from lark.
so we establish streamlining the grammar helps, and it does, marginally, i estimate we could see a 10-20% drop in speed with full optimisation on that front
a further 10x faster speed switching to LALR but i'm not that clever
but remember my options are:
- edit my grammar so it runs more efficiently (how)
- parallelise the tokenisation process (how)
- switch to LALR (how)
- execute the tokens as they come in (would it even be faster? tarnishes my two-section ideal, breaks the function pass)
- something else i'm unaware of which is causing slowdown (i don't know what)
i have 32 cores available, meaning we could very easilly gain a 3200% increase in speed with parallelisation, this has not been discussed once yet
remember, we're not looking for elegance here, this is nothing fancy it just needs to be somewhat functional, but 2 minutes to tokenise 700 lines is frankly ridiculous
but as you know there's not a problem in the world that can't be solved by throwing more compute at it
how would this be achieved?
the code can be as slow as it wants to be but it's using only 5% of my cpu, with parallelisation, any improvements would also be multiplied by core count, i think this should be the priority currently
i read that it is "trivial" to parallelise earley parsing but i'm unsure how, short of using an earley parser to split the code and launching the split code snippets as their own comprehensive parser or something, there is little to no documentation of paralellising lark
from lark.
The easiest option to get it to run faster is to use PyPy instead of CPython.
I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?
from lark.
First of all: tokenization is only a small part of parsing and since switching from dynamic_complete
to dynamic
changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.
Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.
from lark.
The easiest option to get it to run faster is to use PyPy instead of CPython.
i tried this, it won't use lark for some reason, just keeps saying "module not found"
I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?
i read it here
https://news.ycombinator.com/item?id=28259458
LL or LR is almost certainly faster then Earley, since in general more restricted = faster.
Earley's algorithm should be able to be trivially parallelized
First of all: tokenization is only a small part of parsing and since switching from
dynamic_complete
todynamic
changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.
i was hoping there was some magic way to get it working but it appears it is simply just as hard as it seems, so i guess parallelisation is out the window besides tokenising seperate files at the same time
however this person was doing something with multi threading
#493
Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.
so how would you propose further improving this, can you see a way this can work in LALR? as this can gain an instant 10x speed increase (judging by online benches)
from lark.
@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.
Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.
from lark.
@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.
there's no need to get defensive, you also said earlier that i should "use full sentences" but edited it out... if you are getting upset because i simply stated the facts of the matter then that is not my problem but i thankyou for your input regardless... however anemic said input was
Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.
again with the sarcasm, there is no need to get upset about this...
i'm not continuing this reply it's clear you have run out of constructive things to add and are now turning me into some strawman to attack rather than accept you don't know the solution.
has anyone else some constructive things to add, without a generous dollop of attitude ontop?
i feel this guy has derailed the thread a little
from lark.
if you are getting upset because i simply stated the facts
You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong. But apparently you are unwilling to get help with that so ...
i feel this guy has derailed the thread a little
"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes. The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.
from lark.
You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong.
i pass the python file to the exe as an argument... what else is there to do.
it was my understanding it is "plug n play"
But apparently you are unwilling to get help with that so ...
where did i say this
"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes.
Dev or no dev, an attitude problem is an attitude problem, there's a nice way to say everything than treat everyone as if they're stupid? i have been nothing but cordial and polite.
any animosity is their own, and only their own. i have taken no part in what whatsoever.
in short; their problem they are upset.
i do not demand magical fixes, i was simply asking suggestions and places to learn, i got a couple, while being treated like i'm stupid, then this final "i've lost interest"
if i may give some advice in return; frankly i think you lost interest in computers a long time ago.
The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.
got it, neither of you know what's wrong,
closing thread.
from lark.
got it, neither of you know what's wrong
I know exactly what is wrong. You make mistake after mistake, but you're too strong-headed to admit you might be wrong.
I lost my patience because you kept blaming the tool for your own mistakes and shortcomings, you made crazy claims like "it is "trivial" to parallelise earley parsing" and then ignored me when I challenged you.
Sorry, but you're not in a position to give advice.
Best of luck with your project.
from lark.
Related Issues (20)
- How to define lark grammar for best parsing performance HOT 8
- Unable to parse Arabic text HOT 3
- Incorrect start_pos / end_pos in the tree HOT 8
- Add `outlines` in the list of projects using Lark HOT 2
- Lark.open_from_package() does not support namespace packages HOT 2
- Stand-alone program cannot be run HOT 4
- Issue of installing lark in Python HOT 1
- Pipe in terminal regex not working as expected HOT 1
- Transformer Not Applying Expected Transformations in Lark Parser HOT 3
- Deprecation Warning HOT 6
- accepts() vs choices() in InteractiveParser HOT 10
- No such file or directory: 'COMMON.lark' HOT 4
- Grammar Syntax For Unordered Groups HOT 1
- Is it possible to parse parts of the input? HOT 12
- Forgiving syntax HOT 3
- Post 1388 changes HOT 4
- Dynamic Earley: Incorrect value for SymbolNode.end
- Inconsistent parse results from simple ambiguous grammar HOT 4
- Superfluous identical ambiguities in Earley HOT 2
- Porting from pyparsing match_previous_literal HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lark.