TLDR below this is my first time making a language, i have no formal

Please create a smaller example that showcases the actual thing you think could

Please create a smaller example that showcases the actual thing yo

The default is dynamic , not <code class="notranslate"

Aha, OPERATOR_CONCAT is the problem for <code class="

earley very, very slow,about lark-parser/lark

Comments (24)

MegaIng commented on July 16, 2024

Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.

from lark.

Bugz000 commented on July 16, 2024

Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.

you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol

i cannot offer a smaller example as i think the problem resides in the size of the grammar paired with the size of the input text, and due to the nature of my recursive execution, all grammar must have their respective functions, there is only exactly what is absolutely necessary to demonstrate the issue provided

i suppose you could remove the whole execution section but i also leave this to demonstrate it is infact the tokenisation process that is the problem, and the parsing of the resulting json tree is indeed very fast, so there's little reason for it to be SO slow...

however feel free to delete the ASTexecutor and VARIABLE classes and the respective call at the bottom if you don't want to see that
then you will truly have the bare minimum

from lark.

Bugz000 commented on July 16, 2024

i have edited the post and added more context

from lark.

Bugz000 commented on July 16, 2024

i have removed astexecutor and variable classes, now it is minimum though it cannot be verified anymore

from lark.

MegaIng commented on July 16, 2024

you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol

Tip for the future: You can preview your messages before posting. Editing messages afterwards is a bit less optimal since I at least always receive an email first. Having lots of poorly formatted code in the email, hiding also the button to go the github message behind a wall of text is annoying for everyone else.

You are using dynamic_complete. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.

And yes, earley has a time complexity of O(n^3) for highly ambiguous grammars. But unless you are trying to parse a natural language, you don't need that. Try using parser="lalr" and strict=True and fix the problems that come up one by one.

from lark.

Bugz000 commented on July 16, 2024

You are using dynamic_complete. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.

running the same thing in basic creates error:

Error: + at line 2, column 28.


print(round(strlen("reee") + 500+ (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)))  " is indeed a number!")
                           ^


####################################
    None
####################################


Expected one of: {'BOOLEAN', 'STRING', 'LSQB', 'NUMBER', '_WS', 'CNAME', 'LBRACE', 'LPAR'}.
Previous tokens: None

and editing the line to something simple

Error: := at line 2, column 5.


var := 5+5+5
    ^


####################################
    None
####################################


Expected one of: {'LSQB', 'LBRACE', '_WS', 'NUMBER', 'STRING', 'CNAME', 'BOOLEAN', 'LPAR'}.
Previous tokens: None

relevant lines in grammar

start: statements

statements: (statement _NL)* statement
statement: _WS* _statement? _WS* LINE_COMMENT?
_statement: MULTI_LINE_COMMENT
          | COMMAND
          | statement_loop
          | statement_if
          | statement_function_definition
          | statement_return
          | statement_assignment
          | expression
          | statement_exp_inc_dec
          | statement_inc_dec
statement_assignment: (variable|expression_array) _WS* OPERATOR_ASSIGN _WS* expression

variable: CNAME
expression_array: CNAME "[" array_list "]"
OPERATOR_ASSIGN: ":="
_WS: /[ \t]/


expression: _WS* expression_ternary _WS*
 // continue to the expression tree

from lark.

MegaIng commented on July 16, 2024

The default is dynamic, not basic. It seems like the basic lexer has some other bug that I am currently trying to figure out.

from lark.

Bugz000 commented on July 16, 2024

trying dynamic it completes
testing dynamic_complete on around 700 lines of var := 5+5+5

assigning grammar
tokenizing
--- 31.171510457992554 seconds ---

testing dynamic on around 700 lines of var := 5+5+5

assigning grammar
tokenizing
--- 31.176007747650146 seconds ---

zero change

this didn't seem right so sanity check i ran the tests again and ensured i am indeed editing and running the correct file etc
dynamic_complete

tokenizing
--- 32.824604749679565 seconds ---

dynamic test 1

tokenizing
--- 25.929264545440674 seconds ---

dynamic test 2

tokenizing
--- 28.417376279830933 seconds ---

dynamic test 3

tokenizing
--- 27.651228427886963 seconds ---

so about 5% increase in speed (sometimes)

from lark.

MegaIng commented on July 16, 2024

Aha, OPERATOR_CONCAT is the problem for lexer='basic'. Having space be a valid operator is always going to cause problems for fast tokenization scheme.

from lark.

Bugz000 commented on July 16, 2024

Aha, OPERATOR_CONCAT is the problem for lexer='basic'. Having space be a valid operator is always going to cause problems for fast tokenization scheme.

that sounds about inline to what i had to deal with, one of the best features of AHK (and it's quite unconventional) is it's concat with space

print(var " is stored inside the variable") is valid, no . or + concat, and everything is string, float, string int all in strings, so it's very convenient to use, just throw the data it will handle it

however space concat is by far the hardest thing to get working which is ultimately what made me use earley over LALR and (probably) dynamic_complete over basic (though dynamic seems fine too)

i'm unsure if LALR is even possible for this syntax

to be clear nobody has tokenised ahk like this before, all the ahk source codes, _B, _L, v2, _H and ironahk etc all operate the syntax by hammering away at it with raw string funcs, (frankly this is insanity to me) so i wish to tokenise it, "as closely to original syntax" but i'm willing to make concessions for some form of standardised i/o...

i hope you can see what limitations this brings

from lark.

erezsh commented on July 16, 2024

Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use expr expr as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.

However, I would guess most of the slowdown is actually coming from _WS: /[ \t]/, which would be much much better as _WS: /[ \t]/+

from lark.

Bugz000 commented on July 16, 2024

Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use expr expr as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.

i think i tried this but i ran into problems where

print( var (5+5)) ;  would resolve to
print(  var(10)  )

"function var not found"
(or vice versa based on python hash seed or something)
the vice versa being

print(var (10)) ; correctly

never been gaslit by a terminal before but lark pulled that off for a full 2 days somehow

However, I would guess most of the slowdown is actually coming from _WS: /[ \t]/, which would be much much better as _WS: /[ \t]/+

test with that whitespace, and dynamic:

tokenizing
--- 26.913031339645386 seconds ---

comparable

from lark.

erezsh commented on July 16, 2024

You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.

test with that whitespace

What did you test?

Did you change all the WS* into WS? I think that is causing the slowdown.

from lark.

Bugz000 commented on July 16, 2024

You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.

i'm unsure how to do this but i will read on it

test with that whitespace

What did you test? I'd appreciate it if you used full sentences.

the whitespace you provided. i didn't think it needed more clarification given there was two options, one of which was current, but there you go

_WS: /[ \t]/+

i tested with this whitespace change that you suggested

Did you change all the WS* into WS? I think that is causing the slowdown.

An unexpected error occurred: No terminal matches '
' in the current parser context, at line 1 col 1


^
Expected one of:
        * _WS

now we run into new line issues which is frustrating, i've had this newline problem before aswell

An unexpected error occurred: No terminal matches 'v' in the current parser context, at line 1 col 1

var := 5+5+5
^
Expected one of:
       * _WS

removing the first blank newline of the code it shows more inherent problems with changing _WS* to _WS

from lark.

erezsh commented on July 16, 2024

That's because it should be sometimes _WS and sometimes _WS?. You'll have to do it yourself based on the context. But that should help the parser.

from lark.

Bugz000 commented on July 16, 2024

so we establish streamlining the grammar helps, and it does, marginally, i estimate we could see a 10-20% drop in speed with full optimisation on that front

a further 10x faster speed switching to LALR but i'm not that clever

but remember my options are:

edit my grammar so it runs more efficiently (how)
parallelise the tokenisation process (how)
switch to LALR (how)
execute the tokens as they come in (would it even be faster? tarnishes my two-section ideal, breaks the function pass)
something else i'm unaware of which is causing slowdown (i don't know what)

i have 32 cores available, meaning we could very easilly gain a 3200% increase in speed with parallelisation, this has not been discussed once yet

remember, we're not looking for elegance here, this is nothing fancy it just needs to be somewhat functional, but 2 minutes to tokenise 700 lines is frankly ridiculous

but as you know there's not a problem in the world that can't be solved by throwing more compute at it

how would this be achieved?

the code can be as slow as it wants to be but it's using only 5% of my cpu, with parallelisation, any improvements would also be multiplied by core count, i think this should be the priority currently

i read that it is "trivial" to parallelise earley parsing but i'm unsure how, short of using an earley parser to split the code and launching the split code snippets as their own comprehensive parser or something, there is little to no documentation of paralellising lark

from lark.

erezsh commented on July 16, 2024

The easiest option to get it to run faster is to use PyPy instead of CPython.

I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?

from lark.

MegaIng commented on July 16, 2024

First of all: tokenization is only a small part of parsing and since switching from dynamic_complete to dynamic changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.

Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.

from lark.

Bugz000 commented on July 16, 2024

The easiest option to get it to run faster is to use PyPy instead of CPython.

i tried this, it won't use lark for some reason, just keeps saying "module not found"

I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?

i read it here
https://news.ycombinator.com/item?id=28259458

LL or LR is almost certainly faster then Earley, since in general more restricted = faster.
Earley's algorithm should be able to be trivially parallelized

First of all: tokenization is only a small part of parsing and since switching from dynamic_complete to dynamic changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.

i was hoping there was some magic way to get it working but it appears it is simply just as hard as it seems, so i guess parallelisation is out the window besides tokenising seperate files at the same time

however this person was doing something with multi threading
#493

Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.

so how would you propose further improving this, can you see a way this can work in LALR? as this can gain an instant 10x speed increase (judging by online benches)

from lark.

erezsh commented on July 16, 2024

@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.

Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.

from lark.

Bugz000 commented on July 16, 2024

@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.

there's no need to get defensive, you also said earlier that i should "use full sentences" but edited it out... if you are getting upset because i simply stated the facts of the matter then that is not my problem but i thankyou for your input regardless... however anemic said input was

Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.

again with the sarcasm, there is no need to get upset about this...
i'm not continuing this reply it's clear you have run out of constructive things to add and are now turning me into some strawman to attack rather than accept you don't know the solution.

has anyone else some constructive things to add, without a generous dollop of attitude ontop?
i feel this guy has derailed the thread a little

from lark.

MegaIng commented on July 16, 2024

if you are getting upset because i simply stated the facts

You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong. But apparently you are unwilling to get help with that so ...

i feel this guy has derailed the thread a little

"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes. The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.

from lark.

Bugz000 commented on July 16, 2024

You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong.

i pass the python file to the exe as an argument... what else is there to do.
it was my understanding it is "plug n play"

But apparently you are unwilling to get help with that so ...

where did i say this

"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes.

Dev or no dev, an attitude problem is an attitude problem, there's a nice way to say everything than treat everyone as if they're stupid? i have been nothing but cordial and polite.
any animosity is their own, and only their own. i have taken no part in what whatsoever.
in short; their problem they are upset.

i do not demand magical fixes, i was simply asking suggestions and places to learn, i got a couple, while being treated like i'm stupid, then this final "i've lost interest"
if i may give some advice in return; frankly i think you lost interest in computers a long time ago.

The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.

got it, neither of you know what's wrong,
closing thread.

from lark.

erezsh commented on July 16, 2024

got it, neither of you know what's wrong

I know exactly what is wrong. You make mistake after mistake, but you're too strong-headed to admit you might be wrong.

I lost my patience because you kept blaming the tool for your own mistakes and shortcomings, you made crazy claims like "it is "trivial" to parallelise earley parsing" and then ignored me when I challenged you.

Sorry, but you're not in a position to give advice.

Best of luck with your project.

from lark.

earley very, very slow about lark HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent