Comments (67)
Fixed!
At the very least, the example you gave me now runs almost instantly. Please check if latest master fixes your issues.
Turns out it was a very simple bug, but it was a real flaw in my implementation. So thank you for reporting it and helping me weed it out.
See if you can find an example of name length affecting speed. I'll be very happy to fix it if it's true.
from lark.
wow
X20-30 faster (similar to what it was before)
almost no diffs
I'm happy with this version
i'm curious as to what you did tho...
from lark.
I'll see if I can explain it in simple words.
The slowdown was caused by similar parse-items that had different trees. Every different possible derivation was added as a separate parse-item, that would later merge as ambiguity. This created thousands of parse-items at every step, and caused Earley to choke. I made a very subtle change: I merged the similar parse-items into a single parse-item, and added their trees as ambiguity earlier in the process.
You could think of this as an optimization, but really, the previous implementation was just wrong. I still need to solve some subtleties that resulted from this change, and then I plan to merge it into master.
Anyway, I'm glad it solved the issue for you!
from lark.
Okay, all my tests are passing now. I pushed the most recent changes to the branch: earley_fix2
If you still have issues with the priority, an example would really help.
from lark.
In master
.
from lark.
Yes, it seems to be a bug! I'm tracking it down now.
from lark.
Found it! Pushed changes to master. It should now work as intended. Please close the issue if it's resolved.
P.S. Good find on this bug 👍
from lark.
I'm still investigating it, but it it seems the fix causes an infinite loop in some cases.
from lark.
I have an infinite loop between the functions Item.__eq__
, Item.similar
and Tree.__eq__
.
from lark.
It's not infinite but something that was really fast before now takes an hour... I can't provide the specific example at the moment, but thought you might have an idea anyhow.
from lark.
Huh, I think that makes sense. I'll see what I can do, but if you can find an example it will really help.
from lark.
Try changing Item.eq into:
def __eq__(self, other):
return self.similar(other) #remove this: and (self.tree is other.tree or self.tree == other.tree)
See if that fixes it.
from lark.
It worked fast :)
Now checking for other problems
from lark.
Seems good. Closing.
from lark.
Nice. I'll push it to master.
from lark.
Hmm scratch that. It might fix your loop, but it also re-surfaces the previous bug in this issue, where some ambiguity "disappears". We'll need to find a better solution.
from lark.
should I open a new issue?
since this fix the performance is really bad.
can we limit the depth of this comparison somehow?
(doesn't the earley algo support ambiguity out of the box? how can this increase runtime so much?)
from lark.
Hi, I didn't realize it's such a big issue.
You can open a new issue, if you like, or we can continue here.
It's possible that the performance bottleneck comes from the post-processing, and not from the parsing itself. There also may be an easy solution, at least for your specific case. I can't tell, because I don't know how to reproduce it.
So, if we're going to take this seriously, the first step is to prepare a self-contained example of a grammar+input that result in slow run-time (i.e. demonstrating that complexity is larger than O(n^3)).
Provide an example and I'll take it from there.
from lark.
sounds good, thanks.
from lark.
This takes more than a minute to run:
import lark
GRAMMAR_SPECIFICATION = r"""
freddie_terry: "FT"
terry_terry: "TT"
terry_randy: freddie_terry terry_terry
freddie_danny: "F"
terry_danny: "TD"
danny_randy: freddie_danny terry_danny
eric_danny: "D"
danny_eric_low: (danny_randy | eric_danny)+
danny_eric: eric_danny+
| danny_randy
| danny_eric_low
terry_eric: freddie_terry
| terry_terry
| terry_randy
gh_low: danny_randy
| terry_randy danny_eric
gh_very_low: terry_terry danny_eric
| eric_danny
| freddie_danny
| terry_danny
| freddie_terry
gh: danny_eric terry_eric
| gh_low
| gh_very_low
start: gh*
%import common.WS
%ignore WS
"""
PARSER = lark.Lark(
GRAMMAR_SPECIFICATION, ambiguity='resolve__antiscore_sum')
PARSER.parse('D FT TT DF TD FT TT D D FT TT D FT')
from lark.
And I discovered a ridiculous fact - changing the rule names to something short makes the parsing fast (!!)
(Note that entire time is wasted on the parsing itself and not the processing of grammar)
from lark.
This exposes another issue, if you take this grammar and replace danny_randy
with DR
, you get the following error: GrammarError: Rules aren't allowed inside tokens (freddie_danny in DR)
from lark.
Great, I'll look into it very soon.
changing the rule names to something short makes the parsing fast
How odd! It's hard for me to believe. I would be very interested in some example of this.
As a side note: Lowercase names are rules, which means they can contain other rules and tokens, while uppercase names are tokens, which can only contain tokens. This distinction affects the tree structure, but it also becomes useful when switching from a scannerless parse (the default) to using a tokenizer.
from lark.
Woohoo!
Thanks!!
So glad
Haven't checked yet - but will let you know soon
from lark.
Seems good. I think it's still somewhat slower than pre-fix version, but it's definitely reasonable now.
If I find something pathologically slow I'll reopen with the example.
Again thank you so much.
from lark.
Well, there are a few things you can do to speed it up -
- Use lexer="standard" (requires that you don't have collisions between tokens)
- Use PyPy
- Disable "all_derivations" (really depends on your use-case)
from lark.
Ok I have a specific example, not as extreme as before, but does not look like n^3 either. This takes about 20 seconds to run:
import lark
SPECIFICATION = r"""
freddie_terry: "freddie_terry"
terry_terry: "terry_terry"
terry_range_low: terry_terry freddie_terry
terry_range: freddie_terry terry_terry
| terry_range_low
freddie_danny: "freddie_danny"
to_danny: "to_danny"
danny_range: freddie_danny to_danny
explicit_danny: "danny"
danny_expression_low: (danny_range | explicit_danny)+
danny_expression: explicit_danny+
| danny_range
| danny_expression_low
terry_expression: freddie_terry
| terry_terry
| terry_range
| terry_pattern
| terry_pattern_two_range
terry_pattern: freddie_terry terry_terry terry_terry
terry_pattern_two_range: freddie_terry terry_terry freddie_terry terry_terry
multiple_danny_ranges: danny_range danny_range danny_range+
multiple_danny_ranges_expression: multiple_danny_ranges
gh_medium: freddie_terry danny_expression
| terry_expression multiple_danny_ranges_expression
gh_low: danny_range
| terry_range danny_expression
gh_very_low: terry_terry danny_expression
| explicit_danny
| freddie_danny
| to_danny
| freddie_terry
gh_dannys_only: explicit_danny explicit_danny
| explicit_danny explicit_danny explicit_danny
gh: danny_expression terry_expression
| multiple_danny_ranges_expression terry_expression
| gh_medium
| gh_low
| gh_very_low
| gh_dannys_only
start: gh*
%import common.WS
%ignore WS
"""
GRAMMAR = lark.Lark(
SPECIFICATION, ambiguity='resolve__antiscore_sum')
s = 'freddie_terry terry_terry danny freddie_terry terry_terry danny freddie_terry terry_terry danny danny danny danny freddie_terry terry_terry danny freddie_terry terry_terry danny freddie_terry terry_terry danny freddie_terry terry_terry'
GRAMMAR.parse(s)
adding lexer="standard"
has no noticeable effect.
from lark.
I'm really happy you're putting my parser to the test. I'll dig into it.
Meanwhile, I should mention PyPy runs it in 1.5 seconds, about x15 faster.
from lark.
I think I solved it!
This fix is a little trickier, so I want to run some more tests before I push it to master. Meanwhile, I pushed it to the branch earley_fix. So you can do:
git fetch
git checkout origin/earley_fix
And see if it works for you.
from lark.
Does it have any effect on prioritization?
I might be seeing some effect, but not sure yet.
from lark.
Actually scratch that - everything is working great.
from lark.
:(
I think there is something wrong with prioritization, I have a situation where it doesn't matter how high I set the antiscore of a certain rule, it still chooses the derivation with it. Removing it results in a different derivation, that should have been the one chosen at the first place.
I'm not sure this was introduced by last change but let me know if it rings a bell.
from lark.
btw it's a relatively minor issue. mostly it's working well.
from lark.
Yeah, it actually failed one of the priority tests. I think I should be able to get it fixed.
from lark.
Btw, the prioritization issues you're having, are you using lexer="standard", or the default lexer?
from lark.
Default. But I'm not sure I should.
from lark.
I did one trial run with the new branch and pc got stuck.
Do all previous examples run fast?
I'll be able to check again in a few days.
from lark.
Ok still having the same problem - arbitrarily increasing rule weight has no effect, but when removing the rule the other parse is chosen.
Note that is the WCFG algo (antiscore).
from lark.
Have a minimal example for you that a friend made:
import lark
GRAMMAR = r"""
word.1000: WORD
president: "obama" | "bush" | "clinton"
token: president | word
start: token*
%import common.WORD
%import common.WS
%ignore WS
"""
for _ in xrange(10):
parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
print parser.parse('obama')
print '---------'
parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
for _ in xrange(10):
print parser.parse('obama')
print '---------'
parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
for _ in xrange(10):
print parser.parse('obama')
If you run this code you'll get different results on every block, indicating that there are missing ambiguities on some of the runs.
Hope this helps :)
from lark.
Thanks for bringing this to my attention. I'll look into it.
from lark.
FYI the bug is before ambiguity resolution (can be seen with 'explicit' as well)
from lark.
Hi, any updates per chance?
(asking because this bug has now become a bottleneck using the library)
from lark.
Hi, I didn't get a chance to look at the "missing ambiguities" example you provided, but as far as I know I don't have an example of a performance bottleneck (all the previous examples run quickly now).
Debugging such things without a specific example takes a lot of time, and since this is just a free-to-use project I have to de-prioritize it. If it's important enough that you can pay for my time, we can work something out. Otherwise I'll be happy to help, but it's not at the top of my queue, especially without an example to reproduce it.
from lark.
Just to clarify, this is not about performance, I'm talking about the latest code snippet that shows a correctness issue and how to reproduce it.
from lark.
Ah, the use of the word bottleneck gave me the wrong impression. Anyway, I hope I'll have time to get to it this week.
from lark.
Hi again, thanks for the example, and sorry for the misunderstanding.
I pushed to master
a fix that should solve the problem. Let me know if it works.
from lark.
Cool.
One thing I'm noticing is that I'm getting UnexpectedInput
instead of ParseError
on some occasions.
(not such a big deal I think).
Still checking other stuff.
from lark.
That's to be expected. UnexpectedInput
means it failed to match some part of the text, while ParseError
can be many other reasons, for example the structure is different than expected, or there isn't enough input for a complete parse.
from lark.
In that case could you please add it to __init__.py
:
from .lexer import UnexpectedInput
Thanks:)
from lark.
btw - are all other fixes from the other branches on master as well?
from lark.
As far as I'm aware, yes.
from lark.
Including https://github.com/erezsh/lark/tree/earley_fix?
(asking because I'm still not getting all ambiguities when using master, but want to make sure it is really a new issue and not just one of the already fixed ones)
from lark.
Yes, earley_fix
has already been merged.
I'm aware that the parser currently doesn't create every possible derivation.
The parser doesn't recurse into rules more than necessary. So if there's a list or an explicit recursion that can return several results, the parser won't return all of them, just the first.
If this is what you're seeing and you want it to greedy-recurse, I'll see what I can do.
If you're referring to another behavior, then I'll need more details.
from lark.
I see.
So this is what got broken:
import lark
GRAMMAR = r"""
ft: "ft"
tt: "tt"
str: ft tt
tira: str
fd: "fd"
td: "td"
dr: fd td
ad: "ad"
ed: "d"
de: ed+
| dr
| ad
te: ft
| tt
| tira
multiple_drs: dr dr dr+
multiple_drs_expression: multiple_drs
gh_medium.1: ft de
| te multiple_drs_expression
| ad
gh_low.99999: dr
gh_throwaway.10: ed
gh: de te
| gh_medium
| gh_throwaway
| multiple_drs_expression te
| tira de
| gh_low
start: gh*
%import common.NUMBER
%import common.WS
%ignore WS
"""
s = ' '.join(['ad', 'fd', 'td', 'ft', 'tt', 'ad', 'd', 'd', 'ft', 'tt'])
parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
print parser.parse(s)
# pre fix:
# Tree(start, [Tree(gh, [Tree(gh_medium, [Tree(ad, [])])]), Tree(gh, [Tree(de, [Tree(dr, [Tree(fd, []), Tree(td, [])])]), Tree(te, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])])])]), Tree(gh, [Tree(gh_medium, [Tree(ad, [])])]), Tree(gh, [Tree(de, [Tree(ed, []), Tree(ed, [])]), Tree(te, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])])])])])
# post fix:
# Tree(start, [Tree(gh, [Tree(gh_medium, [Tree(ad, [])])]), Tree(gh, [Tree(gh_low, [Tree(dr, [Tree(fd, []), Tree(td, [])])])]), Tree(gh, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])]), Tree(de, [Tree(ad, [])])]), Tree(gh, [Tree(de, [Tree(ed, []), Tree(ed, [])]), Tree(te, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])])])])])
I need to have all ambiguities so resolve_ambig can really pick the lightest one with respect to weights.
What do you think?
from lark.
Well, it's a little dense, but I'm guessing the gh*
is the recursion that isn't yielding all the possible results.
I saw the same behavior in issue #21, so perhaps I can use it to debug it.
However, now I'm really curious what it's used for :)
from lark.
(this is issue #21 so I guess you meant some other issue?)
Is there some kind of hack I can do to get all ambiguities?
As long as not all ambiguities are returned the prioritization is somewhat non functioning.
As for your curiosity, I'll try to think of something... :)
from lark.
This seems to work:
start: gh | gh gh | gh gh gh | gh gh gh gh | gh+
from lark.
Yeah, I meant I saw it in the "president" example you provided.
So, sounds like I guessed right. I'll see what I can do.
from lark.
Well, that wasn't too difficult! Check out master
and use:
parser = lark.Lark(GRAMMAR, ambiguity='explicit', earley__predict_all=True)
It should work fine with gh*
.
from lark.
looking good!
doing some more tests
from lark.
Sadly now there's the slowness issue again.
I'll see if I can make an unreasonable example to indicate something above n^3.
Perhaps I should just implement CYK instead, should that be quicker for all ambiguities?
from lark.
In theory, Earley should outperform CYK. But it's very possible that the slowness is due to my own implementation. I can't say anything smarter right now. I'll have to go back to the academic papers and review them thoroughly to see if I could do a better job.
If you think you can implement CYK, I suggest you go for it. If you get it working, I'll be happy to add it to Lark as another optional parser.
You can also try other Earley implementations (even in other languages). If they perform better for your use-case, I'll consider copying them into Lark.
from lark.
Sounds good, thanks.
from lark.
So, it seems to me that there are no more bugs, only slowness, so the issue is resolved.
Feel free to open a new one for optimizations (if you can provide examples), or reopen this one if you encounter a bug.
from lark.
Hi, just want to let you know that Lark now has a CYK parser implementation.
from lark.
Staggering coincidence ;)
from lark.
I was wondering about that, but you know what they say about assuming :)
from lark.
Related Issues (20)
- How to define lark grammar for best parsing performance HOT 8
- Unable to parse Arabic text HOT 3
- Incorrect start_pos / end_pos in the tree HOT 8
- Add `outlines` in the list of projects using Lark HOT 2
- Lark.open_from_package() does not support namespace packages HOT 2
- Stand-alone program cannot be run HOT 4
- Issue of installing lark in Python HOT 1
- Pipe in terminal regex not working as expected HOT 1
- Transformer Not Applying Expected Transformations in Lark Parser HOT 3
- Deprecation Warning HOT 6
- accepts() vs choices() in InteractiveParser HOT 10
- No such file or directory: 'COMMON.lark' HOT 4
- Grammar Syntax For Unordered Groups HOT 1
- Is it possible to parse parts of the input? HOT 12
- Forgiving syntax HOT 3
- Post 1388 changes HOT 4
- Dynamic Earley: Incorrect value for SymbolNode.end
- Inconsistent parse results from simple ambiguous grammar HOT 4
- Superfluous identical ambiguities in Earley HOT 2
- Porting from pyparsing match_previous_literal HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lark.