When running this code: <div class="snippet-clipboard-content notranslate position

<a target="_blank" rel="noopener noreferrer nofollow" href="https://camo.githubusercon

Bug in handling ambiguity?,about lark-parser/lark

Comments (67)

erezsh commented on July 30, 2024 3

Fixed!
At the very least, the example you gave me now runs almost instantly. Please check if latest master fixes your issues.

Turns out it was a very simple bug, but it was a real flaw in my implementation. So thank you for reporting it and helping me weed it out.

See if you can find an example of name length affecting speed. I'll be very happy to fix it if it's true.

from lark.

uriva commented on July 30, 2024 1

wow
X20-30 faster (similar to what it was before)
almost no diffs
I'm happy with this version

i'm curious as to what you did tho...

from lark.

erezsh commented on July 30, 2024 1

I'll see if I can explain it in simple words.
The slowdown was caused by similar parse-items that had different trees. Every different possible derivation was added as a separate parse-item, that would later merge as ambiguity. This created thousands of parse-items at every step, and caused Earley to choke. I made a very subtle change: I merged the similar parse-items into a single parse-item, and added their trees as ambiguity earlier in the process.

You could think of this as an optimization, but really, the previous implementation was just wrong. I still need to solve some subtleties that resulted from this change, and then I plan to merge it into master.

Anyway, I'm glad it solved the issue for you!

from lark.

erezsh commented on July 30, 2024 1

Okay, all my tests are passing now. I pushed the most recent changes to the branch: earley_fix2

If you still have issues with the priority, an example would really help.

from lark.

erezsh commented on July 30, 2024 1

In master.

from lark.

erezsh commented on July 30, 2024

Yes, it seems to be a bug! I'm tracking it down now.

from lark.

erezsh commented on July 30, 2024

Found it! Pushed changes to master. It should now work as intended. Please close the issue if it's resolved.

P.S. Good find on this bug 👍

from lark.

uriva commented on July 30, 2024

I'm still investigating it, but it it seems the fix causes an infinite loop in some cases.

from lark.

uriva commented on July 30, 2024

I have an infinite loop between the functions Item.__eq__, Item.similar and Tree.__eq__.

from lark.

uriva commented on July 30, 2024

It's not infinite but something that was really fast before now takes an hour... I can't provide the specific example at the moment, but thought you might have an idea anyhow.

from lark.

erezsh commented on July 30, 2024

Huh, I think that makes sense. I'll see what I can do, but if you can find an example it will really help.

from lark.

erezsh commented on July 30, 2024

Try changing Item.eq into:

def __eq__(self, other):
    return self.similar(other) #remove this: and (self.tree is other.tree or self.tree == other.tree)

See if that fixes it.

from lark.

uriva commented on July 30, 2024

It worked fast :)
Now checking for other problems

from lark.

uriva commented on July 30, 2024

Seems good. Closing.

from lark.

erezsh commented on July 30, 2024

Nice. I'll push it to master.

from lark.

erezsh commented on July 30, 2024

Hmm scratch that. It might fix your loop, but it also re-surfaces the previous bug in this issue, where some ambiguity "disappears". We'll need to find a better solution.

from lark.

uriva commented on July 30, 2024

should I open a new issue?
since this fix the performance is really bad.
can we limit the depth of this comparison somehow?
(doesn't the earley algo support ambiguity out of the box? how can this increase runtime so much?)

from lark.

erezsh commented on July 30, 2024

Hi, I didn't realize it's such a big issue.
You can open a new issue, if you like, or we can continue here.
It's possible that the performance bottleneck comes from the post-processing, and not from the parsing itself. There also may be an easy solution, at least for your specific case. I can't tell, because I don't know how to reproduce it.
So, if we're going to take this seriously, the first step is to prepare a self-contained example of a grammar+input that result in slow run-time (i.e. demonstrating that complexity is larger than O(n^3)).
Provide an example and I'll take it from there.

from lark.

uriva commented on July 30, 2024

sounds good, thanks.

from lark.

uriva commented on July 30, 2024

This takes more than a minute to run:

import lark

GRAMMAR_SPECIFICATION = r"""

  freddie_terry: "FT"
  terry_terry: "TT" 

  terry_randy: freddie_terry terry_terry

  freddie_danny: "F"
  terry_danny: "TD"
  danny_randy: freddie_danny terry_danny

  eric_danny: "D"

  danny_eric_low: (danny_randy | eric_danny)+
  danny_eric: eric_danny+
                | danny_randy
                | danny_eric_low

  terry_eric: freddie_terry
                 | terry_terry
                 | terry_randy


  gh_low: danny_randy
              | terry_randy danny_eric

  gh_very_low: terry_terry danny_eric
                       | eric_danny
                       | freddie_danny
                       | terry_danny
                       | freddie_terry

  gh: danny_eric terry_eric
    | gh_low
    | gh_very_low

  start: gh*

  %import common.WS
  %ignore WS
"""

PARSER = lark.Lark(
    GRAMMAR_SPECIFICATION, ambiguity='resolve__antiscore_sum')

PARSER.parse('D FT TT DF TD FT TT D D FT TT D FT')

from lark.

uriva commented on July 30, 2024

And I discovered a ridiculous fact - changing the rule names to something short makes the parsing fast (!!)

(Note that entire time is wasted on the parsing itself and not the processing of grammar)

from lark.

uriva commented on July 30, 2024

This exposes another issue, if you take this grammar and replace danny_randy with DR, you get the following error: GrammarError: Rules aren't allowed inside tokens (freddie_danny in DR)

from lark.

erezsh commented on July 30, 2024

Great, I'll look into it very soon.

changing the rule names to something short makes the parsing fast

How odd! It's hard for me to believe. I would be very interested in some example of this.

As a side note: Lowercase names are rules, which means they can contain other rules and tokens, while uppercase names are tokens, which can only contain tokens. This distinction affects the tree structure, but it also becomes useful when switching from a scannerless parse (the default) to using a tokenizer.

from lark.

uriva commented on July 30, 2024

Woohoo!
Thanks!!
So glad
Haven't checked yet - but will let you know soon

from lark.

uriva commented on July 30, 2024

Seems good. I think it's still somewhat slower than pre-fix version, but it's definitely reasonable now.
If I find something pathologically slow I'll reopen with the example.

Again thank you so much.

from lark.

erezsh commented on July 30, 2024

Well, there are a few things you can do to speed it up -

Use lexer="standard" (requires that you don't have collisions between tokens)
Use PyPy
Disable "all_derivations" (really depends on your use-case)

from lark.

uriva commented on July 30, 2024

Ok I have a specific example, not as extreme as before, but does not look like n^3 either. This takes about 20 seconds to run:

import lark

SPECIFICATION = r"""  
  freddie_terry: "freddie_terry"
  terry_terry: "terry_terry"

  terry_range_low: terry_terry freddie_terry
  terry_range: freddie_terry terry_terry
             | terry_range_low

  freddie_danny: "freddie_danny"
  to_danny: "to_danny"
  danny_range: freddie_danny to_danny

  explicit_danny: "danny"

  danny_expression_low: (danny_range | explicit_danny)+
  
  danny_expression: explicit_danny+
                  | danny_range
                  | danny_expression_low

  terry_expression: freddie_terry
                  | terry_terry
                  | terry_range
                  | terry_pattern
                  | terry_pattern_two_range

  
  terry_pattern: freddie_terry terry_terry terry_terry

  terry_pattern_two_range: freddie_terry terry_terry freddie_terry terry_terry

  multiple_danny_ranges: danny_range danny_range danny_range+
  multiple_danny_ranges_expression: multiple_danny_ranges

  gh_medium: freddie_terry danny_expression
           | terry_expression multiple_danny_ranges_expression

  gh_low: danny_range
        | terry_range danny_expression

  gh_very_low: terry_terry danny_expression
             | explicit_danny
             | freddie_danny
             | to_danny
             | freddie_terry

  gh_dannys_only: explicit_danny explicit_danny
                | explicit_danny explicit_danny explicit_danny

  gh: danny_expression terry_expression
    | multiple_danny_ranges_expression terry_expression
    | gh_medium
    | gh_low
    | gh_very_low
    | gh_dannys_only

  start: gh*

  %import common.WS
  %ignore WS
"""

GRAMMAR = lark.Lark(
    SPECIFICATION, ambiguity='resolve__antiscore_sum')

s = 'freddie_terry terry_terry danny freddie_terry terry_terry danny freddie_terry terry_terry danny danny danny danny freddie_terry terry_terry danny freddie_terry terry_terry danny freddie_terry terry_terry danny freddie_terry terry_terry'

GRAMMAR.parse(s)

adding lexer="standard" has no noticeable effect.

from lark.

erezsh commented on July 30, 2024

I'm really happy you're putting my parser to the test. I'll dig into it.

Meanwhile, I should mention PyPy runs it in 1.5 seconds, about x15 faster.

from lark.

erezsh commented on July 30, 2024

I think I solved it!
This fix is a little trickier, so I want to run some more tests before I push it to master. Meanwhile, I pushed it to the branch earley_fix. So you can do:

git fetch
git checkout origin/earley_fix

And see if it works for you.

from lark.

uriva commented on July 30, 2024

Does it have any effect on prioritization?
I might be seeing some effect, but not sure yet.

from lark.

uriva commented on July 30, 2024

Actually scratch that - everything is working great.

from lark.

uriva commented on July 30, 2024

:(
I think there is something wrong with prioritization, I have a situation where it doesn't matter how high I set the antiscore of a certain rule, it still chooses the derivation with it. Removing it results in a different derivation, that should have been the one chosen at the first place.

I'm not sure this was introduced by last change but let me know if it rings a bell.

from lark.

uriva commented on July 30, 2024

btw it's a relatively minor issue. mostly it's working well.

from lark.

erezsh commented on July 30, 2024

Yeah, it actually failed one of the priority tests. I think I should be able to get it fixed.

from lark.

erezsh commented on July 30, 2024

Btw, the prioritization issues you're having, are you using lexer="standard", or the default lexer?

from lark.

uriva commented on July 30, 2024

Default. But I'm not sure I should.

from lark.

uriva commented on July 30, 2024

I did one trial run with the new branch and pc got stuck.
Do all previous examples run fast?
I'll be able to check again in a few days.

from lark.

uriva commented on July 30, 2024

Ok still having the same problem - arbitrarily increasing rule weight has no effect, but when removing the rule the other parse is chosen.

Note that is the WCFG algo (antiscore).

from lark.

uriva commented on July 30, 2024

Have a minimal example for you that a friend made:

import lark
GRAMMAR = r"""

  word.1000: WORD
  president: "obama" | "bush" | "clinton"
  token: president | word
  start: token*

  %import common.WORD
  %import common.WS
  %ignore WS
"""

for _ in xrange(10):
  parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
  print parser.parse('obama')
print '---------'
parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
for _ in xrange(10):
  print parser.parse('obama')
print '---------'
parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
for _ in xrange(10):
  print parser.parse('obama')

If you run this code you'll get different results on every block, indicating that there are missing ambiguities on some of the runs.

Hope this helps :)

from lark.

erezsh commented on July 30, 2024

Thanks for bringing this to my attention. I'll look into it.

from lark.

uriva commented on July 30, 2024

FYI the bug is before ambiguity resolution (can be seen with 'explicit' as well)

from lark.

uriva commented on July 30, 2024

Hi, any updates per chance?
(asking because this bug has now become a bottleneck using the library)

from lark.

erezsh commented on July 30, 2024

Hi, I didn't get a chance to look at the "missing ambiguities" example you provided, but as far as I know I don't have an example of a performance bottleneck (all the previous examples run quickly now).

Debugging such things without a specific example takes a lot of time, and since this is just a free-to-use project I have to de-prioritize it. If it's important enough that you can pay for my time, we can work something out. Otherwise I'll be happy to help, but it's not at the top of my queue, especially without an example to reproduce it.

from lark.

uriva commented on July 30, 2024

Just to clarify, this is not about performance, I'm talking about the latest code snippet that shows a correctness issue and how to reproduce it.

from lark.

erezsh commented on July 30, 2024

Ah, the use of the word bottleneck gave me the wrong impression. Anyway, I hope I'll have time to get to it this week.

from lark.

erezsh commented on July 30, 2024

Hi again, thanks for the example, and sorry for the misunderstanding.
I pushed to master a fix that should solve the problem. Let me know if it works.

from lark.

uriva commented on July 30, 2024

Cool.

One thing I'm noticing is that I'm getting UnexpectedInput instead of ParseError on some occasions.
(not such a big deal I think).

Still checking other stuff.

from lark.

erezsh commented on July 30, 2024

That's to be expected. UnexpectedInput means it failed to match some part of the text, while ParseError can be many other reasons, for example the structure is different than expected, or there isn't enough input for a complete parse.

from lark.

uriva commented on July 30, 2024

In that case could you please add it to __init__.py:
from .lexer import UnexpectedInput
Thanks:)

from lark.

uriva commented on July 30, 2024

btw - are all other fixes from the other branches on master as well?

from lark.

erezsh commented on July 30, 2024

As far as I'm aware, yes.

from lark.

uriva commented on July 30, 2024

Including https://github.com/erezsh/lark/tree/earley_fix?
(asking because I'm still not getting all ambiguities when using master, but want to make sure it is really a new issue and not just one of the already fixed ones)

from lark.

erezsh commented on July 30, 2024

Yes, earley_fix has already been merged.

I'm aware that the parser currently doesn't create every possible derivation.
The parser doesn't recurse into rules more than necessary. So if there's a list or an explicit recursion that can return several results, the parser won't return all of them, just the first.
If this is what you're seeing and you want it to greedy-recurse, I'll see what I can do.
If you're referring to another behavior, then I'll need more details.

from lark.

uriva commented on July 30, 2024

I see.
So this is what got broken:

import lark
GRAMMAR = r"""
  ft: "ft" 
  tt: "tt" 

  str: ft tt
  tira: str

  fd: "fd" 
  td: "td" 
  dr: fd td

  ad: "ad" 
  ed: "d" 

  de: ed+
    | dr
    | ad

  te: ft
    | tt
    | tira


  multiple_drs: dr dr dr+
  multiple_drs_expression: multiple_drs

  gh_medium.1: ft de
             | te multiple_drs_expression
             | ad

  gh_low.99999: dr

  gh_throwaway.10: ed

  gh: de te
    | gh_medium
    | gh_throwaway
    | multiple_drs_expression te
    | tira de
    | gh_low

  start: gh*

  %import common.NUMBER
  %import common.WS
  %ignore WS
"""

s = ' '.join(['ad', 'fd', 'td', 'ft', 'tt', 'ad', 'd', 'd', 'ft', 'tt'])

parser = lark.Lark(GRAMMAR, ambiguity='resolve__antiscore_sum')
print parser.parse(s)

# pre fix:
# Tree(start, [Tree(gh, [Tree(gh_medium, [Tree(ad, [])])]), Tree(gh, [Tree(de, [Tree(dr, [Tree(fd, []), Tree(td, [])])]), Tree(te, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])])])]), Tree(gh, [Tree(gh_medium, [Tree(ad, [])])]), Tree(gh, [Tree(de, [Tree(ed, []), Tree(ed, [])]), Tree(te, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])])])])])
# post fix:
# Tree(start, [Tree(gh, [Tree(gh_medium, [Tree(ad, [])])]), Tree(gh, [Tree(gh_low, [Tree(dr, [Tree(fd, []), Tree(td, [])])])]), Tree(gh, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])]), Tree(de, [Tree(ad, [])])]), Tree(gh, [Tree(de, [Tree(ed, []), Tree(ed, [])]), Tree(te, [Tree(tira, [Tree(str, [Tree(ft, []), Tree(tt, [])])])])])])

I need to have all ambiguities so resolve_ambig can really pick the lightest one with respect to weights.
What do you think?

from lark.

erezsh commented on July 30, 2024

Well, it's a little dense, but I'm guessing the gh* is the recursion that isn't yielding all the possible results.
I saw the same behavior in issue #21, so perhaps I can use it to debug it.
However, now I'm really curious what it's used for :)

from lark.

uriva commented on July 30, 2024

(this is issue #21 so I guess you meant some other issue?)

Is there some kind of hack I can do to get all ambiguities?
As long as not all ambiguities are returned the prioritization is somewhat non functioning.

As for your curiosity, I'll try to think of something... :)

from lark.

uriva commented on July 30, 2024

This seems to work:
start: gh | gh gh | gh gh gh | gh gh gh gh | gh+

from lark.

erezsh commented on July 30, 2024

Yeah, I meant I saw it in the "president" example you provided.
So, sounds like I guessed right. I'll see what I can do.

from lark.

erezsh commented on July 30, 2024

Well, that wasn't too difficult! Check out master and use:

parser = lark.Lark(GRAMMAR, ambiguity='explicit', earley__predict_all=True)

It should work fine with gh*.

from lark.

uriva commented on July 30, 2024

looking good!
doing some more tests

from lark.

uriva commented on July 30, 2024

Sadly now there's the slowness issue again.
I'll see if I can make an unreasonable example to indicate something above n^3.
Perhaps I should just implement CYK instead, should that be quicker for all ambiguities?

from lark.

erezsh commented on July 30, 2024

In theory, Earley should outperform CYK. But it's very possible that the slowness is due to my own implementation. I can't say anything smarter right now. I'll have to go back to the academic papers and review them thoroughly to see if I could do a better job.

If you think you can implement CYK, I suggest you go for it. If you get it working, I'll be happy to add it to Lark as another optional parser.

You can also try other Earley implementations (even in other languages). If they perform better for your use-case, I'll consider copying them into Lark.

from lark.

uriva commented on July 30, 2024

Sounds good, thanks.

from lark.

erezsh commented on July 30, 2024

So, it seems to me that there are no more bugs, only slowness, so the issue is resolved.
Feel free to open a new one for optimizations (if you can provide examples), or reopen this one if you encounter a bug.

from lark.

erezsh commented on July 30, 2024

Hi, just want to let you know that Lark now has a CYK parser implementation.

from lark.

uriva commented on July 30, 2024

Staggering coincidence ;)

from lark.

erezsh commented on July 30, 2024

I was wondering about that, but you know what they say about assuming :)

from lark.

Bug in handling ambiguity? about lark HOT 67 CLOSED

Comments (67)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent