What is your question? I'm interested in writing a

See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

Hm, I can't figure out how to proceed. Here is an example: <div class="snippet

From what I see <a class="commit-link" data-hovercard-type="commit" data-hovercard-url

Yeah, that does indeed look like a bug, maybe <a class="user-mention notranslate" data

This function should do what you want using the scan method: <div class="highlight

Is it possible to parse parts of the input? about lark HOT 12 OPEN

jjalowie commented on July 16, 2024

Is it possible to parse parts of the input?

from lark.

Comments (12)

MegaIng commented on July 16, 2024

See #1371

It's not exactly easy and especially for C-like syntax that is already tricky to parse with lark it might be quite annoying.

from lark.

jjalowie commented on July 16, 2024

Wow, thanks for a quick response.
I'm thinking about moving on with Earley using the start: /.*/ rule /.*/ approach (my inputs aren't very big so should be fine performance-wise).
I ran into the following problem though when experimenting around:

import lark
grammar = f"""
    start: /.*/ expr /.*/
    expr: "qwerty"
    %import common.WS
    %ignore WS
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')

ast = parser.parse("""asdf qwerty poiuy""")
print(ast.pretty())

This gives the following error:

[...]
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 215, in create_earley_parser
    return f(lexer_conf, parser_conf, resolve_ambiguity=resolve_ambiguity,
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 192, in create_earley_parser__dynamic
    earley_matcher = EarleyRegexpMatcher(lexer_conf)
  File "/home/jjalowie/.bin/mlir_parser/venv/lib/python3.10/site-packages/lark/parser_frontends.py", line 178, in __init__
    raise GrammarError("Dynamic Earley doesn't allow zero-width regexps", t)
lark.exceptions.GrammarError: ("Dynamic Earley doesn't allow zero-width regexps", TerminalDef('__ANON_0', '.*'))

I'm using lark 1.1.9.

from lark.

MegaIng commented on July 16, 2024

As it says in the error message, the terminal can't be zero width. Use /.+/? instead.

from lark.

jjalowie commented on July 16, 2024

Hm, I can't figure out how to proceed.
Here is an example:

import lark
grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "qwerty" | "asdf"
"""
parser = lark.Lark(grammar, start='start', parser='earley', lexer='dynamic_complete')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())
# output:
# start
#   expr
#    qwerty

I can't force asdf to be parsed as an expr because it gets parsed as /.+/? greedily. Any tips how to cope with that?

from lark.

MegaIng commented on July 16, 2024

Use ambiguity='explicit' and manually select the correct parse, see https://github.com/lark-parser/lark/blob/master/examples/advanced/dynamic_complete.py

Or choose the easier and quicker root and use the scan function I coded up in the other issue.

from lark.

jjalowie commented on July 16, 2024

From what I see MegaIng@4975608 doesn't expose a way to use transformers on the matched code. Could this be improved so I can also use transformers? What would be the needed steps?

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?

import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

parser = lark.Lark(grammar, parser='earley', lexer='dynamic_complete', ambiguity='explicit')
ast = parser.parse("""asdf qwerty""")
print(ast.pretty())

It yields this output:

start
  expr
   qwerty

Shouldn't there be an ambiguity between /.*/? and expr visible in the AST this time because of ambiguity='explicit'? Isn't it a bug in lark?

from lark.

MegaIng commented on July 16, 2024

Yeah, that does indeed look like a bug, maybe @erezsh has an idea what is going on.

With regard to scan: You have to do a bit of extra work right now, but you get the starting and end address of the segments that match, which you can then replace in the original. This would make a good recipe to add the documentation when scan gets officially added.

from lark.

MegaIng commented on July 16, 2024

This function should do what you want using the scan method:

def scan_and_replace(parser: lark.Lark, text: str, replacement: Callable[[lark.ParseTree], str],
                     start: str = None) -> str:
    """
    Scans the `text` and replaces all matches of `parser` by the value returned by `replacement`
    given the corresponding tree.
    
    `start` is for passing in the start rule if required, not the starting position of the scanning.
    """
    last = 0
    res = ""
    for (start_pos, end_pos), tree in parser.scan(text, start=start):
        res += text[last:start_pos]
        res += replacement(tree)
        last = end_pos
    return res

from lark.

jjalowie commented on July 16, 2024

This function should do what you want using the scan method:
[...]

That looks very promising. I will pick up from there. Thanks a lot!

from lark.

erezsh commented on July 16, 2024

For now I went with the Earley parser. I'm stuck on the below code. What should it behave like?
import lark

grammar = f"""
    start: /.+/? expr+ /.+/?
    expr: "asdf" | "qwerty"
"""

It's not a bug, you have a space in your text, and only ANY can handle it.

When I change to this grammar:

    !start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"

    %ignore " "

I get -

_ambig
  start
    expr        asdf
    any  qwerty
  start
    expr        asdf
    any qwerty
  start
    expr        asdf
    expr        qwerty

(this is without dynamic_complete, which adds more derivations)

from lark.

MegaIng commented on July 16, 2024

@erezsh No, what actually fixed it is moving /.+/? into a separate rule. These two grammars should have the same behavior, but don't:

grammar = f"""
    start: any? expr+ any?
    any: /.+/
    !expr: "asdf" | "qwerty"
"""

grammar = f"""
    start: /.+/? expr+ /.+/?
    !expr: "asdf" | "qwerty"
"""

The later doesn't produce any ambiguities with dynamic_complete, the former does.

Similar for your grammar, with dynamic_complete it produces fewer derivations when any is inlined.

from lark.

erezsh commented on July 16, 2024

I tested @MegaIng 's example again on the latest master, and looks like this bug is fixed!

from lark.

Is it possible to parse parts of the input? about lark HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent