Comments (4)
I think it would be good to always error on immediate chaining of those operators as in e.g.
A*+
regardless of whether or not this can be interpreted in a reasonable non-infinite-loop-producing way.
These are currently blocked at the syntax level (aside: maybe we need a new error class like GrammarSyntaxError
), as they are not part of Ford's original PEG metasyntax. I'm not entirely opposed to allowing it if we allow (A+)?
, etc.
However, as soon as we deal with indirections across rules, I think all repetition operators should be allowed as long as the referenced rule can not match an empty string.
One part of pe's design is that non-terminals are not inherently different from inline forms. That is, the A1
and A2
productions below have exactly equivalent behavior:
A1 <- B C
B <- "b"
C <- "c"
A2 <- "b" "c"
This design is what enables pe to reason about when to optimize patterns. Semantic changes, such as using a ~
capture or attaching an action to a rule, introduce differences that block optimization.
Thus, if we allow
A <- B?
B <- "b"+
Then we must also allow
A <- ("b"+)?
This is also why the capture in (~"a"*)*
above causes the infinite loop. So maybe we need the ability to see across these semantic effects for detecting potential infinite loops.
However, as long as the referenced rule definitely matches at least one character/byte, applying a repetition operator to it makes sense.
Apologies if I misunderstood, but I want to be able to detect these things at compile time, not parse time. So I don't want to care if, at parse time, A*
matches more than zero characters so that we can allow a second *
to repeat over it. I want to detect at compile time that this pattern can potentially match an empty string infinitely.
Furthermore, I would like to point out that [...] is not equal to [...] nor to [...] this is due to auto-ignoring (of whitespace).
Yes, I did not bring up auto-ignore in my reply above because I wanted to keep the conversation focused, but I agree that we would need to consider its effects in the multiple repeat operator issue. It's a relatively new feature of pe and I'm still not as certain about its ramifications.
from pe.
Hmm, I think you're right, if we only relax the check for an outer ?
. The purpose of this error is to avoid infinite loops. For the three repeat operators *
, +
, and ?
, we have the following combinations:
Pattern | Effect in pe* | Allowed by re** |
---|---|---|
(X*)* |
infinite loop | a** ❌ |
(X*)+ |
infinite loop | a*+ ✔️ |
(X*)? |
redundant | a*? ✔️ |
(X+)* |
redundant | a+* ❌ |
(X+)+ |
redundant | a++ ✔️ |
(X+)? |
equivalent to X* |
a+? ✔️ |
(X?)* |
infinite loop | a?* ❌ |
(X?)+ |
infinite loop | a?+ ✔️ |
(X?)? |
redundant | a?? ✔️ |
* You might not see the infinite loop in pe because the packrat parser has a simple guard for loops that do not advance the position, and, secondly, optimized grammars might turn the pattern into a single regex which does not have the infinite loop.
** The re module gives an error for a**
, a+*
, and a?*
, but you can avoid the error with a group, like (a*)*
or (?:a*)*
, and using it to match does not result in an infinite loop.
In spite of protections against it, there are some ways to induce the infinite loop. For instance, if you use a capture (~
) to prevent the grammar error and then use the machine parser, it will get stuck:
import pe
pe.match('(~"a"*)*', "b", parser="machine") # DANGER: infinite loop
I'm prepared to allow (X+)?
, because even if the effect is the same as X*
, it seems natural to write grammars like this:
Something <- Token?
Token <- Char+
For the "redundant" cases above, it should be safe to allow them, perhaps with a GrammarWarning
instead of a GrammarError
. For now I'd rather continue raising a GrammarError
on the infinite loop cases.
(Aside: while investigating this issue, I came across another bug. See #38)
from pe.
I think it would be good to always error on immediate chaining of those operators as in e.g. A*+
regardless of whether or not this can be interpreted in a reasonable non-infinite-loop-producing way.
However, as soon as we deal with indirections across rules, I think all repetition operators should be allowed as long as the referenced rule can not match an empty string. If it can, all of those operators either don't make sense to begin with (?
) or can create an infinite loop (*
and +
).
However, as long as the referenced rule definitely matches at least one character/byte, applying a repetition operator to it makes sense.
The fact that some of this chaining is redundant and can be rewritten with only a single operator should be an implementation detail of pe
when it optimizes the grammar/parser.
However, from a user's POV it's quite convenient to write a rule matching a specific input and then reuse that rule elsewhere with different repetition operators. Just think of e.g. an ID rule such as
ID <- [a-zA-Z_0-9]+
which can then be used as e.g.
name <- ID ID? # first name and optional last name
name_list <- ID+ # Multiple names after another
Furthermore, I would like to point out that e.g.
A < B+
B <- [a-z]+
is not equal to
A < [a-z]+
nor to
A <- [a-z]+
this is due to auto-ignoring (of whitespace).
from pe.
I want to detect at compile time that this pattern can potentially match an empty string infinitely.
Indeed that is what I meant. If the formal specification of the rule allows it to potentially match an empty string, don't allow repetition operators on it.
from pe.
Related Issues (20)
- Remove packrat parser
- Captured choices not working with Cython machine parser
- Character classes in machine parser fail at odd times
- Inefficiencies in regex optimization
- More "common" optimizations HOT 1
- Separate error type for failing to parse a grammar
- Change implicit optional types to explicit HOT 1
- Update Python versions HOT 1
- Make implicit Optionals into explicit for current Mypy HOT 1
- Add common patterns in code
- Bug: Newlines make the debug output difficult to read. HOT 1
- Add python version to publishing action
- Update to Cython 3.0
- Lint with ruff instead of flake8
- Difference with machine/packrat parsers and captures HOT 1
- (Helpful) Error messages HOT 3
- Common optimization misbehaving on character classes HOT 1
- "Sidecar" objects for accumulative parsing
- Bounded repetitions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pe.