shnewto / bnf Goto Github PK

View Code? Open in Web Editor NEW

249.0 8.0 22.0 319 KB

Parse BNF grammar definitions

License: MIT License

Rust 100.00%

bnf backus-naur-form crates grammar parser sentence-generator rust rust-lang rustlang rust-crate

bnf's Introduction

bnf

A library for parsing Backus–Naur form context-free grammars.

What does a parsable BNF grammar look like?

The following grammar from the Wikipedia page on Backus-Naur form exemplifies a compatible grammar. (*Note: parser allows for an optional ';' to indicate the end of a production)

 <postal-address> ::= <name-part> <street-address> <zip-part>

      <name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
                    | <personal-part> <name-part>

  <personal-part> ::= <initial> "." | <first-name>

 <street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>

       <zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>

<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
    <opt-apt-num> ::= <apt-num> | ""

Output

Take the following grammar for DNA sequences to be input to this library's parse function.

<dna> ::= <base> | <base> <dna>
<base> ::= "A" | "C" | "G" | "T"

The output is a Grammar object representing a tree that looks like this:

Grammar {
    productions: [
        Production {
            lhs: Nonterminal(
                "dna"
            ),
            rhs: [
                Expression {
                    terms: [
                        Nonterminal(
                            "base"
                        )
                    ]
                },
                Expression {
                    terms: [
                        Nonterminal(
                            "base"
                        ),
                        Nonterminal(
                            "dna"
                        )
                    ]
                }
            ]
        },
        Production {
            lhs: Nonterminal(
                "base"
            ),
            rhs: [
                Expression {
                    terms: [
                        Terminal(
                            "A"
                        )
                    ]
                },
                Expression {
                    terms: [
                        Terminal(
                            "C"
                        )
                    ]
                },
                Expression {
                    terms: [
                        Terminal(
                            "G"
                        )
                    ]
                },
                Expression {
                    terms: [
                        Terminal(
                            "T"
                        )
                    ]
                }
            ]
        }
    ]
}

Once the Grammar object is populated, to generate a random sentence from it call the object's generate function. grammar.generate(). For the above grammar you could expect something like TGGC or AG.

If the generate function can't find a production for a nonterminal it tries to evaluate it will print the identifer as a nonterminal, i.e. <identifier>.

The generate function will return an error if it detects an infinite loop caused by a production such as <PATTERN> ::= <PATTERN>.

Parse Example

use bnf::Grammar;

let input =
    "<postal-address> ::= <name-part> <street-address> <zip-part>

        <name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
                        | <personal-part> <name-part>

    <personal-part> ::= <initial> '.' | <first-name>

    <street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>

        <zip-part> ::= <town-name> ',' <state-code> <ZIP-code> <EOL>

    <opt-suffix-part> ::= 'Sr.' | 'Jr.' | <roman-numeral> | ''
        <opt-apt-num> ::= <apt-num> | ''";

let grammar: Result<Grammar, _> = input.parse();
match grammar {
    Ok(g) => println!("{:#?}", g),
    Err(e) => println!("Failed to make grammar from String: {}", e),
}

Generate Example

use bnf::Grammar;

let input =
    "<dna> ::= <base> | <base> <dna>
    <base> ::= 'A' | 'C' | 'G' | 'T'";
let grammar: Grammar = input.parse().unwrap();
let sentence = grammar.generate();
match sentence {
    Ok(s) => println!("random sentence: {}", s),
    Err(e) => println!("something went wrong: {}!", e)
}

Parse Sentence via Grammar

use bnf::Grammar;

let input =
    "<dna> ::= <base> | <base> <dna>
    <base> ::= 'A' | 'C' | 'G' | 'T'";
let grammar: Grammar = input.parse().unwrap();

let sentence = "GATTACA";

let mut parse_trees = grammar.parse_input(sentence);
match parse_trees.next() {
    Some(parse_tree) => println!("{}", parse_tree),
    _ => println!("Grammar could not parse sentence"),
}

bnf's People

Contributors

Stargazers

Watchers

bnf's Issues

How to use the bnf lib in sql parse function

I want to write a custom VQL parser use nom. And I found it have not contain grammer define function.

Can I use bnf lib to define VQL/SQL grammar like antlr4 g4 file？

I can't find some example for this, I think the process is:

define VQL/SQL grammar file, like vql.bnf.
use bnf lib parse vql.bnf to parse tree.
traverse the parse tree use Grammar::traverse or earley::Traversal and output AST?
parse the AST and get the VQL/SQL stmt define.
stmt mod output VQL/SQL logical plan.

I didn't know the process is right or not? Anyone could help me the right way?

Thanks.

Operator Traits

core::ops allows implementing custom logic for operators. Would implementing these for Grammar/Production/Expression/Term enable any cleaner syntax or usage? If so, which of these operators would be included?

ops::Add (+)
- Would this add same types (Grammar + Grammar -> Grammar)?
- Would this add a component (Grammar + Production -> Grammar)?
- Would this add components (Production + Production -> Grammar)?
- Would this do all the above?
- If this is implemented, what is the behavior or subtraction?
ops::AddAssign (+=)
- Same question as Add
ops::BitOr ( | )
- If Add is used only for components, could Or be used as a kind of union operator?
- Should this only be implemented for Term | Term and Term | Expression to allow expr = Term | Term | Term?
ops::BitXor ( ^ )
- If BitOr is used as a union operator, should BitXor be used for set difference?

These were the operators I identified as possible enhancements, but feel free to add more to the discussion that may be useful.

Also, as a possible alternative some of these use cases could be handled by macros to allow for special case syntax:

expression![ term1 | term2 | term3 ] -> Expression::from_parts(vec![ term1, term2, term3 ])

But that would require the user importing BNF's macros.

compare Criterion & Divan

a new benchmarking framework is on the scene! Divan looks pretty great, so the work for this issue is:

port existing benchmarks to Divan
compare the code of Criterion and Divan benchmarks
compare the readability and clarity of the benchmark outputs

make production ending semicolon optional

Plenty of BNF examples, including the wikipedia example, do not require productions to terminate with a semicolon, but rather only require a newline. I believe the parser could be updated to optionally use the ending semicolon, which would support both formats.

Include README.md as documentation in lib.rs

There is a lot of duplication in the documentation between README.md and lib.rs

I can see two problems with this:

the examples in README.md don't actually get run! easy to accidentally break
the two "top level docs" may drift (they already have) including some info in one, but not the other

Maybe these problems don't outweigh some of the benefits? I am not sure! But usually reducing duplication is worth a discussion.

The documentation could be simplified down into one file via:

#![doc = include_str!("README.md")]

This would give both README.md and lib.rs the same content. It would also run the README.md code snippets as tests!

Current Differences

only README.md has:

github badges
- do these render in rustdoc?

only lib.rs has:

a nod to prettybnf
- I believe it would be okay to include this in both rustdoc and github
a link to this github repo "The code is available on Github"
- I believe it would be okay to REMOVE this line, because rustdoc already includes a repo link

So considering the above, the only question I wanted to answer was how the badges look in rustdoc:

They seem fine!

So what do you think @shnewto ?

should the prettybnf reference be included in both?
any other concerns?

Generate code coverage reports as a step in our GitHub actions

I'm going to be dropping our redundant Travis CI builds but still want code coverage reporting. Need to tie it to an existing GH action or make a new one...

Right-recursive productions Fail to Match

Looks related to nullable productions: #108 #112 #117

This grammar fails to match when we change a production from left-recursive to right-recursive.

The left-recursive grammar failed on 0.4.0 ≤ bnf ≤ 0.4.1 – the right-recursive still fails on 0.4.0 ≤ bnf ≤ 0.4.3

So this issue must have been partially addressed by @CrockAgile in #109 (0.4.2)

Happy to help battle-testing any proposed fixes ⚔️

use bnf::Grammar;

#[test]
fn test() {
    let input = "aa a";

    let left_recursive: &str = "
    <conjunction> ::= <conjunction> <ws> <predicate> | <predicate>
    <predicate> ::= <string_null_one> | <special-string> '.'

    <string_null_one> ::= <string_null_two>
    <string_null_two> ::= <string_null_three>
    <string_null_three> ::= <string>

    <string> ::= <char_null> | <string> <char_null>
    <special-string> ::= <special-char> | <special-string> <special-char>

    <char_null> ::= <char>
    <char> ::= 'a'

    <special-char> ::= <char_null> | <whitespace>
    <whitespace> ::= ' '

    <ws> ::= ' ' | ' ' <ws>
    ";
    assert!(left_recursive.parse::<Grammar>().unwrap().parse_input(input).next().is_some());

    let right_recursive = left_recursive.replace(
        // rewrite production from left- to right- recursive
        "<string> ::= <char_null> | <string> <char_null>",
        "<string> ::= <char_null> | <char_null> <string>",
    );
    assert!(
        right_recursive.parse::<Grammar>().unwrap().parse_input(input).next().is_some(),
        "right-recursive grammar failed to parse: {input}"
    );
}

Coverage report includes test LOC

Right now we're using a tool called cargo-travis that provides a convenience wrapper to kcov for generating and publishing coverage reports to Coveralls. It does not expose the --exclude-pattern or --exclude-file options that would allow us to keep lines of test code out of the report.

It would be nice to manage building/caching/running kcov without the cargo-travis wrapper in order to get a more accurate representation of our current test coverage.

EBNF? ABNF?

Thinking about future features that could add value somewhere down the road. Initial ideas on design might be something like the following?

extern crate bnf;
// ...
let ebnf_grammar = bnf::ebnf::parse(ebnf_input);
let abnf_grammar = bnf::abnf::parse(abnf_input);

Alternatively we might be able to just make bnf::parse smart enough to intuit the variant used.

Rethink Grammar::generate logic

I'd like to be smarter about how we generate random sentences from a given grammar.

There's a danger that a valid grammar can result in an infinite loop when randomly generating, so atm we just take an guess at if we're looping in a way that looks like it's not going to stop any time soon by checking the stack and bailing before we recurse too far.

For instance the case used in the tests is <nonterm> ::= <nonterm>. We can't get to a terminal state so we're doomed to failure if we try to generate a sentence.

I'm not sure what an ideal solution is here, for one, it'd be nice to figure something that didn't have to short circuit the potential for "maybe infinite loops".

Maybe there's a way to check for impossible grammars for the generate logic first and error early if that's the case. Then... if we have the potential for a terminal state but aren't reaching it because of random chance we need to do something sensible... maybe allow a configurable max generated length and do the work to find some terminal nodes when we get there?

If can manage a max length, I like the idea of implementing a minimum length too 🤔

reusable grammar parser

Is your feature request related to a problem? Please describe.

The existing Grammar::parse_input method is useful for parsing input text according to a Grammar. But the current API builds "one use" parsers. This limitation forces some redundant work to be repeated when parsing multiple input texts.

// each of these builds completely new `GrammarMatching` and other parsing tables,
// which are then freed before the next `parse_input`, wasting computation
grammar.parse_input(input_1);
grammar.parse_input(input_2);
grammar.parse_input(input_3);

Describe the solution you'd like

Instead, a single GrammarParser could be available which would reuse parser data efficiently:

let parser = grammar.input_parser();

// common immutable parser data is shared between calls
parser.parse_input(input_1);
parser.parse_input(input_2);
parser.parse_input(input_3);

Describe alternatives you've considered

Maybe this just doesn't matter! I am not sure if this new API would even make much of a performance difference. I am not sure "how much faster" would warrant a new API method. 1% faster? 5% faster? 50% faster?

Nonterminal identifiers disallow whitespace in their name

At the moment, identifying a nonterminal rule silently disallows whitespace, i.e. my name in <my name> will parse as a nonterminal identified as my. This is certainly not ideal. I do wonder whether it's reasonable to disallow whitespace by raising an error in cases like the above or if we should just handle and allow it. Probably we need to review the standard for BNF in this case and in general to ensure compliance.

faster production matching while earley parsing

What

while working on nullable rules (#113), profiling showed a potential for performance improvement.

ProductionMatching owns matching productions with terms. currently, it is implemented:

pub(crate) struct ProductionMatching<'gram> {
    pub prod_id: ProductionId,
    pub lhs: &'gram crate::Term,
    /// "right hand side" [`TermMatching`]s which are partitioned by the matched and unmatched.
    /// For example: [Matched, Matched, Matched, Unmatched, Unmatched]
    rhs: Vec<TermMatching<'gram>>,
    /// The progress cursor used to separate [`TermMatching`]s in the "right hand side"
    matched_count: usize,
}

The rhs vector requires a fair amount of dynamic memory alloc and free while parsing (current location where vector is cloned).

There may be a better way to model matching terms.

Brainstorm

If you pick this up, do not take this brainstorm as "the only way". This is just one of many possibilities.

Trie

instead of cloning the matching vector, matching could be modeled as a Trie/Stemming tree, with each node a Term

eg. <word> ::= 'B' 'A' 'D' | 'B' 'A' 'T'

graph TD
    S[start matching prod] --> B
    B --> A
    A --> T
    A --> D

Then each ProductionMatching references a single node in this matching tree.

Results

To check results, the criterion benchmarking can be used:

cargo criterion

Tracing can be used to investigate:

CARGO_PROFILE_BENCH_DEBUG=true cargo flamegraph --bench bnf -- --bench

2 Questions about Parsing

Why does Grammar::parse_input have a return type of impl Iterator<Item = ParseTree> and not of Option<ParseTree> or Result<ParseTree, Error>? Is there any circumstance in which Grammar::parse_input could parse to multiple ParseTrees?
I know that I can access the data in a ParseTree by calling ParseTree::rhs_iter to get something of type ParseTreeIter, which implements Iterator<Item = &ParseTreeNode>. I can then iterate through this to get items of type &ParseTreeNode, but how can I get the data out from this? src/lib.rs only publicly exports Grammar and ParseTree from src/grammar.rs, and I cannot construct the variants of the externally private ParseTreeNode necessary to match ParseTreeNodes and access the data stored within.

Ability to parse BNF grammars at compile time

Is your feature request related to a problem? Please describe.
There's no reason to include an entire BNF parser in the compiled application if I just have a set grammar that I want to use that is constant. However, this is what happens if I use the bnf crate in my application - it must parse and validate at runtime.

Describe the solution you'd like
Allow BNF syntax to be parsed at compile-time. Make the required functions const (FromStr::from_str is not const)

Describe alternatives you've considered
N/A

Additional context
N/A

Step by step example

Hello,

First of all wanted to say thank you for your hard work on this project.

What do you think about creating step by step example with using simple grammar? It seems to me this would help to see the usual work flow and may show general concepts better.

FYI: There is a template to make a good README.md https://gist.github.com/PurpleBooth/109311bb0361f32d87a2

Iterator::size_hint

Iterator::size_hint helps consumers by providing some hint of the iterator's size. This benefits patterns like collecting an iterator into a vec (iter.collect::<Vec<_>>()) because the collection can reserve the correct space, without requiring re-allocations.

BNF's iterators currently hold slices internally, so the size_hint should be quite simple to implement (iterator impl search link)

There are 3 (or 4 once #92 is merged) Iterators to update:

expression.rs (Iter & IterMut)
production.rs (Iter & IterMut)
grammar.rs (Iter & IterMut)

Order of Rules Affects Parsing

Describe the bug
BNF grammar should be independent of the order of the rules as long as it is otherwise well-formed.

To Reproduce
Take the dna_left_recursive test and reverse the order of the rules. The BNF parser can no longer successfully parse the input string.

#[test]
fn dna_left_recursive() {
    let grammar: Grammar = "<base> ::= 'A' | 'C' | 'G' | 'T'
        <dna> ::= <base> | <dna> <base>"
        .parse()
        .unwrap();

    let input = "GATTACA";

    let parses: Vec<_> = grammar.parse_input(input).map(|a| a.to_string()).collect();
    assert_snapshot!(parses.join("\n"));
}

lalrpop exploration

I just read thru the lalrpop documentation and it seems worth investigating the readability & performance comparison to the current nom usage. This exploration task could just be done on its own branch and never merged and it would serve its purpose. But if it stumbles into some improvements, then it could be up for consideration.

draft releases via Github Actions

one good action enabling release drafting already exists

Don't have to use it tho. Just found it while searching around

How to parse a certain text according to a bnf grammar and get a list of token

The examples provided take a bnf grammar description and then are able to generate text that matches the grammar. How do I give the grammar a string compliant with bnf specification and then get a list of tokens out? This is something that pyparsing does.

Allow comments in grammars

We need a way to allow incorporating comments into a grammar. It seems somewhat standard to
use the ; to indicate the start of a comment and a \n to close it. It'd be a good add I think to make that work.

A couple initial questions:

Do we look for the ; all along the way?
Do we make a "first pass" that strips all comments before parsing?

self-hosting

If #33 is addressed and bnf becomes a real parser generator, it could also become self-hosting, which would be pretty fun.

Nullable Rules?

Unsure if there's any desire for them at the moment, but while doing some reading, I realized we aren't explicitly supporting nullable productions, i.e. non-terminals that can represent zero terminals...

Further reading:
https://en.wikipedia.org/wiki/Chomsky_normal_form
https://web.stanford.edu/class/archive/cs/cs143/cs143.1156/handouts/parsing.pdf

generate random sequence from grammar

As was discussed in #7 , generating a random sequence from a BNF grammar is a good feature. The erratic.js serves as a good example of behavior to implement. My only question is:

should the generated sequence be returned as a Vec<Terminals> or as a String?

My assumptions for this feature are:

pseudo random is sufficient
allowing users to tweak the probabilities per node is out of scope

Nullable productions (Still, Still) Fail to Match

Related to #108 and #112 – failures reproduce on 0.4.0 ≤ bnf ≤ 0.4.3 so not a regression from #113 .

Hit more cases where nullable productions aren't handled properly.

This example looks contrived but was adapted from a larger grammar I'm writing.

As usual, thanks for all your work @CrockAgile !

Sadly, this correctness issue continues to be a total blocker preventing me from adopting bnf

Hoping my battle-testing hasn't been in vain and there's an easy fix around the corner 🙏🏻

use bnf::Grammar;

#[test]
fn test() {
    let fails: &str = "
    <disjunction> ::= <predicate> | <disjunction> <or> <predicate>
    <predicate> ::= <char_null_one> | <special-string> '.'

    <char_null_one> ::= <char_null_two>
    <char_null_two> ::= <char_null_three>
    <char_null_three> ::= <char>

    <or> ::= <ws> 'or' <ws>
    <ws> ::= <whitespace> | ' ' <ws>
    <whitespace> ::= ' '

    <special-string> ::= <special-char> | <special-char> <special-string>
    <special-char> ::= <char> | <whitespace>
    <char> ::= 'a'
    ";

    let input = "a or a";

    let passes_1 = fails.replace(
        // skip nullable production <char_null_two>
        "<char_null_one> ::= <char_null_two>",
        "<char_null_one> ::= <char_null_three>",
    );
    assert!(passes_1.parse::<Grammar>().unwrap().parse_input(input).next().is_some());

    let passes_2 = fails.replace(
        // replace <whitespace> with its terminal ' '
        "<ws> ::= <whitespace> | ' ' <ws>",
        "<ws> ::= ' ' | ' ' <ws>",
    );
    assert!(passes_2.parse::<Grammar>().unwrap().parse_input(input).next().is_some());

    let passes_3 = fails.replace(
        // again, replace <whitespace> with its terminal ' '
        "<special-char> ::= <char> | <whitespace>",
        "<special-char> ::= <char> | ' '",
    );
    assert!(passes_3.parse::<Grammar>().unwrap().parse_input(input).next().is_some());

    assert!(
        fails.parse::<Grammar>().unwrap().parse_input(input).next().is_some(),
        "all grammars except last one parsed: {input}"
    );
}

Building docs on nightly issues warnings

Currently cargo doc on nightly Rust issues a series of warnings stemming from Rust's issue 44229. It'd be nice to pinpoint the specifics that are causing the warnings and fix them before they are warnings on stable Rust.

The errors start with the following report:

WARNING: documentation for this crate may be rendered differently using the new Pulldown renderer.                                                                                                                                                                             
    See https://github.com/rust-lang/rust/issues/44229 for details.

report the faulty character / line / column / offset when no parse trees are generated

Is your feature request related to a problem? Please describe.
Im trying to catch syntax errors in a custom plaintext file format, and when the parse fails, i simply receive no parse trees, with no information about where it failed.

Describe the solution you'd like
I'd like to see a line/column or character index, or something along those lines, that i can use to log to the user where the parsing has failed.

Broken coverage report

Looks like it's an issue with the gcrov action
actions-rs/grcov#142

Making an issue so if it takes very long to resolve on the gcrov side, we can track progress on however we decide to address it.

change public vectors to iterators

The types Expression, Production, and Grammar all publicly expose internal vectors. I assume the intention was to allow users to traverse the nodes in the grammar. However, exposing the internal vector locks the project to supporting the vector type interface forever onward. Imagine in the future that the data structures were changed to trees, arrays, or anything besides vector. This would cause a breaking change in the public interface for users.

Instead, if these types only exposed iterators, they would still support tree traversal, but changing the internal implementation would not be a breaking change.

Empty String Rules Fail to Match

When I match a grammar such as

<main> ::= "foo:" <bar>
 <bar> ::= "bar"

against foo:bar, it works. When I match a grammar such as

<main> ::= "foo:" ""

against foo:, it works. However, when I match

<main> ::= "foo:" <baz>
 <baz> ::= ""

against foo:, it doesn't work.

Replicable using this code:

fn main() {
	let grammar1: bnf::Grammar = "
		<main> ::= \"foo:\" <bar>
		 <bar> ::= \"bar\"
	".parse().unwrap();

	let grammar2: bnf::Grammar = "
		<main> ::= \"foo:\" \"\"
	".parse().unwrap();

	let grammar3: bnf::Grammar = "
		<main> ::= \"foo:\" <baz>
		 <baz> ::= \"\"
	".parse().unwrap();

	assert!(grammar1.parse_input("foo:bar").next().is_some(), "grammar1 failed!");
	assert!(grammar2.parse_input("foo:"   ).next().is_some(), "grammar2 failed!");
	assert!(grammar3.parse_input("foo:"   ).next().is_some(), "grammar3 failed!");
}

Why isn't a parse tree like this being generated?

ParseTree {
	lhs: Term::Nonterminal("main"),
	rhs: [
		ParseTreeNode::Terminal("foo:"),
		ParseTreeNode::Nonterminal(ParseTree {
			lhs: Term::Nonterminal("baz"),
			rhs: [
				ParseTreeNode::Terminal(""),
			],
		}),
	],
}

Is this intentional/explicable behavior, or a bug? Thanks!

suggestion: extract bnf test-cases to their own .bnf files?

What would you think of extracting the test-cases to separate files and then re-including them using std::include_str!("path/to/file")?

Potential benefits might include

increased ease of reuse of test cases in this and other projects
no need for string-in-string escaping, extra indentation to make the strings look right

This'd come at the cost of separating the text of the test-cases from the expected result.

Parse could return Result

Parsing a grammar can fail. Currently, if parsing fails an empty grammar is returned and errors are printed to stdout. Users would likely be interested in detecting if the parsing failed, but capturing stdout or stderr is not ideal. The code required to currently detect an empty grammar would look something like:

let grammar = bnf::parse(input);
// errors printed to stdout
if grammar.productions.is_empty() {
    println!("empty grammar");
} else {
    println!("non-empty grammar");
}

Instead of printing error strings to stdout or stderr, they could be included in the Result<Grammar,Error> type, allowing for more traditional error handling:

let grammarResult = bnf::parse(input);

match grammarResult {
    Ok(grammar) => println!("working grammar: {}", grammar),
    Err(e) => eprintln!("grammar parsing failed: {}", e),
}

If this change were to be made, would reporters still be involved? As a library it seems strange to output to stdout or stderr instead of output being handled in the code. But as a command line tool, error reporting in terminal is definitely valuable. Maybe BNF should include an optional CLI binary intended for terminal use? I forget if cargo allows for dual binary & library crates.

Empty String Rules (Still) Fail to Match

Describe the bug
Empty string rules still fail to match in some situations. This is the same as #108 although after it's fix it doesn't happen in all situations but still happens.

To Reproduce

Load grammar from tests/fixtures/bnf.bnf
Try to parse the first line of the text:
"<syntax> ::= <rule> | <rule> <syntax>\n"
This match will fail.
Try to parse the same line with whitespace before the newline
"<syntax> ::= <rule> | <rule> <syntax> \n"
This will parse properly

The first example should pass given <line-end> may begin with <opt-whitespace> which has an empty string rule

<line-end>       ::= <opt-whitespace> <EOL>
<opt-whitespace> ::= "" | " " <opt-whitespace>
<EOL>            ::= "
"

Unfortunately I haven't been able to find a smaller example. Weirdly it does parse "<syntax> ::= <rule>" correctly but not the more complicated example above.

If there support for special characters like \n or a built in definition for <EOL> it could be good to have a test that a grammer made from tests/fixtures/bnf.bnf can parse the file itself.

Desktop (please complete the following information):

MacOS Ventura 13.0.1

Allow partial parsing from string

Frequently while using this crate terms, expressions, productions, and grammars are built from their raw type parts:

let t1: Term = Term::Terminal(String::from("terminal"));
let nt1: Term = Term::Nonterminal(String::from("nonterminal"));
let e1: Expression = Expression::from_parts(vec![nt1, t1]);

Productions and Grammars require even more work. It would be convenient and powerful if instead the types could be constructed using the BNF syntax:

let term = Term::from_str("<nonterminal>").unwrap();
let expression = Expression::from_str("<nonterminal> \"terminal\"").unwrap();
let production = Production::from_str("<nonterminal> ::= <nonterminal> \"terminal\" | \"terminal\"").unwrap();

This could be achieved by implementing the FromStr trait for each type and leveraging existing parsers. Adding this to the Grammar as well would remove the need for the bnf::parse function.

There is however a difficult wrinkle to this problem:

How are terminals and nonterminals differentiated in string format?

A grammar can differentiate between the two by inspecting the full set of productions, but terms in productions and expressions are essentially ambiguous. Maybe some solution brainstorming will help start the discussion:

All terms are terminals until incorporated into a grammar or production that says otherwise
Add a third type of term which represents some "undefined" state

Hopefully @Snewt will have some perspective to resolve this block and enable this convenient functionality 🤞

Manually update dependabot's recommendations

Dependabot has a couple PRs open to update some of this project's dependencies but it looks like updating is going to take a bit of manual work, i.e. updating the symbols that changed / went away.

Add a possibility to parse the text and view how it was parsed

When I was in the university there was a program for learning grammatics in the programming languages theory course. It was very useful for learners to see how the code was parsed in the grammar tree. Can we do the same thing? It must not be so hard, basically, we already have it while we output Grammar as a tree in the README file, the same thing is needed but with real data which could come not from the random generator but from the user input - this is how learners learn how to write grammar correctly.

Support BNF Variants

Is your feature request related to a problem? Please describe.
Most uses of BNFs out there are not formal i.e. they are meant to be human readable. As such, they contain non-standard features such as optional. For instance, consider this RFC that does not use standard BNF to describe Preferred Name Syntax for DNS

Additionally, are you interested in supporting EBNF and validating a given input.

Describe the solution you'd like
I would like to have these features. I am not sure how fine the configuration can be done. Not sure entirely how this is done in Rust but if its like #ifdef in C then I think its fair to want the code to be readable.

Describe alternatives you've considered
I think all of these features are expressable in terms of standard BNF. So users can always make that manual transformation.

Additional context
I needed this while trying to validate (Another feature I want) a given string with the BNF from above. See Wikipedia's page for BNF Variants

Protect master branch

Maybe you have already enabled, this, but it would put my mind at ease on the terminal if the master branch was protected: https://help.github.com/articles/configuring-protected-branches/

Terminals with quotes to_string makes invalid string

The current implementation of fmt::Display is overly simple since Terms may be parsed via 'single-quoted' or "double-quoted".

#[test]
fn quote_term_to_and_from_str() {
    let quote = Term::Terminal(String::from("\""));
    let to_string = quote.to_string(); // -> \"\"\" , uh oh, that won't do
    let from_string = Term::from_str(&to_string);
    assert_eq!(Ok(Term::Terminal(String::from("\""))), from_string);
}

This test fails because the current behavior of to_string produces a string which cannot be parsed back into the original Terminal.

Some solution brainstorming may help:

fmt::Display could inspect if the term contains " to determine if output should use with single quotes
Term::Terminal could be changed to a lower level enum of Single/Double quote type
Term parsing could be changed back to not support the single quote mode, but this would remove support for grammars using terminals with double quotes

My personal gut would be the first option. It would add a few lines of complexity to fmt::Display for Term and make the associated to_string less performant, but it seems like a necessary cost. But maybe @Snewt will have a cleaner idea

Undefined nonterminal is accepted in bnf parsing

Describe the bug
A nonterminal that is not defined is not rejected during parsing.

To Reproduce
For example,

let input = "<dna> ::= <base> | <base> <dna>";
let grammar: Grammar = input.parse().unwrap();

does not panic, while 'base' is not defined in the input.

Desktop (please complete the following information):

Windows 11

Clippy check step for PRs

Clippy has a lot of best practices, but BNF is in violation of a lot of them. Some would be breaking changes (like removing the non std::str::FromStr from_str public functions). Other changes require updating the parsers to nom 5 best practices.

Once that is done, would just need to revert previous attempt to re-enable clippy CI.

    Checking bnf v0.2.6
warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:6:1
   |
6  | / named!(pub prod_lhs< &[u8], Term >,
7  | |     do_parse!(
8  | |             nt: delimited!(char!('<'), take_until!(">"), ws!(char!('>'))) >>
9  | |             _ret: ws!(tag!("::=")) >>
10 | |             (Term::Nonterminal(String::from_utf8_lossy(nt).into_owned()))
11 | |     )
12 | | );
   | |__^
   |
   = note: `#[warn(deprecated)]` on by default
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:6:1
   |
6  | / named!(pub prod_lhs< &[u8], Term >,
7  | |     do_parse!(
8  | |             nt: delimited!(char!('<'), take_until!(">"), ws!(char!('>'))) >>
9  | |             _ret: ws!(tag!("::=")) >>
10 | |             (Term::Nonterminal(String::from_utf8_lossy(nt).into_owned()))
11 | |     )
12 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:14:1
   |
14 | / named!(pub terminal< &[u8], Term >,
15 | |     do_parse!(
16 | |         t: alt!(
17 | |             delimited!(char!('"'), take_until!("\""), ws!(char!('"'))) |
...  |
21 | |     )
22 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:14:1
   |
14 | / named!(pub terminal< &[u8], Term >,
15 | |     do_parse!(
16 | |         t: alt!(
17 | |             delimited!(char!('"'), take_until!("\""), ws!(char!('"'))) |
...  |
21 | |     )
22 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:24:1
   |
24 | / named!(pub nonterminal< &[u8], Term >,
25 | |     do_parse!(
26 | |         nt: complete!(delimited!(char!('<'), take_until!(">"), ws!(char!('>')))) >>
27 | |         ws!(not!(complete!(tag!("::=")))) >>
28 | |         (Term::Nonterminal(String::from_utf8_lossy(nt).into_owned()))
29 | |     )
30 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:24:1
   |
24 | / named!(pub nonterminal< &[u8], Term >,
25 | |     do_parse!(
26 | |         nt: complete!(delimited!(char!('<'), take_until!(">"), ws!(char!('>')))) >>
27 | |         ws!(not!(complete!(tag!("::=")))) >>
28 | |         (Term::Nonterminal(String::from_utf8_lossy(nt).into_owned()))
29 | |     )
30 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:42:1
   |
42 | / named!(pub expression_next,
43 | |     do_parse!(
44 | |         ws!(char!('|')) >>
45 | |         ret: recognize!(peek!(complete!(expression))) >>
46 | |         (ret)
47 | |     )
48 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:50:1
   |
50 | / named!(pub expression< &[u8], Expression >,
51 | |     do_parse!(
52 | |         peek!(term) >>
53 | |         terms: many1!(complete!(term)) >>
...  |
63 | |     )
64 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:74:1
   |
74 | / named!(pub production< &[u8], Production >,
75 | |     do_parse!(
76 | |         lhs: ws!(prod_lhs) >>
77 | |         rhs: many1!(complete!(expression)) >>
...  |
86 | |     )
87 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: use of deprecated item 'ws': whitespace parsing only works with macros and will not be updated anymore
  --> src/parsers.rs:74:1
   |
74 | / named!(pub production< &[u8], Production >,
75 | |     do_parse!(
76 | |         lhs: ws!(prod_lhs) >>
77 | |         rhs: many1!(complete!(expression)) >>
...  |
86 | |     )
87 | | );
   | |__^
   |
   = note: this warning originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

warning: unneeded return statement
   --> src/grammar.rs:115:9
    |
115 |         return Ok(result);
    |         ^^^^^^^^^^^^^^^^^^ help: remove `return`: `Ok(result)`
    |
    = note: `#[warn(clippy::needless_return)]` on by default
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return

warning: useless use of `format!`
  --> src/error.rs:55:32
   |
55 |             Needed::Unknown => format!("Data error: insufficient size, expectation unknown"),
   |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: consider using .to_string(): `"Data error: insufficient size, expectation unknown".to_string()`
   |
   = note: `#[warn(clippy::useless_format)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_format

warning: you should consider deriving a `Default` implementation for `expression::Expression`
  --> src/expression.rs:16:5
   |
16 | /     pub fn new() -> Expression {
17 | |         Expression { terms: vec![] }
18 | |     }
   | |_____^
   |
   = note: `#[warn(clippy::new_without_default)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#new_without_default
help: try this
   |
10 | #[derive(Default)]
   |

warning: defining a method called `from_str` on this type; consider implementing the `std::str::FromStr` trait or choosing a less ambiguous name
  --> src/expression.rs:26:5
   |
26 | /     pub fn from_str(s: &str) -> Result<Self, Error> {
27 | |         match parsers::expression_complete(s.as_bytes()) {
28 | |             Result::Ok((_, o)) => Ok(o),
29 | |             Result::Err(e) => Err(Error::from(e)),
30 | |         }
31 | |     }
   | |_____^
   |
   = note: `#[warn(clippy::should_implement_trait)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#should_implement_trait

warning: you should consider deriving a `Default` implementation for `grammar::Grammar`
  --> src/grammar.rs:20:5
   |
20 | /     pub fn new() -> Grammar {
21 | |         Grammar {
22 | |             productions: vec![],
23 | |         }
24 | |     }
   | |_____^
   |
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#new_without_default
help: try this
   |
14 | #[derive(Default)]
   |

warning: defining a method called `from_str` on this type; consider implementing the `std::str::FromStr` trait or choosing a less ambiguous name
  --> src/grammar.rs:32:5
   |
32 | /     pub fn from_str(s: &str) -> Result<Self, Error> {
33 | |         match parsers::grammar_complete(s.as_bytes()) {
34 | |             Result::Ok((_, o)) => Ok(o),
35 | |             Result::Err(e) => Err(Error::from(e)),
36 | |         }
37 | |     }
   | |_____^
   |
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#should_implement_trait

warning: writing `&String` instead of `&str` involves a new object where a slice will do.
  --> src/grammar.rs:74:31
   |
74 |     fn traverse(&self, ident: &String, rng: &mut StdRng) -> Result<String, Error> {
   |                               ^^^^^^^
   |
   = note: `#[warn(clippy::ptr_arg)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#ptr_arg
help: change this to
   |
74 |     fn traverse(&self, ident: &str, rng: &mut StdRng) -> Result<String, Error> {
   |                               ^^^^
help: change `ident.clone()` to
   |
86 |         let nonterm = Term::Nonterminal(ident.to_string());
   |                                         ^^^^^^^^^^^^^^^^^

error: using `clone` on a double-reference; this will copy the reference instead of cloning the inner type
  --> src/grammar.rs:99:37
   |
99 |             Some(e) => expression = e.clone(),
   |                                     ^^^^^^^^^
   |
   = note: `#[deny(clippy::clone_double_ref)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref
help: try dereferencing it
   |
99 |             Some(e) => expression = &(*e).clone(),
   |                                     ^^^^^^^^^^^^^
help: or try being explicit about what type to clone
   |
99 |             Some(e) => expression = &expression::Expression::clone(e),
   |                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What to do with a grammar?

The example use case is parsing a bnf string to a grammar. But, what are the use cases after that?

// let input = some grammar string
let grammar = bnf::parse(input);
// now what can I do with it?

Should an example be added that does something interesting by navigating the tree?
Should any functions be added that assist in walking the tree, like a visitor pattern?
Should the grammar be able to parse input strings and return the series of productions used?