thelartians / pegparser Goto Github PK

💡 Build your own programming language! A C++17 PEG parser generator supporting parser combination, memoization, left-recursion and context-dependent grammars.

License: BSD 3-Clause "New" or "Revised" License

C++ 90.43% CMake 9.57%

parse cpp grammar calculator peg parser parser-generator c-plus-plus dsl parsercombinator

pegparser's Introduction

PEGParser

A linear-time C++17 PEG parser generator supporting memoization, left-recursion and context-dependent grammars.

Example

The following defines a simple calculator program. It is able to parse and evaluate the basic operations +, -, *, / while obeying operator and bracket precedence and ignoring whitespace characters between tokens.

#include <peg_parser/generator.h>
#include <iostream>

void example() {
  peg_parser::ParserGenerator<float> g;

  // Define grammar and evaluation rules
  g.setSeparator(g["Whitespace"] << "[\t ]");
  g["Sum"     ] << "Add | Subtract | Product";
  g["Product" ] << "Multiply | Divide | Atomic";
  g["Atomic"  ] << "Number | '(' Sum ')'";
  g["Add"     ] << "Sum '+' Product"    >> [](auto e){ return e[0].evaluate() + e[1].evaluate(); };
  g["Subtract"] << "Sum '-' Product"    >> [](auto e){ return e[0].evaluate() - e[1].evaluate(); };
  g["Multiply"] << "Product '*' Atomic" >> [](auto e){ return e[0].evaluate() * e[1].evaluate(); };
  g["Divide"  ] << "Product '/' Atomic" >> [](auto e){ return e[0].evaluate() / e[1].evaluate(); };
  g["Number"  ] << "'-'? [0-9]+ ('.' [0-9]+)?" >> [](auto e){ return stof(e.string()); };
  g.setStart(g["Sum"]);

  // Execute a string
  auto input = "1 + 2 * (3+4)/2 - 3";
  float result = g.run(input); // -> 5
  std::cout << input << " = " << result << std::endl;
}

Quickstart

PEGParser requires at least cmake 3.14 and the ability to compile C++17 code. The following shows how to compile and run the calculator example.

git clone https://github.com/TheLartians/PegParser
cd PegParser
cmake -Sexample -Bbuild/example
cmake --build build/example -j8
./build/example/calculator

You should familiarize yourself with the syntax of parsing expression grammars. The included examples should help you to get started.

Installation and usage

PEGParser can be easily added to your project through CPM.cmake.

CPMAddPackage(
  NAME PEGParser
  VERSION 2.1.1
  GITHUB_REPOSITORY TheLartians/PEGParser
)

target_link_libraries(myProject PEGParser::PEGParser)

Project goals

PEGParser is designed for ease-of-use and rapid prototyping of grammars with arbitrary complexity, and builds its parsers at run time. So far no work has been invested on optimizing the library, however it runs fast enough to be used in several production projects.

Time complexity

PEGParser uses memoization, resulting in linear time complexity (as a function of input string length) for grammars without left-recursion. Left-recursive grammars have squared time complexity in worst case. Memoization can also be disabled on a per-rule basis, reducing the memory footprint and allowing context-dependent rules.

pegparser's People

Contributors

Stargazers

Watchers

pegparser's Issues

license change

Hi -
Just a friendly suggestion about your good work here. If you'd like this software to be considered for use by a wider audience, especially since it is headers-only, change the license from (L)GPL to MIT or BSD!

Use separator in grammar

Hello a third question today :D

Is it possible to use the separator in the grammar too?
For example like this:

[...]
g.setSeparator(g["WS"]); // <- whitespace
g["ConcatExpr"] << "AltExpr WS AltExpr";  
[...]

Error handling

Hello again ;)

I have a question regarding error handling: Is it possible to provide custom error messages when a syntax error is found in the given input?

Right now the error messages are not really useful by its own (i.e. "syntax error at character 27...")

Example throws runtime error

Hello ;)
Building the examples works fine, however when I want to run the examples, the program throws an error in line 24 of python_identation.cpp: "string_view subscript out of range". I trace it back to State::current(). I am using MSVC (c++17 standard), tried Clang and it doesn't work too.
The version of the lib is v2.0.

Here is how it looks for me.

Any ideas?

Please change license references in source code comments from GPL to new BSD-3

Thanks for getting rid of (noxious) GPL!

GPL

Please provide LGPL or MIT like license which allows more user.

Problems with parsing rule using * or +

Hi there!
This might just be me being an idiot, but I'm trying to write a parser for PlantUML files and I'm getting stuck pretty early in the process.
I can parse no element or a single class, interface, entity, whatever just fine with my current grammar, but not more. This is what I have, simplified to just the class:

peg_parser::ParserGenerator<void, PlantUML&> g;
g.setSeparator(g["Whitespace"] << "[\t \n]");
g["Identifier"] << "[a-zA-Z] [a-zA-Z0-9_]*";

// // elements
g["Class"] << "'class' Identifier" >> [](auto e, auto& p) { p.elements.push_back(PlantUML::Element{PlantUML::Element::Type::Class, e[0].string()}); };

// diagram
g["Start"] << "'@startuml'";
g["End"] << "'@enduml'";
g["Diagram"] << "Start Class* End";

g.setStart(g["Diagram"]);

This is the test that fails already (Parser just wraps the ParserGenerator with the rules):

static constexpr auto two_classes_puml = 
"@startuml\
class           class\
class           class2\
@enduml";

TEST_CASE( "diagram with all simple elements is parsed", "[Parser]" ) {
    Parser parser;
    PlantUML r = parser.parse(two_classes_puml);

    REQUIRE(r.elements.size() == 2);
}

Do you have any idea what I'm doing wrong here?

Performance tuning?

In order to gauge the performance of the PEGParser library with a simple-but-real-world grammar I modified the included example/json_parser.cpp to

read the input from a file,
not generate a data structure, and
use the simple benchmark harness found in taoJSON.

Using the three data files from the Native JSON Benchmark as inputs I then compared the performance of

the resulting modified example/json_parser.cpp,
the benchmark utility from taoJSON that uses the example JSON grammar from the PEGTL,
the benchmark utility from taoJSON that uses the optimised-but-more-complicated grammar from that library and also builds the in-memory JSON representation.

Initial results show that program 1 is about 1000-5000x slower than 2, and 100-1000x slower than 3, which probably means that I'm doing something wrong. Towards tuning things, how can I disable memoization? The readme states that this is possible on a per-rule basis, but I couldn't find any documentation on how this might be achieved. And are there any other things I could change to increase performance?

Have you ever benchmarked your library against itself with memoization enabled and disabled? We, the authors of the PEGTL, have been asked about adding memoization on multiple occasions, which was also the main motivation for doing these benchmarks in the first place, however our feeling was that while in theory linear complexity with memoization is better than the quadratic complexity of the straight-forward recursive-descent approach, in practice the overhead of memoization combined with the fact that most real-world grammars by far don't hit the worst-case complexity let the simpler approach win.

Thank you for any insight.

Parser state in filters

Currently, filters only capture an external state and are therefore not thread-safe and is not also not reset on syntax errors. Adding an internal state variable that is unique to the current parsing process would resolve this. The difficulty lies in finding a type-safe way to implement this.

Documentation and a couple questions

So this library looks really cool and I'd love to use it. However, I'm unsure about the syntax (i.e., I know the general PEG syntax, but what extensions, if any, does this library have? How does the syntax it uses differ from normal PEG (which is really just an extension of ABNF/EBNF), etc.). Also, how does it cope with Unicode? A language I'm struggling to write a compiler for requires Unicode for identifiers, so I need some way of handling that without destroying the world in the process. :D

The examples look pretty neat, but they aren't really enough in terms of describing how the parser works or its limitations. For example, how is whitespace handled? Is it an implicit thing? (The examples would have me believe that this is the case, but it's worth asking here.)

Error symbol

Currently the empty symbol is being used to create syntax errors. The symbol should instead be used to declare empty rules and a special symbol for errors should be introduced.