yhirose / cpp-peglib Goto Github PK

View Code? Open in Web Editor NEW

837.0 837.0 105.0 3.68 MB

A single file C++ header-only PEG (Parsing Expression Grammars) library

License: MIT License

C++ 98.95% CMake 0.61% Pascal 0.28% Makefile 0.03% Vim Script 0.13%

c-plus-plus cpp cpp17 header-only parser-generator parsing parsing-expression-grammars peg

cpp-peglib's People

Contributors

Stargazers

Watchers

Forkers

cbeck88 codeyash kumagi snichols tempbottle veltas bobbyzhu haskellstudio pdhahn g40 rkollataj rdhafidh random-developer jordanvrtanoski fcccode yuanmin2015 bengsparks proemion elishemer feibhwang timerz98 linecode fanyi3315 ezhangle romeoxbm magicfoo krovatkin mumme74 asdlei99 toddlipcon abostrom compiler-interpreter-works scanban alex-87 halirutan xinzhaozhu thatname chxj1980 peoro mongodb-forks zwxlib frandy rob-p jadnohra no-more-secrets externalrepositories asmwarrior rimim nsmith- mqnc yesint brinkqiang2cpp apache-hb spaceim maiple kisaragieffective surely66 jparisu lexasub wuyadie tfangz888 zhoudayang rioj7 martinjrobins philokev cn00 b0lv42 mingodad baajarmeh notapenguin0 kfsone rwtolbert skishore wusehuahuo kingfishersoftware ebell495 ptal nfaltermann dualword exaloop oliverlwang mayhemheroes marcokoch eprosima billhsu ycwung diyessi cristinaraila guidolbrrcn raulcoinlistt uamhforever gmh5225 lewismj glebzlat ugovaretto alan0526 tilordleo jsoref chenyahui grootielee

cpp-peglib's Issues

Doesn't compile with MinGW, despite CMake

Why is there an "@" symbol in that "ar" line?

$ cmake -G "MSYS Makefiles" .
-- The C compiler identification is GNU 8.2.0
-- The CXX compiler identification is GNU 8.2.0
-- Check for working C compiler: C:/MinGW/bin/gcc.exe
-- Check for working C compiler: C:/MinGW/bin/gcc.exe -- broken
CMake Error at C:/Program Files/CMake/share/cmake-3.13/Modules/CMakeTestCCompiler.cmake:52 (message):
  The C compiler

    "C:/MinGW/bin/gcc.exe"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: C:/Users//git/cpp-peglib/CMakeFiles/CMakeTmp

    Run Build Command:"C:/MinGW/msys/1.0/bin/make.exe" "cmTC_c9e41/fast"
    /usr/bin/make -f CMakeFiles/cmTC_c9e41.dir/build.make CMakeFiles/cmTC_c9e41.dir/build
    make[1]: Entering directory `/c/Users//git/cpp-peglib/CMakeFiles/CMakeTmp'
    Building C object CMakeFiles/cmTC_c9e41.dir/testCCompiler.c.obj
    /C/MinGW/bin/gcc.exe    -o CMakeFiles/cmTC_c9e41.dir/testCCompiler.c.obj   -c /C/Users//git/cpp-peglib/CMakeFiles/CMakeTmp/testCCompiler.c
    Linking C executable cmTC_c9e41.exe
    "/C/Program Files/CMake/bin/cmake.exe" -E remove -f CMakeFiles/cmTC_c9e41.dir/objects.a
    /C/MinGW/bin/ar.exe cr CMakeFiles/cmTC_c9e41.dir/objects.a @CMakeFiles/cmTC_c9e41.dir/objects1.rsp
    c:\MinGW\bin\ar.exe: could not create temporary file whilst writing archive: no more archived files
    make[1]: *** [cmTC_c9e41.exe] Error 1
    make[1]: Leaving directory `/c/Users//git/cpp-peglib/CMakeFiles/CMakeTmp'
    make: *** [cmTC_c9e41/fast] Error 2




  CMake will not be able to correctly generate this project.


-- Configuring incomplete, errors occurred!

Also, compilation fails:

g++ -std=c++11 peglib.h

Using cpp-peglib inside Circle

Hi, great work on the library. I've been writing my own C++ compiler, and I've found a very novel use for cpp-peglib. A friend of mine has been using cpp-peglib with Circle to generate code targeting an exotic architecture. I decided to port calc3.cc as a tutorial for using cpp-peglib at compile time to define and implement DSLs in C++.

https://github.com/seanbaxter/circle/blob/master/peg_dsl/peg_dsl.md

Basically what I do is #include peglib.h and make a compile-time instance of peg::parser. Circle has an integrated interpreter so any code can be executed during source translation. I beefed up the calc3 grammar by adding an IDENTIFIER rule. Then I create C++ functions that expand a Circle macro and feed it the text to be parsed in a string literal. A Circle macro invokes the parser object (at compile time!), gets back the AST, and traverses the AST using some other macros. IDENTIFIER nodes in the AST are evaluated with @expressions, which lexes, parses and injects the contained text. That is, the result object for an IDENTIFIER node is an lvalue to the object named. The other parts of the AST are processed similarly.

When all this is done, compilation ends, and you're left with a C++ program that has the code you specified in the string lowered to LLVM IR. There is no remnant of the parser in the executable, because its job was to translate the DSL into an AST at compile time.

Since uses of parsing libraries is to make developer tools anyway, having a C++ compiler serve as a host language or a scripting language to bind the desired grammar to a C++ frontend via dynamic parsers like cpp-peglib is a real win.

My main project page is here, and there are tons of other examples.
https://www.circle-lang.org/

I think Circle could also simplify the implementation of cpp-peglib. Since the grammar is almost always known at compile time, you could move parsing of that to compile time and benefit from the type system and error handling already built into the compiler. It would allow you to achieve the performance of a statically scheduled compiler (like a hand-written RD parser) while having the expressiveness of the dynamic system you built.

Thanks,
sean

VS2017 linking errors

Adding target_link_libraries(peglint "Ws2_32.lib") to the peglint stopped the linker errors.

Noted for future developers.

Allow peglib to work w/o RTTI

The current library is pretty close to working without C++ RTTI being enabled. peglib is useful for environments where RTTI adds more space/time overhead than is acceptable, but it currently requires some very minor changes to the code (ie: #ifdef __cpp_rtti around the uses of dynamic_cast and replacement methods that return void* instead).

Would this be something that you'd be OK accepting a patch for?

Tree rewriting

Have a secondary stage to walk the parse tree that is currently output in AST mode. This stage could then create a much reduced AST according to an additional grammar or extra markup.

I have some example code written for the GDB grammar that demonstrates this idea. Though this version relies on the functor callabacks for nodes in the parse tree.

I will explain this more clearly with an example. Just want to have a placeholder.

pegdebug

I wrote an interactive debug inspector for PEGs using peglib. I probably wouldn't have started that if I had found peglint before but now it's done and there's nothing we can do about it. However, what I needed most is a means of finding out why rules don't match certain parts of text although I wanted them to. So pegdebug displays the complete parsing process, not just the resulting AST.

You can check it out here:
https://github.com/mqnc/pegdebug

I had to modify peglib slightly to make it work. You can see the changes here:
https://github.com/mqnc/cpp-peglib

I also need these changes in my other projects. Maybe you find the functionality useful and can include it into your library. However, as it is, it breaks code that uses your peglib since the enter and leave functions have different signatures.

Please let me know if I did something wrong with the licensing or anything else in that direction.

I hope it's useful!

(sorry I made this an issue, I don't see another way for communication)

Handling comments?

Hi Yuji

Any thoughts on the best way to handle single line comments, C++ or Python style?

// I am a comment
# ditto

Not sure if this can be easily done in a grammar. Wondering if there could be a %comment directive as per %whitespace

I had forgotten how utterly cool this code is for writing parsers. 👍

C/C++ escaped quotes in strings?

Hello Yuji.

Cannot get this to work using master. The intention is to be able to parse a C/C++ style literal string with escaped quotes as in:

// i.e. file content, not a compiled string
" Hello \"Yuji\" "

Any thoughts? The rule is specified this:

// 
RULE <- '"' (LITERAL_ESC_QUOTE / LITERAL_CHAR)* '"'
// i.e. match anything that is not a single quote character
LITERAL_CHAR <- (!["] .)
// this is a string not an escaped quote
LITERAL_ESC_QUOTE  <- '\"'

TAIA.

Jerry

Packrat parsing problem with %whitespace% or macro

Capture and AST generation

Hi Yuji, would be very interested to get your thoughts on this too.

parser fills the ast.token value with the literal content of the string.

STRING_LITERAL  <- < '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' >

no capture, ast.token value is empty.

STRING_LITERAL  <- '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"'

parser creates an AST node for STRING_LITERAL with a new child leaf for each matching occurrence of ESC and CHAR. Obviously there is no content in the AST token member.

STRING_LITERAL  <- < '"' (ESC / CHAR)* '"' > 
ESC                                <- ('\\"' / '\\t' / '\\n')
CHAR                             <- (!["] .)

Questions:
Are the differences in 1 and 2 intentional?
3 is possibly unexpected but is consistent throughout so not a problem. I think this definitely worth documenting in readme.

Lint checker, munmap friendly for mingw?

Adding -D_MSC_VER also not working and I got a lot of errors
gcc4.9

In file included from D:/mingw32/i686-w64-mingw32/include/combaseapi.h:154:0,
                 from D:/mingw32/i686-w64-mingw32/include/objbase.h:14,
                 from D:/mingw32/i686-w64-mingw32/include/ole2.h:17,
                 from D:/mingw32/i686-w64-mingw32/include/wtypes.h:12,
                 from D:/mingw32/i686-w64-mingw32/include/winscard.h:10,
                 from D:/mingw32/i686-w64-mingw32/include/windows.h:97,
                 from mmap.h:6,
                 from peglint.cc:11:
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h: In member function 'HRESULT IUnknown::QueryInterface(Q**)':
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: error: expected primary-expression before ')' token
       return QueryInterface(__uuidof(Q), (void **)pp);
                                       ^
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: error: there are no arguments to '__uuidof' that depend on a template parameter, so a declaration of '__uuidof' must be available [-fpermissive]
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
In file included from D:/mingw32/i686-w64-mingw32/include/urlmon.h:289:0,
                 from D:/mingw32/i686-w64-mingw32/include/objbase.h:163,
                 from D:/mingw32/i686-w64-mingw32/include/ole2.h:17,
                 from D:/mingw32/i686-w64-mingw32/include/wtypes.h:12,
                 from D:/mingw32/i686-w64-mingw32/include/winscard.h:10,
                 from D:/mingw32/i686-w64-mingw32/include/windows.h:97,
                 from mmap.h:6,
                 from peglint.cc:11:
D:/mingw32/i686-w64-mingw32/include/servprov.h: In member function 'HRESULT IServiceProvider::QueryService(const GUID&, Q**)':
D:/mingw32/i686-w64-mingw32/include/servprov.h:66:46: error: expected primary-expression before ')' token
   return QueryService(guidService, __uuidof(Q), (void **)pp);
                                              ^
D:/mingw32/i686-w64-mingw32/include/servprov.h:66:46: error: there are no arguments to '__uuidof' that depend on a template parameter, so a declaration of '__uuidof' must be available [-fpermissive]

Expression parsing support (precedence climbing)

Same feature supported in go-peg.

Simple grammar failing

I hate opening GitHub issues simply asking for help, however I've been fighting with a seemly simple grammar that I've not been able to get working correctly. I stumbled across peglint for testing luckily.

File a.peg contains:

Any         <- Placeholder / Text
Placeholder <- '${' Int ':' Any '}'
Int         <- [0-9]+
Text        <- [a-z]+

Running a simple example works as expected:

> peglint.exe --ast --source "${1:hi}" a.peg
+ Any
  + Placeholder
    - Int (1)
    + Any
      - Text (hi)

A bit more complex example doesn't work:

> peglint.exe --ast --source "${1:hi${2:bye}}" a.peg
[commendline]:1:7: syntax error

I'm using Win7 and VS2015 if that makes a difference at all. Thanks.

[Request] external macros

Here's the situation:
I think sooner or later I will probably want the ability to parse a context-dependent grammar, so it would be nice to be able to call external functions during the parsing process.
The first issue where this popped up is this:
I want to be able to parse custom operators with custom priority. So the user should be able to say
"× is infix and should be evaluated before + but after *"
But in order to make the priority customizeable, the parser must know which operator comes on which priority level during parsing. But since the definition of the operator also happens during parsing, it cannot know it during grammar construction.
And in the future I will probably also need to create some sort of registry during parsing where the parser can look things up.
So the best way that I see would be this:

Create a way to call external functions that determine whether there is a match or not. Something like:

Result <- Operand @Operator Operand

parser.definition["Operator"] = [](const char* s, size_t n, SemanticValues& sv, any& dt){
    ... // do custom stuff like look ups
    return matchlen;
}

Do you think this is a good idea and useful in general? Or is there maybe a more elegant solution to my problem? (Like in the end I also didn't need left recursion although I thought I did)

In return I can implement UTF8 support ;)

Parser reduce() functions?

Hi Yuji, more of a meta question here. I was experimenting to see if I could generate a much more minimal AST by associating specific rules with a custom reduce() function. However this is not quite working as expected. The peg::SemanticValues& argument is always empty. Does there need to be a functor associated with each and every rule in the grammar? Or have I missed something here?

A minimal example based on the GDB/MI grammar:

        auto mknode = [](const peg::SemanticValues& sv, peg::any& arg) -> peg::any
        {
            if (sv.size())
            {
                std::cout << sv[0].name << ' ' << sv[0].s << std::endl;
            }
            return peg::any();
        };

        peg::any arg;
                // set up functor for rules of interest *only*
        parser["STRING_LITERAL"] =  mknode;
        parser["IDENTIFIER"] = mknode;
        parser["LBRACE"] = mknode;
        parser["RBRACE"] = mknode;
        parser["LBRACK"] = mknode;
        parser["RBRACK"] = mknode;

        if (!parser.parse_n(source.data(), source.size(), arg ))
        {
            ret = -1;
        }

peglint crashes

Hi!

Somehow peglint does not work for me like described in the readme:

$ cat a.peg
Additive    <- Multitive '+' Additive / Multitive
Multitive   <- Primary '*' Multitive / Primary
Primary     <- '(' Additive ')' / Number
Number      <- < [0-9]+ >
%whitespace <- [ \t]*

$ ./peglint --ast --source "1 + 2 * 3" a.peg
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)

When I run in server mode, it shows the page and then crashes:

$ ./peglint --ast --server 8001 --source "1 + 2 * 3" a.peg
Server running at http://localhost:8001/
(now I open the browser and it shows a promising page)
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)

I compiled and ran under Ubuntu,
The C compiler identification is GNU 7.2.0
The CXX compiler identification is GNU 7.2.0

All the best!

PS: If you can't reproduce this, I can try to provide more debug information.

make 'release' tag?

just want to know if you are going to make some 'release' or 'tag' on this project?
I assume the code is pretty stable enough..

Tree building notation (III)

Another idea for a tree-annotated grammar. Note the N: prefix. The 0'th node becomes the parent. Then its children are assigned in numerical order. So this makes it really easy to create new nodes with an arbitrary number of children. (Yuji, this is only a partially-considered scribble right now. I'm just posting it here for more brain fodder)

                # annotation equivalent to the default parse tree construction
        0:RESULT_LIST       <- 1:RESULT (',' 2:RESULT)*
        RESULT          <- (NAMED_RESULT / ANON_RESULT)
                # ASSIGNMENT_OP becomes parent to IDENTIFIER and ANON_RESULT
        NAMED_RESULT    <- 1:IDENTIFIER 0:(ASSIGNMENT_OP/HASH_OP) 2:ANON_RESULT
        ANON_RESULT     <- (0:STRING_LITERAL / BRACE_LIST / BRACK_LIST)
                # i.e. here we end up with a new RESULT_LIST node, not a BRACE_LIST
        BRACE_LIST      <- '{' (0:RESULT_LIST)* '}'
        BRACK_LIST      <- '[' (0:RESULT_LIST)* ']'
        ASSIGNMENT_OP   <- < '=' >
        HASH_OP         <- < '#' >

[Enhancement] Const access to definitions

Hi, I noticed that all the parsing methods of peg::parser and peg::Definition are const, which is nice.
Could we get const access to the definitions too in order to parse specific rules via const peg::parser&?

diff --git a/peglib.h b/peglib.h
index 463cd3b..8050ec2 100644
--- a/peglib.h
+++ b/peglib.h
@@ -3116,10 +3116,14 @@ public:
 
     Definition& operator[](const char* s) {
         return (*grammar_)[s];
     }
 
+    const Definition& operator[](const char* s) const {
+        return (*grammar_)[s];
+    }
+
     std::vector<std::string> get_rule_names(){
         std::vector<std::string> rules;
         rules.reserve(grammar_->size());
         for (auto const& r : *grammar_) {
             rules.emplace_back(r.first);

adding mutable lambdas action

Ok, I plan want to contribute to write csv or json parser in the examples, however sometime it was not comfortable since we cannot write some thing like following

auto syntax = R"(
    ROOT  <- _ TOKEN (',' _ TOKEN)*
    TOKEN <- < [a-z0-9]+ > _

    _     <- [ \t\r\n]*
)";

tree<string> sym ;
peg pg(syntax); 
pg["TOKEN"] = [=](const char* s, size_t l, const vector<any>& v){ //this will trigger an error because we need mutable lambdas, at peglib.h:273 and 276 didnt allow this
   sym.insert(sym.begin(),string(s,l));
   return string(s,l);
}

also this is useful for generating ast, or let user managing data structure. I have an Idea that, peg grammar is quite powerful to handle some ebnf with some restriction on how prioritization should be used.

Support left recursion

After discovering the whole issue with left recursion and noticing that I can't live without it, I wonder if you are planning to implement it. I have found that some PEG parsers support it using different techniques:

https://github.com/axilmar/parserlib/blob/master/LEFT_RECURSION.txt
https://github.com/PhilippeSigaud/Pegged/wiki/Left-Recursion

matching of choice expression

I have very simple grammar to illustrate:

ARRAY <- SPACEONLY? TEST SPACEONLY? ( COMMA SPACEONLY? TEST SPACEONLY?)*  

TEST <- NUM  		
 / VARNAME
		
COMMA <- ','

NUM <- [0-9]+

VARNAME <- [a-zA-z0-9_]+

~SPACEONLY <- [ \t]+

I expect that this will match NUM firstly, if that fails go to VARNAME, because by
order of choice expression. From the Bryan's paper
at page 2 from the bottom left page paragraph says

"The choice expression ‘e1 / e2’ first attempts
pattern e1, then attempts e2 from the same starting point if e1 fails."

Then, the following tokens shall match:

1DA, SA1_1WS

1, SA1_1WS

sa_1,2

1,2,3

Do you consider this case is by design or bug?

Thank you for your wonderful project.

build errors with clang++ in debug mode

As the subject line indicates I have build problems with clang 5.0.0 in debug mode.

[ 18%] Linking CXX executable test-main CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84b85): undefined reference to peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84bbf): undefined reference to peg::enabler' CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84bed): undefined reference to peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c1b): undefined reference to peg::enabler' CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c49): undefined reference to peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c77): more undefined references to peg::enabler' follow clang-5.0.0: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [test/CMakeFiles/test-main.dir/build.make:95: test/test-main] Error 1 make[1]: *** [CMakeFiles/Makefile2:86: test/CMakeFiles/test-main.dir/all] Error 2 make: *** [Makefile:95: all] Error 2

It works in release mode and gcc works in both modes.

Awesome library btw. I especially like being able to use parser combinators. I really shoots up the performance.

Mike

[Question] how to make parser continue after syntax error?

I am trying to write a scripting language using this library. In compilers/interpreters of other scripting languages I've used, you can get multiple errors from a single file/class/function. I know that I can check for a lot of errors after the parser (from this lib) has finished (accessing a private var, calling a function that doesn't exist, etc.), but I can't seem to find a way to check for multiple syntax errors, because the parser stops after encountering one. Is there be a way to ignore a rule if the parser finds a syntax error in my input?

ast tree does not record selected rule

so here's a simple grammar to parse protobuf

statements <- statement*
statement <-
    "syntax" '=' string ';' /
    "import" string ';' /
    "package" token ';' /
    enum_statement /
    message_statement

enum_statement <- "enum" token '{' enum_decl* '}'
enum_decl <- token '=' number ';'

message_statement <- "message" token '{' field* '}'
field <-
    type_decl /
    repeated_decl /
    oneof_decl /
    map_decl /
    message_statement /
    enum_statement

type_decl <- type token '=' number ';'
type <- token ('.' token)*

repeated_decl <- "repeated" type token '=' number ';'
oneof_decl <- "oneof" token '{' type_decl* '}'
map_decl <- "map" '<' type ',' type '>' token '=' number ';'

%word <- token / number
string <- < '"' (!'"' .)* '"' >
token  <- < [a-zA-Z_][a-zA-Z0-9_]* >
number <- < [0-9]+ >

%whitespace <- [ \t\r\n]*

the problem is the ast doesn't record which statement rule was matched.
the "syntax" and "import" nodes look the same.

Results differ if using AST mode?

Hi Yuji, this one is a bit odd. I'm using the peglint framework so loading grammars from a text file. If I enable AST mode I get a clean parse and a tree etc. If I use the parse() member function I get a syntax error from the same test. Am I missing something obvious?

        # a grammar to parse the GDB 'machine interface' (GDB/MI)
        GDB_MI          <- (GDB_ELEMENT EOL)*
        GDB_ELEMENT     <- (AT_STRING / NEG_STRING / OP_LIST)
        AT_STRING       <- '@' STRING_LITERAL
        NEG_STRING      <- '~' STRING_LITERAL
        OP_LIST         <- OP_CHAR IDENTIFIER ',' (RESULT_LIST)*
        OP_CHAR         <- ( '~' / '*' / '=' / '+' / '^')
        RESULT_LIST     <- RESULT (COMMA RESULT)*
        RESULT          <- (IDENTIFIER (ASSIGNMENT_OP/HASH_OP))? (STRING_LITERAL / LBRACE (RESULT_LIST)* RBRACE / LBRACK (RESULT_LIST)* RBRACK)
        ~ASSIGNMENT_OP  <- < '=' >
        HASH_OP         <- < '#' >
        LBRACE          <- < '{' >
        RBRACE          <- < '}' >
        LBRACK          <- < '[' >
        RBRACK          <- < ']' >
        ~COMMA          <- < ',' >
        # GDB/MI strings contain escape sequences. The <> ensures the token 
        # content is captured in the STRING_LITERAL AST node
        STRING_LITERAL  <- < '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' > 
        # GDB/MI identifiers can contain '-' as in -info-breakpoint
        IDENTIFIER      <- < [_a-zA-Z] ([_A-Za-z0-9] / '-')* > 
        # recognize but ignore end of line characters
        ~EOL            <- '\n'
        # sent at the end of a sequence
        ~TERMINATOR     <- '(gdb)' 
        # consume the following during parse. +1!
        %whitespace     <-  [ \t\r]*

Here's the test string

^done,address="0x4000bb28",load-size="116244",transfer-rate="41104",write-rate="369",BreakpointTable={nr_rows="1",nr_cols="6",hdr=[{width="7",alignment="-1",col_name="number",colhdr="Num"},{width="14",alignment="-1",col_name="type",colhdr="Type"},{width="4",alignment="-1",col_name="disp",colhdr="Disp"},{width="3",alignment="-1",col_name="enabled",colhdr="Enb"},{width="10",alignment="-1",col_name="addr",colhdr="Address"},{width="40",alignment="2",col_name="what",colhdr="What"}],body=[bkpt={number="1",type="breakpoint",disp="keep",enabled="y",addr="0x40003e10",func="main",file="../cyfxbulklpauto.c",fullname="r:\\src\\cypress\\fx3\\usbbulkloopauto\\cyfxbulklpauto.c",line="702",thread-groups=["i1"],times="0",original-location="main"}]}

Memory leaks

I was playing with this version of your parser https://github.com/yhirose/cpp-peglib/blob/57f866c6ca77f5a5afe37f72942d5526c45d7e87/peglib.h and accidentally found unlimited memory consumption. Consider this example:

#include <cpp-peglib/peglib.h>
#include <iostream>
#include <cstdlib>

using namespace peg;
using namespace std;

int main(int , char** ) try
{
	do	
	{
		function<long (const Ast&)> eval = [&](const Ast& ast) {
			if (ast.name == "NUMBER") {
				return stol(ast.token);
			} else {
				const auto& nodes = ast.nodes;
				auto result = eval(*nodes[0]);
				for (auto i = 1u; i < nodes.size(); i += 2) {
					auto num = eval(*nodes[i + 1]);
					auto ope = nodes[i]->token[0];
					switch (ope) {
						case '+': result += num; break;
						case '-': result -= num; break;
						case '*': result *= num; break;
						case '/': result /= num; break;
					}
				}
				return result;
			}
		};

		parser parser(R"(
			EXPRESSION       <-  TERM (TERM_OPERATOR TERM)*
			TERM             <-  FACTOR (FACTOR_OPERATOR FACTOR)*
			FACTOR           <-  NUMBER / '(' EXPRESSION ')'
			TERM_OPERATOR    <-  < [-+] >
			FACTOR_OPERATOR  <-  < [/*] >
			NUMBER           <-  < [0-9]+ >
			%whitespace      <-  [ \t\r\n]*
		)");

		parser.enable_ast();
		parser.enable_packrat_parsing();

		auto expr = " 2+2*2 ";
		shared_ptr<Ast> ast;
		if (parser.parse(expr, ast)) {
			ast = AstOptimizer(true).optimize(ast);
			//cout << ast_to_s(ast);
			//cout << expr << " = " << eval(*ast) << endl;
		}
	}
	while(0);
	

	return 0;
}
catch(const std::exception &ex)
{
	std::cerr << "Error: " << ex.what() << "\n";
	return 1;
}

(Built with g++ (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 8.1.0, Win7, 'g++ -std=c++17 -Wall -Wextra -Wpedantic -g -O0 -fno-inline -fno-omit-frame-pointer -ggdb -isystemK:/1/0/source/cpp-peglib main.cpp -o a.exe'.)
If I make infinite loop while(1), process memory in a few minutes grows up to 200 MB and even more. Program with while(0) requires less than 1 MB.
There is full output of 'drmemory -- a.exe' https://pastebin.com/raw/5qkLW8Zj. Its important parts:

Error #1: LEAK 172 direct bytes 0x020fbd20-0x020fbdcc + 540 indirect bytes
peglib.h:3269 _ZZN3peg6parser10enable_astINS_7AstBaseINS_9EmptyTypeEEEEERS0_vENKUlRKNS_14SemanticValuesEE_clES8_ 

Error #3: LEAK 8 direct bytes 0x021051d0-0x021051d8 + 344 indirect bytes
peglib.h:2942 peg::AstBase<>::AstBase

peglib.h:3269 in lambda:

cpp-peglib/peglib.h

Line 3269 in 57f866c

auto ast = std::make_shared<T>(

peglib.h:2942 in c-tor:

cpp-peglib/peglib.h

Line 2942 in 57f866c

, nodes(a_nodes)

Am I doing something wrong or is it error in your library? Could you fix it?

Releases

Hello
Great library.
It would be great if you had official releases of cpp-peglib even if it's only a header only library, it would make packaging and versioning possible with vcpkg and/or conan, even with cmake fetchcontent which fetches content by git tags.
Thank you for your consideration.

left recursion not detected

Hi Yuji!

I stumbled upon an undetected left recursion. It was hiding deep down in my grammar and I was able to reduce it to the following pattern:

_ <- ' '*
A <- B
B <- _ A

Peglib/Peglint does not see a problem there. However, if I substitute the _ rule, it works fine:

A <- B
B <- ' '* A

lrec.peg:1:6: 'B' is left recursive.
lrec.peg:2:11: 'A' is left recursive.

std::system_error on calc example

Hi! I tried to run calc.cc from the example directory. Unfortunately, I received an error message

     $ cd example/
     $ g++ -I .. -std=c++11 calc.cc
     $ ./a.out 2+3*4
     Terminate called after throwing an instance of 'std::system_error'
       what():  Unknown error -1
     Aborted (core dumped)

This is both with g++ and with clang++. Apparently this call to std::call_once is responsible
https://github.com/yhirose/cpp-peglib/blob/21934dd1ce/peglib.h#L1483

[Question] Can you match c++ raw string literals?

Hi! Great job on the library! I love that it's just a single h file! I have a question tho:

A raw string literal in C++ looks like this:
R"CustomDelimiter(any text you could possibly want)CustomDelimiter"

Can this somehow be parsed with cpp-peglib? It would have to be something like this:

CPPRAWSTRING <- 'R"' '(' .* ')' {token0} '"'

Is that possible? If not with a simple grammar, maybe using enter and leave?

utf-8 support

As part of a larger project which uses utf-8 internally for which I had to read files which could be utf-8 or utf-16. While the project is not ready for posting I have broken off some utilities which includes utf conversion routines (and a link to where I got the information in the header file). I hope that this may be of use to you in your efforts.

https://github.com/mjsurette/easyUtils

Mike

Locally disable %whitespace

I have a grammar for a programming language. It defines %whitespace, because whitespaces are not significant.

Now, I want to parse string literals with a rule like this:

StrQuot   <- '"' (StrEscape / StrChars)* '"'
StrEscape <- < '\\' any >
StrChars  <- < (!'"' !'\\' any)+ >

StrEscape and StrChars both have rules that produces std::string, that I combine together in the rule of StrQuot. The problem is that the whitespaces in the strings are ignored, and thus the resulting string has all the whitespaces filtered out.

Is there a way to deactivate locally the %whitespace rule?

Tree building notation (II)

Here's the annotated grammar. Note the ^ prefix. The notation means take the name and value of the prefixed node and transfer those values to its parent. In effect transforming a n-ary node into a suitably configured binary node. Such a rotated/transformed tree is much simpler to walk.

        RESULT_LIST     <- RESULT (',' RESULT)*
        RESULT          <- (NAMED_RESULT / ANON_RESULT)
        NAMED_RESULT    <- IDENTIFIER ^(ASSIGNMENT_OP/HASH_OP) ANON_RESULT
        ANON_RESULT     <- (STRING_LITERAL / BRACE_LIST / BRACK_LIST)
        BRACE_LIST      <- '{' (RESULT_LIST)* '}'
        BRACK_LIST      <- '[' (RESULT_LIST)* ']'
        ASSIGNMENT_OP   <- < '=' >
        HASH_OP         <- < '#' >

Standard output:

     + RESULT (NAMED_RESULT)
      - IDENTIFIER: 'times'
      - ASSIGNMENT_OP: '='
      - ANON_RESULT (STRING_LITERAL): '"0"'
     + RESULT (NAMED_RESULT)
      - IDENTIFIER: 'original-location'
      - ASSIGNMENT_OP: '='
      - ANON_RESULT (STRING_LITERAL): '"main"'

Transformed output:

     +  ASSIGNMENT_OP: '='
      - IDENTIFIER: 'times'
      - ANON_RESULT (STRING_LITERAL): '"0"'
     + ASSIGNMENT_OP: '='
      - IDENTIFIER: 'original-location'
      - ANON_RESULT (STRING_LITERAL): '"main"'

Cut operator support

https://www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf

peglint example crashes

peglint exits with an error when trying to reproduce tutorial example:

./peglint --ast --opt --source "1 + 2 * 3" a.peg
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1

Seems to be a regression of #46

My environment is: Linux Mint 19.1, gcc 7.3.0.

AST crashes on using optional param in grammar

For the csv grammar in the given example it core dumps when forming an AST of it

CSV grammar based on RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt)

file <- (header NL)? record (NL record)* NL?
header <- name (COMMA name)*
record <- field (COMMA field)*
name <- field
field <- escaped / non_escaped
escaped <- DQUOTE (TEXTDATA / COMMA / CR / LF / D_DQUOTE)* DQUOTE
non_escaped <- TEXTDATA*
COMMA <- ','
CR <- '\r'
DQUOTE <- '"'
LF <- '\n'
NL <- CR LF / CR / LF
TEXTDATA <- !([",] / NL) .
D_DQUOTE <- '"' '"'

#0 0x000000000043267a in std::_Hashtable<std::string, std::pair<std::string const, peg::Definition>, std::allocator<std::pair<std::string const, peg::Definition> >, std::__detail::_Select1st, std::equal_tostd::string, std::hashstd::string, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/hashtable.h:369
#1 0x0000000000429eae in std::_Hashtable<std::string, std::pair<std::string const, peg::Definition>, std::allocator<std::pair<std::string const, peg::Definition> >, std::__detail::_Select1st, std::equal_tostd::string, std::hashstd::string, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/hashtable.h:455
#2 0x00000000004216f6 in std::unordered_map<std::string, peg::Definition, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, peg::Definition> > >::begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/unordered_map.h:249
#3 0x00000000004237e9 in peg::parser::enable_ast<peg::AstBasepeg::EmptyType > (this=0x7ffe1c236ca0) at cpp-peglib/peglib.h:3254

If the (?) operator is not used, it doesn't crash. Any reason to not use the (?) operator

Suggestion: Action forwarding

Hey Yuji!

While I was designing the grammar for my language, I found that many rules require the same actions.
Consider this:

sentence <- subject verb object '.'
question <- verb subject something '?'
quote <- subject verb '"' (sentence/question) '."'
subject <- word
verb <- word
object <- word
something <- word
word <- [a-zA-Z]*

Now, the word things have to return their matched strings and the sentence rules (1 to 3) need to perform some sort of concatenation, so I kind of need two different default rules. Of course I can assign them like this:

parser["sentence"] = parser["question"] = parser["quote"] = [](sv){...};
parser["subject"] = parser["verb"] = parser["object"] = parser["something"] = [](sv){...};

or maybe make one of the two the default rule.
(This example is a bit stupid, in reality I need many more default rules)

But I was thinking, maybe it would be nice to be able to specify some forwarding in the grammar already:

sentence>>concat <- subject verb object '.'
question>>concat <- verb subject something '?'
quote>>concat <- subject verb '"' (sentence/question) '."'
subject>>match <- word
verb>>match <- word
object>>match <- word
something>>match <- word
word <- [a-zA-Z]*

parser["concat"] = [](sv){...};
parser["match"] = [](sv){...};

I'm not sure if the syntax is confusing tho. Maybe rather
name: forward <- pattern

Parsley does it like this:
name = pattern -> action
which I find more intuitive but it's no longer original PEG syntax so I think it's unacceptable.

You think this might be a good idea? It's nothing that I desperately need but maybe a useful feature.

Cheers!

in your calculator sample, how to support negative numbers?

for example : -2 + 3

Thought: Profiling backtracking ...

Hi Yuji,

Something of a 'nice to have': Some statistics on the amount of backtracking during a parse might suggest better/optimal rule ordering (?)

WHITESPACE_DEFINITION_NAME is not static ...

Hi Yuji

In the new whitespace branch, this needs to be static I believe.

[Question] Is it safe to modify semantic values in parse action?

At least in the way I structured my parse actions, I think it could be useful at times to build onto semantic values from previous actions by modifying or reusing data instead of copying it. For example:

// A <- (rule for matching/creating A)
parser["A"] = [](const peg::SemanticValues& sv) -> StructA {
    return { /* using data from sv string */ }
}
// ModA <- (rule for matching a modification of A)
parser["ModA"] = [](const peg::SemanticValues& sv) -> /* StructA */ {
    /* modify StructA within sv[0] and let sv[0] be passed on as usual? */
}
// B <- (rule for matching/creating B from A)
parser["B"] = [](const peg::SemanticValues& sv) -> StructB {
    return { /* std::move data from sv[0] for creating StructB efficiently? */ }
}

In order to do this, I'd need a T& from the const any& items returned by sv[], but the references from any::get<T>() const are const T&. If it is guaranteed that the parser itself doesn't utilize the semantic value contents, wouldn't it be safe to drop const from the reference returned by sv[].get<T>()?

If this is true, could we add this guarantee to SemanticValues as a getter for non-const refs? Maybe something like:

template<typename T> T& SemanticValues::value(size_t index = 0) const {
    return const_cast<T&>(operator[](index).get<T>());
}

The only other problem I could think of is if semantic values were shared between multiple parse handlers, which shouldn't happen since there is only one parse action per match (input is not shared) and AST nodes don't have multiple parents (output is not shared).

[Question] How to get line/column information in a semantic action?

When parsing some grammar, I would like to keep parsing and only record minor semantics issues as warnings at the end, along with the line/column number with the warnings. Therefore, it would be nice to have the line/col information during an semantic action.

However, the only way I can see to get at the line/column information is to actually throw the parse_error exception and let the log function to be called.

Any suggestions?

Parameterized Rule or Macro support

https://github.com/yhirose/go-peg#parameterized-rule-or-macro

rule called with too many semantic values

Hey Yuji!

I have encountered a strange bug:

If I have this grammar

term <- ( ws atom ws op )* ws atom ws
op <- '+'
ws <- ' '*
atom <- [0-9]*

and parse the text "99" then the rule "term" is called with 5 semantic values although there should only be 3:

for(auto& rule:parser.get_rule_names()){
	parser[rule.c_str()] = [rule](const SemanticValues& sv, any&) {
		cout << "rule " << rule << " called\n";
		for(int i = 0; i<sv.size(); i++){
			cout << "  sv[" << i << "] = " << sv[i].get<string>() << "\n";
		}
		return rule;
	};
}

produces

rule ws called
rule atom called
rule ws called
rule ws called
rule atom called
rule ws called
rule term called
  sv[0] = atom
  sv[1] = ws
  sv[2] = ws
  sv[3] = atom
  sv[4] = ws

Or is there something I don't get?

Syntax parsing fails with MSVC but not with GCC

Hey! I am having a problem where a simple syntax fails to parse with MSVC compiler but seems to work with GCC. Compiling the following code with MSVC (x64) versions 19.00.24213.1 and 18.00.31101 will result in parsing error ie. !parser == true. C++14 support is enabled. If 'in?' is removed from op_cmd token then it starts working again. This works with GCC 5.3.0. I used the latest cpp-peglib commit 5e67fcb. Any ideas what could be going wrong or how I could debug it further?

#include <peglib.h>
#include <boost/log/trivial.hpp>

int main()
{
    const auto testSyntax =
    R"(
        main <- op_cmd

        # Set interpolation mode
        intpol_lin      <- 'G01'
        intpol_cw       <- 'G02'
        intpol_ccw      <- 'G03'
        in              <- intpol_lin / intpol_cw / intpol_ccw

        # Operations
        move_intpol     <- 'D01'
        move            <- 'D02'
        flash           <- 'D03'
        sel             <- [XYIJ]
        # If 'in?' then it doesn't crash with VS2015
        op_cmd          <- in? (sel coord)+ (move_intpol / move / flash) '*'

        # General token identifiers
        digit           <- [0-9]
        coord           <- ('+' / '-')? digit+
    )";
    parser parser(testSyntax);

    if(!parser)
    {
        BOOST_LOG_TRIVIAL(error) << "Parser syntax error!";
    }
}

Semantic Values seems weird if string and string are nearby

I would better to use less code to complain the problem that I had occurred.
For example the parser expressions are as below:

ASSIGN			<-	"Set" TYPE IDSTR '=' EXPRESSION
TYPE			<-	["Interger""Decimal"]
IDSTR			<-	[_A-Za-z][_A-Za-z0-9]*
 EXPRESSION		<-	...brabrabra, just return double value...

And I want to parse "SetDecimalvariable=50.0"
sv that I get in ASSIGN will be as follow:

sv.size() == 3        // fine, (TYPE, IDSTR, EXPRESSION) are three element
sv.str_c() == "SetDecimalvariable=50.0"        // fine, originally data
sv[0].get<string>() == "Decimalvariable=50.0"        // weired, it should be "Decimal" as first element
sv[1].get<string>() == "ecimalvariable=50.0"        // more weired, it should be "variable" as second element
sv[2].get<Ele>() == 50.0        // fine, it's the third element

Am I thought wrong?
Thanks for the project, it helps me a lot.

Example code should demonstrate the use of Log

[praise]
I am using cpp-peglib for a few weeks now, and this library works tremendously well. The API is easy to grasp and its behavior is very predictable. I originally started writing my grammar (a subset of Python's grammar) using boost::spirit, and I stopped when things became unmanageable. I can do a lot more, and more easily with cpp-peglib, so thank you for that.
[/praise]

One thing I was not happy with cpp-peglib is that, when I got my grammar wrong, the only thing I saw was a crash at runtime (access to a nullptr). I finally looked more closely in the sources, and I found out two useful things:

It is possible to use the constructor of parser with no argument, and to call load_grammar afterwards. This returns a bool that tells if the grammar is ok.
It is possible to attach a logger function (parser::log), to get the details of what is wrong with our grammar.

It took me a few weeks to realize that these tools were already there, waiting for me to use them. My suggestions is to use them in the example code of the readme file:

Instead of:

    parser parser(syntax);

Use something like this:

    parser parser;
    parser_->log = [](size_t line, size_t col, const string& msg) {
      cerr << line << ":" << col << ": " << msg << "\n";
    };
    bool ok = parser_->load_grammar(grammar);
    assert(ok);

Its is not as pretty as the current version, but it will be a huge help for users to understand the mistakes in their grammars. Alternatively, there could be a default logger installed on all parser instances, that can be deactivated if needs to.

[Question] what does the underscore means in switch cases

Hi, I'm currently reading the pl0 language example and find something really confused, like here, in this file, all the switch cases are following up with an underscore, is this a DSL from this lib or a new C++11 feature? Thanks in advance :)

if enable PackratParsing ，program crash

i set enablePackratParsing=1
and my peg is
auto syntax = R"xyz(decl<-decl_specs init_declarator_list ';'

	decl_specs<-'char' / 'short' / 'int' / 'long' / 'float'

	init_declarator_list<-direct_declarator
	/ direct_declarator '=' initializer

	direct_declarator<-id


	initializer<-additive

	additive<-left:multiplicative "+" right : additive
	/ multiplicative

	multiplicative<-left : primary "*" right : multiplicative
	/ primary

	primary<-primary_exp
	/ "(" additive : additive ")"

	primary_exp<-id
	/ const)xyz";
  parser parser(syntax);

when I debug ,i set breakpoint after "parser parser(syntax);",but it crashed