yhirose / cpp-peglib Goto Github PK
View Code? Open in Web Editor NEWA single file C++ header-only PEG (Parsing Expression Grammars) library
License: MIT License
A single file C++ header-only PEG (Parsing Expression Grammars) library
License: MIT License
Why is there an "@" symbol in that "ar" line?
$ cmake -G "MSYS Makefiles" .
-- The C compiler identification is GNU 8.2.0
-- The CXX compiler identification is GNU 8.2.0
-- Check for working C compiler: C:/MinGW/bin/gcc.exe
-- Check for working C compiler: C:/MinGW/bin/gcc.exe -- broken
CMake Error at C:/Program Files/CMake/share/cmake-3.13/Modules/CMakeTestCCompiler.cmake:52 (message):
The C compiler
"C:/MinGW/bin/gcc.exe"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: C:/Users//git/cpp-peglib/CMakeFiles/CMakeTmp
Run Build Command:"C:/MinGW/msys/1.0/bin/make.exe" "cmTC_c9e41/fast"
/usr/bin/make -f CMakeFiles/cmTC_c9e41.dir/build.make CMakeFiles/cmTC_c9e41.dir/build
make[1]: Entering directory `/c/Users//git/cpp-peglib/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_c9e41.dir/testCCompiler.c.obj
/C/MinGW/bin/gcc.exe -o CMakeFiles/cmTC_c9e41.dir/testCCompiler.c.obj -c /C/Users//git/cpp-peglib/CMakeFiles/CMakeTmp/testCCompiler.c
Linking C executable cmTC_c9e41.exe
"/C/Program Files/CMake/bin/cmake.exe" -E remove -f CMakeFiles/cmTC_c9e41.dir/objects.a
/C/MinGW/bin/ar.exe cr CMakeFiles/cmTC_c9e41.dir/objects.a @CMakeFiles/cmTC_c9e41.dir/objects1.rsp
c:\MinGW\bin\ar.exe: could not create temporary file whilst writing archive: no more archived files
make[1]: *** [cmTC_c9e41.exe] Error 1
make[1]: Leaving directory `/c/Users//git/cpp-peglib/CMakeFiles/CMakeTmp'
make: *** [cmTC_c9e41/fast] Error 2
CMake will not be able to correctly generate this project.
-- Configuring incomplete, errors occurred!
Also, compilation fails:
g++ -std=c++11 peglib.h
Hi, great work on the library. I've been writing my own C++ compiler, and I've found a very novel use for cpp-peglib. A friend of mine has been using cpp-peglib with Circle to generate code targeting an exotic architecture. I decided to port calc3.cc as a tutorial for using cpp-peglib at compile time to define and implement DSLs in C++.
https://github.com/seanbaxter/circle/blob/master/peg_dsl/peg_dsl.md
Basically what I do is #include peglib.h and make a compile-time instance of peg::parser. Circle has an integrated interpreter so any code can be executed during source translation. I beefed up the calc3 grammar by adding an IDENTIFIER rule. Then I create C++ functions that expand a Circle macro and feed it the text to be parsed in a string literal. A Circle macro invokes the parser object (at compile time!), gets back the AST, and traverses the AST using some other macros. IDENTIFIER nodes in the AST are evaluated with @expressions, which lexes, parses and injects the contained text. That is, the result object for an IDENTIFIER node is an lvalue to the object named. The other parts of the AST are processed similarly.
When all this is done, compilation ends, and you're left with a C++ program that has the code you specified in the string lowered to LLVM IR. There is no remnant of the parser in the executable, because its job was to translate the DSL into an AST at compile time.
Since uses of parsing libraries is to make developer tools anyway, having a C++ compiler serve as a host language or a scripting language to bind the desired grammar to a C++ frontend via dynamic parsers like cpp-peglib is a real win.
My main project page is here, and there are tons of other examples.
https://www.circle-lang.org/
I think Circle could also simplify the implementation of cpp-peglib. Since the grammar is almost always known at compile time, you could move parsing of that to compile time and benefit from the type system and error handling already built into the compiler. It would allow you to achieve the performance of a statically scheduled compiler (like a hand-written RD parser) while having the expressiveness of the dynamic system you built.
Thanks,
sean
Adding target_link_libraries(peglint "Ws2_32.lib") to the peglint stopped the linker errors.
Noted for future developers.
The current library is pretty close to working without C++ RTTI being enabled. peglib is useful for environments where RTTI adds more space/time overhead than is acceptable, but it currently requires some very minor changes to the code (ie: #ifdef __cpp_rtti
around the uses of dynamic_cast
and replacement methods that return void*
instead).
Would this be something that you'd be OK accepting a patch for?
Have a secondary stage to walk the parse tree that is currently output in AST mode. This stage could then create a much reduced AST according to an additional grammar or extra markup.
I have some example code written for the GDB grammar that demonstrates this idea. Though this version relies on the functor callabacks for nodes in the parse tree.
I will explain this more clearly with an example. Just want to have a placeholder.
I wrote an interactive debug inspector for PEGs using peglib. I probably wouldn't have started that if I had found peglint before but now it's done and there's nothing we can do about it. However, what I needed most is a means of finding out why rules don't match certain parts of text although I wanted them to. So pegdebug displays the complete parsing process, not just the resulting AST.
You can check it out here:
https://github.com/mqnc/pegdebug
I had to modify peglib slightly to make it work. You can see the changes here:
https://github.com/mqnc/cpp-peglib
I also need these changes in my other projects. Maybe you find the functionality useful and can include it into your library. However, as it is, it breaks code that uses your peglib since the enter and leave functions have different signatures.
Please let me know if I did something wrong with the licensing or anything else in that direction.
I hope it's useful!
(sorry I made this an issue, I don't see another way for communication)
Hi Yuji
Any thoughts on the best way to handle single line comments, C++ or Python style?
// I am a comment
# ditto
Not sure if this can be easily done in a grammar. Wondering if there could be a %comment
directive as per %whitespace
I had forgotten how utterly cool this code is for writing parsers. 👍
Hello Yuji.
Cannot get this to work using master. The intention is to be able to parse a C/C++ style literal string with escaped quotes as in:
// i.e. file content, not a compiled string
" Hello \"Yuji\" "
Any thoughts? The rule is specified this:
//
RULE <- '"' (LITERAL_ESC_QUOTE / LITERAL_CHAR)* '"'
// i.e. match anything that is not a single quote character
LITERAL_CHAR <- (!["] .)
// this is a string not an escaped quote
LITERAL_ESC_QUOTE <- '\"'
TAIA.
Jerry
Hi Yuji, would be very interested to get your thoughts on this too.
STRING_LITERAL <- < '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' >
STRING_LITERAL <- '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"'
STRING_LITERAL <- < '"' (ESC / CHAR)* '"' >
ESC <- ('\\"' / '\\t' / '\\n')
CHAR <- (!["] .)
Questions:
Are the differences in 1 and 2 intentional?
3 is possibly unexpected but is consistent throughout so not a problem. I think this definitely worth documenting in readme.
Adding -D_MSC_VER also not working and I got a lot of errors
gcc4.9
In file included from D:/mingw32/i686-w64-mingw32/include/combaseapi.h:154:0,
from D:/mingw32/i686-w64-mingw32/include/objbase.h:14,
from D:/mingw32/i686-w64-mingw32/include/ole2.h:17,
from D:/mingw32/i686-w64-mingw32/include/wtypes.h:12,
from D:/mingw32/i686-w64-mingw32/include/winscard.h:10,
from D:/mingw32/i686-w64-mingw32/include/windows.h:97,
from mmap.h:6,
from peglint.cc:11:
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h: In member function 'HRESULT IUnknown::QueryInterface(Q**)':
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: error: expected primary-expression before ')' token
return QueryInterface(__uuidof(Q), (void **)pp);
^
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: error: there are no arguments to '__uuidof' that depend on a template parameter, so a declaration of '__uuidof' must be available [-fpermissive]
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
In file included from D:/mingw32/i686-w64-mingw32/include/urlmon.h:289:0,
from D:/mingw32/i686-w64-mingw32/include/objbase.h:163,
from D:/mingw32/i686-w64-mingw32/include/ole2.h:17,
from D:/mingw32/i686-w64-mingw32/include/wtypes.h:12,
from D:/mingw32/i686-w64-mingw32/include/winscard.h:10,
from D:/mingw32/i686-w64-mingw32/include/windows.h:97,
from mmap.h:6,
from peglint.cc:11:
D:/mingw32/i686-w64-mingw32/include/servprov.h: In member function 'HRESULT IServiceProvider::QueryService(const GUID&, Q**)':
D:/mingw32/i686-w64-mingw32/include/servprov.h:66:46: error: expected primary-expression before ')' token
return QueryService(guidService, __uuidof(Q), (void **)pp);
^
D:/mingw32/i686-w64-mingw32/include/servprov.h:66:46: error: there are no arguments to '__uuidof' that depend on a template parameter, so a declaration of '__uuidof' must be available [-fpermissive]
Same feature supported in go-peg.
I hate opening GitHub issues simply asking for help, however I've been fighting with a seemly simple grammar that I've not been able to get working correctly. I stumbled across peglint for testing luckily.
File a.peg
contains:
Any <- Placeholder / Text
Placeholder <- '${' Int ':' Any '}'
Int <- [0-9]+
Text <- [a-z]+
Running a simple example works as expected:
> peglint.exe --ast --source "${1:hi}" a.peg
+ Any
+ Placeholder
- Int (1)
+ Any
- Text (hi)
A bit more complex example doesn't work:
> peglint.exe --ast --source "${1:hi${2:bye}}" a.peg
[commendline]:1:7: syntax error
I'm using Win7 and VS2015 if that makes a difference at all. Thanks.
Here's the situation:
I think sooner or later I will probably want the ability to parse a context-dependent grammar, so it would be nice to be able to call external functions during the parsing process.
The first issue where this popped up is this:
I want to be able to parse custom operators with custom priority. So the user should be able to say
"× is infix and should be evaluated before + but after *"
But in order to make the priority customizeable, the parser must know which operator comes on which priority level during parsing. But since the definition of the operator also happens during parsing, it cannot know it during grammar construction.
And in the future I will probably also need to create some sort of registry during parsing where the parser can look things up.
So the best way that I see would be this:
Create a way to call external functions that determine whether there is a match or not. Something like:
Result <- Operand @Operator Operand
parser.definition["Operator"] = [](const char* s, size_t n, SemanticValues& sv, any& dt){
... // do custom stuff like look ups
return matchlen;
}
Do you think this is a good idea and useful in general? Or is there maybe a more elegant solution to my problem? (Like in the end I also didn't need left recursion although I thought I did)
In return I can implement UTF8 support ;)
Hi Yuji, more of a meta question here. I was experimenting to see if I could generate a much more minimal AST by associating specific rules with a custom reduce() function. However this is not quite working as expected. The peg::SemanticValues& argument is always empty. Does there need to be a functor associated with each and every rule in the grammar? Or have I missed something here?
A minimal example based on the GDB/MI grammar:
auto mknode = [](const peg::SemanticValues& sv, peg::any& arg) -> peg::any
{
if (sv.size())
{
std::cout << sv[0].name << ' ' << sv[0].s << std::endl;
}
return peg::any();
};
peg::any arg;
// set up functor for rules of interest *only*
parser["STRING_LITERAL"] = mknode;
parser["IDENTIFIER"] = mknode;
parser["LBRACE"] = mknode;
parser["RBRACE"] = mknode;
parser["LBRACK"] = mknode;
parser["RBRACK"] = mknode;
if (!parser.parse_n(source.data(), source.size(), arg ))
{
ret = -1;
}
Hi!
Somehow peglint does not work for me like described in the readme:
$ cat a.peg
Additive <- Multitive '+' Additive / Multitive
Multitive <- Primary '*' Multitive / Primary
Primary <- '(' Additive ')' / Number
Number <- < [0-9]+ >
%whitespace <- [ \t]*
$ ./peglint --ast --source "1 + 2 * 3" a.peg
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1
Aborted (core dumped)
When I run in server mode, it shows the page and then crashes:
$ ./peglint --ast --server 8001 --source "1 + 2 * 3" a.peg
Server running at http://localhost:8001/
(now I open the browser and it shows a promising page)
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1
Aborted (core dumped)
I compiled and ran under Ubuntu,
The C compiler identification is GNU 7.2.0
The CXX compiler identification is GNU 7.2.0
All the best!
PS: If you can't reproduce this, I can try to provide more debug information.
just want to know if you are going to make some 'release' or 'tag' on this project?
I assume the code is pretty stable enough..
Another idea for a tree-annotated grammar. Note the N: prefix. The 0'th node becomes the parent. Then its children are assigned in numerical order. So this makes it really easy to create new nodes with an arbitrary number of children. (Yuji, this is only a partially-considered scribble right now. I'm just posting it here for more brain fodder)
# annotation equivalent to the default parse tree construction
0:RESULT_LIST <- 1:RESULT (',' 2:RESULT)*
RESULT <- (NAMED_RESULT / ANON_RESULT)
# ASSIGNMENT_OP becomes parent to IDENTIFIER and ANON_RESULT
NAMED_RESULT <- 1:IDENTIFIER 0:(ASSIGNMENT_OP/HASH_OP) 2:ANON_RESULT
ANON_RESULT <- (0:STRING_LITERAL / BRACE_LIST / BRACK_LIST)
# i.e. here we end up with a new RESULT_LIST node, not a BRACE_LIST
BRACE_LIST <- '{' (0:RESULT_LIST)* '}'
BRACK_LIST <- '[' (0:RESULT_LIST)* ']'
ASSIGNMENT_OP <- < '=' >
HASH_OP <- < '#' >
Hi, I noticed that all the parsing methods of peg::parser
and peg::Definition
are const
, which is nice.
Could we get const
access to the definitions too in order to parse specific rules via const peg::parser&
?
diff --git a/peglib.h b/peglib.h
index 463cd3b..8050ec2 100644
--- a/peglib.h
+++ b/peglib.h
@@ -3116,10 +3116,14 @@ public:
Definition& operator[](const char* s) {
return (*grammar_)[s];
}
+ const Definition& operator[](const char* s) const {
+ return (*grammar_)[s];
+ }
+
std::vector<std::string> get_rule_names(){
std::vector<std::string> rules;
rules.reserve(grammar_->size());
for (auto const& r : *grammar_) {
rules.emplace_back(r.first);
Ok, I plan want to contribute to write csv or json parser in the examples, however sometime it was not comfortable since we cannot write some thing like following
auto syntax = R"(
ROOT <- _ TOKEN (',' _ TOKEN)*
TOKEN <- < [a-z0-9]+ > _
_ <- [ \t\r\n]*
)";
tree<string> sym ;
peg pg(syntax);
pg["TOKEN"] = [=](const char* s, size_t l, const vector<any>& v){ //this will trigger an error because we need mutable lambdas, at peglib.h:273 and 276 didnt allow this
sym.insert(sym.begin(),string(s,l));
return string(s,l);
}
also this is useful for generating ast, or let user managing data structure. I have an Idea that, peg grammar is quite powerful to handle some ebnf with some restriction on how prioritization should be used.
After discovering the whole issue with left recursion and noticing that I can't live without it, I wonder if you are planning to implement it. I have found that some PEG parsers support it using different techniques:
https://github.com/axilmar/parserlib/blob/master/LEFT_RECURSION.txt
https://github.com/PhilippeSigaud/Pegged/wiki/Left-Recursion
I have very simple grammar to illustrate:
ARRAY <- SPACEONLY? TEST SPACEONLY? ( COMMA SPACEONLY? TEST SPACEONLY?)*
TEST <- NUM
/ VARNAME
COMMA <- ','
NUM <- [0-9]+
VARNAME <- [a-zA-z0-9_]+
~SPACEONLY <- [ \t]+
I expect that this will match NUM firstly, if that fails go to VARNAME, because by
order of choice expression. From the Bryan's paper
at page 2 from the bottom left page paragraph says
"The choice expression ‘e1 / e2’ first attempts
pattern e1, then attempts e2 from the same starting point if e1 fails."
Then, the following tokens shall match:
1DA, SA1_1WS
1, SA1_1WS
sa_1,2
1,2,3
Do you consider this case is by design or bug?
Thank you for your wonderful project.
As the subject line indicates I have build problems with clang 5.0.0 in debug mode.
[ 18%] Linking CXX executable test-main CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84b85): undefined reference to
peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84bbf): undefined reference to peg::enabler' CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84bed): undefined reference to
peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c1b): undefined reference to peg::enabler' CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c49): undefined reference to
peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c77): more undefined references to peg::enabler' follow clang-5.0.0: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [test/CMakeFiles/test-main.dir/build.make:95: test/test-main] Error 1 make[1]: *** [CMakeFiles/Makefile2:86: test/CMakeFiles/test-main.dir/all] Error 2 make: *** [Makefile:95: all] Error 2
It works in release mode and gcc works in both modes.
Awesome library btw. I especially like being able to use parser combinators. I really shoots up the performance.
Mike
I am trying to write a scripting language using this library. In compilers/interpreters of other scripting languages I've used, you can get multiple errors from a single file/class/function. I know that I can check for a lot of errors after the parser (from this lib) has finished (accessing a private var, calling a function that doesn't exist, etc.), but I can't seem to find a way to check for multiple syntax errors, because the parser stops after encountering one. Is there be a way to ignore a rule if the parser finds a syntax error in my input?
so here's a simple grammar to parse protobuf
statements <- statement*
statement <-
"syntax" '=' string ';' /
"import" string ';' /
"package" token ';' /
enum_statement /
message_statement
enum_statement <- "enum" token '{' enum_decl* '}'
enum_decl <- token '=' number ';'
message_statement <- "message" token '{' field* '}'
field <-
type_decl /
repeated_decl /
oneof_decl /
map_decl /
message_statement /
enum_statement
type_decl <- type token '=' number ';'
type <- token ('.' token)*
repeated_decl <- "repeated" type token '=' number ';'
oneof_decl <- "oneof" token '{' type_decl* '}'
map_decl <- "map" '<' type ',' type '>' token '=' number ';'
%word <- token / number
string <- < '"' (!'"' .)* '"' >
token <- < [a-zA-Z_][a-zA-Z0-9_]* >
number <- < [0-9]+ >
%whitespace <- [ \t\r\n]*
the problem is the ast doesn't record which statement rule was matched.
the "syntax" and "import" nodes look the same.
Hi Yuji, this one is a bit odd. I'm using the peglint framework so loading grammars from a text file. If I enable AST mode I get a clean parse and a tree etc. If I use the parse() member function I get a syntax error from the same test. Am I missing something obvious?
# a grammar to parse the GDB 'machine interface' (GDB/MI)
GDB_MI <- (GDB_ELEMENT EOL)*
GDB_ELEMENT <- (AT_STRING / NEG_STRING / OP_LIST)
AT_STRING <- '@' STRING_LITERAL
NEG_STRING <- '~' STRING_LITERAL
OP_LIST <- OP_CHAR IDENTIFIER ',' (RESULT_LIST)*
OP_CHAR <- ( '~' / '*' / '=' / '+' / '^')
RESULT_LIST <- RESULT (COMMA RESULT)*
RESULT <- (IDENTIFIER (ASSIGNMENT_OP/HASH_OP))? (STRING_LITERAL / LBRACE (RESULT_LIST)* RBRACE / LBRACK (RESULT_LIST)* RBRACK)
~ASSIGNMENT_OP <- < '=' >
HASH_OP <- < '#' >
LBRACE <- < '{' >
RBRACE <- < '}' >
LBRACK <- < '[' >
RBRACK <- < ']' >
~COMMA <- < ',' >
# GDB/MI strings contain escape sequences. The <> ensures the token
# content is captured in the STRING_LITERAL AST node
STRING_LITERAL <- < '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' >
# GDB/MI identifiers can contain '-' as in -info-breakpoint
IDENTIFIER <- < [_a-zA-Z] ([_A-Za-z0-9] / '-')* >
# recognize but ignore end of line characters
~EOL <- '\n'
# sent at the end of a sequence
~TERMINATOR <- '(gdb)'
# consume the following during parse. +1!
%whitespace <- [ \t\r]*
Here's the test string
^done,address="0x4000bb28",load-size="116244",transfer-rate="41104",write-rate="369",BreakpointTable={nr_rows="1",nr_cols="6",hdr=[{width="7",alignment="-1",col_name="number",colhdr="Num"},{width="14",alignment="-1",col_name="type",colhdr="Type"},{width="4",alignment="-1",col_name="disp",colhdr="Disp"},{width="3",alignment="-1",col_name="enabled",colhdr="Enb"},{width="10",alignment="-1",col_name="addr",colhdr="Address"},{width="40",alignment="2",col_name="what",colhdr="What"}],body=[bkpt={number="1",type="breakpoint",disp="keep",enabled="y",addr="0x40003e10",func="main",file="../cyfxbulklpauto.c",fullname="r:\\src\\cypress\\fx3\\usbbulkloopauto\\cyfxbulklpauto.c",line="702",thread-groups=["i1"],times="0",original-location="main"}]}
I was playing with this version of your parser https://github.com/yhirose/cpp-peglib/blob/57f866c6ca77f5a5afe37f72942d5526c45d7e87/peglib.h and accidentally found unlimited memory consumption. Consider this example:
#include <cpp-peglib/peglib.h>
#include <iostream>
#include <cstdlib>
using namespace peg;
using namespace std;
int main(int , char** ) try
{
do
{
function<long (const Ast&)> eval = [&](const Ast& ast) {
if (ast.name == "NUMBER") {
return stol(ast.token);
} else {
const auto& nodes = ast.nodes;
auto result = eval(*nodes[0]);
for (auto i = 1u; i < nodes.size(); i += 2) {
auto num = eval(*nodes[i + 1]);
auto ope = nodes[i]->token[0];
switch (ope) {
case '+': result += num; break;
case '-': result -= num; break;
case '*': result *= num; break;
case '/': result /= num; break;
}
}
return result;
}
};
parser parser(R"(
EXPRESSION <- TERM (TERM_OPERATOR TERM)*
TERM <- FACTOR (FACTOR_OPERATOR FACTOR)*
FACTOR <- NUMBER / '(' EXPRESSION ')'
TERM_OPERATOR <- < [-+] >
FACTOR_OPERATOR <- < [/*] >
NUMBER <- < [0-9]+ >
%whitespace <- [ \t\r\n]*
)");
parser.enable_ast();
parser.enable_packrat_parsing();
auto expr = " 2+2*2 ";
shared_ptr<Ast> ast;
if (parser.parse(expr, ast)) {
ast = AstOptimizer(true).optimize(ast);
//cout << ast_to_s(ast);
//cout << expr << " = " << eval(*ast) << endl;
}
}
while(0);
return 0;
}
catch(const std::exception &ex)
{
std::cerr << "Error: " << ex.what() << "\n";
return 1;
}
(Built with g++ (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 8.1.0, Win7, 'g++ -std=c++17 -Wall -Wextra -Wpedantic -g -O0 -fno-inline -fno-omit-frame-pointer -ggdb -isystemK:/1/0/source/cpp-peglib main.cpp -o a.exe'.)
If I make infinite loop while(1), process memory in a few minutes grows up to 200 MB and even more. Program with while(0) requires less than 1 MB.
There is full output of 'drmemory -- a.exe' https://pastebin.com/raw/5qkLW8Zj. Its important parts:
Error #1: LEAK 172 direct bytes 0x020fbd20-0x020fbdcc + 540 indirect bytes
peglib.h:3269 _ZZN3peg6parser10enable_astINS_7AstBaseINS_9EmptyTypeEEEEERS0_vENKUlRKNS_14SemanticValuesEE_clES8_
Error #3: LEAK 8 direct bytes 0x021051d0-0x021051d8 + 344 indirect bytes
peglib.h:2942 peg::AstBase<>::AstBase
peglib.h:3269 in lambda:
Line 3269 in 57f866c
peglib.h:2942 in c-tor:
Line 2942 in 57f866c
Hello
Great library.
It would be great if you had official releases of cpp-peglib even if it's only a header only library, it would make packaging and versioning possible with vcpkg and/or conan, even with cmake fetchcontent which fetches content by git tags.
Thank you for your consideration.
Hi Yuji!
I stumbled upon an undetected left recursion. It was hiding deep down in my grammar and I was able to reduce it to the following pattern:
_ <- ' '*
A <- B
B <- _ A
Peglib/Peglint does not see a problem there. However, if I substitute the _ rule, it works fine:
A <- B
B <- ' '* A
lrec.peg:1:6: 'B' is left recursive.
lrec.peg:2:11: 'A' is left recursive.
Hi! I tried to run calc.cc
from the example
directory. Unfortunately, I received an error message
$ cd example/
$ g++ -I .. -std=c++11 calc.cc
$ ./a.out 2+3*4
Terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1
Aborted (core dumped)
This is both with g++ and with clang++. Apparently this call to std::call_once
is responsible
https://github.com/yhirose/cpp-peglib/blob/21934dd1ce/peglib.h#L1483
Hi! Great job on the library! I love that it's just a single h file! I have a question tho:
A raw string literal in C++ looks like this:
R"CustomDelimiter(any text you could possibly want)CustomDelimiter"
Can this somehow be parsed with cpp-peglib? It would have to be something like this:
CPPRAWSTRING <- 'R"' '(' .* ')' {token0} '"'
Is that possible? If not with a simple grammar, maybe using enter and leave?
As part of a larger project which uses utf-8 internally for which I had to read files which could be utf-8 or utf-16. While the project is not ready for posting I have broken off some utilities which includes utf conversion routines (and a link to where I got the information in the header file). I hope that this may be of use to you in your efforts.
https://github.com/mjsurette/easyUtils
Mike
I have a grammar for a programming language. It defines %whitespace, because whitespaces are not significant.
Now, I want to parse string literals with a rule like this:
StrQuot <- '"' (StrEscape / StrChars)* '"'
StrEscape <- < '\\' any >
StrChars <- < (!'"' !'\\' any)+ >
StrEscape and StrChars both have rules that produces std::string
, that I combine together in the rule of StrQuot. The problem is that the whitespaces in the strings are ignored, and thus the resulting string has all the whitespaces filtered out.
Is there a way to deactivate locally the %whitespace rule?
Here's the annotated grammar. Note the ^ prefix. The notation means take the name and value of the prefixed node and transfer those values to its parent. In effect transforming a n-ary node into a suitably configured binary node. Such a rotated/transformed tree is much simpler to walk.
RESULT_LIST <- RESULT (',' RESULT)*
RESULT <- (NAMED_RESULT / ANON_RESULT)
NAMED_RESULT <- IDENTIFIER ^(ASSIGNMENT_OP/HASH_OP) ANON_RESULT
ANON_RESULT <- (STRING_LITERAL / BRACE_LIST / BRACK_LIST)
BRACE_LIST <- '{' (RESULT_LIST)* '}'
BRACK_LIST <- '[' (RESULT_LIST)* ']'
ASSIGNMENT_OP <- < '=' >
HASH_OP <- < '#' >
Standard output:
+ RESULT (NAMED_RESULT)
- IDENTIFIER: 'times'
- ASSIGNMENT_OP: '='
- ANON_RESULT (STRING_LITERAL): '"0"'
+ RESULT (NAMED_RESULT)
- IDENTIFIER: 'original-location'
- ASSIGNMENT_OP: '='
- ANON_RESULT (STRING_LITERAL): '"main"'
Transformed output:
+ ASSIGNMENT_OP: '='
- IDENTIFIER: 'times'
- ANON_RESULT (STRING_LITERAL): '"0"'
+ ASSIGNMENT_OP: '='
- IDENTIFIER: 'original-location'
- ANON_RESULT (STRING_LITERAL): '"main"'
peglint exits with an error when trying to reproduce tutorial example:
./peglint --ast --opt --source "1 + 2 * 3" a.peg
terminate called after throwing an instance of 'std::system_error'
what(): Unknown error -1
Seems to be a regression of #46
My environment is: Linux Mint 19.1, gcc 7.3.0.
For the csv grammar in the given example it core dumps when forming an AST of it
file <- (header NL)? record (NL record)* NL?
header <- name (COMMA name)*
record <- field (COMMA field)*
name <- field
field <- escaped / non_escaped
escaped <- DQUOTE (TEXTDATA / COMMA / CR / LF / D_DQUOTE)* DQUOTE
non_escaped <- TEXTDATA*
COMMA <- ','
CR <- '\r'
DQUOTE <- '"'
LF <- '\n'
NL <- CR LF / CR / LF
TEXTDATA <- !([",] / NL) .
D_DQUOTE <- '"' '"'
#0 0x000000000043267a in std::_Hashtable<std::string, std::pair<std::string const, peg::Definition>, std::allocator<std::pair<std::string const, peg::Definition> >, std::__detail::_Select1st, std::equal_tostd::string, std::hashstd::string, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/hashtable.h:369
#1 0x0000000000429eae in std::_Hashtable<std::string, std::pair<std::string const, peg::Definition>, std::allocator<std::pair<std::string const, peg::Definition> >, std::__detail::_Select1st, std::equal_tostd::string, std::hashstd::string, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/hashtable.h:455
#2 0x00000000004216f6 in std::unordered_map<std::string, peg::Definition, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, peg::Definition> > >::begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/unordered_map.h:249
#3 0x00000000004237e9 in peg::parser::enable_ast<peg::AstBasepeg::EmptyType > (this=0x7ffe1c236ca0) at cpp-peglib/peglib.h:3254
If the (?) operator is not used, it doesn't crash. Any reason to not use the (?) operator
Hey Yuji!
While I was designing the grammar for my language, I found that many rules require the same actions.
Consider this:
sentence <- subject verb object '.'
question <- verb subject something '?'
quote <- subject verb '"' (sentence/question) '."'
subject <- word
verb <- word
object <- word
something <- word
word <- [a-zA-Z]*
Now, the word things have to return their matched strings and the sentence rules (1 to 3) need to perform some sort of concatenation, so I kind of need two different default rules. Of course I can assign them like this:
parser["sentence"] = parser["question"] = parser["quote"] = [](sv){...};
parser["subject"] = parser["verb"] = parser["object"] = parser["something"] = [](sv){...};
or maybe make one of the two the default rule.
(This example is a bit stupid, in reality I need many more default rules)
But I was thinking, maybe it would be nice to be able to specify some forwarding in the grammar already:
sentence>>concat <- subject verb object '.'
question>>concat <- verb subject something '?'
quote>>concat <- subject verb '"' (sentence/question) '."'
subject>>match <- word
verb>>match <- word
object>>match <- word
something>>match <- word
word <- [a-zA-Z]*
parser["concat"] = [](sv){...};
parser["match"] = [](sv){...};
I'm not sure if the syntax is confusing tho. Maybe rather
name: forward <- pattern
Parsley does it like this:
name = pattern -> action
which I find more intuitive but it's no longer original PEG syntax so I think it's unacceptable.
You think this might be a good idea? It's nothing that I desperately need but maybe a useful feature.
Cheers!
for example : -2 + 3
Hi Yuji,
Something of a 'nice to have': Some statistics on the amount of backtracking during a parse might suggest better/optimal rule ordering (?)
Hi Yuji
In the new whitespace branch, this needs to be static I believe.
At least in the way I structured my parse actions, I think it could be useful at times to build onto semantic values from previous actions by modifying or reusing data instead of copying it. For example:
// A <- (rule for matching/creating A)
parser["A"] = [](const peg::SemanticValues& sv) -> StructA {
return { /* using data from sv string */ }
}
// ModA <- (rule for matching a modification of A)
parser["ModA"] = [](const peg::SemanticValues& sv) -> /* StructA */ {
/* modify StructA within sv[0] and let sv[0] be passed on as usual? */
}
// B <- (rule for matching/creating B from A)
parser["B"] = [](const peg::SemanticValues& sv) -> StructB {
return { /* std::move data from sv[0] for creating StructB efficiently? */ }
}
In order to do this, I'd need a T&
from the const any&
items returned by sv[]
, but the references from any::get<T>() const
are const T&
. If it is guaranteed that the parser itself doesn't utilize the semantic value contents, wouldn't it be safe to drop const
from the reference returned by sv[].get<T>()
?
If this is true, could we add this guarantee to SemanticValues
as a getter for non-const refs? Maybe something like:
template<typename T> T& SemanticValues::value(size_t index = 0) const {
return const_cast<T&>(operator[](index).get<T>());
}
The only other problem I could think of is if semantic values were shared between multiple parse handlers, which shouldn't happen since there is only one parse action per match (input is not shared) and AST nodes don't have multiple parents (output is not shared).
When parsing some grammar, I would like to keep parsing and only record minor semantics issues as warnings at the end, along with the line/column number with the warnings. Therefore, it would be nice to have the line/col information during an semantic action.
However, the only way I can see to get at the line/column information is to actually throw the parse_error exception and let the log function to be called.
Any suggestions?
Hey Yuji!
I have encountered a strange bug:
If I have this grammar
term <- ( ws atom ws op )* ws atom ws
op <- '+'
ws <- ' '*
atom <- [0-9]*
and parse the text "99" then the rule "term" is called with 5 semantic values although there should only be 3:
for(auto& rule:parser.get_rule_names()){
parser[rule.c_str()] = [rule](const SemanticValues& sv, any&) {
cout << "rule " << rule << " called\n";
for(int i = 0; i<sv.size(); i++){
cout << " sv[" << i << "] = " << sv[i].get<string>() << "\n";
}
return rule;
};
}
produces
rule ws called
rule atom called
rule ws called
rule ws called
rule atom called
rule ws called
rule term called
sv[0] = atom
sv[1] = ws
sv[2] = ws
sv[3] = atom
sv[4] = ws
Or is there something I don't get?
Hey! I am having a problem where a simple syntax fails to parse with MSVC compiler but seems to work with GCC. Compiling the following code with MSVC (x64) versions 19.00.24213.1 and 18.00.31101 will result in parsing error ie. !parser == true. C++14 support is enabled. If 'in?' is removed from op_cmd token then it starts working again. This works with GCC 5.3.0. I used the latest cpp-peglib commit 5e67fcb. Any ideas what could be going wrong or how I could debug it further?
#include <peglib.h>
#include <boost/log/trivial.hpp>
int main()
{
const auto testSyntax =
R"(
main <- op_cmd
# Set interpolation mode
intpol_lin <- 'G01'
intpol_cw <- 'G02'
intpol_ccw <- 'G03'
in <- intpol_lin / intpol_cw / intpol_ccw
# Operations
move_intpol <- 'D01'
move <- 'D02'
flash <- 'D03'
sel <- [XYIJ]
# If 'in?' then it doesn't crash with VS2015
op_cmd <- in? (sel coord)+ (move_intpol / move / flash) '*'
# General token identifiers
digit <- [0-9]
coord <- ('+' / '-')? digit+
)";
parser parser(testSyntax);
if(!parser)
{
BOOST_LOG_TRIVIAL(error) << "Parser syntax error!";
}
}
I would better to use less code to complain the problem that I had occurred.
For example the parser expressions are as below:
ASSIGN <- "Set" TYPE IDSTR '=' EXPRESSION
TYPE <- ["Interger""Decimal"]
IDSTR <- [_A-Za-z][_A-Za-z0-9]*
EXPRESSION <- ...brabrabra, just return double value...
And I want to parse "SetDecimalvariable=50.0"
sv that I get in ASSIGN will be as follow:
sv.size() == 3 // fine, (TYPE, IDSTR, EXPRESSION) are three element sv.str_c() == "SetDecimalvariable=50.0" // fine, originally data sv[0].get<string>() == "Decimalvariable=50.0" // weired, it should be "Decimal" as first element sv[1].get<string>() == "ecimalvariable=50.0" // more weired, it should be "variable" as second element sv[2].get<Ele>() == 50.0 // fine, it's the third element
Am I thought wrong?
Thanks for the project, it helps me a lot.
[praise]
I am using cpp-peglib for a few weeks now, and this library works tremendously well. The API is easy to grasp and its behavior is very predictable. I originally started writing my grammar (a subset of Python's grammar) using boost::spirit, and I stopped when things became unmanageable. I can do a lot more, and more easily with cpp-peglib, so thank you for that.
[/praise]
One thing I was not happy with cpp-peglib is that, when I got my grammar wrong, the only thing I saw was a crash at runtime (access to a nullptr). I finally looked more closely in the sources, and I found out two useful things:
parser
with no argument, and to call load_grammar
afterwards. This returns a bool
that tells if the grammar is ok.parser::log
), to get the details of what is wrong with our grammar.It took me a few weeks to realize that these tools were already there, waiting for me to use them. My suggestions is to use them in the example code of the readme file:
Instead of:
parser parser(syntax);
Use something like this:
parser parser;
parser_->log = [](size_t line, size_t col, const string& msg) {
cerr << line << ":" << col << ": " << msg << "\n";
};
bool ok = parser_->load_grammar(grammar);
assert(ok);
Its is not as pretty as the current version, but it will be a huge help for users to understand the mistakes in their grammars. Alternatively, there could be a default logger installed on all parser instances, that can be deactivated if needs to.
Hi, I'm currently reading the pl0 language example and find something really confused, like here, in this file, all the switch cases are following up with an underscore, is this a DSL from this lib or a new C++11 feature? Thanks in advance :)
i set enablePackratParsing=1
and my peg is
auto syntax = R"xyz(decl<-decl_specs init_declarator_list ';'
decl_specs<-'char' / 'short' / 'int' / 'long' / 'float'
init_declarator_list<-direct_declarator
/ direct_declarator '=' initializer
direct_declarator<-id
initializer<-additive
additive<-left:multiplicative "+" right : additive
/ multiplicative
multiplicative<-left : primary "*" right : multiplicative
/ primary
primary<-primary_exp
/ "(" additive : additive ")"
primary_exp<-id
/ const)xyz";
parser parser(syntax);
when I debug ,i set breakpoint after "parser parser(syntax);",but it crashed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.