evanphx / kpeg Goto Github PK

View Code? Open in Web Editor NEW

170.0 9.0 37.0 356 KB

A simple PEG library for ruby

License: BSD 3-Clause "New" or "Revised" License

Ruby 99.36% Vim Script 0.64%

kpeg's Introduction

kpeg¶ ↑

home: github.com/evanphx/kpeg
bugs: github.com/evanphx/kpeg/issues

Description¶ ↑

KPeg is a simple PEG library for Ruby. It provides an API as well as native grammar to build the grammar.

KPeg strives to provide a simple, powerful API without being too exotic.

KPeg supports direct left recursion of rules via the OMeta memoization trick.

Writing your first grammar¶ ↑

Setting up your grammar¶ ↑

All grammars start with with the class/module name that will be your parser

%% name = Example::Parser

After that a block of ruby code can be defined that will be added into the class body of your parser. Attributes that are defined in this block can be accessed within your parser as instance variables. Methods can also be defined in this block and used in action blocks as well.

%% {
  attr_accessor :something_cool

  def something_awesome
    # do something awesome
  end
}

Defining literals¶ ↑

Literals are static declarations of characters or regular expressions designed for reuse in the grammar. These can be constants or variables. Literals can take strings, regular expressions or character ranges

ALPHA = /[A-Za-z]/
DIGIT = /[0-9]/
period = "."
string = "a string"
regex = /(regexs?)+/
char_range = [b-t]

Literals can also accept multiple definitions

vowel = "a" | "e" | "i" | "o" | "u"
alpha = /[A-Z]/ | /[a-z]/

Defining Rules for Values¶ ↑

Before you can start parsing a string you will need to define rules that you will use to accept or reject that string. There are many different types of rules available in kpeg

The most basic of these rules is a string capture

alpha = < /[A-Za-z]/ > { text }

While this looks very much like the ALPHA literal defined above it differs in one important way, the text captured by the rule defined between the < and > symbols will be set as the text variable in block that follows. You can also explicitly define the variable that you would like but only with existing rules or literals.

letter = alpha:a { a }

Additionally blocks can return true or false values based upon an expression within the block. To return true if a test passes do the following:

match_greater_than_10 = < num:n > &{ n > 10 }

To test and return a false value if the test passes do the following:

do_not_match_greater_than_10 = < num:n > !{ n > 10 }

Rules can also act like functions and take parameters. An example of this is lifted from the Email List Validator, where an ascii value is passed in and the character is evaluated against it returning a true if it matches

d(num) = <.> &{ text[0] == num }

Rules support some regular expression syntax for matching

maybe ?
many +
kleene *
groupings ()

Examples:

letters = alpha+
words = alpha+ space* period?
sentence = (letters+ | space+)+

Kpeg also allows a rule to define the acceptable number of matches in the form of a range. In regular expressions this is often denoted with syntax like {0,3}. Kpeg uses this syntax to accomplish match ranges [min, max].

matches_3_to_5_times = letter[3,5]
matches_3_to_any_times = letter[3,*]

Defining Actions¶ ↑

Illustrated above in some of the examples, kpeg allows you to perform actions based upon a match that are described in block provided or in the rule definition itself.

num = /[1-9][0-9]*/
sum = < num:n1 "+" num:n2 > { n1 + n2 }

As of version 0.8 an alternate syntax has been added for calling defined methods as actions.

%% {
  def add(n1, n2){
    n1 + n2
  }
}
num = /[1-9][0-9]*/
sum = < num:n1 "+" num:n2 > ~add(n1, n2)

Referencing an external grammar¶ ↑

Kpeg allows you to run a rule that is defined in an external grammar. This is useful if there is a defined set of rules that you would like to reuse in another parser. To do this, create your grammar and generate a parser using the kpeg command line tool.

kpeg literals.kpeg

Once you have the generated parser, include that file into your new grammar

%{
  require "literals.kpeg.rb"
}

Then create a variable to hold to foreign interface and pass it the class name of your parser. In this case my parser class name is Literal

%foreign_grammar = Literal

You can then use rules defined in the foreign grammar in the local grammar file like so

sentence = (%foreign_grammar.alpha %foreign_grammar.space*)+
           %foreign_grammar.period

Comments¶ ↑

Kpeg allows comments to be added to the grammar file by using the # symbol

# This is a comment in my grammar

Variables¶ ↑

A variable looks like this:

%% name = value

Kpeg allows the following variables that control the output parser:

name: The class name of the generated parser.
custom_initialize: When built as a standalone parser a default initialize method will not be included.

Directives¶ ↑

A directive looks like this:

%% header {
  ...
}

Kpeg allows the following directives:

header: Placed before any generated code
pre-class: Placed before the class definition to provide a class comment
footer: Placed after the end of the class (for requiring files dependent upon the parser’s namespace

Generating and running your parser¶ ↑

Before you can generate your parser you will need to define a root rule. This will be the first rule run against the string provided to the parser

root = sentence

To generate the parser run the kpeg command with the kpeg file(s) as an argument. This will generate a ruby file with the same name as your grammar file.

kpeg example.kpeg

Include your generated parser file into an application that you want to use the parser in and run it. Create a new instance of the parser and pass in the string you want to evaluate. When parse is called on the parser instance it will return a true if the sting is matched, or false if it doesn’t.

require "example.kpeg.rb"

parser = Example::Parser.new(string_to_evaluate)
parser.parse

Shortcuts and other techniques¶ ↑

Per vito, you can get the current line or current column in the following way

line = { current_line }
column = { current_column }
foo = line:line ... { # use line here }

AST Generation¶ ↑

As of Kpeg 0.8 a parser can now generate an AST. To define an AST node use the following syntax

%% assign = ast Assignment(name, value)

Once you have a defined AST node, it can be used in your grammar like so

assignment = identifier:i space* = space* value:v ~assign(i,v)

This will create a new Assign node that you can add into your AST.

For a good example of usage check out Talon

Examples¶ ↑

There are several examples available in the /examples directory. The upper parser has a readme with a step by step description of the grammar.

Projects¶ ↑

Dang

Email Address Validator

Callisto

Doodle

Kanbanpad (uses kpeg for parsing of the ‘enter something’ bar)

kpeg's People

Contributors

Stargazers

Watchers

kpeg's Issues

allow flags for regexs in kpeg format file

Currently impossible to do unicode regexps like /abc/u.

missing 1.0.0

I found that current master and kpeg-1.0.0.gem https://rubygems.org/gems/kpeg/versions/1.0.0 is a bit different.
I guess c30d49b is not commited into master branch. Is there any problems?

Items may be duplicated when a single item is followed by a * of items

With this grammar:

%% name = T

%% header {
require 'pp'
}

%% {
  def initialize
    super "ta"
  end

  def test
    raise_error unless parse

    pp @result
  end
}

root = Compound

Compound = thing:thing item*:after {
  [thing, after].flatten.compact
}

thing = t
item = a | t

a = <"a">
t = <"t"> { text }

When run:

$ kpeg -f t.kpeg -o t.rb && ruby -r ./t.rb -e T.new.test
Wrote T to t.rb
["t", "t"]

I expect the result to be ["t"] as there is only one "t" in the parsed input

a question (or feature request)

hi,

is there a possibility to count the number of repetitions:

space1= " "
space2= "[SPACE]"
spacese= < (space1 | space2)+ > {how to get the number spaces? text.size won't work for this}

mayby someting like rule.number_of_repetitions could be included (or does already exist???)

spacese= < (space1 | space2)+ > { rule.number_of_repetitions}

thanx

artur

allow detecting/controlling failure if input not completely parsed

Some way to detect eof would be useful to ensure an input source is parsed completely. Example:

root = (foo+):content eof { content}

Related: currently, if you have a typo in a kpeg format, it'll choke, but silently stop there and generate from what was successfully parsed.

Ruby 1.9 Regexp.new lang parameter can only be 'n' or 'N'

KPeg::LiteralRegexp at around line lib/kpeg/grammar.rb:110

    if opts
      opts.split("").each do |o|
        case o
        when "n", "N", "e", "E", "s", "S", "u", "U"
          lang = o
        when "m"
          flags |= Regexp::MULTILINE
        when "x"
          flags |= Regexp::EXTENDED
        when "i"
          flags |= Regexp::IGNORECASE
        end
      end
    end

    @regexp = Regexp.new(reg, flags, lang)

The 1.9 Pickaxe book says on pages 654-655 (see footnote 5 on page 655) that only options "n" and "N" are allowed.

As a result we get this warning in rake test:

 ...............kpeg/lib/kpeg/grammar.rb:126: warning: encoding option is ignored - u

Also kpeg/test/test_kpeg_format.rb test_regexp_options at line 62 uses the 'u' option.

 assert_rule G.reg(/foo/u), match('a=/foo/u')

Change the 'u' to 'n':
assert_rule G.reg(/foo/n), match('a=/foo/n')

and rake test works.

Documentation

Is there any documentation for kpeg?

\r is not escaped in kpeg output

Instead of "\r", a raw carriage return is placed in the output.

See http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/lib/rdoc/markdown.rb?r1=38102&r2=38101&pathrev=38102

Source grammar is here:

https://github.com/rdoc/rdoc/blob/master/lib/rdoc/markdown.kpeg

Line calculation issues with v1.2.0

With this grammar + support code:

%% name = Bug

%% {
  attr_reader :parsed
  alias parsed? parsed

  def self.test
    expander = self.new "${name}"
    expander.parse

    # ↓↓↓ this line shows the bug
    expander.raise_error unless expander.parsed?
    # ↑↑↑

    expander.result
  end

  def initialize(string)
    @string = string
    @parsed = nil

    super @string
  end

  def parse
    @parsed = super
  end
}

root = variable

variable = "$" variable_name

variable_name = < /\w+/ >

When compiled with kpeg 1.1.0 I get a ParseError as I expect:

$ kpeg _1.1.0_ -f -o bug.rb bug.kpeg && ruby -I /usr/local/lib/ruby/gems/2.6.0/gems/kpeg-1.1.0/lib -r ./bug -e Bug.test
Wrote Bug to bug.rb
/usr/local/lib/ruby/gems/2.6.0/gems/kpeg-1.1.0/lib/kpeg/compiled_parser.rb:109:in `raise_error': @1:2 failed rule 'variable_name', got '{' (KPeg::CompiledParser::ParseError)
	from /Users/erichodel/tmp/kpeg-1.2.0-bug/bug.rb:13:in `test'
	from -e:1:in `<main>'

With kpeg 1.2.0 I get a NoMethodError instead:

$ kpeg _1.2.0_ -f -o bug.rb bug.kpeg && ruby -r ./bug -e Bug.test
Wrote Bug to bug.rb
/usr/local/lib/ruby/gems/2.6.0/gems/kpeg-1.2.0/lib/kpeg/compiled_parser.rb:95:in `failure_oneline': undefined method `[]' for nil:NilClass (NoMethodError)
	from /usr/local/lib/ruby/gems/2.6.0/gems/kpeg-1.2.0/lib/kpeg/compiled_parser.rb:109:in `raise_error'
	from /Users/erichodel/tmp/kpeg-1.2.0-bug/bug.rb:13:in `test'
	from -e:1:in `<main>'

The error comes from KPeg::CompiledParser#failure_online. I only have one line here lines[0], but l is 2 for some reason (lines[1]) so this gives NoMethodError due to some sort of off-by-one.

small typo in examples/calculator/calculator.kpeg

Hi,

I noticed a small typo in your calculator: For the subtract term, you have termLt2 instead of term:t2.
Here is the relevant git diff part lines:

```
| term:t1 - "-" - termLt2 { t1 - t2 }
```
```
| term:t1 - "-" - term:t2 { t1 - t2 }
```

Not really worthy of a PR.. Just thought I would mention it here.

Regards, Ed

non existing files cause error during `gem build`

While executing gem build kpeg.gemspec I see this error:

ERROR:  While executing gem ... (Gem::InvalidSpecificationException)
    [".gemtest", ".travis.yml"] are not files
        /usr/lib/ruby/3.0.0/rubygems/specification_policy.rb:478:in `error'
        /usr/lib/ruby/3.0.0/rubygems/specification_policy.rb:298:in `validate_non_files'
        /usr/lib/ruby/3.0.0/rubygems/specification_policy.rb:73:in `validate_required!'
        /usr/lib/ruby/3.0.0/rubygems/specification_policy.rb:44:in `validate'
        /usr/lib/ruby/3.0.0/rubygems/specification.rb:2635:in `validate'
        /usr/lib/ruby/3.0.0/rubygems/package.rb:298:in `build'
        /usr/lib/ruby/3.0.0/rubygems/package.rb:136:in `build'
        /usr/lib/ruby/3.0.0/rubygems/commands/build_command.rb:99:in `build_package'
        /usr/lib/ruby/3.0.0/rubygems/commands/build_command.rb:89:in `build_gem'
        /usr/lib/ruby/3.0.0/rubygems/commands/build_command.rb:69:in `execute'
        /usr/lib/ruby/3.0.0/rubygems/command.rb:323:in `invoke_with_build_args'
        /usr/lib/ruby/3.0.0/rubygems/command_manager.rb:185:in `process_args'
        /usr/lib/ruby/3.0.0/rubygems/command_manager.rb:149:in `run'
        /usr/lib/ruby/3.0.0/rubygems/gem_runner.rb:51:in `run'
        /usr/bin/gem:13:in `<main>'

When I remove the two listed files from line 15 of the kpeg.gemspec file it works.

"foo":cap doesn't capture "foo"

Given this rule:

expression = (primitive:lhs - "+":op - primitive:rhs) { [lhs, op, rhs] }

And this test code:

p = MyParser.new("1+2")
p.parse
puts p.result

The result is [1, 1, 2]. Looking at the generated code, I see this:

_tmp = match_string("+")
op = @result
unless _tmp
  self.pos = _save
  break
end

But match_string doesn't set @Result, so op gets the value of the previous capture.

Relatively slow performance

I have written a JavaScript parser (wycats/parsejs) using kpeg, and get between 15k and 30k bytes per second when parsing JavaScript files. This seems slow and it would be nice to be figure out a way to improve performance.

There is a good sample file in the fixtures in the ParseJS at https://github.com/wycats/parsejs/blob/master/spec/fixtures/sizzle.js

Here's a simple benchmark:

require "benchmark"
require "parsejs"

parser = ParseJS::Parser.new(File.read("sizzle.js"))

Benchmark.bm do |x|
  x.report("sizzle") { parser.parse }
end

Here is the output I get:

~/Code/parsejs ‹ruby-1.8.7›  ‹master*› $ rvm 1.8.7
~/Code/parsejs ‹ruby-1.8.7›  ‹master*› $ ruby -rubygems benchmark.rb
      user     system      total        real
sizzle  3.790000   0.060000   3.850000 (  3.840110)
~/Code/parsejs ‹ruby-1.8.7›  ‹master*› $ rvm 1.9.3
~/Code/parsejs ‹ruby-1.9.3›  ‹master*› $ ruby -rubygems benchmark.rb
       user     system      total        real
sizzle  1.380000   0.020000   1.400000 (  1.408474)

version in gemspec is not up-to-date

the version in the kpeg.gemspec file is still listing 1.0.0.20140103162640 even so the latest released version is 1.3.1. This makes it harder to build from source as you have to patch the gemspec files first.

Option --debug broken because KPeg::FORMAT not defined

$ kpeg --test --debug lib/kpeg/format.kpeg 

An exception occurred running /home/rocky-rvm/.rvm/gems/rbx-head/bin/kpeg
    Missing or uninitialized constant: KPeg::FORMAT (NameError)

Backtrace:
          Module#const_missing at kernel/common/module.rb:543
              main.__script__ at /home/rocky-rvm/.rvm/gems/rbx-head/gems
                                 /kpeg-0.1/bin/kpeg:64
           Kernel(Object)#load at kernel/common/kernel.rb:727
              main.__script__ at /home/rocky-rvm/.rvm/gems/rbx-head/bin
                                 /kpeg:19
Rubinius::CodeLoader#load_script at kernel/delta/codeloader.rb:67
Rubinius::CodeLoader.load_script at kernel/delta/codeloader.rb:91
       Rubinius::Loader#script at kernel/loader.rb:572
         Rubinius::Loader#main at kernel/loader.rb:676
         Rubinius::Loader.main at kernel/loader.rb:715
             Object#__script__ at kernel/loader.rb:726
$

comments in kpeg format

There doesn't seem to be any support for comments in a kpeg format at the moment.

Make Ruby 1.9 aware?

rake test doesn't work on 1.9.2.

For example, in 1.8 get_byte seems to return ascii ints, while in 1.9 you get the equivalent char.

Quoted curleys in action blocks cause parse failure

See https://github.com/evanphx/kpeg/blob/beautifier/test/test_kpeg_format.rb#L326

Labeled capture groups aren't working

I think I found a bug in KPeg 1.1.0 (Ruby 2.3.1p112 on Ubuntu).

I have the following lines in a grammar:

DIGIT = /[0-9]/
digits = DIGIT+
integer = < negative_sign? digits:num > { puts "num: #{num}, text: #{text}"}

When the integer pattern is matched, the block is called as expected, but the output is

num: , text: 575

Breaking into the block with the debugger reveals that, although num exists as a local variable (it's in binding.local_variables, for example), it is always nil even when the digits pattern matches. Help?

tests not being executed on `rake test`

When running rake test, the command doesn't output anything and no test is executed.

anonymous tags in kpeg format

Some way to just say g.t(...) in a kpeg format would be nice; you can almost do foo:-, but that results in something like - = ... in the generated code, which is a parse error.

no git tags for released version

Hi, https://rubygems.org/gems/kpeg lists version 1.3.1 as the latest, but https://github.com/evanphx/kpeg/tags only lists 1.1.0 as the latest version.

I'm trying to package this library for Arch Linux and we use the git repository tags/releases as source for our packages, so these tags would be really appreciated.

git tag for 1.3.3 is missing

On https://rubygems.org/gems/kpeg version 1.3.3 is released but the repository doesn't have a git tag for it. This is bad if linux distributions want to package the gem from the sources.

Not so much an issue as a question

I am writing a translator which requires I be able to walk an abstract syntax tree based on a PEG. Can kpeg help create such an AST for the application to walk? If so, where in the documentation can I find that information? Thanks in advance.

Can't include a single ' in a curly block

For RDoc I was given a patch which added a "'" in the header section for the parser (a contraction in the comment block). This resulted in a syntax error.

I came up with this test:

diff --git a/test/test_kpeg_format.rb b/test/test_kpeg_format.rb
index 1be9af8..2157d09 100644
--- a/test/test_kpeg_format.rb
+++ b/test/test_kpeg_format.rb
@@ -431,6 +431,24 @@ a=b
     assert_equal expected, m.directives
   end

+  def test_parser_directive_single_quote
+    m = match <<-GRAMMAR
+%% header {
+# It's a bug I found
+}
+
+a=b
+    GRAMMAR
+
+    assert_rule G.ref("b"), m
+
+    expected = {
+      "header" => KPeg::Action.new("\n# It's a bug I found\n")
+    }
+
+    assert_equal expected, m.directives
+  end
+
   def test_parser_setup
     m = match "%% { def initialize; end }\na=b"
     assert_rule G.ref("b"), m

I'm not sure if it's possible to change curly to support unmatched ' though. Nothing came to mind after messing with it for a few minutes.

access to line/position information via the API

Some way to get the starting position of a match (from within a callback block?) would be useful, especially for languages targeting the Rubinius VM.

Grammars should be able to end with a comment

Currently the grammar parser will not allow a grammar to end with a comment and not have a subsequent newline.
%% name = Comment
word = < /\w/ > { text }
root = word+

Test

Enabling debugging does nothing in a built compiler

After generating a parser, calling setup_parser text, true does not emit debugging information when parsing the text.

Looking at KPeg::CompiledParser#setup_parser, the debug argument is not used.