Giter Club home page Giter Club logo

lrgrep's Introduction

Syntax error analyser

This repository provides different tools to work on the error messages of a menhir-generated parser.

The main tool is lrgrep. It takes:

  • a compiled Menhir grammar (a .cmly file, produced by passing --cmly flag to Menhir)
  • a list of rules (usually a .mlyl file).

If the list of rule is well-formed, it produces an OCaml module that can match the rules against the state of a parser at runtime.

By carefully crafting the rules, one can provide fine-grained message to explain syntax errors.

The repository is is structured as follow:

  • the main tool, lrgrep, can be found in src/main.ml
  • support implements the compact table representation shared by the generator and the generated analysers via the lrgrep.runtime library
  • in ocaml, we try to apply this methodology to OCaml grammar:
    • parser_raw.mly and lexer_raw.mll define an OCaml 4.13 compatible grammar with syntax error reporting removed
    • parse_errors.mlyl define the error rules for this grammar
    • the frontend binary is an alternative parser that can be used with ocamlc/ocamlopt 4.14 (using the -pp <path-to-frontend.exe> flag)
    • the interpreter binary is a tool that takes an incorrect input and prints detailed information on the parsing process at the point of failure, useful for devising good error patterns
  • lib implements various algorithms used by other tools

Working on OCaml grammar

For now, the main focus is on the ocaml sub-directory, and ocaml/parse_errors.mlyl specifically.

My current workflow is as follow:

  • starts from an example, an OCaml code with a syntax error for which the message is quite bad
  • by reading the grammar and the output of the interpreter, get an idea of what the parsing situation looks like around the error point
  • craft an error rule, and debug it using by passing -pp frontend to ocamlc

Setting up the tools

All the work is done using OCaml 4.14. Make sure you are using the right switch:

$ ocamlc -version
4.14.1

Clone the repository and install dependencies:

$ git clone https://github.com/let-def/lrgrep.git
$ cd lrgrep
$ opam install menhir fix cmon

At this point, make should succeed (contact me if not) and produce the three binaries: lrgrep.exe, frontend.bc and interpreter.exe.

It is usually better to test with the bytecode frontend as it leads to shorter iteration cycles.

Quick test

Try the new frontend with some simple examples:

$ ocamlc -c -pp _build/default/ocaml/frontend.bc test_ok.ml

This first example compiled successfully.

$ ocamlc -c -pp _build/default/ocaml/frontend.bc test_ko_01.ml
ocamlc -pp _build/default/ocaml/frontend.bc test_ko_01.ml
File "test_ko_01.ml", line 4, characters 0-3:
4 | let z = 7
    ^^^
Error: Spurious semi-colon at 2:9

File "test_ko_01.ml", line 1:
Error: Error while running external preprocessor
Command line: _build/default/ocaml/frontend.bc 'test_ko_01.ml' > /tmp/ocamlppbbc3f9

In this one however, there is a syntax error. Luckily, this case is covered by a rule: while the error happens on line 4, it is likely caused by the semi-colon at the end of line 2.

Using the frontend for compiling ocaml files

By using the OCAMLPARAM environment variable, we can instruct all execution of ocaml compilers in the current shell to use our frontend.

$ ./demo/setup_shell.sh
export 'OCAMLPARAM=pp=$PWD/lrgrep/_build/default/ocaml/frontend.bc,_'
# setup_shell commands produces a suitable OCAMLPARAM value
$ eval `./demo/setup_shell.sh`
$ ocamlc test_ko_01.ml
...
Error: Spurious semi-colon at 2:9
...
# In the updated environment, the new frontend is picked up automatically

Now you are ready to iterate on ocaml/parse_errors.mlyl to produce new rules.

Note: unset OCAMLPARAM to switch back to the normal frontend

Devising new rules

Once you made sure your setup is working (make is (re-)building the frontend and ocamlc is using it), you can proceed to DEVISING-RULES.md to get started with the error DSL and the associated workflow.

Getting started with LRGrep codebase

I am trying to document the code. Each of the src, lib, ocaml, and support directories contain a README.md that briefly explains the purpose of this directory.

External dependencies that are worth knowing:

  • MenhirSdk is a part of the Menhir parser generator that allows external tool to post-process compiled grammars
  • Cmon is a pretty-printer for recursive values
  • Fix is a library for computing fixed points; it also provides a convenient representation of finite sets
  • LRijkstra is taken from Menhir and implements the algorithm described in "Faster Reachability Analysis for LR(1) Parsers", though we apply it for a slightly different purpose than the one described in the articles

lrgrep's People

Contributors

jmid avatar let-def avatar squiddev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

jmid squiddev

lrgrep's Issues

Reductions with empty productions

Apologies for the slew of issues here (though thank you for all the work you've done on fixing them! ❤️).

I'm continuing to update my current code to use the latest version of lrgrep, and have hit a problem where reductions for empty productions do not appear to be matched.

I've committed a small reproduction case, but just to explain what's going on here.

We have the following grammar which accepts any number of characters between two brackets (e.g. (), (a), (aaa)).

let sentence := "(" ; ~ = chars ; ")" ; EOF ; <>

let chars := ~ = list(char) ; <List.rev>
let char := ~ = C ; <>

I'd like to match sentences without a closing bracket, and so have the following rule:

rule error_message = parse error
| OPAREN ; [chars]
{ "Unclosed '('" }

While this does match sentences like (aaa, it does not match just (. Curiously this does work with LRgrep 2 (so using OPAREN ; chars ; !), so I'm assuming this is a regression rather than intentional behaviour?

Sorry I can't provide more info! Still going down the rabbit hole of trying to understand how lrgrep works.

Is Menhir's `really_top` really needed?

I was taking a look at updating my code to use the latest version of lrgrep (I'm still using a version from February, before #8/#9 were merged). However, as of 0cd455b I noticed that lrgrep requires the Menhir parser to export a val really_top : 'a env -> element definition.

While this function is pretty easy to add, there's a bit of me which wonders whether it's needed/useful in the first place. The only time we need to read the very top of the stack in the OCaml test suite is for test_0314.ml:

let lident = lident and false = UIDENT to

| pos=[_* / LET ... . IN ...]
| pos=[_* / let_bindings(ext) ... . IN ...]
| pos=[_* / let_bindings(no_ext) ... . IN ...]
{ "Expecting `in' to continue let-binding at " ^ line_and_char $startloc(pos) }

However, in this case, the capture isn't actually very useful - the start/end position of the top of the stack will the first character in the file, when probably we want to be capturing the position of the initial LET (or let_bindings(_)). I wonder if startloc should be capturing the position of the first token in the production in this case (and similarly for endloc) instead?

Apologies if this is already on your radar (or if I'm talking nonsense)! I've not dug too much into the code yet, just felt it was worth asking first.

Trying to understand the reduction/`!` pattern

I realise everything in this issue is quite involved, and not sure how "production ready" this repository is, so happy if you'd rather just close this :).


I'm currently fiddling with using lrgrep for a Lua parser I'm working on, in an attempt to provide better error messages for some common errors. I've put my current progress on this fork.

One issue I see a lot is people forgetting to put function parenthesis on zero-argument function calls, and so I'd like to provide an error for this case:

print -- should be print()
print("Hello")

Here parsing will fail on line 2, with an unexpected identifier. At that point, the interpreter trace for this parse looks like the following:

- line 1:0-5 IDENT
   [var: IDENT .]
 ↱ var
   [name: var .]
 ↱ simple_expr
   [name: simple_expr . OSQUARE expr CSQUARE]
   [name: simple_expr . DOT IDENT]
   [call: simple_expr . COLON IDENT call_args]
   [call: simple_expr . call_args]
 ↱ sep_list1(COMMA,name)
   [basic_stmt: sep_list1(COMMA,name) . EQUALS sep_list1(COMMA,expr)]
 ↱ name
   [simple_expr: name .]
   [sep_list1(COMMA,name): name . COMMA sep_list1(COMMA,name)]
   [sep_list1(COMMA,name): name .]
- entrypoint program

I'm currently matching this case with the following rule:

| [basic_stmt: sep_list1(COMMA,name) . EQUALS]; !
  partial { (* ... *) }

However, what I really want to be able to do is determine if the next token is on the same line (i.e. x 42, in which case its possible the user wanted to write x(42) or x = 42) or the next line (i.e print\nprint(), in which case it's probably just a function call). I tried doing this by putting a capture around the LR item, so I could compare positions:

| ([basic_stmt: sep_list1(COMMA,name) . EQUALS] as names); !
  partial { (* ... *) }

However, while the pattern still matches, names is always None. This is where my understanding begins to break down a little bit - the bytecode executed doesn't contain any Stores, and so I assume nothing is read from menhir's stack?

Is this something you'd expect to work or, if not, is there a alternative way I could look at implementing such a rule?

Again, apologies for such a verbose question!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.