Giter Club home page Giter Club logo

returntocorp / ocaml-tree-sitter-semgrep Goto Github PK

View Code? Open in Web Editor NEW
2.0 22.0 6.0 2.6 MB

Generate parsers from tree-sitter grammars extended to support Semgrep patterns

License: GNU General Public License v3.0

OCaml 0.01% Shell 4.97% JavaScript 34.88% Makefile 1.76% C++ 10.58% C 8.10% Java 0.43% Ruby 1.02% TypeScript 0.67% Python 0.54% C# 3.75% Go 30.62% Kotlin 0.54% Lua 0.48% Rebol 0.01% Rust 0.02% Elixir 0.05% Dockerfile 0.01% Julia 1.58% Haskell 0.01%

ocaml-tree-sitter-semgrep's Introduction

ocaml-tree-sitter-semgrep

CircleCI

Generate OCaml parsers based on tree-sitter grammars, for semgrep.

Related ocaml-tree-sitter repositories:

  • ocaml-tree-sitter-core: provides the code generator that takes a tree-sitter grammar and produces an OCaml library from it.
  • ocaml-tree-sitter-languages: community repository that has scripts for building and publishing OCaml libraries for parsing a variety of programming languages.
  • ocaml-tree-sitter-semgrep: this repo; same as ocaml-tree-sitter-languages but extends each language with constructs specific to semgrep patterns.

Contributing

Development setup

  1. Make sure you have at least 6 GiB of free memory. More will be needed for some of the grammars.
  2. Install the following tools:
    • git
    • GNU make
    • pkg-config: manages the installation of tree-sitter's runtime library
    • Node.js: JavaScript interpreter used to translate a grammar to json
    • cargo: Rust compiler used to build tree-sitter
    • opam: OCaml package manager
  3. Run opam init, opam switch create 4.12.0 to install a recent version of OCaml.
  4. Install ocaml dev tools for your favorite editor: typically opam install merlin + some plugin for your editor.
  5. Install pre-commit with pip3 install pre-commit and run pre-commit install to set up the pre-commit hook. This will re-indent code in a consistent fashion each time you call git commit.
  6. Check out the extra instructions for MacOS.

See the Makefile for the available targets. Get started with:

make update
make setup

Then build and install the OCaml code generator (core):

make && make install

Testing a language

Say you want to build and test support for kotlin, you would run this:

$ cd lang
$ ./test-lang kotlin

For details, see How to upgrade the grammar for a language.

Adding a new language

See How to add support for a new language.

Documentation

We have limited documentation which is mostly targeted at early contributors. It's growing organically based on demand, so don't hesitate to file an issue explaining what you're trying to do.

License

ocaml-tree-sitter is free software with contributors from multiple organizations. The project is driven by r2c.

  • OCaml code developed specifically for this project is distributed under the terms of the GNU GPL v3.
  • The OCaml bindings to tree-sitter's C API were created by Bryan Phelps as part of the reason-tree-sitter project.
  • The tree-sitter grammars for major programming languages are external projects. Each comes with its own license.

ocaml-tree-sitter-semgrep's People

Contributors

akuhlens avatar amchiclet avatar amietn avatar apt-itude avatar arargon avatar aryx avatar aviks avatar brandonspark avatar brendongo avatar chargarlic avatar colleend avatar drewdennison avatar eatkins avatar emjin avatar frankeld avatar frodan avatar frostweeds avatar ihji avatar joseemds avatar mafrosis avatar michahoffmann avatar mjambon avatar mschwager avatar nmote avatar ruin0x11 avatar semgrep-bot avatar sjord avatar sophiasr avatar wingyplus avatar zythosec avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocaml-tree-sitter-semgrep's Issues

Regenerating parser code after updating tree-sitter-X submodule is too lazy

Say you updated the tree-sitter-javascript submodule. It needs to be reinstalled as a node module (in node_modules). Then the grammar in /lang/javascript needs to be rebuilt, which should happen when doing just make but apparently is broken.

Meanwhile, forcing an update for just javascript can be done as follows:

  • (from project root) npm install --force ./lang/semgrep-grammars/src/tree-sitter-javascript
  • (from lang/javascript/semgrep-grammars/src/semgrep-javascript) make clean && make && make test
  • (from lang/javascript) make clean && make && make test

Use npm list to check the version of the installed packages. All the tree-sitter-* packages must point to a local folder as follows:

~/ocaml-tree-sitter $ npm list
ocaml-tree-sitter@ /home/martin/ocaml-tree-sitter
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-c-sharp
├── [email protected]
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-javascript
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-kotlin
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-lua
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-r
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-ruby
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-rust
├── [email protected] -> /home/martin/ocaml-tree-sitter/lang/semgrep-grammars/src/tree-sitter-typescript
└── [email protected]

⚠️ you must specify the path to the local tree-sitter-javascript, other npm will initially install it correctly but also will update package.json and package-lock.json to the public release (e.g. "^0.16.9"). We don't want that, we want to keep building from the local tree ("file:lang/semgrep-grammars/src/tree-sitter-javascript"). Make sure this doesn't happen.

Wrong handling of anonymous patterns

in javascript grammar.js there is:
5:39 PM
string: $ => choice(
      seq(
        '"',
        repeat(choice(
          token.immediate(prec(PREC.STRING, /[^"\\\n]+|\\\r?\n/)),
          $.escape_sequence
        )),
        '"'
      ),
      seq(
        "'",
        repeat(choice(
          token.immediate(prec(PREC.STRING, /[^'\\\n]+|\\\r?\n/)),
          $.escape_sequence
        )),
        "'"
      )
    ),
5:40 PM
but the type is CST.ml is:
5:40 PM
type string_ = [
    `DQUOT_rep_choice_blank_DQUOT of (
        Token.t (* "\"" *)
      * anon_choice_blank list (* zero or more *)
      * Token.t (* "\"" *)
    )
  | `SQUOT_rep_choice_blank_SQUOT of (
        Token.t (* "'" *)
      * anon_choice_blank list (* zero or more *)
      * Token.t (* "'" *)
    )
]
5:40 PM
type anon_choice_blank = [
    `Blank of unit (* blank *)
  | `Esc_seq of escape_sequence (*tok*)
5:40 PM
the token.immediate and the regular string disappeared


Various typescript parse errors

Tree-sitter-typescript errors 2020-10-01

Those are errors reported by scanning a collection of public
repositories, using make stat from lang/typescript/.
This is for pure typescript, not tsx.

The number indicates the number of occurrences of this kind of error.

Official typescript parser implementation: https://github.com/microsoft/TypeScript/blob/master/src/compiler/parser.ts

Definite assignment assertions let x!

12

let LVIEW_COMPONENT_CACHE!: Map<string|null, Array<any>>;
                         ^
node type: ERROR
children: [
  "!"
]

PR: tree-sitter/tree-sitter-typescript#86 (merged)

Read-only array

7

export function parseMessage(
    messageParts: TemplateStringsArray, expressions?: readonly any[], location?: SourceLocation,
                                                               ^^^
    messagePartLocations?: (SourceLocation|undefined)[],
    expressionLocations: (SourceLocation|undefined)[] = []): ParsedMessage {
node type: ERROR
children: [
  "any"
]

This one is a bit tricky.
PR: tree-sitter/tree-sitter-typescript#90 (merged)

asserts

5

export function assertNumber(actual: any, msg: string): asserts actual is number {
                                                                ^^^^^^

PR: tree-sitter/tree-sitter-typescript#88 (merged)

infer

1

export type ElementOf<T> = T extends (infer E)[] ? E : T extends readonly (infer F)[] ? F : never;
                                            ^

Tracked at semgrep/semgrep#2010

[key: keytype]: valtype

2

    const match = XRegExp.exec(url, this.regex) as ReturnType<typeof XRegExp.exec> & { [captured: string]: string };
                                                                                        ^^^^^^^^^

node type: ERROR
children: [
  expression
  ":"
]

Tracked at semgrep/semgrep#2012

-readonly

2

export type Writable<T> = {
  -readonly[K in keyof T]: T[K];
  ^
};

Tracked at semgrep/semgrep#2013

Optional keys (?)

  inputs?: {[P in keyof T]?: string | [string, string]};
                          ^

node type: ERROR
children: [
  "?"
]

Tracked by semgrep/semgrep#1993

Ellipsis in types

2

export function pipeBindV(values: [any, ...any[]]): any {
                                        ^^^
node type: ERROR
children: [
  "..."
]

Tracked at semgrep/semgrep#2023

Abstract class with decorator

    it('should support forward refs in useClass when impl version is also provided', () => {
      @Injectable({providedIn: 'root', useClass: forwardRef(() => SomeProviderImpl)})
      abstract class SomeProvider {
      ^^^^^^^^
      }

node type: ERROR
children: [
  "abstract"
]

Tracked at semgrep/semgrep#2023

keyof typeof x

2

export interface DiagnosticMessage {
  message: string;
  kind: keyof typeof ts.DiagnosticCategory;
                     ^^
}

node type: ERROR
children: [
  identifier
]

Tracked by semgrep/semgrep#1994.

Optional tuple element

2

  it('should return correct built-in types', () => {
    const tests: Array<[BuiltinType, boolean, ts.TypeFlags?]> = [
                                                          ^

node type: ERROR
children: [
  "?"
]

Tracked at semgrep/semgrep#2024

Non-null assertion operator after array dereference

      lastDstToken.parts[0]! += token.parts[0];
                              ^

Tracked by semgrep/semgrep#2025

Labeled tuple elements

export type ValidOption = [key: string, values: string[]];
                              ^^^^^^^^
export type ValidOptions = ValidOption[];

node type: ERROR
children: [
  ":"
  "string"
]

Tracked at semgrep/semgrep#2034

this is

  isFromDynamicInput(this: DynamicValue<R>): this is DynamicValue<DynamicValue> {
                                             ^^^^^^^
    return this.code === DynamicValueReason.DYNAMIC_INPUT;
  }
node type: ERROR
children: [
  primary_type
  "is"
]
Source code cannot be parsed by tree-sitter.

Tracked by semgrep/semgrep#1970.

warning 42 on Parse.ml

File "ocaml-tree-sitter-lang/typescript/lib/Parse.ml", line 11275, characters 30-33:
11275 | | Alt (0, v) ->
^^^
Warning 42: this use of Alt relies on type-directed disambiguation,
it will not compile with OCaml 4.00 or earlier.

Factor out duplicate grammar nodes (needed for typescript support)

multiple definitions of finalize_parser

when using both the ruby and java parser in ocaml-tree-sitter-lang, I get a compilation error because of finalized_parser
defined multiple times:
ocaml-tree-sitter-lang/ruby/lib/libtree_sitter_ruby_stubs.a(bindings.o): In function finalize_parser': /home/pad/github/semgrep/semgrep-core/_build/default/ocaml-tree-sitter-lang/ruby/lib/bindings.c:26: multiple definition of finalize_parser'
ocaml-tree-sitter-lang/java/lib/libtree_sitter_java_stubs.a(bindings.o):/home/pad/github/semgrep/semgrep-core/_build/default/ocaml-tree-sitter-lang/java/lib/bindings.c:26: first defined here
collect2: error: ld returned 1 exit status
File "none", line 1:
Error: Error while building custom runtime system
Makefile:3: recipe for target 'all' failed
make: *** [all] Error 1

Here is the definition:
void finalize_parser(value v) {
parser_W *p;
p = (parser_W *)Data_custom_val(v);
ts_parser_delete(p->parser);
}

a simple 'static' in front of void finalize_parser does the trick.
The code generator for bindings.c needs to add this 'static'.

Why tree-sitter OCaml bindings are more complicated?

Hi!

Thanks for your work. That looks amazing.
I've been reading the developer docs, and think I understand that tree-sitter C output, even wrapped with OCaml bindings, is still not idiomatically typed and most importantly doesn't reference the original grammar.js appropriately.
Am I correct?

I was wondering why this problem occurs for OCaml bindings specifically and not for the Haskell/Python or Rust bindings?
Or is there something I misunderstood?
(I was thinking of writing such bindings myself and found out your repository.)

As a side question, I know that the standard tool for parsing is ocamllex/menhir in OCaml. I was wondering whether, considering your experience, you can compare both approaches (tree sitter and menhir)?
(I know that the goal of this repo is to get a tree sitter parser usable in OCaml in order to get access to all the General Programming Language grammars that were already written for tree sitter, but I am just asking, out of curiosity in general).

I know GitHub issue is generally not the go-to place for these kind of questions, so I'd totally understand if you simply ignore/close the issue.

Cheers!

Resolve type aliases

@aryx wrote:

found some errors in the inline/deinline process I think for C sharp:

and interpolation_alignment_clause = (Token.t (* "," *) * constant_pattern)

It should not be constant_pattern; it should be expression. The grammar contains:

     constant_pattern: $ => prec.right($._expression),

and

     _pattern: $ => choice(
      $.constant_pattern,
      $.declaration_pattern,
      $.discard,
//      $.recursive_pattern,
      $.var_pattern
    ),

and for some reasons at a few places where there is an expression, in CST.ml there's a constant_pattern instead.

Improve error reporting to include a few lines of context

Currently, we report only the token on which tree-sitter choked, an ERROR node. This is usually not sufficient to understand what the problem is. Instead, the error message should include a few lines before and after the problematic snippet, which itself would be highlighted (if writing to the console).

This is especially useful when triaging a large number of errors when scanning many large projects.

Include hash in names generated for anonymous nodes

Instead of anon_choice_type, anon_choice_type_, anon_choice_type2, generate names like anon_choice_type_3af1, anon_choice_type_49ba, anon_choice_type_2a37 so as to avoid mix-ups when a new name with the same prefix is introduced.

Various TSX parsing errors

TSX errors 2020-10-02

These are errors obtained on .tsx files from public repositories and
one of our private repositories. These are obtained by running
make stat from lang/tsx, which uses the tsx dialect of
tree-sitter-typescript.

The number indicates the number of errors encountered.

Some of these errors are not specific to the TSX grammar. See #115 for other typescript errors.

TSX elements with type parameters

5

<Element<Type> />;
        ^^^^^^

The official typescript implementation supports it.

The error node reported by the tree-sitter parser starts from the beginning of the
file, typically covering a large region, making it hard to figure out
what's wrong.

PR: tree-sitter/tree-sitter-typescript#92

Type of the form A extends B ? C : D - conditional types

4

type a = b extends c ? d : e

Also a problem with pure typescript. See issue semgrep/semgrep#1969

Nested empty/anonymous tag/element

3

<Elt>
            <>
             ^

This is an element with no name. It's a shorthand for <React.Fragment>.
Same problem exists in tree-sitter-javascript.

PR (merged): tree-sitter/tree-sitter-javascript#139

Type arguments on function call

const PolicyTree: React.FC<Props> = ({ policy }) => {
  const [nodes, setNodes] = useState<ITreeNode[]>();
                                                 ^^^

node type: ERROR
children: [
  formal_parameters
  ";"
]

Tracked at semgrep/semgrep#1992

New lines in string literals representing attributes

2

<path
  d='a
     ^
     b'
 />;

PR: tree-sitter/tree-sitter-typescript#91

Optional keys (?)

2

export type ChampsPokemonProps = { [P in keyof Pokemon]?: any } & {
                                                       ^
    showNickname?: boolean;
    showGender?: boolean;
node type: ERROR
children: [
  "?"
]

Also a problem observed for pure typescript.

Tracked at tree-sitter/tree-sitter-typescript#98

keyof typeof

2

    return `${baseURL}${isShiny}/${isFemaleSpecific}${formatSpeciesName(
        species,
    )}${getIconFormeSuffix(forme as keyof typeof Forme)}.png`;
                                                 ^^^^^
};

Also a problem observed for pure typescript. Tracked at tree-sitter/tree-sitter-typescript#99

Abstract class (?)

import AppConfig from './AppConfig';

export default abstract class AppBootstrapper {
               ^^^^^^^^
    constructor() {
        RX.App.initialize(__DEV__, __DEV__);

See semgrep/semgrep#1995.

compilation error about let rec in atdgen code

from semgrep: make config
....

2 warnings generated.
File "src/gen/lib/Tree_sitter_j.ml", lines 119-280, characters 44-1:
119 | ............................................(
120 | Atdgen_runtime.Oj_run.write_with_adapter Json_rule_adapter.restore (
121 | fun ob x ->
122 | match x with
123 | | SYMBOL x ->
...
277 | ) ob x;
278 | Bi_outbuf.add_char ob ']'
279 | )
280 | )
Error: This kind of expression is not allowed as right-hand side of `let rec'
make[2]: *** [build] Error 1
make[1]: *** [build-ocaml-tree-sitter] Error 2
make: *** [build] Error 2

Add easy links or references to the original grammar.js files

We're no longer copying and editing grammar.js files, instead we extend the grammars provided by their original node modules. This is the semgrep-grammars repo.

Accessing the source grammars is important when working with generated ocaml code, so we'd like to be able to consult the original grammars in such cases. The original grammar most often comes from a single javascript grammar.js written in a declarative style and our extension adds support for ... and such. However, there are more complicated cases such as typescript and tsx, which extend the javascript grammar; this makes 3 grammar.js files for typescript (original javascript, original typescript, semgrep dots extension). Ultimately, the whole grammar gets compiled into a single grammar.json files but it's not as readable as the original grammars, it has no comments, etc.

Possible solutions here include:

  • Pretty-printing the final grammar.json in a more compact and readable format than json. Pros: contains all the rules, fully automatic and requires no maintenance. Cons: doesn't show the different sources it comes from, and some constructs from javascript such as optional are expanded into something less obvious.
  • Have a readme, or perhaps a description to be embedded as a comment in generated files, which would provide links to the original projects on GitHub and other useful resources. Pros: flexible, doesn't require much maintenance. Cons: not as accurate as referencing specific git commits for the source files.

Automate parsing stats

Gathering the source for 10 large projects per language and parsing them takes a while. Doing this for every language that we support has become too long.

I'd like to set up a weekly job that produces parsing stats for each language and puts them where they can be consulted. Artifacts we want for each language:

  • Summary: success rate (per line of code), number of files and number of lines processed. Allows us to claim we have a parsing success of approximately X% for such language.
  • Stats per project, one per row in a CSV file. Allows us to see if the errors are concentrated in specific projects.
  • Path or URL of each file that failed. Allows us to triage errors and create bug reports.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.