Giter Club home page Giter Club logo

hammer's People

Contributors

abiggerhammer avatar fbz avatar jakobr avatar moreati avatar pesco avatar prashantbarca avatar skade avatar sonoflilit avatar thequux avatar tomime avatar uucidl avatar zaxtax avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hammer's Issues

Homebrew packaging

A suitable homebrew recipe should be provided in the hammer source tree.

Optional Runtime Parameters for Parsers

It would be useful to be able to pass runtime options to a parser that will be bound to all sub parsers and can be used to specify optional grammars for a set of parsers.

For example: If an application that a parser is being written for has run-time options that the user defines, such as making whitespace significant or non-significant, being able to specify what that runtime option is to the parser and conditionally modify the parser at runtime.

So a function "Parse" could take an input and an optional map of options and then bind those options to the scope of the entry point into the "Parse" function and all subparsers that it calls. Then the parser definition could reference the value in the map and depending on the value of a specific option change at runtime which path a parser takes.

Debian packaging

Both dpkg-buildpackage and scons deb should work from the hammer root directory.

DNS compression

The current DNS example doesn't support compression as defined in RFC 1035. It should.

User-defined pretty printers

It'd be nice to enable users to tell hammer how to print their types. Maybe add a predicate type for it.

This is necessary, unless I'm missing something, for benchmarking of parsers with user-defined types.

{Free,Open,Net}BSD packaging

Should include some sort of command to build a bsd-specific binary package. Bonus points if a suitable ports/pkgsrc file is provided.

Bounds checking for HCountedSequence

During the BerlinSides Q&A, Dan Kaminsky asked whether it was possible to dereference past the end of an HCountedSequence. The current accessor pattern makes it more likely that a user will do this by accident (although the length of the sequence is known via the .used field, accessing elements of **elements directly is not runtime-checked). We promised to add this as a runtime check, since it's easy (and also dovetails with the larger goal of an accessor API for HParseResult).

Gentoo packaging

Source tree should include a suitable ebuild for use in Gentoo. Bonus points if we contribute it upstream.

Add hooks to HArena that get called on h_destroy_arena

Sometimes memory needs to be allocated outside of the arena, or other actions need to be deferred until the arena is destroyed. Add a list of callbacks to be called when an arena is destroyed, in reverse order of their addition.

Packaging complete

Packaging for

  • dpkg
  • rpm
  • Windows (msi)
  • Gentoo (ebuild)
  • {Free,Net,Open}BSD (various ports-like systems)
  • Homebrew

Allow a tag-only mode

In some cases, such as recognising strings (see the init_domain example in DNS), the input is already in a format suitable for further processing and only needs to be validated. In this case, generating the AST and then recreating the input from it is a big overhead - instead it would be nice if the parser could just validate the input according to the grammar and return a range of input bytes as opposed to a full AST.

permutation combinator

Formats like PNG will be a lot easier if we have a combinator for Parsec-style permutation phrases. This may be difficult to translate from the lazy functional idiom, but would be very worthwhile to have.

RPM packaging

  • rpmbuild -ba hammer.spec
  • rpmbuild -ta hammer.tar
  • scons rpm
    should all generate an appropriate hammer.rpm

Make h_indirect more efficient

As far as I see it, the current mechanism for h_indirect introduces one unneeded function call. Why not copy the contents of HParser in h_bind_indirect (and therefore the vtable)? This removes the need for the proxy design pattern currently in use. Maybe isValidRegular breaks, but otherwise we could reduce overhead/complexity quite a bit by removing the indirection at parse time

Make vararg combinators macros that add the sentinel

This is an idea for a possible enhancement. Feedback welcome!

Turn e.g. h_sequence(foo, bar, baz, NULL) into h_sequence(foo, bar, baz) which expands to h_sequence_(foo, bar, baz, NULL) or somesuch.

Motivation: The sentinel arguments are a pain to read, to type, and easy to forget.

LALR(8)

Support LALR(8) as a backend, where bits are tokens.

h_benchmark_result_free

Right now, h_benchmark results leak. Something named like the title would fix that. Some other approach would probably also be okay. Maybe if you passed the result function as a callback to h_benchmark?

Make parsers `foo__m` macros that expand to `foo__m(mm__, ...)`

This is an idea for a possible enhancement. Feedback welcome!

For the default case, a user would globally declare mm__ = &h_system_allocator. In combinators, the standard argument would shadow the global variable.

Benefits:

  • (Motivation:) In an allocator-specific function, shadowing the global ensures that the user cannot use it. This is important in the h_bind combinator I made for DNP3: It typically allocates continuation parsers on every call during the parse and one missed allocation makes a memory leak.
  • Allocator-specific code remains as readable as code using the default allocator.

Cons:

  • User has to add one boilerplate delaration in the typical case.
  • Does this generate trouble for bindings?
  • Danger of mistyping the standard argument and defaulting to the global.

Replace GQueue and GHashTable

GQueue and GHashTable are bulky and we don't need most of their functionality. Replace them with stripped-down versions that just do what we need, eliminating our dependence on glib except for test-harness purposes.

We're using GQueue as a stack of LR_t, and the only functionality we need from it is push, pop and peek. GHashTable is used in a couple of ways; it needs to support insertion, lookup and update.

LL(k)

Support LL(k) as a backend.

raw combinator

Create a combinator that returns the raw bytes from a parse.

Results of `h_optional` and `h_end` inconsistent?

I notice that h_optional returns an HParsedToken with type TT_NONE when its argument does not match, i.e when it does not consume any input.

On the other hand, h_end upon success returns an HParseResult where the ast field is NULL. Somehow I expected it also to yield a TT_NONE.

Is this intentional?

h_choice not backtracking properly?

I'm trying to parse a padded last chunk of Base64, i.e. 4 characters of the form xyz= or yz== where z is restricted to some subset of the Base64 alphabet. Two weird results:

  • The test program accepts aaA= (correct) but rejects aA==. It looks like the h_choice isn't considering its second case correctly?

  • The test program falsely accepts A==, parsing it as "AA=="!

    % echo 'A==' | ./base64
    inputsize=4
    input=A==
    parsed=3 bytes
    [
    u 0x41
    u 0x41
    u 0x3d
    u 0x3d
    ]

Test program:

#include <hammer.h>

const HParser* document = NULL;

void init_parser(void)
{
    // CORE
    const HParser *digit = h_ch_range(0x30, 0x39);
    const HParser *alpha = h_choice(h_ch_range(0x41, 0x5a), h_ch_range(0x61, 0x7a), NULL);

    // AUX.
    const HParser *plus = h_ch('+');
    const HParser *slash = h_ch('/');
    const HParser *equals = h_ch('=');

    const HParser *bsfdig = h_choice(alpha, digit, plus, slash, NULL);
    const HParser *bsfdig_4bit = h_choice(
        h_ch('A'), h_ch('E'), h_ch('I'), h_ch('M'), h_ch('Q'), h_ch('U'),
        h_ch('Y'), h_ch('c'), h_ch('g'), h_ch('k'), h_ch('o'), h_ch('s'),
        h_ch('w'), h_ch('0'), h_ch('4'), h_ch('8'), NULL);
    const HParser *bsfdig_2bit = h_choice(h_ch('A'), h_ch('Q'), h_ch('g'), h_ch('w'), NULL);
    const HParser *base64_2 = h_sequence(bsfdig, bsfdig, bsfdig_4bit, equals, NULL);
    const HParser *base64_1 = h_sequence(bsfdig, bsfdig_2bit, equals, equals, NULL);
    const HParser *base64 = h_choice(base64_2, base64_1, NULL);
        // why does this parse "A=="?!
        // why does this parse "aaA=" but not "aA=="?!

    document = base64;
}


#include <stdio.h>

int main(int argc, char **argv)
{
    uint8_t input[102400];
    size_t inputsize;
    const HParseResult *result;

    init_parser();

    inputsize = fread(input, 1, sizeof(input), stdin);
    fprintf(stderr, "inputsize=%d\ninput=", inputsize);
    fwrite(input, 1, inputsize, stderr);
    result = h_parse(document, input, inputsize);

    if(result) {
        fprintf(stderr, "parsed=%d bytes\n", result->bit_length/8);
        h_pprint(stdout, result->ast, 0, 0);
        return 0;
    } else {
        return 1;
    }
}

Handle H_ALLOC failures

In the provided examples, H_ALLOC and friends could fail. Provide a way to bail out (we might have to abandon the parse - full recognition is not possible without the memory).

TQ on IRC suggested a longjmp to the beginning of the parse, which I like, however a nicer way to abort the parse might be handy...

Weirdness with mixed-endian parsers that cross byte boundaries

I ran into an utterly bizarre bug with a parser I'm writing. I've made a gist that demonstrates it. It's not quite what I had in the initial bug, but it's the same behavior -- the lengths come out wrong on successful parses if I mix BIT_BIG_ENDIAN with a natural-endianness parser.

I was able to resolve the issue cleanly in my own parser by moving the endianness up from the individual bit-fields where possible, but it's still weird, and probably warrants investigation at some point.

Add API for allocating new token type

Right now, if parsers from two different libraries are composed, it is possible (and probable) that their token types will collide. This will cause problems for Tongs.

Proposed API:

/// Allocate a new, unused (as far as this function knows) token type.
int h_allocate_token_type(void);

Implications for bindings:

  • Bindings for hammer that marshal tokens should provide a registry so that custom token types can be unmarshalled transparently.
  • Bindings for libraries that expose parsers should register unmarshallers for their custom token types.

Permissive license

Please consider one of the permissive licenses for this (2-clause BSD, MIT, WTFPL) so projects with such licenses can use the tool.

Ruby bindings

I've started to work on Ruby bindings for hammer, and I have a few questions:

  • Should I add the files to the main repo? If yes, what folder would you want to have it in?
  • Memory management: How do I free the HParser * after usage? (Normal free gave me a segfault, though I probably just made an error there.)
  • Is there a "streaming" interface? I mean, is it possible to parse a file in chunks without completely loading it to memory first?

Bit constant fields

This can be hacked around with validations, but obviously results in grammars that don't appear context-free as far as hammer can tell when they otherwise might be. Syntax like h_bits_const(4, 0xa) would be desirable.

At the moment, this appears to be blocking compilation of grammars that make use of bit constants with any backends other than packrat.

endian combinator

Certain formats, like TIFF, can be little-endian or big-endian depending on the value of a magic byte in the header. Currently there is no way to change the endianness of an input stream mid-parse, but we can add a combinator to do this.

LuaJIT bindings

The LuaJIT FFI is really simple, and if we do those bindings, mnemnion will shim them to work with ordinary Lua, package it, and push it to luarocks.

/core/parser/llk/rightrec segfaults on 32-bit x86

Previous to my recent commit [1], hammer did not compile on 32-bit x86. After that commit, hammer does compile, but /core/parser/llk/rightrec segfaults. Hammer compiles and successfully runs all tests on 64-bit x86 both before and after my commit. Here's the backtrace:

/core/parser/llk/rightrec: 
Program received signal SIGSEGV, Segmentation fault.
0xb7ea03b5 in h_hashtable_get (ht=0x0, key=0x161)
    at build/debug/src/datastructures.c:151
151   HHashValue hashval = ht->hashFunc(key);
(gdb) bt
#0  0xb7ea03b5 in h_hashtable_get (ht=0x0, key=0x161)
    at build/debug/src/datastructures.c:151
#1  0xb7e9da0f in h_stringmap_get_char (m=0x80a1388, c=97 'a')
    at build/debug/src/cfgrammar.h:56
#2  0xb7e9e59e in h_stringmap_get_lookahead (m=0x80a1388, lookahead=...)
    at build/debug/src/cfgrammar.c:343
#3  0xb7e95e8f in h_llk_lookup (table=0x8147a44, x=0x8147aac, stream=0xbffff470)
    at build/debug/src/backends/llk.c:32
#4  0xb7e9683a in h_llk_parse (mm__=0x8058bc0 <system_allocator>, 
    parser=0x8147964, stream=0xbffff470) at build/debug/src/backends/llk.c:303
#5  0xb7ea1bc2 in h_parse__m (mm__=0x8058bc0 <system_allocator>, parser=0x8147964, 
    input=0x805547a "aa", length=2) at build/debug/src/hammer.c:62
#6  0xb7ea1b65 in h_parse (parser=0x8147964, input=0x805547a "aa", length=2)
    at build/debug/src/hammer.c:49
#7  0x08052a47 in test_rightrec (backend=0x2) at build/debug/src/t_parser.c:427
#8  0xb7f19418 in ?? () from /usr/lib/libglib-2.0.so.0
#9  0xb7f195e4 in ?? () from /usr/lib/libglib-2.0.so.0
#10 0xb7f195e4 in ?? () from /usr/lib/libglib-2.0.so.0
#11 0xb7f195e4 in ?? () from /usr/lib/libglib-2.0.so.0
#12 0xb7f19989 in g_test_run_suite () from /usr/lib/libglib-2.0.so.0
#13 0xb7f199cc in g_test_run () from /usr/lib/libglib-2.0.so.0
#14 0x08054c36 in main (argc=1, argv=0xbffff914) at build/debug/src/test_suite.c:40

[1] pete-@c8fc061

Add parse-specific data registry

Some bindings (eg, Go) would find it very useful to have a binding-specific object passed along with the parse. The API for this would take a form similar to that of the token type registry (see #45, #52)

This will make the ext_* variables in registry.c used; however, they should be renamed first.

man page

The unix distros all need one.

Benchmarking

We want the ability to benchmark parsers against large data sets as part of the building-with-Hammer process, since some parsing backends may be better suited for certain deployment environments than others. This means we need some benchmarking code!

Some combinators could be expressed in terms of others

E.g. h_whitespace(p) could obviously be defined as h_sequence(h_many(wschar), p) wrapped in an appropriate action, saving a lot of code. I suspect there are other examples.

Is there a particular reason for eschewing such layering or should there be a refactoring pass across the API at some point?

Probable gcc bug with symptom in h_carray_append

Minimalish example: https://gist.github.com/mrdomino/5f7de7b4aa9f74d13747

That segfaults (dereferencing 0) on -O3 and exits normally on -O2. Trial elimination of optimization flags reveals that -O -fvect-cost-model(=(dynamic|unlimited)|) -ftree-loop-vectorize is sufficient to trigger the bad behavior. Adding fprintf(stderr, "%zu\n", i); to the loop at datastructures.c:30 (the h_carray_append array->used loop) also causes the code to exit normally.

This always fails at exactly 33 elements or greater (so at the capacity jump to 64), and completes successfully on 32 or fewer elements. It has nothing to do with the size of thingy.data, but seems to depend on act_thingy getting a TT_SEQUENCE.

Windows packaging

  • scons msi
  • (any other common "build package" command?)
    should all generate an appropriate hammer.msi

It is OK if this only works on Windows.

Fast regular expressions

Parsers that only use sequence(), choice(), many(), many1(), optional() and token parsers (e.g. ch(), token() and the int and uint families) are parsers for regular languages. (Note for documentation: list which ones can and can't be used!) It should be possible to compile parsers that fit the appropriate constraints using a regular-expression backend.

TQ has an idea for a fast regular expression backend a la RE2, so this one's his.

h_action swallows failing parsers

AFAICS, there is no way for an h_action parser to fail because it always calls make_result which never returns NULL. For instance, I have this:

const HParser *true = h_action(h_token("true", 4), act_true);
const HParser *false = h_action(h_token("false", 5), act_false);
const HParser *boolean = h_choice(true, false, NULL);

The actions act_true and act_false are obviously supposed to return some TT_USER token representing a Boolean if the parse succeeded and propagate failure otherwise.

What are the intended semantics of h_action? Is the action even supposed to be called if the parser failed?

segfault on lr_stack pop with indirect

In case it is useful information: I enabled the indirect parser and here is what happened:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b48e4a in g_slice_free1 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
(gdb) bt
#0  0x00007ffff7b48e4a in g_slice_free1 () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#1  0x00007ffff7b3c0ba in g_queue_pop_head () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x0000000000401daf in h_do_parse (parser=0x60f2f0, state=0x60f638) at hammer.c:193
#3  0x0000000000409b1e in parse_sequence (env=0x60f3e0, state=0x60f638) at parsers/sequence.c:12
#4  0x0000000000401cb0 in h_do_parse (parser=0x60f420, state=0x60f638) at hammer.c:170
#5  0x0000000000409e8a in parse_choice (env=0x60f440, state=0x60f638) at parsers/choice.c:15
#6  0x0000000000401cb0 in h_do_parse (parser=0x60f480, state=0x60f638) at hammer.c:170
#7  0x0000000000409b1e in parse_sequence (env=0x60f4c0, state=0x60f638) at parsers/sequence.c:12
#8  0x0000000000401cb0 in h_do_parse (parser=0x60f500, state=0x60f638) at hammer.c:170
#9  0x0000000000409e8a in parse_choice (env=0x60f540, state=0x60f638) at parsers/choice.c:15
#10 0x0000000000401cb0 in h_do_parse (parser=0x60f580, state=0x60f638) at hammer.c:170
#11 0x000000000040b107 in parse_indirect (env=0x60f580, state=0x60f638) at parsers/indirect.c:4
#12 0x0000000000401cb0 in h_do_parse (parser=0x60f520, state=0x60f638) at hammer.c:170
#13 0x000000000040a655 in parse_many (env=0x60f5c0, state=0x60f638) at parsers/many.c:22
#14 0x0000000000401cb0 in h_do_parse (parser=0x60f5a0, state=0x60f638) at hammer.c:170
#15 0x0000000000401f95 in h_parse (parser=0x60f5a0, input=0x7ffffffe57b0 "1234\n", length=5) at hammer.c:248
#16 0x00000000004014e6 in main ()
(gdb) up
#1  0x00007ffff7b3c0ba in g_queue_pop_head () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
(gdb) up
#2  0x0000000000401daf in h_do_parse (parser=0x60f2f0, state=0x60f638) at hammer.c:193
193         g_queue_pop_head(state->lr_stack);
(gdb) list
188           state->input_stream = INVALID;
189           state->input_stream.input = key->input_pos.input;
190         }
191     #endif
192         // the base variable has passed equality tests with the cache
193         g_queue_pop_head(state->lr_stack);
194         // setupLR, used below, mutates the LR to have a head if appropriate, so we check to see if we have one
195         if (NULL == base->head) {
196           HParserCacheValue *right = a_new(HParserCacheValue, 1);
197           right->value_type = PC_RIGHT; right->right = tmp_res;
(gdb) p state->lr_stack
$1 = (GQueue *) 0x611c00

To reproduce, input "1234\n" to the following program:

#include <hammer.h>

#define h_literal(s) h_token(s, sizeof(s)-1)

const HParser* init_parser()
{
    static const HParser* p_document = NULL;
    if(p_document)
        return p_document;

    // CORE
    const HParser *digit = h_ch_range(0x30, 0x39);

    // AUX.
    const HParser *point = h_ch('.');
    const HParser *minus = h_ch('-');

    // BOOLEANS & NULL
    const HParser *boolean = h_choice(h_literal("true"),
                                      h_literal("false"), NULL);

    // NUMBERS
    const HParser *nothing = h_nothing_p();
    const HParser *hexnum = h_sequence(nothing, nothing, nothing, nothing, NULL);

    const HParser *decnat = h_many1(digit);
    const HParser *decfrac = h_sequence(point, nothing, NULL);
    const HParser *decnum = h_sequence(decnat, h_optional(decfrac),
                                       h_optional(nothing), NULL);

    const HParser *number = h_sequence(h_optional(minus),
                                       h_choice(hexnum, decnum, NULL), NULL);
        // XXX why does "1234" segfault?!

    HParser *value = h_indirect();

    // VALUES
    h_bind_indirect(value, h_choice(boolean, number, NULL));
    const HParser *value_seq = h_many(value);

    p_document = value_seq;
    return p_document;
}


#include <stdio.h>

int main(int argc, char **argv)
{
    uint8_t input[102400];
    size_t inputsize;
    const HParseResult *result;

    const HParser *parser = init_parser();

    inputsize = fread(input, 1, sizeof(input), stdin);
    fprintf(stderr, "inputsize=%d\ninput=", inputsize);
    fwrite(input, 1, inputsize, stderr);
    result = h_parse(parser, input, inputsize);

    if(result) {
        fprintf(stderr, "parsed=%d bytes\n", result->bit_length/8);
        h_pprint(stdout, result->ast, 0, 0);
        return 0;
    } else {
        return 1;
    }
}

Obviously, this is only a fragment of an originally larger parser. Weirdly, changing pretty much anything left here makes the segfault go away. So does changing the length of the input. Also, inputs that end on a digit seem to loop forever.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.