Giter Club home page Giter Club logo

Comments (18)

nblumhardt avatar nblumhardt commented on July 24, 2024 2

@Platzer thanks for the link - will try to check it out :-)

@SuperJMN removing whitespace definitely does make constructing the parser a bit simpler 👍

from superpower.

nblumhardt avatar nblumhardt commented on July 24, 2024 1

:-) .. any help with this, even just figuring out the right test cases, would be awesome @SuperJMN

from superpower.

Platzer avatar Platzer commented on July 24, 2024 1

@nblumhardt TokenizerBuilder is an amazing feature 👍
Yesterday i refactored the tokenizer of o custom DSL from 200 lines of code to easy readable 30 lines of TokenizerBuilder code in less than 40 minutes and just 2 of ~200 test are failing. It is really easy to get started!

from superpower.

nblumhardt avatar nblumhardt commented on July 24, 2024

TokenizerBuilder is in on dev. Leaving this open until one remaining TODO is covered - the tokenizer, if it fails, needs to report the error at the most accurate point; i.e. if one recognizer failed after consuming 0 chars, and another after 10 chars, the latter's results should be surfaced.

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

"Motivated". Did you call me? ;)

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

I'm taking a look tomorrow! Promise :)

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

OK, I'm trying to translate one of my tokenizers to the TokenizerBuilder DSL :)

If I understand it well, this tokenizer will match

  • the boolean operator '=='
  • the assignment operator '='.
var tokenizer = new TokenizerBuilder<LangToken>()
                .Match(Span.EqualTo("=="), LangToken.DoubleEqual)
                .Match(Character.EqualTo('='), LangToken.Equal)
                .Build();

Does it make sense? :)

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

Another question that raises is the matching of keywords.

Right now I found I could do with this construction:

builder
    .Match(Span.EqualTo("if"), LangToken.If)
    .Match(Span.EqualTo("while"), LangToken.While)
    .Match(Span.EqualTo("do"), LangToken.Do)
    .Match(Span.EqualTo("for"), LangToken.For)
   ....

Is this the best way to match keywords? :)

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

I have just discovered another use case:

If you want to make the tokenizer to convert any sequence of whitespace characters to a single token, for example: LangToken.Whitespace, should we do it like this?

builder
    .Match(Character.WhiteSpace.AtLeastOnce(), LangToken.Whitespace)

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

@nblumhardt Nicholas, using the Tokenizer Builder has relieved the pains of creating one manually A LOT! For me it's a big success. I've created an equivalent tokenizer using the Builder in less than 10 minutes. Wow!

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

BTW, the tokenizer I have right now is this. I've only added the tokens that I'm using right now, for my tests. It corresponds to a subset of tokens of a typical C language parser.

return new TokenizerBuilder<LangToken>()
                .Match(Character.WhiteSpace.AtLeastOnce(), LangToken.Whitespace)
                .Match(Span.EqualTo("=="), LangToken.DoubleEqual)
                .Match(Character.EqualTo('='), LangToken.Equal)
                .Match(Character.EqualTo('('), LangToken.LeftParenthesis)
                .Match(Character.EqualTo(')'), LangToken.RightParenthesis)
                .Match(Character.EqualTo('{'), LangToken.LeftBrace)
                .Match(Character.EqualTo('}'), LangToken.RightBrace)
                .Match(Character.EqualTo(';'), LangToken.Semicolon)
                .Match(Span.EqualTo("if"), LangToken.If, true)
                .Match(Span.Regex(@"\w[\w\d]*"), LangToken.Identifier, true)
                .Build();

from superpower.

Platzer avatar Platzer commented on July 24, 2024

Sry for barging in... I've got a remark to @SuperJMN comment on keyword parsing. How would you build a parser for this:

if whileRunning == true

so it is generating these tokens:

- LangToken.If
- LangToken.Whitespace
- LangToken.Identifier
- LangToken.Whitespace
- LangToken.DoubleEqual
- LangToken.Whitespace
- LangToken.True

because (I think, didn't try)

builder
    ...
    .Match(Span.EqualTo("while"), LangToken.While)
    ...

will match the whileRunning identifier as LangToken.While?

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

@Platzer Good question! I've got it to work perfectly.

I think it's not only because the order of the Match calls does matter, but also because I set the requireDelimiters option to true in both keywords and identifiers (it's false by default).

This is the test code I tested (xUnit):

    public class TokenizerSpecs
    {
        [Theory]
        [MemberData(nameof(TokenData))]
        public void TokenizationTest(string code, IEnumerable<LangToken> tokens)
        {
            var sut = CreateSut();
            var actual = sut.Tokenize(code).Select(t => t.Kind);
            var expected = tokens;

            actual.Should().BeEquivalentTo(expected);
        }

        public static IEnumerable<object[]> TokenData()
        {
            return new List<object[]>()
            {
                new object[] {"==", new List<LangToken>() {LangToken.DoubleEqual},},
                new object[] {"=", new List<LangToken>() {LangToken.Equal},},
                new object[] {"ifSomething", new List<LangToken>() {LangToken.Identifier},},
                new object[]
                {
                    "if whileRunning == true", 
                    new List<LangToken>()
                    {
                        LangToken.If,
                        LangToken.Whitespace,
                        LangToken.Identifier,
                        LangToken.Whitespace,
                        LangToken.DoubleEqual,
                        LangToken.Whitespace,
                        LangToken.True,
                    },
                },
            };
        }

        private Tokenizer<LangToken> CreateSut()
        {
            return TokenizerFactory.Create();
        }
    }

and the Tokenizer is this:

    public static class TokenizerFactory
    {
        public static Tokenizer<LangToken> Create()
        {
            return new TokenizerBuilder<LangToken>()
                .Match(Character.WhiteSpace.AtLeastOnce(), LangToken.Whitespace)
                .Match(Span.EqualTo("=="), LangToken.DoubleEqual)
                .Match(Character.EqualTo('='), LangToken.Equal)
                .Match(Character.EqualTo('('), LangToken.LeftParenthesis)
                .Match(Character.EqualTo(')'), LangToken.RightParenthesis)
                .Match(Character.EqualTo('{'), LangToken.LeftBrace)
                .Match(Character.EqualTo('}'), LangToken.RightBrace)
                .Match(Character.EqualTo(';'), LangToken.Semicolon)
                .Match(Span.EqualTo("if"), LangToken.If, true)
                .Match(Span.EqualTo("while"), LangToken.While, true)
                .Match(Span.EqualTo("true"), LangToken.True, true)
                .Match(Span.EqualTo("false"), LangToken.False, true)
                .Match(Span.Regex(@"\w[\w\d]*"), LangToken.Identifier, true)
                .Build();
        }
    }

I hope you understand how xUnit works for tests :) Basically, it takes the object tuples in TokenData static method and passes them as parameters of the [Theory] method. As you see the TokenData method returns the test data. It contains the input and the expected tokens. The latter is the one that you asked for :)

from superpower.

nblumhardt avatar nblumhardt commented on July 24, 2024

@SuperJMN looking good! Curious - why does your grammar need whitespace tokens?

from superpower.

Platzer avatar Platzer commented on July 24, 2024

@nblumhardt if you want to keep the users format and highlight the errors with some squiggles you would need to know how many whitespaces the user entered (see: NDC London How to parse a file - Matt Ellis).

from superpower.

SuperJMN avatar SuperJMN commented on July 24, 2024

@nblumhardt It seems I asked myself the wrong question: "how can my grammar ignore whitespaces?" :) To tell the truth, I just added them because it seemed unnatural at first glance, but now that you say, it's better to remove them because it will make my parsers simpler :)

from superpower.

nblumhardt avatar nblumhardt commented on July 24, 2024

That's great to hear, @Platzer!

from superpower.

straatvark avatar straatvark commented on July 24, 2024

A question please: I have a similar situation as the following:
.Match(Span.EqualTo("while"), LangToken.While) that match 'whileRunning'

I have created this helper method:
public static TextParser BuildTextParserEqualTo(string equalTo)
{
return Span.EqualTo(equalTo);
}

That I call from here:
var tokenizerBuilder = new TokenizerBuilder();
tokenizerBuilder.Ignore(Span.WhiteSpace);
tokenizerBuilder.Match(BuildTextParserEqualTo("and"), InputToken.And);
tokenizerBuilder.Match(Span.NonWhiteSpace, InputToken.None);
return tokenizerBuilder.Build();

But the "and" would get matched in this text: "aaa bbb ccc andDDD" - unwanted in my case unfortunately.

How do I get it matched for only the "and" in: "aaa bbb ccc and DDD", i.e. only for a stand-alone word / token?

(I tried the Regex approach as well, but that also seems to ignore the $ at the end of my expression)

Any help would be much appreciated.

Sry for barging in... I've got a remark to @SuperJMN comment on keyword parsing. How would you build a parser for this:

if whileRunning == true

so it is generating these tokens:

- LangToken.If
- LangToken.Whitespace
- LangToken.Identifier
- LangToken.Whitespace
- LangToken.DoubleEqual
- LangToken.Whitespace
- LangToken.True

because (I think, didn't try)

builder
    ...
    .Match(Span.EqualTo("while"), LangToken.While)
    ...

will match the whileRunning identifier as LangToken.While?

from superpower.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.