Currently, the need to write a tokenizer makes it hard to get started with Superpower.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Simpler tokenization about superpower HOT 18 CLOSED

nblumhardt commented on July 24, 2024 4

Simpler tokenization

from superpower.

Comments (18)

nblumhardt commented on July 24, 2024 2

@Platzer thanks for the link - will try to check it out :-)

@SuperJMN removing whitespace definitely does make constructing the parser a bit simpler 👍

from superpower.

nblumhardt commented on July 24, 2024 1

:-) .. any help with this, even just figuring out the right test cases, would be awesome @SuperJMN

from superpower.

Platzer commented on July 24, 2024 1

@nblumhardt TokenizerBuilder is an amazing feature 👍
Yesterday i refactored the tokenizer of o custom DSL from 200 lines of code to easy readable 30 lines of TokenizerBuilder code in less than 40 minutes and just 2 of ~200 test are failing. It is really easy to get started!

from superpower.

nblumhardt commented on July 24, 2024

TokenizerBuilder is in on dev. Leaving this open until one remaining TODO is covered - the tokenizer, if it fails, needs to report the error at the most accurate point; i.e. if one recognizer failed after consuming 0 chars, and another after 10 chars, the latter's results should be surfaced.

from superpower.

SuperJMN commented on July 24, 2024

"Motivated". Did you call me? ;)

from superpower.

SuperJMN commented on July 24, 2024

I'm taking a look tomorrow! Promise :)

from superpower.

SuperJMN commented on July 24, 2024

OK, I'm trying to translate one of my tokenizers to the TokenizerBuilder DSL :)

If I understand it well, this tokenizer will match

the boolean operator '=='
the assignment operator '='.

var tokenizer = new TokenizerBuilder<LangToken>()
                .Match(Span.EqualTo("=="), LangToken.DoubleEqual)
                .Match(Character.EqualTo('='), LangToken.Equal)
                .Build();

Does it make sense? :)

from superpower.

SuperJMN commented on July 24, 2024

Another question that raises is the matching of keywords.

Right now I found I could do with this construction:

builder
    .Match(Span.EqualTo("if"), LangToken.If)
    .Match(Span.EqualTo("while"), LangToken.While)
    .Match(Span.EqualTo("do"), LangToken.Do)
    .Match(Span.EqualTo("for"), LangToken.For)
   ....

Is this the best way to match keywords? :)

from superpower.

SuperJMN commented on July 24, 2024

I have just discovered another use case:

If you want to make the tokenizer to convert any sequence of whitespace characters to a single token, for example: LangToken.Whitespace, should we do it like this?

builder
    .Match(Character.WhiteSpace.AtLeastOnce(), LangToken.Whitespace)

from superpower.

SuperJMN commented on July 24, 2024

@nblumhardt Nicholas, using the Tokenizer Builder has relieved the pains of creating one manually A LOT! For me it's a big success. I've created an equivalent tokenizer using the Builder in less than 10 minutes. Wow!

from superpower.

SuperJMN commented on July 24, 2024

BTW, the tokenizer I have right now is this. I've only added the tokens that I'm using right now, for my tests. It corresponds to a subset of tokens of a typical C language parser.

return new TokenizerBuilder<LangToken>()
                .Match(Character.WhiteSpace.AtLeastOnce(), LangToken.Whitespace)
                .Match(Span.EqualTo("=="), LangToken.DoubleEqual)
                .Match(Character.EqualTo('='), LangToken.Equal)
                .Match(Character.EqualTo('('), LangToken.LeftParenthesis)
                .Match(Character.EqualTo(')'), LangToken.RightParenthesis)
                .Match(Character.EqualTo('{'), LangToken.LeftBrace)
                .Match(Character.EqualTo('}'), LangToken.RightBrace)
                .Match(Character.EqualTo(';'), LangToken.Semicolon)
                .Match(Span.EqualTo("if"), LangToken.If, true)
                .Match(Span.Regex(@"\w[\w\d]*"), LangToken.Identifier, true)
                .Build();

from superpower.

Platzer commented on July 24, 2024

Sry for barging in... I've got a remark to @SuperJMN comment on keyword parsing. How would you build a parser for this:

if whileRunning == true

so it is generating these tokens:

- LangToken.If
- LangToken.Whitespace
- LangToken.Identifier
- LangToken.Whitespace
- LangToken.DoubleEqual
- LangToken.Whitespace
- LangToken.True

because (I think, didn't try)

builder
    ...
    .Match(Span.EqualTo("while"), LangToken.While)
    ...

will match the whileRunning identifier as LangToken.While?

from superpower.

SuperJMN commented on July 24, 2024

@Platzer Good question! I've got it to work perfectly.

I think it's not only because the order of the Match calls does matter, but also because I set the requireDelimiters option to true in both keywords and identifiers (it's false by default).

This is the test code I tested (xUnit):

    public class TokenizerSpecs
    {
        [Theory]
        [MemberData(nameof(TokenData))]
        public void TokenizationTest(string code, IEnumerable<LangToken> tokens)
        {
            var sut = CreateSut();
            var actual = sut.Tokenize(code).Select(t => t.Kind);
            var expected = tokens;

            actual.Should().BeEquivalentTo(expected);
        }

        public static IEnumerable<object[]> TokenData()
        {
            return new List<object[]>()
            {
                new object[] {"==", new List<LangToken>() {LangToken.DoubleEqual},},
                new object[] {"=", new List<LangToken>() {LangToken.Equal},},
                new object[] {"ifSomething", new List<LangToken>() {LangToken.Identifier},},
                new object[]
                {
                    "if whileRunning == true", 
                    new List<LangToken>()
                    {
                        LangToken.If,
                        LangToken.Whitespace,
                        LangToken.Identifier,
                        LangToken.Whitespace,
                        LangToken.DoubleEqual,
                        LangToken.Whitespace,
                        LangToken.True,
                    },
                },
            };
        }

        private Tokenizer<LangToken> CreateSut()
        {
            return TokenizerFactory.Create();
        }
    }

and the Tokenizer is this:

    public static class TokenizerFactory
    {
        public static Tokenizer<LangToken> Create()
        {
            return new TokenizerBuilder<LangToken>()
                .Match(Character.WhiteSpace.AtLeastOnce(), LangToken.Whitespace)
                .Match(Span.EqualTo("=="), LangToken.DoubleEqual)
                .Match(Character.EqualTo('='), LangToken.Equal)
                .Match(Character.EqualTo('('), LangToken.LeftParenthesis)
                .Match(Character.EqualTo(')'), LangToken.RightParenthesis)
                .Match(Character.EqualTo('{'), LangToken.LeftBrace)
                .Match(Character.EqualTo('}'), LangToken.RightBrace)
                .Match(Character.EqualTo(';'), LangToken.Semicolon)
                .Match(Span.EqualTo("if"), LangToken.If, true)
                .Match(Span.EqualTo("while"), LangToken.While, true)
                .Match(Span.EqualTo("true"), LangToken.True, true)
                .Match(Span.EqualTo("false"), LangToken.False, true)
                .Match(Span.Regex(@"\w[\w\d]*"), LangToken.Identifier, true)
                .Build();
        }
    }

I hope you understand how xUnit works for tests :) Basically, it takes the object tuples in TokenData static method and passes them as parameters of the [Theory] method. As you see the TokenData method returns the test data. It contains the input and the expected tokens. The latter is the one that you asked for :)

from superpower.

nblumhardt commented on July 24, 2024

@SuperJMN looking good! Curious - why does your grammar need whitespace tokens?

from superpower.

Platzer commented on July 24, 2024

@nblumhardt if you want to keep the users format and highlight the errors with some squiggles you would need to know how many whitespaces the user entered (see: NDC London How to parse a file - Matt Ellis).

from superpower.

SuperJMN commented on July 24, 2024

@nblumhardt It seems I asked myself the wrong question: "how can my grammar ignore whitespaces?" :) To tell the truth, I just added them because it seemed unnatural at first glance, but now that you say, it's better to remove them because it will make my parsers simpler :)

from superpower.

nblumhardt commented on July 24, 2024

That's great to hear, @Platzer!

from superpower.

straatvark commented on July 24, 2024

A question please: I have a similar situation as the following:
.Match(Span.EqualTo("while"), LangToken.While) that match 'whileRunning'

I have created this helper method:
public static TextParser BuildTextParserEqualTo(string equalTo)
{
return Span.EqualTo(equalTo);
}

That I call from here:
var tokenizerBuilder = new TokenizerBuilder();
tokenizerBuilder.Ignore(Span.WhiteSpace);
tokenizerBuilder.Match(BuildTextParserEqualTo("and"), InputToken.And);
tokenizerBuilder.Match(Span.NonWhiteSpace, InputToken.None);
return tokenizerBuilder.Build();

But the "and" would get matched in this text: "aaa bbb ccc andDDD" - unwanted in my case unfortunately.

How do I get it matched for only the "and" in: "aaa bbb ccc and DDD", i.e. only for a stand-alone word / token?

(I tried the Regex approach as well, but that also seems to ignore the $ at the end of my expression)

Any help would be much appreciated.

Sry for barging in... I've got a remark to @SuperJMN comment on keyword parsing. How would you build a parser for this:
if whileRunning == true
so it is generating these tokens:
- LangToken.If
- LangToken.Whitespace
- LangToken.Identifier
- LangToken.Whitespace
- LangToken.DoubleEqual
- LangToken.Whitespace
- LangToken.True
because (I think, didn't try)
builder
    ...
    .Match(Span.EqualTo("while"), LangToken.While)
    ...
will match the whileRunning identifier as LangToken.While?

from superpower.

Simpler tokenization about superpower HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent