Giter Club home page Giter Club logo

leac's Introduction

leac

lint status badge test status badge License: MIT npm npm deno

Lexer / tokenizer.

Features

  • Lightweight. Zero dependencies. Not a lot of code.

  • Well tested - comes will tests for everything including examples.

  • Compact syntax - less boilerplate. Rule name is enough when it is the same as the lookup string.

  • No failures - it just stops when there are no matching rules and returns the information about whether it completed and where it stopped in addition to tokens array.

  • Composable lexers - instead of states within a lexer.

  • Stateless lexers - all inputs are passed as arguments, all outputs are returned in a result object.

  • No streaming - accepts a string at a time (more on this below).

  • Only text tokens, no arbitrary values. It seems to be a good habit to have tokens that are trivially serializable back into a valid input string. Don't do the parser's job. There are a couple of convenience features such as the ability to discard matches or string replacements for regular expression rules but that has to be used mindfully (more on this below).

Changelog

Available here: CHANGELOG.md

Install

Node

> npm i leac
> yarn add leac
import { createLexer, Token } from 'leac';

Deno

import { createLexer, Token } from 'https://deno.land/x/leac@.../leac.ts';

Examples

const lex = createLexer([
  { name: '-', str: '-' },
  { name: '+' },
  { name: 'ws', regex: /\s+/, discard: true },
  { name: 'number', regex: /[0-9]|[1-9][0-9]+/ },
]);

const { tokens, offset, complete } = lex('2 + 2');

Published packages using leac

API

A word of caution

It is often really tempting to rewrite token on the go. But it can be dangerous unless you are absolutely mindful of all edge cases.

For example, who needs to carry string quotes around, right? Parser will only need the string content...

We'll have to consider following things:

  • Regular expressions. Sometimes we want to match strings that can have a length from zero and up.

  • Tokens are not produced without changing the offset. If something is missing - there is no token.

    If we allow a token with zero length - it will cause an infinite loop, as the same rule will be matched at the same offset, again and again.

  • Discardable tokens - a convenience feature that may seem harmless at a first glance.

When put together, these things plus some intuition traps can lead to a broken array of tokens.

Strings can be empty, which means the token can be absent. With no content and no quotes the tokens array will most likely make no sense for a parser.

How to avoid potential issues:

  • Don't discard anything that you may need to insert back if you try to immediately serialize the tokens array to string. This means whitespace are usually safe to discard while string quotes are not (what can be considered safe will heavily depend on the grammar - you may have a language with significant spaces and insignificant quotes...);

  • You can introduce a higher priority rule to capture an empty string (opening quote immediately followed by closing quote) and emit a special token for that. This way empty string between quotes can't occur down the line;

  • Match the whole string (content and quotes) with a single regular expression, let the parser deal with it. This can actually lead to a cleaner design than trying to be clever and removing "unnecessary" parts early;

  • Match the whole string (content and quotes) with a single regular expression, use capture groups and replace property. This can produce a non-zero length token with empty text.

Another note about quotes: If the grammar allows for different quotes and you're still willing to get rid of them early - think how you're going to unescape the string later. Make sure you carry the information about the exact string kind in the token name at least - you will need it later.

What about ...?

  • performance - The code is very simple but I won't put any unverified assumptions here. I'd be grateful to anyone who can provide a good benchmark project to compare different lexers.

  • stable release - Current release is well thought out and tested. I leave a chance that some changes might be needed based on feedback. Before version 1.0.0 this will be done without a deprecation cycle.

  • streaming - I have no use case for it - majority of practical scenarios have reasonable input size and there is no need to pay with the complexity. I may think about it again once I see a good use case.

Some other lexer / tokenizer packages

leac's People

Contributors

garronej avatar killymxi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

garronej

leac's Issues

Lua lexer

Thank you so much for leac, its API is very elegant and easy to use.

I'm creating the new assignments for the compilers class at the college of engineering UBA (University of Buenos Aires). The students might use the following lua lexer which I want to share with you in case you find it useful. If I find the time I'll write some more tests and add an example in a PR.

import { createLexer, Rule, Rules } from "https://deno.land/x/leac/leac.ts";

const symbols = [
  ";",
  "=",
  ",",
  "::",
  ".",
  "[",
  "]",
  "...",
  "(",
  ")",
  ":",
  "{",
  "}",
];

const keywords = [
  "break",
  "goto",
  "do",
  "end",
  "while",
  "do",
  "repeat",
  "until",
  "if",
  "then",
  "elseif",
  "else",
  "for",
  "in",
  "function",
  "local",
  "return",
  "nil",
  "false",
  "true",
];

const ops = [
  "+",
  "-",
  "*",
  "/",
  "//",
  "^",
  "%",
  "&",
  "~",
  "|",
  ">>",
  "<<",
  "..",
  "<",
  "<=",
  ">",
  ">=",
  "==",
  "~=",
  "and",
  "or",
  "#",
  "not",
];
const doublequoteStringLexer = createLexer([
  {
    name: "stringContent",
    regex: /(?:\\["abfnrtv/\\nz]|\\u[a-fA-F0-9]{4}|[^"\\\n])*/,
  },
  {
    name: "LiteralStringEnd",
    str: '"',
    pop: true,
    discard: true,
  },
]);
const singlequoteStringLexer = createLexer([
  {
    name: "stringContent",
    regex: /(?:\\['abfnrtv/\\nz]|\\u[a-fA-F0-9]{4}|[^'\\\n])*/,
  },
  {
    name: "LiteralStringEnd",
    str: "'",
    pop: true,
    discard: true,
  },
]);
function smallerCaseRegexpPart(level: number) {
  if (level == 0) {
    return "";
  }
  if (level == 1) {
    return "|\\]\\]";
  }
  return `|\\]={0,${level - 1}}(?=\\])`;
}
function regexNoLongBrackets(level: number) {
  const smallerCase = smallerCaseRegexpPart(level);
  const largerCase = `|\\]={${level + 1},}(?=\\])`;
  const doesntEndWithBracketCase = `|\\]=*[^\\]=]`;
  return new RegExp(
    `([^\\]]${smallerCase}${largerCase}${doesntEndWithBracketCase})+`,
    "m",
  );
}
function longLiteralStringRule(level: number) {
  const equalSigns = Array(level).fill("=").join("");
  const lexer = createLexer([
    {
      name: "LiteralStringEnd",
      str: "]" + equalSigns + "]",
      discard: true,
      pop: true,
    },
    {
      name: "stringContent",
      regex: regexNoLongBrackets(level),
      discard: true,
    },
  ]);
  return {
    name: "LiteralStringBegin",
    str: "[" + equalSigns + "[",
    push: lexer,
    discard: true,
  };
}
const simpleRules = [...ops, ...keywords, ...symbols].map((v) => ({ name: v }));
function longCommentRule(level: number) {
  const equalSigns = Array(level).fill("=").join("");
  const lexer = createLexer([
    {
      name: "longCommentEnd",
      str: "]" + equalSigns + "]",
      discard: true,
      pop: true,
    },
    {
      name: "commentContent",
      regex: regexNoLongBrackets(level),
      discard: true,
    },
  ]);
  return {
    name: "longCommentBegin",
    str: "--[" + equalSigns + "[",
    push: lexer,
    discard: true,
  };
}
export const lex = createLexer(
  [
    {
      name: "ws",
      regex: /\s+/,
      discard: true,
    },
    ...Array(100).fill(0).map((_value, index) => longCommentRule(index)),
    {
      name: "shortComment",
      regex: /--.*\n/m,
      discard:true,
    },
    ...simpleRules,
    {
      name: "Name",
      regex: /[a-zA-Z_][a-zA-Z_0-9]*/,
    },
    {
      name: "Numeral",
      regex: /[0-9]*\.?[0-9]+/,
    },
    {
      name: "LiteralStringBegin",
      str: '"',
      push: doublequoteStringLexer,
      discard: true,
    },
    {
      name: "LiteralStringBegin",
      str: '"',
      push: singlequoteStringLexer,
      discard: true,
    },
    ...Array(100).fill(0).map((_value, index) => longLiteralStringRule(index)),
  ],
);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.