Giter Club home page Giter Club logo

closedxml.parser's Introduction

ClosedParser

ClosedParser is a project to parse OOXML grammar to create an abstract syntax tree that can be later evaluated.

Official source for the grammar is MS-XML, chapter 2.2.2 Formulas. The provided grammar is not usable for parser generators, it's full of ambiguities and the rules don't take into account operator precedence.

How to use

  • Implement IAstFactory interface.
  • Call parsing methods
    • FormulaParser<TScalarValue, TNode>.CellFormulaA1("Sum(A1, 2)", astFactory)
    • FormulaParser<TScalarValue, TNode>.CellFormulaR1C1("Sum(R1C1, 2)", astFactory)

Visualizer

There is a visualizer to display AST in a browser at https://parser.closedxml.io

image

Goals

  • Performance - ClosedXML needs to parse formula really fast. Limit allocation and so on.
  • Evaluation oriented - Parser should concentrates on creation of abstract syntax trees, not concrete syntax tree. Goal is evaluation of formulas, not transformation.
  • Multi-use - Formulas are mostly used in cells, but there are other places with different grammar rules (e.g. sparklines, data validation)
  • Multi notation (A1 or R1C1) - Parser should be able to parse both A1 and R1C1 formulas. I.e. SUM(R5) can mean return sum of cell R5 in A1 notation, but return sum of all cells on row 5 in R1C1 notation.

Project uses ANTLR4 grammar file as the source of truth and a lexer. There is also ANTLR parser is not used, but is used as a basis of recursive descent parser (ANTLR takes up 8 seconds vs RDS 700ms for parsing of enron dataset).

ANTLR4 one of few maintained parser generators with C# target.

The project has a low priority, XLParser mostly works, but in the long term, replacement is likely.

Current performance

ENRON dataset parsed using recursive descent parser and DFA lexer in Release mode:

  • Total: 946320
  • Elapsed: 1838 ms
  • Per formula: 1.942 μs

2μs per formula should be something like 6000 instructions (under unrealistic assumption 1 instruction per 1 Hz), so basically fast enough.

Limitations

The primary goal is to parse formulas stored in file, not user supplied formulas. The formulas displayed in the GUI is not the same as formula stored in the file. Several examples:

  • The IFS function is a part of future functions. In the file, it is stored as _xlfn.IFS, but user sees IFS
  • In the structured references, user sees @ as an indication that structured references this row, but in reality it is a specifier [#This Row]

Therefore:

  • External references are accepted only in form of an index to an external file (e.g. [5])
  • There are several formula implementations out there with slighly different grammar incompatible with OOXML formulas ([1]!'Some name in external wb'). They are out of scope of the project.

Why not use XLParser

ClosedXML is currently using XLParser and transforming the concrete syntax tree to abstract syntax tree.

  • Speed:
    • Grammar extensively uses regexps extensively. Regexs are slow, especially for NET4x target, allocates extra memory. XLParser takes up 47 seconds for Enron dataset on .NET Framework. .NET teams had made massive improvements on regexs, so it takes only 16 seconds on NET7.
    • IronParser needs to determine all possible tokens after every token, that is problematic, even with the help of prefix hints.
  • AST: XLParser creates concentrates on creation of concrete syntax tree, but for ClosedXML, we need abstract syntax tree for evaluation. IronParser is not very friendly in that regard
  • XLParser uses IronParser, an unmaintained project (IronParser recently released version 1.2).
  • Doesn't have support for lambdas and R1C1 style.

ANTLR lexer takes up about 3.2 seconds for Enron dataset. With ANTLR parsing, it takes up 11 seconds. I want that 7+ seconds in performance and no allocation, so RDS that takes up 700 ms.

Debugging

Use vscode-antlr4 plugin for debugging the grammar.

Testing strategy

  • Each token that contains some data that are extracted for a node (e.g. A1_REFERENCE C5 to row 5, column 3) has a separate test class in Lexers directory with a {TokenPascalName}TokenTests.cs
  • Each parser rule has a test class in Rules directory. It should contain all possible combinatins of a rule and comparing it with the AST nodes.
  • Data set tests are in DataSetTests.cs. Each test tries to parse formula and ensures that ANTLR can parse it RDS can and can't parse a formula when ANTLR can't. There is no check of the output, just that formulas can be parsed. Data are contained in a data directory in CSV format with a one column.

Rolex

Rolex is a DFA based lexer released under MIT license (see Rolex: Unicode Enabled Lexer Generator in C# ). ANTLR is still the source of truth, but it is used to generate Rolex grammar and then DFA for a lexer.

It is rather complicated, but two times faster than ANTLR lexer (1.9 us vs 3.676 us per formula).

Generate lexer

Prepare rolex grammars

  • Run Antlr2Rolex over FormulaLexer.g4 with A1 version to ClosedXML.Parser\Rolex\LexerA1.rl
  • Add /* at the beginning of Local A1 References section. It comments out A1_REFERENCE and all its fragments
  • Remove /* at the beinning of Local R1C1 References section. It contains a different tokens for A1_REFERENCE and its fragments
  • Run Antlr2Rolex over FormulaLexer.g4 with R1C1 version to ClosedXML.Parser\Rolex\LexerR1C1.rl

Fix Rolex generator

  • Fix bug in Rolex generator that doesn't recognize property \u1234 (just add pc.Advance() to FFA.cs _ParseEscapePart and _ParseRangeEscapePart]

Generate a DFA through Rolex

  • Rolex.exe ClosedXML.Parser\Rolex\LexerA1.rl /noshared /output ClosedXML.Parser\Rolex\RolexA1Dfa.cs /namespace ClosedXML.Parser.Rolex
  • Rolex.exe ClosedXML.Parser\Rolex\LexerR1C1.rl /noshared /output ClosedXML.Parser\Rolex\RolexR1C1Dfa.cs /namespace ClosedXML.Parser.Rolex

TODO

  • Lexer generation during build
  • Proper CI pipeline.
    • Azure Function
    • Web
  • Fuzzer
  • PR to Rolex to fix unicode bug.

Resources

closedxml.parser's People

Contributors

jahav avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

closedxml.parser's Issues

Use ident based parser

The tokens currently detect a piece of a formula and also a type of a formula, e.g. it detects prefix through exclamation mark at the end of a token (e.g. SomeName!) and that way it is differentiated from a name token that doesn't have the exclamation mark at the end.

It works in most cases, but there are likely some edge cases, where name and other types are aliased and tokens are not what we expect.

The only case I have seen so far is a 3D reference, where A5:Sheet5!A1:A5 is recognized as a 3D reference instead of A1_CELL RANGE SHEET_PREFIX A1_AREA and I made a workaround for it, but there likely are others.

Instead, move to ident based parser, where token doesn't necessary mean specific type and it is handled in the parser and token is responsible only for tokens.

Instead, use IDENT token for sheet name or a name and separate exclamation mark.

Implicit intersection should be part of ref_expression

Implicit intersection is now part of a prefix_atom_expression and is thus a part of expression, but it should be a part of ref_expression, likely close to ref_spill_expression.

Example:B7:@A1:A4 is valid in Excel, but the parser returns an error

Error at char 3 of 'B7:@A1:A4': Unexpected token INTERSECT.

Name quotes

Add a logic to serialize sheet identifiers with a quotes. Sheet names sometimes require quotes and sometimes don't. There is some stuff in the grammar, but that is wildly inaccurate. As far as I can tell, there is no reasonable rule, like unicode categories. The only thing is that any character after Unicode 5.1 needs a quote.

I made AutoHotKey script to rename a sheet with all character from base material plane and checked a formula, if the renamed sheet name had a quote. That was done for first char and next char:

ident-sheet-first.txt
ident-sheet-next.txt

Create a serialization logic to respect the expected quotes, so Excel can eat it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.