Giter Club home page Giter Club logo

sax-wasm's Introduction

SAX (Simple API for XML) for WebAssembly

Build Status Coverage Status

When you absolutely, positively have to have the fastest parser in the room, accept no substitutes.

The first streamable, low memory XML, HTML, JSX and Angular Template parser for WebAssembly.

Sax Wasm is a sax style parser for XML, HTML, JSX and Angular Templates written in Rust, compiled for WebAssembly with the sole motivation to bring faster than native speeds to XML and JSX parsing for node and the web. Inspired by sax js and rebuilt with Rust for WebAssembly, sax-wasm brings optimizations for speed and support for JSX syntax.

Suitable for LSP implementations, sax-wasm provides line numbers and character positions within the document for elements, attributes and text node which provides the raw building blocks for linting, transpilation and lexing.

Benchmarks (Node v18.16.1 / 2.7 GHz Quad-Core Intel Core i7)

All parsers are tested using a large XML document (2.1 MB) containing a variety of elements and is streamed when supported by the parser. This attempts to recreate the best real-world use case for parsing XML. Other libraries test benchmarks using a very small XML fragment such as <foo bar="baz">quux</foo> which does not hit all code branches responsible for processing the document and heavily skews the results in their favor.

Parser with Advanced Features time/ms (lower is better) JS Runs in browser
sax-wasm 64.16
sax-js 155.77 ☑*
libxmljs 274.95
node-xml 685.00
*built for node but should run in the browser

Installation

npm i -s sax-wasm

Usage in Node

const fs = require('fs');
const path = require('path');
const { SaxEventType, SAXParser } = require('sax-wasm');

// Get the path to the WebAssembly binary and load it
const saxPath = require.resolve('sax-wasm/lib/sax-wasm.wasm');
const saxWasmBuffer = fs.readFileSync(saxPath);

// Instantiate
const options = {highWaterMark: 32 * 1024}; // 32k chunks
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag, options);
parser.eventHandler = (event, data) => {
  if (event === SaxEventType.Attribute) {
    // process attribute
  } else {
    // process open tag
  }
};

// Instantiate and prepare the wasm for parsing
parser.prepareWasm(saxWasmBuffer).then(ready => {
  if (ready) {
    // stream from a file in the current directory
    const readable = fs.createReadStream(path.resolve(path.resolve('.', 'path/to/document.xml')), options);
    readable.on('data', (chunk) => {
      parser.write(chunk);
    });
    readable.on('end', () => parser.end());
  }
});

Usage for the web

  1. Instantiate and prepare the wasm for parsing
  2. Pipe the document stream to sax-wasm using ReadableStream.getReader() NOTE This uses WebAssembly.instantiateStreaming under the hood to load the wasm.
import { SaxEventType, SAXParser } from 'sax-wasm';

async function loadAndPrepareWasm() {
  const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag, {highWaterMark: 64 * 1024}); // 64k chunks

  const saxWasmResponse = fetch('./path/to/wasm/sax-wasm.wasm');
  const ready = await parser.prepareWasm(saxWasmResponse);
  if (ready) {
    return parser;
  }
}

loadAndPrepareWasm().then(processDocument);

function processDocument(parser) {
  parser.eventHandler = (event, data) => {
    if (event === SaxEventType.Attribute ) {
        // process attribute
      } else {
        // process open tag
      }
  }

  fetch('path/to/document.xml').then(async response => {
    if (!response.ok) {
      // fail in some meaningful way
    }
    // Get the reader to stream the document to sax-wasm
    const reader = response.body.getReader();
    while(true) {
      const chunk = await reader.read();
      if (chunk.done) {
        return parser.end();
      }
      parser.write(chunk);
    }
  });
}

Differences from sax-js

Besides being incredibly fast, there are some notable differences between sax-wasm and sax-js that may affect some users when migrating:

  1. JSX is supported including JSX fragments. Things like <foo bar={this.bar()}></bar> and <><foo/><bar/></> will parse as expected.
  2. Angular 2+ templates are supported. Things like <button type="submit" [disabled]=disabled *ngIf=boolean (click)="clickHandler(event)"> will parse as expected.
  3. No attempt is made to validate the document. sax-wasm reports what it sees. If you need strict mode or document validation, it may be recreated by applying rules to the events that are reported by the parser.
  4. Namespaces are reported in attributes. No special events dedicated to namespaces.
  5. Streaming utf-8 code points in a Uint8Array is required.

Streaming

Streaming is supported with sax-wasm by writing utf-8 code points (Uint8Array) to the parser instance. Writes can occur safely anywhere except within the eventHandler function or within the eventTrap (when extending SAXParser class). Doing so anyway risks overwriting memory still in play.

Events

Events are subscribed to using a bitmask composed from flags representing the event type. Bit positions along a 12 bit integer can be masked on to tell the parser to emit the event of that type. For example, passing in the following bitmask to the parser instructs it to emit events for text, open tags and attributes:

import { SaxEventType } from 'sax-wasm';
parser.events = SaxEventType.Text | SaxEventType.OpenTag | SaxEventType.Attribute;

Complete list of event/argument pairs:

Event Mask Argument passed to handler
SaxEventType.Text 0b000000000001 text: Text
SaxEventType.ProcessingInstruction 0b000000000010 procInst: Text
SaxEventType.SGMLDeclaration 0b000000000100 sgmlDecl: Text
SaxEventType.Doctype 0b000000001000 doctype: Text
SaxEventType.Comment 0b000000010000 comment: Text
SaxEventType.OpenTagStart 0b000000100000 tag: Tag
SaxEventType.Attribute 0b000001000000 attribute: Attribute
SaxEventType.OpenTag 0b000010000000 tag: Tag
SaxEventType.CloseTag 0b000100000000 tag: Tag
SaxEventType.CDATA 0b001000000000 start: Position

Speeding things up on large documents

The speed of the sax-wasm parser is incredibly fast and can parse very large documents in a blink of an eye. Although it's performance out of the box is ridiculous, the JavaScript thread must be involved with transforming raw bytes to human readable data, there are times where slowdowns can occur if you're not careful. These are some of the items to consider when top speed and performance is an absolute must:

  1. Stream your document from it's source as a Uint8Array - This is covered in the examples above. Things slow down significantly when the document is loaded in JavaScript as a string, then encoded to bytes using Buffer.from(document) or new TextEncoder.encode(document) before being passed to the parser. Encoding on the JavaScript thread is adds a non-trivial amount of overhead so its best to keep the data as raw bytes. Streaming often means the parser will already be done once the document finishes downloading!
  2. Keep the events bitmask to a bare minimum whenever possible - the more events that are required, the more work the JavaScript thread must do once sax-wasm.wasm reports back.
  3. Limit property reads on the reported data to only what's necessary - this includes things like stringifying the data to json using JSON.stringify(). The first read of a property on a data object reported by the eventHandler will retrieve the value from raw bytes and convert it to a string, number or Position on the JavaScript thread. This conversion time becomes noticeable on very large documents with many elements and attributes. NOTE: After the initial read, the value is cached and accessing it becomes faster.

SAXParser.js

Constructor

SaxParser([events: number, [options: SaxParserOptions]])

Constructs new SaxParser instance with the specified events bitmask and options

Parameters

  • events - A number representing a bitmask of events that should be reported by the parser.
  • options - When specified, the highWaterMark option is used to prepare the parser for the expected size of each chunk provided by the stream. The parser will throw if chunks written to it are larger.

Methods

  • prepareWasm(wasm: Uint8Array): Promise<boolean> - Instantiates the wasm binary with reasonable defaults and stores the instance as a member of the class. Always resolves to true or throws if something went wrong.

  • write(chunk: Uint8Array, offset: number = 0): void; - writes the supplied bytes to the wasm memory buffer and kicks off processing. An optional offset can be provided if the read should occur at an index other than 0. NOTE: The line and character counters are not reset.

  • end(): void; - Ends processing for the stream. The line and character counters are reset to zero and the parser is readied for the next document.

Properties

  • events - A bitmask containing the events to subscribe to. See the examples for creating the bitmask

  • eventHandler - A function reference used for event handling. The supplied function must have a signature that accepts 2 arguments: 1. The event which is one of the SaxEventTypes and the body (listed in the table above)

sax-wasm.wasm

Methods

The methods listed here can be used to create your own implementation of the SaxWasm class when extending it or composing it will not meet the needs of the program.

  • parser(events: u32) - Prepares the parser struct internally and supplies it with the specified events bitmask. Changing the events bitmask can be done at anytime during processing using this method.

  • write(ptr: *mut u8, length: usize) - Supplies the parser with the location and length of the newly written bytes in the stream and kicks off processing. The parser assumes that the bytes are valid utf-8 grapheme clusters. Writing non utf-8 bytes may cause unpredictable results but probably will not break.

  • end() - resets the character and line counts but does not halt processing of the current buffer.

Building from source

Prerequisites

This project requires rust v1.30+ since it contains the wasm32-unknown-unknown target out of the box.

Install rust:

curl https://sh.rustup.rs -sSf | sh

Install the stable compiler and switch to it.

rustup install stable
rustup default stable

Install the wasm32-unknown-unknown target.

rustup target add wasm32-unknown-unknown --toolchain stable

Install node with npm then run the following command from the project root.

npm install

Install the wasm-bindgen-cli tool

cargo install wasm-bindgen-cli

The project can now be built using:

npm run build

The artifacts from the build will be located in the /libs directory.

sax-wasm's People

Contributors

amygdaloideum avatar benjinus avatar dependabot[bot] avatar efx avatar jaegermasw avatar justinwilaby avatar samuelcolvin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sax-wasm's Issues

Problem with a repeated fragment on a large xml file

When parsing a large XML file, at some point a certain fragment (but not the first one) starts repeating itself instead of the original content. Increasing the highWaterMark value increases the repeated fragment.

I just took the Node-example and changed this part:

const parser = new SAXParser(SaxEventType.Text, options);
parser.eventHandler = (event, data) => {
  if ( data.value.includes( 'CHAPTER' ) ) {
    console.log(data.value);
  }
};

CHAPTER is some text that is repeated periodically with a variable part (a number).

The output looks like this:

CHAPTER ONE
CHAPTER TWO
CHAPTER THREE
CHAPTER TWO
CHAPTER THREE
CHAPTER TWO
CHAPTER THREE
...

Should be like:

CHAPTER ONE
CHAPTER TWO
CHAPTER THREE
CHAPTER FOUR
CHAPTER FIVE
CHAPTER SIX
CHAPTER SEVEN
...

Node v12.18.0

Extract values of result for further processing

Is your feature request related to a problem? Please describe.
Hi @justinwilaby

Thanks for this great lib!
I just tried it out and it is very fast.

We have a use-case where we need to go through large XML files 100MB+ and process elements within.

I am looking for a solution to stream through the XML and find the <Transaction></Transaction> elements, emit their content as an event to a queue and process them later with XPATH. By emitting smaller chunks of the XML, our RAM does not get maxed out when parsing.

I was able to use the SaxEventType.CloseTag to go process through the XML and find the start / end values, for example:

  "openStart": {
    "line": 337741,
    "character": 10
  },
  "openEnd": {
    "line": 337741,
    "character": 37
  },
  "closeStart": {
    "line": 338092,
    "character": 10
  },
  "closeEnd": {
    "line": 338092,
    "character": 26
  }

Do you think it's possible with your lib to extract the XML content on SaxEventType.CloseTag and emit them to a queue?

Thanks a lot

Describe the solution you'd like
The possibility to extract the XML values when a tag is closed.

Describe alternatives you've considered
I was thinking of getting the start/end value with your library and then stream again through the XML file and extract the values.

Additional context

Cheers!

Error with JSX parsing of inline valid javascript expressions including `<`

Describe the bug
Suppose you have the following:

<foo>{bar < baz ? <div></div> : <></>}</foo>

The SAX parser will detect that <baz ? <div> is a tag
To Reproduce
Use this example

Expected behavior
I expect that any complex expression is ignored from the parser, even if it contains <

I'd be happy to help fix if there are code pointers / strategies that would be effective in solving this!

way to identify JSX style attributes

Hi, sorry me again.

I can parse the following and it works just fine: <div class={foobar}/>.

However, there's no way to differentiate the attribute value from <div class="foobar"/>.

Attribute.value is just a Text:

sax-wasm/src/sax/tag.rs

Lines 110 to 113 in abd01fc

pub struct Attribute {
pub name: Text,
pub value: Text,
}

If there was some way to flag the difference it would be really useful for my application.

Parser failing on some XML files

Hey, awesome project. I'm trying to create my own XML parser with this since I had to use sax anyway.

I've got it mostly running but on some XML files it's acting weird. It seems to make a Tag out of an Attribute.

The full prototype that I'm using can be found here. Just npm install and npm run dev should show an error.

If you debug it and inspect the tag.attributes then you can see that the first attribute has a name of file:localhostZ:audioDarude20-2.... which is actually the value of the Location attribute.

I thought maybe some weird character is doing this but I couldn't find anything.

Any ideas?

Invalid typed array lenght

When I works with small size of xml (<500kb), it works fine. But with big size of xml, it showed me this error.

Uncaught RangeError: Invalid typed array length: 2428901
at new Uint8Array ()
at SAXParser.write (saxWasm.js:63)
at XMLHttpRequest.xhttp.onreadystatechange (index.js:49)
How could i deal with this problem?
Thank you.

html containing `<?>` stops further parsing

Describe the bug

Any html that contains <?> stops the parsing. This is a "trick" to shorten <!--?--> to safe bytes
it's specified here https://html.spec.whatwg.org/#parse-error-unexpected-question-mark-instead-of-tag-name

<?> is in the output of lit ssr... here is the original issue

To Reproduce
parse the following html

<!--lit-part cI7PGs8mxHY=-->
  <p><!--lit-part-->hello<!--/lit-part--></p>
  <!--lit-part BRUAAAUVAAA=--><?><!--/lit-part-->
  <!--lit-part--><!--/lit-part-->
  <p>more</p>
<!--/lit-part-->

Expected behavior

p
p

actual behavior

p

Additional context

full code to reproduce

import saxWasm from 'sax-wasm';
import { createRequire } from 'module';
import { readFile } from 'fs/promises';

export const { SaxEventType, SAXParser } = saxWasm;

const require = createRequire(import.meta.url);

export const streamOptions = { highWaterMark: 128 * 1024 };
const saxPath = require.resolve('sax-wasm/lib/sax-wasm.wasm');
const saxWasmBuffer = await readFile(saxPath);
export const parser = new SAXParser(SaxEventType.CloseTag, streamOptions);

await parser.prepareWasm(saxWasmBuffer);

parser.eventHandler = (ev, data) => {
  if (ev === SaxEventType.CloseTag) {
    console.log(data.name);
  }
};
parser.write(Buffer.from(`
<!--lit-part cI7PGs8mxHY=-->
  <p><!--lit-part-->hello<!--/lit-part--></p>
  <!--lit-part BRUAAAUVAAA=--><?><!--/lit-part-->
  <!--lit-part--><!--/lit-part-->
  <p>more</p>
<!--/lit-part-->
`));
parser.end();

How to pipe stream into the parser?

I try to pipe a stream into the parser, but it fails:

const fs = require('fs');
const { SaxEventType, SAXParser } = require('sax-wasm');
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag, {highWaterMark: 32 * 1024});

parser.eventHandler = (event, data) => {
  console.log(JSON.stringify(data))
};

fs.createReadStream(`sample.xml`).pipe(parser);

Uncaught TypeError TypeError: dest.on is not a function
    at Readable.pipe (node:internal/streams/readable:743:8)

I also tried with prepareWasm() but no success either.
Is it supported somehow?

Locate sax-wasm.wasm automatically

You could do something like this where you locate the wasm file based on import.meta.resolve. This should work for local files also, at least if fetch supports this. Dunno if this works on Node though:

parser.prepareWasm(fetch(import.meta.resolve('./sax.wasm')))

Please export Typescript definitions

Even though you use typescript in the project, the type definitions are not included in the NPM package or referenced in the package.json, which results in completely losing type information and docs when importing the package in another project.

Please could you consider including the typescript output files in the npm package, and also referencing them in the package.json with the types property.

Many thanks

React native npm module request $$

Hi Justin
I'm wondering if you would be nterested to develop a react native npm module based on your sax-wasm project.

Happy to pay for this service.

To give some background, we currently have an iOS app written in React Native that was ported from a java Android app. A lot of the code was converted to Kotlin, and then we compile the kotlin down to javascript. The application generally works, but there's a key performance bottleneck around the parsing of XML. It currently uses a Kotlin version of XMLPullParser, but we would like to replace it with something significantly better. Hence, the desire for a fast native (Rust/C++) XML parser but staying with the event-driven parsing style.

The ability to make a web request (POST/GET) within the native code is a bonus, but not necessary. Given the built in XMLHttpRequest object in react native only sends incremental updates in text mode, SaxParser.write() will need to support string input as well as uint8array. We'll also be reading files from the filesystem.

We intend to include this project into our main application as a NPM dependency.

Please response if you are interested and I will supply my contact details and more details on the project.

Runtime-Error "unreachable" while parsing xml-file

When following the node-example, I'm getting the following error with an arbitrary xml-file:

wasm://wasm/0001d47a:9

^

RuntimeError: unreachable
    at wasm-function[8]:1
    at wasm-function[7]:104
    at wasm-function[9]:45
    at wasm-function[12]:3
    at wasm-function[19]:70
    at wasm-function[11]:100
    at wasm-function[37]:1555
    at wasm-function[28]:3662
    at wasm-function[27]:12
    at SAXParser.write (/home/cjk/proj/daimler/iparts/db-builder/node_modules/sax-wasm/lib/saxWasm.js:209:9)

Rust crate and changes to `EventListener` signature

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Hi, I'm using sax-wasm from rust (it's great 🙏), I'm currently using git submodules, which is working fine, but clearly a proper rust crate would be ideal. (To be clear - a rust crate would be good long term, but I don't actually need it right now)

Secondly, I'm having some trouble setting up a "safe" way to collect nodes of an xml document in rust.

The problem is

pub type EventListener = fn(event: Event, data: &dyn Encode<Vec<u8>>);

The only way I can see to mutate an object in a function which matches this signature is to use a static mut and then unsafe, which isn't ideal.

Maybe I'm missing something obvious, but it would be great to be able to process "event/data"s and mutate a struct without resorting to unsafe.

Describe the solution you'd like
A clear and concise description of what you want to happen.

I guess either:

  • a Iter interface to SAXParser
  • or a compile time flag to change EventListener
  • or maybe I'm missing some other solution

Let me know if the request is unclear and I can add specific examples.

Node-example in README is incorrect

Just stumbled upon an error trying to build on the provided node-example in the readme:

readable.on('end', parser.end)

should probably be:

readable.on('end', () => parser.end())

No closeTag event for `<meta>` without an ending `/>`

Describe the bug
The closeTag event does not fires for <meta ...> tags if it is not "manually" closed ala <meta />

To Reproduce

👉 index.html

<!DOCTYPE html>
<meta http-equiv="refresh" content="0;url=/en/getting-started">

👉 test.js

import saxWasm from 'sax-wasm';
import fs from 'fs';
import { createRequire } from 'module';
import { readFile } from 'fs/promises';
const { SaxEventType, SAXParser } = saxWasm;
const require = createRequire(import.meta.url);
const saxPath = require.resolve('sax-wasm/lib/sax-wasm.wasm');
const saxWasmBuffer = await readFile(saxPath);
const parser = new SAXParser(SaxEventType.CloseTag);
await parser.prepareWasm(saxWasmBuffer);

parser.eventHandler = (ev, data) => {
  console.log(data.name);
};

const readable = fs.createReadStream('./index.html');
readable.on('data', chunk => {
  parser.write(chunk);
});
readable.on('end', () => {
  parser.end();
});

Steps to reproduce the behavior:

  1. node test.js

Expected behavior
Log of meta

Actual behavior
No output at all - the eventHandler never gets triggered.

Additional info

If the HTML source get's change to this

<meta http-equiv="refresh" content="0;url=/en/getting-started" />

then it works fine.

However given that in HTML5 <meta ...> is fine I was wondering if it could also handle it?

PS: My actual usecase is an "website/link validator" - so I am not in control of the HTML... requiring "well" structured HTML makes sense... but it would be nice - not needing to force a specific way of writing meta tags 🤗

Improve API ergonomics by add parse() method to SAXParser

The API can be improved by adding a simple generator function like so:

class SAXParser {
  async *parse(reader) {
    let event
    this.eventHandler = function() {event=arguments}
    while(true) {
      const chunk = await reader.read()
      if (chunk.done) return this.end()
      this.write(chunk.value)
      if (event) {yield event;event=null}
    }
  }
}

No more need to write the callback and while loop yourself, the code becomes a lot more straightforward while adding little overhead.

for await (const [event,data] of parser.parse(reader))
    console.log({event,data})

I am currently already doing something like this using a wrapper:

async function SaxParser(events, options) {
  const parser = new SAXParser(events, options),
    saxWasmResponse = await loadSaxWasm()
  await parser.prepareWasm(saxWasmResponse)
  return async function*(reader) {
    let event
    parser.eventHandler = function() {event=arguments}
    while(true) {
      const chunk = await reader.read()
      if (chunk.done) return parser.end()
      parser.write(chunk.value)
      if (event) {yield event;event=null}
    }
  }
}

Love to know what you think.

lower case doctype is not recognised

(Thanks so much for the library, it's awesome. I'm particularly appreciative that it's very flexible about the input, except in this case...)

Describe the bug

Lower case "doctype" is not supported, although it's valid in HTML.

To Reproduce

Any input with doctype as lowercase not upper case:

const fs = require('fs')
const { SaxEventType, SAXParser } = require('sax-wasm')

const saxWasmBuffer = fs.readFileSync(require.resolve('sax-wasm/lib/sax-wasm.wasm'))

const get_type = (t) => {
  for (const [name, value] of Object.entries(SaxEventType)) {
    if (t === value) {
      return name
    }
  }
}
const str2array = (str) => new Uint8Array(Buffer.from(str, 'utf8'))

// Instantiate
const options = {highWaterMark: 32 * 1024}
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.CloseTag | SaxEventType.OpenTag | SaxEventType.Text | SaxEventType.Doctype, options)
parser.eventHandler = (event, data) => {
  console.log(`${get_type(event)}:`, data.toJSON())
}

parser.prepareWasm(saxWasmBuffer).then(() => {
  console.log('lower case:')
  parser.write(str2array('<!doctype html>'))
  parser.end()
  console.log('upper case:')
  parser.write(str2array('<!DOCTYPE html>'))
  parser.end()
})

This prints nothing for the first case but does find the DOCTYPE clause for the second case.

I think the problem might be:

"DOCTYPE" => {

which appears to require "DOCTYPE"

error for XML documents with comments

Describe the bug

The parser seems to have problems handling comments in XML documents:

the parser does not seem to correctly recognize the comment closing sequence -->.

As a result, tags after the comment are not correctly parsed anymore.

To Reproduce
Steps to reproduce the behavior:

  1. initialize parser to emit events for SaxEventType.Attribute | SaxEventType.OpenTag | SaxEventType.Comment

  2. attach an event handler, that prints the tag & attribute & comment information, e.g. something like

function eventHandler(event, data) {
  if (event === SaxEventType.OpenTag) {
    // process open tag
    console.log('sax event OpenTag (',event,') -> ', data.name)
    data.attributes.forEach(function(attrData) {
      if(process.env.verbose) console.log('    tag attribute '+attrData.name+'='+JSON.stringify(attrData.value)+' at[', attrData.valueStart, ',', attrData.valueEnd, ']')
    });
  } else if (event === SaxEventType.Attribute) {
    // process attribute
    console.log('sax event Attribute (',event,') -> '+data.name+'='+JSON.stringify(data.value)+' at[', data.valueStart, ',', data.valueEnd, ']')
  } else if (event === SaxEventType.Comment) {
    // process comment tag
    console.log('sax event Comment (',event,') -> ', data.name || data.constructor.name, data.value)
  } else {
    // process unknown tag
    console.log('sax event ',event,' -> ', data.name || data.constructor.name, Buffer.from(data.data).toString('utf8'), data)
  }
};
  1. parse the XML code
<?xml version="1.0" encoding="UTF-8"?>
<plugin name="test 1 attr">

  <name name="test 2 attr">the plugin name</name>

  <!-- name="test 3 attr" some comment -->

  <keywords name="test 4 attr">some,key,words</keywords>

</plugin>

Expected behavior

The tag <keywords> and its attribute name should be passed into the event handler.
Instead the last event that is triggered is for the comment.

When the comment is removed, the XML code is parsed as expected.

Desktop (please complete the following information):

  • OS: Windows 10
  • Node.js
  • Version 12.16.3

First character of comments is removed

Describe the bug
A clear and concise description of what the bug is.

When parsing xml/html comments, the first comment is skipped

To Reproduce

Run the following in the browser with the lib directory of master copied or sim-linked to the same directory.

<h1>sax-wasm demo</h1>
<pre id="output"></pre>

<script type="module">
  import { SaxEventType, SAXParser } from './lib/module/index.js';
  window.SaxEventType = SaxEventType;

  async function loadAndPrepareWasm() {
    const saxWasmResponse = await fetch('./lib/sax-wasm.wasm');
    const saxWasmbuffer = await saxWasmResponse.arrayBuffer();
    const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag | SaxEventType.CloseTag | SaxEventType.Comment);

    // Instantiate and prepare the wasm for parsing
    const ready = await parser.prepareWasm(new Uint8Array(saxWasmbuffer));
    if (ready) {
      return parser;
    }
  }

  loadAndPrepareWasm().then(main);
  const el = document.getElementById('output');

  function main(parser) {
    console.log('Wasm is ready to parse', parser);
    parser.eventHandler = (event, data) => {
      console.log('evnet data JSON:', data.toJSON());
      el.innerHTML += `${data.constructor.name}: ${JSON.stringify(data.toJSON())}\n`;
    }

    // const xml = '<div class="foobar" z={1}>hello<!--comment--></div><br/>'
    const xml = '<!--comment-->'
    const enc = new TextEncoder();

    parser.write(enc.encode(xml));
    parser.end();
  }
</script>

The output is omment.

Full output:

Text: {"start":{"line":0,"character":1},"end":{"line":0,"character":12},"value":"omment"}

Expected behavior
A clear and concise description of what you expected to happen.

AFAIK comments should start from the first character after <!--, just as comment currently include the last character before -->.

Desktop (please complete the following information):

  • OS: [e.g. iOS] MacOS
  • Browser [e.g. chrome, safari] Chrome
  • Version [e.g. 22] Version 108.0.5359.98 (Official Build) (arm64)

But I also got this running the rust code with rustc 1.65.0 (897e37553 2022-11-02) directly.

Parser fails on large files due to arithmetic overflow.

I have been trying to use the library to process large XML files (~100 GB) which do not fit in memory of the machine. I copied the example code for node-based streaming files in the README.

It seems that the WASM code does not wait for written data to be processed when writing data and instead allows the buffer to grow infinitely, eating the system memory. Is there any way to make this block here?

Moreover, the processes crashes before memory exhaustion is reached with this error:

node_modules/sax-wasm/lib/saxWasm.js:186
            const uint8array = new Uint8Array(this.wasmSaxParser.memory.buffer, ptr, len);
                               ^

RangeError: Start offset -2147483568 is outside the bounds of the buffer
    at new Uint8Array (<anonymous>)

I'm not exactly sure of what's going on here but it looks like the buffer may have grown too big and the index within it has exceeded s32_max, causing a failure. Is this a bug or am I doing things wrong - would be good to have documentation if so.

Playground.js not working

In playground.js the external method event_listener is not being triggered by the parser.

function event_listener(event, ptr, len) {
    console.log('here');
    const linearMemory = result.instance.exports.memory;
    const memBuff = Buffer.from(linearMemory.buffer, ptr, len);
    const rawString = memBuff.toString();
    console.log(event, rawString);
    try {
      JSON.parse(rawString);
    } catch (e) {
      debugger;
    }
  }

here is never logged.

Is it possible to use sax-wasm for dom manipulation?

Hi,

wanted to inquire if sax-wasm is capable of manipulating the html file, not only parsing it?
What I need is a fast way to parse the html, select the element using selctor, replace the element with new one defined by me? tried kuchiki->wasm within Cloudflare workers environment, it's way to slow (at least the way I managed to make it work).

Thanks!

No changelog?

I am using sax-wasm@^1.4.5 and I would like to update to 2.x and for that, I usually check the changelog but I seem to can't find one here 🤔

Is it missing? what user-facing API changed from 1.x to 2.x?

Example is not working.

I've tried the browser example but events are not firing.

import { SaxEventType, SAXParser } from "sax-wasm";

loadAndPrepareWasm().then(processDocument);

async function loadAndPrepareWasm() {
  const saxWasmResponse = await fetch("/sax-wasm.wasm");

  const saxWasmbuffer = await saxWasmResponse.arrayBuffer();
  const parser = new SAXParser(SaxEventType.OpenTag, {
    highWaterMark: 64 * 1024,
  }); // 64k chunks

  const ready = await parser.prepareWasm(new Uint8Array(saxWasmbuffer));
  // Instantiate and prepare the wasm for parsing
  if (ready) {
    return parser;
  }
}

function processDocument(parser: SAXParser) {
  parser.eventHandler = (event, data) => {
    console.log(data);
    console.log(event);
  };
  console.log(parser);

  fetch(
    "https://cors-anywhere.herokuapp.com/https://www.w3schools.com/xml/plant_catalog.xml"
  ).then(async (response) => {
    if (!response.ok) {
      console.error(response);
    } else {
      const reader = response.body.getReader();
      while (true) {
        const chunk = await reader.read();
        if (chunk.done) {
          return parser.end();
        } else {
          // @ts-ignore
          return parser.write(chunk);
        }
      }
    }
  });
}

My console:

Use WebAssembly.instantiateStreaming

Using WebAssembly.instantiateStreaming instead of WebAssembly.instantiate will improve loading time because the wasm is read directly from a stream without being copied to a ArrayBuffer first.

This method expects a Response or a Promise<Response> so I think the most straightforward way to enable this is to check for this types in SAXParser.prepareWasm and call WebAssembly.instantiateStreaming otherwise fall back the the old WebAssembly.instantiate

Parsed emojis' lengths aren't calculated consistently

Describe the bug
After parsing an html page that uses unicode characters like emojjis, the length of the parsed unicode character is incorrect.

To Reproduce
Steps to reproduce the behavior:

  1. Go to this repro repo
  2. Clone and install the dependencies
  3. run node test.js
  4. Observe error

The repro case takes the following html as input:

📚<div href="./123/123">hey there</div>

and aims to replace the value of href with another string: 234.

The expected output would be :

📚<div href="234">hey there</div>

instead, the output is:

📚<div href=2343">hey there</div>

I don't know specifically why this is the case, but my gut reaction is that the sax-parser doesn't recognise unicode characters like 📚 may have a length greater than 1. Because this particular emoji has a length of 2, it's causing the replaceBetween function to incorrectly calculate where to replace the string.

Additional context
This is made more clear when

end positions for attribute names include whitespaces

Describe the bug

The end position for attribute names nameEnd is returned including any trailing whitespaces up until the = character.

To Reproduce
Steps to reproduce the behavior:

  1. initialize parser to emit events for SaxEventType.Attribute | SaxEventType.OpenTag

  2. attach an event handler, that prints the tag & attribute information, e.g. something like

function eventHandler(event, data) {
  if (event === SaxEventType.OpenTag) {
    // process open tag
    console.log('sax event OpenTag (',event,') -> ', data.name)
    data.attributes.forEach(function(attrData) {
      if(process.env.verbose) console.log('    tag attribute '+attrData.name+'='+JSON.stringify(attrData.value)+' with name at [', attrData.nameStart, ',', attrData.nameEnd, ']')
    });
  } else if (event === SaxEventType.Attribute) {
    // process attribute
    console.log('sax event Attribute (',event,') -> '+data.name+'='+JSON.stringify(data.value)+' with name at [', data.nameStart, ',', data.nameEnd, ']')
  } else {
    // process unknown tag
    console.log('sax event ',event,' -> ', data.name || data.constructor.name, Buffer.from(data.data).toString('utf8'), data)
  }
};
  1. parse the XML code
<?xml version="1.0" encoding="UTF-8"?>
<plugin
    version       =   "1.0.0"   >
</plugin>

Expected behavior

The end position for the name of the version attribute should be at { line: 2, character: 11 }.
Instead the end position currently includes the whitespaces up to the = character, i.e. is { line: 2, character: 18 }.

This could maybe fixed in parser.rs::attribute_name() when handling
the is_whitespace(grapheme) case, by only updating self.attribute.name_end when the self.state is not already State::AttribNameSawWhite.

Desktop (please complete the following information):

  • OS: Windows 10
  • Node.js
  • Version 12.16.3

Unknown type TextDecoder in node.js TS project

Describe the bug
With node.js-only TS project (no "dom" in "lib") can't build.
Type TextDecoder is from DOM.

To Reproduce
tsconfig.json without dom lib, like this:

{
	"compilerOptions": {
		"target": "ES2019",
		"module": "commonjs",
		"lib": [
			"es2019",
			"es2020.bigint",
			"es2020.string",
			"es2020.symbol.wellknown"
		],
		
		"strict": true,
		
		"noUnusedLocals": true,
		"noUnusedParameters": true,
		"noImplicitReturns": true,
		"noFallthroughCasesInSwitch": true,
		"forceConsistentCasingInFileNames": true,
		
		"moduleResolution": "node",
		"resolveJsonModule": true,
		"isolatedModules": true,
		"allowSyntheticDefaultImports": false,
		"esModuleInterop": false,
		
		"noEmitOnError": true,
		"newLine": "LF",
		"removeComments": true,
		"declaration": false,
		"sourceMap": false,
		"importHelpers": false,
		
		"outDir": "./build",
		"rootDir": "./src"
	},
	"typeAcquisition": {
		"include": [
			"node"
		]
	},
	"include": [
		"./src/**/*.ts",
		"./types/**/*.d.ts"
	]
}

Can't build project, TypeScript error:

node_modules/sax-wasm/lib/saxWasm.d.ts:77:25 - error TS2304: Cannot find name 'TextDecoder'.

Node.js 12.18.0
TypeScript 3.9.5
@types/node 14.0.13

Workaround — add type like this type TextDecoder = never;.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.