lddubeau / saxes Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 19.0 3.33 MB

An evented streaming XML parser in JavaScript

License: Other

JavaScript 1.41% TypeScript 98.59%

saxes's People

Contributors

Stargazers

Watchers

Forkers

stefnotch loligans michaelnisi sre-nw-dev papandreou christilut rolftimmermans alubbe mmig kanrule kayaadvisory notranspile-js mpadev0103 biojet1 syda92333gmailc forbeslindesay rubensworks

saxes's Issues

Performance issue

Background

I used saxes to build feedify (RSS parsing package). But recently when I used feedify in one of my projects I noticed a bad parsing performance.

The problem

The browser window would freeze for a second for two when parsing, for example, feeds from overreacted.io.

I tried to investigate what is wrong with my implementation but honestly, I was just trying to guess things (I could not do proper profiling). After I failed I thought saxes might have something to do with this especially after trying other RSS parsing packages like rss-parser (it depends on xml2js which depends on sax).

I tried both feedify and rss-parser with the same input and there were no issues when using rss-parser. So I decided to give sax a shot.

Shockingly, replacing saxes with sax actually fixed my issue.

That being said, I've no idea why that was the case.

Things that might help

Here is the source code of my parser before and after using sax.

I was using saxes v3.1.7 (I don't see any actual code change in v3.1.9)

Note: the result isn't affected by changing the handlers (e.g. onopentag) to promises.

Screenshots

I recorded the performance of running with both saxes and sax versions and here are screenshots of both:

With saxes

With sax

Environment

Browser: Brave (Chromium: 73.0.3683.86)

Runner

Here is the script I used to run the parser.

import http from 'ky';
import { parse } from 'feedify';

async function run(url){
	const resp = await http.get(url);
	
	if (resp.ok) {
		const xml = await resp.text();
	
		await parse(xml);
	}
}

run('https://overreacted.io/rss.xml')

Please distribute the license file with the installed module (v5.0.1)

Hey. We were also looking at some licenses, and noticed the LICENSE file was missing in our version of the installed code (version 5.0.1). This is important because the ISC license is asking us to include a copy of the ISC license in every copy we have of this code, and without this file being redistributed when we install the module, our copies are missing this file.

Disallowed character

Hi,

I'm getting the error disallowed character when an attribute value has the character Ë.
For example:

<NODE Name="TËST"/>

For now I'm just ignoring the error and parsing continues correctly.

saxes fails to correctly capture some DTDs

Versions affected

All versions up and including 3.1.9

Steps to reproduce

Save this file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE root [
<!-- I'm a test. -->
]>
<root/>

Try to parse it with the null parser:

node ./examples/null-parser.js [path to file]

Expected results

No error.

Actual results

Error: undefined:6:0: document must contain a root element.
    at SaxesParser.fail (/home/ldd/src/git-repos/saxes/lib/saxes.js:492:18)
    at SaxesParser.end (/home/ldd/src/git-repos/saxes/lib/saxes.js:1692:12)
    at SaxesParser.write (/home/ldd/src/git-repos/saxes/lib/saxes.js:547:23)
    at SaxesParser.close (/home/ldd/src/git-repos/saxes/lib/saxes.js:557:17)
    at Object.<anonymous> (/home/ldd/src/git-repos/saxes/examples/null-parser.js:30:10)
    at Module._compile (internal/modules/cjs/loader.js:774:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
    at Module.load (internal/modules/cjs/loader.js:641:32)
    at Function.Module._load (internal/modules/cjs/loader.js:556:12)
    at Function.Module.runMain (internal/modules/cjs/loader.js:837:10)
Error: undefined:6:0: unexpected end.
    at SaxesParser.fail (/home/ldd/src/git-repos/saxes/lib/saxes.js:492:18)
    at SaxesParser.end (/home/ldd/src/git-repos/saxes/lib/saxes.js:1701:12)
    at SaxesParser.write (/home/ldd/src/git-repos/saxes/lib/saxes.js:547:23)
    at SaxesParser.close (/home/ldd/src/git-repos/saxes/lib/saxes.js:557:17)
    at Object.<anonymous> (/home/ldd/src/git-repos/saxes/examples/null-parser.js:30:10)
    at Module._compile (internal/modules/cjs/loader.js:774:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
    at Module.load (internal/modules/cjs/loader.js:641:32)
    at Function.Module._load (internal/modules/cjs/loader.js:556:12)
    at Function.Module.runMain (internal/modules/cjs/loader.js:837:10)
Parsing time: 8

Notes

The issue here is inherited from sax and carried over into saxes. I tried with the latest sax, it raises no errors because it does not check whether the document is actually well-formed XML. However, it does not generate events for the root element.

saxes currently aims to just capture the DTD, without well-formedness checks, but this is not easy to do:

You cannot just skip ahead to the string ]> because this string can legally appear in an entity declaration.
sax fixed that issue by keeping track of quotes in the DTD: upon encountering a quote, the state changes and sax looks for the end of the quote. Effectively this causes ]> appearing between quotes (e.g. like in an entity declaration) to be ignored.
The problem though is that sax ignores processing instructions and comments in the DTD. In the problematic file above, the single quote appearing in the comment makes the parser get into the quote state and this is a quote that never terminates. The parser needs to keep track of comments and processing instructions to avoid interpreting quotes that appear in them as quotes. (They are just unstructured comment or processing instruction content.)

Edge: error on import

There seems to a small issue on Edge. I'm getting the following error:
SCRIPT1028: SCRIPT1028: Expected identifier, string or number

Edge gives the most helpful errors.. It claims to break on this line: this.ns = { __proto__: null, ...rootNS }; but who knows if that's really true.

After removing the import statement import * as Saxes from 'saxes', the error is gone. I can't really debug further than that.

I'm using the vue-cli, so webpack.

Add attributestart event

Would it be possible to fire an attributestart event when an attribute name is parsed, similar to the opentagstart event?

This would allow the start position of the attribute (the current parser location minus the length of the attribute name) to be identified.

Stream API

First of all thank a lot for the cool library!

In @exceljs we are going to get rid of old sax parser and replace it to the saxes lib. But there is a problem that the saxes doesn't provide the native streaming API. Looks like it was removed from the original sources.

Since the exceljs is highly coupled to the streaming API, we are looking for ways how can we add this API to the saxes.

Do you think is it possible to do this for the reasonable effort? Or maybe it's possible to avoid this at all.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Repository problems

These problems occurred while renovating this repository.

WARN: Using npm packages for Renovate presets is now deprecated. Please migrate to repository-based presets instead.

Rate-Limited

These updates are currently rate-limited. Click on a checkbox below to force their creation now.

build(deps): update dependency @types/node to ^16.18.40
build(deps): update dependency eslint to ^8.46.0
build(deps): update dependency eslint-plugin-import to ^2.28.0
build(deps): update dependency eslint-plugin-react to ^7.33.1
build(deps): update typescript-eslint monorepo to ^5.62.0 (@typescript-eslint/eslint-plugin, @typescript-eslint/eslint-plugin-tslint, @typescript-eslint/parser)
build(deps): update typescript-eslint monorepo to v6 (major) (@typescript-eslint/eslint-plugin, @typescript-eslint/eslint-plugin-tslint, @typescript-eslint/parser)
🔐 Create all rate-limited PRs at once 🔐

Edited/Blocked

These updates have been manually edited so Renovate will no longer make changes. To discard all commits and start over, click on a checkbox.

build(deps): update actions/checkout action to v3

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

github-actions

.github/workflows/node.js.yml

actions/checkout v2

actions/setup-node v3

npm

package.json

xmlchars ^2.2.0

@commitlint/cli ^16.3.0

@commitlint/config-angular ^16.3.0

@types/chai ^4.3.5

@types/mocha ^9.1.1

@types/node ^16.18.34

@typescript-eslint/eslint-plugin ^5.59.9

@typescript-eslint/eslint-plugin-tslint ^5.59.9

@typescript-eslint/parser ^5.59.9

@xml-conformance-suite/js ^3.0.0

@xml-conformance-suite/mocha ^3.0.0

@xml-conformance-suite/test-data ^3.0.0

chai ^4.3.7

conventional-changelog-cli ^2.2.2

eslint ^8.42.0

eslint-config-lddubeau-base ^6.1.0

eslint-config-lddubeau-ts ^2.0.2

eslint-import-resolver-typescript ^2.7.1

eslint-plugin-import ^2.27.5

eslint-plugin-jsx-a11y ^6.7.1

eslint-plugin-prefer-arrow ^1.2.3

eslint-plugin-react ^7.32.2

eslint-plugin-simple-import-sort ^7.0.0

husky ^7.0.4

mocha ^9.2.2

renovate-config-lddubeau ^1.0.0

simple-dist-tag ^1.0.2

ts-node ^10.9.1

tsd ^0.22.0

tslint ^6.1.3

tslint-microsoft-contrib ^6.2.0

typedoc ^0.24.8

typescript ^4.9.5

node >=v12.22.12

travis

.travis.yml

node 12

node 10

Check this box to trigger a request for Renovate to run again on this repository

Docs: State Machine Diagrams

I was looking to better understand the state transitions for the Saxes parser, so I created state machine diagrams covering all the states. Not sure if it would be useful to others/worth including as documentation, but in case it is, feel free to use: https://gist.github.com/ForbesLindesay/f30b317013b9851149178c9f30993f6c

I've attempted to document all state transitions that don't result in a call to this.fail.

Can saxes be used with asynchronous callbacks?

Hi,

is it possible to use await in event callbacks? Something like saxParser.on("opentag", async (node) => { await ... ?

It looks like it's not possible, but would like to confirm...

White space following XML declaration should be part of the prolog

The ProcessingInstruction-literal-1.xhtml test currently fails in jsdom, as document.firstChild returns a text node instead of the first element. The test does pass in browsers.

I have tracked this down to the newline that follows the XML declaration. If I've understood the relevant part of the specification correctly, then comments, processing instructions and white space directly after the XML declaration should be seen as part of the prolog and not result in an event.

attribute values are corrupted when saxes runs without a text handler.

This happens with 5.0.0. I have not checked older versions.

Steps to reproduce

Save this to a file:

"use strict";

/* eslint-disable no-console */

const fs = require("fs");
const saxes = require("../build/dist/saxes");

const xml = fs.readFileSync(process.argv[2]);
const parser = new saxes.SaxesParser({ xmlns: false });
parser.on("opentag", tag => {
  console.log(tag);
});
parser.on("error", err => {
  console.error(err);
});

parser.write(xml);
parser.close();

Update the path for saxes in the require call so that it loads a local version of saxes.

Save this to a file:

<?xml version="1.0" encoding="UTF-8"?>
<top>
<x>Fnord '&lt;' and then some.</x>
<x x="foo"></x>
</top>

Run node path/to/js/file path/to/xml/file with the two files above that you've saved.

Expected Result

The output should be:

{
  name: 'top',
  attributes: [Object: null prototype] {},
  isSelfClosing: false
}
{
  name: 'x',
  attributes: [Object: null prototype] {},
  isSelfClosing: false
}
{
  name: 'x',
  attributes: [Object: null prototype] { x: 'foo' },
  isSelfClosing: false
}

Namely the x attribute on the second x element should have the value foo.

Actual Output

{
  name: 'top',
  attributes: [Object: null prototype] {},
  isSelfClosing: false
}
{
  name: 'x',
  attributes: [Object: null prototype] {},
  isSelfClosing: false
}
{
  name: 'x',
  attributes: [Object: null prototype] { x: '<foo' },
  isSelfClosing: false
}

Note the less than symbol in the value of the x attribute.

Observations

The issue disappears if a handler is added for the text event.

Whitespace characters in attribute values are not normalized

I have an XML document containing an attribute with carriage return characters: (https://github.com/w3c/xsdtests/blob/master/msData/regex/RegexTest_63.xml)

If I read the XML spec correctly regarding attribute value normalization (https://www.w3.org/TR/xml/#AVNormalize), I think these should be normalized to a single space character. However, the following test case shows that this does not currently happen:

const p = new saxes.SaxesParser();
p.onopentag = n => console.log(JSON.stringify(n.attributes['att']));
p.write('<doc att="a\rb"/>').close();
// logs "a\rb"

Preserve comments and whitespace - Usage for attranslate

Hi,

I am currently running experiments with saxes and attranslate: https://github.com/fkirc/attranslate
attranslate is a new tool for semi-automated app- und website translations.
A goal of attranslate is to modify existing XML-files with as little changes as possible.
Therefore, one thing that I am curious about is the preservation of XML-comments and whitespace (linebreaks).

It would be nice to have a tree-structure that preserves literally all information when reading an XML-file and then writing the (almost) same XML-file again.

Perhaps you could point me in a direction on how to best achieve whitespace- and comments-preservation?

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Error type: Cannot find preset's package (lddubeau:base)

Is it possible to pause / resume the parsing?

I would like to start parsing the XML, then once a specific element has been reached I would like to pause it, apply some logic to what has already been processed, then resume the parsing once I'm done.

Is it possible to achieve this using saxes?
Or should I first parse the string, split it into chunks, and feed each chunk separately to saxes?

Raise an error for missing namespace prefix

Test case:

const saxes = require("saxes"),
      parser = new saxes.SaxesParser({ xmlns: true });

parser.onerror = function (e) {
  console.log(e);
};

parser.write('<span xmlns:="urn:x-test:test">12</span>').close();

I believe the code above is expected to produce an error, based on the following test in https://github.com/web-platform-tests/wpt/blob/master/domparsing/DOMParser-parseFromString-xml-parsererror.html#L24. It's currently passing in two browsers.

onprocessinginstruction event not firing for XML Declaration

I'm using jsdom to parse XML into a DOM and then serialize it back after I make some modifications to it.

I noticed that the processing instruction was missing from the top of the serialized output. At first, I thought this was a limitation of the module. However, it seems that it tries to use the onprocessinginstruction event from saxes, but it's not fired in my case.

I can see in the sPIEnding() method that it is fired from the second if clause, but in my casepiIsXMLDecl is true, so the event is not fired. Is this the intended behaviour?

My xml file starts with the standard processing instruction of:
<?xml version="1.0" encoding="UTF-8"?>

Event that the position tracking offset was updated not given for XML declaration

I am trying to track the XML substring for any given node parsed to a DOM document and saxes position tracking. I am trying to achieve this by creating a DOM as saxes emits new nodes, and diffing the tracked positions at those events. This works well if an event is given any time the position offset is updated in the parser. The issue I run in to is that the XML declaration does not give me an event because it is not a DOM node. This robs me of the offset at which the first emitted node is found.

If you (npm i saxes slimdom and) run the following:


const saxes = require('saxes');
const slimdom = require('slimdom');

const parser = new saxes.SaxesParser({
	position: true
});

function getDocumentWithStartAndEndPositionsPerNode (xmlString) {
	const document = new slimdom.Document();
	let nextNodeParentNode = document;

	// By keeping track of the last parser position you can tell what the XML source ubstring for the last created DOM
	// node is.
	let nextNodeStartPosition = parser.position;

	parser.oncomment = (comment) => {
		const node = document.createComment(comment);
		nextNodeParentNode.appendChild(node);

		// Set the offsets for the XML substring for this node
		node.stringStartOffset = nextNodeStartPosition;
		node.stringEndOffset = parser.position + nextNodeStartPosition;
		nextNodeStartPosition = parser.position;
	};

	parser.onopentag = (element) => {
		const node = document.createElement(element.name);

		// Set the offset for the XML substring for this node
		node.stringStartOffset = nextNodeStartPosition;
		nextNodeStartPosition = parser.position;

		nextNodeParentNode.appendChild(node);
		nextNodeParentNode = node;
	},

	parser.onclosetag = () => {
		nextNodeParentNode = nextNodeParentNode.parentNode;

		// Set the offset for the XML substring for this node
		nextNodeParentNode.stringEndOffset = parser.position;
		nextNodeStartPosition = parser.position;
	},

	parser.write(xmlString).close();

	return document;
}

console.log('Happy flow without XML declaration:');
const dom1 = getDocumentWithStartAndEndPositionsPerNode('<!--test--><x />');
console.log(dom1.childNodes[0].stringStartOffset); // Offset for comment, is correct
console.log(dom1.childNodes[1].stringStartOffset); // Offset for element, is correct

console.log('Error with XML declaration:');
const dom2 = getDocumentWithStartAndEndPositionsPerNode('<?xml version="1.0"?><!--test--><x />');
console.log(dom2.childNodes[0].stringStartOffset); // Offset for comment, is erroneous because nextNodeStartPosition was
                                                   // not updated after encountering XML declaration
console.log(dom2.childNodes[1].stringStartOffset); // Offset for element, is correct

In the first console log, 0 and 11 are expected because the comment starts at 0 and is 11 chars long.

In the second console log 21 and 32 are expected because the XML declaration is 21 characters long. However 0 and 32 are found because the "start" for the first emitted node was not set right.

(Actual results are 1 lower because of some offset thing, I think that's unrelated)

As for a solution, either an emitted event that the XML declaration PI has finished parsing would do (and parser.position is updated) or a way to find how much characters were in the XML declaration.

Let me know what I can do! Or if I can clarify anything about the above

Normalization of end-of-line characters

When I run the following in a browser:
JSON.stringify(new DOMParser().parseFromString('<p>A\r\nB</p>', 'text/xml').documentElement.textContent). It returns the following string: "A\nB"

However, in saxes, I get an ontext event with A\r\nB as the text argument.

The spec states that it is the responsibility of the 'XML processor' to normalize line endings. Since saxes is an XML parser, I'd say that saxes should normalize these line endings.

What's your opinion on this matter?

Stream API with cancellation

Hey, at exceljs we're thinking about moving from sax to saxes exceljs/exceljs#748 (comment)
However, one of our core features is our stream excel reader api that allows large excel files to be processed. The README says that the old streams API was removed and that a new one might be coming - what's the status on that and would you be open to accept help if you lay out a rough plan of how it should be best implemented? Thanks!

Parsing attributes containing the string "]]>" raise an error

Hi,

When I run the following code snippet in the browser: new DOMParser().parseFromString('<e attr="]]>"></e>', 'text/xml'). It outputs an element with an attribute set to ]]>. When I serialize it again, I get <e attr="]]>"></e>.

However, when I try to parse the same string with saxes, I get an error.

The spec mentions that the ]]> sequence MUST be escaped as ]]> because of compatibility with SGML. However, that only seems to count for serializing XML. Also, all of the parsers I've tried accept this string.

Should saxes really throw an error?

lddubeau / saxes Goto Github PK

saxes's People

Contributors

Stargazers

Watchers

Forkers

saxes's Issues

Background

The problem

Things that might help

Screenshots

With saxes

With sax

Environment

Runner

Versions affected

Steps to reproduce

Expected results

Actual results

Notes

Repository problems

Rate-Limited

Edited/Blocked

Open

Detected dependencies

Steps to reproduce

Expected Result

Actual Output

Observations

Recommend Projects

Recommend Topics

Recommend Org