lddubeau / saxes Goto Github PK
View Code? Open in Web Editor NEWAn evented streaming XML parser in JavaScript
License: Other
An evented streaming XML parser in JavaScript
License: Other
I used saxes to build feedify (RSS parsing package). But recently when I used feedify in one of my projects I noticed a bad parsing performance.
The browser window would freeze for a second for two when parsing, for example, feeds from overreacted.io.
I tried to investigate what is wrong with my implementation but honestly, I was just trying to guess things (I could not do proper profiling). After I failed I thought saxes might have something to do with this especially after trying other RSS parsing packages like rss-parser (it depends on xml2js
which depends on sax
).
I tried both feedify and rss-parser with the same input and there were no issues when using rss-parser. So I decided to give sax
a shot.
Shockingly, replacing saxes
with sax
actually fixed my issue.
That being said, I've no idea why that was the case.
Here is the source code of my parser before and after using sax
.
I was using saxes v3.1.7 (I don't see any actual code change in v3.1.9)
Note: the result isn't affected by changing the handlers (e.g.
onopentag
) to promises.
I recorded the performance of running with both saxes
and sax
versions and here are screenshots of both:
Browser: Brave (Chromium: 73.0.3683.86)
Here is the script I used to run the parser.
import http from 'ky';
import { parse } from 'feedify';
async function run(url){
const resp = await http.get(url);
if (resp.ok) {
const xml = await resp.text();
await parse(xml);
}
}
run('https://overreacted.io/rss.xml')
Hey. We were also looking at some licenses, and noticed the LICENSE
file was missing in our version of the installed code (version 5.0.1). This is important because the ISC license is asking us to include a copy of the ISC license in every copy we have of this code, and without this file being redistributed when we install the module, our copies are missing this file.
Hi,
I'm getting the error disallowed character
when an attribute value has the character Ë
.
For example:
<NODE Name="TËST"/>
For now I'm just ignoring the error and parsing continues correctly.
All versions up and including 3.1.9
Save this file:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE root [
<!-- I'm a test. -->
]>
<root/>
Try to parse it with the null parser:
node ./examples/null-parser.js [path to file]
No error.
Error: undefined:6:0: document must contain a root element.
at SaxesParser.fail (/home/ldd/src/git-repos/saxes/lib/saxes.js:492:18)
at SaxesParser.end (/home/ldd/src/git-repos/saxes/lib/saxes.js:1692:12)
at SaxesParser.write (/home/ldd/src/git-repos/saxes/lib/saxes.js:547:23)
at SaxesParser.close (/home/ldd/src/git-repos/saxes/lib/saxes.js:557:17)
at Object.<anonymous> (/home/ldd/src/git-repos/saxes/examples/null-parser.js:30:10)
at Module._compile (internal/modules/cjs/loader.js:774:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
at Module.load (internal/modules/cjs/loader.js:641:32)
at Function.Module._load (internal/modules/cjs/loader.js:556:12)
at Function.Module.runMain (internal/modules/cjs/loader.js:837:10)
Error: undefined:6:0: unexpected end.
at SaxesParser.fail (/home/ldd/src/git-repos/saxes/lib/saxes.js:492:18)
at SaxesParser.end (/home/ldd/src/git-repos/saxes/lib/saxes.js:1701:12)
at SaxesParser.write (/home/ldd/src/git-repos/saxes/lib/saxes.js:547:23)
at SaxesParser.close (/home/ldd/src/git-repos/saxes/lib/saxes.js:557:17)
at Object.<anonymous> (/home/ldd/src/git-repos/saxes/examples/null-parser.js:30:10)
at Module._compile (internal/modules/cjs/loader.js:774:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
at Module.load (internal/modules/cjs/loader.js:641:32)
at Function.Module._load (internal/modules/cjs/loader.js:556:12)
at Function.Module.runMain (internal/modules/cjs/loader.js:837:10)
Parsing time: 8
The issue here is inherited from sax and carried over into saxes. I tried with the latest sax, it raises no errors because it does not check whether the document is actually well-formed XML. However, it does not generate events for the root element.
saxes currently aims to just capture the DTD, without well-formedness checks, but this is not easy to do:
You cannot just skip ahead to the string ]>
because this string can legally appear in an entity declaration.
sax fixed that issue by keeping track of quotes in the DTD: upon encountering a quote, the state changes and sax looks for the end of the quote. Effectively this causes ]>
appearing between quotes (e.g. like in an entity declaration) to be ignored.
The problem though is that sax ignores processing instructions and comments in the DTD. In the problematic file above, the single quote appearing in the comment makes the parser get into the quote state and this is a quote that never terminates. The parser needs to keep track of comments and processing instructions to avoid interpreting quotes that appear in them as quotes. (They are just unstructured comment or processing instruction content.)
There seems to a small issue on Edge. I'm getting the following error:
SCRIPT1028: SCRIPT1028: Expected identifier, string or number
Edge gives the most helpful errors.. It claims to break on this line: this.ns = { __proto__: null, ...rootNS };
but who knows if that's really true.
After removing the import statement import * as Saxes from 'saxes'
, the error is gone. I can't really debug further than that.
I'm using the vue-cli, so webpack.
Would it be possible to fire an attributestart
event when an attribute name is parsed, similar to the opentagstart
event?
This would allow the start position of the attribute (the current parser location minus the length of the attribute name) to be identified.
First of all thank a lot for the cool library!
In @exceljs we are going to get rid of old sax parser and replace it to the saxes lib. But there is a problem that the saxes doesn't provide the native streaming API. Looks like it was removed from the original sources.
Since the exceljs is highly coupled to the streaming API, we are looking for ways how can we add this API to the saxes.
Do you think is it possible to do this for the reasonable effort? Or maybe it's possible to avoid this at all.
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These problems occurred while renovating this repository.
These updates are currently rate-limited. Click on a checkbox below to force their creation now.
@typescript-eslint/eslint-plugin
, @typescript-eslint/eslint-plugin-tslint
, @typescript-eslint/parser
)@typescript-eslint/eslint-plugin
, @typescript-eslint/eslint-plugin-tslint
, @typescript-eslint/parser
)These updates have been manually edited so Renovate will no longer make changes. To discard all commits and start over, click on a checkbox.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
@commitlint/cli
, @commitlint/config-angular
)mocha
, @types/mocha
)node
, @types/node
).github/workflows/node.js.yml
actions/checkout v2
actions/setup-node v3
package.json
xmlchars ^2.2.0
@commitlint/cli ^16.3.0
@commitlint/config-angular ^16.3.0
@types/chai ^4.3.5
@types/mocha ^9.1.1
@types/node ^16.18.34
@typescript-eslint/eslint-plugin ^5.59.9
@typescript-eslint/eslint-plugin-tslint ^5.59.9
@typescript-eslint/parser ^5.59.9
@xml-conformance-suite/js ^3.0.0
@xml-conformance-suite/mocha ^3.0.0
@xml-conformance-suite/test-data ^3.0.0
chai ^4.3.7
conventional-changelog-cli ^2.2.2
eslint ^8.42.0
eslint-config-lddubeau-base ^6.1.0
eslint-config-lddubeau-ts ^2.0.2
eslint-import-resolver-typescript ^2.7.1
eslint-plugin-import ^2.27.5
eslint-plugin-jsx-a11y ^6.7.1
eslint-plugin-prefer-arrow ^1.2.3
eslint-plugin-react ^7.32.2
eslint-plugin-simple-import-sort ^7.0.0
husky ^7.0.4
mocha ^9.2.2
renovate-config-lddubeau ^1.0.0
simple-dist-tag ^1.0.2
ts-node ^10.9.1
tsd ^0.22.0
tslint ^6.1.3
tslint-microsoft-contrib ^6.2.0
typedoc ^0.24.8
typescript ^4.9.5
node >=v12.22.12
.travis.yml
node 12
node 10
I was looking to better understand the state transitions for the Saxes parser, so I created state machine diagrams covering all the states. Not sure if it would be useful to others/worth including as documentation, but in case it is, feel free to use: https://gist.github.com/ForbesLindesay/f30b317013b9851149178c9f30993f6c
I've attempted to document all state transitions that don't result in a call to this.fail
.
Hi,
is it possible to use await in event callbacks? Something like saxParser.on("opentag", async (node) => { await ...
?
It looks like it's not possible, but would like to confirm...
The ProcessingInstruction-literal-1.xhtml test currently fails in jsdom, as document.firstChild
returns a text node instead of the first element. The test does pass in browsers.
I have tracked this down to the newline that follows the XML declaration. If I've understood the relevant part of the specification correctly, then comments, processing instructions and white space directly after the XML declaration should be seen as part of the prolog and not result in an event.
This happens with 5.0.0. I have not checked older versions.
Save this to a file:
"use strict";
/* eslint-disable no-console */
const fs = require("fs");
const saxes = require("../build/dist/saxes");
const xml = fs.readFileSync(process.argv[2]);
const parser = new saxes.SaxesParser({ xmlns: false });
parser.on("opentag", tag => {
console.log(tag);
});
parser.on("error", err => {
console.error(err);
});
parser.write(xml);
parser.close();
Update the path for saxes
in the require
call so that it loads a local version of saxes.
Save this to a file:
<?xml version="1.0" encoding="UTF-8"?>
<top>
<x>Fnord '<' and then some.</x>
<x x="foo"></x>
</top>
Run node path/to/js/file path/to/xml/file
with the two files above that you've saved.
The output should be:
{
name: 'top',
attributes: [Object: null prototype] {},
isSelfClosing: false
}
{
name: 'x',
attributes: [Object: null prototype] {},
isSelfClosing: false
}
{
name: 'x',
attributes: [Object: null prototype] { x: 'foo' },
isSelfClosing: false
}
Namely the x
attribute on the second x
element should have the value foo
.
{
name: 'top',
attributes: [Object: null prototype] {},
isSelfClosing: false
}
{
name: 'x',
attributes: [Object: null prototype] {},
isSelfClosing: false
}
{
name: 'x',
attributes: [Object: null prototype] { x: '<foo' },
isSelfClosing: false
}
Note the less than symbol in the value of the x
attribute.
The issue disappears if a handler is added for the text
event.
I have an XML document containing an attribute with carriage return characters: (https://github.com/w3c/xsdtests/blob/master/msData/regex/RegexTest_63.xml)
If I read the XML spec correctly regarding attribute value normalization (https://www.w3.org/TR/xml/#AVNormalize), I think these should be normalized to a single space character. However, the following test case shows that this does not currently happen:
const p = new saxes.SaxesParser();
p.onopentag = n => console.log(JSON.stringify(n.attributes['att']));
p.write('<doc att="a\rb"/>').close();
// logs "a\rb"
Hi,
I am currently running experiments with saxes
and attranslate
: https://github.com/fkirc/attranslate
attranslate
is a new tool for semi-automated app- und website translations.
A goal of attranslate
is to modify existing XML-files with as little changes as possible.
Therefore, one thing that I am curious about is the preservation of XML-comments and whitespace (linebreaks).
It would be nice to have a tree-structure that preserves literally all information when reading an XML-file and then writing the (almost) same XML-file again.
Perhaps you could point me in a direction on how to best achieve whitespace- and comments-preservation?
There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.
Error type: Cannot find preset's package (lddubeau:base)
I would like to start parsing the XML, then once a specific element has been reached I would like to pause it, apply some logic to what has already been processed, then resume the parsing once I'm done.
Is it possible to achieve this using saxes?
Or should I first parse the string, split it into chunks, and feed each chunk separately to saxes?
Test case:
const saxes = require("saxes"),
parser = new saxes.SaxesParser({ xmlns: true });
parser.onerror = function (e) {
console.log(e);
};
parser.write('<span xmlns:="urn:x-test:test">12</span>').close();
I believe the code above is expected to produce an error, based on the following test in https://github.com/web-platform-tests/wpt/blob/master/domparsing/DOMParser-parseFromString-xml-parsererror.html#L24. It's currently passing in two browsers.
I'm using jsdom to parse XML into a DOM and then serialize it back after I make some modifications to it.
I noticed that the processing instruction was missing from the top of the serialized output. At first, I thought this was a limitation of the module. However, it seems that it tries to use the onprocessinginstruction event from saxes, but it's not fired in my case.
I can see in the sPIEnding()
method that it is fired from the second if
clause, but in my casepiIsXMLDecl
is true
, so the event is not fired. Is this the intended behaviour?
My xml file starts with the standard processing instruction of:
<?xml version="1.0" encoding="UTF-8"?>
I am trying to track the XML substring for any given node parsed to a DOM document and saxes position tracking. I am trying to achieve this by creating a DOM as saxes emits new nodes, and diffing the tracked positions at those events. This works well if an event is given any time the position offset is updated in the parser. The issue I run in to is that the XML declaration does not give me an event because it is not a DOM node. This robs me of the offset at which the first emitted node is found.
If you (npm i saxes slimdom
and) run the following:
const saxes = require('saxes');
const slimdom = require('slimdom');
const parser = new saxes.SaxesParser({
position: true
});
function getDocumentWithStartAndEndPositionsPerNode (xmlString) {
const document = new slimdom.Document();
let nextNodeParentNode = document;
// By keeping track of the last parser position you can tell what the XML source ubstring for the last created DOM
// node is.
let nextNodeStartPosition = parser.position;
parser.oncomment = (comment) => {
const node = document.createComment(comment);
nextNodeParentNode.appendChild(node);
// Set the offsets for the XML substring for this node
node.stringStartOffset = nextNodeStartPosition;
node.stringEndOffset = parser.position + nextNodeStartPosition;
nextNodeStartPosition = parser.position;
};
parser.onopentag = (element) => {
const node = document.createElement(element.name);
// Set the offset for the XML substring for this node
node.stringStartOffset = nextNodeStartPosition;
nextNodeStartPosition = parser.position;
nextNodeParentNode.appendChild(node);
nextNodeParentNode = node;
},
parser.onclosetag = () => {
nextNodeParentNode = nextNodeParentNode.parentNode;
// Set the offset for the XML substring for this node
nextNodeParentNode.stringEndOffset = parser.position;
nextNodeStartPosition = parser.position;
},
parser.write(xmlString).close();
return document;
}
console.log('Happy flow without XML declaration:');
const dom1 = getDocumentWithStartAndEndPositionsPerNode('<!--test--><x />');
console.log(dom1.childNodes[0].stringStartOffset); // Offset for comment, is correct
console.log(dom1.childNodes[1].stringStartOffset); // Offset for element, is correct
console.log('Error with XML declaration:');
const dom2 = getDocumentWithStartAndEndPositionsPerNode('<?xml version="1.0"?><!--test--><x />');
console.log(dom2.childNodes[0].stringStartOffset); // Offset for comment, is erroneous because nextNodeStartPosition was
// not updated after encountering XML declaration
console.log(dom2.childNodes[1].stringStartOffset); // Offset for element, is correct
In the first console log, 0
and 11
are expected because the comment starts at 0
and is 11
chars long.
In the second console log 21
and 32
are expected because the XML declaration is 21 characters long. However 0
and 32
are found because the "start" for the first emitted node was not set right.
(Actual results are 1 lower because of some offset thing, I think that's unrelated)
As for a solution, either an emitted event that the XML declaration PI has finished parsing would do (and parser.position
is updated) or a way to find how much characters were in the XML declaration.
Let me know what I can do! Or if I can clarify anything about the above
When I run the following in a browser:
JSON.stringify(new DOMParser().parseFromString('<p>A\r\nB</p>', 'text/xml').documentElement.textContent)
. It returns the following string: "A\nB"
However, in saxes, I get an ontext event with A\r\nB
as the text argument.
The spec states that it is the responsibility of the 'XML processor' to normalize line endings. Since saxes is an XML parser, I'd say that saxes should normalize these line endings.
What's your opinion on this matter?
Hey, at exceljs we're thinking about moving from sax to saxes exceljs/exceljs#748 (comment)
However, one of our core features is our stream excel reader api that allows large excel files to be processed. The README says that the old streams API was removed and that a new one might be coming - what's the status on that and would you be open to accept help if you lay out a rough plan of how it should be best implemented? Thanks!
Hi,
When I run the following code snippet in the browser: new DOMParser().parseFromString('<e attr="]]>"></e>', 'text/xml')
. It outputs an element with an attribute set to ]]>
. When I serialize it again, I get <e attr="]]>"></e>
.
However, when I try to parse the same string with saxes, I get an error.
The spec mentions that the ]]>
sequence MUST be escaped as ]]>
because of compatibility with SGML. However, that only seems to count for serializing XML. Also, all of the parsers I've tried accept this string.
Should saxes really throw an error?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.