isaacs / sax-js Goto Github PK

View Code? Open in Web Editor NEW

1.1K 25.0 325.0 484 KB

A sax style parser for JS

License: Other

JavaScript 100.00%

sax-js's Introduction

sax js

A sax-style parser for XML and HTML.

Designed with node in mind, but should work fine in the browser or other CommonJS implementations.

What This Is

A very simple tool to parse through an XML string.
A stepping stone to a streaming HTML parser.
A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML docs.

What This Is (probably) Not

An HTML Parser - That's a fine goal, but this isn't it. It's just XML.
A DOM Builder - You can use it to build an object model out of XML, but it doesn't do that out of the box.
XSLT - No DOM = no querying.
100% Compliant with (some other SAX implementation) - Most SAX implementations are in Java and do a lot more than this does.
An XML Validator - It does a little validation when in strict mode, but not much.
A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic masochism.
A DTD-aware Thing - Fetching DTDs is a much bigger job.

Regarding `<!DOCTYPE`s and `<!ENTITY`s

The parser will handle the basic XML entities in text nodes and attribute values: & < > ' ". It's possible to define additional entities in XML by putting them in the DTD. This parser doesn't do anything with that. If you want to listen to the ondoctype event, and then fetch the doctypes, and read the entities and add them to parser.ENTITIES, then be my guest.

Unknown entities will fail in strict mode, and in loose mode, will pass through unmolested.

Usage

var sax = require("./lib/sax"),
  strict = true, // set to false for html-mode
  parser = sax.parser(strict);

parser.onerror = function (e) {
  // an error happened.
};
parser.ontext = function (t) {
  // got some text.  t is the string of text.
};
parser.onopentag = function (node) {
  // opened a tag.  node has "name" and "attributes"
};
parser.onattribute = function (attr) {
  // an attribute.  attr has "name" and "value"
};
parser.onend = function () {
  // parser stream is done, and ready to have more stuff written to it.
};

parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();

// stream usage
// takes the same options as the parser
var saxStream = require("sax").createStream(strict, options)
saxStream.on("error", function (e) {
  // unhandled errors will throw, since this is a proper node
  // event emitter.
  console.error("error!", e)
  // clear the error
  this._parser.error = null
  this._parser.resume()
})
saxStream.on("opentag", function (node) {
  // same object as above
})
// pipe is supported, and it's readable/writable
// same chunks coming in also go out.
fs.createReadStream("file.xml")
  .pipe(saxStream)
  .pipe(fs.createWriteStream("file-copy.xml"))

Arguments

Pass the following arguments to the parser function. All are optional.

strict - Boolean. Whether or not to be a jerk. Default: false.

opt - Object bag of settings regarding string formatting. All default to false.

Settings supported:

trim - Boolean. Whether or not to trim text and comment nodes.
normalize - Boolean. If true, then turn any whitespace into a single space.
lowercase - Boolean. If true, then lowercase tag names and attribute names in loose mode, rather than uppercasing them.
xmlns - Boolean. If true, then namespaces are supported.
position - Boolean. If false, then don't track line/col/position.
strictEntities - Boolean. If true, only parse predefined XML entities (&, ', >, <, and ")
unquotedAttributeValues - Boolean. If true, then unquoted attribute values are allowed. Defaults to false when strict is true, true otherwise.

Methods

write - Write bytes onto the stream. You don't have to do this all at once. You can keep writing as much as you want.

close - Close the stream. Once closed, no more data may be written until it is done processing the buffer, which is signaled by the end event.

resume - To gracefully handle errors, assign a listener to the error event. Then, when the error is taken care of, you can call resume to continue parsing. Otherwise, the parser will not continue while in an error state.

Members

At all times, the parser object will have the following members:

line, column, position - Indications of the position in the XML document where the parser currently is looking.

startTagPosition - Indicates the position where the current tag starts.

closed - Boolean indicating whether or not the parser can be written to. If it's true, then wait for the ready event to write again.

strict - Boolean indicating whether or not the parser is a jerk.

opt - Any options passed into the constructor.

tag - The current tag being dealt with.

And a bunch of other stuff that you probably shouldn't touch.

Events

All events emit with a single argument. To listen to an event, assign a function to on<eventname>. Functions get executed in the this-context of the parser object. The list of supported events are also in the exported EVENTS array.

When using the stream interface, assign handlers using the EventEmitter on function in the normal fashion.

error - Indication that something bad happened. The error will be hanging out on parser.error, and must be deleted before parsing can continue. By listening to this event, you can keep an eye on that kind of stuff. Note: this happens much more in strict mode. Argument: instance of Error.

text - Text node. Argument: string of text.

doctype - The <!DOCTYPE declaration. Argument: doctype string.

processinginstruction - Stuff like <?xml foo="blerg" ?>. Argument: object with name and body members. Attributes are not parsed, as processing instructions have implementation dependent semantics.

sgmldeclaration - Random SGML declarations. Stuff like <!ENTITY p> would trigger this kind of event. This is a weird thing to support, so it might go away at some point. SAX isn't intended to be used to parse SGML, after all.

opentagstart - Emitted immediately when the tag name is available, but before any attributes are encountered. Argument: object with a name field and an empty attributes set. Note that this is the same object that will later be emitted in the opentag event.

opentag - An opening tag. Argument: object with name and attributes. In non-strict mode, tag names are uppercased, unless the lowercase option is set. If the xmlns option is set, then it will contain namespace binding information on the ns member, and will have a local, prefix, and uri member.

closetag - A closing tag. In loose mode, tags are auto-closed if their parent closes. In strict mode, well-formedness is enforced. Note that self-closing tags will have closeTag emitted immediately after openTag. Argument: tag name.

attribute - An attribute node. Argument: object with name and value. In non-strict mode, attribute names are uppercased, unless the lowercase option is set. If the xmlns option is set, it will also contains namespace information.

comment - A comment node. Argument: the string of the comment.

opencdata - The opening tag of a <![CDATA[ block.

cdata - The text of a <![CDATA[ block. Since <![CDATA[ blocks can get quite large, this event may fire multiple times for a single block, if it is broken up into multiple write()s. Argument: the string of random character data.

closecdata - The closing tag (]]>) of a <![CDATA[ block.

opennamespace - If the xmlns option is set, then this event will signal the start of a new namespace binding.

closenamespace - If the xmlns option is set, then this event will signal the end of a namespace binding.

end - Indication that the closed stream has ended.

ready - Indication that the stream has reset, and is ready to be written to.

noscript - In non-strict mode, <script> tags trigger a "script" event, and their contents are not checked for special xml characters. If you pass noscript: true, then this behavior is suppressed.

Reporting Problems

It's best to write a failing test if you find an issue. I will always accept pull requests with failing tests if they demonstrate intended behavior, but it is very hard to figure out what issue you're describing without a test. Writing a test is also the best way for you yourself to figure out if you really understand the issue you think you have with sax-js.

sax-js's People

Stargazers

Watchers

Forkers

laurie71 simplegeo kixxauth thejh smh mikeal tmpvar coreyjewett henryrawas jmakeig sergeyzalyadeev dherman richmarr cohenudi dscape leesolutions fasterize pgte jammie sourishkrout dolphin278 myndzi alagopus danmactough cianomaidin abarre machadogj everettquebral peterreid no9 explodingbarrel defunctzombie c0bra donglix kaoshijuan rajeshvv ianobermiller jbuck sanyaade-mobiledev cosminpaun ckavuklu netconstructor egucciar jbeard4 michaelnisi lbdremy orangedog michelpa michalliu andrewrk qinhelin vikrambhatla83 embarkmobile dmitrymyadzelets csnw santigimeno lddubeau johnhuijbers jsdevel wafou pletcher clockworked247 richarddwalsh itkoren l2l2l corespring ashmind wanted33 casser plediii rasmusvhansen hirak dheerajaggarwal lovasoa todevelopers subzey antony74 lopno propn morgangiraud kennethjiang afrolov eatbyte yinso angelozerr streetlib cfenner kuznetsov-ilia mykook robbert bbondy thiakil mikeyjm145 pirxpilot jochenberger dvbarnes etabits ajvincent robhawkes mart-jansink

sax-js's Issues

Unwanted closing tag will stop the parser

Hello,

trying this code :

parser.write('<div win="true"></div></span><p fail="true"></p>');

See the that come form no-where

Will results in

{ name: 'DIV', attributes: { win: 'true' } }

While

parser.write('<div win="true"></div><p fail="true"></p>');

Results in

{ name: 'DIV', attributes: { win: 'true' } }
{ name: 'P', attributes: { fail: 'true' } }

I know this case and my previous one (#31) are html-related and that sax-js is not currently aimed at parsing html but I thougt I would let you know.

I'm really interested in bringing html parsing to sax-js but I do not know where to start.

pretty-print.js I/O problem

On my Ubuntu system, running pretty-print on a large XML file (my test file is 700k) and sending the output to a file causes node to slow way down and consume 100% of the cpu. It gets really, really slow -- I've never had the patience to wait and see how long it would take. Possibly hours.

I'm not sure I need to post my XML file, since it is large. But I think you can replicate the problem using any large XML file by doing:
node pretty-print.js file.xml > file-pretty.xml

The problem seems to be with how pretty-print.js is handling writes to stdout.

I came up with a solution that makes it run as fast as it should. It doesn't appear to be optimal -- strace shows that it's still making at least twice as many calls to the write(2) system call as it needs to. But from the user's perspective, the problem is completely gone. So it's a big improvement.

diff --git a/examples/pretty-print.js b/examples/pretty-print.js
index 0a40ef0..73a5b7e 100644
--- a/examples/pretty-print.js
+++ b/examples/pretty-print.js
@@ -6,41 +6,61 @@ function entity (str) {
return str.replace('"', '"');
}

+printer.inputPaused = false;
+printer.spool = '';
+printer.send = function(str) {

if(printer.inputPaused) {
printer.spool += str;
}
else if(!process.stdout.write(str)) {
printer.inputPaused = true;
inputStream.pause();
}
+};
+process.stdout.on('drain', function() {
printer.inputPaused = false;
if(printer.spool !== '') {
process.stdout.write(printer.spool);
printer.spool = '';
}
inputStream.resume();
+});

printer.tabstop = 2;
printer.level = 0;
printer.indent = function () {

sys.print("\n");
printer.send("\n");
for (var i = this.level; i > 0; i --) {
for (var j = this.tabstop; j > 0; j --) {
```
 sys.print(" ");
```
```
 printer.send(" ");
```
}
}
}
printer.onopentag = function (tag) {
this.indent();
this.level ++;
sys.print("<"+tag.name);
printer.send("<"+tag.name);
for (var i in tag.attributes) {
sys.print(" "+i+"=""+entity(tag.attributes[i])+""");
printer.send(" "+i+"=""+entity(tag.attributes[i])+""");
}
sys.print(">");
printer.send(">");
}
printer.ontext = printer.ondoctype = function (text) {
this.indent();
sys.print(text);
printer.send(text);
}
printer.onclosetag = function (tag) {
this.level --;
this.indent();
sys.print("</"+tag+">");
printer.send("</"+tag+">");
}
printer.oncdata = function (data) {
this.indent();
sys.print("");
printer.send("");
}
printer.oncomment = function (comment) {
this.indent();
sys.print("");
printer.send("");
}
printer.onerror = function (error) {
sys.debug(error);
@@ -52,21 +72,6 @@ if (!process.argv[2]) {
"TODO: read from stdin or take a file");
}
var xmlfile = require("path").join(process.cwd(), process.argv[2]);
-fs.open(xmlfile, "r", 0666, function (er, fd) {
if (er) throw er;
(function R () {
fs.read(fd, 1024, null, "utf8", function (er, data, bytesRead) {
```
 if (er) throw er;
```
```
 if (data) {
```
```
   printer.write(data);
```
```
   R();
```
```
 } else {
```
```
   fs.close(fd);
```
```
   printer.close();
```
```
 }
```
});
})();
-});

+var inputStream = fs.createReadStream(xmlfile, { encoding: 'utf8' });
+inputStream.on('data', function(data) { printer.write(data); });

Two definitions of SAXParser.prototype.end

I was just skimming the code and noticed that it is defined twice in the prototype.

https://github.com/isaacs/sax-js/blob/master/lib/sax.js#L133
https://github.com/isaacs/sax-js/blob/master/lib/sax.js#L137

streaming

is this lib still maintained at all? was planning on using it for streaming, but seems like it relies on state

Handle schema fragments in doctype declaration

Support stuff like this:

<!DOCTYPE doc [
<!ELEMENT doc (#PCDATA)>
<!ENTITY e SYSTEM "001.ent">
]>
<doc>&e;</doc>

Emit sgmldeclaration events for the ELEMENT and ENTITY declarations.

Should a <!ENTITY declaration add it to the list of sax.ENTITIES, perhaps? I'm certainly not going to fetch the 001.ent file and parse it, but it'd be nice to at least not emit an error in that case. Maybe it could be added to the ENTITIES hash without being decoded, so it'd still be &e;, and another higher level could add the full-fledged entity handling.

Is it possible to read contents of element?

With the following example, is it possible to read the contents of an element as a string?

<myxml attr="val">
   <something>
      <behave-like-cdata>
         <sample>Read as plain text without entities!</sample>
      </behave-like-cdata>
   </something>
</myxml>

When <behave-like-cdata> element is reached I would like to read its contents as a string "Read as plain text without entities!" and then skip to closing tag </behave-like-cdata>.

Is it possible to achieve this with your module?

Many thanks

Speed up reading large Texts

How about reading large chunks of text using regular expressions? Character-by-character seems to be a relatively slow approach.

wrong attrubute value: lost '='

The following test is red:

'# attribute href with query params': function(){
var input = '<a href="service.svc?root=foo&OrderBy=1&Asc=0&Page=1"> The link </a1>'
var parser = sax.parser(false)
parser.onopentag = function(tag){
assert.eql(tag.name, 'A');
assert.eql(tag.attributes.href, 'service.svc?root=&OrderBy=1&Asc=0&Page=1');
}
parser.write(input);
}

try/catch around require('stream') will not save you

You have a try/catch here but you use Stream.prototype.on here

Not a big deal, but I would humbly suggest either dropping the try/catch and all the checking here or adding some kind of on/emit logic to support browsers or other non-node environments.

Your documentation implies you want to support browsers and CommonJS environments so does that mean you want a pull request for a simple on/emit?

Adding namespace awareness

I'd like to make the parser namespace-aware. First, has anyone else done this already? My general approach was going to use the openTag(parser, selfClosing) function to figure out which namespaces have been declared in the existing node. I'm not sure how to calculate the universe of in-scope namespaces given XML's nesting and namespace redeclaration behavior. Any pointers would be much appreciated.

Parsing of ATOM feeds for pubsubhubbub throws error.

Using http://feeds.gawker.com/gawker/full as hub feed.

starting callback server on port 4443
callback: function (feed) {
        sys.log(sys.inspect(feed));
        var hubUri = feed.getLinksByRel('hub')[0];
        var callbackUri = url.parse(subscriber.createCallbackUri());
        subscriber.subscribe(topicUri, hubUri, callbackUri,
          function() {
            topicEvents.emit('subscribed', topicUri.href);
            subscriber.registerEventEmitter(topicUri, topicEvents);
          },
          function(error) {
            topicEvents.emit('error', error);
          });
      }
callback: function (feed) {
        sys.log(sys.inspect(feed));
        var hubUri = feed.getLinksByRel('hub')[0];
        var callbackUri = url.parse(subscriber.createCallbackUri());
        subscriber.subscribe(topicUri, hubUri, callbackUri,
          function() {
            topicEvents.emit('subscribed', topicUri.href);
            subscriber.registerEventEmitter(topicUri, topicEvents);
          },
          function(error) {
            topicEvents.emit('error', error);
          });
      }
15 Jun 17:27:17 - { links: [ {}, {} ]
, entries: []
, title: '\r\n\t\t\r\n\t\t\t'
, updated: ''
, id: ''
}
/Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:51
    hubUri = url.parse(hubUri.href);
                             ^
TypeError: Cannot read property 'href' of undefined
    at [object Object].subscribe (/Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:51:30)
    at /Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:141:20
    at Object.onend (/Users/karl/Sites/karl-git/node-pshb/lib/atom.js:157:3)
    at emit (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:141:32)
    at end (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:172:3)
    at Object.write (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:250:30)
    at Object.close (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:67:37)
    at [object Object].parse (/Users/karl/Sites/karl-git/node-pshb/lib/atom.js:160:21)
    at Object.parse (/Users/karl/Sites/karl-git/node-pshb/lib/atom.js:81:17)
    at IncomingMessage.<anonymous> (/Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:151:12)

Why won't this parse?

I've been beating my head against the wall trying to figure out why this won't parse.
It reports that it successfully parsed the "success" element but then dies.
What am I missing?

Here is the log and error:

parsing data: "<compileClassesResponse><result><bodyCrc>653724009</bodyCrc><column>-1</column><id>01pG0000002KoSUIA0</id><line>-1</line><name>CalendarController</name><success>true</success></result></compileClassesResponse>"
Sax - Open Element: compileclassesresponse (Attributes: {} )
Sax - Open Element: result (Attributes: {} )
Sax - Open Element: bodycrc (Attributes: {} )
Sax - Text: 653724009
Sax - Close Element: bodycrc
Sax - Open Element: column (Attributes: {} )
Sax - Text: -1
Sax - Close Element: column
Sax - Open Element: id (Attributes: {} )
Sax - Text: 01pG0000002KoSUIA0
Sax - Close Element: id
Sax - Open Element: line (Attributes: {} )
Sax - Text: -1
Sax - Close Element: line
Sax - Open Element: name (Attributes: {} )
Sax - Text: CalendarController
Sax - Close Element: name
Sax - Open Element: success (Attributes: {} )
Sax - Text: true
Sax - Close Element: success
Sax - Error: {"stack":"Error: Unexpected end\nLine: 0\nColumn: 209\nChar: \n at error (/node_modules/sax/lib/sax.js:167:8)\n at end (/node_modules/sax/lib/sax.js:173:32)\n at Object.write (/node_modules/sax/lib/sax.js:255:30)\n at Object.close (/node_modules/sax/lib/sax.js:72:37)\n at parseResults (/sfdc.js:99:13)\n at IncomingMessage. (/sfdc.js:224:19)\n at IncomingMessage.emit (events.js:81:20)\n at HTTPParser.onMessageComplete (http.js:133:23)\n at CleartextStream.ondata (http.js:1213:22)\n at CleartextStream._push (tls.js:291:27)","message":"Unexpected end\nLine: 0\nColumn: 209\nChar: "}
{ stack: [Getter/Setter],
arguments: undefined,
type: undefined,
message: 'Unexpected end\nLine: 0\nColumn: 209\nChar: ' }

Unquoted attributes values will fail

Hello, do you think in loose mode we should accept unquoted attributes values ?

parser.write('<span class=test hello=world></span>');

Will emit onopentag with :

{ name: 'SPAN', attributes: {} }

While

parser.write('<span class="test" hello="world"></span>');

will results in :

{ name: 'SPAN',  attributes: { class: 'test', hello: 'world' } }

A single dash in a comment causes an error

The following XML does not parse but is valid.

<xml>
<!-- 
  commment with a single dash- in it
-->
<data/>
</xml>

causes

/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:364
  if (this.error) throw this.error
                            ^
Error: Malformed comment
Line: 3
Column: 2
Char: -
    at error (/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:273:8)
    at strictFail (/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:290:22)
    at Object.write (/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:515:24)
    at /home/teknopaul/node_workspace/saxerror/test-sax.js:11:11
    at [object Object].<anonymous> (fs.js:107:5)
    at [object Object].emit (events.js:62:17)
    at afterRead (fs.js:970:12)
    at wrapper (fs.js:245:17)

this was code used to test

var fs = require('fs');
var sax = require("sax"),
  strict = true, 
  parser = sax.parser(strict);


    fs.readFile('test.xml', function (err, data) {
        if (err) {
            console.log("error: " + err);
        } else {
            parser.write(data.toString('utf8')).close();
        }
    });

This is not a biggie for me, but it might bite someone else, so I thought I'd report it.

Make it easier to pump data through

Add a "fallback" event that gets fired for any event that doesn't have a specific handler.
Pass the raw string data to event handlers, along with any other data.

This will add a bit of weirdness around the "onattribute" event. It probably shouldn't fire the fallback, or else you'll end up pumping data through inappropriately.

handle encoding="..." of xml-processing instruction

Hi,

if I understand the code correctly, the encoding of the first line of an XML document is not respected.
In case of non-utf8 encoded XML files, it could be a problem.

I tried to patch a bit, but it is a very stupid solution and actually just doesn't work: https://gist.github.com/1503453

IMO the problem is, that the parser is eating chunks and at this moment the buffer is already converted to a String.
Changing the encoding would only take effect after new chunks arrive and only in the streaming parser.

I'm afraid I'm lacking deep knowledge of streaming, piping and the architecture of the parser.
But I could do some testing with strange German XML documents ;-)

--Heinrich

sax stream behavior when piped a read stream which emits no data

I'm just wondering what you think should happen in this scenario. Currently, it seems like the sax stream emits both 'error' and 'end' events when given a read stream which emits no data. I'm piping an http response to the sax stream and occasionally the response emits no data. When this happens, the sax event listeners end up calling my callback twice. I ended up just adding in a little check to see if 'error' had been emitted before calling my callback in 'end'.

A `<` in a sript tag brings everything to a grinding halt.

If the parser encounters a '<' in embedded JavaScript it will come to its knees more often than not.

Of course JavaScript is not anywhere close to XML syntatically, so we should expect that. However, maybe the parser can change state when it is inside a script tag and ignore everything until it gets a </script>

All caps tag names

When using the loose mode parser with trim set to true, the resulting tag names come back all capitalized. using the lowercasetags: true option, the tag names correctly get turned to lowercase. Are the tags converted to uppercase by default or is there something that I am setting that I am not aware of?

thanks

Text event called two times.

Hi,

The text event is called two times when sax parse a particular xml file and I cannot figure out why. Do you have any idea?
I published a gist that reproduced this with a piece of this particular xml file https://gist.github.com/1262059.

Thanks.

Remove the nextTick stuff

Not necessary. Just write() and close(). Assume that listeners are assigned before the writes happen, and if not, oh well.

End event is never called

When using the parser on an XML file, the "onend" callback is never called.

Plus this should really be updated to use the EventEmitter although that should be another issue.

Remove dependency on EventEmitter

Options:

Assign callback functions to onText, onOpenTag, onCloseTag, etc.
Implement a simple addListener in the class itself. (Like what's done when EventEmitter isn't available.)

Support <![CDATA[ ... ]]> to include random bits of crazy.

The XML parser should have a CDATA state to support embedding random stuff into the markup.

Character Entites

Parse character entities in attribute values and text nodes.

Support Buffers

If a buffer is written, then sax should handle it appropriately, and not by simply toString-ing everything, or piping through a StringDecoder. Ideally, the parser would always deal with buffer objects internally if a Buffer was supplied, and update the c and position values appropriately.

The biggest snags will involve doing this intelligently in areas where multibyte characters are allowed, which is pretty much everywhere except attribute names.

full tag "onclosetag"

I use your parser and it's greate.
I have a problem, I need full node info in onclosetag event but not only node name.

Right now a have a hack https://github.com/AndrewSumin/Fest/blob/master/lib/sax-js.js#L513
But it is not very comfort to update you parser with this hack.

can't get cdata events to fire

Test case posted here: https://gist.github.com/1317251

I can't get opencdata, cdata, or closecdata events to fire. Am I doing something wrong?

Thanks,
Dave

Support all official html4 entities

http://www.w3.org/TR/html4/sgml/entities.html

unable to distinguish successive CDATA sections from 'chunked' CDATA

The oncdata callback doesn't receive the enclosing <![CDATA[ / ]]> tags, so when a long CDATA section is broken up into multiple calls to oncdata there is no way to tell if those calls correspond to separate CDATA sections. The following inputs could potentially result in identical calls to oncdata:

<![CDATA[......]]>

vs.

<![CDATA[...]]><![CDATA[...]]>

This could be fixed cleanly, with full backward compatibility, by introducing opencdata and endcdata events to the API.

global var 'state' declared

sax.js is setting a global var named 'state'. See https://github.com/isaacs/sax-js/blob/master/lib/sax.js#L582

jslint compliant

It would be nice if the code was tidied up so that there are no warnings flagged

Currently published npm version contains (extraneous) bundled dependecies...

When I npm install sax, npm ls looks like this:

<!DOCTYPE support

Chokes on <!DOCTYPE as an invalid comment.

Bug in usage example in the README

The last line of the usage example in the README (line 84) should use fs.createWriteStream() not fs.createReadStream().

Whitespace handling for text

It would be nice to make the parser ignore leading/trailing whitespace for text nodes.

http://en.wikipedia.org/wiki/Simple_API_for_XML

Has a good example. I think this should probably be an option when creating the parser.

Make the character classes not anglo-centric.

See: http://www.w3.org/TR/REC-xml/#CharClasses

Unclosed root tag

var sax = require("./lib/sax"),
    parser = sax.parser(true);

parser.onend = function () {
  console.log('end');
};

var xml = '<abc:template xmlns:abc="http://abc.example.com" context_name="bad">';

parser.write(xml).close(); // first, no error
parser.write('<root>' + xml + '</root>').close(); // second, unclosed tag error

So is there any chance of getting an error in the first example for unclosed <abc:template> tag without additional wrapping for my input (second example)?

npm link broken

the link in npm's registry for version 0.1.1 is https://registry.npmjs.org/sax/-/sax-0.1.1.tgz.

Attribute namespace declaration in the same parent element gives an error

Using namespace-aware parsing, <parent a:attr="value" xmlns:a="http://ATTRIBUTE" /> gives an Uncaught Error: Unbound namespace prefix: "a" error.

Expose state in the name of speed

Move the closures inside the constructor onto the prototype.

Default binding for the xml prefix

The xml prefix is supposed to be automatically bound to the URI, http://www.w3.org/XML/1998/namespace, according to the namespace spec.

saxStream.close(); gets error

Hi,
saxStream.close();

gets error:

TypeError: Object # has no method 'close'

SAXParser.prototype =
{ end: function () { end(this) }
, write: write
, resume: function () { this.error = null; return this }
, close: function () { return this.write(null) }
}

version property

It would be nice if the sax module exposed a version property so we could see what version we are running with which is useful for support

memory issue

Trying to debug. So not sure if this is a bug, or maybe ideas to fix. I have a basic app, and parser setup, loading a 300MB xml file.

Getting the following error I believe from inside the parser somewhere:

FATAL ERROR: JS Allocation failed - process out of memory

Just curious if you've tested anything large, before I dig into my code/parsing, although I'm not hitting even the first open tag before memory exhausts. I can post a working gist, but essentially this is what's going on :

var fs = require('fs');
var parser = sax.parser();

tnks

Subclass HTML parsing into separate module

Trying to use a single parser for both XML and HTML is Doing It Wrong. It'd be better to have an HTML parser that is a subclass of the SAX parser, since there are so many weird rules and special cases, it'll blow up the size of the module to support them all.

Having strict and loose modes is still good, though, because a lot of non-HTML xml is not well formed, and it would be good to parse it. But pull out stuff like the self-closing tags, etc.

Global variable declaration issue in IE7

This does not occur in IE9, I have not tested others...

Instantiating the parser in IE7 causes the following Javascript error to occur:

"TypeError: 'S' is undefined"

Moving the declaration of 'S' further up the script (above the SAXParser constructor) fixes the problem, but then chokes during a parse on the 'whitespace' global variable:

"TypeError: 'whitespace' is undefined"

Again, the workaround is to move 'whitespace' above the constructor. Finally, IE7 does not support the Object.create function:

"TypeError: Object doesn't support this property or method"

I added Crockford's Object.create implementation in my own code to support this, however I believe you have a declaration in sax.js that might work by moving up the script as well.

<p>hello<p>world

Should result in 2 siblings, not a parent and child. Support the nestable rules in the HTML doctype.