Giter Club home page Giter Club logo

sax-js's Introduction

sax js

A sax-style parser for XML and HTML.

Designed with node in mind, but should work fine in the browser or other CommonJS implementations.

What This Is

  • A very simple tool to parse through an XML string.
  • A stepping stone to a streaming HTML parser.
  • A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML docs.

What This Is (probably) Not

  • An HTML Parser - That's a fine goal, but this isn't it. It's just XML.
  • A DOM Builder - You can use it to build an object model out of XML, but it doesn't do that out of the box.
  • XSLT - No DOM = no querying.
  • 100% Compliant with (some other SAX implementation) - Most SAX implementations are in Java and do a lot more than this does.
  • An XML Validator - It does a little validation when in strict mode, but not much.
  • A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic masochism.
  • A DTD-aware Thing - Fetching DTDs is a much bigger job.

Regarding <!DOCTYPEs and <!ENTITYs

The parser will handle the basic XML entities in text nodes and attribute values: &amp; &lt; &gt; &apos; &quot;. It's possible to define additional entities in XML by putting them in the DTD. This parser doesn't do anything with that. If you want to listen to the ondoctype event, and then fetch the doctypes, and read the entities and add them to parser.ENTITIES, then be my guest.

Unknown entities will fail in strict mode, and in loose mode, will pass through unmolested.

Usage

var sax = require("./lib/sax"),
  strict = true, // set to false for html-mode
  parser = sax.parser(strict);

parser.onerror = function (e) {
  // an error happened.
};
parser.ontext = function (t) {
  // got some text.  t is the string of text.
};
parser.onopentag = function (node) {
  // opened a tag.  node has "name" and "attributes"
};
parser.onattribute = function (attr) {
  // an attribute.  attr has "name" and "value"
};
parser.onend = function () {
  // parser stream is done, and ready to have more stuff written to it.
};

parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();

// stream usage
// takes the same options as the parser
var saxStream = require("sax").createStream(strict, options)
saxStream.on("error", function (e) {
  // unhandled errors will throw, since this is a proper node
  // event emitter.
  console.error("error!", e)
  // clear the error
  this._parser.error = null
  this._parser.resume()
})
saxStream.on("opentag", function (node) {
  // same object as above
})
// pipe is supported, and it's readable/writable
// same chunks coming in also go out.
fs.createReadStream("file.xml")
  .pipe(saxStream)
  .pipe(fs.createWriteStream("file-copy.xml"))

Arguments

Pass the following arguments to the parser function. All are optional.

strict - Boolean. Whether or not to be a jerk. Default: false.

opt - Object bag of settings regarding string formatting. All default to false.

Settings supported:

  • trim - Boolean. Whether or not to trim text and comment nodes.
  • normalize - Boolean. If true, then turn any whitespace into a single space.
  • lowercase - Boolean. If true, then lowercase tag names and attribute names in loose mode, rather than uppercasing them.
  • xmlns - Boolean. If true, then namespaces are supported.
  • position - Boolean. If false, then don't track line/col/position.
  • strictEntities - Boolean. If true, only parse predefined XML entities (&amp;, &apos;, &gt;, &lt;, and &quot;)
  • unquotedAttributeValues - Boolean. If true, then unquoted attribute values are allowed. Defaults to false when strict is true, true otherwise.

Methods

write - Write bytes onto the stream. You don't have to do this all at once. You can keep writing as much as you want.

close - Close the stream. Once closed, no more data may be written until it is done processing the buffer, which is signaled by the end event.

resume - To gracefully handle errors, assign a listener to the error event. Then, when the error is taken care of, you can call resume to continue parsing. Otherwise, the parser will not continue while in an error state.

Members

At all times, the parser object will have the following members:

line, column, position - Indications of the position in the XML document where the parser currently is looking.

startTagPosition - Indicates the position where the current tag starts.

closed - Boolean indicating whether or not the parser can be written to. If it's true, then wait for the ready event to write again.

strict - Boolean indicating whether or not the parser is a jerk.

opt - Any options passed into the constructor.

tag - The current tag being dealt with.

And a bunch of other stuff that you probably shouldn't touch.

Events

All events emit with a single argument. To listen to an event, assign a function to on<eventname>. Functions get executed in the this-context of the parser object. The list of supported events are also in the exported EVENTS array.

When using the stream interface, assign handlers using the EventEmitter on function in the normal fashion.

error - Indication that something bad happened. The error will be hanging out on parser.error, and must be deleted before parsing can continue. By listening to this event, you can keep an eye on that kind of stuff. Note: this happens much more in strict mode. Argument: instance of Error.

text - Text node. Argument: string of text.

doctype - The <!DOCTYPE declaration. Argument: doctype string.

processinginstruction - Stuff like <?xml foo="blerg" ?>. Argument: object with name and body members. Attributes are not parsed, as processing instructions have implementation dependent semantics.

sgmldeclaration - Random SGML declarations. Stuff like <!ENTITY p> would trigger this kind of event. This is a weird thing to support, so it might go away at some point. SAX isn't intended to be used to parse SGML, after all.

opentagstart - Emitted immediately when the tag name is available, but before any attributes are encountered. Argument: object with a name field and an empty attributes set. Note that this is the same object that will later be emitted in the opentag event.

opentag - An opening tag. Argument: object with name and attributes. In non-strict mode, tag names are uppercased, unless the lowercase option is set. If the xmlns option is set, then it will contain namespace binding information on the ns member, and will have a local, prefix, and uri member.

closetag - A closing tag. In loose mode, tags are auto-closed if their parent closes. In strict mode, well-formedness is enforced. Note that self-closing tags will have closeTag emitted immediately after openTag. Argument: tag name.

attribute - An attribute node. Argument: object with name and value. In non-strict mode, attribute names are uppercased, unless the lowercase option is set. If the xmlns option is set, it will also contains namespace information.

comment - A comment node. Argument: the string of the comment.

opencdata - The opening tag of a <![CDATA[ block.

cdata - The text of a <![CDATA[ block. Since <![CDATA[ blocks can get quite large, this event may fire multiple times for a single block, if it is broken up into multiple write()s. Argument: the string of random character data.

closecdata - The closing tag (]]>) of a <![CDATA[ block.

opennamespace - If the xmlns option is set, then this event will signal the start of a new namespace binding.

closenamespace - If the xmlns option is set, then this event will signal the end of a namespace binding.

end - Indication that the closed stream has ended.

ready - Indication that the stream has reset, and is ready to be written to.

noscript - In non-strict mode, <script> tags trigger a "script" event, and their contents are not checked for special xml characters. If you pass noscript: true, then this behavior is suppressed.

Reporting Problems

It's best to write a failing test if you find an issue. I will always accept pull requests with failing tests if they demonstrate intended behavior, but it is very hard to figure out what issue you're describing without a test. Writing a test is also the best way for you yourself to figure out if you really understand the issue you think you have with sax-js.

sax-js's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sax-js's Issues

Unwanted closing tag will stop the parser

Hello,

trying this code :

parser.write('<div win="true"></div></span><p fail="true"></p>');

See the that come form no-where

Will results in

{ name: 'DIV', attributes: { win: 'true' } }

While

parser.write('<div win="true"></div><p fail="true"></p>');

Results in

{ name: 'DIV', attributes: { win: 'true' } }
{ name: 'P', attributes: { fail: 'true' } }

I know this case and my previous one (#31) are html-related and that sax-js is not currently aimed at parsing html but I thougt I would let you know.

I'm really interested in bringing html parsing to sax-js but I do not know where to start.

pretty-print.js I/O problem

On my Ubuntu system, running pretty-print on a large XML file (my test file is 700k) and sending the output to a file causes node to slow way down and consume 100% of the cpu. It gets really, really slow -- I've never had the patience to wait and see how long it would take. Possibly hours.

I'm not sure I need to post my XML file, since it is large. But I think you can replicate the problem using any large XML file by doing:
node pretty-print.js file.xml > file-pretty.xml

The problem seems to be with how pretty-print.js is handling writes to stdout.

I came up with a solution that makes it run as fast as it should. It doesn't appear to be optimal -- strace shows that it's still making at least twice as many calls to the write(2) system call as it needs to. But from the user's perspective, the problem is completely gone. So it's a big improvement.

diff --git a/examples/pretty-print.js b/examples/pretty-print.js
index 0a40ef0..73a5b7e 100644
--- a/examples/pretty-print.js
+++ b/examples/pretty-print.js
@@ -6,41 +6,61 @@ function entity (str) {
return str.replace('"', '"');
}

+printer.inputPaused = false;
+printer.spool = '';
+printer.send = function(str) {

  • if(printer.inputPaused) {
  • printer.spool += str;
  • }
  • else if(!process.stdout.write(str)) {
  • printer.inputPaused = true;
  • inputStream.pause();
  • }
    +};
    +process.stdout.on('drain', function() {
  • printer.inputPaused = false;
  • if(printer.spool !== '') {
  • process.stdout.write(printer.spool);
  • printer.spool = '';
  • }
  • inputStream.resume();
    +});

printer.tabstop = 2;
printer.level = 0;
printer.indent = function () {

  • sys.print("\n");
  • printer.send("\n");
    for (var i = this.level; i > 0; i --) {
    for (var j = this.tabstop; j > 0; j --) {
  •  sys.print(" ");
    
  •  printer.send(" ");
    
    }
    }
    }
    printer.onopentag = function (tag) {
    this.indent();
    this.level ++;
  • sys.print("<"+tag.name);
  • printer.send("<"+tag.name);
    for (var i in tag.attributes) {
  • sys.print(" "+i+"=""+entity(tag.attributes[i])+""");
  • printer.send(" "+i+"=""+entity(tag.attributes[i])+""");
    }
  • sys.print(">");
  • printer.send(">");
    }
    printer.ontext = printer.ondoctype = function (text) {
    this.indent();
  • sys.print(text);
  • printer.send(text);
    }
    printer.onclosetag = function (tag) {
    this.level --;
    this.indent();
  • sys.print("</"+tag+">");
  • printer.send("</"+tag+">");
    }
    printer.oncdata = function (data) {
    this.indent();
  • sys.print("");
  • printer.send("");
    }
    printer.oncomment = function (comment) {
    this.indent();
  • sys.print("");
  • printer.send("");
    }
    printer.onerror = function (error) {
    sys.debug(error);
    @@ -52,21 +72,6 @@ if (!process.argv[2]) {
    "TODO: read from stdin or take a file");
    }
    var xmlfile = require("path").join(process.cwd(), process.argv[2]);
    -fs.open(xmlfile, "r", 0666, function (er, fd) {
  • if (er) throw er;
  • (function R () {
  • fs.read(fd, 1024, null, "utf8", function (er, data, bytesRead) {
  •  if (er) throw er;
    
  •  if (data) {
    
  •    printer.write(data);
    
  •    R();
    
  •  } else {
    
  •    fs.close(fd);
    
  •    printer.close();
    
  •  }
    
  • });
  • })();
    -});

+var inputStream = fs.createReadStream(xmlfile, { encoding: 'utf8' });
+inputStream.on('data', function(data) { printer.write(data); });

streaming

is this lib still maintained at all? was planning on using it for streaming, but seems like it relies on state

Handle schema fragments in doctype declaration

Support stuff like this:

<!DOCTYPE doc [
<!ELEMENT doc (#PCDATA)>
<!ENTITY e SYSTEM "001.ent">
]>
<doc>&e;</doc>

Emit sgmldeclaration events for the ELEMENT and ENTITY declarations.

Should a <!ENTITY declaration add it to the list of sax.ENTITIES, perhaps? I'm certainly not going to fetch the 001.ent file and parse it, but it'd be nice to at least not emit an error in that case. Maybe it could be added to the ENTITIES hash without being decoded, so it'd still be &e;, and another higher level could add the full-fledged entity handling.

Is it possible to read contents of element?

With the following example, is it possible to read the contents of an element as a string?

<myxml attr="val">
   <something>
      <behave-like-cdata>
         <sample>Read as plain text without entities!</sample>
      </behave-like-cdata>
   </something>
</myxml>

When <behave-like-cdata> element is reached I would like to read its contents as a string "Read as plain text without entities!" and then skip to closing tag </behave-like-cdata>.

Is it possible to achieve this with your module?

Many thanks

Speed up reading large Texts

How about reading large chunks of text using regular expressions? Character-by-character seems to be a relatively slow approach.

wrong attrubute value: lost '='

The following test is red:

'# attribute href with query params': function(){
var input = '<a href="service.svc?root=foo&OrderBy=1&Asc=0&Page=1"> The link </a1>'
var parser = sax.parser(false)
parser.onopentag = function(tag){
assert.eql(tag.name, 'A');
assert.eql(tag.attributes.href, 'service.svc?root=&OrderBy=1&Asc=0&Page=1');
}
parser.write(input);
}

try/catch around require('stream') will not save you

You have a try/catch here but you use Stream.prototype.on here

Not a big deal, but I would humbly suggest either dropping the try/catch and all the checking here or adding some kind of on/emit logic to support browsers or other non-node environments.

Your documentation implies you want to support browsers and CommonJS environments so does that mean you want a pull request for a simple on/emit?

Adding namespace awareness

I'd like to make the parser namespace-aware. First, has anyone else done this already? My general approach was going to use the openTag(parser, selfClosing) function to figure out which namespaces have been declared in the existing node. I'm not sure how to calculate the universe of in-scope namespaces given XML's nesting and namespace redeclaration behavior. Any pointers would be much appreciated.

Parsing of ATOM feeds for pubsubhubbub throws error.

Using http://feeds.gawker.com/gawker/full as hub feed.

starting callback server on port 4443
callback: function (feed) {
        sys.log(sys.inspect(feed));
        var hubUri = feed.getLinksByRel('hub')[0];
        var callbackUri = url.parse(subscriber.createCallbackUri());
        subscriber.subscribe(topicUri, hubUri, callbackUri,
          function() {
            topicEvents.emit('subscribed', topicUri.href);
            subscriber.registerEventEmitter(topicUri, topicEvents);
          },
          function(error) {
            topicEvents.emit('error', error);
          });
      }
callback: function (feed) {
        sys.log(sys.inspect(feed));
        var hubUri = feed.getLinksByRel('hub')[0];
        var callbackUri = url.parse(subscriber.createCallbackUri());
        subscriber.subscribe(topicUri, hubUri, callbackUri,
          function() {
            topicEvents.emit('subscribed', topicUri.href);
            subscriber.registerEventEmitter(topicUri, topicEvents);
          },
          function(error) {
            topicEvents.emit('error', error);
          });
      }
15 Jun 17:27:17 - { links: [ {}, {} ]
, entries: []
, title: '\r\n\t\t\r\n\t\t\t'
, updated: ''
, id: ''
}
/Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:51
    hubUri = url.parse(hubUri.href);
                             ^
TypeError: Cannot read property 'href' of undefined
    at [object Object].subscribe (/Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:51:30)
    at /Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:141:20
    at Object.onend (/Users/karl/Sites/karl-git/node-pshb/lib/atom.js:157:3)
    at emit (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:141:32)
    at end (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:172:3)
    at Object.write (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:250:30)
    at Object.close (/Users/karl/Sites/karl-git/node-pshb/third_party/sax-js/lib/sax.js:67:37)
    at [object Object].parse (/Users/karl/Sites/karl-git/node-pshb/lib/atom.js:160:21)
    at Object.parse (/Users/karl/Sites/karl-git/node-pshb/lib/atom.js:81:17)
    at IncomingMessage.<anonymous> (/Users/karl/Sites/karl-git/node-pshb/lib/pshb-client.js:151:12)

Why won't this parse?

I've been beating my head against the wall trying to figure out why this won't parse.
It reports that it successfully parsed the "success" element but then dies.
What am I missing?

Here is the log and error:

parsing data: "<compileClassesResponse><result><bodyCrc>653724009</bodyCrc><column>-1</column><id>01pG0000002KoSUIA0</id><line>-1</line><name>CalendarController</name><success>true</success></result></compileClassesResponse>"
Sax - Open Element: compileclassesresponse (Attributes: {} )
Sax - Open Element: result (Attributes: {} )
Sax - Open Element: bodycrc (Attributes: {} )
Sax - Text: 653724009
Sax - Close Element: bodycrc
Sax - Open Element: column (Attributes: {} )
Sax - Text: -1
Sax - Close Element: column
Sax - Open Element: id (Attributes: {} )
Sax - Text: 01pG0000002KoSUIA0
Sax - Close Element: id
Sax - Open Element: line (Attributes: {} )
Sax - Text: -1
Sax - Close Element: line
Sax - Open Element: name (Attributes: {} )
Sax - Text: CalendarController
Sax - Close Element: name
Sax - Open Element: success (Attributes: {} )
Sax - Text: true
Sax - Close Element: success
Sax - Error: {"stack":"Error: Unexpected end\nLine: 0\nColumn: 209\nChar: \n at error (/node_modules/sax/lib/sax.js:167:8)\n at end (/node_modules/sax/lib/sax.js:173:32)\n at Object.write (/node_modules/sax/lib/sax.js:255:30)\n at Object.close (/node_modules/sax/lib/sax.js:72:37)\n at parseResults (/sfdc.js:99:13)\n at IncomingMessage. (/sfdc.js:224:19)\n at IncomingMessage.emit (events.js:81:20)\n at HTTPParser.onMessageComplete (http.js:133:23)\n at CleartextStream.ondata (http.js:1213:22)\n at CleartextStream._push (tls.js:291:27)","message":"Unexpected end\nLine: 0\nColumn: 209\nChar: "}
{ stack: [Getter/Setter],
arguments: undefined,
type: undefined,
message: 'Unexpected end\nLine: 0\nColumn: 209\nChar: ' }

Unquoted attributes values will fail

Hello, do you think in loose mode we should accept unquoted attributes values ?

parser.write('<span class=test hello=world></span>');

Will emit onopentag with :

{ name: 'SPAN', attributes: {} }

While

parser.write('<span class="test" hello="world"></span>');

will results in :

{ name: 'SPAN',  attributes: { class: 'test', hello: 'world' } }

A single dash in a comment causes an error

The following XML does not parse but is valid.

<xml>
<!-- 
  commment with a single dash- in it
-->
<data/>
</xml>

causes

/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:364
  if (this.error) throw this.error
                            ^
Error: Malformed comment
Line: 3
Column: 2
Char: -
    at error (/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:273:8)
    at strictFail (/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:290:22)
    at Object.write (/home/teknopaul/node_workspace/node_modules/sax/lib/sax.js:515:24)
    at /home/teknopaul/node_workspace/saxerror/test-sax.js:11:11
    at [object Object].<anonymous> (fs.js:107:5)
    at [object Object].emit (events.js:62:17)
    at afterRead (fs.js:970:12)
    at wrapper (fs.js:245:17)

this was code used to test

var fs = require('fs');
var sax = require("sax"),
  strict = true, 
  parser = sax.parser(strict);


    fs.readFile('test.xml', function (err, data) {
        if (err) {
            console.log("error: " + err);
        } else {
            parser.write(data.toString('utf8')).close();
        }
    });

This is not a biggie for me, but it might bite someone else, so I thought I'd report it.

Make it easier to pump data through

  1. Add a "fallback" event that gets fired for any event that doesn't have a specific handler.
  2. Pass the raw string data to event handlers, along with any other data.

This will add a bit of weirdness around the "onattribute" event. It probably shouldn't fire the fallback, or else you'll end up pumping data through inappropriately.

handle encoding="..." of xml-processing instruction

Hi,

if I understand the code correctly, the encoding of the first line of an XML document is not respected.
In case of non-utf8 encoded XML files, it could be a problem.

I tried to patch a bit, but it is a very stupid solution and actually just doesn't work: https://gist.github.com/1503453

IMO the problem is, that the parser is eating chunks and at this moment the buffer is already converted to a String.
Changing the encoding would only take effect after new chunks arrive and only in the streaming parser.

I'm afraid I'm lacking deep knowledge of streaming, piping and the architecture of the parser.
But I could do some testing with strange German XML documents ;-)

--Heinrich

sax stream behavior when piped a read stream which emits no data

I'm just wondering what you think should happen in this scenario. Currently, it seems like the sax stream emits both 'error' and 'end' events when given a read stream which emits no data. I'm piping an http response to the sax stream and occasionally the response emits no data. When this happens, the sax event listeners end up calling my callback twice. I ended up just adding in a little check to see if 'error' had been emitted before calling my callback in 'end'.

A `<` in a sript tag brings everything to a grinding halt.

If the parser encounters a '<' in embedded JavaScript it will come to its knees more often than not.

Of course JavaScript is not anywhere close to XML syntatically, so we should expect that. However, maybe the parser can change state when it is inside a script tag and ignore everything until it gets a </script>

All caps tag names

When using the loose mode parser with trim set to true, the resulting tag names come back all capitalized. using the lowercasetags: true option, the tag names correctly get turned to lowercase. Are the tags converted to uppercase by default or is there something that I am setting that I am not aware of?

thanks

Remove the nextTick stuff

Not necessary. Just write() and close(). Assume that listeners are assigned before the writes happen, and if not, oh well.

End event is never called

When using the parser on an XML file, the "onend" callback is never called.

Plus this should really be updated to use the EventEmitter although that should be another issue.

Remove dependency on EventEmitter

Options:

  1. Assign callback functions to onText, onOpenTag, onCloseTag, etc.
  2. Implement a simple addListener in the class itself. (Like what's done when EventEmitter isn't available.)

Support Buffers

If a buffer is written, then sax should handle it appropriately, and not by simply toString-ing everything, or piping through a StringDecoder. Ideally, the parser would always deal with buffer objects internally if a Buffer was supplied, and update the c and position values appropriately.

The biggest snags will involve doing this intelligently in areas where multibyte characters are allowed, which is pretty much everywhere except attribute names.

unable to distinguish successive CDATA sections from 'chunked' CDATA

The oncdata callback doesn't receive the enclosing <![CDATA[ / ]]> tags, so when a long CDATA section is broken up into multiple calls to oncdata there is no way to tell if those calls correspond to separate CDATA sections. The following inputs could potentially result in identical calls to oncdata:

<![CDATA[......]]>

vs.

<![CDATA[...]]><![CDATA[...]]>

This could be fixed cleanly, with full backward compatibility, by introducing opencdata and endcdata events to the API.

jslint compliant

It would be nice if the code was tidied up so that there are no warnings flagged

Unclosed root tag

var sax = require("./lib/sax"),
    parser = sax.parser(true);

parser.onend = function () {
  console.log('end');
};

var xml = '<abc:template xmlns:abc="http://abc.example.com" context_name="bad">';

parser.write(xml).close(); // first, no error
parser.write('<root>' + xml + '</root>').close(); // second, unclosed tag error

So is there any chance of getting an error in the first example for unclosed <abc:template> tag without additional wrapping for my input (second example)?

saxStream.close(); gets error

Hi,
saxStream.close();

gets error:

TypeError: Object # has no method 'close'

SAXParser.prototype =
{ end: function () { end(this) }
, write: write
, resume: function () { this.error = null; return this }
, close: function () { return this.write(null) }
}

version property

It would be nice if the sax module exposed a version property so we could see what version we are running with which is useful for support

memory issue

Trying to debug. So not sure if this is a bug, or maybe ideas to fix. I have a basic app, and parser setup, loading a 300MB xml file.

Getting the following error I believe from inside the parser somewhere:

FATAL ERROR: JS Allocation failed - process out of memory

Just curious if you've tested anything large, before I dig into my code/parsing, although I'm not hitting even the first open tag before memory exhausts. I can post a working gist, but essentially this is what's going on :

var fs = require('fs');
var parser = sax.parser();

var fileStream = fs.createReadStream(process.argv[2],
{'bufferSize': 4 * 1024}
);
var xml = '';
fileStream.addListener("data",function(chunk){
xml += chunk;
});
fileStream.addListener("close", function(){
console.log("file read into xml");
parser.write(xml).close();
});

tnks

Subclass HTML parsing into separate module

Trying to use a single parser for both XML and HTML is Doing It Wrong. It'd be better to have an HTML parser that is a subclass of the SAX parser, since there are so many weird rules and special cases, it'll blow up the size of the module to support them all.

Having strict and loose modes is still good, though, because a lot of non-HTML xml is not well formed, and it would be good to parse it. But pull out stuff like the self-closing tags, etc.

Global variable declaration issue in IE7

This does not occur in IE9, I have not tested others...

Instantiating the parser in IE7 causes the following Javascript error to occur:

"TypeError: 'S' is undefined"

Moving the declaration of 'S' further up the script (above the SAXParser constructor) fixes the problem, but then chokes during a parse on the 'whitespace' global variable:

"TypeError: 'whitespace' is undefined"

Again, the workaround is to move 'whitespace' above the constructor. Finally, IE7 does not support the Object.create function:

"TypeError: Object doesn't support this property or method"

I added Crockford's Object.create implementation in my own code to support this, however I believe you have a declaration in sax.js that might work by moving up the script as well.

Handle non-nestable tags.

In loose mode, this markup:

<p>hello<p>world

Should result in 2 siblings, not a parent and child. Support the nestable rules in the HTML doctype.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.