fb55 / htmlparser2 Goto Github PK
View Code? Open in Web Editor NEWThe fast & forgiving HTML and XML parser
Home Page: https://feedic.com/htmlparser2
License: MIT License
The fast & forgiving HTML and XML parser
Home Page: https://feedic.com/htmlparser2
License: MIT License
Bower relies on tags and points at this repository. Can you please add tags for 3.x releases?
I am currently using "htmlparser.js" created by John Resig some time ago. I was hoping to switch to something actively maintained. However it seems there are a few features that are missing.
For example the parser I mentioned understands self-closing tags, and seems to scan for other elements that haven't been closed.
So for example calling HTMLtoXML(<p>Hello<p>World')
will yield "<p>Hello</p><p>World</p>" while htmlparser2 (with the right handler) would give me "<p>Hello<p>World</p></p>".
Hi there,
I've been looking to pull in your changes but it seems like quite a bit has changed from 1.x to 2.x. Could you give a basic summary of what's changed? Offhand, I've noticed:
What else has changed?
Thanks!
Matt
tl;dr don't do this
Okay first off, let me just say that htmlparser2 is the bomb. I throw some pretty disgusting HTML at it and it chews through it like a boss.
Except for one problem. Sometimes I throw REALLY ugly HTML at it. Specifically, imagine someone took a reasonably valid html page and threw an opening comment ("<!--") somewhere around 90% down the source, and then never closed it. Overwriting otherwise valid HTML content. Sure, the page wouldn't render properly (mainly the footer content just gets lost), but browsers still render most of the page.
htmlparser2 just breaks. Specifically, it breaks on like 39: https://github.com/FB55/node-htmlparser/blob/master/lib/DomHandler.js#L39
The parser should just call it a day and wrap up the DOM tree up to that point instead of exploding. Otherwise my whole server crashes with an exception and I don't get ANY parsed data back. (Yes, I know I could work around the crashing part.)
So, please, could we not throw an exception there and instead die gracefully, albeit prematurely? (Y'know, like browsers do?)
Thanks!
<p>foo</p>
<hr>
<p>bar</p>
correctly produces:
[ { type: 'tag',
name: 'p',
attribs: {},
children: [ { data: 'foo', type: 'text' } ] },
{ data: '\n', type: 'text' },
{ type: 'tag',
name: 'hr',
attribs: {},
children: [] },
{ data: '\n', type: 'text' },
{ type: 'tag',
name: 'p',
attribs: {},
children: [ { data: 'bar', type: 'text' } ] } ]
while
<p>foo</p>
<HR>
<p>bar</p>
incorrectly results in:
[ { type: 'tag',
name: 'p',
attribs: {},
children: [ { data: 'foo', type: 'text' } ] },
{ data: '\n', type: 'text' },
{ type: 'tag',
name: 'HR',
attribs: {},
children:
[ { data: '\n', type: 'text' },
{ type: 'tag',
name: 'p',
attribs: {},
children: [ { data: 'bar', type: 'text' } ] } ] } ]
Since HTML tag names are entirely case insensitive I think it would be better to lower case them so that <hr>
and <HR>
would both result in identical parsed output?
new Parser({ontext:console.log}).parseComplete(">a>") // -> "a>"
I realize this may not be a goal of the project, but the file size is probably too big for the browser.
Maybe there's a large require
somewhere that could be swapped out easily.
As described here, <p><a>a<p>b
is currently handled as <p><a>a<p>b</p></a></p>
.
I write a lot of filters for HTML, so htmlparser2 is great for parsing it. I'd like to do simple passthrough without being opinionated about anything I'm not specifically interested in changing. However this is difficult because I can't tell whether the original markup was using the self-closing style or not based on the arguments to onclosetag. This unnecessarily complicates the write side of my filter and requires my code to be aware of which tags are safe and traditional to self-close and which are not.
One solution would be a second argument to onclosetag indicating whether the self-closing style was used.
Thanks!
Just a note for anyone with a lot of time:
<div id=">"> foo </div>
Expected behavior: Return a div with an id of >
, followed by the string foo
.
Result: A div with an id of "
, followed by the string "> foo
.
The problem is that there isn't a possibility for a look-ahead. Besides, the markup is clearly broken. The current result should be acceptable in most cases.
Edit: Apparently, that bug is well-known.
Parser looks for a tag even inside a text-area where there may be source.
Hey,
I can run npm install [email protected]
, but running just npm install htmlparser2
seems to hang. Can you verify that this is working on your box?
Thanks,
Matt
Please, replace non-breaking space to normal space:
https://github.com/fb55/htmlparser2/blob/master/lib/CollectingHandler.js#L4
It breaks htmlparser2
when it is browserified and runs in Chrome.
The <script>
tagname is not case insensitive. A tag <SCRIPT>
will see anything starting with a <
as a new tag.
Example: https://gist.github.com/3899198
I found this when crawling http://www.chicagotribune.com/news/local/suburbs/orland_park_homer_glen/community/chi-ugc-article-palos-medical-groups-dr-kanesha-bryant-prov-2-2013-05-15,0,3152151.story
They have a line containing
var str = "<script src=\'about:blank\' type=\'text/javascript\'></"+"script>"
there seams to break things.
Here is a simple testcase:
var Parser = require('htmlparser2').Parser;
var stream = new Parser({
onopentag: function (tagname, attr) {
console.log('open: ' + tagname);
},
ontext: function (text) {
console.log('text: ' + text);
},
onclosetag: function (tagname) {
console.log('close: ' + tagname);
}
});
stream.write('<body>');
stream.write('<script type="text/javascript" language="JavaScript">');
stream.write('var str = "<script src=\'about:blank\' type=\'text/javascript\'></"+"script>"');
stream.write('document.write(str);');
stream.write('</script>');
stream.write('</head>');
stream.end();
The output is:
open: body
open: script
text: var str = "
text: <script src='about:blank' type='text/javascript'>
close: script
text: "
text: document.write(str);
close: body
but it should be:
open: body
open: script
text: var str = "<script src='about:blank' type='text/javascript'></"+"script>"
text: document.write(str);
close: script
close: body
It will be cool, if DomHandler provides method to translate dom structure into string (reverse operation of parse).
var htmlparser = require('htmlparser2');
var handler = new htmlparser.DefaultHandler();
var parser = new htmlparser.Parser(handler);
var html = 'some html here';
parser.parseComplete(html);
// should be true
console.log(handler.getHtml() == html);
It could be done with follow code for now:
html = htmlparser.DomUtils.getInnerHTML({ children: handler.dom })
// or
html = handler.dom.map(htmlparser.DomUtils.getOuterHTML, htmlparser.DomUtils).join('')
But to find this solution you need going deep into source code.
So translate method would be useful to turn dom back into html.
<![CDATA[
This should be CDATA...
]]>
results in
[ { data: '[CDATA[\nThis should be CDATA...\n]]',
type: 'comment' } ]
when it should result in
[ { data: '\nThis should be CDATA...\n',
type: 'cdata' } ]
using version 3.1.5
The current documents were taken from third parties and should be replaced with files with clear licensing terms.
First, the _contains
is misleading as it does not check wether the tag contains the value, rather it checks for an exact match. So if you try something like this:
var domUtils = require("htmlparser2").DomUtils;
domUtils.getElements({ tag_contains: "cookie" }, dom, true);
you won't get anything if the tag you wanted contains cookies
.
Second, when you do enter the exact text, you get the data node rather than the node that the data node belongs to. For example:
domUtils.getElements({ tag_contains: "cookies" }, dom, true);
Would return
[ { data: 'cookies', type: 'text' } ]
instead of:
[ { type: 'tag', name: 'p', children: [ { data: 'cookies', type: 'text' } ] } ]
var htmlparser = require('htmlparser2');
var parser = new htmlparser.Parser({
onopentag: function(name, attribs) {
console.log('open tag: ' + name);
},
ontext: function(text) {
console.log('text: ' + text);
},
onclosetag: function(name) {
console.log('close tag: ' + name);
}
});
parser.write("Xyz <script type='text/javascript'>var foo = '</script><<bar>>';< / script>");
parser.end();
results in:
text: Xyz
open tag: script
text: var foo = '
close tag: script
open tag: <bar
text: >';
close tag: <bar
Hi there,
I've been looking to pull in your changes but it seems like quite a bit has changed from 1.x to 2.x. Could you give a basic summary? Offhand, I've noticed:
What else has changed?
Thanks!
Matt
When htmlparser2 is used in applications and if the variables with _
in them are used, jslint throws errors like this Unexpected dangling '_' in '_attribname'
. Can the variable names be changed?
First of all, you did a remarquable work on this project. We use it in production in our server at my company Fasterize.
Actually, we used a fork of your parser (v2.3.1) (https://github.com/fasterize/node-htmlparser/tree/position). In this fork, we added a patch to know the start index and end index of the current tag. It's very useful in our software to replace a tag.
I would like to upgrade to the latest version of htmlparser2 but for that I need to rebase this patch. Can you let me know where is the best way to get those indexes? Is it in the tokenizer ?
Are you interested in putting this patch in the master ? I would like to avoid to maintain a fork only for that.
Currently you can make <script>
and <code>
tags completely ignore their contents. The ability to do this with any tag would be really nice. Looking through the tokenizer, this doesn't look possible without doing a lot of modification to the tokenizer every time you wanted to add a new tag to ignore.
I managed to create a hack-ish way of doing this with a parser using parser._token._index and _sectionStart but I just know I'll run into issues with auto closing tags at some point. I may try to implement this myself but it would probably be better if someone (willing to) with an in depth knowledge of the tokenizer could attempt something like this.
the following code
var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
ontext: function(text){
console.log(text);
},
});
parser.write('<script>var x = "<!--123-->";</script>');
parser.done();
outputs
var x = "
";
Seems like there's some extra stuff that ended up in the registry. Not a big problem at all, but it'd be nice to either ignore this directory or remove it from the registry in the next version.
Hi there,
This is an issue that came up in: cheeriojs/cheerio#13
Here's the problematic syntax:
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
Basically the parser messes up on the <!--[if lt IE7]>...
So we get:
{
raw: '[if lt IE 7]>',
data: '[if lt IE 7]>',
type: 'comment',
...
}
It doesn't actually break which is good but its an issue in every page that includes HTML5 Boilerplate.
Let me know if you need some additional info with this issue.
Thanks!
Matt
Since 3.2.4, xhtml self-closing tags are doing weird things (unless they happen to be html "void" elements). Sometimes they will engulf the next tag that follows them, and sometimes they will remove the tag that follows them altogether.
With the example in the readme and htmlparser2 v3.0.5 from npm, I get this output instead:
--> Xyz
JS! Hooray!
--> var foo = '
--> <<bar>>';
--> < / script>
That's it?!
Certain tags, such as link
and meta
are treated like HTML tags when the parser is set to xmlMode
and they shouldn't be. For example:
var htmlparser2 = require("./lib/index"),
DomUtils = htmlparser2.DomUtils;
var handler = new htmlparser2.DomHandler({xmlMode: true}),
parser = new htmlparser2.Parser(handler);
parser.parseComplete('<link>foo</link>');
console.log(handler.dom);
console.log(DomUtils.getOuterHTML(handler.dom[0]));
console.log(DomUtils.getInnerHTML(handler.dom[0]));
returns:
[ { type: 'tag', name: 'link' }, { data: 'foo', type: 'text' } ]
<link></link>
rather than what I expected:
[ { type: 'tag', name: 'link' , children: [ { data: 'foo', type: 'text' } ] ]
<link>foo</link>
foo
Hi, i recently began to evaluate your htmlparser alternative for my project and i need to access a "parent" Element from a current Element.
In your DomUtils file you are using "elem.parent" within your "removeElement" function but i can not find anything else for a parent element. Is it a legacy of pollution or is it planned to integrate?
Greets and thanks in advanced for your feedback,
Chris
In an attempt to optimize one of my modules, I discovered Tokenizer.write
isn't optimized by v8.
by using the --trace_deopt
I can see that v8 optimize it but dude to bailouts #488, #13, #166, #396, #388, #287, #348, #154, #14, #280
it gets deoptimized until v8 gives up.
I will spend some time on this, but I just wanted to know if you have seen this before?
To debug it your self checkout AndreasMadsen/article@fafe3b4 and run node --expose-gc --trace_deopt tools/benchmark.js
I am trying to package this library into one single distributable file that exposes the htmlparser
name globally under the window
object.
It seems that browserify doesn't cater to that problem but requires the main program to use the require()
construct, too. I don't believe that this is an option for me right now.
The best solution I've found so far is
$ browserify -r ./lib/index > htmlparser.js
and then in the browser do
> htmlparser = require('./lib/index');
The next best thing I can think about is wrapping each file into a function (module style), "appending" its export to an overall module
variable, which again sits inside a closure which is than returned and assigned to the global variable `htmlparser``.
Am I missing the point of browserify and the like?
Ran into a problematic website with mismatched open and close tags.
It appears the following code in Parser.prototype.onclosetag tries to make better sense out of mismatched close tags by popping the open tags off the stack until reaching the tag being closed:
var pos = this._stack.lastIndexOf(name);
if(pos !== -1){
if(this._cbs.onclosetag){
pos = this._stack.length - pos;
while(pos--) this._cbs.onclosetag(this._stack.pop());
}
else this._stack.length = pos;
...
Sending the re-ordered tags to the browser then caused the rendering to look awful.
The question would be, is it reasonable to just attempt to reconstruct the original flawed order this by pulling out the tag?
var pos = this._stack.lastIndexOf(name);
if(pos !== -1){
this._stack.splice(pos,1);
if (this._cbs.onclosetag) {
this._cbs.onclosetag(name);
}
With this modification, ff, chrome, ie all now rendered properly with this "flawed" html. Being unaware of the potential impact to other code and scenarios that might expect the current tag closing behavior, I put this up for discussion.
As the caller I assumed that the callbacks would be lowercased as the work was already done within the parser to lowercase. Otherwise, the caller is forced to lowercase tags again.
Hello,
I've got a snippet of code that process a feed and more or less looks like:
var htmlparser = require('htmlparser2');
var request = require('request');
var handler = new htmlparser.FeedHandler( function( error, feed ){
console.log( feed.items.length );
});
parser = new htmlparser.Parser( handler, { xmlMode: true } );
request('https://news.ycombinator.com/rss', function (error, response, body) {
if (!error && response.statusCode == 200) {
parser.parseComplete( body.toString() );
}
})
The only problem is the feed.items.length, which should 30 but it does return just 2. I tried to investigate a little bit more and I realised that if from the original feed I edit the description tag with non CDATA content it does work and I can get all the 30 items (I'm loading the file locally). This is the content of the item tag:
<item>
<title>I gave away my xbox 360 today</title>
<link>https://plus.google.com/u/0/105363132599081141035/posts/W3ys5fKnz5t</link>
<comments>https://news.ycombinator.com/item?id=5506571</comments>
<description><![CDATA[<a href="https://news.ycombinator.com/item?id=5506571">Comments</a>]]></description>
</item>
Not sure about what's going on, but looks like this is related to opened/closed CDATA sections.
Thanks
Daniele
when this is parsed doc and param are stored as siblings and doc is no child of param
<param><doc>doc</doc></param>
but when param is renamed to parameter doc is a child of parameter
<parameter><doc>doc</doc></parameter>
There are some benefits in being able to run the parser in a browser even though browsers do html parsing themselves.
Exemples of this might be parsing user input, loading and parsing html via XHR without incurring the cost of loading images and executing Javascript etc... I have been able to tweak the source so it runs in the browser again but it would be nice if this was supported from the get go.
For XML mainly.
For my use case I needed support for a hybrid parser that would allow for XML constructs to be recognized in HTML code. After looking at the source code I felt that adding more options could provide more fine-grained control over parsing without adding too much complexity to the code. Please review my Pull Request which added two new options:
true
then self-closing tags will result in the tag being closed even if xmlMode
is not set to true
true
then CDATA text will result in the ontext
event being fired even if xmlMode
is not set to true
In theory, xmlMode
could be used as a way to control the more fine-grained options, but I wanted to minimize code change.
When the attributes are written without whitespace between them, only the first attribute is found.
Example:
<div class="first attribute"title="second attribute"></div>
Returns:
{
class: "first attribute"
}
Hi, its me again. I just wonder if you plan to get closer to the html specs like using "attributes" instead of "attribs", "nodeValue" instead of "value" or "parentNode" instead of "parent" etc. as object properties. If you are interested, i already made several working changes to my local code and like to contribute to your project.
Greets,
Chris
The method "onattribvalue" is referenced in "Tokenizer.js" but not implemented on "Parser.js" as expected.
This causes the following error:
TypeError: Object #<Parser> has no method 'onattribvalue' at Tokenizer._handleTrailingData (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Tokenizer.js:827:13) at Tokenizer.end (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Tokenizer.js:807:8) at Parser.end (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Parser.js:297:18)
Parsing <br />
results in:
{
type: 'tag',
name: 'br',
attribs: { '/': '/' },
children: []
}
It seems you are solving a problem I have. So far I went with John Resig's abandoned HTML parser, which is the only close result on Google when searching for "Javascipt htmlparser". By chance I went for "node htmlparser" instead, found the predecessor of this project, and by chance again looked at the network graph to see that you are the only line with constant recent commits.
Do make this project more wide known. As a suggestion, maybe add "Javascript" to your description (maybe "JS" isn't really doing it).
referencing #75 as some basic svg shapes were already added as inline, but the list is rather incomplete. Inline svg is probably going to be a lot more common with increasing browser support. At least polyline
and polygon
are missing, but there might be more. As @fb55 mentioned there are several other issues with changing the inline list, perhaps we can discuss this issue here and come to a more complete fix.
I'm not sure what your plans are for the project, but I think going the other way (ie. DOM Object --> HTML) is really useful.
This is the renderer I've been using, I could help integrate it into the project if you'd like:
https://github.com/MatthewMueller/cheerio/blob/master/src/renderer.coffee
When parsing this file https://github.com/AndreasMadsen/article/blob/master/test/reallife/source/09198e90b6a14acfef0d4044606b8fd5801648f98763bf967f181aabaf59804d.html#L920-923 I don't get the highlighted <img ... >
after 263775f.
However as you will see here the <img>
(big Obama picture) does render in Chrome at least.
So I would ask you to support the \r
tag anyway.
Most information regarding the problem is available here: cheeriojs/cheerio#131 (comment)
XML to reproduce issue is available here: https://gist.github.com/4248909
As per this comment, you can see the parsed tree is butchered somehow pretty bad: cheeriojs/cheerio#131 (comment)
Removing the CDATA with '<root>' + page.substr(55).replace(/<!\[CDATA\[([^\]]+)]\]>/ig, "$1") + '</root>'
results in a working parsing tree as shown here: cheeriojs/cheerio#131 (comment)
Not an ideal solution though.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.