fb55 / htmlparser2 Goto Github PK

View Code? Open in Web Editor NEW

4.4K 51.0 372.0 6.01 MB

The fast & forgiving HTML and XML parser

Home Page: https://feedic.com/htmlparser2

License: MIT License

HTML 1.11% TypeScript 98.70% JavaScript 0.19%

html-parser javascript dom htmlparser2 html xml parser

htmlparser2's People

Contributors

Stargazers

Watchers

Forkers

hamtie siddmahen matthewmueller baudehlo benoitzugmeyer fasterize gotomypc unroll-me myndzi yasuno45 jugglinmike demux burl andreasmadsen abarre xingyunshisui reijovosu xcoderzach jcdarwin thebennos pinpickle callumlocke patrick-steele-idem pihizi kpdecker mereskin-zz saary amesianx jhewt zynesis caleyd heshiming edeustace tkrugg browniefed duncanbeevers minodisk devongovett javascript-forks chbrown elffikk ackar arthurgerbelot danhooper jeromew apo-j shawnhilgart alibaba-archive warcrusher haikuowuya dailymotion georgephillips jaggedsoft donnut frontapp henrybryant dustinhayes anvoz minikey showjoy html-shell hongee joncasey zhouhesheng bezoerb devlato nkzawa lonjoy raine sikuli librasama broadly mohitdeshwal derekbreden mickael-van-der-beek mvhaen jasonsanjose mail-apps eongoo jfahrenkrug shushanfx egis leonfedotov simonfan 4front sjn1978 uhoreg nivalamata wataori my-forks shuky orisomething hellocomrade iryusa boutell modulexcite neo-nie rayleesg tfg-urjc-2017 rubyrabelle

htmlparser2's Issues

Use tags for version releases

Bower relies on tags and points at this repository. Can you please add tags for 3.x releases?

Support self-closing tags and other HTML constraints on tags

I am currently using "htmlparser.js" created by John Resig some time ago. I was hoping to switch to something actively maintained. However it seems there are a few features that are missing.

For example the parser I mentioned understands self-closing tags, and seems to scan for other elements that haven't been closed.

So for example calling HTMLtoXML(<p>Hello<p>World') will yield "<p>Hello</p><p>World</p>" while htmlparser2 (with the right handler) would give me "<p>Hello<p>World</p></p>".

Migrating from 1.x to 2.x

Hi there,

I've been looking to pull in your changes but it seems like quite a bit has changed from 1.x to 2.x. Could you give a basic summary of what's changed? Offhand, I've noticed:

Different methods to parse the html
raw attribute is gone
tags no longer have data or raw attributes

What else has changed?

Thanks!
Matt

Don't explode on tag mismatches

tl;dr don't do this

Okay first off, let me just say that htmlparser2 is the bomb. I throw some pretty disgusting HTML at it and it chews through it like a boss.

Except for one problem. Sometimes I throw REALLY ugly HTML at it. Specifically, imagine someone took a reasonably valid html page and threw an opening comment ("<!--") somewhere around 90% down the source, and then never closed it. Overwriting otherwise valid HTML content. Sure, the page wouldn't render properly (mainly the footer content just gets lost), but browsers still render most of the page.

htmlparser2 just breaks. Specifically, it breaks on like 39: https://github.com/FB55/node-htmlparser/blob/master/lib/DomHandler.js#L39

The parser should just call it a day and wrap up the DOM tree up to that point instead of exploding. Otherwise my whole server crashes with an exception and I don't get ANY parsed data back. (Yes, I know I could work around the crashing part.)

So, please, could we not throw an exception there and instead die gracefully, albeit prematurely? (Y'know, like browsers do?)

Thanks!

Incorrect parsing of inline tags when capitalized

<p>foo</p>
<hr>
<p>bar</p>

correctly produces:

[ { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ { data: 'foo', type: 'text' } ] },
  { data: '\n', type: 'text' },
  { type: 'tag',
    name: 'hr',
    attribs: {},
    children: [] },
  { data: '\n', type: 'text' },
  { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ { data: 'bar', type: 'text' } ] } ]

while

<p>foo</p>
<HR>
<p>bar</p>

incorrectly results in:

[ { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ { data: 'foo', type: 'text' } ] },
  { data: '\n', type: 'text' },
  { type: 'tag',
    name: 'HR',
    attribs: {},
    children: 
     [ { data: '\n', type: 'text' },
       { type: 'tag',
         name: 'p',
         attribs: {},
         children: [ { data: 'bar', type: 'text' } ] } ] } ]

Since HTML tag names are entirely case insensitive I think it would be better to lower case them so that <hr> and <HR> would both result in identical parsed output?

Don't skip > at the beginning of an input

cheeriojs/cheerio#103

new Parser({ontext:console.log}).parseComplete(">a>") // -> "a>"

Browserify bundle is over 300 KB

I realize this may not be a goal of the project, but the file size is probably too big for the browser.

Maybe there's a large require somewhere that could be swapped out easily.

Scan more than the head of the tag stack for tags closing other tags

As described here, <p><a>a<p>b is currently handled as <p><a>a<p>b</p></a></p>.

Cannot distinguish tags closed with /> from tags closed with </tag>

I write a lot of filters for HTML, so htmlparser2 is great for parsing it. I'd like to do simple passthrough without being opinionated about anything I'm not specifically interested in changing. However this is difficult because I can't tell whether the original markup was using the self-closing style or not based on the arguments to onclosetag. This unnecessarily complicates the write side of my filter and requires my code to be aware of which tags are safe and traditional to self-close and which are not.

One solution would be a second argument to onclosetag indicating whether the self-closing style was used.

Thanks!

Brackets break attributes

Just a note for anyone with a lot of time:

<div id=">"> foo </div>

Expected behavior: Return a div with an id of >, followed by the string foo.
Result: A div with an id of ", followed by the string "> foo.

The problem is that there isn't a possibility for a look-ahead. Besides, the markup is clearly broken. The current result should be acceptable in most cases.

Edit: Apparently, that bug is well-known.

< in textarea is not parsed correctly

Parser looks for a tag even inside a text-area where there may be source.

npm install htmlparser2 seems to hang

Hey,

I can run npm install [email protected], but running just npm install htmlparser2 seems to hang. Can you verify that this is working on your box?

Thanks,
Matt

Non-break space

Please, replace non-breaking space to normal space:
https://github.com/fb55/htmlparser2/blob/master/lib/CollectingHandler.js#L4

It breaks htmlparser2 when it is browserified and runs in Chrome.

<script> tagname handling

The <script> tagname is not case insensitive. A tag <SCRIPT> will see anything starting with a < as a new tag.

Example: https://gist.github.com/3899198

Bug in script handling

I found this when crawling http://www.chicagotribune.com/news/local/suburbs/orland_park_homer_glen/community/chi-ugc-article-palos-medical-groups-dr-kanesha-bryant-prov-2-2013-05-15,0,3152151.story

They have a line containing

var str = "<script src=\'about:blank\' type=\'text/javascript\'></"+"script>"

there seams to break things.

Here is a simple testcase:

var Parser = require('htmlparser2').Parser;

var stream = new Parser({
  onopentag: function (tagname, attr) {
    console.log('open: ' + tagname);
  },

  ontext: function (text) {
    console.log('text: ' + text);
  },

  onclosetag: function (tagname) {
    console.log('close: ' + tagname);
  }
});

stream.write('<body>');
stream.write('<script type="text/javascript" language="JavaScript">');
stream.write('var str = "<script src=\'about:blank\' type=\'text/javascript\'></"+"script>"');
stream.write('document.write(str);');
stream.write('</script>');
stream.write('</head>');
stream.end();

The output is:

open: body
open: script
text: var str = "
text: <script src='about:blank' type='text/javascript'>
close: script
text: "
text: document.write(str);
close: body

but it should be:

open: body
open: script
text: var str = "<script src='about:blank' type='text/javascript'></"+"script>"
text: document.write(str);
close: script
close: body

Translate dom into string method

It will be cool, if DomHandler provides method to translate dom structure into string (reverse operation of parse).

var htmlparser = require('htmlparser2');
var handler = new htmlparser.DefaultHandler();
var parser = new htmlparser.Parser(handler);

var html = 'some html here';
parser.parseComplete(html);

// should be true
console.log(handler.getHtml() == html);

It could be done with follow code for now:

html = htmlparser.DomUtils.getInnerHTML({ children: handler.dom })
// or 
html = handler.dom.map(htmlparser.DomUtils.getOuterHTML, htmlparser.DomUtils).join('')

But to find this solution you need going deep into source code.
So translate method would be useful to turn dom back into html.

html parser doesn't handle cdata

<![CDATA[
This should be CDATA...
]]>

results in

[ { data: '[CDATA[\nThis should be CDATA...\n]]',
    type: 'comment' } ]

when it should result in

[ { data: '\nThis should be CDATA...\n',
    type: 'cdata' } ]

using version 3.1.5

Replace feeds in tests/Documents

The current documents were taken from third parties and should be replaced with files with clear licensing terms.

DomUtils.getElements() \w tag_contains does not perform as expected

First, the _contains is misleading as it does not check wether the tag contains the value, rather it checks for an exact match. So if you try something like this:

var domUtils = require("htmlparser2").DomUtils;
domUtils.getElements({ tag_contains: "cookie" }, dom, true);

you won't get anything if the tag you wanted contains cookies.

Second, when you do enter the exact text, you get the data node rather than the node that the data node belongs to. For example:

domUtils.getElements({ tag_contains: "cookies" }, dom, true);

Would return

[ { data: 'cookies', type: 'text' } ]

instead of:

[ { type: 'tag', name: 'p', children: [ { data: 'cookies', type: 'text' } ] } ]

Handle special case with JavaScript strings

var htmlparser = require('htmlparser2');
var parser = new htmlparser.Parser({
  onopentag: function(name, attribs) {
    console.log('open tag: ' + name);
  },
  ontext: function(text) {
    console.log('text: ' + text);
  },
  onclosetag: function(name) {
    console.log('close tag: ' + name);
  }
});
parser.write("Xyz <script type='text/javascript'>var foo = '</script><<bar>>';< /  script>");
parser.end();

results in:

text: Xyz
open tag: script
text: var foo = '
close tag: script
open tag: <bar
text: >';
close tag: <bar

Migrating from 1.x to 2.x

Hi there,

I've been looking to pull in your changes but it seems like quite a bit has changed from 1.x to 2.x. Could you give a basic summary? Offhand, I've noticed:

Different methods to parse the html
raw attribute is gone
tags no longer have data or raw attributes

What else has changed?

Thanks!
Matt

jslint warns about the variables with `_`

When htmlparser2 is used in applications and if the variables with _ in them are used, jslint throws errors like this Unexpected dangling '_' in '_attribname'. Can the variable names be changed?

How to get the start index of the current tag ?

First of all, you did a remarquable work on this project. We use it in production in our server at my company Fasterize.

Actually, we used a fork of your parser (v2.3.1) (https://github.com/fasterize/node-htmlparser/tree/position). In this fork, we added a patch to know the start index and end index of the current tag. It's very useful in our software to replace a tag.

I would like to upgrade to the latest version of htmlparser2 but for that I need to rebase this patch. Can you let me know where is the best way to get those indexes? Is it in the tokenizer ?

Are you interested in putting this patch in the master ? I would like to avoid to maintain a fork only for that.

Add ability to treat any tag as a "special" tag

Currently you can make <script> and <code> tags completely ignore their contents. The ability to do this with any tag would be really nice. Looking through the tokenizer, this doesn't look possible without doing a lot of modification to the tokenizer every time you wanted to add a new tag to ignore.
I managed to create a hack-ish way of doing this with a parser using parser._token._index and _sectionStart but I just know I'll run into issues with auto closing tags at some point. I may try to implement this myself but it would probably be better if someone (willing to) with an in depth knowledge of the tokenizer could attempt something like this.

string within script tag which look like comment breaks text node into 3

the following code

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    ontext: function(text){
        console.log(text);
    },
});
parser.write('<script>var x = "<!--123-->";</script>');
parser.done();

outputs

var x = "
";

htmlparser2.esproj is in npm registry

Seems like there's some extra stuff that ended up in the registry. Not a big problem at all, but it'd be nice to either ignore this directory or remove it from the registry in the next version.

Conditional comments aren't parsed correctly

Hi there,

This is an issue that came up in: cheeriojs/cheerio#13

Here's the problematic syntax:

<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->

Basically the parser messes up on the <!--[if lt IE7]>...

So we get:

{
  raw: '[if lt IE 7]>',
  data: '[if lt IE 7]>',
  type: 'comment',
  ...
}

It doesn't actually break which is good but its an issue in every page that includes HTML5 Boilerplate.

Let me know if you need some additional info with this issue.

Thanks!
Matt

xhtml self-closing tags cause problems for subsequent tags

Since 3.2.4, xhtml self-closing tags are doing weird things (unless they happen to be html "void" elements). Sometimes they will engulf the next tag that follows them, and sometimes they will remove the tag that follows them altogether.

Different output for example in readme

With the example in the readme and htmlparser2 v3.0.5 from npm, I get this output instead:

--> Xyz
JS! Hooray!
--> var foo = '
--> <<bar>>';
--> < /  script>
That's it?!

xmlMode Incompleteness/Inconsistencies

Certain tags, such as link and meta are treated like HTML tags when the parser is set to xmlMode and they shouldn't be. For example:

var htmlparser2 = require("./lib/index"),
    DomUtils = htmlparser2.DomUtils;

var handler = new htmlparser2.DomHandler({xmlMode: true}),
    parser = new htmlparser2.Parser(handler);

parser.parseComplete('<link>foo</link>');

console.log(handler.dom);
console.log(DomUtils.getOuterHTML(handler.dom[0]));
console.log(DomUtils.getInnerHTML(handler.dom[0]));

returns:

 [ { type: 'tag', name: 'link' }, { data: 'foo', type: 'text' } ]
<link></link>

rather than what I expected:

[ { type: 'tag', name: 'link' , children: [ { data: 'foo', type: 'text' } ] ]
<link>foo</link>
foo

ParentNode needed ...

Hi, i recently began to evaluate your htmlparser alternative for my project and i need to access a "parent" Element from a current Element.

In your DomUtils file you are using "elem.parent" within your "removeElement" function but i can not find anything else for a parent element. Is it a legacy of pollution or is it planned to integrate?

Greets and thanks in advanced for your feedback,

Chris

missing parser event on duplicate closing tag

Hi while I was meddling with the coding, I notice there no parser handling for <> the second ">"
I find it useful to have parser which can modify the elements and output the result. I wouldn't request something, without contribute back. Therefore I created a element tree parser modifier. How do I post a copy of the script for your review ? I try drag and drop here but it gives me the message it don't support "txt" or "js" file type yet.

[disabled optimization for Tokenizer.write, reason: optimized too many times]

In an attempt to optimize one of my modules, I discovered Tokenizer.write isn't optimized by v8.

by using the --trace_deopt I can see that v8 optimize it but dude to bailouts #488, #13, #166, #396, #388, #287, #348, #154, #14, #280 it gets deoptimized until v8 gives up.

I will spend some time on this, but I just wanted to know if you have seen this before?

To debug it your self checkout AndreasMadsen/article@fafe3b4 and run node --expose-gc --trace_deopt tools/benchmark.js

Describe how to use it in the browser

I am trying to package this library into one single distributable file that exposes the htmlparser name globally under the window object.

It seems that browserify doesn't cater to that problem but requires the main program to use the require() construct, too. I don't believe that this is an option for me right now.

The best solution I've found so far is

$ browserify -r ./lib/index > htmlparser.js

and then in the browser do

> htmlparser = require('./lib/index');

The next best thing I can think about is wrapping each file into a function (module style), "appending" its export to an overall module variable, which again sits inside a closure which is than returned and assigned to the global variable `htmlparser``.

Am I missing the point of browserify and the like?

Alternate handling of out of order tag closing

Ran into a problematic website with mismatched open and close tags.

It appears the following code in Parser.prototype.onclosetag tries to make better sense out of mismatched close tags by popping the open tags off the stack until reaching the tag being closed:

    var pos = this._stack.lastIndexOf(name);
    if(pos !== -1){
        if(this._cbs.onclosetag){
            pos = this._stack.length - pos;
            while(pos--) this._cbs.onclosetag(this._stack.pop());
        }
        else this._stack.length = pos;
    ...

Sending the re-ordered tags to the browser then caused the rendering to look awful.

The question would be, is it reasonable to just attempt to reconstruct the original flawed order this by pulling out the tag?

  var pos = this._stack.lastIndexOf(name);
  if(pos !== -1){
      this._stack.splice(pos,1);
      if (this._cbs.onclosetag) {
          this._cbs.onclosetag(name);
      }

With this modification, ff, chrome, ie all now rendered properly with this "flawed" html. Being unaware of the potential impact to other code and scenarios that might expect the current tag closing behavior, I put this up for discussion.

When 'lowerCaseTags', consider passing back lower case tags to callbacks

As the caller I assumed that the callbacks would be lowercased as the work was already done within the parser to lowercase. Otherwise, the caller is forced to lowercase tags again.

Feed parser freaks out when CDATA is in description

Hello,

I've got a snippet of code that process a feed and more or less looks like:

var htmlparser = require('htmlparser2');
var request = require('request');

var handler = new htmlparser.FeedHandler( function( error, feed ){
  console.log( feed.items.length );
});

parser = new htmlparser.Parser( handler, { xmlMode: true } );

request('https://news.ycombinator.com/rss', function (error, response, body) {
  if (!error && response.statusCode == 200) {
    parser.parseComplete( body.toString() );
  }
})

The only problem is the feed.items.length, which should 30 but it does return just 2. I tried to investigate a little bit more and I realised that if from the original feed I edit the description tag with non CDATA content it does work and I can get all the 30 items (I'm loading the file locally). This is the content of the item tag:

<item>
  <title>I gave away my xbox 360 today</title>
  <link>https://plus.google.com/u/0/105363132599081141035/posts/W3ys5fKnz5t</link>
  <comments>https://news.ycombinator.com/item?id=5506571</comments>
  <description><![CDATA[<a href="https://news.ycombinator.com/item?id=5506571">Comments</a>]]></description>
</item>

Not sure about what's going on, but looks like this is related to opened/closed CDATA sections.

Thanks
Daniele

Invalid DOM when having <param> tags

when this is parsed doc and param are stored as siblings and doc is no child of param

<param><doc>doc</doc></param>

but when param is renamed to parameter doc is a child of parameter

<parameter><doc>doc</doc></parameter>

Running it in a browser

There are some benefits in being able to run the parser in a browser even though browsers do html parsing themselves.
Exemples of this might be parsing user input, loading and parsing html via XHR without incurring the cost of loading images and executing Javascript etc... I have been able to tweak the source so it runs in the browser again but it would be nice if this was supported from the get go.

Strict mode?

For XML mainly.

Feature request: Ability to handle self-closing tags and CDATA in non-XML mode

For my use case I needed support for a hybrid parser that would allow for XML constructs to be recognized in HTML code. After looking at the source code I felt that adding more options could provide more fine-grained control over parsing without adding too much complexity to the code. Please review my Pull Request which added two new options:

recognizeSelfClosing: If set to true then self-closing tags will result in the tag being closed even if xmlMode is not set to true
recognizeCDATA: If set to true then CDATA text will result in the ontext event being fired even if xmlMode is not set to true

In theory, xmlMode could be used as a way to control the more fine-grained options, but I wanted to minimize code change.

Only finds first attribute when there is no whitespace between attributes

When the attributes are written without whitespace between them, only the first attribute is found.

Example:

<div class="first attribute"title="second attribute"></div>

Returns:

{
   class: "first attribute"
}

Closer to html specs ...

Hi, its me again. I just wonder if you plan to get closer to the html specs like using "attributes" instead of "attribs", "nodeValue" instead of "value" or "parentNode" instead of "parent" etc. as object properties. If you are interested, i already made several working changes to my local code and like to contribute to your project.

Greets,

Chris

"onattribvalue" not implemented

The method "onattribvalue" is referenced in "Tokenizer.js" but not implemented on "Parser.js" as expected.

This causes the following error:

TypeError: Object #<Parser> has no method 'onattribvalue' at Tokenizer._handleTrailingData (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Tokenizer.js:827:13) at Tokenizer.end (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Tokenizer.js:807:8) at Parser.end (/data/workspace/projects/cf-news/node_modules/htmlparser2/lib/Parser.js:297:18)

"/" considered an attribute for self-enclosing tags (ex. <br />)

Parsing <br /> results in:

{
  type: 'tag',
  name: 'br',
  attribs: { '/': '/' },
  children: []
}

Make this project findable via Google

It seems you are solving a problem I have. So far I went with John Resig's abandoned HTML parser, which is the only close result on Google when searching for "Javascipt htmlparser". By chance I went for "node htmlparser" instead, found the predecessor of this project, and by chance again looked at the network graph to see that you are the only line with constant recent commits.

Do make this project more wide known. As a suggestion, maybe add "Javascript" to your description (maybe "JS" isn't really doing it).

svg parsing

referencing #75 as some basic svg shapes were already added as inline, but the list is rather incomplete. Inline svg is probably going to be a lot more common with increasing browser support. At least polyline and polygon are missing, but there might be more. As @fb55 mentioned there are several other issues with changing the inline list, perhaps we can discuss this issue here and come to a more complete fix.

Feature: Render the parsed DOM object back to HTML

I'm not sure what your plans are for the project, but I think going the other way (ie. DOM Object --> HTML) is really useful.

This is the renderer I've been using, I could help integrate it into the project if you'd like:

https://github.com/MatthewMueller/cheerio/blob/master/src/renderer.coffee

bug after "drop the carriage return"

When parsing this file https://github.com/AndreasMadsen/article/blob/master/test/reallife/source/09198e90b6a14acfef0d4044606b8fd5801648f98763bf967f181aabaf59804d.html#L920-923 I don't get the highlighted <img ... > after 263775f.

However as you will see here the <img> (big Obama picture) does render in Chrome at least.

So I would ask you to support the \r tag anyway.

XML parsing - CDATA use results in broken tree

Most information regarding the problem is available here: cheeriojs/cheerio#131 (comment)

XML to reproduce issue is available here: https://gist.github.com/4248909

As per this comment, you can see the parsed tree is butchered somehow pretty bad: cheeriojs/cheerio#131 (comment)

Removing the CDATA with '<root>' + page.substr(55).replace(/<!\[CDATA\[([^\]]+)]\]>/ig, "$1") + '</root>' results in a working parsing tree as shown here: cheeriojs/cheerio#131 (comment)
Not an ideal solution though.

fb55 / htmlparser2 Goto Github PK

htmlparser2's People

Contributors

Stargazers

Watchers

Forkers

htmlparser2's Issues

Recommend Projects

Recommend Topics

Recommend Org