cheeriojs / cheerio Goto Github PK
View Code? Open in Web Editor NEWThe fast, flexible, and elegant library for parsing and manipulating HTML and XML.
Home Page: https://cheerio.js.org
License: MIT License
The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
Home Page: https://cheerio.js.org
License: MIT License
jQuery .text()
decodes HTML entities:
> $('<p>M&M</p>').text()
"M&M"
cheerio's does not:
> cheerio.load("<p>M&M</p>")("p").text()
'M&M'
<div class="add-question first" data-bind="visible: questionSets()[0].questions().length > 0">
If I read that in with cheerio and get the html, the tag ends at the first greater than sign.
Chris
The vows test suit isn't scaling the way I want it to. Mocha looks promising.
See #6 for more information
Cheerio (& jQuery) traversing ignores elements that aren't tags. Node-htmlparser treats script and style tags differently (doesn't parse inner content) so it has different types - elem.type is "script" and "style" instead of "tag".
Need to include those tags in the traversing.
Under some circumstances when cheerio objects are generated from html, parent/child relationships aren't created.
I've created the following test case that fails by returning null.
$("<div></div>").append("<div><div></div></div>").children().children().parent();
Hi Matthew,
First off, really like Cheerio. Great work! Fast and easy to use. We are using it heavily and have been able to develop quickly with it.
However, since 0.3.0 we are no longer able to get to nested divs that are two or more levels deep and render additional data in those elements. Here is an example.
I have template that looks like the following:
https://gist.github.com/1500031
When I try to do something like the following it used to work and now it fails.
template = get('games.html');
$ = gamefly.cheerio.load(template);
$('.ttl').text("Call of Duty");
This appears to fail starting with 0.3.1. Any ideas?
Thanks
Christian
Coffeescript had a good run, it's a pleasure to write in. Unfortunately, the whitespace and the two source directories is a pain to work with.
Now that we're ignoring whitespace in the parser, we should write a prettifier so our output is readable.
Seems like a little typo - the test suite points to tests
instead of test
JosProDesk:cheerio jos$ npm install
JosProDesk:cheerio jos$ npm test
> [email protected] test /Users/jos/Sites/cheerio
> coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec
node.js:201
throw e; // process.nextTick error, or 'error' event on first tick
^
Error: Cannot find module '/Users/jos/Sites/cheerio/tests/test.cheerio'
at Function._resolveFilename (module.js:334:11)
at Function._load (module.js:279:25)
at Module.require (module.js:357:17)
at require (module.js:368:17)
at /usr/local/lib/node_modules/vows/bin/vows:496:19
at Array.reduce (native)
at importSuites (/usr/local/lib/node_modules/vows/bin/vows:491:18)
at Object.<anonymous> (/usr/local/lib/node_modules/vows/bin/vows:247:15)
at Module._compile (module.js:432:26)
at Object..js (module.js:450:10)
npm ERR! [email protected] test: `coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec`
npm ERR! `sh "-c" "coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec"` failed with 1
npm ERR!
npm ERR! Failed at the [email protected] test script.
npm ERR! This is most likely a problem with the cheerio package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec
npm ERR! You can get their info via:
npm ERR! npm owner ls cheerio
npm ERR! There is likely additional logging output above.
npm ERR!
npm ERR! System Darwin 10.8.0
npm ERR! command "node" "/usr/local/bin/npm" "test"
npm ERR! cwd /Users/jos/Sites/cheerio
npm ERR! node -v v0.6.7
npm ERR! npm -v 1.1.0-beta-10
npm ERR! code ELIFECYCLE
npm ERR! message [email protected] test: `coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec`
npm ERR! message `sh "-c" "coffee -o lib/ src/ && vows tests/test.cheerio.coffee --spec"` failed with 1
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /Users/jos/Sites/cheerio/npm-debug.log
npm not ok
Lots of unnecessary Here!
s, There!
s, and random console.log
s from debugging. Not a huge priority, but gets kind of annoying.
I am running into the following error when I call cheerio.load
many times in quick succession:
Error: EMFILE, too many open files '/Users/ryanshaw/Code/noflo/node_modules/cheerio/package.json'
at Object.openSync (fs.js:238:18)
at Object.readFileSync (fs.js:128:15)
at Function.version (/Users/ryanshaw/Code/noflo/node_modules/cheerio/index.js:14:27)
at /Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:638:27
at Array.forEach (native)
at /Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:76:11
at Function.extend (/Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:636:5)
at [object Object].extend (/Users/ryanshaw/Code/noflo/node_modules/cheerio/node_modules/underscore/underscore.js:961:26)
at Function.load (/Users/ryanshaw/Code/noflo/node_modules/cheerio/lib/api/utils.js:243:16)
at ScrapeHtml.scrapeHtml (/Users/ryanshaw/Code/noflo/components/ScrapeHTML.js:96:19)
Commenting out the following lines in index.js
solves the problem:
var version = function() {
var pkg = require('fs').readFileSync(__dirname + '/package.json', 'utf8');
return JSON.parse(pkg).version;
};
exports.__defineGetter__('version', version);
jQuery protects from XSS when assigning attributes; cheerio doen't currently.
function xss() {
var $ = require('cheerio').load('<a>GitHub</a>');
$('a').attr('href', 'http://github.com/"><script>alert("XSS!")</script><br');
return $.html();
}
xss()
returns:
<a href = "http://github.com/"><script>alert("XSS!")</script><br">GitHub</a>
media = "" --> media = "undefined"
First off, thanks for a brilliant project. I'm using cheerio 0.8.0:
My code looks like this:
$($(cols.get(3)).html())
The html returned from the $(cols.get(3)).html() looks like this:
Kid's Ride <br/>Foo <br/> <br/><font color="#000000">Permitted</font><br/>Category - D <br/><br/>\n
When I try to wrap that output back into the outer $() so that I can do more selects on it, this is the stack trace I see:
at parse (node_modules/cheerio/node_modules/cheerio-select/node_modules/CSSselect/node_modules/CSSwhat/index.js:109:11)\n
at parse (node_modules/cheerio/node_modules/cheerio-select/node_modules/CSSselect/index.js:646:18)\n
at Function.iterate (node_modules/cheerio/node_modules/cheerio-select/node_modules/CSSselect/index.js:687:42)\n
at node_modules/cheerio/node_modules/cheerio-select/lib/select.js:13:20
at [object Object].find (node_modules/cheerio/lib/api/traversing.js:7:14)
at [object Object].init (node_modules/cheerio/lib/cheerio.js:67:44)
at node_modules/cheerio/lib/cheerio.js:11:12
at fn (node_modules/cheerio/lib/api/utils.js:246:12)
I suspect it has something to do with the entity (') in there as other similar html without the single quote in it parses just fine.
The latest version is removing all my nodes only containing whitespace. It's introducing subtle rendering bugs where spaces are missing e.g.
<div><span class="firstname">Jos</span> <span class="lastname">Shepherd</span></div>
is getting output as:
<div><span class="firstname">Jos</span><span class="lastname">Shepherd</span></div>
Is the whitespace stripping by design or is it a bug?
http://api.jquery.com/serialize/
this would be nice to have when finding hidden inputs etc.
This:
$('div > header > a')
does not work. Cheerio seems unable to hand the '>' child selector.
(mbp) ~ $ cd ~/scratch/
(mbp) ~/scratch $ mkdir cheerio-install
(mbp) ~/scratch $ cd cheerio-install/
(mbp) ~/scratch/cheerio-install $ npm install cheerio
npm ERR! error installing [email protected] Error: No compatible version found: entities@'>=1.0.0- <2.0.0-'
npm ERR! error installing [email protected] Valid install targets:
npm ERR! error installing [email protected] ["0.1.0","0.1.1"]
npm ERR! error installing [email protected] at installTargetsError (/usr/local/lib/node_modules/npm/lib/cache.js:424:10)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/cache.js:406:17
npm ERR! error installing [email protected] at saved (/usr/local/lib/node_modules/npm/lib/utils/npm-registry-client/get.js:136:7)
npm ERR! error installing [email protected] at Object.cb [as oncomplete] (/usr/local/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:36:9)
npm ERR! Error: No compatible version found: entities@'>=1.0.0- <2.0.0-'
npm ERR! Valid install targets:
npm ERR! ["0.1.0","0.1.1"]
npm ERR! at installTargetsError (/usr/local/lib/node_modules/npm/lib/cache.js:424:10)
npm ERR! at /usr/local/lib/node_modules/npm/lib/cache.js:406:17
npm ERR! at saved (/usr/local/lib/node_modules/npm/lib/utils/npm-registry-client/get.js:136:7)
npm ERR! at Object.cb [as oncomplete] (/usr/local/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:36:9)
npm ERR! Report this *entire* log at:
npm ERR! <http://github.com/isaacs/npm/issues>
npm ERR! or email it to:
npm ERR! <[email protected]>
npm ERR!
npm ERR! System Darwin 11.4.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "cheerio"
npm ERR! cwd /Users/bat/scratch/cheerio-install
npm ERR! node -v v0.6.19
npm ERR! npm -v 1.0.106
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /Users/bat/scratch/cheerio-install/npm-debug.log
npm not ok
(mbp) ~/scratch/cheerio-install $
people are apparently using this; how about running npm publish?
Not sure if this is an issue for here , Htmlparser2, or I am just clueless but whenever I pull in a doc with cheerio and there is code like
<br> <span> blah</span> <br>
that first br is going to get 2 children of span and br....
I thought this wasn't supposed to happen for nodes like br who can't have children when the xmlMode option is false.... or am I just confused...
UG
I trying to parse XML with cheerio. XML contains tags. And the following script is return empty string in that case:
$('foo').find('link').text()
How to parse XML with tags , and similar?
Thanks in advance.
Even though, it is faster than jsdom + jquery implementation. There are so many familiar selectors cannot be used with Cherrio. for instance, I can't pass html code into the function... i can't use eq() with the selector... The only advantage for your implementation is the performance gain.
Hello, Matthew,
I just have downloaded cheerio and found that it doesn't works! I have installed it with npm install cheerio
.
Problem in module cheerio-soupselect - it doesn't works with htmlparser2. I dont know what exactly is wrong, but when I downloaded harryf /node-soupselect and placed into the cheerio-soupselect folder and installed htmlparser1, lib become alive.
Please check the current distribution, I think it has errors in cheerio-soupselect module.
Regards, Dmitry
In jQuery you can do this:
$('.pear').prev('.apple');
This lets you write much more robust code. Currently in cheerio, if you add a plum
element between apple
and pear
, your code will break.
Hi Matt Mueller,
How to scrap sites which requires login?
Can you post some example code todo this?
Thanks
Koti
It's much more common to have no space between the equal sign in attributes: for instance <a href="http://github.com/">GitHub</a>
is much more common than <a href = "http://github.com/">GitHub</a>
which cheerio is producing. Is there anything I'm missing here, or would it be better to remove the extra whitespace?
$ = require('cheerio').load('<a>GitHub</a>');
$('a').attr('href', 'http://github.com/');
console.log($.html()) // => '<a href = "http://github.com/">GitHub</a>'
Example:
$script = $('script[type="text/javascript"]')
$('body').find($script); // fail!
val() for input, textarea,and selects would be handy, rather than having to use these types:
$('#select option[selected="selected"]').attr('value')
$('#textarea').html()
$('#text_input').attr('value');
In trying to get val(), I also realized that you have to do option[selected="selected"] which is a bit obtuse compared to option:selected, which in jquery, knows all the different ways one may have a selected option.
Any whitespace between tags is NOT being rendered with .html()
The output of that when rendered with .html() is: LinkHyperLink
It removes the space in-between the 2 tags. This only happens when 2 tags are one after the other. To get around it i have to do the following to the html string: .replace('> <','> <');
Already missing them from jQuery...
after load this page: http://sh.58.com/job.shtml, $('body').length == 0? what's wrong? jsdom works.
Not really a priority, but it would be nice. This way we can drop coffee as a dep completely.
I can't believe this one has gone unnoticed. I'm not really sure what to do about it actually. I think it's more useful how it is right now, but I don't like the asymmetry. I also don't like straying away from the jQuery API.
What do you guy's think?
Forgive me if I've misread the docs, but I can't understand how this kind of result would be useful / intentional:
var myHTML = '<tr><a>myLink1</a></tr><tr><a>myLink2</a></tr>';
var $ = cheerio.load(myHTML);
var rows = $('tr');
var links = $('a', rows.get(0));
console.log("There are " + links.length + " links in the first row")
gives me
There are 2 links in the first row.
I'm using the first row as context so I should only get 1 link for that first row. I also tried rows[0] instead of rows.get(0) to no avail.
Is this a bug?
This fix is for the crazies (like myself):
$('html').before("<h1>Zomg, I'm before the html tag</h1>");
The following should work:
var $script = $('<script>').attr('src', 'jquery.js');
$('head').append($script)
Right now we have to:
var $script = $('<script>').attr('src', 'jquery.js');
$('head').append($script.html())
For more discussion & information, go here:
I just started using Cheerio and it seems like a great tool. Trying some of the examples on the Cheerio page worked as advertised but when I pass Cheerio a full HTML page, it cannot seem to process it.
This is the HTML I am using:
<!doctype html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<!-- Browser Compatibility: -->
<meta http-equiv="X-UA-Compatible" content="IE=9,chrome=1">
<!-- SEO and social media description: -->
<meta name="description" content="">
<!-- Mobile optimization: -->
<meta name="viewport" content="width=device-width,initial-scale=1">
<!-- Disable IE6 image menu: -->
<meta http-equiv="imagetoolbar" content="false">
<title></title>
<!-- CSS: -->
<link rel="stylesheet" href="master.css">
<!-- (Some) JS: -->
<!--<script src="modernizr-2.0.6.min.js"></script>-->
</head>
<body>
<header>
<div id="logo"></div>
<nav>
</nav>
</header>
<div id="main">
</div>
<footer>
<div id="small-logo"></div>
<small>Copyright © <span class="year"></span> Moo cows. All rights reserved.</small>
</footer>
<!-- (Rest of) JS: -->
<!-- Grab Google CDN's jQuery, with a protocol relative URL; fall back to local if offline -->
<!--<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script>window.jQuery || document.write('<script src="js/libs/jquery-1.7.1.min.js"><\/script>')</script>-->
<!-- IE6 Chrome Frame install prompt -->
<!--[if lt IE 7 ]>
<script src="//ajax.googleapis.com/ajax/libs/chrome-frame/1.0.3/CFInstall.min.js"></script>
<script>window.attachEvent('onload',function(){CFInstall.check({mode:'overlay'})})</script>
<![endif]-->
</body>
</html>
If I try
I don't know if this is in any way related to Issue #12 but the error messages are similar. I tried to narrow down the problem by stripping out some of the script tags, conditional statements, etc from the HTML but that didn't work either.
I am using the latest version of Node.js (0.6.4).
Hi Matthew,
I run the suite test on node v0.4.7 and the suite test pass. Is there any reason why cheerio restrict the usage with node >= v0.4.11?
I will need cheerio with this version of node. Can you modify the version of node in package.json?
They get rendered as text. Example:
<!-- comment -->
=>
comment
I was attempting to select elements based on DOM structure, e.g. "li.a div.b span.c" (because span.c exists in other structures as well, but I only want to access those that match this structure), but instead of the 40 elements that match on the page, the code returned 840 elements, with multiple duplicates, so I assume it gets all the possible combinations and just returns them all.
E.g.
$('li.a div.b span.c', html).each(function(i, el) {
// Has 840 items
});
vs.
$('span.c', html).each(function(i, el) {
if(
// Has 40 items
}
});
Is this a (typical) user error or an issue with the selector?
I'm using cheerio on my node.js server, and using the method html() I can copy the functionality of innerHTML, but I haven't found a way to copy how outerHTML works. How could I do this?
Thanks :)
there seems to be a problem with setting text into an empty div like so:
var cheerio = require('cheerio');
var $ = cheerio.load('
the problem appears to be that line 118 of api/manipulation.coffee is checking that this.children exists before inserting the element.
i'm not sure if this is an issue in cheerio or an issue with htmlparser leaving the children property unassigned, but i though you might want to know.
cheers,
simon
2 htmlparsers get bundled in cheerio right now. Soupselect hasn't been updated in a while, so I'm planning on forking it.
Sometimes you only have a fragment of HTML and you want to add content to from or back of it.
Being able to select multiple types of tags at once is immensely useful.
A few ways to implement this:
htmlparser 2 will return nothing if you give it something like "hello world"
Therefore, if you run cheerio.load('hello world'). It will return the cryptic error: ".find is undefined"
Full text should be treated as a single text node.
I don't particularly see any benefit to not having this option set. Is there any reason this wasn't set already?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.