evanderkoogh / node-sitemap-stream-parser Goto Github PK
View Code? Open in Web Editor NEWA streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them.
License: Apache License 2.0
A streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them.
License: Apache License 2.0
Hi! Can you please post a sample output of the function parseSitemaps
in README?
For new users it will help a lot to understand what kind of data is parsed, for example loc
, lastmod
, changefreq
, priority
, etc.
Example of a sitemap that will not return any urls: https://www.parashop.com/1_fr_0_sitemap.xml
Extract:
<url>
<loc><![CDATA[https://www.parashop.com/meilleures-ventes]]></loc>
<priority>0.1</priority>
<changefreq>daily</changefreq>
</url>
Reason for this not working is because sax does not trigger a text event when it is inside CDATA, but instead a specific cdata
event. I believe you can just use the same handler for cdata & text events and it processes fine.
I found an example https://booking.com/robots.txt where sitemaps are marked as Disallowed
Sitemap: https://www.booking.com/sitembk-index-https.xml`
User-agent: Baiduspider
Disallow: /sitembk-index-https.xml
I suggest to add an option respectRobotsTxt to the parser which is true by default.
Please enable the strict mode for the SAX parser. Else the error handler is useless, no error arise even with invalid sml.
This line: parserStream = sax.createStream false,
I'm experiencing a very high CPU utilization (100%) with large nested sitemaps.
The url callback is very simple since it increments a counter.
Could this be related to the "blocking" nature of url (and sitemap) callbacks? If you point out towards the right direction I can contribute to the project.
As an example, you could try this sitemap: https://www.walmart.com/sitemap_ip.xml
I tried the parser for booking.com with the following code:
var sitemaps = require('sitemap-stream-parser');
sitemaps.sitemapsInRobots('http://booking.com/robots.txt', function(err, urls) {
if(err || !urls || urls.length == 0)
return;
sitemaps.parseSitemaps(urls, console.log, function(err, sitemaps) {
console.log(sitemaps);
});
});
The parser runs a while, but then stops with the following error
internal/streams/legacy.js:57
throw er; // Unhandled stream error in pipe.
^
Error: ESOCKETTIMEDOUT
at ClientRequest.<anonymous> (.../node_modules/request/request.js:812:19)
at Object.onceWrapper (events.js:275:13)
at ClientRequest.emit (events.js:182:13)
at TLSSocket.emitTimeout (_http_client.js:694:34)
at Object.onceWrapper (events.js:275:13)
at TLSSocket.emit (events.js:182:13)
at TLSSocket.Socket._onTimeout (net.js:447:8)
at ontimeout (timers.js:427:11)
at tryOnTimeout (timers.js:289:5)
at listOnTimeout (timers.js:252:5)
I'm working with a recursive sitemap and it would be helpful to return the name of the sitemap url that the item was returned from. is this possible?
I would ideally like this returned {sitemap: "...", url: "..."}
or perhaps even better, a way to prevent it from even going down certain paths of the sitemap tree based on a regex. my sitemap is datebased and I don't want it to traverse farther back than a few days
Just wanted to say that this JS package rocks! There are no issues at all. There ought to be an appreciation thank you tab on Github
my code isn't really set to do stream based processing right now and I wanted a version that returned a promise so I wrote this:
async function parseSitemap(sitemapURL) {
const urls = [];
return await new Promise((resolve, reject) => {
sitemaps.parseSitemaps(sitemapURL, url => urls.push(url), (err, sitemaps) => {
if (err) reject(err);
resolve(urls);
})
})
}
could you expose a promise version directly? happy to submit a PR
So far there is no new release for about 2 years.
When I parse this sitemap: https://gazeta.ua/sitemaps/sitemapindex.xml it fails with:
...node_modules/async/dist/async.js:966
if (fn === null) throw new Error("Callback was already called.");
^
Error: Callback was already called.
at ...node_modules/async/dist/async.js:966:32
at SAXStream.parserStream.on (...node_modules/sitemap-stream-parser/index.js:98:16)
at SAXStream.emit (events.js:197:13)
at SAXParser.SAXStream._parser.onend (...node_modules/sax/lib/sax.js:190:10)
at emit (...node_modules/sax/lib/sax.js:624:35)
at end (...node_modules/sax/lib/sax.js:667:5)
at SAXParser.end (...node_modules/sax/lib/sax.js:154:24)
at SAXStream.end (...node_modules/sax/lib/sax.js:248:18)
at Gzip.onend (_stream_readable.js:655:10)
at Object.onceWrapper (events.js:285:13)
at Gzip.emit (events.js:202:15)
at endReadableNT (_stream_readable.js:1129:12)
at processTicksAndRejections (internal/process/next_tick.js:76:17)
I'm not sure what's the reason of this. Probably on(error)
and on(end)
are called for one parse url? But how coult it be possible?
Image and video sitemaps are little bit different from the standard sitemap. Are they supported? If yes, then is all metadata retrieved from them (for example, image title, video title, video duration, etc.)?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.