evanderkoogh / node-sitemap-stream-parser Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 18.0 70 KB

A streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them.

License: Apache License 2.0

CoffeeScript 68.10% JavaScript 31.90%

node-sitemap-stream-parser's People

Contributors

Stargazers

Watchers

Forkers

davidglezz danhunsaker cmincarelli knoxcard semantifyit getshifter kemalelci abhinavsinha001 oleksandr-tkach max-frai bakery levanarabuli santthosh schacode willshiao stanger lightness dataxquare

node-sitemap-stream-parser's Issues

A sample output

Hi! Can you please post a sample output of the function parseSitemaps in README?
For new users it will help a lot to understand what kind of data is parsed, for example loc, lastmod, changefreq, priority, etc.

Urls in <![CDATA[...]] not handled

Example of a sitemap that will not return any urls: https://www.parashop.com/1_fr_0_sitemap.xml

Extract:

<url>
  <loc><![CDATA[https://www.parashop.com/meilleures-ventes]]></loc>
  <priority>0.1</priority>  
  <changefreq>daily</changefreq>
</url>

Reason for this not working is because sax does not trigger a text event when it is inside CDATA, but instead a specific cdata event. I believe you can just use the same handler for cdata & text events and it processes fine.

[IMP]: Respectation of robots.txt

I found an example https://booking.com/robots.txt where sitemaps are marked as Disallowed

Sitemap: https://www.booking.com/sitembk-index-https.xml`

User-agent: Baiduspider
Disallow: /sitembk-index-https.xml

I suggest to add an option respectRobotsTxt to the parser which is true by default.

Enable strict mode for the SAX parser

Please enable the strict mode for the SAX parser. Else the error handler is useless, no error arise even with invalid sml.
This line: parserStream = sax.createStream false,

High CPU usage with nested sitemaps

I'm experiencing a very high CPU utilization (100%) with large nested sitemaps.

The url callback is very simple since it increments a counter.

Could this be related to the "blocking" nature of url (and sitemap) callbacks? If you point out towards the right direction I can contribute to the project.

As an example, you could try this sitemap: https://www.walmart.com/sitemap_ip.xml

Parser does not finish for booking.com

I tried the parser for booking.com with the following code:

var sitemaps = require('sitemap-stream-parser');

sitemaps.sitemapsInRobots('http://booking.com/robots.txt', function(err, urls) {
    if(err || !urls || urls.length == 0)
        return;
    sitemaps.parseSitemaps(urls, console.log, function(err, sitemaps) {
        console.log(sitemaps);
    });
});

The parser runs a while, but then stops with the following error

internal/streams/legacy.js:57
throw er; // Unhandled stream error in pipe.
^

Error: ESOCKETTIMEDOUT
at ClientRequest.<anonymous> (.../node_modules/request/request.js:812:19)
at Object.onceWrapper (events.js:275:13)
at ClientRequest.emit (events.js:182:13)
at TLSSocket.emitTimeout (_http_client.js:694:34)
at Object.onceWrapper (events.js:275:13)
at TLSSocket.emit (events.js:182:13)
at TLSSocket.Socket._onTimeout (net.js:447:8)
at ontimeout (timers.js:427:11)
at tryOnTimeout (timers.js:289:5)
at listOnTimeout (timers.js:252:5)

return name of sitemap that url was returned from?

I'm working with a recursive sitemap and it would be helpful to return the name of the sitemap url that the item was returned from. is this possible?

I would ideally like this returned {sitemap: "...", url: "..."}

or perhaps even better, a way to prevent it from even going down certain paths of the sitemap tree based on a regex. my sitemap is datebased and I don't want it to traverse farther back than a few days

this rocks!

Just wanted to say that this JS package rocks! There are no issues at all. There ought to be an appreciation thank you tab on Github

promise?

my code isn't really set to do stream based processing right now and I wanted a version that returned a promise so I wrote this:

async function parseSitemap(sitemapURL) {
  const urls = [];
  return await new Promise((resolve, reject) => {
    sitemaps.parseSitemaps(sitemapURL, url => urls.push(url), (err, sitemaps) => {
      if (err) reject(err);
      resolve(urls);
    })
  })
}

could you expose a promise version directly? happy to submit a PR

create new release

So far there is no new release for about 2 years.

Error: Callback was already called

When I parse this sitemap: https://gazeta.ua/sitemaps/sitemapindex.xml it fails with:

...node_modules/async/dist/async.js:966
        if (fn === null) throw new Error("Callback was already called.");
                         ^

Error: Callback was already called.
    at ...node_modules/async/dist/async.js:966:32
    at SAXStream.parserStream.on (...node_modules/sitemap-stream-parser/index.js:98:16)
    at SAXStream.emit (events.js:197:13)
    at SAXParser.SAXStream._parser.onend (...node_modules/sax/lib/sax.js:190:10)
    at emit (...node_modules/sax/lib/sax.js:624:35)
    at end (...node_modules/sax/lib/sax.js:667:5)
    at SAXParser.end (...node_modules/sax/lib/sax.js:154:24)
    at SAXStream.end (...node_modules/sax/lib/sax.js:248:18)
    at Gzip.onend (_stream_readable.js:655:10)
    at Object.onceWrapper (events.js:285:13)
    at Gzip.emit (events.js:202:15)
    at endReadableNT (_stream_readable.js:1129:12)
    at processTicksAndRejections (internal/process/next_tick.js:76:17)

I'm not sure what's the reason of this. Probably on(error) and on(end) are called for one parse url? But how coult it be possible?

Are images and video sitemaps supported?

Image and video sitemaps are little bit different from the standard sitemap. Are they supported? If yes, then is all metadata retrieved from them (for example, image title, video title, video duration, etc.)?