Giter Club home page Giter Club logo

node-sitemap-stream-parser's People

Contributors

danhunsaker avatar davidglezz avatar dependabot-preview[bot] avatar evanderkoogh avatar knoxcard avatar max-frai avatar yarnseemannsgarn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

node-sitemap-stream-parser's Issues

A sample output

Hi! Can you please post a sample output of the function parseSitemaps in README?
For new users it will help a lot to understand what kind of data is parsed, for example loc, lastmod, changefreq, priority, etc.

Urls in <![CDATA[...]] not handled

Example of a sitemap that will not return any urls: https://www.parashop.com/1_fr_0_sitemap.xml

Extract:

<url>
  <loc><![CDATA[https://www.parashop.com/meilleures-ventes]]></loc>
  <priority>0.1</priority>  
  <changefreq>daily</changefreq>
</url>

Reason for this not working is because sax does not trigger a text event when it is inside CDATA, but instead a specific cdata event. I believe you can just use the same handler for cdata & text events and it processes fine.

[IMP]: Respectation of robots.txt

I found an example https://booking.com/robots.txt where sitemaps are marked as Disallowed

Sitemap: https://www.booking.com/sitembk-index-https.xml`

User-agent: Baiduspider
Disallow: /sitembk-index-https.xml

I suggest to add an option respectRobotsTxt to the parser which is true by default.

Enable strict mode for the SAX parser

Please enable the strict mode for the SAX parser. Else the error handler is useless, no error arise even with invalid sml.
This line: parserStream = sax.createStream false,

High CPU usage with nested sitemaps

I'm experiencing a very high CPU utilization (100%) with large nested sitemaps.

The url callback is very simple since it increments a counter.

Could this be related to the "blocking" nature of url (and sitemap) callbacks? If you point out towards the right direction I can contribute to the project.

As an example, you could try this sitemap: https://www.walmart.com/sitemap_ip.xml

Parser does not finish for booking.com

I tried the parser for booking.com with the following code:

var sitemaps = require('sitemap-stream-parser');

sitemaps.sitemapsInRobots('http://booking.com/robots.txt', function(err, urls) {
    if(err || !urls || urls.length == 0)
        return;
    sitemaps.parseSitemaps(urls, console.log, function(err, sitemaps) {
        console.log(sitemaps);
    });
});

The parser runs a while, but then stops with the following error

internal/streams/legacy.js:57
throw er; // Unhandled stream error in pipe.
^

Error: ESOCKETTIMEDOUT
at ClientRequest.<anonymous> (.../node_modules/request/request.js:812:19)
at Object.onceWrapper (events.js:275:13)
at ClientRequest.emit (events.js:182:13)
at TLSSocket.emitTimeout (_http_client.js:694:34)
at Object.onceWrapper (events.js:275:13)
at TLSSocket.emit (events.js:182:13)
at TLSSocket.Socket._onTimeout (net.js:447:8)
at ontimeout (timers.js:427:11)
at tryOnTimeout (timers.js:289:5)
at listOnTimeout (timers.js:252:5)

return name of sitemap that url was returned from?

I'm working with a recursive sitemap and it would be helpful to return the name of the sitemap url that the item was returned from. is this possible?

I would ideally like this returned {sitemap: "...", url: "..."}

or perhaps even better, a way to prevent it from even going down certain paths of the sitemap tree based on a regex. my sitemap is datebased and I don't want it to traverse farther back than a few days

this rocks!

Just wanted to say that this JS package rocks! There are no issues at all. There ought to be an appreciation thank you tab on Github

promise?

my code isn't really set to do stream based processing right now and I wanted a version that returned a promise so I wrote this:

async function parseSitemap(sitemapURL) {
  const urls = [];
  return await new Promise((resolve, reject) => {
    sitemaps.parseSitemaps(sitemapURL, url => urls.push(url), (err, sitemaps) => {
      if (err) reject(err);
      resolve(urls);
    })
  })
}

could you expose a promise version directly? happy to submit a PR

Error: Callback was already called

When I parse this sitemap: https://gazeta.ua/sitemaps/sitemapindex.xml it fails with:

...node_modules/async/dist/async.js:966
        if (fn === null) throw new Error("Callback was already called.");
                         ^

Error: Callback was already called.
    at ...node_modules/async/dist/async.js:966:32
    at SAXStream.parserStream.on (...node_modules/sitemap-stream-parser/index.js:98:16)
    at SAXStream.emit (events.js:197:13)
    at SAXParser.SAXStream._parser.onend (...node_modules/sax/lib/sax.js:190:10)
    at emit (...node_modules/sax/lib/sax.js:624:35)
    at end (...node_modules/sax/lib/sax.js:667:5)
    at SAXParser.end (...node_modules/sax/lib/sax.js:154:24)
    at SAXStream.end (...node_modules/sax/lib/sax.js:248:18)
    at Gzip.onend (_stream_readable.js:655:10)
    at Object.onceWrapper (events.js:285:13)
    at Gzip.emit (events.js:202:15)
    at endReadableNT (_stream_readable.js:1129:12)
    at processTicksAndRejections (internal/process/next_tick.js:76:17)

I'm not sure what's the reason of this. Probably on(error) and on(end) are called for one parse url? But how coult it be possible?

Are images and video sitemaps supported?

Image and video sitemaps are little bit different from the standard sitemap. Are they supported? If yes, then is all metadata retrieved from them (for example, image title, video title, video duration, etc.)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.