Giter Club home page Giter Club logo

Comments (17)

IonicaBizau avatar IonicaBizau commented on July 18, 2024 4

I'd be happy to accept contributions in this direction, but same time, I'd like to keep this module as minimal as possible. Probably for a crawling thing, another module would be better. 💭

from scrape-it.

datapimp avatar datapimp commented on July 18, 2024 1

I like the minimalist nature of this module which is that it makes it very simple to build an object like data structure from HTML using CSS selectors. There are really nice tools such as wget that you could use to recursively mirror a website, and then use this library on the pages you are interested in.

Or since this is a Promise based API, you could chain crawling steps together with scraping steps without much effort.

@IonicaBizau that said -- can we add support for file:// URL or for passing in HTML content as a string ourselves?

Or I suppose I could take my own advice and start my own static file server on top of the directory :)

from scrape-it.

IonicaBizau avatar IonicaBizau commented on July 18, 2024

That's definitely a thing.

I'm not quite sure how the code would look like (the public APIs).

from scrape-it.

huv1k avatar huv1k commented on July 18, 2024

There is x-ray and they have callback in selector, but i don't think is best option because if page is big for scrape there could be problem with memory etc. Its just example :)

from scrape-it.

cloud1250x4 avatar cloud1250x4 commented on July 18, 2024

or have a queue system?
like this https://github.com/missinglink/huntsman

from scrape-it.

PSanni avatar PSanni commented on July 18, 2024

Indeed with @cloud1250000 suggestion.

from scrape-it.

sahin avatar sahin commented on July 18, 2024

+1

from scrape-it.

cloud1250x4 avatar cloud1250x4 commented on July 18, 2024

True!
I think the crawling system might be too much for this... but something similar of a queue system
(https://github.com/chuyskywalker/rolling-curl) would be nice in your project!

I know it's php but the main idea is pretty clear

from scrape-it.

IonicaBizau avatar IonicaBizau commented on July 18, 2024

@cloud1250000 When you are saying queue system, are you referring to the functionality to send multiple urls to scrape? 💭

from scrape-it.

cloud1250x4 avatar cloud1250x4 commented on July 18, 2024

yes! I managed to do it using async!
(http://glamour.tweakblogs.net/blog/7818/rate-limited-website-scraping-with-node-punt-js-and-async.html)

from scrape-it.

IonicaBizau avatar IonicaBizau commented on July 18, 2024

@datapimp Would it make sense to have file protocol support in tinyreq itself? If not, then I think it's better to leave the app to read the file and pass the content to this module.

You can use the scrapeHTML to scrape a piece of HTML. 💫

Is there a specific case when you'd like to scrape a file? Adding a method doing that would not be a bad idea, probably.

from scrape-it.

ludo237 avatar ludo237 commented on July 18, 2024

@IonicaBizau one simple case could be when you are scraping a list of articles but they have only the summary so you have to "click" on each of those articles in order to get the current text.

from scrape-it.

mchapman avatar mchapman commented on July 18, 2024

Is https://github.com/srveit/mechanize-js not an option here?

from scrape-it.

cztomsik avatar cztomsik commented on July 18, 2024

I think it would be easier to just do tail -f ./detail-urls.txt | xargs -n1 node your-script.js which will invoke your-script.js for each "detail" url.

And then you just have to somehow append new urls into the file as you go.

Also doing it this way makes it very easy to "restart" scraping from specific point. (tail -n 100 to start from 100nth line)

from scrape-it.

IonicaBizau avatar IonicaBizau commented on July 18, 2024

@cztomsik Interesting! Maybe your cli tool can help in that direction too (instead of running node script.js)?

from scrape-it.

cztomsik avatar cztomsik commented on July 18, 2024

yep, it would be nice if we had some example project... Or at least if we knew what people are using scape-it for...

from scrape-it.

acao avatar acao commented on July 18, 2024

https://www.npmjs.com/package/x-ray has a nice pattern for this, for reference.

from scrape-it.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.