It would be nice to add support for crawling pages and follow links. I suggest: <u

or have a queue system? like this <a href="https://github.com/missinglink/huntsman

yes! I managed to do it using async! (<a href="http://glamour.tweakblogs.net/blog/

Add support for crawling about scrape-it HOT 17 CLOSED

ionicabizau commented on July 18, 2024

Add support for crawling

from scrape-it.

Comments (17)

IonicaBizau commented on July 18, 2024 4

I'd be happy to accept contributions in this direction, but same time, I'd like to keep this module as minimal as possible. Probably for a crawling thing, another module would be better. 💭

from scrape-it.

datapimp commented on July 18, 2024 1

I like the minimalist nature of this module which is that it makes it very simple to build an object like data structure from HTML using CSS selectors. There are really nice tools such as wget that you could use to recursively mirror a website, and then use this library on the pages you are interested in.

Or since this is a Promise based API, you could chain crawling steps together with scraping steps without much effort.

@IonicaBizau that said -- can we add support for file:// URL or for passing in HTML content as a string ourselves?

Or I suppose I could take my own advice and start my own static file server on top of the directory :)

from scrape-it.

IonicaBizau commented on July 18, 2024

That's definitely a thing.

I'm not quite sure how the code would look like (the public APIs).

from scrape-it.

huv1k commented on July 18, 2024

There is x-ray and they have callback in selector, but i don't think is best option because if page is big for scrape there could be problem with memory etc. Its just example :)

from scrape-it.

cloud1250x4 commented on July 18, 2024

or have a queue system?
like this https://github.com/missinglink/huntsman

from scrape-it.

PSanni commented on July 18, 2024

Indeed with @cloud1250000 suggestion.

from scrape-it.

sahin commented on July 18, 2024

from scrape-it.

cloud1250x4 commented on July 18, 2024

True!
I think the crawling system might be too much for this... but something similar of a queue system
(https://github.com/chuyskywalker/rolling-curl) would be nice in your project!

I know it's php but the main idea is pretty clear

from scrape-it.

IonicaBizau commented on July 18, 2024

@cloud1250000 When you are saying queue system, are you referring to the functionality to send multiple urls to scrape? 💭

from scrape-it.

cloud1250x4 commented on July 18, 2024

yes! I managed to do it using async!
(http://glamour.tweakblogs.net/blog/7818/rate-limited-website-scraping-with-node-punt-js-and-async.html)

from scrape-it.

IonicaBizau commented on July 18, 2024

@datapimp Would it make sense to have file protocol support in tinyreq itself? If not, then I think it's better to leave the app to read the file and pass the content to this module.

You can use the scrapeHTML to scrape a piece of HTML. 💫

Is there a specific case when you'd like to scrape a file? Adding a method doing that would not be a bad idea, probably.

from scrape-it.

ludo237 commented on July 18, 2024

@IonicaBizau one simple case could be when you are scraping a list of articles but they have only the summary so you have to "click" on each of those articles in order to get the current text.

from scrape-it.

mchapman commented on July 18, 2024

Is https://github.com/srveit/mechanize-js not an option here?

from scrape-it.

cztomsik commented on July 18, 2024

I think it would be easier to just do tail -f ./detail-urls.txt | xargs -n1 node your-script.js which will invoke your-script.js for each "detail" url.

And then you just have to somehow append new urls into the file as you go.

Also doing it this way makes it very easy to "restart" scraping from specific point. (tail -n 100 to start from 100nth line)

from scrape-it.

IonicaBizau commented on July 18, 2024

@cztomsik Interesting! Maybe your cli tool can help in that direction too (instead of running node script.js)?

from scrape-it.

cztomsik commented on July 18, 2024

yep, it would be nice if we had some example project... Or at least if we knew what people are using scape-it for...

from scrape-it.

acao commented on July 18, 2024

https://www.npmjs.com/package/x-ray has a nice pattern for this, for reference.

from scrape-it.

Add support for crawling about scrape-it HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent