Comments (17)
I'd be happy to accept contributions in this direction, but same time, I'd like to keep this module as minimal as possible. Probably for a crawling thing, another module would be better. 💭
from scrape-it.
I like the minimalist nature of this module which is that it makes it very simple to build an object like data structure from HTML using CSS selectors. There are really nice tools such as wget that you could use to recursively mirror a website, and then use this library on the pages you are interested in.
Or since this is a Promise based API, you could chain crawling steps together with scraping steps without much effort.
@IonicaBizau that said -- can we add support for file://
URL or for passing in HTML content as a string ourselves?
Or I suppose I could take my own advice and start my own static file server on top of the directory :)
from scrape-it.
That's definitely a thing.
I'm not quite sure how the code would look like (the public APIs).
from scrape-it.
There is x-ray and they have callback in selector, but i don't think is best option because if page is big for scrape there could be problem with memory etc. Its just example :)
from scrape-it.
or have a queue system?
like this https://github.com/missinglink/huntsman
from scrape-it.
Indeed with @cloud1250000 suggestion.
from scrape-it.
+1
from scrape-it.
True!
I think the crawling system might be too much for this... but something similar of a queue system
(https://github.com/chuyskywalker/rolling-curl) would be nice in your project!
I know it's php but the main idea is pretty clear
from scrape-it.
@cloud1250000 When you are saying queue system, are you referring to the functionality to send multiple urls to scrape? 💭
from scrape-it.
yes! I managed to do it using async!
(http://glamour.tweakblogs.net/blog/7818/rate-limited-website-scraping-with-node-punt-js-and-async.html)
from scrape-it.
@datapimp Would it make sense to have file
protocol support in tinyreq
itself? If not, then I think it's better to leave the app to read the file and pass the content to this module.
You can use the scrapeHTML
to scrape a piece of HTML. 💫
Is there a specific case when you'd like to scrape a file? Adding a method doing that would not be a bad idea, probably.
from scrape-it.
@IonicaBizau one simple case could be when you are scraping a list of articles but they have only the summary so you have to "click" on each of those articles in order to get the current text.
from scrape-it.
Is https://github.com/srveit/mechanize-js not an option here?
from scrape-it.
I think it would be easier to just do tail -f ./detail-urls.txt | xargs -n1 node your-script.js
which will invoke your-script.js
for each "detail" url.
And then you just have to somehow append new urls into the file as you go.
Also doing it this way makes it very easy to "restart" scraping from specific point. (tail -n 100
to start from 100nth line)
from scrape-it.
@cztomsik Interesting! Maybe your cli tool can help in that direction too (instead of running node script.js
)?
from scrape-it.
yep, it would be nice if we had some example project... Or at least if we knew what people are using scape-it for...
from scrape-it.
https://www.npmjs.com/package/x-ray has a nice pattern for this, for reference.
from scrape-it.
Related Issues (20)
- What is the reason for this jump in popularity? HOT 1
- Encoding issue with Spanish accents HOT 3
- Cannot find name Cheerio / CheerioSelector / CheerioStatic HOT 1
- Add support for conditional selector HOT 1
- Meta Data HOT 1
- no space between tag HOT 1
- Move scrapeHTML into its own package HOT 1
- Headers of type HTTP/2 HOT 1
- scrapeHTML is not defined HOT 1
- Alternative selectors for one element HOT 1
- Scrape a table with image links. HOT 1
- Parsing tables and xpath
- Can't select nth element HOT 1
- Scrape Behind Login HOT 1
- commas on next line in readme HOT 1
- $ is not a function in a Remix App HOT 1
- Doesn't return tag script attribute value HOT 1
- Doesn't work with :nth-child selector HOT 1
- [Feature request] "OR" query
- Can't use :not(:fist-child) nor eq in listItem HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrape-it.