fergiemcdowall / norch-fetch Goto Github PK
View Code? Open in Web Editor NEWFetch pure HTML from a webserver and save it to disk
License: MIT License
Fetch pure HTML from a webserver and save it to disk
License: MIT License
The idea is to:
A: Define the boundaries of the crawl (site, sites, subsite, a set of subsites)
B: Define a html-file for start-URL's
C: Set the pattern of the URL's to follow
D: Make one or more exclude-patterns for C (i.e. ensuring to not click what's ultimately the same page several times)
E: Set the pattern of the URL's to fetch. These can be overlapping with B, but doesn't have to.
F: Make one or more exclude-patterns for E
Feedback on what my regexp actually does.
Something like: http://regex101.com/#javascript
Hi,
Noticed you are missing a var colors = require('colors'); to enable the coloured reporting.
Thanks
Log http status codes along with URLs that norch-fetch tries to fetch.
http://en.wikipedia.org/wiki/List_of_HTTP_status_codes
Could be handy to know which URLs are not working properly.
https://twitter.com/shindakun/status/429682295797063682
Maybe need a .npmignore
or .gitignore
in there?
Maybe forage-fetch need a timer? How long it should wait before following a new link?
It would be handy if the crawler could follow http redirects (301 and 302?).
http://en.wikipedia.org/wiki/HTTP_301
http://en.wikipedia.org/wiki/HTTP_302
This npm package could help us?
https://www.npmjs.org/package/follow-redirects
Not sure what norch-fetch identifies as when crawling a site, but it should be possible to define which type of Agent (norch-fetch), version and who is using it (with a link to a page explaining the purpose).
-u --useragent telling the site you crawls, who is crawling. Default is norch-fetch [version]
This may be a separate module, norch-receive? Then again, maybe not...
Issue #6 describes a general approach. Find pages to click through, and at the figure out which to actually crawl/fetch.
Every now and then it's quicker to generate a local html file or plain text file with URL's to crawl/fetch. I want the possibility to feed that file directly to norch-fetch.
I'm thinking of using some of the norch-fetch
functionality through another script. What's the best way forward? Export all functions as is or make a new "download"-script that does just that? Then other crawlers could use the same download/get function.
There are a lot of different crawlers one could write. From very generic to specific cases.
One that checks which pages are new since last time, if any.
Running the command returns immediately with undefined; Debugging shows it seems to fail around the initial get. Tried on both windows 10 & OSX platforms.
-f --followrobotstxt <yes/no> if you want your fetcher to play nice or not
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.