linuxperia / crawler-2 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from drahflow/crawler
I needed a serious web crawler for search engine applications. This is it.
License: Do What The F*ck You Want To Public License
This project forked from drahflow/crawler
I needed a serious web crawler for search engine applications. This is it.
License: Do What The F*ck You Want To Public License
Problem domain: You want to crawl a non-trivial amount of webpages using only your personal computer. This crawler offers: * integrated duplicate detection, only process each unique line once * modest memory and cpu requirements: 400.000.000+ line duplicate cache 128 different domains in parallel 10 MBit/s download speed (before removal of duplicates) => 1 GB RAM + ~10% of a single core * stores results into a single stream file, optimal for later batch processing * short pauses between requests to the same server * a simplistic HTML "parser" * asynchronous DNS resolution via libadns * short an concise program code * liberal licencing terms = "Architecture" (i.e. what I wrote before the code) = Input: List of URLs Create Bloom filter for seen URLs Create Bloom filter for seen Lines For each domain: Open output stream Fetch robots.txt, parse into patterns Create client socket and read / write buffers Fetch page Parse into lines, check with filter Put surviving lines into output stream Parse target URLs, check with filter Put surviving URLs into search front if search front limit per domain has been reached, drop URL Cooldown timer
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.