let4be / crusty-core Goto Github PK
View Code? Open in Web Editor NEWA small library for building fast and highly customizable web crawlers
License: GNU General Public License v3.0
A small library for building fast and highly customizable web crawlers
License: GNU General Public License v3.0
We should not mangle data returned by server if there were no errors in getting it.
Actual errors are different from errors in filters/expanders
We could simplify && speed things up
=> spawn several Crawlers in their own threads, we already handle job delegation via channels
this way we have less internal/external Send/Sync restrictions and we do not bother tokio with work stealing schenanigans which most likely hurt more than help in our use case
after this is implemented, can review let4be/crusty#35
probably could use Actix here and there
Right now we use select
for html parsing, it's nice and everything but there are different use cases
For low volume crawling it might be a very good fit, but for broad crawling I'm considering switching to something lower level. So crusty-core
should support configurable html parsers(and properly propagate it to task_expanders
via generics)
Built on top of StaticAsyncResolver
trying to resolve DNS by sending request and awaiting response on a channel, within timeout.
When server returns 5xx or 429(too many requests) we should slow down
Job cannot be considered completed OK when root task got cooked by some error,
Recently I switched from using standard rust's Derive to excellent crate https://github.com/mcarton/rust-derivative (because of weird issue with Clone generic's bounds)
https://mcarton.github.io/rust-derivative/latest/index.html
This could probably make code a bit cleaner and get rid of some manual implementations of Default
This would allow for fancier filtering and additional logic, like for example in broad web crawler we might want to "discover" only reservable external domains, crusty-core
could handle this resolution
Also should be useful for ip/subnet filtering if we want to restrict crawling certain subnets for example(not necessary just reserved ipv4/ipv6 addresses)
Kind of gets dragged by #8
So that we could cargo run --example find_duplicate_titles
For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less...
Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are
addr_key
, see #14head
index page to figure out if there are any redirects(if allowed by robots.txt)HEAD
index page are considered "breeded"all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver
(breeder and regular web crawler will have quite different rules and settings)
String/&str
&& get rid of excessive memory allocations.clone()
callsthis should be happening in status_filter(when redirect is being emitted) not in task_filter as it's now
so that we could properly record a typed error and later compare against it in Crusty(proper metrics)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.