let4be / crusty-core Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 1.0 446 KB

A small library for building fast and highly customizable web crawlers

License: GNU General Public License v3.0

Rust 99.39% Shell 0.61%

async rust scraper tokio web-crawler

crusty-core's People

Contributors

Stargazers

Watchers

Forkers

ousado

crusty-core's Issues

Rethink format of exposed data

We should not mangle data returned by server if there were no errors in getting it.
Actual errors are different from errors in filters/expanders

Consider lifting Send+Sync restrictions and migrating to a single thread tokio runtime

We could simplify && speed things up
=> spawn several Crawlers in their own threads, we already handle job delegation via channels
this way we have less internal/external Send/Sync restrictions and we do not bother tokio with work stealing schenanigans which most likely hurt more than help in our use case

after this is implemented, can review let4be/crusty#35
probably could use Actix here and there

Support for fully customizable parsing

Right now we use select for html parsing, it's nice and everything but there are different use cases

For low volume crawling it might be a very good fit, but for broad crawling I'm considering switching to something lower level. So crusty-core should support configurable html parsers(and properly propagate it to task_expanders via generics)

Async channel based DNS resolver

Built on top of StaticAsyncResolver trying to resolve DNS by sending request and awaiting response on a channel, within timeout.

Add exponential backoff

When server returns 5xx or 429(too many requests) we should slow down

Should propagate root_task errors to job status

Job cannot be considered completed OK when root task got cooked by some error,

Extend usage of derivative

Recently I switched from using standard rust's Derive to excellent crate https://github.com/mcarton/rust-derivative (because of weird issue with Clone generic's bounds)
https://mcarton.github.io/rust-derivative/latest/index.html

This could probably make code a bit cleaner and get rid of some manual implementations of Default

DNS resolving should be handled as a separate LinkTarget

This would allow for fancier filtering and additional logic, like for example in broad web crawler we might want to "discover" only reservable external domains, crusty-core could handle this resolution

Also should be useful for ip/subnet filtering if we want to restrict crawling certain subnets for example(not necessary just reserved ipv4/ipv6 addresses)

Rework returned errors

Kind of gets dragged by #8

Fix repo to properly support `cargo examples`

So that we could cargo run --example find_duplicate_titles

Implement non concurrent crawler

For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less...
Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are

download && parse robots.txt, while
resolving redirects
resolving additional DNS requests(if any) as long as it falls within the same addr_key, see #14
head index page to figure out if there are any redirects(if allowed by robots.txt)
Jobs that resolved all DNS(within our restrictions) and successfully HEAD index page are considered "breeded"

all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver(breeder and regular web crawler will have quite different rules and settings)

Unify calling and handling of filters

There's a big chunk of duplicated functionality between calling status_filters/load_filters && task_expanders
as soon as #9 && #8 are ready
this functionality becomes generic enough to be consolidated

Cleanup code

Simplify String/&str && get rid of excessive memory allocations
Get rid of excessive .clone() calls

Improve status_filters and their errors

In particular we should change how we handle max_redirect,

this should be happening in status_filter(when redirect is being emitted) not in task_filter as it's now
so that we could properly record a typed error and later compare against it in Crusty(proper metrics)

redirect status filter should always emit TERM when redirect is detected

Evaluate arena integration

https://github.com/fitzgen/bumpalo