let4be / crusty Goto Github PK
View Code? Open in Web Editor NEWBroad Web Crawler
License: GNU General Public License v3.0
Broad Web Crawler
License: GNU General Public License v3.0
Right now it does some "wasteful serialization"which we just throw away
Yet the lib is so damn fast it doesn't matter...
ideally we would like to completely disable HTML rewriting functionality, but I don't think it's currently possible, see cloudflare/lol-html#91
This could better explain what the primary scaling points are,
config for aws's c5.metal
would be quite different from t2.micro
Sometimes when testing on aws c5.metal I see that writers "hang" which leads to pile of unprocessed messages in buffers(particularly metrics_task
as the heaviest clickhouse hitter)
As soon as crusty-core
fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTML
we don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)
Extracting links/title/meta should be easy to do with a simple tokenizer, like in https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html
It's partially here, but need to add
addr_key
shard_min
.. shard_max
)High volume setups most likely will need local recursive dns server
Could be cool to keep network inside docker overlay, but i'm concerned about performance especially on high end setups
Probably it should be default configuration anyway just to ensure everyone can try Crusty no matter open ports on the system...
seems like https://github.com/jedireza/warc could help
Need to convert existing ./infra/lazy.sh
and ./infra/net.sh
into one script that can be called right from curl
curl -fsSL https://raw.githubusercontent.com/let4be/crusty/master/infra/lazy.sh | bash -s
curl is readily available everywhere and the script could pull such stuff as git, bmon, htop, etc...
Right now we do not precisely control which address hyper will use when connecting but we assume it's the first one(and concurrency restrictions being applied accordingly, which may backfire)
Hello, I hope you are well. I've been trying to run the project locally on my Docker, but I always get an error in the redis dockerfile. I updated some things in the docker file, but the problem still persists:
FROM redis
# Update and install necessary packages
RUN apt-get update && apt-get -y install git build-essential cmake
# Create a symlink for python3 to ensure it is recognized
# Create app directory and clone RedisBloom
RUN mkdir /app && cd /app && \
git clone https://github.com/RedisBloom/RedisBloom && \
cd RedisBloom && \
git submodule update --init --recursive && \
./sbin/setup \
bash -l \
make
# Copy configuration and modules
COPY redis.conf /usr/local/etc/redis/redis.conf
COPY --from=crusty_crusty:latest /usr/local/lib/libredis_queue.so /app
COPY --from=crusty_crusty:latest /usr/local/lib/libredis_calc.so /app
# Expose port
EXPOSE 6379/tcp
CMD [ "redis-server", "/usr/local/etc/redis/redis.conf", "--loadmodule /app/libredis_queue.so", "--loadmodule /app/libredis_calc.so", "--loadmodule /app/RedisBloom/redisbloom.so" ]
error that is happening:
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 * Module 'crusty.queue' loaded from /app/libredis_queue.so
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 * Module 'crusty.calc' loaded from /app/libredis_calc.so
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 # Module /app/RedisBloom/redisbloom.so failed to load: /app/RedisBloom/redisbloom.so: cannot open shared object file: No such file or directory
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 # Can't load module from /app/RedisBloom/redisbloom.so: server aborting
While current "queue-like system" on top of clickhouse worked quite well for testing it's no near as good as required for any serious high-volume use
Recently I did some testing on a beefy AWS hardware and fixed some internal bottlenecks(not yet merged) and in some testing scenarios where I could temporary alleviate the last left bottleneck - job distribution(writing new/updating completed/selecting), Crusty
was capable of doing over 900MiB/sec - a whooping 7+gbit/sec! on 48 core(96 logical) c5.metal with a 25gbit/s port
New job queue should be solely redis-based using redis modules: https://redis.io/topics/modules-intro
rust has good enough library to allow writing redis module logic: https://github.com/RedisLabsModules/redismodule-rs
We will use pre-sharded queue(based on addr_key
)
Atomic operations:
using correct underlying data types(mostly sets and bloom filter for history) + batching and pipelining we can have solid throughput, low cpu usage per redis node, decent reliability and scalability
careful expiration could help to avoid memory overflow on redis node - we always discover domains faster than we can process them
It uses dynamic pulling of all available buffers - displays labels wrong and cannot calc max(outputs Trillions)
it's either a grafana bug or I did something wrong :\
Some clearly weren't selected properly...
In some places we send vectors which doesn't play nice with buffers(not what was implied)
It's essential we scale this part as well...
Right now we write from a single green thread, though we use buffering and write in configurable chunks
On high volume of traffic the metrics writing part may backoff the whole system
Right now this broad crawler is completely emtpy, I think it would be cool if we had something to show off ;)
A good candidate for such task could be page rank...
Now calculating URL page rank is a whole mega-task in it's own, proper implementation of which(that scales) could take months because of requirements on throughput, memory, speed, scalability.
Such system most likely needs a sophisticated URL -> ID mapping
On the other hand we could easily calculate Domain PageRank Ad hoc,
https://oss.redislabs.com/redisgraph/
RedisGraph/RedisGraph#398
Depending on the underlying hardware results may vary. However, inserting a new relationship is done in O(1). RedisGraph is able to create over 1 million nodes under half a second and form 500K relations within 0.3 of a second.
RedisGraph has PageRank built-in
It seems like it could significantly simplify the code
Right now it does not make any sense whatsoever...
blocked by
It seems to capture way too much trash right now
Should dequeue handler update ddc to prevent same domain taking space in resolver's q?...
Stuff like page limit, max depth, skip no follow links, etc, etc, etc...
Figure a way to auto-tune domain concurrency(there is a ~perfect N based on CPU and network bandwidth available)
Will need some kind of graceful adaptive algo which will look at metrics(tx/rx, error rates) and determine the N
Those should be registred dynamically right on channel creation
Consider completely removing config defaults from code where possible
We already include_str!
default config right into our code and parse it - we can take most of defaults from there.
Config is split between Crusty
and crusty-core
though, and the latter has no idea nor should it assume anything about configuration system. So we should keep config defaults in crusty-core
This used to work properly
We probably should always try to download robots.txt before we access index page... and if it gets resolved with 4xx or 5xx codes we should act accordingly and follow google best practices https://developers.google.com/search/docs/advanced/robots/robots_txt
right now we download /
&& /robots.txt
in parallel and external links from /
most likely will be added to the Q(not internal though)
Hello, let4be
First of all I want to say it is really impressive what you have built, I am really amazed, so congratulations. Furthermore, I see that you wrote in the README file that one could attach a graph database to save the crawled data, but I can't quite understand how to do it and how would it fit in the dataflow, because I understand that crusty already saves the crawled data in some database.
I am interested in broad crawling, particularly with Rust, because I've been working on a peer2peer search engine, and thus I need a low-resource broad crawler. I have a (untidy) Python prototype which I would like to convert to Rust.
I would greatly appreciate if you could help me with this, so I could solve this problem for the search engine project.
Thank you very much in advance. Kind regards.
Tried to start crawling from https://cnn.com and nothing ...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.