Giter Club home page Giter Club logo

crusty's People

Contributors

let4be avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

crusty's Issues

Implement faster HTML parsing

As soon as crusty-core fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTML

we don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)

Extracting links/title/meta should be easy to do with a simple tokenizer, like in https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html

Queue sharding support

It's partially here, but need to add

  • routing to proper shard based on addr_key
  • spawn a green thread for each owned shard(shard_min .. shard_max)
  • glue it all together and test

curl script for super-fast start

Need to convert existing ./infra/lazy.sh and ./infra/net.sh into one script that can be called right from curl
curl -fsSL https://raw.githubusercontent.com/let4be/crusty/master/infra/lazy.sh | bash -s
curl is readily available everywhere and the script could pull such stuff as git, bmon, htop, etc...

Review how we access DNS resolved addresses

Right now we do not precisely control which address hyper will use when connecting but we assume it's the first one(and concurrency restrictions being applied accordingly, which may backfire)

Error in redis dockerfile

Hello, I hope you are well. I've been trying to run the project locally on my Docker, but I always get an error in the redis dockerfile. I updated some things in the docker file, but the problem still persists:

FROM redis

# Update and install necessary packages
RUN apt-get update && apt-get -y install git build-essential cmake

# Create a symlink for python3 to ensure it is recognized

# Create app directory and clone RedisBloom
RUN mkdir /app && cd /app && \
    git clone https://github.com/RedisBloom/RedisBloom && \
    cd RedisBloom && \
    git submodule update --init --recursive && \
    ./sbin/setup \
    bash -l \
    make

# Copy configuration and modules
COPY redis.conf /usr/local/etc/redis/redis.conf
COPY --from=crusty_crusty:latest /usr/local/lib/libredis_queue.so /app
COPY --from=crusty_crusty:latest /usr/local/lib/libredis_calc.so /app

# Expose port
EXPOSE 6379/tcp
CMD [ "redis-server", "/usr/local/etc/redis/redis.conf", "--loadmodule /app/libredis_queue.so", "--loadmodule /app/libredis_calc.so", "--loadmodule /app/RedisBloom/redisbloom.so" ]

error that is happening:

2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 * Module 'crusty.queue' loaded from /app/libredis_queue.so
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 * Module 'crusty.calc' loaded from /app/libredis_calc.so
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 # Module /app/RedisBloom/redisbloom.so failed to load: /app/RedisBloom/redisbloom.so: cannot open shared object file: No such file or directory
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 # Can't load module from /app/RedisBloom/redisbloom.so: server aborting

Migrate job management system to Redis

While current "queue-like system" on top of clickhouse worked quite well for testing it's no near as good as required for any serious high-volume use

Recently I did some testing on a beefy AWS hardware and fixed some internal bottlenecks(not yet merged) and in some testing scenarios where I could temporary alleviate the last left bottleneck - job distribution(writing new/updating completed/selecting), Crusty was capable of doing over 900MiB/sec - a whooping 7+gbit/sec! on 48 core(96 logical) c5.metal with a 25gbit/s port

New job queue should be solely redis-based using redis modules: https://redis.io/topics/modules-intro
rust has good enough library to allow writing redis module logic: https://github.com/RedisLabsModules/redismodule-rs

We will use pre-sharded queue(based on addr_key)

Atomic operations:

  1. Enqueue jobs
  2. Dequeue jobs
  3. Finish jobs

using correct underlying data types(mostly sets and bloom filter for history) + batching and pipelining we can have solid throughput, low cpu usage per redis node, decent reliability and scalability
careful expiration could help to avoid memory overflow on redis node - we always discover domains faster than we can process them

Glitchy buffers panel in grafana dashboard

It uses dynamic pulling of all available buffers - displays labels wrong and cannot calc max(outputs Trillions)
it's either a grafana bug or I did something wrong :\

Check channel buffer sizes

Some clearly weren't selected properly...
In some places we send vectors which doesn't play nice with buffers(not what was implied)

Concurrent writing to clickhouse

It's essential we scale this part as well...
Right now we write from a single green thread, though we use buffering and write in configurable chunks

On high volume of traffic the metrics writing part may backoff the whole system

Implement a first approximation of PageRank for Domains

Right now this broad crawler is completely emtpy, I think it would be cool if we had something to show off ;)
A good candidate for such task could be page rank...

Now calculating URL page rank is a whole mega-task in it's own, proper implementation of which(that scales) could take months because of requirements on throughput, memory, speed, scalability.
Such system most likely needs a sophisticated URL -> ID mapping

On the other hand we could easily calculate Domain PageRank Ad hoc,

  1. collect all outbound domains for a given Job
  2. convert Job's Domain into a Second Level Domain (super-blog.tumblr.com -> tumblr.com)
  3. convert all outbound domains into unique Second Level Domains as well
  4. store all this in RedisGraph(this will work because there's only very limited N of second level domains and RedisGraph uses sparse matrixes)

https://oss.redislabs.com/redisgraph/
RedisGraph/RedisGraph#398

Depending on the underlying hardware results may vary. However, inserting a new relationship is done in O(1). RedisGraph is able to create over 1 million nodes under half a second and form 500K relations within 0.3 of a second.

RedisGraph has PageRank built-in

Concurrency auto-tuning

Figure a way to auto-tune domain concurrency(there is a ~perfect N based on CPU and network bandwidth available)
Will need some kind of graceful adaptive algo which will look at metrics(tx/rx, error rates) and determine the N

Review config defaults

Consider completely removing config defaults from code where possible

We already include_str! default config right into our code and parse it - we can take most of defaults from there.

Config is split between Crusty and crusty-corethough, and the latter has no idea nor should it assume anything about configuration system. So we should keep config defaults in crusty-core

Attaching a database

Hello, let4be

First of all I want to say it is really impressive what you have built, I am really amazed, so congratulations. Furthermore, I see that you wrote in the README file that one could attach a graph database to save the crawled data, but I can't quite understand how to do it and how would it fit in the dataflow, because I understand that crusty already saves the crawled data in some database.

I am interested in broad crawling, particularly with Rust, because I've been working on a peer2peer search engine, and thus I need a low-resource broad crawler. I have a (untidy) Python prototype which I would like to convert to Rust.

I would greatly appreciate if you could help me with this, so I could solve this problem for the search engine project.
Thank you very much in advance. Kind regards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.