Giter Club home page Giter Club logo

top-1m-jarm's Introduction

Top-1M-jarm

This repo is used to compute the jarm values of top 1 millions website.
More info on jarm.

Output file template

alexa rank domain ip JARM hash
1 google.com 216.58.213.78 29d3fd00029d29d21c42d43d00041df48f145f65c66577d0b01ecea881c1ba
2 youtube.com 172.217.18.206 29d3fd00029d29d21c42d43d00041df48f145f65c66577d0b01ecea881c1ba

Output file from February 2023 scan: result.csv (Alexa rank column has been removed).

Architecture

flowchart LR
   csv(CSV file with domains to scan and their rank) --> Scheduler --> domainQueue[/queue of domains/]
   domainQueue--> workerDNS1[Worker resolving the domain into IP] 
   domainQueue --> workerDNS2[Worker resolving the domain into IP]
   domainQueue --> workerDNS3[Worker resolving the domain into IP]
   workerDNS1 --> ipQueue[/queue of IPs/]
   workerDNS2 --> ipQueue
   workerDNS3 --> ipQueue
   ipQueue--> workerJARM1[Worker performing the JARM scan on the IP] 
   ipQueue --> workerJARM2[Worker performing the JARM scan on the IP]
   ipQueue --> workerJARM3[Worker performing the JARM scan on the IP]
   workerJARM1 --> scanResultQueue[/queue of JARM result/]
   workerJARM2 --> scanResultQueue
   workerJARM3 --> scanResultQueue
   scanResultQueue--> workerAggregation[Aggregates results in a single CSV]
Loading

Note
The use of rq and docker compose does really make sense for tasks that are CPU-bond which is not the case here.
Nonetheless with a master and 3 workers (for a total of 5Go of RAM and 8vCPU, so a very modest cluster) it took a day and a half to process ~600k scans.

A batch of 1k domains being processed (RQ debug view)

As workers focus on priority on the ip queue, few jobs stay in this queue asciicast

Set up for development

Run poetry install to install dependencies.
This project use PyO3 to bind rust code, to use it run maturin develop --locked --release
To prepare local docker image run docker build -t top-1m-jarm:latest .

Running

This project use docker swarm (might require docker swarm init).
One node has to be marked as a coordinator with: docker node update --label-add coordinator=1 $(docker node inspect --format '{{ .ID }}' self).
It'll be responsible for input/output files.
result.csv must also be created via touch (by default touch ./data/result.csv).

docker stack deploy --compose-file docker-compose.yml top1MjarmStack
docker stack ls
docker service ls
docker service logs top1MjarmStack_scheduler -f

To monitor the queue:

docker exec -it $(docker ps -qf "name=top1MjarmStack_csv_writer" | head -n 1) poetry run rq info default domains ips jarm_result --url redis://:XXX_SET_REDIS_PASS_XXX@redis_queue:6379 -i 1

To remove the running containers:

docker stack rm top1MjarmStack
docker stack ls

Push to docker hub

docker build -t hugocker/top-1m-jarm --pull --no-cache .
docker push hugocker/top-1m-jarm

top-1m-jarm's People

Contributors

hugo-c avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.