Giter Club home page Giter Club logo

blockprint's Introduction

blockprint

Blockprint is a tool for measuring client diversity on the Ethereum beacon chain.

It's the backend behind these tweets:

Public API

As of Feb 11 2022 Blockprint is hosted on a server managed by Sigma Prime.

For API documentation please see docs/api.md.

Running blockprint

Lighthouse

Blockprint needs to run alongside a Lighthouse node v2.1.2 or newer.

It uses the /lighthouse/analysis/block_rewards endpoint.

VirtualEnv

All Python commands should be run from a virtualenv with the dependencies from requirements.txt installed.

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt

The Classifier

Blockprint's classifier utilizes one of two machine learning algorithms:

  • K-nearest neighbours
  • Multi-layer Perceptron

These can be chosen with the --classifier-type flag in classifier.py.

See ./classifier.py --help for more command line options including cross validation (CV) and manual classification.

Training the Classifier

The classifier is trained from a directory of reward batches. You can fetch batches with the load_blocks.py script by providing a start slot, end slot and output directory:

./load_blocks.py 2048001 2048032 testdata

The directory testdata now contains 1 or more files of the form slot_X_to_Y.json downloaded from Lighthouse.

To train the classifier on this data, use the prepare_training_data.py script:

./prepare_training_data.py testdata testdata_proc

This will read files from testdata and write the graffiti-classified training data to testdata_proc, which is structured as directories of single block reward files for each client.

$ tree testdata_proc
testdata_proc
├── Lighthouse
│   ├── 0x03ae60212c73bc2d09dd3a7269f042782ab0c7a64e8202c316cbcaf62f42b942.json
│   └── 0x5e0872a64ea6165e87bc7e698795cb3928484e01ffdb49ebaa5b95e20bdb392c.json
├── Nimbus
│   └── 0x0a90585b2a2572305db37ef332cb3cbb768eba08ad1396f82b795876359fc8fb.json
├── Prysm
│   └── 0x0a16c9a66800bd65d997db19669439281764d541ca89c15a4a10fc1782d94b1c.json
└── Teku
    ├── 0x09d60a130334aa3b9b669bf588396a007e9192de002ce66f55e5a28309b9d0d3.json
    ├── 0x421a91ebdb650671e552ce3491928d8f78e04c7c9cb75e885df90e1593ca54d6.json
    └── 0x7fedb0da9699c93ce66966555c6719e1159ae7b3220c7053a08c8f50e2f3f56f.json

You can then use this directory as the datadir argument to ./classifier.py:

./classifier.py testdata_proc --classify testdata

If you then want to use the classifier to build an sqlite database:

./build_db.py --db-path block_db.sqlite --classify-dir testdata --data-dir testdata_proc

Running the API server

gunicorn api_server:app --timeout 1800

It will take a few minutes to start-up while it loads all of the training data into memory.

License

Copyright 2021 Sigma Prime and blockprint contributors

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

blockprint's People

Contributors

ariskk avatar macladson avatar michaelsproul avatar santi1234567 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

blockprint's Issues

Fix handling of sidechains and gaps

At the moment blockprint's DB is not handling sidechains very elegantly.

The database is designed so that a block can be inserted without knowing its parent. This is nice because it provides the ability to stay online processing new blocks, even if some old blocks have been missed due to temporary downtime. There's a background process in background_tasks.py which is meant to query the API for gaps to fill in, and patch them up.

The problem is that the logic for determining gaps returns some gaps that are impossible for the background task to heal. A gap is currently defined as a slot interval between a block with a parent missing from the DB (end_slot) and the last known block prior to that missing parent (start_slot):

blockprint/build_db.py

Lines 156 to 163 in c7f570d

def get_missing_parent_blocks(block_db):
res = block_db.execute(
"""SELECT slot, parent_slot FROM blocks b1
WHERE
(SELECT slot FROM blocks WHERE slot = b1.parent_slot) IS NULL
AND slot <> 1"""
)
return [(int(x[0]), int(x[1])) for x in res]

blockprint/build_db.py

Lines 181 to 191 in c7f570d

for block_slot, parent_slot in missing_parent_slots:
prior_slot = get_greatest_prior_block_slot(block_db, parent_slot)
if prior_slot is None:
start_slot = 0
else:
start_slot = prior_slot + 1
end_slot = block_slot - 1
assert end_slot >= start_slot
gaps.append({"start": start_slot, "end": end_slot})

E.g. the current output from https://api.blockprint.sigp.io/sync/gaps is:

[
  {
    "start": 8253079,
    "end": 8253086
  },
  {
    "start": 8253031,
    "end": 8253045
  },
  {
    "start": 8253127,
    "end": 8253151
  },
  {
    "start": 8254502,
    "end": 8254511
  },
  {
    "start": 8277096,
    "end": 8277097
  },
  {
    "start": 8299646,
    "end": 8299647
  },
  {
    "start": 8299650,
    "end": 8299651
  }
]

Looking at the first gap, we see that the block with missing parent that triggered this must be one at slot 8253087, which is reorged out (beaconcha.in doesn't even know about it): https://beaconcha.in/slot/8253087. Our Lighthouse nodes saw it though:

Jan 21 18:17:49.412 DEBG Cloned snapshot for late block/skipped slot, block_delay: Some(2.040814714s), parent_root: 0xd429dc371766b1d71fdad731879aafe7c2df990b402fbc5704b29144009cce8f, parent_slot: 8253079, slot: 8253087, service: beacon

Now the interesting thing here is the parent slot, https://beaconcha.in/slot/8253079. It's also empty! In order to heal the gap, we would need to load this parent block at 8253079, which we can't do because it has also been pruned.

In summary, blockprint's gap healing is broken for sidechains of length > 1. I can think of two ways to fix it:

Give the background task the ability to either delete or mark orphaned blocks in the database when: the slot of the missing parent has been finalized as a skipped slot. If we just mark them as orphaned, then we get to keep them in the DB (moar data) but won't block the gap healing process on them. On the other hand, marking them orphaned would require a new database column and a little DB migration (not too bad, given the small number of live blockprint instances).

Modularise feature selection

It would be cool to have features named and capable of being toggled on and off from the CLI.

It would also be nice to include feature names on the matplotlib graphs, with a 2D cross-section for each pair of features.

Add Grandine support to the DB

We should support Grandine in the DB. This will be a breaking schema change unfortunately, as it introduces a new column (pr_grandine). We could provide a simple migration script to upgrade existing DBs so they don't need to be rebuilt from scratch.

Make training graffiti configurable

Some graffiti regex should only be enabled on e.g. Prater. The ./prepare_training_data.py script should take a --prater or --mainnet flag to switch regex on/off, or a path to a file containing the regex.

Questions about the performance reading the model and the configuration

Description

I just downloaded the "raw" block rewards for ~6M main net slots (~1TB of data), which after the classification, stays in ~130GB of "model" (see table below).

The model (size):
2.5G    ./model/Prysm
4.3G    ./model/Nimbus
31G     ./model/Teku
1.2G    ./model/Lodestar
92G     ./model/Lighthouse
130G    ./model

There are a few questions I would like to open here:

  1. Is this an overkilling number of block rewards to train the model?
    I'd like to have an accurate enough model, but this seems a bit overkill to me. I'm unsure if you could guide me here based on your experience.

  2. Loading the model currently takes around 50mins, is this an expected duration for the range of time I'm managing?
    I checked if there is any concurrency parameter that I could be missing when loading the model (meaning creating an object Classifier from the given model/folder), but it seems to be sequential by itself.

  3. I also have some doubts about the possible features of the Classifier. I'm currently using the default ones, and I'll like to know if you guys have better performance using a different combination of features.

Many thanks in advance, and great work with this repo!

Return 404s when database contains gaps

Currently blockprint's API will return 0 values when querying a section of the database for which history is missing. This can occur if the beacon node backing blockprint has been falling in and out of sync or is re-syncing and lacks a portion of history.

At the moment these gaps can be detected via the /sync/gaps API, but this is error-prone and not obvious. It would be better if blockprint detected queries affected by the gaps and failed them explicitly so that the user doesn't have to check /sync/gaps manually. E.g. if there is a gap from slot 3623682 to 3687456 (as there is currently) then blockprint should 404 on queries like /blocks_per_client/113625/113850 rather than returning a response with 0 blocks per client for all clients.

The reason the API is the way it is currently is that the SQL query to detect the gaps takes several seconds to evaluate, because it scans the entire table. If this query can be made to run quickly, then it will be feasible to run it for every query.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.