macrocosm-os / folding Goto Github PK

View Code? Open in Web Editor NEW

13.0 3.0 16.0 35.44 MB

Decentralized Protein Folding Bittensor Subnet

Home Page: https://www.macrocosmos.ai

License: MIT License

Python 2.01% Shell 0.19% JavaScript 0.01% Jupyter Notebook 97.80%

origami bittensor crypto decentralized protein-folding

folding's Introduction

Protein Folding Subnet 25

The Incentivized Internet

Discord • Network • Research

This repository is the official codebase for Bittensor Subnet Folding (SN25), which was registered on May 20th, 2024. To learn more about the Bittensor project and the underlying mechanics, read here.

IMPORTANT: This repo has a functional testnet 141 as of May 13th. You should be testing your miners here before launching on main.

Introduction

The protein folding subnet is Bittensors’ first venture into academic use cases, built and maintained by Macrocosmos AI. While the current subnet landscape consists of mainly AI and web-scraping protocols, we believe that it is important to highlight to the world how Bittensor is flexible enough to solve almost any problem.

This subnet is designed to produce valuable academic research in Bittensor. Researchers and universities can use this subnet to solve almost any protein, on demand, for free. It is our hope that this subnet will empower researchers to conduct world-class research and publish in top journals while demonstrating that decentralized systems are an economic and efficient alternative to traditional approaches.

What is Protein Folding?

Proteins are the biological molecules that "do" things, they are the molecular machines of biochemistry. Enzymes that break down food, hemoglobin that carries oxygen in blood, and actin filaments that make muscles contract are all proteins. They are made from long chains of amino acids, and the sequence of these chains is the information that is stored in DNA. However, its a large step to go from a 2D chain of amino acids to a 3D structure capable of working.

The process of this 2D structure folding on itself into a stable, 3D shape is called protein folding. For the most part, this process happens naturally and the end structure is in a much lower free energy state than the string. Like a bag of legos though, it is not enough to just know the building blocks being used, its the way they're supposed to be put together that matters. "Form defines function" is a common phrase in biochemsitry, and it is the quest to determine form, and thus function of proteins, that makes this process so important to understand and simulate.

Why is Folding a Good Subnet Idea?

An ideal incentive mechanism defines an asymmetric workload between the validators and miners. The necessary proof of work (PoW) for the miners must require substantial effort and should be impossible to circumvent. On the other hand, the validation and rewarding process should benefit from some kind of privileged position or vantage point so that an objective score can be assigned without excess work. Put simply, rewarding should be objective and adversarially robust.

Protein folding is also a research topic that is of incredibly high value. Research groups all over the world dedicate their time to solving particular niches within this space. Providing a solution to attack this problem at scale is what Bittensor is meant to provide to the global community.

Reward Mechanism

Protein folding is a textbook example of this kind of asymmetry; the molecular dynamics simulation involves long and arduous calculations which apply the laws of physics to the system over and over again until an optimized configuration is obtained. There are no reasonable shortcuts.

While the process of simulation is exceedingly compute-intensive, the evaluation process is actually straightforward. The reward given to the miners is based on the ‘energy’ of their protein configuration (or shape). The energy value compactly represents the overall quality of their result, and this value is precisely what is decreased over the course of a molecular dynamics simulation. The energy directly corresponds to the configuration of the structure, and can be computed in closed-form. The gif below illustrates the energy minimization over a short simulation procedure.

When the simulations finally converge (ΔE/t < threshold), they produce the form of the proteins as they are observed in real physical contexts, and this form gives rise to their biological function. Thus, the miners provide utility by preparing ready-for-study proteins on demand. An example of such a protein is shown below.

Running the Subnet

Requirements

Protein folding utilizes a standardized package called GROMACS. To run, you will need:

A Linux-based machine
Multiple high-performance CPU cores.

Out of the box, we do not require miners to run GPU compatible GROMACS packages. For more information regarding recommended hardware specifications, look at min_compute.yml

IMPORTANT: GROMACS is a large package, and take anywhere between 1h to 1.5h to download.

Installation

This repository requires python3.8 or higher. To install it, simply clone this repository and run the install.sh script.

git clone https://github.com/macrocosm-os/folding.git
cd folding
bash install.sh

This will also create a virtual environment in which the repo can be run inside of.

Sometimes, there can be problems with the install so to ensure that gromacs is installed correctly, please check the .bashrc. Importantly, these lines MUST be run:

echo "source /usr/local/gromacs/bin/GMXRC" >> ~/.bashrc
source ~/.bashrc

The above commands will install the necessary requirements, as well as download GROMACS and add it to your .bashrc. To ensure that installation is complete, running gmx in the terminal should print

 :-) GROMACS - gmx, 2024.1 (-:

If not, there is a problem with your installation, or with your .bashrc

Registering on Mainnet

btcli subnet register --netuid 25 --wallet.name <YOUR_COLDKEY> --wallet.hotkey <YOUR_HOTKEY>

Registering on Testnet

Netuids that are larger than 99 must be set explicity when registering your hotkey. Use the following command:

btcli subnet register --netuid 141 --wallet.name <YOUR_COLDKEY> --wallet.hotkey <YOUR_HOTKEY>

Launch Commands

Validator

There are many parameters that one can configure for a simulation. The base command-line args that are needed to run the validator are below.

python neurons/validator.py
    --netuid <25/141>
    --subtensor.network <finney/test>
    --wallet.name <your wallet> # Must be created using the bittensor-cli
    --wallet.hotkey <your hotkey> # Must be created using the bittensor-cli
    --axon.port <your axon port> #VERY IMPORTANT: set the port to be one of the open TCP ports on your machine

As a validator, you should change these base parameters in scripts/run_validator.py.

For additional configuration, the following params are useful:

python neurons/validator.py
    --netuid <25/141>
    --subtensor.network <finney/test>
    --wallet.name <your wallet> # Must be created using the bittensor-cli
    --wallet.hotkey <your hotkey> # Must be created using the bittensor-cli
    --neuron.queue_size <number of pdb_ids to submit>
    --neuron.sample_size <number of miners per pdb_id>
    --protein.max_steps <number of steps for the simulation>
    --logging.debug # Run in debug mode, alternatively --logging.trace for trace mode
    --axon.port <your axon port> #VERY IMPORTANT: set the port to be one of the open TCP ports on your machine

Validators are heavily recommended to run the autoprocess script to ensure that they are always up to date with the most recent version of folding. We have version tagging that will disable validators from setting weights if they are not on the correct version.

bash run_autoprocess.sh

Miner

There are many parameters that one can configure for a simulation. The base command-line args that are needed to run the miner are below.

python neurons/miner.py
    --netuid <25/141>
    --subtensor.network <finney/test>
    --wallet.name <your wallet> # Must be created using the bittensor-cli
    --wallet.hotkey <your hotkey> # Must be created using the bittensor-cli
    --neuron.max_workers <number of processes to run on your machine>
    --axon.port <your axon port> #VERY IMPORTANT: set the port to be one of the open TCP ports on your machine

Optionally, pm2 can be run for both the validator and the miner using our utility scripts found in pm2_configs.

pm2 start pm2_configs/miner.config.js

pm2 start pm2_configs/validator.config.js

Keep in mind that you will need to change the default parameters for either the miner or the validator.

How does the Subnet Work?

In this subnet, validators create protein folding challenges for miners, who in turn run simulations based on GROMACS to obtain stable protein configurations. At a high level, each role can be broken down into parts:

Validation

Validator creates a neuron.queue_size number of proteins to fold.
These proteins get distributed to a neuron.sample_size number of miners (ie: 1 PDB --> sample_size batch of miners).
Validator is responsible for keeping track of sample_size * queue_size number of individual tasks it has distributed out.
Validator queries and logs results for all jobs based on a timer, neuron.update_interval.

For more detailed information, look at validation.md

Mining

Miners are expected to run many parallel processes, each executing an energy minimization routine for a particular pdb_id. The number of protein jobs a miner can handle is determined via the config.neuron.max_workers parameter.

For detailed information, read mining.md.

Notes

Miner simulations will output a projected time. The first two runs will be about the same length, with the third taking about an order of magnitude longer using a default number of steps = 50,000. The number of steps (steps) and the maximum allowed runtime (maxh) are easily configurable and should be employed by miners to prevent timing out. We also encourage miners to take advantage of 'early stopping' techniques so that simulations do not run past convergence.

Furthermore, we want to support the use of ML-based mining so that recent algorithmic advances (e.g. AlphaFold) can be leveraged. At present, this subnet is effectively a specialized compute subnet (rather than an algorithmic subnet). For now, we leave this work to motivated miners.

GROMACS itself is a rather robust package and is widely used within the research community. There are specific guides and functions if you wish to parallelize your processing or run these computations off of a GPU to speed things up.

License

This repository is licensed under the MIT License.

# The MIT License (MIT)
# Copyright © 2024 Yuma Rao

# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the “Software”), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in all copies or substantial portions of
# the Software.

# THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
# THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

folding's People

Contributors

Stargazers

Watchers

Forkers

heliosprimeone cxmplex robertmeppelink ronx-labs kvkr1997 loayei distributedstatemachine gonzabccc fullchain324 lostgf dragon-lord777 pelering dungdev1 huynhlnghai thienvip107

folding's Issues

Determine the best way for data persistence

This subnet is going to generate a lot of data, and we are in need of a way of storing data. We need to know:

Possible providers if not Runpod volumes
When to use wandb (what data / plots)
How often should we log results?
How much data does each simulation create (record not only datasize, but parameters about the simulation, so we can determine if file size is related to things like protein shape, complexity, ect...)

Performance of miners and validators

Is there a principled way to modify initial conditions to create new challenges? (initial coordinates, temperature, pressure, solution, etc.)

How can we know if a pdb is going to be a good candidate for simulation? Are the initial conditions that stop us from being able to simulate?

Find a way of scaling up pod deployment

gromacs takes forever to install so we need a dedicated way to spin up pods with this environment. I recon Docker?

How many proteins are eligible for simulation?

How can we know if a pdb is going to be a good candidate for simulation? Are the initial conditions that stop us from being able to simulate? Estimate the total number of eligible candidates.

Increasing Phase Space

There are lots of variables that can be tuned to determine what pdb configurations can be sent to the miner.

There are some issues tracking aspects of this:
#17, #18

This issue will keep an open record of how we are increasing the problem space.

Can we use GPUs? How hard would this be to implement?

How similar are loss curves across multiple miners in the same forward step? (stability)

Rebasing the coordinates at the end of each forward pass or epoch so that all the miners are running the best simulation

% utilization for Miners and Validators

validator cancel job if all uids are 503 in initialization

Do we need to remove HETATM and CONECT records from the `.pdb` file before creating a topology?

This issue is to address the questions of removing HETATM and CONECT records from the .pdb file before creating a topology. This is an approach that was found in one of the tutorials, and it is noteworthy to understand what its doing to see if we need to apply it to the pipeline or not.

Folding MVP Timeline

Deadline for Folding MVP is April 19th.

Here is the timing that we are going to impose for the following week with key deliverables highlighted:

Tuesday April 9th

@mccrindlebrian

Identify the 3/4 most important parameters for the validator hyperparameter space. Identify the most popular options via literature for these parameters and use this as the basis for the grid search
Put sampling routine into code, recording what parameter combination you have already sampled, and record this data in a local wandb proj (pdb_id --> parameters --> success/fail)
Simulate at least 100 validator forward passes. If certain flags are present in pdbs, then it is possible that the pdb will not run at all.

@schampoux

How can we make all pdb simulations look like the ones you did? Specifically, identify what the important parameters that need to be set in the .mdp files (or any other file used for simulation) for us to achieve this goal. Parameters like nstenergy. Look at your old simulations, there should be a clear signal here. We are currently altering some important parameters the Protein class in the edit_files method. Issue
Tied to this is, of course, the amount of (tabular) data that the miner is saving to the corresponding energy/rmsd files. Once you are confident regarding point (1), please integrate your data extracting logic into the codebase. folding/utils would be a good place for this. We will need this code for sending it downstream to validators.

Wednesday April 10th

@mccrindlebrian

Create the code infrastructure for the reward landscape. Deliverable here is V0: miner with the lowest loss wins 100%. To do this, I will need you data parser from above @schampoux

@schampoux

The data that you parse needs to be recorded, and plots need to be made. Setup 2 Wandb projects, one for the miners and one for the validators. Incorporate code that uses the data parsing methods you made to parse the output files, save them into dataframes, and save the data to wandb. Data should include ALL information we need (uid, hyperparameters, ect..). Make sure that the data we care about is being plotted on wandb correctly! I can also support this wandb effort. We should prioritize data being recorded from the validator side of things. The miner can come later, if time limited.

Thursday April 11th

@mccrindlebrian

Now we should be well set up to run a set of simulations entirely in-loop. We want to run simulations where we have a 1-to-2 validator-to-miner setup. This means that we can properly test out the miner scaling with the reward mechanism.
Furthermore, we can generate a lot more data. Scale this up to 5 validators. We will need multiple machines. Spin up runpod instances that have gromacs installed so this is not a blocker.
Run these experiments where you have all validators on one side, and all miners on the other. This will give us better clarity on how we save and send data over the wire.
Work that @schampoux did regarding save frequency will be useful here. Ideally, we do NOT have to touch miner code at this stage. Should be relatively set in stone.

@schampoux

Keeping this as a buffer day, as the above could take some time. We will evaluate your bandwidth when we get here. I might need you for the above.

@RodrigoLPA

Looking at default gromacs force-field information. What fields are available, does gromacs do anything smart when it comes to default parameters/choices based on the pdb?
Looking into MISSING flags, and general cleaning of PDBs to increase success rate of running a simulation on the validator side (for reference, look into folding/protein.generate_input_files()

Friday April 12th

@mccrindlebrian & @schampoux

At this point, we should have many simulations, and a ton of logged data to wandb. Create scripts to pull the data (copy from analysis repo) so we can do an analysis of the success rate on the validator side. KPI here is > 50% of pdbs sampled were able to find a set of hyperparameters that finished and set the completed data to the miner
Get wandb working with miner architecture, if not done.

RUN EXPERIMENTS OVER THE WEEKEND

Monday/Tuesday April 15/16th

This week is for refining the codebase. We should have identified some clear oversights in either design, logging, ect.. at this point.

@mccrindlebrian

Reward mechanism can be tightened up here. Do some data science on the reward curves to see variabilities between miners, test different scaling methods for rewards. Condition miner variability (std, kurtosis ect..) on hyperparameter combinations. There could be some interesting relationships that pop out here, which might ultimately indicate things like protein complexity (size, number of heads, ect...). This is where @steffencruz could be helpful

@schampoux

Improve wandb logging. Probably needs more fine-grained information
If we haven't done so already, we need to determine the best way to send data over the wire to the validator from the miner. csvs, files directly, binary??? Please look into this

Wednesday-Friday April 17-19th

Run many many simulations. Get data. Fix bugs. Release!

Quantify Codebase Stability

are there uncontrolled exceptions, or other runtime issues (not related to folding necessarily)

Should we use raw energy as the reward, or does it need to be scaled (log or similar) or relative?

Are losses noisy or smooth? How smooth is the energy loss surface as a function of coordinates?

Design of the Reward Mechanism

There are many ways to design the reward mechanism.

We need to benchmark the chosen reward mechanism(s) in many ways before we can deploy this on mainnet. We need to understand the expected results in terms of miner rewards, competitiveness, dependency on hardware, etc. Main points are:

max steps should be set to a default value in config

How does the loss curve depend on protein size and structure?

Quantify important of initial conditions for miners

How similar are miner coordinates after running a simulation on the same initial state (stability) and how similar are their energies? Quantify any relationships.

do validators kill jobs after a specified time even if losses are inf?

sometimes no miners respond for a job. Do we kill this job after some amount of time?

GROMACs API vs sys.command

We can execute GROMACs commands in two ways in python:

Using the GromacsWrapper, or
shell commands

We need to know which one is more suitable here (pros/cons). Things like

running in background threads/processes,
error handling,
code clarity and extensibility.

If necessary, is it possible to create random or otherwise perturbed proteins which are viable (and interesting)?

Quantify the size of the problem space

We need to understand how large this problem space is so that we do not exhaust all of the proteins too quickly and effectively kill the PoW component (since lookups would become the norm). Key points are:

Record edge cases

Are there any issues that arise when there are edge cases (multiple validators querying the same miner, no miner responses available, invalid miner responses)?

Remove checkpoint files

Validators seem to generate lots of backup files when running rerun commands which is problematic. Find a graceful way to stop checkout generation

Exploits

Can we prevent lookup attacks by submitting intermediate results?

Parsing Error tracking: gro_hash

The gro_hash method in folding/utils/ops.py encounters errors for certain proteins/DNA. This method is responsible for generating the hash for a specific gro file. It does this by parsing the .gro file, connecting the residue name, atom name, and residue number from each line together into a single string. The error arises because some of the atoms contain apostrophe's.

The goal for this issue is to increase the robustness of the gro_hash generator so that it can accommodate for cases like this.
Another goal for this issue is to track other errors.

If validator runs into problem running forward pass, remove it from db

Heuristic analysis of RMSD after each forward pass, and whether we can use this to prevent lookup attacks.

Seeding miner and validator simulations

There are different parameters that you can set to ensure that simulations are deterministic. This is mainly done by applying mdrun -seed. However, there is also some hyperparameters that can be set for grompp commands. We should implement this for V0.

the miner forward should always use the miner uid as the seed
is the seed verifiable in any of the files? xvg...

miner-side job timeout needs to be implemented

if a job has not been queried in t > neuron.miner_job_timeout, then we should remove the job.

Miners should return old data that is not md_0_1

Failed to attach files for pdb 1qcc with error: No files found for md_0_1 happens when we have old data, but we never finished the simulation. We should see this behaviour and start the simulation from the most advanced checkpoint we have if we haven't finished

Investigate GROMACS builtin early stopping capabilities

Instead of implementing ES logic externally we prefer to use GROMACS to do this. This should reduce bugginess, instability and complexity.

Quantify Simulation Timing and Timeout conditions

What is the characteristic timescale for simulations? In order to answer we need to evaluate:

How many steps do simulations require to convergence in general,
and how much wall time is this?
What should we set timeouts to? Are there any stability-based metrics we can come up with that warrant when we should shut-down a simulation?
How long is a forward pass?

validators need to log each sampled energy for each miner overtime

Include gromacs-py for enhanced visualizations and analysis of results

https://gromacs-py.readthedocs.io/en/latest/installation.html

Analysis of Hyperparameter Efficiency and Success Rates

To analyze and identify optimal hyperparameters for protein folding simulations using based on a dataset of PDB IDs. The goal is to determine which hyperparameters correlate with successful simulations and to explore why simulations are failing.

miners should restart jobs if they are queried by validators and are not in the "finished" state

How can we make all simulations output data in a similar fashion.

Understanding and using .mdp files for simulations.

This issue is for informing us about .mdp files so that we can structure our experiments with intent.
Please see the official guide for all mdp options.

Folding MVP Final Week Sprint

Here are the details for the final week's sprint on Folding. Folding is currently in a place that can operate such that is has high success rates for validator hp search, but validators in general are not busy, and there are open questions about how miners are asked for work. Here are some key components

gro-based energy calculations (exploit resistant)
Validator scheduler & DB
Hyperparameter tuning (timeout, wait_time, sample_size, max_concurrent_pdbs) for optimal workload (adaptive?)
Miner process pool executor for multiple jobs
Some restructuring of data directories (low priority?)
Logging and persistence of final results
Data viz
Loads of experiments

We can try and break these down into tasks for each member of the team.

@steffencruz

Validator scheduler and DB management.
- Validators to ping miners to see if they are alive
- Validators can process/wait more than 1 pdb at a time
- Validators can query miners at some pre-defined interval to acquire .gro files

@mccrindlebrian

Mining infrastructure
- Miner process pool executor (aka: being able to pull out pdbs out of the queue and not always having to wait for another Pdb request from a validator), and spinning up an N number of child processes based on the capacity of their CPU. Need to check if this design is a problem with GPU-compiled Gromacs?
- Being able to be interrupted during simulation, return requested data, and continue simulation (with or without a new ckpt file?)
- Therefore a part of this is storing which validators get what pdb? I guess this is handled within the child process itself so we don't need to explicitly keep track of this?
Logging and persistence of final results
- This infra is already in place, I just need to add more to the miner side. This requires some refactoring of the run_commands function

@schampoux

Gro-based energy calculations
- Implement a methodology on the validator side that can apply the necessary steps to compute the next step (or next N steps) of any energy file via the -rerun method.
Data visualization
- We can currently plot data if given in the reward stack (we log this to wandb) but it would also be good to see if we can do this on the wandb side (if the data is available)
- More importantly, we want to see the protein! Find a way to show the protein given the necessary files. This can be separate from wandb

@RodrigoLPA

We need to you to run your parallelization pdb script and gather all the statistics.
- Number of successful pdbs
- What are the properties of these pdbs? Distribution of size/ other metrics of complexity. It looks like simple pdbs might be out of the question since they optimize too quickly. We need to know how large this space is.

Everyone

We need to run a boat load of experiments, specifically:

Run with multiple validators (3) on different machines
Run with at least 10 miners

I think this is a good opportunity for @RodrigoLPA to learn how to spin up miners and validators, use pm2, and manage across multiple machine. I am happy to support here with 1 on 1 time getting him up to speed

Determine how to finetune hyperparameters

Fine tune hyperparameters: number of steps and expected reward/loss, as a function of protein 'difficulty'

Investigate how miners should be initialized

Do we let miners decide where they start, or should we be giving them an initial condition? What about a random seed? This ties into being able to verify their computation.

How does performance depend on hardware

Handling Large File Buildup from Simulations

This issue will uncover the cause for large file buildup in the validator directories, as well as strategies to mitigate excessive file growth from the simulations.

Calibration of Epsilon

There is a parameter epsilon that determines the minimum threshold needed to indicate what is an "improvement" in loss. This is set to some arbitrary number. Gather data and measure what this value should be. Static, or dynamic?

How can we effectively find subsets of proteins that will work under similar simulation conditions?

In order to prove that this system can work as a subset, It is essential that we have a subset of pdb files to test the system. This is not easy because not all proteins behave the same way under similar experimental conditions. The purpose of this issue is to outline how to effectively find these subsets of "similar" proteins. If the goal of the simulation is energy minimization and we wish to switch out PDB files with minimal adjustments to hyperparameters, structural similarity is the most relevant criterion.

macrocosm-os / folding Goto Github PK

folding's Introduction

Protein Folding Subnet 25

The Incentivized Internet

Introduction

What is Protein Folding?

Why is Folding a Good Subnet Idea?

Reward Mechanism

Running the Subnet

Requirements

Installation

Registering on Mainnet

Registering on Testnet

Launch Commands

Validator

Miner

How does the Subnet Work?

Validation

Mining

Notes

License

folding's People

Contributors

Stargazers

Watchers

Forkers

folding's Issues

Tuesday April 9th

Wednesday April 10th

Thursday April 11th

Friday April 12th

@mccrindlebrian & @schampoux

RUN EXPERIMENTS OVER THE WEEKEND

Monday/Tuesday April 15/16th

Wednesday-Friday April 17-19th

Everyone

Recommend Projects

Recommend Topics

Recommend Org