Giter Club home page Giter Club logo

boda2's Introduction

Computational Optimization of DNA Activity (CODA)

Preprint on bioRxiv. For legacy reasons, this repo is currently listed as boda2.

Contents

Overview

Here, we present a platform to engineer and validate synthetic CREs capable of driving gene expression with programmed cell type specificity. This library contains the resources needed to train DNNs on MPRA data and generate synthetic sequences from these models. Additionally, we include resources to apply these model to to inference tasks on common input data types (e.g., VCF, FASTA, etc.). Finally, we provide examples which deploy the Malinois model.

Tutorials

We have a number of end-to-end tested notebooks available as references for model training and sequence design in the tutorials subdirectory. Updated: Feb 16 2024.

Library Modules

The modules we develop to implement modeling and design are found in boda.

Scripts

We wrote scripts provide command line integration for the modules we implement which are found in src.

System requirements

Hardware requirements

CODA was extensively tested in Google Colab environments and GCP VMs with the following specs:

  • Type: a2-highgpu-1g
  • CPU: 12 vCPU
  • RAM: 85 GB
  • GPU: 1x Tesla A100
  • GPU-RAM: 40 GB HBM2

or

  • Type: custom-12-65536
  • CPU: 12 vCPU (Intel Broadwell)
  • RAM: 64 GB
  • GPU: 1x Tesla V100
  • GPU-RAM: 16 GB HBM2

Software Requirements

CODA was designed using the GCP deployment: NVIDIA GPU-Optimized Image for Deep Learning, ML & HPC

  • OS: Ubuntu 20.04.2 LTS
  • CUDA: 11.3
  • Python: 3.7.12
  • PyTorch: 1.13.1

The most recent tested python environment (Feb 23 2024) is provided in pip_packages.json.

Installation Guide

First install torch according to the installation guide at https://pytorch.org/, then CODA can be installed from the latest version of the GITHUB repo. CODA was developed on torch==1.13.1.

git clone https://github.com/sjgosai/boda2.git
cd boda2/

pip install --upgrade pip==21.3.1
pip install --no-cache-dir -r requirements.txt
pip install -e .

Installation time is from the repository is approximately 66.07 seconds (Feb 23 2024, tested in Colab). Properly installing torch is highly variable system to system so it wasn't feesible to include in the requirements.txt.

Colab

CODA works in Colab notebooks backed with Custom VMs. We were successful using V100s, but not T4s (tested: Feb 16 2024). Deployment specs used to test these notebooks are:

  • Type: n1-highmem-8
  • GPU: 1x Tesla V100
  • Disk: SSD
  • Disk Size: 500 GB

We partially reproduce two tutorials as examples:

For some reason, you need to restart the runtime before import boda will work.

Interactive Docker Environments

CODA has been installed in docker containers which can be downloaded in attached for interactive use. This can be quickly deployed using helper scripts in the repo:

cd /home/ubuntu
git clone https://github.com/sjgosai/boda2.git
bash boda2/src/run_docker_for_dev.sh gcr.io/sabeti-encode/boda devenv 0.2.0 8888 6006

Which connects jupyter lab to ports 8888 and 6006.

More containers can be found at gcr.io/sabeti-encode/boda. Use devenv 0.2.0 if using A100s and devenv 0.1.2 if using V100s.

Interactive usage

Modeling

CODA is an extension of pytorch and pytorch-lightning. Classes in CODA used to construct models generally inherit from nn.Module and lightning.LightningModule but need to be combined as described in tutorials/construct_new_model.ipynb. The documentation for this is in progress.

Inference

Example interactive deployment of Malinois for inference can be found here: tutorials/load_malinois_model.ipynb.

The models trained for the paper:

Sequence generation

The tutorials also include a notebook that describes how to mix models with the sequence generation algorithms implemented in CODA: tutorials/run_generators.ipynb

Applications

We have developed python applications to train models and generate sequences using this library.

Model training

Deep learning models can be trained from the command line by invoking the DATA, MODEL, and GRAPH modules. For example (note: change --artifact_path arg at bottom):

python /home/ubuntu/boda2/src/train.py \
  --data_module=MPRA_DataModule \
    --datafile_path=gs://tewhey-public-data/CODA_resources/Table_S2__MPRA_dataset.txt \
    --sep tab --sequence_column sequence \
    --activity_columns K562_log2FC HepG2_log2FC SKNSH_log2FC \
    --stderr_columns K562_lfcSE HepG2_lfcSE SKNSH_lfcSE \
    --stderr_threshold 1.0 --batch_size=1076 \
    --duplication_cutoff=0.5 --std_multiple_cut=6.0 \
    --val_chrs 7 13 --test_chrs 9 21 X \
    --synth_val_pct=0.0 --synth_test_pct=99.98 \
    --padded_seq_len=600 --use_reverse_complements=True --num_workers=8 \
  --model_module=BassetBranched \
    --input_len 600 \
    --conv1_channels=300 --conv1_kernel_size=19 \
    --conv2_channels=200 --conv2_kernel_size=11 \
    --conv3_channels=200 --conv3_kernel_size=7 \
    --linear_activation=ReLU --linear_channels=1000 \
    --linear_dropout_p=0.11625456877954289 \
    --branched_activation=ReLU --branched_channels=140 \
    --branched_dropout_p=0.5757068086404574 \
    --n_outputs=3 --n_linear_layers=1 \
    --n_branched_layers=3 --n_branched_layers=3 \
    --use_batch_norm=True --use_weight_norm=False \
    --loss_criterion=L1KLmixed --beta=5.0 \
    --reduction=mean \
  --graph_module=CNNTransferLearning \
    --parent_weights=gs://tewhey-public-data/CODA_resources/my-model.epoch_5-step_19885.pkl \
    --frozen_epochs=0 \
    --optimizer=Adam --amsgrad=True \
    --lr=0.0032658700881052086 --eps=1e-08 --weight_decay=0.0003438210249762151 \
    --beta1=0.8661062881299633 --beta2=0.879223105336538 \
    --scheduler=CosineAnnealingWarmRestarts --scheduler_interval=step \
    --T_0=4096 --T_mult=1 --eta_min=0.0 --last_epoch=-1 \
    --checkpoint_monitor=entropy_spearman --stopping_mode=max \
    --stopping_patience=30 --accelerator=gpu --devices=1 --min_epochs=60 --max_epochs=200 \
    --precision=16 --default_root_dir=/tmp/output/artifacts \
    --artifact_path=gs://your/bucket/mpra_model/

Update Feb. 16 2024: We found a problem with the training data in supplementary table 2 we deposited on bioRxiv. There's an updated version here.

Sequence design

Trained models can be deployed to generate sequences using implemented algorithms. This command will run Fast SeqProp using the Malinois model to optimize K562 specific expression:

python /home/ubuntu/boda2/src/generate.py \
    --params_module StraightThroughParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 --n_samples 10 \
        --use_norm True --use_affine False \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 1.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact gs://tewhey-public-data/CODA_resources/malinois_artifacts__20211113_021200__287348.tar.gz \
    --generator_module FastSeqProp \
         --n_steps 200 --learning_rate 0.5 \
    --energy_threshold -0.5 --max_attempts 40 \
    --n_proposals 1000 \
    --proposal_path ./test__k562__fsp

The target cell type can be changed by picking a different index for --target_feature. If you're using Malinois, {'K562': 0, 'HepG2': 1, 'SKNSH': 2}.

This command will run Simulated Annealing with the same objective:

python /home/ubuntu/boda2/src/generate.py \
    --params_module BasicParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 0.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact gs://tewhey-public-data/CODA_resources/malinois_artifacts__20211113_021200__287348.tar.gz \
    --generator_module SimulatedAnnealing \
         --n_steps 2000 --n_positions 5 \
         --a 1.0 --b 1.0 --gamma 0.501 \
    --energy_threshold -0.5 --max_attempts 40 \
    --n_proposals 1000 \
    --proposal_path ./test__k562__sa

This command will run AdaLead with the same objective:

python /home/ubuntu/boda2/src/generate.py \
    --params_module PassThroughParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 0.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact gs://tewhey-public-data/CODA_resources/malinois_artifacts__20211113_021200__287348.tar.gz \
    --generator_module AdaLead \
         --n_steps 20 --rho 2 --n_top_seqs_per_batch 1 \
         --mu 1 --recomb_rate 0.1 --threshold 0.05 \
    --energy_threshold -0.5 --max_attempts 4000 \
    --n_proposals 10 \
    --proposal_path ./test__k562__al

We have an example of back to back training and design using terminal commands in run_training_and_design.ipynb.

Variant effect prediction

Trained models can be deployed to infer the effect of non-coding variants in CREs

python vcf_predict.py \
  --artifact_path gs://tewhey-public-data/CODA_resources/malinois_artifacts__20211113_021200__287348.tar.gz \
  --vcf_file hepg2.ase.calls.fdr.05.vcf \
  --fasta_file GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
  --output test.vcf \
  --relative_start 25 --relative_end 180 --step_size 25 \
  --strand_reduction mean --window_reduction gather \
  --window_gathering abs_max --gather_source skew --activity_threshold 0.5 \
  --batch_size 16

Saturation mutagenesis

UNDER CONSTRUCTION

Extending CODA

CODA is modular. If new modules fit the API requirements, they will work with the entire system, including deployment applications. Further documentation on API requirements are forthcoming.

Cloud Integrations

Containerized CODA applications can be used in combination with various GCP platforms.

Training models on VertexAI

We provide containers that can be used in combination with Vertex AI's custom model APIs, including hyperparameter optimization. An example deployment using Vertex AI's Python SDK can be found here: tutorials/vertex_sdk_launch.ipynb

Deploying inference with Batch

UNDER CONSTRUCTION. Note: GCP is deprecating Life Sciences API which was what we used to scale inference. The goal is to move to Batch which we have to figure out, so expect this to take some time.

boda2's People

Contributors

irodcast avatar sjgosai avatar asr2210 avatar frankliuyc avatar

Stargazers

mnarayan avatar Will DeGroat avatar Adam Klie avatar Jieli Zhou avatar Aowen Wang avatar  avatar  avatar Eric South avatar Sebastian R avatar Gökçen Eraslan avatar Hongru Hu avatar

Watchers

 avatar  avatar

Forkers

outpace-bio

boda2's Issues

Model not found

Hi,

I'm trying to load the pre-trained model, but got this error: FileNotFoundError: [Errno 2] No such file or directory: './model_artifacts__20231222_004129__133866.tar.gz'

Thanks,
JC

Error in example implementation of sequence generation.

Hi there, I recently attempted your implementation of generation using the tutorial command, below is exactly what I used:

python src/generate.py \
  --params_module StraightThroughParameters \
    --n_channels 4 --length 200 --n_samples 10 \
  --energy_module OverMaxEnergy \
    --model_artifact tmp/test_new_config/model_artifacts__20231127_235004__284796.tar.gz \
    --bias_cell 0 --bending_factor 1.0 --a_min -2.0 --a_max 6.0 \
  --generator_module FastSeqProp \
    --energy_threshold -2.0 --max_attempts 20 --n_steps 200 \
    --n_proposals 20 \
  --proposal_path tmp/generated/

The model training worked fine and I was able to get pretty close to what the preprint's pearson correlations, however the generate command as given does not seem to work. Below is a snippet of the error I get:

File "/opt/py/conda/PyLib_Common/envs/boda2/lib/python3.10/site-packages/boda-0.2.0-py3.10.egg/boda/generator/FastSeqProp.py", line 230, in generate
    states    = torch.cat([states,     final_states[energy_filter].cpu()], dim=0)
RuntimeError: Tensors must have same number of dimensions: got 3 and 4

Would you know why this may be the case? I have tried several different parameter combinations and have never gotten FastSeqProp to work. I have gotten Simulated Annealing to work fine.

Also, there seems to be no add_generator_specific_args function in AdaLead.py even though there is one in the SimulatedAnnealing and FastSeqProp class. Therefore, AdaLead also does not work. Is there a reason why this is the case?

data preprocessing

hey @sjgosai.

Do we have an example python script of the datapreprocessing mentioned in paper?
basically going from the encode files for MPRA to supplementary table 2 (SupTable 2 - UKBB_GTEX_CODA_averaged_no_cutoffs.txt) in the paper

Acceptable data input formats

Hi all,

I have a datatable with just two columns: DNA sequence and activity level. Is there a way for me to input this data into the boda2 pipeline? The tutorial I see only shows how it can be done for datasets in the MPRA format.

Thanks,
Adam

Error in training due to pytorch-lightning version

Hi, I recently tried running your construct_new_model.ipynb notebook and encountered an error with the pytorch-lightning version having some of your training modules deprecated after version 2.0. Could you all please provide the lightning.pytorch used in this paper in the requirements.txt file?

input data length

According to the tutorial, the input data length should be 600. However, it was mentioned that both downstream and upstream sequences are padded with a fixed length of 200. Could you please explain the rationale behind this approach?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.