microsoft / evodiff Goto Github PK

View Code? Open in Web Editor NEW

462.0 15.0 61.0 20.17 MB

Generation of protein sequences and evolutionary alignments via discrete diffusion models

License: MIT License

Python 98.90% Shell 0.86% Dockerfile 0.23%

discrete-diffusion generative-model multiple-sequence-alignment protein-sequences

evodiff's Introduction

EvoDiff

Description

In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.

We evaluate our sequence and MSA models – EvoDiff-Seq and EvoDiff-MSA, respectively – across a range of generation tasks to demonstrate their power for controllable protein design. Below, we provide documentation for running our models.

EvoDiff is described in this preprint; if you use the code from this repository or the results, please cite the preprint.

Evodiff
Table of contents
Installation
- Datasets
- Loading pretrained models
Available models
Unconditional generation
- Unconditional sequence generation
- Unconditional MSA generation
Conditional sequence generation
Analysis
Downloading generated sequences
Docker

Installation

To download our code, we recommend creating a clean conda environment with python v3.8.5.

conda create --name evodiff python=3.8.5

In that new environment, install EvoDiff:

pip install evodiff
pip install git+https://github.com/microsoft/evodiff.git # bleeding edge, current repo main branch

You will also need to install PyTorch (we tested our models on v2.0.1), PyTorch Geometric, and PyTorch Scatter.

We provide a notebook with installation guidance that can be found in examples/evodiff.ipynb. It also includes examples on how to generate a smaller number of sequences and MSAs using our models. We recommend following this notebook if you would like to use our models to generate proteins.

Thanks to Colby Ford EvoDiff is available as a space on huggingface

Our downstream analysis scripts make use of a variety of tools we do not include in our package installation. To run the scripts, please download the following packages in addition to EvoDiff:

TM score
Omegafold
ProteinMPNN
ESM-IF1; see this Jupyter notebook for setup details.
PGP
DR-BERT

We refer to the setup instructions outlined by the authors of those tools.

Datasets

We obtain sequences from the Uniref50 dataset, which contains approximately 42 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the OpenFold dataset, which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. The intrinsically disordered regions (IDR) data was obtained from the Reverse Homology GitHub.

For the scaffolding structural motifs task, we use the baselines compiled in RFDiffusion. We provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide pdb files used for conditionally generating MSAs in the examples/scaffolding-msas folder.

To access the UniRef50 test sequences, use the following code:

test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To access the test sequences

The filenames for train and validation Openfold splits are saved in data/valid_msas.csv and data/train_msas.csv

Loading pretrained models

To load a model:

from evodiff.pretrained import OA_DM_38M

model, collater, tokenizer, scheme = OA_DM_38M()

Available evodiff models are:

D3PM_BLOSUM_640M()
D3PM_BLOSUM_38M()
D3PM_UNIFORM_640M()
D3PM_UNIFORM_38M()
OA_DM_640M()
OA_DM_38M()
MSA_D3PM_BLOSUM_RANDSUB()
MSA_D3PM_BLOSUM_MAXSUB()
MSA_D3PM_UNIFORM_RANDSUB()
MSA_D3PM_UNIFORM_MAXSUB()
MSA_OA_DM_RANDSUB()
MSA_OA_DM_MAXSUB()

It is also possible to load our LRAR baseline models:

LR_AR_640M()
LR_AR_38M()

Note: if you want to download a BLOSUM model, you will first need to download data/blosum62-special-MSA.mat.

Available models

We investigated two types of forward processes for diffusion over discrete data modalities to determine which would be most effective. In order-agnostic autoregressive diffusion OADM, one amino acid is converted to a special mask token at each step in the forward process. After $T=L$ steps, where $L$ is the length of the sequence, the entire sequence is masked. We additionally designed discrete denoising diffusion probabilistic models D3PM for protein sequences. In EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after $T$ steps the sequence is indistinguishable from a uniform sample over the amino acids. In the reverse process for both, a neural network model is trained to undo the previous corruption. The trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively. We trained all EvoDiff sequence models on 42M sequences from UniRef50 using a dilated convolutional neural network architecture introduced in the CARP protein masked language model. We trained 38M-parameter and 640M-parameter versions for each forward corruption scheme and for left-to-right autoregressive (LRAR) decoding.

To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the MSA Transformer architecture on the OpenFold dataset}. To do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes.

Unconditional sequence generation

Unconditional generation with EvoDiff-Seq

EvoDiff can generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids. All available models can be used to unconditionally generate new sequences, without needing to download the training datasets.

To unconditionally generate 100 sequences from EvoDiff-Seq, run the following script:

python evodiff/generate.py --model-type oa_dm_38M --num-seqs 100

The default model type is oa_dm_640M, other evodiff models available are:

oa_dm_38M
d3pm_blosum_38M
d3pm_blosum_640M
d3pm_uniform_38M
d3pm_uniform_640M

Our LRAR baseline models are also available:

lr_ar_38M
lr_ar_640M

An example of unconditionally generating a sequence of a specified length can be found in this notebook.

To evaluate the generated sequences, we implement our self-consistency Omegafold ESM-IF pipeline, as shown in analysis/self_consistency_analysis.py. To use this evaluation script, you must have the dependencies listed under the Installation section installed.

Unconditional generation with EvoDiff-MSA

To explicitly leverage evolutionary information, we design and train EvoDiff-MSA models using the MSA Transformer architecture on the OpenFold dataset. To do so, we subsample MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences (“Random”) or by greedily maximizing for sequence diversity (“Max”).

It is possible to unconditionally generate an entire MSA, using the following script:

python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming

The default model type is msa_oa_dm_maxsub, which is EvoDiff-MSA-OADM trained on Max subsampled sequences, and the other available evodiff models are:

EvoDiff-MSA OADM trained on random subsampled sequences: msa_oa_dm_randsub
EvoDiff-MSA D3PM-BLOSUM trained on Max subsampled sequences:msa_d3pm_blosum_maxsub
EvoDiff-MSA D3PM-BLOSUM trained on random subsampled sequences: msa_d3pm_blosum_randsub
EvoDiff-MSA D3PM-Uniform trained on Max subsampled sequences: msa_d3pm_uniform_maxsub
EvoDiff-MSA D3PM-Uniform trained on random subsampled sequences: msa_d3pm_uniform_randsub

You can also specify a desired number of sequences per MSA, sequence length, batch size, and more.

Conditional sequence generation

EvoDiff’s OADM diffusion framework induces a natural method for conditional sequence generation by fixing some subsequences and predicting the remainder. Because the model is trained to generate proteins with an arbitrary decoding order, this is easily accomplished by simply masking and decoding the desired portions. We apply EvoDiff’s power for controllable protein design across three scenarios: conditioning on evolutionary information encoded in MSAs, inpainting functional domains, and scaffolding structural motifs.

Evolution-guided protein generation with EvoDiff-MSA

First, we test the ability of EvoDiff-MSA (msa_oa_dm_maxsub) to generate new query sequences conditioned on the remainder of an MSA, thus generating new members of a protein family without needing to train family-specific generative models.

To generate a new query sequence, given an alignment, use the following with the --start-msa flag. This starts conditional generation by sampling from a validation MSA. To run this script you must have the Openfold dataset and splits downloaded.

python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-msa

If you want to generate on a custom MSA, it is possible to retrofit existing code.

Additionally, the code is capable of generating an alignment given a query sequence, use the following --start-query flag. This starts with the query and generates the alignment.

python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-query

NOTE: you can only specify one of the above flags at a time. You cannot specify both (--start-query & --start-msa) together. Please look at generate.py for more information.

Generating intrinsically disordered regions

Because EvoDiff generates directly in sequence space, we hypothesized that it could natively generate intrinsically disordered regions (IDRs). IDRs are regions within a protein sequence that lack secondary or tertiary structure, and they carry out important and diverse functional roles in the cell directly facilitated by their lack of structure. Despite their prevalence and critical roles in function and disease, IDRs do not fit neatly in the structure-function paradigm and remain outside the capabilities of structure-based protein design methods.

We used inpainting with EvoDiff-Seq and EvoDiff-MSA to intentionally generate disordered regions conditioned on their surrounding structured regions, and then used DR-BERT to predict disorder scores for each residue in the generated and natural sequences. Note: to generate with our scripts here, you must have the IDR dataset downloaded. Different pre-processing steps may apply to other datasets.

To run our code and generate IDRs from EvoDiff-Seq, run the following:

python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --num-seqs 1

or equivalently, from EvoDiff-MSA:

python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --query-only --max-seq-len 150 --num-seqs 1

Which will sample IDRs from the IDR dataset, and generate new ones.

Scaffolding functional motifs

Given that the fixed functional motif includes the residue identities for the motif, we show that a sequence-only model can be used for a motif scaffolding task. We used EvoDiff to generate scaffolds for a set of 17 motif-scaffolding problems by fixing the functional motif, supplying only the motif's amino-acid sequence as conditioning information, and then decoding the remainder of the sequence.

For the scaffolding structural motifs task, we provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide a3m files used for conditionally generating MSAs in the examples/scaffolding-msas folder. Please view the PDB codes available and select an appropriate code. In this example, we use PDB code 1prw with domains 16-35 (FSLFDKDGDGTITTKELGTV) and 52-71 (INEVDADGNGTIDFPEFLTM). An example of generating 1 MSA scaffold of a structural motif can be found in this notebook.

To generate from EvoDiff-Seq:

python evodiff/conditional_generation.py --model-type oa_dm_640M --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 100 --scaffold-min 50 --scaffold-max 100

The --start-idxs and --end-idxs indicate the start & end indices for the motif being scaffolded. If defining multiple motifs, you can supply the start and end index motifs as new arguments, such as in the example we provide above.

Equivalent code for generating a new scaffold sequence from an EvoDiff-MSA:

python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 1 --query-only

To generate a custom scaffold for a given motif, one simply needs to supply the PDB ID, and the residue indices of the motif. The code will download the PDB for you. In some cases PDB files downloaded from RCSB will be incomplete, or contain additional residues. We have implemented code to circumvent PDB-reading issues, but we recommend care when generating files for this task.

Analysis of generations

To analyze the quality of the generations, we look at:

amino acid KL divergence (aa_reconstruction_parity_plot)
secondary structure KL divergence (evodiff/analysis/calc_kl_ss.py)
model perplexity for sequences (evodiff/analysis/sequence_perp.py)
model perplexity for MSAs (evodiff/analysis/msa_perp.py)
Fréchet inception distance (evodiff/analysis/calc_fid.py)
Hamming distance (evodiff/analysis/calc_nearestseq_hamming.py)
RMSD score (analysis/rmsd_analysis.py)

We also compute the self-consistency perplexity to evaluate the foldability of generated sequences. To do so, we make use of various tools:

TM score
Omegafold
ProteinMPNN
ESM-IF1; see this Jupyter notebook for setup details.
PGP
DR-BERT

We refer to the setup instructions outlined by the authors of those tools.

Our analysis scripts for iterating over these tools are in the evodiff/analysis/downstream_bash_scripts folder. Once we run the scripts in this folder, we analyze the results in self_consistency_analysis.py.

Downloading generated sequences

We provide all generated sequences on the EvoDiff Zenodo.

To download our unconditional generated sequences from unconditional_generations.csv file:

curl -O https://zenodo.org/record/8332830/files/unconditional_generations.csv?download=1

To extract all unconditionally generated sequences created using the EvoDiff-seq oa_dm_640M model, run the following code:

import pandas as pd
df = pd.read_csv('unconditional_generations.csv', index_col = 0)
subset = df.loc[df['model'] == 'evodiff_oa_dm_640M']

The CSV files containing generated data are organized as follows:

Unconditional generations from sequence-based models: unconditional_generations.csv
- sequence: generated sequence
- min hamming dist: minimum Hamming distance between generated sequence and all training sequences
- seq len: length of generated sequence
- model: model type used for generations, models: evodiff_oa_dm_38M, evodiff_oa_dm_640M, evodiff_d3pm_uniform_38M,
  evodiff_d3pm_uniform_640M, evodiff_d3pm_blosum_38M, evodiff_d3pm_blosum_640M, carp_38M, carp_640M, lr_ar_38M
  lr_ar_38M, lr_ar_640M, esm_1b, or esm_2
Sequence predictions for unconditional structure generation baselines: esmif_predictions_unconditional_structure_generations.csv
- sequence: predicted protein sequence from protein structure (using ESM-IF1 model)
- seq len: length of generated sequence
- model: 'foldingdiff' or 'rfdiffusion'
Sequence generation via evolutionary alignments: msa_evolution_conditional_generations.csv
- sequence: generated query sequences
- seq len: length of generated sequence
- model: model type used for generations: evodiff_msa_oa_dm_maxsub, evodiff_msa_oa_dm_randsub, esm_msa_1b, or potts
Generated IDRs: idr_conditional_generations.csv
- sequence: subsampled sequence that contains IDR
- seq len: length of generated sequence
- gen_idrs: the generated IDR sequence
- original_idrs: the original IDR sequence
- start_idxs: indices corresponding to start of IDR in sequence
- end_idxs: indices corresponding to end of IDR in sequence (inclusive)
- model: model type used for generations evodiff_seq_oa_dm_640M or evodiff_msa_oa_dm_maxsub
Successfully generated scaffolds msa_scaffold.csv (EvoDiff-MSA generations) or seq_scaffold.csv (Evodiff-Seq generations)
- pdb: pdb code corresponding to scaffold task
- seqs: generated scaffold and motif
- start_idxs: indices corresponding to start of motif
- end_idxs: indices corresponding to end of motif
- seq len: length of generated sequence
- scores: average predicted local distance difference test (pLDDT) of sequence
- rmsd: motifRMSD between predicted motif coordinates and crystal motif coordinates
- model: model type used for generations

Docker

The Docker image for EvoDiff is hosted on DockerHub at https://hub.docker.com/r/cford38/evodiff.

docker pull cford38/evodiff:latest

Alternatively, you can build the Docker image locally.

## Build Docker Image
docker build -t evodiff .

Then, run the Docker image locally with the following command.

## Run Docker Image (Bash Console)
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name evodiff --rm -it evodiff /bin/bash

Note: You may need to set your default Torch device to cuda in the Docker container so that EvoDiff executes on the GPU.

import torch
torch.set_default_device('cuda:0')

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos are subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third party trademarks or logos is subject to those third-party's policies.

evodiff's People

Contributors

Stargazers

Watchers

evodiff's Issues

Error with IDR inpainting with Evodiff-MSA

Hi,
I try to run script
python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-query
But it seems couldn't find this file human_idr_alignments/human_protein_alignments.
Also, could put one simple example in evodiff.ipynb for inpainting IDRs with Evodiff-MSA?
Thanks in advance

Non-canonical amino acid sequence seen

I used conditional sequence generation via evodiff.ipynb with my own MSA file.
However it came with a amino acid sequence containing un-natural amino acid codes such as "Z" and "B"

And i would like to ask if it is fine and the un-natural amino acid means something or my own MSA is problem.

Thank you.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

KeyError: '!' during conditional generation

Hello, I am experiencing a key error attempting to use the code 1 for 1 from the example notebook for conditional generation:
https://github.com/microsoft/evodiff/blob/main/examples/evodiff.ipynb

from evodiff.pretrained import MSA_OA_DM_MAXSUB
from evodiff.generate_msa import generate_query_oadm_msa_simple
import re

checkpoint = MSA_OA_DM_MAXSUB()
model, collater, tokenizer, scheme = checkpoint

path_to_msa = 'bfd_uniclust_hits.a3m'
n_sequences=64 # number of sequences in MSA to subsample
seq_length=256 # maximum sequence length to subsample
selection_type='random' # or 'MaxHamming'; MSA subsampling scheme

tokeinzed_sample, generated_sequence  = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)
print("New sequence (no gaps, pad tokens)", re.sub('[!-]', '', generated_sequence[0][0],))

The error can be traced back to:

evodiff/utils.py, line 247, in
return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists

The alphabet seems to not know how to handle ! which should be the padding token. This alphabet appears to be imported from sequence_models.constants as MSA_ALPHABET.

Also this is much less important but I noticed there's three instances of "tokeinzed_sample" as a variable name in the example notebook that almost certainly are meant to be "tokenized_sample"

How to train LRAR

Hi,

Thanks for your excellent work! I found that the code for training the LRAR model is masked. How can I modify the code to train LRAR model by myself?

Running the `evodiff/generate.py` script

I've been having trouble with getting the conda environment to work properly, so this may be exacerbating the issue below.

(evodiff3) ubuntu@209-20-159-77:~/evodiff_repo$ python evodiff/generate.py --model-type oa_dm_38M --num-seqs 100
Traceback (most recent call last):
  File "evodiff/generate.py", line 323, in <module>
    main()
  File "evodiff/generate.py", line 40, in main
    data = UniRefDataset('data/uniref50/', 'train', structure=False, max_len=2048)
  File "/home/ubuntu/miniconda3/envs/evodiff3/lib/python3.8/site-packages/sequence_models/datasets.py", line 330, in __init__
    with open(data_dir + 'splits.json', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/uniref50/splits.json'
(evodiff3) ubuntu@209-20-159-77:~/evodiff_repo$

Indexing bugs in TRRMSADataset

Great work putting together this project!
Two minor indexing errors in the TRRMSADataset class:
l256 in data.py:
sliced_msa = msa[:, slice_start: slice_start + self.max_seq_len] --> sliced_msa = msa[:, slice_start: slice_start + seq_len]

l274 in data.py:
output = sliced_msa[:64] --> output = sliced_msa[:self.n_sequences]

Allow diffusion to gap character in OADM models?

In the generation code for the OADM models, we are explicitly preventing the sampling of gap characters (e.g https://github.com/microsoft/evodiff/blob/main/evodiff/generate_msa.py#L211 )

Is this necessary, or an explicit choice to only sample fixed length sequences given a sub-sampled MSA. Given that MASK and GAP are distinct tokens, and the ground truth alignment of the query sequence can contain gap tokens, it seems allowing diffusion to GAP would be desirable.

I noticed in the conditional MSA generation this check isn't present https://github.com/microsoft/evodiff/blob/main/evodiff/conditional_generation_msa.py#L605C1-L605C102 (despite the comment's claim), would like to understand better what the difference here is.

Scaffolding benchmark produces different sequences every time it is run

I'm not sure if this is expected behavior, but the scaffolding benchmark code produces different sequences every time it is run. It appears to be because in conditional_generation.py, while the numpy and pytorch seeds are set to 0, the random seed is not set and random.randint is used to generate the scaffold length. When I manually set random.seed(0), I got the same sequences on every run.

Error Loading Pretrained Model with `--pretrained` Flag

Environment

Python version: 3.8
GPU model: NVIDIA GeForce RTX 3090

Issue Description

I am encountering a FileNotFoundError when attempting to load a pretrained model using the --pretrained flag during the execution of the training script.

Steps to Reproduce

Executed the following command:

python train.py /home/loris/dev/evodiff/config/config38M.json /home/loris/dev/evodiff/output --mini_run --gpus 1 --pretrained

Expected Behavior

The training script should start and the pretrained model should be loaded successfully.

Actual Behavior

Received a traceback ending with:

Using 30 as padding index
Using 28 as masking index
Loading weights from data/pretrained/checkpoint538468.tar...
FileNotFoundError: [Errno 2] No such file or directory: 'data/pretrained/checkpoint538468.tar'

Additional Context

The file checkpoint538468.tar does exist and is located in the data/pretrained directory. The script seems unable to locate the file even though the path is correctly specified.

Do you have any suggestions ?
I actually made a brand new Uniref50 dataset , containing the necessary .npz and .json and consensus.fasta after running the SnakeFile that you used.
I am willing to retrain the 38M OADM on this dataset.

Following basic code doesn't work

python evodiff/generate_msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming
Traceback (most recent call last):
  File "evodiff/generate_msa.py", line 2, in <module>
    import evodiff
  File "/home/jimendi1/miniconda3/envs/evodiff/lib/python3.8/site-packages/evodiff/__init__.py", line 2, in <module>
    from . import collaters
  File "/home/jimendi1/miniconda3/envs/evodiff/lib/python3.8/site-packages/evodiff/collaters.py", line 3, in <module>
    from evodiff.utils import Tokenizer
  File "/home/jimendi1/miniconda3/envs/evodiff/lib/python3.8/site-packages/evodiff/utils.py", line 4, in <module>
    from sequence_models.constants import MASK, MSA_PAD, MSA_ALPHABET, MSA_AAS, GAP, START, STOP
ModuleNotFoundError: No module named 'sequence_models'

msa_oa_ar_maxsub not recognized

Hi, I'm trying to run the example command for 1PRW in the README.

python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 1 --query-only

However, I'm getting the following error.

Traceback (most recent call last):
  File "evodiff/conditional_generation_msa.py", line 1065, in <module>
    main()
  File "evodiff/conditional_generation_msa.py", line 75, in main
    raise Exception("Please select either msa_oa_dm_randsub, msa_oa_dm_maxsub, or esm_msa_1b baseline. You selected:", args.model_type)
Exception: ('Please select either msa_oa_dm_randsub, msa_oa_dm_maxsub, or esm_msa_1b baseline. You selected:', 'msa_oa_ar_maxsub')

Should I be running a different command to get the same results as the paper?

Question on MaxHamming distance

Hi there!

Could someone please help explain the maxhamming distance in https://github.com/microsoft/evodiff/blob/main/evodiff/data.py#L60

curr_dist = cdist(random_seq, msa_subset, metric='hamming')
curr_dist = np.expand_dims(np.array(curr_dist), axis=0)  # shape is now (1,msa_num_seqs)
distance_matrix[i] = curr_dist
col_min = np.min(distance_matrix, axis=0)  # (1,num_choices)
max_ind = np.argmax(col_min)

Why do we get the minimum hamming distance for the random sequence wrt the msa instead of maximum and then do an argmax?
I may have missed the details but as far as I understand we need to get the sequences that have more mutations with respect to the anchor sequences in msa if we need diversity?

Thank you!

TypeError: forward() missing 1 required positional argument: 'y'

Hello, I was interested in learning how to use evodiff and have been encountering an error when running the example code. When I run:

tokeinzed_sample, generated_sequence = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)

I get this error in Jupyter notebooks:

TypeError Traceback (most recent call last)
Cell In[7], line 10
6 seq_length=256
7 selection_type='random'
---> 10 tokeinzed_sample, generated_sequence = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)

TypeError: forward() missing 1 required positional argument: 'y'

Since forward() isn't in the example code, I figured it may be an issue with another function or method

no generate directory

readme refers to a generate dir which does not exist anymore

Nothing works

None of the commands or scripts on the Readme can run. Bugs everywhere

padding_idx=masking_idx in ByteNetLMTime instantiation arguments

The following code in train.py assigns padding_idx=masking_idx in the model initiation. This is conflicting with the definition above for padding_idx which is different from masking_idx. Is this an oversight or there is a particular reason for this assignment?

padding_idx = tokenizer.pad_id # PROTEIN_ALPHABET.index(PAD)
masking_idx = tokenizer.mask_id
print('Using {} as padding index'.format(padding_idx))
print('Using {} as masking index'.format(masking_idx))
#if args.model_type == 'ByteNet':
model = ByteNetLMTime(n_tokens, d_embed, d_model, n_layers, kernel_size, r,
causal=causal, padding_idx=masking_idx, rank=weight_rank, dropout=args.dropout,
tie_weights=args.tie_weights, final_ln=args.final_norm, slim=slim, activation=activation,
timesteps=diffusion_timesteps)

Thank you in advance

Error while running inpainting task using EvoDiff-seq

I was trying to run the IDR inpainting task and faced some issues:

Issue 1:

On running the example given on the GitHub: python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --num-seqs 1

Error:
Exception: ('Please select either msa_oa_dm_randsub, msa_oa_dm_maxsub, or esm_msa_1b baseline. You selected:', 'msa_oa_ar_maxsub')

Following the error message I changed the model to msa_oa_dm_maxsub:

python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task idr --num-seqs 1

I get the following error:
KeyError: 'OMA_ID'

Issue 2:

When trying to run this command: python evodiff/conditional_generation.py --model-type oa_dm_38M --cond-task idr --num-seqs 1 --pdb 1bcf --start-idxs 20 --end-idxs 50

Error:
FileNotFoundError: [Errno 2] No such file or directory: 'data/human_idr_alignments/human_idr_boundaries.tsv'

I am aware that the source for the IDR was linked to the Reverse Homology GitHub, but I cannot find the appropriate file there as well. I have gone through all of their Zenodo repositories as well. I'd highly appreciate if I can be directed towards, or be given instructions on how to generate the human_idr_boundaries.tsv file as I am not familiar with 'OMA_ID'.

Question: ways to improve conditional MSA generation

I played with evo-diff on msa conditional generation. Structures of query and generated are not as similar as I expected. Actually many times they are quite different.

Below is a comparison of 3D structures of query and generated sequence from your example jupyter notebook under "Evolutionary guided sequence generation with EvoDiff-MSA":

query=MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEASVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
generated=MDLRSSLVEHEGLRWKVYNNAEYVPTIGLGQIHNRPSQYWDYPVPLPEQYAEKDQISWSLETIQAVFDERYTKAKSEMVNLETIGKNFDDLPSEHTNAVTDMMFQLGTDHLSEFHKMITALKNNTYEEACREMKSSFWTRQMGNRCTRYLNDALEENYFFFNHH

Structures are from colab alphafold2 with default settings. Arrows pointing to regions where structures are not similar:

My question is how to generate sequences with more similar structures to that of query. Should input a3m include very similar sequences? Any parameters to test in generation, like number of sequences to subsample? Thanks.

tag device='cuda' not working

Hi,

When I specify device=cuda I get an error saying cpu and cuda devices are in use?

I can see that in generate.py .to(device) is being called in inputs and model so not sure why this is happening.

Error Loading State Dictionary in ByteNetLMTime: Unexpected Keys

I'm encountering an error when trying to load the state dictionary for the ByteNetLMTime model using the provided checkpoint. I'm following the instructions to retrain 38M oadm model on a data formatted as the uniref50 one with the respective configuration provided in the repository. However, the loading process fails with an error related to unexpected keys in the state dictionary.
Command :
python train.py config38M.json --dataset uniref50 -sd checkpoint_input/oaar-38M.tar
Output and Error :
nr 0 gpus 1 gpu 0 rank 0
Using 30 as padding index
Using 28 as masking index
Loading weights from /checkpoint_input/oaar-38M.tar...
..............
RuntimeError: Error(s) in loading state_dict for ByteNetLMTime: Unexpected key(s) in state_dict: "last_norm.weight", "last_norm.bias".

Expected Behavior:
The model should load the state dictionary without any issues, especially since I am using the provided configuration and checkpoint.
Could you please help in resolving this issue or suggest any necessary steps I might be missing?

Missing data files: lengths_and_offsets.npz, consensus.fasta, and splits.json

Hello,
While trying to run train.py, I noticed that the metadata and ds_train mentioned in lines 164-165 require the three files: lengths_and_offsets.npz, consensus.fasta, and splits.json. However, I couldn't find these files in the data folder or on Uniprot. Could you provide these files or guide me on how to obtain or generate them?
Thank you!

Using generate_msa()

When using MSA_OA_DM_MAXSUB() and generate_msa() I get the following error:

 in A3MMSADataset.__init__(self, selection_type, n_sequences, max_seq_len, data_dir, min_depth)
    351     _lengths = np.load(data_dir+'openfold_lengths.npz')['ells']
    352     lengths = np.array(_lengths)[keep_idx]
--> 353     all_files = np.array(all_files)[keep_idx]
    354     print("filter MSA depth > 64", len(all_files))
    356 # Re-filter based on high gap-contining rows

IndexError: index 1 is out of bounds for axis 0 with size 1

Do I need to have the Openfold datatset (not only the weights but also the MSA's?) in the evodiff/data/openfold folder?

Action required: self-attest your goal for this repository

It's time to review and renew the intent of this repository

An owner or administrator of this repository has previously indicated that this repository can not be migrate to GitHub inside Microsoft because it is going public, open source, or it is used to collaborate with external parties (customers, partners, suppliers, etc.).

Action

👀 ✍️ In order to keep Microsoft secure, we require repository owners and administrators to review this repository and regularly renew the intent whether to opt-in or opt-out of migration to GitHub inside Microsoft which is specifically intended for private or internal projects.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived. 🔒

Instructions

❌ Opt-out of migration

If this repository can not be migrated to GitHub inside Microsoft, you can opt-out of migration by replying with a comment on this issue containing one of the following optout command options below.

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

staging : My project will ship as Open Source

collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete : This repository will be deleted because it is no longer needed.

other : Other reasons not specified

✅ Opt-in to migrate

If the circumstances of this repository has changed and you decide that you need to migrate, then you can specify the optin command below. For example, the repository is no longer going public, open source or require external collaboration.

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

Click here for more information about optin and optout command options and examples

Opt-in

@gimsvc optin --date <target_migration_date>

When opting-in to migrate your repository, the --date option is required followed by your specified migration date using the format: mm-dd-yyyy

@gimsvc optin --date 03-15-2023

Opt-out

@gimsvc optout --reason <staging|collaboration|delete|other>

When opting-out of migration, you need to specify the --reason.

staging

My project will ship as Open Source

collaboration

Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete

This repository will be deleted because it is no longer needed.

other

Other reasons not specified

Examples:

@gimsvc optout --reason staging

@gimsvc optout --reason collaboration

@gimsvc optout --reason delete

@gimsvc optout --reason other

Need more help? 🖐️

Email [email protected]. ✉️
Post your questions in GitHub inside Microsoft Team in Microsoft Teams.

Inpaint VS Scaffold

If I was to sample a sequence with as scaffold length of 1.
Considering let's say 3 motives to scaffold, e.g : GDHJK#############HDJKSZ#######HDDD
Would there be a difference with inpainting ?

Question about fine-tuning

Hello people,

I have been struggling to train the 38M parameters model on a fasta dataset from the RCSB.
I had to alter the dataloader, but when running the script, it leads to batch size issues.
Did anybody try to re-train their models on its own data ?

Thanks in advance for any of your insight.

Best,

the error in generate sequences using esm1b_650M

when i use generate.py to generate protein sequences by
args.model_type=='esm1b_650M',there is an error show in line 136-137
string.append(i_string)
UnboundLocalError: local variable 'i_string' referenced before assignment

Missing uniref50_aa_ref_test.csv

Hi,
I successfully generated sequences for the unconditional situation. However, at the end as main() tried to call aa_reconstruction_parity_plot() (line 148 in evodiff/generate.py) to plot the reconstruction parity, the reference file seems to be missing, giving:

FileNotFoundError: [Errno 2] No such file or directory: '/wynton/home/rotation/fhy/Desktop/DMs/ref/uniref50_aa_ref_test.csv'

I'm wondering where am I supposed to find the file, or is there anything I did wrong?