microsoft / evodiff Goto Github PK

Generation of protein sequences and evolutionary alignments via discrete diffusion models

License: MIT License

Python 98.90% Shell 0.86% Dockerfile 0.23%

discrete-diffusion generative-model multiple-sequence-alignment protein-sequences

evodiff's Issues

Running the `evodiff/generate.py` script

I've been having trouble with getting the conda environment to work properly, so this may be exacerbating the issue below.

(evodiff3) ubuntu@209-20-159-77:~/evodiff_repo$ python evodiff/generate.py --model-type oa_dm_38M --num-seqs 100
Traceback (most recent call last):
  File "evodiff/generate.py", line 323, in <module>
    main()
  File "evodiff/generate.py", line 40, in main
    data = UniRefDataset('data/uniref50/', 'train', structure=False, max_len=2048)
  File "/home/ubuntu/miniconda3/envs/evodiff3/lib/python3.8/site-packages/sequence_models/datasets.py", line 330, in __init__
    with open(data_dir + 'splits.json', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/uniref50/splits.json'
(evodiff3) ubuntu@209-20-159-77:~/evodiff_repo$

no generate directory

readme refers to a generate dir which does not exist anymore

Question: ways to improve conditional MSA generation

I played with evo-diff on msa conditional generation. Structures of query and generated are not as similar as I expected. Actually many times they are quite different.

Below is a comparison of 3D structures of query and generated sequence from your example jupyter notebook under "Evolutionary guided sequence generation with EvoDiff-MSA":

query=MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEASVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
generated=MDLRSSLVEHEGLRWKVYNNAEYVPTIGLGQIHNRPSQYWDYPVPLPEQYAEKDQISWSLETIQAVFDERYTKAKSEMVNLETIGKNFDDLPSEHTNAVTDMMFQLGTDHLSEFHKMITALKNNTYEEACREMKSSFWTRQMGNRCTRYLNDALEENYFFFNHH

Structures are from colab alphafold2 with default settings. Arrows pointing to regions where structures are not similar:

My question is how to generate sequences with more similar structures to that of query. Should input a3m include very similar sequences? Any parameters to test in generation, like number of sequences to subsample? Thanks.

KeyError: '!' during conditional generation

Hello, I am experiencing a key error attempting to use the code 1 for 1 from the example notebook for conditional generation:
https://github.com/microsoft/evodiff/blob/main/examples/evodiff.ipynb

from evodiff.pretrained import MSA_OA_DM_MAXSUB
from evodiff.generate_msa import generate_query_oadm_msa_simple
import re

checkpoint = MSA_OA_DM_MAXSUB()
model, collater, tokenizer, scheme = checkpoint

path_to_msa = 'bfd_uniclust_hits.a3m'
n_sequences=64 # number of sequences in MSA to subsample
seq_length=256 # maximum sequence length to subsample
selection_type='random' # or 'MaxHamming'; MSA subsampling scheme

tokeinzed_sample, generated_sequence  = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)
print("New sequence (no gaps, pad tokens)", re.sub('[!-]', '', generated_sequence[0][0],))

The error can be traced back to:

evodiff/utils.py, line 247, in
return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists

The alphabet seems to not know how to handle ! which should be the padding token. This alphabet appears to be imported from sequence_models.constants as MSA_ALPHABET.

Also this is much less important but I noticed there's three instances of "tokeinzed_sample" as a variable name in the example notebook that almost certainly are meant to be "tokenized_sample"

Error Loading Pretrained Model with `--pretrained` Flag

Environment

Python version: 3.8
GPU model: NVIDIA GeForce RTX 3090

Issue Description

I am encountering a FileNotFoundError when attempting to load a pretrained model using the --pretrained flag during the execution of the training script.

Steps to Reproduce

Executed the following command:

python train.py /home/loris/dev/evodiff/config/config38M.json /home/loris/dev/evodiff/output --mini_run --gpus 1 --pretrained

Expected Behavior

The training script should start and the pretrained model should be loaded successfully.

Actual Behavior

Received a traceback ending with:

Using 30 as padding index
Using 28 as masking index
Loading weights from data/pretrained/checkpoint538468.tar...
FileNotFoundError: [Errno 2] No such file or directory: 'data/pretrained/checkpoint538468.tar'

Additional Context

The file checkpoint538468.tar does exist and is located in the data/pretrained directory. The script seems unable to locate the file even though the path is correctly specified.

Do you have any suggestions ?
I actually made a brand new Uniref50 dataset , containing the necessary .npz and .json and consensus.fasta after running the SnakeFile that you used.
I am willing to retrain the 38M OADM on this dataset.

Nothing works

None of the commands or scripts on the Readme can run. Bugs everywhere

Error while running inpainting task using EvoDiff-seq

I was trying to run the IDR inpainting task and faced some issues:

Issue 1:

On running the example given on the GitHub: python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task idr --num-seqs 1

Error:
Exception: ('Please select either msa_oa_dm_randsub, msa_oa_dm_maxsub, or esm_msa_1b baseline. You selected:', 'msa_oa_ar_maxsub')

Following the error message I changed the model to msa_oa_dm_maxsub:

python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task idr --num-seqs 1

I get the following error:
KeyError: 'OMA_ID'

Issue 2:

When trying to run this command: python evodiff/conditional_generation.py --model-type oa_dm_38M --cond-task idr --num-seqs 1 --pdb 1bcf --start-idxs 20 --end-idxs 50

Error:
FileNotFoundError: [Errno 2] No such file or directory: 'data/human_idr_alignments/human_idr_boundaries.tsv'

I am aware that the source for the IDR was linked to the Reverse Homology GitHub, but I cannot find the appropriate file there as well. I have gone through all of their Zenodo repositories as well. I'd highly appreciate if I can be directed towards, or be given instructions on how to generate the human_idr_boundaries.tsv file as I am not familiar with 'OMA_ID'.

msa_oa_ar_maxsub not recognized

Hi, I'm trying to run the example command for 1PRW in the README.

python evodiff/conditional_generation_msa.py --model-type msa_oa_ar_maxsub --cond-task scaffold --pdb 1prw --start-idxs 15 --end-idxs 34 --start-idxs 51 --end-idxs 70 --num-seqs 1 --query-only

However, I'm getting the following error.

Traceback (most recent call last):
  File "evodiff/conditional_generation_msa.py", line 1065, in <module>
    main()
  File "evodiff/conditional_generation_msa.py", line 75, in main
    raise Exception("Please select either msa_oa_dm_randsub, msa_oa_dm_maxsub, or esm_msa_1b baseline. You selected:", args.model_type)
Exception: ('Please select either msa_oa_dm_randsub, msa_oa_dm_maxsub, or esm_msa_1b baseline. You selected:', 'msa_oa_ar_maxsub')

Should I be running a different command to get the same results as the paper?

Action required: self-attest your goal for this repository

It's time to review and renew the intent of this repository

An owner or administrator of this repository has previously indicated that this repository can not be migrate to GitHub inside Microsoft because it is going public, open source, or it is used to collaborate with external parties (customers, partners, suppliers, etc.).

Action

👀 ✍️ In order to keep Microsoft secure, we require repository owners and administrators to review this repository and regularly renew the intent whether to opt-in or opt-out of migration to GitHub inside Microsoft which is specifically intended for private or internal projects.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived. 🔒

Instructions

❌ Opt-out of migration

If this repository can not be migrated to GitHub inside Microsoft, you can opt-out of migration by replying with a comment on this issue containing one of the following optout command options below.

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

staging : My project will ship as Open Source

collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete : This repository will be deleted because it is no longer needed.

other : Other reasons not specified

✅ Opt-in to migrate

If the circumstances of this repository has changed and you decide that you need to migrate, then you can specify the optin command below. For example, the repository is no longer going public, open source or require external collaboration.

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

Click here for more information about optin and optout command options and examples

Opt-in

@gimsvc optin --date <target_migration_date>

When opting-in to migrate your repository, the --date option is required followed by your specified migration date using the format: mm-dd-yyyy

@gimsvc optin --date 03-15-2023

Opt-out

@gimsvc optout --reason <staging|collaboration|delete|other>

When opting-out of migration, you need to specify the --reason.

staging

My project will ship as Open Source

collaboration

Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete

This repository will be deleted because it is no longer needed.

other

Other reasons not specified

Examples:

@gimsvc optout --reason staging

@gimsvc optout --reason collaboration

@gimsvc optout --reason delete

@gimsvc optout --reason other

Need more help? 🖐️

Email [email protected]. ✉️
Post your questions in GitHub inside Microsoft Team in Microsoft Teams.

Question on MaxHamming distance

Hi there!

Could someone please help explain the maxhamming distance in https://github.com/microsoft/evodiff/blob/main/evodiff/data.py#L60

curr_dist = cdist(random_seq, msa_subset, metric='hamming')
curr_dist = np.expand_dims(np.array(curr_dist), axis=0)  # shape is now (1,msa_num_seqs)
distance_matrix[i] = curr_dist
col_min = np.min(distance_matrix, axis=0)  # (1,num_choices)
max_ind = np.argmax(col_min)

Why do we get the minimum hamming distance for the random sequence wrt the msa instead of maximum and then do an argmax?
I may have missed the details but as far as I understand we need to get the sequences that have more mutations with respect to the anchor sequences in msa if we need diversity?

Thank you!

the error in generate sequences using esm1b_650M

when i use generate.py to generate protein sequences by
args.model_type=='esm1b_650M',there is an error show in line 136-137
string.append(i_string)
UnboundLocalError: local variable 'i_string' referenced before assignment

Following basic code doesn't work

python evodiff/generate_msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming
Traceback (most recent call last):
  File "evodiff/generate_msa.py", line 2, in <module>
    import evodiff
  File "/home/jimendi1/miniconda3/envs/evodiff/lib/python3.8/site-packages/evodiff/__init__.py", line 2, in <module>
    from . import collaters
  File "/home/jimendi1/miniconda3/envs/evodiff/lib/python3.8/site-packages/evodiff/collaters.py", line 3, in <module>
    from evodiff.utils import Tokenizer
  File "/home/jimendi1/miniconda3/envs/evodiff/lib/python3.8/site-packages/evodiff/utils.py", line 4, in <module>
    from sequence_models.constants import MASK, MSA_PAD, MSA_ALPHABET, MSA_AAS, GAP, START, STOP
ModuleNotFoundError: No module named 'sequence_models'

Dependency issues when installing environment packages

Hi, I am trying to install the necessary packages to start working with EvoDiff but I am running into some problems. Since I will be working on a cluster I wanted to use the apptainer containarization. For this, I updated the environment.yml file that is found in the repository according to your installation instructions. The updated environment.yml file now looks like this:

name: evodiff
channels:
  - defaults
  - nvidia
  - pytorch
  - pyg
dependencies:
  - python=3.8.5 # Recommended
  - pandas
  - numpy
  - python-lmdb
  - pyg::pyg
  - conda-forge::torch-scatter
  - pytorch::pytorch=2.0.1 # Recommended
  # - nvidia::cudatoolkit=11.3
  - torchvision
  - torchaudio
  - cpuonly
  - pip
  - pip:
      - git+https://github.com/microsoft/protein-sequence-models.git
      - evodiff
      - git+https://github.com/microsoft/evodiff.git
      - mlflow
      - scikit-learn
      - seaborn
      - blosum
      - matplotlib
      - fair-esm
      - biotite
      - tqdm
      - git+https://github.com/HeliXonProtein/OmegaFold.git
      - mdanalysis
      - pdb-tools

Unfortunately, this combination of packages gives dependency issues for cpuonly, Python, PyTorch and torch-scatter.

What is the best way to update this environment file?

If any additional information is required please don't hesitate to contact me.

Kind regards,
Gijs

Missing data files: lengths_and_offsets.npz, consensus.fasta, and splits.json

Hello,
While trying to run train.py, I noticed that the metadata and ds_train mentioned in lines 164-165 require the three files: lengths_and_offsets.npz, consensus.fasta, and splits.json. However, I couldn't find these files in the data folder or on Uniprot. Could you provide these files or guide me on how to obtain or generate them?
Thank you!

Error with IDR inpainting with Evodiff-MSA

Hi,
I try to run script
python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming --start-query
But it seems couldn't find this file human_idr_alignments/human_protein_alignments.
Also, could put one simple example in evodiff.ipynb for inpainting IDRs with Evodiff-MSA?
Thanks in advance

Using generate_msa()

When using MSA_OA_DM_MAXSUB() and generate_msa() I get the following error:

 in A3MMSADataset.__init__(self, selection_type, n_sequences, max_seq_len, data_dir, min_depth)
    351     _lengths = np.load(data_dir+'openfold_lengths.npz')['ells']
    352     lengths = np.array(_lengths)[keep_idx]
--> 353     all_files = np.array(all_files)[keep_idx]
    354     print("filter MSA depth > 64", len(all_files))
    356 # Re-filter based on high gap-contining rows

IndexError: index 1 is out of bounds for axis 0 with size 1

Do I need to have the Openfold datatset (not only the weights but also the MSA's?) in the evodiff/data/openfold folder?

How to train LRAR

Hi,

Thanks for your excellent work! I found that the code for training the LRAR model is masked. How can I modify the code to train LRAR model by myself?

tag device='cuda' not working

Hi,

When I specify device=cuda I get an error saying cpu and cuda devices are in use?

I can see that in generate.py .to(device) is being called in inputs and model so not sure why this is happening.

Indexing bugs in TRRMSADataset

Great work putting together this project!
Two minor indexing errors in the TRRMSADataset class:
l256 in data.py:
sliced_msa = msa[:, slice_start: slice_start + self.max_seq_len] --> sliced_msa = msa[:, slice_start: slice_start + seq_len]

l274 in data.py:
output = sliced_msa[:64] --> output = sliced_msa[:self.n_sequences]

Non-canonical amino acid sequence seen

I used conditional sequence generation via evodiff.ipynb with my own MSA file.
However it came with a amino acid sequence containing un-natural amino acid codes such as "Z" and "B"

And i would like to ask if it is fine and the un-natural amino acid means something or my own MSA is problem.

Thank you.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Allow diffusion to gap character in OADM models?

In the generation code for the OADM models, we are explicitly preventing the sampling of gap characters (e.g https://github.com/microsoft/evodiff/blob/main/evodiff/generate_msa.py#L211 )

Is this necessary, or an explicit choice to only sample fixed length sequences given a sub-sampled MSA. Given that MASK and GAP are distinct tokens, and the ground truth alignment of the query sequence can contain gap tokens, it seems allowing diffusion to GAP would be desirable.

I noticed in the conditional MSA generation this check isn't present https://github.com/microsoft/evodiff/blob/main/evodiff/conditional_generation_msa.py#L605C1-L605C102 (despite the comment's claim), would like to understand better what the difference here is.

Missing uniref50_aa_ref_test.csv

Hi,
I successfully generated sequences for the unconditional situation. However, at the end as main() tried to call aa_reconstruction_parity_plot() (line 148 in evodiff/generate.py) to plot the reconstruction parity, the reference file seems to be missing, giving:

FileNotFoundError: [Errno 2] No such file or directory: '/wynton/home/rotation/fhy/Desktop/DMs/ref/uniref50_aa_ref_test.csv'

I'm wondering where am I supposed to find the file, or is there anything I did wrong?

Error Loading State Dictionary in ByteNetLMTime: Unexpected Keys

I'm encountering an error when trying to load the state dictionary for the ByteNetLMTime model using the provided checkpoint. I'm following the instructions to retrain 38M oadm model on a data formatted as the uniref50 one with the respective configuration provided in the repository. However, the loading process fails with an error related to unexpected keys in the state dictionary.
Command :
python train.py config38M.json --dataset uniref50 -sd checkpoint_input/oaar-38M.tar
Output and Error :
nr 0 gpus 1 gpu 0 rank 0
Using 30 as padding index
Using 28 as masking index
Loading weights from /checkpoint_input/oaar-38M.tar...
..............
RuntimeError: Error(s) in loading state_dict for ByteNetLMTime: Unexpected key(s) in state_dict: "last_norm.weight", "last_norm.bias".

Expected Behavior:
The model should load the state dictionary without any issues, especially since I am using the provided configuration and checkpoint.
Could you please help in resolving this issue or suggest any necessary steps I might be missing?

padding_idx=masking_idx in ByteNetLMTime instantiation arguments

The following code in train.py assigns padding_idx=masking_idx in the model initiation. This is conflicting with the definition above for padding_idx which is different from masking_idx. Is this an oversight or there is a particular reason for this assignment?

padding_idx = tokenizer.pad_id # PROTEIN_ALPHABET.index(PAD)
masking_idx = tokenizer.mask_id
print('Using {} as padding index'.format(padding_idx))
print('Using {} as masking index'.format(masking_idx))
#if args.model_type == 'ByteNet':
model = ByteNetLMTime(n_tokens, d_embed, d_model, n_layers, kernel_size, r,
causal=causal, padding_idx=masking_idx, rank=weight_rank, dropout=args.dropout,
tie_weights=args.tie_weights, final_ln=args.final_norm, slim=slim, activation=activation,
timesteps=diffusion_timesteps)

Thank you in advance

Inpaint VS Scaffold

If I was to sample a sequence with as scaffold length of 1.
Considering let's say 3 motives to scaffold, e.g : GDHJK#############HDJKSZ#######HDDD
Would there be a difference with inpainting ?

Scaffolding benchmark produces different sequences every time it is run

I'm not sure if this is expected behavior, but the scaffolding benchmark code produces different sequences every time it is run. It appears to be because in conditional_generation.py, while the numpy and pytorch seeds are set to 0, the random seed is not set and random.randint is used to generate the scaffold length. When I manually set random.seed(0), I got the same sequences on every run.

TypeError: forward() missing 1 required positional argument: 'y'

Hello, I was interested in learning how to use evodiff and have been encountering an error when running the example code. When I run:

tokeinzed_sample, generated_sequence = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)

I get this error in Jupyter notebooks:

TypeError Traceback (most recent call last)
Cell In[7], line 10
6 seq_length=256
7 selection_type='random'
---> 10 tokeinzed_sample, generated_sequence = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)

TypeError: forward() missing 1 required positional argument: 'y'

Since forward() isn't in the example code, I figured it may be an issue with another function or method

Question about fine-tuning

Hello people,

I have been struggling to train the 38M parameters model on a fasta dataset from the RCSB.
I had to alter the dataloader, but when running the script, it leads to batch size issues.
Did anybody try to re-train their models on its own data ?

Thanks in advance for any of your insight.

Best,