Giter Club home page Giter Club logo

diffab's Introduction

DiffAb

cover-large

Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures (NeurIPS 2022)

[Paper][Demo]

Install

Environment

conda env create -f env.yaml -n diffab
conda activate diffab

The default cudatoolkit version is 11.3. You may change it in env.yaml.

Datasets and Trained Weights

Protein structures in the SAbDab dataset can be downloaded here. Extract all_structures.zip into the data folder.

The data folder contains a snapshot of the dataset index (sabdab_summary_all.tsv). You may replace the index with the latest version here.

Trained model weights are available here (Hugging Face) or here (Google Drive).

[Optional] HDOCK

HDOCK is required to design CDRs for antigens without bound antibody frameworks. Please download HDOCK here and put the hdock and createpl programs into the bin folder.

[Optional] PyRosetta

PyRosetta is required to relax the generated structures and compute binding energy. Please follow the instruction here to install.

[Optional] Ray

Ray is required to relax and evaluate the generated antibodies. Please install Ray using the following command:

pip install -U ray

Design Antibodies

5 design modes are available. Each mode corresponds to a config file in the configs/test folder:

Config File Description
codesign_single.yml Sample both the sequence and structure of one CDR.
codesign_multicdrs.yml Sample both the sequence and structure of all the CDRs simultaneously.
abopt_singlecdr.yml Optimize the sequence and structure of one CDR.
fixbb.yml Sample only the sequence of one CDR (fix-backbone sequence design).
strpred.yml Sample only the structure of one CDR (structure prediction).

Antibody-Antigen Complex

Below is the usage of design_pdb.py. It samples CDRs for antibody-antigen complexes. The full list of options can be found in diffab/tools/runner/design_for_pdb.py.

python design_pdb.py \
	<path-to-pdb> \
	--heavy <heavy-chain-id> \
	--light <light-chain-id> \
	--config <path-to-config-file>

The --heavy and --light options can be omitted as the script can automatically identify them with AbNumber and ANARCI.

The below example designs the six CDRs separately for the 7DK2_AB_C antibody-antigen complex.

python design_pdb.py ./data/examples/7DK2_AB_C.pdb \
	--config ./config/test/codesign_single.yml

Antigen Only

HDOCK is required to design antibodies for antigens without bound antibody structures (see above for instructions on installing HDOCK). Below is the usage of design_dock.py.

python design_dock.py \
	--antigen <path-to-antigen-pdb> \
	--antibody <path-to-antibody-template-pdb> \
	--config <path-to-config-file>

The --antibody option is optional and the default antibody template is 3QHF_Fv.pdb. The full list of options can be found in the script.

Below is an example that designs antibodies for SARS-CoV-2 Omicron RBD.

python design_dock.py \
	--antigen ./data/examples/Omicron_RBD.pdb \
	--config ./config/test/codesign_multicdrs.yml

Train

python train.py ./configs/train/<config-file-name>

Reference

@inproceedings{luo2022antigenspecific,
  title={Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures},
  author={Shitong Luo and Yufeng Su and Xingang Peng and Sheng Wang and Jian Peng and Jianzhu Ma},
  booktitle={Advances in Neural Information Processing Systems},
  editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
  year={2022},
  url={https://openreview.net/forum?id=jSorGn2Tjg}
}

diffab's People

Contributors

luost26 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

diffab's Issues

diffab 在colab 上运行,请教~

老师您好,
diffab 的功能好强,给抗体设计产生了很多想象的空间,您太棒了!赞一个!
..en, 我在google colab 上捣鼓了一下代码,链接如下,运行测试出现如下问题,能否麻烦您帮看一下,应该怎么解决:
https://colab.research.google.com/drive/1O4vz0A-Y84sOE3dAjePAhL_VmYFhZe8A?usp=share_link

python /content/diffab/design_pdb.py /content/diffab/data/examples/7DK2_AB_C.pdb --config /content/diffab/configs/test/codesign_single.yml

Python 3.8.15
[INFO] Renumbered chain A (H)
[INFO] Renumbered chain B (K)
[INFO] Chain C does not contain valid Fv: Variable chain sequence not recognized: "CPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGXXX"
[2022-12-18 05:44:18,014::sample::INFO] Data ID: 7DK2_AB_C.pdb
[2022-12-18 05:44:18,014::sample::INFO] Results will be saved to ./results/codesign_single/7DK2_AB_C.pdb_2022_12_18__05_44_18
[2022-12-18 05:44:18,185::sample::INFO] Loading model config and checkpoints: ./trained_models/codesign_single.pt
Traceback (most recent call last):
File "/content/diffab/design_pdb.py", line 4, in
design_for_pdb(args_from_cmdline())
File "/content/diffab/diffab/tools/runner/design_for_pdb.py", line 141, in design_for_pdb
ckpt = torch.load(config.model.checkpoint, map_location='cpu')
File "/usr/local/envs/diffab/lib/python3.8/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/envs/diffab/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/envs/diffab/lib/python3.8/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './trained_models/codesign_single.pt'

CalledProcessError Traceback (most recent call last)
in
----> 1 get_ipython().run_cell_magic('bash', '', 'source activate diffab\n#%%shell\n#eval "$(conda shell.bash hook)" # copy conda command to shell\n#conda config --show\n#conda config --get channels\n#conda list\n# python commands are ready to run within your environment\npython --version\npython /content/diffab/design_pdb.py /content/diffab/data/examples/7DK2_AB_C.pdb --config /content/diffab/configs/test/codesign_single.yml\n#python /content/diffab/diffab/tools/runner/design_for_pdb.py /content/diffab/data/examples/7DK2_AB_C.pdb --config /content/diffab/configs/test/abopt_singlecdr.yml\n')

3 frames
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2357 with self.builtin_trap:
2358 args = (magic_arg_s, cell)
-> 2359 result = fn(*args, **kwargs)
2360 return result
2361

/usr/local/lib/python3.8/dist-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
140 else:
141 line = script
--> 142 return self.shebang(line, cell)
143
144 # write a basic docstring:

in shebang(self, line, cell)

/usr/local/lib/python3.8/dist-packages/IPython/core/magic.py in (f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):

/usr/local/lib/python3.8/dist-packages/IPython/core/magics/script.py in shebang(self, line, cell)
243 sys.stderr.flush()
244 if args.raise_error and p.returncode!=0:
--> 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
246
247 def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'source activate diffab\n#%%shell\n#eval "$(conda shell.bash hook)" # copy conda command to shell\n#conda config --show\n#conda config --get channels\n#conda list\n# python commands are ready to run within your environment\npython --version\npython /content/diffab/design_pdb.py /content/diffab/data/examples/7DK2_AB_C.pdb --config /content/diffab/configs/test/codesign_single.yml\n#python /content/diffab/diffab/tools/runner/design_for_pdb.py /content/diffab/data/examples/7DK2_AB_C.pdb --config /content/diffab/configs/test/abopt_singlecdr.yml\n'' returned non-zero exit status 1.

License?

Hi. First off, thank you very much for releasing the code accompanying your awesome work here!
I was curious if this source code is currently (or will eventually be) accompanied by any software licenses (e.g., MIT).

Problem with diffab.utils.protein.writers.save_pdb()

I'm trying to deploy the model on our local cluster, but got the following error while running the example script.
> python design_pdb.py ./data/examples/7DK2_AB_C.pdb --config ./configs/test/codesign_single.yml

[INFO] Renumbered chain A (H)
[INFO] Renumbered chain B (K)
[INFO] Chain C does not contain valid Fv: Variable chain sequence not recognized: "CPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGXXX"
[2022-10-21 23:45:07,716::sample::INFO] Data ID: 7DK2_AB_C.pdb
[2022-10-21 23:45:07,717::sample::INFO] Results will be saved to ./results/codesign_single/7DK2_AB_C.pdb_2022_10_21__23_45_07
[2022-10-21 23:45:07,962::sample::INFO] Loading model config and checkpoints: ./trained_models/codesign_single.pt
[2022-10-21 23:45:13,578::sample::INFO]
[2022-10-21 23:45:14,304::sample::INFO] Start sampling for: H_CDR1
Traceback (most recent call last):
File "/gshare/xielab/jianfc/DLTEST/diffab/design_pdb.py", line 4, in
design_for_pdb(args_from_cmdline())
File "/gshare/xielab/jianfc/DLTEST/diffab/diffab/tools/runner/design_for_pdb.py", line 177, in design_for_pdb
save_pdb(data_native, os.path.join(log_dir, variant['tag'], 'REF1.pdb')) # w/ OpenMM minimization
File "/gshare/xielab/jianfc/DLTEST/diffab/diffab/utils/protein/writers.py", line 58, in save_pdb
unique_chain_nb = data['chain_nb'].unique().tolist()
File "/share/home/jianfc/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py", line 586, in unique
return torch.unique(self, sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim)
File "/share/home/jianfc/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_jit_internal.py", line 422, in fn
return if_false(*args, **kwargs)
File "/share/home/jianfc/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_jit_internal.py", line 422, in fn
return if_false(*args, **kwargs)
File "/share/home/jianfc/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/functional.py", line 946, in _return_output
output, _, _ = _unique_impl(input, sorted, return_inverse, return_counts, dim)
File "/share/home/jianfc/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/functional.py", line 860, in _unique_impl
output, inverse_indices, counts = torch._unique2(
RuntimeError: std::bad_alloc

Although I have solved the problem temporarily, by changing the Line 58 of writers.py from:
unique_chain_nb = data['chain_nb'].unique().tolist()
to:
unique_chain_nb = data['chain_nb'].to('cuda').unique().tolist()
, running the unique() function on GPU and everything goes well, I'm still wondering the reason why the problem occurs on CPU and how I can fix it.

I'm working on a cluster node with 512GB RAM and two A100-40GB GPU, with python 3.9, pytorch 1.11.0 and cuda 11.3.1.

Fix-backbone Performance on Light Chains

Hi, do you have the results of your model as well as baselines on light chains? I want to list your method as the baseline and it would be very helpful if you can provide such information!

Thanks.

在运行sabdab.py代码,发生报错

在运行sabdab.py代码,发生报错,请问该如何解决呀
第一个错误:报ValueError: invalid literal for int() with base 10: 'V'异常,
第二个错误如图,请教这两个问题都是怎么合理解决的呀
20230617-131318

Example

Hello, thank you for sharing the great work. I am trying to reproduce some of the experiments using the example data. However, there seems not to be a configuration file for that. Could you help me to run experiments with those example data? Many thanks!

missing streamlit dependency

Hello,

I was wondering if you could add streamlit to the env yaml file.
This way streamlit is directly installed and the demo can be used.
Or else maybe provide an alternative installation maybe with pip?

I have a really hard time getting streamlit to run in a docker container together with diffab. the issue is that I cannot install streamlit with conda in a separate step.

Best regards,

code is inconsistent with your paper

def add_noise(self, v_0, mask_generate, t):

Hi. In the forward diffusion process, given e_0, the initial rotations, this function generates the noised rotations by R_noisy = E_scaled @ R0_scaled. where E_scaled is sampled from a prior distribution. I wonder why e_normal is not used but declared?

e_normal = e_scaled / (c1 + 1e-8)

In your paper, the R_noisy should be the interpolation between c0 * v_0(scaled v_0) and c1 * e_scaled. But your code does not use c1 at all.

Issue with non-reproducibility in the energy calculation

Hi,

As a proof-of-concept, I've been trying to sample the H-CDR3 structure of a given antibody to see if your method would be able to reproduce the crystal structure conformation. For the designed structures I'm especially interested in identifying those conformations where the RMSD is <= 2 Å.

I'm not sure if my protocol is correct, but I've been running it as follows:

  • Step 1: run design_pdb.py with the config file 'strpred.yml'. I've modified it to sample only the H-CDR3.
  • Step 2: run diffab/tools/relax/run.py with the pipeline set to 'pyrosetta' to relax the sampled structures.
  • Step 3: run diffab/tools/eval/run.py to obtain RMSDs and energies.

With this protocol, I was able to identify some conformations where the RMSD was smaller than 2 Å.

However, I noticed that across different runs, the energy for the reference structure was changing, which would make my analysis not comparable and non-reproducible. For instance, the absolute difference between the dG_ref in two runs was 100, which is a lot.

So, I started debugging the code and found out that 'diffab/tools/relax/run.py' relaxes the reference and 'diffab/tools/eval/run.py' uses it as the reference to calculate the energy. That explains why across different runs I got different dG_refs. I then set PyRosetta to use the same seed in 'diffab/tools/relax/run.py' and I was able to get the same values across different calls of 'diffab/tools/eval/run.py'.

Despite that, I'm still getting minor differences for the reference within the same run. See an example below:

image

The difference is small, but it is strange to have different energies for the same complex. I tried to set PyRosetta's seed in 'diffab/tools/eval/run.py', but it didn't work.

Would you know what is going on here?

thank you for this very great tool and for your help.
bests

Why generating seqmap during data preprocessing?

Hi, Luo,
Thanks for sharing the code.
I notice that you produce a key named 'seqmap' for each entity. However, this 'seqmap' seemed not to be used any longer. This is a very tine issue, and I just wonder the significance of this value.

Designed CDR clashed with antigen

Dear author,

5CIL_0023_CDRH3.pdb.zip

I did single cdr loop CDRH3 structure-sequence co-design with input as a pair of Ab-Ag complex and using this exact config file https://github.com/luost26/diffab/blob/main/configs/test/codesign_single.yml. Interestingly, one of the designed loop structure clashed with input antigen. In theory, if the design is conditioned on the antigen structure, we shouldn't observe such clash. Could you please help me understand this specific case?

Thanks!

antibody target region

Can an antibody target region be defined?

I had tried and found that the results targeted different regions.
It would be nice to provide a custom region.

thanks

Issues with Sidechain packing

Hi , for the 7DK2_AB_C.pdb I have sampled it using codesign_multicdrs.yml and when I am trying to do side chain packing and full atom refinement it is giving errors.I am running the differ/tools/relax/run.py over the sampled files.
Screenshot 2022-12-22 at 7 21 10 PM

left-multiply or right-multiply ?

According to the paper attachment, after sampling a random rotation $E \sim \mathcal{IG}_{SO(3)}(I, \sigma^2)$, we need to left-multiply $R$ to $E$ to get the desired random value $RE \sim \mathcal{IG}_SO(3)(R, \sigma^2)$.

However, on the released code,

R_noisy = E_scaled @ R0_scaled

R_next = E @ so3vec_to_rotation(v_next)

$R$ is multiplied to $E$ from the right side.

Which operation is correct?

Further questions about left/right-multiply in issue #14

Thanks for your quick reply #14 (comment).

It is still confusing to me. I also take the column vectors as the base vectors. In this case, $RE$ is more intuitive to me (we first get $E$ around $I$ and transform it with $R$).

Moreover, $ER = (ERE^{-1})E \sim \mathcal{IG}_{SO(3)}(ERE^{-1}, \sigma^2)$, it seems that $RE$ and $ER$ do different things, because $ERE^{-1} = R$ is not always true.

inquiry about nanobodies

Thanks for sharing this work, its really fascinating. I was wondering - would this model work on nanobodies as well?

Training time

Hi there,

How much time does it take to train a model?

Thanks!

Why not use FAPE loss?

Hi, I am wondering why you don't use fape loss to supervise the training. it is a more natural choice. It optimizes the rotation and translation predictions jointly.

questions regarding the reproduction of your test results

Hi. I am trying to reproduce your test results about generating antibody CDRs (sequence-structure co-design) using the DiffAb model.
Using the design_testset.py script, index 10 (pdb 7bwj_H_L_E), and codesign_single ckpt, the results on CDR-H3 are unacceptably bad.
The following table is the rmsd-ca between generated structure and native structure.

         H_CDR1     H_CDR2      H_CDR3     L_CDR1     L_CDR2     L_CDR3
mean    1.428283   1.659423   52.084544   1.605134   0.385445   3.645147
min     0.861669   0.935878   28.523428   0.730051   0.208125   1.603799
max     2.923090   3.224749  153.499802   2.118108   0.752684   6.416183

the rmsd-ca is calculated by the following code.

generate_flags = variant['data']['generate_flag']
native_atom_positions = variant['data']['pos_heavyatom'][...,BBHeavyAtom.CA,:][generate_flags]
# native_atom_positions = native_atom_positions[mask_ha[generate_flags]]
pred_atom_positions = pos_ha[...,BBHeavyAtom.CA,:][generate_flags]
# pred_atom_positions = pred_atom_positions[mask_ha[generate_flags]]
rmsd = ((native_atom_positions - pred_atom_positions)**2).sum(-1).mean()

If this case has such a high rmsd, I doubt that the testset rmsd reported in your paper, Table 1 would also high.
No offense, I am trying to find out what is wrong with my reproduction.
Let me know if you want more details about my reproduction.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.