guanjq / targetdiff Goto Github PK
View Code? Open in Web Editor NEWThe official implementation of 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction (ICLR 2023)
The official implementation of 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction (ICLR 2023)
I moved scripts/property_prediction/inference.py into targetdiff directory.
I used "python inference.py --ckpt_path pretrained_models/egnn_pdbbind_v2016.pt --protein_path examples/1h36_A_rec_1h36_r88_lig_tt_docked_0_pocket10.pdb --ligand_path examples/1h36_A_rec_1h36_r88_lig_tt_docked_0.sdf"
I encounted
RuntimeError: Error(s) in loading state_dict for PropPredNet:
size mismatch for ligand_atom_emb.weight: copying a param with shape torch.Size([256, 30]) from checkpoint, the shape in current model is torch.Size([256, 31]).
The embedding dimension is different from each other.
I used AI ligand-protein docking tool.
So now I have a ligand sdf file that have docking coordiantes.
I want to extract pocket but the method mentioned on github is confused in this case.
How can I extract pocket in this situation that there are protein pdb file and ligand sdf file.
Finally your advice of environment is very good.
I share yaml content.
name: targetdiff
channels:
I appreciate your work very much, but I have to say that your code is very poorly written and it is very inconvenient for me to read.
Hello Jiaqi,
Thanks for your fantastic work! I am using targetdiff and this repo for research purpose. When I read the code for ligand sampling with diffusion process, I suspected that there is a bug with the pocket size estimation:
n_data = batch_size if i < num_batch - 1 else num_samples - batch_size * (num_batch - 1)
batch = Batch.from_data_list([data.clone() for _ in range(n_data)], follow_batch=FOLLOW_BATCH).to(device)
with torch.no_grad():
if sample_num_atoms == 'prior':
pocket_size = atom_num.get_space_size(batch.protein_pos.detach().cpu().numpy())
ligand_num_atoms = [atom_num.sample_atom_num(pocket_size).astype(int) for _ in range(n_data)]
The code is copied from scripts/sample_diffusion.py
(some of the unimportant lines are omitted). As the paper introduced, it seems that you sampled 100 ligand candidates for 1 pocket, and data
here corresponds to 1 pocket, which are copied for n_data
times to form a batch. When you estimate the pocket size, the atom_num.get_space_size
actually computes the pairwise distance of batch.protein_pos
and returns the median of the top-10 pairwise distances. However, I think some of the protein atoms in batch
are isolated and the pairwise distance between them do not make sense. I think the pocket_size estimation should be modified as:
pocket_size = atom_num.get_space_size(data.protein_pos.detach().cpu().numpy())
This bug also makes the execution of scripts/sample_diffusion.py
or scripts/sample_for_pocket.py
very slow, as the pairwise distance computation of ~10000 atoms within a batch is resource hungry. Besides, the estimation of pocket size is always larger than the corrected one.
I am not sure if I am correct. Looking forward to your reply!
Hello, I have a question about Jensen-Shannon divergence.When calculating JSD, do we calculate the distance between all atoms or only those with bonds.
Hi @guanjq , I am not able to replicate the results by re-training the EGNN model . However While using the pre-trained model I am able to replicate the results of RMSE: 1.316, MAE: 1.031, R^2 score: 0.633, Pearson: 0.797, Spearman: 0.782, mean/std: 6.412/1.621
.
I tried to keep the all the hyper-parameters and datasets(/data/pdbbind_v2016/pocket_10_refined
) by referring to the config found in the checkpoint shared and followed the readme to prepare the pockets and splits
My current results on the test set shared are
RMSE: 3.082, MAE: 2.412, R^2 score: -1.014, Pearson: 0.513, Spearman: 0.562, mean/std: 7.769/3.195
Any Idea what might me going wrong in re-training?
Originally posted by @Dornavineeth in #1 (comment)
Thanks for the brilliant work and sharing the code!
I have a question regarding the selection of model parameters for checkpoint: ./pretrained_models/pretrained_diffusion.pt
in the sampling.yml file under the config directory. Could you please clarify the criteria used to choose these parameters? Is it based on the checkpoint with the lowest validation loss during training? I have noticed that when using this checkpoint's model(lowest validation loss during training), the performance does not align with the results reported in the associated paper. I would appreciate any insights or guidance on potential factors that I might be overlooking in order to achieve the expected performance.
Thank you!!
Best regards.
Hi, thanks for the code.
result = {
'data': data,
'pred_ligand_pos': pred_pos,
'pred_ligand_v': pred_v,
'pred_ligand_pos_traj': pred_pos_traj,
'pred_ligand_v_traj': pred_v_traj,
'time': time_list
}
how would you get rdkit mol from the results?
我在手动制作了一个小分子结合口袋之后使用sample_for_poocket.py得到了sample.pt文件,但接下来应该如何使用其他代码倒出算法能够生成的小分子的sdf?
hi, guanjq,
thanks for your good works of new diffusion models.
while I try to run the command line,
python scripts/sample_for_pocket.py configs/sampling.yml --pdb_path examples/1h36_A_rec_1h36_r88_lig_tt_docked_0_pocket10.pdb
the error occured, as below:
[2023-08-23 23:56:23,376::evaluate::INFO] {'model': {'checkpoint': './pretrained_models/pretrained_diffusion.pt'}, 'sample': {'seed': 2021, 'num_samples': 100, 'num_steps': 1000, 'pos_only': False, 'center_pos_mode': 'protein', 'sample_num_atoms': 'prior'}}
[2023-08-23 23:56:27,030::evaluate::INFO] Training Config: {'data': {'name': 'pl', 'path': './data/crossdocked_v1.1_rmsd1.0_pocket10', 'split': './data/crossdocked_pocket10_pose_split.pt', 'transform': {'ligand_atom_mode': 'add_aromatic', 'random_rot': False}}, 'model': {'denoise_type': 'diffusion', 'model_mean_type': 'C0', 'gt_noise_type': 'origin', 'beta_schedule': 'sigmoid', 'beta_start': 1e-07, 'beta_end': 0.002, 'v_beta_schedule': 'cosine', 'v_beta_s': 0.01, 'num_diffusion_timesteps': 1000, 'loss_v_weight': 100.0, 'v_mode': 'categorical', 'v_net_type': 'mlp', 'loss_pos_type': 'mse', 'sample_time_method': 'symmetric', 'time_emb_dim': 0, 'time_emb_mode': 'simple', 'center_pos_mode': 'protein', 'node_indicator': True, 'model_type': 'uni_o2', 'num_blocks': 1, 'num_layers': 9, 'hidden_dim': 128, 'n_heads': 16, 'edge_feat_dim': 4, 'num_r_gaussian': 20, 'knn': 32, 'num_node_types': 8, 'act_fn': 'relu', 'norm': True, 'cutoff_mode': 'knn', 'ew_net_type': 'global', 'r_feat_mode': 'sparse', 'energy_h_mode': 'basic', 'num_x2h': 1, 'num_h2x': 1, 'r_max': 10.0, 'x2h_out_fc': False, 'sync_twoup': False}, 'train': {'seed': 2021, 'batch_size': 4, 'num_workers': 4, 'max_iters': 10000000, 'val_freq': 2000, 'pos_noise_std': 0.1, 'max_grad_norm': 8.0, 'bond_loss_weight': 1.0, 'optimizer': {'type': 'adam', 'lr': 0.0005, 'weight_decay': 0, 'beta1': 0.95, 'beta2': 0.999}, 'scheduler': {'type': 'plateau', 'factor': 0.6, 'patience': 10, 'min_lr': 1e-06}}}
[2023-08-23 23:56:32,218::evaluate::INFO] Successfully load the model! [./pretrained_models/pretrained_diffusion.pt](https://file+.vscode-resource.vscode-cdn.net/d%3A/Cheminfo_Workshop/4_Fragment_Scaffold_Evolution/targetdiff-main/pretrained_models/pretrained_diffusion.pt)
Traceback (most recent call last):
File "scripts/sample_for_pocket.py", line 75, in <module>
data = transform(data)
File "c:\Users\lsy\anaconda3\envs\molgen\lib\site-packages\torch_geometric\transforms\compose.py", line 24, in __call__
data = transform(data)
File "d:\Cheminfo_Workshop\4_Fragment_Scaffold_Evolution\targetdiff-main\scripts\utils\transforms.py", line 128, in __call__
amino_acid = F.one_hot(data.protein_atom_to_aa_type, num_classes=self.max_num_aa)
RuntimeError: one_hot is only applicable to index tensor.
could you please help to provde suggestions how to fix it up?
many thanks,
Best,
Sh-Y
Hi, it's nice of you to share codes!
I got some puzzles about data preprocessing, that is, how to generate the file you've shared in google disk 'crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb'.
I've go through the codes but didn't find the code to directly implement this. I notice the code for data preprocessing can generate index.pkl and crossdocked_pocket10_pose_split.pt, then how can I generate the lmdb file using these files?
I'm new to this area and it's highly appreciated if you can help me with this! Thanks a lot!
thanks for the share of this amazing work, and I wanna know that is there a way to predict the affinity via a complex structure?
Firstly, I would like to express my gratitude for your impressive work and for making it available on GitHub.
I recently utilized your pre-trained model to generate molecules for a specific pocket using the following command: python scripts/sample_for_pocket.py configs/sampling.yml --pdb_path examples/1zcm.pdb
. However, I encountered an issue when it came to the vina docking step during evaluation by scripts/evaluate_diffusion.py
. Specifically, the 'ligand_filename': r['data'].ligand_filename
, does not appear to exist in the results generated by sample_for_pocket.py
.
In order to resolve this issue, I removed the aforementioned snippet from lines 110 and 143 in the scripts/evaluate_diffusion.py
code. Post this modification, the problem seems to be resolved. I am raising this issue to bring this to your attention, and potentially help others who might encounter a similar problem.
Dear authors,
Thanks for your great work. It seems the training code python scripts/train_diffusion.py configs/training.yml
would cast a file missing error as described in the title, and I cannot find /data/crossdocked_v1.1_rmsd1.0_pocket10/index.pkl
in the provided google drive.
In configs/training.yml, time_emb_dim is set to 0, does this number correspond to the pretrained_difusion.pt you provided?
Hi,
Thank you for sharing this fantastic work. I am trying to reproduce some experiment results, and I follow the instruction to download all the data and use the provided pretrained weight. Here is an image regarding the error in the evaluation.
Could you help to view this problem and give me some hints to fix it?
By the way, complete mols' JS bond distances also show None, which I don't think it is correct.
Many thanks!
I am reopening this issue #3 since the data folder on github has no folder called crossdocked_v1.1_rmsd1.0_pocket10
. What is supposed to be in this folder? Should it have the same folder structure as this folder that is in the github link: crossdocked_v1.1_rmsd1.0
which you get after untarring?
crossdocked_v1.1_rmsd1.0
├── 1433B_HUMAN_1_240_pep_0
│ ├── 5f74_A_rec_5f74_amp_lig_tt_docked_5.sdf
│ ├── 5f74_A_rec.pdb
│ ├── 5n10_A_rec_5f74_amp_lig_tt_min_0.sdf
│ └── 5n10_A_rec.pdb
├── 1433C_TOBAC_1_256_0
│ ├── 2o98_B_rec_2o98_fsc_lig_tt_docked_0.sdf
│ └── 2o98_B_rec.pdb
├── 1433S_HUMAN_1_233_0
│ ├── 3iqu_A_rec_3p1s_fsc_lig_tt_docked_10.sdf
│ ├── 3iqu_A_rec.pdb
│ ├── 3iqv_A_rec_3iqv_fsc_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec_3p1o_fsc_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec_3p1q_fsc_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec_3p1q_fsc_lig_tt_docked_5.sdf
│ ├── 3iqv_A_rec_3p1s_fsc_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec_3smk_cw7_lig_tt_docked_1.sdf
│ ├── 3iqv_A_rec_3smm_fja_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec_3smo_fja_lig_tt_docked_2.sdf
│ ├── 3iqv_A_rec_3smo_fja_lig_tt_min_0.sdf
│ ├── 3iqv_A_rec_3sp5_cx7_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec_4dhn_0kc_lig_tt_min_0.sdf
│ ├── 3iqv_A_rec_4fr3_0v4_lig_tt_docked_2.sdf
│ ├── 3iqv_A_rec_4jdd_fsc_lig_tt_docked_1.sdf
│ ├── 3iqv_A_rec_5mxo_fsc_lig_tt_docked_0.sdf
│ ├── 3iqv_A_rec.pdb
Hi, thank you for sharing such a good work. However, I am a little confused about how can I get batch.ligand_element_batch in the
def train(it):
model.train()
optimizer.zero_grad()
for _ in range(config.train.n_acc_batch):
batch = next(train_iterator).to(args.device)
results = model.get_diffusion_loss(
ligand_pos=batch.ligand_pos, #
ligand_v=batch.ligand_atom_feature_full,
batch_ligand=batch.ligand_element_batch
)
Can you tell where can I find the processing operation of the ligand element batch?
Thank you
Great work!
@guanjq a couple of questions:
#############
data:
...
transform:
ligand_atom_mode: add_aromatic
random_rot: False
##############
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.