nyu-dl / dl4chem-mgm Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

dl4chem-mgm's Introduction

Masked Graph Modeling for Molecule Generation

This repository and its references contain the models, data and scripts used to carry out the experiments in the Masked graph modeling for molecule generation paper.

Installation Guide

We used a Linux OS with an Nvidia Tesla P100-SXM2 GPU with 16 GB of memory.

A conda environment file (environment.yml) is provided as part of this repository. It may contain packages beyond those needed to run the scripts here. If not using this file, please install the following dependencies.

Python

python 3.7
pytorch 1.4.0
tensorflow 1.14
rdkit 2019.09.3
guacamol 0.5.0
dgl 0.4.3post2
tensorboardx 2.0
scipy 1.4.1

GPU

CUDA (We used version 10.0)
Pytorch, Tensorflow and dgl installations should correspond to the CUDA version used.

Datasets

QM9

QM9 SMILES strings are included in this repository at data/QM9/QM9_smiles.txt
To process QM9 smiles for use in train and generation scripts:
python -m data.gen_targets --data-path data/QM9/QM9_smiles.txt --save-path data/QM9/QM9_processed.p --dataset-type QM9

ChEMBL

To download the ChEMBL dataset:
Training set: wget -O data/ChEMBL/guacamol_v1_train.smiles https://ndownloader.figshare.com/files/13612760
Validation set: wget -O data/ChEMBL/guacamol_v1_valid.smiles https://ndownloader.figshare.com/files/13612766
Full dataset (training + validation + test): wget -O data/ChEMBL/guacamol_v1_all.smiles https://ndownloader.figshare.com/files/13612745

To process the training dataset after downloading for use in train and generation scripts:
python -m data.gen_targets --data-path data/ChEMBL/guacamol_v1_train.smiles --save-path data/ChEMBL/ChEMBL_train_processed.p --dataset-type ChEMBL
To process the validation dataset after downloading for use in train and generation scripts:
python -m data.gen_targets --data-path data/ChEMBL/guacamol_v1_valid.smiles --save-path data/ChEMBL/ChEMBL_val_processed.p --dataset-type ChEMBL

Pretrained Models

Pretrained models are provided here for both datasets. To use these for generation, download the entire dumped folder to the repository root.

Training

As an alternative to using pretrained models, the following are scripts for training models from scratch.

QM9

python train.py --data_path data/QM9/QM9_processed.p --graph_type QM9 --exp_name QM9_experiment --num_node_types 5 --num_edge_types 5 --max_nodes 9 --layer_norm --spatial_msg_res_conn --batch_size 1024 --val_batch_size 2500 --val_after 105 --num_epochs 200 --shuffle --mask_independently --force_mask_predict --optimizer adam,lr=0.0001 --tensorboard

ChEMBL

python train.py --data_path data/ChEMBL/ChEMBL_train_processed.p --graph_type ChEMBL --exp_name chembl_experiment --val_data_path data/ChEMBL/ChEMBL_val_processed.p --num_node_types 12 --num_edge_types 5 --max_nodes 88 --min_charge -1 --max_charge 3 --mpnn_steps 6 --layer_norm --spatial_msg_res_conn --batch_size 32 --val_batch_size 64 --grad_accum_iters 16 --val_after 3200 --num_epochs 10 --shuffle --force_mask_predict --mask_independently --optimizer adam,lr=0.0001 --tensorboard

Training Baseline Transformer Models

In addition to scripts for training our model, we include the scripts for training the baseline autoregressive Transformer models (download preprocessed data for the Transformer models here). The code for the model is based on the following repository: https://github.com/facebookresearch/XLM

ChEMBL (Transformer Regular)

python train_ar.py --dump_path ./ --data_path /path/to/chembl/data --data_type ChEMBL --exp_name chembl_smiles_transformer_regular --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 1000 --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --batch_size 128

ChEMBL (Transformer Small)

python train_ar.py --dump_path ./ --data_path /path/to/chembl/data --data_type ChEMBL --exp_name chembl_smiles_transformer_small --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 1000 --emb_dim 512 --n_layers 4 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --batch_size 128

For QM9 take the above scripts ChEMBL and set the flag --data_type to QM9.

Generation

The following are scripts for generating from trained/pretrained models. The --node_target_frac and --edge_target_frac options set the masking rate for node and edge features respectively.

QM9

To generate using training initialisation and masking rate 0.1:
python generate.py --data_path data/QM9/QM9_processed.p --graph_type QM9 --model_path dumped/QM9_experiment/best_model --smiles_dataset_path data/QM9/QM9_smiles.txt --output_dir dumped/QM9_experiment/generation/train_init/mask10/results --num_node_types 5 --num_edge_types 5 --max_nodes 9 --layer_norm --embed_hs --spatial_msg_res_conn --num_iters 400 --num_sampling_iters 400 --cp_save_dir dumped/QM9_experiment/generation/train_init/mask10/generation_checkpoints --batch_size 2500 --checkpointing_period 400 --evaluation_period 20 --save_period 20 --evaluate_finegrained --save_finegrained --mask_independently --retrieve_train_graphs --node_target_frac 0.1 --edge_target_frac 0.1

To generate using marginal initialisation and masking rate 0.2:
python generate.py --data_path data/QM9/QM9_processed.p --graph_type QM9 --model_path dumped/QM9_experiment/best_model --smiles_dataset_path data/QM9/QM9_smiles.txt --output_dir dumped/QM9_experiment/generation/marginal_init/mask20/results --num_node_types 5 --num_edge_types 5 --max_nodes 9 --layer_norm --embed_hs --spatial_msg_res_conn --num_iters 400 --num_sampling_iters 400 --cp_save_dir dumped/QM9_experiment/generation/train_init/mask20/generation_checkpoints --batch_size 2500 --checkpointing_period 400 --evaluation_period 20 --save_period 20 --evaluate_finegrained --save_finegrained --mask_independently --random_init --node_target_frac 0.2 --edge_target_frac 0.2

ChEMBL

To generate using training initialisation and masking rate 0.01:
python generate.py --data_path data/ChEMBL/ChEMBL_train_processed.p --graph_type ChEMBL --model_path dumped/chembl_experiment/best_model --smiles_dataset_path data/ChEMBL/guacamol_v1_all.smiles --output_dir dumped/chembl_experiment/generation/train_init/mask1/results --num_node_types 12 --num_edge_types 5 --max_nodes 88 --min_charge -1 --max_charge 3 --layer_norm --mpnn_steps 6 --embed_hs --spatial_msg_res_conn --num_iters 300 --num_sampling_iters 300 --checkpointing_period 300 --evaluation_period 20 --save_period 20 --evaluate_finegrained --save_finegrained --cp_save_dir dumped/chembl_experiment/generation/train_init/mask1/generation_checkpoints --batch_size 32 --mask_independently --retrieve_train_graphs --node_target_frac 0.01 --edge_target_frac 0.01

Generation Using Transformer Baseline Models

QM9

python generate_ar_distributional.py --model_path /path/to/trained/qm9/model/best_model.pth \ --dist_file QM9_all.smiles

ChEMBL

python generate_ar_distributional.py --model_path /path/to/trained/chembl/model/best_model.pth \ --dist_file guacamol_v1_all.smiles

MGM Generation Results

SMILES strings and distributional results at each recorded generation step can be found in <output_dir> from the MGM generation script used above.

To print generation results at each step in a dataframe: python get_best_distributional_results.py <output_dir>

We also provide a list of SMILES strings of 20,000 generated molecules each for QM9 with a 10% masking rate and ChEMBL with a 1% masking rate here. Training initialisation was used in both cases.

Citation

If you have found the materials in this repository useful, please consider citing: Mahmood, O., Mansimov, E., Bonneau, R. et al. Masked graph modeling for molecule generation. Nat Commun 12, 3156 (2021). https://doi.org/10.1038/s41467-021-23415-2

dl4chem-mgm's People

Contributors

Stargazers

Watchers

Forkers

cthoyt-forks-and-packages aspirincode q20110911 jixing475 ndnng adelabriere peterli3819 tianyuzelin seunghoon-yi rnaimehaom anamika-yadav99

dl4chem-mgm's Issues

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Using backend: pytorch
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/glard/doping/dl4chem-mgm/src/model/graph_generator.py:19: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/glard/doping/dl4chem-mgm/src/model/graph_generator.py:21: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2022-01-18 13:35:06.430977: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2022-01-18 13:35:06.451785: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz
2022-01-18 13:35:06.452153: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558cb0b482d0 executing computations on platform Host. Devices:
2022-01-18 13:35:06.452164: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
2022-01-18 13:35:06.452267: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2022-01-18 13:35:06.462302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-18 13:35:06.462441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: NVIDIA GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.815
pciBusID: 0000:01:00.0
2022-01-18 13:35:06.462471: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-01-18 13:35:06.462490: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-01-18 13:35:06.462505: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2022-01-18 13:35:06.462519: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2022-01-18 13:35:06.683946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2022-01-18 13:35:06.684160: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2022-01-18 13:35:07.202372: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-01-18 13:35:07.202642: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-18 13:35:07.203193: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-18 13:35:07.203657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2022-01-18 13:43:36.841675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-01-18 13:43:36.841695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2022-01-18 13:43:36.841703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2022-01-18 13:43:36.841802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-18 13:43:36.841928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-18 13:43:36.842032: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-18 13:43:36.842127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6878 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
2022-01-18 13:43:36.843210: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558cd6a94070 executing computations on platform CUDA. Devices:
2022-01-18 13:43:36.843222: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6
INFO - 01/18/22 13:43:37 - 0:00:00 - ============ Initialized logger ============
INFO - 01/18/22 13:43:37 - 0:00:00 - Random seed is 0
INFO - 01/18/22 13:43:37 - 0:00:00 - ar: False
batch_size: 16
binary_classification: False
bound_edges: False
check_pred_validity: False
clip_grad_norm: 10.0
cond_virtual_node: False
data_path: data/QM9/QM9_processed.p
debug_fixed: False
debug_small: False
decay_start_iter: 99999999
dim_h: 2048
dim_k: 1
do_not_corrupt: False
dump_path: dumped/
edge_mask_frac: 1.0
edge_mask_predict_frac: 1.0
edge_replace_frac: 0.0
edge_replace_predict_frac: 1.0
edge_target_frac: 0.2
edges_per_batch: -1
embed_hs: False
equalise: False
exp_id:
exp_name: QM9_experiment
first_iter: 0
force_mask_predict: True
force_replace_predict: False
fully_connected: False
gen_num_iters: 10
gen_num_samples: 0
gen_predict_deterministically: False
gen_random_init: False
global_connection: False
grad_accum_iters: 1
graph2binary_properties_path: data/proteins/pdb_golabels.p
graph_properties_path:
graph_property_names: []
graph_type: QM9
layer_norm: True
load_best: False
load_latest: False
local_cpu: False
log_train_steps: 200
loss_normalisation_type: by_component
lr_decay_amount: 0.0
lr_decay_frac: 1.0
lr_decay_interval: 9999999
mask_all_ring_properties: False
mask_independently: True
mat_N: 2
mat_d_model: 64
mat_dropout: 0.1
mat_h: 8
max_charge: 1
max_epoch: 100000
max_hs: 4
max_nodes: 9
max_steps: 10000000.0
max_target_frac: 0.8
min_charge: -1
min_lr: 0.0
model_name: GraphNN
mpnn_name: EdgesFromNodesMPNN
mpnn_steps: 4
no_edge_present_type: zeros
no_save: False
no_update: False
node_mask_frac: 1.0
node_mask_predict_frac: 1.0
node_mpnn_name: NbrEWMultMPNN
node_replace_frac: 0.0
node_replace_predict_frac: 1.0
node_target_frac: 0.2
normalise_graph_properties: False
num_batches: 4
num_binary_graph_properties: 0
num_edge_types: 5
num_epochs: 200
num_graph_properties: 0
num_mpnns: 1
num_node_types: 5
optimizer: adam,lr=0.0001
perturbation_batch_size: 32
perturbation_edges_per_batch: -1
predict_graph_properties: False
prediction_data_structs: all
pretrained_property_embeddings_path: data/proteins/preprocessed_go_embeddings.npy
property_type: None
res_conn: False
save_all: False
seed: 0
seq_output_dim: 768
share_embed: False
shuffle: True
smiles_path: None
smiles_train_split: 0.8
spatial_msg_res_conn: True
spatial_postgru_res_conn: False
suppress_params: False
suppress_train_log: False
target_data_structs: both
target_frac_inc_after: None
target_frac_inc_amount: 0
target_frac_type: random
tensorboard: True
update_edges_at_end_only: False
use_newest_edges: False
use_smiles: False
val_after: 105
val_batch_size: 2500
val_data_path: data/ChEMBL/ChEMBL_val_processed_hs.p
val_dataset_size: -1
val_edge_target_frac: 0.1
val_edges_per_batch: None
val_graph2binary_properties_path: None
val_graph_properties_path: data/ChEMBL/ChEMBL_val_graph_properties.p
val_node_target_frac: 0.1
val_seed: 0
validate_on_train: False
warm_up_iters: 1.0
weighted_loss: False
INFO - 01/18/22 13:43:37 - 0:00:00 - Running command: python train.py --data_path data/QM9/QM9_processed.p --graph_type QM9 --exp_name QM9_experiment --num_node_types 5 --num_edge_types 5 --max_nodes 9 --layer_norm --spatial_msg_res_conn --batch_size 16 --val_batch_size 2500 --val_after 105 --num_epochs 200 --shuffle --mask_independently --force_mask_predict --optimizer adam,lr=0.0001 --tensorboard
INFO - 01/18/22 13:43:37 - 0:00:00 - The experiment will be stored in dumped/QM9_experiment

INFO - 01/18/22 13:43:43 - 0:00:06 - train_loader len is 6651
INFO - 01/18/22 13:43:43 - 0:00:06 - val_loader len is 11
Starting epoch 1
0
Traceback (most recent call last):
File "train.py", line 322, in
main(params)
File "train.py", line 129, in main
binary_graph_properties)
File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/glard/doping/dl4chem-mgm/src/model/gnn.py", line 186, in forward
batch_init_graph = self.mpnnsmpnn_num
File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/glard/doping/dl4chem-mgm/src/model/mpnns.py", line 56, in forward
updated_nodes, updated_edges = self.mpnn_step_forward(batch_graph, step_num)
File "/home/glard/doping/dl4chem-mgm/src/model/mpnns.py", line 77, in mpnn_step_forward_nonfc
updated_nodes = self.node_mpnn(batch_graph)
File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/glard/doping/dl4chem-mgm/src/model/node_mpnns.py", line 36, in forward
nodes = self.update_GRU(msg, g.ndata['nodes'])
File "/home/glard/doping/dl4chem-mgm/src/model/node_mpnns.py", line 23, in update_GRU
_, node_next = self.gru(msg, node)
File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 716, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Cannot run generate.py

The program ran into RuntimeError frequently. Could you please fix them?

SMILES vs. Graphs

Even though the goal of the presented method is to generate graphs, I noticed in the code that you finally convert it to SMILES. According to your paper, it is because "the GuacaMol benchmark requires that graph representations be converted into SMILES strings before evaluation." I have two questions, which I really appreciate if you answer:

Do the generated graphs (before converting to SMILES) have properties not available in the converted SMILES? Or the two representations convey the same information?
It seems that both datasets used to pre-train the generator include SMILES strings. So, how you can use SMILES strings to generate a higher-level representation as graphs? I assume that graphs include more information than SMILES.

Code example (Generation QM9) TypeError

Hi,
I tired to run the example code under Generation QM9:
#https://github.com/nyu-dl/dl4chem-mgm/blob/master/README.md#qm9-2
I just copy and paste all previous commands, and got error as:

2022-07-08 10:11:21.644672: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/shunyang/intel/oneapi/compiler/2022.0.2/linux/lib:/home/shunyang/intel/oneapi/compiler/2022.0.2/linux/lib/x64:/home/shunyang/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin
2022-07-08 10:11:21.644747: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/shunyang/intel/oneapi/compiler/2022.0.2/linux/lib:/home/shunyang/intel/oneapi/compiler/2022.0.2/linux/lib/x64:/home/shunyang/intel/oneapi/compiler/2022.0.2/linux/compiler/lib/intel64_lin
2022-07-08 10:11:21.644765: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
  0%|                                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]Generator length: 8
  0%|                                                                                                                                                                                   | 0/8 [00:04<?, ?it/s]
  0%|                                                                                                                                                                                 | 0/400 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "generate.py", line 115, in <module>
    main(params)
  File "generate.py", line 110, in main
    params.evaluate_connected_only)
  File "/mnt/c/Users/Study/Documents/GitHub/dl4chem-mgm/src/model/graph_generator.py", line 158, in generate_with_evaluation
    smiles_list = self.carry_out_iteration(loader, use_argmax)
  File "/mnt/c/Users/Study/Documents/GitHub/dl4chem-mgm/src/model/graph_generator.py", line 199, in carry_out_iteration
    graph_properties, binary_graph_properties, use_argmax)
  File "/mnt/c/Users/Study/Documents/GitHub/dl4chem-mgm/src/model/graph_generator.py", line 257, in sample_simultaneously
    binary_graph_properties)
  File "/mnt/c/Users/Study/Documents/GitHub/dl4chem-mgm/src/model/graph_generator.py", line 274, in model_forward_mgm
    return self.model(batch_init_graph, graph_properties, binary_graph_properties)
  File "/home/shunyang/miniconda3/envs/ame/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/c/Users/Study/Documents/GitHub/dl4chem-mgm/src/model/gnn.py", line 183, in forward
    batch_init_graph = self.calculate_embeddings(batch_init_graph, graph_properties, binary_graph_properties)
  File "/mnt/c/Users/Study/Documents/GitHub/dl4chem-mgm/src/model/gnn.py", line 138, in calculate_embeddings
    batch_graph.batch_num_nodes())
TypeError: 'list' object is not callable

Could you please help me solve this problem?
Thank you!
Shunyang

Mask Ratio during Training Process

I noticed that in your paper, you discussed the impact of different mask ratios during the sampling process, but I was curious if you have also experimented with different mask ratios during the training process. Specifically, I was wondering if you have tried varying the mask ratio during training and how that might affect the performance of your model.

Could not generate new molecules

I tried to run the generation on ChEMBL using the given pretrained model and generate script. But I found the output results smiles_1_0 to smiles_300_0 are all the same. Also I tried to change the mask fraction to avoid this but I checked the code and found nowhere using the parameter --node_target_frac and --edge_target_frac. Could you please explain me that how these parameters work?

environment.yml missing

Hello, it seems that the environment file is missing. Could you add it to the repo?

Thank you

Cannot run train.py

I met some problem when I try to run your train.py, after I successfully process the QM9 smiles as your code,I use the code python train.py or python train.py --data_path data/QM9/QM9_processed.p --graph_type QM9 --exp_name QM9_experiment --num_node_types 5 --num_edge_types 5 --max_nodes 9 --layer_norm --spatial_msg_res_conn --batch_size 1024 --val_batch_size 2500 --val_after 105 --num_epochs 200 --shuffle --mask_independently --force_mask_predict --optimizer adam,lr=0.0001 --tensorboard,but something with it

Traceback (most recent call last):
File "train.py", line 324, in
main(params)
File "train.py", line 96, in main
model = model_cls(params)
File "/data/yinmingyue/Sim/code/Masked_graph/src/model/gnn.py", line 173, in init
super().init(params)
File "/data/yinmingyue/Sim/code/Masked_graph/src/model/gnn.py", line 27, in init
'num_output_classes': params.num_node_type})])
AttributeError: 'Namespace' object has no attribute 'num_node_type'

my package version is
python 3.7.10
pytorch 1.5.1
tensorflow 1.14.0
rdkit 2018.09.3
guacamol 0.5.0
dgl 0.6.1
scipy 1.7..0

Is the problem caused by my pkgs version?

Validation losses keep increasing

Hi,

Thank you for sharing a great work. I tried to replicate training from scratch, yet does not seem to work properly on both QM9 and ChEMBL datasets. Training losses decrease as training iterations as training iterations go, while validation losses keep increasing from the beginning. For example, running the following scripts:

the losses at the beginning:

INFO - 11/09/21 14:52:37 - 0:07:59 - total_iter = 105, loss = 0.21, is_aromatic = 0.00, is_in_ring = 0.01, chirality = 0.11, charge = 0.01, node_type = 0.04, hydrogens = 0.04, edge_type = 0.01
INFO - 11/09/21 14:52:37 - 0:07:59 - Validating
INFO - 11/09/21 14:53:54 - 0:09:17 - Validation_loss: 6.92
INFO - 11/09/21 14:53:54 - 0:09:17 - node_type_loss: 0.30
INFO - 11/09/21 14:53:54 - 0:09:17 - hydrogens_loss: 4.70
INFO - 11/09/21 14:53:54 - 0:09:17 - charge_loss: 0.12
INFO - 11/09/21 14:53:54 - 0:09:17 - is_in_ring_loss: 0.41
INFO - 11/09/21 14:53:54 - 0:09:17 - is_aromatic_loss: 0.04
INFO - 11/09/21 14:53:54 - 0:09:17 - chirality_loss: 0.50
INFO - 11/09/21 14:53:54 - 0:09:17 - edge_type_loss: 0.85

the losses at the middle of the training:

INFO - 11/10/21 05:17:22 - 14:32:45 - total_iter = 9975, loss = 0.00, is_aromatic = 0.00, is_in_ring = 0.00, chirality = 0.00, charge = 0.00, node_type = 0.00, hydrogens = 0.00, edge_type = 0.00
INFO - 11/10/21 05:17:22 - 14:32:45 - Validating
INFO - 11/10/21 05:18:43 - 14:34:06 - Validation_loss: 12.83
INFO - 11/10/21 05:18:43 - 14:34:06 - node_type_loss: 1.11
INFO - 11/10/21 05:18:43 - 14:34:06 - hydrogens_loss: 8.00
INFO - 11/10/21 05:18:43 - 14:34:06 - charge_loss: 0.28
INFO - 11/10/21 05:18:43 - 14:34:06 - is_in_ring_loss: 0.49
INFO - 11/10/21 05:18:43 - 14:34:06 - is_aromatic_loss: 0.02
INFO - 11/10/21 05:18:43 - 14:34:06 - chirality_loss: 1.79
INFO - 11/10/21 05:18:43 - 14:34:06 - edge_type_loss: 1.14

the losses at the end:

INFO - 11/10/21 21:42:53 - 1 day, 6:58:16 - total_iter = 20790, loss = 0.00, is_aromatic = 0.00, is_in_ring = 0.00, chirality = 0.00, charge = 0.00, node_type = 0.00, hydrogens = 0.00, edge_type = 0.00
INFO - 11/10/21 21:42:53 - 1 day, 6:58:16 - Validating
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - Validation_loss: 15.39
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - node_type_loss: 1.19
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - hydrogens_loss: 10.07
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - charge_loss: 0.32
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - is_in_ring_loss: 0.38
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - is_aromatic_loss: 0.02
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - chirality_loss: 1.91
INFO - 11/10/21 21:44:17 - 1 day, 6:59:40 - edge_type_loss: 1.51

It would be appreciate to share your experience on this problem.

nyu-dl / dl4chem-mgm Goto Github PK

dl4chem-mgm's Introduction

Masked Graph Modeling for Molecule Generation

Installation Guide

Python

GPU

Datasets

QM9

ChEMBL

Pretrained Models

Training

QM9

ChEMBL

Training Baseline Transformer Models

ChEMBL (Transformer Regular)

ChEMBL (Transformer Small)

Generation

QM9

ChEMBL

Generation Using Transformer Baseline Models

QM9

ChEMBL

MGM Generation Results

Citation

dl4chem-mgm's People

Contributors

Stargazers

Watchers

Forkers

dl4chem-mgm's Issues

Recommend Projects

Recommend Topics

Recommend Org