yuyangw / molclr Goto Github PK
View Code? Open in Web Editor NEWImplementation of MolCLR: "Molecular Contrastive Learning of Representations via Graph Neural Networks" in PyG.
License: MIT License
Implementation of MolCLR: "Molecular Contrastive Learning of Representations via Graph Neural Networks" in PyG.
License: MIT License
Hi, Yuyang Wu!
I'm wondering if, for multi-target tasks, each mean value per data in the table1 and table2 in your paper consists of different hyperparameter settings. (i.e, picking heterogeneous(best performed for each target) hyper-parameter combinations in ClinTox and take the average for individual run).
Best regards,
JongKook, Heo
Hi Team,
It's a great job, but here's my problem. We know that the data is processed when the model is transformed into vectors. My current work is to concatenate the representation file of the downstream task data set generated by our model with that of your work to test the quality of the prediction results (for example, the representation file of the bbbp dataset after processing). So here I would like to ask how to get a representation file of the processed data set when executing finetune.py. @yuyangw
Hi dear author, after reading your impressive writings, we have the following questions.
Thank you for your answers and help, and good luck with your research!
Hi, thanks for your nice work!
I find that the performance of Strategies for pre-training graph neural networks in your paper is different from the original paper. I wonder why is that, did you use different experimental settings?
Thanks!
Thank you for your work. However, I have found that I am unable to reproduce the results on many datasets as presented in the paper. I have the following questions:
Is the best result for each dataset achieved under the SAME hyperparameter configuration of the pre-trained mode? (i.e., the pre-training results provided in the ckpt file) I noticed in Table 6 that there is mention of parameter search during the fine-tuning stage, but it's not clear whether different datasets shared the pretrained model under the same hyperparameter configuration.
I noticed that in Table 6 of the supplementary material, the parameter search range for the fine-tuning stage is provided. However, I still encountered difficulties during the replication stage for some datasets. I was wondering if you could please provide the exact hyperparameter configurations for the best result of each dataset rather than the search range? Since I've noticed that others have also raised issues regarding replication problem. Thus, this would help reproducers save time. Thank you!
Nice work, thanks for sharing !
But I have a question about reproducing the results in the paper.
When I tried to reproduce BACE , the model has similar results than training from scatch.
To double-check if I load the model correctly, I freeze the pretrain model and train BACE on it, it cannot converge.
Given meaning representations, shouldn't freezing the pretrain model perform slightly worse than finetuning on it?
Can you please provide more details on reproducing the paper's results?
Here's the config I used:
batch_size: 32 # batch size
epochs: 100 # total number of epochs
eval_every_n_epochs: 1 # validation frequency
fine_tune_from: ./ckpt/ # sub directory of pre-trained model in ./ckpt
log_every_n_steps: 50 # print training log frequency
fp16_precision: False # float precision 16 (i.e. True/False)
init_lr: 0.0005 # initial learning rate for the prediction head
init_base_lr: 0.00001 # initial learning rate for the base GNN encoder
weight_decay: 1e-6 # weight decay of Adam
gpu: cuda:0 # training GPU
task_name: BACE # name of fine-tuning benchmark, inlcuding
# classifications: BBBP/BACE/ClinTox/Tox21/HIV/SIDER/MUV
# regressions: FreeSolv/ESOL/Lipo/qm7/qm8/qm9
model_type: gin # GNN backbone (i.e., gin/gcn)
model:
num_layer: 5 # number of graph conv layers
emb_dim: 300 # embedding dimension in graph conv layers
feat_dim: 512 # output feature dimention
drop_ratio: 0.5 # dropout ratio
pool: mean # readout pooling (i.e., mean/max/add)
dataset:
num_workers: 4 # dataloader number of workers
valid_size: 0.1 # ratio of validation data
test_size: 0.1 # ratio of test data
splitting: scaffold # data splitting (i.e., random/scaffold)
Hi, @yuyangw , I have a question about Scaffold Split and ROC-AUC in MUV Dataset!!
About three weeks ago, I received the answer, "For MUV, in rare cases where only one class is included in y_true, we calculate the accuracy instead of ROC-AUC."
Is that mean the ROC-AUC Score of the table 1 in your paper shows the mixed value of mostly ROC-AUC and few accruacy for above rare case??
If Not, How can I reproduce the adequate scaffold split sets of MUV by your code implementation?
Hi @yuyangw
Awesome work about molecule pre-training! I see you provided the apex support for mixed-precision training. Since I have not tried the package before, may I ask whether it is obvious to accelerate model training, and how is the performance gap compared to non-apex support?
Many thanks.
Peizhen
Hi Team,
I have a canonical smile "CCC[NH2+]C(CCc1ccccn1)c1ccc(Cl)s1" and obtain rdkit.Chem.rdchem.ChiralType.CHI_UNSPECIFIED as chiral tag and rdkit.Chem.rdchem.BondDir.NONE for all atoms and bonds. Why does this happen? is there any alternative way to include those two properties in the molecular graph?
Again for all smiles strings that are used for pertaining of the iMolCLR produces rdkit.Chem.rdchem.ChiralType.CHI_UNSPECIFIED for chiraltag and rdkit.Chem.rdchem.BondDir.NONE for BondDIR.
Dear Authors,
I have a thorough read of your work on "Molecular Contrastive Learning of the conflict via Graph Neural Networks ", from which I also learned a lot and benefited a lot. Thanks to the authors for their hard work.
However, I would like to ask the following questions:First, the number of data sets does not correspond to each other. The data sets I downloaded from the platform provided by the link address of CodeOcean capsule in "Code availability" of the article did not correspond to the number of data sets in the paper. The data sets I extracted from "data.zip" were recorded and counted as follows:
qm9:133885,qm8:21786,qm7:6834,Lipo:4200,ESOL:1128,FreeSolv:642,BBBP:2050,Tox21:7831,ClinTox:1483,HIV:82254,BACE:1513,SIDER:1427,MUV:186175.MUV, in particular, was very different: the MOLCLR paper counted 93,087 records, while 186,175 records were actually downloaded. There are also some other data that cannot be completely matched. Could the author please provide a data set of the same size as the paper?The second is the article directly using three enhanced strategies in the downstream task of molecular property prediction source code can provide, I have difficulties in achieving. Finally, Figure 3 on page 7 and Figure 4 on page 8, as well as Figure 4 on page 16 of Supplementary Information, could you please provide the source code, as I am limited in my ability, please also provide it?
I would like to thank the authors again for providing me with a learning opportunity.I look forward to your early reply! My email address is [email protected].
I wish you all the best.
Yunwu Liu
Recently I came across some papers on molecular contrastive learning, and it is my great pleasure to find a paper written by your team, named Molecular Contrastive Learning of Representations via Graph Neural Networks. This paper has benefited me a lot. But when I use the pre-trained model you provided for downstream tasks with the default configuration file config_finetune.yaml, the performance of the model can never reach the one shown in the paper. So I would like to ask if you can provide the hyperparameter configuration files required for downstream tasks on each data set.
Dear author,
Recently I've read your iMolCLR paper. And I'd like to know if you have the plans to publish the code for it.
Many thanks,
Lixin
When I run either the pretraining or fine-tuning files, I get this error.
Traceback (most recent call last):
File "molclr.py", line 199, in
main()
File "molclr.py", line 195, in main
molclr.train()
File "molclr.py", line 70, in train
train_loader, valid_loader = self.dataset.get_data_loaders()
File "C:\Users\user\Desktop\Skin Permeation\MolCLR\dataset\dataset.py", line 162, in get_data_loaders
train_dataset = MoleculeDataset(data_path=self.data_path)
File "C:\Users\user\anaconda3\envs\molclr\lib\typing.py", line 819, in new
obj = super().new(cls)
TypeError: Can't instantiate abstract class MoleculeDataset with abstract methods get, len
Thanks for your inspiring work. However, we've got problems when fine-tuning MolCLR on the SIDER dataset.
It's reported in the paper that MolCLR achieves 68.0 in terms of ROC_AUC under Scaffold split on SIDER, but we fail to reproduce the results. Here's what we've tried:
We've also tried several hyper-parameter combinations by changing dropout, hidden size and activation functions of MLP, which yields similar results.
We hope the authors could kindly offer the experiment settings and hyper-parameters on SIDER that fully reproduces the promising results of MolCLR.
Hi,
when trying to load the pre-trained GIN model with torch, load, we were facing an error. According to our google search it it due to a corrupted file.
In Google Colab we get this error: '
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory'
It would be great if you could re-upload the file.
Thanks!
When I run MolCLR, an error seems to stem from the following lines in the forward function:
>>> l_pos = torch.diag(similarity_matrix, self.batch_size)
>>> r_pos = torch.diag(similarity_matrix, -self.batch_size)
>>> positives = torch.cat([l_pos, r_pos]).view(2 * self.batch_size, 1)
RuntimeError: shape '[1024, 1]' is invalid for input of size 0
The similarity matrix appears to be calculated correctly, but l_pos and r_pos are empty tensors when printed out. Would appreciate any guidance here.
Hi @yuyangw :)
I've found the valid and test indexes are shared in the case of the random split (especially for QM9).
MolCLR/dataset/dataset_test.py
Line 201 in dd87b5f
I guess it should be this manner.
valid_idx, test_idx, train_idx = indices[:split], indices[split:split+split2], indices[split+split2:]
Hi @yuyangw , thanks for sharing your nice works! :)
I've pre-trained the mix-aug GIN model from scratch, and got the finetuned results on QM7 database.
In supported yaml files, the number of max epoch is set to 100.
BTW, when I checked tensorboard logs, the training seemed not to be sufficient.
So I've tried to finetune the models for 1k epochs on QM7, and got MAE results as 63.4±0.89 for 3 runs (c.f., 87.2±2.0 in the paper).
Target: u0_atom Test loss: 0.2730964422225952 Test MAE: 62.36752 Best Valid metric: 80.150856 @ epoch: 400
Target: u0_atom Test loss: 0.2805113196372986 Test MAE: 64.06087 Best Valid metric: 80.38328 @ epoch: 385
Target: u0_atom Test loss: 0.2787664532661438 Test MAE: 63.66238 Best Valid metric: 82.99878 @ epoch: 295
(pick the model when the valid metric shows minimum value)
Have you ever experienced about finetuning the models for more epochs?
Here are the detailed configuration for finetuned results: (trained on Tesla V100-SXM2-32gb)
batch_size: 4096
dataset:
data_path: data/qm7/qm7.csv
num_workers: 4
splitting: scaffold
target: u0_atom
task: regression
test_size: 0.1
valid_size: 0.1
epochs: 1000
eval_every_n_epochs: 5
fine_tune_from: ckpt/pubchem_Apr25_12-12-28/checkpoints/model.pth
fp16_precision: false
gpu: cuda:0
init_base_lr: 5.0e-05
init_lr: 0.0005
log_every_n_steps: 50
model:
drop_ratio: 0.3
emb_dim: 300
feat_dim: 512
num_layer: 5
pool: mean
pred_act: softplus
model_type: gin
task_name: qm7
weight_decay: 1e-6
Thanks in advance! :)
Sincerely,
YKim
Hi, can u give me the hyper params in the regression downstream FreeSolv, ESOL and Lipo; i can not reach the grade in your paper said.
thanks very much
Hello, I tried to do the experiment in the Investment of MolCLR Representation part of your paper, but the code did not provide the vector to generate the molecular representation. In the encoder part, I used the representation vector generated by the pre-trained gcn, and the result of cosine similarity was very incorrect. How do you use MolCLR to generate molecular representations in this part of the experiment?
Hi,
I have got another question: Would you mind providing the seed for splitting the benchmarking datasets (or the splits themselves)?
And also: Did you run all the benchmarkings of the other models on your paper yourself or did you gather the results from the corresponding papers (Table1 and Table2)?
Thanks!
Hi, thanks for your nice work!
Line 124 in fe603e0
Hi, @yuyangw , thanks for sharing your nice works!
I have some questions about calculating ROC-AUC in MUV Dataset.
As you know, there are very few positive(label=1) data in MUV.
So when I'm trying to get ROC-AUC following your scaffold split and metric evaluation(), I encounter with the problem that ONLY ONE CLASS present in y_true.
Can I ask you how did you cope with that problem?
###Further Question!(Simple)##
For multiple label task, Is it right that Each individual run result in Table 1 and Table2 is the average for whole tasks?
And then stdev must be considered by '# of runs' not by '# of runs * # of tasks'. Is that correct?
Thanks!!!
class MolTestDataset(Dataset):
def init(self, data_path, target, task):
super(Dataset, self).init()#super(MolTestDataset, self).init()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.