wenhao-gao / synnet Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
I run the code with my data (which have smiles data more than 2000).
And then, the sentence like below was printed.
Can you tell me why the error occurs?
I don't know the exact list object which provoke the error.
list index out of range
When running the compute_embedding.py I get this error.
Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Total data: 172988
0%| | 0/172988 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/ec2-user/SynNet/scripts/compute_embedding.py", line 143, in <module>
embeddings.append(model(smi))
File "/home/ec2-user/miniconda3/envs/rdkit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'categorical_node_feats' and 'categorical_edge_feats'
When trying to run the compute_embedding_mp.py I get the following error
Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Total data: 172988
Traceback (most recent call last):
File "/home/ec2-user/SynNet/scripts/compute_embedding_mp.py", line 29, in <module>
embeddings = pool.map(gin_embedding, data)
NameError: name 'gin_embedding' is not defined
I think this can be resolved by changing gin_embedding to model but that then results in the above error.
It seems like the argument _tree
is missing for the nn_search
function.
Missing argument here:
SynNet/syn_net/utils/predict_utils.py
Line 871 in 56917a6
And function signature here:
SynNet/syn_net/utils/predict_utils.py
Line 290 in 56917a6
As far as I can see, there is no _tree
in the local scope. This will throw an error but is caught in scripts/_mp_predict_multireactant.py/ with a try ... except
clause.
I run optimize_ga.py for my molecule optimization.
But I got the error because no mol_fp module in _mp_decoe.py.
Traceback (most recent call last):
File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 207, in <module>
[decode.mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 207, in <listcomp>
[decode.mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
AttributeError: module 'scripts._mp_decode' has no attribute 'mol_fp'
So, I changed the mol_fp to mol_fp function in predict_utils.py.
from syn_net.utils.predict_utils import mol_fp
population = np.array(
[mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
)
Then, I got the error like below.
Traceback (most recent call last):
File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 210, in <module>
population = population.reshape((population.shape[0], population.shape[2]))
IndexError: tuple index out of range
Can you help me with this error?
Hi,
I'd like to use SynNet in my work. I have followed the instructions in the README to setup my environment.
In the environment.yml
file the name is rdkit not synthenv. As a result, source activate synthevn
as instructed in the readme does not work. You may want to take a look at these.
When I ran the unit tests, it gives me a few errors. I think it's originating from the incorrect path specifications. One of the errors I have got: FileNotFoundError: [Errno 2] No such file or directory: '/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_gin.npy'
I noticed that there are multiple pathways as such, which might make it difficult to use in future computations without having to change each and everyone of them.
Will you be able to help me with these? Thanks!
I'm trying to test everything is working in my setup by running
python optimize_ga.py --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 48
It seems to run forever with the following output
Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Starting with 128 fps with 4096 bits
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
...
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
Initial: 0.000 +/- 0.000
Scores: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0.]
Top-3 Smiles: [None, None, None]
How long should this run and is this output normal?
hello wenhao & rocio,
the unittests are great and gives a great overview of how different modules should be run. however, I saw that in these lines, the path to the building block embeddings are hardcoded to the path on the HPC cluster.
Lines 78 to 89 in 56917a6
so, I am unable to make pytest
pass, specifically:
FAILED tests/test_Training.py::TestTraining::test_reactant1_network - UnboundLocalError: local variable 'kdtree' referenced before assignment
FAILED tests/test_Training.py::TestTraining::test_reactant2_network - UnboundLocalError: local variable 'kdtree' referenced before assignment
at least for the unittest, what should the correct path be? and would it be possible to make these paths user-passable arguments?
.json.gz
ourselvesI've recently been interested in running SynNet with the most recent version of the US Stock Enamine BBs. I ran steps 0-2 to preprocess the data and wanted to try reward-guided molecule generation using GA per the instructions in the readme. However, I notice that even with the initial randomly generated fingerprints, 70-80 of the initial 100 are decoded to the same SMILES string:
CC(C)(C)OC(=O)N1CC2NCCN(S(=O)(=O)CC(=O)c3ccccc3)C2C1
This causes the GA population update to hang forever, as insufficient unique new molecules are found to add to the pool and increment parent_idx to num_population in each step of the algorithm.
Could this be the result of the difference in the Enamine stock between the time of publication and now? Any help is appreciated!
Thank you,
Andrei
hello wenhao & rocio,
I see that we have to provide path/to/zinc.csv
to run the genetic algorithm (to replicate how it was done in the paper)
https://github.com/wenhao-gao/SynNet#synthesizable-molecular-design-1
optimize_ga.py -i path/to/zinc.csv --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 32 --objective gsk
is it possible to provide the exact zinc.csv
that was used in the publication?
Seeds are randomly sampled from the ZINC database (Sterling & Irwin, 2015)
First off great work!
The unit tests reference files that are ignored in .gitignore
'./data/states_0_train.npz'
'./data/st_hb_test.json.gz'
'./data/building_blocks_matched.csv.gz'
can we add these to the repo so the unit tests can be run?
hello (again),
sorry that I am raising multiple issues. just want to make it easier for everyone else to start using this awesome work.
i didn't a note about how one could compute molecular fingerprints / GNN embeddings for a dataset. only after some CTRL+F, i found that scripts/compute_embedding.py
does it.
https://github.com/wenhao-gao/SynNet/blob/master/scripts/compute_embedding.py
so, it would be a good idea to add this to the README. I believe we need to do this step before running any inference.
Is the environment.yml same as "env/synthenv.yml"?
I have trained models and data ready, then follow steps 0-2 from the INSTRUCTIONS.md.
python /data/users/xx/projects/SynNet/src/00-extract-smiles-from-sdf.py \
--input-file="/data/users/xx/data/enamine_us/Enamine_Rush-Delivery_Building_Blocks-US_222337cmpd_20230801.sdf" \
--output-file="/data/users/xx/projects/SynNet/data/assets/building-blocks/enamine-us-smiles.csv.gz"
python /data/users/xx/projects/SynNet/src/01-filter-building-blocks.py \
--building-blocks-file "/data/users/xx/projects/SynNet/data/assets/building-blocks/enamine-us-smiles.csv.gz" \
--rxn-templates-file "/data/users/xx/projects/SynNet/data/assets/reaction-templates/hb.txt" \
--output-bblock-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--output-rxns-collection-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose
python /data/users/xx/projects/SynNet/src/02-compute-embeddings.py \
--building-blocks-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--output-file "/data/users/xx/projects/SynNet/data/pre-process/embeddings/hb-enamine-embeddings.npy" \
--featurization-fct "fp_256"
but after running the script of synthesis planning
BUILDING_BLOCKS_FILE=/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz
RXN_COLLECTION_FILE=/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz
EMBEDDINGS_KNN_FILE=/data/users/xx/projects/SynNet/data/pre-process/embeddings/hb-enamine-embeddings.npy
python /data/users/xx/projects/SynNet/src/20-predict-targets.py \
--building-blocks-file $BUILDING_BLOCKS_FILE \
--rxns-collection-file $RXN_COLLECTION_FILE \
--embeddings-knn-file $EMBEDDINGS_KNN_FILE \
--data "/data/users/xx/projects/SynNet/data/assets/molecules/sample-targets.txt" \
--ckpt-dir "/data/users/xx/projects/SynNet/checkpoints/" \
--output-dir "/data/users/xx/projects/SynNet/results/demo-inference/"
I get the result as
targets,decoded,similarity
COc1cc(Cn2c(C)c(Cc3ccccc3)c3c2CCCC3)ccc1OCC(=O)N(C)C,,0
CCC1CCCC(Nc2cc(C(F)(F)F)c(Cl)cc2SC)CC1,,0
Clc1cc(Cl)c(C2=NC(c3cccc4c(Br)cccc34)=NN2)nn1,,0
COc1ccc(S(=O)(=O)c2ccc(-c3nc(-c4cc(B(O)O)ccc4O)no3)cn2)cc1,,0
CNS(=O)(=O)c1ccc(-c2cc3c4c(ccc3[nH]2)CCCN4C(N)=O)cc1,,0
CC(NC(=O)C1Cn2c(O)nnc2CN1)c1cc(F)ccc1N1CCC(n2nnn(-c3ccc(Br)cc3)c2=S)CC1,,0
COc1cc(-c2nc(-c3ccccc3)c(-c3ccccc3)s2)ccn1,,0
CCCn1c(C)nnc1CC(C)(O)C(=C(C)C)c1nccnc1S(=O)(=O)F,,0
CN(c1ccccc1)c1ccc(-c2nc3ncccc3s2)cn1,,0
COc1cc(-c2nc(-c3ccc(F)cc3)c(-c3ccc(F)cc3)n2c2cc(Cl)ccc2Cl)ccc1Oc1ccc(S(=O)(=O)N2CCCCC2)cc1[N+](=O)[O-],,0
It seems there is no output.
Can you help me with this? Thank you!
When I try and run optimize_ga.py
I am getting
Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Starting with 128 fps with 4096 bits
Traceback (most recent call last):
File "/home/ec2-user/SynNet/scripts/optimize_ga.py", line 205, in <module>
scores, mols, trees = fitness(embs=population,
TypeError: fitness() got an unexpected keyword argument 'pool'
This is the command I am using
python optimize_ga.py --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 48 --objective logp
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.