topazape / lstm_chem Goto Github PK

View Code? Open in Web Editor NEW

116.0 9.0 55.0 428.13 MB

Implementation of the paper - Generative Recurrent Networks for De Novo Drug Design.

License: The Unlicense

Python 3.40% Jupyter Notebook 96.60%

chemoinformatics cheminformatics keras python3 machine-learning lstm rnn denovo tensorflow2

lstm_chem's Introduction

LSTM_Chem

This is the implementation of the paper - Generative Recurrent Networks for De Novo Drug Design

Changelog

2021-08-09

Now support tensorflow >= 2.5.0

2020-03-25

Changed the code to use tensorflow 2.1.0 (tf.keras)

2019-12-23

Reimplimented all code to use tensorflow 2.0.0 (tf.keras)
Changed data_loader to use generator to reduce memory usage
Removed some unused atoms and symbols
Changed directory layout

Requirements

This model is built using Python 3.7. See Pipfile or requirements.txt for dependencies. I strongly recommend using GPU version of tensorflow.Learning this model with all the data is very slow in CPU mode (about 9 hrs / epoch). RDKit and matplotlib are used for SMILES cleanup, validation, and visualization of molecules and their properties. Recently, RDKit can be installed with pip, you don't have to use Anaconda! Scikit-learn is used for PCA.

Usage

Training

Just run below. However, all the data is used according to the default setting. So please be careful, it will take a long time. If you don't have enough time, set data_length to a different value in base_config.json.

$ python train.py

After training, experiments/{exp_name}/{YYYY-mm-dd}/config.json is generated. It's a copy of base_config.json with additional settings for internal varibale. Since it is used for generation, be careful when rewriting.

Generation

See example_Randomly_generate_SMILES.ipynb.

fine-tuning

See example_Fine-tuning_for_TRPM8.ipynb.

Detail

Configuration

See base_config.json. If you want to change, please edit this file before training.

parameters	meaning
exp_name	experiment name (default: `LSTM_Chem`)
data_filename	filepath for training the model (`SMILES file with newline as delimiter`)
data_length	number of SMILES for training. If you set 0, all the data is used (default: `0`)
units	size of hidden state vector of two LSTM layers (default: `256`, see the paper)
num_epochs	number of epochs (default: `22`, see the paper)
optimizer	optimizer (default: `adam`)
seed	random seed (default: `71`)
batch_size	batch size (default: `256`)
validation_split	split ratio for validation (default: `0.10`)
varbose_training	verbosity mode (default: `True`)
checkpoint_monitor	quantity to monitor (default: `val_loss`)
checkpoint_mode	one of {`auto`, `min`, `max`} (default: `min`)
checkpoint_save_best_only	the latest best model according to the quantity monitored will not be overwritten (default: `False`)
checkpoint_save_weights_only	If True, then only the model's weights will be saved (default: `True`)
checkpoint_verbose	verbosity mode while `ModelCheckpoint` (default: `1`)
tensorboard_write_graph	whether to visualize the graph in TensorBoard (defalut: `True`)
sampling_temp	sampling temperature (default: `0.75`, see the paper)
smiles_max_length	maximum size of generated SMILES (symbol) length (default: `128`)
finetune_epochs	epochs for fine-tuning (default: `12`, see the paper)
finetune_batch_size	batch size of finetune (default: `1`)
finetune_filename	filepath for fine-tune the model (`SMILES file with newline as delimiter`)

Preparing Dataset

Get database from ChEMBL

Download SQLite dump for ChEMBL25 (ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_25_sqlite.tar.gz), which is 3.3 GB compressed, and 16 GB uncompressed.
Unpack it the usual way, cd into the directory, and open the database using sqlite console.

Extract SMILES for training

$ sqlite3 chembl_25.db
SQLite version 3.30.1 2019-10-10 20:19:45
Enter ".help" for usage hints.
sqlite> .output dataset.smi

You can get SMILES that annotated nM activities according to the following SQL query.

SELECT
  DISTINCT canonical_smiles
FROM
  compound_structures
WHERE
  molregno IN (
    SELECT
      DISTINCT molregno
    FROM
      activities
    WHERE
      standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")
      AND standard_units = "nM"
      AND standard_value < 1000
      AND standard_relation IN ("<", "<<", "<=", "=")
    INTERSECT
    SELECT
      molregno
    FROM
      molecule_dictionary
    WHERE
      molecule_type = "Small molecule"
  );

You can get 556134 SMILES in dataset.smi. According to the paper, the dataset was preprocessed and duplicates, salts, and stereochemical information were removed, SMILES strings with lengths from 34 to 74 (tokens). So I made SMILES clean up script. Run the following to get cleansed SMILES. It takes about 10 miniutes or more. Please wait.

$ python cleanup_smiles.py datasets/dataset.smi datasets/dataset_cleansed.smi

You can get 438552 SMILES. This dataset is used for training.

SMILES for fine-tuning

The paper shows 5 TRPM8 antagonists for fine-tuning.

FC(F)(F)c1ccccc1-c1cc(C(F)(F)F)c2[nH]c(C3=NOC4(CCCCC4)C3)nc2c1
O=C(Nc1ccc(OC(F)(F)F)cc1)N1CCC2(CC1)CC(O)c1cccc(Cl)c1O2
O=C(O)c1ccc(S(=O)(=O)N(Cc2ccc(C(F)(F)C3CC3)c(F)c2)c2ncc3ccccc3c2C2CC2)cc1
Cc1cccc(COc2ccccc2C(=O)N(CCCN)Cc2cccs2)c1
CC(c1ccc(F)cc1F)N(Cc1cccc(C(=O)O)c1)C(=O)c1cc2ccccc2cn1

You can see this in datasets/TRPM8_inhibitors_for_fine-tune.smi.

Extract known TRPM8 inhibitors from ChEMBL25

Open the database using sqlite console.

$ sqlite3 chembl_25.db
SQLite version 3.30.1 2019-10-10 20:19:45
Enter ".help" for usage hints.
sqlite> .output known-TRPM8-inhibitors.smi

Then issue the following SQL query. I set maximum IC50 activity to 10 uM.

SELECT
  DISTINCT canonical_smiles
FROM
  activities,
  compound_structures
WHERE
  assay_id IN (
    SELECT
      assay_id
    FROM
      assays
    WHERE
      tid IN (
        SELECT
          tid
        FROM
          target_dictionary
        WHERE
          pref_name = "Transient receptor potential cation channel subfamily M member 8"
      )
  )
  AND standard_type = "IC50"
  AND standard_units = "nM"
  AND standard_value < 10000
  AND standard_relation IN ("<", "<<", "<=", "=")
  AND activities.molregno = compound_structures.molregno;

You can get 494 known TRPM8 inhibitors. As described above, clean up the TRPM8 inhibitor SMILES. Please use the -ft option to ignore SMILES strings (tokens) length restriction.

$ python cleanup_smiles.py -ft datasets/known-TRPM8-inhibitors.smi datasets/known_TRPM8-inhibitors_cleansed.smi

You can get 477 SMILES. I used this for mere visualization of the results of fine-tuning.

lstm_chem's People

Contributors

Stargazers

Watchers

lstm_chem's Issues

RuntimeError: dictionary changed size during iteration

Hi
Thanks for your great job.
At the end of epoch 2 before saving model and last iteration I got this error:
`Exception in thread Thread-2:
Traceback (most recent call last):
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\site-packages\tensorflow_core\python\keras\utils\data_utils.py", line 748, in _run
with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\site-packages\tensorflow_core\python\keras\utils\data_utils.py", line 727, in pool_fn
initargs=(seqs, None, get_worker_id_queue()))
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\pool.py", line 176, in init
self._repopulate_pool()
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\pool.py", line 241, in _repopulate_pool
w.start()
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
RuntimeError: dictionary changed size during iteration

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\Radical\Anaconda3\envs\myenv\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input`

would you explain why it is happening or fixed it.. thank you.

Issue while importing functions

I have run the train.py model and am trying to run the example_randomly_generated_SMILES file on jupyter notebook but I cannot run the command where we import LSTMChem from lstm_chem.model. The kernel keeps dying.
Is there any way to resolve this?

How I can get the file named ' LSTM_Chem-22-0.42.hdf5'

[Errno 2] No such file or directory: './datasets/fine-tune.smi'

Fragment growing' implemented in code?

Fantastic implementation of paper, although they have one more method of fine-tuning called as 'fragment growing', where if you give one fragment as SMILE, it will generate SMILES around that fragment. Is there any direction that you can point me to?

different result about dataset query

I tried to reproduce with the same dataset (chemble22) that the author said was used in the paper by referring to the code created by the you, but the results are different.

I tried below.

SELECT DISTINCT canonical_smiles FROM compound_structures WHERE molregno IN ( SELECT DISTINCT molregno FROM activities WHERE standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50") AND standard_units = "nM" );
result is [Result: 802320 rows]

Author said "dataset of 677,044 SMILES strings with annotated nanomolar activities(Kd/i/B, IC/EC50) from ChEMBL22 "

So I use Chembl22, and
insert [standard_units = "nM"] for "nanomolar" ,
and [standard_type IN ("Kd", "Ki", "Kb", "IC50", "EC50")] for "activities(Kd/i/B, IC/EC50)"

what I missed?

Please update the dataset link to the one provided here!

http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_26_sqlite.tar.gz

Source link : http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/

where to get finetune's corpors?

in your project ,i can t see file to finetune,how to get ? if you have ,cao you give me a file or tell me how to get it ,thx.

Run on only 1 GPU, how to run on all the 4 available GPUs

Model training and molecule generation run only on single GPU even if 4 GPUs are available.
How to run on all the GPUs available.

AttributeError: 'NoneType' object has no attribute 'shape'

Hi, topazape

Thanks for your great job.

would you explain why it is happening or fixed it..

Is this the problem that comes from tensorflow version error?(it seems like that tensorflow 2.1.0 is no longer available.

Thank you.

Could you public the saved model for tranfer learning?

Hi,
Really loved your work, I went through your code and through the paper, the code is well written and documented.
I tried running the model and yes, it does take a lot of time to run on CPU, especially when I try to run on a mobile chip of a laptop. Unfortunately I don't have the luxury to get a GPU because of 2 reasons:

The labs in our departments are closed due to quarentine for another 1.5 months
I don't have a pre - existing GPU to run CUDA .

If you could make your transfer learning model public ( Ie the trained model created from model.save() ) It would be a great help. I could then alter your code and use transfer learning for my usecase.

1 solution I found was to use 100K data points to train the initial network ( LSTM Cell + RNN ) , however that is also quiet time consuming surprisingly on the CPU + the results will suffer.

Awaiting your reply
Your's Sincerely

AttributeError: 'NoneType' object has no attribute 'shape'

Hi，topazape
Thank you for developing such an excellent code.
I have some questions and I hope I can get your help.
When I run the command: Python train.py ，Attributeerror occurred，
I noticed that gwanseum had the same problem，borrow his pictures here.

I didn't see your discussion with gwanseum,

I hope I can get your help！
Thanks in advance for your help,

Sincerely

How I can get the file 'LSTM_Chem-22-0.42.hdf5'?

Hello topazape,

Could you please share the file named' LSTM_Chem-22-0.42.hdf5'?

Thank you in advance!

Batchsize in Finetuning is irrelevant

I've noticed that BatchSize even it is there for Fine Tuning it is always 1 to make sure it functions properly. If one wants to set it bigger than 1, then it triggers an error due to a fact that self.max_len=0 and no padding takes place. I don't know how it would affect training if one uses max_len vs not using max_len with batchsize=1.

GPU for Tensorflow

Hello,
I was trying to run the Fine tuning notebook with a different dataset for my model and it is taking too long, I was wondering how to use the GPU version of Tensorflow to run the code?
It would be great if you could help. Thank you!

About GPU memory usage?

Thank the author for the code.

I'd like to ask you some questions:
1.The memory capacity and GPU model of your computer.
2.GPU memory footprint while training code.

Looking forward to your reply. I want to know if my GTX1060 GPU can train with this code.

Heartfelt thanks.