Giter Club home page Giter Club logo

transformer-cnn's Introduction

Transformer-CNN

The repository contains the source code for a new Transformer-CNN method described in our paper http://arxiv.org/abs/1911.06603. First, we trained the Transformer model on SMILES canonicalization task, e.g., given an arbitrary SMILES, the model converts it to a canonical one. Second, we use the internal representation of the Transformer (the output of the encoding stack with shape (BATCH, LENGTH, EMBEDDING)) as SMILES embeddings and build upon them CharNN model (Convolution and HighWay as it is done in DeepChem). The resulting model works both in classification and regression settings.

The "standalone" folder provides the implementation of the Transformer-CNN model for prognosis without TensorFlow (depends on only NumPy and RDKit). The solubility and AMES models are available. Layerwise Relevance Propagation method is used to infer the model's reasoning behind a particular prediction.

Dependencies

The code has been tested in Ubuntu 18.04 with the following components:

  1. python v.3.4.6 or higher
  2. TensorFlow v1.12
  3. rdkit v.2018.09.2
  4. molvs for tautomer enumeration

How to use

The main program, transformer-cnn.py, uses the config.cfg file to read all the parameters of a task to do. After filling the config.cfg with the appropriate information, launch the python3 transformer-cnn.py config.cfg

How to train a model

To train a model, one needs to create a config file like this.

[Task]
   train_mode = True
   model_file = model.tar
   train_data_file = train.csv
[Details]
   canonize = True
   gpu = 0
   seed = 100666
   n_epochs = 30
   batch_size = 16

If the canonize parameter is set, then all the SMILES will be worked up with RDKit. Then 10 non-canonical SMILES for each molecule will be generated (the real number of generated strings can be smaller depending on the compound). If this parameter is set to False, then the string is passed to the model as is without any treatment. The same is also valid for the prognosis step.

Using the trained model

To use a model, the config file looks like:

[Task]
   train_mode = False
   model_file = model.tar
   apply_data_file = predict.csv
[Details]
   canonize = True
   gpu = 0
   seed = 100666
   n_epochs = 30
   batch_size = 16

Using the standalone prognosis

The "standalone" folder contains scripts and models for execution without TensorFlow. Solubility regression and AMES classification models are available. To run a prognosis for a single molecule (haloperidol here as an example) execute:

python3 ochem.py models/solubility.pickle "O=C(CCCN1CCC(c2ccc(Cl)cc2)(O)CC1)c1ccc(F)cc1"

In this case, the program produces 26 random SMILES (number of non-hydrogens atoms in the molecule) starting from each atom in the original SMILES. For each SMILES a target property is predicted as well as influences of a particular atom to the overall property. The output contains:

  1. the estimated property with a confidence interval.
  2. the file map.txt contains a gnuplot script to visualize the individual atoms' contributions.
  3. the file mol.svg contains a drawing of the molecule with atoms' contributions.

For haloperidol molecule the outpus should be:

Solubility = 0.014 ± 0.002 g/L (experimental value 14 mg/L).

And visualization:

LRP graph

The green color contributes positively to the property. The higher the bar the more the impact of the corresponding atom. The red color works in the opposite direction.

Feel free to contact us if you have any suggestions or possible applications of this code.

transformer-cnn's People

Contributors

carpovpv avatar linlinzhao avatar nikhilbhardwajdev avatar bigchem avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.