Giter Club home page Giter Club logo

the-photoswitch-dataset's Introduction

The Photoswitch Dataset

License DOI

This repository provides benchmarked property prediction results on a curated dataset of 405 photoswitch molecules.

Installation

We recommend using a conda virtual environment.

conda create -n photoswitch python=3.7

conda install -c conda-forge rdkit

conda install umap-learn seaborn xlrd matplotlib ipython pytest pytorch scikit-learn pandas

pip install git+https://github.com/GPflow/GPflow.git@develop#egg=gpflow

pip install dgl dgllife jupyter

git clone https://github.com/hyperopt/hyperopt-sklearn.git
cd hyperopt-sklearn
pip install -e .

Property Prediction

To reproduce the property prediction results, run a model prediction script using the task flag to specify the appropriate task (e_iso_pi, e_iso_n, z_iso_pi, z_iso_n) corresponding to different electronic transition wavelengths e.g.

python  predict_with_GPR.py -task e_iso_pi                    
Metric GPR-Tanimoto + Fragprints
RMSE 20.9 nm
MAE 13.3 nm
R2 0.90

Prediction Error as a Guide to Representation Selection

Prediction errors under different model/representation combinations may be analyzed in the context of the property being predicted.

The following boxplot shows shows the performance of each molecular representation aggregated across the Random Forest, Gaussian Process, Multioutput Gaussian Process and Attentive Neural Process models.

TD-DFT Comparison

For the comparison with Time-Dependent Density Functional Theory (TD-DFT) run the DFT comparison script specifying the level of theory to compare against using the theory level flag (CAM-B3LYP or PBE0) e.g.

python  dft_comparison_with_GPR.py -theory_level CAM-B3LYP                        
Metric GPR-Tanimoto + Fragprints CAM-B3LYP TD-DFT
MAE 14.9 nm 16.5 nm
python  dft_comparison_with_GPR.py -theory_level PBE0                        
Metric GPR-Tanimoto + Fragprints PBE0 TD-DFT
MAE 15.2 nm 26.0 nm

Human Performance Comparison

To reproduce the model prediction errors for the human performance comparison, run the following script:

python  human_performance_comparison.py                        

Generalization Error

To reproduce the generalization error results, run the following scripts:

python  generalization_error.py -augment_photo_dataset False                        
Metric RF + Fragprints
RMSE 85.2 nm
MAE 72.5 nm
R2 -0.66
python  generalization_error.py -augment_photo_dataset True                        
Metric RF + Fragprints
RMSE 36.9 nm
MAE 22.7 nm
R2 0.67

Data Visualization

Run

python  visualization.py                       

to obtain an (unannotated) visualization of the Photoswitch Dataset

What We Provide

The dataset includes molecular properties for 405 photoswitch molecules. All molecular structures are denoted according to the simplified molecular input line entry system (SMILES). We collate the following properties for the molecules:

Rate of Thermal Isomerization (units = s^-1): This is a measure of the thermal stability of the least stable isomer (Z isomer for non-cyclic azophotoswitches and E isomer for cyclic azophotoswitches). Measurements are carried out in solution with the compounds dissolved in the stated solvents.

Photostationary State (units = % of stated isomer): Upon continuous irradiation of an azophotoswitch a steady state distribution of the E and Z isomers is achieved. Measurements are carried out in solution with the compounds dissolved in the ‘irradiation solvents’.

pi-pi-star/n-pi-star wavelength (units = nanometers): The wavelength at which the pi-pi*/n-pi* electronic transition has a maxima for the stated isomer. Measurements are carried out in solution with the compounds dissolved in the ‘irradiation solvents’.

DFT-computed pi-pi-star/n-pi-star wavelengths (units = nanometers): DFT-computed wavelengths at which the pi-pi*/n-pi* electronic transition has a maxima for the stated isomer.

Extinction coefficient: The molar extinction coefficient.

Wiberg Index: A measure of the bond order of the N=N bond in an azophotoswitch. Bond order is a measure of the ‘strength’ of said chemical bond. This value is computed theoretically.

Irradiation wavelength: The specific wavelength of light used to irradiate samples from E-Z or Z-E such that a photo stationary state is obtained. Measurements are carried out in solution with the compounds dissolved in the ‘irradiation solvents’.

Gaussian Process Regression using a Tanimoto Kernel

Please see the examples folder for a tutorial on how to implement and use the GP-Tanimoto model in GPflow.

Citing the Photoswitch Dataset

If you find the Photoswitch Dataset useful for your research, please consider citing the following article.

@article{Thawani_2020,
      title={The Photoswitch Dataset: A Molecular Machine Learning Benchmark for the Advancement of Synthetic Chemistry}, 
      author={Aditya R. Thawani and Ryan-Rhys Griffiths and Arian Jamasb and Anthony Bourached and Penelope Jones and William McCorkindale and Alexander A. Aldrick and Alpha A. Lee},
      year={2020},
      eprint={2008.03226},
      archivePrefix={arXiv},
      primaryClass={physics.chem-ph}
}

the-photoswitch-dataset's People

Contributors

ryan-rhys avatar a-r-j avatar chemical-hero avatar driesvr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.