This repository is an attempt to create a model for the Tox-21 data challenge, which resolved in 2014. The data can be obtained from the webpage:
https://tripod.nih.gov/tox21/challenge/data.jsp
In particular, the data used in this repository is the complete dataset found under the subheader 'Training Datasets' in the webpage. The winners of the challenge were announced here:
https://ncats.nih.gov/news/releases/2015/tox21-challenge-2014-winners
In this repository, a model is created to try and predict the signal and stress panels as dictated by the webpage, which is done by means of feature engineering using basic ML practices, but also by making use of known computational chemistry libraries and deep neural networks.
The challenge revolves around the idea of applying machine learning to drug discovery in order to generally shorten the time it takes for drugs to make it to pharmaceutical testing, and then to market. The general idea is that enzymes have so-called active sites, which consume subtrates in which certain desired properties are implemented. The hope is that further chemical reactions may occur in which further required properties occur.
In the context of the dataset here, there are a list of candidate molecules and their performance (active or in-active) on a set of so-called assays. Put more simply - the input dataset is a list of molecules. The output are 12 dimensional arrays, filled with 1's or 0's. The point of the exercise is to be able to predict all 1's and 0's.
The input molecules are represented by so-called SMILES. https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system which is a way to represent a molecule linearly by treating the 2D diagrammatic representation as a graph and performing a depth first search algorithm (roughly). An example would be CC[C@H](O1)CC[C@@]12CCCO2
. In order to obtain useful information about the molecule, the rdkit library is used extensively https://www.rdkit.org/docs/index.html. The author of this repository makes no claim to understanding the chemistry concepts in full, however, the rdkit library has made it relatively simple to make some progress despite this fact.
Finally, it is also worth noting that for the training portion of this study, the use of google colab was employed - this will be pointed out when certain notebooks are discussed.
Here, the basic outline of the folder structure is provided:
There are three main folders:
-
drugdiscovery
: This folder contains all of the preprocesing, DNN training and testing code and encapsulates the main functionailities used in this projet. -
data
: This folder contains all of the data that is collected or engineered throughout the project, since this data has come from the Tox-21 group, the data have not been given here. -
models
: This folder containts the models that were obtained.
There are 6 notebooks:
-
data_integrity.ipynb
is the first touch with the downloaded data and provides very rudimentary consistency checks on the data (this file is partnered with the scriptclean_data.py
to be described shortly). -
data_sourcing.ipynb
provides some details of the kind of data processing and engineering available in regards to the smile data (this is partnered withpreprocess.py
) -
EDA.ipynb
performs exploratory data analysis on the smile data as well as all the engineered data. -
Benchmark_models.ipynb
provides benchmkark models for the data. -
Deep neural net modelling.ipynb
is where the DNN model instanciation adn training occurs, this is done on google colab. -
Grid esarch analysis and seeking final model.ipynb
goes through the grid search studied in the previus notebook, identifies an optimal model subject to the area under the precision-recall curve.
There are two python scripts:
-
clean_data.py
cleans the data subject to the findings indata_integrity.ipynb
. -
preprocess.py
provides data engineering efforts subject to the findings indata_sourcing.ipynb
.
The data itself presents as smile representations, together with molecular weight, various identifiers (from PubChem, Tox-21, etc.) together with the known assay results. After removing the identifiers, the last 5 rows of the table are :
SMILES | SR-HSE | NR-AR | SR-ARE | NR-Aromatase | NR-ER-LBD | NR-AhR | SR-MMP | NR-ER | NR-PPAR-gamma | SR-p53 | SR-ATAD5 | NR-AR-LBD |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCCc1cc(=O)[nH]c(=S)[nH]1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
S=C1NCCN1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
S=C1NCCN1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CCOP(=S)(OCC)Oc1ccc(N+[O-])cc1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
CCC(COC(=O)CCS)(COC(=O)CCS)COC(=O)CCS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
The SMILES are the input data, whilst the remaining columns are the 12 target columns.
There are two important data integrity issues that need to be addressed:
-
There are duplicate columns, with the same smile but different (or improved, by virtue of NaN->non-NaN or 0->1) targets. The assumption is that further experimental results have increased the table knowledge or modified it. However, the relevant rows have not been modified by the experiment reporter - but a new row have simply been added. This is addressed in
data_intgrity.ipynb
and resolved inclean_data.py
. -
Null are a natural consequence of the way the the data was collected. There are two many null values to simply remove them (since they occur intertwined with relevatn and wanted target results), thus they must be worked with. The descision here is bring these data points along the entire process, but to introduce bit masking into the weightings for metric and criterion loss computations in order to not let these values contribute. In the computation of metrics, this is done by enforcing a 2D weighting matrix (which is the bit mask) via, the example line 58 of
drugdiscovery/testing.py
:precision = precision_score(self.y[:,i],y_pred[:,i], sample_weight = self.sample_weights[:,i])
In the training of the DNN, it is important that that the null values aren't backpropagated - again, this can bed one by making use of the weighting functionality within PyTorch:
# step 5. Re-adjust the criterion weight to mask null values in the validation target set.
self.criterion.weight = torch.tensor(weight_test).to(self.device).float()
loss = self.criterion(preds.to(self.device), y_val.to(self.device).float()).to(self.device)
vloss = loss
which can be found in line 204 of drugdiscovery/deeplearning.py
.
In this section the process of the project is described, this also describes the order that one should read this project. There are 5 clear steps of the project lifecycle:
-
Data Cleaning and engineering
The data is checked for consistency and data engineering opportunities are invesitgated. In this particular case, duplicates were removed and a strategy for dealing with the null values was initiated. For data engineering, the rdkit library was used to generate:
- Fingerprint data
- Dice similarities between known toxicophores and fingerprints of scaffold molecules.
- Molecular descriptors
data_integrity.ipynb
anddata_sourcing.ipynb
describe cleaning and engineering opporunities within the dataset, whilstclean_data.py
andpreprocess.py
perform these steps. -
Exploratory Data Analysis
Given the engineered data, a basic exploratory study into the dataset is performed. The structure and balance of the target dataset is studied. The key questions asked here are regarding the distributions of the various input datasets when split by te target dataset.
A very key takeaway from this study is that the target data in naturally very imbalanced, thus the key metric for success for any models would be the Area under precision-recall since, both the precision and the recall focus on how relevant the results are and whether relevatn results are returned at all.
This is performed in
EDA.ipynb
, at this juncture its also worth looking atdrugdiscovery/testing.py
where a testing framework is built. -
Benchmarking
Basic benchmarking are also performed, and provides a basic sense of how the testing framework can be utilised. The model parameters are despoted in
/models
. The two models explored are:-
RandomForeset : The generic model given by sklearn is chosen which has
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
-
XGBClassifier by XGBoost: Again, the generic model given byxgboost:
XGBClassifier(base_score=None, booster=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, gamma=None, gpu_id=None, importance_type='gain', interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, objective='binary:logistic', random_state=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, validate_parameters=False, verbosity=None)
This is performed in
Benchmark_models.ipynb
, using a bag of words approach, where the results are also displayed. The result of the test set for the random forest are: -
Assay | Precision | Recall | F1 | AUPRC | Accuracy | Balanced Accuracy | ROC_AUC |
---|---|---|---|---|---|---|---|
SR-HSE | 0.5 | 0.059322 | 0.106061 | 0.175205 | 0.946069 | 0.52797 | 0.52797 |
NR-AR | 0.803922 | 0.418367 | 0.550336 | 0.530519 | 0.972709 | 0.707062 | 0.707062 |
SR-ARE | 0.68254 | 0.133127 | 0.222798 | 0.454856 | 0.846861 | 0.560451 | 0.560451 |
NR-Aromatase | 0.769231 | 0.0934579 | 0.166667 | 0.295378 | 0.948927 | 0.545919 | 0.545919 |
NR-ER-LBD | 0.710526 | 0.243243 | 0.362416 | 0.406449 | 0.959729 | 0.619175 | 0.619175 |
NR-AhR | 0.771084 | 0.243346 | 0.369942 | 0.535456 | 0.901802 | 0.616819 | 0.616819 |
SR-MMP | 0.738318 | 0.265993 | 0.391089 | 0.609741 | 0.873652 | 0.624512 | 0.624512 |
NR-ER | 0.581395 | 0.192308 | 0.289017 | 0.351963 | 0.882521 | 0.586339 | 0.586339 |
NR-PPAR-gamma | 0.5 | 0.0333333 | 0.0625 | 0.184745 | 0.97235 | 0.516193 | 0.516193 |
SR-p53 | 0.533333 | 0.0567376 | 0.102564 | 0.268889 | 0.938353 | 0.526726 | 0.526726 |
SR-ATAD5 | 0.857143 | 0.0674157 | 0.125 | 0.303089 | 0.964706 | 0.53349 | 0.53349 |
NR-AR-LBD | 0.763158 | 0.386667 | 0.513274 | 0.551939 | 0.97593 | 0.691297 | 0.691297 |
- Grid searching DNN models
Having established benchmarks, DNN modelling is performed. Particularly, by grid searching over types of data inputs, the layers of the network and whether PCA should be used as a final step to halve the data input dimension. The grid search parameters investigated are:
- Data inputs over all subsets of fingerprint data, dice similarities of toxicophores with the data and molecular descriptors.
- The layers [1024],[1024,2048],[1024,2048,4196]
- Whether a final PCA should be performed just before the data is consumed by the model
The data which is not grid searched, but remains static throughout is:
- Early stopping at a patience level of 10 over the area under precision-recall curve which is computed every 10 epochs.
- Batch size is 128.
- A residual layer in the network itself.
- Adam optimiser with learning rate 4e-5, and weight decay at 1e-5.
- A scheduler which lowers the learning rate on when a plateau on the validation loss is reached.
- Number of epochs fixd to 200, it was found that for this particular grid search - 200 epochs was enough to determine the relative success of the model from the search parameters chosen.
This is all conducted in Deep neural net modelling.ipynb
with the help of the functionality built in drugdiscovery/deeplearning
.
- Finding an optimal model subject to the grid search
The process of finding the optimal model from the grid search by virtue of the AUPRC was found in Grid search analysis and seeing final model.ipynb
.
According to the maximal area under precision-recall curve - the model with data inputs with truncated SVD fingerprint, dice similarities and molecular descriptors, with layers [1024, 2048] and PCA employed to halve the dataset was the optimal. It has an AUPRC of 0.50923616. On the testing set, the metric table is:
Assay | Precision | Recall | F1 | AUPRC | Accuracy | Balanced Accuracy | ROC_AUC |
---|---|---|---|---|---|---|---|
SR-HSE | 0.4 | 0.277778 | 0.327869 | 0.341085 | 0.937879 | 0.62687 | 0.62687 |
NR-AR | 0.666667 | 0.454545 | 0.540541 | 0.526299 | 0.977212 | 0.72382 | 0.72382 |
SR-ARE | 0.670103 | 0.601852 | 0.634146 | 0.641071 | 0.874372 | 0.768206 | 0.768206 |
NR-Aromatase | 0.62963 | 0.515152 | 0.566667 | 0.58764 | 0.956811 | 0.748788 | 0.748788 |
NR-ER-LBD | 0.454545 | 0.285714 | 0.350877 | 0.442262 | 0.948179 | 0.634021 | 0.634021 |
NR-AhR | 0.689655 | 0.526316 | 0.597015 | 0.678326 | 0.919403 | 0.748006 | 0.748006 |
SR-MMP | 0.717391 | 0.680412 | 0.698413 | 0.777485 | 0.90339 | 0.813837 | 0.813837 |
NR-ER | 0.536585 | 0.293333 | 0.37931 | 0.414401 | 0.884984 | 0.629425 | 0.629425 |
NR-PPAR-gamma | 0.272727 | 0.176471 | 0.214286 | 0.205877 | 0.965891 | 0.581866 | 0.581866 |
SR-p53 | 0.576923 | 0.3 | 0.394737 | 0.481543 | 0.932353 | 0.64127 | 0.64127 |
SR-ATAD5 | 0.466667 | 0.304348 | 0.368421 | 0.428118 | 0.966667 | 0.646435 | 0.646435 |
NR-AR-LBD | 0.5 | 0.5625 | 0.529412 | 0.592103 | 0.976744 | 0.774554 | 0.774554 |
The main conclusion to draw is that the discovered DNN model performed a little better that out of the box RandomForest. This appears to be largely due to the not having a large enough feature space to choose from, this is particularly stark given the winning solution had almost 40000 data inputs: https://arxiv.org/pdf/1503.01445.pdf.
For this project to progress - we have to avenues to pursue
- Consider polynomial features, or further data engineer processes.
- Enhance the input dataset, particularly with specific domain knoweledge.
- Consider a larger variety of nueral network architecture - the most attractive of these would probably be from NLP application since there are structure sequence data.
-
The challenge has a separate testing dataset and has now released results - thus we should apply our model to this testing dataset.
-
Add polynomial features to the input data.
-
Add RNN and Transformer models.
-
Create a requirements.yaml file.
-
Send trained model to data folder.
-
Need more data.