Giter Club home page Giter Club logo

sigmaccs's Introduction

SigmaCCS

This is the code base for the paper Ion Mobility Collision Cross Section Prediction Using Structure Included Graph Merged with Adduct. We developed a model named SigmaCCS which can be used to predict CCS of compounds,and a dataset including CCS values of ~90,000,000 molecules from Pubchem was formed.For each molecules,there are "Pubchem ID","SMILES","InChi","Inchikey","Molecular Weight","Exact Mass","Formula"and predicted CCS values of three adduct ion type ([M+H]+,[M-H]-,[M+Na]+). Our paper also uses the GNN-RT.

  • sigma.py
  • GraphData.py
  • model.py
  • data Folder:
    • TrainData.csv
    • TestData.csv
    • TestData-pred.csv (Predicted results)
  • model Folder:
    • model.h5
  • parameter Folder:
    • parameter.pkl

Package required:

We recommend to use conda and pip.

By using the requirements/conda/requirements.txt, requirements/pip/requirements.txt file, it will update all your packages to the correct version.

Data pre-processing

SigmaCCS is a model for predicting CCS based on graph neural networks, so we need to convert SMILES strings to Graph. The related method is shown in GraphData.py

1. Generate 3D conformations of molecules.

mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
ps = AllChem.ETKDGv3()
ps.randomSeed = -1
ps.maxAttempts = 1
ps.numThreads = 0
ps.useRandomCoords = True
re = AllChem.EmbedMultipleConfs(mol, numConfs = 1, params = ps)
re = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads = 0)
  • ETKDGv3 Returns an EmbedParameters object for the ETKDG method - version 3 (macrocycles).
  • EmbedMultipleConfs, use distance geometry to obtain multiple sets of coordinates for a molecule.
  • MMFFOptimizeMoleculeConfs, uses MMFF to optimize all of a molecule’s conformations

2. Save relevant parameters. For details, seeparameter.py.

  • adduct set
  • atoms set
  • Minimum value in atomic coordinates
  • Maximum value in atomic coordinates

3. Generate the Graph dataset. Generate the three matrices used to construct the Graph:
(1) node feature matrix, (2) adjacency matrix, (3) edge feature matrix.

adj, features, edge_features = convertToGraph(smiles, Coordinate, All_Atoms)
DataSet = MyDataset(features, adj, edge_features, ccs)

Optionnal args

  • All_Atoms : The set of all elements in the dataset
  • Coordinate : Array of coordinates of all molecules
  • features : Node feature matrix
  • adj : Adjacency matrix
  • edge_features : Edge feature matrix

Model training

Train the model based on your own training dataset.

Model_train(ifile, ParameterPath, ofile, ofileDataPath, EPOCHS, BATCHS, Vis, All_Atoms=[], adduct_SET=[])

Optionnal args

  • ifile : File path for storing the data of smiles and adduct.
  • ofile : File path where the model is stored.
  • ParameterPath : Save path of related data parameters.
  • ofileDataPath : File path for storing model parameter data.

Predicting CCS

The CCS prediction of the molecule is obtained by inputting Graph and Adduct into the already trained SigmaCCS model.

Model_prediction(ifile, ParameterPath, mfileh5, ofile, Isevaluate = 0)

Optionnal args

  • ifile : File path for storing the data of smiles and adduct
  • ParameterPath : File path for storing model parameter data
  • mfileh5 : File path where the model is stored
  • ofile : Path to save ccs prediction values

Usage

The example codes for usage is included in the test.ipynb

Others

The following files are in the others/ folder

  • Attribute Analysis.ipynb. analyze the attribute importance
  • UMAP.ipynb. visualize the learned representation with UMAP
  • UMAPDataset.py. for generating graph datasets.
  • theoretical calculation.ipynb. investigate of the relationship between SigmaCCS and theoretical calculation
  • Filtering.ipynb. Filtering of target unknown molecules based on the CCS and mz of the molecules
  • CFM-ID4. the code for generating MS/MS spectra with CFM-ID 4.0.
  • GNN-RT:
  • model:
    • model.h5
  • data:
    • Attribute importance data
      • Attribute importance (data.csv)
      • Coordinate data (Store the 3D coordinate data of all molecules in data.csv)
    • UMAP data
      • data.csv
      • data_molvec.npy (Molecular vectors of all molecules)
      • data-UMAP-EUC-60.npy
      • Coordinate data (Store the 3D coordinate data of all molecules in data.csv)
    • theoretical calculation data
      • data.csv
      • LJ_data.csv (Get the LJ interaction parameters of different elements according to LJ_data.csv)
      • Coordinate data (Store the 3D coordinate data of all molecules in data.csv)
    • Filtering data
      • data.csv

Package required:

Slurm script

slurm script for generating CCS of PubChem in HPC cluster. The following files are in the slurm/ folder

  • mp.py
  • multiple_job.sh (Batch generation of slurm script files)
  • normal_job.sh (Submit the slurm script for the mp.py file)

Information of maintainer

sigmaccs's People

Contributors

youjiazhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigmaccs's Issues

Error ion test.ipynb

in the last times i made predictions using the jupyter notebook in have some errors.

the first one came from numpy

anaconda3\envs\sigma\lib\site-packages\numpy\core_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray return array(a, dtype, copy=False, order=order

and the second one came from Keras library,

anaconda3\envs\sigma\lib\site-packages\keras\utils\generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
category=CustomMaskWarning)

can you help me? Thank you for your work

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.