This is the code repo for the paper Highly Automatic and Universal Approach for Extracting Features from LC-MS Data Using Deep Learning. We developed a deep learning-based pure ion chromatogram method (DeepPIC) for extracting PICs from raw data files directly and automatically. The DeepPIC method has already been integrated into the KPIC2 framework. The combination can provide the entire pipeline from raw data to discriminant models for metabolomic datasets.
1. Install Anaconda for python 3.8.13.
2. Install R 4.2.1.
3. Install KPIC2 in R language.
The method of installing KPIC2 can refer to https://github.com/hcji/KPIC2.
- First install the depends of KPIC2.
install.packages(c("BiocManager", "devtools", "Ckmeans.1d.dp", "Rcpp", "RcppArmadillo", "mzR", "parallel", "shiny", "plotly", "data.table", "GA", "IRanges", "dbscan", "randomForest")) BiocManager::install(c("mzR","ropls"))
- Then, download the source package of KPIC2 at url and install the package locally.
4. Create environment and install main packages.
-
Open commond line, create environment.
conda create --name DeepPIC python=3.8.13 conda activate DeepPIC
-
Clone the repository and enter.
git clone https://github.com/yuxuanliao/DeepPIC.git cd DeepPIC
-
Install main packages in requirements.txt with following commands.
python -m pip install -r requirements.txt
-
Set environment variables for calling R language using rpy2.
R_HOME represents the installation location of the R language.
R_USER represents the installation location of the rpy2 package.
setx "R_HOME" "C:\Program Files\R\R-4.2.1" setx "R_USER" "C:\Users\yxliao\anaconda3\Lib\site-packages\rpy2"
The following files are in the DeepPIC folder:
- train.py. for model training
- extract.py. extract PICs from raw LC-MS files
- predict.py. define the IoU metric for PICs and evalute the DeepPIC model
The following files are in the KPIC2 folder:
- KPIC2.py. for integrating DeepPIC into KPIC2 to implement the whole process of metabolomics processing
- KPIC2.R. the code for the feature detection, alignment, grouping, missing value filling, and building classification models
- permutation_vip.py. define some functions for file format conversion, permutation test, and biomarkers selection
- files:
- pics (PICs extracted by DeepPIC from each LC-MS file in the metabolomics dataset by running extract.py)
- scantime (RTs read from each LC-MS file in the metabolomics dataset using OpenMS)
- KPIC2_result.csv (the file generated by running KPIC2.py)
- KPIC2_result_plot.csv (the file format for the OPLS-DA scores plot, permutation test, and biomarkers selection by running Datatransform function in permutation_vip.py)
The following files are in the others folder:
- metabolomics.py. the code for the OPLS-DA scores plot, permutation test, biomarkers selection and hierarchical cluster analysis
- quantitative.py. evaluate the quantitative ability of feature extraction methods
- XCMS.R. the code for XCMS to detect peaks
- Simulation:
- mssimulator.py. define some functions for generating the simulated LC-MS files
- simulated_mm48.py. generate the simulated MM48 dataset
The dataset with 200 input-label pairs used to train, validate, and test the DeepPIC model is in the dataset folder. As the model and the data exceeded the limits, we have uploaded the optimized model and the datasets (MM48, simulated MM48, quantitative, metabolomics and different instrumental datasets) to Github release page.
The example code for model training is included in the train.ipynb.
The example code for feature extraction is included in the extract.ipynb.
The example code for integrating DeepPIC into KPIC2 to implement the whole process of metabolomics processing is included in the Integration_into_KPIC2.ipynb.
By running extract.py, user can use DeepPIC to extract PICs from each LC-MS file in the metabolomics dataset. The whole process of metabolomics processing can be implemented by running KPIC2.py directly. Please refer to extract.ipynb and Integration_into_KPIC2.ipynb for details. Thus, you can use DeepPIC+KPIC2 to process your data.