Requirements:
- Python numpy
- Python pandas
- Python multiprocessing
- PyTables
- XGBoost (python API) (only required for training)
- Cython (http://cython.org/)
MS2PIPC requires the machine specific compilation of the C-code:
sh compile.sh
Pre-trained HCD models for the b- and y-ions can be found in
the /models
folder. These C-coded decision tree models are compiled
by running the compile.sh
script that writes the python module
ms2pipfeatures_pyx.so
which is imported into the main python script
ms2pipC.py
:
usage: ms2pipC.py [-h] [-s FILE] [-w FILE.ext] [-c INT] <peptide file>
positional arguments:
<peptide file> list of peptides
optional arguments:
-h, --help show this help message and exit
-c FILE config file
-s FILE .mgf MS2 spectrum file (optional)
-w FILE write feature vectors to FILE.pkl (optional)
-i iTRAQ models
-p phospho models
-m INT number of cpu's to use
The -i
flag makes ms2pipC use the NIST iTRAQ4 models (HCD onnly).
The -i
flag in combination with the -p
flag makes ms2pipC use the NIST iTRAQ4 phospho models (HCD onnly).
Several ms2pipC options need to be set in this configfile.
The models that should be used are set as frag_method=X
where X is either CID
or HCD
.
The fragment ion error tolerance is set as frag_error=X
where is X is the tolerance in Da.
PTMs (see further) are set as ptm=X,Y,Z
for each internal PTM where X is a string that represents
the PTM, Y is the difference in Da associated with the PTM and Z is the amino
acid that is modified by the PTM. N-terminal modifications are specified as nterm=X,Y
where X is gain a string that represents the PTM, Y is again the difference in Da associated with the PTM.
Similarly, c-terminal modifications are specified as cterm=X,Y
where X is gain a string that represents the PTM, Y is again the difference in Da associated with the PTM.
To apply the pre-trained models you need to pass only a <peptide file>
to ms2pipC.py
. This file contains the peptide sequences for which you
want to predict the b- and y-ion peak intensities. The file is space
separated and contains four columns with the following header names:
spec_id
: an id for the peptide/spectrummodifications
: a string indicating the modified amino acidspeptide
: the unmodified amino acid sequencecharge
: charge state to predict
The predictions are saved in a .csv
file with the name <peptide_file>_predictions.csv
.
If you want the output to be in the form of an .mgf
file, replace the variable
mgf
in line 142 of ms2pipC.py
.
The spec_id column is a unique identifier for each peptide that will
be used in the TITLE field of the predicted MS2 .mgf
file. The
modifications
column is a string that lists the PTMs in the peptide. Each PTM is written as
A|B
where A is the location of the PTM in the peptide (the first amino acid has location 1,
location 0 is used for n-term
modificatios, while -1 is used for c-term modifications) and B is a string that represent the PTM
as defined in the configfile (-c
command line argument).
Multiple PTMs in the modifications
column are concatenated with '|'.
As an example, suppose the configfile contains the line
ptm=Cam,57.02146,C
nterm=Ace,42.010565
cterm=Glyloss,-58.005479
then a modifications string could like 0|Ace|2|Cam|5|Cam|-1|Glyloss
which means that the second
and fifth amin acid is modified with Cam
,
that there is a n-terminal modification Ace
,
and that there is a c-terminal modification Glyloss
,
To compile a feature vector dataset you need to supply the
MS2 .mgf file (option -s
) and the name of the file to write the feature
vectors to (option -w
) to ms2pipC.py
.
The spec_id
column in the <peptide file>
should match the TITLE field
of the corresponding MS2 spectrum in the .mgf file and is used to find
the targets for the feature vectors.
In the folder tests
, run pytest
. This will run the tests in
test_features.py
, which verify if the feature and target extraction are
working properly. (The tests must be updated when we add or remove features!)
To do this the pytest
package must be installed (pip install pytest
)
The python script
$ python convert_to_mgf.py <file>.msp <title>
converts a spectral library in .msp
format into a spectrum .mgf
file,
a <peptide file>
and a <meta>
file.
The script
usage: train_xgboost_c.py [-h] [-c INT] <vectors.pkl or .h5> <type>
XGBoost training
positional arguments:
<vectors.pkl or .h5> feature vector file
<type> model type
optional arguments:
-h, --help show this help message and exit
-c INT number of cpu's to use
reads the pickled feature vector file <vectors.pkl or .h5>
and trains an
XGBoost model. The type
option should be "B" for b-ions and "Y" for
y-ions.
Hyperparameters should still be optimized. You will need to digg into the script for model selection.
This script will write the XGBoost models as .c
files that can be compiled
and linked through Cython. Just put the models in the /models
folder
, change the #include
directives in ms2pipfeatures_c.c
, and recompile
the ms2pipfeatures_pyx.so
model by running the compile.sh
script.