Giter Club home page Giter Club logo

darcsign's Introduction

DARC Sign (DnA Repair Classification Signatures)

Note: This repository accompanies the publication "A generalizable machine learning framework for classifying DNA repair defects using ctDNA exomes". It is meant to ensure reproducibly of the publication and facilitate the use of the "DARC Sign" package to predict DNA repair defects for research and/or clinical applications.

This tool includes the generation of signatures from standard maf and seg files and trained XGBoost models that use the features.

demo image

This repository contains the processing steps required to generate features needed to train new models, to classify a sample as BRCA2d, CDK12d or MMRd and to reproduce the publication. The directory structure includes:

  • The data used to train the models and all other supplemental data is located in ./data.
  • Figures and associated code from the publicattion are reproduced in ./figures
  • Code that was used to find optimal hyperparemeters of each model is located in ./model_training_gridsearchcv
  • The data pipeline for predicting a sample can be run through the script ./darcsign_predict.py which calls various functions and files in ./darc_sign_pipeline.

Installation

The installation has been tested with python >3.7 and xgboost >1.0 All python dependencies can be installed via conda

#this version is using xgboost=1.5.0 and python=3.9.7
conda create --name darcsign_env --yes numpy scipy pandas matplotlib seaborn scikit-learn pysam xgboost; 

The current version of this pipeline uses the tns and indel features generated by SigProfilerMatrixGenerator which needs to be installed into the created environment:

#activate environment
conda activate darcsign_env;
#Then install SigProfilerMatrixGenerator via pip
pip install SigProfilerMatrixGenerator;
#sigprofiler reference genome also needs to be installed. This tool uses GRCh38
python -c "from SigProfilerMatrixGenerator import install as genInstall; genInstall.install('GRCh38', rsync=False, bash=True)"

Download this repository and the xgboost models on figshare

# clone form get or wget
git clone https://github.com/elieritch/DarcSign.git;
#download and extract model directory from figshare ~65mb
cd DarcSign; 
wget https://figshare.com/ndownloader/files/34540430 -O models.tar.gz; 
tar -xzvf models.tar.gz;
rm models.tar.gz;

Running the classifier

Inputs: The pipeline takes as input a maf (mutation annotation file) and a sequenza segments file. Examples of these files can be found in ./darc_sign_pipeline/test_data_input.

In the maf file, the columns needed and the number of the column are:

Column Number Column
#2 Chromosome (1,2,...X,Y)
#3 Position (int)
#5 Reference sequence (str)
#6 Alternate sequence (str)

The names in this table do not matter, the column numbers do. This maf file is converted into a vcf with an awk command called through a subprocess to convert the file into a vcf usable by SigProfilerMatrixGenerator ie..

cmd = f"tail -n +2 {mafpath} | awk \'BEGIN{{FS=OFS=\"\\t\"}} {{print $2,$3,\".\",$5,$6,\".\", \".\", \".\"}}\' >> {vcffile}"
subprocess.call(cmd, shell=True)

One of the products of the sequenza pipeline is a file with the suffix "_segments.txt". This is the other input to the DARC Sign pipeline. Other segmentation data other than Sequenza could also be used but requires the same 4 named columns. The columns that are used from this file are:

Column Name Column
chromosome Chromosome (1,2,...X,Y)
start.pos Start position of segment (int)
end.pos End position of segment (int)
CNt Copy number tumor (int)

The directory also contains a centromere position defining file downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz which is used for feature calculation.

Running the classifier on sample using ./darcsign_predict.py:

seg_file="/path/to/samplename_segments.txt"; #product of sequenza
maf_file="/path/to/samplename.maf"; #product of circuit variant caller or other variant calling tools
output_directory="/path/to/samplename_directory"; #directory will be created if doesnt already exist
sample_name="samplename"; #will be used in figure titles and in the nameing of files 
script="/path/to/darcsign_predict.py"; #from this git repo
#activate appropriate environment
conda activate darcsign_env;
#run the script
python ${script} -m ${maf} -s ${seg} -od ${od} -sn ${sn}

Output interpretation

Outputs: The pipeline produces several figures and tables as output. Examples of these files can be found in ./darc_sign_pipeline/test_data_input. The output of the tool includes the following files:

The processed set of features extracted from the input data:

  • {samplename}_{kindoffeature}_matrix.tsv which include the raw values of each feature that is used as the input of Darc Sign.
    • ./darc_sign_pipeline/test_data_output/sample_name/sample_name_cnv45feature_matrix.tsv
    • ./darc_sign_pipeline/test_data_output/sample_name/sample_name_snv96feature_matrix.tsv
    • ./darc_sign_pipeline/test_data_output/sample_name/sample_name_ndl83feature_matrix.tsv

Graphs of the feature values as proportions of their feature sets

A table that specifies the sample name and the probability of each deficiency

sample prob_of_BRCA2d prob_of_CDK12d prob_of_MMRd
sample_010 0.881962 0.33209658 0.081173204

Reproducing the publication

Each subdirectory of ./figures contains python scripts and the figures generated by each script. Each figure calls data from the publication which can be found in ./data. The scripts are all run in place using paths relative to their respective hierarchical locations and can be run using the same conda environment as previously installed. eg: the script DarcSign/figures/fig4/predict_and_graph_bc.py produces these graphs and the associated legends.

Figure 5A Figure 5B

References

darcsign's People

Contributors

elieritch avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.