Giter Club home page Giter Club logo

dryclean's Introduction

Build Status codecov.io

dryclean

dryclean

Robust PCA based method to de-noise genomic coverage data.

Installations

Install devtools from CRAN

install.packages('devtools')

Set this to allow dependencies that throw warnings to be installed.

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = TRUE)

Install dependent packages and latest Bioconductor (if you haven't already)

source('https://bioconductor.org/biocLite.R')
biocLite('GenomicRanges')

Install mskilab R dependencies (gUtils)

devtools::install_github('mskilab/gUtils')

Install dryclean

devtools::install_github('mskilab/dryclean')

(after installing R package) Add dryclean directory to PATH and test the executable

$ export PATH=${PATH}:$(Rscript -e 'cat(paste0(installed.packages()["dryclean", "LibPath"], "/dryclean/extdata/"))')
$ drcln -h ## to see the help message

Tutorial

dryclean is a robust principal component analysis (rPCA) based method. dryclean uses a panel of normal (PON) samples to learn the landscape of both biological and technical noise in read depth data. dryclean then uses this landscape significantly reduce noise and artifacts in the signal for tumor samples. The input to the algorithm is a GenomicsRanges object containing read depth. You can either use read count from your favorite tools (there are many fast tools out there, for example: megadepth). Using uncorrected read count as input for dryclean works well from our experience, but if you wish, you can use the GC amd mappability corrected read depth data from fragCounter that can be found at: https://github.com/mskilab/fragCounter . The only requirement for the input is centering the data. This can be achieved by dividing each bin by global mean signal of the sample .i.e. mean-normalization. This is an essential pre-processing step.

1. Creating Panel of Normal aka detergent

For creating PON the following factors are needed:

  1. Tumor and normal sample mean-normalized read count outputs should be stored in two different directories
  2. A data.table with two columns: a. "sample" column contains the sample name you will use to index the sample b. "normal_cov" is a column with paths to the normal samples to be used

Following is an example of such a table

normal_table_example = readRDS("~/git/dryclean/inst/extdata/normal_table.rds")
normal_table_example
samplenormal_cov
samp1 ~/git/dryclean/inst/extdata/samp1.rds
samp2 ~/git/dryclean/inst/extdata/samp2.rds
samp3 ~/git/dryclean/inst/extdata/samp3.rds

There are three ways to make the PON:

  1. Using all normal samples availabble. PON can be made with all available normal samples. In this case set use.all = TRUE
detergent = prepare_detergent(normal.table.path = "~/git/dryclean/inst/extdata/normal_table.rds", path.to.save = "~/git/dryclean/inst/extdata/", num.cores = 1, use.all = TRUE)
  1. Using random subset of normal samples availabble.
detergent = prepare_detergent(normal.table.path = "~/git/dryclean/inst/extdata/normal_table.rds", path.to.save = "~/git/dryclean/inst/extdata/", num.cores = 1, use.all = FALSE, choose.randomly = TRUE)
  1. Cluster based approach. In order to keep the size of PON as small as possible but maximize the information in the PON. This is acheived by clustering the genomic background of normal samples and selecting normal samples from each cluster. Hierarchical clustering is used on L matrix after decomposing a small genomic region of all normal samples.
detergent = prepare_detergent(normal.table.path = "~/git/dryclean/inst/extdata/normal_table.rds", path.to.save = "~/git/dryclean/inst/extdata/", num.cores = 1, use.all = FALSE, choose.by.clustering = TRUE)
names(detergent)
head(detergent$L)
  1. 'L'
  2. 'S'
  3. 'k'
  4. 'U.hat'
  5. 'V.hat'
  6. 'sigma.hat'
0.11864910.12936250.1181923
0.34289640.37385830.3415761
0.11894280.12968280.1184849
0.12309010.13420450.1226161
0.10944830.11933090.1090269
0.12330530.13443920.1228305

The detergent is a list with the following elements:

  1. L: This is the L low ranked matrix of all the PONs calculated by batch robust PCA mathod
  2. S: This is the S sparse matrix of all the PONs calculated by batch robust PCA mathod
  3. k: This is estimated rank of a matrix where coverage values from each normal sample forms a column
  4. U.hat: svd decompsed left sigular matrix of L required for online implentation of rPCA
  5. V.hat: svd decompsed right sigular matrix of L required for online implentation of rPCA
  6. sigma.hat: svd decompsed first k sigular values of L required for online implentation of rPCA

2. Identifying germline events

Since a PON is used for decomposing the tumor samples, a method is required identify and remove germline events. This is achieved by looking at all normal samples as a population and infer the markers that have a copynumber events at a given frequency, set by user.

In order to do so, normal samples are treated as tumors and all copy number changes in the normal samples are extracted using thr PON created. Here is an example:

decomp.1 = start_wash_cycle(cov = sample.1, detergent.pon.path = "~/git/dryclean/inst/extdata/", whole_genome = TRUE, chr = NA, germline.filter = FALSE)

Once all normal samples are decomposed, the data.table is updated to reflect that:

normal_table_example = readRDS("~/git/dryclean/inst/extdata/normal_table.rds")
normal_table_example
samplenormal_covdecomposed_cov
samp1 ~/git/dryclean/inst/extdata/samp1.rds ~/git/dryclean/inst/extdata/decomp1.rd
samp2 ~/git/dryclean/inst/extdata/samp2.rds ~/git/dryclean/inst/extdata/decomp2.rd
samp3 ~/git/dryclean/inst/extdata/samp3.rds ~/git/dryclean/inst/extdata/decomp3.rd

With this table we can run the following:

grm = identify_germline(normal.table.path = "~/git/dryclean/inst/extdata/normal_table.rds", path.to.save = "~/git/dryclean/inst/extdata/", signal.thresh=0.5, pct.thresh=0.98)

Now we are ready for tumor decomposition

3. Running dryclean on tumor sample within R

Following is a dummy example. The data diretory has a dummy coverage gRanges object which requires "reads.corrected" field

coverage_file = readRDS("~/git/dryclean/data/dummy_coverage.rds")
coverage_file
GRanges object with 50 ranges and 1 metadata column:                                 
        seqnames    ranges strand |    reads.corrected                                   
           <Rle> <IRanges>  <Rle> |          <numeric>                                    
    [1]       22       1-3      * |   2.86974197952077                                    
    [2]       22       3-5      * |   2.22116750082932                                   
    [3]       22       5-7      * |   3.57646101620048                                    
    [4]       22       7-9      * |   3.28955231001601                               
    [5]       22      9-11      * | 0.0134209531825036 
    ...      ...       ...    ... .                ...
   [46]       22     91-93      * |   1.89621421624906
   [47]       22     93-95      * |    4.1652654542122
   [48]       22     95-97      * |   1.22946820687503
   [49]       22     97-99      * |   2.79558349982835
   [50]       22    99-101      * |    1.8819149560295
   -------                                                                                                                                                                                                                                                   
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

In order to run dryclean, simply invoke the following function

cov_out = start_wash_cycle(cov = coverage_file, detergent.pon.path = "~/git/dryclean/inst/extdata", whole_genome = TRUE, chr = NA, germline.filter = TRUE, germline.file = "~/git/dryclean/inst/extdata/germline.markers.rds")
head(cov_out)
GRanges object with 6 ranges and 7 metadata columns:
       seqnames    ranges strand |      background.log    foreground.log       
          <Rle> <IRanges>  <Rle> |           <numeric>         <numeric>      
   [1]       22       1-3      * |  0.0319473954249811 0.667269772358936 
   [2]       22       3-5      * |   0.285896847736708                 0    
   [3]       22       5-7      * | -0.0356340182110771 0.886416246387707
   [4]       22       7-9      * |   0.306494769047133 0.780626154020659
   [5]       22      9-11      * |  -0.134142034550078 -4.39454085149961
   [6]       22     11-13      * |   0.568947906575528 0.713624255729269  
        input.read.counts      median.chr         foreground        background 
                <numeric>       <numeric>          <numeric>         <numeric> 
   [1]   2.86974197952077 2.5809692934854   1.94890908484404  1.03246319158914 
   [2]   2.22116750082932 2.5809692934854                  1  1.33095515713431  
   [3]   3.57646101620048 2.5809692934854   2.42641836548012 0.964993398874529 
   [4]   3.28955231001601 2.5809692934854   2.18283863086114  1.35865436040174 
   [5] 0.0134209531825036 2.5809692934854 0.0123445470046525 0.874465851418496  
   [6]   3.07809003512375 2.5809692934854   2.04137633774307    1.766407647601  
               log.reads                                                           
               <numeric>                                                             
   [1]  1.05422212312404                                        
   [2] 0.798032958921047                                       
   [3]  1.27437376852101                                 
   [4]  1.19075147953513                                      
   [5] -4.31093812294871                                  
   [6]  1.12430928616619               
   -------                        
   seqinfo: 1 sequence from an unspecified genome 

The output has following metadata fields:

  1. background.log: This is the L low ranked vector after decomposition and represent the background noise separated by dryclean in the log space
  2. foreground.log: The S vector with the inferred copy number signal separated by dryclean, that forms foreground, in the log space
  3. reads: Raw read counts
  4. gc: GC score
  5. map: mappability score
  6. input.read.counts: This is the mean-normalized count input in linear space
  7. median.chr: median chromosome signal
  8. blacklisted: if off target marker list is available and used
  9. foreground: Foreground signal, that forms SCNAs (S vector) in read count/ratio space
  10. background: This is the L low ranked vector after decomposition and represent the background noise separated by dryclean in read count/ratio space
  11. log.reads: log of the mean-normalized count
  12. germline.blk: germline marker based on the inferred germline function

3. Running dryclean on tumor sample from command line

./drcln -i inst/extdata/samp1.rds -p inst/extdata/detergent.50.rds 
Rprofile Loading
Rprofile Finished Loading


▓█████▄   ██▀███  ██   ██▓ ▄████▄   ██▓    ▓█████ ▄▄▄       ███▄    █
 ██▀ ██▌ ▓██   ██  ██  ██  ██▀ ▀█  ▓██▒    ▓█   ▀ ████▄     ██ ▀█   █
░██   █▌ ▓██ ░▄█    ██ ██  ▓█    ▄  ██░    ░███   ██  ▀█▄   ██  ▀█ ██▒
░▓█▄   ▌ ▒██▀▀█▄   ░ ▐██▓ ▒▓▓▄ ▄██▒ ██░    ░▓█  ▄ ██▄▄▄▄█   ██▒  ▐▌██▒
░▒████▓  ░██▓  ██  ░ ██▒    ▓███▀ ░░█████ ▒█████▒ █     █▒ ██░   ▓██░
 ▒ ▓  ▒  ░  ▓ ░▒▓░  ██    ░ ░▒ ▒  ░░ ▒░▓  ░░░ ▒░ ░▒▒   ▓▒█░░ ▒░   ▒ ▒
 ░ ▒  ▒    ░▒ ░  ░  ░░▒░   ░  ▒   ░ ░ ▒  ░ ░ ░  ░ ▒   ▒▒ ░░ ░░   ░ ▒░
 ░ ░  ░    ░░   ░   ░  ░░  ░          ░ ░  ░    ░    ░   ▒      ░   ░ ░
   ░        ░     ░ ░     ░ ░          ░  ░   ░  ░     ░  ░     ░   ░
 ░               ░ ░     ░       ░    ░     ░     ░      ░     ░ 


(Let's dryclean the genomes!)

Loading PON a.k.a detergent from path provided
Let's begin, this is whole exome/genome
Initializing
Using the detergent provided to start washing
lambdas calculated
Here begins the wash cycle
calculating A and B
calculating v and s
Updating subspace
Combining matrices with gRanges
Giddy Up!

All the options and usage is as follows

./drcln -h
Rprofile Loading
Rprofile Finished Loading
Usage: ./drcln [options]


Options:
	-i INPUT, --input=INPUT
		Path to cov.rds file, mean-normalized count for sample under consideration

	-p PON, --pon=PON
		Path to Panel Of Normal (PON) on which batch rPCA have been run. Within the file should be L, S matrices, estimated rank for burnin samples and svd decomposition matrices for the same

	-m NAME, --name=NAME
		Sample / Individual name

	-b BLACKLIST, --blacklist=BLACKLIST
		blacklisted makers

	-w WHOLEGENOME, --wholeGenome=WHOLEGENOME
		If TRUE then it will process all chromosomes and parallelize it

	-C CHROMOSOME, --chromosome=CHROMOSOME
		If wholeGenome is FALSE, specify the chromosome to process

	-g GERMLINE.FILTER, --germline.filter=GERMLINE.FILTER
		If PON based germline filter is to be used for removing some common germline events, If set to TRUE, give path to germline annotated file

	-f GERMLINE.FILE, --germline.file=GERMLINE.FILE
		Path to file annotated with germline calls, if germline.filter == TRUE

	-c CORES, --cores=CORES
		How many cores to use

	-o OUTDIR, --outdir=OUTDIR
		output directory

	-k COLLAPSE, --collapse=COLLAPSE
		collapse 200bp fragCounter to 1kb, deprecatec

	-h, --help
		Show this help message and exit

Panel of Normal for 1kb WGS (hg19)

The Panel of Normal samples (PON) of 395 TCGA WGS normal samples was created using hierarchical clustering approach described above and filtered for CNPs.

The file is 17G in size.

WGS 1 kb PON: https://mskilab.s3.amazonaws.com/hg19/WGS/detergent.rds

dryclean's People

Contributors

evanbiederstedt avatar shaiberalon avatar mskilab avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.