Giter Club home page Giter Club logo

dryclean's Introduction

Build Status codecov.io

Dockerized installation

To make our life easier, we have created a Docker container with the latest stable release of Dryclean and its dependencies. This can be found here. The latest updated version is 0.0.2, so make sure to select the correct tag.


title: dryclean tutorial

Robust PCA based method to de-noise genomic coverage data.

Installations

Install devtools from CRAN

install.packages('devtools')

Set this to allow dependencies that throw warnings to be installed.

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = TRUE)

Install dependent packages and latest Bioconductor (if you haven't already)

source('https://bioconductor.org/biocLite.R')
biocLite('GenomicRanges')

Install mskilab R dependencies (gUtils)

devtools::install_github('mskilab/gUtils')

Install dryclean

devtools::install_github('mskilab-org/dryclean')

(after installing R package) Add dryclean directory to PATH and test the executable

$ export PATH=${PATH}:$(Rscript -e 'cat(paste0(installed.packages()["dryclean", "LibPath"], "/dryclean/extdata/"))')
$ drcln -h ## to see the help message

Tutorial

Dryclean is a robust principal component analysis (rPCA) based method. It uses a panel of normal (PON) samples to learn the landscape of both biological and technical noise in read depth data. Dryclean then uses this landscape to significantly reduce noise and artifacts in the signal for tumor samples. The input to the algorithm is a GenomicsRanges object containing read depth. You can use read counts from your favorite tool (there are many fast tools out there, for example: megadepth). Using uncorrected read counts as input for Dryclean works well based on our experience, but if you wish, you can use the GC and mappability corrected read depth data from fragCounter, which can be found at: https://github.com/mskilab/fragCounter.

1. Creating Panel of Normal aka detergent

There are 2 options for instantiating the PON object:

Option 1: Load an existing PON from a path.

To load an existing PON from the path into the pon_object, run:

pon_object = pon$new(pon_path = "~/git/dryclean/inst/extdata/detergent.rds")

Option 2: Create a new PON from normal samples.

To create a new PON, the vector with paths to the normal samples is needed.

Following is an example of such a vector

normal_vector_example = readRDS("~/git/dryclean/inst/extdata/normal_vector.rds")
normal_vector_example

[1] "/git/dryclean/extdata/samp1.rds"
[2] "/git/dryclean/extdata/samp2.rds"
[3] "/git/dryclean/extdata/samp3.rds"

To make a new PON, you need to instantiate a PON object and set create_new_pon = TRUE.

pon_object = pon$new(
    create_new_pon = TRUE, 
    normal_vector = normal_vector_example
    )

NOTE: We recommend using raw reads from normal samples in PON generation for the most optimal performance.

The parameters that could be used in PON generation:

Parameter Default value Description
create_new_pon FALSE Whether to create a new PON from normal samples
pon_path NULL If create_new_pon==FALSE, the path to the existing PON; If create_new_pon==TRUE and save_pon == TRUE, the path to save the new PON
normal_vector c() Vector of paths to normal samples
save_pon FALSE If create_new_pon==TRUE, whether to save pon to the path given by pon_path
field "reads.corrected" Field name in GRanges metadata of normal samples to use for PON generation
use.all TRUE Whether all normal samples are to be used for creating PON
choose.randomly FALSE If use.all==FALSE, whether a random subset of normal samples are to be used for creating PON
choose.by.clustering FALSE Whether to cluster normal samples based on the genomic background and take a random sample from within the clusters
number.of.samples 50 If choose.by.clustering==TRUE or choose.randomly==TRUE, the number of clusters/samples to use
tolerance 0.0001 Tolerance for error for batch rPCA; we suggest keeping this value
num.cores 1 Number of cores to use for parallelization
verbose TRUE Whether to output progress
is.human TRUE Organism type
build "hg19" Genome build to define PAR region in chromosome X
PAR.file NULL GRanges with the boundaries of PAR region in X chr
balance TRUE Experimental variable to take into consideration 1 copy of X chr in male sample
infer.germline FALSE Whether to use the L matrix to infer germline events
signal.thresh 0.5 The threshold to be used to identify an amplification (markers with signal intensity > 0.5) or deletions (markers with signal intensity < -0.5) in log space from dryclean outputs
pct.thresh 0.98 Proportion of samples in which a given marker is free of germline event
wgs TRUE Whether whole genome is being used
target_resolution 1,000 Desired bin size of the PON
nochr TRUE Whether to remove chr prefix
all.chr c(as.character(1:22), "X") List of chromosomes

The pon_object contains the following methods:

1. get_L() - returns L, the low ranked matrix of all the PONs calculated by batch robust PCA method
2. get_S() - returns S, the sparse matrix of all the PONs calculated by batch robust PCA method
3. get_k() - returns k, the estimated rank of a matrix where coverage values from each normal sample forms a column
4. get_U_hat() - returns U.hat, svd decompsed left sigular matrix of L required for online implentation of rPCA
5. get_V_hat() - returns V.hat, svd decompsed right sigular matrix of L required for online implentation of rPCA
6. get_sigma_hat() - returns sigma.hat, svd decompsed first k sigular values of L required for online implentation of rPCA
7. get_inf_germ() - returns inf.germ, the inferred germline obtained from the normal samples
8. get_seqlengths() - returns seqlengths of each chromosome of the PON objects
9. get_history() - returns the history of actions on the pon object with timestamps

2. Normalizing the coverage aka drycleaning

Following is a dummy example. The data directory has a dummy coverage gRanges object with "reads.corrected" field.

coverage_file = readRDS("~/git/dryclean/inst/extdata/dummy_coverage.rds")
coverage_file
GRanges object with 50 ranges and 1 metadata column:
       seqnames    ranges strand | reads.corrected
          <Rle> <IRanges>  <Rle> |       <numeric>
   [1]       22       1-3      * |        2.869742
   [2]       22       3-5      * |        2.221168
   [3]       22       5-7      * |        3.576461
   [4]       22       7-9      * |        3.289552
   [5]       22      9-11      * |        0.013421
   ...      ...       ...    ... .             ...
  [46]       22     91-93      * |         1.89621
  [47]       22     93-95      * |         4.16527
  [48]       22     95-97      * |         1.22947
  [49]       22     97-99      * |         2.79558
  [50]       22    99-101      * |         1.88191
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

In order to run dryclean, instantiate a dryclean class object first with previously created pon object.

dryclean_object <- dryclean$new(pon = pon_object)

After initializing the dryclean object, use the clean function to normalize the coverage with the path to the coverage data as the required cov parameter. For the sake of example, we set the parameter testing=TRUE but typically, you would leave it at its default value.

dryclean_object$clean(cov = "~/git/dryclean/inst/extdata/dummy_coverage.rds", testing = TRUE)
GRanges object with 50 ranges and 7 metadata columns:
       seqnames    ranges strand | background.log foreground.log input.read.counts median.chr foreground background log.reads
          <Rle> <IRanges>  <Rle> |      <numeric>      <numeric>         <numeric>  <numeric>  <numeric>  <numeric> <numeric>
   [1]       22       1-3      * |    -0.00161363      0.0130496        1.16515870    1.04791 1.01313509   0.998388  0.152857
   [2]       22       3-5      * |    -0.00515368      0.0000000        0.90182764    1.04791 1.00000000   0.994860 -0.103332
   [3]       22       5-7      * |    -0.00162024      0.2332078        1.45209733    1.04791 1.26264386   0.998381  0.373009
   [4]       22       7-9      * |    -0.00176592      0.1497312        1.33560805    1.04791 1.16152201   0.998236  0.289387
   [5]       22      9-11      * |    -0.00147886     -5.0694027        0.00544911    1.04791 0.00628617   0.998522 -5.212303
   ...      ...       ...    ... .            ...            ...               ...        ...        ...        ...       ...
  [46]       22     91-93      * |    -0.00161919      -0.118465          0.769892    1.04791   0.888283   0.998382 -0.261505
  [47]       22     93-95      * |    -0.00515368       0.389148          1.691161    1.04791   1.475722   0.994860  0.525415
  [48]       22     95-97      * |    -0.00515368      -0.548208          0.499183    1.04791   0.577985   0.994860 -0.694783
  [49]       22     97-99      * |    -0.00176489       0.000000          1.135049    1.04791   1.000000   0.998237  0.126676
  [50]       22    99-101      * |    -0.00176960      -0.125884          0.764086    1.04791   0.881717   0.998232 -0.269075
  -------
  seqinfo: 1 sequence from an unspecified genome

The output has following metadata fields:

1. background.log: This is the L low ranked vector after decomposition and represent the background noise separated by dryclean in the log space
2. foreground.log: The S vector with the inferred copy number signal separated by dryclean, that forms foreground, in the log space
3. input.read.counts: This is the mean-normalized count input in linear space
4. median.chr: median chromosome signal
5. foreground: Foreground signal, that forms SCNAs (S vector) in read count/ratio space
6. background: This is the L low ranked vector after decomposition and represent the background noise separated by dryclean in read count/ratio space
7. log.reads: log of the median-normalized count

The parameters that can be used in clean() function:

Parameter Default value Description
cov REQUIRED Path to the granges coverage file to be normalized
field "reads.corrected" Field name in GRanges metadata of coverage to use for drycleaning
center TRUE Whether to center a coverage
cbs FALSE Whether to perform cbs on the drycleaned coverage; If TRUE, saves CBS coverage as cbs_output.rds in output directory
cnsignif 1e-5 The significance levels for the tests in cbs to accept change-points
mc.cores 1 Number of cores to use for parallelization
use.blacklist FALSE Whether to exclude off-target markers in case of Exomes or targeted sequencing; If set to TRUE, it will use a defualt mask or needs a path to GRanges marking if each marker is set to be excluded or not as blacklist_path
blacklist_path NA If use.blacklist == TRUE, path a GRanges object marking if each marker is set to be excluded or not
germline.filter FALSE Whether germline markers need to be removed from decomposition
verbose TRUE Outputs progress

Prerequisites for 'dryclean' to work correctly:

  • The number of bins on each chromosome in the coverage and PON (Panel of Normal) data must match. If you attempt to normalize the coverage with PON data of different number of bins, you will encounter an error. In the event of such an error, you can utilize the get_mismatch() method to obtain a data table of all chromosomes with mismatched lengths.
  • The coverage data has to be centered. If the coverage has not been centered, set center=TRUE. NOTE: If you used Fragcounter to correct the coverage, it has already been centered (set center=FALSE).

Additionally, you can use the get_history() method to review all actions performed on the object with timestamps.

3. Running dryclean on tumor sample from command line

Dryclean CLI offers two modes of operation: PON generation and normalization using an existing PON.

Mode 1 (Default): Coverage Normalization with an existing PON. To select this mode, set --mode 'coverage'. In this mode, Dryclean employs an existing PON specified by --pon to normalize the GRanges coverage provided with --input. The normalized coverage is saved as GRanges in the --outdir directory (default = './').

Example (Note: --testing TRUE only for example purposes; typically, you would use the default value):

./drcln --input inst/extdata/samp1.rds --pon inst/extdata/detergent.rds --testing TRUE
▓█████▄   ██▀███  ██   ██▓ ▄████▄   ██▓    ▓█████ ▄▄▄       ███▄    █
 ██▀ ██▌ ▓██   ██  ██  ██  ██▀ ▀█  ▓██▒    ▓█   ▀ ████▄     ██ ▀█   █
░██   █▌ ▓██ ░▄█    ██ ██  ▓█    ▄  ██░    ░███   ██  ▀█▄   ██  ▀█ ██▒
░▓█▄   ▌ ▒██▀▀█▄   ░ ▐██▓ ▒▓▓▄ ▄██▒ ██░    ░▓█  ▄ ██▄▄▄▄█   ██▒  ▐▌██▒
░▒████▓  ░██▓  ██  ░ ██▒    ▓███▀ ░░█████ ▒█████▒ █     █▒ ██░   ▓██░
 ▒ ▓  ▒  ░  ▓ ░▒▓░  ██    ░ ░▒ ▒  ░░ ▒░▓  ░░░ ▒░ ░▒▒   ▓▒█░░ ▒░   ▒ ▒
 ░ ▒  ▒    ░▒ ░  ░  ░░▒░   ░  ▒   ░ ░ ▒  ░ ░ ░  ░ ▒   ▒▒ ░░ ░░   ░ ▒░
 ░ ░  ░    ░░   ░   ░  ░░  ░          ░ ░  ░    ░    ░   ▒      ░   ░ ░
   ░        ░     ░ ░     ░ ░          ░  ░   ░  ░     ░  ░     ░   ░
 ░               ░ ░     ░       ░    ░     ░     ░      ░     ░ 


(Let's dryclean the genomes!)

ℹ Loading dryclean
Loading PON...
PON loaded
Loading coverage
Loading PON a.k.a detergent
Let's begin, this is whole exome/genome
Centering the sample
Initializing wash cycle
Using the detergent provided to start washing
lambdas calculated
calculating A and B
calculating v and s
Calculating b
Combining matrices with gRanges
Giddy Up!

Mode 2: PON Generation. To select this mode, set --mode 'pon'. In this mode, a new Panel of Normals (PON) is generated using a vector of normal samples saved as .rds, specified with --normal_vector flag. The newly created PON is then saved in the --outdir directory (default = './').

Example:

./drcln --mode "pon" --normal_vector inst/extdata/normal_vector.rds
▓█████▄   ██▀███  ██   ██▓ ▄████▄   ██▓    ▓█████ ▄▄▄       ███▄    █
 ██▀ ██▌ ▓██   ██  ██  ██  ██▀ ▀█  ▓██▒    ▓█   ▀ ████▄     ██ ▀█   █
░██   █▌ ▓██ ░▄█    ██ ██  ▓█    ▄  ██░    ░███   ██  ▀█▄   ██  ▀█ ██▒
░▓█▄   ▌ ▒██▀▀█▄   ░ ▐██▓ ▒▓▓▄ ▄██▒ ██░    ░▓█  ▄ ██▄▄▄▄█   ██▒  ▐▌██▒
░▒████▓  ░██▓  ██  ░ ██▒    ▓███▀ ░░█████ ▒█████▒ █     █▒ ██░   ▓██░
 ▒ ▓  ▒  ░  ▓ ░▒▓░  ██    ░ ░▒ ▒  ░░ ▒░▓  ░░░ ▒░ ░▒▒   ▓▒█░░ ▒░   ▒ ▒
 ░ ▒  ▒    ░▒ ░  ░  ░░▒░   ░  ▒   ░ ░ ▒  ░ ░ ░  ░ ▒   ▒▒ ░░ ░░   ░ ▒░
 ░ ░  ░    ░░   ░   ░  ░░  ░          ░ ░  ░    ░    ░   ▒      ░   ░ ░
   ░        ░     ░ ░     ░ ░          ░  ░   ░  ░     ░  ░     ░   ░
 ░               ░ ░     ░       ░    ░     ░     ░      ░     ░ 


(Let's dryclean the genomes!)

ℹ Loading dryclean
Loading PON...
WARNING: New PON will be generated and saved at ./pon.rds

Giving you some time to think...

Starting the preparation of Panel of Normal samples a.k.a detergent
3 samples available
Using all samples
PAR file not provided, using hg19 default. If this is not the correct build, please provide a GRanges object delineating for corresponding build
PAR read
Checking for existence of files
3 files present
Starting decomposition
This is version 2
Finished making the PON
Finished saving the PON to the provided path
PON loaded
Giddy Up!

All CLI options:

./drcln -h
Options:
	--mode=MODE
		Mode of operation: 'pon' or 'coverage'. Set to 'pon' for PON generation and 'coverage' for normalizing a sample using existing PON

	-p PON, --pon=PON
		path to the existing Panel Of Normal (PON) saved as .rds

	-i INPUT, --input=INPUT
		path to the coverage file in GRanges format saved as .rds

	-t CENTER, --center=CENTER
		whether to center the coverage before drycleaning

	-s CBS, --cbs=CBS
		whether to perform cbs on the drycleaned coverage

	-n CNSIGNIF, --cnsignif=CNSIGNIF
		the significance levels for the test to accept change-points in cbs

	-c CORES, --cores=CORES
		number of cores to use

	-b BLACKLIST, --blacklist=BLACKLIST
		whether there are blacklisted makers

	-l BLACKLIST_PATH, --blacklist_path=BLACKLIST_PATH
		if --blacklist == TRUE, path to a GRanges object marking if each marker is set to be excluded or it willuse a default mask

	-g GERMLINE.FILTER, --germline.filter=GERMLINE.FILTER
		if PON based germline filter is to be used for removing some common germline events, if set to TRUE, give path to germline annotated file

	-m HUMAN, --human=HUMAN
		whther the samples under consideration are human

	-F FIELD, --field=FIELD
		field name in GRanges metadata to use for drycleaning

	-C ALL.CHR, --all.chr=ALL.CHR
		list of chromosomes to dryclean

	-B BUILD, --build=BUILD
		hg19/hg38 build for human samples

	-T TESTING, --testing=TESTING
		DO NOT CHANGE

	--normal_vector=NORMAL_VECTOR
		if mode = 'pon', path to a vector containing normal coverages in GRanges format saved as .rds

	--field_pon=FIELD_PON
		field name in GRanges metadata of normal samples to use for pon generation

	-o OUTDIR, --outdir=OUTDIR
		output directory

	-h, --help
		show this help message and exit

Panel of Normal for 1kb WGS (hg19)

The Panel of Normal samples (PON) of 395 TCGA WGS normal samples was created using hierarchical clustering approach described above and filtered for CNPs.

The file is 16G in size.

WGS 1 kb PON: https://mskilab-pipeline.s3.amazonaws.com/dryclean/pon/hg19/fixed.detergent.rds

dryclean's People

Contributors

evanbiederstedt avatar jrafailov avatar mskilab avatar sc13-bioinf avatar sebastian-brylka avatar shaiberalon avatar shihabdider avatar tanubrata avatar zining01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dryclean's Issues

"Error in m.vec - s : non-conformable arrays" when running the tutorial

Hi,

I've tried to follow the tutorial to run dryclean on tumor sample within R but have run into this error:

Error in m.vec - s : non-conformable arrays

What I have run is:

# Install the latest version of dryclean
devtools::install_github("mskilab/dryclean", ref = "87a1a4f")

#
# Start of the tutorial
#
options(warn = 1)
library("dryclean")
library("magrittr")
library("GenomicRanges")

normal_dt <-
  data.frame(sample = c("samp1", "samp2", "samp3")) %>%
  dplyr::mutate(
    normal_cov =
      system.file(
        "extdata", paste0(.data[["sample"]], ".rds"), package = "dryclean"
      ),
  ) %>%
  data.table::setDT()

saveRDS(normal_dt, "normal_table.rds")

dir.create("detergent", showWarnings = FALSE)

# use.all: Use all samples
# save.pon: Saves the PoN (detergent) to the destinated folder
detergent <-
  prepare_detergent(
    normal.table.path = "normal_table.rds",
    path.to.save = "detergent",
    num.cores = 1,
    use.all = TRUE,
    save.pon = TRUE
  )

# Running dryclean on tumor sample within R
coverage_file <-
  readRDS(system.file("extdata", "dummy_coverage.rds", package = "dryclean"))
cov_out <-
  start_wash_cycle(
    cov = coverage_file,
    detergent.pon.path = file.path("detergent", "detergent.rds"),
    whole_genome = TRUE,
    chr = NA,
  )

This was the output:

Starting the preparation of Panel of Normal samples a.k.a detergent
3 samples available
Using all samples
Balancing pre-decomposition
PAR file not provided, using hg19 default.
If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
3 files present
  |=================================================================================================================================================================================================================================
Warning in .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
Starting decomposition
This is version 2
Finished making the PON or detergent and saving it to the path provided

Loading PON a.k.a detergent from path provided
Let's begin, this is whole exome/genome
Initializing wash cycle
Using the detergent provided to start washing
lambdas calculated
calculating A and B
calculating v and s
Error in m.vec - s : non-conformable arrays

Any idea what is wrong?

Issue with pon.binsize

Hello,

I am attempting to use Dryclean with the PON provided at the bottom of the Readme, however I am receiving the below error:

`(Let's dryclean the genomes!)

Loading PON...
PON loaded
Loading coverage
Loading PON a.k.a detergent
Error in if (tumor.binsize != pon.binsize & testing == FALSE) { :
argument is of length zero
Calls:
2: (function ()
traceback(2))()
1: dryclean_object$clean(cov = opt$input, center = opt$center, cbs = opt$cbs,
cnsignif = opt$cnsignif, mc.cores = opt$cores, verbose = TRUE,
use.blacklist = opt$blacklist, blacklist_path = opt$blacklist_path,
germline.filter = opt$germline.filter, field = opt$field,
testing = opt$testing)`

When running dryclean_object$clean in R directly, it appears that my coverage file has a 1000bp bin size as expected, but the PON is returning NULL when pon.binsize is set. Do you have any insight into what may be causing this?

Typo in tutorial

Hi. Section 2 of the tutorial uses a data.table termed normal_table_example with the last column termed decompose_cov. This has to be corrected since the R code uses decomposed_cov (dryclean.R#L299).

prepare_detergent failing when using all samples

Hello,

After collecting a test set of fragCounter coverage profiles for 4 normal samples, I attempted to run the dryclean workflow.
I encountered the following error while trying the first step of creating the PoN in prepare_detergent:

pon_detergent <- prepare_detergent(normal.table.path = "/drycleanRun/test_ton.rds",
                                   use.all = TRUE,
                                   num.cores = 2,
                                   build = "hg38",
                                   path.to.save = "drycleanRun/",
                                   nochr = T,
                                   save.pon = T)

### OUTPUT ###
Starting the preparation of Panel of Normal samples a.k.a detergent
4 samples available
Using all samples
PAR file not provided, using hg38 default. If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
4 files present
  |=====================================================================================================================| 100%, Elapsed 07:21
Error in setattr(ans, "names", c(keep.names, paste0("V", seq_len(length(ans) -  : 
  'names' attribute [1] must be the same length as the vector [0]

While troubleshooting, it seems like others have encountered the same error, but at a different stage of the workflow (#2).
Based on the output message, it looks like the error occurs within pbmclapply function call at line 259 although I am not exactly sure where.

I then decided to test prepare_detergent under the other possible approaches instead of using all samples.
Interestingly, using either of the two alternative options choose.randomly = TRUE or choose.by.clustering = TRUE both executed without an error.

Here using choose.randomly = TRUE and selecting 2 of the 4 samples:

pon_detergent <- prepare_detergent(normal.table.path = "/drycleanRun/test_ton.rds",
                                   use.all = FALSE,
                                   choose.randomly = TRUE,
                                   number.of.samples = 2,
                                   choose.by.clustering = FALSE,
                                   num.cores = 2,
                                   build = "hg38",
                                   path.to.save = "drycleanRun/",
                                   nochr = T,
                                   save.pon = T)

### OUTPUT ###
Starting the preparation of Panel of Normal samples a.k.a detergent
4 samples available
Selecting 2 normal samples randomly
PAR file not provided, using hg38 default. If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
2 files present
  |============================================================================================================| 100%, Elapsed 03:28
Starting decomposition
This is version 2
Warning: Item 1 has 3031053 rows but longest item has 15155223; recycled with remainder.Finished making the PON or detergent and saving it to the path provided

And here using choose.by.clustering = TRUE

pon_detergent <- prepare_detergent(normal.table.path = "/drycleanRun/test_ton.rds",
                                   use.all = FALSE,
                                   choose.randomly = FALSE,
                                   number.of.samples = 2,
                                   choose.by.clustering = TRUE,
                                   num.cores = 2,
                                   build = "hg38",
                                   path.to.save = "drycleanRun/",
                                   nochr = T,
                                   save.pon = T)

### OUTPUT ###
Starting the preparation of Panel of Normal samples a.k.a detergent
4 samples available
Starting the clustering
Starting decomposition on a small section of genome
This is version 2
Starting clustering
PAR file not provided, using hg38 default. If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
2 files present
  |============================================================================================================| 100%, Elapsed 01:52
Starting decomposition
This is version 2
Warning: Item 1 has 3031053 rows but longest item has 15155223; recycled with remainder.Finished making the PON or detergent and saving it to the path provided

The output detergent.rds is in working order as I was able to run start_wash_cycle without any problems.
I will likely use the clustering method for further analysis but wanted to point out this issue for others who encounter it.

Best,
Patrick

could not find function "identify_germline"

Hi
I run the following code

grm = identify_germline(normal.table.path = "~/git/dryclean/inst/extdata/normal_table.rds", path.to.save = "~/git/dryclean/inst/extdata/", signal.thresh=0.5, pct.thresh=0.98)

get an error

could not find function "identify_germline"

Has identify_germline been discarded? Do it still need the step of Identifying germline events?

apply dryclean to other ASCN algoritms

Hi, you guys did great jobs on this. I just want to feed the corrected rds data to other ASCN algoritms like sequenza. Btw, I noticed that you mentioned in the paper that output of dryclean could be fed to more spphiscated segmentation algorithms. Could u give me an example? Thx in advance!

Jay

na.omit wiping out data.table

Hi. I have an issue using the identify_germline function, where na.omit (dryclean.R#L310) destroys my entire data.table (0 lines = 0 samples). My understanding is that na.omit removes the entire line (corresponding to a sample) if it finds any NA value. But you do have NA values in there using WGS because of low-complexity regions, telomeres, centromeres, etc. Typically, my first positions correspond to chr1 telomere where I don't have any mapped read. Shouldn't it remove columns instead (corresponding to a genomic window where a single sample has a NA value? Aren't lines 310 and 311 inverted (transpose the data.table and then remove NA regions)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.