zhenxingguo0015 / tress Goto Github PK

Toobox for RNA Methylation Sequencing Analysis

License: GNU General Public License v3.0

R 100.00%

tress's Introduction

Analyzing MeRIP-seq data with TRESS

TRESS is an R package desinged for the RNA methylation sequencing data analysis.

The post-transcriptional epigenetic modiﬁcation on mRNA is an emerging ﬁeld to study the gene regulatory mechanism and their association with diseases. Recently developed high-throughput sequencing technology named Methylated RNA Immunoprecipitation Sequencing (MeRIP-seq) enables one to proﬁle mRNA epigenetic modiﬁcation transcriptome-wide. Two major tasks in the analysis of MeRIP-seq data is to identify transcriptome-wide m6A regions (namely "peak calling") and differential m6A regions (differential peak calling).

Our package TRESS provides functions for peak calling and differential peak calling of MeRIP-seq data, based on empirical Bayesian hierarchical models. The method accounts for various sources of variations in the data through rigorous modeling, and achieves shrinkage estimation by borrowing information from transcriptome-wide data to stabilize the parameter estimation.

Here, we briefly describe how to install TRESS package through GitHub. For detailed usage of TRESS, please refer to the vignette file.

Installation

From GitHub:

install.packages("devtools") # if you have not installed "devtools" package
library(devtools)
install_github("https://github.com/ZhenxingGuo0015/TRESS", build_vignettes = TRUE)

To view the package vignette in HTML format, run the following lines in R

library(TRESS)
browseVignettes("TRESS")

Quick start on peak calling

Here we provide quick examples of how TRESS performs peak calling and differential peak calling. Prior to analysis, TRESS requires paired input control and IP BAM files for each replicate of all samples: "input1.bam & ip1.bam", "input2.bam & ip2.bam", .... The BAM files contain mapped reads sequenced from respective samples and are the output of sequence alignment tools like Bowtie2. In addition to BAM files, TRESS also needs the genome annotation of reads saved in format of *.sqlite.

For illustration purpose, we include four example BAM files and one corresponding genome annotation file in our publicly available data package datasetTRESon github, which can be installed with

install_github("https://github.com/ZhenxingGuo0015/datasetTRES")

The BAM files contain sequencing reads (only on chromosome 19) from two input & IP mouse brain cerebellum samples. Given both BAM and annotation files, peak calling in TRESS is conducted by:

## Directly take BAM files in "datasetTRES" available on github
library(TRESS)
library(datasetTRES)
Input.file = c("cb_input_rep1_chr19.bam", "cb_input_rep2_chr19.bam")
IP.file = c("cb_ip_rep1_chr19.bam", "cb_ip_rep2_chr19.bam")
BamDir = file.path(system.file(package = "datasetTRES"), "extdata/")
annoDir = file.path(system.file(package = "datasetTRES"),
                    "extdata/mm9_chr19_knownGene.sqlite")
OutDir = "/directory/to/output"  
TRESS_peak(IP.file = IP.file,
           Input.file = Input.file,
           Path_To_AnnoSqlite = annoDir,
           InputDir = BamDir,
           OutputDir = OutDir, # specify a directory for output
           experiment_name = "examplebyBam", # name your output 
           filetype = "bam")

### example peaks
peaks = read.table(file.path(system.file(package = "TRESS"),
                           "extdata/examplebyBam_peaks.xls"),
                 sep = "\t", header = TRUE)
head(peaks[, -c(5, 14, 15)], 3)

To replace the example BAM files with your BAM files, the codes are:

## or, take BAM files from your path
Input.file = c("input_rep1.bam", "input_rep2.bam")
IP.file = c("ip_rep1.bam", "ip_rep2.bam")
BamDir = "/directory/to/BAMfile"
annoDir = "/path/to/xxx.sqlite"
OutDir = "/directory/to/output"
TRESS_peak(IP.file = IP.file,
           Input.file = Input.file,
           Path_To_AnnoSqlite = annoDir,
           InputDir = BamDir,
           OutputDir = OutDir,
           experiment_name = "example",
           filetype = "bam")
peaks = read.table(paste0(OutDir, "/", 
                          "example_peaks.xls"), 
                   sep = "\t", header = TRUE)
head(peaks, 3)

Quick start on differential peak calling

If one has paired input and IP ("input1.bam & ip1.bam", "input2.bam & ip2.bam", ..., "inputN.bam & ipN.bam") BAM files for samples from different conditions, then one can apply TRESS to call differential m6A methylation regions (DMRs). Note that, the input order of BAM files from different conditions should be appropriately listed in case that samples from different conditions are mistakenly treated as one group.

As TRESS is designed for differential analysis under general experimental design, then in addition to BAM and genome annotation files, sample attributes determined by all factors in study should also be provided to construct a design matrix for model fitting. For this, TRESS requires a dataframe (taken by variable) containing, for each factor, the attribute value of all samples (the order of sample should be exactly the same as BAM files taken by TRESS).
A particular model (taken by model) determining which factor will be included into design matrix should also be provided.

All aforementioned input requirements are for model fitting in TRESS. For hypothesis testing, TRESS requires a contrast of coefficients. The contrast should be in line with the name and order of all coefficients in the design matrix. It can be a vector for simple linear relationship detection or a matrix for composite relationship detection.

With all required information prepared, do,

InputDir = "/directory/to/BAMfile"
Input.file = c("input1.bam", "input2.bam",..., "inputN.bam")
IP.file = c("ip1.bam", "ip2.bam", ..., "ipN.bam")
OutputDir = "/directory/to/output"
Path_sqlit = "/path/to/xxx.sqlite"
variable = "YourVariable" # a dataframe containing both
# testing factor and potential covariates, 
# e.g., for two group comparison with balanced samples
# variable = data.frame(Trt = rep(c("Ctrl", "Trt"), each = N/2))
model = "YourModel"     # e.g. model = ~1 + Trt
DMR.fit = TRESS_DMRfit(IP.file = IP.file,
                       Input.file = Input.file,
                       Path_To_AnnoSqlite = Path_sqlit,
                       variable = variable,
                       model = model,
                       InputDir = InputDir,
                       OutputDir = OutputDir,
                       experimentName = "example"
                       )
CoefName(DMR.fit)# show the name of and order of coefficients 
                 # in the design matrix
Contrast = "YourContrast" # e.g., Contrast = c(0, 1)
DMR.test = TRESS_DMRtest(DMR = DMR.fit, contrast = Contrast)

As shown above, TRESS separates the model fitting (implemented by function TRESS_DMRfit()), which is the most computationally heavy part, from the hypothesis testing (implemented by function TRESS_DMRtest()). Given an experimental design with multiple factors, the parameter estimation (model fitting) only needs to be performed once, and then the hypothesis testing for DMR calling can be performed for different factors efficiently.

For detailed usage of the package, please refer to the vignette file through

browseVignettes("TRESS")

tress's People

Contributors

Stargazers

Watchers

Forkers

mitharuka azkasaleem musculusmus huipan1973

tress's Issues

Inconsistent peak merging across spliced regions

Hello,

We have been using TRESS to call m6a peak data. For most of our datasets everything works fine. This is the database we are using:
txdb=makeTxDbFromUCSC(genome="dm6", tablename="ncbiRefSeq")

However, just for certain datasets the peak calling doesn't correspond to the location of the actual reads, and it is merging across regions that are spliced out of the transcript. I've attached an example where it is merging across genes. The coverage from the bam files is shown below the blue bar which is the peak region for two datasets. As you can see, coverage is just in the 5' UTR of AGO2, but the peak is getting called spanning two genes. Again, for the other six datasets we are working on this doesn't happen, its just these two datasets. It is really confusing as to why this is happening.

CallDMRs.paramEsti error

Hi, I am trying to run TRESS in R/4.2.0 and I got the following error:
"Error in (function (cond) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': subscript out of bounds
Calls: TRESS_DMRfit ... CallDMRs.paramEsti -> MLE.parallel -> as.matrix -> "

If I tried to run step by step: DivideBins + CallCandidates + filterRegions + CallDMRs.paramEsti, everything works with a nice output except the last step where I get the same error.

What can produce this error? I can share the filterRegions output if needed.

Thanks,

Error in Step2 for TRESS_DMRfit

Dear Zhenking,
I'm running TRESS_DMRfit function from TRESS_1.2.0 to call differential m6A peaks at different time points after a drug treatment, but the command results in errors.
This is the command I ran:

DMR.fit <- TRESS_DMRfit(IP.file = basename(IP_bam_files),
             Input.file = basename(input_bam_files),
             InputDir = bam_dir,
             OutputDir = output_dir,
             Path_To_AnnoSqlite = annotation_sqlite,
             variable = variable_treatment,
             model = model_treatment,
             experimentName = "treatment_time_course")

with:

> model_treatment
[1] "~1 + time"

> str(model_treatment)
 chr "~1 + time"

and:

> variable_treatment
   time
1    0h
2    0h
3    0h
4    1h
5    1h
6    1h
7    2h
8    2h
9    2h
10   4h
11   4h
12   4h
13   8h
14   8h
15   8h
16  16h
17  16h
18  16h

> str(variable_treatment)
'data.frame':	18 obs. of  1 variable:
 $ time: chr  "0h" "0h" "0h" "1h" ...

I have 3 replicates for each time point in both input and IP, and bam files are provided in the same order as specified in the variable_treatment dataframe.
This is the error I got:

##### Divid the genome into bins and obtain bin counts...
Time used to obtain bin-level data is: 
31.86404
##### Step 1: Call candidate DMRs...
Merge bumps from different replicates...
The number of candidates is: 
25475
Time used in Step 1 is: 
11.58883
##### Step 2: Model fitting on candidates...
[1] "Start to estimate preliminary MLE ..."
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.matrix': error in evaluating the argument 'x' in selecting a method for function 'ncol': $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': GL000219.1
  - in 'y': KI270721.1, KI270711.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
2: In mclapply(seq_len(nrow(Ratio)), iMLE, X, Y, sx, sy, Ratio, D,  :
  all scheduled cores encountered errors in user code

The command that is failing is:

res.MLE = MLE.parallel(mat = as.matrix(counts),
                         sf = sf,
                         D = model.matrix(model, variable)
                         )

When I run it, I find:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.matrix': error in evaluating the argument 'x' in selecting a method for function 'ncol': $ operator is invalid for atomic vectors
In addition: Warning message:
In mclapply(seq_len(nrow(Ratio)), iMLE, X, Y, sx, sy, Ratio, D,  :
  all scheduled cores encountered errors in user code

In particular, it seems it doesn't like model_treatment or variable_treaments variables:
D = model.matrix(model_treatment, variable_treatment)
I get:
Error: $ operator is invalid for atomic vectors

Could you please suggest me if you spot any errors in model_treatment and variable_treatment variables?
Thanks in advance,
Simone

Error in calculating variance-covariance of coefficients

Here is the rroe message:

[1] "Calculate variance-covariance of coefficients..."
Error in R[i, ] <- res[[i]]$R :
number of items to replace is not a multiple of replacement length

TRESS_1.0.0 and 1.2.0 provide very different results

Dear developers,
I have been calling peaks with one single replicate, with the following command.

TRESS_peak(IP.file = basename(bam_files_IP_curr),
             Input.file = basename(bam_files_input_curr),
             Path_To_AnnoSqlite = annotation_sqlite,
             InputDir = input_dir,
             OutputDir = output_dir,
             experiment_name = peaks_curr,
             filetype = "bam",
             IncludeIntron = FALSE,
             binsize = 50,
             WhichThreshold = fdr_lfc,
             pval.cutoff0 = 0.00001,
             fdr.cutoff0 = 0.05,
             lfc.cutoff0 = 0.7,
             lowcount = 30)

However, when using TRESS_1.0.0 I obtained 7,718 peaks, while, when using TRESS_1.2.0 I got 24 peaks. I guess in one of the two versions there might be a bug, as the difference is huge. Did you experience similar behaviour in your hands?
I obtained 1.2.0 from Bioconductor with:
BiocManager::install("TRESS")
and version 1.0.0 from bioconda with:
conda install bioconductor-tress
Thanks,
Simone

findBumps in makeCGI

Hi,
I know this is not exactly related but I don't know where else I should ask this.
I recently tried to use makeCGI R package to create CpG island regions from my genome and it runs ok. But it suddenly crashes my whole session when it comes to annotateCGI function. I was going through this function and found out that it crashes on

tmp <- .Call("findBumps", as.integer(thispos), x[idx[[ichr]]], 
                 as.double(cutoff), as.double(sep), as.double(minlen), 
                 as.integer(minCount), as.double(dis.merge))

suggesting there is C version of the function included in the package (apparently a bit different from the R version in this repo) but I have completely no idea how I would get to its source code to look for any problems since there is no repository for this package or anything.

I would be grateful if you had any idea what might be the problem or you could supply me with the C code version of the findBumps function since I wasn't able to stitch it together with your R function.

Thank you and here is my R session info if that would be any help:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Prague
tzcode source: system (glibc)

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] TRESS_1.5.1 BSgenome.MorexV3.Gatersleben_3.0
[3] makeCGI_1.3.4 TSRexploreR_0.1
[5] CAGEr_2.5.0 MultiAssayExperiment_1.24.0
[7] SummarizedExperiment_1.28.0 Biobase_2.58.0
[9] MatrixGenerics_1.14.0 matrixStats_1.2.0
[11] BSgenome_1.66.3 rtracklayer_1.58.0
[13] Biostrings_2.66.0 XVector_0.38.0
[15] GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
[17] IRanges_2.32.0 S4Vectors_0.36.1
[19] BiocGenerics_0.48.1

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 shape_1.4.6 rstudioapi_0.14
[4] jsonlite_1.8.4 magrittr_2.0.3 GenomicFeatures_1.50.4
[7] farver_2.1.1 GlobalOptions_0.1.2 BiocIO_1.8.0
[10] zlibbioc_1.44.0 vctrs_0.5.2 memoise_2.0.1
[13] Rsamtools_2.14.0 DelayedMatrixStats_1.20.0 RCurl_1.98-1.10
[16] ggtree_3.6.2 forcats_1.0.0 progress_1.2.2
[19] curl_5.0.0 gridGraphics_0.5-1 KernSmooth_2.23-21
[22] plyr_1.8.8 cachem_1.0.6 GenomicAlignments_1.34.0
[25] igraph_1.4.0 lifecycle_1.0.3 pkgconfig_2.0.3
[28] Matrix_1.5-3 R6_2.5.1 fastmap_1.1.0
[31] GenomeInfoDbData_1.2.9 digest_0.6.31 aplot_0.1.9
[34] enrichplot_1.18.3 colorspace_2.1-0 patchwork_1.1.2
[37] AnnotationDbi_1.60.0 RSQLite_2.3.0 vegan_2.6-4
[40] filelock_1.0.2 fansi_1.0.4 mgcv_1.8-42
[43] httr_1.4.4 polyclip_1.10-4 compiler_4.3.1
[46] bit64_4.0.5 withr_2.5.0 BiocParallel_1.32.5
[49] viridis_0.6.2 DBI_1.1.3 ggforce_0.4.1
[52] biomaRt_2.54.0 MASS_7.3-60 rappdirs_0.3.3
[55] DelayedArray_0.24.0 rjson_0.2.21 HDO.db_0.99.1
[58] permute_0.9-7 gtools_3.9.4 tools_4.3.1
[61] ape_5.7 scatterpie_0.1.8 glue_1.6.2
[64] restfulr_0.0.15 nlme_3.1-162 GOSemSim_2.24.0
[67] grid_4.3.1 stringdist_0.9.10 shadowtext_0.1.2
[70] cluster_2.1.4 reshape2_1.4.4 fgsea_1.24.0
[73] generics_0.1.3 operator.tools_1.6.3 gtable_0.3.1
[76] formula.tools_1.7.1 tidyr_1.3.0 hms_1.1.2
[79] data.table_1.14.8 xml2_1.3.3 tidygraph_1.2.3
[82] utf8_1.2.3 ggrepel_0.9.3 pillar_1.8.1
[85] stringr_1.5.0 yulab.utils_0.0.6 circlize_0.4.15
[88] splines_4.3.1 dplyr_1.1.0 tweenr_2.0.2
[91] BiocFileCache_2.6.1 treeio_1.22.0 lattice_0.21-8
[94] bit_4.0.5 tidyselect_1.2.0 GO.db_3.16.0
[97] gridExtra_2.3 graphlayouts_0.8.4 stringi_1.7.12
[100] VGAM_1.1-7 lazyeval_0.2.2 ggfun_0.0.9
[103] yaml_2.3.7 codetools_0.2-19 som_0.3-5.1
[106] ggraph_2.1.0 tibble_3.2.1 qvalue_2.30.0
[109] ggplotify_0.1.0 cli_3.6.0 munsell_0.5.0
[112] Rcpp_1.0.10 dbplyr_2.3.0 png_0.1-8
[115] XML_3.99-0.13 ellipsis_0.3.2 assertthat_0.2.1
[118] ggplot2_3.4.1 blob_1.2.3 prettyunits_1.1.1
[121] DOSE_3.24.2 plyranges_1.18.0 sparseMatrixStats_1.10.0
[124] bitops_1.0-7 viridisLite_0.4.1 tidytree_0.4.2
[127] scales_1.2.1 purrr_1.0.1 crayon_1.5.2
[130] rlang_1.1.1 cowplot_1.1.1 fastmatch_1.1-3
[133] KEGGREST_1.38.0

Error in updating phi with its posterior

Here is the error message:

[1] "Update phi with its posterior ..."
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 't': requires numeric/complex matrix/vector arguments

About output files

Hi,

I am looking for a description document about the TRESS output file. What are the meanings of values such as mu, mu.var, stats, shrkPhi, and shrkTheta? Where can I find the description?

Thanks,
LeeLee

How to setup TRESS_DMRfit?

My experimental setup is the following:
3 replicates of IP versus Input, for three groups (A, C and D) = 18 samples
I want to compare C vs A, C vs. D, and A vs. D.

I assume I need to use TRESS_DMRfit, but I do not understand how to set up the function. For example do all BAM files for the groups need to be added at the same time; 9 input and 9 IP?

Thank you.

Symbol annotation

Hi, appreciate for developing TRESS! But how can I get the "*.sqlite" file containg mappings between different version of gene names"?

Recently, m6a-seq2 was a convenient protocol for constructing different repeats and samples into a library. In this situation, the size of the pulldown library somehow reflects the m6a levels between samples. If still calculating the RPM with individual sample size, I think that would overestimate the low-level samples and underestimate the high-level ones. Do you think TRESS can treat this condition with user-defined size factors between IP and INPUT( for calling peaks ), and between sample 1 and sample 2 (for calling DMRs)?

Thanks!

Error in estimating preliminary MLE, subscript out of bounds

Time used to obtain bin-level data is:
1.526614

Step 1: Call candidate DMRs...

Merge bumps from different replicates...
The number of candidates is:
53984
Time used in Step 1 is:
38.82685

Step 2: Model fitting on candidates...

[1] "Start to estimate preliminary MLE ..."
Error in (function (cond) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': subscript out of bounds

Windows不支持'mc.cores' > 1

Hi, thanks for this great tool. I got a error when I call TRESS_DMRfit function for differential peak analysis, here is the error information:

DMR.fit = TRESS_DMRfit(IP.file = IP.file,
                       Input.file = Input.file,
                       Path_To_AnnoSqlite = Path_sqlit,
                       variable = variable,
                       model = model,
                       InputDir = InputDir,
                       OutputDir = OutputDir,
                       experimentName = "DOXvsDMSO")

##### Divid the genome into bins and obtain bin counts...
Time used to obtain bin-level data is: 
9.763366
##### Step 1: Call candidate DMRs...
Merge bumps from different replicates...
The number of candidates is: 
47020
Time used in Step 1 is: 
1.88935
##### Step 2: Model fitting on candidates...
[1] "Start to estimate preliminary MLE ..."
Error in mclapply(seq_len(nrow(Ratio)), iMLE, X, Y, sx, sy, Ratio, D,  : 
   Windows不支持'mc.cores' > 1
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': GL456210.1, GL456211.1, JH584296.1
  - in 'y': MT, JH584297.1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).

Thanks for your help!

TRESS with paired-end stranded reads

Dear @ZhenxingGuo0015,

I'd like to run TRESS on a dataset that is paired-end and stranded. Does your tool support this sequencing design? If not, should I manipulate the BAM file to assign the strand of the original fragment to both the corresponding reads?

Thanks in advance,

Mattia

Best strategy for time course experiment

Dear @ZhenxingGuo0015,
I have an experimental design with 3 replicates and 6 time points. Which of these two strategies seems more suitable to you?

Compare replicates of each sample at a specific time point VS replicates at time point = 0h, and repeat both the fit and the test for each time point
Perform a single fit creating dummy variables for each time point, with name time_point_1h, time_point_2h, etc, with values set at a specific value for the corresponding time point, and "other" for remaining time points, defining a model with all these variables, and then performing pairwise comparisons between time points using a contrast vector? For example:

> str(variable_treatment_full)
'data.frame':	18 obs. of  6 variables:
 $ time_0h : chr  "0h" "0h" "0h" "other" ...
 $ time_1h : chr  "other" "other" "other" "1h" ...
 $ time_2h : chr  "other" "other" "other" "other" ...
 $ time_4h : chr  "other" "other" "other" "other" ...
 $ time_8h : chr  "other" "other" "other" "other" ...
 $ time_16h: chr  "other" "other" "other" "other" ... 

 > model_treatment_full
~1 + time_0h + time_1h + time_2h + time_4h + time_8h + time_16h

> Contrast_1h
[1] 0 1 0 0 0 0

Thanks in advance,
Simone