stemangiola / tidybulk Goto Github PK
View Code? Open in Web Editor NEWBrings bulk and pseudobulk transcriptomics to the tidyverse
Brings bulk and pseudobulk transcriptomics to the tidyverse
Hi,
I was running all three methods on a single dataset, same formula, same contrasts and got opposite directionality for my logFC values!
tidybulk_1.1.8
Guessing there is a missing test in the sample data: logFC for given gene in test dataset is consistent?
Code:
counts_de.voom <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "limma_voom", # edgeR_quasi_likelihood", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "limma_voom."
)
counts_de.edger <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "edgeR_quasi_likelihood", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "edgeR_quasi_likelihood."
)
counts_de.DESeq2 <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "DESeq2", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "DESeq2."
)
Beside egsea for gene enrichment (using rank), add also over representation where the only thing needed is a gene vector (https://yulab-smu.github.io/clusterProfiler-book/chapter3.html) Developed for Bhupinder.
Could you specify which edgeR testing method is being used in the test_differential_abundance
help, is it the likelihood ratio test (LRT) or the quasi-likelihood F-test (QLF)?
e.g. from the edgeR user guide: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
1.4 Quick start
edgeR offers many variants on analyses. The glm approach is more popular than the classic approach as it offers great flexibilities. There are two testing methods under the glm
framework: likelihood ratio tests and quasi-likelihood F-tests. The quasi-likelihood method
is highly recommended for differential expression analyses of bulk RNA-seq data as it gives
stricter error rate control by accounting for the uncertainty in dispersion estimation. The
likelihood ratio test can be useful in some special cases such as single cell RNA-seq and
datasets with no replicates. The details of these methods are described in Chapter 2.
A typical edgeR analysis might look like the following. Here we assume there are four RNASeq libraries in two groups, and the counts are stored in a tab-delimited text file, with gene
symbols in a column called Symbol.
x <- read.delim("TableOfCounts.txt",row.names="Symbol")
group <- factor(c(1,1,2,2))
y <- DGEList(counts=x,group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
To perform quasi-likelihood F-tests:
fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2)
topTags(qlf)
To perform likelihood ratio tests:
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt)
HEllo @mblue9, I was wondering whether when you have some time you could apply your solution for MacOS build to tidybulk as well.
Currently the method seems fixed to "TMM" in edgeR::calcNormFactors() in test_differential_abundance(). Allowing this to be changed like in scale_abundance() would be nice.
This is the reason
What we generally call “normalisation” is the action of trying to adjust for sequencing output
Since it is a linear confounder, we decided to apply a scaling operation (multiplying the transcript counts by a number)
In the past we took the wrong approach of thinking that
This is wrong because actually
CPM and TMM have the same objective but act on different things
Cpm(counts, TMM = ...) is an anomaly for historical reasons. Let’s rename the functions
Then in case of common standards (see point (6) ) we have
There is a adjustment that adjust another adjustment
Cpm is therefore redundant, the only function it has is telling us what fraction of counts a transcript has relative to the total (divided by 1M), which decreases the information content of our data, compared to raw counts, or even scaled counts
Hello @mblue9 , I ave some strange results from DESeq2 and I double checked, I was wondering if you could have a quick eye.
analise -> analyse (or analyze if American spelling is mandated)
Perhaps test_gene_enrichment would be a better name as it fits with test_differential_abundance.
add = add infor to the whole data frame
get = get just the sample or transcript related columns
exclusive = what now get is
For example, if we want to change key columns, we might do so calling tidybulk again
original_tidybulk %>%
strip_internals %>%
strip_classes %>%
tidybulk_again
Hi,
Enjoying the package! Is there a plan to include voom with sample weighting option into test_differential_abundance?
Thanks
/Alistair
Hello @mblue9,
I tag you in case you have an idea of how to proceed.
For the pull request #139 we have this error
ERROR; return code from pthread_create() is 22
Due to the loading of preprocessCore for Cibersort, this happens only in github actions.
I found a possible solution from https://support.bioconductor.org/p/122925/
git clone https://github.com/bmbolstad/preprocessCore.git
cd preprocessCore/
R CMD INSTALL --configure-args="--disable-threading" .
Do you know how to install preprocesscore with this procedure in github actions rather than the normal dependency manager?
Do you have any recommendations for visualising the output of test_gene_overrepresentation?
Can it/should it use a clusterprofiler viz method: https://yulab-smu.github.io/clusterProfiler-book/ or a ggplot2 one?
I think subset_transcripts and subset_samples would be more clear and in line with tidyverse. @mblue9 concerns?
Solving travis check issues:
synchronisation done (master -> ruining-dev)
tidybulk says: for aggregation fctors <--
Do we need to make some change to have test_gene_enrichment work with tidySummarizedExperiment input or should we just select the columns needed and input that way?
As was testing with the airway workshop input (using dev branch tidybulk version) and getting this error below.
options(connectionObserver = NULL) - # need to add this temporarily due to an [issue with rstudio](https://github.com/rstudio/rstudio/issues/9219)
library(airway)
library(stringr)
library(EGSEA)
library(tidybulk)
library(tidySummarizedExperiment)
data(airway)
counts_scaled <- airway %>%
keep_abundant(factor_of_interest = dex) %>%
scale_abundance()
# we need entrez ids for egsea
counts_annot <- counts_scaled %>%
mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
keys = feature,
keytype = "ENSEMBL",
column = "ENTREZID",
multiVals = "first"))
test_gene_enrichment(counts_annot, .entrez=entrez, .abundance=counts_scaled, .formula=~ dex + cell, species="human", gene_collections = "h")
tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Error: object 'entrez' not found
Works if select the columns to input e.g.
# select columns and remove any duplicate entrez ids
counts_format <- counts_annot %>%
select(sample, entrez, counts_scaled, dex, cell) %>%
aggregate_duplicates(.sample=sample, .transcript=entrez, .abundance=counts_scaled)
# run egsea
counts_format %>%
test_gene_enrichment(.sample=sample, .entrez=entrez, .abundance=counts_scaled, .formula=~ 0 + dex + cell, .contrasts = c("dextrt - dexuntrt"), species="human", gene_collections = "h")
On the Bioc latest page the author shows as R code rather than human-readable text. Using the person function only works with Authors@R, which is interpreted during R CMD build.
Author: person("Stefano", "Mangiola", email=[email protected], role=c("cre", aut"), comment = c(ORCID = "0000-0001-7474-836X")).
You'll also need to quote your email address
Here the conversation on mailing list.
One other point before I run
It seems to me the operators "require" that certainfields be defined in their tibble operands.
names(attributes(counts))[1] "names" "class" "row.names" "parameters"> attributes(counts)$names[1] "sample" "transcript" "Cell type" [4] "count" "time" "condition" [7] "batch" "factor_of_interest"> validObject(counts)Error in .classEnv(classDef) : trying to get slot "package" from an object of a basic class ("NULL") with no slots
Enter a frame number, or 0 to exit
1: validObject(counts)2: .classEnv(classDef)
I think you mentioned validity checking in a previous email. Thisis a feature of S4 that is not frequently invoked. Of coursevalidObject will not work on counts, but do you have something similar?(Not all working S4 objects from Bioc will pass validObject tests, butthey should....)
changed
scaled,
adjusted,
filter out low counts -> lowly_abundant
Hello @mblue9 , can you confirm that we cannot do the same trick we did for tidyseurat, here for tidybulk?
@mblue9 is this still valid as far as you remember? :)
Here is the discussion on mailing list
A common pattern might be to implement a generic
setGeneric("foo", function(x, ...) standardGeneric("foo"))
an ‘internal’ function that implements the method on base R data types
.foo <- function(x) {
stopifnot("'x' must be a matrix" = is.matrix(x))
t(x)
}
and methods that act as a facade to the implementation
setMethod("foo", "tbl_df", function(x) {
x <- as.matrix(x)
result <- .foo(x)
as_tibble(result)
})
setMethod("foo", "SummarizedExperiment", function(x) {
result <- .foo(assay(x))
assays(x)[["foo"]] <- result
x
})
Add differential composition analysis, both for glm and censored data. See ARMET test script
Was trying out test_gene_overrepresentation using the workshop data. If I do this:
counts_de_entrez <- counts_de_pretty %>%
filter(!lowly_abundant) %>%
symbol_to_entrez()
res <- test_gene_overrepresentation(counts_de_entrez, .do_test=significant, species="human")
I get Error: object '' not found
Am I doing something wrong or is it a bug?
Hello @mblue9 I was wondering if you had time to give an hand with fixing the github actions for Master.
The error is due to the use of tidySE in the vignette, that is only available in Bioc 3_13.
Create methods for SummarisedExperiment object. Each wrapper returns the same object that accepts.
When converting from SE to tidybulk "count" column name instead of "counts"
Hi,
Just noticed a consistent typo while digging around.
/Alistair
Currently differential expression results will return different column names for what is essentially the same type of output
edgeR: logFC logCPM F PValue FDR
limma_voom: logFC AveExpr t P.Value adj.P.Val B
Is it worth making these somewhat consistent in the output? (i.e. at least PValue/P.Value => a consistent PValue)?
It would reduce fiddly downstream code changes.
Firstly, very nice tool (and talk).
It would be interesting to be able to create a methods text, from the resulting tibble. I imagine a case where a tibble is published/shared, and checking what was done to create it. It would also be a way of making clear the methods under the hood, and could provide citation information at the same time.
In the process of adding the prefix argument to test_differential_abundance
for allowing easy multi method testing I was thinking that we could trim down some of the argument to make the interface more clean
I think this warning might be misleading, as there are more than 2 reps per primary factor (trt vs untrt) I think the secondary factor might be confusing the function? and I don't get any warning with edgeR on it's own (running it with a similar design outside tidybulk)
test_differential_abundance(
.formula = ~ 0 + dex + cell,
.contrasts = c("dextrt - dexuntrt"))```
tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Joining, by = "transcript"
tidybulk says: to access the raw results (glmFit) do `attr(..., "internals")$edgeR`
Joining, by = c("transcript", "lowly_abundant")
Warning message:
In get_differential_transcript_abundance_bulk(.data, .formula, .sample = !!.sample, :
tidybulk says: You have less than two replicated for each factorial condition
@mblue9, FYI at the Monash talk I received very good feedback about get_bibliography.
I was suggested to include what methods/functions are related to each bib citation, for giving the opportunity to people to right methods without guessing.
do i need sample, transcript, abundance for SE objects?
maybe eliminate the arguments
For a function that adds a column with .abundant label TRUE or FALSE do you think is better
Do keep_abundant
and test_differential_abundance
use edgeR's filterByExpr
?
I thought so from what we discussed in the past but maybe I'm wrong?
As here I've said filterByExpr is used
https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html#filtering-lowly-expressed-genes
But there's no mention of it in the help for keep_abundant
and test_differential_abundance
.
Could you let me know if we should rephrase that or am happy to try help with the function documentation if this should be added.
it doesn't show the ref for voom, and not sure if it should have the refs for the specific edgeR methods (from the edgeR user guide)?
For example to get plot=TRUE from voom
Heya, I was trying to include a small example of using tidybulk's test_gene_enrichment in a tutorial but I can't get it to run. I'm getting the error Error in enquo(.transcript) : object '.transcript' not found
.
The example in the function here https://github.com/stemangiola/tidybulk/blob/master/R/methods.R#L2807-L2822 gives the same error e.g.
# From tidybulk example
df_entrez = symbol_to_entrez(tidybulk::counts_mini, .transcript = transcript, .sample = sample)
df_entrez = aggregate_duplicates(df_entrez, aggregation_function = sum, .sample = sample, .transcript = entrez, .abundance = count)
library(EGSEA)
test_gene_enrichment(
df_entrez,
~ condition,
.sample = sample,
.entrez = entrez,
.abundance = count,
method = c("roast" , "safe", "gage" , "padog" , "globaltest", "ora" ),
species="human",
cores = 2
)
Is it something to do with .transcript not being an argument (.entrez is instead) but then there's a check for .transcript here ?
Line 2863 in 852d996
As discussed recording here as issue:
I tried this below with tidybulk dev, it ran but I got this message saying “No group or design set. Assuming all samples belong to one group” but they’re not all from the same group, so does it need to be given a group or design matrix or is the .formula meant to handle that?
readRDS("ent.rds") %>%
filter(entrez %>% is.na %>% !
) %>%
test_gene_enrichment(
.formula = ~ 0 + dex + cell,
.sample = sample,
.entrez = entrez,
.abundance = count,
.contrasts = c("dextrt - dexuntrt"),
species = "human",
cores = 4
)
tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
No group or design set. Assuming all samples belong to one group.
And do this in general when columns are created.
Hello Stefano,
Thank you for the great packages.
I've noticed that with tidybulk you can perform a survival analysis in terms of relationships of different subpopulations with survival.
I was wondering if it's possible to use your package to estimate Kaplan-Meier survival curves. What I am trying to do is understanding if the expression of one gene or even a gene module (defined as a categorical variable: low, mid, high) predict survival. Something similar to what some tools (eg gepia2) allow to do for TCGA data.
Thank you
Best,
Francesco
The keep_variable() function prints out the following phrase by default:
## Getting the 500 most variable genes
While it is not important for scripting, the message stays at the R markdown document even I designate knitr::opts_chunk$set(message = F)
Is it possible to suppress this message?
From https://twitter.com/mikelove/status/1244624727114690562?s=20
E.g. reduce_dimensions() now only allows for f(x) = x or log(x+1) but you could have an argument where user passes the transformation function
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.