stemangiola / tidybulk Goto Github PK

Brings bulk and pseudobulk transcriptomics to the tidyverse

R 100.00%

tidyverse tidy tidy-data bioconductor edger deseq2 transcriptomics pipe tibble gene-symbols

tidybulk's Issues

directionality of log2FoldChange (DESeq2) opposite to voom and edgeR

Hi,

I was running all three methods on a single dataset, same formula, same contrasts and got opposite directionality for my logFC values!

tidybulk_1.1.8

Guessing there is a missing test in the sample data: logFC for given gene in test dataset is consistent?

Code:

counts_de.voom <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "limma_voom", # edgeR_quasi_likelihood", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "limma_voom."
)

counts_de.edger <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "edgeR_quasi_likelihood", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "edgeR_quasi_likelihood."
)

counts_de.DESeq2 <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "DESeq2", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "DESeq2."
)

check for duplicates for reduce dimensions

Add over representation analysis

Beside egsea for gene enrichment (using rank), add also over representation where the only thing needed is a gene vector (https://yulab-smu.github.io/clusterProfiler-book/chapter3.html) Developed for Bhupinder.

test_differential_abundance: edgeR LRT or QLF?

Could you specify which edgeR testing method is being used in the test_differential_abundance help, is it the likelihood ratio test (LRT) or the quasi-likelihood F-test (QLF)?

e.g. from the edgeR user guide: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

1.4 Quick start
edgeR offers many variants on analyses. The glm approach is more popular than the classic approach as it offers great flexibilities. There are two testing methods under the glm
framework: likelihood ratio tests and quasi-likelihood F-tests. The quasi-likelihood method
is highly recommended for differential expression analyses of bulk RNA-seq data as it gives
stricter error rate control by accounting for the uncertainty in dispersion estimation. The
likelihood ratio test can be useful in some special cases such as single cell RNA-seq and
datasets with no replicates. The details of these methods are described in Chapter 2.

A typical edgeR analysis might look like the following. Here we assume there are four RNASeq libraries in two groups, and the counts are stored in a tab-delimited text file, with gene
symbols in a column called Symbol.

x <- read.delim("TableOfCounts.txt",row.names="Symbol")
group <- factor(c(1,1,2,2))
y <- DGEList(counts=x,group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)

To perform quasi-likelihood F-tests:

fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2)
topTags(qlf)

To perform likelihood ratio tests:

fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt)

MacOS build issues

HEllo @mblue9, I was wondering whether when you have some time you could apply your solution for MacOS build to tidybulk as well.

Allow different scaling methods with test_differential_abundance()

Currently the method seems fixed to "TMM" in edgeR::calcNormFactors() in test_differential_abundance(). Allowing this to be changed like in scale_abundance() would be nice.

Change terminology "library size" to "sequencing depth"

This is the reason

What we generally call “normalisation” is the action of trying to adjust for sequencing output
Since it is a linear confounder, we decided to apply a scaling operation (multiplying the transcript counts by a number)
In the past we took the wrong approach of thinking that
- Sequencing output = library size = sum of the counts
This is wrong because actually
- library size = sum of the counts = Fx(Sequencing output, biology)
- The library size is an observed quantity
- The biology causes outliers (e.g., for a sample a gene extremely abundant because of cell death)
CPM and TMM have the same objective but act on different things
- CPM (equivalent to -> counts / sum(counts) ) acts on “library size” (erroneously)
- TMM, because robust to outliers, manages to act more on “Sequencing output”
- Both try to adjust for “Sequencing output”
Cpm(counts, TMM = ...) is an anomaly for historical reasons. Let’s rename the functions
- Cpm = adjust_sequencing_depth_A
- TMM = adjust_sequencing depth_B
Then in case of common standards (see point (6) ) we have
- adjust_sequencing_depth_A(counts, adjust_sequencing_depth_B)
There is a adjustment that adjust another adjustment
Cpm is therefore redundant, the only function it has is telling us what fraction of counts a transcript has relative to the total (divided by 1M), which decreases the information content of our data, compared to raw counts, or even scaled counts

double check DESeq2

Hello @mblue9 , I ave some strange results from DESeq2 and I double checked, I was wondering if you could have a quick eye.

Fix spelling on analise_gene_enrichment

analise -> analyse (or analyze if American spelling is mandated)
Perhaps test_gene_enrichment would be a better name as it fits with test_differential_abundance.

make actions: add, get, exclusive

add = add infor to the whole data frame
get = get just the sample or transcript related columns
exclusive = what now get is

allow overriding tidybulk object

For example, if we want to change key columns, we might do so calling tidybulk again

original_tidybulk %>%
strip_internals %>%
strip_classes %>%
tidybulk_again

limma_voom sample weighing in test_differential_abundance?

Hi,

Enjoying the package! Is there a plan to include voom with sample weighting option into test_differential_abundance?

Thanks
/Alistair

ERROR; return code from pthread_create() is 22

Hello @mblue9,

I tag you in case you have an idea of how to proceed.

For the pull request #139 we have this error

ERROR; return code from pthread_create() is 22

Due to the loading of preprocessCore for Cibersort, this happens only in github actions.

I found a possible solution from https://support.bioconductor.org/p/122925/

git clone https://github.com/bmbolstad/preprocessCore.git
cd preprocessCore/
R CMD INSTALL --configure-args="--disable-threading"  .

Do you know how to install preprocesscore with this procedure in github actions rather than the normal dependency manager?

test_gene_overrepresentation: visualising results

Do you have any recommendations for visualising the output of test_gene_overrepresentation?

Can it/should it use a clusterprofiler viz method: https://yulab-smu.github.io/clusterProfiler-book/ or a ggplot2 one?

`tidybulk` Replace pivot_* with select_columns_*

I think subset_transcripts and subset_samples would be more clear and in line with tidyverse. @mblue9 concerns?

ruining-dev sync

Solving travis check issues:
synchronisation done (master -> ruining-dev)

Correct typo

tidybulk says: for aggregation fctors <--

test_gene_enrichment with tidySummarizedExperiment input

Do we need to make some change to have test_gene_enrichment work with tidySummarizedExperiment input or should we just select the columns needed and input that way?

As was testing with the airway workshop input (using dev branch tidybulk version) and getting this error below.

options(connectionObserver = NULL) - # need to add this temporarily due to an [issue with rstudio](https://github.com/rstudio/rstudio/issues/9219) 
library(airway)
library(stringr)
library(EGSEA)
library(tidybulk)
library(tidySummarizedExperiment)

data(airway)

counts_scaled <- airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() 

# we need entrez ids for egsea
counts_annot <- counts_scaled %>% 
    mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    keys = feature,
    keytype = "ENSEMBL",
    column = "ENTREZID",
    multiVals = "first"))
    
test_gene_enrichment(counts_annot, .entrez=entrez, .abundance=counts_scaled, .formula=~ dex + cell, species="human", gene_collections = "h")

tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Error: object 'entrez' not found

Works if select the columns to input e.g.

# select columns and remove any duplicate entrez ids
counts_format <- counts_annot %>% 
    select(sample, entrez, counts_scaled, dex, cell) %>%
    aggregate_duplicates(.sample=sample, .transcript=entrez, .abundance=counts_scaled)

# run egsea
counts_format %>%
    test_gene_enrichment(.sample=sample, .entrez=entrez, .abundance=counts_scaled, .formula=~ 0 + dex + cell, .contrasts = c("dextrt - dexuntrt"),  species="human", gene_collections = "h")

Use Authors@R: not Author: in DESCRIPTION

On the Bioc latest page the author shows as R code rather than human-readable text. Using the person function only works with Authors@R, which is interpreted during R CMD build.

Author: person("Stefano", "Mangiola", email=[email protected], role=c("cre", aut"), comment = c(ORCID = "0000-0001-7474-836X")).

You'll also need to quote your email address

Implement data validation with validObject

Here the conversation on mailing list.

One other point before I run
It seems to me the operators "require" that certainfields be defined in their tibble operands.

names(attributes(counts))[1] "names" "class" "row.names" "parameters"> attributes(counts)$names[1] "sample" "transcript" "Cell type" [4] "count" "time" "condition" [7] "batch" "factor_of_interest"> validObject(counts)Error in .classEnv(classDef) : trying to get slot "package" from an object of a basic class ("NULL") with no slots
Enter a frame number, or 0 to exit
1: validObject(counts)2: .classEnv(classDef)
I think you mentioned validity checking in a previous email. Thisis a feature of S4 that is not frequently invoked. Of coursevalidObject will not work on counts, but do you have something similar?(Not all working S4 objects from Bioc will pass validObject tests, butthey should....)

Delete white spaces in produced column names

changed
scaled,
adjusted,
filter out low counts -> lowly_abundant

Do the README/Vignette trick for tidybulk

Hello @mblue9 , can you confirm that we cannot do the same trick we did for tidyseurat, here for tidybulk?

use edgeR function filterByExpr to filter

add symbol convesion for drosophila gene ID

@mblue9 is this still valid as far as you remember? :)

Use S4 methods style

Here is the discussion on mailing list

A common pattern might be to implement a generic

setGeneric("foo", function(x, ...) standardGeneric("foo"))

an ‘internal’ function that implements the method on base R data types

.foo <- function(x) {
stopifnot("'x' must be a matrix" = is.matrix(x))
t(x)
}

and methods that act as a facade to the implementation

setMethod("foo", "tbl_df", function(x) {
x <- as.matrix(x)
result <- .foo(x)
as_tibble(result)
})

setMethod("foo", "SummarizedExperiment", function(x) {
result <- .foo(assay(x))
assays(x)[["foo"]] <- result
x
})

add test_differential_composition

Add differential composition analysis, both for glm and censored data. See ARMET test script

test_gene_overrepresentation: bug or user error

Was trying out test_gene_overrepresentation using the workshop data. If I do this:

counts_de_entrez <- counts_de_pretty %>% 
    filter(!lowly_abundant) %>%
    symbol_to_entrez() 

res <- test_gene_overrepresentation(counts_de_entrez, .do_test=significant, species="human")

I get Error: object '' not found

Am I doing something wrong or is it a bug?

Fixing github action

Hello @mblue9 I was wondering if you had time to give an hand with fixing the github actions for Master.

The error is due to the use of tidySE in the vignette, that is only available in Bioc 3_13.

Create SummarisedExperiment stream

Create methods for SummarisedExperiment object. Each wrapper returns the same object that accepts.

SE to tidybulk use "count" instead of "counts"

When converting from SE to tidybulk "count" column name instead of "counts"

fucntion => function in method descriptions

Hi,

Just noticed a consistent typo while digging around.

/Alistair

Consistent column names for differential expression results

Currently differential expression results will return different column names for what is essentially the same type of output

edgeR: logFC logCPM F PValue FDR
limma_voom: logFC AveExpr t P.Value adj.P.Val B

Is it worth making these somewhat consistent in the output? (i.e. at least PValue/P.Value => a consistent PValue)?

It would reduce fiddly downstream code changes.

add raw object from backend calculation to internals

Auto-generated materials and methods?

Firstly, very nice tool (and talk).

It would be interesting to be able to create a methods text, from the resulting tibble. I imagine a case where a tibble is published/shared, and checking what was done to create it. It would also be a way of making clear the methods under the hood, and could provide citation information at the same time.

Dropping parameters significant_threshold and fill_missing_values

In the process of adding the prefix argument to test_differential_abundance for allowing easy multi method testing I was thinking that we could trim down some of the argument to make the interface more clean

significant_threshold and significant column. It does not add much as it is pretty easy to filter based on FDR, and if anything make the users think less about what they are filtering
fill_missing_values we have already fill_missing and impute_missing routines, I was thinking throu a error/warning if needed and let the user operate externally to this function

test_differential_abundance: misleading warning with multiple factors?

I think this warning might be misleading, as there are more than 2 reps per primary factor (trt vs untrt) I think the secondary factor might be confusing the function? and I don't get any warning with edgeR on it's own (running it with a similar design outside tidybulk)

test_differential_abundance(
.formula = ~ 0 + dex + cell,
.contrasts = c("dextrt - dexuntrt"))```

tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Joining, by = "transcript"
tidybulk says: to access the raw results (glmFit) do `attr(..., "internals")$edgeR`
Joining, by = c("transcript", "lowly_abundant")
Warning message:
In get_differential_transcript_abundance_bulk(.data, .formula, .sample = !!.sample,  :
  tidybulk says: You have less than two replicated for each factorial condition

Feedback from talk: get_bibliography

@mblue9, FYI at the Monash talk I received very good feedback about get_bibliography.

I was suggested to include what methods/functions are related to each bib citation, for giving the opportunity to people to right methods without guessing.

do i need sample, transcript, abundance for SE objects?

maybe eliminate the arguments

allow validation to skip internals check if column names are present

abundant transcript vocabulary

For a function that adds a column with .abundant label TRUE or FALSE do you think is better

find_abundant
identify_abundant
label_abundant
mark_abundant
flag_abundant
annotate_abundant
...

Should filterByExpr be added into some functions help documentation?

Do keep_abundant and test_differential_abundance use edgeR's filterByExpr?

I thought so from what we discussed in the past but maybe I'm wrong?

As here I've said filterByExpr is used
https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html#filtering-lowly-expressed-genes

But there's no mention of it in the help for keep_abundant and test_differential_abundance.

Could you let me know if we should rephrase that or am happy to try help with the function documentation if this should be added.

curate bibliography list

it doesn't show the ref for voom, and not sure if it should have the refs for the specific edgeR methods (from the edgeR user guide)?

pass ... ellipse argument to test_differential_abundance

For example to get plot=TRUE from voom

test_gene_enrichment error: object '.transcript' not found

Heya, I was trying to include a small example of using tidybulk's test_gene_enrichment in a tutorial but I can't get it to run. I'm getting the error Error in enquo(.transcript) : object '.transcript' not found.

The example in the function here https://github.com/stemangiola/tidybulk/blob/master/R/methods.R#L2807-L2822 gives the same error e.g.

# From tidybulk example
df_entrez = symbol_to_entrez(tidybulk::counts_mini, .transcript = transcript, .sample = sample)
df_entrez = aggregate_duplicates(df_entrez, aggregation_function = sum, .sample = sample, .transcript = entrez, .abundance = count)

library(EGSEA)
test_gene_enrichment(
			df_entrez,
			~ condition,
			.sample = sample,
			.entrez = entrez,
			.abundance = count,
       method = c("roast" , "safe", "gage"  ,  "padog" , "globaltest", "ora" ),
			species="human",
			cores = 2
		)

Is it something to do with .transcript not being an argument (.entrez is instead) but then there's a check for .transcript here ?

tidybulk/R/methods.R

Line 2863 in 852d996

 warning_if_data_is_not_rectangular(.data, !!.sample, !!.transcript, !!.abundance) 

test_gene_enrichment: No group or design set

As discussed recording here as issue:
I tried this below with tidybulk dev, it ran but I got this message saying “No group or design set. Assuming all samples belong to one group” but they’re not all from the same group, so does it need to be given a group or design matrix or is the .formula meant to handle that?

readRDS("ent.rds") %>%
filter(entrez %>% is.na %>% !) %>%
test_gene_enrichment(
.formula = ~ 0 + dex + cell,
.sample = sample,
.entrez = entrez,
.abundance = count,
.contrasts = c("dextrt - dexuntrt"),
species = "human",
cores = 4
)

tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
No group or design set. Assuming all samples belong to one group.

set default PCA scale = FALSE

for scaling check the presence of a multiplier column

And do this in general when columns are created.

Survival Analysis

Hello Stefano,

Thank you for the great packages.
I've noticed that with tidybulk you can perform a survival analysis in terms of relationships of different subpopulations with survival.

I was wondering if it's possible to use your package to estimate Kaplan-Meier survival curves. What I am trying to do is understanding if the expression of one gene or even a gene module (defined as a categorical variable: low, mid, high) predict survival. Something similar to what some tools (eg gepia2) allow to do for TCGA data.

Thank you

Best,
Francesco

Quiet parameter for keep_variable()

The keep_variable() function prints out the following phrase by default:
## Getting the 500 most variable genes

While it is not important for scripting, the message stays at the R markdown document even I designate knitr::opts_chunk$set(message = F)

Is it possible to suppress this message?

use functions as arguments instead of character strings

From https://twitter.com/mikelove/status/1244624727114690562?s=20

E.g. reduce_dimensions() now only allows for f(x) = x or log(x+1) but you could have an argument where user passes the transformation function

stemangiola / tidybulk Goto Github PK

tidybulk's Issues

Recommend Projects

Recommend Topics

Recommend Org