Giter Club home page Giter Club logo

tidybulk's Issues

directionality of log2FoldChange (DESeq2) opposite to voom and edgeR

Hi,

I was running all three methods on a single dataset, same formula, same contrasts and got opposite directionality for my logFC values!

tidybulk_1.1.8

Guessing there is a missing test in the sample data: logFC for given gene in test dataset is consistent?

Code:

counts_de.voom <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "limma_voom", # edgeR_quasi_likelihood", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "limma_voom."
)

counts_de.edger <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "edgeR_quasi_likelihood", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "edgeR_quasi_likelihood."
)

counts_de.DESeq2 <- tt.aggr %>% identify_abundant(factor_of_interest = genotype) %>% scale_abundance() %>% keep_abundant() %>%
test_differential_abundance(
.formula = ~ 0 + genotype, # + cell,
.contrasts = c("genotypeKI - genotypeWT"),
method = "DESeq2", # Alternative method. "limma_voom",
omit_contrast_in_colnames = TRUE,
prefix= "DESeq2."
)

test_differential_abundance: edgeR LRT or QLF?

Could you specify which edgeR testing method is being used in the test_differential_abundance help, is it the likelihood ratio test (LRT) or the quasi-likelihood F-test (QLF)?

e.g. from the edgeR user guide: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

1.4 Quick start
edgeR offers many variants on analyses. The glm approach is more popular than the classic approach as it offers great flexibilities. There are two testing methods under the glm
framework: likelihood ratio tests and quasi-likelihood F-tests. The quasi-likelihood method
is highly recommended for differential expression analyses of bulk RNA-seq data as it gives
stricter error rate control by accounting for the uncertainty in dispersion estimation. The
likelihood ratio test can be useful in some special cases such as single cell RNA-seq and
datasets with no replicates. The details of these methods are described in Chapter 2.

A typical edgeR analysis might look like the following. Here we assume there are four RNASeq libraries in two groups, and the counts are stored in a tab-delimited text file, with gene
symbols in a column called Symbol.

x <- read.delim("TableOfCounts.txt",row.names="Symbol")
group <- factor(c(1,1,2,2))
y <- DGEList(counts=x,group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)

To perform quasi-likelihood F-tests:

fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2)
topTags(qlf)

To perform likelihood ratio tests:

fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt)

MacOS build issues

HEllo @mblue9, I was wondering whether when you have some time you could apply your solution for MacOS build to tidybulk as well.

Change terminology "library size" to "sequencing depth"

This is the reason

  • What we generally call “normalisation” is the action of trying to adjust for sequencing output

  • Since it is a linear confounder, we decided to apply a scaling operation (multiplying the transcript counts by a number)

  • In the past we took the wrong approach of thinking that

    • Sequencing output = library size = sum of the counts
  • This is wrong because actually

    • library size = sum of the counts = Fx(Sequencing output, biology)
    • The library size is an observed quantity
    • The biology causes outliers (e.g., for a sample a gene extremely abundant because of cell death)
  • CPM and TMM have the same objective but act on different things

    • CPM (equivalent to -> counts / sum(counts) ) acts on “library size” (erroneously)
    • TMM, because robust to outliers, manages to act more on “Sequencing output”
    • Both try to adjust for “Sequencing output”
  • Cpm(counts, TMM = ...) is an anomaly for historical reasons. Let’s rename the functions

    • Cpm = adjust_sequencing_depth_A
    • TMM = adjust_sequencing depth_B
  • Then in case of common standards (see point (6) ) we have

    • adjust_sequencing_depth_A(counts, adjust_sequencing_depth_B)
  • There is a adjustment that adjust another adjustment

  • Cpm is therefore redundant, the only function it has is telling us what fraction of counts a transcript has relative to the total (divided by 1M), which decreases the information content of our data, compared to raw counts, or even scaled counts

double check DESeq2

Hello @mblue9 , I ave some strange results from DESeq2 and I double checked, I was wondering if you could have a quick eye.

image

Fix spelling on analise_gene_enrichment

analise -> analyse (or analyze if American spelling is mandated)
Perhaps test_gene_enrichment would be a better name as it fits with test_differential_abundance.

allow overriding tidybulk object

For example, if we want to change key columns, we might do so calling tidybulk again

original_tidybulk %>%
strip_internals %>%
strip_classes %>%
tidybulk_again

ERROR; return code from pthread_create() is 22

Hello @mblue9,

I tag you in case you have an idea of how to proceed.

For the pull request #139 we have this error

ERROR; return code from pthread_create() is 22

Due to the loading of preprocessCore for Cibersort, this happens only in github actions.

I found a possible solution from https://support.bioconductor.org/p/122925/

git clone https://github.com/bmbolstad/preprocessCore.git
cd preprocessCore/
R CMD INSTALL --configure-args="--disable-threading"  .

Do you know how to install preprocesscore with this procedure in github actions rather than the normal dependency manager?

ruining-dev sync

Solving travis check issues:
synchronisation done (master -> ruining-dev)

test_gene_enrichment with tidySummarizedExperiment input

Do we need to make some change to have test_gene_enrichment work with tidySummarizedExperiment input or should we just select the columns needed and input that way?

As was testing with the airway workshop input (using dev branch tidybulk version) and getting this error below.

options(connectionObserver = NULL) - # need to add this temporarily due to an [issue with rstudio](https://github.com/rstudio/rstudio/issues/9219) 
library(airway)
library(stringr)
library(EGSEA)
library(tidybulk)
library(tidySummarizedExperiment)

data(airway)

counts_scaled <- airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() 

# we need entrez ids for egsea
counts_annot <- counts_scaled %>% 
    mutate(entrez = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    keys = feature,
    keytype = "ENSEMBL",
    column = "ENTREZID",
    multiVals = "first"))
    
test_gene_enrichment(counts_annot, .entrez=entrez, .abundance=counts_scaled, .formula=~ dex + cell, species="human", gene_collections = "h")
tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Error: object 'entrez' not found

Works if select the columns to input e.g.

# select columns and remove any duplicate entrez ids
counts_format <- counts_annot %>% 
    select(sample, entrez, counts_scaled, dex, cell) %>%
    aggregate_duplicates(.sample=sample, .transcript=entrez, .abundance=counts_scaled)

# run egsea
counts_format %>%
    test_gene_enrichment(.sample=sample, .entrez=entrez, .abundance=counts_scaled, .formula=~ 0 + dex + cell, .contrasts = c("dextrt - dexuntrt"),  species="human", gene_collections = "h")

Use Authors@R: not Author: in DESCRIPTION

On the Bioc latest page the author shows as R code rather than human-readable text. Using the person function only works with Authors@R, which is interpreted during R CMD build.

Author: person("Stefano", "Mangiola", email=[email protected], role=c("cre", aut"), comment = c(ORCID = "0000-0001-7474-836X")).

You'll also need to quote your email address

Implement data validation with validObject

Here the conversation on mailing list.

One other point before I run
It seems to me the operators "require" that certainfields be defined in their tibble operands.

names(attributes(counts))[1] "names"      "class"      "row.names"  "parameters"> attributes(counts)$names[1] "sample"             "transcript"         "Cell type"         [4] "count"              "time"               "condition"         [7] "batch"              "factor_of_interest"> validObject(counts)Error in .classEnv(classDef) :   trying to get slot "package" from an object of a basic class ("NULL") with no slots
Enter a frame number, or 0 to exit   
1: validObject(counts)2: .classEnv(classDef)
I think you mentioned validity checking in a previous email.  Thisis a feature of S4 that is not frequently invoked.  Of coursevalidObject will not work on counts, but do you have something similar?(Not all working S4 objects from Bioc will pass validObject tests, butthey should....)

Use S4 methods style

Here is the discussion on mailing list

A common pattern might be to implement a generic

    setGeneric("foo", function(x, ...) standardGeneric("foo"))

an ‘internal’ function that implements the method on base R data types

    .foo <- function(x) {
        stopifnot("'x' must be a matrix" = is.matrix(x))
        t(x)
    }

and methods that act as a facade to the implementation

    setMethod("foo", "tbl_df", function(x) {
        x <- as.matrix(x)
        result <- .foo(x)
        as_tibble(result)
    })

    setMethod("foo", "SummarizedExperiment", function(x) {
        result <- .foo(assay(x))
        assays(x)[["foo"]] <- result
        x
    })

test_gene_overrepresentation: bug or user error

Was trying out test_gene_overrepresentation using the workshop data. If I do this:

counts_de_entrez <- counts_de_pretty %>% 
    filter(!lowly_abundant) %>%
    symbol_to_entrez() 

res <- test_gene_overrepresentation(counts_de_entrez, .do_test=significant, species="human")

I get Error: object '' not found

Am I doing something wrong or is it a bug?

Fixing github action

Hello @mblue9 I was wondering if you had time to give an hand with fixing the github actions for Master.

The error is due to the use of tidySE in the vignette, that is only available in Bioc 3_13.

Consistent column names for differential expression results

Currently differential expression results will return different column names for what is essentially the same type of output

edgeR: logFC logCPM F PValue FDR
limma_voom: logFC AveExpr t P.Value adj.P.Val B

Is it worth making these somewhat consistent in the output? (i.e. at least PValue/P.Value => a consistent PValue)?

It would reduce fiddly downstream code changes.

Auto-generated materials and methods?

Firstly, very nice tool (and talk).

It would be interesting to be able to create a methods text, from the resulting tibble. I imagine a case where a tibble is published/shared, and checking what was done to create it. It would also be a way of making clear the methods under the hood, and could provide citation information at the same time.

Dropping parameters significant_threshold and fill_missing_values

In the process of adding the prefix argument to test_differential_abundance for allowing easy multi method testing I was thinking that we could trim down some of the argument to make the interface more clean

  • significant_threshold and significant column. It does not add much as it is pretty easy to filter based on FDR, and if anything make the users think less about what they are filtering
  • fill_missing_values we have already fill_missing and impute_missing routines, I was thinking throu a error/warning if needed and let the user operate externally to this function

test_differential_abundance: misleading warning with multiple factors?

I think this warning might be misleading, as there are more than 2 reps per primary factor (trt vs untrt) I think the secondary factor might be confusing the function? and I don't get any warning with edgeR on it's own (running it with a similar design outside tidybulk)

test_differential_abundance(
.formula = ~ 0 + dex + cell,
.contrasts = c("dextrt - dexuntrt"))```

tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
Joining, by = "transcript"
tidybulk says: to access the raw results (glmFit) do `attr(..., "internals")$edgeR`
Joining, by = c("transcript", "lowly_abundant")
Warning message:
In get_differential_transcript_abundance_bulk(.data, .formula, .sample = !!.sample,  :
  tidybulk says: You have less than two replicated for each factorial condition

Feedback from talk: get_bibliography

@mblue9, FYI at the Monash talk I received very good feedback about get_bibliography.

I was suggested to include what methods/functions are related to each bib citation, for giving the opportunity to people to right methods without guessing.

abundant transcript vocabulary

For a function that adds a column with .abundant label TRUE or FALSE do you think is better

  • find_abundant
  • identify_abundant
  • label_abundant
  • mark_abundant
  • flag_abundant
  • annotate_abundant
  • ...

Should filterByExpr be added into some functions help documentation?

Do keep_abundant and test_differential_abundance use edgeR's filterByExpr?

I thought so from what we discussed in the past but maybe I'm wrong?

As here I've said filterByExpr is used
https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html#filtering-lowly-expressed-genes

But there's no mention of it in the help for keep_abundant and test_differential_abundance.

Could you let me know if we should rephrase that or am happy to try help with the function documentation if this should be added.

curate bibliography list

it doesn't show the ref for voom, and not sure if it should have the refs for the specific edgeR methods (from the edgeR user guide)?

test_gene_enrichment error: object '.transcript' not found

Heya, I was trying to include a small example of using tidybulk's test_gene_enrichment in a tutorial but I can't get it to run. I'm getting the error Error in enquo(.transcript) : object '.transcript' not found.

The example in the function here https://github.com/stemangiola/tidybulk/blob/master/R/methods.R#L2807-L2822 gives the same error e.g.

# From tidybulk example
df_entrez = symbol_to_entrez(tidybulk::counts_mini, .transcript = transcript, .sample = sample)
df_entrez = aggregate_duplicates(df_entrez, aggregation_function = sum, .sample = sample, .transcript = entrez, .abundance = count)

library(EGSEA)
test_gene_enrichment(
			df_entrez,
			~ condition,
			.sample = sample,
			.entrez = entrez,
			.abundance = count,
       method = c("roast" , "safe", "gage"  ,  "padog" , "globaltest", "ora" ),
			species="human",
			cores = 2
		)

Is it something to do with .transcript not being an argument (.entrez is instead) but then there's a check for .transcript here ?

warning_if_data_is_not_rectangular(.data, !!.sample, !!.transcript, !!.abundance)

test_gene_enrichment: No group or design set

As discussed recording here as issue:
I tried this below with tidybulk dev, it ran but I got this message saying “No group or design set. Assuming all samples belong to one group” but they’re not all from the same group, so does it need to be given a group or design matrix or is the .formula meant to handle that?

readRDS("ent.rds") %>%
filter(entrez %>% is.na %>% !) %>%
test_gene_enrichment(
.formula = ~ 0 + dex + cell,
.sample = sample,
.entrez = entrez,
.abundance = count,
.contrasts = c("dextrt - dexuntrt"),
species = "human",
cores = 4
)

tidybulk says: The design column names are "dextrt, dexuntrt, cellN061011, cellN080611, cellN61311"
No group or design set. Assuming all samples belong to one group.

Survival Analysis

Hello Stefano,

Thank you for the great packages.
I've noticed that with tidybulk you can perform a survival analysis in terms of relationships of different subpopulations with survival.

I was wondering if it's possible to use your package to estimate Kaplan-Meier survival curves. What I am trying to do is understanding if the expression of one gene or even a gene module (defined as a categorical variable: low, mid, high) predict survival. Something similar to what some tools (eg gepia2) allow to do for TCGA data.

Thank you

Best,
Francesco

Quiet parameter for keep_variable()

The keep_variable() function prints out the following phrase by default:
## Getting the 500 most variable genes

While it is not important for scripting, the message stays at the R markdown document even I designate knitr::opts_chunk$set(message = F)

Is it possible to suppress this message?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.