Giter Club home page Giter Club logo

mitch's Introduction

mitch

mitch is an R package for multi-dimensional enrichment analysis. At it's heart, it uses a rank-MANOVA based statistical approach to detect sets of genes that exhibit enrichment in the multidimensional space as compared to the background. Mitch is useful for pathway analysis of profiling studies with two to or more contrasts, or in studies with multiple omics profiling, for example proteomic, transcriptomic, epigenomic analysis of the same samples. Mitch is perfectly suited for pathway level differential analysis of scRNA-seq data.

Installation

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mitch")

library("mitch")

Workflow overview

Importing gene sets

mitch has a function to import GMT files to R lists (adapted from Yu et al, 2012 in the clusterProfiler package). For example:

download.file("https://reactome.org/download/current/ReactomePathways.gmt.zip", destfile="ReactomePathways.gmt.zip")
unzip("ReactomePathways.gmt.zip")
genesets <- gmt_import("ReactomePathways.gmt")

Importing profiling data

mitch accepts pre-ranked data supplied by the user, but also has a function called mitch_import for importing tables generated by Limma, edgeR, DESeq2, ABSSeq, Sleuth, Seurat and Muscat. By default, only the genes that are detected in all contrasts are included, but this behaviour can be modified. The below example imports two edgeR tables called "dge1" and "dge2". Where gene identifiers are coded in the row names.

x <- list("dge1"=dge1, "dge2"=dge2)
y <- mitch_import(x, DEtype="edger")

mitch can do unidimensional analysis if you provide it a single profile as a dataframe.

y <- mitch_import(df, DEtype="edger")

If the gene identifiers are not given in the rownames, then the column can be specified with the geneIDcol parameter like this:

y <- mitch_import(df, DEtype="edger", geneIDcol="MyGeneIDs")

By default, differential gene activity is scored using the directional nominal p-value.

S = -log10(p-value) * sgn(logFC)

If this is not desired, then users can perform their own custom scoring procedure.

There are many cases where the gene IDs don't match the gene sets. To overcome this, mitch_import also accepts a two-column table that relates gene identifiers in the profiling data to those in the gene sets.

?mitch_import provides more instructions on using this feature.

Calculating enrichment

The mitch_calc function performs multivariate enrichment analysis of the supplied gene sets in the scored profiling data. At its simpest form mitch_calc function accepts the scored data as the first argument and the genesets as the second argument. Users can prioritise enrichments based on small adjusted p-values, or by the observed effect size (magnitude of the enrichment score).

res <- mitch_calc(y, genesets, priority="significance")
res <- mitch_calc(y, genesets, priority="effect")

You can peek at the top results with head like this:

head(res$enrichment_result)

By default, mitch_calc uses mclapply to speed up calculations on all but one available CPU threads. This behaviour can be modified by setting the cores to a desred number.

res <- mitch_calc(y, genesets, priority="significance", cores=4)

By default, gene sets with fewer than 10 members present in the profiling data are discarded. This threshold can be modified using the minsetsize option. There is no upper limit of gene set size.

res <- mitch_calc(y, genesets, priority="significance", minsetsize=20)

By default, in downstream visualisation steps, charts are made from the top 50 gene sets, but this can be modified using the resrows option.

res <- mitch_calc(y, genesets, priority="significance", resrows=100)

Generate a HTML report

Can be done like this:

mitch_report(res, "myreport.html")

Take a look at an example report.

Generate high resolution plots

In case you want the charts in PDF format, these can be generated as such:

mitch_plots(res, outfile="mycharts.pdf")

Take a look at an example plot set.

scRNA-seq

This type of data is notoriously sparse. Mitch works best if there are >1000 genes detected in each cell type (contrast). If there are <400 genes present it may cause mitch to run into an error in the MANOVA test due to insufficient degrees of freedom.

Infinium DNA methylation data

Mitch can be applied to Infinium Beadarray methylation data if you have limma results for probes. Please see our solution described at our other repository.

Contrbutions

If you have questions or need help with applying mitch to your work, please raise an issue.

mitch's People

Contributors

jwokaty avatar markziemann avatar nturaga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mitch's Issues

Create a bubble plot for unidimensional analysis

here is some working code

library(ggplot2)
top <- head(res_eff$enrichment_result,50)
top <- top[order(top$s.dist),]
top$set <- factor(top$set, levels = top$set[order(top$s.dist)])
ggplot(top, aes(s.dist, set, size = setSize)) + geom_point(aes(colour=-log10(top$p.adjustANOVA)))

Error in names(y) %in% geneIDcol

Hi there,
I'm very eager to use your tool. However, I'm experiencing an issue with the first step ('mitch_import'). I would like to do 1D enrichment on a single data frame with own test statistics.

I do the following:

mitchDF <- data.frame(geneID = c("Aacs", "Aak1", "Aamdc", "Aars1", "Aasdhppt"),
value = c(-0.1293, -0.324, -0.5898, 1.1, .97))

y <- mitch::mitch_import(mitchDF, DEtype="prescored",
geneIDcol = 'geneID')

And get the following:
'The input is a single dataframe; one contrast only. Converting
it to a list for you.
Error in names(y) %in% geneIDcol :
promise already under evaluation: recursive default argument reference or earlier problems?'

Any help would be nice.
Thank you!

gmt_import problem with empty lines

If the GMT file contains empty lines it creates empty gene sets.

[126] "SRP102542: Genes expressed differentially in old people who do resistance training: downregulated"
[127] ""
[128] "SRP101810: bulk-cell RNA-seq of anti-CD3 anti-CD28 monoclonal antibodies stimulated T cell clone versus GAD peptide-loaded tetramer stimulated T cell clone: downregulated"

The solution is to remove empty lines as they are read in

Turn off pbmclapply

It results in a lot of output in RMD reports so the default should be OFF

mitch_import error when using Seurat v4

Hi,

I recently tried running mitch using differential expression results from Seurat's FindMarkers function.

de_list <- list()
for (i in 1:length(levels(cells))) {
  name <- levels(cells)[i]
  print(paste("Computing markers for", name))
  x <- 
    de_list[[name]] <- FindMarkers(cells, ident.1 = name, min.pct = 0.25, logfc.threshold = .75)
}
z <- mitch_import(de_list, DEtype="Seurat")

When I try to import the results into mitch, I'm met with:

Error in FUN(X[[i]], ...) : 
  Error, there is no column named 'avg_logFC' in the input

I believe this is because as of Seurat v4, they have changed the fold change column name from avg_logFC to avg_log2FC.

Renaming the column seems to fix the issue

de_list[[name]] <- FindMarkers(cells, ident.1 = name, min.pct = 0.25, logfc.threshold = .75) %>%
      rename(avg_logFC = "avg_log2FC")

R manova limit

mitch will fail if 70 or more dimensions are tested in one analysis
fit <- manova(x ~ inset)

Rmarkdown report chunk headers

When using mitch_report in another rmarkdown doc, header name clashes can occur.

Simple fix is to change the name of the chunks to something more obscure.

like: heatmap --> heatmapx1281

label: mitch
Quitting from lines 713-713 (/tmp/RtmpEQLoxz/file1fa79762b2da16.md) 
Error in parse_block(g[-1], g[1], params.src, markdown_mode) : 
  Duplicate chunk label 'heatmap', which has been used for the chunk:

Enrichment of very simillar samples

Currently, I´m using two samples which are very simillar and I ask myself how to test those adequate (also after reading your current paper in PLOS CompBiol). So, there are several questions I have:

  • The samples are RNA-seqs of 2 different knock-down experiments using the same Ctrl; experiments were performed in parallel.

  • Normally using GSEA I rather prefer to compare only the knock-downs against the controls, as the effect of the knock-downs are only valid against the controls. Having very simillar results I fear that "using knock-down 1" vs "knock down 2" in a weighted gsea with gene set permutation, signal2noise metric etc leads to results which are overrated. Do you think GSEA statsitical testing is adequate for this kind of testing?

  • As I use also self-generated gene sets with GSEA: I think using self-genearted gene sets should always be done with sufficient background -meaning embedded in gene set collections with a adequate size. Do you have any experience about that?

  • Comparing the knock-down with each other I normally use mitch with several ranked diff.expressed gene list (knock-down 1 vs Ctrl tested against knock-down 2 vs Ctrl ). These contrasts are to simillar so that I do not receive a valid result by mitch using e.g. GO BP or simillar. As we are interested in a distinct biological function I ask if it is worth and appropriate to test only a rather low number of self generated gene sets to evaluate a difference or simillarity: not 1000 but rather dozens of gene sets?

Replace taucharts with echarts4r

df %>% 
  e_charts(State) %>% # initialise and set x
  e_line(Population, smooth = TRUE) %>%  # add a line
  e_area(Income, smooth = TRUE) %>%  # add area
  e_x_axis(name = "States") %>%  # add x axis name
  e_title("US States", "Population & Income") %>%  # Add title & subtitle
  e_theme("infographic") # theme

Q: should I apply a cut off when importing DEGs

Hello Mark,

Do I understand correctly, that I should import the complete list of differentially expressed genes, without applying for example a fold change cut off?

Thank you for your time.

Best,
Paul

Problem when importing a single dataframe with prescored data

Example:

data(rna)
rna$genename <- rownames(rna)
rownames(rna)<-1:nrow(rna)
y <- mitch_import(rna, DEtype="edgeR", geneIDcol = 'genename')
rna$score <- sign(rna$logFC) * -log10(rna$PValue)
rna2 <- rna[,c("genename","score")]
y <- mitch::mitch_import(rna2, DEtype="prescored", geneIDcol = 'genename')

Yields an error:

y <- mitch::mitch_import(rna2, DEtype="prescored", geneIDcol = 'genename')
The input is a single dataframe; one contrast only. Converting
        it to a list for you.
Error in names(y) %in% geneIDcol : 
  promise already under evaluation: recursive default argument reference or earlier problems?

Error when running several DE genes per cluster

Good day,

I want to try your tool. I have tried to apply it to my dataset (scRNA-10x preprocessed by Seurat). When I run the functions for one cluster it (y <- mitch_import(markers_0, DEtype="Seurat")) it seems to work but once I make a list of DE genes per cluster and try to run it, I get an error:

markers_0 <- FindMarkers(seu_obj, ident.1 = 0, min.pct = 0.25)
markers_1 <- FindMarkers(seu_obj, ident.1 = 1, min.pct = 0.25)
markers_2 <- FindMarkers(seu_obj, ident.1 = 2, min.pct = 0.25)
markers_3 <- FindMarkers(seu_obj, ident.1 = 3, min.pct = 0.25)
markers_4 <- FindMarkers(seu_obj, ident.1 = 4, min.pct = 0.25)

x <- list('clus_0'=markers_0,'clus_1'=markers_1,'clus_2'=markers_2,'clus_3'=markers_3,
         'clus_4'=markers_4)
y <- mitch_import(x, DEtype="Seurat")

Warning message in mitch_rank(input_profile):
“Warning: >60% of genes have the same score. This isn't
            optimal for rank based enrichment analysis.”
Warning message in mitch_rank(input_profile):
“Warning: >60% of genes have the same score. This isn't
            optimal for rank based enrichment analysis.”
Warning message in mitch_rank(input_profile):
“Warning: >60% of genes have the same score. This isn't
            optimal for rank based enrichment analysis.”
Warning message in mitch_rank(input_profile):
“Warning: >60% of genes have the same score. This isn't
            optimal for rank based enrichment analysis.”
Warning message in mclapply(X, FUN, ..., mc.cores = mc.cores, mc.preschedule = mc.preschedule, :
“all scheduled cores encountered errors in user code”
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors): cannot coerce class ‘"try-error"’ to a data.frame
Traceback:

1. mitch_calc(y, genesets, priority = "significance", cores = 4)
2. MANOVA(ranked_profile, genesets, minsetsize = minsetsize, cores = cores, 
 .     priority = priority)
3. ldply(res, data.frame)
4. llply(.data = .data, .fun = .fun, ..., .progress = .progress, 
 .     .inform = .inform, .parallel = .parallel, .paropts = .paropts)
5. structure(lapply(pieces, .fun, ...), dim = dim(pieces))
6. lapply(pieces, .fun, ...)
7. FUN(X[[i]], ...)
8. as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
9. as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
10. stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])), 
  .     domain = NA)

The data.frame y contains multiple NA per column since not all DE genes per cluster are the same. What would be the best way to overcome this issue?

Question on Type I Error

Oh hi Mark,
First of all, thank you so much for your help on Twitter. I'd never heard of this tool, and it turns out to work very well for what we need. For context, we have 71 samples and roughly 2000 genes with recorded copy number aberration, RNAseq, and shotgun proteomics. We are interested in detecting gene sets / pathways that are differentially expressed between tumour v normal samples concurrently across all three platforms.

In order to test this tool before applying it to our real data, we created a few gene sets of our own, and increased the expression levels for some of the samples within a couple of those pathways. We were very happy to find that mitch:: detected all five pathways for which we synthetically boosted the signal. Here are two screenshots with some results, the first where we set priority = "effect", and the second where we set priority = "significance":

ranked by effect
image

As you can see, we are very pleased with the results when we sort the pathways by "effect" rather than just p-value alone. However, we are stuck on the best way to calculate test size / Type I error if we don't know a priori how many gene sets should be significant. If we use the "significance" ranking, then the Type I error rate is over 80% in our simulation. If we use "effect", we don't know how many pathways to treat as significant. We did notice that the column s.dist was above 1 for all of the treated pathways and less than 1 for the non-treated pathways, so we could create a threshold that way (the issue there is that we don't really know what the s.dist column measures or how it's calculated--the help files are a touch sparse in that area).

Summary: we like the tool, but we need to know the best technique to measure gene set significance with a well-controlled Type I error rate.

Error in sparse matrix

With the detailed gene set reports in both the mitch_plots and mitch_report I get this error.

plot: [1,4] [=======>------------------------------------------] 16% est:28s Error in cor.test.default(x, y, method = method, use = use) :
not enough finite observations
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I think this is in the ggplot pairs plots. Not sure what to do if there are no points in a panel.

For now it is a good choice to omit the ggpairs plot if there are all NAs present in one panel

NA handling bug in unidimensional analysis

ww<-w[,9,drop=F] ; res<-mitch_calc(ww,gsets,priority="significance",cores=4)
|======================================================| 100%, Elapsed 00:03
Note: When prioritising by significance (ie: small p-values), large effect sizes might be missed.
tail(res$enrichment_result)
set setSize pANOVA s.dist p.adjustANOVA
826 MPSHyRBdPzYdquC 36 0.9971732 NA 0.9997527
191 rGjOWewGivEfCIs 43 0.9972467 NA 0.9997527
638 ITsoDQexoewokLN 38 0.9976097 NA 0.9997527
636 gmeFdvoYpgCYCJE 41 0.9978909 NA 0.9997527
423 NjfDeOpAaStHUwT 44 0.9987549 NA 0.9997527
534 HEDzVBVOCLdhnpi 38 0.9997527 NA 0.9997527

Use GSEABase to import genesets

GSEABase has a function for gmt import called getGmt
It can be converted early on with with the mitch_calc() function

library("GSEABase")
reactome<-getGmt("ReactomePathways.gmt")
gslist<-lapply(reactome, function(x) {x@geneIds } )
names(gslist)<-lapply(reactome, function(x) {x@setName } )
head(gslist)

After that's done, I will need to change a few things

  •  DESCRIPTION import getGmt from GSEABase
  •  mitch to import getGmt from GSEABase
  •  examples
  •  README.md
  •  Vignette
  •  Tests
  •  Break up mitch.R into chunks, misc, import, calc, plots

mitch_report fails if an absoulte path is given as outfile

When I attempt to run mitch_report using an absolute path for outfile, it will fail with an error message like:

> data(resExample)
> mitch_report(resExample, '/desktop-home/heyer/outres2.html')
Error in file(file, "wb") : cannot open the connection
In addition: Warning message:
In file(file, "wb") :
  cannot open file '/tmp/Rtmpv71mKa//desktop-home/heyer/outres2.RDataTmp': No such file or directory
Warning message:
In file.remove(outfile) :
  cannot remove file '/tmp/Rtmpv71mKa//desktop-home/heyer/outres2.RDataTmp', reason 'No such file or directory'

I guess this is somehow related to how the temporary file paths are created from the outfile in the mitch_report function.
I don't know if this is somehow related to #10, but after installing the package from GitHub directly i still ran into this issue.

Pre-filtering of results?

Hi there!

Thanks so much for the package. I've found it incredibly useful, intuitive, and extremely well documented -- really appreciate all the work you've put into it and making it so accessible!

I just had a quick question: should I be filtering DE results for significance before feeding the tables into mitch_import()?

Obviously this is quite common in many of enrichment/ORA analysis tools, but it seems the D score calculated by mitch may essentially account for this when combined with the p.adjustMANOVA in the final results.

Thanks again for the package and any insight you can provide!

Dakota

Output file naming mitch_report()

Dear Mark,

I think there is a minor problem with the mitch_report() function. When trying to specify the output file path of the html report it is appending the path to the working directory. i think this should be avoided when specifying absolute paths.

 HTMLNAME <- paste(getwd(), HTMLNAME, sep = "/")

Best regards,

Joschka

Import from more single cell packages in bioconductor

Bioconductor

  • SCDE ✔ colname="Z" is the Z-score. Scde only compares between cell types. This is not what I'm after, but could include it for completeness.
  • MAST ✔ colname = "Pr(>Chisq)" is the p value and "coef" is a log fold change. Note that data.table object is converted to normal dataframe before import
  • scDD ✖ looks like a useful pkg but because it classifies genes as DE, DM, DP, etc it is hard to summarise this into a rnking value
  • Monocle (v2) ✖ none of these vignettes actually work and there are several different ways to apply it such as trajectory and cluster analysis and the significance values are variable between version and in some cases there is no fold change given so hard to actually figure out the direction of genes. I'll wait to see V3 in bioconductor before adding.
  • DEsingle ✔ colname = "pvalue" for p-value and "foldChange" for linear fold change. easy.

CRAN

GitHub

  • D3E
  • SigEMD

Other

  • SINCERA

spaces in colnames causes an error in ggpairs/ggcontour plot

processing file: mitch.Rmd
 plot: [2,1] [===============>----------------------------------] 31% est: 1s Quitting from lines 883-1203 (mitch.Rmd)
Error in ss[, gsub("~", "", as.character(mapping[2]))] :
  subscript out of bounds

Was resolved by removing spaces. Consider stripping spaces from colnames and replacing with underscores

Lots of warnings in the vignette

While the vignette works perfectly, there are many warnings during the mitch_import as seen below

library("stringi")
# obtain vector of gene names
genenames<-rownames(rna)
# create fake accession numbers
accessions<-paste("Gene0",stri_rand_strings(nrow(rna)*2, 6, pattern = "[0-9]"),sep="")
accessions<-head(unique(accessions),nrow(rna))
# create a gene table file that relates gene names to accession numbers
gt<-data.frame(genenames,accessions)

# now swap gene names for accessions
rna2<-merge(rna,gt,by.x=0,by.y="genenames")
rownames(rna2)<-rna2$accessions
rna2<-rna2[,2:5]

k9a2<-merge(k9a,gt,by.x=0,by.y="genenames")
rownames(k9a2)<-k9a2$accessions
k9a2<-k9a2[,2:5]

# now have a peek at the input data before importing
head(rna2,3)

##                 logFC  logCPM      PValue adj.p.value
## Gene0962203  0.296028 6.82814 3.84512e-04 1.46941e-02
## Gene0604687 -0.375440 4.71470 1.09120e-03 3.05432e-02
## Gene0825040  0.882624 8.12078 2.11945e-11 8.01642e-09

head(k9a2,3)

##                logFC  logCPM      PValue adj.p.value
## Gene0962203 0.339535 3.67309 1.62925e-03 1.43363e-02
## Gene0604687 0.585837 3.66069 3.23724e-04 4.06552e-03
## Gene0825040 1.138700 2.78713 4.94270e-14 1.69263e-11

head(gt,3)

##   genenames  accessions
## 1     NR4A3 Gene0107342
## 2    HSPA1B Gene0651366
## 3    DNAJB1 Gene0239749

x<-list("rna2"=rna2,"k9a2"=k9a2)
y<-mitch_import(x,DEtype="edgeR",geneTable=gt)

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

Import prescored data

The option to import custom scored data is called "preranked" currently but "prescored" is a better term.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.