Giter Club home page Giter Club logo

cola's Introduction

cola: A General Framework for Consensus Partitioning

R-CMD-check bioc bioc

Citation

Zuguang Gu, et al., cola: an R/Bioconductor package for consensus partitioning through a general framework, Nucleic Acids Research, 2021. https://doi.org/10.1093/nar/gkaa1146

Zuguang Gu, et al., Improve consensus partitioning via a hierarchical procedure. Briefings in bioinformatics 2022. https://doi.org/10.1093/bib/bbac048

Install

cola is available on Bioconductor, you can install it by:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("cola")

The latest version can be installed directly from GitHub:

library(devtools)
install_github("jokergoo/cola")

Methods

The cola supports two types of consensus partitioning.

Standard consensus partitioning

Features

  1. It modularizes the consensus clustering processes that various methods can be easily integrated in different steps of the analysis.
  2. It provides rich visualizations for intepreting the results.
  3. It allows running multiple methods at the same time and provides functionalities to compare results in a straightforward way.
  4. It provides a new method to extract features which are more efficient to separate subgroups.
  5. It generates detailed HTML reports for the complete analysis.

Workflow

The steps of consensus partitioning is:

  1. Clean the input matrix. The processing are: adjusting outliers, imputing missing values and removing rows with very small variance. This step is optional.
  2. Extract subset of rows with highest scores. Here "scores" are calculated by a certain method. For gene expression analysis or methylation data analysis, $n$ rows with highest variance are used in most cases, where the "method", or let's call it "the top-value method" is the variance (by var() or sd()). Note the choice of "the top-value method" can be general. It can be e.g. MAD (median absolute deviation) or any user-defined method.
  3. Scale the rows in the sub-matrix (e.g. gene expression) or not (e.g. methylation data). This step is optional.
  4. Randomly sample a subset of rows from the sub-matrix with probability $p$ and perform partition on the columns of the matrix by a certain partition method, with trying different numbers of subgroups.
  5. Repeat step 4 several times and collect all the partitions.
  6. Perform consensus partitioning analysis and determine the best number of subgroups which gives the most stable subgrouping.
  7. Apply statistical tests to find rows that show significant difference between the predicted subgroups. E.g. to extract subgroup specific genes.
  8. If rows in the matrix can be associated to genes, downstream analysis such as function enrichment analysis can be performed.

Usage

Three lines of code to perfrom cola analysis:

mat = adjust_matrix(mat) # optional
rl = run_all_consensus_partition_methods(
    mat, 
    top_value_method = c("SD", "MAD", ...),
    partition_method = c("hclust", "kmeans", ...),
    cores = ...)
cola_report(rl, output_dir = ...)

Plots

Following plots compare consensus heatmaps with k = 4 under all combinations of methods.

Hierarchical consensus partitioning

Features

  1. It can detect subgroups which show major differences and also moderate differences.
  2. It can detect subgroups with large sizes as well as with tiny sizes.
  3. It generates detailed HTML reports for the complete analysis.

Hierarchical Consensus Partitioning

Usage

Three lines of code to perfrom hierarchical consensus partitioning analysis:

mat = adjust_matrix(mat) # optional
rh = hierarchical_partition(mat, mc.cores = ...)
cola_report(rh, output_dir = ...)

Plots

Following figure shows the hierarchy of the subgroups.

Following figure shows the signature genes.

License

MIT @ Zuguang Gu

cola's People

Contributors

jokergoo avatar jwokaty avatar lshep avatar nturaga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cola's Issues

check PAC and concordance

            best_k   cophcor        PAC mean_silhouette concordance
ATC:NMF          2 0.9886968 0.06366852       0.9611175   0.9841618 **
MAD:kmeans       4 0.9959875 0.02129834       0.9598491   0.9797399 **
sd:NMF           4 0.9934111 0.03979490       0.9503534   0.9779191 **
cv:skmeans       2 0.9809859 0.13412099       0.9331213   0.9726301 **  <--
MAD:NMF          4 0.9903000 0.05275786       0.9348654   0.9721387 **
sd:kmeans        4 0.9919949 0.04824628       0.9337733   0.9665029 **
ATC:skmeans      4 0.9866620 0.07423391       0.9187754   0.9612428 **
MAD:pam          3 0.9842428 0.11316332       0.9065556   0.9576879 **
cv:NMF           2 0.9689004 0.19324018       0.8955448   0.9570809 ** <--
ATC:pam          2 0.9692288 0.19000105       0.8766683   0.9548844 ** <--
sd:pam           3 0.9788987 0.15456213       0.8897471   0.9515607 **

random sample significant rows

When there are more than 2000 signature rows, should we select the top 2k rows or random sample 2k rows from all significant rows?

do not group signature genes

because for some cases, it is difficult to say it is a subgroup1 signatures, so we just make heatmap
for all differential genes.

get_signatures: split rows

Since rows show difference between subgroups, it should be easy to find a optimized k if do k-means on rows.

samples with small silhoutte scores

In hierarchical partitioning, in each iteration, should all the samples be used or the samples with silhouette scores larger than the cutoff?

reduce the duplicated generation of plots

For each plot, e.g. consensus heatmap or signature heatmap, each was generated for three times. Especially for signature heatmaps, this will increase the running time three times. Think about a ways to simplify this.

Assign TEMPLATE_DIR on load rather than at installation

This

TEMPLATE_DIR = system.file("extdata", package = "cola")

assigns TEMPLATE_DIR to the path when the package is being installed, rather than when it is loaded or attached. This will cause problems in a future R where 'staged install' is implemented. A better practice is along the lines of

TEMPLATE_DIR <- NULL

.onLoad <- function(...) {
    TEMPLATE_DIR <- system.file(package="cola", ...
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.