Giter Club home page Giter Club logo

mplnclust's Introduction

MPLNClust

Finite Mixtures of Multivariate Poisson-Log Normal Model for Clustering Count Data

GitHub issues License GitHub language count GitHub commit activity (branch)

Description

MPLNClust is an R package for performing clustering using finite mixtures of multivariate Poisson-log normal (MPLN) distribution proposed by Silva et al., 2019. It was developed for count data, with clustering of RNA sequencing data as a motivation. However, the clustering method may be applied to other types of count data. The package provides functions for functions for parameter estimation via 1) an MCMC-EM framework by Silva et al., 2019 and 2) a variational Gaussian approximation with EM algorithm by Subedi and Browne, 2020. Information criteria (AIC, BIC, AIC3 and ICL) and slope heuristics (Djump and DDSE, if more than 10 models are considered) are offered for model selection. Also included are functions for simulating data from this model and visualization.

Installation

To install the latest version of the package:

require("devtools")
devtools::install_github("anjalisilva/MPLNClust", build_vignettes = TRUE)
library("MPLNClust")

To run the Shiny app:

MPLNClust::runMPLNClust()

Overview

To list all functions available in the package:

ls("package:MPLNClust")

MPLNClust contains 14 functions.

  1. mplnVariational for carrying out clustering of count data using mixtures of MPLN via variational expectation-maximization
  2. mplnMCMCParallel for carrying out clustering of count data using mixtures of MPLN via a Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM) with parallelization
  3. mplnMCMCNonParallel for carrying out clustering of count data using mixtures of MPLN via a Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM) with no parallelization
  4. mplnDataGenerator for the purpose of generating simlulation data via mixtures of MPLN
  5. mplnVisualizeAlluvial for visualizing clustering results as Alluvial plots
  6. mplnVisualizeBar for visualizing clustering results as bar plots
  7. mplnVisualizeHeatmap for visualizing clustering results as heatmaps
  8. mplnVisualizeLine for visualizing clustering results as line plots
  9. AICFunction for model selection
  10. AIC3Function for model selection
  11. BICFunction for model selection
  12. ICLFunction for model selection
  13. runMPLNClust is the shiny implementation of mplnVariational
  14. mplnVarClassification is an implementation for classification is currently under construction

Framework of mplnVariational makes it computationally efficient and faster compared to mplnMCMCParallel or mplnMCMCNonParallel. Therefore, mplnVariational may perform better for large datasets. For more information, see details section below. An overview of the package is illustrated below:

Details

The MPLN distribution (Aitchison and Ho, 1989) is a multivariate log normal mixture of independent Poisson distributions. The hidden layer of the MPLN distribution is a multivariate Gaussian distribution, which allows for the specification of a covariance structure. Further, the MPLN distribution can account for overdispersion in count data. Additionally, the MPLN distribution supports negative and positive correlations.

A mixture of MPLN distributions is introduced for clustering count data by Silva et al., 2019. Here, applicability is illustrated using RNA sequencing data. To this date, two frameworks have been proposed for parameter estimation: 1) an MCMC-EM framework by Silva et al., 2019 and 2) a variational Gaussian approximation with EM algorithm by Subedi and Browne, 2020.

MCMC-EM Framework for Parameter Estimation

Silva et al., 2019 used an MCMC-EM framework via Stan for parameter estimation. This method is employed in functions mplnMCMCParallel and mplnMCMCNonParallel.

Coarse grain parallelization is employed in mplnMCMCParallel, such that when a range of components/clusters (g = 1,…,G) are considered, each component/cluster size is run on a different processor. This can be performed because each component/cluster size is independent from another. All components/clusters in the range to be tested have been parallelized to run on a separate core using the parallel R package. The number of cores used for clustering is calculated using parallel::detectCores() - 1. No internal parallelization is performed for mplnMCMCNonParallel.

To check the convergence of MCMC chains, the potential scale reduction factor and the effective number of samples are used. The Heidelberger and Welch’s convergence diagnostic (Heidelberger and Welch, 1983) is used to check the convergence of the MCMC-EM algorithm. Starting values (argument: initMethod) and the number of iterations for each chain (argument: nIterations) play an important role for the successful operation of this algorithm.

Variational-EM Framework for Parameter Estimation

Subedi and Browne, 2020 proposed a variational Gaussian approximation that alleviates challenges of MCMC-EM algorithm. Here the posterior distribution is approximated by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities. A variational-EM based framework is used for parameter estimation. This algorithm is implemented in the function mplnVariational. The parsimonious family of models implemented by considering eigen-decomposition of covariance matrix in Subedi and Browne, 2020 is not yet available with this package.

Model Selection and Other Details

Four model selection criteria are offered, which include the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), a variation of the AIC used by Bozdogan (1994) called AIC3, and the integrated completed likelihood (ICL; Biernacki et al., 2000). Slope heuristics (Djump and DDSE; Arlot et al., 2016) could be used for model selection if more than 10 models are considered.

Starting values (argument: initMethod) and the number of iterations for each chain (argument: nInitIterations) play an important role to the successful operation of this algorithm. There maybe issues with singularity, in which case altering starting values or initialization method may help.

Shiny App

The Shiny app employing mplnVariational could be run and results could be visualized:

MPLNClust::runMPLNClust()
ShinyApp1

In simple, the runMPLNClust is a web applications available with MPLNClust.

Tutorials

For tutorials and plot interpretation, refer to the vignette:

browseVignettes("MPLNClust")

Citation for Package

citation("MPLNClust")

Silva, A., S. J. Rothstein, P. D. McNicholas, and S. Subedi (2019). A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinformatics. 20(1):394.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data},
    author = {A. Silva and S. J. Rothstein and P. D. McNicholas and S. Subedi},
    journal = {BMC Bioinformatics},
    year = {2019},
    volume = {20},
    number = {1},
    pages = {394},
    url = {https://pubmed.ncbi.nlm.nih.gov/31311497/},
  }

Package References

Other References

Maintainer

Contributions

MPLNClust welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues.

Acknowledgments

  • Dr. Marcelo Ponce, SciNet HPC Consortium, University of Toronto, ON, Canada for all the computational support.

  • This work was funded by Natural Sciences and Engineering Research Council of Canada, Queen Elizabeth II Graduate Scholarship, and Arthur Richmond Memorial Scholarship.

mplnclust's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.