Giter Club home page Giter Club logo

mixr's Introduction

CRAN Status Badge CRAN Downloads CRAN Monthly Downloads DOI DOI

mixR: An R package for finite mixture modeling for both raw and binned data

Why mixR?

R programming language provides a rich collection of packages for building and analyzing finite mixture models which are widely used in unsupervised learning such as model-based clustering and density estimation. For example,

  • mclust can be used to build Gaussian mixture models with different covariance structures
  • mixtools implements parametric and non-parametric mixture models as well as mixtures of Gaussian regressions
  • flexmix provides a general framework for finite mixtures of regression models
  • mixdist fits mixture models for grouped and conditional data (also called binned data).

To our knowledge, almost all R packages for finite mixture models are designed to use raw data as the modeling input except mixdist. However the popular model selection methods based on information criteria or bootstrapping likelihood ratio test (McLachlan, 1987; Feng & McCulloch, 1996; Yu & Harvill, 2019) are not implemented in mixdist.

mixR is a package that aims to bridge this gap and to unify the interface for finite mixture modeling for both raw and binned data.

Installation

For stable/pre-compiled(for Windows and OS X) version, please install from CRAN:

install.packages('mixR')

To get the latest development version from Github:

# install.packages('devtools')
devtools::install_github('garybaylor/mixR')

Examples

  • Fitting a normal mixture model
library(mixR)

# generate data from a Normal mixture model
set.seed(102)
x1 = rmixnormal(1000, c(0.3, 0.7), c(-2, 3), c(2, 1))

# fit a Normal mixture model
mod1 = mixfit(x1, ncomp = 2)

# plot the fitted model
plot(mod1)

# fit a Normal mixture model (equal variance)
mod1_ev = mixfit(x1, ncomp = 2, ev = TRUE)
  • Fitting a Weibull mixture model
# generate data from a Weibull mixture model
x2 = rmixweibull(1000, c(0.4, 0.6), c(0.6, 1.3), c(0.1, 0.1))
mod2_weibull = mixfit(x2, family = 'weibull', ncomp = 2)
  • Fitting a mixture model with binned data
head(Stamp2)
##     lower  upper freq
## 1  0.0595 0.0605    1
## 5  0.0635 0.0645    2
## 6  0.0645 0.0655    1
## 7  0.0655 0.0665    1
## 9  0.0675 0.0685    1
## 10 0.0685 0.0695    7
mod_binned = mixfit(Stamp2, ncomp = 7, family = 'weibull')
plot(mod_binned)

# data binned from numeric data
x1_binned = bin(x1, seq(min(x1), max(x1), length = 30))
mod1_binned = mixfit(x1_binned, ncomp = 2)
  • Mixture model selection by BIC
# Selecting the best g for Normal mixture model
s_normal = select(x2, ncomp = 2:6)

# Selecting the best g for Weibull mixture model
s_weibull = select(x2, ncomp = 2:6, family = 'weibull')

plot(s_weibull)
plot(s_normal)
  • Mixture model selection by bootstrap likelihood ratio test (LRT)
b1 = bs.test(x1, ncomp = c(2, 3))
plot(b1, main = 'Bootstrap LRT for Normal Mixture Models (g = 2 vs g = 3)')
b1$pvalue

b2 = bs.test(x2, ncomp = c(2, 4))
plot(b2, main = 'Bootstrap LRT for Normal Mixture Models (g = 2 vs g = 4)')
b2$pvalue

For more examples please check the vignette An Introduction to mixR.

Contributor Code of Conduct

Everyone is welcome to contribute to the project through reporting issues, posting feature requests, updating documentation, submitting pull requests, or contact the project maintainer directly. To maintain a friendly atmosphere and to collaborate in a fun and productive way, we expect contributors to abide by the Contributor Code of Conduct.

Citation

Yu, Y., (2022). mixR: An R package for Finite Mixture Modeling for Both Raw and Binned Data. Journal of Open Source Software, 7(69), 4031, https://doi.org/10.21105/joss.04031

BibTex information

@article{Yu2022,
  doi = {10.21105/joss.04031},
  url = {https://doi.org/10.21105/joss.04031},
  year = {2022},
  publisher = {The Open Journal},
  volume = {7},
  number = {69},
  pages = {4031},
  author = {Youjiao Yu},
  title = {mixR: An R package for Finite Mixture Modeling for Both Raw and Binned Data},
  journal = {Journal of Open Source Software}
}

mixr's People

Contributors

garybaylor avatar soodoku avatar

Stargazers

 avatar Rochita Das avatar  avatar Tim Triche, Jr. avatar elec_tri_city avatar  avatar Xing Meng avatar Owain  gaunders avatar

Watchers

James Cloos avatar  avatar

Forkers

retoschmucki

mixr's Issues

statement of need within readme/documentation

I think you have an ok starting point around 'statement of need' in your paper but we JOSS also evaluates if you provide that in your documentation for the software.

I recommend adding a crisp #knowwhy in your readme to orient the users and help guide them to other places if there needs are different.

has bic been consistently implemented?

when i do

mod1 = mixfit(x1, ncomp = 2)
mod1$bic
[1] 4238.221

with

s_normal = select(x2, ncomp = 2:6)
s_normal$bic
 [1] -386.7280 -387.1504 -403.0483 -443.2730 -445.2417 -440.1219 -431.4261 -426.6852 -440.6210 -408.1540
```

Binned data format

I think that the one of the main contributions of this package is that allows the user to work with binned data. However, the examples in the vignette that utilize binned data are from simulated data that was later binned using the mixR::bin function which may confuse an inexperienced user.

I consider that the vignette and the README file could be improved with an explicit example of the binned data matrix expected by the mixfit function that wasn't generated with mixR::bin

model object

hey trying some of the code in the readme

# Selecting the best g for Normal mixture model
s_normal = select(x2, ncomp = 2:6)

It prints out: "The final model: normal mixture (equal variance) with 4 components"

I did

str(s_normal)
List of 5
 $ ncomp    : int [1:10] 2 2 3 3 4 4 5 5 6 6
 $ equal.var: chr [1:10] "Y" "N" "Y" "N" ...
 $ bic      : num [1:10] -387 -387 -403 -443 -445 ...
 $ best     : chr [1:10] " " " " " " " " ...
 $ family   : chr "normal"
 - attr(*, "class")= chr "selectEM"

I typed

 [1] " " " " " " " " "*" " " " " " " " " " "

Is that expected? Why?

documenting 'ev'

?b.test
a logical value indicating whether the variance of each component should be the same or not (default FALSE). ev is ignored for other family members.

the statement `ev is ignored for other family members' is not clear. need to provide default family (normal) somewhere?

more helpful error would be useful

b1 = bs.test(x1, ncomp = c(2, 3), B = 100, max_iter = 1)
Error in if (any(c(n < 0, pi < 0, sd < 0))) { : 
  missing value where TRUE/FALSE needed

test

one kind of test that would be cool = reproduce numbers from another package like mixtools. it may not make it to automated tests but useful to have it somewhere in the docs.

missing error message?

i tried

b1 = bs.test(x1, ncomp = c(3, 2), B = 100, max_iter = 2)

when documentation =

<html>
<body>
<!--StartFragment-->

a vector of two positive integers specifying the number of components of the mixture model under the null and alternative hypothesis. The first integer should be smaller than the second one. The default value is c(1, 2).
--


<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>

should we throw a warning?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.