ramhiser / clusteval Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 3.0 13.55 MB

Clustering Evaluation in R

License: Other

R 94.99% C++ 3.29% C 1.73%

clusteval's People

Contributors

Stargazers

Watchers

Forkers

arturochian tylerbackman khughitt

clusteval's Issues

Fix erroneous documentation

Look at clusteval documentation on CRAN.
Identify erroneous documentation.
Fix documentation.
Bump version to 0.1.1.
Push changes to CRAN.

Implement Consensus Clustering

Update package documentation

The package documentation is in need of much TLC. We need to write a package description. This should be added to 3 different places in various spots:

README.md
R/help.r
DESCRIPTION

Adjusted Rand returns NaN when both vectors contain a single cluster

Repeatable example:

library(clusteval)
n <- 10
labels1 <- rep(1, 10)
labels2 <- rep(2, 10)
adjusted_rand(labels1, labels2)
[1] NaN

This case should return 1. I confirmed that mclust::adjustedRandIndex has this behavior. I also checked on some examples that we get the same similarity value as mclust::adjustedRandIndex.

Update package license for CRAN

From email I received from Kurt Hornik of CRAN:

Package: clusteval Version: 0.1
Check: DESCRIPTION meta-information ... NOTE
License components which are templates and need ‘+ file LICENSE’:
  MIT

It seems I need to add a LICENSE file.

Purge bloat from package

My initial exploratory research into this topic yielded a lot of bloat and creep into the package. Here, we enumerate all of the changes that must be made to purge the package of this bloat.

Remove TODO
Remove reliance on the foreach package
Remove reliance on the mclust package
Move consensus.r to a consensus branch for later research
Remove sim_gamma
Remove plot.r
Move NOTES to journal and then remove the file
Move wrapper functions used in the clustomit paper to that project folder

Write documentation for clustering similarity functions

The documentation for these functions is dismal. Let's fix that.

Update documentation for data simulation functions

Because the parameter configurations have changed significantly and because I never finished the documentation, finish the documentation for the following functions:

~~sim_unif~~
~~sim_normal~~
~~sim_student~~
sim_data

Implement Davies-Bouldin index

The Davies-Bouldin index is an internal clustering evaluation method. The formula is straightforward to implement and requires a distance metric to be given.

From Wikipedia:

Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies Bouldin Index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of k in the kmeans algorithm, where the value of k is not known apriori.

Email Bettina Gruen to add package to CRAN Clustering Task View

Email address: Bettina.Gruen at jku.at

Related packages are listed under Additional Functionality.

Provide summary to Bettina to add to the task view.

Rcpp core dump when strings are used instead of factors

Got this in an email from @khughitt:

passing invalid inputs (e.g. vectors of characters instead of numerics) to the `cluster_similarity` function leads to a core dump and the R session being killed, e.g.:

> cluster_similarity(c('a', 'b', 'c'), c('a', 'a', 'c'))
terminate called after throwing an instance of 'Rcpp::not_compatible'
  what():  Not compatible with requested type: [type=character; target=double].
[1]    27028 abort (core dumped)  R

A quick type check in the function(s) before calling the Rcpp functions should be enough to prevent this.

Satisfy BDR for CRAN Submission

Along with ramhiser/itertools2#38, got Ripley'd over last night's CRAN submission. Thing to fix:

We see

checking top-level files ... NOTE
Non-standard file/directory found at top level:
‘cran-comments.md’

which should not be in the tarball. Please scrupulously follow the policies and check before submission.

Fix NOTES for CRAN Submission

There were two NOTES in my package submission. Ripley emailed me and told me to fix them:

We see

checking top-level files ... NOTE
Non-standard file/directory found at top level:
‘NEWS.md’

checking R code for possible problems ... NOTE
plot.clustomit: no visible binding for global variable ‘method’
plot.clustomit: no visible binding for global variable ‘ClustOmit’
plot.clustomit: no visible binding for global variable ‘Cluster’

Please fix

Update documentation for the `sim_unif` function to reflect changes to parameter configuration

Add reference to ClustOmit paper in clustomit function after published

Negative numbers in comembership table

For larger sample sizes combined with large numbers of cluster labels comembership_table() can return negative numbers for the number of discordant pairs.

set.seed(1)
a <- sample(1:20, 70000, replace = TRUE)
b <- sample(1:20, 70000, replace = TRUE)
clusteval::comembership_table(a, b)

output:

$n_11
6125067
$n_10
116356347
$n_01
116372976
$n_00
-2083856686

Update documentation for Rand functions

Functions:

rand
rand_glmm
rand_standard

Refactor clustering similarity functions

Currently, the clustering similarity functions implemented utilize helper functions from various packages. Some of them are much slower than my Rcpp implementation. To streamline the clustering similarity calculations, do the following:

Create comembership_summary function that uses Rcpp to compute the 2x2 similarity table
Remove wrapper functions for similarity indices
Remove Adjusted Rand implementation for now
Create jaccard_naive and rand_naive functions
Create wrapper function that computes clustering similarity with statistic and method = c('naive') arguments: See the entropy package for examples.

Implement Figure of Merit

This is too slow with the clValid package. Let's make it faster for future usage.

Create Github Pages site

Demonstrate clustering evaluation with examples.

Write quick-and-dirty vignette that lists clustering similarity statistics

This vignette should list all of the statistics implemented and include a brief description of clustering comembership and the calculation of the 2x2 contingency tables.

Implement Jain and Moreau's (1987) bootstrap methods

The main idea is to bootstrap B times and look at the width of confidence intervals as a measure of stability to determine the true number of clusters. The value of K that minimizes the width of the confidence interval of some criterion specified is the optimal value of K.

Here's a link to the paper.

Add option to estimate Jaccard directly from comemberships

Currently, the Jaccard takes only cluster labels as arguments, but for estimation purposes, it would be useful to have an option that we could instead pass the comemberships.

Do one of the following:

Add a flag to the current framework that says the labels are comemberships
Add helper functions that do this separately.

Implement Dunn index

The Dunn index is an internal evaluation technique. Cluster k results in an intracluster distance \Delta_k, which is computed as one of:

Max distance between all pairs
Mean distance between all pairs
Distance of all the points from the mean

An intercluster distance is then calculated as a comparison of the clusters.

Update documentation for Jaccard functions

Functions:

jaccard
jaccard_glmm
jaccard_standard

Refactor the `clustomit` function and update its documentation

Update arguments' descriptions.
Allow for a more general, custom cluster_wrapper function

Implement Meila's (2007) Variation of Information (VI) metric

Information-theoretic criterion for comparing two partitions.

Paper: Comparing clusterings—an information based distance

Add S3 plot function for the 'clustomit' objects

The plot should look similar to the density plots in the ClustOmit paper.

Errors in simulation with simulated data sets

After I ran the simulations for the first 5 simulation configurations (i.e. the first 5 rows of simgrid) using a small value of B = 6ish and D = 5, things worked fine.

Now, I have increased B to 100 and D to 100.

When I did that, I am getting the following 2 errors and warning multiple times:

Error : number of cluster centres must lie between 1 and nrow(x)
Error in kmeans(x = x, centers = num_clusters, nstart = num_starts, ...) :
  more cluster centers than distinct data points.
In addition: Warning message:
In FUN(c(2L, 1L, 3L)[[3L]], ...) : Returning NA

When mclapply exits, I receive:

Warning message:
In mclapply(seq_len(nrow(simgrid)), function(i) { :
  all scheduled cores encountered errors in user code

Need to resolve this so that we can move on with simulation.

Fuzzy Clustering Evaluations

Suleman 2017 show that hard clustering similarities like rand and jaccard can be easily extended to fuzzy clusterings by replacing the comembership 0/1 indicator with a normalized manhattan distance between cluster weights. I have implemented a prototype but it would be nice to have it in C++ and available to other users through the existing clusteval package.

Add more cluster similarity functions

Determine a list of 5-10 similarity functions to add that utilize the comembership_summary function. Implement these.