Giter Club home page Giter Club logo

clusteval's People

Contributors

ramhiser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

clusteval's Issues

Fix erroneous documentation

  1. Look at clusteval documentation on CRAN.
  2. Identify erroneous documentation.
  3. Fix documentation.
  4. Bump version to 0.1.1.
  5. Push changes to CRAN.

Update package documentation

The package documentation is in need of much TLC. We need to write a package description. This should be added to 3 different places in various spots:

  1. README.md
  2. R/help.r
  3. DESCRIPTION

Adjusted Rand returns NaN when both vectors contain a single cluster

Repeatable example:

library(clusteval)
n <- 10
labels1 <- rep(1, 10)
labels2 <- rep(2, 10)
adjusted_rand(labels1, labels2)
[1] NaN

This case should return 1. I confirmed that mclust::adjustedRandIndex has this behavior. I also checked on some examples that we get the same similarity value as mclust::adjustedRandIndex.

Update package license for CRAN

From email I received from Kurt Hornik of CRAN:

Package: clusteval Version: 0.1
Check: DESCRIPTION meta-information ... NOTE
License components which are templates and need ‘+ file LICENSE’:
  MIT

It seems I need to add a LICENSE file.

Purge bloat from package

My initial exploratory research into this topic yielded a lot of bloat and creep into the package. Here, we enumerate all of the changes that must be made to purge the package of this bloat.

  1. Remove TODO
  2. Remove reliance on the foreach package
  3. Remove reliance on the mclust package
  4. Move consensus.r to a consensus branch for later research
  5. Remove sim_gamma
  6. Remove plot.r
  7. Move NOTES to journal and then remove the file
  8. Move wrapper functions used in the clustomit paper to that project folder

Update documentation for data simulation functions

Because the parameter configurations have changed significantly and because I never finished the documentation, finish the documentation for the following functions:

  1. sim_unif
  2. sim_normal
  3. sim_student
  4. sim_data

Implement Davies-Bouldin index

The Davies-Bouldin index is an internal clustering evaluation method. The formula is straightforward to implement and requires a distance metric to be given.

From Wikipedia:

Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies Bouldin Index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of k in the kmeans algorithm, where the value of k is not known apriori.

Rcpp core dump when strings are used instead of factors

Got this in an email from @khughitt:

passing invalid inputs (e.g. vectors of characters instead of numerics) to the `cluster_similarity` function leads to a core dump and the R session being killed, e.g.:

> cluster_similarity(c('a', 'b', 'c'), c('a', 'a', 'c'))
terminate called after throwing an instance of 'Rcpp::not_compatible'
  what():  Not compatible with requested type: [type=character; target=double].
[1]    27028 abort (core dumped)  R

A quick type check in the function(s) before calling the Rcpp functions should be enough to prevent this.

Satisfy BDR for CRAN Submission

Along with ramhiser/itertools2#38, got Ripley'd over last night's CRAN submission. Thing to fix:

We see

  • checking top-level files ... NOTE
    Non-standard file/directory found at top level:
    ‘cran-comments.md’

which should not be in the tarball. Please scrupulously follow the policies and check before submission.

Fix NOTES for CRAN Submission

There were two NOTES in my package submission. Ripley emailed me and told me to fix them:

We see

  • checking top-level files ... NOTE
    Non-standard file/directory found at top level:
    ‘NEWS.md’
  • checking R code for possible problems ... NOTE
    plot.clustomit: no visible binding for global variable ‘method’
    plot.clustomit: no visible binding for global variable ‘ClustOmit’
    plot.clustomit: no visible binding for global variable ‘Cluster’

Please fix

Negative numbers in comembership table

For larger sample sizes combined with large numbers of cluster labels comembership_table() can return negative numbers for the number of discordant pairs.

set.seed(1)
a <- sample(1:20, 70000, replace = TRUE)
b <- sample(1:20, 70000, replace = TRUE)
clusteval::comembership_table(a, b)

output:

$n_11
6125067
$n_10
116356347
$n_01
116372976
$n_00
-2083856686

Refactor clustering similarity functions

Currently, the clustering similarity functions implemented utilize helper functions from various packages. Some of them are much slower than my Rcpp implementation. To streamline the clustering similarity calculations, do the following:

  1. Create comembership_summary function that uses Rcpp to compute the 2x2 similarity table
  2. Remove wrapper functions for similarity indices
  3. Remove Adjusted Rand implementation for now
  4. Create jaccard_naive and rand_naive functions
  5. Create wrapper function that computes clustering similarity with statistic and method = c('naive') arguments: See the entropy package for examples.

Add option to estimate Jaccard directly from comemberships

Currently, the Jaccard takes only cluster labels as arguments, but for estimation purposes, it would be useful to have an option that we could instead pass the comemberships.

Do one of the following:

  1. Add a flag to the current framework that says the labels are comemberships
  2. Add helper functions that do this separately.

Implement Dunn index

The Dunn index is an internal evaluation technique. Cluster k results in an intracluster distance \Delta_k, which is computed as one of:

  1. Max distance between all pairs
  2. Mean distance between all pairs
  3. Distance of all the points from the mean

An intercluster distance is then calculated as a comparison of the clusters.

Errors in simulation with simulated data sets

After I ran the simulations for the first 5 simulation configurations (i.e. the first 5 rows of simgrid) using a small value of B = 6ish and D = 5, things worked fine.

Now, I have increased B to 100 and D to 100.

When I did that, I am getting the following 2 errors and warning multiple times:

Error : number of cluster centres must lie between 1 and nrow(x)
Error in kmeans(x = x, centers = num_clusters, nstart = num_starts, ...) :
  more cluster centers than distinct data points.
In addition: Warning message:
In FUN(c(2L, 1L, 3L)[[3L]], ...) : Returning NA

When mclapply exits, I receive:

Warning message:
In mclapply(seq_len(nrow(simgrid)), function(i) { :
  all scheduled cores encountered errors in user code

Need to resolve this so that we can move on with simulation.

Fuzzy Clustering Evaluations

Suleman 2017 show that hard clustering similarities like rand and jaccard can be easily extended to fuzzy clusterings by replacing the comembership 0/1 indicator with a normalized manhattan distance between cluster weights. I have implemented a prototype but it would be nice to have it in C++ and available to other users through the existing clusteval package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.