Giter Club home page Giter Club logo

ratiogreedyclustering's Introduction

Entropy Clustering

A C implementation of the Dominance clustering algorithm from 1.

Overview

The following algorithms are implemented for clustering distributions of probabilities with the goal of reducing the partition entropy:

  • rd_clustering: a simple random assignment of distributions.
  • di_clustering: The Divisive_Information_Theoretic_Clustering from 2.
  • rg_clustering: The Ratio-Greedy algorithm from 1.

The implementations are tested in two data sets: the 20 newsgroup data set available in the scikit-learn package from Python, and the RCV1-v2 data set retrievable from the website http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.

Retrieving data

The data used for the tests presented in Cicalese, Laber, and Murtinho (2019) can be retrieved and saved via the make_ng20_data.pyand make_rcv1_data.py files. The ng20 dataset will be downloaded from scikit-learn, while the files used to prepare the csv files must be downloaded from the RCV1-v2 website:

The files must be downloaded and unzipped in the folder from which make_rcv1_data.py is being called.

Testing

The test file main_test.c tests all three algorithms on a single data set for 29 different numbers of clusters (from 2 to 2000). Run from command line as ./test dname n_rows n_cols:

  • dname: filename of the csv file with the data set on which the function will be tested (without the .csv extension)
  • n_rows: the humber of rows in the csv file
  • n_cols: the number of cols in the csv file

If test is the name of the executable compiled from new_test.c, then ./test ng20 51840 20 will run the test on the ng20.csv file which can be obtained by running make_ng20_data.py (see above), while rcv1 170946 103 will run the test on the provided rcv1.csv file. The results will be stored in files dname_entrs.csv, dname_times.csv, and dname_iters.csv, in the folder from which the function is called. The results for the initialization of the DITC algorithm will be stored as well as the results for the first, fifth, tenth, and final iterations of the algorithm.

The test file single_function_test.ccan be used to manually test a single function in a single data set.

Languages

  • C: Apple LLVM version 7.0.2 (clang-700.1.81)
  • Python: 3.7.0

Bibliography

1: Cicalese, Ferdinando, Laber, Eduardo, and Murtinho, Lucas. New results on information-theoretic clustering. 2019.

2: Dhillon, Inderjit S., Mallela, Subramanyam, and Kumar, Rahul. A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3:1265-1287, 2003.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.