The project involves the implementation of the K-Means clustering algorithm using Hadoop MapReduce. The objective is to develop an application that allows clustering to be performed on datasets of different dimensions and with different configurations of centroids and points. The Hadoop MapReduce framework has been utilized to parallelize the algorithm and improve its scalability.
A Python script has been developed that utilizes the "make blobs" module from "scikit-learn" to generate datasets of specified sizes and dimensions.
The dimension and the structure of the dataset relies on specific parameters:
- n: number of points/observations;
- k: number of clusters;
- d: number of dimension of the points/observations;
The algorithm is tested and evaluated on seven different datasets, with variations in n, k and d.
The tests, which are in the scripts/data directory, were evaluated using a MATLAB plot for datasets with dimensions less than or equal to 3. For datasets with dimensions greater than 3, the Silhouette score was used to evaluate the algorithm's performance. However, for two tests, the results were not satisfactory, and the K-Means++ algorithm was used instead, which resulted in better performance. Additionally, the impact of the number of reducers on algorithm execution time was examined, and the average execution time for one iteration of the algorithm was recorded for different numbers of reducers. See the documentation in the docs directory for further details.
Contact us if you have any corrections or additional features to offer.