hadoop-kmeans's Introduction

Hadoop MapReduce KMeans Clustering

The project involves the implementation of the K-Means clustering algorithm using Hadoop MapReduce. The objective is to develop an application that allows clustering to be performed on datasets of different dimensions and with different configurations of centroids and points. The Hadoop MapReduce framework has been utilized to parallelize the algorithm and improve its scalability.

🧐 Study:

A Python script has been developed that utilizes the "make blobs" module from "scikit-learn" to generate datasets of specified sizes and dimensions.

The dimension and the structure of the dataset relies on specific parameters:

n: number of points/observations;
k: number of clusters;
d: number of dimension of the points/observations;

The algorithm is tested and evaluated on seven different datasets, with variations in n, k and d.

The tests, which are in the scripts/data directory, were evaluated using a MATLAB plot for datasets with dimensions less than or equal to 3. For datasets with dimensions greater than 3, the Silhouette score was used to evaluate the algorithm's performance. However, for two tests, the results were not satisfactory, and the K-Means++ algorithm was used instead, which resulted in better performance. Additionally, the impact of the number of reducers on algorithm execution time was examined, and the average execution time for one iteration of the algorithm was recorded for different numbers of reducers. See the documentation in the docs directory for further details.

💖Like our work?

👥 Authors:

Recommend Projects

mirawara / hadoop-kmeans Goto Github PK

hadoop-kmeans's Introduction

Hadoop MapReduce KMeans Clustering

🧐 Study:

💖Like our work?

👥 Authors:

hadoop-kmeans's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent