Giter Club home page Giter Club logo

reuters-clustering's Introduction

Lab 4 - Clustering Data

Table of Contents

  1. Overview
  2. Module Description
  3. Usage
  4. Development
  5. Change Log

Overview

The purpose of this module is to implement two clustering algorithms (k-means, DBScan) on the feature vectors generated in lab 1. This module is built on the preprocessing module that sanitized a set of SGML documents representing a Reuters article database into a dataset of feature vectors and class labels. The results of the clustering algorithms will be employed in future assignments for automated categorization, similarity search, and building document graphs.

Description

This python module contains the following files and directories:

  • lab2.py - main module for KDD process
  • preprocessing/
    • __init__.py
    • preprocessing.py - module for preprocessing the Reuters article database
    • document/
      • __init__.py
      • document.py - sub-module for text extraction & tokenization of document objects
    • lexicon/
      • __init__.py
      • lexicon.py - sub-module for generating the title/body lexicon for document set
    • feature/
      • __init__.py
      • feature.py - sub-module for generating feature vector datasets
      • weighting.py - sub-module for computing tf-idf scores
      • featureselect.py - sub-module for feature selection/reduction
  • classification/ directory included but not used in lab
    • __init__.py
    • classification.py - module for classification of the feature vector datasets
    • crossvalidator/
      • __init__.py
      • crossvalidator.py - submodule containing functionality for cross validation
    • classifiers
      • __init__.py
      • knearestneighbor.py - submodule for brute force knn classification
      • knearestneighbor_balltree.py - submodule for ball tree knn classification
      • decisiontree.py - submodule for decision tree classification
      • bayesian.py - submodule for multinomial naive bayes classification
  • clustering/
    • __init__.py
    • clustering.py - module for clustering Reuters Article database
    • algorithm/
      • __init__.py
      • kmeans.py
      • dbscan.py
  • data/
    • reut2-xxx.sgm - formatted articles (replace xxx from {000,...,021})

The preprocessing.py file will generate the following files

  • dataset1.csv - regular feature vector set
  • dataset2.csv - pared down version of feature vector in dataset1.csv

The feature vectors in the datasets were generated using the following methodologies

  • TF-IDF of title & body words to select the top 5 words of each document features
  • Feature reduction process of paring down original feature vector to 10% original size

Potential additional to future iterations of feature vector generation:

  • different normalization
  • bigram/trigram/n-gram aggregation
  • stratified sampling: starting letter, stem, etc.
  • binning: equal-width & equal-depth (grouping by topics/places, part-of-speech, etc)
  • entropy-based discretization (partitioning based on entropy calculations)

The clustering.py file will produce the following 2x2x2 experiment results:

  • k-means using euclidean distance on the standard feature vector
  • k-means using euclidean distance on the pared feature vector
  • DBScan using euclidean distance on the standard feature vector
  • DBScan using euclidean distance on the pared feature vector

For more information on how these classifiers were implemented and the offline/online costs, use the command:

> less Report2.md

Usage

This module relies on several libraries to perform preprocessing, before anything:

Ensure NLTK is installed and the corpus and tokenizers are installed:

> pip install NLTK

Next, enter a Python shell and download the necessary NLTK data:

> python
$ import nltk
$ nltk.download()

From the download window, ensure punkt, wordnet and stopwords are downloaded onto your machine.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
  Identifier> punkt
    Downloading package punkt to /home/3/loua/nltk_data...
      Unzipping tokenizers/punkt.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords
    Downloading package stopwords to /home/3/loua/nltk_data...
      Unzipping corpora/stopwords.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> wordnet
    Downloading package wordnet to /home/3/loua/nltk_data...
      Unzipping corpora/wordnet.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q

Next, ensure BeautifulSoup4 is installed:

> pip install beautifulsoup4

Lastly, ensure scikit-learn is installed:

> pip install scikit-learn

To run the code, first ensure the lab4.py file has execute privileges:

> chmod +x lab4.py

Next, ensure the preprocessing/ and clustering/ directories and their filetrees are correct with respect to lab4.py (based on the file tree in the overview). Also, ensure there is a data/ directory in the same folder as preprocess.py and the data/ directory containing the reut2-xxx.sgm files is present. To begin preprocessing the data, run:

> python lab4.py

or

> ./lab4.py

The preprocessing and clustering might take some time to complete.

Once preprocessing.py finishes execution, two datasets files are generated (dataset1.csv, dataset2.csv) in the /datasets directory. To view these datasets, run:

> less datasets/datasetX.csv

where X is replaced with 1 or 2 depending on the dataset.

Once clustering.py finishes execution, the results of the 2x2 experiments will be outputted to the terminal.

Development

  • This module was developed using python 2.7.10 using the NLTK and BeautifulSoup4 modules.

Adding More Feature Vectors

  • Update the __select_features method in feature.py to extract a new list of feature vector & append to self.features

Adding More Clustering

  • Add a new .py file to the clustering/algorithm directory
  • Ensure this new clustering algorithm is a class with a construct and at least one method:
    • generate_clusters(training)
  • Import the algorithm in clustering.py, add an instance of the algorithm to @clusterings, and the print strings

Contributors

Change Log

2015-10-28 - Version 1.0.2:

  • Added functionality for k-means clustering
  • Added functionality for dcscan clustering
  • Added functionality to compute cluster entropy
  • Added functionality to compute cluster variance
  • Basic code requirements fulfilled
  • TODO: clean up code + make it faster & more modular
  • TODO: write report in latex

2015-10-28 - Version 1.0.0:

  • Initial code import
  • Added functionality to generate parse tree
  • Added functionality to generate document objects
  • Added functionality to tokenize, stem, and filter words
  • Added functionality to generate lexicons for title & body words
  • Prepare documents for feature selection & dataset generation

reuters-clustering's People

Contributors

ankailou avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.