Giter Club home page Giter Club logo

autolda's Introduction

AutoLDA

CSCE676 Data Mining Course Project (Fall 2021)

alt text

[Original github repo] [Website] [Poster]

In this project, we implemented different hyperparameter searching methods to find the best hyperparameters for LDA, including Hyperband, Grid Search, Random Search. Specifically, we implemented Hyperband with LDA using number of iterations as resources. The results comparing against Random Search shows Hyperband achieves better score and clustering results.

Different implementation schemes of Hyperband with LDA were explored:

  1. Using data (# of documents) as resources. This scheme allocates more training data to most promising configs. However, tests show that this does not work well because optimal LDA params should change with dataset (# of docs).

  2. Using # of iterations as resouces, and perplexity as metric to quantitatively evaluate the goodness of LDA. This schemes splits data into training data and test data. The perplexity of test data is used as evaluation metric to filter the good configs. However, the perplexity is a biased score which is strongly affected by the number of topics chosen.

  3. Using # of iterations as resources and embedding scores as metric. This scheme uses full data to train the LDA with given iterations. Different embedding methods were used to calculate the embedding score. The locally-trained W2V outperformed the pretrained GLOVE, ELMO, BERT. It is believed that further fine-tuning these models will give better results than W2V.

Random Search and Grid Search were implemented as a baseline to compare with Hyperband. The results on our data shows that using the W2V score, Hyperband can find better hyperparameters than random search. It also yields great clustering results.

Hyperband + LDA

To run Hyperband with LDA using W2V embeddings:

python main.py results_hb_W2V.pkl W2V

To show the best 10 configurations with its topic_words with given iterations:

python show_results.py results_hb_W2V.pkl 10 W2V

To run the selected top1 config with full 81 resources to get the final LDA results:

python run_configs.py 1 W2V

To plot the score vs. time of each embedding schemes:

python plot.py

Environment Setup for Embeddings

Load pretrained embedding models:

  1. For GLOVE, download the pretrained model to folder ./Embeddings/GLOVE_pretrained/, then run GLOVE.py to save the loaded model to pkl file
wget http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip glove.840B.300d.zip 
python GLOVE.py 

autolda's People

Contributors

tian1327 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.