Giter Club home page Giter Club logo

sorayutmild / unsupervised-thai-document-clustering-with-sanook-news Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 55.94 MB

An unsupervised model to clustering Thai news. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

Python 0.18% Jupyter Notebook 99.82%
document-clustering huggingface-transformers k-means-clustering name-entity-recognition nlp-machine-learning sentence-embeddings thai-nlp

unsupervised-thai-document-clustering-with-sanook-news's Introduction

Unsupervised-Thai-Document-Clustering-with-Sanook-news

TL;DR This work creates an unsupervised model to clustering Thai news into 10 categories. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

Problem statement

Create unsupervised model to clustering sanook news 10 categories.

Dataset

Method

1. Vector representation

1.1 Vector representation using Bag-of-Words (TF-IDF)

I create vector representation using Bag-of-Words (TF-IDF) and use it as a baseline.

Bag-of-Words

  • Text cleaning: remove link, symbols, numbers, special characters
  • Word tokenization: newmm (dictionary-based, Maximum Matching + Thai Character Cluster)
  • TF-IDF vectorization

1.2 Vector representation using Transformer model

Transformer model

  • Text cleaning: remove link, symbols, numbers, special characters
  • Sentence tokenization: CRF
  • Sentence embedding: The best model is WangchanBERTa with SimCSE.
  • Weighted with number of Named-Entities After, Sentences are embedded to vector by Transformer model. The embedded vectors are weighted by number of named entities of particular types in sentence. then make Document vector representation using these formulas.

$$ v_{d} = \frac {\sum_{s \in d}w_{s} \times v_{s}} {\sum w_{s}} $$

$$ w_{s} = n_{s} + 1$$

where ns denotes the number of named entities of particular types in sentence. This weighting scheme is adopted from https://ieeexplore.ieee.org/document/9085059

2. Clustering model

After, we get vector representation. we use the vector as a cluster features. I used simple k-mean clustering following the code below.

from sklearn.cluster import KMeans
k = 10
km = KMeans(n_clusters=k, max_iter=100, n_init=55,)

How to run code

  • For web scraping (you can skip this. we download it for you)

    • Install the library by running this command pip install -r requirements.txt
    • Download chromedriver.exe and put in directory.
    • then run this notebook sanook_web_scraping.ipynb with that environment.
  • Document clustering

    • Run this Document_clustering.ipynb notebook on Google Colab. it contains
      • Text preprocessing
      • Text representation
        • Bag-of-Words
        • Transformer Embedding
      • Clustering model
      • Evaluation
      • Error analysis

Results

Chosen the class of cluster by select the most frequency in each cluster.
compare the predictions with Labels by accuracy score as a evaluation metric.

Vector representation techniques Acc
TF-IDF 0.8216
SimCSE WangchanBERTa 0.8330
SimCSE WangchanBERTa Weighted with number of Named-Entities 0.8445
SimCSE WangchanBERTa finetuned Weighted with number of Named-Entities 0.7368

Discussion

  • I have tried a lot of Transformer model (BERT, RoBERTa, and WangchanBERTa) by adding pooling layer to get embedding vector shape (number_of_samples, 768). But they are not perform well in this task.
  • SimCSE improves the model's performance.
  • SimCSE model with Weighted with number of Named-Entities is the best in my experiments.

Future work

  • Try other Clustering models (e.g., Hierarchical clustering, DBSCAN)
  • Try Dimension reduction methods (e.g., PCA)
  • Try other weighted schemes
  • Try Vector representation with Doc2vec method
  • Try soft clustering (topic modeling) (e.g., LDA)

Acknowledgements

unsupervised-thai-document-clustering-with-sanook-news's People

Contributors

sorayutmild avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.