Giter Club home page Giter Club logo

rimtouny / enhancing-gutenberg-book-clustering-using-advanced-nlp-techniques Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.85 MB

Text clustering, an unsupervised ML technique in NLP, groups similar texts based on content. Techniques like hierarchical, k-means, or density-based clustering categorize unstructured data, unveiling insights and patterns in diverse datasets. This exploration was part of the NLP course in my University of Ottawa master's program in 2023.

License: MIT License

Jupyter Notebook 100.00%
clustering gutenberg hierarchical-clustering k-means-clustering kappa-score machine-learning nlp-machine-learning silhouette-score text-clustering unsupervised-learning

enhancing-gutenberg-book-clustering-using-advanced-nlp-techniques's Introduction

Enhancing Gutenberg Book Clustering using Advanced NLP Techniques

Text clustering, an unsupervised ML technique in NLP, groups similar texts based on content. Techniques like hierarchical, k-means, or density-based clustering categorize unstructured data, unveiling insights and patterns in diverse datasets. This exploration was part of the NLP course in my University of Ottawa master's program in 2023.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Unsupervised Text Clustering problem

Text clustering involves grouping comparable texts based on content similarity, a crucial unsupervised technique.(chose 5 differnet books for 5 differnet author and genre)

selected_books=['austen-emma.txt','whitman-leaves.txt','milton-paradise.txt', 'melville-moby_dick.txt','chesterton-thursday.txt']

Key Tasks Undertaken

  1. Data Preparation, Preprocessing and, Cleaning:

    • Listing all the books in Gutenberg’s library.

      {'austen-emma.txt': 'Jane Austen',
      'austen-persuasion.txt': 'Jane Austen',
      'austen-sense.txt': 'Jane Austen',
      'carroll-alice.txt': 'Lewis Carroll',
      'chesterton-ball.txt': 'G.K. Chesterton',
      'chesterton-brown.txt': 'G. K. Chesterton',
      'chesterton-thursday.txt': 'G. K. Chesterton',
      'edgeworth-parents.txt': 'Maria Edgeworth',
      'melville-moby_dick.txt': 'Dick  Herman Melville',
      'shakespeare-caesar.txt': 'William Shakespeare',
      'shakespeare-hamlet.txt': 'William Shakespeare',
      'whitman-leaves.txt': 'Walt Whitman'}
    • Choose five different books by five different authors belong to the same category (History).

    • Data preparation:

      • Removing stop words.
      • Converting all words to the lower case.
      • Tokenize the text.
      • Lemmatization is the next step that reduces a word to its base form.
    • Data Partitioning: partition each book into 200 documents, each document is a 100 word record.

    • Data labeling as follows:

      • austen-emma→ a
      • chesterton-thursday→ b
      • shakespeare-hamlet→ c
      • chesterton-ball→ d
      • carroll-alice→ e
    • Word Cloud Generation: Generates word clouds displaying the most frequent 100 words in books for each author. image

  2. Feature Engineering:

    • Transformation
      • Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
        • A vocabulary of known words.
        • A measure of the presence of known words.
      • Term Frequency - Inverse Document Frequency (TF-IDF):a technique to quantify words in a set of documents. We compute a score for each word to signify its importance in the document and corpus.
      • Latent Dirichlet Allocation (LDA): Perform topic modeling to extract latent topics from the text data. Each document is represented as a mixture of topics.
      • Word Embedding (Word2Vec)

merge_from_ofoct

  • Encoding
  1. Modeling: For each technique of the above, these following models are trained and tested.

    • K-Means
    • Expectation Maximization (EM)
    • Hierarchical clustering (Agglomerative)
  2. Model Evaluation

    • using Silhouette Score

    • using Kappa Score

      [!IMPORTANT] The method for calculating the Kappa Score has been uploaded in the document titled "Kappa Score.pdf".

  3. Champion Model

    • on Silhouette Score

    • on Kappa Score

  4. Error Analysis of Champion Model:

  • By reducing the number of clusters from 5 to 3

    • on Silhouette Score

    • Champion Model

    • on Kappa Score

    • Champion Model

enhancing-gutenberg-book-clustering-using-advanced-nlp-techniques's People

Contributors

rimtouny avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.