Enhancing Gutenberg Book Clustering using Advanced NLP Techniques

Text clustering, an unsupervised ML technique in NLP, groups similar texts based on content. Techniques like hierarchical, k-means, or density-based clustering categorize unstructured data, unveiling insights and patterns in diverse datasets. This exploration was part of the NLP course in my University of Ottawa master's program in 2023.

Required libraries: scikit-learn, pandas, matplotlib.
Execute cells in a Jupyter Notebook environment.
The uploaded code has been executed and tested successfully within the Google Colab environment.

Unsupervised Text Clustering problem

Text clustering involves grouping comparable texts based on content similarity, a crucial unsupervised technique.(chose 5 differnet books for 5 differnet author and genre)

selected_books=['austen-emma.txt','whitman-leaves.txt','milton-paradise.txt', 'melville-moby_dick.txt','chesterton-thursday.txt']

Key Tasks Undertaken

Data Preparation, Preprocessing and, Cleaning:
- Listing all the books in Gutenberg’s library.
```
{'austen-emma.txt': 'Jane Austen',
'austen-persuasion.txt': 'Jane Austen',
'austen-sense.txt': 'Jane Austen',
'carroll-alice.txt': 'Lewis Carroll',
'chesterton-ball.txt': 'G.K. Chesterton',
'chesterton-brown.txt': 'G. K. Chesterton',
'chesterton-thursday.txt': 'G. K. Chesterton',
'edgeworth-parents.txt': 'Maria Edgeworth',
'melville-moby_dick.txt': 'Dick  Herman Melville',
'shakespeare-caesar.txt': 'William Shakespeare',
'shakespeare-hamlet.txt': 'William Shakespeare',
'whitman-leaves.txt': 'Walt Whitman'}
```
- Choose five different books by five different authors belong to the same category (History).
- Data preparation:
  - Removing stop words.
  - Converting all words to the lower case.
  - Tokenize the text.
  - Lemmatization is the next step that reduces a word to its base form.
- Data Partitioning: partition each book into 200 documents, each document is a 100 word record.
- Data labeling as follows:
  - austen-emma→ a
  - chesterton-thursday→ b
  - shakespeare-hamlet→ c
  - chesterton-ball→ d
  - carroll-alice→ e
- Word Cloud Generation: Generates word clouds displaying the most frequent 100 words in books for each author.
Feature Engineering:
- Transformation
  - Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
    - A vocabulary of known words.
    - A measure of the presence of known words.
  - Term Frequency - Inverse Document Frequency (TF-IDF):a technique to quantify words in a set of documents. We compute a score for each word to signify its importance in the document and corpus.
  - Latent Dirichlet Allocation (LDA): Perform topic modeling to extract latent topics from the text data. Each document is represented as a mixture of topics.
  - Word Embedding (Word2Vec)

Encoding

Modeling: For each technique of the above, these following models are trained and tested.
- K-Means
- Expectation Maximization (EM)
- Hierarchical clustering (Agglomerative)
Model Evaluation
- using Silhouette Score
- using Kappa Score
  
  [!IMPORTANT] The method for calculating the Kappa Score has been uploaded in the document titled "Kappa Score.pdf".
Champion Model
- on Silhouette Score
- on Kappa Score
Error Analysis of Champion Model: