Giter Club home page Giter Club logo

rimtouny / enhancing-gutenberg-book-classification-using-advanced-nlp-techniques Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.61 MB

The project aimed to classify Gutenberg texts accurately. Employing advanced NLP methodologies, it covered collection, preprocessing, feature engineering, and model evaluation for literary work classification. as part of the University of Ottawa's 2023 NLP course.

License: MIT License

Jupyter Notebook 100.00%
gutenberg lemmatization natural-language-processing nlp-machine-learning text-classification tokenization transformer bag-of-words bow error-analysis

enhancing-gutenberg-book-classification-using-advanced-nlp-techniques's Introduction

Enhancing Gutenberg Book Classification using Advanced NLP Techniques

The project aimed to classify Gutenberg texts accurately. Employing advanced NLP methodologies, it covered collection, preprocessing, feature engineering, and model evaluation for literary work classification. as part of the University of Ottawa's 2023 NLP course.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Multi-class classification problem

Text Classification Task to categorize a 5 Gutenberg texts into their respective literary works or books.

selected_books = ['austen-emma.txt','carroll-alice.txt','chesterton-brown.txt','edgeworth-parents.txt','shakespeare-hamlet.txt']

System workflow

image

Key Tasks Undertaken

  1. Data Preparation, Preprocessing and, Cleaning:

    • Listing all the books in Gutenberg’s library.

      {'austen-emma.txt': 'Jane Austen',
      'austen-persuasion.txt': 'Jane Austen',
      'austen-sense.txt': 'Jane Austen',
      'carroll-alice.txt': 'Lewis Carroll',
      'chesterton-ball.txt': 'G.K. Chesterton',
      'chesterton-brown.txt': 'G. K. Chesterton',
      'chesterton-thursday.txt': 'G. K. Chesterton',
      'edgeworth-parents.txt': 'Maria Edgeworth',
      'melville-moby_dick.txt': 'Dick  Herman Melville',
      'shakespeare-caesar.txt': 'William Shakespeare',
      'shakespeare-hamlet.txt': 'William Shakespeare',
      'whitman-leaves.txt': 'Walt Whitman'}
    • Choose five different books by five different authors belong to the same category (History).

    • Data preparation:

      • Removing stop words.
      • Converting all words to the lower case.
      • Tokenize the text.
      • Lemmatization is the next step that reduces a word to its base form.
    • Data Partitioning: partition each book into 200 documents, each document is a 100 word record. image

    • Data labeling as follows:

      • austen-emma→ a
      • chesterton-thursday→ b
      • shakespeare-hamlet→ c
      • chesterton-ball→ d
      • carroll-alice→ e
    • Word Cloud Generation: Generates word clouds displaying the most frequent 100 words in books for each author. merge_from_ofoct

    • Shuffle Dataset

  2. Feature Engineering:

    • Transformation
      • Bag of Word (BOW):It represents the occurrence of words within a document, it involves two things:
        • A vocabulary of known words.
        • A measure of the presence of known words.
      • Term Frequency - Inverse Document Frequency (TF-IDF):a technique to quantify words in a set of documents. We compute a score for each word to signify its importance in the document and corpus.
      • N-grams
      • Word Embedding (Word2Vec)

merge_from_ofoct

  • Encoding
  1. Modeling: For each technique of the above, these following models are trained and tested.

    • Random Forest
    • Gaussian Naive Bayes
    • K Nearest Neighbors
  2. Model Evaluation

    • BOW

    • TF-IDF

    • N-grams

    • Word2Vec

  3. Error Analysis of Champion Model:

Best Model= Gaussian Naive Bayes
Accacruy and Champion Embedding: [0.98, 'N-Grams']
  • By reducing the number of words, it will lead to reduce the accuracy of our champion model

        Accuracy with number of words 100 is 98.67 %
        
        Accuracy with number of words 70 is 97.33 %
        
        Accuracy with number of words 50 is 94.67 %
        
        Accuracy with number of words 40 is 94.67 %
        
        Accuracy with number of words 30 is 92.0 %
        
        Accuracy with number of words 20 is 84.0 %
  • Indicate that the n estimators’ parameter is not significantly impacting the model's performance on our dataset.

        Accuracy with n estimators100 is 98.67 %
        
        Accuracy with n estimators70 is 98.67 %
        
        Accuracy with n estimators50 is 98.67 %
        
        Accuracy with n estimators40 is 98.67 %
        
        Accuracy with n estimators30 is 98.67 %
        
        Accuracy with n estimators20 is 98.67 %

enhancing-gutenberg-book-classification-using-advanced-nlp-techniques's People

Contributors

rimtouny avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.