Giter Club home page Giter Club logo

nlp_secondlab's Introduction

Part 1: Rule-Based NLP and Regex

-Objective 1: Develop a Python code using regular expressions to extract product information (name, quantity, unit price) from user-provided text. Objective 2: Generate a formatted bill with product details (name, quantity, unit price, total price) based on the extracted information.

Part 2: Word Embeddings

-Objective 1: Pre-process the data vectors collected during Lab 1 (text cleaning, tokenization). -Objective 2: Apply one-hot encoding, bag-of-words, and TF-IDF techniques to convert the pre-processed text data into numerical representations. -Objective 3: Implement a Word2Vec model (either Skip-gram or CBOW) to learn word embeddings from the data. -Objective 4: Apply GloVe and FastText models to learn word embeddings from the data. -Objective 5: Use t-SNE dimensionality reduction to visualize the embeddings generated by each technique. -Objective 6: Evaluate and compare the performance of each word embedding technique based on the visualizations and potential use cases. -Objective 7:: Provide a general conclusion on the suitability of different word embedding techniques for the specific dataset and potential applications.

Ex1

ex1 contains script designed to process a given text containing information about purchased items and their prices, extract relevant details, and generate a bill report in a tabular format using the tabulate library. 1- Import Statements: The code imports necessary libraries including tabulate, re (regular expressions), stopwords and word_tokenize from NLTK, and spacy for natural language processing. 2- Text Preprocessing: The script loads an English language model from spaCy (en_core_web_sm) for tokenization and other NLP tasks. It defines a sample text containing information about purchased items and their prices. It tokenizes the text using spaCy and filters out tokens that are not adjectives or stop words. It filters out common English stop words and defines a regular expression pattern to match weight units. 3- Regular Expression Patterns: The script defines a regular expression pattern to extract information about quantities, products, and prices from the filtered tokens. It covers various representations of numbers like digits and their textual forms. Matching and Data Extraction: It iterates over matches found in the text based on the defined regular expression pattern. For each match, it extracts quantity, product description, and unit price. It then calculates the total price.

Preprocessing Arabic Text for NLP Tasks (ex2.ipynb)

This Python script (ex2.ipynb) demonstrates various techniques for preprocessing Arabic text for Natural Language Processing (NLP) tasks. Here's a breakdown of the key functionalities:

1. Text Cleaning:

Removes HTML tags, URLs, email addresses, and special characters. Normalizes Arabic diacritics (tashkeel) and removes redundant letter repetitions. Filters out digits. Converts punctuation to a unified format (Arabic and English).

2. Tokenization:

Splits the text into individual words (tokens) using NLTK's word_tokenize. Filters out non-alphanumeric tokens. Handles specific cases of words starting with the conjunction "ูˆ" (waaw).

3. Stop Word Removal:

Removes common Arabic stop words using NLTK's stop words list. Applies the stop word removal to both individual tokens and full sentences.

4. Lemmatization and Stemming:

Applies lemmatization (normalizing words to their base form) using qalsadi's lemmatizer. Applies light stemming (removing prefixes and suffixes) using the ArabicLightStemmer from tashaphyne.

5. Text Vectorization:

Creates a vocabulary dictionary mapping unique words to their corresponding integer IDs. Implements One-Hot Encoding, where each word is represented by a binary vector with a 1 at its corresponding vocabulary index. Demonstrates CountVectorizer, which creates a Bag-of-Words (BOW) representation where each document (sentence) is represented by a vector containing the count of each word in the vocabulary. Implements TF-IDF vectorizer, which assigns weights to words based on their importance within the documents and across the entire corpus. Overall, this script provides a comprehensive overview of common preprocessing steps for Arabic NLP tasks.

nlp_secondlab's People

Contributors

mohammed-stalin avatar abdelmajidben avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.