-Objective 1: Develop a Python code using regular expressions to extract product information (name, quantity, unit price) from user-provided text. Objective 2: Generate a formatted bill with product details (name, quantity, unit price, total price) based on the extracted information.
-Objective 1: Pre-process the data vectors collected during Lab 1 (text cleaning, tokenization). -Objective 2: Apply one-hot encoding, bag-of-words, and TF-IDF techniques to convert the pre-processed text data into numerical representations. -Objective 3: Implement a Word2Vec model (either Skip-gram or CBOW) to learn word embeddings from the data. -Objective 4: Apply GloVe and FastText models to learn word embeddings from the data. -Objective 5: Use t-SNE dimensionality reduction to visualize the embeddings generated by each technique. -Objective 6: Evaluate and compare the performance of each word embedding technique based on the visualizations and potential use cases. -Objective 7:: Provide a general conclusion on the suitability of different word embedding techniques for the specific dataset and potential applications.
ex1 contains script designed to process a given text containing information about purchased items and their prices, extract relevant details, and generate a bill report in a tabular format using the tabulate library. 1- Import Statements: The code imports necessary libraries including tabulate, re (regular expressions), stopwords and word_tokenize from NLTK, and spacy for natural language processing. 2- Text Preprocessing: The script loads an English language model from spaCy (en_core_web_sm) for tokenization and other NLP tasks. It defines a sample text containing information about purchased items and their prices. It tokenizes the text using spaCy and filters out tokens that are not adjectives or stop words. It filters out common English stop words and defines a regular expression pattern to match weight units. 3- Regular Expression Patterns: The script defines a regular expression pattern to extract information about quantities, products, and prices from the filtered tokens. It covers various representations of numbers like digits and their textual forms. Matching and Data Extraction: It iterates over matches found in the text based on the defined regular expression pattern. For each match, it extracts quantity, product description, and unit price. It then calculates the total price.
This Python script (ex2.ipynb) demonstrates various techniques for preprocessing Arabic text for Natural Language Processing (NLP) tasks. Here's a breakdown of the key functionalities:
Removes HTML tags, URLs, email addresses, and special characters. Normalizes Arabic diacritics (tashkeel) and removes redundant letter repetitions. Filters out digits. Converts punctuation to a unified format (Arabic and English).
Splits the text into individual words (tokens) using NLTK's word_tokenize. Filters out non-alphanumeric tokens. Handles specific cases of words starting with the conjunction "ู" (waaw).
Removes common Arabic stop words using NLTK's stop words list. Applies the stop word removal to both individual tokens and full sentences.
Applies lemmatization (normalizing words to their base form) using qalsadi's lemmatizer. Applies light stemming (removing prefixes and suffixes) using the ArabicLightStemmer from tashaphyne.
Creates a vocabulary dictionary mapping unique words to their corresponding integer IDs. Implements One-Hot Encoding, where each word is represented by a binary vector with a 1 at its corresponding vocabulary index. Demonstrates CountVectorizer, which creates a Bag-of-Words (BOW) representation where each document (sentence) is represented by a vector containing the count of each word in the vocabulary. Implements TF-IDF vectorizer, which assigns weights to words based on their importance within the documents and across the entire corpus. Overall, this script provides a comprehensive overview of common preprocessing steps for Arabic NLP tasks.