Email Spam Classifier (in-progress)
I took reference from Meriem Ferdjouni’s project available on Medium. The project is a supervised learning problem. I aimed to classify between spam text and useful text in email conversations sourced from a large dataset, and to get a good accuracy for the same. A dataset of spam data (5728 rows of text data) was fed into the system. The data was pre-processed to remove duplicate text and the text “Subject:” from each line of the data. To obtain tokens from this cleaned data, a tokenize function was written – where, text is converted to lowercase and punctuations removed using the strip function and the text data obtained in terms of sentences is cut-up into sub-strings using the nltk library’s word_tokenize function. Stopwords are removed from this data using stopwords from the nltk.corpus package, and cleaned tokens are finally obtained by using the WordNetLemmatizer function from nltk on the previous data. An 80-20 split of training and testing datasets is carried out on the emails dataset. A scikit-learn pipeline is set up, where the following transforms are sequentially used: CountVectorizer for getting sparse representations of cleaned tokens, TfidfTransformer for getting tf-idf representations (used as a weighting scheme), and lastly MultinomialNB (multinomial naïve bayes) as the classifier algorithm. The training set data is fit into the pipeline to get the predicted values. I used the accuracy_score function from sklearn.metrics to evaluate the accuracy of the model, which compares the pipeline’s predicted values with the testing set data. It came out to be only 74% at present. Investigating this, I plotted a report using the classification_report function from sklearn.metrics. At present, I see the precision, recall and f1 scores to be ‘0’ for the text data classified as spam.