Giter Club home page Giter Club logo

projectspamclassifier's Introduction

ProjectSpamClassifier

Email Spam Classifier (in-progress)

I took reference from Meriem Ferdjouni’s project available on Medium. The project is a supervised learning problem. I aimed to classify between spam text and useful text in email conversations sourced from a large dataset, and to get a good accuracy for the same. A dataset of spam data (5728 rows of text data) was fed into the system. The data was pre-processed to remove duplicate text and the text “Subject:” from each line of the data. To obtain tokens from this cleaned data, a tokenize function was written – where, text is converted to lowercase and punctuations removed using the strip function and the text data obtained in terms of sentences is cut-up into sub-strings using the nltk library’s word_tokenize function. Stopwords are removed from this data using stopwords from the nltk.corpus package, and cleaned tokens are finally obtained by using the WordNetLemmatizer function from nltk on the previous data. An 80-20 split of training and testing datasets is carried out on the emails dataset. A scikit-learn pipeline is set up, where the following transforms are sequentially used: CountVectorizer for getting sparse representations of cleaned tokens, TfidfTransformer for getting tf-idf representations (used as a weighting scheme), and lastly MultinomialNB (multinomial naïve bayes) as the classifier algorithm. The training set data is fit into the pipeline to get the predicted values. I used the accuracy_score function from sklearn.metrics to evaluate the accuracy of the model, which compares the pipeline’s predicted values with the testing set data. It came out to be only 74% at present. Investigating this, I plotted a report using the classification_report function from sklearn.metrics. At present, I see the precision, recall and f1 scores to be ‘0’ for the text data classified as spam.

projectspamclassifier's People

Contributors

baloneygit avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.