Giter Club home page Giter Club logo

sms-spam-classification's Introduction

sms-spam-classification

SMS spam classification data from Kaggle (https://www.kaggle.com/uciml/sms-spam-collection-dataset) using naive bayesian model.

Description

This algorithm performs the following steps:

  1. Load and read .csv file with SMS Spam Collection Dataset | Kaggle data from disk.
  2. Parse this file and extract v1 value as label (spam or ham) and v2 value as message.
  3. Transform messages to tokens and then to lemmas, replace all numbers with constant token __NUMBER__.
  4. Shuffle all messages.
  5. Split messages into train and test sets.
  6. Fit bayesian model with train set.
  7. Predict labels on test set.
  8. Calculate the following metrics:
  • accuracy,
  • precision,
  • recall,
  • F1-score,
  • Matthews correlation.

Requirements

  1. Node JS library and NPM package manager.
  2. Libraries installed from package.json file.

Install and configure

  1. Go to the project root directory.
  2. Run npm i or npm install command. This command installs necessary libraries.
  3. Open .env file and configure the following parameters:
  • SMS_COLLECTION_PATH: string value, that specifies .csv file path to the SMS Spam collection data from Kaggle (absolute or relative path).
  • TRAIN_SIZE: float value, that specifies the size of train set.
  • COUNT_EXPERIMENTS: integer value, that specifies the number of experiments.

Running command

In the project root directory execute npm start command.

Output example

RESULTS:

  • Count experiments: 100
  • Train set size: 0.8
  • Avg accuracy: 0.9768671454219029
  • Avg precision (spam): 0.8840480938416516
  • Avg recall (spam): 0.9494494826142801
  • Avg F1-score (spam): 0.9153065782697162
  • Matthews correlation: 0.9028739302029823

Used Node JS libraries

  • csv-parser (version 2.3.2) is used for parsing .csv files.
  • natural (version 0.6.3) is used for tokenizing input texts from corpus to words.
  • lemmatizer (version 0.0.1) is used for creating lemmas from words.

sms-spam-classification's People

Contributors

by-ilya avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.