SMS spam classification data from Kaggle (https://www.kaggle.com/uciml/sms-spam-collection-dataset) using naive bayesian model.
This algorithm performs the following steps:
- Load and read
.csv
file with SMS Spam Collection Dataset | Kaggle data from disk. - Parse this file and extract
v1
value as label (spam or ham) andv2
value as message. - Transform messages to tokens and then to lemmas, replace all numbers with
constant token
__NUMBER__
. - Shuffle all messages.
- Split messages into train and test sets.
- Fit bayesian model with train set.
- Predict labels on test set.
- Calculate the following metrics:
- accuracy,
- precision,
- recall,
- F1-score,
- Matthews correlation.
Node JS
library andNPM
package manager.- Libraries installed from
package.json
file.
- Go to the project root directory.
- Run
npm i
ornpm install
command. This command installs necessary libraries. - Open
.env
file and configure the following parameters:
SMS_COLLECTION_PATH
:string
value, that specifies.csv
file path to the SMS Spam collection data from Kaggle (absolute or relative path).TRAIN_SIZE
:float
value, that specifies the size of train set.COUNT_EXPERIMENTS
:integer
value, that specifies the number of experiments.
In the project root directory execute npm start
command.
RESULTS:
- Count experiments: 100
- Train set size: 0.8
- Avg accuracy: 0.9768671454219029
- Avg precision (spam): 0.8840480938416516
- Avg recall (spam): 0.9494494826142801
- Avg F1-score (spam): 0.9153065782697162
- Matthews correlation: 0.9028739302029823
csv-parser
(version2.3.2
) is used for parsing.csv
files.natural
(version0.6.3
) is used for tokenizing input texts from corpus to words.lemmatizer
(version0.0.1
) is used for creating lemmas from words.