Predictive analysis of a dataset containing news articles from 2015 to 2018, related to business and sports, sourced from Kaggle. The primary objective of this project was to compare the performance of various machine learning methods using a simple bag-of-words approach for classification.
Features:
- Word Analysis: Provides an analysis of the most frequent words in the dataset for both news types (Business and Sports).
- Predictive Analysis: Compared the performance of several machine learning models, including K Nearest Neighbors (KNN), Logistic Regression (LR), Forward-Feed Neural Networks (FNN), and Multinomial Naive Bayes (MNB), on the task of predicting the type of news article (Business or Sports) based on its content.
Technical Details:
- Programming Language: Python
- Data Cleaning: Regular expressions and pandas
- Machine learning: Scikit-learn
- Visualization: Matplotlib and Seaborn
Kaggle Dataset: https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles
This code requires python3.6.
- Clone the repository:
git clone https://github.com/asad1996172/Articles-Analysis
- Create a conda environment:
conda create -n articles python=3.6
- Activate the environment:
conda activate articles
- Install the requirements:
pip install -r requirements.txt
- Run the code:
python3 main_code.py
First, I did some word analysis on the different News Types. There are two News Types in this dataset i.e., Business and Sports. Following are the top 20 words with respect to frequency along with their number of occurences.
Intersection of these two sets contains very less words so intuitively classification accuracies are expected to be high. Following are the models we use for comparison
- K Nearest Neighbors (KNN)
- Logistic Regression (LR)
- Forward-Feed Neural Networks (FNN)
- Multinomial Naive Bayes (MNB)
I used the 80-20 train test split. Following are the accuracies for different models on the 20% test set.
Classifier | Accuracy |
---|---|
KNN (k=3) | 94.06% |
LR | 99.44% |
FNN | 99.07% |
MNB | 99.63% |
As evident from the table, the best performing model among these is Multinomial Naive Bayes.