Giter Club home page Giter Club logo

ml-project-fake-real-news's Introduction

Fake_Real_News_ML

Fake and real news dataset | Kaggle

This project is developed for the UNICA.IT University ML exams.

Starting from the datasets provided by "Kaggle.com" we have structured an algorithm capable of distinguishing fake news from real news.

Master's degree in Computer Engineering, Cybersecurity and Artificial Intelligence - University of Cagliari

Machine Learning

Authors: Michela Lecca - Marco Mulas- Lello Molinario

Supervisor:Prof. Battista Biggio



Contents

  1. Installation
  2. Project Goal
  3. Solution Design
  4. Analysis and conclusion


Installation

  • Download the ZIP code or clone the repository with:

    git clone https://github.com/lmolinario/MLproject.git
  • Install the requirements with:

    pip3 install -r requirements.txt
  • Run the file entrypoint.py to start the program.

Project goal

The proposed project aims to deepen knowledge of the key concepts and potential of machine learning algorithms. Through binary classification the elements of a dataset are divided into two groups, managing to predict which group each element belongs to.

Solution Design

The developed code implements a modular text classification pipeline using the strategy design pattern, allowing data representation and interchangeable classification strategies. The Strategy Design Pattern allows you to define a family of algorithms, encapsulating each one and making them interchangeable without altering the client code.

It includes three methods of data representation:

  1. TokenizerRepresentation: Convert text to integer sequences using Keras Tokenizer and fixed-length pad sequences.
  2. TextVectorizationRepresentation: Use TensorFlow's TextVectorization layer to map text into integer sequences and fill them.
  3. TFIDFRepresentation: Applies TF-IDF vectorization to convert text into numeric features based on document frequency inverse to term frequency. Using these methods, you transform text data into numeric formats suitable for machine learning models.

The script also provides three classification strategies:

  1. NaiveBayesClassifier: Implements Naive Bayes for binary classification, such as text classification tasks.
  2. PerceptronNetStrategy: A basic form of neural network, suitable for binary classification tasks.
  3. RandomForestClassifier: Implements a Random Forest algorithm with multiple decision trees for robust classification.

The performance of the classifiers is evaluated through:

  1. Average accuracy:Precision is the ratio of actual positive predictions to the total predicted positives.
  2. Average recall:Recall is the ratio of true positive predictions to total actual positives.
  3. Average F1 Score: The F1 score is the harmonic mean of precision and recall.

The pipeline supports k-fold cross-validation for robust evaluation of model performance.

Analysis and Conclusion

NaiveBayes

  • Best Performance Metric: TF-IDF
  • Why: Metrics are consistently high for TF-IDF, indicating that NaiveBayes works exceptionally well with this data representation. It shows significant improvements in precision, recall, and F1-score, highlighting its ability to leverage the importance-weighted word representation effectively.

PerceptronNet

  • Best Performance Metric: TF-IDF
  • Why: PerceptronNet exhibits the most dramatic improvement with TF-IDF, achieving near-perfect scores in all metrics. This suggests that PerceptronNet can capture and learn from the detailed features provided by TF-IDF, making it highly effective for this task.

RandomForest

  • Best Performance Metric: TF-IDF
  • Why: RandomForest maintains high performance across all data representations, with the best metrics for TF-IDF. The high F1-score indicates a good balance between precision and recall, suggesting that it effectively handles both false positives and false negatives. Its robust performance across different representations makes it a versatile choice.

Model with the Best Performance Overall

RandomForest and PerceptronNet are the top performers, especially with TF-IDF, achieving near-perfect scores in all three metrics.

Between the Two?

  • RandomForest consistently has high metrics across all representations, indicating robust performance regardless of the data representation used.
  • PerceptronNet achieves slightly higher precision and recall with TF-IDF, but RandomForest's performance is more consistent across all representations.

Why RandomForest Works Better?

  • Ensemble Learning: Uses multiple decision trees, reducing overfitting and improving generalization.
  • Robustness: Handles variance and bias balancing well by averaging multiple trees.
  • Feature Importance: Effectively handles high-dimensional data and identifies important features, improving performance across different data representations.

Conclusion

Based on the metrics provided, RandomForest with TF-IDF is slightly the best overall due to its consistent high precision, recall, and F1-score.

Overall Goal of Machine Learning

The goal of machine learning is to create systems that can learn from data, identify patterns, and make informed decisions or predictions, ultimately improving efficiency, accuracy, and customization in various applications. In our case, the goal of binary classification in machine learning is to create models that can effectively and accurately distinguish between two classes, providing reliable and actionable predictions while balancing performance and efficiency.

ml-project-fake-real-news's People

Contributors

lmolinario avatar lccmic avatar marcomulas99 avatar

Watchers

Lucian avatar  avatar  avatar

ml-project-fake-real-news's Issues

Preprocessing data

Data preprocessing:

  • create a single dataset;
  • add the relevant labels to the "News" item (0 = real news - 1 = fake news);
  • remove useless data.

Documentation

  • preparation of project presentation slides;
  • presentation subdivision of the project .

Train and test ML models

  • choose the Machine Learning models to train;
  • divide the datasets appropriately (into train and test datasets);
  • train the chosen models;
  • test the chosen models;
  • show the performances of the chosen models.

Tips:
Possible use of appropriate design patterns.

Download dataset from Kaggle

Prepare the dataset for subsequent analysis and evaluation:

About Dataset:
Dataset separated in two files:

  • Fake.csv (23502 fake news article)
  • True.csv (21417 true news article)

Dataset columns:

  • Title: title of news article
  • Text: body text of news article
  • Subject: subject of news article
  • Date: publish date of news article

Last Review

Summary of the request

Use informative commit messages and document the code.
Remove print statements and unnecessary prompts or comments, but document your code well.
All the issues listed are feature requests.

Sub tasks

  • Use informative commit messages and document the code.
  • Remove print statements and unnecessary prompts or comments
  • Verify that all the issues listed are feature requests.

Plan for the implementation

Careful review of the whole project before the push

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.