This repository contains multiple ways of analyzing toxic comments. The goal is to compare different methods and see which one is the most efficient.
A quick report written in French can be found in the main
branch.
All the toxic comments models are available on different branches of the repository. You can find the following branches:
main
: contains the main code of the projecttf-idf
: contains the code for the tf-idf modelRNN
: contains the code for the basic RNN modelGRU
: contains the code for the GRU modelLSTM
: contains the code for the LSTM model
/!\ There is 2 LSTM branches : LSTMathis is the one that can be found in the report while LouiSTM is a notebook where we tried to go further in the implementation.
Each model has its metrics displayed on the code.
- Python 3.11.5
- Tensorflow 2.16.0
We are using glove embeddings, you can download them from the following link: https://nlp.stanford.edu/projects/glove/
put the file glove.6B.100d.txt and others in the datasets
folder.
We are using the Jigsaw dataset, you can download it from the following link: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data
The project is structured as follows:
datasets
: contains the data used for the projecthelpers
: contains functions used in the projectGloVe
: contains the glove embeddingsmodels
: contains the trained modelsnotebook.ipynb
: The notebook used for training the modelspipeline.py
: The pipeline to implement the model in production