Giter Club home page Giter Club logo

detoxic's Introduction

DeToxic

This is a class project in CMPS240 at UCSC, aiming to detect toxic content in Quora questions using state-of-art NLP techniques. This project is from an on-going Kaggle competition, which provides us all the datasets and pretrained embeddings.

Related topics: NLP, classification, Deep Learning, content & style, PyTorch, Google Cloud Platform, GPU, Kaggle

Structure

We only mention the function of key directories here, leaving the diagram of the whole structure of the source code at the end of this README.md file:

  • ./checkpoints: save the trained weight and optimizer files
  • ./data_proc: python script for data preprocessing; save the processed data
  • ./model: contain classes that combine networks to build models, e.g. baseline
  • ./networks: contain classes that define networks, e.g. LSTM
  • ./options: contain the class define the options in our model
  • ./utils: contain classes for util functions

How to train the model

For example,

python train-baseline.py \
--model_type baseline \
--classifier_net LSTM \
--number_layers 2 \
--is_bidirectional True \
--tag R2D2d0-L1 \
--max_epoch_C 20 \
--load_epoch_C 10 \
--threshold 0.5
  • In this example we continue to train a specified LSTM model, starting from epoch 10 to max epoch 20;
    • one must make sure the existence of pretrained weight file and optimizer file for epoch 10 in the checkpoints directroy.
  • --model_type baseline means the intermedeiate files will be saved in ./checkpoints/baseline/;
  • --classifier_net LSTM --number_layers 2 --is_bidirectional True specify the architecture of the LSTM model: two bidirectional LSTM layers;
  • --tag R2D2d0-L1 defines the tag field in the name of weight and optimizer files, e.g. net_LSTM_R2D2d0-L1_1_id.pth and opt_LSTM_R2D2d0-L1_1_id.pth;
    • here R refers to the number of RNN (LSTM) layers; D refers to the number of directions; d refers to the dropout rate; and L refers to the number of fully-connected layers in the classifier.
  • --threshold 0.5 set the threshold for binary classification, i.e. if the predicted probability is greater than 0.5, it will be assigned to the possitive class;
  • All possible options are defined in the file ./options/base_options.py.

How to evaluate the modle (on validation set)

  • One may use command line directly,
    python test.py \
    --model_type baseline \
    --classifier_net LSTM \
    --number_layers 2 \
    --is_bidirectional True \
    --tag R2D2d0-L1 \
    --load_epoch_C 10 \
    --threshold 0.5
    
    • the options are very similar to the ones used in the training phase, basically those options explicitly describe the model for testing;
    • in this case, before test the model LSTM(R2D2d0-L1), one must make sure the existence of file ./checkpoints/baseline/net_LSTM_R2D2d0-L1_10_id.pth (not the opt file is not necessary for testing).
  • or simply run the shell script (need to modify the shell script accordingly)
    bash run_test.sh

Appendix

The whole structure of the source code is listed in the following tree diagram:

.
├── README.md
├── checkpoints
│   ├── balanced
│   ├── baseline
│   └── triplet
├── data_proc
│   ├── __init__.py
│   ├── embedding_preprocess.py
│   ├── preprocess.ipynb
│   ├── preprocessor_demo.ipynb
│   ├── processed_data
│   └── question_preprocess.py
├── debug
├── model
│   ├── __init__.py
│   ├── baseline.py
│   ├── basemodel.py
│   └── triplet.py
├── networks
│   ├── GRU.py
│   ├── LR.py
│   ├── LSTM.py
│   ├── __init__.py
│   ├── __pycache__
│   ├── basenetwork.py
│   ├── embedding.py
│   └── linear.py
├── options
│   ├── __init__.py
│   ├── __pycache__
│   └── base_options.py
├── results
├── run_AUC.sh
├── run_test.sh
├── test.py
├── train_baseline.py
├── train_triplet.py
└── utils
    ├── __init__.py
    ├── __pycache__
    ├── dataloader.py
    └── util.py

detoxic's People

Contributors

hannah86 avatar xyli1905 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.