Giter Club home page Giter Club logo

jigsaw-unintended-bias-in-toxicity-classification's Introduction

Jigsaw Unintended Bias in Toxicity Classification

This respository contains my code for competition in kaggle.

7th Place Solution for Jigsaw Unintended Bias in Toxicity Classification

Team: Abhishek Thakur, Duy, R0seNb1att, atfujita

All models(Team)
Public LB: 0.94729(3rd)
Private LB: 0.94660(7th)

Note: This repository contains only my models and only train script.

My models(5 Model averaging)
Public LB: 0.94719
Private LB: 0.94651

Thanks to Abhishek and Duy's wonderful models and support, I was able to get better results.

Set up

  • Particularly important libraries are listed in requirements.txt

Models

I created 5 models

  • LSTM

    • Based on the Quora competition model
    • Architecture: LSTM + GRU + Self Attention + Max pooling
    • Word embeddings: concat glove and fasttext.
    • Optimizer: AdamW
    • Train:
      • max_len = 220
      • n_splits = 10
      • batch_size = 512
      • train_epochs = 7
      • base_lr, max_lr = 0.0005, 0.003
      • Weight Decay = 0.0001
      • Learning schedule: CyclicLR
  • BERT

    • The model is based on yuval reina's graet kernel
    • Changes are loss function and preprocessing.
    • I created 4 BERT models.
      • BERT-Base Uncased
      • BERT-Base Cased
      • BERT-Large Uncased(Whole Word Masking)
      • BERT-Large Cased(Whole Word Masking)
    • Train:
      • max_len = 220
      • train samples = 1.7M, val samples= 0.1M
      • batch_size = 32(Base), 4(Large)
      • accumulation_steps = 1(Base), 16(Large)
      • train_epochs = 2
      • lr = 2e-5

Worked well

The loss function was very important in this competition.
In fact, all winners used different loss functions.

My loss function is below.

y_columns = ['target']

y_aux_train = train_df[['target', 'severe_toxicity', 'obscene',
                        'identity_attack', 'insult',
                        'threat',
                        'sexual_explicit'
                        ]]

y_aux_train = y_aux_train.fillna(0)

identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']
# Overall
weights = np.ones((len(train_df),)) / 4
# Subgroup
weights += (train_df[identity_columns].fillna(0).values >= 0.5).sum(
    axis=1).astype(bool).astype(np.int) / 4
# Background Positive, Subgroup Negative
weights += (((train_df['target'].values >= 0.5).astype(bool).astype(np.int) +
             (train_df[identity_columns].fillna(0).values < 0.5).sum(
                 axis=1).astype(bool).astype(np.int)) > 1).astype(
    bool).astype(np.int) / 4
# Background Negative, Subgroup Positive
weights += (((train_df['target'].values < 0.5).astype(bool).astype(np.int) +
             (train_df[identity_columns].fillna(0).values >= 0.5).sum(
                 axis=1).astype(bool).astype(np.int)) > 1).astype(
    bool).astype(np.int) / 4

y_train = np.vstack(
    [(train_df['target'].values >= 0.5).astype(np.int), weights]).T

y_train = np.hstack([y_train, y_aux_train])


def custom_loss(data, targets):
    ''' Define custom loss function for weighted BCE on 'target' column '''
    bce_loss_1 = nn.BCEWithLogitsLoss(
        weight=targets[:, 1:2])(data[:, :1], targets[:, :1])
    bce_loss_2 = nn.BCEWithLogitsLoss()(data[:, 1:2], targets[:, 2:3])
    bce_loss_3 = nn.BCEWithLogitsLoss()(data[:, 2:3], targets[:, 3:4])
    bce_loss_4 = nn.BCEWithLogitsLoss()(data[:, 3:4], targets[:, 4:5])
    bce_loss_5 = nn.BCEWithLogitsLoss()(data[:, 4:5], targets[:, 5:6])
    bce_loss_6 = nn.BCEWithLogitsLoss()(data[:, 5:6], targets[:, 6:7])
    bce_loss_7 = nn.BCEWithLogitsLoss()(data[:, 6:7], targets[:, 7:8])
    bce_loss_8 = nn.BCEWithLogitsLoss()(data[:, 7:8], targets[:, 8:9])

    return bce_loss_1 + bce_loss_2 + bce_loss_3 + bce_loss_4 \
           + bce_loss_5 + bce_loss_6 + bce_loss_7 + bce_loss_8

jigsaw-unintended-bias-in-toxicity-classification's People

Contributors

atsunorifujita avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kapitsa2811

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.