Giter Club home page Giter Club logo

goemotions-pytorch's Introduction

GoEmotions Pytorch

Pytorch Implementation of GoEmotions with Huggingface Transformers

What is GoEmotions

Dataset labeled 58000 Reddit comments with 28 emotions

  • admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise + neutral

Training Details

  • Use bert-base-cased (Same as the paper's code)

  • In paper, 3 Taxonomies were used. I've also made the data with new taxonomy labels for hierarchical grouping and ekman.

    1. Original GoEmotions (27 emotions + neutral)
    2. Hierarchical Grouping (positive, negative, ambiguous + neutral)
    3. Ekman (anger, disgust, fear, joy, sadness, surprise + neutral)

Vocabulary

  • I've replace [unused1], [unused2] to [NAME], [RELIGION] in the vocab, respectively.
[PAD]
[NAME]
[RELIGION]
[unused3]
[unused4]
...
  • I've also set special_tokens_map.json as below, so the tokenizer won't split the [NAME] or [RELIGION] into its word pieces.
{
  "unk_token": "[UNK]",
  "sep_token": "[SEP]",
  "pad_token": "[PAD]",
  "cls_token": "[CLS]",
  "mask_token": "[MASK]",
  "additional_special_tokens": ["[NAME]", "[RELIGION]"]
}

Requirements

  • torch==1.4.0
  • transformers==2.11.0
  • attrdict==2.0.1

Hyperparameters

You can change the parameters from the json files in config directory.

Parameter
Learning rate 5e-5
Warmup proportion 0.1
Epochs 10
Max Seq Length 50
Batch size 16

How to Run

For taxonomy, choose original, group or ekman

$ python3 run_goemotions.py --taxonomy {$TAXONOMY}

$ python3 run_goemotions.py --taxonomy original
$ python3 run_goemotions.py --taxonomy group
$ python3 run_goemotions.py --taxonomy ekman

Results

Best Result of Macro F1

Macro F1 (%) Dev Test
original 50.16 50.30
group 69.41 70.06
ekman 62.59 62.38

Pipeline

  • Inference for multi-label classification was made possible by creating a new MultiLabelPipeline class.
  • Already uploaded finetuned model on Huggingface S3.
    • Original GoEmotions Taxonomy: monologg/bert-base-cased-goemotions-original
    • Hierarchical Group Taxonomy: monologg/bert-base-cased-goemotions-group
    • Ekman Taxonomy: monologg/bert-base-cased-goemotions-ekman

1. Original GoEmotions Taxonomy

from transformers import BertTokenizer
from model import BertForMultiLabelClassification
from multilabel_pipeline import MultiLabelPipeline
from pprint import pprint

tokenizer = BertTokenizer.from_pretrained("monologg/bert-base-cased-goemotions-original")
model = BertForMultiLabelClassification.from_pretrained("monologg/bert-base-cased-goemotions-original")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

texts = [
    "Hey that's a thought! Maybe we need [NAME] to be the celebrity vaccine endorsement!",
    "it’s happened before?! love my hometown of beautiful new ken πŸ˜‚πŸ˜‚",
    "I love you, brother.",
    "Troll, bro. They know they're saying stupid shit. The motherfucker does nothing but stink up libertarian subs talking shit",
]

pprint(goemotions(texts))

# Output
 [{'labels': ['neutral'], 'scores': [0.9750906]},
 {'labels': ['curiosity', 'love'], 'scores': [0.9694574, 0.9227462]},
 {'labels': ['love'], 'scores': [0.993483]},
 {'labels': ['anger'], 'scores': [0.99225825]}]

2. Group Taxonomy

from transformers import BertTokenizer
from model import BertForMultiLabelClassification
from multilabel_pipeline import MultiLabelPipeline
from pprint import pprint

tokenizer = BertTokenizer.from_pretrained("monologg/bert-base-cased-goemotions-group")
model = BertForMultiLabelClassification.from_pretrained("monologg/bert-base-cased-goemotions-group")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

texts = [
    "Hey that's a thought! Maybe we need [NAME] to be the celebrity vaccine endorsement!",
    "it’s happened before?! love my hometown of beautiful new ken πŸ˜‚πŸ˜‚",
    "I love you, brother.",
    "Troll, bro. They know they're saying stupid shit. The motherfucker does nothing but stink up libertarian subs talking shit",
]

pprint(goemotions(texts))

# Output
[{'labels': ['positive'], 'scores': [0.9989434]},
 {'labels': ['ambiguous', 'positive'], 'scores': [0.99801123, 0.99845874]},
 {'labels': ['positive'], 'scores': [0.99930394]},
 {'labels': ['negative'], 'scores': [0.9984231]}]

3. Ekman Taxonomy

from transformers import BertTokenizer
from model import BertForMultiLabelClassification
from multilabel_pipeline import MultiLabelPipeline
from pprint import pprint

tokenizer = BertTokenizer.from_pretrained("monologg/bert-base-cased-goemotions-ekman")
model = BertForMultiLabelClassification.from_pretrained("monologg/bert-base-cased-goemotions-ekman")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

texts = [
    "Hey that's a thought! Maybe we need [NAME] to be the celebrity vaccine endorsement!",
    "it’s happened before?! love my hometown of beautiful new ken πŸ˜‚πŸ˜‚",
    "I love you, brother.",
    "Troll, bro. They know they're saying stupid shit. The motherfucker does nothing but stink up libertarian subs talking shit",
]

pprint(goemotions(texts))

# Output
 [{'labels': ['joy', 'neutral'], 'scores': [0.30459446, 0.9217335]},
 {'labels': ['joy', 'surprise'], 'scores': [0.9981395, 0.99863845]},
 {'labels': ['joy'], 'scores': [0.99910116]},
 {'labels': ['anger'], 'scores': [0.9984291]}]

Reference

goemotions-pytorch's People

Contributors

monologg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

goemotions-pytorch's Issues

Cudaλ₯Ό 잘λͺ»μ°ΎλŠ”κ²½μš°

μ•ˆλ…•ν•˜μ„Έμš” 쒋은 λͺ¨λ“ˆ λ§Œλ“€μ–΄μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€.
μ œκ°€ cuda ν™˜κ²½μ—μ„œ λͺ¨λΈμ„ λΆˆλŸ¬μ™€μ„œ

pprint(goemotions(text))

λ₯Ό μ‹€ν–‰ν•˜λ©΄ μ•„λž˜μ™€μ²˜λŸΌ μ—λŸ¬κ°€ λœ¨λ„€μš”.
multilabel_pipeline μ—μ„œ deviceλ₯Ό -1μ—μ„œ 0μœΌλ‘œλ„ 바꿔보고,

import torch
device = torch.device('cuda:0')
model.to(device)

μœ„ 처럼 μΆ”κ°€ 섀정도 ν•΄λ³΄μ•˜λŠ”λ° λ™μΌν•œ μ—λŸ¬κ°€ λ– μ„œ 여기에 μ—¬μ­€λ³΄κ²Œ λμŠ΅λ‹ˆλ‹€.
μ–΄λ–»κ²Œ κ³ μΉ  수 μžˆμ„κΉŒμš”?

κ°μ‚¬ν•©λ‹ˆλ‹€.

RuntimeError Traceback (most recent call last)
in
----> 1 pprint(goemotions(df['sentences'][0]))

~/ProjComment/IMDB/GoEmotions-pytorch/multilabel_pipeline.py in call(self, *args, **kwargs)
37
38 def call(self, *args, **kwargs):
---> 39 outputs = super().call(*args, **kwargs)
40 scores = 1 / (1 + np.exp(-outputs)) # Sigmoid
41 results = []

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/transformers/pipelines.py in call(self, *args, **kwargs)
472 def call(self, *args, **kwargs):
473 inputs = self._parse_and_tokenize(*args, **kwargs)
--> 474 return self._forward(inputs)
475
476 def _forward(self, inputs, return_tensors=False):

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
491 with torch.no_grad():
492 inputs = self.ensure_tensor_on_device(**inputs)
--> 493 predictions = self.model(**inputs)[0].cpu()
494
495 if return_tensors:

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

~/ProjComment/IMDB/GoEmotions-pytorch/model.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels)
25 labels=None,
26 ):
---> 27 outputs = self.bert(
28 input_ids,
29 attention_mask=attention_mask,

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask)
724 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
725
--> 726 embedding_output = self.embeddings(
727 input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
728 )

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
172
173 if inputs_embeds is None:
--> 174 inputs_embeds = self.word_embeddings(input_ids)
175 position_embeddings = self.position_embeddings(position_ids)
176 token_type_embeddings = self.token_type_embeddings(token_type_ids)

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
110
111 def forward(self, input):
--> 112 return F.embedding(
113 input, self.weight, self.padding_idx, self.max_norm,
114 self.norm_type, self.scale_grad_by_freq, self.sparse)

/home/ubuntu/anaconda3/envs/GoEmotions-pytorch/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1482 # remove once script supports set_grad_enabled
1483 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 1484 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1485
1486

RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

faster prediction using GoEmotions-pytorch models based on bert-mini, bert-small or bert-tiny

Hi,
Thank you for your great work on GoEmotions-pytorch!
I am trying to use your code to generate models using either bert-mini, bert-small or bert-tiny for faster predictions.
I changed the file original.json by setting model_name_or_path to prajjwal1/bert-mini for example and I run python3 run_goemotions.py --taxonomy original
It works and the new model is a bit faster than the one using bert-base. However, I was wondering if I need to also change the tokenizer_name_or_path to a different value. The original value is "monologg/bert-base-cased-goemotions-original". Any thoughts on how to get a tokenizer based on bert-mini?

Many thanks!
Chedia

Model not running on GPU & out of memory error

Hi,

Thanks so much for this repo and please forgive me if this is trivial, I've been trying for a little while now to run the model on Google Colab. I'm running into two separate issues, which I think may be linked. The first, is that if I load the model in a GPU runtime the model defaults to the 'cpu'.

After running:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model.to(device)

There's an error when trying to run goemotions(texts):
"RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select".

Second, when trying to run goemotions over more than a few thousand rows on Colab in a high-RAM runtime environment, I run into an out of memory error. I'm wondering if this is a problem with batching in the data loader? I'll be looking for solutions & hope to close this issue myself, but in the meantime any help is much appreciated, thanks!

Tokenizer and Model loading for a fine-tuned model

I am looking into loading the model and the tokenizer after it has been trained on a custom dataset. After training, I am able to produce pytorch_model.bin, config.json, tokenizer_config.json, special_tokens_map.json, training_args.bin, and vocab.txt for every checkpoint saved.

Is there any script where I can know how to load the saved checkpoints along with the tokenizer just like the example that you have provided here for your pre-trained model

tokenizer = BertTokenizer.from_pretrained("monologg/bert-base-cased-goemotions-ekman")
model = BertForMultiLabelClassification.from_pretrained("monologg/bert-base-cased-goemotions-ekman")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

Thanks for the awesome repo!

TypeError: Can't instantiate abstract class MultiLabelPipeline with abstract methods _forward, _sanitize_parameters, postprocess, preprocess

Thanks for this great work.
I am using transformer v4
I know this is not transformer v2 as requested in the readme file but I cannot install v2.11.0 anymore because there are dependency errors in that version. And I tried v2.4.1 it raise other errors.

In transformer v4, it raises:
Traceback (most recent call last):
goemotions = MultiLabelPipeline(
TypeError: Can't instantiate abstract class MultiLabelPipeline with abstract methods _forward, _sanitize_parameters, postprocess, preprocess

Do you think you can update your code so it can work with the latest Hugginface transformer (v4)?

Fine tuning on custom number of classes

Let's say we want to fine-tune the model (any of the taxonomies - ekman/original) on another dataset having a different number of classes for e.g. only positive and negative. Then what's the correct procedure to do it?

Currently, If I prepare the data in the same format (.tsv files with data and labels), and put the labels.txt as having only two classes, the training seems to run. But is this way correct? Or any other changes need to be made inside the model training?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.