abhishekkrthakur / bert-sentiment Goto Github PK

View Code? Open in Web Editor NEW

269.0 269.0 101.0 24 KB

License: MIT License

Python 93.61% Dockerfile 6.39%

bert-sentiment's Introduction

Hi there 👋

I'm a data scientist / machine learning engineer.

bert-sentiment's People

Contributors

Stargazers

Watchers

Forkers

dhirendra101 havingfun serdar-eric bestpredicts rakash sugatoray 10-zin intelegencia-shaitender rimijoker vedraiyani ptiwaree vpfahad hamditarek sirkells preranadeshpande anindabitm johnfffee knitvikas pathakvishnu dylanjcastillo pawardipawesh adeyinka-hub jadhav-aditya nitinh digitalcompanion vgodie adityajadhavai piyushbhuwalka-sopho xsh-yang cstorm125 krkrreddy dougoliver12 sandeshprabhu02 shwetank3 tsivaguru sethips ahmedbesbes jaingaurav3 lionell-paquit kuberiitb drshutkot bigfatgoalie urmi22 abdoulsn dark-art108 bsuryachandra dwtcourses chungelu tusharkalecam gokulsg arpitpatel1501 gabeesh deebyadeepparida mamunshub udaspriest acharyadarshan sahansera atulchandorkar junaideqbal nikhiljaiswal alexalexeyuk ajaysh2193 isinghgithub ramgopalum srushti104 andriishapoval adbmd uchidate pr162 yerlan-amanzholov siddhigolatkar 111989 note-z kazitoufiq zachdata kauvinlucas iketutg madamroziyani moushumibanerjee1609 gleberof hercules261188 tanjimanasreen fangyiyu juanmunera xiamaozi11 sreebhagya-s airsimon23 python-repository-hub itneshkumar chelmed raquelhortab elijahahianyo bose9999 apratim-mishra pmukeshreddy easy-forks christianhfs parisa-ahmadi ssarkar445 mario-16180

bert-sentiment's Issues

Requirement.txt is missing

Will it be possible for you to push requirements.txt file to your repo as docker is not working.

requirements.txt file is missing

The requirements.txt file is missing. Some of the functionalities does not work on torch==1.11.0.

In file train.py line 13 (get_linear_schedule_with_warmup import)

In file train.py line 13
from transformers import get_linear_schedule_with_warmup

it is better to change to

from transformers import WarmupLinearSchedule as get_linear_schedule_with_warmup

Loading DataParallel GPU model on CPU

Follow up to #1 issue
@abhishekkrthakur : Can you give any leads on how to load DataParallel GPU model on CPU?
As per pytorch docs tried following but still raises above RuntimeError

device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))

IndexError: index ? is out of bounds for axis 0 with size ?

Hello,

I've tried your implementation of bert model in order to predict the sentiment of a sentence.

When training the model on google TPUs, i had this issues, on this line :

review = str(self.review[item])

I've spended so many time to find the issue but i don't figure out why it throws an array index out of bound ?

I started training with a litle dataset of 10000 rows.

StackStrace :

bi = 0, loss = 0.6744452714920044
bi = 10, loss = 0.6480506658554077
bi = 20, loss = 0.6070395708084106
bi = 30, loss = 0.3570273816585541
bi = 40, loss = 0.322771281003952
bi = 50, loss = 0.42349475622177124
bi = 60, loss = 0.2848508358001709
bi = 70, loss = 0.2577969431877136
bi = 80, loss = 0.4233595132827759
bi = 90, loss = 0.600457489490509
bi = 100, loss = 0.22680382430553436
bi = 110, loss = 0.09512724727392197
bi = 120, loss = 0.14158135652542114
bi = 130, loss = 0.653974175453186
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 134, in _loader_worker
_, data = next(data_iter)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "", line 12, in getitem
review = str(self.review[item])
IndexError: index 4244 is out of bounds for axis 0 with size 979

line 15 model.py

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Getting issue while loading model

Getting below issue while loading the model in local system. Model was trained on colab.

Traceback (most recent call last):
  File "app.py", line 74, in <module>
    MODEL.load_state_dict(torch.load(config.MODEL_PATH, map_location=torch.device('cpu'))) #New created
  File "C:\Users\Vijender\Downloads\bert_sentiment\lib\site-packages\torch\nn\modules\module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERTBaseUncased:
        Missing key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias",

how to train on multi classes data?

ModuleNotFoundError: No module named 'tokenizers.tokenizers'

Any version info, like python, torch, transformers's.Thanks

index out of range

Hi I'm getting the below error when I run the train script. Can you help me

File "C:\Users\thanisb\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
[Finished in 33.4s with exit code 1]

multi-label classification problem

how can this model be trained for a multi-label classification problem?

In Dataset.py

i just try to run your code then i found an error called not know keywards argument
{"pad_to_max_len} is not recognized

class CustomDataset(Dataset):

def __init__(self , review , target):
    super(CustomDataset , self).__init__()
    self.review = review
    self.target = target
    self.tokenizer = config.TOKENIZER
    self.max_len = config.MAX_LEN
    self.train_encodings = self.tokenizer(review, truncation=True, padding=True)

def __len__(self):
    return len(self.review)

def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.train_encodings.items()}
    item['labels'] = torch.tensor(self.target[idx])
    return item

just look at this this can be helpfull

TypeError: _init_() got an unexpected keyword argument 'comment_text' and AttributeError: module 'config' has no attribute 'DEVICE'

while doing the code for multilingual toxic comment classification i am getting errors,
import config
import dataset
import engine
import torch
import pandas as pd
import torch.nn as nn
import numpy as np

from model import BERTBaseUncased
from sklearn import model_selection
from sklearn import metrics
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

def run():
df1 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-toxic-comment-train.csv", usecols = ["comment_text","toxic"])
df2 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-unintended-bias-train.csv", usecols = ["comment_text","toxic"])

  df_train = pd.concat([df1,df2], axis=0).reset_index(drop=True)
  
  df_valid = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\validation.csv")

  train_dataset = dataset.BERTDataset(
      comment_text=df_train.comment_text.values,
      target=df_train.toxic.values
  )

  train_data_loader = torch.utils.data.DataLoader(
      train_dataset,
      batch_size=config.TRAIN_BATCH_SIZE, 
      num_workers=4
  )

  valid_dataset = dataset.BERTDataset(
      comment_text=df_valid.comment_text.values, 
      target=df_valid.toxic.values
  )

  valid_data_loader = torch.utils.data.DataLoader(
      valid_dataset, 
      batch_size=config.VALID_BATCH_SIZE, 
      num_workers=1
  )

  device = torch.device(config.DEVICE)
  model = BERTBaseUncased()
  model.to(device)

  param_optimizer = list(model.named_parameters())
  no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
  optimizer_parameters = [
      {
          "params": [
              p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
          ],
          "weight_decay": 0.001,
      },
      {
          "params": [
              p for n, p in param_optimizer if any(nd in n for nd in no_decay)
          ],
          "weight_decay": 0.0,
      },
  ]

  num_train_steps = int(len(df_train) / config.TRAIN_BATCH_SIZE * config.EPOCHS)
  optimizer = AdamW(optimizer_parameters, lr=3e-5)
  scheduler = get_linear_schedule_with_warmup(
      optimizer, num_warmup_steps=0, num_training_steps=num_train_steps
  )

  best_accuracy = 0
  for epoch in range(config.EPOCHS):
      engine.train_fn(train_data_loader, model, optimizer, device, scheduler)
      outputs, targets = engine.eval_fn(valid_data_loader, model, device)
      targets = np.array(targets) >= 0.5
      accuracy = metrics.roc_auc_score(targets, outputs)
      print(f"AUC Score = {accuracy}")
      if accuracy > best_accuracy:
          torch.save(model.state_dict(), config.MODEL_PATH)
          best_accuracy = accuracy

if name == "main":
run()

error message:

PS C:\Users\saura\Desktop\tcc\src> python train.py
Traceback (most recent call last):
  File "train.py", line 86, in <module>
    run()
  File "train.py", line 46, in run
    device = torch.device(config.DEVICE)
AttributeError: module 'config' has no attribute 'DEVICE'

Dataloader possible bug

For some reason I am unable to iterate throught the Pytorch Dataloader. Could be something i am missing or the Dataloader has bug.

import transformers
from sklearn import model_selection
import torch
import pandas as pd
    
    
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-cased", do_lower_case=True)
max_len = 512
train_batch_size = 8
    
"""
This class takes reviews and targets as arguments 
- Split the reviews and tokenizes
"""
class BERTDataset:
    def __init__(self, review, target):
        self.review = review
        self.target = target
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.review)

    def __getitem__(self, item):
        review = str(self.review[item])
        review = " ".join(review.split())

        tokenized_inputs = self.tokenizer.encode_plus(
            review,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding=True,
            truncation=True
        )

        ids = tokenized_inputs["input_ids"]
        mask = tokenized_inputs["attention_mask"]
        token_type_ids = tokenized_inputs["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "targets": torch.tensor(self.target[item], dtype=torch.float),
        }


dfx = pd.read_csv(training_file).fillna("none")
dfx['sentiment'] = dfx['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

df_train, df_valid = model_selection.train_test_split(
                        dfx,
                        test_size=0.1,
                        random_state=42,
                        stratify=dfx['sentiment'].values
                        )

# reset indices 
df_train = df_train.reset_index(drop=True)

# get ids, tokens, masks and targets  
train_dataset = BERTDataset(review=df_train['review'], target=df_train['sentiment'])

# load into pytorch dataset object
# DataLoader inputs tensor dataset of Inputs and targets 
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=train_batch_size, num_workers=0)

# Iterating to the Data loader
train_iter = iter(train_data_loader)
print(type(train_iter))

review, labels = train_iter.next()

When iterating through the dataloader the following error comes up.

RuntimeError Traceback (most recent call last)
    <ipython-input-19-c99d0829d5d9> in <module>()
          2 print(type(train_iter))
          3 
    ----> 4 images, labels = train_iter.next()
    
    5 frames
    /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
         53             storage = elem.storage()._new_shared(numel)
         54             out = elem.new(storage)
    ---> 55         return torch.stack(batch, 0, out=out)
         56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
         57             and elem_type.__name__ != 'string_':
    
    RuntimeError: stack expects each tensor to be equal size, but got [486] at entry 0 and [211] at entry 1

Appreciate your inputs.