I'm a data scientist / machine learning engineer.
abhishekkrthakur / bert-sentiment Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Will it be possible for you to push requirements.txt file to your repo as docker is not working.
The requirements.txt file is missing. Some of the functionalities does not work on torch==1.11.0.
In file train.py line 13
from transformers import get_linear_schedule_with_warmup
it is better to change to
from transformers import WarmupLinearSchedule as get_linear_schedule_with_warmup
Follow up to #1 issue
@abhishekkrthakur : Can you give any leads on how to load DataParallel GPU model on CPU?
As per pytorch docs tried following but still raises above RuntimeError
device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))
Hello,
I've tried your implementation of bert model in order to predict the sentiment of a sentence.
When training the model on google TPUs, i had this issues, on this line :
review = str(self.review[item])
I've spended so many time to find the issue but i don't figure out why it throws an array index out of bound ?
I started training with a litle dataset of 10000 rows.
StackStrace :
bi = 0, loss = 0.6744452714920044
bi = 10, loss = 0.6480506658554077
bi = 20, loss = 0.6070395708084106
bi = 30, loss = 0.3570273816585541
bi = 40, loss = 0.322771281003952
bi = 50, loss = 0.42349475622177124
bi = 60, loss = 0.2848508358001709
bi = 70, loss = 0.2577969431877136
bi = 80, loss = 0.4233595132827759
bi = 90, loss = 0.600457489490509
bi = 100, loss = 0.22680382430553436
bi = 110, loss = 0.09512724727392197
bi = 120, loss = 0.14158135652542114
bi = 130, loss = 0.653974175453186
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 134, in _loader_worker
_, data = next(data_iter)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "", line 12, in getitem
review = str(self.review[item])
IndexError: index 4244 is out of bounds for axis 0 with size 979
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str
Getting below issue while loading the model in local system. Model was trained on colab.
Traceback (most recent call last):
File "app.py", line 74, in <module>
MODEL.load_state_dict(torch.load(config.MODEL_PATH, map_location=torch.device('cpu'))) #New created
File "C:\Users\Vijender\Downloads\bert_sentiment\lib\site-packages\torch\nn\modules\module.py", line 830, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERTBaseUncased:
Missing key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias",
how to train on multi classes data?
Hi I'm getting the below error when I run the train script. Can you help me
File "C:\Users\thanisb\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
[Finished in 33.4s with exit code 1]
how can this model be trained for a multi-label classification problem?
i just try to run your code then i found an error called not know keywards argument
{"pad_to_max_len} is not recognized
class CustomDataset(Dataset):
def __init__(self , review , target):
super(CustomDataset , self).__init__()
self.review = review
self.target = target
self.tokenizer = config.TOKENIZER
self.max_len = config.MAX_LEN
self.train_encodings = self.tokenizer(review, truncation=True, padding=True)
def __len__(self):
return len(self.review)
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.train_encodings.items()}
item['labels'] = torch.tensor(self.target[idx])
return item
just look at this this can be helpfull
while doing the code for multilingual toxic comment classification i am getting errors,
import config
import dataset
import engine
import torch
import pandas as pd
import torch.nn as nn
import numpy as np
from model import BERTBaseUncased
from sklearn import model_selection
from sklearn import metrics
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
def run():
df1 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-toxic-comment-train.csv", usecols = ["comment_text","toxic"])
df2 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-unintended-bias-train.csv", usecols = ["comment_text","toxic"])
df_train = pd.concat([df1,df2], axis=0).reset_index(drop=True)
df_valid = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\validation.csv")
train_dataset = dataset.BERTDataset(
comment_text=df_train.comment_text.values,
target=df_train.toxic.values
)
train_data_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=config.TRAIN_BATCH_SIZE,
num_workers=4
)
valid_dataset = dataset.BERTDataset(
comment_text=df_valid.comment_text.values,
target=df_valid.toxic.values
)
valid_data_loader = torch.utils.data.DataLoader(
valid_dataset,
batch_size=config.VALID_BATCH_SIZE,
num_workers=1
)
device = torch.device(config.DEVICE)
model = BERTBaseUncased()
model.to(device)
param_optimizer = list(model.named_parameters())
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_parameters = [
{
"params": [
p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
],
"weight_decay": 0.001,
},
{
"params": [
p for n, p in param_optimizer if any(nd in n for nd in no_decay)
],
"weight_decay": 0.0,
},
]
num_train_steps = int(len(df_train) / config.TRAIN_BATCH_SIZE * config.EPOCHS)
optimizer = AdamW(optimizer_parameters, lr=3e-5)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=0, num_training_steps=num_train_steps
)
best_accuracy = 0
for epoch in range(config.EPOCHS):
engine.train_fn(train_data_loader, model, optimizer, device, scheduler)
outputs, targets = engine.eval_fn(valid_data_loader, model, device)
targets = np.array(targets) >= 0.5
accuracy = metrics.roc_auc_score(targets, outputs)
print(f"AUC Score = {accuracy}")
if accuracy > best_accuracy:
torch.save(model.state_dict(), config.MODEL_PATH)
best_accuracy = accuracy
if name == "main":
run()
error message:
PS C:\Users\saura\Desktop\tcc\src> python train.py
Traceback (most recent call last):
File "train.py", line 86, in <module>
run()
File "train.py", line 46, in run
device = torch.device(config.DEVICE)
AttributeError: module 'config' has no attribute 'DEVICE'
For some reason I am unable to iterate throught the Pytorch Dataloader. Could be something i am missing or the Dataloader has bug.
import transformers
from sklearn import model_selection
import torch
import pandas as pd
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-cased", do_lower_case=True)
max_len = 512
train_batch_size = 8
"""
This class takes reviews and targets as arguments
- Split the reviews and tokenizes
"""
class BERTDataset:
def __init__(self, review, target):
self.review = review
self.target = target
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.review)
def __getitem__(self, item):
review = str(self.review[item])
review = " ".join(review.split())
tokenized_inputs = self.tokenizer.encode_plus(
review,
None,
add_special_tokens=True,
max_length=self.max_len,
padding=True,
truncation=True
)
ids = tokenized_inputs["input_ids"]
mask = tokenized_inputs["attention_mask"]
token_type_ids = tokenized_inputs["token_type_ids"]
return {
"ids": torch.tensor(ids, dtype=torch.long),
"mask": torch.tensor(mask, dtype=torch.long),
"token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
"targets": torch.tensor(self.target[item], dtype=torch.float),
}
dfx = pd.read_csv(training_file).fillna("none")
dfx['sentiment'] = dfx['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df_train, df_valid = model_selection.train_test_split(
dfx,
test_size=0.1,
random_state=42,
stratify=dfx['sentiment'].values
)
# reset indices
df_train = df_train.reset_index(drop=True)
# get ids, tokens, masks and targets
train_dataset = BERTDataset(review=df_train['review'], target=df_train['sentiment'])
# load into pytorch dataset object
# DataLoader inputs tensor dataset of Inputs and targets
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=train_batch_size, num_workers=0)
# Iterating to the Data loader
train_iter = iter(train_data_loader)
print(type(train_iter))
review, labels = train_iter.next()
When iterating through the dataloader
the following error comes up.
RuntimeError Traceback (most recent call last)
<ipython-input-19-c99d0829d5d9> in <module>()
2 print(type(train_iter))
3
----> 4 images, labels = train_iter.next()
5 frames
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
53 storage = elem.storage()._new_shared(numel)
54 out = elem.new(storage)
---> 55 return torch.stack(batch, 0, out=out)
56 elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
57 and elem_type.__name__ != 'string_':
RuntimeError: stack expects each tensor to be equal size, but got [486] at entry 0 and [211] at entry 1
Appreciate your inputs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.