abhimishra91 / transformers-tutorials Goto Github PK

Github repo with tutorials to fine tune transformers for diff NLP tasks

License: MIT License

Jupyter Notebook 100.00%

transformers nlp natural-language-processing deep-learning bert distilbert pytorch pytorch-tutorial classification named-entity-recognition

transformers-tutorials's Introduction

PyTorch Transformers Tutorials

Introduction

The field of NLP was revolutionized in the year 2018 by introduction of BERT and his Transformer friends(RoBerta, XLM etc.).

These novel transformer based neural network architectures and new ways to training a neural network on natural language data introduced transfer learning to NLP problems. Transfer learning had been giving out state of the art results in the Computer Vision domain for a few years now and introduction of transformer models for NLP brought about the same paradigm change in NLP.

Companies like Google and Facebook trained their neural networks on large swathes of Natural Language Data to grasp the intricacies of language thereby generating a Language model. Finally these models were fine tuned to specific domain dataset to achieve state of the art results for a specific problem statement. They also published these trained models to open source community. The community members were now able to fine tune these models to their specific use cases.

Hugging Face made it easier for community to access and fine tune these models using their Python Package: Transformers.

Motivation

Despite these amazing technological advancements applying these solutions to business problems is still a challenge given the niche knowledge required to understand and apply these method on specific problem statements. Hence, In the following tutorials i will be demonstrating how a user can leverage technologies along with some other python tools to fine tune these Language models to specific type of tasks.

Before i proceed i will like to mention the following groups for the fantastic work they are doing and sharing which have made these notebooks and tutorials possible:

Please review these amazing sources of information and subscribe to their channels/sources.

Hugging Face Team
Abhishek Thakur for his amazing Youtube videos

The problem statements that i will be working with are:

Notebook	Github Link	Kaggle Kernel
Text Classification: Multi-Class	Github	Kaggle
Text Classification: Multi-Label	Github	Kaggle
Sentiment Classification with Experiment Tracking in WandB!	Github
Named Entity Recognition: with TPU processing!	Github	Kaggle
Question Answering
Summary Writing: with Experiment Tracking in WandB!	Github	Kaggle

Directory Structure

data: This folder contains all the toy data used for fine tuning.
utils: This folder will contain any miscellaneous script used to prepare for the fine tuning.
models: Folder to save all the artifacts post fine tuning.

Further Watching/Reading

I will try to cover the practical and implementation aspects of fine tuning of these language models on various NLP tasks. You can improve your knowledge on this topic by reading/watching the following resources.

Watching
Reading

transformers-tutorials's People

Contributors

Stargazers

Watchers

Forkers

snowdj amirstudy carol233 rajibmondal jingmouren llt1 amitkayal nileee sidiniquity wwwanghao moresun deepchatterjeevns sudharsan2020 lakshya0002 whateveryet xhc19930714 aalok-sathe rogervaas zmskye peterouzh yiweiyihang xiangs18 moyu18 wwt1991wwz dsp6414 sdpmas darthrabbit xrosliang nikhiljaiswal p3n9w31 rachel-sorek forderation buptldy cage6666 atherfawaz rafaelsandroni mqjg andreasfrei puneethapai ngo010 rasoolgithub daoudamjad hitesh-flipkart davidalami mikuts andrew05200 sanazbahargam karndeepsingh pen-ho xccanxin philipnz amitamaryadav venk-chari jsrozner dmytrosytro morobro deboras23 jacob-lewis sciencepal mallasree altafr sriharshaboda pnageshkar mrsumesh amdp-chauhan rakesh283343 saitej123 mrbananahuman shenyi666666 patrickshu mremreozan zhifxu allen-lzh alkhalifas aixia121 tingnlp yang-charlie siyuchen1 hjh233 zbn123 efratmag amir9ume daming-qokka maxwellwjj ilevs cesargm2015 muralits98 zetimente datanizing okara83 mikpim01 arunadevikaruppasamy techthiyanes shilpajv asanka-fonseka gongchuanyang nicemartin jm534920 ipekdk juyongjiang

transformers-tutorials's Issues

Where is train.csv?

in data folder i see only the readme, but the train.csv folder there isn't.

Running Inference

Hi, I am using your multi-label classification notebook as reference for my learning. I was wondering what would be the efficient way to run inference on these trained models. Any referece to the code if any would be appreciated.

Thanks

transformers_multi_label_classification.ipynb

In this example we use one feature, how would I adjust the class to have more than one feature?

Variables not initialized in valid()

tr_loss, nb_tr_steps, nb_tr_examples are not defined or initialized before updating them in the loop of the valid() function.

transformers_multi_label_classification.ipynb issue

https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb

The BertClass forward function is causing the following error message:

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

It looks like the notebook contains version 3 code. How to migrate it to version 4?

https://stackoverflow.com/questions/65082243/dropout-argument-input-position-1-must-be-tensor-not-str-when-using-bert

https://huggingface.co/docs/transformers/migration

Classifier ouput features are 4 in Model, then why this printed as 1?

transformers-tutorials/transformers_multiclass_classification.ipynb

Line 527 in 9f12834

" (l3): Linear(in_features=768, out_features=1, bias=True)\n",

Fine-tuning MarianMT for Machine Translation tasks

Hi, I am new to the field of NLP. I wanna do the fine-tuning of MarianMT pretrained models for deutsch to english translation. But, i am not sure how to achieve that. I checked the MarianMT, they uses BART for their model architecture configuration, but they dont share any information on how to fine-tune the model or train the model form scratch.

Error when batch size is not a factor of the number of samples in transformers_multiclass_classification.ipynb

re: transformers_multiclass_classification.ipynb

Thank you for this helpful tutorial!

It seems to work well when the batch size (either for training or validation) is a factor of the number of examples, but otherwise I get the following error message:
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

For example:
with batch_size =4,
3172 samples works, but 3171 or 3173 return an error.

How to reload the model in a different machine with same hardware

Hi ,

Firstly great tutorial . The multiclass model tutorial really helped me in understanding how to leverage BERT . I have my model saved as shown post training. However when i am loading on another machine i see drop in accuracy . At the same time, it does not always work .

Could you guide with a proper code snippet to load the models .

I do have the ability to define the class in the new environment so that is not an issue

Outputs are different when running the notebooks

I am running this notebook transformers_multiclass_classification.ipynb and I do not get the same outputs.

For example, the training loss goes straight to 0 (very close to 0) in this notebook. Unfortunately I cannot reproduce that.

Also, there it is stated that the validation test accuracy is 99.99%.

Could you just double check if the printed outputs are correct? Thanks

NER notebook issue

Hi,

First, I want to thank u for this great job.
I tried to execute the code of the NER with bert notebook but an error is generated when calculating the f1 score in the validation step

could someone help plz
thank u

Code Explanation for transformers_multiclass_classification.ipynb

Thanks for sharing the excellent tutorial! However, there are some questions regarding the classifier:

The final linear layer output of the classifier should be of shape: (batch_size, num_labels), however, I find that according to the code, the final output shape is (batch_size, seq_len), which is used to calculate the cross-entropy loss. I am quite confused.

Colab Error :: RuntimeError: CUDA error: device-side assert triggered

Getting the following error in Colab:
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py:1944: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
FutureWarning,
Training Loss per 100 steps: 1.3555612564086914
Training Accuracy per 100 steps: 25.0

RuntimeError Traceback (most recent call last)
in ()
1 for epoch in range(EPOCHS):
----> 2 train(epoch)

in train(epoch)
15 outputs = model(ids, mask)
16 loss = loss_function(outputs, targets)
---> 17 tr_loss += loss.item()
18 big_val, big_idx = torch.max(outputs.data, dim=1)
19 n_correct += calcuate_accu(big_idx, targets)

RuntimeError: CUDA error: device-side assert triggered

dataset split is invalid (because of reset_index)

In the main() function, you have the following:

train_dataset=df.sample(frac=train_size, random_state = config.SEED).reset_index(drop=True)

val_dataset=df.drop(train_dataset.index).reset_index(drop=True)

But since you reset the index, the dropped rows will drop based on the new index. So your train and eval datasets overlap. You can reset the index only after doing drop

train_dataset=df.sample(frac=train_size, random_state = config.SEED)   #<-- removed rop
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

Fine tuning DistilBERT model OSError: Unable to load weights from pytorch checkpoint file.

Hello! Thanks for the excellent tutorial of an awesome DistilBERT model. I learned and reproduced it successfully. I tried to load and predict this model "pytorch_distilbert_news.bin" with tokenizer "vocab_distilbert_news.bin" (I changed model name as "pytorch_model.bin" and tokenizer vocab as "vocab.txt" ). I wrote the scripts as below and tested it as below, but got an error and couldn't find solution.

# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

test ="He'll give us really healthy competition with our goalkeepers and we wish him the very best of luck."

tokenizer = transformers.DistilBertTokenizer.from_pretrained('model/')
model= transformers.DistilBertModel.from_pretrained('model/')

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/anaconda3/envs/hgface/lib/python3.7/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
658 try:
--> 659 state_dict = torch.load(resolved_archive_file, map_location="cpu")
660 except Exception:
~/anaconda3/envs/hgface/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
579 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 580 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
~/anaconda3/envs/hgface/lib/python3.7/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
759 unpickler.persistent_load = persistent_load
--> 760 result = unpickler.load()
AttributeError: Can't get attribute 'DistillBERTClass' on <module '__main__'>
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-24-827530780d10> in <module>
----> 1 model= transformers.DistilBertModel.from_pretrained('model/')

~/anaconda3/envs/hgface/lib/python3.7/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
660 except Exception:
--> 662 "Unable to load weights from pytorch checkpoint file. " 663 "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Please help. Thanks.

T5 fine-tuning for summarization mistake

When I ran your code, the program reported this error, please help me! Thank you

Relation between transformers_multi_label_classification.ipynb and BertForMultipleChoice

Hi,

There is a class BertForMultipleChoice in modeling_bert.py in transformers library.
Does it do the same as the model in notebook?

Thanks!

problem on loading .bin model for inference

Hi. I tried load the model using torch.load('model.bin') but this kind of error occured:

_pickle.UnpicklingError: invalid load key, '['.

And how to do inference?
Any help?

Saving the model

How do I save the fine-tuned model using this notebook
https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb

Saving it locally or in huggingface for future use.

Tokenization issue in transformer NER

In your custom data loader:

class CustomDataset(Dataset):
    def __init__(self, tokenizer, sentences, labels, max_len):
        self.len = len(sentences)
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = str(self.sentences[index])
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        label = self.labels[index]
        label.extend([4]*200)
        label=label[:200]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'tags': torch.tensor(label, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

according to my understanding:
you have a sentence say w1 w2 w3 w4, and its BIO label is O B-class1 I-class1 O.
once you encode your sentence using tokenizer it will use word piece and split your words into subwords, therefore making it more longer and you are padding it to some 200 length(lets say upto 10) say w1-a w1-b w2 w3-a w3-b w4 [PAD] [PAD] [PAD] [PAD], but your labels are O B-class1 I-class1 O 4 4 4 4 4 4. So, now you are passing incorrect labels to your model.

T5 fine-tuning for summarization decoder_input_ids and labels

hello @abhimishra91

i was trying to implement the fine tuning of T5 as explained in your notebook.
in addition to have implemented the same structure as you, i have made some experiments with the HuggingFace Trainer class. the decoder_input_ids and labels parameters are not very clear to me. when you train the model, you do this

y = data['target_ids'].to(device, dtype = torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100

where y_ids is the decoder_input_ids. i don't understand why we need these preprocessing. i kindly ask you why are you skipping the last token of the target_ids, and why are you replacing the pads with -100 in the labels?
when i use the HuggingFace Trainer i need to tweak the __getitem__ function of the DataLoader like this

def __getitem__(self, idx):

    ...

    item['decoder_input_ids'] = y[:-1]
    lbl = y[:-1].clone()
    lbl[y[1:] == self.tokenizer.pad_token_id] = -100
    item['labels'] = lbl

    return item

otherwise the loss function does not decrease over time.

thank you for your help!

param of T5 model

in function train, outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels). param lm_labels should be labels in new version of transformers

tokenizer = T5Tokenizer.from_pretrained("t5-base"). Type of tokenizer object is NoneType

When running the main function in CoLab, got the following error
AttributeError: 'NoneType' object has no attribute 'batch_encode_plus'
When I tried this,
tokenizer = T5Tokenizer.from_pretrained("t5-base")
print(type(tokenizer))

got,
<class 'NoneType'>

Training based on Teacher forcing technique

Hi,
Thank you for your code,
I have a question regarding the way the model is being trained,
In the paper it is mentioned T5 is being trained based on the teacher forcing technique which for each time stamp in the decoding part the input should be from the ground truth data not the previously generated token, but in your code your model will generate the entire output by itself trough the following line:
outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
loss = outputs[0]
Is my assumption correct that you do not use teacher forcing technique? thanks