Giter Club home page Giter Club logo

transformers-tutorials's Introduction

PyTorch Transformers Tutorials

Transformer Tutorials

GitHub issues GitHub forks Github Stars GitHub license

Introduction

The field of NLP was revolutionized in the year 2018 by introduction of BERT and his Transformer friends(RoBerta, XLM etc.).

These novel transformer based neural network architectures and new ways to training a neural network on natural language data introduced transfer learning to NLP problems. Transfer learning had been giving out state of the art results in the Computer Vision domain for a few years now and introduction of transformer models for NLP brought about the same paradigm change in NLP.

Companies like Google and Facebook trained their neural networks on large swathes of Natural Language Data to grasp the intricacies of language thereby generating a Language model. Finally these models were fine tuned to specific domain dataset to achieve state of the art results for a specific problem statement. They also published these trained models to open source community. The community members were now able to fine tune these models to their specific use cases.

Hugging Face made it easier for community to access and fine tune these models using their Python Package: Transformers.

Motivation

Despite these amazing technological advancements applying these solutions to business problems is still a challenge given the niche knowledge required to understand and apply these method on specific problem statements. Hence, In the following tutorials i will be demonstrating how a user can leverage technologies along with some other python tools to fine tune these Language models to specific type of tasks.

Before i proceed i will like to mention the following groups for the fantastic work they are doing and sharing which have made these notebooks and tutorials possible:

Please review these amazing sources of information and subscribe to their channels/sources.

The problem statements that i will be working with are:

Notebook Github Link Colab Link Kaggle Kernel
Text Classification: Multi-Class Github Open In Colab Kaggle
Text Classification: Multi-Label Github Open In Colab Kaggle
Sentiment Classification with Experiment Tracking in WandB! Github Open In Colab
Named Entity Recognition: with TPU processing! Github Open In Colab Kaggle
Question Answering
Summary Writing: with Experiment Tracking in WandB! Github Open In Colab Kaggle

Directory Structure

  1. data: This folder contains all the toy data used for fine tuning.
  2. utils: This folder will contain any miscellaneous script used to prepare for the fine tuning.
  3. models: Folder to save all the artifacts post fine tuning.

Further Watching/Reading

I will try to cover the practical and implementation aspects of fine tuning of these language models on various NLP tasks. You can improve your knowledge on this topic by reading/watching the following resources.

transformers-tutorials's People

Contributors

aalok-sathe avatar abhimishra91 avatar atherfawaz avatar blessontomjoseph avatar darthrabbit avatar davidalami avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformers-tutorials's Issues

Where is train.csv?

in data folder i see only the readme, but the train.csv folder there isn't.

Running Inference

Hi, I am using your multi-label classification notebook as reference for my learning. I was wondering what would be the efficient way to run inference on these trained models. Any referece to the code if any would be appreciated.

Thanks

transformers_multi_label_classification.ipynb issue

https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb

The BertClass forward function is causing the following error message:

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

It looks like the notebook contains version 3 code. How to migrate it to version 4?

https://stackoverflow.com/questions/65082243/dropout-argument-input-position-1-must-be-tensor-not-str-when-using-bert

https://huggingface.co/docs/transformers/migration

Fine-tuning MarianMT for Machine Translation tasks

Hi, I am new to the field of NLP. I wanna do the fine-tuning of MarianMT pretrained models for deutsch to english translation. But, i am not sure how to achieve that. I checked the MarianMT, they uses BART for their model architecture configuration, but they dont share any information on how to fine-tune the model or train the model form scratch.

Error when batch size is not a factor of the number of samples in transformers_multiclass_classification.ipynb

re: transformers_multiclass_classification.ipynb

Thank you for this helpful tutorial!

It seems to work well when the batch size (either for training or validation) is a factor of the number of examples, but otherwise I get the following error message:
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

For example:
with batch_size =4,
3172 samples works, but 3171 or 3173 return an error.

How to reload the model in a different machine with same hardware

Hi ,

Firstly great tutorial . The multiclass model tutorial really helped me in understanding how to leverage BERT . I have my model saved as shown post training. However when i am loading on another machine i see drop in accuracy . At the same time, it does not always work .

Could you guide with a proper code snippet to load the models .

I do have the ability to define the class in the new environment so that is not an issue

Outputs are different when running the notebooks

I am running this notebook transformers_multiclass_classification.ipynb and I do not get the same outputs.

For example, the training loss goes straight to 0 (very close to 0) in this notebook. Unfortunately I cannot reproduce that.

Also, there it is stated that the validation test accuracy is 99.99%.

Could you just double check if the printed outputs are correct? Thanks

NER notebook issue

Hi,

First, I want to thank u for this great job.
I tried to execute the code of the NER with bert notebook but an error is generated when calculating the f1 score in the validation step

Screen Shot 2020-10-28 at 9 12 02 AM

could someone help plz
thank u

Code Explanation for transformers_multiclass_classification.ipynb

Thanks for sharing the excellent tutorial! However, there are some questions regarding the classifier:

The final linear layer output of the classifier should be of shape: (batch_size, num_labels), however, I find that according to the code, the final output shape is (batch_size, seq_len), which is used to calculate the cross-entropy loss. I am quite confused.

Colab Error :: RuntimeError: CUDA error: device-side assert triggered

Getting the following error in Colab:
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py:1944: FutureWarning: The pad_to_max_length argument is deprecated and will be removed in a future version, use padding=True or padding='longest' to pad to the longest sequence in the batch, or use padding='max_length' to pad to a max length. In this case, you can give a specific length with max_length (e.g. max_length=45) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
FutureWarning,
Training Loss per 100 steps: 1.3555612564086914
Training Accuracy per 100 steps: 25.0

RuntimeError Traceback (most recent call last)
in ()
1 for epoch in range(EPOCHS):
----> 2 train(epoch)

in train(epoch)
15 outputs = model(ids, mask)
16 loss = loss_function(outputs, targets)
---> 17 tr_loss += loss.item()
18 big_val, big_idx = torch.max(outputs.data, dim=1)
19 n_correct += calcuate_accu(big_idx, targets)

RuntimeError: CUDA error: device-side assert triggered

dataset split is invalid (because of reset_index)

In the main() function, you have the following:

train_dataset=df.sample(frac=train_size, random_state = config.SEED).reset_index(drop=True)

val_dataset=df.drop(train_dataset.index).reset_index(drop=True)

But since you reset the index, the dropped rows will drop based on the new index. So your train and eval datasets overlap. You can reset the index only after doing drop

train_dataset=df.sample(frac=train_size, random_state = config.SEED)   #<-- removed rop
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

Fine tuning DistilBERT model OSError: Unable to load weights from pytorch checkpoint file.

Hello! Thanks for the excellent tutorial of an awesome DistilBERT model. I learned and reproduced it successfully. I tried to load and predict this model "pytorch_distilbert_news.bin" with tokenizer "vocab_distilbert_news.bin" (I changed model name as "pytorch_model.bin" and tokenizer vocab as "vocab.txt" ). I wrote the scripts as below and tested it as below, but got an error and couldn't find solution.

# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

test ="He'll give us really healthy competition with our goalkeepers and we wish him the very best of luck."

tokenizer = transformers.DistilBertTokenizer.from_pretrained('model/')
model= transformers.DistilBertModel.from_pretrained('model/')

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/anaconda3/envs/hgface/lib/python3.7/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
658 try:
--> 659 state_dict = torch.load(resolved_archive_file, map_location="cpu")
660 except Exception:
~/anaconda3/envs/hgface/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
579 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 580 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
~/anaconda3/envs/hgface/lib/python3.7/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
759 unpickler.persistent_load = persistent_load
--> 760 result = unpickler.load()
AttributeError: Can't get attribute 'DistillBERTClass' on <module '__main__'>
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-24-827530780d10> in <module>
----> 1 model= transformers.DistilBertModel.from_pretrained('model/')

~/anaconda3/envs/hgface/lib/python3.7/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
660 except Exception:
--> 662 "Unable to load weights from pytorch checkpoint file. " 663 "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Please help. Thanks.

problem on loading .bin model for inference

Hi. I tried load the model using torch.load('model.bin') but this kind of error occured:

_pickle.UnpicklingError: invalid load key, '['.

And how to do inference?
Any help?

Tokenization issue in transformer NER

In your custom data loader:

class CustomDataset(Dataset):
    def __init__(self, tokenizer, sentences, labels, max_len):
        self.len = len(sentences)
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = str(self.sentences[index])
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        label = self.labels[index]
        label.extend([4]*200)
        label=label[:200]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'tags': torch.tensor(label, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

according to my understanding:
you have a sentence say w1 w2 w3 w4, and its BIO label is O B-class1 I-class1 O.
once you encode your sentence using tokenizer it will use word piece and split your words into subwords, therefore making it more longer and you are padding it to some 200 length(lets say upto 10) say w1-a w1-b w2 w3-a w3-b w4 [PAD] [PAD] [PAD] [PAD], but your labels are O B-class1 I-class1 O 4 4 4 4 4 4. So, now you are passing incorrect labels to your model.

T5 fine-tuning for summarization decoder_input_ids and labels

hello @abhimishra91

i was trying to implement the fine tuning of T5 as explained in your notebook.
in addition to have implemented the same structure as you, i have made some experiments with the HuggingFace Trainer class. the decoder_input_ids and labels parameters are not very clear to me. when you train the model, you do this

y = data['target_ids'].to(device, dtype = torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100

where y_ids is the decoder_input_ids. i don't understand why we need these preprocessing. i kindly ask you why are you skipping the last token of the target_ids, and why are you replacing the pads with -100 in the labels?
when i use the HuggingFace Trainer i need to tweak the __getitem__ function of the DataLoader like this

def __getitem__(self, idx):

    ...

    item['decoder_input_ids'] = y[:-1]
    lbl = y[:-1].clone()
    lbl[y[1:] == self.tokenizer.pad_token_id] = -100
    item['labels'] = lbl

    return item

otherwise the loss function does not decrease over time.

thank you for your help!

param of T5 model

in function train, outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels). param lm_labels should be labels in new version of transformers

Training based on Teacher forcing technique

Hi,
Thank you for your code,
I have a question regarding the way the model is being trained,
In the paper it is mentioned T5 is being trained based on the teacher forcing technique which for each time stamp in the decoding part the input should be from the ground truth data not the previously generated token, but in your code your model will generate the entire output by itself trough the following line:
outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
loss = outputs[0]
Is my assumption correct that you do not use teacher forcing technique? thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.