I want to share my findings about the error I encountered and how I solved it while trying to fine-tune DistilBert for sentiment classification (pages 575-582), and it will be great to get feedback do I understand and resolve the problem correctly.
First of all, I installed required packages including transformers 4.9.1.
After that, I followed the code from ch16-part3-bert.ipynb, with one little modification - because I don't have internet on the server with GPU, I downloaded model and tokenizer manually with additional required files and use it to load model/tokenizer.
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Encoding(num_tokens=3157, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
I decided to follow the next cells and when I ran training loop using device='cuda', I saw the following error:
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I switched to device='cpu' to see more detailed description of the error, and got the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [40], in <cell line: 3>()
12 labels = batch['labels'].to(DEVICE)
14 ### Forward
---> 15 outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
16 loss, logits = outputs['loss'], outputs['logits']
18 ### Backward
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
1098 # If we don't have any hooks, we want to skip the rest of the logic in
1099 # this function, and just call forward.
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/transformers/models/distilbert/modeling_distilbert.py:625, in DistilBertForSequenceClassification.forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
617 r"""
618 labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
619 Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
620 config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
621 If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
622 """
623 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 625 distilbert_output = self.distilbert(
626 input_ids=input_ids,
627 attention_mask=attention_mask,
628 head_mask=head_mask,
629 inputs_embeds=inputs_embeds,
630 output_attentions=output_attentions,
631 output_hidden_states=output_hidden_states,
632 return_dict=return_dict,
633 )
634 hidden_state = distilbert_output[0] # (bs, seq_len, dim)
635 pooled_output = hidden_state[:, 0] # (bs, dim)
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
1098 # If we don't have any hooks, we want to skip the rest of the logic in
1099 # this function, and just call forward.
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/transformers/models/distilbert/modeling_distilbert.py:488, in DistilBertModel.forward(self, input_ids, attention_mask, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
485 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
487 if inputs_embeds is None:
--> 488 inputs_embeds = self.embeddings(input_ids) # (bs, seq_length, dim)
489 return self.transformer(
490 x=inputs_embeds,
491 attn_mask=attention_mask,
(...)
495 return_dict=return_dict,
496 )
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
1098 # If we don't have any hooks, we want to skip the rest of the logic in
1099 # this function, and just call forward.
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/transformers/models/distilbert/modeling_distilbert.py:118, in Embeddings.forward(self, input_ids)
115 position_ids = position_ids.unsqueeze(0).expand_as(input_ids) # (bs, max_seq_length)
117 word_embeddings = self.word_embeddings(input_ids) # (bs, max_seq_length, dim)
--> 118 position_embeddings = self.position_embeddings(position_ids) # (bs, max_seq_length, dim)
120 embeddings = word_embeddings + position_embeddings # (bs, max_seq_length, dim)
121 embeddings = self.LayerNorm(embeddings) # (bs, max_seq_length, dim)
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
1098 # If we don't have any hooks, we want to skip the rest of the logic in
1099 # this function, and just call forward.
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/sparse.py:158, in Embedding.forward(self, input)
157 def forward(self, input: Tensor) -> Tensor:
--> 158 return F.embedding(
159 input, self.weight, self.padding_idx, self.max_norm,
160 self.norm_type, self.scale_grad_by_freq, self.sparse)
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/functional.py:2044, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2038 # Note [embedding_renorm set_grad_enabled]
2039 # XXX: equivalent to
2040 # with torch.no_grad():
2041 # torch.embedding_renorm_
2042 # remove once script supports set_grad_enabled
2043 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2044 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
It seems that somehow embedding lookup was not correct, so I decided to check embedding dimensions in our pre-trained model.
DistilBertForSequenceClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
...
Model and tokenizer vocabulary sizes are equal to 30522, so I decided to return to the first warning of the tokenizer and follow the recommendation to specify max_length argument for the tokenizer.
I supposed that the problem with the positional embedding, because we have the samples with the length greater than number of rows in positional embedding (512).
We can see it above based on train_encodings[0] example (num_tokens=3157).
So I changed the code for tokenization:
train_encodings = tokenizer(list(train_texts), max_length=512, truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), max_length=512, truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), max_length=512, truncation=True, padding=True)
And after that the training loop was completed successfully.
Thank you.