microsoft / deberta Goto Github PK

The implementation of DeBERTa

License: MIT License

Python 89.44% Dockerfile 0.54% Shell 10.02%

bert deeplearning representation-learning roberta language-model natural-language-understanding self-attention transformer-encoder

deberta's Introduction

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Disentangled Attention and DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

News

03/18/2023

DeBERTaV3 paper is accepted by ICLR 2023.
The code for DeBERTaV3 pre-training and continous training is added. Please check Language Model for details.

12/8/2021

DeBERTa-V3-XSmall is added. With only 22M backbone parameters which is only 1/4 of RoBERTa-Base and XLNet-Base, DeBERTa-V3-XSmall significantly outperforms the later on MNLI and SQuAD v2.0 tasks (i.e. 1.2% on MNLI-m, 1.5% EM score on SQuAD v2.0). This further demonstrates the efficiency of DeBERTaV3 models.

11/16/2021

The models of our new work DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing are publicly available at huggingface model hub now. The new models are based on DeBERTa-V2 models by replacing MLM with ELECTRA-style objective plus gradient-disentangled embedding sharing which further improves the model efficiency.
Scripts for DeBERTa V3 model fine-tuning are added
Code of RTD task head is added
Document for language model pre-training is added

3/31/2021

Masked language model task is added
SuperGLUE tasks is added
SiFT code is added

2/03/2021

DeBERTa v2 code and the 900M, 1.5B model are here now. This includes the 1.5B model used for our SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can find more details about this submission in our blog

What's new in v2

Vocabulary In v2 we use a new vocabulary of size 128K built from the training data. Instead of GPT2 tokenizer, we use sentencepiece tokenizer.
nGiE(nGram Induced Input Encoding) In v2 we use an additional convolution layer aside with the first transformer layer to better learn the local dependency of input tokens. We will add more ablation studies on this feature.
Sharing position projection matrix with content projection matrix in attention layer Based on our previous experiment, we found this can save parameters without affecting performance.
Apply bucket to encode relative positions In v2 we use log bucket to encode relative positions similar to T5.
900M model & 1.5B model In v2 we scale our model size to 900M and 1.5B which significantly improves the performance of downstream tasks.

12/29/2020

With DeBERTa 1.5B model, we surpass T5 11B model and human performance on SuperGLUE leaderboard. Code and model will be released soon. Please check out our paper for more details.

06/13/2020

We released the pre-trained models, source code, and fine-tuning scripts to reproduce some of the experimental results in the paper. You can follow similar scripts to apply DeBERTa to your own experiments or applications. Pre-training scripts will be released in the next step.

Introduction to DeBERTa

DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks.

Pre-trained Models

Our pre-trained models are packaged into zipped files. You can download them from our releases, or download an individual model via the links below:

Model	Vocabulary(K)	Backbone Parameters(M)	Hidden Size	Layers	Note
V2-XXLarge¹	128	1320	1536	48	128K new SPM vocab
V2-XLarge	128	710	1536	24	128K new SPM vocab
XLarge	50	700	1024	48	Same vocab as RoBERTa
Large	50	350	1024	24	Same vocab as RoBERTa
Base	50	100	768	12	Same vocab as RoBERTa
V2-XXLarge-MNLI	128	1320	1536	48	Fine-turned with MNLI
V2-XLarge-MNLI	128	710	1536	24	Fine-turned with MNLI
XLarge-MNLI	50	700	1024	48	Fine-turned with MNLI
Large-MNLI	50	350	1024	24	Fine-turned with MNLI
Base-MNLI	50	86	768	12	Fine-turned with MNLI
DeBERTa-V3-Large²	128	304	1024	24	128K new SPM vocab
DeBERTa-V3-Base²	128	86	768	12	128K new SPM vocab
DeBERTa-V3-Small²	128	44	768	6	128K new SPM vocab
DeBERTa-V3-XSmall²	128	22	384	12	128K new SPM vocab
mDeBERTa-V3-Base²	250	86	768	12	250K new SPM vocab, multi-lingual model with 102 languages

Note

1 This is the model(89.9) that surpassed T5 11B(89.3) and human performance(89.8) on SuperGLUE for the first time. 128K new SPM vocab.
2 These V3 DeBERTa models are deberta models pre-trained with ELECTRA-style objective plus gradient-disentangled embedding sharing which significantly improves the model efficiency.

Try the model

Read our documentation

Requirements

Linux system, e.g. Ubuntu 18.04LTS
CUDA 10.0
pytorch 1.3.0
python 3.6
bash shell 4.0
curl
docker (optional)
nvidia-docker2 (optional)

There are several ways to try our code,

Use docker

Docker is the recommended way to run the code as we already built every dependency into our docker bagai/deberta and you can follow the docker official site to install docker on your machine.

To run with docker, make sure your system fulfills the requirements in the above list. Here are the steps to try the GLUE experiments: Pull the code, run ./run_docker.sh , and then you can run the bash commands under /DeBERTa/experiments/glue/

Use pip

Pull the code and run pip3 install -r requirements.txt in the root directory of the code, then enter experiments/glue/ folder of the code and try the bash commands under that folder for glue experiments.

Install as a pip package

pip install deberta

Use DeBERTa in existing code

# To apply DeBERTa to your existing code, you need to make two changes to your code,
# 1. change your model to consume DeBERTa as the encoder
from DeBERTa import deberta
import torch
class MyModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    # Your existing model code
    self.deberta = deberta.DeBERTa(pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self.deberta.apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor
    #
  def forward(self, input_ids):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. 
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = deberta.bert(input_ids)[-1]

# 2. Change your tokenizer with the tokenizer built-in DeBERta
from DeBERTa import deberta
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer.tokenize('Examples input text of DeBERTa')
# Truncate long sequence
tokens = tokens[:max_seq_len -2]
# Add special tokens to the `tokens`
tokens = ['[CLS]'] + tokens + ['[SEP]']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1]*len(input_ids)
# padding
paddings = max_seq_len-len(input_ids)
input_ids = input_ids + [0]*paddings
input_mask = input_mask + [0]*paddings
features = {
'input_ids': torch.tensor(input_ids, dtype=torch.int),
'input_mask': torch.tensor(input_mask, dtype=torch.int)
}

Run DeBERTa experiments from command line

For glue tasks,

Get the data

cache_dir=/tmp/DeBERTa/
cd experiments/glue
./download_data.sh  $cache_dir/glue_tasks

Run task

task=STS-B 
OUTPUT=/tmp/DeBERTa/exps/$task
export OMP_NUM_THREADS=1
python3 -m DeBERTa.apps.run --task_name $task --do_train  \
  --data_dir $cache_dir/glue_tasks/$task \
  --eval_batch_size 128 \
  --predict_batch_size 128 \
  --output_dir $OUTPUT \
  --scale_steps 250 \
  --loss_scale 16384 \
  --accumulative_update 1 \  
  --num_train_epochs 6 \
  --warmup 100 \
  --learning_rate 2e-5 \
  --train_batch_size 32 \
  --max_seq_len 128

Notes

1. By default we will cache the pre-trained model and tokenizer at $HOME/.~DeBERTa, you may need to clean it if the downloading failed unexpectedly.
1. You can also try our models with HF Transformers. But when you try XXLarge model you need to specify --sharded_ddp argument. Please check our XXLarge model card for more details.

Experiments

Our fine-tuning experiments are carried on half a DGX-2 node with 8x32 V100 GPU cards, the results may vary due to different GPU models, drivers, CUDA SDK versions, using FP16 or FP32, and random seeds. We report our numbers based on multiple runs with different random seeds here. Here are the results from the Large model:

Task	Command	Results	Running Time(8x32G V100 GPUs)
MNLI xxlarge v2	`experiments/glue/mnli.sh xxlarge-v2`	91.7/91.9 +/-0.1	4h
MNLI xlarge v2	`experiments/glue/mnli.sh xlarge-v2`	91.7/91.6 +/-0.1	2.5h
MNLI xlarge	`experiments/glue/mnli.sh xlarge`	91.5/91.2 +/-0.1	2.5h
MNLI large	`experiments/glue/mnli.sh large`	91.3/91.1 +/-0.1	2.5h
QQP large	`experiments/glue/qqp.sh large`	92.3 +/-0.1	6h
QNLI large	`experiments/glue/qnli.sh large`	95.3 +/-0.2	2h
MRPC large	`experiments/glue/mrpc.sh large`	91.9 +/-0.5	0.5h
RTE large	`experiments/glue/rte.sh large`	86.6 +/-1.0	0.5h
SST-2 large	`experiments/glue/sst2.sh large`	96.7 +/-0.3	1h
STS-b large	`experiments/glue/Stsb.sh large`	92.5 +/-0.3	0.5h
CoLA large	`experiments/glue/cola.sh`	70.5 +/-1.0	0.5h

And here are the results from the Base model

Task	Command	Results	Running Time(8x32G V100 GPUs)
MNLI base	`experiments/glue/mnli.sh base`	88.8/88.5 +/-0.2	1.5h

Fine-tuning on NLU tasks

We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model	SQuAD 1.1	SQuAD 2.0	MNLI-m/mm	SST-2	QNLI	CoLA	RTE	MRPC	QQP	STS-B
	F1/EM	F1/EM	Acc	Acc	Acc	MCC	Acc	Acc/F1	Acc/F1	P/S
BERT-Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa-Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa-Large¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa-XLarge¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa-V2-XLarge¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa-V2-XXLarge^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1
DeBERTa-V3-Large	-/-	91.5/89.0	91.8/91.9	96.9	96.0	75.3	92.7	92.2/-	93.0/-	93.0/-
DeBERTa-V3-Base	-/-	88.4/85.4	90.6/90.7	-	-	-	-	-	-	-
DeBERTa-V3-Small	-/-	82.9/80.4	88.3/87.7	-	-	-	-	-	-	-
DeBERTa-V3-XSmall	-/-	84.8/82.0	88.1/88.3	-	-	-	-	-	-	-

Fine-tuning on XNLI

We present the dev results on XNLI with zero-shot crosslingual transfer setting, i.e. training with english data only, test on other languages.

Model	avg	en	fr	es	de	el	bg	ru	tr	ar	vi	th	zh	hi	sw	ur
XLM-R-base	76.2	85.8	79.7	80.7	78.7	77.5	79.6	78.1	74.2	73.8	76.5	74.6	76.7	72.4	66.5	68.3
mDeBERTa-V3-Base	79.8+/-0.2	88.2	82.6	84.4	82.7	82.3	82.4	80.8	79.5	78.5	78.1	76.4	79.5	75.9	73.9	72.4

Notes.

¹ Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on DeBERTa-Large-MNLI, DeBERTa-XLarge-MNLI, DeBERTa-V2-XLarge-MNLI, DeBERTa-V2-XXLarge-MNLI. The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when starting from MNLI fine-tuned models, however, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.

Pre-training with MLM and RTD objectives

To pre-train DeBERTa with MLM and RTD objectives, please check experiments/language_models

Contacts

Pengcheng He([email protected]), Xiaodong Liu([email protected]), Jianfeng Gao([email protected]), Weizhu Chen([email protected])

Citation

@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

deberta's People

Contributors

Stargazers

Watchers

Forkers

oo2316oo ankitshah009 hfxunlp addf400 replr yyht ganik rocke2020 silencio94 liuhc001 taffywrinkle claudiusgonzo namisan edwardjhu huberemanuel acul3 analogiks rosssong nakosung longjohncoder murilo phosseini beijinggao wangyan2013 basciple renyi533 sdedalus zeta1999 xrosliang isabellarossi sxjscience dbs700 ricklentz alexnewtown airacks tobiasoend shenfe cedar33 rogervaas abhishekvermasg mitrofanovdmitry stjordanis xvshiting chenshuguang johnson7788 dinuda lgstd xgk bruinxiong lysandrejik sunnyhuma171 abodacs global-localhost global19 global19-atlassian-net khubaibahmed-dev dataxujing brunotech jbdatascience bakhitov codeaudit wk-chen github2alfred rfox12 ericdoug-qi gandalf012 beesitech touchmed sailfish009 tawawhite meehawk alisafaya mijing suhao16 qi-jinli hyfeng kucukagan anshiquanshu66 marshallho msft-edward chenyujie1127 starlight-2021 hyunmu git19112019 daishu7 trianxy anukaal standardgalactic sameepyadav noanabeshima kkkzxh jcooky yarally vilhub oughtinc verochen147 oi2996814 devbox10 sarufi-io derenlei

deberta's Issues

Dose DeBERTa have any plans to publish Chinese pre-trained model?

How does it save computational cost in EMD?

As is showed in the code, EMD just reuses the last layer of the encoder twice, how to get the conclusion of "the additional cost is 0.15 * 2L" in the paper?

bpe_encoder.bin file missing

Hello,

Looks like bpe_encoder.bin is missing in https://huggingface.co/microsoft/deberta-xxlarge-v2/tree/main
Can you please upload it?

Thanks

assertion hit during forward pass

Hi,

I am trying out the DeBERTa model from the latest docker image provided. It seems like I hit an error during the forward pass:

Downloading pytorch_model.bin: 100%|█| 1518990915/1518990915 [00:44<00:00, 33979
Downloading model_config.json: 100%|██████| 475/475 [00:00<00:00, 373718.70it/s]
07/24/2021 01:19:40|INFO|logging|00| Loaded pretrained model file /root/.~DeBERTa/assets/latest/deberta-xlarge/pytorch_model.bin
Traceback (most recent call last):
  File "model_copy.py", line 52, in <module>
    model = MyModel()
  File "model_copy.py", line 10, in __init__
    self.deberta = deberta.DeBERTa(pre_trained='xlarge') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
  File "/usr/local/lib/python3.6/dist-packages/DeBERTa/deberta/deberta.py", line 57, in __init__
    self.apply_state(state)
  File "/usr/local/lib/python3.6/dist-packages/DeBERTa/deberta/deberta.py", line 147, in apply_state
    current[c] = state[key_match(c, state.keys())]
  File "/usr/local/lib/python3.6/dist-packages/DeBERTa/deberta/deberta.py", line 143, in key_match
    assert len(c)==1, c
AssertionError: []

The assertion shows there is one entry inside the dataset that doesn't have a key. I printed the keys it run into so far:

**  embeddings.word_embeddings.weight ['deberta.embeddings.word_embeddings.weight']
**  embeddings.LayerNorm.weight ['deberta.embeddings.LayerNorm.weight']
**  embeddings.LayerNorm.bias ['deberta.embeddings.LayerNorm.bias']
**  encoder.layer.0.attention.self.query_proj.weight []

The last entry didn't have a corresponding key. Is this expected? How should I work around it?

Thanks

how to train wsc in superglue

I'm wondering how to use this code to training for the WSC task in superglue. I'm getting very low accuracy running using the script in jiant scripts based on roberta-base.

Deberta Xlarge

Hello there,

Is Deberta X-large the same model as Deberta1.5B?
https://github.com/microsoft/DeBERTa/releases/tag/v0.1.8

V2 SentencePiece Tokenizer - training settings used?

Just curious about the switch of tokenizer in V2, can you share why you switched?

And what training settings for SentencePiece you used to train the v2 spm tokenizer? Was it a BPE model? Byte-level or character-level? Any other sentencepiece settings you used that you found useful?

Thanks!

model key 'encoder.layer.0.attention.self.query_proj.weight' not found in base-mnli

code:

self.deberta = deberta.DeBERTa(pre_trained="/path/to/pretrained_dir/pytorch_model.bin")
self.deberta.apply_state()

message:

 File "/home/user/DeBERTa/DeBERTa/deberta/deberta.py", line 143, in key_match
    assert len(c)==1, (c, s, key)
AssertionError: ([], dict_keys(['deberta.embeddings.word_embeddings.weight', 'deberta.embeddings.LayerNorm.weight', 'deberta.embeddings.LayerNorm.bias', 'deberta.encoder.layer.0.attention.self.q_bias', 'deberta.encoder.layer.0.attention.self.v_bias', 'deberta.encoder.layer.0.attention.self.in_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.0.attention.output.dense.weight', 'deberta.encoder.layer.0.attention.output.dense.bias', 'deberta.encoder.layer.0.attention.output.LayerNorm.weight', 'deberta.encoder.layer.0.attention.output.LayerNorm.bias', 'deberta.encoder.layer.0.intermediate.dense.weight', 'deberta.encoder.layer.0.intermediate.dense.bias', 'deberta.encoder.layer.0.output.dense.weight', 'deberta.encoder.layer.0.output.dense.bias', 'deberta.encoder.layer.0.output.LayerNorm.weight', 'deberta.encoder.layer.0.output.LayerNorm.bias', 'deberta.encoder.layer.1.attention.self.q_bias', 'deberta.encoder.layer.1.attention.self.v_bias', 'deberta.encoder.layer.1.attention.self.in_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.1.attention.output.dense.weight', 'deberta.encoder.layer.1.attention.output.dense.bias', 'deberta.encoder.layer.1.attention.output.LayerNorm.weight', 'deberta.encoder.layer.1.attention.output.LayerNorm.bias', 'deberta.encoder.layer.1.intermediate.dense.weight', 'deberta.encoder.layer.1.intermediate.dense.bias', 'deberta.encoder.layer.1.output.dense.weight', 'deberta.encoder.layer.1.output.dense.bias', 'deberta.encoder.layer.1.output.LayerNorm.weight', 'deberta.encoder.layer.1.output.LayerNorm.bias', 'deberta.encoder.layer.2.attention.self.q_bias', 'deberta.encoder.layer.2.attention.self.v_bias', 'deberta.encoder.layer.2.attention.self.in_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.2.attention.output.dense.weight', 'deberta.encoder.layer.2.attention.output.dense.bias', 'deberta.encoder.layer.2.attention.output.LayerNorm.weight', 'deberta.encoder.layer.2.attention.output.LayerNorm.bias', 'deberta.encoder.layer.2.intermediate.dense.weight', 'deberta.encoder.layer.2.intermediate.dense.bias', 'deberta.encoder.layer.2.output.dense.weight', 'deberta.encoder.layer.2.output.dense.bias', 'deberta.encoder.layer.2.output.LayerNorm.weight', 'deberta.encoder.layer.2.output.LayerNorm.bias', 'deberta.encoder.layer.3.attention.self.q_bias', 'deberta.encoder.layer.3.attention.self.v_bias', 'deberta.encoder.layer.3.attention.self.in_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.3.attention.output.dense.weight', 'deberta.encoder.layer.3.attention.output.dense.bias', 'deberta.encoder.layer.3.attention.output.LayerNorm.weight', 'deberta.encoder.layer.3.attention.output.LayerNorm.bias', 'deberta.encoder.layer.3.intermediate.dense.weight', 'deberta.encoder.layer.3.intermediate.dense.bias', 'deberta.encoder.layer.3.output.dense.weight', 'deberta.encoder.layer.3.output.dense.bias', 'deberta.encoder.layer.3.output.LayerNorm.weight', 'deberta.encoder.layer.3.output.LayerNorm.bias', 'deberta.encoder.layer.4.attention.self.q_bias', 'deberta.encoder.layer.4.attention.self.v_bias', 'deberta.encoder.layer.4.attention.self.in_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.4.attention.output.dense.weight', 'deberta.encoder.layer.4.attention.output.dense.bias', 'deberta.encoder.layer.4.attention.output.LayerNorm.weight', 'deberta.encoder.layer.4.attention.output.LayerNorm.bias', 'deberta.encoder.layer.4.intermediate.dense.weight', 'deberta.encoder.layer.4.intermediate.dense.bias', 'deberta.encoder.layer.4.output.dense.weight', 'deberta.encoder.layer.4.output.dense.bias', 'deberta.encoder.layer.4.output.LayerNorm.weight', 'deberta.encoder.layer.4.output.LayerNorm.bias', 'deberta.encoder.layer.5.attention.self.q_bias', 'deberta.encoder.layer.5.attention.self.v_bias', 'deberta.encoder.layer.5.attention.self.in_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.5.attention.output.dense.weight', 'deberta.encoder.layer.5.attention.output.dense.bias', 'deberta.encoder.layer.5.attention.output.LayerNorm.weight', 'deberta.encoder.layer.5.attention.output.LayerNorm.bias', 'deberta.encoder.layer.5.intermediate.dense.weight', 'deberta.encoder.layer.5.intermediate.dense.bias', 'deberta.encoder.layer.5.output.dense.weight', 'deberta.encoder.layer.5.output.dense.bias', 'deberta.encoder.layer.5.output.LayerNorm.weight', 'deberta.encoder.layer.5.output.LayerNorm.bias', 'deberta.encoder.layer.6.attention.self.q_bias', 'deberta.encoder.layer.6.attention.self.v_bias', 'deberta.encoder.layer.6.attention.self.in_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.6.attention.output.dense.weight', 'deberta.encoder.layer.6.attention.output.dense.bias', 'deberta.encoder.layer.6.attention.output.LayerNorm.weight', 'deberta.encoder.layer.6.attention.output.LayerNorm.bias', 'deberta.encoder.layer.6.intermediate.dense.weight', 'deberta.encoder.layer.6.intermediate.dense.bias', 'deberta.encoder.layer.6.output.dense.weight', 'deberta.encoder.layer.6.output.dense.bias', 'deberta.encoder.layer.6.output.LayerNorm.weight', 'deberta.encoder.layer.6.output.LayerNorm.bias', 'deberta.encoder.layer.7.attention.self.q_bias', 'deberta.encoder.layer.7.attention.self.v_bias', 'deberta.encoder.layer.7.attention.self.in_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.7.attention.output.dense.weight', 'deberta.encoder.layer.7.attention.output.dense.bias', 'deberta.encoder.layer.7.attention.output.LayerNorm.weight', 'deberta.encoder.layer.7.attention.output.LayerNorm.bias', 'deberta.encoder.layer.7.intermediate.dense.weight', 'deberta.encoder.layer.7.intermediate.dense.bias', 'deberta.encoder.layer.7.output.dense.weight', 'deberta.encoder.layer.7.output.dense.bias', 'deberta.encoder.layer.7.output.LayerNorm.weight', 'deberta.encoder.layer.7.output.LayerNorm.bias', 'deberta.encoder.layer.8.attention.self.q_bias', 'deberta.encoder.layer.8.attention.self.v_bias', 'deberta.encoder.layer.8.attention.self.in_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.8.attention.output.dense.weight', 'deberta.encoder.layer.8.attention.output.dense.bias', 'deberta.encoder.layer.8.attention.output.LayerNorm.weight', 'deberta.encoder.layer.8.attention.output.LayerNorm.bias', 'deberta.encoder.layer.8.intermediate.dense.weight', 'deberta.encoder.layer.8.intermediate.dense.bias', 'deberta.encoder.layer.8.output.dense.weight', 'deberta.encoder.layer.8.output.dense.bias', 'deberta.encoder.layer.8.output.LayerNorm.weight', 'deberta.encoder.layer.8.output.LayerNorm.bias', 'deberta.encoder.layer.9.attention.self.q_bias', 'deberta.encoder.layer.9.attention.self.v_bias', 'deberta.encoder.layer.9.attention.self.in_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.9.attention.output.dense.weight', 'deberta.encoder.layer.9.attention.output.dense.bias', 'deberta.encoder.layer.9.attention.output.LayerNorm.weight', 'deberta.encoder.layer.9.attention.output.LayerNorm.bias', 'deberta.encoder.layer.9.intermediate.dense.weight', 'deberta.encoder.layer.9.intermediate.dense.bias', 'deberta.encoder.layer.9.output.dense.weight', 'deberta.encoder.layer.9.output.dense.bias', 'deberta.encoder.layer.9.output.LayerNorm.weight', 'deberta.encoder.layer.9.output.LayerNorm.bias', 'deberta.encoder.layer.10.attention.self.q_bias', 'deberta.encoder.layer.10.attention.self.v_bias', 'deberta.encoder.layer.10.attention.self.in_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.10.attention.output.dense.weight', 'deberta.encoder.layer.10.attention.output.dense.bias', 'deberta.encoder.layer.10.attention.output.LayerNorm.weight', 'deberta.encoder.layer.10.attention.output.LayerNorm.bias', 'deberta.encoder.layer.10.intermediate.dense.weight', 'deberta.encoder.layer.10.intermediate.dense.bias', 'deberta.encoder.layer.10.output.dense.weight', 'deberta.encoder.layer.10.output.dense.bias', 'deberta.encoder.layer.10.output.LayerNorm.weight', 'deberta.encoder.layer.10.output.LayerNorm.bias', 'deberta.encoder.layer.11.attention.self.q_bias', 'deberta.encoder.layer.11.attention.self.v_bias', 'deberta.encoder.layer.11.attention.self.in_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.11.attention.output.dense.weight', 'deberta.encoder.layer.11.attention.output.dense.bias', 'deberta.encoder.layer.11.attention.output.LayerNorm.weight', 'deberta.encoder.layer.11.attention.output.LayerNorm.bias', 'deberta.encoder.layer.11.intermediate.dense.weight', 'deberta.encoder.layer.11.intermediate.dense.bias', 'deberta.encoder.layer.11.output.dense.weight', 'deberta.encoder.layer.11.output.dense.bias', 'deberta.encoder.layer.11.output.LayerNorm.weight', 'deberta.encoder.layer.11.output.LayerNorm.bias', 'deberta.encoder.rel_embeddings.weight', 'pooler.dense.weight', 'pooler.dense.bias', 'classifier.weight', 'classifier.bias', 'config']),
 'encoder.layer.0.attention.self.query_proj.weight')

when I change the path to deberta-v2-xlarge-mnli, everything is OK.

Issues loading 1.5B model in huggingface and in deberta package

Hello,

It seems like some of the weights were renamed/shaped in the V2 model releases and I couldn't quite figure out how to map them to the old structure

# it seemed like 
pos_q_proj => query_proj
v_bias => value_proj

but I couldn't match

deberta.encoder.layer.44.attention.self.key_proj.weight', 'deberta.encoder.layer.44.attention.self.key_proj.bias
=>
deberta.encoder.layer.44.attention.self.q_bias', 'deberta.encoder.layer.44.attention.self.value_proj', 'deberta.encoder.layer.44.attention.self.in_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_proj.weight

That was for huggingface, but I couldn't figure it out in this repo either.

Could someone upload the v2 model file?

The exact English pretraining data and Chinese pretraining data that are exact same to the BERT paper's pretraining data.

Any one know where to get them?
Thank you and thank you.

Pre-training DeBERTa from Scratch

Hello there,

Are there any instructions on how to pretrain DeBERTa from scratch?

Thanks

A question about the implementation of position-to-content attention

Merged with #33

Pretrained Model with 'RuntimeError: Error(s) in loading state_dict for DebertaModel'

Just for comparison: This works:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

And this works too:
from DeBERTa import deberta
model= DebertaModel.from_pretrained('microsoft/deberta-base')

But this does not work:
from DeBERTa import deberta
model = DebertaModel.from_pretrained('microsoft/deberta-xlarge-v2-mnli')

**What am I doing wrong? I simply would like to apply this pre-trained model without fine-tuning
I am using Python 3.8.5 with conda, all packages are up-to-date.

Here the error (see especially the RuntimeError down below):**
Some weights of the model checkpoint at microsoft/deberta-xlarge-v2-mnli were not used when initializing DebertaModel: ['deberta.encoder.LayerNorm.weight', 'deberta.encoder.LayerNorm.bias', 'deberta.encoder.conv.conv.weight', 'deberta.encoder.conv.conv.bias', 'deberta.encoder.conv.LayerNorm.weight', 'deberta.encoder.conv.LayerNorm.bias', 'deberta.encoder.layer.0.attention.self.query_proj.weight', 'deberta.encoder.layer.0.attention.self.query_proj.bias', 'deberta.encoder.layer.0.attention.self.key_proj.weight', 'deberta.encoder.layer.0.attention.self.key_proj.bias', 'deberta.encoder.layer.0.attention.self.value_proj.weight', 'deberta.encoder.layer.0.attention.self.value_proj.bias', 'deberta.encoder.layer.1.attention.self.query_proj.weight', 'deberta.encoder.layer.1.attention.self.query_proj.bias', 'deberta.encoder.layer.1.attention.self.key_proj.weight', 'deberta.encoder.layer.1.attention.self.key_proj.bias', 'deberta.encoder.layer.1.attention.self.value_proj.weight', 'deberta.encoder.layer.1.attention.self.value_proj.bias', 'deberta.encoder.layer.2.attention.self.query_proj.weight', 'deberta.encoder.layer.2.attention.self.query_proj.bias', 'deberta.encoder.layer.2.attention.self.key_proj.weight', 'deberta.encoder.layer.2.attention.self.key_proj.bias', 'deberta.encoder.layer.2.attention.self.value_proj.weight', 'deberta.encoder.layer.2.attention.self.value_proj.bias', 'deberta.encoder.layer.3.attention.self.query_proj.weight', 'deberta.encoder.layer.3.attention.self.query_proj.bias', 'deberta.encoder.layer.3.attention.self.key_proj.weight', 'deberta.encoder.layer.3.attention.self.key_proj.bias', 'deberta.encoder.layer.3.attention.self.value_proj.weight', 'deberta.encoder.layer.3.attention.self.value_proj.bias', 'deberta.encoder.layer.4.attention.self.query_proj.weight', 'deberta.encoder.layer.4.attention.self.query_proj.bias', 'deberta.encoder.layer.4.attention.self.key_proj.weight', 'deberta.encoder.layer.4.attention.self.key_proj.bias', 'deberta.encoder.layer.4.attention.self.value_proj.weight', 'deberta.encoder.layer.4.attention.self.value_proj.bias', 'deberta.encoder.layer.5.attention.self.query_proj.weight', 'deberta.encoder.layer.5.attention.self.query_proj.bias', 'deberta.encoder.layer.5.attention.self.key_proj.weight', 'deberta.encoder.layer.5.attention.self.key_proj.bias', 'deberta.encoder.layer.5.attention.self.value_proj.weight', 'deberta.encoder.layer.5.attention.self.value_proj.bias', 'deberta.encoder.layer.6.attention.self.query_proj.weight', 'deberta.encoder.layer.6.attention.self.query_proj.bias', 'deberta.encoder.layer.6.attention.self.key_proj.weight', 'deberta.encoder.layer.6.attention.self.key_proj.bias', 'deberta.encoder.layer.6.attention.self.value_proj.weight', 'deberta.encoder.layer.6.attention.self.value_proj.bias', 'deberta.encoder.layer.7.attention.self.query_proj.weight', 'deberta.encoder.layer.7.attention.self.query_proj.bias', 'deberta.encoder.layer.7.attention.self.key_proj.weight', 'deberta.encoder.layer.7.attention.self.key_proj.bias', 'deberta.encoder.layer.7.attention.self.value_proj.weight', 'deberta.encoder.layer.7.attention.self.value_proj.bias', 'deberta.encoder.layer.8.attention.self.query_proj.weight', 'deberta.encoder.layer.8.attention.self.query_proj.bias', 'deberta.encoder.layer.8.attention.self.key_proj.weight', 'deberta.encoder.layer.8.attention.self.key_proj.bias', 'deberta.encoder.layer.8.attention.self.value_proj.weight', 'deberta.encoder.layer.8.attention.self.value_proj.bias', 'deberta.encoder.layer.9.attention.self.query_proj.weight', 'deberta.encoder.layer.9.attention.self.query_proj.bias', 'deberta.encoder.layer.9.attention.self.key_proj.weight', 'deberta.encoder.layer.9.attention.self.key_proj.bias', 'deberta.encoder.layer.9.attention.self.value_proj.weight', 'deberta.encoder.layer.9.attention.self.value_proj.bias', 'deberta.encoder.layer.10.attention.self.query_proj.weight', 'deberta.encoder.layer.10.attention.self.query_proj.bias', 'deberta.encoder.layer.10.attention.self.key_proj.weight', 'deberta.encoder.layer.10.attention.self.key_proj.bias', 'deberta.encoder.layer.10.attention.self.value_proj.weight', 'deberta.encoder.layer.10.attention.self.value_proj.bias', 'deberta.encoder.layer.11.attention.self.query_proj.weight', 'deberta.encoder.layer.11.attention.self.query_proj.bias', 'deberta.encoder.layer.11.attention.self.key_proj.weight', 'deberta.encoder.layer.11.attention.self.key_proj.bias', 'deberta.encoder.layer.11.attention.self.value_proj.weight', 'deberta.encoder.layer.11.attention.self.value_proj.bias', 'deberta.encoder.layer.12.attention.self.query_proj.weight', 'deberta.encoder.layer.12.attention.self.query_proj.bias', 'deberta.encoder.layer.12.attention.self.key_proj.weight', 'deberta.encoder.layer.12.attention.self.key_proj.bias', 'deberta.encoder.layer.12.attention.self.value_proj.weight', 'deberta.encoder.layer.12.attention.self.value_proj.bias', 'deberta.encoder.layer.13.attention.self.query_proj.weight', 'deberta.encoder.layer.13.attention.self.query_proj.bias', 'deberta.encoder.layer.13.attention.self.key_proj.weight', 'deberta.encoder.layer.13.attention.self.key_proj.bias', 'deberta.encoder.layer.13.attention.self.value_proj.weight', 'deberta.encoder.layer.13.attention.self.value_proj.bias', 'deberta.encoder.layer.14.attention.self.query_proj.weight', 'deberta.encoder.layer.14.attention.self.query_proj.bias', 'deberta.encoder.layer.14.attention.self.key_proj.weight', 'deberta.encoder.layer.14.attention.self.key_proj.bias', 'deberta.encoder.layer.14.attention.self.value_proj.weight', 'deberta.encoder.layer.14.attention.self.value_proj.bias', 'deberta.encoder.layer.15.attention.self.query_proj.weight', 'deberta.encoder.layer.15.attention.self.query_proj.bias', 'deberta.encoder.layer.15.attention.self.key_proj.weight', 'deberta.encoder.layer.15.attention.self.key_proj.bias', 'deberta.encoder.layer.15.attention.self.value_proj.weight', 'deberta.encoder.layer.15.attention.self.value_proj.bias', 'deberta.encoder.layer.16.attention.self.query_proj.weight', 'deberta.encoder.layer.16.attention.self.query_proj.bias', 'deberta.encoder.layer.16.attention.self.key_proj.weight', 'deberta.encoder.layer.16.attention.self.key_proj.bias', 'deberta.encoder.layer.16.attention.self.value_proj.weight', 'deberta.encoder.layer.16.attention.self.value_proj.bias', 'deberta.encoder.layer.17.attention.self.query_proj.weight', 'deberta.encoder.layer.17.attention.self.query_proj.bias', 'deberta.encoder.layer.17.attention.self.key_proj.weight', 'deberta.encoder.layer.17.attention.self.key_proj.bias', 'deberta.encoder.layer.17.attention.self.value_proj.weight', 'deberta.encoder.layer.17.attention.self.value_proj.bias', 'deberta.encoder.layer.18.attention.self.query_proj.weight', 'deberta.encoder.layer.18.attention.self.query_proj.bias', 'deberta.encoder.layer.18.attention.self.key_proj.weight', 'deberta.encoder.layer.18.attention.self.key_proj.bias', 'deberta.encoder.layer.18.attention.self.value_proj.weight', 'deberta.encoder.layer.18.attention.self.value_proj.bias', 'deberta.encoder.layer.19.attention.self.query_proj.weight', 'deberta.encoder.layer.19.attention.self.query_proj.bias', 'deberta.encoder.layer.19.attention.self.key_proj.weight', 'deberta.encoder.layer.19.attention.self.key_proj.bias', 'deberta.encoder.layer.19.attention.self.value_proj.weight', 'deberta.encoder.layer.19.attention.self.value_proj.bias', 'deberta.encoder.layer.20.attention.self.query_proj.weight', 'deberta.encoder.layer.20.attention.self.query_proj.bias', 'deberta.encoder.layer.20.attention.self.key_proj.weight', 'deberta.encoder.layer.20.attention.self.key_proj.bias', 'deberta.encoder.layer.20.attention.self.value_proj.weight', 'deberta.encoder.layer.20.attention.self.value_proj.bias', 'deberta.encoder.layer.21.attention.self.query_proj.weight', 'deberta.encoder.layer.21.attention.self.query_proj.bias', 'deberta.encoder.layer.21.attention.self.key_proj.weight', 'deberta.encoder.layer.21.attention.self.key_proj.bias', 'deberta.encoder.layer.21.attention.self.value_proj.weight', 'deberta.encoder.layer.21.attention.self.value_proj.bias', 'deberta.encoder.layer.22.attention.self.query_proj.weight', 'deberta.encoder.layer.22.attention.self.query_proj.bias', 'deberta.encoder.layer.22.attention.self.key_proj.weight', 'deberta.encoder.layer.22.attention.self.key_proj.bias', 'deberta.encoder.layer.22.attention.self.value_proj.weight', 'deberta.encoder.layer.22.attention.self.value_proj.bias', 'deberta.encoder.layer.23.attention.self.query_proj.weight', 'deberta.encoder.layer.23.attention.self.query_proj.bias', 'deberta.encoder.layer.23.attention.self.key_proj.weight', 'deberta.encoder.layer.23.attention.self.key_proj.bias', 'deberta.encoder.layer.23.attention.self.value_proj.weight', 'deberta.encoder.layer.23.attention.self.value_proj.bias']

This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaModel were not initialized from the model checkpoint at microsoft/deberta-xlarge-v2-mnli and are newly initialized: ['deberta.encoder.layer.0.attention.self.q_bias', 'deberta.encoder.layer.0.attention.self.v_bias', 'deberta.encoder.layer.0.attention.self.in_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.1.attention.self.q_bias', 'deberta.encoder.layer.1.attention.self.v_bias', 'deberta.encoder.layer.1.attention.self.in_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.2.attention.self.q_bias', 'deberta.encoder.layer.2.attention.self.v_bias', 'deberta.encoder.layer.2.attention.self.in_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.3.attention.self.q_bias', 'deberta.encoder.layer.3.attention.self.v_bias', 'deberta.encoder.layer.3.attention.self.in_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.4.attention.self.q_bias', 'deberta.encoder.layer.4.attention.self.v_bias', 'deberta.encoder.layer.4.attention.self.in_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.5.attention.self.q_bias', 'deberta.encoder.layer.5.attention.self.v_bias', 'deberta.encoder.layer.5.attention.self.in_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.6.attention.self.q_bias', 'deberta.encoder.layer.6.attention.self.v_bias', 'deberta.encoder.layer.6.attention.self.in_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.7.attention.self.q_bias', 'deberta.encoder.layer.7.attention.self.v_bias', 'deberta.encoder.layer.7.attention.self.in_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.8.attention.self.q_bias', 'deberta.encoder.layer.8.attention.self.v_bias', 'deberta.encoder.layer.8.attention.self.in_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.9.attention.self.q_bias', 'deberta.encoder.layer.9.attention.self.v_bias', 'deberta.encoder.layer.9.attention.self.in_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.10.attention.self.q_bias', 'deberta.encoder.layer.10.attention.self.v_bias', 'deberta.encoder.layer.10.attention.self.in_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.11.attention.self.q_bias', 'deberta.encoder.layer.11.attention.self.v_bias', 'deberta.encoder.layer.11.attention.self.in_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.12.attention.self.q_bias', 'deberta.encoder.layer.12.attention.self.v_bias', 'deberta.encoder.layer.12.attention.self.in_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.13.attention.self.q_bias', 'deberta.encoder.layer.13.attention.self.v_bias', 'deberta.encoder.layer.13.attention.self.in_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.14.attention.self.q_bias', 'deberta.encoder.layer.14.attention.self.v_bias', 'deberta.encoder.layer.14.attention.self.in_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.15.attention.self.q_bias', 'deberta.encoder.layer.15.attention.self.v_bias', 'deberta.encoder.layer.15.attention.self.in_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.16.attention.self.q_bias', 'deberta.encoder.layer.16.attention.self.v_bias', 'deberta.encoder.layer.16.attention.self.in_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.17.attention.self.q_bias', 'deberta.encoder.layer.17.attention.self.v_bias', 'deberta.encoder.layer.17.attention.self.in_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.18.attention.self.q_bias', 'deberta.encoder.layer.18.attention.self.v_bias', 'deberta.encoder.layer.18.attention.self.in_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.19.attention.self.q_bias', 'deberta.encoder.layer.19.attention.self.v_bias', 'deberta.encoder.layer.19.attention.self.in_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.20.attention.self.q_bias', 'deberta.encoder.layer.20.attention.self.v_bias', 'deberta.encoder.layer.20.attention.self.in_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.21.attention.self.q_bias', 'deberta.encoder.layer.21.attention.self.v_bias', 'deberta.encoder.layer.21.attention.self.in_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.22.attention.self.q_bias', 'deberta.encoder.layer.22.attention.self.v_bias', 'deberta.encoder.layer.22.attention.self.in_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.23.attention.self.q_bias', 'deberta.encoder.layer.23.attention.self.v_bias', 'deberta.encoder.layer.23.attention.self.in_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_q_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):

File "", line 1, in
model = DebertaModel.from_pretrained('microsoft/deberta-xlarge-v2-mnli')

File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\transformers\modeling_utils.py", line 1157, in from_pretrained
raise RuntimeError(

RuntimeError: Error(s) in loading state_dict for DebertaModel:
size mismatch for deberta.encoder.rel_embeddings.weight: copying a param with shape torch.Size([512, 1536]) from checkpoint, the shape in current model is torch.Size([1024, 1536]).

Training with K80 or GPUs with version less than 5.x

In train.sh the following line limits the CUDA_VISIBLE_DEVICES environment variable with GPUs with versions greater or equal to 6.x.

export CUDA_VISIBLE_DEVICES=$(python3 -c "import torch; x=[str(x) for x in range(torch.cuda.device_count()) if torch.cuda.get_device_capability(x)[0]>=6]; print(','.join(x))" 2>/dev/null)

I don't know why there is this limitation, but this caused an issue in my case where I use a Tesla K80 that is a 3.x version. When I executed the training script, the error would say that I do not have a CUDA available device.
After removing this limitation (export CUDA_VISIBLE_DEVICES=1) I was able to run the training procedure correctly.

This version limitation is really needed? Or can we remove it?

The amount of WSC training examples

Hi,

According to the table in the appendix to the paper, 554 thousand (!) examples were used to train the DeBERTa model on the WSC task. Winograd Schemas are generally not very easy to come up with, especially such a huge amount. Can you please clarify where do these training examples come from? Or is it just a typo? See the screenshot of the table below.

Pre-trained models are not accessible

Thanks for sharing the repo. However, I could not access the per-trained base and large models from the below paths.

https://github.com/microsoft/DeBERTa/releases/download/v0.1/base.zip
https://github.com/microsoft/DeBERTa/releases/download/v0.1/large.zip

DeBERTa-MT and code for NLG experiments

Hi, in appendix 4 of the paper you describe results on NLG tasks and a DeBERTa-MT base model pretrained on wikitext-103, using unilm task formulations.

Are you going to release this model and the code for the associated experiments?

Thanks

Does any plan to release the pretrain code?

Is there any plan to open source pre-trained model code that can train deberta in different languages.

HTTP Error 403: Forbidden when downloading glue_tasks

This error occurs when running the setup_glue_data() function inside any Glue Task script, like qqp_large.sh.
The error resides on the original script download_glue_data.py, probably the token is no longer valid, but this directly affects this repo making it not possible to reproduce the results with the Glue Tasks.

Error log:

Downloading and extracting QQP...
Traceback (most recent call last):
  File "<stdin>", line 172, in <module>
  File "<stdin>", line 168, in main
  File "<stdin>", line 57, in download_and_extract
  File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

A question about the Attention map

Hi,

I have a question about the Attention Map. What is the meaning of the x-axis and y-axis? Does that mean the Attention head?

Kind regards,
Qiming

how to run the GLUE diagnostics dataset?

I don't understand what results of this task(AX.tsv) should be submitted. Look forward to any reply~

Speed of DeBERTa seems disappointing

The training or inference speed of deberta is much slower than roberta. Is there any official benchmark on model speed?
And I wonder if the extra computation is inescapable, is it more rewarding that I just increase the number of transformer layers instead of using deberta? The idea of deberta still makes sense of course, but making model deeper is very reliable and more considerable if I expect a better model performance.

[bug] incomplete code

In deberta.mlm, MaskedLayerNorm is not imported from deberta.ops, and PreLayerNorm is undefined.

And I'm not sure if deberta.mlm contains codes for pretraining?

Questions about the dataset?

As mentioned in the paper, the DeBERT model has good performance, but the data sets are all for English data sets. I wonder if the author has tried to apply DeBERT to the processing of Chinese NLP task, and then make a comparison with RoBERTa?

T5 11B mode

With DeBERTa 1.5B model, we suppass T5 11B mode

should be "surpass"

Enhanced Masked Decoding

Can you explain where in the codebase is the enhanced masked decoding happening?

DeBERTa v2: loading example code from huggingface: TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

On Colab, I did:

!pip install transformers

Then:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-xxlarge-v2")

model = AutoModel.from_pretrained("microsoft/deberta-xxlarge-v2")

But I got this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-21a010bd2ed3> in <module>()
      1 from transformers import AutoTokenizer, AutoModel
      2 
----> 3 tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-xxlarge-v2")
      4 
      5 model = AutoModel.from_pretrained("microsoft/deberta-xxlarge-v2")

4 frames
/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    386             else:
    387                 if tokenizer_class_py is not None:
--> 388                     return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    389                 else:
    390                     raise ValueError(

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1767 
   1768         return cls._from_pretrained(
-> 1769             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1770         )
   1771 

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1839         # Instantiate tokenizer.
   1840         try:
-> 1841             tokenizer = cls(*init_inputs, **init_kwargs)
   1842         except OSError:
   1843             raise OSError(

/usr/local/lib/python3.6/dist-packages/transformers/models/deberta/tokenization_deberta.py in __init__(self, vocab_file, do_lower_case, unk_token, sep_token, pad_token, cls_token, mask_token, **kwargs)
    540         )
    541 
--> 542         if not os.path.isfile(vocab_file):
    543             raise ValueError(
    544                 "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "

/usr/lib/python3.6/genericpath.py in isfile(path)
     28     """Test whether a path is a regular file"""
     29     try:
---> 30         st = os.stat(path)
     31     except OSError:
     32         return False

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

"deberta-v2-xxlarge"-Model not working!

I do:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v2-xxlarge")
model = AutoModel.from_pretrained("microsoft/deberta-v2-xxlarge")

But always the same error occurs:
config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'deberta-v2'

Appreciate your help!

Question about the hyperparameters of SuperGlue

What are the parameters of Deberta finetune superglue for each task, such as batch, GPU cards, learning rate, etc.?
I couldn't find the detailed parameters of each task in SuperGLue in the paper.

How can I reproduce Deberta's results on SuperGlue leaderboard?

How to evaluate squad

evlate squad and suqad2.0. for both deberta and deberta_v2

is this a bug? in disentangled_attention.py pos_query_layer's dimension is 3, when use p2p attention and this code:\n pos_query = pos_query_layer[:,:,att_span:,:] \n get IndexError: too many indices for tensor of dimension 3

in disentangled_attention.py
pos_query_layer's dimension is 3,
but when select p2p attention,
this code get IndexError

pos_query = pos_query_layer[:,:,att_span:,:]

test code:

##########################################
import os
os.chdir('F:\WorkSpace\DeBERTa-master')

import numpy as np

from future import absolute_import
from future import division
from future import print_function

import torch
from torch.nn import CrossEntropyLoss
from torch import optim

import math
import pdb

from DeBERTa.deberta import *
from DeBERTa.utils import *

from DeBERTa.deberta.config import ModelConfig
from DeBERTa.apps.models.sequence_classification import SequenceClassificationModel

import os
import time

import warnings
warnings.filterwarnings('ignore')
##########################################
##from FocalLoss import FocalLoss
##from DualFocalLoss import Dual_Focal_loss
##from circle_loss import convert_label_to_similarity,CircleLoss
##from classify import Classify
import itertools

##criterion_circle = CircleLoss(m=0.25, gamma=256)
####criterion_focal_loss = FocalLoss(gamma=2.0, alpha=0.25, size_average=True)
##criterion_focal_loss = FocalLoss(gamma=2.0, alpha=0.25, size_average=False)
##criterion_dual_focal_loss = Dual_Focal_loss(ignore_lb=255, eps=1e-5, reduction='mean')
####classifier=Classify(520,28,2)
##classifier=Classify(8,16,2)
#######################################
Data_dim=32
Data_S_npy_d5=138

config_dict={
"hidden_size" : 32,
"num_hidden_layers" : 3,
"num_attention_heads" : 8,
"hidden_act" : "gelu",
"intermediate_size" : 128,
"hidden_dropout_prob" : 0.1,
"attention_probs_dropout_prob" : 0.1,
"max_position_embeddings" : 65,
"type_vocab_size" : 0,
"initializer_range" : 0.02,
"layer_norm_eps" : 1e-7,
"padding_idx" : 0,
"vocab_size" : 68,
"relative_attention" : True,
"max_relative_positions" : 11,
"position_buckets" : 8,
"position_biased_input" : True,
"pos_att_type" : "p2c|c2p|p2p" ##why p2p not work?
}
config_my=ModelConfig.from_dict(config_dict)
DeBERTa_c=SequenceClassificationModel(config_my, num_labels=2, drop_out=None, pre_trained=None)

DeBERTa_c.double()
DeBERTa_c.train()
DeBERTa_c_optim = optim.Adam(DeBERTa_c.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, amsgrad=False)
##optimizer_chain = optim.Adam(itertools.chain(DeBERTa_c.parameters(), classifier.parameters()), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, amsgrad=False)

device=torch.device('cpu')
##batch_x_mark=None,batch_y_mark=None
train_loss = []
iter_count=0
iter_count += 1
device=torch.device('cpu')
start_time = time.time()
DeBERTa_c_optim.zero_grad()
batch_x=np.random.randint(1,66,size=(64,65),dtype=int)
batch_y=np.random.rand(64,2)
batch_x=torch.from_numpy(batch_x)
batch_y=torch.from_numpy(batch_y[:,1]).ge(0.5).long()
batch_x = batch_x.double().to(device)
logits,loss=DeBERTa_c(batch_x, type_ids=None, input_mask=None, labels=batch_y, position_ids=None)

got this

F:\WorkSpace\DeBERTa-master\DeBERTa\deberta\disentangled_attention.py in disentangled_attention_bias(self, query_layer, key_layer, relative_pos, rel_embeddings, scale_factor)

169 if 'p2p' in self.pos_att_type:

170 print("pos_query_layer.shape:",pos_query_layer.shape)

--> 171 pos_query = pos_query_layer[:,:,att_span:,:]

172 p2p_att = torch.matmul(pos_query, pos_key_layer.transpose(-1, -2))

173 p2p_att = p2p_att.expand(query_layer.size()[:2] + p2p_att.size()[2:])

IndexError: too many indices for tensor of dimension 3

how to evaluate DeBERTa on SUPER_GLUE benchmark?

Hi, I see you guys updated the score of DeBERTa on SuperGlue Leaderboard, would you update the SuperGlue codes as soon?
thanks a lot :)

lunch, did you mean launch?

To run our code on multiple GPUs, you must OMP_NUM_THREADS=1 before lunch our training code

Is the Decoder like the Transformer Decoder, or just a layer?

As the title mentioned, I'm not sure that should we need to mask the future tokens just like the Transformer did in the Decoder?

I didn't find any answer in the paper or code. Is anyone who knows that? thanks.

How to train

I use the following command

python3 -m DeBERTa.apps.run --task_name $task --do_train  \
  --data_dir $cache_dir/glue_tasks/$task \
  --eval_batch_size 128 \
  --predict_batch_size 128 \
  --output_dir $OUTPUT \
  --scale_steps 250 \
  --loss_scale 16384 \
  --accumulative_update 1 \  
  --num_train_epochs 6 \
  --warmup 100 \
  --learning_rate 2e-5 \
  --train_batch_size 32 \
  --max_seq_len 128

But I got errors

Traceback (most recent call last):
  File "/content/drive/MyDrive/DeBERTa/DeBERTa/DeBERTa/apps/run.py", line 449, in <module>
    main(args)
  File "/content/drive/MyDrive/DeBERTa/DeBERTa/DeBERTa/apps/run.py", line 268, in main
    tokenizer = tokenizers[vocab_type](vocab_path)
  File "/content/drive/MyDrive/DeBERTa/DeBERTa/DeBERTa/deberta/gpt2_tokenizer.py", line 61, in __init__
    self.gpt2_encoder = torch.load(vocab_file)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 235, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 220, in __init__
    _check_seekable(buffer)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 311, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 304, in raise_err_msg
    raise type(e)(msg)
AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

I found vocab_path is None, anyone can tell me how to set vocab_path

want some examples for pre-training from scratch

Questions about pretraining dataset preparation

For openwebtext dataset, there seem to be two sources. Which source was used?
1. Download dataset directly from : https://skylion007.github.io/OpenWebTextCorpus/
2. Download dataset from the urls given in this document https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext
Was the deduplication done on urls or by doing LSH on documents?
Did you cleanup the dataset as per section “Prepare the data for GPT-2 training” step 1 in this doc? Could you provide any extra cleanup steps that you performed on the dataset?

Is the STS-b fine-tuned model available somewhere to download?

Or are the STS-b results from the already released Large model?

Missing PreLayerNorm code

In mlm.py() , PreLayerNorm is used in module MLMPredictionHead. However, PreLayerNorm code is missing in the github. Can you take a look at this issue? Thanks.

colab

Hi can you please add a google colab for inference

Is there a plan for Chinese Model?

Hi,
Do you have a plan to release the pre-training code?

and, the Disentangled Attention seems more suitable for non-segmented languages, like Japanese, Chinese.
So, as the title says...

thanks for the nice job.

Why mlm task can't be registed

How to get sentence embedding from DeBERTa for SNLI Dataset?

Hi,
I want a single vector from hugging face DeBERTA-base model for doing semantic similarity....
Can i use the first vector of DeBERTa to represent the whole sentence OR Can i use the sum of first and last vector to represent sentence embedding???
Can anybody guide me how to use deberta to get the accuracy on snli dataset??

Thanks

Is there Chinese version？

It is a great job. Would you plan to release Chinese version models?

RuntimeError: Index tensor must have the same number of dimensions as input tensor

An error occurred while run in class DisentangledSelfAttention.forward() where query_states.size(1) > hidden_states.size(1):

https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/deberta/disentangled_attention.py
line 165: p2c_att = torch.gather(p2c_att, dim=-2, index=pos_index.expand(p2c_att.size()[:2] + (pos_index.size(-2), key_layer.size(-2))))

DeBERTa base different performance numbers

Hi, While going through the research paper I found the performance number of the DeBERTa base model on the MNLI task at 2 different locations with different values.

And while I tried to reproduce the numbers by finetuning on MNLI task, I got these numbers.
MNLI Matched - 86.8; MNLI mismatched - 86.3

My Hyperparameters:
python run_glue.py
--model_name_or_path $MODEL_NAME
--task_name $TASK_NAME
--do_train --do_eval
--train_file $GLUE_DIR/$TASK_NAME/$train_file
--validation_file $GLUE_DIR/$TASK_NAME/$validation_file
--test_file $GLUE_DIR/$TASK_NAME/$test_file
--max_seq_length 128
--per_device_train_batch_size 64
--per_device_eval_batch_size 128
--learning_rate 2e-5
--num_train_epochs 6.0
--output_dir $OUTPUT_DIR
--logging_dir $LOG_DIR
--logging_steps $logging_steps
--save_total_limit 2
--save_steps 1000
--warmup_steps 100
--gradient_accumulation_steps 1
--overwrite_output_dir
--evaluation_strategy epoch \

Can you provide some details on hyperparameters used by you while finetuning and which performance number to be considered?

Is SiFT included?

Is SiFT(SCALE INVARIANT FINE-TUNING) module included in this repository?

About Enhanced Masked Decoding

I cannot find the coding about Enhanced Masked Decoding, can you explain where is the related coding?

can't load v1 model

self.deberta = deberta.DeBERTa(pre_trained='base')

when pre_trained='base','larget','xlarge', throw
Traceback (most recent call last):
File "/home/v-weishengli/Downloads/pycharm-community-2020.2.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1448, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/v-weishengli/Downloads/pycharm-community-2020.2.2/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/v-weishengli/mydeberta/debertadeberta.py", line 2, in
model=deberta.DeBERTa(pre_trained='large')
File "/home/v-weishengli/mydeberta/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 57, in init
self.apply_state(state)
File "/home/v-weishengli/mydeberta/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 147, in apply_state
current[c] = state[key_match(c, state.keys())]
File "/home/v-weishengli/mydeberta/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 143, in key_match
assert len(c)==1, c
AssertionError: []

xlargev2 normal.

where is the absolute position embeddings?

The paper says you add the absolte position embeddings after all Transformer layers, before softmax layer for MLM, however, I could not find these parameters.

looking forward to your response.
Thank you

microsoft / deberta Goto Github PK

deberta's Introduction

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

News

03/18/2023

12/8/2021

11/16/2021

3/31/2021

2/03/2021

What's new in v2

12/29/2020

06/13/2020

Introduction to DeBERTa

Pre-trained Models

Note

Try the model

Requirements

Use docker

Use pip

Install as a pip package

Use DeBERTa in existing code

Run DeBERTa experiments from command line

Notes

Experiments

Fine-tuning on NLU tasks

Fine-tuning on XNLI

Notes.

Pre-training with MLM and RTD objectives

Contacts

Citation

deberta's People

Contributors

Stargazers

Watchers

Forkers

deberta's Issues

in disentangled_attention.py pos_query_layer's dimension is 3, but when select p2p attention, this code get IndexError

pos_query = pos_query_layer[:,:,att_span:,:]

test code:

got this

F:\WorkSpace\DeBERTa-master\DeBERTa\deberta\disentangled_attention.py in disentangled_attention_bias(self, query_layer, key_layer, relative_pos, rel_embeddings, scale_factor)

169 if 'p2p' in self.pos_att_type:

170 print("pos_query_layer.shape:",pos_query_layer.shape)

--> 171 pos_query = pos_query_layer[:,:,att_span:,:]

172 p2p_att = torch.matmul(pos_query, pos_key_layer.transpose(-1, -2))

173 p2p_att = p2p_att.expand(query_layer.size()[:2] + p2p_att.size()[2:])

IndexError: too many indices for tensor of dimension 3

Recommend Projects

Recommend Topics

Recommend Org

in disentangled_attention.py
pos_query_layer's dimension is 3,
but when select p2p attention,
this code get IndexError