Following the scrips I trained a teacher model successfully, generated the extraction

Thanks a lot for the prompt response. The class distribution i

Thanks Kalpesh, Yes I am using BERT_large <

BERT EXTRACTION: Unable to reproduce results on MNLI about language HOT 10 CLOSED

google-research commented on May 17, 2024

BERT EXTRACTION: Unable to reproduce results on MNLI

from language.

Comments (10)

martiansideofthemoon commented on May 17, 2024

Hi Ahmad, thanks for your interest! An accuracy of 31.9% indicates worse than random guess performance. A few questions to help debug this,

What is the class distribution of the extracted data?
What scheme did you use, RANDOM / WIKI?
What was the dev set accuracy of the teacher model?

from language.

ahmadrash commented on May 17, 2024

Thanks a lot for the prompt response.

The class distribution is [26.76%, 26.31%, 46.93%] respectively
I used DATA_SCHEME="random_ed_k_uniform"
Dev set accuracy of the teacher model is 0.851

from language.

martiansideofthemoon commented on May 17, 2024

Hi Ahmad,
1,2 and 3 look good to me. A few more follow-up questions,

I guess you are using BERT-large?
Are you using this file to train the student model? https://github.com/google-research/language/blob/master/language/bert_extraction/steal_bert_classifier/models/run_classifier_distillation.py
Is the training loss decreasing? (just confirming if the weight updates are happening)
Does the same script work for SST2 / SQuAD?

from language.

ahmadrash commented on May 17, 2024

Thanks Kalpesh,

Yes I am using BERT_large
Yes I am using the file.
I am adding the loss curve from Tensorboard. It shows oscillations.
I am still running the other experiments.

from language.

martiansideofthemoon commented on May 17, 2024

regarding your curve, how many epochs are you training it for / what's your batch size? A loss of 1.1 indicates nothing is being learnt, but I do see a strong decrease after the first few ~10k steps. Also, what is your learning rate, optimizer and learning rate schedule? Finally, what hardware are you using?

from language.

ahmadrash commented on May 17, 2024

I am training it for 3 epochs. I have a batch size of 8 on an NVIDIA V100 GPU. The learning rate,optimizer and schedule are the default in the script.

--learning_rate=3e-5
--warmup_propotion=0.1

And optimizer is same as the default for BERT

from language.

martiansideofthemoon commented on May 17, 2024

I think the batch size might be the issue, learning is less stable for RANDOM than the original MNLI, and smaller batch sizes (hence weaker gradient estimates) could put the model off the optimization path. I'd recommend trying batch size 32. If it doesn't fit on the GPU, you could try using BERT-base or gradient accumulation.

Another thing you could try is a learning rate decay. From your graph, it is clear that the training loss reduces during the warmup phase of training, but then the learning rate is too high and a bad gradient (from a small batch) can put off the optimization. You could also simply try smaller learning rates, maybe 1e-5

from language.

ahmadrash commented on May 17, 2024

Thanks a lot for the suggestions. I will try these and report back.

from language.

ahmadrash commented on May 17, 2024

Thanks Kalesh! I was able to to get 78 on MNLI dev and 90 on SST-2 reducing the learning rate to 1e-5. The loss curve still is not ideal but much better than what we were seeing before.

from language.

Jimntu commented on May 17, 2024

Hi, I am a beginner in deep learning and have little experience in implementing the code. May I ask how can you draw the loss curve from tensorboard? I would really appreciate if you can help me!

from language.

BERT EXTRACTION: Unable to reproduce results on MNLI about language HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent