Hi, Thanks for sharing such excellent work. After reading the paper and some issue (<a

something about the model training and testing process about spert HOT 2 CLOSED

lavis-nlp commented on September 2, 2024

something about the model training and testing process

from spert.

Comments (2)

markus-eberts commented on September 2, 2024

Hi,
your understanding of our training process is mostly correct. Some corrections:

If (True), fine-tuning the hyperparameter (epoch in train.conf) is unuseful or meaningless

We only tuned some hyperparameters on the CoNLL04 development set (learning rate and especially relation threshold). We ended up using the same learning rate as in the original BERT paper (5e-5), which also works well in our other projects. So the only parameter that was really tuned on the development set was the relation threshold (and we tuned it only on the CoNLL04 development set, since we found the threshold to also work well for other datasets). We experienced little to no overfitting on the development set regarding the number of epochs (note that we also use a learning rate schedule). The model achieves similar performance already after just a few epochs (3-5) and training it for longer only improves performance by a little bit. We just settled for 20 epochs here, but we also achieve similar results with a higher number (e.g. 40 epochs).

[...] then each time (one epoch) the model is trained on train and dev dataset (train_dev.json), the new traind model is tested once on the test set (test.json),
Finally, acorss all training epoches, the model with the best performance on the test set is saved as the finally model, and the highest metric values on test dataset is reported in the paper.

Of course we do not apply early stopping to the test dataset. We just train the model on the combined train and dev set and then (after being trained for 20 epochs) evaluate it on the test dataset. We repeat this 5 times and report the averaged results. Note that most other papers do not state if experiments were averaged over multiple runs (or just the best out of x runs was reported, which can also make a large difference).

If all the baseline methods take the same operation (adding the validation set dev.json to the training set train.json to form a new dataset train_dev.json to train the model), it may be relatively fair.

There are others who also used the combined train+dev set, for example the highly cited work by Bekoulis et al. ("Joint entity recognition and relation extraction as a multi-head selection problem"). For many other papers (also on other datasets), we do not know if the combined set was used or not, since many prior papers did not report their training/dev/test split (and preprocessing) and/or did not disclose their code on GitHub. There are also no official dev sets for CoNLL04 and ADE. Also, training the model on the combined set only makes a larger difference for CoNLL04, and has only little effect on SciERC. In all cases, it does not affect any state-of-the-art claims.

[...] it may be relatively fair.

By combining and re-training the model on train+dev, we essentially decided to not use early stopping on the development set (since we experienced no overfitting) and rather use it as additional training data. I think both approaches (early stopping or combination) have it pros and cons, depending on the circumstances.

from spert.

markus-eberts commented on September 2, 2024

Please leave a comment if you have additional questions.

from spert.

something about the model training and testing process about spert HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent