ai2's Introduction

Minimal Code Base For AI2 Commonsense Leaderboard

Dependencies

install apex if you want to use half precision: https://github.com/NVIDIA/apex. Conda env file is also included for reference, the apex might not be compatiable with conda directly so you can remove that before you create an environment.

pip install -r requirements.txt

Train

Modify config.yaml as you like and run python train.py to train a model. It loads the config file and outputs all the logs/checkpoints in outputs

Eval

Get predictions without evaluation

python eval.py \
    --input_x cache/physicaliqa-train-dev/physicaliqa-train-dev/dev.jsonl \
    --config config.yaml \
    --checkpoint outputs/2020-02-26/20-26-22/lightning_logs/version_6341419/checkpoints/_ckpt_epoch_3_v0.ckpt \
    --output pred.lst

Get predictions with evaluation(accuracy, confidence interval)

python eval.py \
    --input_x cache/physicaliqa-train-dev/physicaliqa-train-dev/dev.jsonl \
    --config config.yaml \
    --checkpoint outputs/2020-02-26/20-26-22/lightning_logs/version_6341419/checkpoints/_ckpt_epoch_3_v0.ckpt \
    --input_y cache/physicaliqa-train-dev/physicaliqa-train-dev/dev-labels.lst \
    --output pred.lst

Results

PIQA

Model	Bootstrapped Accuracy Mean	Bootstrapped Accuracy CI	Accuracy
Roberta large (V100)	77.4	75.7 - 79.4	77.3
Roberta large (K80)	74.0	72.4 - 76.2	74.2

ai2's People

Contributors

Stargazers

Watchers

ai2's Issues

transformers should be a pip requirement

model_cache.py requires transformers;

ANLI data distribution makes it hard to create internal dev - so it's temporarily ignored

If you look at the original dev data, you will see every datapoint is distinct. Training data set, however, has a lot of repetitions. This makes it infeasible to do a 90-10-10 split.

Potential solutions:

one reasonable thing to do would be to (a) separate in a "not overlapping" dev and set up a cross-fold validation experiment
using a fraction of the original dev as internal dev for Anli

space of values for MODEL_TYPE, MODEL_WEIGHT not clear to newbs

some more handholding to teach someone what the space of legitimate values for those variables is helpful. Perhaps a handheld walkthrough that uses an existing huggingface model would be appropriate. On top of that, a walkthrough with a trivially different model that shows e.g. subclassing a huggingface model into a new name, making a small tweak, showing how to add that model to train.py/test.py.

Wondering more details about finetuning RoBERTa on PhysicalIQA

Hi @ChenghaoMou ,

Thanks for your implementation details on benchmarks from AI2. I'm trying to finetuning RoBERTa on PhysicalIQA and I want to know some more details:

what's the dev accuracy of the model that you submitted to leaderboard.
what's the hyper-parameters when you were training the model.

Thanks again!

Best
Tao

Recommend Projects

isi-nlp / ai2 Goto Github PK