Hello. Thank you for a great work and for sharing the code. I was trying to reproduce your scores on dataset 'sst2' reported in paper. I was using learned concepts from the Google Drive link (folder tune-train) and running the following script:
TRAIN_METHOD=direct
TEST_METHOD=direct
LR=1e-2
N_PREFIX=10
DATASET=glue-sst2
TRAIN_TASK=tune
SPLIT=train
MODEL=gpt2-large
TRAIN_SIZE=100
STEP=100000
K=4
DIFFICULTY=concept_calibrated
CUDA_VISIBLE_DEVICES=5 python test.py\
--dataset $DATASET\
--gpt $MODEL\
--method $TEST_METHOD\
--test_batch_size 16\
--out_dir out/$MODEL\
--k $K\
--embedding_dir embeddings/\
--use_demonstrations\
--concept_temperature 50\
--similarity_temperature 0.1\
--train_size $TRAIN_SIZE\
--difficulty $DIFFICULTY\
--n_prefix_tokens $N_PREFIX\
--concept_dir $DIFFICULTY-$K/gpt2-large/$TRAIN_TASK-$SPLIT-$TRAIN_SIZE/$DATASET-$TRAIN_METHOD-prefix=$N_PREFIX-lr=$LR-$STEP\
--prefix_embed_file checkpoints/gpt2-large/$TRAIN_TASK-$SPLIT/prefix={$N_PREFIX}-{$TRAIN_METHOD}-lr={$LR}-initByVocab/soft_embeddings-$STEP.pt\
--prior easiest\
--reorder\
# --prior most_similar\
I've checked that test_prefix.sh results in the same scores that are reported as "optimal" in your paper. So the problem is not with the learned soft-tokens I guess.