I tried to perform finetuning using custom dataset on a model I continuously pretraine

performing continuous pretraining and then finetuning causes error about litgpt HOT 3 OPEN

richardzhuang0412 commented on September 22, 2024

performing continuous pretraining and then finetuning causes error

from litgpt.

Comments (3)

rasbt commented on September 22, 2024

Thanks for reporting, and hm, yes, this is weird. I can reproduce it:

Pretraining

litgpt pretrain \
   --model_name pythia-14m \
   --tokenizer_dir checkpoints/EleutherAI/pythia-14m \
   --out_dir my_test_dir \
   --data TextFiles \
   --data.train_data_path custom_pretraining_data \
   --train.max_tokens 10_000

...
Seed set to 42
Time to instantiate model: 0.13 seconds.
Total parameters: 14,067,712
Validating ...
Measured TFLOPs: 0.10
Saving checkpoint to '/teamspace/studios/this_studio/my_test_dir/final/lit_model.pth'
Training time: 24.14s
Memory used: 1.44 GB

Continued Pretraining

litgpt pretrain \
   --model_name pythia-14m \
   --tokenizer_dir checkpoints/EleutherAI/pythia-14m \
   --out_dir my_test_dir_2 \
   --data TextFiles \
   --data.train_data_path custom_pretraining_data \
   --train.max_tokens 10_000 \
   --initial_checkpoint_dir /teamspace/studios/this_studio/my_test_dir/final/

RuntimeError: Error(s) in loading state_dict for GPT:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.norm_1.bias", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.attn.bias", "transformer.h.0.attn.proj.weight", "transformer.h.0.attn.proj.bias", "transformer.h.0.norm_2.weight", "transformer.h.0.norm_2.bias", "transformer.h.0.mlp.fc.weight",
...

ls  /teamspace/studios/this_studio/my_test_dir/final

config.json  generation_config.json  hyperparameters.yaml  lit_model.pth  model_config.yaml  tokenizer.json  tokenizer_config.json

It did work a few months ago when I tested this for the tutorials and don't have a good explanation at the moment why it would fail. Either I am doing something incorrectly above, or there could be something that has recently changed that's causing this. I will have to think more about this ...

Have you seen this before @awaelchli or @carmocca ?

Finetuning

Finetuning seems to work fine for me though

litgpt finetune full \
    --checkpoint_dir /teamspace/studios/this_studio/my_test_dir/final \
    --train.max_seq_length 64 \
    --train.max_steps 5

...
Epoch 1 | iter 73 step 4 | loss train: 10.978, val: n/a | iter time: 15.70 ms
Epoch 1 | iter 74 step 4 | loss train: 10.972, val: n/a | iter time: 15.56 ms
Epoch 1 | iter 75 step 4 | loss train: 10.967, val: n/a | iter time: 15.70 ms
Epoch 1 | iter 76 step 4 | loss train: 10.960, val: n/a | iter time: 16.08 ms
Epoch 1 | iter 77 step 4 | loss train: 10.961, val: n/a | iter time: 16.31 ms
Epoch 1 | iter 78 step 4 | loss train: 10.957, val: n/a | iter time: 16.12 ms
Epoch 1 | iter 79 step 4 | loss train: 10.944, val: n/a | iter time: 15.83 ms
Epoch 1 | iter 80 step 5 | loss train: 10.931, val: n/a | iter time: 18.52 ms (step)
Training time: 20.99s
Memory used: 0.31 GB

So, I am thinking the generated checkpoint file is fine, it's more like something when loading the checkpoint in the pretraining script.

from litgpt.

performing continuous pretraining and then finetuning causes error about litgpt HOT 3 OPEN

Comments (3)

Pretraining

Continued Pretraining

Finetuning

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent