I am trying to funetine llama2-13b using a single 3090 but the code fails when trying to do model.save_pretrained(output_merged_dir, safe_serialization=True)
Any thoughts/suggestions on what I can try to get this model to merge/save (I have had some success before with 7 and 13b, not sure what changed)
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/gpu/anaconda3/envs/nightly did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/gpu/anaconda3/pkgs/cuda-cudart-11.7.99-0/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
================================================================================
Your GPU supports bfloat16, you can accelerate training with the argument --bf16
================================================================================
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:11<00:00, 3.89s/it]
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
0%| | 0/10000 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/torch/utils/checkpoint.py:391: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
{'train_runtime': 20.9561, 'train_samples_per_second': 1908.751, 'train_steps_per_second': 477.188, 'train_loss': 4.5660131872326074e-05, 'epoch': 0.66}
10001it [00:20, 477.18it/s]
Merging and pushing weights
output-13-4096a/final_checkpoints
Loading model for merging
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00, 4.18s/it]
Merging and unloading weights
Saving merged weights
Saving to output-13-4096a/final_merged_checkpoint
Removed shared tensor {'model.layers.32.self_attn.o_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.36.self_attn.v_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.32.mlp.up_proj.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.39.mlp.up_proj.weight', 'model.layers.38.self_attn.k_proj.weight', 'model.layers.37.self_attn.q_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.33.self_attn.q_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.35.mlp.up_proj.weight', 'model.layers.37.input_layernorm.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.39.post_attention_layernorm.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.38.post_attention_layernorm.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.39.input_layernorm.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.39.self_attn.o_proj.weight', 'model.layers.36.mlp.down_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.36.input_layernorm.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.norm.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.33.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.32.self_attn.v_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.36.self_attn.q_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.30.post_attention_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.34.mlp.up_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.33.post_attention_layernorm.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.37.post_attention_layernorm.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.35.self_attn.o_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.32.mlp.gate_proj.weight', 'model.layers.33.mlp.gate_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.33.mlp.down_proj.weight', 'model.layers.33.mlp.up_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.38.self_attn.o_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.34.self_attn.k_proj.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.37.mlp.up_proj.weight', 'model.layers.39.self_attn.q_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.35.input_layernorm.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.38.input_layernorm.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.35.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.36.mlp.up_proj.weight', 'model.layers.38.mlp.down_proj.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.33.input_layernorm.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.39.mlp.gate_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.37.self_attn.o_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.32.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.37.mlp.gate_proj.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.32.input_layernorm.weight', 'model.layers.34.mlp.gate_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.33.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.34.self_attn.q_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.35.post_attention_layernorm.weight', 'model.layers.34.self_attn.v_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.35.self_attn.q_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.34.post_attention_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.39.mlp.down_proj.weight', 'model.layers.34.self_attn.o_proj.weight', 'model.layers.38.self_attn.v_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.36.self_attn.k_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.32.self_attn.k_proj.weight', 'model.layers.39.self_attn.v_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.35.mlp.down_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.33.self_attn.k_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.36.post_attention_layernorm.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.35.self_attn.v_proj.weight', 'model.layers.32.post_attention_layernorm.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.35.mlp.gate_proj.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.37.self_attn.k_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.37.mlp.down_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.32.mlp.down_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.38.mlp.up_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.37.self_attn.v_proj.weight', 'model.layers.38.self_attn.q_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.38.mlp.gate_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.39.self_attn.k_proj.weight', 'model.layers.34.mlp.down_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.36.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.36.self_attn.o_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.34.input_layernorm.weight', 'model.layers.30.self_attn.o_proj.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
Traceback (most recent call last):
File "/home/gpu/code/llama-recipes/f3.py", line 245, in <module>
model.save_pretrained(output_merged_dir, safe_serialization=True)
File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/transformers/modeling_utils.py", line 1845, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/safetensors/torch.py", line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/safetensors/torch.py", line 394, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'lm_head.weight'}].
A potential way to correctly save your model is to use `save_model`.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors