facebookresearch / llama-recipes Goto Github PK

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.

Python 13.43% Jupyter Notebook 86.54% Shell 0.04%

ai finetuning langchain llama llama2 llm machine-learning python pytorch vllm

llama-recipes's Introduction

Llama Recipes: Examples to get started using the Llama models from Meta

The 'llama-recipes' repository is a companion to the Meta Llama models. We support the latest version, Llama 3.1, in this repository. The goal is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama and other tools in the LLM ecosystem. The examples here showcase how to run Llama locally, in the cloud, and on-prem.

Important

Meta Llama 3.1 has a new prompt template and special tokens.

Token	Description
`<\|begin_of_text\|>`	Specifies the start of the prompt.
`<\|eot_id\|>`	This token signifies the end of a turn i.e. the end of the model's interaction either with the user or tool executor.
`<\|eom_id\|>`	End of Message. A message represents a possible stopping point where the model can inform the execution environment that a tool call needs to be made.
`<\|python_tag\|>`	A special tag used in the model’s response to signify a tool call.
`<\|finetune_right_pad_id\|>`	Used for padding text sequences in a batch to the same length.
`<\|start_header_id\|>{role}<\|end_header_id\|>`	These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant and ipython.
`<\|end_of_text\|>`	This is equivalent to the EOS token. For multiturn-conversations it's usually unused, this token is expected to be generated only by the base models.

A multiturn-conversation with Meta Llama 3.1 that includes tool-calling follows this structure:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|python_tag|>{{ model_tool_call_1 }}<|eom_id|><|start_header_id|>ipython<|end_header_id|>

{{ tool_response }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{model_response_based_on_tool_response}}<|eot_id|>

Each message gets trailed by an <|eot_id|> token before a new header is started, signaling a role change.

More details on the new tokenizer and prompt template can be found here.

Note

The llama-recipes repository was recently refactored to promote a better developer experience of using the examples. Some files have been moved to new locations. The src/ folder has NOT been modified, so the functionality of this repo and package is not impacted.

Make sure you update your local clone by running git pull origin main

Llama Recipes: Examples to get started using the Meta Llama models from Meta

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

PyTorch Nightlies

If you want to use PyTorch nightlies instead of the stable release, go to this guide to retrieve the right --extra-index-url URL parameter for the pip install commands on your platform.

Installing

Llama-recipes provides a pip distribution for easy install and usage in other projects. Alternatively, it can be installed from source.

Note

Ensure you use the correct CUDA version (from nvidia-smi) when installing the PyTorch wheels. Here we are using 11.8 as cu118. H100 GPUs work better with CUDA >12.0

Install with pip

pip install llama-recipes

Install with optional dependencies

Llama-recipes offers the installation of optional packages. There are three optional dependency groups. To run the unit tests we can install the required dependencies with:

pip install llama-recipes[tests]

For the vLLM example we need additional requirements that can be installed with:

pip install llama-recipes[vllm]

To use the sensitive topics safety checker install with:

pip install llama-recipes[auditnlg]

Optional dependencies can also be combines with [option1,option2].

Install from source

To install from source e.g. for development use these commands. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package.

git clone [email protected]:meta-llama/llama-recipes.git
cd llama-recipes
pip install -U pip setuptools
pip install -e .

For development and contributing to llama-recipes please install all optional dependencies:

git clone [email protected]:meta-llama/llama-recipes.git
cd llama-recipes
pip install -U pip setuptools
pip install -e .[tests,auditnlg,vllm]

Getting the Meta Llama models

You can find Meta Llama models on Hugging Face hub here, where models with hf in the name are already converted to Hugging Face checkpoints so no further conversion is needed. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well.

Model conversion to Hugging Face

The recipes and notebooks in this folder are using the Meta Llama model definition provided by Hugging Face's transformers library.

Given that the original checkpoint resides under models/7B you can install all requirements and convert the checkpoint with:

## Install Hugging Face Transformers from source
pip freeze | grep transformers ## verify it is version 4.31.0 or higher

git clone [email protected]:huggingface/transformers.git
cd transformers
pip install protobuf
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
   --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path

Repository Organization

Most of the code dealing with Llama usage is organized across 2 main folders: recipes/ and src/.

`recipes/`

Contains examples are organized in folders by topic:

Subfolder	Description
quickstart	The "Hello World" of using Llama, start here if you are new to using Llama.
use_cases	Scripts showing common applications of Meta Llama3
3p_integrations	Partner owned folder showing common applications of Meta Llama3
responsible_ai	Scripts to use PurpleLlama for safeguarding model outputs
experimental	Meta Llama implementations of experimental LLM techniques

`src/`

Contains modules which support the example recipes:

Subfolder	Description
configs	Contains the configuration files for PEFT methods, FSDP, Datasets, Weights & Biases experiment tracking.
datasets	Contains individual scripts for each dataset to download and process. Note
inference	Includes modules for inference for the fine-tuned models.
model_checkpointing	Contains FSDP checkpoint handlers.
policies	Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode).
utils	Utility files for: - `train_utils.py` provides training/eval loop and more train utils. - `dataset_utils.py` to get preprocessed datasets. - `config_utils.py` to override the configs received from CLI. - `fsdp_utils.py` provides FSDP wrapping policy for PEFT methods. - `memory_utils.py` context manager to track different memory stats in train loop.

Supported Features

The recipes and modules in this repository support the following features:

Feature
HF support for inference	✅
HF support for finetuning	✅
PEFT	✅
Deferred initialization ( meta init)	✅
Low CPU mode for multi GPU	✅
Mixed precision	✅
Single node quantization	✅
Flash attention	✅
Activation checkpointing FSDP	✅
Hybrid Sharded Data Parallel (HSDP)	✅
Dataset packing & padding	✅
BF16 Optimizer (Pure BF16)	✅
Profiling & MFU tracking	✅
Gradient accumulation	✅
CPU offloading	✅
FSDP checkpoint conversion to HF for inference	✅
W&B experiment tracker	✅

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

See the License file for Meta Llama 3.1 here and Acceptable Use Policy here

See the License file for Meta Llama 3 here and Acceptable Use Policy here

See the License file for Meta Llama 2 here and Acceptable Use Policy here

llama-recipes's People

Contributors

Stargazers

Watchers

Forkers

kkpan11 ai-app sekyondameta 4lbatr0s rohan-varma evdcush ittailup backpropper deep-learner-msp arjunchandra wxjiao yibit davidkim97 rickyhong sawyerbutton gdls lplzyp liujuncn yulinchen99 subratcall vshanyiao smallccn muddylemon thlo7777 nullze techthiyanes ernestinaqiu yokona1398 sergheidinu skhendle hongwen-sun pacman100 tecmie hz-nm drasaadmoosa aibibulaatawula shohog avanindra getkksingh1 skit-ai yey11 aamanlamba anshikavermag hinnytsang easyfmxu hpunetha rioncarter illustromancer cinderzhang sumonst21 moh-c 9atatimer irajmoradi aiweaver activescott fingerx skyrookieyu devfaixapreta damonclifford eyeimmanuel secsus najiaboo gwangsoo-ko yjeanrenaud tianyuzhou shawnli heatdh maybemind gavinchen1314 ramstorage mdf-git eltociear vidina-solutions jhusseth1 jaedukseo ishaan2611 ksanya007 hodlog higher-level-systems totesarana rkp64 ssusantachary jay-dox dan4k-tosh ezhomelabs goswamig dst1213 nilalakh 0xmarsyas jinlmsft kylewang1005 14h034160212 mitchell-xiyunfeng robertsvsnx53 esenthil2018 aleksandre19 louielcrypto hephaex ohanias89 arwin-cc

llama-recipes's Issues

Running model in Int8 on a single GPU (24GB)

Hi all,

This isn't my work, but the initial promise of this model for many people was that it could potentially be run on consumer hardware.

https://github.com/tloen/llama-int8.git

I found this implementation with bitsandbytes 8bit, brining resource requirements for 13B down to something a 3090 or 4090 could run. Is there any work being done on making this whole thing compatible with HF Transformers, PEFT, LoRAs, etc? Also, is there general interest in submitting PRs for these to be merged into this repo?

I'd be happy to contribute if there is.

llama-13b finetuning on G5.48x with LORA not working (batch size 1, #trainable parameters ~ 6M)

Hi, I am trying to fine-tune LLama 13B model on g5.48xlarge instance type (8 GPUs, 24GB per GPU) and running into CUDA OOM issue. However, when I try to check the memory statistics, they don't make sense to me.

Loading the model requires less than 30GB GPU memory
After forward pass takes ~48GB total GPU memory (6GB per GPU)
Number of trainable params: 6,553,600 (0.05032552357220002% of the total parameters)

Why does the backward pass consume more than 18*8 = 144GB GPU memory for such few parameters?

Setting: torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --use_peft --peft_method lora --model_name /PATH/TO/13B --pure_bf16 --batch_size_training 1 --micro_batch_size 1 --enable_fsdp

File "llama_finetuning.py", line 237, in [214/1853]
fire.Fire(main)
File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "llama_finetuning.py", line 220, in main
results = train(
File "/home/ubuntu/llama-recipes/utils/train_utils.py", line 110, in train
loss.backward()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 643, in _pre_backward_hook
_prefetch_handles(state, _handles_key)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 1007, in _prefetch_handles
_unshard(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 329, in _unshard
handle.unshard()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 918, in unshard
unsharded_flat_param = self._alloc_padded_unsharded_flat_param()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/flat_param.py", line 944, in _alloc_padded_unsharded_flat_param
_alloc_storage(unsharded_flat_param, flat_param._padded_unsharded_size) # type: ignore[attr-defined]
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_utils.py", line 79, in _alloc_storage
tensor._typed_storage().resize(size.numel())
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/storage.py", line 787, in resize

"max words" confusion and zero loss with alpaca dataset

I am trying to finetune LLAMA-2 7B using another instruction tuning dataset. It is pretty similar in format to the alpaca dataset, but the instructions are slightly longer.

I'm often getting zero loss when training the model, and after some debugging i realized the the labels tensor used to evaluate the loss is usually full of zeros. I traced this back to the "max words" parameter in the alpaca dataset. https://github.com/facebookresearch/llama-recipes/blob/905f633dab92688b0a989f8d5cd11d86f882f534/ft_datasets/alpaca_dataset.py#L29

With a sufficiently long instruction, lines 63-65 beginning here https://github.com/facebookresearch/llama-recipes/blob/905f633dab92688b0a989f8d5cd11d86f882f534/ft_datasets/alpaca_dataset.py#L63 end up padding away the entire labels and replacing it with zeros, because the raw prompt is longer than the "example" (after the example is adjusted to fit max_words).

When I change max_words to a larger value, I get a lot of cuda errors like this:
read: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [623,0,0], th

This even happens with QLora occupying only 10gb of an 80gb GPU so i don't think it is a memory issue?
Is there a way I can do instruction tuning with somewhat longer instructions? On some investigation, even a significant portion of the Alpaca dataset itself seems to get padded away to zero by this....

file : inference.py RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[Bug] Maybe a finetune bug

lr_scheduler.step is called every gradient_accumulation_steps times. This leads to a rapid decrease of the learning rate, which I believe is a bug.

https://github.com/facebookresearch/llama-recipes/blob/cd82118b74d2fd739bd6227af33b661d04a97406/utils/train_utils.py#L108

Facing shape mismatch issue with text-classification finetuning

I'm using a text-classification dataset dair-ai/emotion with LLaMA 7b with FSDP + PEFT and the training fails with the following shape mismatch error.

Below is the implementation of the get_dataset method

def get_dataset(dataset_config, tokenizer, split):
    dataset = datasets.load_dataset("dair-ai/emotion", split=split)
    dataset = dataset.map(
        lambda sample: tokenizer(sample["text"], truncation=True, padding="max_length", max_length=dataset_config.max_input_length),
        batched=True,
        remove_columns=["text"],
    )
    return dataset

_File "/anaconda/envs/llama-FT/lib/python3.8/site-packages/torch/nn/functional.py", line 3055, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (2044) to match target batch_size (3).

I'm using pytorch-nightly and using samsum dataset, the finetuing with FSDP + PEFT is working.
For classification dataset, do we need to change something in FSDP wrapper as well?

Provide a docker file?

Hello,
Thank you for providing the LLAMA-recipes and LLAMA V2 model.
I have encounted the following error as reported in issure #15 .

 File "/home/anaconda3/envs/drive_llama2/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 166, in scale
    assert outputs.is_cuda or outputs.device.type == 'xla'

I would like to ask would it be possible to release a docker file?

Possible bug: error while using quantization

While fine-tuning for 13B with PEFT (LORA) and quantized model, run into the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:6!
utis/train_utils.py total_loss += loss.detach().float()

First step is completed. Error occurs in the second step.

Command:
python3 llama_finetuning.py --use_peft --peft_method lora --model_name /PATH/TO/13B --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --batch_size_training 1 --micro_batch_size 1 --quantization

Error when running finetune on AWS g4dn.12xlarge instance

I have converted the llama-2-13B model into HF version. When run the finetune code below with 4 GPU,

(base) ubuntu@ip-172-31-45-56:~/llama-recipes$ torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../llama/llama-2-13b-hf --pure_bf16 --output_dir ../llama/llama-2-13b-lora-test

I got the following error:

bin /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
bin /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
bin /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
bin /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so.11.0'), PosixPath('/opt/conda/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so.11.0'), PosixPath('/opt/conda/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
--> Running with torch dist debug set to detail
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78260 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78261 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78262 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 78259) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

llama_finetuning.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-22_09:52:50
host : ip-172-31-45-56.ap-southeast-1.compute.internal
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 78259)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 78259

why i can not load model from llama-2-7b,

the error are as follows:

Fine-tuning LLaMA with TencentPretrain and DeepSpeed

We implement LLaMA training on the TencentPretrain framework, the tutorial is as follows:

Clone the TencentPretrain project and install dependencies: PyTorch, DeepSpeed, SentencePiece

git clone https://github.com/Tencent/TencentPretrain.git

Convert LLaMA-7B weights to TencentPretrain format

cd TencentPretrain

python3 scripts/convert_llama_to_tencentpretrain.py --input_model_path $LLaMA_7B_FOLDER/consolidated.00.pth --output_model_path models/llama-7b.bin --layers_num 32

Modify configuration file

Check out the tencentpretrain/utils/constants.py file, and modify L4: special_tokens_map.json to llama_special_tokens_map.json

Data preprocess. We use the example corpus in the project for pre-training, one can also use custom data training in the same format (one sample per line).

python3 preprocess.py --corpus_path corpora/book_review.txt --spm_model_path $LLaMA_7B_FOLDER/tokenizer.model \
                      --dataset_path dataset.pt --processes_num 8 --data_processor lm

Start training.

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                      --pretrained_model_path models/llama-7b.bin \
                      --dataset_path dataset.pt --spm_model_path $LLaMA_7B_FOLDER/tokenizer.model \
                      --config_path models/llama/7b_config.json \
                      --output_model_path models/output_model.bin \
                      --world_size 8 --learning_rate 1e-4  \
                      --data_processor lm --total_steps 10000 --save_checkpoint_steps 2000 --batch_size 24

For now, TencentPretrain only support LLaMA-7B training. We are working on our framework to support LLaMA model training/fine-tuning at all scales and sharing more experimental results.

Watchdog caught collective operation timeout: WorkNCCL

When my code runs for a while, it gets bogged down in timeout after eval, which I've encountered on many 2,4 GPUs. It tends to happen after the code has run some epochs. I wonder if there might be a problem with model eval or model saving. At the same time the model save is saved on each rank, I am not sure if this is reasonable.

CUDA OOM due to model save for SHARDED_STATE_DICT

My model training + eval runs fine for the epoch 0, then the model saves but epoch 1 fails due to OOM.

I am using FSDP (FULL_SHARD + SHARDED_STATE_DICT) w/o PEFT on p4de.24xlarge cluster. I verified that by removing the saving step the training proceeds just fine.

Also updated https://github.com/facebookresearch/llama-recipes/blob/main/model_checkpointing/checkpoint_handler.py#L67 to have ShardedStateDictConfig(offload_to_cpu=True) but the problem persists.

How to fine-tune all parameters?

How to configure parameters to fine-tune all parameters?

cfg.checkpoint_folder is not defined when using StateDictType.FULL_STATE_DICT

When fine-tuning with StateDictType.FULL_STATE_DICT, the program crashes when saving checkpoint.

The error is caused here
https://github.com/facebookresearch/llama-recipes/blob/74bde65a62667a38ee0411676cf058c53f85771c/model_checkpointing/checkpoint_handler.py#L145

I know this can be easily solved by assigning cfg.checkpoint_folder some value, but just curious why adding another config rather than using cfg.dist_checkpoint_root_folder and cfg.dist_checkpoint_folder.

Besides, using two dist_ configs is also strange. Isn't one such config enough?

How to load 13B model in a single GPU?

Hi, I want to load 13B or larger model in a single A100 80G, but find that the two shards of the model are ought to be loaded in 2 GPUs, is there any way to consolidate the two shards into one file?

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

Suggestions on running LLAMA with limited resources?

Being a grad student, I have limited free personal resources such as Google Colab available. But in Google Colab, when I try running the example.py script (after downloading weights), I run out of System RAM on the following line:

torch.load(ckpt_path, map_location="cpu")

Would greatly appreciate if anyone has suggestions on how I can run LLAMA?

Suggestion for free compute resources available which would have enough compute power available for running it?
Ways to reduce the compute required for LLAMA?

llama2 13B_chat Model conversion to Hugging Face protobuf version error

llama2 13B_chat Model conversion to Hugging Face protobuf version error ,Do I need to follow the prompts to lower the protobuf version? Or set other parameters? Any answer is useful to me, thank you in advance.

Here is the error log:

(textgen) root@ubuntu:/chat/transformers# python src/transformers/models/llama/convert_llama_weights_to_hf.py     --input_dir /media/ubuntu/chat/transformers/13B-chat --model_size 13B --output_dir models_hf/13B
Fetching all parameters from the checkpoint at /media/ubuntu/chat/transformers/13B-chat/13B.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|████████████████| 41/41 [00:24<00:00,  1.70it/s]
Saving in the Transformers format.
Saving a LlamaTokenizerFast to models_hf/13B.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /media/ubuntu/chat/transformers/src/transformers/models/llama/convert_llama_ │
│ weights_to_hf.py:304 in <module>                                             │
│                                                                              │
│   301                                                                        │
│   302                                                                        │
│   303 if __name__ == "__main__":                                             │
│ ❱ 304 │   main()                                                             │
│   305                                                                        │
│                                                                              │
│ /media/ubuntu/chat/transformers/src/transformers/models/llama/convert_llama_ │
│ weights_to_hf.py:300 in main                                                 │
│                                                                              │
│   297 │   │   │   safe_serialization=args.safe_serialization,                │
│   298 │   │   )                                                              │
│   299 │   spm_path = os.path.join(args.input_dir, "tokenizer.model")         │
│ ❱ 300 │   write_tokenizer(args.output_dir, spm_path)                         │
│   301                                                                        │
│   302                                                                        │
│   303 if __name__ == "__main__":                                             │
│                                                                              │
│ /media/ubuntu/chat/transformers/src/transformers/models/llama/convert_llama_ │
│ weights_to_hf.py:272 in write_tokenizer                                      │
│                                                                              │
│   269 │   # Initialize the tokenizer based on the `spm` model                │
│   270 │   tokenizer_class = LlamaTokenizer if LlamaTokenizerFast is None els │
│   271 │   print(f"Saving a {tokenizer_class.__name__} to {tokenizer_path}.") │
│ ❱ 272 │   tokenizer = tokenizer_class(input_tokenizer_path)                  │
│   273 │   tokenizer.save_pretrained(tokenizer_path)                          │
│   274                                                                        │
│   275                                                                        │
│                                                                              │
│ /root/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/model │
│ s/llama/tokenization_llama_fast.py:93 in __init__                            │
│                                                                              │
│    90 │   │   add_eos_token=False,                                           │
│    91 │   │   **kwargs,                                                      │
│    92 │   ):                                                                 │
│ ❱  93 │   │   super().__init__(                                              │
│    94 │   │   │   vocab_file=vocab_file,                                     │
│    95 │   │   │   tokenizer_file=tokenizer_file,                             │
│    96 │   │   │   clean_up_tokenization_spaces=clean_up_tokenization_spaces, │
│                                                                              │
│ /root/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/token │
│ ization_utils_fast.py:118 in __init__                                        │
│                                                                              │
│   115 │   │   elif self.slow_tokenizer_class is not None:                    │
│   116 │   │   │   # We need to create and convert a slow tokenizer to build  │
│   117 │   │   │   slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs │
│ ❱ 118 │   │   │   fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)    │
│   119 │   │   else:                                                          │
│   120 │   │   │   raise ValueError(                                          │
│   121 │   │   │   │   "Couldn't instantiate the backend tokenizer from one o │
│                                                                              │
│ /root/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/conve │
│ rt_slow_tokenizer.py:1307 in convert_slow_tokenizer                          │
│                                                                              │
│   1304 │                                                                     │
│   1305 │   converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]   │
│   1306 │                                                                     │
│ ❱ 1307 │   return converter_class(transformer_tokenizer).converted()         │
│   1308                                                                       │
│                                                                              │
│ /root/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/conve │
│ rt_slow_tokenizer.py:445 in __init__                                         │
│                                                                              │
│    442 │   │                                                                 │
│    443 │   │   super().__init__(*args)                                       │
│    444 │   │                                                                 │
│ ❱  445 │   │   from .utils import sentencepiece_model_pb2 as model_pb2       │
│    446 │   │                                                                 │
│    447 │   │   m = model_pb2.ModelProto()                                    │
│    448 │   │   with open(self.original_tokenizer.vocab_file, "rb") as f:     │
│                                                                              │
│ /root/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/utils │
│ /sentencepiece_model_pb2.py:91 in <module>                                   │
│                                                                              │
│     88 │   file=DESCRIPTOR,                                                  │
│     89 │   create_key=_descriptor._internal_create_key,                      │
│     90 │   values=[                                                          │
│ ❱   91 │   │   _descriptor.EnumValueDescriptor(                              │
│     92 │   │   │   name="UNIGRAM",                                           │
│     93 │   │   │   index=0,                                                  │
│     94 │   │   │   number=1,                                                 │
│                                                                              │
│ /root/anaconda3/envs/textgen/lib/python3.10/site-packages/google/protobuf/de │
│ scriptor.py:796 in __new__                                                   │
│                                                                              │
│    793 │   def __new__(cls, name, index, number,                             │
│    794 │   │   │   │   type=None,  # pylint: disable=redefined-builtin       │
│    795 │   │   │   │   options=None, serialized_options=None, create_key=Non │
│ ❱  796 │     _message.Message._CheckCalledFromGeneratedFile()                │
│    797 │     # There is no way we can build a complete EnumValueDescriptor w │
│    798 │     # given parameters (the name of the Enum is not known, for exam │
│    799 │     # Fortunately generated files just pass it to the EnumDescripto │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and 
must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible 
workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use 
pure-Python parsing and will be much slower).

More information: 
https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updat
es

How many epochs needed for good results while finetuning?

Hi,
I finetuned the 7B model with LoRA + 8b quantization on a task (GSM8k) for 4 epochs. But the finetuned weights did not seem to be performing well on the task. I wanted to know if there are any guidelines on the duration/number of epochs that I need to finetune the model for. Also, is there a chance that PEFT tricks are leading to the performance hit?
Thanks

How to finetune with a own private data and then build chatbot on that?

So far with the example of fine tuning I see examples of summarisation, chatbot based on specific use cases etc. However, I want to build the a chatbot based on my own private data (100s of PDF & word files). How can I fine tune on this. The approach I am thinking is
1-> LoRA fine tuning of the base alpaca model on my own private data
2-> LoRA fine tuning of the above model on some input output prompts.

Is it a good technique for build chatbot on private datasets. Please someone can suggest a good way of building model based on private data.

HF conversion instructions are incorrect

The HF conversion instructions here say to run a pip command and then cd into the transformers directory. The transformers directory does not exist after running that command though.

Changing the $MP?

Our server has A100*2 (80GB), A6000*2 (49GB), and A5000*2 (24GB). Currently, without any modification, we can only run at most the 30B model, because by default, the 65B model requires MP=8. Can we maximize each GPU's usage so that the 65B can run on these 6 GPUs?

How to perform inference by loading Fine-tuned Models using FSDP-only optimization?

When I finish fine-tuning, I try to perform inference by running the following command: cat ./inference/alpaca_prompt.txt | python ./inference/inference.py --model_name ./FSDP/fsdp-fine-tuned-./models_hf/7B However, I encounter an error stating "OSError: ./FSDP/fine-tuned-./models_hf/7B does not appear to have a file named config.json."

I need assistance in resolving this issue and successfully running the inference. Could you please help me with it?

Input length parameter not working appropriately

Even when specifying input length 256, when we run finetuning, input to the model is 2048. This is because chunk size is defined to be 2048 https://github.com/facebookresearch/llama-recipes/blob/main/ft_datasets/utils.py

Cannot find config.json in llama-2-13b model

I am trying to fine tune the newly released llama 2 model with this repo. Here is my code:
torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name ../llama/llama-2-13b/ --pure_bf16 --output_dir ../llama_2_lora_test/

Got the following error:
OSError: ../llama/llama-2-13b/ does not appear to have a file named config.json. Checkout 'https://huggingface.co/../llama/llama-2-13b//main' for available files.

Here are all the files in the llama 2 model folder:

pre-training recipe

Any plans to add a recipe for further pre-training on custom data with optional tokenizer vocab extension in the style of chinese-llama? would love to see that. thanks

Quickstart not working as expected

I just follow the quickstart.ipynb to train model, but the model output is not expected, could you please find out the reason?

the result :
Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-)
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy - he's a great Motorhead fan :-)))

Summary:
elenazeokeatos Bek wwdale Localoждан .stonealiasartifactлекimo Lew� nitetteiteënaroendoraretannechtatekenpy bor QuintaroПетерjervesren丁 schrose Guyaucoupidearalfolgeaton neutralordingoihm HOqloemppersonicanneondenboardark¾aacopimanragmarivernitaniaoniconicaxisonongo Íoniclahriverimes sic inflaby Kleiejzec Lyoki dispositionokevirtmor Grosarorvao�eryСР Tro Kurteso

code:
model.eval()
with torch.no_grad():
print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

training process:
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: train/epoch ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: train/learning_rate ███▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb: train/loss ▃▄▄▄▄▅▆▃▄▆▆▄▁▃▄▅▅▁▁█▄█▄▅▄▄▂▅▆▄▃█▄▆▇▃▂▂
wandb: train/total_flos ▁
wandb: train/train_loss ▁
wandb: train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb: train/train_steps_per_second ▁
wandb:
wandb: Run summary:
wandb: train/epoch 1.0
wandb: train/global_step 389
wandb: train/learning_rate 0.0
wandb: train/loss 12.4497
wandb: train/total_flos 1.263322087292928e+17
wandb: train/train_loss 12.47912
wandb: train/train_runtime 14419.425
wandb: train/train_samples_per_second 0.108
wandb: train/train_steps_per_second 0.027

llama2&recipes-map module library frame composition

全球首发，llama2系列模块库架构图 Global launch, llama2-map module library frame composition
https://github.com/ziwang-com/AGI-MAP

[Question] How to use llama2 for typical NLP tasks?

I am wondering how we could adapt the example.py files provided to undertake tasks such as:

Identify the extent of positive sentiment and negative sentiment in the following text.
Detect PERSON names entities and return the result as a json string.

Thanks!
sbs

CUDA OOM while saving model (trying to allocate 1EB memory)

When training with FSDP without LORA, training finishes successfully but while saving the model, it runs out of CUDA memory.

Error thrown by https://github.com/facebookresearch/llama-recipes/blob/e5970e2e1f59eb2af513ea269823a3b9bb8f3cc8/utils/train_utils.py#L144

Command:
torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --model_name PATH/TO/7B --pure_bf16 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --batch_size_training 8 --micro_batch_size 8 --enable_fsdp

Stack trace:

results = train(
File "/home/ubuntu/efs/jumpstart/llama_finetuning/updated_finetuning/llama-recipes/utils/train_utils.py", line 144, in train
results = train(
model_checkpointing.save_model_and_optimizer_sharded(model, rank, train_config)
File "/home/ubuntu/efs/jumpstart/llama_finetuning/updated_finetuning/llama-recipes/utils/train_utils.py", line 144, in train
File "/home/ubuntu/efs/jumpstart/llama_finetuning/updated_finetuning/llama-recipes/model_checkpointing/checkpoint_handler.py", line 112, in save_model_and_optimizer_sharded
dist_cp.save_state_dict(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 104, in save_state_dict
model_checkpointing.save_model_and_optimizer_sharded(model, rank, train_config)
File "/home/ubuntu/efs/jumpstart/llama_finetuning/updated_finetuning/llama-recipes/model_checkpointing/checkpoint_handler.py", line 112, in save_model_and_optimizer_sharded
dist_cp.save_state_dict(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 104, in save_state_dict
dist_cp.save_state_dict(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 104, in save_state_dict
model_checkpointing.save_model_and_optimizer_sharded(model, rank, train_config)
File "/home/ubuntu/efs/jumpstart/llama_finetuning/updated_finetuning/llama-recipes/model_checkpointing/checkpoint_handler.py", line 112, in save_model_and_optimizer_sharded
model_checkpointing.save_model_and_optimizer_sharded(model, rank, train_config)
File "/home/ubuntu/efs/jumpstart/llama_finetuning/updated_finetuning/llama-recipes/model_checkpointing/checkpoint_handler.py", line 112, in save_model_and_optimizer_sharded
dist_cp.save_state_dict(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 104, in save_state_dict
central_plan

torch/cuda/amp/grad_scaler.py:assert outputs.is_cuda or outputs.device.type == 'xla'

The current examples do not support torch1.x, because it use package:torch.distributed.checkpoint
When I run the following comman using the following pacakges,An
torch 2.0.1
transformers 4.31.0

python llama_finetuning.py --use_peft --peft_method lora --quantization --dataset alpaca_dataset --use_fp16 --model_name /home/models/meta-llama/Llama-2-7b-hf --output_dir /home/checkpoints/temp

Installable (PyPi) version of the library?

Is there a plan to make this a PyPi library?

Meanwhile, can we get similar utility? In particular, if I want to fine-tune it on a custom dataset but do not want to change the code of the repo itself, can I achieve this?

support for mps on M1 Macbook?

Hi I tried to change cuda to mps in the jupyter notebook example because I am on an M1 Mac, but encountered the following error:

RuntimeError: Placeholder storage has not been allocated on MPS device!

Is there an easy way to run llama2 on the GPUs of an M1 Mac? Thanks so much!

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("/Users/michaelfive/Desktop/R Directory/NSA_NLP/llama_models")
model = LlamaForCausalLM.from_pretrained("/Users/michaelfive/Desktop/R Directory/NSA_NLP/llama_models")

import torch
from pathlib import Path
import os
import sys
from utils.dataset_utils import get_preprocessed_dataset
from configs.datasets import samsum_dataset

train_dataset = get_preprocessed_dataset(tokenizer, samsum_dataset, 'train')

eval_prompt = """
what is data science?
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to('mps')

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I'm trying to fine-tune llama2-13b with 4*A100 80G on aws, I installed requiements.txt already , but nvidia-smi is CUDA Version: 11.2. Is the RuntimeError related to cuda version? or any other advise

LD_LIBRARY_PATH=/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/lib/:$LD_LIBRARY_PATH  CUDA_VISIBLE_DEVICES=0,1,3 torchrun --nnodes 1 --nproc_per_node 3  ./llama_finetuning.py --enable_fsdp --model_name /home/ts/models/Llama-2-13b-chat-hf --use_peft --peft_method lora --output_dir /home/ts/work/shaohaiming/llama-recipes/output/PEFT --dist_checkpoint_root_folder /home/ts/work/shaohaiming/llama-recipes/output/FSDP --use_fp16 --save_model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/ts/miniconda3/envs/llama2 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.0/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/ts/miniconda3/envs/llama2 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.0/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so...
bin /home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/ts/miniconda3/envs/llama2 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.0/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so...
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.68s/it]
--> Model /home/ts/models/Llama-2-13b-chat-hf
--> /home/ts/models/Llama-2-13b-chat-hf has 13015.86432 Million params
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.72s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.71s/it]
trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
FP16 enabled
trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
--> applying fdsp activation checkpointing...
Found cached dataset csv (/home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-abd39d944ed316eb.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-08f3a4087d2592f5.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-eb83d3d0988f7eb7.arrow
--> Training Set Length = 154675
Found cached dataset csv (/home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-55b469f8ce4636aa.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-5d8415cc119fe61f.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-021757ce16db014b.arrow
--> Validation Set Length = 7060
/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   0%|                                                                                                                           | 0/12889 [00:00<?, ?it/s]--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
Found cached dataset csv (/home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-abd39d944ed316eb.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-08f3a4087d2592f5.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-eb83d3d0988f7eb7.arrow
Found cached dataset csv (/home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-55b469f8ce4636aa.arrow
Found cached dataset csv (/home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-5d8415cc119fe61f.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-021757ce16db014b.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-abd39d944ed316eb.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-08f3a4087d2592f5.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-eb83d3d0988f7eb7.arrow
Found cached dataset csv (/home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-55b469f8ce4636aa.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-5d8415cc119fe61f.arrow
Loading cached processed dataset at /home/ts/work/shaohaiming/llama-recipes/cache_data/csv/samsum-43bab4d22b1186c1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-021757ce16db014b.arrow
/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   0%|                                                                                                                           | 0/12889 [00:00<?, ?it/s]/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/cuda/memory.py:303: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch0:   0%|                                                                                                                           | 0/12889 [00:04<?, ?it/s]
Training Epoch0:   0%|                                                                                                                           | 0/12889 [00:03<?, ?it/s]
Training Epoch0:   0%|                                                                                                                           | 0/12889 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/home/ts/work/shaohaiming/llama-recipes/./llama_finetuning.py", line 239, in <module>
    fire.Fire(main)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ts/work/shaohaiming/llama-recipes/./llama_finetuning.py", line 222, in main
    results = train(
  File "/home/ts/work/shaohaiming/llama-recipes/utils/train_utils.py", line 98, in train
    scaler.scale(loss).backward()
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
  File "/home/ts/work/shaohaiming/llama-recipes/./llama_finetuning.py", line 239, in <module>
    fire.Fire(main)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
Traceback (most recent call last):
  File "/home/ts/work/shaohaiming/llama-recipes/./llama_finetuning.py", line 239, in <module>
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    fire.Fire(main)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component = fn(*varargs, **kwargs)
  File "/home/ts/work/shaohaiming/llama-recipes/./llama_finetuning.py", line 222, in main
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    results = train(
  File "/home/ts/work/shaohaiming/llama-recipes/utils/train_utils.py", line 98, in train
    scaler.scale(loss).backward()
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    torch.autograd.backward(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    component = fn(*varargs, **kwargs)
RuntimeError  File "/home/ts/work/shaohaiming/llama-recipes/./llama_finetuning.py", line 222, in main
: element 0 of tensors does not require grad and does not have a grad_fn
    results = train(
  File "/home/ts/work/shaohaiming/llama-recipes/utils/train_utils.py", line 98, in train
    scaler.scale(loss).backward()
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36331) of binary: /home/ts/miniconda3/envs/llama2/bin/python
Traceback (most recent call last):
  File "/home/ts/miniconda3/envs/llama2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ts/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./llama_finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-20_18:21:55
  host      : iZ6wecshuj1ifmxy70w4zpZ
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 36332)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-20_18:21:55
  host      : iZ6wecshuj1ifmxy70w4zpZ
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 36334)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-20_18:21:55
  host      : iZ6wecshuj1ifmxy70w4zpZ
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 36331)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

How to convert finetuned (with FSDP, without LORA) to transformers weights?

Finetuning LLama 13B/7B model with FSDP, but without LORA saves the models in a format different from transformers compatible weights (.bin) and original llama weights (.pth). How should we convert them to the transformers compatible weights?

Saved model:
__0_0.distcp __1_0.distcp __2_0.distcp __3_0.distcp __4_0.distcp __5_0.distcp __6_0.distcp __7_0.distcp

Question regarding the attention mask

Hello, I have a short question regarding the attention mask.
Do we have to set the attention mask to true for padding tokens?

https://github.com/facebookresearch/llama-recipes/blob/1e0f8a1fb77b9ddccf649970f632dd606a22bd06/ft_datasets/alpaca_dataset.py#L66

or should it be example_mask = example.gt(0) ?

Cannot import LLamaForCausalLM

I am trying to test the inference script on A100 80GB but getting import errors in the very beginning. Any requirements need to be updated?


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /cluster/home/kujain/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /cluster/apps/gcc-6.3.0/cuda-11.7.0-wxadkko7qhgxtnml3j366amhc7ml2xo3/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /cluster/home/kujain/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Traceback (most recent call last):
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1099, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 32, in <module>
    from ...modeling_utils import PreTrainedModel
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 38, in <module>
    from .deepspeed import deepspeed_config, is_deepspeed_zero3_enabled
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/deepspeed.py", line 37, in <module>
    from accelerate.utils.deepspeed import HfDeepSpeedConfig as DeepSpeedConfig
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/accelerate/__init__.py", line 3, in <module>
    from .accelerator import Accelerator
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 35, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/accelerate/utils/__init__.py", line 131, in <module>
    from .bnb import has_4bit_bnb_layers, load_and_quantize_model
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/accelerate/utils/bnb.py", line 42, in <module>
    import bitsandbytes as bnb
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/bitsandbytes/__init__.py", line 16, in <module>
    from .nn import modules
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/bitsandbytes/nn/__init__.py", line 6, in <module>
    from .triton_based_modules import SwitchBackLinear, SwitchBackLinearGlobal, SwitchBackLinearVectorwise, StandardLinear
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/bitsandbytes/nn/triton_based_modules.py", line 8, in <module>
    from bitsandbytes.triton.dequantize_rowwise import dequantize_rowwise
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/bitsandbytes/triton/dequantize_rowwise.py", line 10, in <module>
    import triton
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/triton/__init__.py", line 20, in <module>
    from .runtime import (
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/triton/runtime/__init__.py", line 1, in <module>
    from .autotuner import Config, Heuristics, autotune, heuristics
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 7, in <module>
    from ..compiler import OutOfResources
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/triton/compiler/__init__.py", line 1, in <module>
    from .compiler import CompiledKernel, compile, instance_descriptor
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/triton/compiler/compiler.py", line 14, in <module>
    from .._C.libtriton.triton import (add_external_libs, compile_ptx_to_cubin,
ImportError: cannot import name 'translate_llvmir_to_hsaco' from 'triton._C.libtriton.triton' (unknown location)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "llama_gsm_test.py", line 1, in <module>
    from transformers import LlamaForCausalLM
  File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1090, in __getattr__
    value = getattr(module, name)
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1089, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/cluster/home/kujain/.local/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1101, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
cannot import name 'translate_llvmir_to_hsaco' from 'triton._C.libtriton.triton' (unknown location)

Unable to save merged model:

I am trying to funetine llama2-13b using a single 3090 but the code fails when trying to do model.save_pretrained(output_merged_dir, safe_serialization=True)

Any thoughts/suggestions on what I can try to get this model to merge/save (I have had some success before with 7 and 13b, not sure what changed)

File "/home/gpu/code/llama-recipes/train.py", line 245, in
model.save_pretrained(output_merged_dir, safe_serialization=True)
File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/transformers/modeling_utils.py", line 1845, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/safetensors/torch.py", line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/safetensors/torch.py", line 394, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'lm_head.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

Log:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/gpu/anaconda3/envs/nightly did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/gpu/anaconda3/pkgs/cuda-cudart-11.7.99-0/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
================================================================================
Your GPU supports bfloat16, you can accelerate training with the argument --bf16
================================================================================
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:11<00:00,  3.89s/it]
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
  0%|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 0/10000 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/torch/utils/checkpoint.py:391: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'train_runtime': 20.9561, 'train_samples_per_second': 1908.751, 'train_steps_per_second': 477.188, 'train_loss': 4.5660131872326074e-05, 'epoch': 0.66}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
10001it [00:20, 477.18it/s]
Merging and pushing weights
output-13-4096a/final_checkpoints
Loading model for merging
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00,  4.18s/it]
Merging and unloading weights
Saving merged weights
Saving to output-13-4096a/final_merged_checkpoint
Removed shared tensor {'model.layers.32.self_attn.o_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.36.self_attn.v_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.32.mlp.up_proj.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.39.mlp.up_proj.weight', 'model.layers.38.self_attn.k_proj.weight', 'model.layers.37.self_attn.q_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.33.self_attn.q_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.35.mlp.up_proj.weight', 'model.layers.37.input_layernorm.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.39.post_attention_layernorm.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.38.post_attention_layernorm.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.39.input_layernorm.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.39.self_attn.o_proj.weight', 'model.layers.36.mlp.down_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.36.input_layernorm.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.norm.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.33.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.32.self_attn.v_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.36.self_attn.q_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.30.post_attention_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.34.mlp.up_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.33.post_attention_layernorm.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.37.post_attention_layernorm.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.35.self_attn.o_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.32.mlp.gate_proj.weight', 'model.layers.33.mlp.gate_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.33.mlp.down_proj.weight', 'model.layers.33.mlp.up_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.38.self_attn.o_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.34.self_attn.k_proj.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.37.mlp.up_proj.weight', 'model.layers.39.self_attn.q_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.35.input_layernorm.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.38.input_layernorm.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.35.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.36.mlp.up_proj.weight', 'model.layers.38.mlp.down_proj.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.33.input_layernorm.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.39.mlp.gate_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.37.self_attn.o_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.32.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.37.mlp.gate_proj.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.32.input_layernorm.weight', 'model.layers.34.mlp.gate_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.33.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.34.self_attn.q_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.35.post_attention_layernorm.weight', 'model.layers.34.self_attn.v_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.35.self_attn.q_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.34.post_attention_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.39.mlp.down_proj.weight', 'model.layers.34.self_attn.o_proj.weight', 'model.layers.38.self_attn.v_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.36.self_attn.k_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.32.self_attn.k_proj.weight', 'model.layers.39.self_attn.v_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.35.mlp.down_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.33.self_attn.k_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.36.post_attention_layernorm.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.35.self_attn.v_proj.weight', 'model.layers.32.post_attention_layernorm.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.35.mlp.gate_proj.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.37.self_attn.k_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.37.mlp.down_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.32.mlp.down_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.38.mlp.up_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.37.self_attn.v_proj.weight', 'model.layers.38.self_attn.q_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.38.mlp.gate_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.39.self_attn.k_proj.weight', 'model.layers.34.mlp.down_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.36.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.36.self_attn.o_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.34.input_layernorm.weight', 'model.layers.30.self_attn.o_proj.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
Traceback (most recent call last):
  File "/home/gpu/code/llama-recipes/f3.py", line 245, in <module>
    model.save_pretrained(output_merged_dir, safe_serialization=True)
  File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/transformers/modeling_utils.py", line 1845, in save_pretrained
    safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
  File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/safetensors/torch.py", line 232, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
                   ^^^^^^^^^^^^^^^^^
  File "/home/gpu/anaconda3/envs/nightly/lib/python3.11/site-packages/safetensors/torch.py", line 394, in _flatten
    raise RuntimeError(
RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'lm_head.weight'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

ERROR occurred when ‘from transformers import LlamaForCausalLM, LlamaTokenizer’

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Required library version not found: libbitsandbytes_cuda102.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

CUDA driver not installed
CUDA not installed
You have multiple conflicting CUDA libraries
Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone [email protected]:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=102
python setup.py install
CUDA SETUP: Setup Failed!

RuntimeError Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1099, in _LazyModule._get_module(self, module_name)
1098 try:
-> 1099 return importlib.import_module("." + module_name, self.name)
1100 except Exception as e:

File /opt/conda/lib/python3.10/importlib/init.py:126, in import_module(name, package)
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File :1050, in _gcd_import(name, package, level)

File :1027, in find_and_load(name, import)

File :1006, in find_and_load_unlocked(name, import)

File :688, in _load_unlocked(spec)

File :883, in exec_module(self, module)

File :241, in _call_with_frames_removed(f, *args, **kwds)

File /opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:32, in
31 from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
---> 32 from ...modeling_utils import PreTrainedModel
33 from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:86, in
85 if is_accelerate_available():
---> 86 from accelerate import dispatch_model, infer_auto_device_map, init_empty_weights
87 from accelerate.utils import (
88 check_tied_parameters_on_same_device,
89 find_tied_parameters,
(...)
94 set_module_tensor_to_device,
95 )

File /opt/conda/lib/python3.10/site-packages/accelerate/init.py:3, in
1 version = "0.21.0"
----> 3 from .accelerator import Accelerator
4 from .big_modeling import (
5 cpu_offload,
6 cpu_offload_with_hook,
(...)
11 load_checkpoint_and_dispatch,
12 )

File /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:35, in
33 import torch.utils.hooks as hooks
---> 35 from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
36 from .data_loader import DataLoaderDispatcher, prepare_data_loader, skip_first_batches

File /opt/conda/lib/python3.10/site-packages/accelerate/checkpointing.py:24, in
22 from torch.cuda.amp import GradScaler
---> 24 from .utils import (
25 MODEL_NAME,
26 OPTIMIZER_NAME,
27 RNG_STATE_NAME,
28 SCALER_NAME,
29 SCHEDULER_NAME,
30 get_pretty_name,
31 is_tpu_available,
32 is_xpu_available,
33 save,
34 )
37 if is_tpu_available(check_device=False):

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/init.py:131, in
122 from .deepspeed import (
123 DeepSpeedEngineWrapper,
124 DeepSpeedOptimizerWrapper,
(...)
128 HfDeepSpeedConfig,
129 )
--> 131 from .bnb import has_4bit_bnb_layers, load_and_quantize_model
132 from .fsdp_utils import load_fsdp_model, load_fsdp_optimizer, save_fsdp_model, save_fsdp_optimizer

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/bnb.py:42, in
41 if is_bnb_available():
---> 42 import bitsandbytes as bnb
44 from copy import deepcopy

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/init.py:6, in
1 # Copyright (c) Facebook, Inc. and its affiliates.
2 #
3 # This source code is licensed under the MIT license found in the
4 # LICENSE file in the root directory of this source tree.
----> 6 from . import cuda_setup, utils, research
7 from .autograd._functions import (
8 MatmulLtState,
9 bmm_cublas,
(...)
13 matmul_4bit
14 )

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/research/init.py:1, in
----> 1 from . import nn
2 from .autograd._functions import (
3 switchback_bnb,
4 matmul_fp8_global,
5 matmul_fp8_mixed,
6 )

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/init.py:1, in
----> 1 from .modules import LinearFP8Mixed, LinearFP8Global

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py:8, in
7 import bitsandbytes as bnb
----> 8 from bitsandbytes.optim import GlobalOptimManager
9 from bitsandbytes.utils import OutlierTracer, find_outlier_dims

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/init.py:6, in
1 # Copyright (c) Facebook, Inc. and its affiliates.
2 #
3 # This source code is licensed under the MIT license found in the
4 # LICENSE file in the root directory of this source tree.
----> 6 from bitsandbytes.cextension import COMPILED_WITH_CUDA
8 from .adagrad import Adagrad, Adagrad8bit, Adagrad32bit

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:20, in
19 CUDASetup.get_instance().print_log_stack()
---> 20 raise RuntimeError('''
21 CUDA Setup failed despite GPU being available. Please run the following command to get more information:
22
23 python -m bitsandbytes
24
25 Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
26 to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
27 and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues''')
28 lib.cadam32bit_grad_fp32 # runs on an error if the library could not be found -> COMPILED_WITH_CUDA=False

RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
Input In [13], in <cell line: 2>()
1 import torch
----> 2 from transformers import LlamaForCausalLM, LlamaTokenizer

File :1075, in handle_fromlist(module, fromlist, import, recursive)

File /opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1090, in _LazyModule.getattr(self, name)
1088 elif name in self._class_to_module.keys():
1089 module = self._get_module(self._class_to_module[name])
-> 1090 value = getattr(module, name)
1091 else:
1092 raise AttributeError(f"module {self.name} has no attribute {name}")

File /opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1089, in _LazyModule.getattr(self, name)
1087 value = self._get_module(name)
1088 elif name in self._class_to_module.keys():
-> 1089 module = self._get_module(self._class_to_module[name])
1090 value = getattr(module, name)
1091 else:

File /opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1101, in _LazyModule._get_module(self, module_name)
1099 return importlib.import_module("." + module_name, self.name)
1100 except Exception as e:
-> 1101 raise RuntimeError(
1102 f"Failed to import {self.name}.{module_name} because of the following error (look up to see its"
1103 f" traceback):\n{e}"
1104 ) from e

RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

device_map being controlled by wrong flag

Shouldn't the device_map https://github.com/facebookresearch/llama-recipes/blob/main/llama_finetuning.py#L96

be controlled by https://github.com/facebookresearch/llama-recipes/blob/main/configs/training.py#L30 instead?

13B Fine-tuning GPU requirements

How many A100 GPUs (40GB) required to tune 13B model ( full parameter )? I want to train llama2 for new language, so lora is not the best option

Example prompts of chat and pre-trained variants for document Q&A tasks?

I am trying to see if I can use the model with RetrievalQA tasks (

In other models like falcon-7b instruct I used to provide a prompt like this,

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:

I am bit confused if I should use pre-trained or chat variant for document QA tasks. In the huggingface model card there was mention of using chat variant for Q&A tasks since it's instruction tuned.

Can you provide an example prompt I can send for chat variant if it can be used for document Q&A tasks? I couldn't find much in the repo for this task.

What is the difference between quickstart.ipynb and llama_finetuning.py?

What is the difference between quickstart.ipynb and llama_finetuning.py? Is llama_finetuning.py more precise?

Error occurs when training with prefix strategy, normal when using lora and llama_adapter,Some of the reported errors are attached below.

ValueError: Attention mask should be of size (180, 1, 256, 256), but is torch.Size([180, 1, 256, 286])

Viewed the source code spliced prefix mask Not sure whether to modify the format of the input data, so I'd like to ask for advice.

Error while inference: RuntimeError: expected scalar type Half but found Char

Hi,

I am getting the following error, while replicating the quickstart notebook: https://github.com/facebookresearch/llama-recipes/blob/main/quickstart.ipynb with model "meta-llama/Llama-2-13b-hf"

RuntimeError: expected scalar type Half but found Char

The error occurred when running inference for eval prompt given in notebook.

Please help to resolve the issue.

Thanks

fp16 Training not working as expected

Hi, I'm attempting to finetune 7b on my Nvidia-DGX server. It has 8 V100 GPU's (16gb) each. Because these are older GPU's they do not have bf16 support , moreover I cannot use --quantize across multiple GPU's, so I can only do fp16 training (fp32 causes OOM).

Even after updating configs/training.py and configs/fsdp.py , use_fp16 -> True, I cannot get fp16 to work. Below is the error log:

Traceback (most recent call last):
  File "script_name.py", line 237, in <module>
    fire.Fire(main)
  File "/path/to/library/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/path/to/library/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/path/to/library/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "script_name.py", line 220, in main
    results = train(
  File "/path/to/project/utils/train_utils.py", line 91, in train
    outputs = model(**batch)
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/project/peft/peft_model.py", line 947, in forward
    return self.base_model(
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
    outputs = self.model(
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
    layer_outputs = decoder_layer(
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 165, in forward
    return self.checkpoint_fn(  # type: ignore[misc]
  File "/path/to/python/dist-packages/torch/utils/checkpoint.py", line 251, in checkpoint
    return _checkpoint_without_reentrant(
  File "/path/to/python/dist-packages/torch/utils/checkpoint.py", line 432, in _checkpoint_without_reentrant
    output = function(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
    query_states = self.q_proj(hidden_states)
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/peft/tuners/lora.py", line 823, in forward
    self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/python/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found Half

Any ideas what I can do to fix this? Perhaps I can avoid this altogether if I could get --quantize to work across multiple gpu's however that raises the error:

ValueError: .to is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

Any help would be much appreciated! Thank you 😄

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

The code takes some environment variables in llama_finetuning.py. Is there a reference as to how should I set them?

Loss calculation incorrect?

Hi,
In the loss calculation, we are dividing the total loss by the length of the entire dataset. Since HF by default returns a cross-entropy loss with reduction=mean, should we not divide by the length of the train data loader instead of the number of examples (dataset length)? I think we're currently dividing the loss twice. Let me know if this is wrong.

loss = model(**batch).loss
loss = loss / gradient_accumulation_steps
total_loss += loss.detach().float()
first_key = next(iter(batch))
data_set_len += len(batch[first_key])

...

train_epoch_loss = total_loss / data_set_len

Pretrain using fairscale with model parallelism?

Currently, all snippets are based on Hugging Face Transformers. Will there be examples of how to fine-tune full parameters using FairScale? It seems that the model was originally pre-trained using FairScale and model parallelism. Such examples would be useful for fine-tuning or continuing to train the model.