Giter Club home page Giter Club logo

fsdp_qlora's Introduction

fsdp_qlora

Training LLMs with Quantized LoRA + FSDP.

Read our announcement blog post.

You should treat this script as an alpha/preview release. If you’re not comfortable with testing and debugging models, we’d suggest holding off for a few months while the community more fully tests the approach.

Integrations

FSDP+QLoRA has been integrated into:

Installation

The following steps should work (tested on Cuda 11.7, 11.8 and 12.1):

  • Clone https://github.com/AnswerDotAI/fsdp_qlora
  • pip install llama-recipes fastcore "transformers!=4.38.*,!=4.39.*" --extra-index-url https://download.pytorch.org/whl/test/cu118 as an easy way to get most dependencies (replace 118 with your desired Cuda version)
  • Install bitsandbytes pip install bitsandbytes>=0.43.0
  • Run huggingface-cli login (to access Llama 2)
  • Optional Libraries:
    • HQQ quantization: follow the HQQ installation instructions. Our training script uses HQQBackend.ATEN_BACKPROP, so also make sure to build the custom kernels cd hqq/kernels && python setup_cuda.py install.
    • Weights and Biases logging: pip install wandb
  • Pytorch >= 2.2 is recommended to make use of the native flash-attention 2 kernel.

Finetune Llama-2 70B on Dual 24GB GPUs

Once installed, run cd fsdp_qlora and then run the following command to begin finetuning Llama-2 70B on Alpaca at a maximum sequence length of 512 tokens.

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 512 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true

This example command currently uses just over 128GB of CPU RAM. If you only have 128GB available, we recommend making a 10-20GB swap file to accommodate the initial spike in usage.

Training Options

For quantization we support HQQ and bitsandbytes. We're currently doing benchmarking to help you decide which to use. If you do use bitsandbytes, be sure to pass --reentrant_checkpointing True to avoid triggering a bug in bitsandbytes which results in high memory usage (a fix is in progress).

--train_type full

Full params fine-tuning.

export CUDA_VISIBLE_DEVICES=4,5 # optionally set devices
python train.py \
--world_size 2 \ # optional, on a single machine will be set automatically
--master_port 12356 \ # optional, defaults to 12355
--model_name meta-llama/Llama-2-7b-hf \
--gradient_accumulation_steps 4 \
--batch_size 8 \
--context_length 512 \
--precision bf16 \
--train_type full \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--dataset alpaca

--train_type lora

LoRA fine-tuning using HF PEFT library.

- --train_type full \
+ --train_type lora \

--train_type custom_lora

LoRA fine-tuning using a custom LoRA module.

- --train_type full \
+ --train_type custom_lora \

--train_type qlora

4-bit quantized LoRA fine-tuning using bitsanbytes Linear4bit layer with NF4 quantization and HF PEFT library.

- --train_type full \
+ --train_type qlora \
+ --reentrant_checkpointing true \

--train_type custom_qlora

4-bit quantized LoRA fine-tuning using bitsanbytes Linear4bit layer with NF4 quantization and a custom LoRA module.

- --train_type full \
+ --train_type custom_qlora \
+ --reentrant_checkpointing true \

--train_type hqq_lora

4-bit quantized LoRA fine-tuning using HQQ library and a custom LoRA module.

- --train_type full \
+ --train_type hqq_lora \

--train_type bnb_dora

4-bit quantized DoRA fine-tuning using bitsanbytes Linear4bit layer with NF4 quantization and a custom DoRA module.

- --train_type full \
+ --train_type bnb_dora \

--train_type hqq_dora

4-bit quantized DoRA fine-tuning using HQQ library and a custom DoRA module.

- --train_type full \
+ --train_type hqq_dora \

--train_type bnb_llama_pro

4-bit quantized Llama-Pro fine-tuning using bitsanbytes Linear4bit layer with NF4 quantization.

To create llama-pro weights, run the following command:

python scripts/block_expansion.py \
--model_name meta-llama/Llama-2-7b-hf \
--output_dir /path/to/llama_pro_weights_directory \
--expansion_rate 0.1
- --train_type full \
+ --train_type bnb_llama_pro \
+ --llama_pro_path /path/to/llama_pro_weights_directory \

--train_type hqq_llama_pro

4-bit quantized Llama-Pro fine-tuning using HQQ library.

To create llama-pro weights, run the following command:

python scripts/block_expansion.py \
--model_name meta-llama/Llama-2-7b-hf \
--output_dir /path/to/llama_pro_weights_directory \
--expansion_rate 0.1
- --train_type full \
+ --train_type hqq_llama_pro \
+ --llama_pro_path /path/to/llama_pro_weights_directory \

Low Memory Loading

During quantized LoRA training we use a custom quantization and loading code to avoid loading the entire model into GPU memory before sharding it across GPUs. This is the default behavior of our training script when any of the following training options "qlora", "custom_qlora", "hqq_lora" is used. Other training options are already optimized for low memory loading to their best extent.

We load the weights iteratively, quantize them on the GPU and place them back to CPU or meta device (based on their rank) concurrently a few layers at a time. We do this across all GPUs to initialize the quantization parameters, such as zero and scale, while using sync_module_states=True to sync the model parameters and buffers across all GPUs during FSDP initialization.

Mixed Precision Training

--precision bf16 (pure bfloat16)

This will cast all the model parameters to torch.bfloat16 before training and won't use FSDP mixed precision. As a result, sharded and unsharded params will be stored in bf16, forward and backward passes will be done in bf16, and gradient reduction and updates will be done in bf16.

--precision fp32 (pure float32)

This will cast all the model parameters to torch.float32 before training and won't use FSDP mixed precision. As a result, sharded and unsharded params will be stored in fp32, forward and backward passes will be done in fp32, and gradient reduction and updates will be done in fp32.

--precision mp_fp16_autocast (mixed float16 with autocast)

This will cast all the model parameters to torch.float32 before training and will use FSDP mixed precision with

mp_policy = MixedPrecision(param_dtype=torch.float32, reduce_dtype=torch.float32, buffer_dtype=torch.float32)

As a results, sharded and unsharded params will be stored in fp32. It will use autocast(torch.float16) for forward and backward passes, and autocast(torch.float16) for gradient reduction and updates.

--precision mp_bf16_autocast (mixed bfloat16 with autocast)

This will cast all the model parameters to torch.float32 before training and will use FSDP mixed precision with

mp_policy = MixedPrecision(param_dtype=torch.float32, reduce_dtype=torch.float32, buffer_dtype=torch.float32)

As a results, sharded and unsharded params will be stored in fp32. It will use autocast(torch.bfloat16) for forward and backward passes, and autocast(torch.bfloat16) for gradient reduction and updates.

--precision mp_bf16_buffers_autocast (bfloat16 params and float32 buffers with autocast)

This will cast all the model parameters to torch.bfloat16 before training but will keep the buffers in torch.float32 and will use FSDP mixed precision with

mp_policy = MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.float32)

As a results, sharded and unsharded params will be stored in bf16. It will use autocast(torch.bfloat16) for forward and backward passes, and autocast(torch.bfloat16) for gradient reduction and updates. Buffers and only eligible operations in autocast will be performed in bf16.

This option is important for RoPE layer which gives incorrect results when cast to lower precision especially with longer context lengths.

Comparison to an existing trainer

Screenshot 2024-02-01 083222 hf_train.py uses TRL's SFTTrainer for a comparison run. To match with our script, modify the dataloading code to train on everything (not just completions) and then run train.py --train_type qlora --dataset guanaco --batch_size 8 --lr_scheduler cosine --log_to wandb --save_model True --output_dir guanaco_7B --gradient_accumulation_steps 2 --lr 2e-4. The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Converting Saved Models

If you specify --save_model True the adapter layers will be saved as a state dict. To convert to the regular Hugging Face format and upload to the hub, see: Converting the State Dict.ipynb

If "custom_qlora", "hqq_lora" training options are used, then only the trainable LoRA parameters will be saved. Before inference, you need to load and quantize the base model again, and separately load the saved LoRA parameters.

You can alternatively test to see if merging base model weights and trained LoRA weights and then quantizing them performs similar to keeping the parameters separately as done during training. To make use of torch.compile with HQQ, see mobiusml/hqq#18.

Limitations

While QLoRA finetuning works with FSDP, there are some rough edges to be aware of with this alpha release and our example script.

First, the current release of Transformer AutoModel.from_pretrained cannot be used to load models into quantized weights, as it does not support the new quant_storage or quantization flag. Loading pretrained models requires writing or using custom model loading code. We provide an example of how to load and quantize a QLoRA model for finetuning in our demo script.

We are actively working with Hugging Face to resolve this incompatibility in future Transformers and PEFT releases.

Second, while FSDP’s Mixed Precision works with QLoRA, practitioners need to be careful to set the MixedPrecision.param_type to match the Linear4Bit.quant_storage dtype. Otherwise, FSDP’s Mixed Precision could cast the quantized weights to a different precision, essentially turning them into random weights. Our example script shows how to avoid this potential pitfall, and we will be happy to assist model training libraries in correctly exposing FSDP’s Mixed Precision options to users when training with QLoRA

Example: Llama 70B 4-A100 40GB Training

# BnB QLoRA
export CUDA_VISIBLE_DEVICES=4,5,6,7
python train.py \
--world_size 4 \
--master_port 12356 \
--model_name meta-llama/Llama-2-70b-hf \
--gradient_accumulation_steps 4 \
--batch_size 2 \
--context_length 512 \
--precision bf16_buffers_autocast \
--train_type custom_qlora \
--use_gradient_checkpointing true \
--reentrant_checkpointing true
--use_cpu_offload false \
--log_to stdout \
--dataset alpaca

# HQQ QLoRA
export CUDA_VISIBLE_DEVICES=4,5,6,7
python train.py \
--world_size 4 \
--master_port 12356 \
--model_name meta-llama/Llama-2-70b-hf \
--gradient_accumulation_steps 4 \
--batch_size 2 \
--context_length 512 \
--precision bf16_buffers_autocast \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--log_to stdout \
--dataset alpaca

Note: For large batch size or long context training HQQ LoRA is a bit more memory efficient compared to BnB LoRA with re-entrant checkpointing. So if you are running into OOM issues, try using HQQ LoRA.

SLURM Training

See fsdp_multi_node.sh for an example training script using multi-node training with SLURM.

Add support for a new model

First, import the new model's transformer, attention, and MLP layers from Transformers:

from transformers.models.mistral.modeling_mistral import MistralDecoderLayer, MISTRAL_ATTENTION_CLASSES, MistralMLP

Then in the get_wrapping_policy function, add the attention, MLP, and transformer layers to the self_attn_policy_fn, mlp_policy_fn, and transformer_wrap_policy wrapping policy methods:

def get_wrapping_policy(custom_policy:bool=False):

    def self_attn_policy_fn(module):
        return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values()))

    def mlp_policy_fn(module):
        return isinstance(module, (LlamaMLP, MistralMLP))

    transformer_wrap_policy = functools.partial(
        transformer_auto_wrap_policy,
        transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer),
    )

Finally, add gradient checkpointing support by adding the transformer layer to check_fn:

if args["use_gradient_checkpointing"]:
    check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer))

fsdp_qlora's People

Contributors

austinvhuang avatar geronimi73 avatar jeromeku avatar johnowhitaker avatar jph00 avatar keremturgutlu avatar umerha avatar warner-benjamin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fsdp_qlora's Issues

nan when the input length is large

Hi

Thanks for your efforts folks!
While I was testing the code on my own dataset, I found that when the length of the input is large (~4000), the loss becomes Nan from the first step:
Epoch 0, Loss nan, LR 1.00e-05: 12%|█████

For the same dataset, when I truncate my input to something shorter, I start to see the loss.
What is the problem?

DeepSeek VL support

Hello, thank you for the awesome work! Could you please add support for the DeepSeek VL model?

Example with AMD ROCm/HIP

Can you please provide an example that works with AMD ROCm/HIP?

I would be happy to give access to my server!

Request for Scripts to Merge QDoRA Adapters with Base Model for vLLM Inference

Hello,

I've successfully finetuned Llama-3 8B with QDoRA and am now looking to perform inference using vLLM. Could you provide guidance or scripts on how to merge the QDoRA adapters with the original base model? Additionally, does this process involve quantization and dequantization of the base model?

Thank you!

BOFT support?

Hi,

Considering the PEFT library has support for the OFT/BOFT adapter, can this be supported in fsdp_qlora too? Would be an useful adapter to have due to its resistance against catastrophic forgetting.

Thanks

How does one load and do inference on fine-tuned LLama 3 using bnb_dora train script?

I used this script to fine tune LLama 3 (from AnswerAI blog post), what I'm left with is a state dict that I am unable to use to replace layers in the original model following the Converting the State Dict.ipynb notebook. Since it does not work (KeyError with mismatching key names of tensors/new_sd), how does one obtain a model from this state dict?

export CUDA_VISIBLE_DEVICES=0,1
python fsdp_qlora/train.py \
--train_type bnb_dora \
--model_name meta-llama/Meta-Llama-3-8B \
--dataset orca_math \
--dataset_samples 10000 \
--batch_size 4 \
--context_length 2048 \
--gradient_accumulation_steps 2 \
--sharding_strategy full_shard \
--use_gradient_checkpointing true \
--reentrant_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--project_name "fsdp-quantized-ft-exps" \
--save_model true \
--output_dir models/Llama-3-8b-orca-math-10k-bnb-QDoRA

Question about adding / training Mixtral

I followed your 'adding a new model' guide to add Mixtral. It appears transformers mixtral does not have a MixtralMLP as suggested by the guide. The other items can be imported OK. As a workaround I added MistralMLP to mlp_policy_fn insead of MixtralMLP.

The model now begins to train. Previously, without these changes there was an OOM error just prior to training, so something has worked. What is the effect of using MixtralMLP instead of MistralMLP? Am I just training garbage, or is it likely to produce something useful?

Background info:

Cannot import MixtralMLP

>>> 
>>> from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES, MixtralMLP
Traceback (most recent call last):
ImportError: cannot import name 'MixtralMLP' from 'transformers.models.mixtral.modeling_mixtral' )
>>> 
>>> from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES
>>> 

With mixtral mod

python train.py --model_name "/home/chris/repos/Mixtral-8x7B-Instruct-v0.1/" --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload false --dataset alpaca --reentrant_checkpointing true
World size: 4
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00,  5.95s/it]
Rank 0: Model created: 0.752 GiB
trainable params: 37,748,736 || all params: 46,740,541,440 || trainable%: 0.08076229935944876
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 9.803 GiB
Applying activation checkpointing 0
Total Training Steps: 6470
Epoch 0, Loss 1.045, LR 1.00e-05:   0%|▏ 

without mixtral mod

python train.py --model_name "/home/chris/repos/Mixtral-8x7B-Instruct-v0.1/" --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload false --dataset alpaca --reentrant_checkpointing true
World size: 4
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [06:14<00:00, 19.69s/it]
Rank 0: Model created: 0.752 GiB
trainable params: 37,748,736 || all params: 46,740,541,440 || trainable%: 0.08076229935944876
Wrapping model w/ FSDP 0
Traceback (most recent call last):
<etc>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 2 has a total capacity of 23.69 GiB of which 26.81 MiB is free. Including non-PyTorch memory, this process has 23.66 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 47.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The mod

diff --git a/train.py b/train.py
index 9181dc8..ca4809d 100644
--- a/train.py
+++ b/train.py
@@ -68,6 +68,7 @@ except ImportError:
 # for the wrapping policy and `check_fn` in activation checkpointing
 from transformers.models.llama.modeling_llama import LlamaDecoderLayer, LLAMA_ATTENTION_CLASSES, LlamaMLP
 from transformers.models.mistral.modeling_mistral import MistralDecoderLayer, MISTRAL_ATTENTION_CLASSES, MistralMLP
+from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES
 
 # To get rid of tokenizers warnings for now
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
@@ -429,18 +430,18 @@ def get_wrapping_policy(custom_policy:bool=False):
             )
     def self_attn_policy_fn(module):
         # Check module name is self_attn.
-        return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values()))
+        return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values(), *MIXTRAL_ATTENTION_CLASSES.values()))
 
     def mlp_policy_fn(module):
         # Check module name is self_attn.
-        return isinstance(module, (LlamaMLP, MistralMLP))
+        return isinstance(module, (LlamaMLP, MistralMLP, MistralMLP))
 
     lambda_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn)
     self_attn_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=self_attn_policy_fn)
     mlp_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=mlp_policy_fn)
     transformer_wrap_policy = functools.partial(
         transformer_auto_wrap_policy,
-        transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer),
+        transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer, MixtralDecoderLayer,),
     )
     policies=[lambda_policy, transformer_wrap_policy]
     if custom_policy:
@@ -735,7 +736,7 @@ def fsdp_main(local_rank:int, world_size:int, args:Dict):
 
         )
 
-        check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer))
+        check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer, MixtralDecoderLayer))
         if rank == 0 or args['verbose']:
             print("Applying activation checkpointing", rank)
         apply_activation_checkpointing(
@@ -1042,4 +1043,4 @@ def main(
     mp.spawn(fsdp_main,
         args=(world_size, args),
         nprocs=torch.cuda.device_count(),
-        join=True)
\ No newline at end of file
+        join=True)
(END)

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

Hey, I'm loving the goal of lowering the resource requirements for training!

In this paper https://arxiv.org/abs/2403.06504 they claim direct memory access between the GPU<->Nvme Storage is more efficient at swapping thus keeping the GPU at its maximum compute capacity.
"Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS"

Also if we look at memory bandwidth, servers have a bunch of channels while high end gaming machine limit at two:
"DDR4 3200MHz with eight channels has a theoretical bandwidth of 204.8 GB/s."

What advice could you share given the experience offloading?

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase

Hello everyone!

First, thank you for this implementation!
Unfortunately I have an issue with running this, RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

I debugged it a bit and it seems that PEFT v.0.9 breaks it. The previous release PEFT v0.8.2 works fine. The fix is to downgrade or move all the peft imports in train.py inside the functions where they are used, like this: https://github.com/geronimi73/fsdp_qlora/tree/fix_ProcessExitedException

I'm not sure whether I am doing something wrong and how come nobody else noticed this, since PEFT 0.9 has been released two weeks ago already. Any ideas what might be wrong?

command:

python train.py \
--model_name models/llama2-7b \
--gradient_accumulation_steps 4 \
--batch_size 8 \
--context_length 512 \
--precision bf16 \
--train_type full \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--dataset alpaca 

Note: models/llama2-7b is meta-llama/Llama-2-7b-hf

stacktrace:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/g/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/g/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
World size: 2
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/g/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/g/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "/home/g/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/g/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 148, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

pip list:

accelerate                        0.28.0
bitsandbytes                      0.43.0
fastcore                          1.5.29
flash-attn                        2.5.6
hqq                               0.1.5
peft                              0.9.0
torch                             2.2.1
transformers                      4.38.2

2x 3090, CUDA Version: 12.2

Results after running

image
image
image
Why do I lose a lot of pre training weights after running, and how can I use the trained files for downstream tasks?

Q on comparison with SFTTrainer

The README mentions:

The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?

Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?

NCCL issue training with two GPUs

I ran into this issue (NVIDIA/nccl#1125) when trying to replicate the instructions from the README. Since the blog posts mentions that the training was done on two GPUs is there a workaround for the NCCL issue with 1 or 2 GPUs?

Ran
$ python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true

The error trace looks like -

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found

Dual GPU training instantly powers off my desktop

Running fine-tuning with these settings makes my desktop instantly power off as soon as training starts:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0,1

python train.py \
--model_name mistralai/Mixtral-8x7B-v0.1 \
--batch_size 1 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
       --log_to wandb \
       --gradient_accumulation_steps 4 \
       --lr_scheduler linear \
       --verbose false \
       --lora_rank 8 \
       --no_sync true

I have 2x 4090 GPUs, Ubuntu 22.04, PyTorch 2.2.1, CUDA 12.1., bitsandbytes 0.43.0, transformers 4.39.2. I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once, with both of them at 450 watts, and that works fine. Training using naive model parallelism with text-generation-webui works fine.

process 0 terminated with signal SIGKILL

I am interested your project. It is full of your work.
But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73

World size: 2
Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]

Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s]
Generating train split: 51760 examples [00:00, 76513.36 examples/s]
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it]
Rank 0: Model created: 1.479 GiB
trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 5.822 GiB
Applying activation checkpointing 0
Total Training Steps: 12940
Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in
def main(
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main
mp.spawn(fsdp_main,
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Process finished with exit code 1

ProcessExitedException: process 0 (2x 4090)

I'm trying what looks like the "Hello World" of this repo: Running the basic training on a Runpod community cloud 2 x RTX 4090, (128 vCPU 125 GB RAM) configuration. Normally I'd play around with this for longer before posting an issue, but since Runpod was mentioned explicitly in the Answer.ai intro post, I figure this will be the simplest path for anybody trying to test this out.

On their runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 pod:

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb

Download the Llama-2 mode, sets everything up, and dies with the following backtrace:

Traceback (most recent call last):                                                                           
  File "/root/fsdp_qlora/train.py", line 939, in <module>                                                    
    def main(                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 125, in call_parse                 
    return _f()                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 119, in _f                         
    return tfunc(**merge(args, args_from_prog(func, xtra)))                                                  
  File "/root/fsdp_qlora/train.py", line 1010, in main                                                       
    mp.spawn(fsdp_main,                                                                                      
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn          
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")                             
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():                                                                                
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join           
    raise ProcessExitedException(                                                                            
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL  

Log:

1 Creating model 0
2 Loading model 0
3 Model created 0 1.119 GB
4 trainable params: 744,488,960 || all params: 35,495,616,512 || trainable%: 2.097410985236193
5 Wrapping model w/ FSDP 0
6 Wrapped model 0 1.444 GB
7 Applying activation checkpointing 0
8 Total Training Steps: 12940
9 Epoch 0, Loss 0.000:   0%|                                                                                                                             | 0/12940 [00:00<?, ?it/s]

Here's the W&B run.

I haven't found any indicators as to what's going on. Both System and GPU ram seem well within bounds, so I'm not sure why it's dying (unless maybe 125ּGB system ram is not enough, and getting blown through instantaneously before it's visible on nvitop or the W&B log?)

Running into CUDA out of memory with hqq_lora

Setup:

I have 1x3090 and 1x4090 and I'm trying to follow the instructions in README.md to fine tune using HQQ but running into CUDA out of memory error

python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type hqq_lora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --log_to wandb

Error

Traceback (most recent call last):
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 625, in fsdp_main
    parallel(load_and_quantize_parallel, weights.items(), n_workers=n_workers, threadpool=True,
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/parallel.py", line 117, in parallel
    return L(r)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/foundation.py", line 98, in __call__
    return super().__call__(x, *args, **kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/foundation.py", line 106, in __init__
    items = listify(items, *rest, use_list=use_list, match=match)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/basics.py", line 66, in listify
    elif is_iter(o): res = list(o)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/parallel.py", line 46, in _call
    return g(item)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 609, in load_and_quantize_parallel
    load_and_quantize(model, name, param, **kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 212, in load_and_quantize
    submodule.initialize()
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 280, in initialize
    self.quantize(self.linear_layer.weight.data, **self.quant_config)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 382, in quantize
    W_q , meta = Quantizer.quantize(W, device=self.device, compute_dtype=self.compute_dtype, **weight_quant_params)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 71, in quantize
    if(optimize): scale, zero = Quantizer.optimize_weights(tensor=W, scale=scale, zero=zero, min_max=min_max, axis=axis, device=device)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/optimize.py", line 166, in optimize_weights_proximal_legacy
    W_e   = shrink_op(W_f - W_r, beta)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/optimize.py", line 160, in <lambda>
    shrink_op = lambda x, beta,p=lp_norm: torch.sign(x)*torch.nn.functional.relu(torch.abs(x) - (1./beta)*torch.pow(torch.abs(x), p-1))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 1 has a total capacity of 23.69 GiB of which 109.94 MiB is free. Including non-PyTorch memory, this process has 23.49 GiB memory in use. Of the allocated memory 22.66 GiB is allocated by PyTorch, and 549.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Issues with LLaMA-3-70B

Testing with this script on 4x H100s with 80GB VRAM and 2T system RAM:

python train.py \
  --model_name meta-llama/Meta-Llama-3-70B-Instruct \
  --batch_size 32 \
  --context_length 8192 \
  --precision bf16 \
  --train_type hqq_dora \
  --use_gradient_checkpointing true \
  --use_cpu_offload true \
  --dataset alpaca \
  --verbose true

I get this result after Wrapping model w/ FSDP 0:

- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/root/fsdp_qlora/train.py", line 724, in fsdp_main
    model = FSDP(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
    _materialize_with_param_init_fn(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
    param_init_fn(module)
  File "/root/fsdp_qlora/train.py", line 734, in <lambda>
    param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
  File "/root/hqq/hqq/core/quantize.py", line 480, in to_empty
    return self.cuda(device)
  File "/root/hqq/hqq/core/quantize.py", line 414, in cuda
    self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
  File "/root/hqq/hqq/core/quantize.py", line 215, in cuda
    return Quantizer.to_inplace(W_q, meta, device=device)
  File "/root/hqq/hqq/core/quantize.py", line 176, in to_inplace
    W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!

/root/miniconda3/envs/fq/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Seems to be running out of VRAM at this point and moving something to CPU and failing I guess?

Question about GPU memory usage.

Hi, I tried to finetune a llama7b model with HQQ-LORA using dual GPUs.
I found that during "Loading & Quantizing Model Shards", the peak GPU memory usage acheved 35G. What's the problem?
the run command is:

export CUDA_VISIBLE_DEVICES=3,4
python train.py \
--world_size 2 \
--model_name /workspace/model/Llama-2-7b-chat-hf \
--gradient_accumulation_steps 2 \
--batch_size 1 \
--context_length 4096 \
--num_epochs 1 \
--sharding_strategy full_shard \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset dummy \
--verbose true  

Looking forward to your reply.

train.py script crashes when using HQQ

Here's the command I ran:

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 1 \
--context_length 1024 \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
       --log_to wandb \
       --gradient_accumulation_steps 8 \
       --lr_scheduler linear \
       --verbose false \
       --lora_rank 16 \
       --no_sync true

this crashes with the stack trace:

Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.18s/it]
Model created 0 0.067 GB
LoRA layers added 0 0.067 GB
Wrapping model w/ FSDP 0
Traceback (most recent call last):
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 953, in <module>
    def main(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 1026, in main
    mp.spawn(fsdp_main,
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 703, in fsdp_main
    model = FSDP(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
    _materialize_with_param_init_fn(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
    param_init_fn(module)
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 713, in <lambda>
    param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 485, in to_empty
    return self.cuda(device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 419, in cuda
    self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 220, in cuda
    return Quantizer.to_inplace(W_q, meta, device=device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 181, in to_inplace
    W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!

llama3?

Now we have llama3 , what else should we pay attention to if we fine-tune it?

Fine tuning only runs on CPU

Hello,

I am running this on a few 2X 4090 cloud instances on Vast to test and benchmark. Most machines work without issues, however sometimes I have noticed on certain machines that the GPUs are never used and the fine-tuning stays running on the CPU only. Llama 2 70B can get 15-18s/it on most instances. For ones where the GPUs are not used, it is 800s/it.

nvidia-smi is showing no active processes and 0% on both GPUs. Any idea on how to troubleshoot or fix this issue?

Here is how I am running it and all the settings:

export CUDA_VISIBLE_DEVICES=1,0
python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true \

Performance:
[42:45<2887:27:12, 803.50s/it]

nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off |
| 30% 29C P8 20W / 450W | 10717MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off |
| 30% 30C P8 24W / 450W | 11015MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

How to load the saved model?

Hi

I need your help with loading the model. I see how you're doing that in the "converting..." file. But this is only for LORA models.
What about full_shard models? (--sharding_strategy full_shard --train_type full).

I tried to load it this way but it didn't work:

model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
model.load_state_dict(torch.load('model_state_dict.safetensors'))

Torch Compile?

Thanks for such wonderful work!
I see you comment out this line:

# model = torch.compile(model)

May I ask what is the rationale behind it? Is fsdp_qlora compatible with torch compile?

Training from e

I understand the article mostly mentions fine-tuning. But theoretically is it possible to train something like a 7b model from scratch on a single 24GB GPU?
The recent GaLore paper targets this: https://huggingface.co/papers/2403.03507

Do you think something like this can be implemented in this library?

ValueError report

Hi, I met the following error when finetune llama7b model with FSDP+HQQ:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/workspace/fsdp_qlora/train.py", line 723, in fsdp_main
    model = FSDP(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 481, in __init__
    _auto_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 481, in __init__
    _auto_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 45, in _auto_wrap
    _check_nested_wrapping(root_module)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 107, in _check_nested_wrapping
    raise ValueError(
ValueError: FSDP auto wrapping requires modules to not already have FSDP applied but found q_proj.lora_AB in
LlamaSdpaAttention(
  (q_proj): LORA(
    (base_layer): HQQLinear()
    (lora_AB): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Sequential(
        (0): Linear(in_features=4096, out_features=64, bias=False)
        (1): Linear(in_features=64, out_features=4096, bias=False)
      )
    )
    (lora_dropout): Dropout(p=0.1, inplace=False)
  )
  (k_proj): LORA(
    (base_layer): HQQLinear()
    (lora_AB): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Sequential(
        (0): Linear(in_features=4096, out_features=64, bias=False)
        (1): Linear(in_features=64, out_features=4096, bias=False)
      )
    )
    (lora_dropout): Dropout(p=0.1, inplace=False)
  )
  (v_proj): LORA(
    (base_layer): HQQLinear()
    (lora_AB): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Sequential(
        (0): Linear(in_features=4096, out_features=64, bias=False)
        (1): Linear(in_features=64, out_features=4096, bias=False)
      )
    )
    (lora_dropout): Dropout(p=0.1, inplace=False)
  )
  (o_proj): HQQLinear()
  (rotary_emb): LlamaRotaryEmbedding()
)

the command is:

export CUDA_VISIBLE_DEVICES=3,4
python train.py \
--world_size 2 \
--model_name /workspace/model/Llama-2-7b-hf \
--gradient_accumulation_steps 2 \
--batch_size 1 \
--context_length 4096 \
--num_epochs 1 \
--sharding_strategy full_shard \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset dummy \
--verbose true  

How to solve this problem?
Looking forward to your reply.

train.py

I had to vary this code here in the Train.py to get it to work on my system

LoRA and DORA modules

sys.path.append("./scripts")
from scripts.lora import LORA
from scripts.dora import BNBDORA, HQQDORA, DORALayer, MagnitudeLayer

probably because i already had a module called dora and lora in the pip.

Can i use this script to pre-train models?

Hi there!
I am currently working on pre-training a Llama or Mistral model with clinical texts. Is there any way to use this QLoRA + FSDP script to do such training? Should I make any changes to the current code to be able to do pre-training?

License

Thank you for releasing this, please add a license

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.