answerdotai / fsdp_qlora Goto Github PK

Training LLMs with QLoRA + FSDP

License: Apache License 2.0

Python 25.49% Jupyter Notebook 66.62% Shell 7.89%

fsdp_qlora's Issues

Training from e

I understand the article mostly mentions fine-tuning. But theoretically is it possible to train something like a 7b model from scratch on a single 24GB GPU?
The recent GaLore paper targets this: https://huggingface.co/papers/2403.03507

Do you think something like this can be implemented in this library?

nan when the input length is large

Thanks for your efforts folks!
While I was testing the code on my own dataset, I found that when the length of the input is large (~4000), the loss becomes Nan from the first step:
Epoch 0, Loss nan, LR 1.00e-05: 12%|█████

For the same dataset, when I truncate my input to something shorter, I start to see the loss.
What is the problem?

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

Hey, I'm loving the goal of lowering the resource requirements for training!

In this paper https://arxiv.org/abs/2403.06504 they claim direct memory access between the GPU<->Nvme Storage is more efficient at swapping thus keeping the GPU at its maximum compute capacity.
"Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS"

Also if we look at memory bandwidth, servers have a bunch of channels while high end gaming machine limit at two:
"DDR4 3200MHz with eight channels has a theoretical bandwidth of 204.8 GB/s."

What advice could you share given the experience offloading?

Why is o_proj not targetted?

DeepSeek VL support

Hello, thank you for the awesome work! Could you please add support for the DeepSeek VL model?

process 0 terminated with signal SIGKILL

I am interested your project. It is full of your work.
But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73

World size: 2
Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]

Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s]
Generating train split: 51760 examples [00:00, 76513.36 examples/s]
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it]
Rank 0: Model created: 1.479 GiB
trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 5.822 GiB
Applying activation checkpointing 0
Total Training Steps: 12940
Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in
def main(
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main
mp.spawn(fsdp_main,
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Process finished with exit code 1

Can i use this script to pre-train models?

Hi there!
I am currently working on pre-training a Llama or Mistral model with clinical texts. Is there any way to use this QLoRA + FSDP script to do such training? Should I make any changes to the current code to be able to do pre-training?

Bigger context size?

Is training with 1024 or 2048 sequence length feasible using this method?

How does one load and do inference on fine-tuned LLama 3 using bnb_dora train script?

I used this script to fine tune LLama 3 (from AnswerAI blog post), what I'm left with is a state dict that I am unable to use to replace layers in the original model following the Converting the State Dict.ipynb notebook. Since it does not work (KeyError with mismatching key names of tensors/new_sd), how does one obtain a model from this state dict?

export CUDA_VISIBLE_DEVICES=0,1
python fsdp_qlora/train.py \
--train_type bnb_dora \
--model_name meta-llama/Meta-Llama-3-8B \
--dataset orca_math \
--dataset_samples 10000 \
--batch_size 4 \
--context_length 2048 \
--gradient_accumulation_steps 2 \
--sharding_strategy full_shard \
--use_gradient_checkpointing true \
--reentrant_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--project_name "fsdp-quantized-ft-exps" \
--save_model true \
--output_dir models/Llama-3-8b-orca-math-10k-bnb-QDoRA

How to load the saved model?

I need your help with loading the model. I see how you're doing that in the "converting..." file. But this is only for LORA models.
What about full_shard models? (--sharding_strategy full_shard --train_type full).

I tried to load it this way but it didn't work:

model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
model.load_state_dict(torch.load('model_state_dict.safetensors'))

Question about GPU memory usage.

Hi, I tried to finetune a llama7b model with HQQ-LORA using dual GPUs.
I found that during "Loading & Quantizing Model Shards", the peak GPU memory usage acheved 35G. What's the problem?
the run command is:

export CUDA_VISIBLE_DEVICES=3,4
python train.py \
--world_size 2 \
--model_name /workspace/model/Llama-2-7b-chat-hf \
--gradient_accumulation_steps 2 \
--batch_size 1 \
--context_length 4096 \
--num_epochs 1 \
--sharding_strategy full_shard \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset dummy \
--verbose true

Looking forward to your reply.

Running into CUDA out of memory with hqq_lora

Setup:

I have 1x3090 and 1x4090 and I'm trying to follow the instructions in README.md to fine tune using HQQ but running into CUDA out of memory error

python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type hqq_lora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --log_to wandb

Error

Traceback (most recent call last):
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 625, in fsdp_main
    parallel(load_and_quantize_parallel, weights.items(), n_workers=n_workers, threadpool=True,
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/parallel.py", line 117, in parallel
    return L(r)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/foundation.py", line 98, in __call__
    return super().__call__(x, *args, **kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/foundation.py", line 106, in __init__
    items = listify(items, *rest, use_list=use_list, match=match)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/basics.py", line 66, in listify
    elif is_iter(o): res = list(o)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/parallel.py", line 46, in _call
    return g(item)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 609, in load_and_quantize_parallel
    load_and_quantize(model, name, param, **kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 212, in load_and_quantize
    submodule.initialize()
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 280, in initialize
    self.quantize(self.linear_layer.weight.data, **self.quant_config)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 382, in quantize
    W_q , meta = Quantizer.quantize(W, device=self.device, compute_dtype=self.compute_dtype, **weight_quant_params)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 71, in quantize
    if(optimize): scale, zero = Quantizer.optimize_weights(tensor=W, scale=scale, zero=zero, min_max=min_max, axis=axis, device=device)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/optimize.py", line 166, in optimize_weights_proximal_legacy
    W_e   = shrink_op(W_f - W_r, beta)
  File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/optimize.py", line 160, in <lambda>
    shrink_op = lambda x, beta,p=lp_norm: torch.sign(x)*torch.nn.functional.relu(torch.abs(x) - (1./beta)*torch.pow(torch.abs(x), p-1))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 1 has a total capacity of 23.69 GiB of which 109.94 MiB is free. Including non-PyTorch memory, this process has 23.49 GiB memory in use. Of the allocated memory 22.66 GiB is allocated by PyTorch, and 549.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Fine tuning only runs on CPU

Hello,

I am running this on a few 2X 4090 cloud instances on Vast to test and benchmark. Most machines work without issues, however sometimes I have noticed on certain machines that the GPUs are never used and the fine-tuning stays running on the CPU only. Llama 2 70B can get 15-18s/it on most instances. For ones where the GPUs are not used, it is 800s/it.

nvidia-smi is showing no active processes and 0% on both GPUs. Any idea on how to troubleshoot or fix this issue?

Here is how I am running it and all the settings:

export CUDA_VISIBLE_DEVICES=1,0
python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true \

Performance:
[42:45<2887:27:12, 803.50s/it]

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Dual GPU training instantly powers off my desktop

Running fine-tuning with these settings makes my desktop instantly power off as soon as training starts:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0,1

python train.py \
--model_name mistralai/Mixtral-8x7B-v0.1 \
--batch_size 1 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
       --log_to wandb \
       --gradient_accumulation_steps 4 \
       --lr_scheduler linear \
       --verbose false \
       --lora_rank 8 \
       --no_sync true

I have 2x 4090 GPUs, Ubuntu 22.04, PyTorch 2.2.1, CUDA 12.1., bitsandbytes 0.43.0, transformers 4.39.2. I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once, with both of them at 450 watts, and that works fine. Training using naive model parallelism with text-generation-webui works fine.

bugs for fine-tune fsdp multinode

how to fix that

What if I have three graphics cards?

I have rtx3090 * 1, rtx 3060 16G * 2, total mem is 24+16 * 2=56G
In this case, is it possible to finetune models?

Q on comparison with SFTTrainer

The README mentions:

The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?

Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?

/opt/conda/conda-bld/pytorch_1708025847130/work/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.

when I tried to train some 'qna' style dataset like knowrohit07/know_sql get this error.

Issues with LLaMA-3-70B

Testing with this script on 4x H100s with 80GB VRAM and 2T system RAM:

python train.py \
  --model_name meta-llama/Meta-Llama-3-70B-Instruct \
  --batch_size 32 \
  --context_length 8192 \
  --precision bf16 \
  --train_type hqq_dora \
  --use_gradient_checkpointing true \
  --use_cpu_offload true \
  --dataset alpaca \
  --verbose true

I get this result after Wrapping model w/ FSDP 0:

- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/root/fsdp_qlora/train.py", line 724, in fsdp_main
    model = FSDP(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
    _materialize_with_param_init_fn(
  File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
    param_init_fn(module)
  File "/root/fsdp_qlora/train.py", line 734, in <lambda>
    param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
  File "/root/hqq/hqq/core/quantize.py", line 480, in to_empty
    return self.cuda(device)
  File "/root/hqq/hqq/core/quantize.py", line 414, in cuda
    self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
  File "/root/hqq/hqq/core/quantize.py", line 215, in cuda
    return Quantizer.to_inplace(W_q, meta, device=device)
  File "/root/hqq/hqq/core/quantize.py", line 176, in to_inplace
    W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!

/root/miniconda3/envs/fq/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Seems to be running out of VRAM at this point and moving something to CPU and failing I guess?

Results after running

Why do I lose a lot of pre training weights after running, and how can I use the trained files for downstream tasks？

llama3?

Now we have llama3 , what else should we pay attention to if we fine-tune it?

NCCL issue training with two GPUs

I ran into this issue (NVIDIA/nccl#1125) when trying to replicate the instructions from the README. Since the blog posts mentions that the training was done on two GPUs is there a workaround for the NCCL issue with 1 or 2 GPUs?

Ran
$ python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true

The error trace looks like -

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found

Request for Scripts to Merge QDoRA Adapters with Base Model for vLLM Inference

Hello,

I've successfully finetuned Llama-3 8B with QDoRA and am now looking to perform inference using vLLM. Could you provide guidance or scripts on how to merge the QDoRA adapters with the original base model? Additionally, does this process involve quantization and dequantization of the base model?

Thank you!

Torch Compile?

Thanks for such wonderful work!
I see you comment out this line:

fsdp_qlora/train.py

Line 722 in d7818ec

# model = torch.compile(model)

May I ask what is the rationale behind it? Is fsdp_qlora compatible with torch compile?

Question about adding / training Mixtral

I followed your 'adding a new model' guide to add Mixtral. It appears transformers mixtral does not have a MixtralMLP as suggested by the guide. The other items can be imported OK. As a workaround I added MistralMLP to mlp_policy_fn insead of MixtralMLP.

The model now begins to train. Previously, without these changes there was an OOM error just prior to training, so something has worked. What is the effect of using MixtralMLP instead of MistralMLP? Am I just training garbage, or is it likely to produce something useful?

Background info:

Cannot import MixtralMLP

>>> 
>>> from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES, MixtralMLP
Traceback (most recent call last):
ImportError: cannot import name 'MixtralMLP' from 'transformers.models.mixtral.modeling_mixtral' )
>>> 
>>> from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES
>>>

With mixtral mod

python train.py --model_name "/home/chris/repos/Mixtral-8x7B-Instruct-v0.1/" --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload false --dataset alpaca --reentrant_checkpointing true
World size: 4
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00,  5.95s/it]
Rank 0: Model created: 0.752 GiB
trainable params: 37,748,736 || all params: 46,740,541,440 || trainable%: 0.08076229935944876
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 9.803 GiB
Applying activation checkpointing 0
Total Training Steps: 6470
Epoch 0, Loss 1.045, LR 1.00e-05:   0%|▏

without mixtral mod

python train.py --model_name "/home/chris/repos/Mixtral-8x7B-Instruct-v0.1/" --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload false --dataset alpaca --reentrant_checkpointing true
World size: 4
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [06:14<00:00, 19.69s/it]
Rank 0: Model created: 0.752 GiB
trainable params: 37,748,736 || all params: 46,740,541,440 || trainable%: 0.08076229935944876
Wrapping model w/ FSDP 0
Traceback (most recent call last):
<etc>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 2 has a total capacity of 23.69 GiB of which 26.81 MiB is free. Including non-PyTorch memory, this process has 23.66 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 47.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The mod

diff --git a/train.py b/train.py
index 9181dc8..ca4809d 100644
--- a/train.py
+++ b/train.py
@@ -68,6 +68,7 @@ except ImportError:
 # for the wrapping policy and `check_fn` in activation checkpointing
 from transformers.models.llama.modeling_llama import LlamaDecoderLayer, LLAMA_ATTENTION_CLASSES, LlamaMLP
 from transformers.models.mistral.modeling_mistral import MistralDecoderLayer, MISTRAL_ATTENTION_CLASSES, MistralMLP
+from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES
 
 # To get rid of tokenizers warnings for now
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
@@ -429,18 +430,18 @@ def get_wrapping_policy(custom_policy:bool=False):
             )
     def self_attn_policy_fn(module):
         # Check module name is self_attn.
-        return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values()))
+        return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values(), *MIXTRAL_ATTENTION_CLASSES.values()))
 
     def mlp_policy_fn(module):
         # Check module name is self_attn.
-        return isinstance(module, (LlamaMLP, MistralMLP))
+        return isinstance(module, (LlamaMLP, MistralMLP, MistralMLP))
 
     lambda_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn)
     self_attn_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=self_attn_policy_fn)
     mlp_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=mlp_policy_fn)
     transformer_wrap_policy = functools.partial(
         transformer_auto_wrap_policy,
-        transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer),
+        transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer, MixtralDecoderLayer,),
     )
     policies=[lambda_policy, transformer_wrap_policy]
     if custom_policy:
@@ -735,7 +736,7 @@ def fsdp_main(local_rank:int, world_size:int, args:Dict):
 
         )
 
-        check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer))
+        check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer, MixtralDecoderLayer))
         if rank == 0 or args['verbose']:
             print("Applying activation checkpointing", rank)
         apply_activation_checkpointing(
@@ -1042,4 +1043,4 @@ def main(
     mp.spawn(fsdp_main,
         args=(world_size, args),
         nprocs=torch.cuda.device_count(),
-        join=True)
\ No newline at end of file
+        join=True)
(END)

ValueError report

Hi, I met the following error when finetune llama7b model with FSDP+HQQ:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/workspace/fsdp_qlora/train.py", line 723, in fsdp_main
    model = FSDP(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 481, in __init__
    _auto_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 481, in __init__
    _auto_wrap(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 45, in _auto_wrap
    _check_nested_wrapping(root_module)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 107, in _check_nested_wrapping
    raise ValueError(
ValueError: FSDP auto wrapping requires modules to not already have FSDP applied but found q_proj.lora_AB in
LlamaSdpaAttention(
  (q_proj): LORA(
    (base_layer): HQQLinear()
    (lora_AB): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Sequential(
        (0): Linear(in_features=4096, out_features=64, bias=False)
        (1): Linear(in_features=64, out_features=4096, bias=False)
      )
    )
    (lora_dropout): Dropout(p=0.1, inplace=False)
  )
  (k_proj): LORA(
    (base_layer): HQQLinear()
    (lora_AB): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Sequential(
        (0): Linear(in_features=4096, out_features=64, bias=False)
        (1): Linear(in_features=64, out_features=4096, bias=False)
      )
    )
    (lora_dropout): Dropout(p=0.1, inplace=False)
  )
  (v_proj): LORA(
    (base_layer): HQQLinear()
    (lora_AB): FullyShardedDataParallel(
      (_fsdp_wrapped_module): Sequential(
        (0): Linear(in_features=4096, out_features=64, bias=False)
        (1): Linear(in_features=64, out_features=4096, bias=False)
      )
    )
    (lora_dropout): Dropout(p=0.1, inplace=False)
  )
  (o_proj): HQQLinear()
  (rotary_emb): LlamaRotaryEmbedding()
)

the command is:

export CUDA_VISIBLE_DEVICES=3,4
python train.py \
--world_size 2 \
--model_name /workspace/model/Llama-2-7b-hf \
--gradient_accumulation_steps 2 \
--batch_size 1 \
--context_length 4096 \
--num_epochs 1 \
--sharding_strategy full_shard \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset dummy \
--verbose true

How to solve this problem?
Looking forward to your reply.

ProcessExitedException: process 0 (2x 4090)

I'm trying what looks like the "Hello World" of this repo: Running the basic training on a Runpod community cloud 2 x RTX 4090, (128 vCPU 125 GB RAM) configuration. Normally I'd play around with this for longer before posting an issue, but since Runpod was mentioned explicitly in the Answer.ai intro post, I figure this will be the simplest path for anybody trying to test this out.

On their runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 pod:

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb

Download the Llama-2 mode, sets everything up, and dies with the following backtrace:

Traceback (most recent call last):                                                                           
  File "/root/fsdp_qlora/train.py", line 939, in <module>                                                    
    def main(                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 125, in call_parse                 
    return _f()                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 119, in _f                         
    return tfunc(**merge(args, args_from_prog(func, xtra)))                                                  
  File "/root/fsdp_qlora/train.py", line 1010, in main                                                       
    mp.spawn(fsdp_main,                                                                                      
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn          
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")                             
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():                                                                                
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join           
    raise ProcessExitedException(                                                                            
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Log:

1 Creating model 0
2 Loading model 0
3 Model created 0 1.119 GB
4 trainable params: 744,488,960 || all params: 35,495,616,512 || trainable%: 2.097410985236193
5 Wrapping model w/ FSDP 0
6 Wrapped model 0 1.444 GB
7 Applying activation checkpointing 0
8 Total Training Steps: 12940
9 Epoch 0, Loss 0.000:   0%|                                                                                                                             | 0/12940 [00:00<?, ?it/s]

Here's the W&B run.

I haven't found any indicators as to what's going on. Both System and GPU ram seem well within bounds, so I'm not sure why it's dying (unless maybe 125ּGB system ram is not enough, and getting blown through instantaneously before it's visible on nvitop or the W&B log?)

BOFT support?

Hi,

Considering the PEFT library has support for the OFT/BOFT adapter, can this be supported in fsdp_qlora too? Would be an useful adapter to have due to its resistance against catastrophic forgetting.

Thanks

License

Thank you for releasing this, please add a license

how to inference using 70b? or we need to implement it with the same way to train it by ourself?

Example with AMD ROCm/HIP

Can you please provide an example that works with AMD ROCm/HIP?

I would be happy to give access to my server!

train.py script crashes when using HQQ

Here's the command I ran:

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 1 \
--context_length 1024 \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
       --log_to wandb \
       --gradient_accumulation_steps 8 \
       --lr_scheduler linear \
       --verbose false \
       --lora_rank 16 \
       --no_sync true

this crashes with the stack trace:

Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.18s/it]
Model created 0 0.067 GB
LoRA layers added 0 0.067 GB
Wrapping model w/ FSDP 0
Traceback (most recent call last):
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 953, in <module>
    def main(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 1026, in main
    mp.spawn(fsdp_main,
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 703, in fsdp_main
    model = FSDP(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
    _materialize_with_param_init_fn(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
    param_init_fn(module)
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 713, in <lambda>
    param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 485, in to_empty
    return self.cuda(device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 419, in cuda
    self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 220, in cuda
    return Quantizer.to_inplace(W_q, meta, device=device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 181, in to_inplace
    W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!

train.py

I had to vary this code here in the Train.py to get it to work on my system

LoRA and DORA modules

sys.path.append("./scripts")
from scripts.lora import LORA
from scripts.dora import BNBDORA, HQQDORA, DORALayer, MagnitudeLayer

probably because i already had a module called dora and lora in the pip.

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase

Hello everyone!

First, thank you for this implementation!
Unfortunately I have an issue with running this, RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

I debugged it a bit and it seems that PEFT v.0.9 breaks it. The previous release PEFT v0.8.2 works fine. The fix is to downgrade or move all the peft imports in train.py inside the functions where they are used, like this: https://github.com/geronimi73/fsdp_qlora/tree/fix_ProcessExitedException

I'm not sure whether I am doing something wrong and how come nobody else noticed this, since PEFT 0.9 has been released two weeks ago already. Any ideas what might be wrong?

command:

python train.py \
--model_name models/llama2-7b \
--gradient_accumulation_steps 4 \
--batch_size 8 \
--context_length 512 \
--precision bf16 \
--train_type full \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--dataset alpaca

Note: models/llama2-7b is meta-llama/Llama-2-7b-hf

stacktrace:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/g/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/g/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
World size: 2
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/g/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/g/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "/home/g/fsdp_qlora/train.py", line 939, in <module>
    def main(
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/g/fsdp_qlora/train.py", line 1010, in main
    mp.spawn(fsdp_main,
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 148, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

pip list:

accelerate                        0.28.0
bitsandbytes                      0.43.0
fastcore                          1.5.29
flash-attn                        2.5.6
hqq                               0.1.5
peft                              0.9.0
torch                             2.2.1
transformers                      4.38.2

2x 3090, CUDA Version: 12.2

answerdotai / fsdp_qlora Goto Github PK

fsdp_qlora's Issues

Setup:

Error

LoRA and DORA modules

Recommend Projects

Recommend Topics

Recommend Org