answerdotai / fsdp_qlora Goto Github PK
View Code? Open in Web Editor NEWTraining LLMs with QLoRA + FSDP
License: Apache License 2.0
Training LLMs with QLoRA + FSDP
License: Apache License 2.0
I understand the article mostly mentions fine-tuning. But theoretically is it possible to train something like a 7b model from scratch on a single 24GB GPU?
The recent GaLore paper targets this: https://huggingface.co/papers/2403.03507
Do you think something like this can be implemented in this library?
Hi
Thanks for your efforts folks!
While I was testing the code on my own dataset, I found that when the length of the input is large (~4000), the loss becomes Nan from the first step:
Epoch 0, Loss nan, LR 1.00e-05: 12%|█████
For the same dataset, when I truncate my input to something shorter, I start to see the loss.
What is the problem?
Hey, I'm loving the goal of lowering the resource requirements for training!
In this paper https://arxiv.org/abs/2403.06504 they claim direct memory access between the GPU<->Nvme Storage is more efficient at swapping thus keeping the GPU at its maximum compute capacity.
"Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS"
Also if we look at memory bandwidth, servers have a bunch of channels while high end gaming machine limit at two:
"DDR4 3200MHz with eight channels has a theoretical bandwidth of 204.8 GB/s."
What advice could you share given the experience offloading?
Hello, thank you for the awesome work! Could you please add support for the DeepSeek VL model?
I am interested your project. It is full of your work.
But i met this bug for this project, please help me! @jph00 @johnowhitaker @KeremTurgutlu @warner-benjamin @geronimi73
World size: 2
Downloading readme: 100%|██████████| 11.6k/11.6k [00:00<00:00, 4.21MB/s]
Downloading data: 0%| | 0.00/44.3M [00:00<?, ?B/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:17<04:11, 135kB/s]
Downloading data: 24%|██▎ | 10.5M/44.3M [01:30<04:11, 135kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:27<02:42, 144kB/s]
Downloading data: 47%|████▋ | 21.0M/44.3M [02:40<02:42, 144kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:37<01:27, 147kB/s]
Downloading data: 71%|███████ | 31.5M/44.3M [03:50<01:27, 147kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:32<00:14, 161kB/s]
Downloading data: 95%|█████████▍| 41.9M/44.3M [04:50<00:14, 161kB/s]
Downloading data: 100%|██████████| 44.3M/44.3M [05:12<00:00, 142kB/s]
Generating train split: 51760 examples [00:00, 76513.36 examples/s]
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████| 15/15 [30:58<00:00, 123.93s/it]
Rank 0: Model created: 1.479 GiB
trainable params: 744,488,960 || all params: 69,721,137,152 || trainable%: 1.0678095487411938
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 5.822 GiB
Applying activation checkpointing 0
Total Training Steps: 12940
Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 969, in
def main(
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/sam/Doctorproject/fsdp_qlora/train.py", line 1042, in main
mp.spawn(fsdp_main,
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/sam/anaconda3/envs/fsdp/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Process finished with exit code 1
Hi there!
I am currently working on pre-training a Llama or Mistral model with clinical texts. Is there any way to use this QLoRA + FSDP script to do such training? Should I make any changes to the current code to be able to do pre-training?
Is training with 1024 or 2048 sequence length feasible using this method?
I used this script to fine tune LLama 3 (from AnswerAI blog post), what I'm left with is a state dict that I am unable to use to replace layers in the original model following the Converting the State Dict.ipynb notebook. Since it does not work (KeyError with mismatching key names of tensors/new_sd), how does one obtain a model from this state dict?
export CUDA_VISIBLE_DEVICES=0,1
python fsdp_qlora/train.py \
--train_type bnb_dora \
--model_name meta-llama/Meta-Llama-3-8B \
--dataset orca_math \
--dataset_samples 10000 \
--batch_size 4 \
--context_length 2048 \
--gradient_accumulation_steps 2 \
--sharding_strategy full_shard \
--use_gradient_checkpointing true \
--reentrant_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--project_name "fsdp-quantized-ft-exps" \
--save_model true \
--output_dir models/Llama-3-8b-orca-math-10k-bnb-QDoRA
Hi
I need your help with loading the model. I see how you're doing that in the "converting..." file. But this is only for LORA models.
What about full_shard models? (--sharding_strategy full_shard --train_type full).
I tried to load it this way but it didn't work:
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
model.load_state_dict(torch.load('model_state_dict.safetensors'))
Hi, I tried to finetune a llama7b model with HQQ-LORA using dual GPUs.
I found that during "Loading & Quantizing Model Shards", the peak GPU memory usage acheved 35G. What's the problem?
the run command is:
export CUDA_VISIBLE_DEVICES=3,4
python train.py \
--world_size 2 \
--model_name /workspace/model/Llama-2-7b-chat-hf \
--gradient_accumulation_steps 2 \
--batch_size 1 \
--context_length 4096 \
--num_epochs 1 \
--sharding_strategy full_shard \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset dummy \
--verbose true
Looking forward to your reply.
I have 1x3090 and 1x4090 and I'm trying to follow the instructions in README.md to fine tune using HQQ but running into CUDA out of memory error
python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type hqq_lora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --log_to wandb
Traceback (most recent call last):
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 625, in fsdp_main
parallel(load_and_quantize_parallel, weights.items(), n_workers=n_workers, threadpool=True,
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/parallel.py", line 117, in parallel
return L(r)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/foundation.py", line 98, in __call__
return super().__call__(x, *args, **kwargs)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/foundation.py", line 106, in __init__
items = listify(items, *rest, use_list=use_list, match=match)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/basics.py", line 66, in listify
elif is_iter(o): res = list(o)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/fastcore/parallel.py", line 46, in _call
return g(item)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 609, in load_and_quantize_parallel
load_and_quantize(model, name, param, **kwargs)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/train.py", line 212, in load_and_quantize
submodule.initialize()
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 280, in initialize
self.quantize(self.linear_layer.weight.data, **self.quant_config)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 382, in quantize
W_q , meta = Quantizer.quantize(W, device=self.device, compute_dtype=self.compute_dtype, **weight_quant_params)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/quantize.py", line 71, in quantize
if(optimize): scale, zero = Quantizer.optimize_weights(tensor=W, scale=scale, zero=zero, min_max=min_max, axis=axis, device=device)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/optimize.py", line 166, in optimize_weights_proximal_legacy
W_e = shrink_op(W_f - W_r, beta)
File "/home/ml-curious/Documents/Projects/Opensource/fsdp_qlora/.venv/lib/python3.10/site-packages/hqq/core/optimize.py", line 160, in <lambda>
shrink_op = lambda x, beta,p=lp_norm: torch.sign(x)*torch.nn.functional.relu(torch.abs(x) - (1./beta)*torch.pow(torch.abs(x), p-1))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 1 has a total capacity of 23.69 GiB of which 109.94 MiB is free. Including non-PyTorch memory, this process has 23.49 GiB memory in use. Of the allocated memory 22.66 GiB is allocated by PyTorch, and 549.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Hello,
I am running this on a few 2X 4090 cloud instances on Vast to test and benchmark. Most machines work without issues, however sometimes I have noticed on certain machines that the GPUs are never used and the fine-tuning stays running on the CPU only. Llama 2 70B can get 15-18s/it on most instances. For ones where the GPUs are not used, it is 800s/it.
nvidia-smi is showing no active processes and 0% on both GPUs. Any idea on how to troubleshoot or fix this issue?
Here is how I am running it and all the settings:
export CUDA_VISIBLE_DEVICES=1,0
python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true \
Performance:
[42:45<2887:27:12, 803.50s/it]
nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off |
| 30% 29C P8 20W / 450W | 10717MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off |
| 30% 30C P8 24W / 450W | 11015MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Running fine-tuning with these settings makes my desktop instantly power off as soon as training starts:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1
python train.py \
--model_name mistralai/Mixtral-8x7B-v0.1 \
--batch_size 1 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb \
--gradient_accumulation_steps 4 \
--lr_scheduler linear \
--verbose false \
--lora_rank 8 \
--no_sync true
I have 2x 4090 GPUs, Ubuntu 22.04, PyTorch 2.2.1, CUDA 12.1., bitsandbytes 0.43.0, transformers 4.39.2. I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once, with both of them at 450 watts, and that works fine. Training using naive model parallelism with text-generation-webui works fine.
I have rtx3090 * 1, rtx 3060 16G * 2, total mem is 24+16 * 2=56G
In this case, is it possible to finetune models?
The README mentions:
The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.
Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?
Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?
when I tried to train some 'qna' style dataset like knowrohit07/know_sql get this error.
Testing with this script on 4x H100s with 80GB VRAM and 2T system RAM:
python train.py \
--model_name meta-llama/Meta-Llama-3-70B-Instruct \
--batch_size 32 \
--context_length 8192 \
--precision bf16 \
--train_type hqq_dora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--verbose true
I get this result after Wrapping model w/ FSDP 0
:
- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/root/fsdp_qlora/train.py", line 724, in fsdp_main
model = FSDP(
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
_auto_wrap(
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
_materialize_with_param_init_fn(
File "/root/miniconda3/envs/fq/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
param_init_fn(module)
File "/root/fsdp_qlora/train.py", line 734, in <lambda>
param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
File "/root/hqq/hqq/core/quantize.py", line 480, in to_empty
return self.cuda(device)
File "/root/hqq/hqq/core/quantize.py", line 414, in cuda
self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
File "/root/hqq/hqq/core/quantize.py", line 215, in cuda
return Quantizer.to_inplace(W_q, meta, device=device)
File "/root/hqq/hqq/core/quantize.py", line 176, in to_inplace
W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!
/root/miniconda3/envs/fq/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Seems to be running out of VRAM at this point and moving something to CPU and failing I guess?
Now we have llama3 , what else should we pay attention to if we fine-tune it?
I ran into this issue (NVIDIA/nccl#1125) when trying to replicate the instructions from the README. Since the blog posts mentions that the training was done on two GPUs is there a workaround for the NCCL issue with 1 or 2 GPUs?
Ran
$ python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true
The error trace looks like -
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
Hello,
I've successfully finetuned Llama-3 8B with QDoRA and am now looking to perform inference using vLLM. Could you provide guidance or scripts on how to merge the QDoRA adapters with the original base model? Additionally, does this process involve quantization and dequantization of the base model?
Thank you!
Thanks for such wonderful work!
I see you comment out this line:
Line 722 in d7818ec
May I ask what is the rationale behind it? Is fsdp_qlora compatible with torch compile?
I followed your 'adding a new model' guide to add Mixtral. It appears transformers mixtral does not have a MixtralMLP as suggested by the guide. The other items can be imported OK. As a workaround I added MistralMLP to mlp_policy_fn insead of MixtralMLP.
The model now begins to train. Previously, without these changes there was an OOM error just prior to training, so something has worked. What is the effect of using MixtralMLP instead of MistralMLP? Am I just training garbage, or is it likely to produce something useful?
Background info:
Cannot import MixtralMLP
>>>
>>> from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES, MixtralMLP
Traceback (most recent call last):
ImportError: cannot import name 'MixtralMLP' from 'transformers.models.mixtral.modeling_mixtral' )
>>>
>>> from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES
>>>
With mixtral mod
python train.py --model_name "/home/chris/repos/Mixtral-8x7B-Instruct-v0.1/" --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload false --dataset alpaca --reentrant_checkpointing true
World size: 4
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.95s/it]
Rank 0: Model created: 0.752 GiB
trainable params: 37,748,736 || all params: 46,740,541,440 || trainable%: 0.08076229935944876
Wrapping model w/ FSDP 0
Rank 0: Wrapped model: 9.803 GiB
Applying activation checkpointing 0
Total Training Steps: 6470
Epoch 0, Loss 1.045, LR 1.00e-05: 0%|▏
without mixtral mod
python train.py --model_name "/home/chris/repos/Mixtral-8x7B-Instruct-v0.1/" --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload false --dataset alpaca --reentrant_checkpointing true
World size: 4
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [06:14<00:00, 19.69s/it]
Rank 0: Model created: 0.752 GiB
trainable params: 37,748,736 || all params: 46,740,541,440 || trainable%: 0.08076229935944876
Wrapping model w/ FSDP 0
Traceback (most recent call last):
<etc>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 2 has a total capacity of 23.69 GiB of which 26.81 MiB is free. Including non-PyTorch memory, this process has 23.66 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 47.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The mod
diff --git a/train.py b/train.py
index 9181dc8..ca4809d 100644
--- a/train.py
+++ b/train.py
@@ -68,6 +68,7 @@ except ImportError:
# for the wrapping policy and `check_fn` in activation checkpointing
from transformers.models.llama.modeling_llama import LlamaDecoderLayer, LLAMA_ATTENTION_CLASSES, LlamaMLP
from transformers.models.mistral.modeling_mistral import MistralDecoderLayer, MISTRAL_ATTENTION_CLASSES, MistralMLP
+from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer, MIXTRAL_ATTENTION_CLASSES
# To get rid of tokenizers warnings for now
os.environ["TOKENIZERS_PARALLELISM"] = "false"
@@ -429,18 +430,18 @@ def get_wrapping_policy(custom_policy:bool=False):
)
def self_attn_policy_fn(module):
# Check module name is self_attn.
- return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values()))
+ return isinstance(module, tuple(*LLAMA_ATTENTION_CLASSES.values(), *MISTRAL_ATTENTION_CLASSES.values(), *MIXTRAL_ATTENTION_CLASSES.values()))
def mlp_policy_fn(module):
# Check module name is self_attn.
- return isinstance(module, (LlamaMLP, MistralMLP))
+ return isinstance(module, (LlamaMLP, MistralMLP, MistralMLP))
lambda_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=lambda_policy_fn)
self_attn_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=self_attn_policy_fn)
mlp_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=mlp_policy_fn)
transformer_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
- transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer),
+ transformer_layer_cls=(LlamaDecoderLayer, MistralDecoderLayer, MixtralDecoderLayer,),
)
policies=[lambda_policy, transformer_wrap_policy]
if custom_policy:
@@ -735,7 +736,7 @@ def fsdp_main(local_rank:int, world_size:int, args:Dict):
)
- check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer))
+ check_fn = lambda submodule: isinstance(submodule, (LlamaDecoderLayer, MistralDecoderLayer, MixtralDecoderLayer))
if rank == 0 or args['verbose']:
print("Applying activation checkpointing", rank)
apply_activation_checkpointing(
@@ -1042,4 +1043,4 @@ def main(
mp.spawn(fsdp_main,
args=(world_size, args),
nprocs=torch.cuda.device_count(),
- join=True)
\ No newline at end of file
+ join=True)
(END)
Hi, I met the following error when finetune llama7b model with FSDP+HQQ:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/workspace/fsdp_qlora/train.py", line 723, in fsdp_main
model = FSDP(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 481, in __init__
_auto_wrap(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 481, in __init__
_auto_wrap(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 45, in _auto_wrap
_check_nested_wrapping(root_module)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 107, in _check_nested_wrapping
raise ValueError(
ValueError: FSDP auto wrapping requires modules to not already have FSDP applied but found q_proj.lora_AB in
LlamaSdpaAttention(
(q_proj): LORA(
(base_layer): HQQLinear()
(lora_AB): FullyShardedDataParallel(
(_fsdp_wrapped_module): Sequential(
(0): Linear(in_features=4096, out_features=64, bias=False)
(1): Linear(in_features=64, out_features=4096, bias=False)
)
)
(lora_dropout): Dropout(p=0.1, inplace=False)
)
(k_proj): LORA(
(base_layer): HQQLinear()
(lora_AB): FullyShardedDataParallel(
(_fsdp_wrapped_module): Sequential(
(0): Linear(in_features=4096, out_features=64, bias=False)
(1): Linear(in_features=64, out_features=4096, bias=False)
)
)
(lora_dropout): Dropout(p=0.1, inplace=False)
)
(v_proj): LORA(
(base_layer): HQQLinear()
(lora_AB): FullyShardedDataParallel(
(_fsdp_wrapped_module): Sequential(
(0): Linear(in_features=4096, out_features=64, bias=False)
(1): Linear(in_features=64, out_features=4096, bias=False)
)
)
(lora_dropout): Dropout(p=0.1, inplace=False)
)
(o_proj): HQQLinear()
(rotary_emb): LlamaRotaryEmbedding()
)
the command is:
export CUDA_VISIBLE_DEVICES=3,4
python train.py \
--world_size 2 \
--model_name /workspace/model/Llama-2-7b-hf \
--gradient_accumulation_steps 2 \
--batch_size 1 \
--context_length 4096 \
--num_epochs 1 \
--sharding_strategy full_shard \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset dummy \
--verbose true
How to solve this problem?
Looking forward to your reply.
I'm trying what looks like the "Hello World" of this repo: Running the basic training on a Runpod community cloud 2 x RTX 4090, (128 vCPU 125 GB RAM)
configuration. Normally I'd play around with this for longer before posting an issue, but since Runpod was mentioned explicitly in the Answer.ai intro post, I figure this will be the simplest path for anybody trying to test this out.
On their runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
pod:
python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb
Download the Llama-2 mode, sets everything up, and dies with the following backtrace:
Traceback (most recent call last):
File "/root/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/root/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Log:
1 Creating model 0
2 Loading model 0
3 Model created 0 1.119 GB
4 trainable params: 744,488,960 || all params: 35,495,616,512 || trainable%: 2.097410985236193
5 Wrapping model w/ FSDP 0
6 Wrapped model 0 1.444 GB
7 Applying activation checkpointing 0
8 Total Training Steps: 12940
9 Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]
Here's the W&B run.
I haven't found any indicators as to what's going on. Both System and GPU ram seem well within bounds, so I'm not sure why it's dying (unless maybe 125ּGB system ram is not enough, and getting blown through instantaneously before it's visible on nvitop
or the W&B log?)
Hi,
Considering the PEFT library has support for the OFT/BOFT adapter, can this be supported in fsdp_qlora too? Would be an useful adapter to have due to its resistance against catastrophic forgetting.
Thanks
Thank you for releasing this, please add a license
Can you please provide an example that works with AMD ROCm/HIP?
I would be happy to give access to my server!
Here's the command I ran:
python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 1 \
--context_length 1024 \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb \
--gradient_accumulation_steps 8 \
--lr_scheduler linear \
--verbose false \
--lora_rank 16 \
--no_sync true
this crashes with the stack trace:
Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.18s/it]
Model created 0 0.067 GB
LoRA layers added 0 0.067 GB
Wrapping model w/ FSDP 0
Traceback (most recent call last):
File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 953, in <module>
def main(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 1026, in main
mp.spawn(fsdp_main,
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 703, in fsdp_main
model = FSDP(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
_auto_wrap(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
[Previous line repeated 1 more time]
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
_materialize_with_param_init_fn(
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
param_init_fn(module)
File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 713, in <lambda>
param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 485, in to_empty
return self.cuda(device)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 419, in cuda
self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 220, in cuda
return Quantizer.to_inplace(W_q, meta, device=device)
File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 181, in to_inplace
W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!
I had to vary this code here in the Train.py to get it to work on my system
sys.path.append("./scripts")
from scripts.lora import LORA
from scripts.dora import BNBDORA, HQQDORA, DORALayer, MagnitudeLayer
probably because i already had a module called dora and lora in the pip.
Hello everyone!
First, thank you for this implementation!
Unfortunately I have an issue with running this, RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
I debugged it a bit and it seems that PEFT v.0.9 breaks it. The previous release PEFT v0.8.2 works fine. The fix is to downgrade or move all the peft imports in train.py
inside the functions where they are used, like this: https://github.com/geronimi73/fsdp_qlora/tree/fix_ProcessExitedException
I'm not sure whether I am doing something wrong and how come nobody else noticed this, since PEFT 0.9 has been released two weeks ago already. Any ideas what might be wrong?
command:
python train.py \
--model_name models/llama2-7b \
--gradient_accumulation_steps 4 \
--batch_size 8 \
--context_length 512 \
--precision bf16 \
--train_type full \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--use_activation_cpu_offload false \
--log_to wandb \
--dataset alpaca
Note: models/llama2-7b
is meta-llama/Llama-2-7b-hf
stacktrace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/g/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/g/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
World size: 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/g/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/g/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "/home/g/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/home/g/.local/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/home/g/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/g/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 148, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
pip list:
accelerate 0.28.0
bitsandbytes 0.43.0
fastcore 1.5.29
flash-attn 2.5.6
hqq 0.1.5
peft 0.9.0
torch 2.2.1
transformers 4.38.2
2x 3090, CUDA Version: 12.2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.