微调 Mistral、Gemma、Llama 速度提高 2-5 倍，内存减少 80%！

✨ 免费微调

所有笔记本电脑都适合初学者！添加您的数据集，单击“全部运行”，您将获得速度提高 2 倍的微调模型，该模型可以导出到 GGUF、vLLM 或上传到 Hugging Face。

不懒惰的支持	免费笔记本	表现	内存使用
骆驼-3 8b	▶️从 Colab 开始	速度提高 2 倍	减少 60%
杰玛7b	▶️从 Colab 开始	速度提高 2.4 倍	减少 71%
米斯特拉尔7b	▶️从 Colab 开始	速度提高 2.2 倍	减少 73%
小羊驼	▶️从 Colab 开始	速度提高 3.9 倍	减少 82%
代码Llama 34b A100	▶️从 Colab 开始	速度提高 1.9 倍	减少 49%
米斯特拉尔 7b 1xT4	▶️从 Kaggle 开始	速度提高 5 倍*	减少 73%
DPO-Zephyr	▶️从 Colab 开始	速度提高 1.9 倍	减少 43%

与 FA2 + Hugging Face 组合进行基准比较。
此会话笔记本对于 ShareGPT ChatML / Vicuna 模板非常有用。
此文本完成笔记本适用于原始文本。这款DPO 笔记本复制了 Zephyr。
* Kaggle 有 2 个 T4，但我们使用 1 个。由于开销，1 个 T4 速度快了 5 倍。

🦥 Unsloth.ai 新闻

📣 新！Llama-3 8b现在可以使用了！ Llama-3 70b 也可以（只需更改笔记本中的型号名称）。
📣 新！我们将内存使用量进一步减少了30% ，现在支持使用 4 倍长的上下文窗口对 LLM 进行微调！如果您使用我们的笔记本电脑，则无需进行任何更改。要启用，只需更改 1 行：

model = FastLanguageModel.get_peft_model(
    model,
    use_gradient_checkpointing = "unsloth", # <<<<<<<
)

📣 CodeGemma现在可与Gemma 7b和Gemma 2b一起使用
📣所有模型的推理速度提高了 2 倍
📣现已包含DPO 支持。有关 DPO 的更多信息
📣 我们用 🤗Hugging Face 写了一个博客，并在他们的官方文档中！查看SFT 文档和DPO 文档

🔗 链接和资源

类型	链接
📚维基百科和常见问题解答	阅读我们的维基
推特（又名 X）	在 X 上关注我们
📜文档	阅读文档
💾安装	不懒惰/README.md
🥇基准测试	性能表
🌐已发布型号	不懒惰的发布
✍️博客	阅读我们的博客

⭐ 主要特点

所有内核均采用OpenAI 的 Triton语言编写。手动反向传播引擎。
精度损失为 0% - 无近似方法 - 全部精确。
没有改变硬件。自 2018 年起支持 NVIDIA GPU。最低 CUDA 能力 7.0（V100、T4、Titan V、RTX 20、30、40x、A100、H100、L40 等）检查您的 GPU！ GTX 1070、1080 可以工作，但速度很慢。
通过 WSL 在Linux和Windows上运行。
通过bitsandbytes支持4位和16位QLoRA/LoRA微调。
开源训练速度提高了 5 倍 - 请参阅Unsloth Pro以获得高达30 倍的训练速度！
如果您使用 🦥Unsloth 训练了模型，则可以使用这个很酷的贴纸！

🥇 性能基准测试

有关可复制基准测试表的完整列表，请访问我们的网站

1 个 A100 40GB	🤗抱脸	闪光注意	🦥Unsloth 开源	🦥消除懒惰专业版
羊驼毛	1x	1.04倍	1.98倍	15.64倍
莱昂芯片2	1x	0.92倍	1.61倍	20.73倍
欧亚斯特	1x	1.19倍	2.17倍	14.83倍
苗条逆戟鲸	1x	1.18倍	2.22倍	14.82倍

下面的基准测试表是由🤗Hugging Face进行的。

免费Colab T4	数据集	🤗抱脸	火炬2.1.1	🦥不懒惰	🦥 显存减少
骆驼-2 7b	欧亚斯特	1x	1.19倍	1.95倍	-43.3%
米斯特拉尔7b	羊驼毛	1x	1.07倍	1.56倍	-13.7%
小羊驼 1.1b	羊驼毛	1x	2.06倍	3.87倍	-73.8%
DPO 与 Zephyr	超级聊天	1x	1.09倍	1.55倍	-18.6%

💾 安装说明

康达安装

选择pytorch-cuda=11.8CUDA 11.8 或pytorch-cuda=12.1CUDA 12.1。如果有mamba，请使用mamba代替来conda更快地求解。请参阅此Github 问题以获取有关调试 Conda 安装的帮助。

conda create --name unsloth_env python=3.10 conda activate unsloth_env

conda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps trl peft accelerate bitsandbytes

点安装

如果您有 Anaconda，请勿使用此功能。你必须使用 Conda 安装方法，否则东西会崩溃。

通过以下方式查找您的 CUDA 版本

import torch; torch.version.cuda

对于 Pytorch 2.1.0：您可以通过 Pip 更新 Pytorch（交换cu121/ cu118）。请访问https://pytorch.org/了解更多信息。选择cu118CUDA 11.8 或cu121CUDA 12.1。如果您有 RTX 3060 或更高版本（A100、H100 等），请使用该"ampere"路径。对于 Pytorch 2.1.1：转到步骤 3。对于 Pytorch 2.2.0：转到步骤 4。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"

对于 Pytorch 2.1.1：使用"ampere"较新的 RTX 30xx GPU 或更高版本的路径。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"

对于 Pytorch 2.2.0：使用"ampere"较新的 RTX 30xx GPU 或更高版本的路径。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
  --index-url https://download.pytorch.org/whl/cu121

pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"

如果出现错误，请先尝试以下操作，然后返回步骤 1：

pip install --upgrade pip

对于 Pytorch 2.2.1：

# RTX 3090, 4090 Ampere GPUs: pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

# Pre Ampere RTX 2080, T4, GTX 1080 GPUs: pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" pip install --no-deps xformers trl peft accelerate bitsandbytes

要解决安装问题，请尝试以下操作（全部必须成功）。 Xformers 应该大部分都可用。

nvcc
python -m xformers.info
python -m bitsandbytes

📜 文档

请访问我们的Wiki 页面，了解保存到 GGUF、检查点、评估等内容！
我们支持 Huggingface 的 TRL、Trainer、Seq2SeqTrainer 甚至 Pytorch 代码！
我们在🤗Hugging Face 的官方文档中！查看SFT 文档和DPO 文档！

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support - 4x faster downloading!
fourbit_models = [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
] # Go to https://huggingface.co/unsloth for more 4-bit models!
# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none",    # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False,  # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Cutomized chat templates

DPO 支持

DPO（直接偏好优化）、PPO、奖励建模似乎都按照Llama-Factory的第 3 方独立测试工作。我们有一个初步的 Google Colab 笔记本，用于在 Tesla T4 上复制 Zephyr：笔记本。

我们在🤗Hugging Face 的官方文档中！我们正在查看SFT 文档和DPO 文档！

from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none",    # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()

<clipboard-copy aria-label="Copy" class="ClipboardButton btn btn-invisible js-clipboard-copy m-2 p-0 tooltipped-no-delay d-flex flex-justify-center flex-items-center" data-copy-feedback="Copied!" data-tooltip-direction="w" value="from unsloth import FastLanguageModel, PatchDPOTrainer PatchDPOTrainer() import torch from transformers import TrainingArguments from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/zephyr-sft-bnb-4bit", max_seq_length = max_seq_length, dtype = None, load_in_4bit = True, )

Do model patching and add fast LoRA weights

model = FastLanguageModel.get_peft_model( model, r = 64, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 64, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized use_gradient_checkpointing = True, random_state = 3407, max_seq_length = max_seq_length, )

dpo_trainer = DPOTrainer( model = model, ref_model = None, args = TrainingArguments( per_device_train_batch_size = 4, gradient_accumulation_steps = 8, warmup_ratio = 0.1, num_train_epochs = 3, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1, optim = "adamw_8bit", seed = 42, output_dir = "outputs", ), beta = 0.1, train_dataset = YOUR_DATASET_HERE, # eval_dataset = YOUR_DATASET_HERE, tokenizer = tokenizer, max_length = 1024, max_prompt_length = 512, ) dpo_trainer.train()" tabindex="0" role="button">

🥇 详细的基准测试表

单击“代码”以获得完全可重现的示例
“Unsloth Equal”是我们 PRO 版本的预览版，其中删除了代码。所有设置和损失曲线保持相同。
有关基准测试表的完整列表，请访问我们的网站

1 个 A100 40GB	🤗抱脸	闪光注意2	🦥不懒惰开放	不懒惰平等	解除懒惰专业版	不懒惰麦克斯
羊驼毛	1x	1.04倍	1.98倍	2.48倍	5.32倍	15.64倍
代码	代码	代码	代码	代码
秒	1040	1001	第525章	第419章	196	67
内存MB	18235	15365	9631	8525
保存百分比		15.74	47.18	53.25

Llama-Factory 第三方基准测试

链接到性能表。 TGS：每 GPU 每秒的令牌数。型号：LLaMA2-7B。 GPU：NVIDIA A100 * 1。批量大小：4。梯度累积：2。LoRA 等级：8。最大长度：1024。

方法	位	TGS	公克	速度
高频	16	2392	18GB	100%
高频+FA2	16	2954	17GB	123%
不懒惰+FA2	16	4007	16 GB	168%
高频	4	2415	9GB	101%
不懒惰+FA2	4	3726	7GB	160%

流行型号之间的性能比较

单击查看特定型号基准测试表（Mistral 7b、CodeLlama 34b 等）

米斯特拉尔7b

1 个 A100 40GB	抱脸	闪光注意2	不懒惰开放	不懒惰平等	解除懒惰专业版	不懒惰麦克斯
米斯特拉尔 7B 超薄逆戟鲸	1x	1.15倍	2.15倍	2.53倍	4.61倍	13.69倍
代码	代码	代码	代码	代码
秒	1813	第1571章	第842章	718	第393章	132
内存MB	32853	19385	12465	10271
保存百分比		40.99	62.06	68.74

代码骆马 34b

1 个 A100 40GB	抱脸	闪光注意2	不懒惰开放	不懒惰平等	解除懒惰专业版	不懒惰麦克斯
代码骆驼 34B	OOM❌	0.99倍	1.87倍	2.61倍	4.27倍	12.82倍
代码	▶️代码	代码	代码	代码
秒	1953年	1982年	1043	第748章	第458章	152
内存MB	40000	33217	27413	22161
保存百分比		16.96	31.47	44.60

1 特斯拉 T4

1个T4 16GB	抱脸	闪光注意	不懒惰开放	Unsloth Pro 平等	解除懒惰专业版	不懒惰麦克斯
羊驼毛	1x	1.09倍	1.69倍	1.79倍	2.93倍	8.3倍
代码	▶️代码	代码	代码	代码
秒	1599	第1468章	第942章	第894章	第545章	193
内存MB	7199	7059	6459	5443
保存百分比		1.94	10.28	24.39

2 辆特斯拉 T4（通过 DDP）

2 T4 顺铂	抱脸	闪光注意	不懒惰开放	不懒惰平等	解除懒惰专业版	不懒惰麦克斯
羊驼毛	1x	0.99倍	4.95倍	4.44倍	7.28倍	20.61倍
代码	▶️代码	代码	代码
秒	9882	9946	1996年	2227	第1357章	第480章
内存MB	9176	9128	6904	6782
保存百分比		0.52	24.76	26.09

1 Tesla T4 GPU 上的性能比较：

单击查看 1 epoch 所用时间

Google Colab 上的一辆 Tesla T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

系统	图形处理器	羊驼毛 (52K)	莱昂 OIG (210K)	打开助手 (10K)	SlimOrca (518K)
抱脸	1个T4	23小时15米	56小时28米	8小时38米	391小时41米
不懒惰开放	1个T4	13小时7米（1.8倍）	31小时47米（1.8倍）	4小时27米（1.9倍）	240小时4米（1.6倍）
解除懒惰专业版	1个T4	3小时6米（7.5倍）	5小时17米（10.7倍）	1小时7米（7.7倍）	59小时53米（6.5倍）
不懒惰麦克斯	1个T4	2小时39m（8.8倍）	4小时31m（12.5倍）	0小时58m（8.9倍）	51小时30米（7.6倍）

内存使用峰值

系统	图形处理器	羊驼毛 (52K)	莱昂 OIG (210K)	打开助手 (10K)	SlimOrca (518K)
抱脸	1个T4	7.3GB	5.9GB	14.0GB	13.3GB
不懒惰开放	1个T4	6.8GB	5.7GB	7.8GB	7.7GB
解除懒惰专业版	1个T4	6.4GB	6.4GB	6.4GB	6.4GB
不懒惰麦克斯	1个T4	11.4GB	12.4GB	11.9GB	14.4GB

单击通过 DDP 在 2 个 Tesla T4 GPU 上进行性能比较：

**1 epoch 所花费的时间**

Kaggle 上的两辆特斯拉 T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

系统	图形处理器	羊驼毛 (52K)	莱昂 OIG (210K)	打开助手 (10K)	SlimOrca (518K) *
抱脸	2 T4	84小时47米	163小时48米	30小时51米	1301小时24米*
解除懒惰专业版	2 T4	3小时20米（25.4倍）	5小时43米（28.7倍）	1小时12米（25.7倍）	71小时40米（18.1倍）*
不懒惰麦克斯	2 T4	3小时4米（27.6倍）	5小时14米（31.3倍）	1小时6米（28.1倍）	54小时20米（23.9倍）*

多 GPU 系统（2 个 GPU）上的峰值内存使用量

系统	图形处理器	羊驼毛 (52K)	莱昂 OIG (210K)	打开助手 (10K)	SlimOrca (518K) *
抱脸	2 T4	8.4GB\| 6GB	7.2GB\| 5.3GB	14.3GB \| 14.3GB 6.6GB	10.9GB \| 10.9GB 5.9GB*
解除懒惰专业版	2 T4	7.7GB\| 4.9GB	7.5GB\| 4.9GB	8.5GB \| 4.9GB	6.2GB\| 4.7GB *
不懒惰麦克斯	2 T4	10.5GB \| 10.5GB 5GB	10.6GB \| 10.6GB 5GB	10.6GB \| 10.6GB 5GB	10.5GB \| 5GB *

Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.

谢谢至

HuyNguyen-渴望将RoPE 嵌入速度提高 28%
RandomInternetPreson用于确认 WSL 支持
152334H用于实验性 DPO 支持
atgctg用于语法高亮

yuanzhongqiao / unsloth Goto Github PK

unsloth's Introduction