ymcui / chinese-llama-alpaca-2 Goto Github PK

View Code? Open in Web Editor NEW

6.8K 74.0 555.0 8.35 MB

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

License: Apache License 2.0

Python 98.59% Shell 1.41%

alpaca llama llm llama-2 large-language-models nlp alpaca-2 flash-attention llama2 alpaca2

chinese-llama-alpaca-2's Introduction

Chinese-LLaMA-Alpaca-3项目启动！

本项目基于Meta发布的可商用大模型Llama-2开发，是中文LLaMA&Alpaca大模型的第二期项目，开源了中文LLaMA-2基座模型和Alpaca-2指令精调大模型。这些模型在原版Llama-2的基础上扩充并优化了中文词表，使用了大规模中文数据进行增量预训练，进一步提升了中文基础语义和指令理解能力，相比一代相关模型获得了显著性能提升。相关模型支持FlashAttention-2训练。标准版模型支持4K上下文长度，长上下文版模型支持16K、64k上下文长度。RLHF系列模型为标准版模型基础上进行人类偏好对齐精调，相比标准版模型在正确价值观体现方面获得了显著性能提升。

本项目主要内容

🚀 针对Llama-2模型扩充了新版中文词表，开源了中文LLaMA-2和Alpaca-2大模型
🚀 开源了预训练脚本、指令精调脚本，用户可根据需要进一步训练模型
🚀 使用个人电脑的CPU/GPU快速在本地进行大模型量化和部署体验
🚀 支持🤗transformers, llama.cpp, text-generation-webui, LangChain, privateGPT, vLLM等LLaMA生态

已开源的模型

基座模型（4K上下文）：Chinese-LLaMA-2 (1.3B, 7B, 13B)
聊天模型（4K上下文）：Chinese-Alpaca-2 (1.3B, 7B, 13B)
长上下文模型（16K/64K）：
- Chinese-LLaMA-2-16K (7B, 13B) 、Chinese-Alpaca-2-16K (7B, 13B)
- Chinese-LLaMA-2-64K (7B)、Chinese-Alpaca-2-64K (7B)
偏好对齐模型：Chinese-Alpaca-2-RLHF (1.3B, 7B)

新闻

[2024/03/27] 本项目已入驻机器之心SOTA!模型平台，欢迎关注：https://sota.jiqizhixin.com/project/chinese-llama-alpaca-2

[2024/01/23] 添加新版GGUF模型（imatrix量化）、AWQ量化模型，支持vLLM下加载YaRN长上下文模型。详情查看📚 v4.1版本发布日志

[2023/12/29] 发布长上下文模型Chinese-LLaMA-2-7B-64K和Chinese-Alpaca-2-7B-64K，同时发布经过人类偏好对齐（RLHF）的Chinese-Alpaca-2-RLHF（1.3B/7B）。详情查看📚 v4.0版本发布日志

[2023/09/01] 发布长上下文模型Chinese-Alpaca-2-7B-16K和Chinese-Alpaca-2-13B-16K，该模型可直接应用于下游任务，例如privateGPT等。详情查看📚 v3.1版本发布日志

[2023/08/25] 发布长上下文模型Chinese-LLaMA-2-7B-16K和Chinese-LLaMA-2-13B-16K，支持16K上下文，并可通过NTK方法进一步扩展至24K+。详情查看📚 v3.0版本发布日志

[2023/08/14] 发布Chinese-LLaMA-2-13B和Chinese-Alpaca-2-13B，添加text-generation-webui/LangChain/privateGPT支持，添加CFG Sampling解码方法等。详情查看📚 v2.0版本发布日志

[2023/08/02] 添加FlashAttention-2训练支持，基于vLLM的推理加速支持，提供长回复系统提示语模板等。详情查看📚 v1.1版本发布日志

[2023/07/31] 正式发布Chinese-LLaMA-2-7B（基座模型），使用120G中文语料增量训练（与一代Plus系列相同）；进一步通过5M条指令数据精调（相比一代略微增加），得到Chinese-Alpaca-2-7B（指令/chat模型）。详情查看📚 v1.0版本发布日志

[2023/07/19] 🚀启动中文LLaMA-2、Alpaca-2开源大模型项目

内容导引

章节	描述
💁🏻‍♂️模型简介	简要介绍本项目相关模型的技术特点
⏬模型下载	中文LLaMA-2、Alpaca-2大模型下载地址
💻推理与部署	介绍了如何对模型进行量化并使用个人电脑部署并体验大模型
💯系统效果	介绍了模型在部分任务上的效果
📝训练与精调	介绍了如何训练和精调中文LLaMA-2、Alpaca-2大模型
❓常见问题	一些常见问题的回复

模型简介

本项目推出了基于Llama-2的中文LLaMA-2以及Alpaca-2系列模型，相比一期项目其主要特点如下：

📖 经过优化的中文词表

在一期项目中，我们针对一代LLaMA模型的32K词表扩展了中文字词（LLaMA：49953，Alpaca：49954）
在本项目中，我们重新设计了新词表（大小：55296），进一步提升了中文字词的覆盖程度，同时统一了LLaMA/Alpaca的词表，避免了因混用词表带来的问题，以期进一步提升模型对中文文本的编解码效率

⚡ 基于FlashAttention-2的高效注意力

FlashAttention-2是高效注意力机制的一种实现，相比其一代技术具有更快的速度和更优化的显存占用
当上下文长度更长时，为了避免显存爆炸式的增长，使用此类高效注意力技术尤为重要
本项目的所有模型均使用了FlashAttention-2技术进行训练

🚄 基于PI和YaRN的超长上下文扩展技术

在一期项目中，我们实现了基于NTK的上下文扩展技术，可在不继续训练模型的情况下支持更长的上下文
基于位置插值PI和NTK等方法推出了16K长上下文版模型，支持16K上下文，并可通过NTK方法最高扩展至24K-32K
基于YaRN方法进一步推出了64K长上下文版模型，支持64K上下文
进一步设计了方便的自适应经验公式，无需针对不同的上下文长度设置NTK超参，降低了使用难度

🤖 简化的中英双语系统提示语

在一期项目中，中文Alpaca系列模型使用了Stanford Alpaca的指令模板和系统提示语
初步实验发现，Llama-2-Chat系列模型的默认系统提示语未能带来统计显著的性能提升，且其内容过于冗长
本项目中的Alpaca-2系列模型简化了系统提示语，同时遵循Llama-2-Chat指令模板，以便更好地适配相关生态

👮 人类偏好对齐

在一期项目中，中文Alpaca系列模型仅完成预训练和指令精调，获得了基本的对话能力
通过基于人类反馈的强化学习（RLHF）实验，发现可显著提升模型传递正确价值观的能力
本项目推出了Alpaca-2-RLHF系列模型，使用方式与SFT模型一致

下图展示了本项目以及一期项目推出的所有大模型之间的关系。

模型下载

模型选择指引

以下是中文LLaMA-2和Alpaca-2模型的对比以及建议使用场景。如需聊天交互，请选择Alpaca而不是LLaMA。

对比项	中文LLaMA-2	中文Alpaca-2
模型类型	基座模型	指令/Chat模型（类ChatGPT）
已开源大小	1.3B、7B、13B	1.3B、7B、13B
训练类型	Causal-LM (CLM)	指令精调
训练方式	7B、13B：LoRA + 全量emb/lm-head 1.3B：全量	7B、13B：LoRA + 全量emb/lm-head 1.3B：全量
基于什么模型训练	原版Llama-2（非chat版）	中文LLaMA-2
训练语料	无标注通用语料（120G纯文本）	有标注指令数据（500万条）
词表大小^[1]	55,296	55,296
上下文长度^[2]	标准版：4K（12K-18K）长上下文版（PI）：16K（24K-32K）长上下文版（YaRN）：64K	标准版：4K（12K-18K）长上下文版（PI）：16K（24K-32K）长上下文版（YaRN）：64K
输入模板	不需要	需要套用特定模板^[3]，类似Llama-2-Chat
适用场景	文本续写：给定上文，让模型生成下文	指令理解：问答、写作、聊天、交互等
不适用场景	指令理解、多轮聊天等	文本无限制自由生成
偏好对齐	无	RLHF版本（1.3B、7B）

Note

[1] 本项目一代模型和二代模型的词表不同，请勿混用。二代LLaMA和Alpaca的词表相同。
[2] 括号内表示基于NTK上下文扩展支持的最大长度。
[3] Alpaca-2采用了Llama-2-chat系列模板（格式相同，提示语不同），而不是一代Alpaca的模板，请勿混用。
[4] 不建议单独使用1.3B模型，而是通过投机采样搭配更大的模型（7B、13B）使用。

完整模型下载

以下是完整版模型，直接下载即可使用，无需其他合并步骤。推荐网络带宽充足的用户。

模型名称	类型	大小	下载地址	GGUF
Chinese-LLaMA-2-13B	基座模型	24.7 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-LLaMA-2-7B	基座模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-LLaMA-2-1.3B	基座模型	2.4 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-13B	指令模型	24.7 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-7B	指令模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-1.3B	指令模型	2.4 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]

长上下文版模型

以下是长上下文版模型，推荐以长文本为主的下游任务使用，否则建议使用上述标准版。

模型名称	类型	大小	下载地址	GGUF
Chinese-LLaMA-2-7B-64K 🆕	基座模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-7B-64K 🆕	指令模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-LLaMA-2-13B-16K	基座模型	24.7 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-LLaMA-2-7B-16K	基座模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-13B-16K	指令模型	24.7 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-7B-16K	指令模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]

RLHF版模型

以下是人类偏好对齐版模型，对涉及法律、道德的问题较标准版有更优的价值导向。

模型名称	类型	大小	下载地址	GGUF
Chinese-Alpaca-2-7B-RLHF 🆕	指令模型	12.9 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Alpaca-2-1.3B-RLHF 🆕	指令模型	2.4 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]	[🤗HF]

AWQ版模型

AWQ（Activation-aware Weight Quantization）是一种高效的模型量化方案，目前可兼容🤗transformers、llama.cpp等主流框架。

本项目模型的AWQ预搜索结果可通过以下链接获取：https://huggingface.co/hfl/chinese-llama-alpaca-2-awq

生成AWQ量化模型（AWQ官方目录）：https://github.com/mit-han-lab/llm-awq
llama.cpp中使用AWQ：https://github.com/ggerganov/llama.cpp/tree/master/awq-py

LoRA模型下载

以下是LoRA模型（含emb/lm-head），与上述完整模型一一对应。需要注意的是LoRA模型无法直接使用，必须按照教程与重构模型进行合并。推荐网络带宽不足，手头有原版Llama-2且需要轻量下载的用户。

模型名称	类型	合并所需基模型	大小	LoRA下载地址
Chinese-LLaMA-2-LoRA-13B	基座模型	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-LLaMA-2-LoRA-7B	基座模型	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-Alpaca-2-LoRA-13B	指令模型	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-Alpaca-2-LoRA-7B	指令模型	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]

以下是长上下文版模型，推荐以长文本为主的下游任务使用，否则建议使用上述标准版。

模型名称	类型	合并所需基模型	大小	LoRA下载地址
Chinese-LLaMA-2-LoRA-7B-64K 🆕	基座模型	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-Alpaca-2-LoRA-7B-64K 🆕	指令模型	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-LLaMA-2-LoRA-13B-16K	基座模型	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-LLaMA-2-LoRA-7B-16K	基座模型	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-Alpaca-2-LoRA-13B-16K	指令模型	Llama-2-13B-hf	1.5 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]
Chinese-Alpaca-2-LoRA-7B-16K	指令模型	Llama-2-7B-hf	1.1 GB	[Baidu] [Google] [🤗HF] [🤖ModelScope]

Important

LoRA模型无法单独使用，必须与原版Llama-2进行合并才能转为完整模型。请通过以下方法对模型进行合并。

在线转换：Colab用户可利用本项目提供的notebook进行在线转换并量化模型
手动转换：离线方式转换，生成不同格式的模型，以便进行量化或进一步精调

推理与部署

本项目中的相关模型主要支持以下量化、推理和部署方式，具体内容请参考对应教程。

工具	特点	CPU	GPU	量化	GUI	API	vLLM^§	16K^‡	64K^‡	投机采样	教程
llama.cpp	丰富的量化选项和高效本地推理	✅	✅	✅	❌	✅	❌	✅	✅	✅	link
🤗Transformers	原生transformers推理接口	✅	✅	✅	✅	❌	✅	✅	✅	✅	link
Colab Demo	在Colab中启动交互界面	✅	✅	✅	✅	❌	✅	✅	✅	✅	link
仿OpenAI API调用	仿OpenAI API接口的服务器Demo	✅	✅	✅	❌	✅	✅	✅	✅	❌	link
text-generation-webui	前端Web UI界面的部署方式	✅	✅	✅	✅	✅^†	❌	✅	❌	❌	link
LangChain	适合二次开发的大模型应用开源框架	✅^†	✅	✅^†	❌	❌	❌	✅	✅	❌	link
privateGPT	基于LangChain的多文档本地问答框架	✅	✅	✅	❌	❌	❌	✅	❌	❌	link

Note

^† 工具支持该特性，但教程中未实现，详细说明请参考对应官方文档
^‡ 指是否支持长上下文版本模型（需要第三方库支持自定义RoPE）
^§ vLLM后端不支持长上下文版本模型

系统效果

为了评测相关模型的效果，本项目分别进行了生成效果评测和客观效果评测（NLU类），从不同角度对大模型进行评估。需要注意的是，综合评估大模型能力仍然是亟待解决的重要课题，单个数据集的结果并不能综合评估模型性能。推荐用户在自己关注的任务上进行测试，选择适配相关任务的模型。

生成效果评测

为了更加直观地了解模型的生成效果，本项目仿照Fastchat Chatbot Arena推出了模型在线对战平台，可浏览和评测模型回复质量。对战平台提供了胜率、Elo评分等评测指标，并且可以查看两两模型的对战胜率等结果。题库来自于一期项目人工制作的200题，以及在此基础上额外增加的题目。生成回复具有随机性，受解码超参、随机种子等因素影响，因此相关评测并非绝对严谨，结果仅供晾晒参考，欢迎自行体验。部分生成样例请查看examples目录。

⚔️ 模型竞技场：http://llm-arena.ymcui.com

系统	对战胜率（无平局） ↓	Elo评分
Chinese-Alpaca-2-13B-16K	86.84%	1580
Chinese-Alpaca-2-13B	72.01%	1579
Chinese-Alpaca-Pro-33B	64.87%	1548
Chinese-Alpaca-2-7B	64.11%	1572
Chinese-Alpaca-Pro-7B	62.05%	1500
Chinese-Alpaca-2-7B-16K	61.67%	1540
Chinese-Alpaca-Pro-13B	61.26%	1567
Chinese-Alpaca-Plus-33B	31.29%	1401
Chinese-Alpaca-Plus-13B	23.43%	1329
Chinese-Alpaca-Plus-7B	20.92%	1379

Note

以上结果截至2023年9月1日。最新结果请进入⚔️竞技场进行查看。

客观效果评测：C-Eval

C-Eval是一个全面的中文基础模型评估套件，其中验证集和测试集分别包含1.3K和12.3K个选择题，涵盖52个学科。实验结果以“zero-shot / 5-shot”进行呈现。C-Eval推理代码请参考本项目：📖GitHub Wiki

LLaMA Models	Valid	Test	Alpaca Models	Valid	Test
Chinese-LLaMA-2-13B	40.6 / 42.7	38.0 / 41.6	Chinese-Alpaca-2-13B	44.3 / 45.9	42.6 / 44.0
Chinese-LLaMA-2-7B	28.2 / 36.0	30.3 / 34.2	Chinese-Alpaca-2-7B	41.3 / 42.9	40.3 / 39.5
Chinese-LLaMA-Plus-33B	37.4 / 40.0	35.7 / 38.3	Chinese-Alpaca-Plus-33B	46.5 / 46.3	44.9 / 43.5
Chinese-LLaMA-Plus-13B	27.3 / 34.0	27.8 / 33.3	Chinese-Alpaca-Plus-13B	43.3 / 42.4	41.5 / 39.9
Chinese-LLaMA-Plus-7B	27.3 / 28.3	26.9 / 28.4	Chinese-Alpaca-Plus-7B	36.7 / 32.9	36.4 / 32.3

客观效果评测：CMMLU

CMMLU是另一个综合性中文评测数据集，专门用于评估语言模型在中文语境下的知识和推理能力，涵盖了从基础学科到高级专业水平的67个主题，共计11.5K个选择题。CMMLU推理代码请参考本项目：📖GitHub Wiki

LLaMA Models	Test (0/few-shot)	Alpaca Models	Test (0/few-shot)
Chinese-LLaMA-2-13B	38.9 / 42.5	Chinese-Alpaca-2-13B	43.2 / 45.5
Chinese-LLaMA-2-7B	27.9 / 34.1	Chinese-Alpaca-2-7B	40.0 / 41.8
Chinese-LLaMA-Plus-33B	35.2 / 38.8	Chinese-Alpaca-Plus-33B	46.6 / 45.3
Chinese-LLaMA-Plus-13B	29.6 / 34.0	Chinese-Alpaca-Plus-13B	40.6 / 39.9
Chinese-LLaMA-Plus-7B	25.4 / 26.3	Chinese-Alpaca-Plus-7B	36.8 / 32.6

长上下文版模型评测

LongBench是一个大模型长文本理解能力的评测基准，由6大类、20个不同的任务组成，多数任务的平均长度在5K-15K之间，共包含约4.75K条测试数据。以下是本项目长上下文版模型在该中文任务（含代码任务）上的评测效果。LongBench推理代码请参考本项目：📖GitHub Wiki

Models	单文档QA	多文档QA	摘要	Few-shot学习	代码补全	合成任务	Avg
Chinese-Alpaca-2-7B-64K	44.7	28.1	14.4	39.0	44.6	5.0	29.3
Chinese-LLaMA-2-7B-64K	27.2	16.4	6.5	33.0	7.8	5.0	16.0
Chinese-Alpaca-2-13B-16K	47.9	26.7	13.0	22.3	46.6	21.5	29.7
Chinese-Alpaca-2-13B	38.4	20.0	11.9	17.3	46.5	8.0	23.7
Chinese-Alpaca-2-7B-16K	46.4	23.3	14.3	29.0	49.6	9.0	28.6
Chinese-Alpaca-2-7B	34.0	17.4	11.8	21.3	50.3	4.5	23.2
Chinese-LLaMA-2-13B-16K	36.7	17.7	3.1	29.8	13.8	3.0	17.3
Chinese-LLaMA-2-13B	28.3	14.4	4.6	16.3	10.4	5.4	13.2
Chinese-LLaMA-2-7B-16K	33.2	15.9	6.5	23.5	10.3	5.3	15.8
Chinese-LLaMA-2-7B	19.0	13.9	6.4	11.0	11.0	4.7	11.0

量化效果评测

以Chinese-LLaMA-2-7B为例，对比不同精度下的模型大小、PPL（困惑度）、C-Eval效果，方便用户了解量化精度损失。PPL以4K上下文大小计算，C-Eval汇报的是valid集合上zero-shot和5-shot结果。

精度	模型大小	PPL	C-Eval
FP16	12.9 GB	9.373	28.2 / 36.0
8-bit量化	6.8 GB	9.476	26.8 / 35.4
4-bit量化	3.7 GB	10.132	25.5 / 32.8

特别地，以下是在llama.cpp下不同量化方法的评测数据，供用户参考，速度以ms/tok计，测试设备为M1 Max。具体细节见📖GitHub Wiki

llama.cpp	F16	Q2_K	Q3_K	Q4_0	Q4_1	Q4_K	Q5_0	Q5_1	Q5_K	Q6_K	Q8_0
PPL	9.128	11.107	9.576	9.476	9.576	9.240	9.156	9.213	9.168	9.133	9.129
Size	12.91G	2.41G	3.18G	3.69G	4.08G	3.92G	4.47G	4.86G	4.59G	5.30G	6.81G
CPU Speed	117	42	51	39	44	43	48	51	50	54	65
GPU Speed	53	19	21	17	18	20	x	x	25	26	x

投机采样加速效果评测

通过投机采样方法并借助Chinese-LLaMA-2-1.3B和Chinese-Alpaca-2-1.3B，可以分别加速7B、13B的LLaMA和Alpaca模型的推理速度。以下是使用投机采样脚本在1*A40-48G上解码生成效果评测中的问题测得的平均速度（速度以ms/token计，模型均为fp16精度），供用户参考。详细说明见📖GitHub Wiki。

草稿模型	草稿模型速度	目标模型	目标模型速度	投机采样速度（加速比）
Chinese-LLaMA-2-1.3B	7.6	Chinese-LLaMA-2-7B	49.3	36.0（1.37x）
Chinese-LLaMA-2-1.3B	7.6	Chinese-LLaMA-2-13B	66.0	47.1（1.40x）
Chinese-Alpaca-2-1.3B	8.1	Chinese-Alpaca-2-7B	50.2	34.9（1.44x）
Chinese-Alpaca-2-1.3B	8.2	Chinese-Alpaca-2-13B	67.0	41.6（1.61x）

人类偏好对齐（RLHF）版本评测

对齐水平

为评估中文模型与人类价值偏好对齐程度，我们自行构建了评测数据集，覆盖了道德、色情、毒品、暴力等人类价值偏好重点关注的多个方面。实验结果以价值体现正确率进行呈现（体现正确价值观题目数 / 总题数）。

Alpaca Models	Accuracy	Alpaca Models	Accuracy
Chinese-Alpaca-2-1.3B	79.3%	Chinese-Alpaca-2-7B	88.3%
Chinese-Alpaca-2-1.3B-RLHF	95.8%	Chinese-Alpaca-2-7B-RLHF	97.5%

客观效果评测：C-Eval & CMMLU

Alpaca Models	C-Eval (0/few-shot)	CMMLU (0/few-shot)
Chinese-Alpaca-2-1.3B	23.8 / 26.8	24.8 / 25.1
Chinese-Alpaca-2-7B	42.1 / 41.0	40.0 / 41.8
Chinese-Alpaca-2-1.3B-RLHF	23.6 / 27.1	24.9 / 25.0
Chinese-Alpaca-2-7B-RLHF	40.6 / 41.2	39.5 / 41.0

训练与精调

预训练

在原版Llama-2的基础上，利用大规模无标注数据进行增量训练，得到Chinese-LLaMA-2系列基座模型
训练数据采用了一期项目中Plus版本模型一致的数据，其总量约120G纯文本文件
训练代码参考了🤗transformers中的run_clm.py，使用方法见📖预训练脚本Wiki

指令精调

在Chinese-LLaMA-2的基础上，利用有标注指令数据进行进一步精调，得到Chinese-Alpaca-2系列模型
训练数据采用了一期项目中Pro版本模型使用的指令数据，其总量约500万条指令数据（相比一期略增加）
训练代码参考了Stanford Alpaca项目中数据集处理的相关部分，使用方法见📖指令精调脚本Wiki

RLHF精调

在Chinese-Alpaca-2系列模型基础上，利用偏好数据和PPO算法进行人类偏好对齐精调，得到Chinese-Alpaca-2-RLHF系列模型
训练数据基于多个开源项目中的人类偏好数据和本项目指令精调数据进行采样，奖励模型阶段、强化学习阶段分别约69.5K、25.6K条样本
训练代码基于DeepSpeed-Chat开发，具体流程见📖奖励模型Wiki和📖强化学习Wiki

常见问题

请在提Issue前务必先查看FAQ中是否已存在解决方案。具体问题和解答请参考本项目 📖GitHub Wiki

问题1：本项目和一期项目的区别？
问题2：模型能否商用？
问题3：接受第三方Pull Request吗？
问题4：为什么不对模型做全量预训练而是用LoRA？
问题5：二代模型支不支持某些支持一代LLaMA的工具？
问题6：Chinese-Alpaca-2是Llama-2-Chat训练得到的吗？
问题7：为什么24G显存微调Chinese-Alpaca-2-7B会OOM？
问题8：可以使用16K长上下文版模型替代标准版模型吗？
问题9：如何解读第三方公开榜单的结果？
问题10：会出34B或者70B级别的模型吗？
问题11：为什么长上下文版模型是16K，不是32K或者100K？
问题12：为什么Alpaca模型会回复说自己是ChatGPT？
问题13：为什么pt_lora_model或者sft_lora_model下的adapter_model.bin只有几百k？

引用

如果您使用了本项目的相关资源，请参考引用本项目的技术报告：https://arxiv.org/abs/2304.08177

@article{Chinese-LLaMA-Alpaca,
    title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca},
    author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
    journal={arXiv preprint arXiv:2304.08177},
    url={https://arxiv.org/abs/2304.08177},
    year={2023}
}

致谢

本项目主要基于以下开源项目二次开发，在此对相关项目和研究开发人员表示感谢。

同时感谢Chinese-LLaMA-Alpaca（一期项目）的contributor以及关联项目和人员。

免责声明

本项目基于由Meta发布的Llama-2模型进行开发，使用过程中请严格遵守Llama-2的开源许可协议。如果涉及使用第三方代码，请务必遵从相关的开源许可协议。模型生成的内容可能会因为计算方法、随机因素以及量化精度损失等影响其准确性，因此，本项目不对模型输出的准确性提供任何保证，也不会对任何因使用相关资源和输出结果产生的损失承担责任。如果将本项目的相关模型用于商业用途，开发者应遵守当地的法律法规，确保模型输出内容的合规性，本项目不对任何由此衍生的产品或服务承担责任。

局限性声明

虽然本项目中的模型具备一定的中文理解和生成能力，但也存在局限性，包括但不限于：

可能会产生不可预测的有害内容以及不符合人类偏好和价值观的内容
由于算力和数据问题，相关模型的训练并不充分，中文理解能力有待进一步提升
暂时没有在线可互动的demo（注：用户仍然可以自行在本地部署和体验）

问题反馈

如有疑问，请在GitHub Issue中提交。礼貌地提出问题，构建和谐的讨论社区。

在提交问题之前，请先查看FAQ能否解决问题，同时建议查阅以往的issue是否能解决你的问题。
提交问题请使用本项目设置的Issue模板，以帮助快速定位具体问题。
重复以及与本项目无关的issue会被stable-bot处理，敬请谅解。

chinese-llama-alpaca-2's People

Contributors

Stargazers

Watchers

Forkers

kirakkk jiaqianjing phlizik davidfan1224 enjoysport2022 dumpmemory cyjack jaredshuai iamkomen kevinqin panguanji20 minghsuanwu snoopycn chenteng williamfangca compromiseee hertera1 gshan4056 glayyiyi leemengtw sarahbrownplace knightcn1983 pengzhimou zpf4934 hangxue-lab egoleecode lp17863564 gogojoestar tianpf wolfworld6 seekpoint keychenn lindali113 lealaxy kobe4cn hzrlsd zhch8888168 zy1417548204 bobolau liemlin haorenkk123 expmrc techthiyanes di-dimmasik kkd4soei cliekid xnliang98 lokvke totaldom1nation desu9 jilu6 xianghaiyang806 zxbin2000 hfbin ukza chunhualiu f901107 airaria hajime-y michael-fangji pppairiki deemonych lishuji gmpdtd95 qinjun-ioc hhy5277 kyleclk zhengfuliu michaelcola ssr-xx emreblky ari2341 somezak1 zzleo songmingjiu clsung codingonion startagain2016 d1026 philisterd nicoyang-21 criticalpulsar xiangyangkan openandrus yey11 nanixu kioco wtxfrancise ozszsh xfg0913 zfbok paulhuang01 ananan skyrookieyu constantwangheng gaovicki leonllzhang imounttai bloomv castrol68

chinese-llama-alpaca-2's Issues

预训练完美运行，但是保存lora模型为1k

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

None

基础模型

None

操作系统

None

详细描述问题

drwxr-xr-x 2 root root 4096 Aug 2 15:21 ./
drwxr-xr-x 4 root root 4096 Aug 2 15:21 ../
-rw-r--r-- 1 root root 111 Aug 2 15:21 README.md
-rw-r--r-- 1 root root 592 Aug 2 15:21 adapter_config.json
-rw-r--r-- 1 root root 443 Aug 2 15:21 adapter_model.bin
-rw-r--r-- 1 root root 411 Aug 2 15:21 special_tokens_map.json
-rw-r--r-- 1 root root 844403 Aug 2 15:21 tokenizer.model
-rw-r--r-- 1 root root 747 Aug 2 15:21 tokenizer_config.json

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

请问llama2-7b的显存要求是多少

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

# 用了6张a10（24G）进行预训练，block开到512，没有加lm_head，embedding层，开启了zero2 offload依旧报错

运行日志或截图

# torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB (GPU 0; 22.20 GiB total capacity; 20.60 GiB already allocated; 126.12 MiB free; 20.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                  | 0/3 [00:00<?, ?it/s

合併 LoRa model 不成功, 沒有產生最終模型檔

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型转换和合并

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

我嘗試按照文件上的步驟做模型合併, 結果不如預期.
雖然沒有報錯, 但沒有看到最終合併模型檔案

LLama 基模型放在 llama/llama-2-7b/

LLama-LoRA 模型放在 chinese-llama-2-lora-7b

# python llama/Chinese-LLaMA-Alpaca-2/scripts/merge_llama2_with_chinese_lora_low_mem.py --base_model llama/llama-2-7b/ --lora_model chinese-llama-2-lora-7b --output_type huggingface --output_dir llama-2-7b-combined
================================================================================
Base model: llama/llama-2-7b/
LoRA model: chinese-llama-2-lora-7b
Loading chinese-llama-2-lora-7b
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
Saving tokenizer
Done.
Check output dir: llama-2-7b-combined


# ls -al llama-2-7b-combined/     ### 沒有看到最終模型檔案
total 844
drwxr-xr-x  2 root root   4096 Aug  4 15:13 .
drwxrwxr-x 11 1000 1000   4096 Aug  4 15:06 ..
-rw-r--r--  1 root root    435 Aug  4 15:13 special_tokens_map.json
-rw-r--r--  1 root root 844403 Aug  4 15:13 tokenizer.model
-rw-r--r--  1 root root    766 Aug  4 15:13 tokenizer_config.json

# ls -al  llama/llama-2-7b/                  ### 基模型在這裡
total 13161080
drwxr-xr-x 2 root root        4096 Aug  4 14:15 .
drwxr-xr-x 6 root root        4096 Aug  4 14:23 ..
-rw-r--r-- 1 root root         100 Jul 14 07:00 checklist.chk
-rw-r--r-- 1 root root 13476925163 Jul 14 07:00 consolidated.00.pth
-rw-r--r-- 1 root root         102 Jul 14 07:00 params.json

# ls -al chinese-llama-2-lora-7b    ### LoRA模型在這裡
total 1197992
drwxr-xr-x  3 root root       4096 Aug  4 14:53 .
drwxrwxr-x 11 1000 1000       4096 Aug  4 15:06 ..
drwxr-xr-x  8 root root       4096 Aug  4 14:38 .git
-rw-r--r--  1 root root       1519 Aug  4 14:38 .gitattributes
-rw-r--r--  1 root root       1945 Aug  4 14:38 README.md
-rw-rw-r--  1 root root        471 Jul 27 12:52 adapter_config.json
-rw-rw-r--  1 root root 1225856253 Jul 27 12:41 adapter_model.bin
-rw-rw-r--  1 root root        435 Jul 27 12:41 special_tokens_map.json
-rw-rw-r--  1 root root     844403 Jul 27 12:41 tokenizer.model
-rw-rw-r--  1 root root        748 Jul 27 12:41 tokenizer_config.json

依赖情况（代码类问题务必提供）

# pip list | grep -E 'transformers|peft|torch'
ctransformers            0.2.5
peft                     0.3.0.dev0
pytorch-quantization     2.1.2
sentence-transformers    2.2.2
torch                    2.0.1
torch-tensorrt           1.5.0.dev0
torchdata                0.7.0a0
torchtext                0.16.0a0
torchvision              0.16.0a0
transformers             4.31.0

运行日志或截图

# 请在此处粘贴运行日志

赞！动作真快！能否直接上 plus 版本？

预训练完毕后，进行lora合成报错，只训练lora参数，但是用你们提供的alpaca-lora 合成是OK的

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型推理

基础模型

Alpaca-2-7B

操作系统

Windows

详细描述问题

预训练完毕后，进行lora合成，只训练了lora的参数。但是合并后，运行推理是报错的。用你们提供的alpaca-lora是正确的。
lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj"
modules_to_save="embed_tokens,lm_head" 这里已经去掉
lora_dropout=0.05

pretrained_model=path/model/chinese-alpaca-2-7b-hf
chinese_tokenizer_path=path/tokenizer
dataset_dir=path/yuliao
data_cache=temp_data_cache_dir
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=3
output_dir=output_dir
block_size=512
这是PT训练参数

依赖情况（代码类问题务必提供）

 pip list | grep -E 'transformers|peft|torch'
peft                     0.3.0.dev0
torch                    2.0.1
transformers             4.31.0

运行日志或截图

Loading path_to_output_dir...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.56s/it]
Loaded the model in 7.85 seconds.
Loading the extension "gallery"... Ok.
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1005,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1014,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1019,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Indexing.cu:1146: block: [1004,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File "I:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "I:\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 290, in generate_with_callback
shared.model.generate(**kwargs)
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 536, in forward
attention_mask = self._prepare_decoder_attention_mask(
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 464, in _prepare_decoder_attention_mask
combined_attention_mask = _make_causal_mask(
File "I:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 49, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Output generated in 0.69 seconds (0.00 tokens/s, 0 tokens, context 38, seed 2094177833

鸣哥，是不是这里会把llama1那套沿用下来，先扩充词表中文语料先增量预训练一波？

请问增量训练数据量大概用了多少B token数呢？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

expected scalar type Half but found Float

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

SFT之后加载模型，在对话出错：RuntimeError: expected scalar type Half but found Float，基础模型不管是chinese-alpaca-2-7b还是llama-2-7b-hf 都是这个错误。

SFT代码
lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=chinese-alpaca-2-7b
chinese_tokenizer_path=chinese-alpaca-2-7b
dataset_dir=
per_device_train_batch_size=64
per_device_eval_batch_size=64
gradient_accumulation_steps=8
output_dir=output_dir/chinese-alpaca-2-7b-datav3_v2-sft-lr${lr}-rank${lora_rank}-alpha${lora_alpha}-dropout${lora_dropout}
validation_file=val.json

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --flash_attn \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 50 \
    --save_steps 10 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length 1024 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False


CUDA_VISIBLE_DEVICES=1 python gradio_demo.py --base_model chinese-alpaca-2-7b --lora_model ../training/output_dir/chinese-alpaca-2-7b-datav3_v2-sft-l
r2e-4-rank64-alpha128-dropout0.05/checkpoint-10/sft_lora_model/

依赖情况（代码类问题务必提供）

No response

运行日志或截图

len(history): 1
history:  [['你好', None]]
Input length: 36
/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/transformers/generation/utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Traceback (most recent call last):
  File "/home/daliqiji/project/llm/Chinese-LLaMA-Alpaca-2/scripts/inference/gradio_demo.py", line 258, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/daliqiji/project/llm/Chinese-LLaMA-Alpaca-2/scripts/inference/gradio_demo.py", line 419, in generate_with_callback
    model.generate(**kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/daliqiji/project/llm/Chinese-LLaMA-Alpaca-2/scripts/attn_and_long_ctx_patches.py", line 44, in xformers_forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/peft/tuners/lora.py", line 358, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/daliqiji/miniconda3/envs/chllmalp2/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Half but found Float

扩充词汇表是怎样操作的？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型转换和合并

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

想问下哪里能获得扩充词汇表相关的内容与代码。

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

新模型训练考虑加入FlashAttention等新技术吗

只对lora进行精调最后合并报了错

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

4张A100 80G
先对原始llama2-7b-hf和chinese-alpaca-2-7b-lora做了合并。
以下是训练脚本：

wandb disabled
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

dir_date=$(date +%m%d)
train_version=v0.1

NUM_NODES=1
SEQ_LEN=2048
GC_SCALE=4
SKYPILOT_NUM_GPUS_PER_NODE=4
PER_DEVICE_BATCH_SIZE=$((2048 * $GC_SCALE / $SEQ_LEN))
GRADIENT_ACCUMULATION_STEPS=$((128 * 512 / $SEQ_LEN / $PER_DEVICE_BATCH_SIZE / $NUM_NODES / $SKYPILOT_NUM_GPUS_PER_NODE))

pretrained_model=/data/output/merged-chinese-alpaca-2-7b-lora-hf
chinese_tokenizer_path=/data/chinese-alpaca-2-lora-7b
peft_model=/data/chinese-alpaca-2-lora-7b
dataset_dir=/data/train-data/
# data cache 需要每次清理一下，否则加上之前的数据
data_cache=/data/cache/${dir_date}-${train_version}
rm -rf ${data_cache}*
per_device_train_batch_size=${PER_DEVICE_BATCH_SIZE}
per_device_eval_batch_size=${PER_DEVICE_BATCH_SIZE}
training_steps=1500
gradient_accumulation_steps=${GRADIENT_ACCUMULATION_STEPS}
output_dir=/data/output/fine-tunning-chinese-alpaca-2-7b-lora-${dir_date}-${train_version}
block_size=1024
max_seq_length=1024
deepspeed_config_file=ds_zero2_no_offload.json
validation_file=/data/train-data/fine_tunning.json
run_clm_sft_with_peft=run_clm_sft_with_peft.py

torchrun --nnodes 1 --nproc_per_node ${SKYPILOT_NUM_GPUS_PER_NODE} ${run_clm_sft_with_peft} \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --peft_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 2 \
    --max_steps ${training_steps} \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 250 \
    --save_steps 500 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --gradient_checkpointing \
    --torch_dtype float16 \
    --ddp_find_unused_parameters False \
    --peft_path ${peft_model} \
    --gradient_checkpointing \
    --validation_file ${validation_file} \
    --flash_attn

依赖情况（代码类问题务必提供）

peft                     0.3.0.dev0
torch                    2.0.1
transformers             4.31.0

运行日志或截图

Dataset json downloaded and prepared to /data/train-data/fine_tunning/json/default-a5e3f5abb73f5f82/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 925.08it/s]
08/03/2023 23:16:27 - WARNING - root - building dataset...
08/03/2023 23:16:27 - WARNING - root - building dataset...
Traceback (most recent call last):
  File "/data/python/Chinese-LLaMA-Alpaca/scripts/training/build_dataset.py", line 68, in build_instruction_dataset
    processed_dataset = datasets.load_from_disk(cache_path)
  File "/root/anaconda3/envs/alpaca/lib/python3.10/site-packages/datasets/load.py", line 1906, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /data/train-data/fine_tunning is neither a `Dataset` directory nor a `DatasetDict` directory.

During handling of the above exception, another exception occurred:

请问120G中文语料包括什么内容？增量训练添加英文语料是否更合适呢？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

其他问题

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

No response

chinese-alpace-2-7b推理时，为什么要先输出一遍问题再回答？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型推理

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

chinese-alpace-2-7b部署在Linux，在jupyter中推理时，为什么要先输出一遍问题再回答？
推理代码如下：
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import GenerationConfig
from peft import PeftModel

generation_config = GenerationConfig(
repetition_penalty=1.1,
max_new_tokens=400
)

model_path='/home/chinese-alpace-2-7b'

tokenizer = LlamaTokenizer.from_pretrained(model_path, legacy=True)
model = LlamaForCausalLM.from_pretrained(
model_path,
load_in_8bit=False,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map='auto',
)
model.eval()

input_text='**法定货币是什么？'
inputs = tokenizer(input_text,return_tensors="pt")
generation_output = model.generate(
input_ids = inputs["input_ids"].to('cuda:0'),
attention_mask = inputs['attention_mask'].to('cuda:0'),
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
generation_config = generation_config
)
s = generation_output[0]
output = tokenizer.decode(s,skip_special_tokens=True)
print(output)

输出output：**法定货币是什么？人民币。**法定货币是人民币，简称RMB或CNY（Chinese National Currency）。

依赖情况（代码类问题务必提供）

No response

运行日志或截图

期待一下

请问下 “ 基于FlashAttention-2的高效注意力” 如何实现的？我在training代码里面没有找到

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

请教一下中文预训练数据组成？

如: 百科书籍网页代码 paper等，以及比例是怎样的？

请问训练使用的显卡资源大概是多少

deepspeed是哪个版本

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

None

基础模型

None

操作系统

None

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

请问是否考虑训练一个 extended context 版本的模型？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

7月28日， together.ai 使用 FlashAttention2 训练了一个 extended context 版本的 LLaMA-2 ， context 长度达 32k。他们发表了以下博客说明: https://together.ai/blog/llama-2-7b-32k
llama 2 基准模型的最大context 为 4k tokens，感觉在商业应用上是否会有些限制？训练一个 extended context 版本的 LLaMA-2 是否会有帮助呢？

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

请问如果想做全量微调的话，和Lora微调的代码一样吗？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

后续会开源13B的中文模型吗

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

其他问题

基础模型

None

操作系统

Linux

详细描述问题

No response

依赖情况（代码类问题务必提供）

No response

运行日志或截图

No response

预训练chinese-llama-2-7b时出错

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

为什么会在loading checkpoint时出错？是因为显存不够吗？

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

pip list | grep -E 'transformers|peft|torch'
peft 0.3.0.dev0
torch 2.0.1
transformers 4.31.0

运行日志或截图

# 请在此处粘贴运行日志

[2023-08-03 08:32:36,402] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2023-08-03 08:32:40.458334: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-08-03 08:32:42,753] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-03 08:32:42,753] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-03 08:32:42,753] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
08/03/2023 08:32:43 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:712] 2023-08-03 08:32:50,607 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--ziqingyang--chinese-llama-2-7b/snapshots/557b5cbd48a4a4eb5a08e975c4b6e11ac1ed4cbc/config.json
[INFO|configuration_utils.py:768] 2023-08-03 08:32:50,607 >> Model config LlamaConfig {
"_name_or_path": "ziqingyang/chinese-llama-2-7b",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 55296
}

[INFO|tokenization_utils_base.py:1839] 2023-08-03 08:32:50,710 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--ziqingyang--chinese-llama-2-7b/snapshots/557b5cbd48a4a4eb5a08e975c4b6e11ac1ed4cbc/tokenizer.model
[INFO|tokenization_utils_base.py:1839] 2023-08-03 08:32:50,710 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1839] 2023-08-03 08:32:50,710 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--ziqingyang--chinese-llama-2-7b/snapshots/557b5cbd48a4a4eb5a08e975c4b6e11ac1ed4cbc/special_tokens_map.json
[INFO|tokenization_utils_base.py:1839] 2023-08-03 08:32:50,710 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--ziqingyang--chinese-llama-2-7b/snapshots/557b5cbd48a4a4eb5a08e975c4b6e11ac1ed4cbc/tokenizer_config.json
[WARNING|logging.py:295] 2023-08-03 08:32:50,711 >> You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
08/03/2023 08:32:50 - INFO - main - training datasets-t has been loaded from disk
Caching indices mapping at /content/Chinese-LLaMA-Alpaca-2/scripts/training/temp_data_cache_dir/t/train/cache-5ce72d5a2218009e.arrow
08/03/2023 08:32:50 - INFO - datasets.arrow_dataset - Caching indices mapping at /content/Chinese-LLaMA-Alpaca-2/scripts/training/temp_data_cache_dir/t/train/cache-5ce72d5a2218009e.arrow
Caching indices mapping at /content/Chinese-LLaMA-Alpaca-2/scripts/training/temp_data_cache_dir/t/train/cache-7bb07f6ff7076c43.arrow
08/03/2023 08:32:50 - INFO - datasets.arrow_dataset - Caching indices mapping at /content/Chinese-LLaMA-Alpaca-2/scripts/training/temp_data_cache_dir/t/train/cache-7bb07f6ff7076c43.arrow
08/03/2023 08:32:50 - INFO - main - Num train_samples 3248
08/03/2023 08:32:50 - INFO - main - Training example:
08/03/2023 08:32:50 - INFO - main - 应该流畅而准确。 Text: ""I have a dream that one day this nation will rise up and live out the true meaning of its creed: 'We hold these truths to be self-evident, that all men are created equal.'"" Assistant: 我有一个梦想，希望有一天这个国家会振作起来，实现其信条的真正含义：“我们认为这些真理都是不言而喻的，即所有人生而平等。” " " Human: 请你回答以下问题：2019年全球最受欢迎的旅游城市是哪个？ Assistant: 据《环球旅行》杂志发布的2019年全球最佳旅游城市排名，第一名是日本京都。京都以其古老的神社、庙宇和花园等著名景点吸引着大量游客前来。此外，其他入选前十名的城市还包括巴黎、罗马、布达佩斯、普拉古、华沙、纽约、马拉喀什、巴塞罗那和香港等。 " " Human: 列举三个方法来减轻焦虑和压力 Assistant: 1. 呼吸练习：深呼吸、放松呼吸等简单的呼吸练习可以让身体得到放松，减轻焦虑和压力。一般而言，只需花费几分钟就可以达到效果。 2. 运动：运动是一种非常有效的减轻焦虑和压力的方法。进行一些简单的运动，例如散步、慢跑、瑜伽等，可以帮助身体释放紧张和充满焦虑的能量，同时还能提高身体的免疫力。 3. 放松技巧：放松技巧是减轻焦虑和压力的另一种有效方法。例如，渐进性肌肉松弛法、冥想、温泉浴、按摩等，都可以帮助身体达到放松和舒适的状态，从而减轻焦虑和压力。 " " Human: 小明：李老板，你好，我是你打电话来的那个律师，我来为你解决那个让你困扰的案件。李老板：谢谢你来帮我，小明律师。我听说你是个非常有能力的年轻律师，我相信你能帮我赢得这个案子。小明：嗯，这案子确实不简单，但是没问题。我已经仔细研究了你提供的材料，对于你的情况，我有一些个人的看法。不知道你是否同意先听一听我的观点。李老板：当然，我非常愿意听听你的想法。请告诉我，小明律师。请总结以上对话的主要内容。Ass
[INFO|modeling_utils.py:2603] 2023-08-03 08:32:50,837 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--ziqingyang--chinese-llama-2-7b/snapshots/557b5cbd48a4a4eb5a08e975c4b6e11ac1ed4cbc/pytorch_model.bin.index.json
[INFO|modeling_utils.py:1172] 2023-08-03 08:32:50,838 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:599] 2023-08-03 08:32:50,839 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.31.0"
}

Loading checkpoint shards: 0% 0/2 [00:00<?, ?it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 3336) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_pt_with_peft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-08-03_08:33:39
host : 802dbb16712c
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 3336)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3336

预训练使用flashattn报错：RuntimeError: shape '[1, 1024, 64, 128]' is invalid for input of size 1048576

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

None

操作系统

Linux

详细描述问题

我训练的是llama2-70b，因为显存不够就尝试了一下git里面的flashattn，但是就报错了，请问是什么原因？

# 训练代码：跟仓库里run_pt那个基本一样
#!/usr/bin/env python
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset.

Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=text-generation
"""
# You can also adapt this script on your own causal language modeling task. Pointers for this are left as comments.

import logging
import numpy as np
import math
import os
import sys
from dataclasses import dataclass, field
from itertools import chain
from typing import Optional, List, Dict, Any, Mapping
from pathlib import Path
import datasets
import torch
from datasets import load_dataset, concatenate_datasets

import transformers
from transformers import (
    CONFIG_MAPPING,
    MODEL_FOR_CAUSAL_LM_MAPPING,
    AutoConfig,
    AutoModelForCausalLM,
    LlamaForCausalLM,
    LlamaTokenizer,
    AutoTokenizer,
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    is_torch_tpu_available,
    set_seed,
)
from transformers.testing_utils import CaptureLogger
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import send_example_telemetry
from transformers.utils.versions import require_version

from sklearn.metrics import accuracy_score
from peft import LoraConfig, TaskType, get_peft_model, PeftModel, get_peft_model_state_dict
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(transformers.TrainerCallback):
    def save_model(self, args, state, kwargs):
        if state.best_model_checkpoint is not None:
            checkpoint_folder = os.path.join(state.best_model_checkpoint, "pt_lora_model")
        else:
            checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "pt_lora_model")
        kwargs["model"].save_pretrained(peft_model_path)
        kwargs["tokenizer"].save_pretrained(peft_model_path)

    def on_save(self, args, state, control, **kwargs):
        self.save_model(args, state, kwargs)
        return control

    def on_train_end(self, args, state, control, **kwargs):
        peft_model_path = os.path.join(args.output_dir, "pt_lora_model")
        kwargs["model"].save_pretrained(peft_model_path)
        kwargs["tokenizer"].save_pretrained(peft_model_path)


def accuracy(predictions, references, normalize=True, sample_weight=None):
        return {
            "accuracy": float(
                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
            )
        }


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics but we need to shift the labels
    labels = labels[:, 1:].reshape(-1)
    preds = preds[:, :-1].reshape(-1)
    return accuracy(predictions=preds, references=labels)


def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
        # Depending on the model and config, logits may contain extra tensors,
        # like past_key_values, but logits always come first
        logits = logits[0]
    return logits.argmax(dim=-1)


def fault_tolerance_data_collator(features: List) -> Dict[str, Any]:
    if not isinstance(features[0], Mapping):
        features = [vars(f) for f in features]
    first = features[0]
    batch = {}

    # Special handling for labels.
    # Ensure that tensor is created with the correct type
    # (it should be automatically the case, but let's make sure of it.)
    if "label" in first and first["label"] is not None:
        label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]
        dtype = torch.long if isinstance(label, int) else torch.float
        batch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)
    elif "label_ids" in first and first["label_ids"] is not None:
        if isinstance(first["label_ids"], torch.Tensor):
            batch["labels"] = torch.stack([f["label_ids"] for f in features])
        else:
            dtype = torch.long if isinstance(first["label_ids"][0], int) else torch.float
            batch["labels"] = torch.tensor([f["label_ids"] for f in features], dtype=dtype)

    # Handling of all other possible keys.
    # Again, we will use the first element to figure out which key/values are not None for this model.

    try:
        for k, v in first.items():
            if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
                if isinstance(v, torch.Tensor):
                    batch[k] = torch.stack([f[k] for f in features])
                elif isinstance(v, np.ndarray):
                    batch[k] = torch.tensor(np.stack([f[k] for f in features]))
                else:
                    batch[k] = torch.tensor([f[k] for f in features])
    except ValueError: # quick fix by simply take the first example
        for k, v in first.items():
            if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
                if isinstance(v, torch.Tensor):
                    batch[k] = torch.stack([features[0][k]] * len(features))
                elif isinstance(v, np.ndarray):
                    batch[k] = torch.tensor(np.stack([features[0][k]] * len(features)))
                else:
                    batch[k] = torch.tensor([features[0][k]] * len(features))

    return batch


MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
    """

    model_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch."
            )
        },
    )
    tokenizer_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "The tokenizer for weights initialization.Don't set if you want to train a model from scratch."
            )
        },
    )
    model_type: Optional[str] = field(
        default=None,
        metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
    )
    config_overrides: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "Override some existing default config settings when a model is trained from scratch. Example: "
                "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
            )
        },
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": (
                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
                "with private models)."
            )
        },
    )
    torch_dtype: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "Override the default `torch.dtype` and load the model under this dtype. If `auto` is passed, the "
                "dtype will be automatically derived from the model's weights."
            ),
            "choices": ["auto", "bfloat16", "float16", "float32"],
        },
    )

    def __post_init__(self):
        if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None):
            raise ValueError(
                "--config_overrides can't be used in combination with --config_name or --model_name_or_path"
            )


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    dataset_dir: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
    validation_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of training examples to this "
                "value if set."
            )
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
                "value if set."
            )
        },
    )
    streaming: bool = field(default=False, metadata={"help": "Enable streaming mode"})
    block_size: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "Optional input sequence length after tokenization. "
                "The training dataset will be truncated in block of this size for training. "
                "Default to the model max input length for single sentence inputs (take into account special tokens)."
            )
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    validation_split_percentage: Optional[float] = field(
        default=0.05,
        metadata={
            "help": "The percentage of the train set used as validation set in case there's no validation split"
        },
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    keep_linebreaks: bool = field(
        default=True, metadata={"help": "Whether to keep line breaks when using TXT files or not."}
    )
    data_cache_dir: Optional[str] = field(default="./", metadata={"help": "The datasets processed stored"})

    def __post_init__(self):
        if self.streaming:
            require_version("datasets>=2.0.0", "The streaming feature requires `datasets>=2.0.0`")


@dataclass
class MyTrainingArguments(TrainingArguments):
    trainable : Optional[str] = field(default="q_proj,v_proj")
    lora_rank : Optional[int] = field(default=8)
    lora_dropout : Optional[float] = field(default=0.1)
    lora_alpha : Optional[float] = field(default=32.)
    modules_to_save : Optional[str] = field(default=None)
    debug_mode : Optional[bool] = field(default=False)
    peft_path : Optional[str] = field(default=None)
    flash_attn : Optional[bool] = field(default=False)
    train_peft : Optional[bool] = field(default=True)


logger = logging.getLogger(__name__)


def main():

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, MyTrainingArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    if training_args.flash_attn:
        from flash_attn_patch import replace_llama_attn_with_flash_attn
        replace_llama_attn_with_flash_attn()

    # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
    # information sent is the one passed as arguments along with your Python/PyTorch versions.
    send_example_telemetry("run_clm", model_args, data_args)

    # Setup logging
    logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,  # if training_args.local_rank in [-1, 0] else logging.WARN,
        handlers=[logging.StreamHandler(sys.stdout)],)

    if training_args.should_log:
        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
        transformers.utils.logging.set_verbosity_info()

    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()
    # transformers.tokenization_utils.logging.set_verbosity_warning()

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )

    # Detecting last checkpoint.
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
            raise ValueError(
                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
                "Use --overwrite_output_dir to overcome."
            )
        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
            logger.info(
                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
            )

    # Set seed before initializing model.
    set_seed(training_args.seed)

    config_kwargs = {
        "cache_dir": model_args.cache_dir,
        "revision": model_args.model_revision,
        "use_auth_token": True if model_args.use_auth_token else None,
    }
    if model_args.config_name:
        config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
    elif model_args.model_name_or_path:
        config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
    else:
        config = CONFIG_MAPPING[model_args.model_type]()
        logger.warning("You are instantiating a new config instance from scratch.")
        if model_args.config_overrides is not None:
            logger.info(f"Overriding config: {model_args.config_overrides}")
            config.update_from_string(model_args.config_overrides)
            logger.info(f"New config: {config}")

    tokenizer_kwargs = {
        "cache_dir": model_args.cache_dir,
        "use_fast": model_args.use_fast_tokenizer,
        "revision": model_args.model_revision,
        "use_auth_token": True if model_args.use_auth_token else None,
    }
    if model_args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
    elif model_args.tokenizer_name_or_path:
        tokenizer = LlamaTokenizer.from_pretrained(model_args.tokenizer_name_or_path, **tokenizer_kwargs)
    else:
        raise ValueError(
            "You are instantiating a new tokenizer from scratch. This is not supported by this script."
            "You can do it from another script, save it, and load it from here, using --tokenizer_name."
        )

    # Preprocessing the datasets.
    # First we tokenize all the texts.
    # since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function
    tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")

    def tokenize_function(examples):
        with CaptureLogger(tok_logger) as cl:
            output = tokenizer(examples["text"])
        # clm input could be much much longer than block_size
        if "Token indices sequence length is longer than the" in cl.out:
            tok_logger.warning(
                "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits"
                " before being passed to the model."
            )
        return output
    if data_args.block_size is None:
        block_size = tokenizer.model_max_length
        if block_size > 1024:
            logger.warning(
                "The chosen tokenizer supports a `model_max_length` that is longer than the default `block_size` value"
                " of 1024. If you would like to use a longer `block_size` up to `tokenizer.model_max_length` you can"
                " override this default with `--block_size xxx`."
            )
            block_size = 1024
    else:
        if data_args.block_size > tokenizer.model_max_length:
            logger.warning(
                f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model"
                f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
            )
        block_size = min(data_args.block_size, tokenizer.model_max_length)

    # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result
    with training_args.main_process_first(desc="dataset map tokenization and grouping"):
        lm_datasets = []
        path = Path(data_args.dataset_dir)
        files = [file.name for file in path.glob("*.txt")]
        if training_args.debug_mode is True:
            files = [files[0]]
        for idx, file in enumerate(files):
            data_file = os.path.join(path, file)
            filename = ''.join(file.split(".")[:-1])
            cache_path = os.path.join(data_args.data_cache_dir, filename)
            os.makedirs(cache_path, exist_ok=True)
            try:
                processed_dataset = datasets.load_from_disk(cache_path, keep_in_memory=False)
                logger.info(f'training datasets-{filename} has been loaded from disk')
            except Exception:
                cache_dir = os.path.join(data_args.data_cache_dir, filename+"_text")
                os.makedirs(cache_dir, exist_ok=True)
                raw_dataset = load_dataset("text", data_files=data_file, cache_dir=cache_dir, keep_in_memory=False)
                logger.info(f"{file} has been loaded")
                tokenized_dataset = raw_dataset.map(
                    tokenize_function,
                    batched=True,
                    num_proc=data_args.preprocessing_num_workers,
                    remove_columns="text",
                    load_from_cache_file=True,
                    keep_in_memory=False,
                    cache_file_names = {k: os.path.join(cache_dir, 'tokenized.arrow') for k in raw_dataset},
                    desc="Running tokenizer on dataset",
                )
                grouped_datasets = tokenized_dataset.map(
                    group_texts,
                    batched=True,
                    num_proc=data_args.preprocessing_num_workers,
                    load_from_cache_file=True,
                    keep_in_memory=False,
                    cache_file_names = {k: os.path.join(cache_dir, 'grouped.arrow') for k in tokenized_dataset},
                    desc=f"Grouping texts in chunks of {block_size}",
                )
                processed_dataset = grouped_datasets
                processed_dataset.save_to_disk(cache_path)
            if idx == 0:
                lm_datasets = processed_dataset['train']
            else:
                assert lm_datasets.features.type == processed_dataset["train"].features.type
                lm_datasets = concatenate_datasets([lm_datasets, processed_dataset["train"]])

        lm_datasets = lm_datasets.train_test_split(test_size = data_args.validation_split_percentage)

    if training_args.do_train:
        train_dataset = lm_datasets['train']
        if data_args.max_train_samples is not None:
            max_train_samples = min(len(train_dataset), data_args.max_train_samples)
            train_dataset = train_dataset.select(range(max_train_samples))
        logger.info(f"Num train_samples  {len(train_dataset)}")
        logger.info("Training example:")
        logger.info(tokenizer.decode(train_dataset[0]['input_ids']))
    if training_args.do_eval:
        eval_dataset = lm_datasets["test"]
        if data_args.max_eval_samples is not None:
            max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples)
            eval_dataset = eval_dataset.select(range(max_eval_samples))
        logger.info(f"Num eval_samples  {len(eval_dataset)}")
        logger.info("Evaluation example:")
        logger.info(tokenizer.decode(eval_dataset[0]['input_ids']))
    if model_args.model_name_or_path:
        torch_dtype = (
            model_args.torch_dtype
            if model_args.torch_dtype in ["auto", None]
            else getattr(torch, model_args.torch_dtype)
        )
        model = LlamaForCausalLM.from_pretrained(
            model_args.model_name_or_path,
            from_tf=bool(".ckpt" in model_args.model_name_or_path),
            config=config,
            cache_dir=model_args.cache_dir,
            revision=model_args.model_revision,
            use_auth_token=True if model_args.use_auth_token else None,
            torch_dtype=torch_dtype,
            low_cpu_mem_usage=True,
            load_in_8bit=True,
            device_map='auto'
        )
    else:
        model = AutoModelForCausalLM.from_config(config)
        n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values())
        logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")

    model_vocab_size = model.get_output_embeddings().weight.size(0)
    tokenizer_vocab_size = len(tokenizer)
    logger.info(f"Model vocab size: {model_vocab_size}")
    logger.info(f"Tokenizer vocab size: {tokenizer_vocab_size}")
    if tokenizer_vocab_size != 55296:
        raise ValueError(f"The vocab size of tokenizer is {tokenizer_vocab_size}, not 55296. Please use Chinese-LLaMA-2 tokenizer.")
    if model_vocab_size != tokenizer_vocab_size:
        logger.info(f"Rezize model vocab size to {tokenizer_vocab_size}")
        model.resize_token_embeddings(len(tokenizer))
    
    if training_args.train_peft:
        logger.info("Train Peft Model!")
        if training_args.peft_path is not None:
            logger.info("Peft from pre-trained model")
            model = PeftModel.from_pretrained(model, training_args.peft_path)
        else:
            logger.info("Init new peft model")
            target_modules = training_args.trainable.split(',')
            modules_to_save = training_args.modules_to_save
            if modules_to_save is not None:
                modules_to_save = modules_to_save.split(',')
            lora_rank = training_args.lora_rank
            lora_dropout = training_args.lora_dropout
            lora_alpha = training_args.lora_alpha
            logger.info(f"target_modules: {target_modules}")
            logger.info(f"lora_rank: {lora_rank}")
            peft_config = LoraConfig(
                task_type=TaskType.CAUSAL_LM,
                target_modules=target_modules,
                inference_mode=False,
                r=lora_rank, lora_alpha=lora_alpha,
                lora_dropout=lora_dropout,
                modules_to_save=modules_to_save)
            model = get_peft_model(model, peft_config)
        model.print_trainable_parameters()
        old_state_dict = model.state_dict
        model.state_dict = (
            lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
        ).__get__(model, type(model))
    else:
        logger.info("Tranin Full Model!")

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=fault_tolerance_data_collator,
        compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics
        if training_args.do_eval and not is_torch_tpu_available()
        else None,
    )
    trainer.add_callback(SavePeftModelCallback)
    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        
        with torch.autocast("cuda"): 
            train_result = trainer.train(resume_from_checkpoint=checkpoint)

        metrics = train_result.metrics

        max_train_samples = (
            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        )
        metrics["train_samples"] = min(max_train_samples, len(train_dataset))

        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()

    # Evaluation
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        
        with torch.autocast("cuda"): 
            metrics = trainer.evaluate()

        max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
        metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
        try:
            perplexity = math.exp(metrics["eval_loss"])
        except OverflowError:
            perplexity = float("inf")
        metrics["perplexity"] = perplexity

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)


if __name__ == "__main__":
    main()

提交作业任务：

lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/code/xx/LLM_mine/model/LLama2/llama2_chinese
chinese_tokenizer_path=/code/xx/LLM_mine/model/LLama2/llama2_chinese
dataset_dir=/code/xx/LLM_mine/data/wudao_test
data_cache=/code/xx/LLM_mine/scripts/pretraining/pretrain_output/data_cache_wudao
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
output_dir=/code/xx/LLM_mine/scripts/pretraining/pretrain_output/pretrain_llama2_final

deepspeed_config_file=/code/xx/LLM_mine/scripts/pretraining/ds_zero2_no_offload.json

torchrun --nnodes 2 --nproc_per_node 5 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29400 /code/xiongxiong/LLM_mine/scripts/pretraining/run_pt_llama2_flashattn.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --seed 666 \
    --fp16 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 50 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 1024 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False \
    --train_peft True \
    --flash_attn False

依赖情况（代码类问题务必提供）

 RUN pip install git+https://github.com/huggingface/peft.git@13e53fc
 RUN pip install transformers==4.31.0
 RUN pip install sentencepiece==0.1.97
 RUN pip install bitsandbytes==0.39.1
 RUN pip install xformers
 RUN MAX_JOBS=2 pip install flash-attn --no-build-isolation -i https://mirrors.aliyun.com/pypi/simple

运行日志或截图

[INFO|trainer.py:1686] 2023-08-03 17:19:49,528 >> ***** Running training *****

[INFO|trainer.py:1687] 2023-08-03 17:19:49,528 >>   Num examples = 1,553,003

[INFO|trainer.py:1688] 2023-08-03 17:19:49,528 >>   Num Epochs = 1

[INFO|trainer.py:1689] 2023-08-03 17:19:49,528 >>   Instantaneous batch size per device = 1

[INFO|trainer.py:1692] 2023-08-03 17:19:49,528 >>   Total train batch size (w. parallel, distributed & accumulation) = 32

[INFO|trainer.py:1693] 2023-08-03 17:19:49,528 >>   Gradient Accumulation steps = 4

[INFO|trainer.py:1694] 2023-08-03 17:19:49,528 >>   Total optimization steps = 48,531

[INFO|trainer.py:1695] 2023-08-03 17:19:49,535 >>   Number of trainable parameters = 1,734,344,704

[INFO|integrations.py:716] 2023-08-03 17:19:49,545 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"

wandb: Currently logged in as: doublebear315. Use `wandb login --relogin` to force relogin

wandb: Tracking run with wandb version 0.15.3

wandb: Run data is saved locally in /wandb/run-20230803_171951-cglufcfi

wandb: Run `wandb offline` to turn off syncing.

wandb: Syncing run efficient-breeze-127

wandb: ⭐️ View project at https://wandb.ai/doublebear315/huggingface

wandb: 🚀 View run at https://wandb.ai/doublebear315/huggingface/runs/cglufcfi


  0%|          | 0/48531 [00:00> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

Traceback (most recent call last):

  File "/code/xiongxiong/LLM_mine/scripts/pretraining/run_pt_llama2_flashattn.py", line 646, in 

    main()

  File "/code/xiongxiong/LLM_mine/scripts/pretraining/run_pt_llama2_flashattn.py", line 613, in main

    train_result = trainer.train(resume_from_checkpoint=checkpoint)

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train

    return inner_training_loop(

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop

    tr_loss_step = self.training_step(model, inputs)

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step

    loss = self.compute_loss(model, inputs)

  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss

    outputs = model(**inputs)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

    ret_val = func(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward

    loss = self.module(*inputs, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward

    return self.base_model(

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward

    output = old_forward(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward

    outputs = self.model(

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward

    output = old_forward(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward

    layer_outputs = torch.utils.checkpoint.checkpoint(

  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint

    return CheckpointFunction.apply(function, preserve, *args)

  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply

    return super().apply(*args, **kwargs)  # type: ignore[misc]

  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward

    outputs = run_function(*args)

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward

    return module(*inputs, output_attentions, None)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward

    output = old_forward(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward

    hidden_states, self_attn_weights, present_key_value = self.self_attn(

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

    return forward_call(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward

    output = old_forward(*args, **kwargs)

  File "/code/xiongxiong/LLM_mine/scripts/pretraining/flash_attn_patch.py", line 39, in forward

    .view(bsz, q_len, self.num_heads, self.head_dim)

RuntimeError: shape '[1, 1024, 64, 128]' is invalid for input of size 1048576

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.

wandb: - 0.015 MB of 0.015 MB uploaded (0.000 MB deduped)
wandb: \ 0.015 MB of 0.040 MB uploaded (0.000 MB deduped)
wandb: | 0.040 MB of 0.040 MB uploaded (0.000 MB deduped)
wandb: / 0.040 MB of 0.040 MB uploaded (0.000 MB deduped)
wandb: 🚀 View run efficient-breeze-127 at: https://wandb.ai/doublebear315/huggingface/runs/cglufcfi

wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

wandb: Find logs at: 

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1157) of binary: /opt/conda/bin/python

Traceback (most recent call last):

  File "/opt/conda/bin/torchrun", line 8, in 

    sys.exit(main())

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper

    return f(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main

    run(args)

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run

    elastic_launch(

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__

    return launch_agent(self._config, self._entrypoint, list(args))

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent

    raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

============================================================

/code/xiongxiong/LLM_mine/scripts/pretraining/run_pt_llama2_flashattn.py FAILED

------------------------------------------------------------

Failures:

  

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

  time      : 2023-08-03_17:20:13

  host      : pd-xiongxiong-p-tra-asbqtmoctdau-torchjob-master-0

  rank      : 0 (local_rank: 0)

  exitcode  : 1 (pid: 1157)

  error_file: 

  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================

请问如何在该项目基础上做微调时启用FlashAttention-2的高效注意力？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

其他问题

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

对比Baichuan7B

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

效果问题

基础模型

LLaMA-2-7B

操作系统

None

详细描述问题

可否对比一下Baichuan7B，看看相较于Baichuan，各方面差异如何

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

关于flashattention增加后的精度分析

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型推理

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

请问您使用flash attention替代之前的attention后模型的结果精度大概怎么评估呢？是不是fp16+flash的推理和fp32模型推理精度误差在小数点后两位左右？

依赖情况（代码类问题务必提供）

No response

运行日志或截图

No response

Can you first expose the training scripts, PT scripts and sft scripts, and synthesis scripts?

期待作者早日放出模型，也欢迎使用试用我们自己尝试的模型

中文语言模型如何发展还有很多不确定性，但基于Llama 2的中文版改造仍然会是最直接的基础方案，是较好的基线模型，所以我们放出了自己finetune的模型，希望能给社区一个参考。

在线Demo：https://huggingface.co/spaces/LinkSoul/Chinese-Llama-2-7b
模型仓库：https://huggingface.co/LinkSoul/Chinese-Llama-2-7b
Github仓库：https://github.com/LinkSoul-AI/Chinese-Llama-2-7b

欢迎切磋或共建

关于基于llama版本底座预训练模型，全量sft loss炸裂的问题

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

None

详细描述问题

CUDA_VISIBLE_DEVICES=0,1 deepspeed --master_port 61001 train_full.py \
    --data_path ./data/train_data2.json \
    --model_name_or_path checkpoints/llama2-7B \
    --per_device_train_batch_size 1 --output_dir out/fschat_7b_full \
    --fp16 --num_train_epochs 3 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 84 \
    --learning_rate 2e-5 --weight_decay 0. \
    --warmup_ratio 0.03 --lr_scheduler_type "cosine" \
    --model_max_length 1344 \
    --logging_steps 10```

其中，llama2-7B也就是chinese-llama-2-7B，微调过程中发现会loss 炸裂，

该问题其实用原本llama2也有，但是原因是llama2中文不行，但是没想到，中文llmama2也会炸裂。

使用的是V100， fp16，A100下没有问题。

请问有人知道为何吗？

具体的traceback:

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│    98                                                                                            │
│    99                                                                                            │
│   100 if __name__ == "__main__":                                                                 │
│ ❱ 101 │   train()                                                                                │
│   102                                                                                            │
│                                                                                                  │
│                                                                                                  │
│    92 │   if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):                  │
│    93 │   │   trainer.train(resume_from_checkpoint=True)                                         │
│    94 │   else:                                                                                  │
│ ❱  95 │   │   trainer.train()                                                                    │
│    96 │   trainer.save_state()                                                                   │
│    97 │   safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)   │
│    98                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1645 in train                    │
│                                                                                                  │
│   1642 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1643 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1644 │   │   )                                                                                 │
│ ❱ 1645 │   │   return inner_training_loop(                                                       │
│   1646 │   │   │   args=args,                                                                    │
│   1647 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1648 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1938 in _inner_training_loop     │
│                                                                                                  │
│   1935 │   │   │   │   │   self.control = self.callback_handler.on_step_begin(args, self.state,  │
│   1936 │   │   │   │                                                                             │
│   1937 │   │   │   │   with self.accelerator.accumulate(model):                                  │
│ ❱ 1938 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1939 │   │   │   │                                                                             │
│   1940 │   │   │   │   if (                                                                      │
│   1941 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2770 in training_step            │
│                                                                                                  │
│   2767 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss:                     │
│   2768 │   │   │   │   scaled_loss.backward()                                                    │
│   2769 │   │   else:                                                                             │
│ ❱ 2770 │   │   │   self.accelerator.backward(loss)                                               │
│   2771 │   │                                                                                     │
│   2772 │   │   return loss.detach() / self.args.gradient_accumulation_steps                      │
│   2773                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:1815 in backward               │
│                                                                                                  │
│   1812 │   │   │   # deepspeed handles loss scaling by gradient_accumulation_steps in its `back  │
│   1813 │   │   │   loss = loss / self.gradient_accumulation_steps                                │
│   1814 │   │   if self.distributed_type == DistributedType.DEEPSPEED:                            │
│ ❱ 1815 │   │   │   self.deepspeed_engine_wrapped.backward(loss, **kwargs)                        │
│   1816 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:                        │
│   1817 │   │   │   return                                                                        │
│   1818 │   │   elif self.scaler is not None:                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py:176 in backward            │
│                                                                                                  │
│   173 │   │   # - zero grad                                                                      │
│   174 │   │   # - checking overflow                                                              │
│   175 │   │   # - lr_scheduler step (only if engine.lr_scheduler is not None)                    │
│ ❱ 176 │   │   self.engine.step()                                                                 │
│   177 │   │   # and this plugin overrides the above calls with no-ops when Accelerate runs und   │
│   178 │   │   # Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabli   │
│   179 │   │   # training loop that works transparently under many training regimes.              │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py:2184 in step                 │
│                                                                                                  │
│   2181 │   │   │   │   │   and self.quantizer.any_precision_switch()):                           │
│   2182 │   │   │   │   self._take_model_step(lr_kwargs, self.block_eigenvalue)                   │
│   2183 │   │   │   else:                                                                         │
│ ❱ 2184 │   │   │   │   self._take_model_step(lr_kwargs)                                          │
│   2185 │   │   │                                                                                 │
│   2186 │   │   │   report_progress = self.global_rank == 0 if self.global_rank else True         │
│   2187                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py:2086 in _take_model_step     │
│                                                                                                  │
│   2083 │   │   │   │   clip_grad_norm_(parameters=master_params,                                 │
│   2084 │   │   │   │   │   │   │   │   max_norm=self.gradient_clipping(),                        │
│   2085 │   │   │   │   │   │   │   │   mpu=self.mpu)                                             │
│ ❱ 2086 │   │   self.optimizer.step()                                                             │
│   2087 │   │                                                                                     │
│   2088 │   │   if hasattr(self.optimizer, '_global_grad_norm'):                                  │
│   2089 │   │   │   self._global_grad_norm = self.optimizer._global_grad_norm                     │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1778 in step     │
│                                                                                                  │
│   1775 │   │   timer_names = [OPTIMIZER_ALLGATHER, OPTIMIZER_GRADIENTS, OPTIMIZER_STEP]          │
│   1776 │   │                                                                                     │
│   1777 │   │   prev_scale = self.loss_scale                                                      │
│ ❱ 1778 │   │   self._update_scale(self.overflow)                                                 │
│   1779 │   │   if self.overflow:                                                                 │
│   1780 │   │   │   see_memory_usage('After overflow before clearing gradients')                  │
│   1781 │   │   │   self.zero_grad(set_to_none=True)                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:2028 in          │
│ _update_scale                                                                                    │
│                                                                                                  │
│   2025 │   │   self._check_overflow(partition_gradients)                                         │
│   2026 │                                                                                         │
│   2027 │   def _update_scale(self, has_overflow=False):                                          │
│ ❱ 2028 │   │   self.loss_scaler.update_scale(has_overflow)                                       │
│   2029 │                                                                                         │
│   2030 │   # Promote state so it can be retrieved or set via "fp16_optimizer_instance.state"     │
│   2031 │   def _get_state(self):                                                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py:164 in             │
│ update_scale                                                                                     │
│                                                                                                  │
│   161 │   │   │   # self.cur_scale /= self.scale_factor                                          │
│   162 │   │   │   if self.delayed_shift == 1 or self.cur_hysteresis == 1:                        │
│   163 │   │   │   │   if (self.cur_scale == self.min_scale) and self.raise_error_at_min_scale:   │
│ ❱ 164 │   │   │   │   │   raise Exception(                                                       │
│   165 │   │   │   │   │   │   "Current loss scale already at minimum - cannot decrease scale a   │
│   166 │   │   │   │   │   )                                                                      │
│   167 │   │   │   │   else:                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

tokenizer是不是没对应好？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

None

基础模型

None

操作系统

None

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

sft的时候load数据进来，发现数据的样本全部被做了错误映射，看看是否会有影响？

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

你好，请问这里说的指令精调和RLHF事一个东西吗

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

None

基础模型

None

操作系统

None

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

新扩充的词表模型会开源么？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

LLaMA-2-7B

操作系统

Linux

详细描述问题

扩充后的词表模型（55296）会开源出来么？

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

开启 gradient checkpointing 和 flash-attn2 时 lora sft 在 eval 时报错："use_cache is not supported"

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

No response

依赖情况（代码类问题务必提供）

peft                     0.4.0
torch                    2.0.1
transformers             4.31.0

运行日志或截图

开启了 gradient checkpointing，Transformers 会默认将 use_cache=False。

[WARNING|logging.py:295] 2023-08-04 11:20:55,373 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

但是开启 flash-attn2 时 lora sft 在 eval 时报错："use_cache is not supported"

[INFO|trainer.py:3081] 2023-08-04 11:44:30,657 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-08-04 11:44:30,657 >>   Num examples = 299
[INFO|trainer.py:3086] 2023-08-04 11:44:30,658 >>   Batch size = 16
Traceback (most recent call last):
  File "/mnt/bn/fulei-v6-hl-nas-mlx/mlx/workspace/llm/llm_train/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_sft_with_peft.py", line 507, in <module>
    main()
  File "/mnt/bn/fulei-v6-hl-nas-mlx/mlx/workspace/llm/llm_train/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_sft_with_peft.py", line 480, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 1901, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 2226, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 2934, in evaluate
    output = eval_loop(
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 3123, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 3337, in prediction_step
    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
    layer_outputs = decoder_layer(
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/miniforge3/envs/flash-atten2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/bn/fulei-v6-hl-nas-mlx/mlx/workspace/llm/llm_train/Chinese-LLaMA-Alpaca-2/scripts/training/flash_attn_patch.py", line 59, in forward
    assert not use_cache, "use_cache is not supported"
AssertionError: use_cache is not supported

请问llama2扩充词表的脚本和Chinese-LLaMa-Alpaca项目里面的扩充词表方式是否一模一样呢?

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

None

基础模型

None

操作系统

None

详细描述问题

请问llama2扩充词表的脚本和Chinese-LLaMa-Alpaca项目里面的扩充词表方式是否一模一样呢?
如果不一样,能否开源一下LLama2扩充中文词表的方法呢?感谢大佬!

依赖情况（代码类问题务必提供）

No response

运行日志或截图

No response

进行预训练的时候报错。

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Windows

详细描述问题

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 28969) of binary:
提示报错了，查资料好像说是，版本问题
系统是windows wsl Ubuntu 20.04
Python 3.9.17
torch 2.0.1+cu117

依赖情况（代码类问题务必提供）

 pip list | grep -E 'transformers|peft|torch'
peft                     0.4.0
torch                    2.0.1
transformers             4.31.0

运行日志或截图

[INFO|trainer.py:1686] 2023-08-01 08:21:52,370 >> ***** Running training *****
[INFO|trainer.py:1687] 2023-08-01 08:21:52,370 >>   Num examples = 8,328
[INFO|trainer.py:1688] 2023-08-01 08:21:52,371 >>   Num Epochs = 1
[INFO|trainer.py:1689] 2023-08-01 08:21:52,371 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1692] 2023-08-01 08:21:52,371 >>   Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:1693] 2023-08-01 08:21:52,371 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1694] 2023-08-01 08:21:52,371 >>   Total optimization steps = 4,164
[INFO|trainer.py:1695] 2023-08-01 08:21:52,372 >>   Number of trainable parameters = 159,907,840
  0%|                                                                                          | 0/4164 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/maguoheng/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 636, in <module>
    main()
  File "/home/maguoheng/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_pt_with_peft.py", line 604, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
    layer_outputs = decoder_layer(
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 330, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.99 GiB total capacity; 22.70 GiB already allocated; 0 bytes free; 22.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                          | 0/4164 [00:01<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 28969) of binary: /home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/bin/python
Traceback (most recent call last):
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/maguoheng/anaconda3/envs/chinese-llama2-alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-01_08:21:58
  host      : MICROSO-AR91S8Q.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 28969)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

支援繁體中文嗎?

预训练完毕，合成完毕，在oobabooga运行模型会出现自问自答的情况。

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型推理

基础模型

Alpaca-2-7B

操作系统

Windows

详细描述问题

1.在预训练完成后，模型在oobabooga进行推理的时候，模型会出现 自问自答的情况。是模型训练，文本导致的还是，oobabooga设置导致的。
2.在某次合并后出现正常操作。  但是在回复的时候经常会出现说一半的情况。

依赖情况（代码类问题务必提供）

已经训练合并完成  模型训练用的  llama2-alpaca 进行训练，  合并 模型 model 指向 llama2-alpaca 模型进行合并。

运行日志或截图

# 请在此处粘贴运行日志

请问继续pretrain和sft都是按照4k长度来进行的吗？谢谢！

请教一下申请Lliama2模型下载Meta审批流程需要多久？

一周前就申请了Meat Llama2 模型下载，但是Meta侧一直没有完成审批。

请问，你们申请了多久，Meta侧就完成了审批？填写 https://ai.meta.com/resources/models-and-libraries/llama-downloads/ 中的申请信息时，有什么需要注意的吗？比如，国家不能填写**？

感觉中文答非所问，你们的会有这个问题吗？

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

其他问题

基础模型

Alpaca-2-7B

操作系统

Windows

详细描述问题

我从huggingface下载的ziqingyang/chinese-alpaca-2-7b 和ziqingyang/chinese-alpaca-2-lora-7b，通过script目录下面的gradio_demo.py 加载上面下载的模块，由于显卡内存不大，使用了--load_in_8bit选项，在窗口中简单提了几个问题，但是感觉答非所问，是我设置的问题吗？

# 请在此处粘贴运行代码（如没有可删除该代码块）

python gradio_demo.py --base_model C:\Users\aa\Downloads\ziqingyang-chinese-alpaca-2-7b --lora_model C:\Users\aa\Downloads\ziqingyang-chinese-alpaca-2-lora-7b --tokenizer_path C:\Users\aa\Downloads\ziqingyang-chinese-alpaca-2-lora-7b --load_in_8bit

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

模型推理的速度很慢

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型推理

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（如没有可删除该代码块）

python inference_hf.py chinese-alpaca-2-7b --use_vllm 的推理速度可以
但是python gradio_demo.py --use_vllm --gpus 0,1 的推理速度很慢为啥？

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

无

运行日志或截图

# 请在此处粘贴运行日志

无

chinese-alpaca-2训练数据格式问题

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

请问一下如果要在自己数据上sft Chinese-alpaca-2，数据格式应该怎么准备，目前训练数据的格式都还是之前的{"instruction":"", "input":"", "output":""}，需要套上提到的新的template吗？

SYSTEM_PROMPT = """You are a helpful assistant. 你是一个乐于助人的助手。"""

PROMPT_TEMPLATE = (
    "[INST] <<SYS>>\n"
    "{system_prompt}\n"
    "<</SYS>>\n\n"
    "{instruction} [/INST]"
)

full_prompt = PROMPT_TEMPLATE.format_map({"instruction": your_instruction, "system_prompt": SYSTEM_PROMPT})

还是直接用instruction+input作为prompt就可以呢？

依赖情况（代码类问题务必提供）

No response

运行日志或截图

No response

多卡微调时报错：ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 1280) of binary: /root/miniconda3/envs/llama2/bin/python

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

进行多卡训练时报错，运行环境：docker cuda11.6，4卡24G A6000，python3.10

########参数部分########
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/root/.cache/huggingface/chinese-alpaca-2-7b-hf
chinese_tokenizer_path=/root/.cache/huggingface/chinese-alpaca-2-7b-hf
dataset_dir=/root/.cache/huggingface/data/merge.json
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
output_dir=output_dir
validation_file=/root/.cache/huggingface/data/merge.json
max_seq_length=1024

deepspeed_config_file=ds_zero2_no_offload.json

########启动命令########
torchrun --nnodes 1 --nproc_per_node 4 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 2 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 250 \
    --save_steps 500 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False

依赖情况（代码类问题务必提供）

peft                     0.3.0.dev0
torch                    2.0.1
transformers             4.31.0

运行日志或截图

(llama2) root@eb03b13bd90d:~/Chinese-LLaMA-Alpaca-2/scripts/training# bash run_sft.sh
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-08-03 08:13:01,087] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-03 08:13:01,106] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-03 08:13:01,202] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-03 08:13:01,202] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-03 08:13:01,202] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-03 08:13:01,219] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-03 08:13:01,219] [INFO] [comm.py:616:init_distributed] cdb=None
08/03/2023 08:13:01 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
[WARNING|logging.py:295] 2023-08-03 08:13:01,415 >> You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
08/03/2023 08:13:01 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:710] 2023-08-03 08:13:01,818 >> loading configuration file /root/.cache/huggingface/chinese-alpaca-2-7b-hf/config.json
[INFO|configuration_utils.py:768] 2023-08-03 08:13:01,818 >> Model config LlamaConfig {
  "_name_or_path": "/root/.cache/huggingface/chinese-alpaca-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 55296
}

[INFO|tokenization_utils_base.py:1837] 2023-08-03 08:13:01,818 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:1837] 2023-08-03 08:13:01,818 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1837] 2023-08-03 08:13:01,818 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1837] 2023-08-03 08:13:01,818 >> loading file tokenizer_config.json
[WARNING|logging.py:295] 2023-08-03 08:13:01,819 >> You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
08/03/2023 08:13:01 - INFO - __main__ - Training files: 
08/03/2023 08:13:01 - WARNING - root - building dataset...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 1280) of binary: /root/miniconda3/envs/llama2/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/llama2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
run_clm_sft_with_peft.py FAILED
----------------------------------------------------
Failures:
[1]:
  time      : 2023-08-03_08:13:04
  host      : eb03b13bd90d
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 1281)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1281
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-03_08:13:04
  host      : eb03b13bd90d
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 1280)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1280
====================================================

请问预计什么时候开源代码和模型权重文件？

您好，麻烦问下预计什么时候开源代码和模型权重文件？期待 chinese-llama2 的工作！

sha256值对应不上

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

下载问题

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

config.json和generation_config.json的sha256值对应不上，但是文件内容一致

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

运行日志或截图

下载后的：
2dcbbb625d02f4d12406d71b28d5cfac71fdffd332a0d91bc163655b35a185ac ./config.json
343026a3ef80bf1c69ad858cc414a87037bb9725d12cd24b7a194eabcc520e2d ./generation_config.json
文件内容：

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 55296
}
{
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0,
  "transformers_version": "4.31.0"
}

huggingface上的：
68f516b143f9cdfe669efb3ae4edd707ff6e996559a68160d9f2d900e5bf1a3a config.json
a8094d0e4f79ad03cb77120df07d86e6fcf77b9654b5ca8323ecccdb40b3d84d generation_config.json

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 55296
}
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.31.0"
}