Giter Club home page Giter Club logo

eva's Introduction

EVA: 大规模中文开放域对话系统

English version

🌟更新

  • 2022.7: 开源 HuggingFace 版本的模型/对应代码,见此分支
  • 2022.5: 开源 EVA2.0-base 与 EVA2.0-large 模型
  • 2022.3: 开源 EVA2.0-xLarge 模型,发布 EVA2.0 的论文
  • 2022.1: 开源 fine-tune 代码。
  • 2021.8: 开源 EVA1.0 模型及交互代码,发布 EVA1.0 的论文

1 项目简介

EVA 是目前最大的开源中文预训练对话模型,拥有28亿参数,主要擅长开放域闲聊,目前有 1.0 和 2.0 两个版本。其中,1.0版本在 WudaoCorpus-Dialog 上训练而成,2.0 版本在从 WudaoCorpus-Dialog 中清洗出的更高质量的对话数据上训练而成,模型性能也明显好于 EVA1.0。EVA1.0 论文链接,EVA2.0 论文链接

本仓库中提供了模型交互式评测,模型静态评测,模型微调的代码。HuggingFace 版本的模型/对应代码见此分支

2 模型下载

EVA2.0-base, EVA2.0-large 和EVA2.0-xlarge 模型可以从此处下载。

3 运行

所有代码都包含在 src/ 目录下.

3.1 环境配置

代码运行需要 CUDA10.2 toolkit。交互式评测大约需要占用 7000MB 显存,静态评测和模型微调占用的显存取决于 batch size 和最大输入长度。经过测试,当前微调的超参数配置可以在 4*32G V100 上跑起来。我们提供了两种配置环境的方式。

方式1: 使用 requirements.txt

安装基础依赖

pip install -r requirements.txt

安装 apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

安装 deepspeed

我们使用了 v0.3.9 版本的 deepspeed,可以从此仓库中下载安装,或者运行如下命令:

pip install deepspeed==0.3.9

由于此版本的 deepspeed 有一些 bug,您可能需要对安装后的 python 包做一些修改。关于 bug 的具体信息您可以参考此问题 。简单来说,您需要修改 ${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/zero/stage1.py${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/engine.py 中的几行代码。 我们在仓库中提供了修改后两个文件:src/ds_fix/stage1.pysrc/ds_fix/engine.py。您只需要将 ${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/zero/stage1.py 替换为 src/ds_fix/stage1.py${PATH_TO_PYTHON_SITE_PACKAGE}/deepspeed/runtime/engine.py 替换为 src/ds_fix/engine.py 即可。

方式2: 使用 Docker

docker pull gyxthu17/eva:1.5

因为上述环境已经在 docker 中预装,您不需要再设置任何环境变量了。为了运行代码,您可能需要将此仓库挂在到 docker 中的目录,例如,/mnt 目录。为此,您可以运行如下代码:

docker run -ti -v ${PWD}:/mnt gyxthu17/eva:1.5 /bin/bash

3.2 准备数据

将训练、验证、测试数据放在一个目录下,该目录中需要有 train.txtvalid.txttest.txt 三个文件,文件的每一行为一个对话样本(展开后),轮次之间用\t分隔,最后一轮为模型需要生成的对话,前面的轮次为对话上下文,具体格式可以参考我们给出的预处理好的 KdConv 数据。KdConv 原始数据可从此仓库下载。

3.3 运行代码

所有运行脚本都在 src/scripts 中。

  • 交互式评测脚本:eva_inference_interactive_beam.sheva_inference_interactive_no_beam.sh
  • 静态评测脚本:eva_inference_static.sh
  • 微调脚本:eva_finetune.sh

在运行以上脚本之前,需要先将 WORKING_DIR 改为此 EVA 目录的路径, 将 CKPT_PATH 改为存储预训练 checkpoint 的路径。静态评测和微调还需要将DATA_PATH改为3.2中的数据目录,该目录下需要有 train.txtvalid.txttest.txt 三个文件,训练/评测结果存储位置SAVE_PATH也可以按照需求修改。其它参数含义可以参考中 eva_finetune.sh 的注释。

注意:EVA2.0 与 EVA1.0 在模型结构上有一些差别,在更换模型时请注意同时更换模型配置文件。项目中默认提供 EVA2.0-xLarge 的模型配置文件:eva2.0_model_config.json,EVA1.0 的配置文件为 eva1.0_model_config.json。更改执行脚本中的 CONFIG_PATH 即可。

上述修改修改完成后运行:

cd src/
bash scripts/eva_inference_interactive_beam.sh # 交互式评测,使用 beam search 解码
bash scripts/eva_inference_interactive_no_beam.sh # 交互式评测,不使用 beam search 解码
bash scripts/eva_inference_static.sh # 静态评测
bash scripts/eva_finetune.sh # 微调模型

注意:运行上述命令后, 您需要确定预训练模型加载成功。如果它们加载成功,stdout 中会输出 successfully loaded /path-to-checkpoint/eva/mp_rank_01_model_states.pt. 否则,会输出 WARNING: could not find the metadata file /***/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random。需要注意的是,当成功加载模型后,程序还会输出 The following zero checkpoints paths are missing: ['/path-to-checkpoint/eva/200000/zero_pp_rank_0_mp_rank_00_optim_states.pt',... 一大串 log,说明没有加载优化器的参数。因为本仓库代码只进行评测和微调,是否加载优化器参数没有影响,所以您可以忽略这个 log。

如果上述脚本正常运行,对于交互式评测,您会看到一个交互提示符,可以在后面输入文字和 EVA 对话,输入 clear 可以重新开始对话,输入 seed 可以设置随机种子。对于静态评测和模型微调,代码会读取数据并启动模型训练和推理,最后的结果存储在 SAVE_PATH 中。

3.4 修改模型并行度

如果发现单卡显存不够,可以使用 src/change_mp.py 更改模型并行度,下面命令中 TARGET_MP 表示目标模型并行度。下载下来的模型并行度为1,即所有模型参数在用一张 GPU 上,可以通过增加并行度将一个模型的参数分摊到多个卡上,从而减少单张卡的显存占用(当然,这可能意味着需要用更多的卡来训练)。注意对下载的模型修改之后还要将训练/推理脚本中的 MP_SIZE 修改为对应的并行度。

cd src/
python3 change_mp.py ${INPUT_MODEL_DIR} ${OUTPUT_MODEL_DIR} ${TARGET_MP}

4 参考结果

我们使用处理好的 KdConv 数据集进行评测,按照仓库中给出的超参数微调、测试集上静态评测得到如下结果

loss BLEU-4 Dist-4 f1
eva1.0 Finetune 前 3.49 0.64 85.94 13.24
eva1.0 Finetune 后 2.37 3.94 72.47 19.80
eva2.0 Finetune 前 3.14 1.02 82.67 14.36
eva2.0 Finetune 后 2.09 4.69 73.03 22.91

使用 beam search 对 EVA2.0 模型进行交互式评测,我们获得了如下样例。

Usr >>> 你好
Sys >>> 你好,请问你是?
Usr >>> 我是小明
Sys >>> 好的,小明,我是新来的,请多关照
Usr >>> 你了解人工智能吗?
Sys >>> 不了解,我只知道我是机器人
Usr >>> 那你很聪明
Sys >>> 谢谢夸奖,我会继续努力的
Usr >>> 你是学什么专业的呀
Sys >>> 计算机科学与技术,你呢?
Usr >>> 我也是学计算机的
Sys >>> 哦,那你有什么想知道的,可以问我
Usr >>> 编程语言你学过哪些?
Sys >>> java,c,c+都学过
Usr >>> 你学的真多
Sys >>> 呵呵,还好吧,我比较喜欢写代码
Usr >>> 哈哈,我也喜欢写代码,以后可以多多交流
Sys >>> 嗯,好的
Usr >>> 拜拜喽
Sys >>> 拜~

注意:由于不同硬件的随机函数可能不同,即使使用和我们相同的随机种子,可能仍然无法复现样例结果。但是整体性能应该不会有太大差距。

5 免责声明

本预训练对话模型仅限科研用途。模型训练数据集中的对话收集自不同的来源,虽然我们设计了一套严格的数据清洗流程,但是我们并不保证所有不当内容均已被过滤。该数据中所包含的所有内容和意见与本项目作者无关。 本项目所提供的模型和代码仅为完整对话系统的一个组成部分,我们所提供的解码脚本仅限科研用途,使用本项目中的模型和脚本所生成的一切对话内容与本项目作者无关。

6 TODO

  • finetune 代码整理与开源
  • EVA2.0 模型下载链接
  • EVA2.0 技术报告
  • 开源小规模模型
  • huggingface 版本的模型/对应代码
  • 预训练数据处理代码开源

7 引用

@article{coai2021eva,
  title={{EVA}: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training},
  author={Zhou, Hao and Ke, Pei and Zhang, Zheng and Gu, Yuxian and Zheng, Yinhe and Zheng, Chujie and Wang, Yida and Wu, Chen Henry and Sun, Hao and Yang, Xiaocong and Wen, Bosi and Zhu, Xiaoyan and Huang, Minlie and Tang, Jie},
  journal={arXiv preprint arXiv:2108.01547},
  year={2021}
}
@article{coai2022eva2,
  title={{EVA2.0}: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training},
  author={Gu, Yuxian and Wen, Jiaxin and Sun, Hao and Song, Yi and Ke, Pei and Zheng, Chujie and Zhang, Zheng and Yao, Jianzhu and Zhu, Xiaoyan and Tang, Jie and Huang, Minlie},
  journal={arXiv preprint arXiv:2203.09313},
  year={2022}
}

eva's People

Contributors

jiaxin-wen avatar kepei1106 avatar t1101675 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eva's Issues

how to pretrain in specific gpus

now my 0,2 gpus are available,I pretrain the KdConv data,i change the default NUM_GPUS_PER_WORKER 4 to 2,but got the wrong infos:
python -m torch.distributed.launch --master_port 1234 --nproc_per_node 1 /mnt/src/eva_finetune.py --model-config /mnt/src/configs/model/eva1.0_model_config.json --model-parallel-size 1 --batch-size 16 --epochs 3 --gradient-accumulation-steps 1 --enc-seq-length 128 --dec-seq-length 128 --train-iters -1 --save /mnt/results/finetune/ --log-file /mnt/results/finetune//log.txt --load /mnt/checkpoints/eva1.0 --no_load_strict --data-path /mnt/data/kdconv --distributed-backend nccl --lr 0.0001 --lr-decay-style noam --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --tokenizer-path /mnt/bpe_dialog_new --eval-interval 500 --log-interval 100 --save-interval 500 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /mnt/src/configs/deepspeed/eva_ds_config.json --do-train --do-valid --do-eval --train-ratio 1
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-02-14 02:57:49,629] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
File "/mnt/src/eva_finetune.py", line 506, in
main()
File "/mnt/src/eva_finetune.py", line 442, in main
initialize_distributed(args)
File "/mnt/src/utils.py", line 62, in initialize_distributed
deepspeed.init_distributed()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed
torch.distributed.init_process_group(backend=dist_backend)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/mnt/src/eva_finetune.py', '--local_rank=0', '--model-config', '/mnt/src/configs/model/eva1.0_model_config.json', '--model-parallel-size', '1', '--batch-size', '16', '--epochs', '3', '--gradient-accumulation-steps', '1', '--enc-seq-length', '128', '--dec-seq-length', '128', '--train-iters', '-1', '--save', '/mnt/results/finetune/', '--log-file', '/mnt/results/finetune//log.txt', '--load', '/mnt/checkpoints/eva1.0', '--no_load_strict', '--data-path', '/mnt/data/kdconv', '--distributed-backend', 'nccl', '--lr', '0.0001', '--lr-decay-style', 'noam', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--tokenizer-path', '/mnt/bpe_dialog_new', '--eval-interval', '500', '--log-interval', '100', '--save-interval', '500', '--checkpoint-activations', '--deepspeed-activation-checkpointing', '--fp16', '--deepspeed', '--deepspeed_config', '/mnt/src/configs/deepspeed/eva_ds_config.json', '--do-train', '--do-valid', '--do-eval', '--train-ratio', '1']' returned non-zero exit status 1.

请教和具体某个人物角色对话的方案

你好,请教下在闲聊基础上,如果和某个人物角色进行对话,比如和鲁迅聊和他相关的话题,比如三味书屋、日本留学经历等话题,可以自动以鲁迅第一人称视角进行回答。
这种具有人设的对话,有哪些可以考虑的方案呢?

人工交互评测的demo能开源吗

EVA介绍人工交互的评测demo系统很有意义,界面友好,方便评测和比较不同模型,是否能开源以实现它的更大价值。

What's the training setting for EVA2.0?

Hello, I am very interested in EVA2.0, and wonder what the settings are when performing the EVA2.0 pre-training because I want to do my training on a large dataset, for example, the batch size, epoch number, the number of used GPU, and training time. Could anyone know the answer to this question? Many thanks!

huggingface

你好,请问能将现有的模型转换成huggingface的形式吗?有样例代码可以参考吗

MP_SIZE如何在单机多卡条件下工作

您好,感谢你们的工作。

我尝试在单机8*16G P100卡的环境微调EVA 2.0。设置模型并行度为4,并用脚本转换。相关超参数如下:

MP_SIZE=4 # the model parallel size

NUM_GPUS_PER_WORKER=2 # number of gpus used on one node

BATCH_SIZE=8

但是运行微调脚本后发现,模型只占用了前两张卡,剩余6张卡均处于空闲状态,且很快抛出了 CUDA out of memory的错误。因此想请问下在单机多卡条件下模型并行该如何设置。

EVA2.0模型文件

请问EVA2.0的模型文件会放出么?
感觉效果挺不错的 ,是多少参数量?

【eva】eva deploy problems

BAAI-WuDao#4

1.在容器中执行了torch.cuda.device_count(),结果是1
2.在容器中也ssh连接了宿主机,也是可以通的。

烦请继续帮忙定位下问题所在

Unable to proceed, no GPU resources available

目前运行了多次,也换了很多GPU服务器,但都在报同一个错误。
错误信息:
Traceback (most recent call last):
File "/opt/conda/bin/deepspeed", line 6, in
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 264, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available
提示,没有可用的GPU资源。

在docker中也测试了GPU
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:07.0 Off | 0 |
| N/A 28C P0 22W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

现在,不清楚的是,代码提示GPU资源找不到,是什么原因?谢谢指导!

Traceback (most recent call last): File "eva_interactive.py", line 479, in <module> main() File "eva_interactive.py", line 466, in main model = setup_model_for_inference(args, tokenizer.vocab_size) File "eva_interactive.py", line 155, in setup_model_for_inference dist_init_required=False File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/__init__.py", line 136, in initialize config_params=config_params) File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 218, in __init__ self._configure_checkpointing(dist_init_required) File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 556, in _configure_checkpointing group=self.optimizer.dp_process_group) AttributeError: 'NoneType' object has no attribute 'dp_process_group'

Traceback (most recent call last):
File "eva_interactive.py", line 479, in
main()
File "eva_interactive.py", line 466, in main
model = setup_model_for_inference(args, tokenizer.vocab_size)
File "eva_interactive.py", line 155, in setup_model_for_inference
dist_init_required=False
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/init.py", line 136, in initialize
config_params=config_params)
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 218, in init
self._configure_checkpointing(dist_init_required)
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 556, in _configure_checkpointing
group=self.optimizer.dp_process_group)
AttributeError: 'NoneType' object has no attribute 'dp_process_group'

如何开启新的一轮对话?

您好,请问如何开启一轮新的对话? 不依赖之前的对话呢?
或者,如果在eval的时候,只做简单的一轮对话?修改input输入后,并没有成功,后面的输出就全错了。

Run src/scripts/infer_enc_dec_interactive.sh

Hi @t1101675 , I encounter the following errors when running infer_enc_dec_interactive.sh. Is there something wrong with DeepSpeed version? My DeepSpeed version is 0.5.1, could you give some advise on this?

[2021-09-13 22:26:05,248] [INFO] [engine.py:197:__init__] DeepSpeed Flops Profiler Enabled: False
Traceback (most recent call last):
  File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 498, in <module>
    main()
  File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 485, in main
    model = setup_model_for_inference(args, tokenizer.vocab_size)
  **File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 168, in setup_model_for_inference
    model, optimizer, _, lr_scheduler = deepspeed.initialize(**
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/__init__.py", line 131, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 228, in __init__
    self._configure_checkpointing(dist_init_required)
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 572, in _configure_checkpointing
    group=self.optimizer.dp_process_group)
**AttributeError: 'NoneType' object has no attribute 'dp_process_group'**
Killing subprocess 4701
Traceback (most recent call last):
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

Bug in change_mp.py

Hi, thanks for the repo and fine-tune script. However, I encountered a problem after the fine-tune phase.

The things is, I splitted the checkpoint into two for fine-tuning at first. After fine-tuning, when I try to merge the two checkpoints into one checkpoint for inference, I find the change_mp.py will run into the following bug:

Traceback (most recent call last):
File "../change_mp.py", line 131, in
main()
File "../change_mp.py", line 104, in main
new_model = merge(model_parts)
File "../change_mp.py", line 17, in merge
for k, v in model_parts.items():
AttributeError: 'list' object has no attribute 'items'

It seems a bug, since I find the variable model_parts is actually a list of two models which has no attribute ".items()"

mp_size=4 模型保存错误

Traceback (most recent call last):
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 508, in
main()
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 491, in main
train(args, tokenizer, model, optimizer, lr_scheduler, train_dataset, train_dataloader, dev_dataset, dev_dataloader, device)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 303, in train
save_checkpoint(global_step, model, optimizer, lr_scheduler, args)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/utils.py", line 92, in save_checkpoint
save_ds_checkpoint(iteration, model, args)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/utils.py", line 110, in save_ds_checkpoint
model.save_checkpoint(args.save, str(iteration), client_state = sd, save_zero=False)
TypeError: save_checkpoint() got an unexpected keyword argument 'save_zero'

ssh: Could not resolve hostname node-0: Name or service not known

ssh: Could not resolve hostname node-0: Name or service not known
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.6/dist-packages/deepspeed/launcher/runner.py", line 281, in main
result = subprocess.check_output(hostname_cmd, shell=True)
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ssh node-0 hostname -I']' returned non-zero exit status 255.

我看readme里面说:You also need to change node-0 ${WORKING_DIR}/src/configs/host_files/hostfile to the ssh node name (or IP) where you run distributed training,不是很懂要如何设置...... 还请大佬帮忙帮忙

EVA2.0论文相关

self-chat实验启动query的来源在论文中有说明,我们不直接提供原始文件。

mp_size=4 enforce_repetition_penalty_ lprobs数组越界错误

File "/mnt/lustre/sjtu/home/bwy18/EVA/src/generation_utils.py", line 238, in postprocess_next_token_scores
enforce_repetition_penalty_(
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/generation_utils.py", line 215, in enforce_repetition_penalty_
if lprobs[i, previous_token] < 0:
IndexError: index 29810 is out of bounds for dimension 1 with size 7500
不确定是不是因为decoder输出的lm_logits应该调用mpu.gather_from_model_parallel_region方法

TCPStore(master_addr, master_port, world_size, start_daemon, timeout) ==>RuntimeError: connect() timed out.

1.在容器中执行了torch.cuda.device_count(),结果是1.

2.在容器中对宿主机ssh连接,也是通的。

报错信息:
[2021-09-15 08:47:29,228] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
Loading Model ...
WARNING: No training data specified
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2021-09-15 08:47:30,165] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
File "/mnt/src/eva_interactive.py", line 517, in
main()
File "/mnt/src/eva_interactive.py", line 494, in main
initialize_distributed(args)
File "/mnt/src/eva_interactive.py", line 464, in initialize_distributed
deepspeed.init_distributed()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed
torch.distributed.init_process_group(backend=dist_backend)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

请问,目前的错误信息需要怎么解决了?谢谢指导!

cuda out of memory

你好,我在运行eva_finetune.sh的时候,无论怎么调小batchsize,memory都会加载到一张卡上。然后out of memory

Some weights not loaded when I tried to run the code in the huggingface branch

Hello, There is a question that is about the different EVAModel architecture between the code in the main and huggingface branch (the attached picture shows the warnings when running the code in the huggingface with the pre-trained EVA large model checkpoint).

I investigate the code while no significant difference was found. I wonder what happens. Could anyone be kind to help me inspect this? Thanks!

截屏2022-06-28 10 42 29

WDC-Dialogue数据来源问题

您好,paper中提到WDC-Dialogue数据分别来源于社交平台的转发、网站论坛的评论转发、问答交流,请问能再分别详细说明下分别在哪些网站中通过什么方式采集的吗?
比如zhihu平台是什么入口,或者什么关键词搜索相关数据?
对这部分工作比较感兴趣,请帮忙说明下,谢谢~

inference时始终卡在load_checkpoint

在不load模型时,可以随机初始化参数运行inference代码
但在load模型时,会始终卡在load_checkpoint这一步

具体的运行日志如下:

python -m torch.distributed.launch --master_port 1234 --nproc_per_node 1 /workspace/user_code/EVA-main/src/eva_interactive.py --model-config /workspace/user_code/EVA-main/src/configs/model/eva2.0_model_config.json --model-parallel-size 1 --load /workspace/user_code/EVA-main/new_data_scale_1103_change_iter/ --no_load_strict --distributed-backend nccl --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /workspace/user_code/EVA-main/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --num-beams 1 --repetition-penalty 1.6 --rule-path /workspace/user_code/EVA-main/rules --fp16 --deepspeed --deepspeed_config /workspace/user_code/EVA-main/src/configs/deepspeed/eva_ds_config.json
Loading Model ...
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-03-27 18:28:06,328] [INFO] [distributed.py:38:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3140 and data parallel seed: 422
building Enc-Dec model ...
number of parameters on model parallel rank 0: 2841044992
DeepSpeed is enabled.
[2022-03-27 18:28:34,516] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.9, git-hash=unknown, git-branch=unknown
[2022-03-27 18:28:34,536] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f956971ae90>
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] allreduce_always_fp32 ........ False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] amp_enabled .................. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] amp_params ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] disable_allgather ............ False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] dump_state ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 2000, 'delayed_shift': 4, 'min_scale': 256}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] elasticity_enabled ........... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] fp16_enabled ................. True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] global_rank .................. 0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_accumulation_steps .. 1
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_clipping ............ 1.0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_predivide_factor .... 1.0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] initial_dynamic_scale ........ 65536
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] loss_scale ................... 0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] memory_breakdown ............. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_legacy_fusion ...... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_name ............... None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_params ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pld_enabled .................. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pld_params ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] prescale_gradients ........... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] scheduler_name ............... None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] scheduler_params ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] sparse_attention ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] sparse_gradients_enabled ..... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] steps_per_print .............. 10
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_enabled .......... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_job_name ......... DeepSpeedJobName
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_output_path ......
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] train_batch_size ............. 32
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] train_micro_batch_size_per_gpu 32
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] wall_clock_breakdown ......... True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] world_size ................... 1
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_allow_untested_optimizer True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_config .................. {
"allgather_bucket_size": 500000000,
"allgather_partitions": true,
"contiguous_gradients": false,
"cpu_offload": false,
"elastic_checkpoint": true,
"load_from_fp32_weights": true,
"overlap_comm": false,
"reduce_bucket_size": 500000000,
"reduce_scatter": true,
"stage": 1
}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_enabled ................. True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_optimization_stage ...... 1
[2022-03-27 18:28:34,538] [INFO] [config.py:715:print] json = {
"activation_checkpointing":{
"contiguous_memory_optimization":false,
"partition_activations":false
},
"fp16":{
"enabled":true,
"hysteresis":4,
"initial_scale_power":16,
"loss_scale":0,
"loss_scale_window":2000,
"min_loss_scale":256
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"steps_per_print":10,
"train_micro_batch_size_per_gpu":32,
"wall_clock_breakdown":true,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"stage":1
}
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] g++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/TH -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/THC -isystem /data/miniconda3/envs/env-3.7.7/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] g++ flatten_unflatten.o -shared -L/data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 10.33121132850647 seconds
[2022-03-27 18:28:45,497] [INFO] [engine.py:1286:_load_checkpoint] rank: 0 loading checkpoint: /workspace/user_code/EVA-main/new_data_scale_1103_change_iter/1/mp_rank_00_model_states.pt

The server socket cannot be initialized on [::]:1234

When I tried to run scripts/eva_inference_static.sh on my server, I encountered a problem (please see the attached figure). It seems there is something wrong with the torch distribution. Could anyone have any idea with this problem? Thanks!

截屏2022-06-22 23 23 59

模型文件问题

您好,请问现在给的模型文件是全的吗?(只有一个mp_rank_00_model_states.pt),我想进行finetune的话用这个模型文件可以完成吗?

ppl?

请问有计算ppl指标的代码吗

torch multiprocessing api failed

Hi I encounter the following error when I finetune eva2.0-xLarge with 4 v100 GPU.

building Enc-Dec model ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 16 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 14) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I was able to run the interactive mode on 1 P100 GPU without error. Any clue what is causing the error?

train

怎么训练使用自己的数据进行训练呢

实现grounded dialog

可以修改src/eva_datasets中的代码,根据自己的数据形式在输入中加入grounded数据(知识、人设信息等)。

stucking when use docker img to exec infer script

same issue at EVA/issues/4
when go to deepspeed.init_distributed() at eva_infer.py, then code will stucking, so I suggest:
as just one GPU can do infer job, why don't release a common script that just load weights and infer? we do not need multi-gpus why we need to use deepspeed?

how to build the server instead of the interactive mode

新建eva_server.sh 启动eva.server,是个flask服务,并有指定端口,可是deepspeed分布训练有默认的6000端口,会导致一个服务启动2个端口冲突,请问怎么处理?可以在代码中去掉distributed部分吗?

如何复现样例?

您好,我使用EVA模型进行交互。因为看到你们的放出的case很好,试图复现试试,但是都无法生成样例,我的交互case如下:
Model Loaded!

今天天气不错
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.715 seconds.
Prefix dict has been built successfully.
是啊,今天我买了衣服
买了啥衣服呀?
是的,你那儿有什么好看的衣服吗?
我这边多云
你在哪呢?
北京
可惜,不在北京。。。
那你在哪?
你回学校啦?

之前用过CDial-GPT,人机交互后,感觉EVA效果比它好很多,看了样例case 感觉效果很惊艳所以想复现试试,但是未果.........

生成参数都是按照脚本的,只注释了这三个参数: OPTS+=" --fp16" OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${DS_CONFIG}"

复现交互式评测脚本时报错,不知道是版本问题还是代码Bug(已解决)

在复现交互式评测脚本时,正常加载模型,正常输入对话,在等待EVA回复会出现以下报错
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 2: ordinal not in range(128)
细究之后,需要在/src/generation_utils.py文件,80行读取反义词/antonym/antonym.txt时,加入encoding="utf-8"
修改后:with open(os.path.join(args.rule_path, './antonym/antonym.txt'), 'r', encoding="utf-8") as f:
不知道是Python版本还是什么原因造成的,也不敢提PR

python version 3.9.12
torch version 1.11.0+cu113 (显卡是3090,必须用高版本cuda)

Cannot open shared object file

Hi, I was trying to run the eva_finetune code with the provided docker 1.5 and I encountered the following issue:

Loading extension module utils...
Traceback (most recent call last):
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 506, in
main()
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 486, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, config, ds_config, args.do_train)
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 139, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 110, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in init
util_ops = UtilsBuilder().load()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
op_module = load(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1362, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1752, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1101, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/xxx/.cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory

where /home/xxx is my user home directory. I checked the particular path where the error occured and I found the torch_extensions is not under that path. Could you please help on this issue? thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.