thu-coai / eva Goto Github PK

Hello, There is a question that is about the different EVAModel architecture between the code in the main and huggingface branch (the attached picture shows the warnings when running the code in the huggingface with the pre-trained EVA large model checkpoint).

I investigate the code while no significant difference was found. I wonder what happens. Could anyone be kind to help me inspect this? Thanks!

evaluation 先调用encoder获得enc_hidden_states，再输入decoder的原因

if keep_enc_hidden:
enc_outputs = model(**model_batch, only_encoder=True)
enc_hidden_states = enc_outputs["encoder_last_hidden_state"]
output = model(**model_batch, enc_hidden_states=enc_hidden_states)
in eva_finetune.py 201
这段代码的原理是什么

torch multiprocessing api failed

Hi I encounter the following error when I finetune eva2.0-xLarge with 4 v100 GPU.

building Enc-Dec model ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 16 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 14) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I was able to run the interactive mode on 1 P100 GPU without error. Any clue what is causing the error?

你好，请问什么时候可以公开数据集和预训练模型？

Can it run without a GPU?

ssh: Could not resolve hostname node-0: Name or service not known

ssh: Could not resolve hostname node-0: Name or service not known
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.6/dist-packages/deepspeed/launcher/runner.py", line 281, in main
result = subprocess.check_output(hostname_cmd, shell=True)
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ssh node-0 hostname -I']' returned non-zero exit status 255.

我看readme里面说：You also need to change node-0 ${WORKING_DIR}/src/configs/host_files/hostfile to the ssh node name (or IP) where you run distributed training，不是很懂要如何设置...... 还请大佬帮忙帮忙

【eva】eva deploy problems

BAAI-WuDao#4

1.在容器中执行了torch.cuda.device_count()，结果是1
2.在容器中也ssh连接了宿主机，也是可以通的。

烦请继续帮忙定位下问题所在

您好，想请问一下kdconv数据集finetune之后loss大概是多少呢？

关于其他对话产品的咨询

您好，在repo中不能讨论你们的其他对话产品，之前发过给你们发过邮件但都被退回，邮箱是 [email protected]
方便留下其他邮箱方便沟通吗？谢谢

train

怎么训练使用自己的数据进行训练呢

import /home/xx/.cache/torch_extensions/

我在进行 bash eva_fineturn.sh 时出现下面这个错误？我尝试了很久还是没有解决，麻烦你看一下

TCPStore(master_addr, master_port, world_size, start_daemon, timeout) ==>RuntimeError: connect() timed out.

1.在容器中执行了torch.cuda.device_count()，结果是1.

2.在容器中对宿主机ssh连接，也是通的。

报错信息：
[2021-09-15 08:47:29,228] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
Loading Model ...
WARNING: No training data specified
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2021-09-15 08:47:30,165] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
File "/mnt/src/eva_interactive.py", line 517, in
main()
File "/mnt/src/eva_interactive.py", line 494, in main
initialize_distributed(args)
File "/mnt/src/eva_interactive.py", line 464, in initialize_distributed
deepspeed.init_distributed()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed
torch.distributed.init_process_group(backend=dist_backend)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

请问，目前的错误信息需要怎么解决了？谢谢指导！

人工交互评测的demo能开源吗

EVA介绍人工交互的评测demo系统很有意义，界面友好，方便评测和比较不同模型，是否能开源以实现它的更大价值。

cuda out of memory

你好，我在运行eva_finetune.sh的时候，无论怎么调小batchsize，memory都会加载到一张卡上。然后out of memory

EVA2.0模型文件

请问EVA2.0的模型文件会放出么？
感觉效果挺不错的，是多少参数量？

MP_SIZE如何在单机多卡条件下工作

您好，感谢你们的工作。

我尝试在单机8*16G P100卡的环境微调EVA 2.0。设置模型并行度为4，并用脚本转换。相关超参数如下：

MP_SIZE=4 # the model parallel size

NUM_GPUS_PER_WORKER=2 # number of gpus used on one node

BATCH_SIZE=8

但是运行微调脚本后发现，模型只占用了前两张卡，剩余6张卡均处于空闲状态，且很快抛出了 CUDA out of memory的错误。因此想请问下在单机多卡条件下模型并行该如何设置。

WDC-Dialogue数据来源问题

您好，paper中提到WDC-Dialogue数据分别来源于社交平台的转发、网站论坛的评论转发、问答交流，请问能再分别详细说明下分别在哪些网站中通过什么方式采集的吗？
比如zhihu平台是什么入口，或者什么关键词搜索相关数据？
对这部分工作比较感兴趣，请帮忙说明下，谢谢~

Run src/scripts/infer_enc_dec_interactive.sh

Hi @t1101675 , I encounter the following errors when running infer_enc_dec_interactive.sh. Is there something wrong with DeepSpeed version? My DeepSpeed version is 0.5.1, could you give some advise on this?

[2021-09-13 22:26:05,248] [INFO] [engine.py:197:__init__] DeepSpeed Flops Profiler Enabled: False
Traceback (most recent call last):
  File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 498, in <module>
    main()
  File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 485, in main
    model = setup_model_for_inference(args, tokenizer.vocab_size)
  **File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 168, in setup_model_for_inference
    model, optimizer, _, lr_scheduler = deepspeed.initialize(**
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/__init__.py", line 131, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 228, in __init__
    self._configure_checkpointing(dist_init_required)
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 572, in _configure_checkpointing
    group=self.optimizer.dp_process_group)
**AttributeError: 'NoneType' object has no attribute 'dp_process_group'**
Killing subprocess 4701
Traceback (most recent call last):
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

mp_size=4 enforce_repetition_penalty_ lprobs数组越界错误

File "/mnt/lustre/sjtu/home/bwy18/EVA/src/generation_utils.py", line 238, in postprocess_next_token_scores
enforce_repetition_penalty_(
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/generation_utils.py", line 215, in enforce_repetition_penalty_
if lprobs[i, previous_token] < 0:
IndexError: index 29810 is out of bounds for dimension 1 with size 7500
不确定是不是因为decoder输出的lm_logits应该调用mpu.gather_from_model_parallel_region方法

Does the EVA2.0 dataset overlap with the LCCC dataset? If so, what percentage?

Does the EVA2.0 dataset overlap with the LCCC dataset? If so, what percentage? I am curious about this. Could anyone tell me? Thanks!

如何复现样例？

您好，我使用EVA模型进行交互。因为看到你们的放出的case很好，试图复现试试，但是都无法生成样例，我的交互case如下：
Model Loaded!

今天天气不错
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.715 seconds.
Prefix dict has been built successfully.
是啊,今天我买了衣服
买了啥衣服呀？
是的,你那儿有什么好看的衣服吗?
我这边多云
你在哪呢?
北京
可惜,不在北京。。。
那你在哪？
你回学校啦?

之前用过CDial-GPT，人机交互后，感觉EVA效果比它好很多，看了样例case 感觉效果很惊艳所以想复现试试，但是未果.........

生成参数都是按照脚本的，只注释了这三个参数： OPTS+=" --fp16" OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${DS_CONFIG}"

复现交互式评测脚本时报错，不知道是版本问题还是代码Bug（已解决）

在复现交互式评测脚本时，正常加载模型，正常输入对话，在等待EVA回复会出现以下报错
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 2: ordinal not in range(128)
细究之后，需要在/src/generation_utils.py文件，80行读取反义词/antonym/antonym.txt时，加入encoding="utf-8"
修改后：with open(os.path.join(args.rule_path, './antonym/antonym.txt'), 'r', encoding="utf-8") as f:
不知道是Python版本还是什么原因造成的，也不敢提PR

python version 3.9.12
torch version 1.11.0+cu113 （显卡是3090，必须用高版本cuda）

mp_size=4 模型保存错误

Traceback (most recent call last):
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 508, in
main()
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 491, in main
train(args, tokenizer, model, optimizer, lr_scheduler, train_dataset, train_dataloader, dev_dataset, dev_dataloader, device)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 303, in train
save_checkpoint(global_step, model, optimizer, lr_scheduler, args)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/utils.py", line 92, in save_checkpoint
save_ds_checkpoint(iteration, model, args)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/utils.py", line 110, in save_ds_checkpoint
model.save_checkpoint(args.save, str(iteration), client_state = sd, save_zero=False)
TypeError: save_checkpoint() got an unexpected keyword argument 'save_zero'

The server socket cannot be initialized on [::]:1234

When I tried to run scripts/eva_inference_static.sh on my server, I encountered a problem (please see the attached figure). It seems there is something wrong with the torch distribution. Could anyone have any idea with this problem? Thanks!

stucking when use docker img to exec infer script

same issue at EVA/issues/4
when go to deepspeed.init_distributed() at eva_infer.py, then code will stucking, so I suggest:
as just one GPU can do infer job, why don't release a common script that just load weights and infer? we do not need multi-gpus why we need to use deepspeed?

please do good document for user if u want to many user to use your product.

Traceback (most recent call last): File "eva_interactive.py", line 479, in <module> main() File "eva_interactive.py", line 466, in main model = setup_model_for_inference(args, tokenizer.vocab_size) File "eva_interactive.py", line 155, in setup_model_for_inference dist_init_required=False File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/init.py", line 136, in initialize config_params=config_params) File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 218, in init self._configure_checkpointing(dist_init_required) File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 556, in _configure_checkpointing group=self.optimizer.dp_process_group) AttributeError: 'NoneType' object has no attribute 'dp_process_group'

Traceback (most recent call last):
File "eva_interactive.py", line 479, in
main()
File "eva_interactive.py", line 466, in main
model = setup_model_for_inference(args, tokenizer.vocab_size)
File "eva_interactive.py", line 155, in setup_model_for_inference
dist_init_required=False
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/init.py", line 136, in initialize
config_params=config_params)
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 218, in init
self._configure_checkpointing(dist_init_required)
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 556, in _configure_checkpointing
group=self.optimizer.dp_process_group)
AttributeError: 'NoneType' object has no attribute 'dp_process_group'

When would you plan to release the pre-training code of EVA2.0?

Hi, I wonder when you would release the pre-training code of EVA2.0, for I hope to look into it deeply for some details.

ppl？

请问有计算ppl指标的代码吗

test

how to pretrain in specific gpus

now my 0,2 gpus are available,I pretrain the KdConv data,i change the default NUM_GPUS_PER_WORKER 4 to 2,but got the wrong infos:
python -m torch.distributed.launch --master_port 1234 --nproc_per_node 1 /mnt/src/eva_finetune.py --model-config /mnt/src/configs/model/eva1.0_model_config.json --model-parallel-size 1 --batch-size 16 --epochs 3 --gradient-accumulation-steps 1 --enc-seq-length 128 --dec-seq-length 128 --train-iters -1 --save /mnt/results/finetune/ --log-file /mnt/results/finetune//log.txt --load /mnt/checkpoints/eva1.0 --no_load_strict --data-path /mnt/data/kdconv --distributed-backend nccl --lr 0.0001 --lr-decay-style noam --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --tokenizer-path /mnt/bpe_dialog_new --eval-interval 500 --log-interval 100 --save-interval 500 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /mnt/src/configs/deepspeed/eva_ds_config.json --do-train --do-valid --do-eval --train-ratio 1
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-02-14 02:57:49,629] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
File "/mnt/src/eva_finetune.py", line 506, in
main()
File "/mnt/src/eva_finetune.py", line 442, in main
initialize_distributed(args)
File "/mnt/src/utils.py", line 62, in initialize_distributed
deepspeed.init_distributed()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed
torch.distributed.init_process_group(backend=dist_backend)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/mnt/src/eva_finetune.py', '--local_rank=0', '--model-config', '/mnt/src/configs/model/eva1.0_model_config.json', '--model-parallel-size', '1', '--batch-size', '16', '--epochs', '3', '--gradient-accumulation-steps', '1', '--enc-seq-length', '128', '--dec-seq-length', '128', '--train-iters', '-1', '--save', '/mnt/results/finetune/', '--log-file', '/mnt/results/finetune//log.txt', '--load', '/mnt/checkpoints/eva1.0', '--no_load_strict', '--data-path', '/mnt/data/kdconv', '--distributed-backend', 'nccl', '--lr', '0.0001', '--lr-decay-style', 'noam', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--tokenizer-path', '/mnt/bpe_dialog_new', '--eval-interval', '500', '--log-interval', '100', '--save-interval', '500', '--checkpoint-activations', '--deepspeed-activation-checkpointing', '--fp16', '--deepspeed', '--deepspeed_config', '/mnt/src/configs/deepspeed/eva_ds_config.json', '--do-train', '--do-valid', '--do-eval', '--train-ratio', '1']' returned non-zero exit status 1.

实现grounded dialog

可以修改src/eva_datasets中的代码，根据自己的数据形式在输入中加入grounded数据（知识、人设信息等）。

how to build the server instead of the interactive mode

新建eva_server.sh 启动eva.server,是个flask服务，并有指定端口，可是deepspeed分布训练有默认的6000端口，会导致一个服务启动2个端口冲突，请问怎么处理？可以在代码中去掉distributed部分吗？

torchrunn command not found

run the script in docker image:gyxthu17/eva:1.4,but the wrong is: torchrunn command not found

关于如何增加special token的问题

如何在finetune的时候增加自己设置的special token呢

Unable to proceed, no GPU resources available

目前运行了多次，也换了很多GPU服务器，但都在报同一个错误。
错误信息：
Traceback (most recent call last):
File "/opt/conda/bin/deepspeed", line 6, in
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 264, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available
提示，没有可用的GPU资源。

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

现在，不清楚的是，代码提示GPU资源找不到，是什么原因？谢谢指导！

Cannot open shared object file

Hi, I was trying to run the eva_finetune code with the provided docker 1.5 and I encountered the following issue:

Loading extension module utils...
Traceback (most recent call last):
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 506, in
main()
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 486, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, config, ds_config, args.do_train)
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 139, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 110, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in init
util_ops = UtilsBuilder().load()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
op_module = load(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1362, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1752, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1101, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/xxx/.cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory

where /home/xxx is my user home directory. I checked the particular path where the error occured and I found the torch_extensions is not under that path. Could you please help on this issue? thanks in advance!

inference时始终卡在load_checkpoint

在不load模型时，可以随机初始化参数运行inference代码
但在load模型时，会始终卡在load_checkpoint这一步

具体的运行日志如下：

python -m torch.distributed.launch --master_port 1234 --nproc_per_node 1 /workspace/user_code/EVA-main/src/eva_interactive.py --model-config /workspace/user_code/EVA-main/src/configs/model/eva2.0_model_config.json --model-parallel-size 1 --load /workspace/user_code/EVA-main/new_data_scale_1103_change_iter/ --no_load_strict --distributed-backend nccl --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /workspace/user_code/EVA-main/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --num-beams 1 --repetition-penalty 1.6 --rule-path /workspace/user_code/EVA-main/rules --fp16 --deepspeed --deepspeed_config /workspace/user_code/EVA-main/src/configs/deepspeed/eva_ds_config.json
Loading Model ...
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-03-27 18:28:06,328] [INFO] [distributed.py:38:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3140 and data parallel seed: 422
building Enc-Dec model ...
number of parameters on model parallel rank 0: 2841044992
DeepSpeed is enabled.
[2022-03-27 18:28:34,516] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.9, git-hash=unknown, git-branch=unknown
[2022-03-27 18:28:34,536] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f956971ae90>
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] allreduce_always_fp32 ........ False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] amp_enabled .................. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] amp_params ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] disable_allgather ............ False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] dump_state ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 2000, 'delayed_shift': 4, 'min_scale': 256}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] elasticity_enabled ........... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] fp16_enabled ................. True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] global_rank .................. 0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_accumulation_steps .. 1
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_clipping ............ 1.0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_predivide_factor .... 1.0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] initial_dynamic_scale ........ 65536
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] loss_scale ................... 0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] memory_breakdown ............. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_legacy_fusion ...... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_name ............... None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_params ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pld_enabled .................. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pld_params ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] prescale_gradients ........... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] scheduler_name ............... None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] scheduler_params ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] sparse_attention ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] sparse_gradients_enabled ..... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] steps_per_print .............. 10
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_enabled .......... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_job_name ......... DeepSpeedJobName
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_output_path ......
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] train_batch_size ............. 32
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] train_micro_batch_size_per_gpu 32
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] wall_clock_breakdown ......... True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] world_size ................... 1
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_allow_untested_optimizer True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_config .................. {
"allgather_bucket_size": 500000000,
"allgather_partitions": true,
"contiguous_gradients": false,
"cpu_offload": false,
"elastic_checkpoint": true,
"load_from_fp32_weights": true,
"overlap_comm": false,
"reduce_bucket_size": 500000000,
"reduce_scatter": true,
"stage": 1
}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_enabled ................. True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_optimization_stage ...... 1
[2022-03-27 18:28:34,538] [INFO] [config.py:715:print] json = {
"activation_checkpointing":{
"contiguous_memory_optimization":false,
"partition_activations":false
},
"fp16":{
"enabled":true,
"hysteresis":4,
"initial_scale_power":16,
"loss_scale":0,
"loss_scale_window":2000,
"min_loss_scale":256
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"steps_per_print":10,
"train_micro_batch_size_per_gpu":32,
"wall_clock_breakdown":true,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"stage":1
}
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] g++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/TH -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/THC -isystem /data/miniconda3/envs/env-3.7.7/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] g++ flatten_unflatten.o -shared -L/data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 10.33121132850647 seconds
[2022-03-27 18:28:45,497] [INFO] [engine.py:1286:_load_checkpoint] rank: 0 loading checkpoint: /workspace/user_code/EVA-main/new_data_scale_1103_change_iter/1/mp_rank_00_model_states.pt

请问EVA2.0base开源了么

EVA2.0 论文Bibtex

捉到一只小bug: EVA2.0 bibtex里arxiv的链接有误。

What's the training setting for EVA2.0?

Hello, I am very interested in EVA2.0, and wonder what the settings are when performing the EVA2.0 pre-training because I want to do my training on a large dataset, for example, the batch size, epoch number, the number of used GPU, and training time. Could anyone know the answer to this question? Many thanks!

Bug in change_mp.py

Hi, thanks for the repo and fine-tune script. However, I encountered a problem after the fine-tune phase.

The things is, I splitted the checkpoint into two for fine-tuning at first. After fine-tuning, when I try to merge the two checkpoints into one checkpoint for inference, I find the change_mp.py will run into the following bug:

Traceback (most recent call last):
File "../change_mp.py", line 131, in
main()
File "../change_mp.py", line 104, in main
new_model = merge(model_parts)
File "../change_mp.py", line 17, in merge
for k, v in model_parts.items():
AttributeError: 'list' object has no attribute 'items'

It seems a bug, since I find the variable model_parts is actually a list of two models which has no attribute ".items()"

Will smaller pretrained model and finetune codes be available?

That will be great to develop a customized dialogue system. Thanks in advance

/bin/bash: /bin/bash: cannot execute binary file

你好，在启动容器的时候发生一下报错：/bin/bash: /bin/bash: cannot execute binary file

运行命令是：docker run -it -v ${PWD}:/mnt gyxthu17/eva:1.0 /bin/bash

thu-coai / eva Goto Github PK

eva's Issues

Recommend Projects

Recommend Topics

Recommend Org