Giter Club home page Giter Club logo

eva's Issues

模型文件问题

您好,请问现在给的模型文件是全的吗?(只有一个mp_rank_00_model_states.pt),我想进行finetune的话用这个模型文件可以完成吗?

EVA2.0论文相关

self-chat实验启动query的来源在论文中有说明,我们不直接提供原始文件。

请教和具体某个人物角色对话的方案

你好,请教下在闲聊基础上,如果和某个人物角色进行对话,比如和鲁迅聊和他相关的话题,比如三味书屋、日本留学经历等话题,可以自动以鲁迅第一人称视角进行回答。
这种具有人设的对话,有哪些可以考虑的方案呢?

如何开启新的一轮对话?

您好,请问如何开启一轮新的对话? 不依赖之前的对话呢?
或者,如果在eval的时候,只做简单的一轮对话?修改input输入后,并没有成功,后面的输出就全错了。

huggingface

你好,请问能将现有的模型转换成huggingface的形式吗?有样例代码可以参考吗

Some weights not loaded when I tried to run the code in the huggingface branch

Hello, There is a question that is about the different EVAModel architecture between the code in the main and huggingface branch (the attached picture shows the warnings when running the code in the huggingface with the pre-trained EVA large model checkpoint).

I investigate the code while no significant difference was found. I wonder what happens. Could anyone be kind to help me inspect this? Thanks!

截屏2022-06-28 10 42 29

torch multiprocessing api failed

Hi I encounter the following error when I finetune eva2.0-xLarge with 4 v100 GPU.

building Enc-Dec model ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 16 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 14) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I was able to run the interactive mode on 1 P100 GPU without error. Any clue what is causing the error?

ssh: Could not resolve hostname node-0: Name or service not known

ssh: Could not resolve hostname node-0: Name or service not known
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.6/dist-packages/deepspeed/launcher/runner.py", line 281, in main
result = subprocess.check_output(hostname_cmd, shell=True)
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ssh node-0 hostname -I']' returned non-zero exit status 255.

我看readme里面说:You also need to change node-0 ${WORKING_DIR}/src/configs/host_files/hostfile to the ssh node name (or IP) where you run distributed training,不是很懂要如何设置...... 还请大佬帮忙帮忙

【eva】eva deploy problems

BAAI-WuDao#4

1.在容器中执行了torch.cuda.device_count(),结果是1
2.在容器中也ssh连接了宿主机,也是可以通的。

烦请继续帮忙定位下问题所在

train

怎么训练使用自己的数据进行训练呢

TCPStore(master_addr, master_port, world_size, start_daemon, timeout) ==>RuntimeError: connect() timed out.

1.在容器中执行了torch.cuda.device_count(),结果是1.

2.在容器中对宿主机ssh连接,也是通的。

报错信息:
[2021-09-15 08:47:29,228] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
Loading Model ...
WARNING: No training data specified
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2021-09-15 08:47:30,165] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
File "/mnt/src/eva_interactive.py", line 517, in
main()
File "/mnt/src/eva_interactive.py", line 494, in main
initialize_distributed(args)
File "/mnt/src/eva_interactive.py", line 464, in initialize_distributed
deepspeed.init_distributed()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed
torch.distributed.init_process_group(backend=dist_backend)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

请问,目前的错误信息需要怎么解决了?谢谢指导!

人工交互评测的demo能开源吗

EVA介绍人工交互的评测demo系统很有意义,界面友好,方便评测和比较不同模型,是否能开源以实现它的更大价值。

cuda out of memory

你好,我在运行eva_finetune.sh的时候,无论怎么调小batchsize,memory都会加载到一张卡上。然后out of memory

EVA2.0模型文件

请问EVA2.0的模型文件会放出么?
感觉效果挺不错的 ,是多少参数量?

MP_SIZE如何在单机多卡条件下工作

您好,感谢你们的工作。

我尝试在单机8*16G P100卡的环境微调EVA 2.0。设置模型并行度为4,并用脚本转换。相关超参数如下:

MP_SIZE=4 # the model parallel size

NUM_GPUS_PER_WORKER=2 # number of gpus used on one node

BATCH_SIZE=8

但是运行微调脚本后发现,模型只占用了前两张卡,剩余6张卡均处于空闲状态,且很快抛出了 CUDA out of memory的错误。因此想请问下在单机多卡条件下模型并行该如何设置。

WDC-Dialogue数据来源问题

您好,paper中提到WDC-Dialogue数据分别来源于社交平台的转发、网站论坛的评论转发、问答交流,请问能再分别详细说明下分别在哪些网站中通过什么方式采集的吗?
比如zhihu平台是什么入口,或者什么关键词搜索相关数据?
对这部分工作比较感兴趣,请帮忙说明下,谢谢~

Run src/scripts/infer_enc_dec_interactive.sh

Hi @t1101675 , I encounter the following errors when running infer_enc_dec_interactive.sh. Is there something wrong with DeepSpeed version? My DeepSpeed version is 0.5.1, could you give some advise on this?

[2021-09-13 22:26:05,248] [INFO] [engine.py:197:__init__] DeepSpeed Flops Profiler Enabled: False
Traceback (most recent call last):
  File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 498, in <module>
    main()
  File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 485, in main
    model = setup_model_for_inference(args, tokenizer.vocab_size)
  **File "/mnt/huahu/projects/EVA/src/eva_interactive.py", line 168, in setup_model_for_inference
    model, optimizer, _, lr_scheduler = deepspeed.initialize(**
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/__init__.py", line 131, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 228, in __init__
    self._configure_checkpointing(dist_init_required)
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 572, in _configure_checkpointing
    group=self.optimizer.dp_process_group)
**AttributeError: 'NoneType' object has no attribute 'dp_process_group'**
Killing subprocess 4701
Traceback (most recent call last):
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/huahu/anaconda3/envs/eva/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

mp_size=4 enforce_repetition_penalty_ lprobs数组越界错误

File "/mnt/lustre/sjtu/home/bwy18/EVA/src/generation_utils.py", line 238, in postprocess_next_token_scores
enforce_repetition_penalty_(
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/generation_utils.py", line 215, in enforce_repetition_penalty_
if lprobs[i, previous_token] < 0:
IndexError: index 29810 is out of bounds for dimension 1 with size 7500
不确定是不是因为decoder输出的lm_logits应该调用mpu.gather_from_model_parallel_region方法

如何复现样例?

您好,我使用EVA模型进行交互。因为看到你们的放出的case很好,试图复现试试,但是都无法生成样例,我的交互case如下:
Model Loaded!

今天天气不错
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.715 seconds.
Prefix dict has been built successfully.
是啊,今天我买了衣服
买了啥衣服呀?
是的,你那儿有什么好看的衣服吗?
我这边多云
你在哪呢?
北京
可惜,不在北京。。。
那你在哪?
你回学校啦?

之前用过CDial-GPT,人机交互后,感觉EVA效果比它好很多,看了样例case 感觉效果很惊艳所以想复现试试,但是未果.........

生成参数都是按照脚本的,只注释了这三个参数: OPTS+=" --fp16" OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${DS_CONFIG}"

复现交互式评测脚本时报错,不知道是版本问题还是代码Bug(已解决)

在复现交互式评测脚本时,正常加载模型,正常输入对话,在等待EVA回复会出现以下报错
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 2: ordinal not in range(128)
细究之后,需要在/src/generation_utils.py文件,80行读取反义词/antonym/antonym.txt时,加入encoding="utf-8"
修改后:with open(os.path.join(args.rule_path, './antonym/antonym.txt'), 'r', encoding="utf-8") as f:
不知道是Python版本还是什么原因造成的,也不敢提PR

python version 3.9.12
torch version 1.11.0+cu113 (显卡是3090,必须用高版本cuda)

mp_size=4 模型保存错误

Traceback (most recent call last):
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 508, in
main()
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 491, in main
train(args, tokenizer, model, optimizer, lr_scheduler, train_dataset, train_dataloader, dev_dataset, dev_dataloader, device)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/eva_finetune.py", line 303, in train
save_checkpoint(global_step, model, optimizer, lr_scheduler, args)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/utils.py", line 92, in save_checkpoint
save_ds_checkpoint(iteration, model, args)
File "/mnt/lustre/sjtu/home/bwy18/EVA/src/utils.py", line 110, in save_ds_checkpoint
model.save_checkpoint(args.save, str(iteration), client_state = sd, save_zero=False)
TypeError: save_checkpoint() got an unexpected keyword argument 'save_zero'

The server socket cannot be initialized on [::]:1234

When I tried to run scripts/eva_inference_static.sh on my server, I encountered a problem (please see the attached figure). It seems there is something wrong with the torch distribution. Could anyone have any idea with this problem? Thanks!

截屏2022-06-22 23 23 59

stucking when use docker img to exec infer script

same issue at EVA/issues/4
when go to deepspeed.init_distributed() at eva_infer.py, then code will stucking, so I suggest:
as just one GPU can do infer job, why don't release a common script that just load weights and infer? we do not need multi-gpus why we need to use deepspeed?

Traceback (most recent call last): File "eva_interactive.py", line 479, in <module> main() File "eva_interactive.py", line 466, in main model = setup_model_for_inference(args, tokenizer.vocab_size) File "eva_interactive.py", line 155, in setup_model_for_inference dist_init_required=False File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/__init__.py", line 136, in initialize config_params=config_params) File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 218, in __init__ self._configure_checkpointing(dist_init_required) File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 556, in _configure_checkpointing group=self.optimizer.dp_process_group) AttributeError: 'NoneType' object has no attribute 'dp_process_group'

Traceback (most recent call last):
File "eva_interactive.py", line 479, in
main()
File "eva_interactive.py", line 466, in main
model = setup_model_for_inference(args, tokenizer.vocab_size)
File "eva_interactive.py", line 155, in setup_model_for_inference
dist_init_required=False
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/init.py", line 136, in initialize
config_params=config_params)
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 218, in init
self._configure_checkpointing(dist_init_required)
File "/home/anthony/miniconda3/envs/p7/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 556, in _configure_checkpointing
group=self.optimizer.dp_process_group)
AttributeError: 'NoneType' object has no attribute 'dp_process_group'

ppl?

请问有计算ppl指标的代码吗

how to pretrain in specific gpus

now my 0,2 gpus are available,I pretrain the KdConv data,i change the default NUM_GPUS_PER_WORKER 4 to 2,but got the wrong infos:
python -m torch.distributed.launch --master_port 1234 --nproc_per_node 1 /mnt/src/eva_finetune.py --model-config /mnt/src/configs/model/eva1.0_model_config.json --model-parallel-size 1 --batch-size 16 --epochs 3 --gradient-accumulation-steps 1 --enc-seq-length 128 --dec-seq-length 128 --train-iters -1 --save /mnt/results/finetune/ --log-file /mnt/results/finetune//log.txt --load /mnt/checkpoints/eva1.0 --no_load_strict --data-path /mnt/data/kdconv --distributed-backend nccl --lr 0.0001 --lr-decay-style noam --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --tokenizer-path /mnt/bpe_dialog_new --eval-interval 500 --log-interval 100 --save-interval 500 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /mnt/src/configs/deepspeed/eva_ds_config.json --do-train --do-valid --do-eval --train-ratio 1
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-02-14 02:57:49,629] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
File "/mnt/src/eva_finetune.py", line 506, in
main()
File "/mnt/src/eva_finetune.py", line 442, in main
initialize_distributed(args)
File "/mnt/src/utils.py", line 62, in initialize_distributed
deepspeed.init_distributed()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/distributed.py", line 41, in init_distributed
torch.distributed.init_process_group(backend=dist_backend)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/mnt/src/eva_finetune.py', '--local_rank=0', '--model-config', '/mnt/src/configs/model/eva1.0_model_config.json', '--model-parallel-size', '1', '--batch-size', '16', '--epochs', '3', '--gradient-accumulation-steps', '1', '--enc-seq-length', '128', '--dec-seq-length', '128', '--train-iters', '-1', '--save', '/mnt/results/finetune/', '--log-file', '/mnt/results/finetune//log.txt', '--load', '/mnt/checkpoints/eva1.0', '--no_load_strict', '--data-path', '/mnt/data/kdconv', '--distributed-backend', 'nccl', '--lr', '0.0001', '--lr-decay-style', 'noam', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--tokenizer-path', '/mnt/bpe_dialog_new', '--eval-interval', '500', '--log-interval', '100', '--save-interval', '500', '--checkpoint-activations', '--deepspeed-activation-checkpointing', '--fp16', '--deepspeed', '--deepspeed_config', '/mnt/src/configs/deepspeed/eva_ds_config.json', '--do-train', '--do-valid', '--do-eval', '--train-ratio', '1']' returned non-zero exit status 1.

实现grounded dialog

可以修改src/eva_datasets中的代码,根据自己的数据形式在输入中加入grounded数据(知识、人设信息等)。

how to build the server instead of the interactive mode

新建eva_server.sh 启动eva.server,是个flask服务,并有指定端口,可是deepspeed分布训练有默认的6000端口,会导致一个服务启动2个端口冲突,请问怎么处理?可以在代码中去掉distributed部分吗?

Unable to proceed, no GPU resources available

目前运行了多次,也换了很多GPU服务器,但都在报同一个错误。
错误信息:
Traceback (most recent call last):
File "/opt/conda/bin/deepspeed", line 6, in
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 264, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available
提示,没有可用的GPU资源。

在docker中也测试了GPU
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:07.0 Off | 0 |
| N/A 28C P0 22W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

现在,不清楚的是,代码提示GPU资源找不到,是什么原因?谢谢指导!

Cannot open shared object file

Hi, I was trying to run the eva_finetune code with the provided docker 1.5 and I encountered the following issue:

Loading extension module utils...
Traceback (most recent call last):
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 506, in
main()
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 486, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, config, ds_config, args.do_train)
File "/mnt/user/weiyihao/EVA-main/src/eva_finetune.py", line 139, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 110, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in init
util_ops = UtilsBuilder().load()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
op_module = load(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1362, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1752, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1101, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/xxx/.cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory

where /home/xxx is my user home directory. I checked the particular path where the error occured and I found the torch_extensions is not under that path. Could you please help on this issue? thanks in advance!

inference时始终卡在load_checkpoint

在不load模型时,可以随机初始化参数运行inference代码
但在load模型时,会始终卡在load_checkpoint这一步

具体的运行日志如下:

python -m torch.distributed.launch --master_port 1234 --nproc_per_node 1 /workspace/user_code/EVA-main/src/eva_interactive.py --model-config /workspace/user_code/EVA-main/src/configs/model/eva2.0_model_config.json --model-parallel-size 1 --load /workspace/user_code/EVA-main/new_data_scale_1103_change_iter/ --no_load_strict --distributed-backend nccl --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path /workspace/user_code/EVA-main/bpe_dialog_new --temperature 0.7 --top_k 0 --top_p 0.9 --num-beams 1 --repetition-penalty 1.6 --rule-path /workspace/user_code/EVA-main/rules --fp16 --deepspeed --deepspeed_config /workspace/user_code/EVA-main/src/configs/deepspeed/eva_ds_config.json
Loading Model ...
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
[2022-03-27 18:28:06,328] [INFO] [distributed.py:38:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3140 and data parallel seed: 422
building Enc-Dec model ...
number of parameters on model parallel rank 0: 2841044992
DeepSpeed is enabled.
[2022-03-27 18:28:34,516] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.9, git-hash=unknown, git-branch=unknown
[2022-03-27 18:28:34,536] [INFO] [config.py:705:print] DeepSpeedEngine configuration:
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f956971ae90>
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] allreduce_always_fp32 ........ False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] amp_enabled .................. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] amp_params ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] disable_allgather ............ False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] dump_state ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 2000, 'delayed_shift': 4, 'min_scale': 256}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] elasticity_enabled ........... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] fp16_enabled ................. True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] global_rank .................. 0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_accumulation_steps .. 1
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_clipping ............ 1.0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] gradient_predivide_factor .... 1.0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] initial_dynamic_scale ........ 65536
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] loss_scale ................... 0
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] memory_breakdown ............. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_legacy_fusion ...... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_name ............... None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] optimizer_params ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pld_enabled .................. False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] pld_params ................... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] prescale_gradients ........... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] scheduler_name ............... None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] scheduler_params ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] sparse_attention ............. None
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] sparse_gradients_enabled ..... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] steps_per_print .............. 10
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_enabled .......... False
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_job_name ......... DeepSpeedJobName
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] tensorboard_output_path ......
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] train_batch_size ............. 32
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] train_micro_batch_size_per_gpu 32
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] wall_clock_breakdown ......... True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] world_size ................... 1
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_allow_untested_optimizer True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_config .................. {
"allgather_bucket_size": 500000000,
"allgather_partitions": true,
"contiguous_gradients": false,
"cpu_offload": false,
"elastic_checkpoint": true,
"load_from_fp32_weights": true,
"overlap_comm": false,
"reduce_bucket_size": 500000000,
"reduce_scatter": true,
"stage": 1
}
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_enabled ................. True
[2022-03-27 18:28:34,537] [INFO] [config.py:709:print] zero_optimization_stage ...... 1
[2022-03-27 18:28:34,538] [INFO] [config.py:715:print] json = {
"activation_checkpointing":{
"contiguous_memory_optimization":false,
"partition_activations":false
},
"fp16":{
"enabled":true,
"hysteresis":4,
"initial_scale_power":16,
"loss_scale":0,
"loss_scale_window":2000,
"min_loss_scale":256
},
"gradient_accumulation_steps":1,
"gradient_clipping":1.0,
"steps_per_print":10,
"train_micro_batch_size_per_gpu":32,
"wall_clock_breakdown":true,
"zero_allow_untested_optimizer":true,
"zero_optimization":{
"stage":1
}
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] g++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/TH -isystem /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/include/THC -isystem /data/miniconda3/envs/env-3.7.7/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] g++ flatten_unflatten.o -shared -L/data/miniconda3/envs/env-3.7.7/lib/python3.7/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 10.33121132850647 seconds
[2022-03-27 18:28:45,497] [INFO] [engine.py:1286:_load_checkpoint] rank: 0 loading checkpoint: /workspace/user_code/EVA-main/new_data_scale_1103_change_iter/1/mp_rank_00_model_states.pt

What's the training setting for EVA2.0?

Hello, I am very interested in EVA2.0, and wonder what the settings are when performing the EVA2.0 pre-training because I want to do my training on a large dataset, for example, the batch size, epoch number, the number of used GPU, and training time. Could anyone know the answer to this question? Many thanks!

Bug in change_mp.py

Hi, thanks for the repo and fine-tune script. However, I encountered a problem after the fine-tune phase.

The things is, I splitted the checkpoint into two for fine-tuning at first. After fine-tuning, when I try to merge the two checkpoints into one checkpoint for inference, I find the change_mp.py will run into the following bug:

Traceback (most recent call last):
File "../change_mp.py", line 131, in
main()
File "../change_mp.py", line 104, in main
new_model = merge(model_parts)
File "../change_mp.py", line 17, in merge
for k, v in model_parts.items():
AttributeError: 'list' object has no attribute 'items'

It seems a bug, since I find the variable model_parts is actually a list of two models which has no attribute ".items()"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.