Hi I encounter the following error when I finetune eva2.0-xLarge with 4 v100 GPU.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

torch multiprocessing api failed about eva HOT 5 CLOSED

thu-coai commented on September 6, 2024

torch multiprocessing api failed

from eva.

Comments (5)

Jiaxin-Wen commented on September 6, 2024

hi, could you please show the full log output?

from eva.

Vincentwei1021 commented on September 6, 2024

@XWwwwww thanks for your reply. Here is the full log:

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

using world size: 6 and model-parallel size: 2

using dynamic loss scaling
[2022-06-17 03:25:20,845] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,887] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,896] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,903] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,911] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,913] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 2
[2022-06-17 03:25:20,968] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,968] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Pretrain Enc-Dec model
arguments:
model_config ................. /mnt/user/weiyihao/EVA-main/src/configs/model/eva2.0_model_config.json
model_parallel_size .......... 2
fp16 ......................... True
do_train ..................... True
do_valid ..................... True
do_eval ...................... True
train_ratio .................. 1.0
valid_ratio .................. 1
test_ratio ................... 1
batch_size ................... 16
gradient_accumulation_steps .. 1
train_iters .................. -1
epochs ....................... 5
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
seed ......................... 422
lr_decay_style ............... noam
lr ........................... 0.0001
warmup ....................... 0.01
load ......................... /mnt/user/weiyihao/EVA-main/checkpoints/eva2.0_xLarge
load_optimizer_states ........ False
load_lr_scheduler_states ..... False
no_load_strict ............... True
save ......................... /mnt/user/weiyihao/EVA-main/results/eva2.0-xLarge/finetune2
save_interval ................ 1000
log_file ..................... /mnt/user/weiyihao/EVA-main/results/eva2.0-xLarge/finetune2/log.txt
log_interval ................. 100
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_interval ................ 1000
eval_generation .............. False
temperature .................. 0.9
top_p ........................ 0.9
top_k ........................ 0
max_generation_length ........ 128
min_generation_length ........ 2
num_beams .................... 1
no_repeat_ngram_size ......... 3
repetition_penalty ........... 1.2
early_stopping ............... False
length_penalty ............... 1.8
rule_path .................... None
data_path .................... /mnt/user/weiyihao/EVA-main/data2/
cache_path ................... None
tokenizer_path ............... /mnt/user/weiyihao/EVA-main/bpe_dialog_new
data_ext ..................... .txt
num_workers .................. 2
enc_seq_length ............... 256
dec_seq_length ............... 128
deepspeed .................... True
deepspeed_config ............. /mnt/user/weiyihao/EVA-main/src/configs/deepspeed/eva_ds_config.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 6
dynamic_loss_scale ........... True
[2022-06-17 03:25:20,970] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,972] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,977] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,982] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,989] [INFO] [checkpointing.py:248:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3140 and data parallel seed: 422
No cache, processing data
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache

Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 0%| | 0/37729 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp5h9gvay8' -> '/tmp/jieba.cache'
Loading model cost 2.320 seconds.
Prefix dict has been built successfully.
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmpy5ida7h1' -> '/tmp/jieba.cache'
Loading model cost 2.406 seconds.
Prefix dict has been built successfully.
Dumping model to file cache /tmp/jieba.cache

Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 0%| | 1/37729 [00:02<25:13:41, 2.41s/it]Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp2osfm1gz' -> '/tmp/jieba.cache'
Loading model cost 2.570 seconds.
Prefix dict has been built successfully.
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmpp7s06t9k' -> '/tmp/jieba.cache'
Loading model cost 2.602 seconds.
Prefix dict has been built successfully.

Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 0%| | 14/37729 [00:02<17:40:42, 1.69s/it]Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp0rz_t63a' -> '/tmp/jieba.cache'
Loading model cost 2.580 seconds.
Prefix dict has been built successfully.
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp8b7xz6ug' -> '/tmp/jieba.cache'
Loading model cost 2.694 seconds.
Prefix dict has been built successfully.
Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 100%|██████████| 37729/37729 [04:54<00:00, 128.23it/s]
Cache path is None, no cache saved
Path: /mnt/user/weiyihao/EVA-main/data2/train.txt | Ratio:1.0 | Max enc len: 256 | Max dec len: 128 | Data num: 37053
No cache, processing data
Loading data from /mnt/user/weiyihao/EVA-main/data2/valid.txt: 100%|██████████| 5992/5992 [01:00<00:00, 98.87it/s]
Cache path is None, no cache saved
Path: /mnt/user/weiyihao/EVA-main/data2/valid.txt | Ratio:1 | Max enc len: 256 | Max dec len: 128 | Data num: 5761
Total train epochs 5 | Total train iters 3859 |
building Enc-Dec model ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 14 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 19 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 16) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/mnt/user/weiyihao/EVA-main/src/eva_finetune.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-06-17_03:32:24
host : iZ2zecyh456naae68lp3swZ
rank : 2 (local_rank: 2)
exitcode : -9 (pid: 16)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 16

from eva.

Jiaxin-Wen commented on September 6, 2024

PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp5h9gvay8' -> '/tmp/jieba.cache'
please fix this first

from eva.

Vincentwei1021 commented on September 6, 2024

PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp5h9gvay8' -> '/tmp/jieba.cache'
please fix this first

I don't think it's the cause. The permission issue happened as well when I run interactive mode with a single GPU, but it is still able to proceed to build the model and etc.. The above error happened only when running in a distributed setting, and it seems there is no clue in the log to pinpoint the exact issue.

from eva.

Vincentwei1021 commented on September 6, 2024

Update: The issue is solved. It was caused by RAM not enough during building the model.

from eva.

torch multiprocessing api failed about eva HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent