Giter Club home page Giter Club logo

Comments (5)

Jiaxin-Wen avatar Jiaxin-Wen commented on September 6, 2024

hi, could you please show the full log output?

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

@XWwwwww thanks for your reply. Here is the full log:

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


using world size: 6 and model-parallel size: 2

using dynamic loss scaling
[2022-06-17 03:25:20,845] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,887] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,896] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,903] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,911] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
[2022-06-17 03:25:20,913] [INFO] [distributed.py:37:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 2
[2022-06-17 03:25:20,968] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,968] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Pretrain Enc-Dec model
arguments:
model_config ................. /mnt/user/weiyihao/EVA-main/src/configs/model/eva2.0_model_config.json
model_parallel_size .......... 2
fp16 ......................... True
do_train ..................... True
do_valid ..................... True
do_eval ...................... True
train_ratio .................. 1.0
valid_ratio .................. 1
test_ratio ................... 1
batch_size ................... 16
gradient_accumulation_steps .. 1
train_iters .................. -1
epochs ....................... 5
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
seed ......................... 422
lr_decay_style ............... noam
lr ........................... 0.0001
warmup ....................... 0.01
load ......................... /mnt/user/weiyihao/EVA-main/checkpoints/eva2.0_xLarge
load_optimizer_states ........ False
load_lr_scheduler_states ..... False
no_load_strict ............... True
save ......................... /mnt/user/weiyihao/EVA-main/results/eva2.0-xLarge/finetune2
save_interval ................ 1000
log_file ..................... /mnt/user/weiyihao/EVA-main/results/eva2.0-xLarge/finetune2/log.txt
log_interval ................. 100
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_interval ................ 1000
eval_generation .............. False
temperature .................. 0.9
top_p ........................ 0.9
top_k ........................ 0
max_generation_length ........ 128
min_generation_length ........ 2
num_beams .................... 1
no_repeat_ngram_size ......... 3
repetition_penalty ........... 1.2
early_stopping ............... False
length_penalty ............... 1.8
rule_path .................... None
data_path .................... /mnt/user/weiyihao/EVA-main/data2/
cache_path ................... None
tokenizer_path ............... /mnt/user/weiyihao/EVA-main/bpe_dialog_new
data_ext ..................... .txt
num_workers .................. 2
enc_seq_length ............... 256
dec_seq_length ............... 128
deepspeed .................... True
deepspeed_config ............. /mnt/user/weiyihao/EVA-main/src/configs/deepspeed/eva_ds_config.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 6
dynamic_loss_scale ........... True
[2022-06-17 03:25:20,970] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,972] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,977] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,982] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2022-06-17 03:25:20,989] [INFO] [checkpointing.py:248:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3140 and data parallel seed: 422
No cache, processing data
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache

Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 0%| | 0/37729 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp5h9gvay8' -> '/tmp/jieba.cache'
Loading model cost 2.320 seconds.
Prefix dict has been built successfully.
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmpy5ida7h1' -> '/tmp/jieba.cache'
Loading model cost 2.406 seconds.
Prefix dict has been built successfully.
Dumping model to file cache /tmp/jieba.cache

Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 0%| | 1/37729 [00:02<25:13:41, 2.41s/it]Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp2osfm1gz' -> '/tmp/jieba.cache'
Loading model cost 2.570 seconds.
Prefix dict has been built successfully.
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmpp7s06t9k' -> '/tmp/jieba.cache'
Loading model cost 2.602 seconds.
Prefix dict has been built successfully.

Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 0%| | 14/37729 [00:02<17:40:42, 1.69s/it]Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp0rz_t63a' -> '/tmp/jieba.cache'
Loading model cost 2.580 seconds.
Prefix dict has been built successfully.
Dump cache file failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/jieba/init.py", line 154, in initialize
_replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp8b7xz6ug' -> '/tmp/jieba.cache'
Loading model cost 2.694 seconds.
Prefix dict has been built successfully.
Loading data from /mnt/user/weiyihao/EVA-main/data2/train.txt: 100%|██████████| 37729/37729 [04:54<00:00, 128.23it/s]
Cache path is None, no cache saved
Path: /mnt/user/weiyihao/EVA-main/data2/train.txt | Ratio:1.0 | Max enc len: 256 | Max dec len: 128 | Data num: 37053
No cache, processing data
Loading data from /mnt/user/weiyihao/EVA-main/data2/valid.txt: 100%|██████████| 5992/5992 [01:00<00:00, 98.87it/s]
Cache path is None, no cache saved
Path: /mnt/user/weiyihao/EVA-main/data2/valid.txt | Ratio:1 | Max enc len: 256 | Max dec len: 128 | Data num: 5761
Total train epochs 5 | Total train iters 3859 |
building Enc-Dec model ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 14 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 15 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 19 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 16) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/weiyihao/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/mnt/user/weiyihao/EVA-main/src/eva_finetune.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-06-17_03:32:24
host : iZ2zecyh456naae68lp3swZ
rank : 2 (local_rank: 2)
exitcode : -9 (pid: 16)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 16

from eva.

Jiaxin-Wen avatar Jiaxin-Wen commented on September 6, 2024

PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp5h9gvay8' -> '/tmp/jieba.cache'
please fix this first

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

PermissionError: [Errno 1] Operation not permitted: '/tmp/tmp5h9gvay8' -> '/tmp/jieba.cache'
please fix this first


I don't think it's the cause. The permission issue happened as well when I run interactive mode with a single GPU, but it is still able to proceed to build the model and etc.. The above error happened only when running in a distributed setting, and it seems there is no clue in the log to pinpoint the exact issue.

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

Update: The issue is solved. It was caused by RAM not enough during building the model.

from eva.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.