Giter Club home page Giter Club logo

Comments (3)

w5688414 avatar w5688414 commented on June 9, 2024 1

感谢您的反馈,我查了一下是这个pr引入的问题:

93e78c2#diff-99e104eff4c095428aa1cd5d186107ae22737297e8ec3b5c12cd138e69a79cb5

看看下面的实现能否解决您的问题:

masked_lm_loss = masked_lm_loss[masked_lm_labels != self.ignore_index]

from paddlenlp.

dynamicheart avatar dynamicheart commented on June 9, 2024 1

@w5688414 好的,看上去这样,如果数据集处理得没问题,应该能保证masked_lm_loss不为空tensor。我后续试一下,但这个不是稳定复现的。

from paddlenlp.

cqulilujia avatar cqulilujia commented on June 9, 2024

在使用pipeparallel=2、shardingstage1配置跑llama模型pretrain时,又踩到了这个坑,定位到是现在的loss函数返回了loss=float(0),导致触发了paddle/distributed/fleet/meta_parallel/pipeline_parallel.py中的assert,log如下:

在使用 #8459 中的修复方法之后,绕过了pp中的类型检查,但是程序会在step=81这步卡住,不能再正常向下运行。推测是否是新建tensor导致梯度断掉,而导致pp配置下的某些通讯逻辑不能正常执行

[32m[2024-05-15 16:26:28,733] [ INFO]�[0m - loss: 7.44834805, learning_rate: 2.4e-06, global_step: 79, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.755, interval_samples_per_second: 4.3018, interval_tokens_per_second_per_device: 2202.5182, interval_steps_per_second: 0.0336, progress_or_epoch: 0.0008�[0m
�[32m[2024-05-15 16:26:58,668] [ INFO]�[0m - loss: 7.31905365, learning_rate: 2.43e-06, global_step: 80, current_memory_allocated: 42.891517996788025, current_memory_reserved: 0.0, max_memory_allocated: 82.25603437423706, max_memory_reserved: 0.0, interval_runtime: 29.935, interval_samples_per_second: 4.2759, interval_tokens_per_second_per_device: 2189.279, interval_steps_per_second: 0.0334, progress_or_epoch: 0.0008�[0m
LAUNCH INFO 2024-05-15 16:27:24,714 Pod failed
LAUNCH ERROR 2024-05-15 16:27:24,715 Container failed !!!
Container rank 4 status failed cmd ['/root/miniconda3/envs/paddle/bin/python', '-u', 'run_pretrain.py', '--model_name_or_path', 'meta-llama/Llama-2-13b', '--tokenizer_name_or_path', 'meta-llama/Llama-2-13b', '--input_dir', './data', '--output_dir', 'output/llama2-13b-4k/20240515154555', '--split', '949,50,1', '--max_seq_length', '4096', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--use_flash_attention', '1', '--use_fused_rope', '1', '--fuse_attention_ffn', '1', '--fuse_attention_qkv', '1', '--use_fused_rms_norm', '1', '--num_hidden_layers', '40', '--bf16', '--fp16_opt_level', 'O2', '--scale_loss', '1024', '--learning_rate', '0.00003', '--min_learning_rate', '0.000005', '--lr_scheduler_type', 'cosine', '--max_steps', '100000', '--save_steps', '100000', '--weight_decay', '0.01', '--warmup_ratio', '0.01', '--max_grad_norm', '1.0', '--logging_steps', '1', '--sequence_parallel', '0', '--dataloader_num_workers', '4', '--pipeline_parallel_degree', '2', '--pipeline_parallel_config', 'disable_partial_send_recv', '--tensor_parallel_degree', '1', '--tensor_parallel_config', 'enable_mp_async_allreduce,enable_mp_skip_c_identity', '--gradient_accumulation_steps', '32', '--sharding', 'stage1', '--eval_steps', '1000', '--report_to', 'visualdl', '--disable_tqdm', 'true', '--continue_training', '0', '--recompute', '0', '--do_train', '--seed', '1026', '--device', 'xpu'] code 1 log output/llama2-13b-4k/20240515154555_log/workerlog.4
env {'PYTHONPATH': '../../:', 'LSCOLORS': 'Gxfxcxdxbxegedabagacad', 'LESS': '-R', 'CONDA_EXE': '/root/miniconda3/bin/conda', '_CE_M': '', 'XPU_CDNN_CLUSTER_PARALLEL_STREAM_NUMBER': '2', 'HOSTNAME': 'localhost.localdomain', 'PWD': '/workspace/PaddleNLP/llm/llama', 'LOGNAME': 'root', 'CONDA_PREFIX': '/root/miniconda3/envs/paddle', 'XPU_PADDLE_L3_SIZE1': '1024', 'XPU_PADDLE_L3_SIZE0': '1024', 'XBLAS_FC_HBM_VERSION': '40', 'FLAGS_use_stride_kernel': '0', 'HOME': '/root', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'TERM': 'xterm', 'XPU_CDNN_CLUSTER_PARALLEL': '1', 'ZSH': '/root/.oh-my-zsh', 'CE_CONDA': '', 'XPUAPI_DEFAULT_SIZE0': '1502653248', 'XPUAPI_DEFAULT_SIZE1': '380265324', 'CONDA_SHLVL': '2', 'SHLVL': '2', 'PAGER': 'less', 'CUDA_DEVICE_MAX_CONNECTIONS': '8', 'CONDA_PYTHON_EXE': '/root/miniconda3/bin/python', 'LD_LIBRARY_PATH': '/workspace/so-bkcl/:/workspace/so-runtime/:/workspace/so-fast_paddle/:', 'CONDA_DEFAULT_ENV': 'paddle', 'XPU_FORCE_USERMODE_LAUNCH': '1', 'PATH': '/root/miniconda3/envs/paddle/bin:/root/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'CONDA_PREFIX_1': '/root/miniconda3', 'OLDPWD': '/workspace/PaddleNLP', '': '/root/miniconda3/envs/paddle/bin/python', 'LC_CTYPE': 'C.UTF-8', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'cztpec', 'PADDLE_MASTER': '127.0.0.1:36569', 'PADDLE_GLOBAL_SIZE': '8', 'PADDLE_LOCAL_SIZE': '8', 'PADDLE_GLOBAL_RANK': '4', 'PADDLE_LOCAL_RANK': '4', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:36574', 'PADDLE_TRAINER_ID': '4', 'PADDLE_TRAINERS_NUM': '8', 'PADDLE_RANK_IN_NODE': '4', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:36570,127.0.0.1:36571,127.0.0.1:36572,127.0.0.1:36573,127.0.0.1:36574,127.0.0.1:36575,127.0.0.1:36576,127.0.0.1:36577', 'FLAGS_selected_xpus': '4', 'PADDLE_LOG_DIR': '/workspace/PaddleNLP/llm/llama/output/llama2-13b-4k/20240515154555_log'}
LAUNCH INFO 2024-05-15 16:27:24,715 ------------------------- ERROR LOG DETAIL -------------------------
dygraph_optimizer/dygraph_sharding_optimizer.py:101: UserWarning: nccl reduce_avg requires paddle compiled with cuda and nccl>=2.10.0, please check compilation setups.
warnings.warn(
[2024-05-15 15:46:56,542] [ WARNING] hybrid_parallel_optimizer.py:292 - While using ClipGradByGlobalNorm in TensorParallel, PipelineParallel or Sharding, the grad clip of original optimizer will be changed.
�[32m[2024-05-15 15:46:56,542] [ INFO]�[0m - [timelog] checkpoint loading time: 0.00s (2024-05-15 15:46:56) �[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - ***** Running training *****�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Num examples = 12,816,085�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Num Epochs = 1�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Instantaneous batch size per device = 1�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Total train batch size (w. parallel, distributed & accumulation) = 128�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Gradient Accumulation steps = 32�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Total optimization steps = 100,000�[0m
�[32m[2024-05-15 15:46:56,543] [ INFO]�[0m - Total num train samples = 12,800,000�[0m
�[35m[2024-05-15 15:46:56,545] [ DEBUG]�[0m - Number of trainable parameters = 6,507,934,720 (per device)�[0m
�[35m[2024-05-15 15:46:56,563] [ DEBUG]�[0m - Number of trainable parameters = 13,015,863,296 (all devices, roughly)�[0m
/root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/amp/auto_cast.py:502: UserWarning: XPUPlace only support float16 amp.
warnings.warn('XPUPlace only support float16 amp.')
Traceback (most recent call last):
File "/workspace/PaddleNLP/llm/llama/run_pretrain.py", line 630, in
main()
File "/workspace/PaddleNLP/llm/llama/run_pretrain.py", line 608, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 770, in train
return self._inner_training_loop(
File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 964, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2044, in training_step
return self.training_pipeline_step(model, inputs)
File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 2113, in training_pipeline_step
loss = model.forward_backward_pipeline(inputs, self.scaler if self.do_grad_scaling else None)
File "/root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 536, in forward_backward_pipeline
output_tensor = self._forward_step(input_tensor, micro_dataset)
File "/root/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 789, in _forward_step
assert isinstance(
AssertionError: Currently, loss_fn should obtain Paddle.Tensor dtype
LAUNCH INFO 2024-05-15 16:27:29,316 Exit code -15

from paddlenlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.