Overflow in GPT examples about colossalai-examples HOT 1 OPEN

hpcaitech commented on August 26, 2024

Overflow in GPT examples

from colossalai-examples.

Comments (1)

Gy-Lu commented on August 26, 2024

🐛 Describe the bug

I met overflow using the official scripts for GPT2. Is that a normal case?
cd XXX/ColossalAI/examples/language/gpt
export DATA=/data/scratch/gpt_data/small-gpt-dataset.json
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch
[Epoch 0 / Train]: 0%| | 1/8614 [00:00<1:03:35, 2.26it/s, loss=265.25, lr=2.5e-05, throughput=4.5244][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 [Epoch 0 / Train]: 0%| | 2/8614 [00:00<1:00:07, 2.39it/s, loss=nan, lr=2.5e-05, throughput=4.9813][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0 [Epoch 0 / Train]: 0%| | 3/8614 [00:01<56:35, 2.54it/s, loss=nan, lr=2.5e-05, throughput=5.4833][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 [Epoch 0 / Train]: 0%| | 4/8614 [00:01<55:32, 2.58it/s, loss=nan, lr=2.5e-05, throughput=5.3257][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 [Epoch 0 / Train]: 0%| | 5/8614 [00:01<54:26, 2.64it/s, loss=nan, lr=2.5e-05, throughput=5.473][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 [Epoch 0 / Train]: 0%| | 6/8614 [00:02<53:34, 2.68it/s, loss=nan, lr=2.5e-05, throughput=5.5342][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0 [Epoch 0 / Train]: 0%| | 7/8614 [00:02<53:14, 2.69it/s, loss=nan, lr=2.5e-05, throughput=5.4624][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0 [Epoch 0 / Train]: 0%| | 8/8614 [00:03<52:47, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.5429][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0 [Epoch 0 / Train]: 0%| | 9/8614 [00:03<52:41, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.4693][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0 [Epoch 0 / Train]: 0%| | 10/8614 [00:03<52:14, 2.74it/s, loss=nan, lr=2.5e-05, throughput=5.6025][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0 [Epoch 0 / Train]: 0%| | 11/8614 [00:04<51:50, 2.77it/s, loss=nan, lr=2.5e-05, throughput=5.6395][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0 [Epoch 0 / Train]: 0%|▏ | 12/8614 [00:04<51:27, 2.79it/s, loss=nan, lr=2.5e-05, throughput=5.6746][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0 [Epoch 0 / Train]: 0%|▏ | 13/8614 [00:04<51:15, 2.80it/s, loss=nan, lr=2.5e-05, throughput=5.6452][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0 [Epoch 0 / Train]: 0%|▏ | 14/8614 [00:05<50:58, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.7043][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0 [Epoch 0 / Train]: 0%|▏ | 15/8614 [00:05<50:56, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.6454][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 [Epoch 0 / Train]: 0%|▏ | 16/8614 [00:05<50:48, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.678][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 [Epoch 0 / Train]: 0%|▏ | 17/8614 [00:06<50:38, 2.83it/s, loss=nan, lr=2.5e-05, throughput=5.7112][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0 [Epoch 0 / Train]: 0%|▏ | 18/8614 [00:06<50:47, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.6076][Epoch 0 / Train]: 0%|▏

Environment

ffmpeg 4.3 hf484d3e_0 pytorch pytorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 pytorch pytorch-mutex 1.0 cuda pytorch torchaudio 0.10.2 py39_cu113 pytorch torchvision 0.11.3 py39_cu113 pytorch

Hi, there would be overflows running GPT-2 with a single GPU. We recommend you run it with more than 4 GPUs.

from colossalai-examples.

Overflow in GPT examples about colossalai-examples HOT 1 OPEN

Comments (1)

🐛 Describe the bug

Environment

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent