Giter Club home page Giter Club logo

Comments (1)

Gy-Lu avatar Gy-Lu commented on August 26, 2024

🐛 Describe the bug

I met overflow using the official scripts for GPT2. Is that a normal case?

cd XXX/ColossalAI/examples/language/gpt
export DATA=/data/scratch/gpt_data/small-gpt-dataset.json
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

[Epoch 0 / Train]: 0%| | 1/8614 [00:00<1:03:35, 2.26it/s, loss=265.25, lr=2.5e-05, throughput=4.5244][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 [Epoch 0 / Train]: 0%| | 2/8614 [00:00<1:00:07, 2.39it/s, loss=nan, lr=2.5e-05, throughput=4.9813][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0 [Epoch 0 / Train]: 0%| | 3/8614 [00:01<56:35, 2.54it/s, loss=nan, lr=2.5e-05, throughput=5.4833][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 [Epoch 0 / Train]: 0%| | 4/8614 [00:01<55:32, 2.58it/s, loss=nan, lr=2.5e-05, throughput=5.3257][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 [Epoch 0 / Train]: 0%| | 5/8614 [00:01<54:26, 2.64it/s, loss=nan, lr=2.5e-05, throughput=5.473][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 [Epoch 0 / Train]: 0%| | 6/8614 [00:02<53:34, 2.68it/s, loss=nan, lr=2.5e-05, throughput=5.5342][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0 [Epoch 0 / Train]: 0%| | 7/8614 [00:02<53:14, 2.69it/s, loss=nan, lr=2.5e-05, throughput=5.4624][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0 [Epoch 0 / Train]: 0%| | 8/8614 [00:03<52:47, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.5429][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0 [Epoch 0 / Train]: 0%| | 9/8614 [00:03<52:41, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.4693][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0 [Epoch 0 / Train]: 0%| | 10/8614 [00:03<52:14, 2.74it/s, loss=nan, lr=2.5e-05, throughput=5.6025][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0 [Epoch 0 / Train]: 0%| | 11/8614 [00:04<51:50, 2.77it/s, loss=nan, lr=2.5e-05, throughput=5.6395][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0 [Epoch 0 / Train]: 0%|▏ | 12/8614 [00:04<51:27, 2.79it/s, loss=nan, lr=2.5e-05, throughput=5.6746][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0 [Epoch 0 / Train]: 0%|▏ | 13/8614 [00:04<51:15, 2.80it/s, loss=nan, lr=2.5e-05, throughput=5.6452][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0 [Epoch 0 / Train]: 0%|▏ | 14/8614 [00:05<50:58, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.7043][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0 [Epoch 0 / Train]: 0%|▏ | 15/8614 [00:05<50:56, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.6454][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 [Epoch 0 / Train]: 0%|▏ | 16/8614 [00:05<50:48, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.678][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 [Epoch 0 / Train]: 0%|▏ | 17/8614 [00:06<50:38, 2.83it/s, loss=nan, lr=2.5e-05, throughput=5.7112][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0 [Epoch 0 / Train]: 0%|▏ | 18/8614 [00:06<50:47, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.6076][Epoch 0 / Train]: 0%|▏

Environment

ffmpeg 4.3 hf484d3e_0 pytorch pytorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 pytorch pytorch-mutex 1.0 cuda pytorch torchaudio 0.10.2 py39_cu113 pytorch torchvision 0.11.3 py39_cu113 pytorch

Hi, there would be overflows running GPT-2 with a single GPU. We recommend you run it with more than 4 GPUs.

from colossalai-examples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.