Comments (1)
🐛 Describe the bug
I met overflow using the official scripts for GPT2. Is that a normal case?
cd XXX/ColossalAI/examples/language/gpt export DATA=/data/scratch/gpt_data/small-gpt-dataset.json torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch
[Epoch 0 / Train]: 0%| | 1/8614 [00:00<1:03:35, 2.26it/s, loss=265.25, lr=2.5e-05, throughput=4.5244][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 [Epoch 0 / Train]: 0%| | 2/8614 [00:00<1:00:07, 2.39it/s, loss=nan, lr=2.5e-05, throughput=4.9813][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0 [Epoch 0 / Train]: 0%| | 3/8614 [00:01<56:35, 2.54it/s, loss=nan, lr=2.5e-05, throughput=5.4833][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 [Epoch 0 / Train]: 0%| | 4/8614 [00:01<55:32, 2.58it/s, loss=nan, lr=2.5e-05, throughput=5.3257][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 [Epoch 0 / Train]: 0%| | 5/8614 [00:01<54:26, 2.64it/s, loss=nan, lr=2.5e-05, throughput=5.473][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 [Epoch 0 / Train]: 0%| | 6/8614 [00:02<53:34, 2.68it/s, loss=nan, lr=2.5e-05, throughput=5.5342][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0 [Epoch 0 / Train]: 0%| | 7/8614 [00:02<53:14, 2.69it/s, loss=nan, lr=2.5e-05, throughput=5.4624][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0 [Epoch 0 / Train]: 0%| | 8/8614 [00:03<52:47, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.5429][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0 [Epoch 0 / Train]: 0%| | 9/8614 [00:03<52:41, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.4693][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0 [Epoch 0 / Train]: 0%| | 10/8614 [00:03<52:14, 2.74it/s, loss=nan, lr=2.5e-05, throughput=5.6025][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0 [Epoch 0 / Train]: 0%| | 11/8614 [00:04<51:50, 2.77it/s, loss=nan, lr=2.5e-05, throughput=5.6395][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0 [Epoch 0 / Train]: 0%|▏ | 12/8614 [00:04<51:27, 2.79it/s, loss=nan, lr=2.5e-05, throughput=5.6746][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0 [Epoch 0 / Train]: 0%|▏ | 13/8614 [00:04<51:15, 2.80it/s, loss=nan, lr=2.5e-05, throughput=5.6452][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0 [Epoch 0 / Train]: 0%|▏ | 14/8614 [00:05<50:58, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.7043][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0 [Epoch 0 / Train]: 0%|▏ | 15/8614 [00:05<50:56, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.6454][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 [Epoch 0 / Train]: 0%|▏ | 16/8614 [00:05<50:48, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.678][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 [Epoch 0 / Train]: 0%|▏ | 17/8614 [00:06<50:38, 2.83it/s, loss=nan, lr=2.5e-05, throughput=5.7112][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0 [Epoch 0 / Train]: 0%|▏ | 18/8614 [00:06<50:47, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.6076][Epoch 0 / Train]: 0%|▏
Environment
ffmpeg 4.3 hf484d3e_0 pytorch pytorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 pytorch pytorch-mutex 1.0 cuda pytorch torchaudio 0.10.2 py39_cu113 pytorch torchvision 0.11.3 py39_cu113 pytorch
Hi, there would be overflows running GPT-2 with a single GPU. We recommend you run it with more than 4 GPUs.
from colossalai-examples.
Related Issues (20)
- Problem with saving model state dict HOT 2
- BERT示例运行错误,ColossalAI-Examples/language/bert/sequene_parallel/ HOT 2
- 运行GPT2案例出现RuntimeError: Could not find 'SLURM_PROCID'问题,是必须要装SLURM环境? HOT 1
- [Compatibility] Runining OPT using PyTorch 1.12 and Gemini placement_policy = 'cuda' failed HOT 3
- 当模型gradient_checkpointing时运行feature/zero/train_v2.py出错 HOT 5
- Cannot find the gradient handler example HOT 1
- BERT Data Preprocessing HOT 3
- ImportError: cannot import name 'colo_state_dict' from 'colossalai.utils.model.colo_init_context' HOT 1
- RuntimeError: CUDA out of memory with cifar10 in data_parallel example HOT 1
- The error happened when I did multi-node distributed training HOT 1
- ImportError running detr HOT 1
- question about import model_zoo.gpt.gpt as col_gpt HOT 2
- It seems the pipeline parallel document is out of date(https://www.colossalai.org/docs/features/pipeline_parallel) HOT 7
- there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py) HOT 5
- detr-debug pipelinable.py HOT 1
- Load ColossalAI GPT model as HuggingFace/Transformers Model HOT 2
- Outdated OPT example
- grad is none when run gpt2 with pipeline parallelism only
- cannot import name 'OPTForCausalLM'
- connection failure HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colossalai-examples.