Giter Club home page Giter Club logo

colossalai-examples's Introduction

ColossalAI-Examples

2023.01.05

This repo is deprecated. Use our timely maintained example at ColossalAI/example.

Introduction

This repository provides various examples for Colossal-AI. For each feature of Colossal-AI, you can find a simple example in the feature folder and a corresponding tutorial in feature section of the documentation. For more complex examples for domain-specific models, you can find them in this repository as well. Some of them are covered in the advanced tutorials of the documentation.

This repository is built upon Colossal-AI and Titans.

🚀 Quick Links

Colossal-AI | Titans Paper | Documentation | Forum | Blog

Setup

  1. Install Colossal-AI

You can download Colossal-AI here.

  1. Install dependencies
pip install -r requirements.txt

Table of Content

This repository contains examples of training models with ColossalAI. These examples fall under three categories:

  1. Computer Vision

    • ResNet
    • SimCLR
    • Vision Transformer
      • Data Parallel
      • Pipeline Parallel
      • Hybrid Parallel
    • WideNet
      • Mixture of experts
  2. Natural Language Processing

    • BERT
      • Sequence Parallel
    • GPT-2
      • Hybrid Parallel
    • GPT-3
      • Hybrid Parallel
    • Knowledge Graph Embedding
  3. Features

    • Mixed Precision Training
    • Gradient Accumulation
    • Gradient Clipping
    • Tensor Parallel
    • Pipeline Parallel
    • ZeRO

The image and language folders are for complex model applications. The features folder is for demonstration of Colossal-AI. The features folder aims to be simple so that users can execute in minutes. Each example in the features folder relates to a tutorial in the Official Documentation.

If you wish to make contribution to this repository, please read the Contributing section below.

Discussion

Discussion about the Colossal-AI project and examples is always welcomed! We would love to exchange ideas with the community to better help this project grow. If you think there is a need to discuss anything, you may jump to our discussion forum and create a topic there.

If you encounter any problem while running these examples, you may want to raise an issue in this repository.

Contributing

This project welcomes constructive ideas and implementations from the community.

Update an Example

If you find that an example is broken (not working) or not user-friendly, you may put up a pull request to this repository and update this example.

Add a New Example

If you wish to add an example for a specific application, please follow the steps below.

  1. create a folder in the image, language or features folders. Generally we do not accept new examples for features as one example is often enough. We encourage contribution with hybrid parallel or models of different domains (e.g. GAN, self-supervised, detection, video understanding, text classification, text generation)
  2. Prepare configuration files and train.py
  3. Prepare a detailed readme on environment setup, dataset preparation, code execution, etc. in your example folder
  4. Update the table of content (first section above) in this readme file

If your PR is accepted, we may invite you to put up a tutorial or blog in ColossalAI Documentation.

colossalai-examples's People

Contributors

1saa avatar binmakeswell avatar boxiangw avatar extremeviscent avatar fanjinfucool avatar feifeibear avatar frankleeeee avatar gy-lu avatar huxin711 avatar i-e-e-e avatar kurisusnowdeng avatar lstm-kirigaya avatar mandoxzhang avatar miracledesigner avatar oahzxl avatar ofey404 avatar ryanrussell avatar ver217 avatar wang-cr avatar wesley-jzy avatar yuliangliu0306 avatar yuxuan-lou avatar zhaoyi1222 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

colossalai-examples's Issues

there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py)

🐛 Describe the bug

I try to run a config by using the train_gpt.py. I add a model on the gpt.py .


def gpt2_test4gpu350M(**kwargs):
    model_kwargs = dict(hidden_size=1024, depth=24, num_heads=16,max_position_embeddings=2048, **kwargs)
    return _create_gpt_model(**model_kwargs)

And I change my dateset webtext to this .


@DATASETS.register_module
class  WebtextDataset(Dataset):

    def __init__(self, path=None, seq_len=1024, mbs = 4) -> None:
        super().__init__()
        if path is not None:
            root = os.path.dirname(path)
            encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
        else:
            encoded_data_cache_path = f'gpt_webtext_{seq_len}.pt'

        self.data = torch.randint(0,10000,(seq_len, ), requires_grad=False, device=torch.device('cpu')).long()
        self.attention_mask = (torch.rand((seq_len, seq_len), requires_grad=False, device=torch.device('cpu')))
        self.attention_mask = torch.where(self.attention_mask < 0.5, 0, 1)
        print("self.atttntion_mask:",self.attention_mask[:20])
        
        self.mbs =mbs 
        print("self.mbs:",self.mbs)

        torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)

    def __len__(self):
        print("WebtextDataset,self.mbs*3:",self.mbs) ## len(train_loader) :返回的是len(dataset)/batch_size
        return self.mbs * 5

    def __getitem__(self, index):
        return {'input_ids':self.data,
            'attention_mask': self.attention_mask[0]}, self.data

I run this model and colossalai just spends 1s for 1 iteration. But I run the same model on Megatron-LM and I need about 100s for one iteration.

Environment

No response

BERT Data Preprocessing

🐛 Describe the bug

NVIDIA DeepLearningExamples removed LDDL from DLE tools on Aug 16, 2022. Therefore, the guide on https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/preprocessing fails to work in the following aspects:

  1. pip install git+https://github.com/NVIDIA/DeepLearningExamples.git#subdirectory=Tools/lddl won't work. The solution could be either using the new url, i.e. pip install git+https://github.com/NVIDIA/lddl.git, or finding lddl from the history version https://github.com/NVIDIA/DeepLearningExamples/tree/29f5b7ab059025e4ead512e54037eddbdf740f19.
  2. after installing lddl, using pip install boto3 would lead to a version conflict, which is of unknown effect on the whole process.
  3. in the preprocessing part, both phase 1 and phase 2 wouldn't work. The details would be provided later.
  4. changing lddl source from the new url to the history version wouldn't solve the problem 3, not installing boto3 also wouldn't help.

Environment

python=3.8
pytorch=1.12.1
cudatoolkit=10.2.89
cuda=10.2

Too large training loss

🐛 Describe the bug

Hi

I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence'))
the training loss in the beginning was and after 2330 steps the training loss is 13.044

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence'))
after 2330 steps the training loss is 13.044

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence'))
after 2330 steps the training loss is 6.5549

Environment

after running colossalai check -i
I got
image

my device is 8 rtx3090
training batch is 128 across three sequence parallel settings.

my training config is

image

Thanks!

'RuntimeError: CUDA error: an illegal memory access was encountered' with large batch size of GPT2-example

🐛 Describe the bug

When I ran gpt2-vanilla with a batch size of 64, there was a CUDA error RuntimeError: CUDA error: an illegal memory access was encountered.
Then I printed the memory usage of GPU. At the second iteration, the max allocated memory was 74GB(with torch.max_memory_allocated), then the error happened, while the allocated memory was no more than 50GB(with torch.memory_allocated).
It also happened when comes to gpt2-zero3.
I think the peak memory usage was out of memory, while the total memory allocated was not.
This bug may be fixed with PyTorch's update :)

Environment

CUDA/11.3.1
NCCL/2.9.6
Python/3.8.12
PyTorch/1.10.1+cu113

The error happened when I did multi-node distributed training

🐛 Describe the bug

Excuse me. When I enter the command "colossalai run --nproc_per_node 4 --host [host1 ip addr],[host2 ip addr] --master_addr [host1 ip addr] train.py", I got this message: Error: failed to run torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=[host1 ip addr]:29500 --rdzv_id=colossalai-default-job train.py on [host2 ip addr]

What are the configurations I have to set in the train.py you provided with?

Environment

CUDA Version: 11.3
PyTorch Version: 1.12.0
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓

failed to run gpt2 zero3 example

🐛 Describe the bug

Command:

OMP_NUM_THREADS=32 torchrun --standalone --nnodes=1 --nproc_per_node 2 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

Result:

Traceback (most recent call last):
  File "train_gpt.py", line 130, in <module>
    main()
  File "train_gpt.py", line 56, in main
    ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
Traceback (most recent call last):
  File "train_gpt.py", line 130, in <module>
    main()
  File "train_gpt.py", line 56, in main
    ctx = ZeroInitContext(target_device=torch.cuda.current_device(),
TypeError: __init__() missing 1 required positional argument: 'convert_fp16'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 38441) of binary: /home/toga/.conda/envs/ColAI/bin/python
Traceback (most recent call last):
  File "/home/toga/.conda/envs/ColAI/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/toga/.conda/envs/ColAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_gpt.py FAILED

Environment

colossalai

colossalai               0.1.1

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Python

Python 3.8.12

PyTorch

torch                    1.10.1

ColossalAI cannot run the shufflenet_v2_x1_0 model as torch do

🐛 Describe the bug

models.shufflenet_v2_x1_0 can be trained with BATCH_SIZE = 16384, which cannot be run successfully with ColossalAI.
The information is below:

(conda-general) user@user:~/research/Experiments/ColossalAI-Examples/image/resnet$ colossalai run --nproc_per_node 1 train.py
[06/16/22 13:30:42] INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:1 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:1 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:2 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:2 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:3 to store
                             for rank: 0                                        
                    ...                                     
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:5 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:6 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:6 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:7 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:7 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:8 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:8 with 1 nodes.        
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/context/paral
                             lel_context.py:521 set_device                      
                    INFO     colossalai - colossalai - INFO: process rank 0 is  
                             bound to device 0                                  
[06/16/22 13:30:43] INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/context/paral
                             lel_context.py:557 set_seed                        
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 0, numpy: 1024, python random: 1024,          
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR:      
                             1024,the default parallel seed is                  
                             ParallelMode.DATA.                                 
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :117 launch                                        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, data parallel size: 1, 
                             pipeline parallel size: 1, tensor parallel size: 1 
Files already downloaded and verified
[06/16/22 13:30:44] INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :266 initialize                                    
                    INFO     colossalai - colossalai - INFO:                    
                             ========== Your Config ========                    
                             {'BATCH_SIZE': 16384,                              
                              'CONFIG': {'fp16': {'mode': <AMP_TYPE.TORCH:      
                             'torch'>}},                                        
                              'NUM_EPOCHS': 200}                                
                             ================================                   
                                                                                
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :278 initialize                                    
                    INFO     colossalai - colossalai - INFO: cuDNN benchmark =  
                             True, deterministic = False                        
                    WARNING  colossalai - colossalai - WARNING: /home/user/so
                             ftware/python/anaconda/anaconda3/envs/conda-general
                             /lib/python3.10/site-packages/colossalai/initialize
                             .py:304 initialize                                 
                    WARNING  colossalai - colossalai - WARNING: Initializing an 
                             non ZeRO model with optimizer class                
                    WARNING  colossalai - colossalai - WARNING: /home/user/so
                             ftware/python/anaconda/anaconda3/envs/conda-general
                             /lib/python3.10/site-packages/colossalai/initialize
                             .py:436 initialize                                 
                    WARNING  colossalai - colossalai - WARNING: No PyTorch DDP  
                             or gradient handler is set up, please make sure you
                             do not need to all-reduce the gradients after a    
                             training step.                                     
 25%|██▌       | 1/4 [00:05<00:16,  5.59s/it]
Traceback (most recent call last):
  File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 157, in <module>
    main()
  File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 103, in main
    output = engine(img)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 156, in forward
    return self._forward_impl(x)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 147, in _forward_impl
    x = self.stage2(x)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 85, in forward
    out = torch.cat((x1, self.branch2(x2)), dim=1)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 10.76 GiB total capacity; 9.54 GiB already allocated; 9.00 MiB free; 9.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2549731) of binary: /home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python
Fatal Python error: Segmentation fault

Thread 0x00007ff209a3e700 (most recent call first):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 324 in wait
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 600 in wait
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 254 in _run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 946 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 966 in _bootstrap

Current thread 0x00007ff2e1d5a740 (most recent call first):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/torchrun", line 33 in <module>

Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg.lapack_lite, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 22)
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py on 127.0.0.1

Environment

CUDA: 11.4

connection failure

🐛 Describe the bug

I found a runtime error while running the code:
The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out)
using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py

Environment

image

Cannot find the gradient handler example

📚 The doc issue

We provide a runnable example to demonstrate the use of gradient handler. In this example, we used DataParallelGradientHandler instead of PyTorch DistributedDataParallel for data parallel training.

The example of using customized gradient handler is not found and the url in the doc is 404

Load ColossalAI GPT model as HuggingFace/Transformers Model

Describe the feature

Hi all,

I'm trying to use a GPT model I trained using ColossalAI with huggingface/transformers for inference but it's not possible to load the model as a huggingface model as it is implemented in pytorch. How can I go about loading the model I trained using huggingface/transformers library?

Thanks so much for your help.

Best,
Red

ZeRO without using shard_param

🐛 Describe the bug

When i use ZeRO without shard_params, it occurs the following problems

Traceback (most recent call last):
  File "train.py", line 175, in <module>
    main()
  File "train.py", line 39, in main
    with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
    self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
    assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
AttributeError: 'int' object has no attribute 'type'

My init code is:

def main():
    parser = colossalai.get_default_parser()
    parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
    args = parser.parse_args()

    colossalai.launch_from_torch(config='./config.py')

    logger = get_dist_logger()

    rank = int(os.environ['RANK'])
    # build resnet
    use_zero3 = hasattr(gpc.config, 'zero')
    if use_zero3:
        shard_strategy = TensorShardStrategy()
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
            model = resnet34(num_classes=10)
    else:
        model = resnet34(num_classes=10)

my config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

zero = dict(
    model_config=dict(
        tensor_placement_policy='cuda',
        shard_strategy=TensorShardStrategy(),
        reuse_fp16_shard=False
    ),
    optimizer_config=dict()
)

optimizer = dict(
    type=HybridAdam,
    lr=0.001,
    # weight_decay=1e-2,
)

BATCH_SIZE = 64
NUM_EPOCHS = 20
LOGGING_FREQUNCE = 20
OUTPUT = './'

gradient_clipping = 5.0

Environment

pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

ubuntu 18.04

Overflow in GPT examples

🐛 Describe the bug

I met overflow using the official scripts for GPT2. Is that a normal case?

cd XXX/ColossalAI/examples/language/gpt
export DATA=/data/scratch/gpt_data/small-gpt-dataset.json
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

[Epoch 0 / Train]: 0%| | 1/8614 [00:00<1:03:35, 2.26it/s, loss=265.25, lr=2.5e-05, throughput=4.5244][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[Epoch 0 / Train]: 0%| | 2/8614 [00:00<1:00:07, 2.39it/s, loss=nan, lr=2.5e-05, throughput=4.9813][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[Epoch 0 / Train]: 0%| | 3/8614 [00:01<56:35, 2.54it/s, loss=nan, lr=2.5e-05, throughput=5.4833][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[Epoch 0 / Train]: 0%| | 4/8614 [00:01<55:32, 2.58it/s, loss=nan, lr=2.5e-05, throughput=5.3257][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[Epoch 0 / Train]: 0%| | 5/8614 [00:01<54:26, 2.64it/s, loss=nan, lr=2.5e-05, throughput=5.473][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[Epoch 0 / Train]: 0%| | 6/8614 [00:02<53:34, 2.68it/s, loss=nan, lr=2.5e-05, throughput=5.5342][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[Epoch 0 / Train]: 0%| | 7/8614 [00:02<53:14, 2.69it/s, loss=nan, lr=2.5e-05, throughput=5.4624][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[Epoch 0 / Train]: 0%| | 8/8614 [00:03<52:47, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.5429][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[Epoch 0 / Train]: 0%| | 9/8614 [00:03<52:41, 2.72it/s, loss=nan, lr=2.5e-05, throughput=5.4693][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[Epoch 0 / Train]: 0%| | 10/8614 [00:03<52:14, 2.74it/s, loss=nan, lr=2.5e-05, throughput=5.6025][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[Epoch 0 / Train]: 0%| | 11/8614 [00:04<51:50, 2.77it/s, loss=nan, lr=2.5e-05, throughput=5.6395][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[Epoch 0 / Train]: 0%|▏ | 12/8614 [00:04<51:27, 2.79it/s, loss=nan, lr=2.5e-05, throughput=5.6746][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[Epoch 0 / Train]: 0%|▏ | 13/8614 [00:04<51:15, 2.80it/s, loss=nan, lr=2.5e-05, throughput=5.6452][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[Epoch 0 / Train]: 0%|▏ | 14/8614 [00:05<50:58, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.7043][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[Epoch 0 / Train]: 0%|▏ | 15/8614 [00:05<50:56, 2.81it/s, loss=nan, lr=2.5e-05, throughput=5.6454][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[Epoch 0 / Train]: 0%|▏ | 16/8614 [00:05<50:48, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.678][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
[Epoch 0 / Train]: 0%|▏ | 17/8614 [00:06<50:38, 2.83it/s, loss=nan, lr=2.5e-05, throughput=5.7112][deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
[Epoch 0 / Train]: 0%|▏ | 18/8614 [00:06<50:47, 2.82it/s, loss=nan, lr=2.5e-05, throughput=5.6076][Epoch 0 / Train]: 0%|▏

Environment

ffmpeg 4.3 hf484d3e_0 pytorch
pytorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 0.10.2 py39_cu113 pytorch
torchvision 0.11.3 py39_cu113 pytorch

Outdated OPT example

🐛 Describe the bug

When running OPT example, I got the following errors:

AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'

This is caused by an outdated API. Comparing to the OPT example in ColossalAI, the example here is not updated up for a while.

Environment

No response

wikiextractor raise BdbQuit

🐛 Describe the bug

Hi All,
When I run the code in language Bert # extractmodule
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
I got raise BdbQuit, this seems to be solved in here , by changing the version of wikiextractor to 3.0.4
But after that, the example code couldn't work due to 3.0.4 does not support --json

Environment

No response

ImportError: cannot import name 'colo_state_dict' from 'colossalai.utils.model.colo_init_context'

🐛 Describe the bug

I am trying example colo_vit but got this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'colo_state_dict' from 'colossalai.utils.model.colo_init_context' (/home/wfh/.local/lib/python3.8/site-packages/colossalai/utils/model/colo_init_context.py)

This line seems having issue:

https://github.com/hpcaitech/ColossalAI-Examples/blob/main/image/vision_transformer/colo_vit/train.py#L11

Environment

>>> colossalai.__version__
'0.1.9'

ImportError running detr

🐛 Describe the bug

File "/workspace/ColossalAI-Examples/image/detr/models/transformer.py", line 10, in
from titans.layer.attention import DeTrAttention
ImportError: cannot import name 'DeTrAttention' from 'titans.layer.attention' (/opt/conda/lib/python3.8/site-packages/titans/layer/attention/init.py)

Environment

No response

BERT示例运行错误,ColossalAI-Examples/language/bert/sequene_parallel/

🐛 Describe the bug

使用了最新提供的Dockerhub上的镜像0.1.8,但是在运行BERT序列并行案例:ColossalAI-Examples/language/bert/sequene_parallel/时候仍不能正常运行,提示缺少相关包:
Traceback (most recent call last):
File "/workspace/ColossalAI-Examples/language/bert/sequene_parallel/train.py", line 10, in
from model.bert import BertForPretrain
File "/workspace/ColossalAI-Examples/language/bert/sequene_parallel/model/bert.py", line 12, in
from colossalai.builder.pipeline import partition_uniform
ModuleNotFoundError: No module named 'colossalai.builder.pipeline'

XI 7HJD3ZJY0X1W2RW W}6Q

Environment

docker镜像:docker pull hpcaitech/colossalai:0.1.8

no kernel image

🐛 Describe the bug

Hi

I'm running squence_parallel for bert pre-training, but I got this problem
b465612bbf66f370c18248b5d6f86bf

What could I do to solve this problem?

Thanks!

Environment

CUDA 11.1
torch 1.12.0
python 3.8.13
colossalai 0.1.7
Device RTX 3090

Provide relatively small model for ViT

Describe the feature

Hi, I find that we have provided too many huge models as examples, for instance, we reshape cifar-10 to 224*224 and use ViT Huge(at least not tiny) to train.
This reshape is quite redundant and so as ViT Huge for cifar-10.

当模型gradient_checkpointing时运行feature/zero/train_v2.py出错

🐛 Describe the bug

Traceback (most recent call last):
File "/data1/users/jizhong1/ColossalAI-Examples/features/zero/train_v2.py", line 133, in
main()
File "/dirname/ColossalAI-Examples/features/zero/train_v2.py", line 123, in main
optimizer.backward(loss)
File "/python_path/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 154, in backward
self.module.backward(loss)
File "/python_path/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward
loss.backward()
File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward
return handle_torch_function(
File "/python_path/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/python_path/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 171, in torch_function
ret = func(*args, **kwargs)
File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/python_path/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/python_path/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/python_path/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 130, in backward
outputs = ctx.run_function(*detached_inputs)
File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in custom_forward
return module(*inputs, use_cache, output_attentions)
File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 400, in forward
hidden_states = self.ln_1(hidden_states)
File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/python_path/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/python_path/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40088) of binary: /python_path/bin/python
Traceback (most recent call last):
File "/python_path/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/python_path/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Environment

No response

failed to run gpt example

🐛 Describe the bug

cd ColossalAI/examples/language/gpt
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash)
Colossalai should be built with cuda extension to use the FP16 optimizer
/home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning:
NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader
colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model
Traceback (most recent call last):
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in
main()
File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main
model = gpc.config.model.pop('type')(**gpc.config.model)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small
return create_gpt_model(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in create_gpt_model
model = GPT(**model_kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init
self.embed = GPTEmbedding(embedding_dim=dim,
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init
self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype)
File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper
f(module, *args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init
weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embedding_dim)
File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer
return nn.init.normal
(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal

return no_grad_normal(tensor, mean, std)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in no_grad_normal
return tensor.normal_(mean, std)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds
Traceback (most recent call last):
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
store_util.barrier(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize
store.set(f"{key_prefix}{rank}", data)
RuntimeError: Broken pipe
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/lcfjr/.local/bin/torchrun", line 10, in
sys.exit(main())
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_gpt.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-02-24_15:04:10
host : HPC-AI
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1150747)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

No response

Memory leakage in BERT example

🐛 Describe the bug

I attempted to run the BERT example on two GPUs in a single node using following command:
torchrun --nproc_per_node 1 --master_addr localhost --master_port 29500 train.py

However, the allocated device memory inflates as training proceeds.

After a brief check, I found theres 500 new tensors are created every 10 iterations.

Logs shown below:

colossalai - apex.transformer.tensor_parallel - 2022-03-20 13:54:59,556 WARNING: `fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,631 INFO: Added key: store_based_barrier_key:1 to store for rank: 0
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,631 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,631 INFO: Added key: store_based_barrier_key:2 to store for rank: 0
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,631 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,631 INFO: Added key: store_based_barrier_key:3 to store for rank: 0
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Added key: store_based_barrier_key:4 to store for rank: 0
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 1 nodes.
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Added key: store_based_barrier_key:5 to store for rank: 0
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:5 with 1 nodes.
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Added key: store_based_barrier_key:6 to store for rank: 0
colossalai - torch.distributed.distributed_c10d - 2022-03-20 13:54:59,632 INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:6 with 1 nodes.
colossalai - colossalai - 2022-03-20 13:54:59,634 INFO: process rank 0 is bound to device 0
colossalai - colossalai - 2022-03-20 13:54:59,635 INFO: initialized seed on rank 0, numpy: 1234, python random: 1234, ParallelMode.DATA: 1234, ParallelMode.TENSOR: 1234,the default parallel seed is ParallelMode.DATA.
colossalai - colossalai - 2022-03-20 13:54:59,635 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
> building BertWordPieceLowerCase tokenizer ...
 > padded vocab (size: 30524) with 68 dummy tokens (new size: 30592)
colossalai - colossalai - 2022-03-20 13:54:59,658 INFO: > building train, validation, and test datasets ...
colossalai - colossalai - 2022-03-20 13:54:59,658 INFO:  > datasets target sizes (minimum size):
colossalai - colossalai - 2022-03-20 13:54:59,658 INFO:     train:      32000000
colossalai - colossalai - 2022-03-20 13:54:59,658 INFO:     validation: 32000320
colossalai - colossalai - 2022-03-20 13:54:59,658 INFO:     test:       320
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
colossalai - colossalai - 2022-03-20 13:54:59,665 INFO: 
 > building dataset index ...
colossalai - colossalai - 2022-03-20 13:54:59,665 INFO: 
 > finished creating indexed dataset in 0.006665 seconds
colossalai - colossalai - 2022-03-20 13:54:59,665 INFO: 
 > indexed dataset stats:
    number of documents: 6409572
    number of sentences: 128198975
colossalai - colossalai - 2022-03-20 13:54:59,665 INFO: 
 > dataset split:
colossalai - colossalai - 2022-03-20 13:54:59,665 INFO: 
    train:
     document indices in [0, 6082683) total of 6082683 documents
     sentence indices in [0, 123690635) total of 123690635 sentences
colossalai - colossalai - 2022-03-20 13:54:59,665 INFO: 
    validation:
     document indices in [6082683, 6403162) total of 320479 documents
     sentence indices in [123690635, 128115537) total of 4424902 sentences
colossalai - colossalai - 2022-03-20 13:54:59,667 INFO: 
    test:
     document indices in [6403162, 6409572) total of 6410 documents
     sentence indices in [128115537, 128198975) total of 83438 sentences
colossalai - colossalai - 2022-03-20 13:55:03,351 INFO: 
 > loading indexed mapping from /work/workspace/MOE-ColossalAI/Megatron-LM/my-bert_text_sentence_train_indexmap_32000000mns_125msl_0.10ssp_1234s.npy
    loaded indexed file in 0.014 seconds
    total number of samples: 50551630
colossalai - colossalai - 2022-03-20 13:55:03,363 INFO: 
 > loading indexed mapping from /work/workspace/MOE-ColossalAI/Megatron-LM/my-bert_text_sentence_valid_indexmap_32000320mns_125msl_0.10ssp_1234s.npy
    loaded indexed file in 0.011 seconds
    total number of samples: 32197223
colossalai - colossalai - 2022-03-20 13:55:03,365 INFO: 
 > loading indexed mapping from /work/workspace/MOE-ColossalAI/Megatron-LM/my-bert_text_sentence_test_indexmap_320mns_125msl_0.10ssp_1234s.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 17447
colossalai - colossalai - 2022-03-20 13:55:03,742 INFO: Dataloaders are built
colossalai - colossalai - 2022-03-20 13:55:07,958 INFO: Model is built with softmax in fp32 = True
colossalai - colossalai - 2022-03-20 13:55:07,958 INFO: This model has 38392960 parameters
colossalai - colossalai - 2022-03-20 13:55:07,958 INFO: Criterion is built
colossalai - colossalai - 2022-03-20 13:55:07,958 INFO: without weight decay param: 22, with weight decay param: 11
colossalai - colossalai - 2022-03-20 13:55:07,960 INFO: Optimizer is built
colossalai - colossalai - 2022-03-20 13:55:07,960 INFO: LR Scheduler is built with 9900 warmup steps and 990000 decay steps
colossalai - colossalai - 2022-03-20 13:55:07,962 INFO: 
========== Your Config ========
{'ADD_BINARY_HEAD': False,
 'DATA_PATH': '/work/workspace/MOE-ColossalAI/Megatron-LM/my-bert_text_sentence',
 'DECAY_ITERS': 990000,
 'DEPTH': 2,
 'EVAL_INTERVAL': 10,
 'EVAL_ITERS': 10,
 'GLOBAL_BATCH_SIZE': 32,
 'HIDDEN_SIZE': 768,
 'LR': 0.0001,
 'MIN_LR': 1e-05,
 'NUM_ATTENTION_HEADS': 2,
 'NUM_MICRO_BATCHES': 4,
 'SEED': 1234,
 'SEQ_LENGTH': 128,
 'TRAIN_ITERS': 1000000,
 'VOCAB_FILE_PATH': '/work/workspace/MOE-ColossalAI/vocab/bert-large-uncased-vocab.txt',
 'WARMUP_FRACTION': 0.01,
 'WEIGHT_DECAY': 0.01,
 'clip_grad_norm': 1.0,
 'fp16': {'log_num_zeros_in_grad': True,
          'mode': <AMP_TYPE.NAIVE: 'naive'>,
          'verbose': True},
 'gradient_handler': [{'type': 'SequenceParallelGradientHandler'}],
 'parallel': {'pipeline': 1, 'tensor': {'mode': 'sequence', 'size': 1}}}
================================

colossalai - colossalai - 2022-03-20 13:55:07,962 INFO: cuDNN benchmark = True, deterministic = False
colossalai - colossalai - 2022-03-20 13:55:07,985 INFO: 
=========  FP16 Optimizer Config =========
Optimizer: FusedAdam
clip_grad = 1.0
log_num_zeros_in_grad = True
initial_scale = 4294967296
min_scale = 1
growth_factor = 2
backoff_factor = 0.5
growth_interval = 1000
hysteresis = 2
==========================================
colossalai - colossalai - 2022-03-20 13:55:09,886 INFO: overflow occurs, loss scale is adjusted to tensor([4.2950e+09], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:09,915 INFO: overflow occurs, loss scale is adjusted to tensor([2.1475e+09], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:09,941 INFO: overflow occurs, loss scale is adjusted to tensor([1.0737e+09], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:09,966 INFO: overflow occurs, loss scale is adjusted to tensor([5.3687e+08], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:09,991 INFO: overflow occurs, loss scale is adjusted to tensor([2.6844e+08], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:10,184 INFO: overflow occurs, loss scale is adjusted to tensor([1.3422e+08], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:10,339 INFO: overflow occurs, loss scale is adjusted to tensor([67108864.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:10,363 INFO: overflow occurs, loss scale is adjusted to tensor([33554432.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:10,476 INFO: overflow occurs, loss scale is adjusted to tensor([16777216.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:10,504 INFO: overflow occurs, loss scale is adjusted to tensor([8388608.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:11,092 INFO: Step 10 / 1000000 | Train Loss: 10.504 | Eval Loss: 10.486 | Grad Norm: None | Skipped Iterations: 10 | Loss Scale: 8388608.0| Learning rate: 0.0 | Num Zero in Grad: None | train-iterations: 251.09575
colossalai - colossalai - 2022-03-20 13:55:11,117 INFO: overflow occurs, loss scale is adjusted to tensor([4194304.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:11,141 INFO: overflow occurs, loss scale is adjusted to tensor([2097152.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:11,166 INFO: overflow occurs, loss scale is adjusted to tensor([1048576.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:11,190 INFO: overflow occurs, loss scale is adjusted to tensor([524288.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:11,434 INFO: overflow occurs, loss scale is adjusted to tensor([262144.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:12,321 INFO: Step 20 / 1000000 | Train Loss: 10.501 | Eval Loss: 10.491 | Grad Norm: 8.805602073669434 | Skipped Iterations: 5 | Loss Scale: 262144.0| Learning rate: 5.0505050505050506e-08 | Num Zero in Grad: 843 | train-iterations: 66.22717
colossalai - colossalai - 2022-03-20 13:55:12,664 INFO: overflow occurs, loss scale is adjusted to tensor([131072.], device='cuda:0')
colossalai - colossalai - 2022-03-20 13:55:13,665 INFO: Step 30 / 1000000 | Train Loss: 10.484 | Eval Loss: 10.491 | Grad Norm: 9.763409614562988 | Skipped Iterations: 1 | Loss Scale: 131072.0| Learning rate: 1.4141414141414141e-07 | Num Zero in Grad: 822 | train-iterations: 66.48097
colossalai - colossalai - 2022-03-20 13:55:15,080 INFO: Step 40 / 1000000 | Train Loss: 10.494 | Eval Loss: 10.481 | Grad Norm: 8.260552406311035 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 2.4242424242424244e-07 | Num Zero in Grad: 845 | train-iterations: 68.68193
colossalai - colossalai - 2022-03-20 13:55:16,482 INFO: Step 50 / 1000000 | Train Loss: 10.471 | Eval Loss: 10.438 | Grad Norm: 8.306716918945312 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 3.4343434343434344e-07 | Num Zero in Grad: 840 | train-iterations: 71.22791
colossalai - colossalai - 2022-03-20 13:55:17,696 INFO: Step 60 / 1000000 | Train Loss: 10.427 | Eval Loss: 10.367 | Grad Norm: 9.02975845336914 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 4.444444444444445e-07 | Num Zero in Grad: 833 | train-iterations: 66.13533
colossalai - colossalai - 2022-03-20 13:55:18,963 INFO: Step 70 / 1000000 | Train Loss: 10.387 | Eval Loss: 10.293 | Grad Norm: 9.127448081970215 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 5.454545454545455e-07 | Num Zero in Grad: 839 | train-iterations: 63.39867
colossalai - colossalai - 2022-03-20 13:55:20,199 INFO: Step 80 / 1000000 | Train Loss: 10.304 | Eval Loss: 10.242 | Grad Norm: 9.513465881347656 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 6.464646464646465e-07 | Num Zero in Grad: 824 | train-iterations: 64.48858
colossalai - colossalai - 2022-03-20 13:55:21,562 INFO: Step 90 / 1000000 | Train Loss: 10.26 | Eval Loss: 10.16 | Grad Norm: 8.393514633178711 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 7.474747474747475e-07 | Num Zero in Grad: 834 | train-iterations: 71.18421
colossalai - colossalai - 2022-03-20 13:55:22,871 INFO: Step 100 / 1000000 | Train Loss: 10.176 | Eval Loss: 10.093 | Grad Norm: 8.5743408203125 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 8.484848484848486e-07 | Num Zero in Grad: 838 | train-iterations: 64.65802
colossalai - colossalai - 2022-03-20 13:55:24,213 INFO: Step 110 / 1000000 | Train Loss: 10.094 | Eval Loss: 9.9949 | Grad Norm: 8.43807315826416 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 9.494949494949495e-07 | Num Zero in Grad: 835 | train-iterations: 64.54813
colossalai - colossalai - 2022-03-20 13:55:25,504 INFO: Step 120 / 1000000 | Train Loss: 10.004 | Eval Loss: 9.9296 | Grad Norm: 7.786318778991699 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 1.0505050505050506e-06 | Num Zero in Grad: 844 | train-iterations: 63.54368
colossalai - colossalai - 2022-03-20 13:55:26,700 INFO: Step 130 / 1000000 | Train Loss: 9.9541 | Eval Loss: 9.8156 | Grad Norm: 7.1489057540893555 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 1.1515151515151516e-06 | Num Zero in Grad: 841 | train-iterations: 62.47182
colossalai - colossalai - 2022-03-20 13:55:28,004 INFO: Step 140 / 1000000 | Train Loss: 9.868 | Eval Loss: 9.783 | Grad Norm: 6.379231929779053 | Skipped Iterations: 0 | Loss Scale: 131072.0| Learning rate: 1.2525252525252527e-06 | Num Zero in Grad: 844 | train-iterations: 64.96000
Traceback (most recent call last):
  File "/work/workspace/MOE-ColossalAI/sequene_parallel/train.py", line 267, in <module>
    main()
  File "/work/workspace/MOE-ColossalAI/sequene_parallel/train.py", line 197, in main
    lm_loss, sop_output = engine(tokens, padding_mask, types, lm_labels)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/engine/_base_engine.py", line 127, in __call__
    return self.model(*args, **kwargs)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 74, in forward
    out = self.model(*args, **kwargs)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/workspace/MOE-ColossalAI/sequene_parallel/model/bert.py", line 117, in forward
    return self.head(output, self.embedding.word_embedding_weight, lm_labels)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/workspace/MOE-ColossalAI/sequene_parallel/model/layers/head.py", line 77, in forward
    lm_loss = self.lm_head(hidden_states, word_embeddings_weight, lm_labels)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/workspace/MOE-ColossalAI/sequene_parallel/model/layers/head.py", line 39, in forward
    output = F.linear(hidden_states, word_embeddings_weight, self.bias)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 240.00 MiB (GPU 0; 39.59 GiB total capacity; 36.40 GiB already allocated; 204.19 MiB free; 37.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 84857) of binary: /work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/bin/python
Traceback (most recent call last):
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/work/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-20_13:55:33
  host      : inspur-4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 84857)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment

Colossal-AI version: 0.0.2
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.6
Libc version: glibc-2.17

Python version: 3.9.6 (default, Aug 18 2021, 19:38:01) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1062.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB

Nvidia driver version: 460.27.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] numpy==1.20.3
[pip3] pytorch-lightning==1.1.4
[pip3] pytorch-nlp==0.5.0
[pip3] segmentation-models-pytorch==0.2.0
[pip3] torch==1.10.1
[pip3] torchaudio==0.10.1
[pip3] torchio==0.18.50
[pip3] torchmetrics==0.5.0
[pip3] torchtext==0.11.1
[pip3] torchvision==0.11.2
[conda] blas 1.0 mkl defaults
[conda] cudatoolkit 11.3.1 h2bc3f7f_2 defaults
[conda] efficientnet-pytorch 0.6.3 pypi_0 pypi
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520 defaults
[conda] mkl-service 2.4.0 py39h7f8727e_0 defaults
[conda] mkl_fft 1.3.0 py39h42c9631_2 defaults
[conda] mkl_random 1.2.2 py39h51133e4_0 defaults
[conda] numpy 1.20.3 py39hf144106_0 defaults
[conda] numpy-base 1.20.3 py39h74d4b33_0 defaults
[conda] pytorch 1.10.1 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-lightning 1.1.4 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-nlp 0.5.0 pypi_0 pypi
[conda] segmentation-models-pytorch 0.2.0 pypi_0 pypi
[conda] torch 1.9.0 pypi_0 pypi
[conda] torchaudio 0.10.1 py39_cu113 pytorch
[conda] torchio 0.18.50 pypi_0 pypi
[conda] torchmetrics 0.5.0 pypi_0 pypi
[conda] torchtext 0.11.1 pypi_0 pypi
[conda] torchvision 0.11.2 py39_cu113 pytorch

[Compatibility] Runining OPT using PyTorch 1.12 and Gemini placement_policy = 'cuda' failed

🐛 Describe the bug

Just run the examples/language/opt/run_clm.py will reproduce the error.
The program crashed with no error information.
After I replace placement_policy as 'cuda'. It is OK.

    placement_policy = 'cuda'
    chunk_manager = ChunkManager(chunk_size, process_group=pg,
                                 enable_distributed_storage=True,
                                 init_device=GeminiManager.get_default_device(placement_policy))
    gemini_manager = GeminiManager(placement_policy, chunk_manager)
    model = ZeroDDP(model, gemini_manager)
    logger.info(f'{model.__class__.__name__} has been created', ranks=[0])

Environment

colossalai 0.1.8+torch1.12cu11.3

Failed to run gpt2_3d example

Dear developers,

I am trying to run the gpt2_3d example but failed. It looks like the model didn't load the correct batch size. Hope to get some advice.

Thanks.

Error

File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d

assert dim_size % world_size == 0, \

AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly.

Command

torchrun --standalone --nproc_per_node=8 train_gpt.py --config=gpt2_configs/gpt2_3d.py --from_torch

Environment

  • colossalai 0.1.2
  • nvcc 11.3.109
  • python 3.8.13
  • pytorch 1.11.0
  • GPUs: 40G A100 * 8

Error details

$ torchrun --standalone --nproc_per_node=8 ./train_gpt.py --config=./gpt2_configs/gpt2_3d.py  --from_torch
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1026,the default parallel seed is
                             ParallelMode.DATA.
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3
                    INFO     colossalai - colossalai - INFO: process rank 7 is bound to device 7
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                    INFO     colossalai - colossalai - INFO: process rank 4 is bound to device 4
                    INFO     colossalai - colossalai - INFO: process rank 5 is bound to device 5
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: process rank 6 is bound to device 6
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1027,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1031,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1028,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1029,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1030,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:109 launch
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1,                             pipeline parallel size: 1, tensor parallel size: 8
                    INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:45 main
                    INFO     colossalai - colossalai - INFO: Build data loader
                    INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:54 main
                    INFO     colossalai - colossalai - INFO: Build model
[05/01/22 10:54:01] INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:84 main
                    INFO     colossalai - colossalai - INFO: Build optimizer
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:240 initialize
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    INFO     colossalai - colossalai - INFO:
                             ========== Your Config ========
                             {'BATCH_SIZE': 4,
                              'NUM_EPOCHS': 60,
                              'SEQ_LEN': 1024,
                              'TENSOR_PARALLEL': 8,
                              'fp16': {'mode': <AMP_TYPE.NAIVE: 'naive'>},
                              'gpt2_small': <function gpt2_small at 0x7f32a53354c0>,
                              'loss': {'type': <class 'model_zoo.gpt.gpt.GPTLMLoss'>},
                              'model': {'checkpoint': True},
                              'optimizer': {'lr': 0.00015, 'weight_decay': 0.01},
                              'parallel': {'pipeline': 1, 'tensor': {'mode': '3d', 'size': 8}}}
                             ================================

                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:252 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    INFO     colossalai - colossalai - INFO: cuDNN benchmark = True, deterministic = False
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                    WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:02] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:02] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:409 initialize
                    WARNING  colossalai - colossalai - WARNING: No PyTorch DDP or gradient handler is set up, please make
                             sure you do not need to all-reduce the gradients after a training step.
                    INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:98 main
                    INFO     colossalai - colossalai - INFO: Init done, global batch size = 4
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LossHook for training, priority = 0
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LRSchedulerHook for training, priority = 1
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LogMetricByEpochHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using ThroughputHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LogMetricByStepHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LogMemoryByEpochHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:319 fit
                    INFO     colossalai - colossalai - INFO: Lower value means higher priority for calling hook function
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/utils/memory_utils/memory_monitor.py:63 report_memory_usage
                    INFO     colossalai - colossalai - INFO: Before-train: GPU: allocated 91.75 MB, max allocated 92.3 MB,
                             cached: 96.0 MB, max cached: 96.0 MB
[Epoch 0 / Train]:   0%|                                                                             | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
  File "./train_gpt_0.1.2.py", line 132, in <module>
    main()Traceback (most recent call last):

  File "./train_gpt_0.1.2.py", line 132, in <module>
  File "./train_gpt_0.1.2.py", line 120, in main
Traceback (most recent call last):
      File "./train_gpt_0.1.2.py", line 132, in <module>
main()
  File "./train_gpt_0.1.2.py", line 120, in main
    main()
    main()
  File "./train_gpt_0.1.2.py", line 120, in main
  File "./train_gpt_0.1.2.py", line 120, in main
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
trainer.fit(
self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    self._train_epoch(
    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
main()  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit

    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
  File "./train_gpt_0.1.2.py", line 120, in main
    self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
Traceback (most recent call last):
self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
  File "./train_gpt_0.1.2.py", line 132, in <module>
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
        output = self._call_engine(engine, data)output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    main()
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "./train_gpt_0.1.2.py", line 120, in main
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        return engine(**inputs)
return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
trainer.fit(
output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    return self.model(*args, **kwargs)
        return engine(**inputs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
        return self.model(*args, **kwargs)logits, label, loss = self.engine.execute_schedule(

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)out = self.model(*args, **kwargs)out = self.model(*args, **kwargs)

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    x = self.embed(input_ids)
        return forward_call(*input, **kwargs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
[Epoch 0 / Train]:   0%|                                                                             | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
        result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)

      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
result = forward_call(*input, **kwargs)
main()  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward

  File "./train_gpt_0.1.2.py", line 120, in main
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
            return self._forward_func(*args)return self._forward_func(*args)    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)

x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward


return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    return self._forward_func(*args)
result = forward_call(*input, **kwargs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward

    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        assert dim_size % world_size == 0, \assert dim_size % world_size == 0, \

        result = forward_call(*input, **kwargs)AssertionErrorout = self.model(*args, **kwargs)
:
The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
AssertionError  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    result = forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    result = forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
    main()
  File "./train_gpt_0.1.2.py", line 120, in main
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f0bb282b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f0bf06ba6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f0bf06bccd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f0bf06bdf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f0c48562039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f0c6ecd8ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f0c6ea019fd in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fe8efe431bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe92dcd26ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe92dcd4cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe92dcd5f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7fe985b3a039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fe9ac2f0ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fe9ac0199fd in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fdfff31b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe03d1aa6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe03d1accd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe03d1adf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7fe095012039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fe0bb7c8ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fe0bb4f19fd in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f835f9611bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f839d7f06ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f839d7f2cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f839d7f3f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f83f5658039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f841be0eea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f841bb379fd in /lib64/libc.so.6)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 184844) of binary: /home/asc/.conda/envs/nlp/bin/python
Traceback (most recent call last):
  File "/home/asc/.conda/envs/nlp/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_0.1.2.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 184845)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184845
[2]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 184846)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184846
[3]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 184847)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184847
[4]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 184848)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184848
[5]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 184849)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 184850)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 184851)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 184844)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Personal Dataset Preprocessing

If I want to use my own dataset to train the gpt-2 model, the format is TXT, with one sentence per line, how can I modify the data preprocessing code to make it match and run normally.

Problem with saving model state dict

🐛 Describe the bug

model_state = model.state_dict

The code in this line should be model_state = model.state_dict(), although fixing this bug, the saved state dict is all None.

Traceback (most recent call last): File "generate.py", line 238, in <module> main() File "generate.py", line 211, in main model = OPTForCausalLM.from_pretrained(args.model_path) File "/mnt/datadisk0/ouyangliqi/miniconda3/envs/colossalai/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2119, in from_pretrained model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model( File "/mnt/datadisk0/ouyangliqi/miniconda3/envs/colossalai/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2376, in _load_pretrained_model raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") RuntimeError: Error(s) in loading state_dict for OPTForCausalLM: size mismatch for model.decoder.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50272, 4096]). size mismatch for model.decoder.embed_positions.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2050, 4096]). size mismatch for model.decoder.final_layer_norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.final_layer_norm.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 4096]). size mismatch for model.decoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn_layer_norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.self_attn_layer_norm.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.fc1.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16384, 4096]). size mismatch for model.decoder.layers.0.fc1.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16384]). size mismatch for model.decoder.layers.0.fc2.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 16384]). size mismatch for model.decoder.layers.0.fc2.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for model.decoder.layers.0.final_layer_norm.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096])....

Environment

CUDA: 11.3
Pytorch: 1.12
transformers: 4.21.0.dev0

运行GPT2案例出现RuntimeError: Could not find 'SLURM_PROCID'问题,是必须要装SLURM环境?

🐛 Describe the bug

使用了提供的Dockerhub上的镜像0.1.7,但是在运行GPT案例时候出现RuntimeError: Could not find 'SLURM_PROCID'问题,并且在0.1.8镜像版本中也是如此
M4QKMAI76Q~U952 KAY5Y
T4GKG9P$KSS$XIGXL7{EVAM
这是我的run脚本:
260CY7X5}DOF1363S{4PJ`1
其中我的gpt2_configs配置换了其他的配置也出现同样的问题

Environment

docker pull hpcaitech/colossalai:0.1.7 & 0.1.8
pip install transformers
pip install titans

8张A100

ZeRO 2 configuration example

📚 The doc issue

The gpt2 zero2 example was removed after the new API was introduced.
But I don't know how to not offload the model.
I've tried removing model_config or setting offload_config to None/gpu/cuda, but neither of them works.

Vision Transformer cifar10 bug

🐛 Describe the bug

When I run a vit experiment by the following command

node=76
prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node"
$prefix colossalai run --nproc_per_node 4  train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py --host=10.51.2.$node

I got

tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
    main()
  File "train_with_cifar10.py", line 116, in main
    main()
  File "train_with_cifar10.py", line 116, in main
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration

Environment

I install ColossalAI via

pip install colossalai==0.1.6+torch1.10cu10.2 -f https://release.colossalai.org

Other environment information is collected via this

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 5.3.0
Clang version: Could not collect
CMake version: version 3.19.3
Libc version: glibc-2.17

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] colossalai==0.1.6+torch1.10cu10.2
[pip3] numpy==1.22.4
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] colossalai                0.1.6+torch1.10cu10.2          pypi_0    pypi
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi
``

RuntimeError: CUDA out of memory with cifar10 in data_parallel example

🐛 Describe the bug

I am trying train_with_cifar10.py in https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel

My command:

colossalai run --nproc_per_node 2 train_with_cifar10.py --config config.py

I have 7 GPUs with 16G GPU memory for each one.

The error traceback is like:

...
RuntimeError: CUDA out of memory. Tried to allocate 296.00 MiB (GPU 1; 15.78 GiB total capacity; 13.75 GiB already
 allocated; 232.19 MiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "train_with_cifar10.py", line 71, in <module>
    main()
  File "train_with_cifar10.py", line 62, in main
    trainer.fit(train_dataloader=train_dataloader,
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit
    self._train_epoch(
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 181, in _train_epoch
    logits, label, loss = self.engine.execute_schedule(
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 201, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 78, in forward_backward_step
    output = self._call_engine(engine, data)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 109, in _call_engine
    return engine(inputs)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 186, in __call__
    return self.model(*args, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 79, in forward
    return self.model(*args, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 465, in forward
    x = self.forward_features(x)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 454, in forward_features
    x = self.blocks(x)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 243, in forward
    x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/layers/mlp.py", line 29, in forward
    x = self.drop1(x)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
...

Environment

>>> import colossalai
>>> colossalai.__version__
'0.1.9'
>>> import torch
>>> torch.__version__
'1.12.1+cu113'

GPU:

$nvidia-smi
Wed Sep 28 16:54:08 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:67:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:69:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |   2360MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0    56W / 300W |   4172MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:8F:00.0 Off |                    0 |
| N/A   32C    P0    54W / 300W |   7307MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

[RFC] Merge train_with_engine and train_with_trainer

Describe the feature

In most examples, there are two files, namely train with engine and trainer. The code is highly redundant in these two files and we should just merge them into one file.
We can add an additional flag to let the user to choose to run with either engine or trainer. As engine provides better portability from user code to colossalai-style code, we should keep engine as the default.

Python Exception when running BERT Examples

🐛 Describe the bug

When running BERT sequence example with README, an exception happened.

File "train.py", line 240, in main grad_norm = grad_norm.item() AttributeErrorAttributeError: : 'float' object has no attribute 'item''float' object has no attribute 'item'AttributeError

After annotating line 240, it works.

Environment

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.