Giter Club home page Giter Club logo

hpcaitech / colossalai Goto Github PK

View Code? Open in Web Editor NEW
37.9K 37.9K 4.2K 30.37 MB

Making large AI models cheaper, faster and more accessible

Home Page: https://www.colossalai.org

License: Apache License 2.0

Python 94.07% C++ 1.53% C 0.09% Cuda 1.26% Dockerfile 0.04% Shell 0.56% HTML 2.44%
ai big-model data-parallelism deep-learning distributed-computing foundation-models heterogeneous-training hpc inference large-scale model-parallelism pipeline-parallelism

colossalai's People

Contributors

1saa avatar binmakeswell avatar camille7777 avatar chengeharrison avatar cjhha1 avatar csric avatar cwher avatar cypher30 avatar digger-yu avatar fazziekey avatar feifeibear avatar flybird11111 avatar foolplayer avatar frankleeeee avatar fridge003 avatar github-actions[bot] avatar gy-lu avatar ht-zhou avatar klhhhhh avatar kurisusnowdeng avatar lstm-kirigaya avatar maruyamaaya avatar oahzxl avatar super-dainiu avatar sze-qq avatar tongli3701 avatar ver217 avatar wesley-jzy avatar yuliangliu0306 avatar zengzh95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

colossalai's Issues

Need a finetuning example

Describe the feature

Fewer users are able to train large models directly from scratch. We need to provide an example of fine-tuning.

For example, how to load pre-training parameters into model of colossal-AI and fine-tune them efficiently with other features in colossal-AI? Performance can be optimized incrementally with subsequent updates, but this feature is practical and important.

Zombile process with MPI launch[BUG]

Describe the bug
If the parallel training is launched via MPI, zombie process will not be killed upon keyboard interruption or exceptions.

To Reproduce
Initialize the parallel context with MPI and launch more than one processes (e.g., mpirun -np 2 train.py). Then interrupt the training with Ctrl + C.

Expected behavior
Ranks > 0 will keep running and taking up memories.

The performance of model parallelism (MP) is not good

Hello developers.

I found the performance of MP provided is not good. I compared it with PatrickStar and DeepSpeed. Can you check it with me? See MR #115
BTW: I strongly recommend you add Tflops as an indicator of performance.

Platform: a node of SuperPod including 8xA100 and 1TB memory CPU. BS = batch size, pstar=PatrickStar, deeps=DeepSpeed
Entries indicate the Throughput (batch/elapse). Xd-Xmp is using Colossal-AI.

Model Scale global BS 1d-4mp 1d-8mp 2d-4mp 2d-8mp 3d-4mp 2.5d-4mp pstar deeps deeps-mp4 deeps-mp8
4B 8 7.61 7.62 9.89 8.47 failed 10.31 8.78 1.15 1.26 1.26
4B 16. OOM OOM OOM OOM OOM OOM 16.67 2.26 2.42 2.36
4B 128 OOM OOM OOM OOM OOM OOM 28.39 12.51 10.80 OOM
10B 2 OOM 3.62 OOM failed OOM OOM - - 0.15 0.15
10B 4 OOM 4.66 OOM OOM OOM OOM - - 0.30 0.30
10B 128 OOM OOM OOM OOM OOM OOM 13.43 OOM 6.31 5.73
  1. As you can see, the computing efficiency is the lowerest among the three solutions on 1 node scale. However, Colossal-AI is very competitive on the same batch size. Unfortunately, the batch size severely limits Colossal-AI performance.
  2. The 2.5d-MP is superior on 4B-8bs. But the 1d-8mp has a better generalization.
  3. Heterologous Training (like PatrickStar and DeepSpeed) may be a better solution, rather than a complex MP strategy, on 1 node scale.

Facing error in using CNN on MNIST

I was using colossalAI for applying CNN on the MNIST dataset but the following error occurs which I am not able to remove:

[Epoch 0 train]: 0%| | 0/6000 [00:00<?, ?it/s]

TypeError Traceback (most recent call last)
in ()
6 max_epochs = num_epochs,
7 display_progress = True,
----> 8 test_interval = test_interval
9 )

4 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() takes 2 positional arguments but 11 were given

I have attached the link to the jupyter notebook.
https://colab.research.google.com/drive/15Yiv7EBAc6eWV14aGEl0GP3-AMZLSr06?usp=sharing

[FEATURE] how can I quickly test a huggingface transformer model

Is your feature request related to a problem? Please describe.
I'm frustrated when I try to apply this project on a huggingface transformer model, i.e. a bert model.

Describe the solution you'd like
I can not find a clear doc to direct me to port a huggingface model into the colossal-ai. Apparently, all of the examples are related to visual models, but most of the large model applications are on NLP scenarios.

Describe alternatives you've considered
Provide an example to tell me how to simply move my training process to colossal-ai.

[FEATURE] does this project supports gradient checkpointing?

Activation Checkpoint (a.k.a gradient checkpointing in PyTorch (https://pytorch.org/docs/stable/checkpoint.html)) is an effective technique (from my perspective, maybe the most effective one) to improve model scale. It can primarily save activation memory footprint at the cost of recomputing. However, I did not see the technique applied in colossal?
I believe it is a model-relative optimization and should not be put in the core functionality of colossal-ai. But you should add it in the example or benchmark scripts.

See the huggingface GPT2 implementation for more details

https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/modeling_gpt2.py#L865

How to initialize the linear2D? Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer

Describe the feature

#!/usr/bin/env python

-- encoding: utf-8 --

import os
import colossalai
import torch
import torchvision
from colossalai.builder import *
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn import Accuracy, CrossEntropyLoss
from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
from colossalai.trainer import Trainer
from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
LogMetricByEpochHook,
LogMetricByStepHook,
LogTimingByEpochHook, LossHook,
LRSchedulerHook, ThroughputHook)
from colossalai.utils import MultiTimer, get_dataloader
from model_zoo.vit import vit_lite_depth7_patch4_32
from torchvision import transforms
from colossalai.nn import Linear2D
import torch.nn as nn
from delete import trainset

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '8888'
os.environ['DATA'] = 'D:/trash_can'
os.environ['CONFIG_FILE'] = 'D:/trash_can/ColossalAI-main/benchmark/cifar/configs/vit_1d.py'
os.environ['LOCAL_RANK'] = '0'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'

DATASET_PATH = str(os.environ['DATA'])

def build_cifar(batch_size):
train_dataset = trainset()
test_dataset = trainset()
train_dataloader = get_dataloader(dataset=train_dataset,
shuffle=True,
batch_size=batch_size,
num_workers=0,
pin_memory=True)
test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, num_workers=0, pin_memory=True)
return train_dataloader, test_dataloader

class MLP_2D(nn.Module):

def __init__(self):
    super().__init__()
    self.linear_1 = Linear2D(in_features=1024, out_features=16384)
    self.linear_2 = Linear2D(in_features=16384, out_features=1024)

def forward(self, x):
    x = self.linear_1(x)
    x = self.linear_2(x)
    return x

def train_cifar():
args = colossalai.get_default_parser().parse_args()

colossalai.launch_from_torch(config=args.config)

logger = get_dist_logger()
if hasattr(gpc.config, 'LOG_PATH'):
    if gpc.get_global_rank() == 0:
        log_path = gpc.config.LOG_PATH
        if not os.path.exists(log_path):
            os.mkdir(log_path)
        logger.log_to_file(log_path)

model = MLP_2D()

train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)

criterion = CrossEntropyLoss(label_smoothing=0.1)

optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)

steps_per_epoch = len(train_dataloader)

lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
                                       total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
                                       warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)

engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                                optimizer=optimizer,
                                                                                criterion=criterion,
                                                                                train_dataloader=train_dataloader,
                                                                                test_dataloader=test_dataloader,
                                                                                lr_scheduler=lr_scheduler)

logger.info("Engine is built", ranks=[0])

timer = MultiTimer()

trainer = Trainer(engine=engine, logger=logger, timer=timer)
logger.info("Trainer is built", ranks=[0])

hooks = [
    LogMetricByEpochHook(logger=logger),
    LogMetricByStepHook(),
    # LogTimingByEpochHook(timer=timer, logger=logger),
    # LogMemoryByEpochHook(logger=logger),
    AccuracyHook(accuracy_func=Accuracy()),
    LossHook(),
    ThroughputHook(),
    LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
]

logger.info("Train start", ranks=[0])
trainer.fit(train_dataloader=train_dataloader,
            test_dataloader=test_dataloader,
            epochs=gpc.config.NUM_EPOCHS,
            hooks=hooks,
            display_progress=True,
            test_interval=1)

if name == 'main':
train_cifar()

Traceback (most recent call last):
File "D:/trash_can/ColossalAI-main/benchmark/cifar/train.py", line 161, in
train_cifar()
File "D:/trash_can/ColossalAI-main/benchmark/cifar/train.py", line 112, in train_cifar
model = MLP_2D()
File "D:/trash_can/ColossalAI-main/benchmark/cifar/train.py", line 81, in init
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
File "D:\trash_can\ColossalAI-main\colossalai\nn\layer\parallel_2d\layers.py", line 52, in init
assert_summa_initialization()
File "D:\trash_can\ColossalAI-main\colossalai\nn\layer\parallel_2d_utils.py", line 23, in assert_summa_initialization
'Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer'
AssertionError: Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Unable to import get_dataloader from colossalai.utils

While making an example for application of colossalAI with the new API, I wanted to use the get_dataloader method to create a dataloader but after running the line
from colossalai.utils import get_dataloader
and ran into the error:
ImportError: cannot import name 'get_dataloader' from 'colossalai.utils' (/usr/local/lib/python3.7/dist-packages/colossalai/utils/init.py)

[BUG] Python 3.10 cannot install dependency by requirements.txt

Describe the bug

To Reproduce
Steps or code snippet to reproduce the behavior:
image
I use conda default python version which is 3.10 to follow this picture to install but raise an error.

Expected behavior
A clear and concise description of what you expected to happen.
pip install -r requirements/requirements.txt, this command can not be used based on Python 3.10, the screenshots are as follow.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Environment (please complete the following information):

  • CUDA version:
  • cuDNN version:
  • NCCL version:
  • Python version: 3.10
  • PyTorch version:

Additional context
Add any other context about the problem here.
I found that pip install -r requirements/requirements.txt can be used at the latest version of python 3.8 and 3.9, so maybe we can set a note to let users do not use python 3.10 to install. Or recommend a python version.

How to run benchmark example on each worker?

Describe the feature

To start training, use the following command to run each worker:

$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE
--rank=RANK
--local_rank=LOCAL_RANK
--host=MASTER_IP_ADDRESS
--port=MASTER_PORT
--config=CONFIG_FILE

I read this on readme file in colossalai's benchmark. So currently I have 2 GPUs on NSCC server, and not sure how to run the program on 'each' worker....I tried this on the command line and then press enter:
DATA='./dataset/' python train.py --world_size=2 --rank=0 --local_rank=0 --host='172.18.126.98' --port='51066' --config='./configs/vit_1d.py'

And the program stuck.... and only shows these:

Colossalai should be built with cuda extension to use the FP16 optimizer
warning: variables which starts with __, is a module or class declaration are omitted

Could anyone help me? Thank you!!

API on collective operations

Describe the feature

When I attempted to implement a partial tensor parallel model (i.e. only some layers are 2D/2.5D parallel), not a single rank can get a whole tensor. It would be best to provide a function to allow easier collective operations (all_reduce, broadcast, etc) at user level.

The startup commands in examples and benchmark are different and confusing

๐Ÿ› Describe the bug

The startup commands in examples and benchmark are different and confusing. They should be unified, the current form is very easy to confuse newbies.

Not clear:
https://github.com/hpcaitech/ColossalAI-Benchmark/tree/62904e4ff2f3261c5469c773faa3d9307b6f16f4
More detail in hpcaitech/ColossalAI-Benchmark#5

Only command more than 64 GPUs using srun, how to run with limited GPUs and local machine?
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel

Clear command:
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel

Possible error:
Since we give the command '--master_port 29500', it is possible that users meet the error 'RuntimeError: Address already in use', which needs to use another port number.
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet

Environment

No response

train on a multi-gpu server

For the case that I have only a multi-GPU server, no distributed system is available, how could I use all GPUs to train a model? Thanks!

[BUG] failed to run a gpt2_xl test case

I wrote a GPT2 test case and tried gpt_small and gpt_large. they are fine. However, it failed on gpt2_xl.
For more details see my MR #115 .
BTW: What is the unit of throughput? Can you provide a Tflops metrics? It is a task-irrelevant indicator that reflects the utilization of hardware computing power.
Tflops = (model_numel * batch_size * sequence_length * 2 * 4) / elapse per iter

On a 8 GPU node.

cd examples/gpt
bash run.sh
Traceback (most recent call last):
  File "run_gpt2_with_engine.py", line 110, in <module>
Traceback (most recent call last):
  File "run_gpt2_with_engine.py", line 110, in <module>
    train_gpt()
  File "run_gpt2_with_engine.py", line 106, in train_gpt
    trainer.fit(train_dataloader=train_dataloader, epochs=gpc.config.NUM_EPOCHS, hooks=hook_list, display_progress=True)
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 312, in fit
    train_gpt()
  File "run_gpt2_with_engine.py", line 106, in train_gpt
    self._train_epoch(
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 178, in _train_epoch
    trainer.fit(train_dataloader=train_dataloader, epochs=gpc.config.NUM_EPOCHS, hooks=hook_list, display_progress=True)
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 312, in fit
    logits, label, loss = self.schedule.forward_backward_step(
  File "/workspace/ColossalAI/colossalai/engine/schedule/_non_pipeline_schedule.py", line 52, in forward_backward_step
    output = self._call_engine(engine, data)
  File "/workspace/ColossalAI/colossalai/engine/schedule/_base_schedule.py", line 98, in _call_engine
    self._train_epoch(
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 178, in _train_epoch
    return engine(inputs)
  File "/workspace/ColossalAI/colossalai/engine/_base_engine.py", line 112, in __call__
    logits, label, loss = self.schedule.forward_backward_step(
      File "/workspace/ColossalAI/colossalai/engine/schedule/_non_pipeline_schedule.py", line 52, in forward_backward_step
return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    output = self._call_engine(engine, data)
  File "/workspace/ColossalAI/colossalai/engine/schedule/_base_schedule.py", line 98, in _call_engine
    return engine(inputs)
  File "/workspace/ColossalAI/colossalai/engine/_base_engine.py", line 112, in __call__
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 871, in forward
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 871, in forward
    output = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    output = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 18, in decorate_autocast
    return func(*args, **kwargs)
  File "/workspace/ColossalAI/colossalai/amp/torch_amp/torch_amp.py", line 63, in forward
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 18, in decorate_autocast
    return func(*args, **kwargs)
  File "/workspace/ColossalAI/colossalai/amp/torch_amp/torch_amp.py", line 63, in forward
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 245, in forward
    x, attention_mask = block(x, attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 245, in forward
    x, attention_mask = block(x, attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/colossalai/nn/layer/utils/common.py", line 26, in forward
    return checkpoint(self._forward, *args, **kwargs)
  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 117, in checkpoint
    return CheckpointFunction.apply(function, *args)
  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 44, in forward
    outputs = run_function(*args)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 148, in _forward
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/colossalai/nn/layer/utils/common.py", line 26, in forward
    return checkpoint(self._forward, *args, **kwargs)
x = x + self.attn(self.norm1(x), attention_mask)  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 117, in checkpoint

  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return CheckpointFunction.apply(function, *args)
  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 44, in forward
    outputs = run_function(*args)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 148, in _forward
    x = x + self.attn(self.norm1(x), attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 72, in forward
    qkv = qkv.view(new_qkv_shape)
RuntimeError: shape '[8, 1024, 6, 192]' is invalid for input of size 9830400
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 72, in forward
    qkv = qkv.view(new_qkv_shape)
RuntimeError: shape '[8, 1024, 6, 192]' is invalid for input of size 9830400

[BUG] Timer reset bug.

Could you please check the implementation of the timer reset? I believe it is buggy and not able to deal with exceptions. For example, what if the name is not in self._timers`?

colossalai - root - 2022-01-04 14:47:52,004 INFO: [Epoch 0 / Train]: Loss = nan | LR = 0.00015 | Throughput = 0
Traceback (most recent call last):
File "/home/jiaruifang/codes/ColossalAI/examples/bert/run_bert_with_engine.py", line 106, in
train_gpt()
File "/home/jiaruifang/codes/ColossalAI/examples/bert/run_bert_with_engine.py", line 102, in train_gpt
trainer.fit(train_dataloader=train_dataloader, epochs=gpc.config.NUM_EPOCHS, hooks=hook_list, display_progress=True)
File "/home/jiaruifang/codes/ColossalAI/colossalai/trainer/_trainer.py", line 312, in fit
self._train_epoch(
File "/home/jiaruifang/codes/ColossalAI/colossalai/trainer/_trainer.py", line 196, in _train_epoch
self._call_timer(action='reset', item='Train-step')
File "/home/jiaruifang/codes/ColossalAI/colossalai/trainer/_trainer.py", line 127, in _call_timer
getattr(self._timer, action)(item, *args, **kwargs)
File "/home/jiaruifang/codes/ColossalAI/colossalai/utils/timer.py", line 121, in reset
self._timers[name].reset()
KeyError: 'Train-step'

colossalai.launch has error, but colossalai.launch_from_torch works well

Describe the feature

I am running the code in NUS HPC. When I open two tab separately to run this command:

DATA='./dataset/' python train_1.py --world_size=2 --rank=0 --local_rank=0 --host='172.17.0.1' --port='51061' --config='./configs/vit_1d.py'

DATA='./dataset/' python train_1.py --world_size=2 --rank=1 --local_rank=1 --host='172.17.0.1' --port='51061' --config='./configs/vit_1d.py'

I will encounter an error like this:
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai-0.0.1b0-py3.6.egg/colossalai/nn/layer/parallel_1d/_utils.py", line 211, in gather_forward_split_backward
return GatherForwardSplitBackward.apply(input, parallel_mode, dim)
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai-0.0.1b0-py3.6.egg/colossalai/nn/layer/parallel_1d/_utils.py", line 137, in forward
return gather(input, parallel_mode, dim)
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai-0.0.1b0-py3.6.egg/colossalai/nn/layer/parallel_1d/_utils.py", line 71, in gather
torch.distributed.all_gather(tensor_list, input
, group=gpc.get_group(parallel_mode))
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2006, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

But when I use torchrun version like this:
DATA='./dataset/' torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr='172.17.0.1' --master_port='51066' train_1.py --config='./configs/vit_1d.py'

It works well!

Just doubt why, haha. Any suggestion will benefit me and broadcast knowledge, thanks!

PyTorch or TensorFlow

Can a new example be built on the TensorFlow framework?

More specifically I have a TensorFlow based deep neural network model. How should I proceed if I want to upload it as an example?

Missing Long Description on PyPi

๐Ÿ“š The doc issue

The setup.py should contain a long description to display on PyPi. We can add this in the next release.

[BUG] The project is not compatible with torch v1.8.1

The PyTorch APIs have changed recently. I failed to run the Embedding layer using torch 1.8.1+cu111.
The APIs of Embedding and LayerNorm are different. Did you consider supporting different torch versions or locking it on a specific version? The latter is not a user-friendly choice, although I noticed your designate a torch version in requirements.txt.

https://pytorch.org/docs/1.8.1/generated/torch.nn.Embedding.html?highlight=embeddings
https://pytorch.org/docs/1.10.0/generated/torch.nn.Embedding.html?highlight=embeddings

[BUG] initialize context failed using pytorch 1.10

Describe the bug

and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
warning: variables which starts with __, is a module or class declaration are omitted
Traceback (most recent call last):
  File "run_resnet_cifar10_with_trainer.py", line 118, in <module>
    main()
  File "run_resnet_cifar10_with_trainer.py", line 19, in main
    colossalai.launch_from_torch(config='./config.py')
  File "/opt/conda/lib/python3.8/site-packages/colossalai/initialize.py", line 209, in launch_from_torch
    launch(config=config,
  File "/opt/conda/lib/python3.8/site-packages/colossalai/initialize.py", line 101, in launch
    gpc.init_global_dist(rank, world_size, backend, host, port)
  File "/opt/conda/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 322, in init_global_dist
    dist.init_process_group(rank=rank,
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 559, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 141, in _tcp_rendezvous_handler
    store = TCPStore(  # type: ignore[call-arg]
RuntimeError: Address already in use

To Reproduce

cd examples/resnet_cifar10_data_parallel
examples/resnet_cifar10_data_parallel
env DATA="./cifar10_data" python -m torch.distributed.launch --nproc_per_node=1 run_resnet_cifar10_with_engine.py

Expected behavior

Screenshots

Environment (please complete the following information):

  • CUDA version: cuda_11.4.r11.4/compiler.30188945_0
  • cuDNN version:
  • NCCL version:
  • Python version: 3.8.10
  • PyTorch version: 1.10.0a0+3fd9dcf

Additional context
I believe the bug come from the L321 of colossalai/context/parallel_context.py.

Automatic Release on PyPI

Describe the feature

We can set up a CI to automate the release process. There are several functions to achieve:

  1. automatically test version compatibility
  2. automatically publish the develop branch to Test PyPI
  3. automatically publish the main branch to PyPI

This CI should better run on manual workflow dispatch. If triggered by event such as PR, this CI may run in unwanted situations such as updating submodule references.

Need experiment results to show superiority

Describe the feature

The current content mainly shows what Colossal-AI offers, but lacks a convincing and engaging presentation of experiment results.

For example, in the README, highlight results of features need to be provided.
In example, benchmark, and tutorial, the expected key experiment results need to be presented, not just the features description and running commands, which are likely to cause problems for novices to use and reproduce : (

Add hooks before and after operators

Describe the feature

I noticed the project has a hook factory, which provides an easy way to add some extra business logic (like throughput metric) before and after a training iteration.
However, the name hook is a little bit misunderstanding since it is not the same as the hook of Pytorch.
The PyTorch hooks can add some operations before and after FWD and BWD of a submodule.
Currently, the project does not provide a function in the BaseHook for developers to do some operations before and after a PyTorch submodule (like Linear) execution.
For example, someone would like to depict the memory footprint during training by recording the memory usage before and after an operator execution.
I consider adding register_forward_hook and register_forward_pre_hook functions to HookBase. Just like the setup_zero_stage3_hooks. Does this make sense to you?

How to define a model with only one layer?

Describe the feature

#!/usr/bin/env python

-- encoding: utf-8 --

import os
import colossalai
import torchvision
from colossalai.builder import *
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn import Accuracy, CrossEntropyLoss, MSELoss
from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
from colossalai.trainer import Trainer
from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
LogMetricByEpochHook,
LogMetricByStepHook,
LogTimingByEpochHook, LossHook,
LRSchedulerHook, ThroughputHook)
from colossalai.utils import MultiTimer, get_dataloader
from model_zoo.vit import vit_lite_depth7_patch4_32
from torchvision import transforms
from colossalai.nn import Linear1D_Col, Linear1D, Linear1D_Row
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch

DATASET_PATH = str(os.environ['DATA'])

class trainset(Dataset):
def init(self):
pass

def __getitem__(self, index):
    target = torch.randint(5, (4,), dtype=torch.int64)
    data = torch.randn(1, 5)
    return data, target

def __len__(self):
    return 512*1

def build_cifar(batch_size):
train_dataset = trainset()
test_dataset = trainset()
train_dataloader = get_dataloader(dataset=train_dataset,
shuffle=True,
batch_size=batch_size,
num_workers=0,
pin_memory=True)
test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, num_workers=0, pin_memory=True)
return train_dataloader, test_dataloader

class MLP_1D(nn.Module):

def __init__(self):
    super().__init__()
    self.linear_1 = Linear1D_Col(in_features=5, out_features=4, gather_output=True)
    #self.linear_2 = Linear1D_Row(in_features=4, out_features=1)

def forward(self, x):
    x = self.linear_1(x)
    #x = self.linear_2(x)
    x = torch.squeeze(x, 1)
    return x

def train_cifar():
args = colossalai.get_default_parser().parse_args()
# standard launch
# colossalai.launch(config=args.config,
# rank=args.rank,
# world_size=args.world_size,
# local_rank=args.local_rank,
# host=args.host,
# port=args.port)

# launch from torchrun
colossalai.launch_from_torch(config=args.config)

logger = get_dist_logger()
if hasattr(gpc.config, 'LOG_PATH'):
    if gpc.get_global_rank() == 0:
        log_path = gpc.config.LOG_PATH
        if not os.path.exists(log_path):
            os.mkdir(log_path)
        logger.log_to_file(log_path)

# model = vit_lite_depth7_patch4_32()
model = MLP_1D()

train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)


criterion = CrossEntropyLoss(label_smoothing=0.1)

optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)

steps_per_epoch = len(train_dataloader)

lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
                                       total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
                                       warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)

engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                                optimizer=optimizer,
                                                                                criterion=criterion,
                                                                                train_dataloader=train_dataloader,
                                                                                test_dataloader=test_dataloader,
                                                                                lr_scheduler=lr_scheduler)

  
logger.info("Engine is built", ranks=[0])

timer = MultiTimer()

trainer = Trainer(engine=engine, logger=logger, timer=timer)
logger.info("Trainer is built", ranks=[0])

hooks = [
    LogMetricByEpochHook(logger=logger),
    LogMetricByStepHook(),
    # LogTimingByEpochHook(timer=timer, logger=logger),
    # LogMemoryByEpochHook(logger=logger),
    AccuracyHook(accuracy_func=Accuracy()),
    LossHook(),
    ThroughputHook(),
    LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
]

logger.info("Train start", ranks=[0])
trainer.fit(train_dataloader=train_dataloader,
            test_dataloader=test_dataloader,
            epochs=gpc.config.NUM_EPOCHS,
            hooks=hooks,
            display_progress=True,
            test_interval=1)

if name == 'main':
train_cifar()

This is my code and it reports:
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected floating point type for target with class probabilities, got Long
I am not sure...How to define a single layer model?

Inconsistent pip install and uninstall

Pip install will follow

pip install colossalai

but uninstall is

pip uninstall colossal-ai

This is because that the name is colossal-ai in setup.py, change it to be without - for consistency.

[DOC] Documentation is not detailed enough

Some problems regarding the documentation

  1. some functions take in *args, **kwargs, should have a link and example to explain which these arguments are
  2. some classes and functions should come with an example, e.g. colossalai.launch

[BUG]

Hi. I met an error when I tried to import colossalai. Seems it tries to import from its own layers but cannot: Is this means I need to change my encoding format?

Traceback (most recent call last):
File "main.py", line 8, in
import colossalai
File "/usr/local/lib/python3.6/site-packages/colossalai/init.py", line 1, in
from .initialize import (initialize, launch, launch_from_openmpi,
File "/usr/local/lib/python3.6/site-packages/colossalai/initialize.py", line 7, in
from colossalai.nn.optimizer.colossalai_optimizer import ColossalaiOptimizer
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/init.py", line 1, in
from .layer import *
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/layer/init.py", line 1, in
from .colossalai_layer import *
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/layer/colossalai_layer/init.py", line 2, in
from .dropout import Dropout
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/layer/colossalai_layer/dropout.py", line 1, in
from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'

  • CUDA version: 11.0
  • Python version: 3.6.8
  • PyTorch version: 1.10.0

Stuck by creating new model for linear1D

Describe the feature

So I am trying to run a new model. On my local PC, I try to run 1d parallel with TENSOR_PARALLEL_SIZE=1, because my PC has only one GPU, and this model works. But on HPC, when I try TENSOR_PARALLEL_SIZE=2, (only 2 GPUs), this model is blocked and does not move! Any suggestions? Thank you!!!

My command to run this code is

DATA='./dataset/' torchrun --nproc_per_node='2' --nnodes='1' --node_rank='0' --master_addr='172.18.126.98' --master_port='51063' train_1.py --config='./configs/vit_1d.py'

and I have attached the MobaXterm screen below.
//////////////////////////////////////////////////////////////////////////
'''
import os
import colossalai
import torchvision
from colossalai.builder import *
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn import Accuracy, CrossEntropyLoss, MSELoss
from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
from colossalai.trainer import Trainer
from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
LogMetricByEpochHook,
LogMetricByStepHook,
LogTimingByEpochHook, LossHook,
LRSchedulerHook, ThroughputHook)
from colossalai.utils import MultiTimer, get_dataloader
from model_zoo.vit import vit_lite_depth7_patch4_32
from torchvision import transforms
from colossalai.nn import Linear1D
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch
os.environ['MASTER_ADDR'] = '172.18.126.98'
os.environ['MASTER_PORT'] = '51064'
os.environ['DATA'] = './dataset/'
os.environ['CONFIG_FILE'] = './configs/vit_1d.py'
os.environ['LOCAL_RANK'] = '0'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '2'
DATASET_PATH = str(os.environ['DATA'])

class trainset(Dataset):

def __init__(self):
    pass

def __getitem__(self, index):
    target = torch.randn(1)
    data = torch.randn(1, 1024)
    return data, target

def __len__(self):
    return 512*5

def build_cifar(batch_size):

train_dataset = trainset()
test_dataset = trainset()
train_dataloader = get_dataloader(dataset=train_dataset,
                                  shuffle=True,
                                  batch_size=batch_size,
                                  num_workers=0,
                                  pin_memory=True)
test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, num_workers=0, pin_memory=True)
return train_dataloader, test_dataloader

class MLP_1D(nn.Module):

def __init__(self):
    super().__init__()
    self.linear_1 = Linear1D(in_features=1024, out_features=16384)
    self.linear_2 = Linear1D(in_features=16384, out_features=1)

def forward(self, x):
    x = self.linear_1(x)
    x = self.linear_2(x)
    x = x.squeeze(-1)
    return x

def train_cifar():
args = colossalai.get_default_parser().parse_args()

colossalai.launch_from_torch(config=args.config)
print('111')
logger = get_dist_logger()
if hasattr(gpc.config, 'LOG_PATH'):
    if gpc.get_global_rank() == 0:
        log_path = gpc.config.LOG_PATH
        if not os.path.exists(log_path):
            os.mkdir(log_path)
        logger.log_to_file(log_path)
print('1.5 1.5 1.5')
model = MLP_1D()  # !!!!!!!!!Here is blocked!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

print('222')
train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)

criterion = CrossEntropyLoss(label_smoothing=0.1)

optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
print('333')
steps_per_epoch = len(train_dataloader)

lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
                                       total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
                                       warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)

engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                                optimizer=optimizer,
                                                                                criterion=criterion,
                                                                                train_dataloader=train_dataloader,
                                                                                test_dataloader=test_dataloader,
                                                                                lr_scheduler=lr_scheduler)

logger.info("Engine is built", ranks=[0])

timer = MultiTimer()
print('444')
trainer = Trainer(engine=engine, logger=logger, timer=timer)
logger.info("Trainer is built", ranks=[0])

hooks = [
    LogMetricByEpochHook(logger=logger),
    LogMetricByStepHook(),
    AccuracyHook(accuracy_func=Accuracy()),
    LossHook(),
    ThroughputHook(),
    LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
]
print('555')
logger.info("Train start", ranks=[0])
trainer.fit(train_dataloader=train_dataloader,
            test_dataloader=test_dataloader,
            epochs=gpc.config.NUM_EPOCHS,
            hooks=hooks,
            display_progress=True,
            test_interval=1)

if name == 'main':
train_cifar()
'''

22222

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

[BUG] please update pypi version

Describe the bug
I use the colossal installed from pypi, it failed to run a resnet example!

python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py

Colossalai should be built with cuda extension to use the FP16 optimizer
Colossalai should be built with cuda extension to use the FP16 optimizer
DeepSpeed is required if you want to use ZeRO.
DeepSpeed is required if you want to use ZeRO.
Traceback (most recent call last):
  File "run_resnet_cifar10_with_engine.py", line 7, in <module>
    from colossalai.utils import get_dataloader
ImportError: cannot import name 'get_dataloader' from 'colossalai.utils' (/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/colossalai/utils/__init__.py)
Killing subprocess 1397423
Traceback (most recent call last):
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/jiaruifang/anaconda3/envs/deepalpha/bin/python', '-u', 'run_resnet_cifar10_with_engine.py', '--local_rank=0']' returned non-zero exit status 1.

To Reproduce

pip install colossalai
cd examples/resnet_cifar10_data_parallel
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py

Expected behavior

Screenshots

Environment (please complete the following information):

  • CUDA version:
  • cuDNN version:
  • NCCL version:
  • Python version:
  • PyTorch version:

Additional context

reformat warmup

Currently, warmup step and epoch are mixed-used in lr_scheduler, which is a little confusing.
If add warmup_step in lr_scheduler_hook, which is more user-friendly for users needing steps in multiple epochs, maybe the name of 'by_epoch' needs to be more clear, resulting in changes of other modules and examples.

Duplicated 'mkdir' with MPI backend

๐Ÿ› Describe the bug

The save checkpoint hook does not check local rank while calling _ensure_directory_exists(checkpoint_path), causing multiple processes attempt to create directory and crash. Log attached.

Traceback (most recent call last):
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 312, in run_commandline
Traceback (most recent call last):
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 276, in run
    return self.run(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/run.py", line 238, in __call__
    run()
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/workspace/ColossalAI-Examples/image/vilt/run.py", line 137, in main
    self.result = self.main_function(*args)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/workspace/ColossalAI-Examples/image/vilt/run.py", line 137, in main
    trainer.fit(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 312, in fit
    trainer.fit(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 312, in fit
    self._train_epoch(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 195, in _train_epoch
    self._train_epoch(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 195, in _train_epoch
    self._call_hooks('after_train_epoch')
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 145, in _call_hooks
    self._call_hooks('after_train_epoch')
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 145, in _call_hooks
    getattr(hook, func)(self)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/hooks/_checkpoint_hook.py", line 61, in after_train_epoch
    getattr(hook, func)(self)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/hooks/_checkpoint_hook.py", line 61, in after_train_epoch
    save_checkpoint(save_path,
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 163, in save_checkpoint
    save_checkpoint(save_path,
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 163, in save_checkpoint
    _ensure_directory_exists(checkpoint_path)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 76, in _ensure_directory_exists
    _ensure_directory_exists(checkpoint_path)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 76, in _ensure_directory_exists
    os.makedirs(dir)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/os.py", line 225, in makedirs
    os.makedirs(dir)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './ckpt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/ColossalAI-Examples/image/vilt/run.py", line 40, in <module>
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './ckpt'

Environment

No response

a bit mistake

image
the right git clone code should be
git clone https://github.com/hpcaitech/ColossalAI.git

[FEATURE] How to prepare WebtextDataset?

Is your feature request related to a problem? Please describe.
I tried to run the gpt2 example. It uses the WebtextDataset. Is there any instruction on the data preparation?

Compatabilities to various batch formats[FEATURE]

Is your feature request related to a problem? Please describe.
The implementation of ColossalAI seems only transfer tesors in batch (dict) values to device. However the organization of batch format can be various and highly customized (e.g. list-type batches, minibatch dict containing lists of tensors). In such cases the batchsize cannot be correctly determined and causing errors.

Describe the solution you'd like
The batch_size in BaseSchedule.load_batch() should consider list-type batches and use len() instead of size(). The same applies for the BaseSchedule._move_to_device(), which should consider minibatches containing multiple tensors in a list.

[BUG] RuntimeError: Address already in use

Describe the bug
Traceback (most recent call last):
File "train.py", line 132, in
train_cifar()
File "train.py", line 73, in train_cifar
colossalai.launch_from_torch(config=args.config)
File "/home/svu/e0787810/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai/initialize.py", line 217, in launch_from_torch
verbose=verbose)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai/initialize.py", line 101, in launch
gpc.init_global_dist(rank, world_size, backend, host, port)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai/context/parallel_context.py", line 325, in init_global_dist
init_method=init_method)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: Address already in use

I am running this on NUS HPC, and for hostname, I use 'hostname -l' instrucution to get four IPs, I tried one of them, for port, I randomly choose a number. Not sure how to find an acceptable address.

[Discussion] About 3D Parallelism

I read the paper Maximizing Parallelism in Distributed Training for Huge Neural Networks. The idea is elegant and does make sense to me. However, I just wonder about the compatibility of this method with gradient checkpointing (I mentioned it in #117, We call it GC afterward).

Using 3D parallelism, on the activations we have to conduct all-gather across (N/P^2) processors (it is a partial collective communication), where N is the number of GPU for 3-D linear. At least three times such partial collective communication has to be done, during forward, backward, and recomputing of activation during backward using GC. Therefore, it introduces more communication overhead compared with the model parallelism not splitting activations. Did you consider the overhead in the experiment section of the paper?

Also, the tensor of activations is in small size. If partition an activation tensor into N pieces, and send/recv in granularity of one piece of tensor. The bandwidth utilization will be extremely low? This is different from communication on parameters. We can pack a number of layers of parameter tensors and send/recv them in a larger volume to better utilize network bandwidth, but activations come one after another, you cannot treat them the same as the parameter tensors.

PS: a small typo in the arXiv paper. Page 5, 1st line, Bij = [lnp : lnp + np + 1]

LAMB is not suited for tensor parallel

The current LAMB optimizer implementation does not support tensor parallel as it needs to compute norm of the whole matrix. It is not compatible with tensor parallel as the tensor is split.

Need more runtime hooks during a training step

Describe the feature

In the PyTorch fashion, we usually train a model like

for x, y in dataloader:
    ... # do something before forward
    out = model(x)
    loss = criterion(out, y)
    ... # do something between forward and backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    ... # do something after backward

In the trainer of Colossal-AI, it is only allowed to add hooks before and after a training step, while users cannot customize the behaviors between fetching an input batch and forward pass, or between forward and backward pass.
Also, since the OpHook is applied to modules recursively, it is not appropriate for this issue either. We may need to add at least two more hooks as mentioned above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.