hpcaitech / colossalai Goto Github PK

Making large AI models cheaper, faster and more accessible

License: Apache License 2.0

Python 92.65% C++ 1.44% C 0.24% Cuda 2.86% Dockerfile 0.04% Shell 0.55% HTML 2.22%

deep-learning hpc large-scale data-parallelism pipeline-parallelism model-parallelism ai big-model distributed-computing inference

colossalai's People

Stargazers

Watchers

Forkers

dumpmemory kurisusnowdeng shiyuzh2007 yztongzhan vincentwei2021 shubhammittal98 mldl ver217 cuineo yzqiiii ai-hub-deep-learning-fundamental frankleeeee weisk jaedukseo techthiyanes tianhaofu emg110 zhiweigengqiang usamshen fqhuang hierarchyjk nanaakwasiabayieboateng yzs-lab fermidos datablestorm 15737939656 hnqin-xdu devincheung zhangsongdmk rahulgupta9202 dlreseach matrixplayer songxiaoyi machinelearningsystem fastalgo yangorwell shanmait ambrish001 duzhanyuan luciferbobo jingmouren swagshaw xdjiangkai jaclynzhong1127 bharathbolla zebrajack homer-max weiplanet jacktang ejhortala ulandz faruba l-net-1992 yidong72 bluescale007 onepiec1 njushuangliang xiaming9880 sachitbhardwaj lplstchx klonggan nasa03 5gapp davi-od wuziyou199217 zzhalan qdx123 pz130107 troland nus-hpc-ai-lab learn-deeplearn jihys b-xiang ethanyhzhang lord-aresyzen algonacci stjordanis qj2864813 027kaka annalinlin lit1088 danvez nanpaiuncle3 carlgao-git2 521hellogithub smesforoush chancat87 fazziekey hj3938 bridgeyao2022 dreamos miracledesigner helena2021lc superxiang gaohuan2015 xlorne mmccwwcc isuyu shen-chenhui scalableeknn2021

colossalai's Issues

[BUG] failed to run a gpt2_xl test case

I wrote a GPT2 test case and tried gpt_small and gpt_large. they are fine. However, it failed on gpt2_xl.
For more details see my MR #115 .
BTW: What is the unit of throughput? Can you provide a Tflops metrics? It is a task-irrelevant indicator that reflects the utilization of hardware computing power.
Tflops = (model_numel * batch_size * sequence_length * 2 * 4) / elapse per iter

On a 8 GPU node.

cd examples/gpt
bash run.sh

Traceback (most recent call last):
  File "run_gpt2_with_engine.py", line 110, in <module>
Traceback (most recent call last):
  File "run_gpt2_with_engine.py", line 110, in <module>
    train_gpt()
  File "run_gpt2_with_engine.py", line 106, in train_gpt
    trainer.fit(train_dataloader=train_dataloader, epochs=gpc.config.NUM_EPOCHS, hooks=hook_list, display_progress=True)
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 312, in fit
    train_gpt()
  File "run_gpt2_with_engine.py", line 106, in train_gpt
    self._train_epoch(
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 178, in _train_epoch
    trainer.fit(train_dataloader=train_dataloader, epochs=gpc.config.NUM_EPOCHS, hooks=hook_list, display_progress=True)
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 312, in fit
    logits, label, loss = self.schedule.forward_backward_step(
  File "/workspace/ColossalAI/colossalai/engine/schedule/_non_pipeline_schedule.py", line 52, in forward_backward_step
    output = self._call_engine(engine, data)
  File "/workspace/ColossalAI/colossalai/engine/schedule/_base_schedule.py", line 98, in _call_engine
    self._train_epoch(
  File "/workspace/ColossalAI/colossalai/trainer/_trainer.py", line 178, in _train_epoch
    return engine(inputs)
  File "/workspace/ColossalAI/colossalai/engine/_base_engine.py", line 112, in __call__
    logits, label, loss = self.schedule.forward_backward_step(
      File "/workspace/ColossalAI/colossalai/engine/schedule/_non_pipeline_schedule.py", line 52, in forward_backward_step
return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    output = self._call_engine(engine, data)
  File "/workspace/ColossalAI/colossalai/engine/schedule/_base_schedule.py", line 98, in _call_engine
    return engine(inputs)
  File "/workspace/ColossalAI/colossalai/engine/_base_engine.py", line 112, in __call__
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 871, in forward
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 871, in forward
    output = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    output = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 18, in decorate_autocast
    return func(*args, **kwargs)
  File "/workspace/ColossalAI/colossalai/amp/torch_amp/torch_amp.py", line 63, in forward
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 18, in decorate_autocast
    return func(*args, **kwargs)
  File "/workspace/ColossalAI/colossalai/amp/torch_amp/torch_amp.py", line 63, in forward
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 245, in forward
    x, attention_mask = block(x, attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 245, in forward
    x, attention_mask = block(x, attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/colossalai/nn/layer/utils/common.py", line 26, in forward
    return checkpoint(self._forward, *args, **kwargs)
  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 117, in checkpoint
    return CheckpointFunction.apply(function, *args)
  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 44, in forward
    outputs = run_function(*args)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 148, in _forward
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/colossalai/nn/layer/utils/common.py", line 26, in forward
    return checkpoint(self._forward, *args, **kwargs)
x = x + self.attn(self.norm1(x), attention_mask)  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 117, in checkpoint

  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return CheckpointFunction.apply(function, *args)
  File "/workspace/ColossalAI/colossalai/utils/activation_checkpoint.py", line 44, in forward
    outputs = run_function(*args)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 148, in _forward
    x = x + self.attn(self.norm1(x), attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 72, in forward
    qkv = qkv.view(new_qkv_shape)
RuntimeError: shape '[8, 1024, 6, 192]' is invalid for input of size 9830400
    return forward_call(*input, **kwargs)
  File "/workspace/ColossalAI/model_zoo/gpt/gpt.py", line 72, in forward
    qkv = qkv.view(new_qkv_shape)
RuntimeError: shape '[8, 1024, 6, 192]' is invalid for input of size 9830400

[BUG] Python 3.10 cannot install dependency by requirements.txt

Describe the bug

To Reproduce
Steps or code snippet to reproduce the behavior:

I use conda default python version which is 3.10 to follow this picture to install but raise an error.

Expected behavior
A clear and concise description of what you expected to happen.
pip install -r requirements/requirements.txt, this command can not be used based on Python 3.10, the screenshots are as follow.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

CUDA version:
cuDNN version:
NCCL version:
Python version: 3.10
PyTorch version:

Additional context
Add any other context about the problem here.
I found that pip install -r requirements/requirements.txt can be used at the latest version of python 3.8 and 3.9, so maybe we can set a note to let users do not use python 3.10 to install. Or recommend a python version.

API on collective operations

Describe the feature

When I attempted to implement a partial tensor parallel model (i.e. only some layers are 2D/2.5D parallel), not a single rank can get a whole tensor. It would be best to provide a function to allow easier collective operations (all_reduce, broadcast, etc) at user level.

Do we have activation CPU offloading?

Describe the feature

Similar to the CPU_checkpointing in deepspeed.
Some reference deepspeed implementation

Missing Long Description on PyPi

📚 The doc issue

The setup.py should contain a long description to display on PyPi. We can add this in the next release.

Inconsistent pip install and uninstall

Pip install will follow

pip install colossalai

but uninstall is

pip uninstall colossal-ai

This is because that the name is colossal-ai in setup.py, change it to be without - for consistency.

Could you share the config file to train vit-b16 from scratch on imagenet data ? Thanks.

[BUG] Timer reset bug.

Could you please check the implementation of the timer reset? I believe it is buggy and not able to deal with exceptions. For example, what if the name is not in self._timers`?

colossalai - root - 2022-01-04 14:47:52,004 INFO: [Epoch 0 / Train]: Loss = nan | LR = 0.00015 | Throughput = 0
Traceback (most recent call last):
File "/home/jiaruifang/codes/ColossalAI/examples/bert/run_bert_with_engine.py", line 106, in
train_gpt()
File "/home/jiaruifang/codes/ColossalAI/examples/bert/run_bert_with_engine.py", line 102, in train_gpt
trainer.fit(train_dataloader=train_dataloader, epochs=gpc.config.NUM_EPOCHS, hooks=hook_list, display_progress=True)
File "/home/jiaruifang/codes/ColossalAI/colossalai/trainer/_trainer.py", line 312, in fit
self._train_epoch(
File "/home/jiaruifang/codes/ColossalAI/colossalai/trainer/_trainer.py", line 196, in _train_epoch
self._call_timer(action='reset', item='Train-step')
File "/home/jiaruifang/codes/ColossalAI/colossalai/trainer/_trainer.py", line 127, in _call_timer
getattr(self._timer, action)(item, *args, **kwargs)
File "/home/jiaruifang/codes/ColossalAI/colossalai/utils/timer.py", line 121, in reset
self._timers[name].reset()
KeyError: 'Train-step'

colossalai.launch has error, but colossalai.launch_from_torch works well

Describe the feature

I am running the code in NUS HPC. When I open two tab separately to run this command:

DATA='./dataset/' python train_1.py --world_size=2 --rank=0 --local_rank=0 --host='172.17.0.1' --port='51061' --config='./configs/vit_1d.py'

DATA='./dataset/' python train_1.py --world_size=2 --rank=1 --local_rank=1 --host='172.17.0.1' --port='51061' --config='./configs/vit_1d.py'

I will encounter an error like this:
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai-0.0.1b0-py3.6.egg/colossalai/nn/layer/parallel_1d/_utils.py", line 211, in gather_forward_split_backward
return GatherForwardSplitBackward.apply(input, parallel_mode, dim)
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai-0.0.1b0-py3.6.egg/colossalai/nn/layer/parallel_1d/_utils.py", line 137, in forward
return gather(input, parallel_mode, dim)
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai-0.0.1b0-py3.6.egg/colossalai/nn/layer/parallel_1d/_utils.py", line 71, in gather
torch.distributed.all_gather(tensor_list, input, group=gpc.get_group(parallel_mode))
File "/home/svu/e0XXXXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2006, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

But when I use torchrun version like this:
DATA='./dataset/' torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr='172.17.0.1' --master_port='51066' train_1.py --config='./configs/vit_1d.py'

It works well!

Just doubt why, haha. Any suggestion will benefit me and broadcast knowledge, thanks!

Need more runtime hooks during a training step

Describe the feature

In the PyTorch fashion, we usually train a model like

for x, y in dataloader:
    ... # do something before forward
    out = model(x)
    loss = criterion(out, y)
    ... # do something between forward and backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    ... # do something after backward

In the trainer of Colossal-AI, it is only allowed to add hooks before and after a training step, while users cannot customize the behaviors between fetching an input batch and forward pass, or between forward and backward pass.
Also, since the OpHook is applied to modules recursively, it is not appropriate for this issue either. We may need to add at least two more hooks as mentioned above.

ZeRO compatibility with gradient accumulation

🐛 Describe the bug

ZeRO cannot work with gradient accumulation now due to multiple reduction of gradients.

Environment

No response

Compatabilities to various batch formats[FEATURE]

Is your feature request related to a problem? Please describe.
The implementation of ColossalAI seems only transfer tesors in batch (dict) values to device. However the organization of batch format can be various and highly customized (e.g. list-type batches, minibatch dict containing lists of tensors). In such cases the batchsize cannot be correctly determined and causing errors.

Describe the solution you'd like
The batch_size in BaseSchedule.load_batch() should consider list-type batches and use len() instead of size(). The same applies for the BaseSchedule._move_to_device(), which should consider minibatches containing multiple tensors in a list.

Stuck by creating new model for linear1D

Describe the feature

So I am trying to run a new model. On my local PC, I try to run 1d parallel with TENSOR_PARALLEL_SIZE=1, because my PC has only one GPU, and this model works. But on HPC, when I try TENSOR_PARALLEL_SIZE=2, (only 2 GPUs), this model is blocked and does not move! Any suggestions? Thank you!!!

My command to run this code is

DATA='./dataset/' torchrun --nproc_per_node='2' --nnodes='1' --node_rank='0' --master_addr='172.18.126.98' --master_port='51063' train_1.py --config='./configs/vit_1d.py'

and I have attached the MobaXterm screen below.
//////////////////////////////////////////////////////////////////////////
'''
import os
import colossalai
import torchvision
from colossalai.builder import *
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn import Accuracy, CrossEntropyLoss, MSELoss
from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
from colossalai.trainer import Trainer
from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
LogMetricByEpochHook,
LogMetricByStepHook,
LogTimingByEpochHook, LossHook,
LRSchedulerHook, ThroughputHook)
from colossalai.utils import MultiTimer, get_dataloader
from model_zoo.vit import vit_lite_depth7_patch4_32
from torchvision import transforms
from colossalai.nn import Linear1D
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch
os.environ['MASTER_ADDR'] = '172.18.126.98'
os.environ['MASTER_PORT'] = '51064'
os.environ['DATA'] = './dataset/'
os.environ['CONFIG_FILE'] = './configs/vit_1d.py'
os.environ['LOCAL_RANK'] = '0'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '2'
DATASET_PATH = str(os.environ['DATA'])

class trainset(Dataset):

def __init__(self):
    pass

def __getitem__(self, index):
    target = torch.randn(1)
    data = torch.randn(1, 1024)
    return data, target

def __len__(self):
    return 512*5

def build_cifar(batch_size):

train_dataset = trainset()
test_dataset = trainset()
train_dataloader = get_dataloader(dataset=train_dataset,
                                  shuffle=True,
                                  batch_size=batch_size,
                                  num_workers=0,
                                  pin_memory=True)
test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, num_workers=0, pin_memory=True)
return train_dataloader, test_dataloader

class MLP_1D(nn.Module):

def __init__(self):
    super().__init__()
    self.linear_1 = Linear1D(in_features=1024, out_features=16384)
    self.linear_2 = Linear1D(in_features=16384, out_features=1)

def forward(self, x):
    x = self.linear_1(x)
    x = self.linear_2(x)
    x = x.squeeze(-1)
    return x

def train_cifar():
args = colossalai.get_default_parser().parse_args()

colossalai.launch_from_torch(config=args.config)
print('111')
logger = get_dist_logger()
if hasattr(gpc.config, 'LOG_PATH'):
    if gpc.get_global_rank() == 0:
        log_path = gpc.config.LOG_PATH
        if not os.path.exists(log_path):
            os.mkdir(log_path)
        logger.log_to_file(log_path)
print('1.5 1.5 1.5')
model = MLP_1D()  # !!!!!!!!!Here is blocked!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

print('222')
train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)

criterion = CrossEntropyLoss(label_smoothing=0.1)

optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
print('333')
steps_per_epoch = len(train_dataloader)

lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
                                       total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
                                       warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)

engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                                optimizer=optimizer,
                                                                                criterion=criterion,
                                                                                train_dataloader=train_dataloader,
                                                                                test_dataloader=test_dataloader,
                                                                                lr_scheduler=lr_scheduler)

logger.info("Engine is built", ranks=[0])

timer = MultiTimer()
print('444')
trainer = Trainer(engine=engine, logger=logger, timer=timer)
logger.info("Trainer is built", ranks=[0])

hooks = [
    LogMetricByEpochHook(logger=logger),
    LogMetricByStepHook(),
    AccuracyHook(accuracy_func=Accuracy()),
    LossHook(),
    ThroughputHook(),
    LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
]
print('555')
logger.info("Train start", ranks=[0])
trainer.fit(train_dataloader=train_dataloader,
            test_dataloader=test_dataloader,
            epochs=gpc.config.NUM_EPOCHS,
            hooks=hooks,
            display_progress=True,
            test_interval=1)

if name == 'main':
train_cifar()
'''

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

[FEATURE] How to prepare WebtextDataset?

Is your feature request related to a problem? Please describe.
I tried to run the gpt2 example. It uses the WebtextDataset. Is there any instruction on the data preparation?

All tensors must be on devices[0]: 0

🐛 Describe the bug

For https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet, when use python -m torch.distributed.launch --nproc_per_node 2 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py, there is an error that All tensors must be on devices[0]: 0

Environment

torch=1.8.1

Lack of quick demo link about hybrid parallel in Documentation

📚 The doc issue

In 'hybrid parallel' of 'Quick Demo', there is not an example link. And 'You can follow the README.md for more details.' is also confusing.

Can we consider the hybrid ViT example in the following link?

https://github.com/hpcaitech/ColossalAI-Examples/tree/c3cf060d8346ce81d95704cf2b727c07fa35aa6c/image/vision_transformer/hybrid_parallel

[Discussion] About 3D Parallelism

I read the paper Maximizing Parallelism in Distributed Training for Huge Neural Networks. The idea is elegant and does make sense to me. However, I just wonder about the compatibility of this method with gradient checkpointing (I mentioned it in #117, We call it GC afterward).

Using 3D parallelism, on the activations we have to conduct all-gather across (N/P^2) processors (it is a partial collective communication), where N is the number of GPU for 3-D linear. At least three times such partial collective communication has to be done, during forward, backward, and recomputing of activation during backward using GC. Therefore, it introduces more communication overhead compared with the model parallelism not splitting activations. Did you consider the overhead in the experiment section of the paper?

Also, the tensor of activations is in small size. If partition an activation tensor into N pieces, and send/recv in granularity of one piece of tensor. The bandwidth utilization will be extremely low? This is different from communication on parameters. We can pack a number of layers of parameter tensors and send/recv them in a larger volume to better utilize network bandwidth, but activations come one after another, you cannot treat them the same as the parameter tensors.

PS: a small typo in the arXiv paper. Page 5, 1st line, Bij = [lnp : lnp + np + 1]

Pipeline parallel sample with custom schedule

Looking to see how to specify a custom schedule with a pipeline parallel example.

Zombile process with MPI launch[BUG]

Describe the bug
If the parallel training is launched via MPI, zombie process will not be killed upon keyboard interruption or exceptions.

To Reproduce
Initialize the parallel context with MPI and launch more than one processes (e.g., mpirun -np 2 train.py). Then interrupt the training with Ctrl + C.

Expected behavior
Ranks > 0 will keep running and taking up memories.

[DOC] Documentation is not detailed enough

Some problems regarding the documentation

some functions take in *args, **kwargs, should have a link and example to explain which these arguments are
some classes and functions should come with an example, e.g. colossalai.launch

a bit mistake

the right git clone code should be
git clone https://github.com/hpcaitech/ColossalAI.git

Unable to import get_dataloader from colossalai.utils

While making an example for application of colossalAI with the new API, I wanted to use the get_dataloader method to create a dataloader but after running the line
from colossalai.utils import get_dataloader
and ran into the error:
ImportError: cannot import name 'get_dataloader' from 'colossalai.utils' (/usr/local/lib/python3.7/dist-packages/colossalai/utils/init.py)

PyTorch or TensorFlow

Can a new example be built on the TensorFlow framework?

More specifically I have a TensorFlow based deep neural network model. How should I proceed if I want to upload it as an example?

How to define a model with only one layer?

Describe the feature

#!/usr/bin/env python

-- encoding: utf-8 --

import os
import colossalai
import torchvision
from colossalai.builder import *
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn import Accuracy, CrossEntropyLoss, MSELoss
from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
from colossalai.trainer import Trainer
from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
LogMetricByEpochHook,
LogMetricByStepHook,
LogTimingByEpochHook, LossHook,
LRSchedulerHook, ThroughputHook)
from colossalai.utils import MultiTimer, get_dataloader
from model_zoo.vit import vit_lite_depth7_patch4_32
from torchvision import transforms
from colossalai.nn import Linear1D_Col, Linear1D, Linear1D_Row
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch

DATASET_PATH = str(os.environ['DATA'])

class trainset(Dataset):
def init(self):
pass

def __getitem__(self, index):
    target = torch.randint(5, (4,), dtype=torch.int64)
    data = torch.randn(1, 5)
    return data, target

def __len__(self):
    return 512*1

def build_cifar(batch_size):
train_dataset = trainset()
test_dataset = trainset()
train_dataloader = get_dataloader(dataset=train_dataset,
shuffle=True,
batch_size=batch_size,
num_workers=0,
pin_memory=True)
test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, num_workers=0, pin_memory=True)
return train_dataloader, test_dataloader

class MLP_1D(nn.Module):

def __init__(self):
    super().__init__()
    self.linear_1 = Linear1D_Col(in_features=5, out_features=4, gather_output=True)
    #self.linear_2 = Linear1D_Row(in_features=4, out_features=1)

def forward(self, x):
    x = self.linear_1(x)
    #x = self.linear_2(x)
    x = torch.squeeze(x, 1)
    return x

def train_cifar():
args = colossalai.get_default_parser().parse_args()
# standard launch
# colossalai.launch(config=args.config,
# rank=args.rank,
# world_size=args.world_size,
# local_rank=args.local_rank,
# host=args.host,
# port=args.port)

# launch from torchrun
colossalai.launch_from_torch(config=args.config)

logger = get_dist_logger()
if hasattr(gpc.config, 'LOG_PATH'):
    if gpc.get_global_rank() == 0:
        log_path = gpc.config.LOG_PATH
        if not os.path.exists(log_path):
            os.mkdir(log_path)
        logger.log_to_file(log_path)

# model = vit_lite_depth7_patch4_32()
model = MLP_1D()

train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)


criterion = CrossEntropyLoss(label_smoothing=0.1)

optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)

steps_per_epoch = len(train_dataloader)

lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
                                       total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
                                       warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)

engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                                optimizer=optimizer,
                                                                                criterion=criterion,
                                                                                train_dataloader=train_dataloader,
                                                                                test_dataloader=test_dataloader,
                                                                                lr_scheduler=lr_scheduler)

  
logger.info("Engine is built", ranks=[0])

timer = MultiTimer()

trainer = Trainer(engine=engine, logger=logger, timer=timer)
logger.info("Trainer is built", ranks=[0])

hooks = [
    LogMetricByEpochHook(logger=logger),
    LogMetricByStepHook(),
    # LogTimingByEpochHook(timer=timer, logger=logger),
    # LogMemoryByEpochHook(logger=logger),
    AccuracyHook(accuracy_func=Accuracy()),
    LossHook(),
    ThroughputHook(),
    LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
]

logger.info("Train start", ranks=[0])
trainer.fit(train_dataloader=train_dataloader,
            test_dataloader=test_dataloader,
            epochs=gpc.config.NUM_EPOCHS,
            hooks=hooks,
            display_progress=True,
            test_interval=1)

if name == 'main':
train_cifar()

This is my code and it reports:
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected floating point type for target with class probabilities, got Long
I am not sure...How to define a single layer model?

[DOC] Document Issue in colossalai.utils.data_sampler.get_dataloader

Screen:

Url:
https://www.colossalai.org/colossalai/colossalai.utils.data_sampler.html
Description:
The note of parameter dataset should be torch.utils.data.Dataset.

[BUG] please update pypi version

Describe the bug
I use the colossal installed from pypi, it failed to run a resnet example!

python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py

Colossalai should be built with cuda extension to use the FP16 optimizer
Colossalai should be built with cuda extension to use the FP16 optimizer
DeepSpeed is required if you want to use ZeRO.
DeepSpeed is required if you want to use ZeRO.
Traceback (most recent call last):
  File "run_resnet_cifar10_with_engine.py", line 7, in <module>
    from colossalai.utils import get_dataloader
ImportError: cannot import name 'get_dataloader' from 'colossalai.utils' (/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/colossalai/utils/__init__.py)
Killing subprocess 1397423
Traceback (most recent call last):
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/jiaruifang/anaconda3/envs/deepalpha/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/jiaruifang/anaconda3/envs/deepalpha/bin/python', '-u', 'run_resnet_cifar10_with_engine.py', '--local_rank=0']' returned non-zero exit status 1.

To Reproduce

pip install colossalai
cd examples/resnet_cifar10_data_parallel
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py

Expected behavior

Screenshots

Environment (please complete the following information):

CUDA version:
cuDNN version:
NCCL version:
Python version:
PyTorch version:

Additional context

[BUG] The project is not compatible with torch v1.8.1

The PyTorch APIs have changed recently. I failed to run the Embedding layer using torch 1.8.1+cu111.
The APIs of Embedding and LayerNorm are different. Did you consider supporting different torch versions or locking it on a specific version? The latter is not a user-friendly choice, although I noticed your designate a torch version in requirements.txt.

https://pytorch.org/docs/1.8.1/generated/torch.nn.Embedding.html?highlight=embeddings
https://pytorch.org/docs/1.10.0/generated/torch.nn.Embedding.html?highlight=embeddings

Need experiment results to show superiority

Describe the feature

The current content mainly shows what Colossal-AI offers, but lacks a convincing and engaging presentation of experiment results.

For example, in the README, highlight results of features need to be provided.
In example, benchmark, and tutorial, the expected key experiment results need to be presented, not just the features description and running commands, which are likely to cause problems for novices to use and reproduce : (

How to run benchmark example on each worker?

Describe the feature

To start training, use the following command to run each worker:

$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE
--rank=RANK
--local_rank=LOCAL_RANK
--host=MASTER_IP_ADDRESS
--port=MASTER_PORT
--config=CONFIG_FILE

I read this on readme file in colossalai's benchmark. So currently I have 2 GPUs on NSCC server, and not sure how to run the program on 'each' worker....I tried this on the command line and then press enter:
DATA='./dataset/' python train.py --world_size=2 --rank=0 --local_rank=0 --host='172.18.126.98' --port='51066' --config='./configs/vit_1d.py'

And the program stuck.... and only shows these:

Colossalai should be built with cuda extension to use the FP16 optimizer
warning: variables which starts with __, is a module or class declaration are omitted

Could anyone help me? Thank you!!

Masking is missing in triangle masked softmax kernel

🐛 Describe the bug

Hi, the triangle masked softmax cuda kernel (ScaledUpperTriangMaskedSoftmax.forward) has only inputs and scale as its inputs. It cannot actually apply masking.

Environment

No response

train on a multi-gpu server

For the case that I have only a multi-GPU server, no distributed system is available, how could I use all GPUs to train a model? Thanks!

How to initialize the linear2D? Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer

Describe the feature

#!/usr/bin/env python

-- encoding: utf-8 --

import os
import colossalai
import torch
import torchvision
from colossalai.builder import *
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn import Accuracy, CrossEntropyLoss
from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
from colossalai.trainer import Trainer
from colossalai.trainer.hooks import (AccuracyHook, LogMemoryByEpochHook,
LogMetricByEpochHook,
LogMetricByStepHook,
LogTimingByEpochHook, LossHook,
LRSchedulerHook, ThroughputHook)
from colossalai.utils import MultiTimer, get_dataloader
from model_zoo.vit import vit_lite_depth7_patch4_32
from torchvision import transforms
from colossalai.nn import Linear2D
import torch.nn as nn
from delete import trainset

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '8888'
os.environ['DATA'] = 'D:/trash_can'
os.environ['CONFIG_FILE'] = 'D:/trash_can/ColossalAI-main/benchmark/cifar/configs/vit_1d.py'
os.environ['LOCAL_RANK'] = '0'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'

DATASET_PATH = str(os.environ['DATA'])

class MLP_2D(nn.Module):

def __init__(self):
    super().__init__()
    self.linear_1 = Linear2D(in_features=1024, out_features=16384)
    self.linear_2 = Linear2D(in_features=16384, out_features=1024)

def forward(self, x):
    x = self.linear_1(x)
    x = self.linear_2(x)
    return x

def train_cifar():
args = colossalai.get_default_parser().parse_args()

colossalai.launch_from_torch(config=args.config)

logger = get_dist_logger()
if hasattr(gpc.config, 'LOG_PATH'):
    if gpc.get_global_rank() == 0:
        log_path = gpc.config.LOG_PATH
        if not os.path.exists(log_path):
            os.mkdir(log_path)
        logger.log_to_file(log_path)

model = MLP_2D()

train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE // gpc.data_parallel_size)

criterion = CrossEntropyLoss(label_smoothing=0.1)

optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)

steps_per_epoch = len(train_dataloader)

lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
                                       total_steps=gpc.config.NUM_EPOCHS * steps_per_epoch,
                                       warmup_steps=gpc.config.WARMUP_EPOCHS * steps_per_epoch)

engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model=model,
                                                                                optimizer=optimizer,
                                                                                criterion=criterion,
                                                                                train_dataloader=train_dataloader,
                                                                                test_dataloader=test_dataloader,
                                                                                lr_scheduler=lr_scheduler)

logger.info("Engine is built", ranks=[0])

timer = MultiTimer()

trainer = Trainer(engine=engine, logger=logger, timer=timer)
logger.info("Trainer is built", ranks=[0])

hooks = [
    LogMetricByEpochHook(logger=logger),
    LogMetricByStepHook(),
    # LogTimingByEpochHook(timer=timer, logger=logger),
    # LogMemoryByEpochHook(logger=logger),
    AccuracyHook(accuracy_func=Accuracy()),
    LossHook(),
    ThroughputHook(),
    LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False)
]

logger.info("Train start", ranks=[0])
trainer.fit(train_dataloader=train_dataloader,
            test_dataloader=test_dataloader,
            epochs=gpc.config.NUM_EPOCHS,
            hooks=hooks,
            display_progress=True,
            test_interval=1)

if name == 'main':
train_cifar()

Traceback (most recent call last):
File "D:/trash_can/ColossalAI-main/benchmark/cifar/train.py", line 161, in
train_cifar()
File "D:/trash_can/ColossalAI-main/benchmark/cifar/train.py", line 112, in train_cifar
model = MLP_2D()
File "D:/trash_can/ColossalAI-main/benchmark/cifar/train.py", line 81, in init
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
File "D:\trash_can\ColossalAI-main\colossalai\nn\layer\parallel_2d\layers.py", line 52, in init
assert_summa_initialization()
File "D:\trash_can\ColossalAI-main\colossalai\nn\layer\parallel_2d_utils.py", line 23, in assert_summa_initialization
'Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer'
AssertionError: Both TWO_DIMENSION_COL and TWO_DIMENSION_ROW must be initialized by the process group initializer

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Automatic Release on PyPI

Describe the feature

We can set up a CI to automate the release process. There are several functions to achieve:

automatically test version compatibility
automatically publish the develop branch to Test PyPI
automatically publish the main branch to PyPI

This CI should better run on manual workflow dispatch. If triggered by event such as PR, this CI may run in unwanted situations such as updating submodule references.

[BUG] RuntimeError: Address already in use

Describe the bug
Traceback (most recent call last):
File "train.py", line 132, in
train_cifar()
File "train.py", line 73, in train_cifar
colossalai.launch_from_torch(config=args.config)
File "/home/svu/e0787810/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai/initialize.py", line 217, in launch_from_torch
verbose=verbose)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai/initialize.py", line 101, in launch
gpc.init_global_dist(rank, world_size, backend, host, port)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/colossalai/context/parallel_context.py", line 325, in init_global_dist
init_method=init_method)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home/svu/e078XXXX/.conda/miniconda/envs/huang_test/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: Address already in use

I am running this on NUS HPC, and for hostname, I use 'hostname -l' instrucution to get four IPs, I tried one of them, for port, I randomly choose a number. Not sure how to find an acceptable address.

reformat warmup

Currently, warmup step and epoch are mixed-used in lr_scheduler, which is a little confusing.
If add warmup_step in lr_scheduler_hook, which is more user-friendly for users needing steps in multiple epochs, maybe the name of 'by_epoch' needs to be more clear, resulting in changes of other modules and examples.

[FEATURE] how can I quickly test a huggingface transformer model

Is your feature request related to a problem? Please describe.
I'm frustrated when I try to apply this project on a huggingface transformer model, i.e. a bert model.

Describe the solution you'd like
I can not find a clear doc to direct me to port a huggingface model into the colossal-ai. Apparently, all of the examples are related to visual models, but most of the large model applications are on NLP scenarios.

Describe alternatives you've considered
Provide an example to tell me how to simply move my training process to colossal-ai.

LAMB is not suited for tensor parallel

The current LAMB optimizer implementation does not support tensor parallel as it needs to compute norm of the whole matrix. It is not compatible with tensor parallel as the tensor is split.

Any checkpoint saving/loading tutorial provided?

📚 The doc issue

So far is there any tutorial that shows how to save/load a model in checkpoint?

[BUG] initialize context failed using pytorch 1.10

Describe the bug

and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
warning: variables which starts with __, is a module or class declaration are omitted
Traceback (most recent call last):
  File "run_resnet_cifar10_with_trainer.py", line 118, in <module>
    main()
  File "run_resnet_cifar10_with_trainer.py", line 19, in main
    colossalai.launch_from_torch(config='./config.py')
  File "/opt/conda/lib/python3.8/site-packages/colossalai/initialize.py", line 209, in launch_from_torch
    launch(config=config,
  File "/opt/conda/lib/python3.8/site-packages/colossalai/initialize.py", line 101, in launch
    gpc.init_global_dist(rank, world_size, backend, host, port)
  File "/opt/conda/lib/python3.8/site-packages/colossalai/context/parallel_context.py", line 322, in init_global_dist
    dist.init_process_group(rank=rank,
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 559, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 141, in _tcp_rendezvous_handler
    store = TCPStore(  # type: ignore[call-arg]
RuntimeError: Address already in use

To Reproduce

cd examples/resnet_cifar10_data_parallel
examples/resnet_cifar10_data_parallel
env DATA="./cifar10_data" python -m torch.distributed.launch --nproc_per_node=1 run_resnet_cifar10_with_engine.py

Expected behavior

Screenshots

Environment (please complete the following information):

CUDA version: cuda_11.4.r11.4/compiler.30188945_0
cuDNN version:
NCCL version:
Python version: 3.8.10
PyTorch version: 1.10.0a0+3fd9dcf

Additional context
I believe the bug come from the L321 of colossalai/context/parallel_context.py.

[BUG]

Hi. I met an error when I tried to import colossalai. Seems it tries to import from its own layers but cannot: Is this means I need to change my encoding format?

Traceback (most recent call last):
File "main.py", line 8, in
import colossalai
File "/usr/local/lib/python3.6/site-packages/colossalai/init.py", line 1, in
from .initialize import (initialize, launch, launch_from_openmpi,
File "/usr/local/lib/python3.6/site-packages/colossalai/initialize.py", line 7, in
from colossalai.nn.optimizer.colossalai_optimizer import ColossalaiOptimizer
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/init.py", line 1, in
from .layer import *
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/layer/init.py", line 1, in
from .colossalai_layer import *
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/layer/colossalai_layer/init.py", line 2, in
from .dropout import Dropout
File "/usr/local/lib/python3.6/site-packages/colossalai/nn/layer/colossalai_layer/dropout.py", line 1, in
from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'

CUDA version: 11.0
Python version: 3.6.8
PyTorch version: 1.10.0

Facing error in using CNN on MNIST

I was using colossalAI for applying CNN on the MNIST dataset but the following error occurs which I am not able to remove:

[Epoch 0 train]: 0%| | 0/6000 [00:00<?, ?it/s]

TypeError Traceback (most recent call last)
in ()
6 max_epochs = num_epochs,
7 display_progress = True,
----> 8 test_interval = test_interval
9 )

4 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
1104 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() takes 2 positional arguments but 11 were given

I have attached the link to the jupyter notebook.
https://colab.research.google.com/drive/15Yiv7EBAc6eWV14aGEl0GP3-AMZLSr06?usp=sharing

Create a develop branch for dev

Describe the feature

We can merge new features to develop branch instead of master.

https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

The performance of model parallelism (MP) is not good

Hello developers.

I found the performance of MP provided is not good. I compared it with PatrickStar and DeepSpeed. Can you check it with me? See MR #115
BTW: I strongly recommend you add Tflops as an indicator of performance.

Platform: a node of SuperPod including 8xA100 and 1TB memory CPU. BS = batch size, pstar=PatrickStar, deeps=DeepSpeed
Entries indicate the Throughput (batch/elapse). Xd-Xmp is using Colossal-AI.

Model Scale	global BS	1d-4mp	1d-8mp	2d-4mp	2d-8mp	3d-4mp	2.5d-4mp	pstar	deeps	deeps-mp4	deeps-mp8
4B	8	7.61	7.62	9.89	8.47	failed	10.31	8.78	1.15	1.26	1.26
4B	16.	OOM	OOM	OOM	OOM	OOM	OOM	16.67	2.26	2.42	2.36
4B	128	OOM	OOM	OOM	OOM	OOM	OOM	28.39	12.51	10.80	OOM
10B	2	OOM	3.62	OOM	failed	OOM	OOM	-	-	0.15	0.15
10B	4	OOM	4.66	OOM	OOM	OOM	OOM	-	-	0.30	0.30
10B	128	OOM	OOM	OOM	OOM	OOM	OOM	13.43	OOM	6.31	5.73

As you can see, the computing efficiency is the lowerest among the three solutions on 1 node scale. However, Colossal-AI is very competitive on the same batch size. Unfortunately, the batch size severely limits Colossal-AI performance.
The 2.5d-MP is superior on 4B-8bs. But the 1d-8mp has a better generalization.
Heterologous Training (like PatrickStar and DeepSpeed) may be a better solution, rather than a complex MP strategy, on 1 node scale.

The startup commands in examples and benchmark are different and confusing

🐛 Describe the bug

The startup commands in examples and benchmark are different and confusing. They should be unified, the current form is very easy to confuse newbies.

Not clear:
https://github.com/hpcaitech/ColossalAI-Benchmark/tree/62904e4ff2f3261c5469c773faa3d9307b6f16f4
More detail in hpcaitech/ColossalAI-Benchmark#5

Only command more than 64 GPUs using srun, how to run with limited GPUs and local machine?
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel

Clear command:
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel

Possible error:
Since we give the command '--master_port 29500', it is possible that users meet the error 'RuntimeError: Address already in use', which needs to use another port number.
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet

Environment

No response

Requirements for README and Documentation in Chinese

📚 The doc issue

It will come soon~

Duplicated 'mkdir' with MPI backend

🐛 Describe the bug

The save checkpoint hook does not check local rank while calling _ensure_directory_exists(checkpoint_path), causing multiple processes attempt to create directory and crash. Log attached.

Traceback (most recent call last):
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 312, in run_commandline
Traceback (most recent call last):
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 276, in run
    return self.run(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/run.py", line 238, in __call__
    run()
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/workspace/ColossalAI-Examples/image/vilt/run.py", line 137, in main
    self.result = self.main_function(*args)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/workspace/ColossalAI-Examples/image/vilt/run.py", line 137, in main
    trainer.fit(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 312, in fit
    trainer.fit(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 312, in fit
    self._train_epoch(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 195, in _train_epoch
    self._train_epoch(
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 195, in _train_epoch
    self._call_hooks('after_train_epoch')
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 145, in _call_hooks
    self._call_hooks('after_train_epoch')
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/_trainer.py", line 145, in _call_hooks
    getattr(hook, func)(self)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/hooks/_checkpoint_hook.py", line 61, in after_train_epoch
    getattr(hook, func)(self)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/trainer/hooks/_checkpoint_hook.py", line 61, in after_train_epoch
    save_checkpoint(save_path,
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 163, in save_checkpoint
    save_checkpoint(save_path,
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 163, in save_checkpoint
    _ensure_directory_exists(checkpoint_path)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 76, in _ensure_directory_exists
    _ensure_directory_exists(checkpoint_path)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 76, in _ensure_directory_exists
    os.makedirs(dir)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/os.py", line 225, in makedirs
    os.makedirs(dir)
  File "/workspace/intel/oneapi/intelpython/latest/envs/autoaug/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './ckpt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/ColossalAI-Examples/image/vilt/run.py", line 40, in <module>
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './ckpt'

Environment

No response

What is VanillaPatchEmbedding, does it really has tensor_parallel?

Describe the feature

Hi! I am learning VanillaPatchEmbedding recently and I did not find prarllel implement here....Does this function really has parallel? Thank you!!
By the way, if some stuff could be added to the document, it will be highly appreciated!!

Add hooks before and after operators

Describe the feature

I noticed the project has a hook factory, which provides an easy way to add some extra business logic (like throughput metric) before and after a training iteration.
However, the name hook is a little bit misunderstanding since it is not the same as the hook of Pytorch.
The PyTorch hooks can add some operations before and after FWD and BWD of a submodule.
Currently, the project does not provide a function in the BaseHook for developers to do some operations before and after a PyTorch submodule (like Linear) execution.
For example, someone would like to depict the memory footprint during training by recording the memory usage before and after an operator execution.
I consider adding register_forward_hook and register_forward_pre_hook functions to HookBase. Just like the setup_zero_stage3_hooks. Does this make sense to you?

[FEATURE] does this project supports gradient checkpointing?

Activation Checkpoint (a.k.a gradient checkpointing in PyTorch (https://pytorch.org/docs/stable/checkpoint.html)) is an effective technique (from my perspective, maybe the most effective one) to improve model scale. It can primarily save activation memory footprint at the cost of recomputing. However, I did not see the technique applied in colossal?
I believe it is a model-relative optimization and should not be put in the core functionality of colossal-ai. But you should add it in the example or benchmark scripts.

See the huggingface GPT2 implementation for more details

https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/modeling_gpt2.py#L865

Need a finetuning example

Describe the feature

Fewer users are able to train large models directly from scratch. We need to provide an example of fine-tuning.

For example, how to load pre-training parameters into model of colossal-AI and fine-tune them efficiently with other features in colossal-AI? Performance can be optimized incrementally with subsequent updates, but this feature is practical and important.

hpcaitech / colossalai Goto Github PK

colossalai's People

Stargazers

Watchers

Forkers

colossalai's Issues

Describe the feature

Describe the feature

📚 The doc issue

Describe the feature

Describe the feature

🐛 Describe the bug

Environment

Describe the feature

Describe the solution you'd like

Describe alternatives you've considered

Additional context

🐛 Describe the bug

Environment

📚 The doc issue

Describe the feature

-- encoding: utf-8 --

Describe the feature

Describe the feature

🐛 Describe the bug

Environment

Describe the feature

-- encoding: utf-8 --

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Describe the feature

📚 The doc issue

[Epoch 0 train]: 0%| | 0/6000 [00:00<?, ?it/s]

Describe the feature

🐛 Describe the bug

Environment

📚 The doc issue

🐛 Describe the bug

Environment

Describe the feature

Describe the feature

Describe the feature

Recommend Projects

Recommend Topics

Recommend Org