eleutherai / oslo Goto Github PK

OSLO: Open Source for Large-scale Optimization

Python 50.40% C++ 24.77% Cuda 24.01% C 0.24% CMake 0.58%

oslo's Introduction

OSLO: Open Source for Large-scale Optimization

What is OSLO about?

OSLO is a framework that provides various GPU based optimization technologies for large-scale modeling. Features like 3D parallelism and kernel fusion which could be useful when training a large model are the key features. OSLO makes these technologies easy-to-use by magical compatibility with Hugging Face Transformers that is being considered as a de facto standard in NLP field. We look forward large-scale modeling technologies to be more democratized by significantly decreasing the difficulty of using these technologies using OSLO.

Installation

OSLO can be easily installed using the pip package manager. Be careful that the ‘core’ is in the PyPI project name.

pip install oslo-core

Administrative Notes

Citing OSLO

If you find our work useful, please consider citing:

@misc{oslo,
  author       = {},
  title        = {OSLO: Open Source for Large-scale Optimization},
  howpublished = {\url{https://github.com/EleutherAI/oslo}},
  year         = {2022},
}

Licensing

The code of the OSLO is licensed under the terms of the Apache License 2.0.

oslo's People

Contributors

Stargazers

Watchers

oslo's Issues

Milestone: Make `oslo-examples` repository

We'll make oslo-examples repository for users

TODO: Test TP + PP

Describe a TODO feature

Test TP + PP on 4 GPus

Assignees

@hyunwoongko @ohwi

Add datadistribedSampler to DDP

Describe a TODO feature

Current test code uses general dataloader which provides data (duplicated). We need to change it to distributedSampler to be used for DDP case.

Assignees

TODO: Redesign DDP module

Describe a TODO feature

Redesign DDP module as funcional

Assignees

@hyunwoongko

Syntax error in mappings_utils.py when installing OSLO

How to reproduce

python setup.py install

Environment

OS : CentOS 7.9
Python version : 3.9
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

Extracting oslo_core-3.0.0-py3.7.egg to /opt/conda/lib/python3.7/site-packages
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/transformers/mapping_utils.py", line 141
    OPT=[
       ^
SyntaxError: invalid syntax

I've changed it for continuing my tests :

       "OPT": [
            Column("q_proj", "k_proj", "v_proj", "fc1"),
            Row("out_proj", "fc2"),
            Update("embed_dim", "num_heads"),
            Head("lm_head", "score"),
        ]

Implement vocab parall crossentropy loss

Describe a TODO feature

Add vocab cross entropy loss

Assignees

SP parameter device type error

How to reproduce

Environment

OS : CentOS 7.9
Python version : 3.7
Transformers version : 4.21.3
Whether to use Docker:
Misc.:

Description

model_no_sp = GPT2LMHeadModel(GPT2Config.from_pretrained(configs["model_name"])).cuda()
model_sp = GPT2LMHeadModel(GPT2Config.from_pretrained(configs["model_name"]))

model_sp = SequenceDataParallel(
model_sp,
parallel_context=parallel_context,
)

Error comes from init of _DistributedDataParallel due to parameters are on CPU not on GPU.

Need to remove device_type check code of parameter in _DistributedDataParallel

'TrainingArguments' object has no attribute 'parallel_mode' when running mBart test

How to reproduce

python ./tests/transformers/models/mbart/test_training.py

Environment

OS : CentOS 7.9
Python version : 3.9
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

python ./tests/transformers/models/mbart/test_training.py Reusing dataset glue (/root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|███████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 682.67it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 68/68 [00:01<00:00, 52.15ba/s]
100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 43.15ba/s]
100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 42.94ba/s]
You are using a model of type bart to instantiate a model of type mbart. This is not supported for all configurations of models and can yield errors.
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['encoder.layer_norm.bias', 'decoder.layer_norm.weight', 'encoder.layer_norm.weight', 'decoder.layer_norm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are using a model of type bart to instantiate a model of type mbart. This is not supported for all configurations of models and can yield errors.
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['encoder.layer_norm.bias', 'decoder.layer_norm.weight', 'encoder.layer_norm.weight', 'decoder.layer_norm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
PyTorch: setting up devices
Traceback (most recent call last):
  File "./tests/transformers/models/mbart/test_training.py", line 94, in <module>
    fp16=False,
  File "./tests/transformers/models/mbart/test_training.py", line 44, in train
    eval_dataset=dataset["validation"],
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/transformers/trainer.py", line 186, in __init__
    if len(args.parallel_mode) > 0:
AttributeError: 'TrainingArguments' object has no attribute 'parallel_mode'

The problem seems to be the parallel_mode property in training_args.py is commented, line 989

# @property # def parallel_mode(self): # """ # The current mode used for parallelism if multiple GPUs/TPU cores are available. One of: # # -ParallelMode.NOT_PARALLEL: no parallelism (CPU or one GPU). # - ParallelMode.NOT_DISTRIBUTED: several GPUs in one single process (uses torch.nn.DataParallel). # - ParallelMode.DISTRIBUTED: several GPUs, each having its own process (uses # torch.nn.DistributedDataParallel). # - ParallelMode.TPU: several TPU cores. # """ # # if is_torch_tpu_available(): # # return ParallelMode.TPU # # elif is_sagemaker_mp_enabled(): # # return ParallelMode.SAGEMAKER_MODEL_PARALLEL # # elif is_sagemaker_dp_enabled(): # # return ParallelMode.SAGEMAKER_DATA_PARALLEL # if self.local_rank != -1: # return ParallelMode.DISTRIBUTED # elif self.n_gpu > 1: # return ParallelMode.NOT_DISTRIBUTED # else: # return ParallelMode.NOT_PARALLEL

TODO: Redesign ZeRO modules

Describe a TODO feature

Currently we use Fairscale's copy to perform ZeRO. We need to analyze this code further and modify it to a functional design.

Assignees

@minqukanq

TODO : switch code base of expert parallel from colossalai to deepspeed

Describe a TODO feature

Existing code is based on colossalai but this code is not proper for multi parallelism.
Switch the code base of expert parallel to deepspeed

Assignees

-@scsc0511

Error when installing OSLO

How to reproduce

python setup.py install

Environment

OS : CentOS 7.9
Python version : 3.9
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

# python setup.py install
Traceback (most recent call last):
  File "setup.py", line 18, in <module>
    long_description=open("README.md").read(),
FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

Solution applied in my local: include a README.md

add mapping for oslo model

Describe a TODO feature

Add mapping for oslo models to test vocab parallel crossentropy loss

Assignees

fused_bias_gelu is missing when call BertModel

How to reproduce

python test_modeling_bert.py

Environment

OS : Ubuntu
Python version : 3.7.14
Transformers version : 4.22.1
Whether to use Docker: No
Misc.:
: This bug is caused from #30 removing all fused kernels
bert, reberta still use onn.fused_bias_gelu

Error message

AttributeError: module 'oslo.torch.nn' han no attribute 'fused_bias_gelu'

TODO: clean data communication functions for PP

Describe a TODO feature

Currently, PP only supports Modules with return type tuple
Need to support other types.

Assignees

@ohwi

Implement vocab parallel cross entropy loss

Describe a requested feature

Implement cross entropy for vocab paralleled logits in tensor parallel 1D, 2D, 2p5D, 3D.
Implement test codes.

Expected behavior

>>> criterion = VocabParallelCrossEntropyLoss(parallel_context)
>>> loss = criterion(vocab_parallel_logits, targets)

Rename SequenceDataParallel to SequenceParallel

Describe a TODO feature

SequenceParallel looks better than SequenceDataParallel

Assignees

@hyunwoongko

Refactoring transformers wrap

Describe a TODO feature

Refactoring transformers wrap

TODO: Fix pipeline parallelism bugs

Describe a TODO feature

Currently, when pipeline parallelization is run on a large model, an issue arises that gradient values are different. This issue should be addressed.

Assignees

@ohwi

Milestone: OSLO 2.1

We'll release OSLO 2.1 here before pipeline parallelism is complete. And we will integrate the features available in version 2.1 into Hugging Face Transformers. cc @stas00

Apply vocab parallel cross entropy for oslo models

Describe a requested feature

Apply vocab parallel cross entropy for oslo models.

fix _FullyShardedDataParallelMapping when running test_fsdp.py

How to reproduce

python -m torch.distributed.run --nproc_per_node=2 --master_port=2333 ./tests/torch/nn/parallel/data_parallel/test_fsdp.py

Environment

OS :
Python version :
Transformers version :
Whether to use Docker:
Misc.:

The problem is in _fsdp where _FullyShardedDataParallelMappingForHuggingFace is used instead of _FullyShardedDataParallelMapping
from oslo.transformers.mapping_utils import ( _FullyShardedDataParallelMappingForHuggingFace, )

FSDP returns different loss value with zero stage 2 and 3

How to reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --nproc_per_node=2  ./tests/torch/nn/parallel/data_parallel/test_fsdp.py --zero-stage 2

Environment

OS : ubuntu18.04
Python version : python3.7
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

Add FullyShardedDataParallelMapping

TODO : Deparallelize expert parallel

Describe a TODO feature

Deparallelize the expert parallelized model

Assignees

scsc0511

TODO: Save Deparallelized Expert Parallel Model

Describe a TODO feature

-Save Deparallelized Expert Parallel Model

Assignees

-@scsc0511

TODO: Deparallelize Pipeline Parallel

Describe a TODO feature

Deparallelization of Pipeline Parallel

Assignees

@ohwi

TODO: refactor tasks module

PatricStar for Zero

Describe a TODO feature

Added PatricStar chunk manager to Zero case

Assignees

No module named oslo.transformers.data when running mBart test.

How to reproduce

python ./tests/transformers/models/mbart/test_training.py

Environment

OS : CentOS 7.9
Python version : 3.9
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

# python ./tests/transformers/models/mbart/test_training.py
 
Traceback (most recent call last):
  File "./tests/transformers/models/mbart/test_training.py", line 2, in <module>
    from oslo.transformers.trainer import Trainer as OTrainer
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/transformers/trainer.py", line 37, in <module>
    from .data.data_collator import (
ModuleNotFoundError: No module named 'oslo.transformers.data'

I've changed included __init__.py in the oslo/transformers/data folder to continue with the tests.

Make sequence parallel splitting automatic

Describe a TODO feature

Make sequence parallel splitting automatic

Assignees

@hyeinhyun

Use oslo activation checkpointing rather than torch activation checkpointing

fused_scale_mask_softmax on GPT2 model

Describe a TODO feature

Current implementation does not use scale part on fused_scale_mask_softmax
Change it to use only not reorder_and_upcast part

Assignees

@loopinf

Integration ZeroDDP and ShardedModelv2 from colossal AI

Describe a TODO feature

There are two version of Zero support from ZeroDDP and ShardedModelv2

check the possibility to merge two into one
Otherwise, it has a function to choose one of them based on the flag (not shown to users directly)

Assignees

Dongsung and Hyen

Modify many structures for 2.1

Add description how to use fused_scale_softmax

Describe a TODO feature

It is hard to know how to use fused scale mask softmax
- what is scale value and how it is used in attention layer.
- missing test case for scale value result for not scale = 1.0

Assignees

TODO: Make DP + EP available

Describe a TODO feature

The Expert Parallelism (MoE) feature we currently have cannot be used with data parallelism. we'll make it can be worked with Data Parallelism and reflects a new design that can further reduce the communication amount by 1.5 times.

Assignees

@scsc0511

Fix slurm local world size

local world size = SLURM_GPUS_ON_NODE

coloDDP integration

Describe a TODO feature

Need to integrate coloDDP for patricstar and zeroDDP

Port coloDDP class
add test code for coloDDP

Assignees

Dongsung Kim

TODO: Modify wrapper design to functional

Describe a TODO feature

When multiple parallelizations are overlapped, the wrapper-style design leads to several undesirable results.

Design notes

1. The old design

class TensorParallel:
    def __init__(self, model, ...):
        self.module = model
        self.xxx_for_tp = xxx

class PipelineParallel:
    def __init__(self, model, ...):
        self.module = model
        self.yyy_for_pp = yyy

model = XXXModel.from_pretrained(...)
model = TensorParallel(model)
model = PipelineParallel(model)

2. problems

2.1. accecibility

model.module.module.module.xxx_for_tp <--- it's too bad.
model.generate <--- unavailable
model.save_pretrained <--- unavailable

2.2. checkpoint

"transformer.0.attn.q_proj.weight" => "module.module.module.transformer.0.attn.q_proj.weight"

3. new design - class like function !

def TensorParallel(model, parallel_context, ...):
    # do something
    return model

Assignees

@jason9693

wrap transformers layer

Describe a TODO feature

Wrapping transformers layer for FSDP

Fix some code errors

rename ddp -> _ddp, fsdp -> _fsdp.
wrapper loading from save_pretrained.
remove tracing inputs in PP functional wrapper.

Change bert model to use `_fused_scale_mask_softmax` functions

Describe a TODO feature

Current implementation does not use kernel fusion function oslo/transformers/models/bert/modeling_bert.py
Need to change to use kernel fusion if possible.

Assignees

loopinf

No module named 'oslo.torch.experimental' when running mbart test

How to reproduce

python ./tests/transformers/models/mbart/test_training.py

Environment

OS : CentOS 7.9
Python version : 3.9
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

File "./tests/transformers/models/mbart/test_training.py", line 2, in <module>
    from oslo.transformers.trainer import Trainer as OTrainer
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/transformers/trainer.py", line 27, in <module>
    from oslo.torch.nn.parallel.data_parallel import (
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/torch/nn/parallel/__init__.py", line 1, in <module>
    from oslo.torch.nn.parallel.data_parallel import *
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/torch/nn/parallel/data_parallel/__init__.py", line 4, in <module>
    from oslo.torch.nn.parallel.data_parallel.fully_sharded_data_parallel import (
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/torch/nn/parallel/data_parallel/fully_sharded_data_parallel.py", line 56, in <module>
    from oslo.torch.nn.parallel.data_parallel._flatten_params_wrapper import (
  File "/opt/conda/lib/python3.7/site-packages/oslo_core-3.0.0-py3.7.egg/oslo/torch/nn/parallel/data_parallel/_flatten_params_wrapper.py", line 34, in <module>
    from oslo.torch.experimental.nn.ssd_offload import SsdFlatParameter
ModuleNotFoundError: No module named 'oslo.torch.experimental'

I've changed included __init__.py in the experimental folder to continue with the tests.

Error on test_modeling_bert.py

How to reproduce

python tests/transformers/models/bert/test_modeling_bert.py

Environment

OS : Amazon Linux 2
Python version : 3.7.10
Transformers version : 4.21.3
Whether to use Docker: No
Misc.: slurm interactive

    return forward_call(*input, **kwargs)
  File "/fsx/loopinf/oslo-1/oslo/torch/nn/modules/linear.py", line 32, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_addmm)

Fix sorting error at allocate_param function

when sorting dictionary, need to use sorted(dict.items(), key=lambda item: str(item[0])) not just sorted(dict, key lambda x: x[0]) because key is enum which not comparable so need to be converted into str.

wand module not found when running test_mlm.py

How to reproduce

python ./tests/transformers/models/electra/test_mlm.py

Environment

OS : CentOS 7.9
Python version : 3.9
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

python ./tests/transformers/models/electra/test_mlm.py 
Traceback (most recent call last):
  File "./tests/transformers/models/electra/test_mlm.py", line 8, in <module>
    import wandb
ModuleNotFoundError: No module named 'wandb'

I'm using an empty box for these tests and wandb is not installed here.
If wandb is a needed library should it be included in the setup.py?

pass optimizer parameters

Describe a TODO feature

Passing multiple parameters to ZeroRedundancyOptimizer

Assignees

@minqukanq

No _TensorParallelMappingForHuggingFace

How to reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --nproc_per_node=2  ./tests/torch/nn/parallel/data_parallel/test_ddp.py

The bus comes from latest submission which changes _TensorParallelMappingForHuggingFace to _ParallelMapping. It happens when parallel_context is called. (tensor_parallel import issue)

Environment

OS : 18.04
Python version : 3.7
Transformers version : 4.21.2
Whether to use Docker:
Misc.:

eleutherai / oslo Goto Github PK

oslo's Introduction

OSLO: Open Source for Large-scale Optimization

What is OSLO about?

Installation

Administrative Notes

Citing OSLO

Licensing

oslo's People

Contributors

Stargazers

Watchers

Forkers

oslo's Issues

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

How to reproduce

Environment

Describe a TODO feature

Assignees

How to reproduce

Environment

Description

How to reproduce

Environment

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

How to reproduce

Environment

Describe a TODO feature

Assignees

How to reproduce

Environment

Error message

Describe a TODO feature

Assignees

Describe a requested feature

Expected behavior

Describe a TODO feature

Assignees

Describe a TODO feature

Describe a TODO feature

Assignees

Describe a requested feature

How to reproduce

Environment

How to reproduce

Environment

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

How to reproduce

Environment

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Assignees

Describe a TODO feature

Design notes

1. The old design

2. problems