pytorch / ort Goto Github PK

View Code? Open in Web Editor NEW

348.0 23.0 53.0 12.96 MB

Accelerate PyTorch models with ONNX Runtime

License: MIT License

Python 81.22% Shell 8.35% Jupyter Notebook 10.43%

ort's Introduction

A library for developing and deploying PyTorch models using ONNX Runtime.

Installation • Training • Inference • Docs • License

Introduction

A library for accelerating PyTorch models using ONNX Runtime:

torch-ort to train PyTorch models faster with ONNX Runtime
moe to scale large models and improve their quality
torch-ort-infer to perform inference on PyTorch models with ONNX Runtime and Intel® OpenVINO™

🚀 Installation

Install for training

Pre-requisites

You need a machine with at least one NVIDIA or AMD GPU to run ONNX Runtime for PyTorch.

You can install and run torch-ort in your local environment, or with Docker.

Install in a local Python environment

Install CUDA
Install CuDNN
Install torch-ort
- pip install torch-ort
Run post-installation script for ORTModule
- python -m torch_ort.configure

Get install instructions for other combinations in the Get Started Easily section at https://www.onnxruntime.ai/ under the Optimize Training tab.

Verify your installation

Clone this repo
- git clone [email protected]:pytorch/ort.git
Install extra dependencies
- pip install wget pandas sklearn transformers
Run a test training script
- python ./ort/tests/bert_for_sequence_classification.py

Install Mixture Of Experts

Mixture of Experts layer implementation is available in the ort_moe folder.

Clone this repo

git clone https://github.com/pytorch/ort.git

Build MoE

cd ort_moe
python setup.py install

Install for Inference

Prerequisites

Ubuntu 18.04, 20.04
Python* 3.7, 3.8 or 3.9

Install in a local Python environment

pip install torch-ort-infer[openvino]
Run post installation configuration script python -m torch_ort.configure

Verify your installation

Clone this repo
- git clone [email protected]:pytorch/ort.git
Install extra dependencies
- pip install wget pandas sklearn transformers
Run a test script
- python ./torch_ort_inference/tests/bert_for_sequence_classification.py

📈 Training

The torch-ort library accelerates training of large transformer PyTorch models to reduce the training time and GPU cost with a few lines of code change. It is built on top of highly successful and proven technologies of ONNX Runtime and ONNX format and includes the ONNX Runtime Optimizer and Data Sampler.

Add ONNX Runtime for PyTorch to your PyTorch training script

from torch_ort import ORTModule
model = ORTModule(model)
# PyTorch training script follows

Usage of FusedAdam and FP16 Optimizer (Optional)

import torch
from torch_ort.optim import FusedAdam
class NeuralNet(torch.nn.Module):
    ...
# Only supports GPU Currently.
device = "cuda"
model = NeuralNet(...).to(device)
ort_fused_adam_optimizer = FusedAdam(
    model.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=0.01, eps=1e-8
)

# To use FP16_Optimizer, Add these lines : 
from torch_ort.optim import FP16_Optimizer
ort_fused_adam_optimizer = FP16_Optimizer(ort_fused_adam_optimizer)


loss = model(...).sum()
loss.backward()
ort_fused_adam_optimizer.step()
ort_fused_adam_optimizer.zero_grad()

For detailed documentation see FusedAdam

For a full working example see FusedAdam Test Example

FP16_Optimizer is a simple wrapper to replace inefficient FP16_Optimizer function calls implemented by libraries for example Apex, DeepSpeed, Megatron-LM.

For detailed documentation see FP16 Optimizer

Usage of LoadBalancingDistributedSampler

import torch
from torch.utils.data import DataLoader 
from torch_ort.utils.data import LoadBalancingDistributedSampler
class MyDataset(torch.utils.data.Dataset):
   ...
   
def collate_fn(data): 
    ...
    return samples, label_list 
samples = [...] 
labels = [...] 
dataset = MyDataset(samples, labels) 
data_sampler = sampler.LoadBalancingDistributedSampler( 
    dataset, complexity_fn=complexity_fn, world_size=2, rank=0, shuffle=False 
) 
train_dataloader = DataLoader(dataset, batch_size=2, sampler=data_sampler, collate_fn=collate_fn) 
for batched_data, batched_label in train_dataloader: 
    optimizer.zero_grad() 
    loss = loss_fn(model(batched_data) , batched_labels) 
    loss.backward() 
    optimizer.step()

For detailed documentation see LoadBalancingDistributedSampler

For a full working example see LoadBalancingDistributedSampler Test Example

Samples

To see torch-ort in action, see https://github.com/microsoft/onnxruntime-training-examples, which shows you how to train the most popular HuggingFace models.

🤓 Mixture of Experts

To run MoE, add the layer to your model as described in the tutorial: ort_moe/docs/tutorials/moe_tutorial.py

For more details, see ort_moe/docs/moe.md

Note: ONNX Runtime is not required to run the MoE layer. It is integrated in standalone PyTorch.

🎯 Inference

➕

ONNX Runtime for PyTorch supports PyTorch model inference using ONNX Runtime and Intel® OpenVINO™.

It is available via the torch-ort-infer python package. This package enables OpenVINO™ Execution Provider for ONNX Runtime by default for accelerating inference on various Intel® CPUs, Intel® integrated GPUs, and Intel® Movidius™ Vision Processing Units - referred to as VPU.

Supported Execution Providers

Execution Providers
OpenVINO

Provider Options

Users can configure different options for a given Execution Provider to run inference. As an example, OpenVINO™ Execution Provider options can be configured as shown below:

from torch_ort import ORTInferenceModule, OpenVINOProviderOptions
provider_options = OpenVINOProviderOptions(backend = "GPU", precision = "FP16")
model = ORTInferenceModule(model, provider_options = provider_options)

# PyTorch inference script follows

List of Provider Options

Supported backend-precision combinations:

Backend	Precision
CPU	FP32
GPU	FP32
GPU	FP16
MYRIAD	FP16

If no provider options are specified by user, OpenVINO™ Execution Provider is enabled with following options by default:

backend = "CPU"
precision = "FP32"

For more details on APIs, see usage.md.

Code Sample

Below is an example of how you can leverage OpenVINO™ integration with Torch-ORT in a simple NLP usecase.

A pretrained BERT model fine-tuned on the CoLA dataset from HuggingFace model hub is used to predict grammar correctness on a given input text.

from transformers 
import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
from torch_ort import ORTInferenceModule
tokenizer = AutoTokenizer.from_pretrained(
            "textattack/bert-base-uncased-CoLA")
model = AutoModelForSequenceClassification.from_pretrained(
        "textattack/bert-base-uncased-CoLA")
# Wrap model in ORTInferenceModule to prepare the model for inference using OpenVINO Execution Provider on CPU
model = ORTInferenceModule(model)
text = "Replace me any text by you'd like ."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# Post processing
logits = output.logits
logits = logits.detach().cpu().numpy()
# predictions
pred = np.argmax(logits, axis=1).flatten()
print("Grammar correctness label (0=unacceptable, 1=acceptable)")
print(pred)

Samples

To see OpenVINO™ integration with Torch-ORT in action, see demos, which shows you how to run inference on some of the most popular Deep Learning models.

🤝 Contribute

Please refer to our contributing guide for more information on how to contribute!

License

This project has an MIT license, as found in the LICENSE file.

ort's People

Contributors

Stargazers

Watchers

ort's Issues

Fallback not kicking in with non-contiguous tensors

I tried running the following training code: https://github.com/natke/onnxruntime-training-examples/blob/034a5b73ce804d55c120308804fda6b08b016a8d/orttrainer/getting-started/train_ort.py (added ORTModule to the previous getting started example)

It fails when the input tensor is non-contiguous but fallback is not getting initiated: https://github.com/natke/onnxruntime-training-examples/blob/034a5b73ce804d55c120308804fda6b08b016a8d/orttrainer/getting-started/train_ort.py#L112, even when the policy is explicitly set to FALLBACK_FORCE_TORCH_FORWARD.

Error message

File "train_ort.py", line 168, in <module>
train(model)
File "train_ort.py", line 116, in train
output = model(data, src_mask)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/ortmodule.py", line 81, in _forward
return self._torch_module.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_torch_module_ort.py", line 32, in _forward
return self._execution_manager(self.is_training()).forward(*inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 265, in forward
override_policy=_FallbackPolicy.FALLBACK_FORCE_TORCH_FORWARD)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 194, in handle_exception
raise exception
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 256, in forward
self._device)))
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 149, in forward
*inputs)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 42, in execution_session_run_forward
forward_inputs.push_back(to_dlpack(input), input.dtype == torch.bool)
RuntimeError: /onnxruntime_src/onnxruntime/core/dlpack/dlpack_converter.cc:223 OrtValue onnxruntime::dlpack::DlpackToOrtValue(DLManagedTensor*, bool) IsContiguousTensor(dlpack->dl_tensor) was false. ORT only supports contiguous tensor for now.

What does ORT stands for?

A dumb question, What does ORT stands for?

MaxPool op resolved as Aten OP

I created a small model training script, to test out maxpool gradient op for OneDNN EP, and the model definition below was used. But for some reason, the maxpool was resolved to Aten Op in the onnx graph. Is there a way to force torch-ort to use maxpool instead aten op? (maybe by disabling use of aten ops?)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.maxpool1 = nn.MaxPool2d(2)
        self.maxpool2 = nn.MaxPool2d(2)
        #self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(self.maxpool1(self.conv1(x)))
        x = F.relu(self.maxpool2(self.conv2(x)))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

torch-ort cannot be installed on windows: onnxruntime-training not found

When running pip install torch-ort in a conda environment on Windows, I get the following error:

ERROR: Could not find a version that satisfies the requirement onnxruntime-training (from versions: none)
ERROR: No matching distribution found for onnxruntime-training

However, if I run the same command in a conda environment in WSL, it works just fine. Other people could repro. Seems to be a Windows issue.

Will there be new nightly builds with version 1.13.0.dev?

Currently, the last nightly package on https://download.onnxruntime.ai/torch_ort_nightly.html is torch_ort-1.12.0.dev20220719-py3-none-any.whl.

So, installing the nightly version of torch-ort with python -m pip install --pre torch-ort -f https://download.onnxruntime.ai/torch_ort_nightly.html just installs torch-ort 1.12.0 from PyPI.

Do you still plan to release nightly builds with updated versions?

Seg fault while training model with maxpool op

I wanted to test maxpool op using a training script. From the inputs given in a previous issue, I had disabled maxpool to AtenOp bindings so that the resulting graph would result in a maxpool and maxpoolgrad op instead of Aten. But this resulted in segfault while running maxpoolgrad op. Here is my model:

'''
class Net(nn.Module):
def init(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)

def forward(self, x):
    x = F.relu(F.max_pool2d(self.conv1(x), 2))
    x = F.relu(F.max_pool2d(self.conv2(x), 2))
    x = x.view(-1, 320)
    x = F.relu(self.fc1(x))
    x = F.dropout(x, training=self.training)
    x = self.fc2(x)
    return F.log_softmax(x)

'''

My question is: Why is the Aten op conversion needed? How was this decided that a particular set of ops needed to be binded into Aten Ops?

If you need, I can provide the entire training script.

Does it support TensorRT backend?

Hello, great job. In the README it seems we just support CUDA backend and openVINO backend but how about TensorRT backend which is used in ONNXruntime by default on Nvidia GPUs? Do we have any benchmark about speedup? Do we aim at providing a solution for deploying transformers? How about the roadmap?

Pytorch lightning

Do you know if ort works with Pytorch lightning?

I am trying it but I am getting:

raise new_exception(raised_exception) from raised_exception

onnxruntime.training.ortmodule._fallback.ORTModuleTorchModelException: ORTModule does not support adding modules to it.

Also is there a way to configure ort automatically when you install the package with conda?

currently I have to call this from my code:

from onnxruntime.training.ortmodule.torch_cpp_extensions import install as ortmodule_install
ortmodule_install.build_torch_cpp_extensions()

Does it work with pytorch 10 and cuda 11?

Thanks!

Sample does not work with torch-ort:1.8.1

The test sample https://github.com/pytorch/ort/blob/main/tests/bert_for_sequence_classification.py does not work with the 1.8.1 release package due to the addition of DebugOptions to the API.

PyTorch Lightning Integration

Hey guys!

Really epic work in this repo! I'm currently working on integrating this into Lightning (any assistance would be appreciated). From what I see the ORTModule just wraps the forward function, converting it into ONNX format? As a result I've internally in Lightning wrapped the model to ensure that user defined functions (training_step validation_step test_step) are placed in a wrapped modules' forward function.

Currently I'm running into an error:

/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_io.py:473: UserWarning: This model cannot be deep copied (or pickled), which is a required step for stateful models to be properly exported to ONNX. Compute will continue, but unexpected results may occur!
  warnings.warn("This model cannot be deep copied (or pickled), "
2021-07-19 13:20:57.590381944 [E:onnxruntime:, inference_session.cc:1341 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/session_state_utils.cc:143 onnxruntime::common::Status onnxruntime::session_state_utils::SaveInitializedTensors(const onnxruntime::Env&, const std::basic_string<char>&, const onnxruntime::GraphViewer&, const AllocatorPtr&, const onnxruntime::OrtValueNameIdxMap&, const std::vector<int>&, onnxruntime::ITensorAllocator&, const std::function<onnxruntime::common::Status(int, const OrtValue&, const onnxruntime::OrtCallback&, bool)>&, const onnxruntime::logging::Logger&, const onnxruntime::DataTransferManager&, const onnxruntime::ExecutionPlanBase&, const onnxruntime::SessionOptions&) ort_value_name_idx_map.MaxIdx() > -1 was false. OrtValue indexes should have been populated.

Traceback (most recent call last):
  File "reproduce_test.py", line 99, in <module>
    run()
  File "reproduce_test.py", line 94, in run
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
  File "/data/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 515, in fit
    self._run(model)
  File "/data/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 896, in _run
    self._dispatch()
  File "/data/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 963, in _dispatch
    self.accelerator.start_training(self)
  File "/data/pytorch-lightning/pytorch_lightning/accelerators/accelerator.py", line 97, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/data/pytorch-lightning/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/data/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 973, in run_stage
    return self._run_train()
  File "/data/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1008, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/data/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1084, in _run_sanity_check
    self._evaluation_loop.run()
  File "/data/pytorch-lightning/pytorch_lightning/loops/base.py", line 112, in run
    self.advance(*args, **kwargs)
  File "/data/pytorch-lightning/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 122, in advance
    self.num_dataloaders,
  File "/data/pytorch-lightning/pytorch_lightning/loops/base.py", line 112, in run
    self.advance(*args, **kwargs)
  File "/data/pytorch-lightning/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/data/pytorch-lightning/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 162, in evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/data/pytorch-lightning/pytorch_lightning/accelerators/accelerator.py", line 220, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "reproduce_test.py", line 74, in validation_step
    return self.model(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/ortmodule.py", line 41, in _forward
    return self._execution_manager(self._is_training()).forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_inference_manager.py", line 86, in forward
    self._create_execution_agent()
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_inference_manager.py", line 115, in _create_execution_agent
    session_options, providers, provider_options)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_execution_agent.py", line 52, in __init__
    self.create_inference_agent(path_or_bytes, session_options, providers, provider_options)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_execution_agent.py", line 56, in create_inference_agent
    providers, provider_options)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 321, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/session_state_utils.cc:143 onnxruntime::common::Status onnxruntime::session_state_utils::SaveInitializedTensors(const onnxruntime::Env&, const std::basic_string<char>&, const onnxruntime::GraphViewer&, const AllocatorPtr&, const onnxruntime::OrtValueNameIdxMap&, const std::vector<int>&, onnxruntime::ITensorAllocator&, const std::function<onnxruntime::common::Status(int, const OrtValue&, const onnxruntime::OrtCallback&, bool)>&, const onnxruntime::logging::Logger&, const onnxruntime::DataTransferManager&, const onnxruntime::ExecutionPlanBase&, const onnxruntime::SessionOptions&) ort_value_name_idx_map.MaxIdx() > -1 was false. OrtValue indexes should have been populated.

With the script (requires you to install pytorch lightning, pip install pytorch-lightning):

import os
import pickle

import torch
from torch.utils.data import DataLoader, Dataset
from torch_ort import ORTModule

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.overrides import LightningDistributedModule
from pytorch_lightning.plugins import SingleDevicePlugin


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def unwrap_lightning_module(wrapped_model) -> 'pl.LightningModule':
    model = wrapped_model
    if isinstance(model, LightningDistributedModule):
        model = unwrap_lightning_module(model.module)
    if isinstance(model, ORTModule):
        model = unwrap_lightning_module(model._module_metadata.original_module)
    return model


class ORTPlugin(SingleDevicePlugin):
    def setup(self, model: torch.nn.Module) -> torch.nn.Module:
        pickle.dumps(model)
        import pdb;pdb.set_trace()

        self.model = ORTModule(LightningDistributedModule(self.model))
        self.model_to_device()
        return self.model

    @property
    def lightning_module(self):
        return unwrap_lightning_module(self._model)

    def training_step(self, *args, **kwargs):
        return self.model(*args, **kwargs)

    def validation_step(self, *args, **kwargs):
        return self.model(*args, **kwargs)

    def test_step(self, *args, **kwargs):
        return self.model(*args, **kwargs)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        plugins=ORTPlugin(device=torch.device('cuda:0')),
        gpus=1,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == '__main__':
    run()

I'll continue to debug in the meantime :)

Running ORTModule with other EPs from ORT

I am building a new wheel with the OneDNN EP using Onnx runtime training. After that is installed, I install torch_ort and then run the configure, but it does not seem to work ( I get the same error asking me to run the configure again). From the instructions, I see that there is no recipe for this combination. Is this possible or is there any other way for me to build a custom wheel and use it to train bert model with OneDNN and ORT?

[Question] PyTorch 1.11

Hi, I see that the page https://onnxruntime.ai/ does not have PyTorch 1.10 or 1.11 as candidates. Is there any way to install torch-ort for latest PyTorch versions?

[torch-ort-infer] Aten fallback doesn't work

Aten op doesn't fallback to native pytorch runtime as expected.

Versions:
Torch - 1.12.0
OnnxRuntime - 1.12.0
Torch-ort-infer - 1.12.0

Reproduction steps:

import torch
from torch_ort import ORTInferenceModule

def test_numpy_T(input_shape):
    class NeuralNet(torch.nn.Module):
        def __init__(self):
            super(NeuralNet, self).__init__()
        def forward(self, input):
            return input.T

    device = "cpu"
    ort_model = ORTInferenceModule(NeuralNet().to(device))

    def run_step(model, input):
        prediction = model(input)
        return prediction

    ort_input = torch.rand(input_shape, dtype=torch.float, device=device)
    ort_prediction = run_step(ort_model, ort_input)

if __name__ == "__main__":
    test_numpy_T([3, 2, 5])

Error log

Traceback (most recent call last):
File "unit_test_atenop.py", line 23, in
test_numpy_T([3, 2, 5])
File "unit_test_atenop.py", line 20, in test_numpy_T
ort_prediction = run_step(ort_model, ort_input)
File "unit_test_atenop.py", line 16, in run_step
prediction = model(input)
File "/ort_aten_fb/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/ort_aten_fb/lib/python3.8/site-packages/torch_ort/ortinferencemodule/_utils_infer.py", line 98, in _forward
return ortinferencemodule._forward_call(*inputs, **kwargs)
File "/ort_aten_fb/lib/python3.8/site-packages/torch_ort/ortinferencemodule/ortinferencemodule.py", line 107, in _forward_call
self._inference_session = onnxruntime.InferenceSession(
File "/ort_aten_fb/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/ort_aten_fb/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 386, in create_inference_session
sess = C.InferenceSession(session_options, self.model_bytes, False, self.read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (ATen_0) output arg (data) type inference failed.

Tested with symbolic shape inference call from ORTModule(ref: symbolic_shape). Fails with Exception("Incomplete symbolic shape inference").

Where operator export error when performing fp16 quantization

Error is encountered when enabling deepspeed fp16 quantization for gpt2 huggingface optimum model training.

See below for the full stack trace
See link to download exported graph dump
See issue for more information on how to reproduce

Traceback (most recent call last):
File "trainer/run_clm_optimum.py", line 480, in
main()
File "trainer/run_clm_optimum.py", line 422, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/prathikrao/transformers-ort-failures/optimum/optimum/onnxruntime/trainer.py", line 482, in train
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 2011, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 2043, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
return ortmodule._torch_module.forward(*inputs, *kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 288, in _forward
return torch_module_ort._execution_manager(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 295, in forward
self._fallback_manager.handle_exception(exception=e,
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
raise exception
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 234, in forward
self._initialize_graph_builder(training=True)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 450, in _initialize_graph_builder
self._graph_builder.initialize(
RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:752 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder, const pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of Optype (Where) bound to different types (tensor(float) and tensor(float16) in node (Where_199).

python -m torch_ort.configure fail

it seems the length of filename result in this error. Here is the log.
Any idea about solving the problem? thx

running build
running build_ext
building 'aten_op_executor' extension
Emitting ninja build file D:\python\Anaconda\envs\dl\lib\site-packages\onnxruntime\training\ortmodule\torch_cpp_extensions\build\temp.win-amd64-cpython-310\Release\build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: error: Stat(D:/python/Anaconda/envs/dl/lib/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.win-amd64-cpython-310/Release/python/Anaconda/envs/dl/lib/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.obj): Filename longer than 260 characters
Traceback (most recent call last):
  File "D:\python\Anaconda\envs\dl\lib\site-packages\torch\utils\cpp_extension.py", line 1894, in _run_ninja_build
    subprocess.run(
  File "D:\python\Anaconda\envs\dl\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

RecursionError: maximum recursion depth exceeded in comparison

I use ort like this:

...
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = ORTModule(model)
model = nn.parallel.DistributedDataParallel(model, find_unused_parameters=True, device_ids=[device])
...

But found error:

Traceback (most recent call last):
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/ddp_trainer.py", line 156, in _main_func
    main_func(local_rank, *args)
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/tools/train.py", line 163, in train_entrance
    trainer.fit()
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/tools/trainer_wrapper.py", line 225, in fit
    self._trainer.fit()
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/trainer.py", line 298, in fit
    profiler=self.profiler,
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/processors/processor.py", line 265, in __call__
    model_outs = model(*_as_list(batch_i))
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 41, in _forward
    return self._execution_manager(self._is_training()).forward(*inputs, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 67, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 206, in _export_model
    schema = _io._extract_schema({'args': copy.copy(inputs), 'kwargs': copy.copy(kwargs)})
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 300, in _extract_schema
    data[key] = _extract_schema(data[key])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
    data[idx] = _extract_schema(data[idx])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
    data[idx] = _extract_schema(data[idx])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
    data[idx] = _extract_schema(data[idx])
  [Previous line repeated 949 more times]
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 287, in _extract_schema
    if isinstance(data, abc.Sequence):
  File "/home/users/min.du/venvs/pytorch1.8/lib64/python3.6/abc.py", line 184, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/home/users/min.du/venvs/pytorch1.8/lib64/python3.6/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison

Any suggestion?

Clarify installation requirements for CUDA vs ROCm

See this ONNX Runtime issue for more info:

microsoft/onnxruntime#14821

AttributeError: 'ORTModule' object has no attribute 'resize_token_embeddings'

Hi,
I am using ort to run transformers/examples/pytorch/language-modeling/run_clm.py (fine-tuning GPT-2 on WikiText-2, using the raw WikiText-2 no tokens were replaced before the tokenization). I am running it on rocm platform.
I edited the script like this

from torch_ort import ORTModule

    if model_args.model_name_or_path:
        model = AutoModelForCausalLM.from_pretrained(
            model_args.model_name_or_path,
            from_tf=bool(".ckpt" in model_args.model_name_or_path),
            config=config,
            cache_dir=model_args.cache_dir,
            revision=model_args.model_revision,
            use_auth_token=True if model_args.use_auth_token else None,
        )
        model = ORTModule(model)
    else:
        model = AutoModelForCausalLM.from_config(config)
        model = ORTModule(model)
        n_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
        logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")

I am getting this error

Traceback (most recent call last):
  File "./examples/pytorch/language-modeling/run_clm.py", line 519, in <module>
    main()
  File "./examples/pytorch/language-modeling/run_clm.py", line 353, in main
    model.resize_token_embeddings(len(tokenizer))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 948, in __getattr__
    type(self).__name__, name))
AttributeError: 'ORTModule' object has no attribute 'resize_token_embeddings'

Could you kindly help me in resolving it
Thank you
Bhavya

Lack of speed improvement when using custom GPT model with ORT

hey guys! In my investigation to try figure out why there is a speed regression for #56, I created a simple minimal script to benchmark ORT vs no ORT.

With the script I'm seeing basically the same time between ORT and no ORT. Any ideas on what is causing performance issue? I'm also seeing a few warnings which I've included below!

No ORT Time taken: 85.2842013835907 seconds
ORT Time taken 85.33545899391174 seconds

Warnings:

/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_logger.py:52: UserWarning: There were one or more warnings or errors raised while exporting the PyTorch model. Please enable INFO level logging to view all warnings and errors.
  "model. Please enable INFO level logging to view all warnings and errors.", UserWarning)
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.

script:

import math
import os
import time

import numpy as np
import torch
import torch.nn as nn
from torch.cuda.amp import autocast
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from torch_ort import ORTModule


class GPTConfig:
    """ base GPT config, params common to all GPT versions """
    embd_pdrop = 0.1
    resid_pdrop = 0.1
    attn_pdrop = 0.1

    def __init__(self, vocab_size, block_size, **kwargs):
        self.vocab_size = vocab_size
        self.block_size = block_size
        for k, v in kwargs.items():
            setattr(self, k, v)


class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    I believe I could have just used torch.nn.MultiheadAttention but their documentation
    is all but absent and code ugly so I don't trust it, rolling my own here.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads
        self.key = nn.Linear(config.n_embd, config.n_embd)
        self.query = nn.Linear(config.n_embd, config.n_embd)
        self.value = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_drop = nn.Dropout(config.attn_pdrop)
        self.resid_drop = nn.Dropout(config.resid_pdrop)
        # output projection
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(config.block_size, config.block_size))
                             .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head

    def forward(self, x, layer_past=None):
        B, T, C = x.size()

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y


class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
            nn.Dropout(config.resid_pdrop),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


class GPT(torch.nn.Module):
    def __init__(self, vocab_size, n_embd, block_size, embd_pdrop, n_layer, config):
        # input embedding stem
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))
        self.drop = nn.Dropout(embd_pdrop)
        self.config = config

        # decoder head
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)

        self.block_size = block_size

        blocks = []
        for x in range(n_layer):
            layer = Block(self.config)
            blocks.append(layer)
        self.blocks = nn.Sequential(*blocks)

    def forward(self, idx):
        b, t = idx.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."

        # forward the GPT model
        token_embeddings = self.tok_emb(idx)  # each index maps to a (learnable) vector
        position_embeddings = self.pos_emb[:, :t, :]  # each position maps to a (learnable) vector
        x = self.drop(token_embeddings + position_embeddings)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits


class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = list(set(data))
        data_size, vocab_size = len(data), len(chars)

        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for i, ch in enumerate(chars)}
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data

    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i + self.block_size + 1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


if __name__ == '__main__':
    n_embd = 2048
    block_size = 128
    n_layer = 6
    batch_size = 8
    num_workers = 0
    n_head = 16
    n_warmup = 20
    enable_ort = True

    device = torch.device("cuda:0")

    if not os.path.exists("input.txt"):
        os.system("wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt")

    file = 'input.txt'
    text = open(file, 'r').read()
    train_dataset = CharDataset(text, block_size)  # one line of poem is roughly 50 characters
    train_loader = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers)
    vocab_size = train_dataset.vocab_size

    model = GPT(
        vocab_size=vocab_size,
        n_embd=n_embd,
        embd_pdrop=0.1,
        block_size=block_size,
        n_layer=n_layer,
        config=GPTConfig(
            vocab_size=vocab_size,
            block_size=block_size,
            n_layer=n_layer,
            n_head=n_head,
            n_embd=n_embd,
        )
    )
    if enable_ort:
        model = ORTModule(model)

    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

    torch.cuda.synchronize()
    # warmup before measuring
    for x, (idx, targets) in tqdm(enumerate(train_loader), total=len(train_loader)):
        if x == n_warmup:
            break
        idx = idx.to(device)
        targets = targets.to(device)
        with autocast():
            logits = model(idx)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

    torch.cuda.synchronize()
    start_time = time.time()

    for idx, targets in tqdm(train_loader, total=len(train_loader)):
        idx = idx.to(device)
        targets = targets.to(device)
        with autocast():
            logits = model(idx)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        loss.backward()
        optimizer.step()

    torch.cuda.synchronize()
    print("Time taken", time.time() - start_time)

cc @natke @ashbhandare

Torch unable to use ORT because of opset version issue

torch-ort: 1.9.0
onnxruntime-training: 1.11
OS: Ubuntu 20.04

To reproduce, run below script with above libraries installed:

from datasets import load_dataset
raw_datasets = load_dataset("imdb")  
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
from torch_ort import ORTModule
model = ORTModule(model)
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")
training_args.per_device_train_batch_size = 2
training_args.num_train_epochs = 1
training_args.max_steps = 1
from transformers import Trainer
trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)

I built the latest onnxruntime-training(1.11) wheel for cpu default and installed torch-ort using it. The ortmodule is trying to export the model with opset version of 14, which is returning as unsupported and my pytorch script is ending up using torch backend and ignoring ORT. I had the script previously working with an older version of onnxruntime-training(1.10).

I modify the ortmodule to use OneDNN EP. I am unable to utilize this flow because of the above issue. Is there a way to prevent the module from exporting it to OPSET 14 by default?

Here is the error log:

Traceback (most recent call last):
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
    raise exception
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 223, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 321, in _export_model
    self._onnx_models.exported_model = self._get_exported_model(
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 391, in _get_exported_model
    raise wrap_exception(ORTModuleONNXModelException,
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback_exceptions.py", line 72, in wrap_exception
    raise new_exception(raised_exception) from raised_exception
onnxruntime.training.ortmodule._fallback_exceptions.ORTModuleONNXModelException: There was an error while exporting the PyTorch model to ONNX:

Traceback (most recent call last):
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
    raise exception
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 385, in _get_exported_model
    torch.onnx.export(self._flattened_module,
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/torch/onnx/__init__.py", line 275, in export
    return utils.export(model, args, f, export_params, verbose, training,
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/torch/onnx/utils.py", line 88, in export
    _export(model, args, f, export_params, verbose, training, input_names, output_names,
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/torch/onnx/utils.py", line 672, in _export
    _set_opset_version(opset_version)
  File "/home/mtc/code/chethan/vitualenv/lib/python3.8/site-packages/torch/onnx/symbolic_helper.py", line 783, in _set_opset_version
    raise ValueError("Unsupported ONNX opset version: " + str(opset_version))
ValueError: Unsupported ONNX opset version: 14

CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

RuntimeError: Error in backward pass execution: Non-zero status code returned while running BiasGeluGrad_dX node. Name:'BiasGelu_token_112_Grad/BiasGeluGrad_dX_0' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

when I use a big tensor like (128, 3, 224, 224)，it will cause some errors.
If I change the tensor dimension to (16, 3, 224, 224), it works. So why?

Question about supported device

Hi, I find if I want to install torch-ort, I must install cuda dependency first. But in bert example, it also supports CPU running. So I wonder the request of cuda dependency for installation is necessary?

Compatibility between ORTModule and DeepSpeed

Hi folks,

I am recently working on validating distributed training features while using ORTModule, here are some incompatibilities that I found during some tests:

[With DeepSpeed]

ZeRO Stage 1 and 2 work well
ZeRO Stage 3 ❌

Warnings:

/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_io.py:558: 
UserWarning: This model cannot be deep copied (or pickled), which is a required step for stateful models to be properly exported to ONNX. Compute will continue,  but  unexpected results may occur!  
warnings.warn("This model cannot be deep copied (or pickled)

BF16 ❌

Error Message:

RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:752
onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, 
const pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)> 
[ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Type Error: Type 'tensor(bfloat16)' of input parameter
(_original_module.distilbert.embeddings.word_embeddings.weight) of operator (ATen) in node (ATen_17) is invalid

[With Fairscale]

Can only shard optimizer state

Environment

OS: Ubuntu 20.04
CUDA/cuDNN version: 11.3/8
onnxruntime-training: 1.11.1+cu113
torch: 1.11.0+cu113
torch-ort: 1.11.1
Python version:3.8
GPU: A100

I would like to confirm with you folks if these behaviors are intended? And concerning the compatibility with DeepSpeed stage 3 and BF16, would it be possible to have some insights on if it would be supported in the future?

Thanks a lot!

topKgate loss issues

We have calculated the loss of the gate, but does this have any effect on training? Where is this Loss used?

 logits = self.wg(input) #dim: [bxs, num_experts]
        if self.k == 1:
            self.loss, self.gate_log, gates1_s, dispatch_mask, retval = top1gating(
                    logits,
                    self.capacity_factor if self.training else self.eval_capacity_factor,
                    is_expert_slicing=self.is_expert_slicing,
                    fp16_mode=self.fp16_mode,
                    nonpadding=nonpadding,
                    logits_gumbel=self.logits_gumbel if self.training else 0,
                    token_drop_type=self.token_drop_type,
                    straight_through=self.straight_through,
                    straight_through_temperature=self.straight_through_temperature,
                    balance_ratio=self.balance_ratio,
                    gate_log_req=self.gate_log_req,
                    lid=lid,
                    tutel_cumsum_sub_one=self.tutel_cumsum_sub_one,
                )
            return gates1_s, dispatch_mask, retval

Add `ninja` in requirements.txt

Here is an error I get from a fresh install of torch-ort

  File "main.py", line 59, in <module>
    main()
  File "main.py", line 35, in main
    model = ORTModule(model)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/ortmodule.py", line 77, in __init__
    self._execution_manager = GraphExecutionManagerFactory(self._flattened_module)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/_ortmodule_graph_execution_manager_factory.py", line 12, in __init__
    self._training_manager = TrainingManager(module)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/_ortmodule_training_manager.py", line 21, in __init__
    super().__init__(model)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/_ortmodule_graph_execution_manager.py", line 121, in __init__
    self.is_rocm_pytorch)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/training/_ortmodule_utils.py", line 44, in _load_torch_gpu_allocator_cpp_extension
    verbose=verbosity, with_cuda=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1240, in load_inline
    keep_intermediates=keep_intermediates)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1302, in _jit_compile
    is_standalone=is_standalone)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Should / How can we deal with unsupported operator warning?

How can I find where and how to solve these? Or I could ignore these?

/home/yisiang/miniconda3/envs/dl/lib/python3.9/site-packages/onnxruntime/training/ortmodule/_logger.py:51: UserWarning: There were one or more warnings or errors raised while exporting the PyTorch model. Please enable INFO level logging to view all warnings and errors.
  warnings.warn("There were one or more warnings or errors raised while exporting the PyTorch "
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.
...
Warning: Checker does not support models with experimental ops: Scale
Warning: Checker does not support models with experimental ops: Scale
....
2021-07-20 22:00:01.572480474 [W:onnxruntime:, constant_folding.cc:134 ApplyImpl] Could not find a CPU kernel and hence can't constant fold Sub node 'Pow_10_Grad/Sub_1'
2021-07-20 22:00:01.580978404 [W:onnxruntime:, constant_folding.cc:134 ApplyImpl] Could not find a CPU kernel and hence can't constant fold Sub node 'Pow_10_Grad/Sub_1'
....

Training a transformer encoder
torch 1.9.0
torch-ort 1.8.1
onnx 1.9.0
onnxruntime-training 1.8.1+torch190.cu111

-- 2021.07.24 ---
I found if the forward pass of embedding is commented, Warning: Unsupported operator ATenOp.... will disappear. But UserWarning: There were one or more warnings... still exists. Wonder how to turn on INFO level logging.

RuntimeError: Error in execution: At least one output should be requested.

Getting this error with pretty simple model.
This is direct error from ONNX, but I couldn’t find any methods to register output in ORTInferenceModule

Versions:
torch Version: 1.12.1
onnx Version: 1.12.0
torch-ort-infer Version: 1.12.0

Reproduction steps:

import torch
from torch import nn
from torch_ort import ORTInferenceModule, OpenVINOProviderOptions

class Block(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.size = size
        self.ff1 = nn.Linear(size, size)

    def forward(self, x):
        second = self.ff1(x)
        return second

model = Block(1024)
model.eval()

model = ORTInferenceModule(model, provider_options=OpenVINOProviderOptions(backend="CPU", precision="FP32"))

with torch.inference_mode():
    print("start")

    x = torch.randn(1, 1024, dtype=torch.float32)
    x = model(x)
    print(x.mean())

Turn off fallback to torch by default

The current behavior for ORTModule is to fall back on torch if there are errors running the converted model. Since most make a conscious choice to run onnxruntime when they use ORTModule, it may be better to fail faster (and raise) by default when onnxruntime fails, instead of failing back on torch.

ORT support for CUDA 11.0

Hello all,

I am trying to use ORT on kaggle which currently has CUDA 11.0 but the current version available on pip is compatible with CUDA 10.2.

Is there support for CUDA 11.0 available from the master branch and if so how do I get it up and running ?

If the above is not the case, can the ORT team consider extending the compatibility ?

MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library

I encountered below error while I am typing

python -m torch_ort.configure

"Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it."

I found the solution for this issue and wanted to leave it here.

export MKL_SERVICE_FORCE_INTEL=1
and
export MKL_THREADING_LAYER=GNU
solved the error of the above command.

Warning: Checker does not support models with experimental ops: ATen

Even though I only see warnings below (no error) the trace does not get created

Repro script

import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
import torchvision
import torchvision.transforms as T
import torchvision.models as models

import torch.profiler

model = models.resnet50(pretrained=True)
model.cuda()
cudnn.benchmark = True

transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32,
                                          shuffle=True, num_workers=4)

criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
device = torch.device("cuda:0")

from torch_ort import ORTModule
model = ORTModule(model)

model.train()

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./result', worker_name='worker0'),
    record_shapes=True,
    profile_memory=True,  # This will take 1 to 2 minutes. Setting it to False could greatly speedup.
    with_stack=True
) as p:
    for step, data in enumerate(trainloader, 0):
        print("step:{}".format(step))
        inputs, labels = data[0].to(device=device), data[1].to(device=device)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if step + 1 >= 4:
            break
        p.step()

Package versions

(serve) ubuntu@ip-172-31-17-70:~$ pip list
Package                Version
---------------------- -------------------
brotlipy               0.7.0
Cerberus               1.3.4
certifi                2020.12.5
cffi                   1.14.5
chardet                4.0.0
conda                  4.10.1
conda-package-handling 1.7.3
cryptography           3.4.7
flatbuffers            22.9.24
h5py                   3.7.0
idna                   2.10
mamba                  0.13.0
mpmath                 1.2.1
numpy                  1.23.3
onnx                   1.12.0
onnxruntime-training   1.12.0
packaging              21.3
Pillow                 9.2.0
pip                    21.1.2
protobuf               3.20.1
pycosat                0.6.3
pycparser              2.20
pyOpenSSL              20.0.1
pyparsing              3.0.9
PySocks                1.7.1
requests               2.25.1
ruamel-yaml-conda      0.15.80
setuptools             49.6.0.post20210108
six                    1.16.0
sympy                  1.11.1
torch                  1.12.1+cu116
torch-ort              1.12.0
torchaudio             0.12.1+cu116
torchvision            0.13.1+cu116
tqdm                   4.61.0
typing-extensions      4.4.0
urllib3                1.26.4
wheel                  0.36.2

Logs

(serve) ubuntu@ip-172-31-17-70:~$ python resnet.py 
/opt/conda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Files already downloaded and verified
/opt/conda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training
  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)
/opt/conda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.12.0
  warnings.warn("onnxruntime training package info: __version__: %s" % version)
/opt/conda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 10.2
  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)
/opt/conda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 10020
  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)
/opt/conda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")
/opt/conda/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [11060]
  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)
step:0
/opt/conda/lib/python3.9/site-packages/onnxruntime/training/ortmodule/_training_manager.py:190: UserWarning: Fast path enabled - skipping checks. Rebuild graph: True, Execution agent: True, Device check: True
  warnings.warn(
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.0.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.0.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.0.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.0.downsample.1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.1.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.1.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.1.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.2.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.2.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer1.2.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.0.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.0.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.0.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.0.downsample.1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.1.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.1.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.1.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.2.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.2.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.2.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.3.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.3.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer2.3.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.0.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.0.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.0.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.0.downsample.1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.1.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.1.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.1.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.2.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.2.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.2.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.3.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.3.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.3.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.4.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.4.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.4.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.5.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.5.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer3.5.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.0.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.0.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.0.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.0.downsample.1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.1.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.1.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.1.bn3.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.2.bn1.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.2.bn2.num_batches_tracked'. This changes graph semantics.
Warning: ONNX Preprocess - Removing mutation from node aten::add_ on block input: '_original_module.layer4.2.bn3.num_batches_tracked'. This changes graph semantics.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Warning: Checker does not support models with experimental ops: ATen
Inconsistency detected by ld.so: dl-version.c: 205: _dl_check_map_versions: Assertion `needed != NULL' failed!

no speedup using ort

I have tried using ort in training transformer . But it seems that no speed up is got.
I wonder whether i have missed someting in configuration.

ONNXRuntimeError after enabled fp16 mixed precision training

Hi folks,

I tested fp16 mixed precision training with ORTModule wrapped GPT2 model on a fine-tuning task. However, after enabling fp16, I encountered the following error:

Error Message

Traceback (most recent call last):
  File "test_onnxruntime_train.py", line 115, in test_ort_trainer
    train_result = trainer.train()
  File "/workspace/optimum/onnxruntime/trainer.py", line 498, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 2016, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/ortmodule.py", line 81, in _forward
    return self._torch_module.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_torch_module_ort.py", line 32, in _forward
    return self._execution_manager(self.is_training()).forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 265, in forward
    override_policy=_FallbackPolicy.FALLBACK_FORCE_TORCH_FORWARD)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 194, in handle_exception
    raise exception
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 85, in forward
    self._initialize_graph_builder(training=True)
  File "/usr/local/lib/python3.6/dist-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 420, in _initialize_graph_builder
    self._onnx_models.exported_model.SerializeToString(), grad_builder_config)
RuntimeError: /onnxruntime_src/orttraining/orttraining/python/orttraining_pybind_state.cc:707 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const pybind11::bytes&, const onnxruntime::training::OrtModuleGraphBuilderConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of Optype (Where) bound to different types (tensor(float) and tensor(float16) in node (Where_183).

It seems that the exported ONNX graph is broken due to incompatible input types. I am wondering where comes the problem. Do any insight on that?

System information

Docker image built with the Dockerfile-cu11 in onnxruntime-training-examples.

OS: Ubuntu 18.04
CUDA/cuDNN version: 11/8
onnxruntime-training: 1.9.0+cu111
torch: 1.9.0+cu111
torch-ort: 1.9.0
Python version:3.6
GPU: A100

Additional Information

I actually have a work version under the environment: torch 1.8.1+torch-ort 1.9.0+onnxruntime-training1.11.0.dev20220113001+cu102, so I wonder if the error comes from the fact that what's in the Dockerfile are outdated. However, I can't find how to install onnxruntime-training1.11.0.dev20220113001+cu102 anymore.
Here is the onnx graph exported with DebugOptions, not sure if that could help

tests/bert_for_sequence_classification.py reports "This is an invalid model"

Hi, I installed torch-ort as the instruction. 'python -m torch_ort.configure' does not report any error. However, when I run the verification, it reports the following errors:

======== Epoch 1 / 4 with batch size 32 ========
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator ATenOp. No schema registered for this operator.
Warning: Unsupported operator SoftmaxCrossEntropyLossInternal. No schema registered for this operator.
2021-11-16 23:56:39.446639117 [W:onnxruntime:Default, graph.cc:2538 InitFunctionBodyForNode] Function body initialization failed for node 'Softmax_131_Grad/SoftmaxGrad_0' optype SoftmaxGrad. Error message /onnxruntime_src/onnxruntime/core/graph/function.cc:749 onnxruntime::FunctionImpl::FunctionImpl(onnxruntime::Graph&, const NodeIndex&, const onnx::FunctionProto&, const std::unordered_map<std::basic_string<char>, const onnx::FunctionProto*>&, std::vector<std::unique_ptr<onnxruntime::Function> >&, const onnxruntime::logging::Logger&, bool) status.IsOK() was false. Resolve subgraph failed:This is an invalid model. Error in Node:0x557a5b74e130 : Node (0x557a5b74e130) has input size 2 not in range [min=1, max=1].
. Execution will fail if ORT does not have a specialized kernel for this op
2021-11-16 23:56:39.464632288 [W:onnxruntime:, graph.cc:2538 InitFunctionBodyForNode] Function body initialization failed for node 'Softmax_131_Grad/SoftmaxGrad_0' optype SoftmaxGrad. Error message /onnxruntime_src/onnxruntime/core/graph/function.cc:749 onnxruntime::FunctionImpl::FunctionImpl(onnxruntime::Graph&, const NodeIndex&, const onnx::FunctionProto&, const std::unordered_map<std::basic_string<char>, const onnx::FunctionProto*>&, std::vector<std::unique_ptr<onnxruntime::Function> >&, const onnxruntime::logging::Logger&, bool) status.IsOK() was false. Resolve subgraph failed:This is an invalid model. Error in Node:0x557a60d5e230 : Node (0x557a60d5e230) has input size 2 not in range [min=1, max=1].
. Execution will fail if ORT does not have a specialized kernel for this op
Inconsistency detected by ld.so: dl-version.c: 205: _dl_check_map_versions: Assertion `needed != NULL' failed!

Testbed:
V100-32GB
CUDA-10.2
torch==1.9.0
torch-ort==1.9.0
onnxruntime-training==1.9.0

`python -m torch_ort.configure` fails with protobuf errors

With the latest pytorch-nightly and a fresh install of torch_ort, when running python -m torch_ort.configure I get a protobuf error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/justinchu/.local/lib/python3.8/site-packages/torch_ort/__init__.py", line 6, in <module>
    from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnxruntime/__init__.py", line 32, in <module>
    from onnxruntime.capi import onnxruntime_validation
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 138, in <module>
    has_ortmodule, package_name, version, cuda_version = validate_build_package_info()
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 133, in validate_build_package_info
    raise import_ortmodule_exception
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py", line 66, in validate_build_package_info
    from onnxruntime.training.ortmodule import ORTModule # noqa
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnxruntime/training/__init__.py", line 11, in <module>
    from .orttrainer import ORTTrainer, TrainStepInfo
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnxruntime/training/orttrainer.py", line 4, in <module>
    import onnx
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnx/__init__.py", line 11, in <module>
    from onnx.external_data_helper import load_external_data_for_model, write_external_data_tensors, convert_model_to_external_data
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnx/external_data_helper.py", line 14, in <module>
    from .onnx_pb import TensorProto, ModelProto
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnx/onnx_pb.py", line 8, in <module>
    from .onnx_ml_pb2 import *  # noqa
  File "/home/justinchu/.local/lib/python3.8/site-packages/onnx/onnx_ml_pb2.py", line 33, in <module>
    _descriptor.EnumValueDescriptor(
  File "/home/justinchu/.local/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Why should I be forced to have a CUDA or ROCm machine when wanting to run OpenVino on Intel?

This link tells me ort-inference supports OpenVino:
https://github.com/pytorch/ort#-inference
"ONNX Runtime for PyTorch supports PyTorch model inference using ONNX Runtime and Intel® OpenVINO™.

However when I try to use it the dependencies point to the install of torch_ort, which needs CUDA as prerequisite. I don't have either ATI or NVIDIA on this Intel-PC, and want to use the Intel-GPU.
What can I do to omit the CUDA-dependencies completely?

torch_ort configure fails

Does anyone have a solution for the following error? I built ORT from source and got this error when installing torch_ort.

root@6fc16d770e85:/nfs# python -m torch_ort.configure
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 185, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/opt/conda/lib/python3.8/runpy.py", line 111, in _get_module_details
import(pkg_name)
File "/opt/conda/lib/python3.8/site-packages/torch_ort/init.py", line 6, in
from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel
File "/opt/conda/lib/python3.8/site-packages/onnxruntime/init.py", line 34, in
raise import_capi_exception
File "/opt/conda/lib/python3.8/site-packages/onnxruntime/init.py", line 23, in
from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed,
ImportError: cannot import name 'enable_telemetry_events' from 'onnxruntime.capi._pybind_state' (/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py)

My build command is

./build.sh --config Debug  --enable_pybind --enable_language_interop_ops --use_cuda --enable_training --build_wheel --parallel --skip_tests --cuda_home /usr/local/cuda-11.1 --cudnn_home /usr/lib/x86_64-linux-gnu/ --cuda_version 11.1 --enable_training_torch_interop

torch_ort version: torch_ort in conda/lib/python3.8/site-packages (1.9.0)
ORT commit: 88d5023885ecfde70a5947a3247ab430f9270fb8 (Thu Oct 7 15:27:12 2021 -0700)

pytorch / ort Goto Github PK

ort's Introduction

Introduction

🚀 Installation

Install for training

Pre-requisites

Install in a local Python environment

Verify your installation

Install Mixture Of Experts

Install for Inference

Prerequisites

Install in a local Python environment

Verify your installation

📈 Training

Add ONNX Runtime for PyTorch to your PyTorch training script

Usage of FusedAdam and FP16 Optimizer (Optional)

Usage of LoadBalancingDistributedSampler

Samples

🤓 Mixture of Experts

🎯 Inference

Supported Execution Providers

Provider Options

List of Provider Options

Code Sample

Samples

🤝 Contribute

License

ort's People

Contributors

Stargazers

Watchers

Forkers

ort's Issues

Repro script

Package versions

Logs

Recommend Projects

Recommend Topics

Recommend Org