Hi folks, I tested fp16 mixed precision training wi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi folks, Thanks for the reply! I just tested <cod

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ONNXRuntimeError after enabled fp16 mixed precision training about ort HOT 8 CLOSED

pytorch commented on September 18, 2024

ONNXRuntimeError after enabled fp16 mixed precision training

from ort.

Comments (8)

baijumeswani commented on September 18, 2024 2

@JingyaHuang I think this error is exactly as the one reported in this pull request: microsoft/onnxruntime#10674

I thought this PR made its way to ort 1.11.0 release. I will check and get back to you on that. In the meantime, you could use onnx==1.10.2 with onnxruntime-training==1.11.0+cu113 to work around the problem.

from ort.

baijumeswani commented on September 18, 2024 2

@JingyaHuang I can confirm that that PR didn't make it to ort release 1.11.0. Please use the work around for now until we have a more permanent solution in place.

from ort.

JingyaHuang commented on September 18, 2024 2

Hi @ytaous and @baijumeswani ,

Thanks a lot for the reply, super glad to know the root of the error!!

I adopted the workaround suggested with onnx==1.10.2 and onnxruntime-training==1.11.0+cu113, and it works well for the previous models that we have tested! Although there is still the issue of mixed-precision training, I think that we will temporarily jump the benchmark for gpt2 until the next release with the fix integrated.

Thanks again for the help!

from ort.

ytaous commented on September 18, 2024

Hi, can you please try our latest version? Also try later version of torch if possible.
https://download.onnxruntime.ai/
https://download.onnxruntime.ai/onnxruntime_stable_cu111.html

So far the dockerfile for the example has not been updated for a while, it might have some uncovered issues with older version of torch and ort. So once you confirm the fix, pls feel free to create another ticket here - https://github.com/microsoft/onnxruntime-training-examples/issues and close this one.
Thanks.

from ort.

JingyaHuang commented on September 18, 2024

Hello @ytaous ,

Thank you for the reply, bravo for the newly released onnxruntime 1.11.0!

I have tried onnxruntime-training 1.11.0, however I met some unexpected errors while trying to run a simple text classification task. It seems that even the examples that we had no problem with before are broken now. Here are the error messages that I got:

Traceback (most recent call last):
  File "run_glue.py", line 572, in <module>
    main()
  File "run_glue.py", line 491, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/optimum/onnxruntime/trainer.py", line 476, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2016, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
    return ortmodule._torch_module.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 288, in _forward
    return torch_module_ort._execution_manager(
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 295, in forward
    self._fallback_manager.handle_exception(exception=e,
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 231, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 329, in _export_model
    self._onnx_models.exported_model = SymbolicShapeInference.infer_shapes(self._onnx_models.exported_model,
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 2061, in infer_shapes
    all_shapes_inferred = symbolic_shape_inference._infer_impl()
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 1928, in _infer_impl
    self._check_merged_dims(in_dims, allow_broadcast=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 620, in _check_merged_dims
    self._add_suggested_merge(dims, apply=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 218, in _add_suggested_merge
    assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols])
AssertionError
  0%|                                                                                   | 0/12630 [00:03<?, ?it/s]

Besides, I received a lot of warnings before that. It seems that the exported ONNX graph is broken:

WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of com.microsoft::SoftmaxCrossEntropyLossInternal type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Warning: Checker does not support models with experimental ops: ATen

Environment information

The tests were run with this Dockerfile with

OS: Ubuntu 20.04
CUDA/cuDNN version: 11.1/8
onnxruntime-training: 1.11.0+cu111
torch: 1.10.0+cu111
(actually I tried every stable torch version>1.9.0, like 1.9.0, 1.9.0+cu111, 1.10.0, 1.10.0+cu111, 1.11.0+cu113, none of them works)
torch-ort: 1.11.0
Python version:3.8.10
GPU: A100 / T4

To Reproduce
(updated 2022/4/8 with the suggestions)

Thanks for helping!! 🙏

from ort.

ytaous commented on September 18, 2024

Hi, @JingyaHuang - you can ignore those warning messages, they are expected for now. Eventually they will get cleaned up.

@liqunfu , @thiagocrepaldi - this error message looks similar to what @edgchen1 reported a while ago, and the fix has been merged - #65
Any idea what might be the issue here by looking at user's env?
I'll see if I can repro it locally, thx.

from ort.

JingyaHuang commented on September 18, 2024

Hi folks,

Thanks for the reply!

I just tested onnxruntime-training with cuda 10.2 and it works as expected. Here is the Dockerfile that I used.

But it will be definitely great to figure out a worked env for 1.11.0 under cuda 11, since A100 doesn't support cuda 10.2 and some benchmarks have been done with that...

@ytaous @liqunfu @thiagocrepaldi

from ort.

jambayk commented on September 18, 2024

@baijumeswani I came across the same precision training issue with huggingface transformers (version 4.18.0) gpt2 reported here . The error doesn't arise when using transformers version 4.16.0.

My setup has:

torch 1.11.0
transformers 4.18.0
torch-ort 1.11.0
onnx 1.10.2
onnxruntime-training 1.11.0+cu113

Following the chain of messages here, it appears to me that the mixed precision training issue hasn't actually been addressed yet. The PR you have linked addresses a different issue related to symbolic_shape_infer and unrelated to the mixed precision error. Will the next release of ort also fix the mixed precision issue?

from ort.

ONNXRuntimeError after enabled fp16 mixed precision training about ort HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent