Comments (8)
@JingyaHuang I think this error is exactly as the one reported in this pull request: microsoft/onnxruntime#10674
I thought this PR made its way to ort 1.11.0 release. I will check and get back to you on that. In the meantime, you could use onnx==1.10.2
with onnxruntime-training==1.11.0+cu113
to work around the problem.
from ort.
@JingyaHuang I can confirm that that PR didn't make it to ort release 1.11.0. Please use the work around for now until we have a more permanent solution in place.
from ort.
Hi @ytaous and @baijumeswani ,
Thanks a lot for the reply, super glad to know the root of the error!!
I adopted the workaround suggested with onnx==1.10.2
and onnxruntime-training==1.11.0+cu113
, and it works well for the previous models that we have tested! Although there is still the issue of mixed-precision training, I think that we will temporarily jump the benchmark for gpt2 until the next release with the fix integrated.
Thanks again for the help!
from ort.
Hi, can you please try our latest version? Also try later version of torch if possible.
https://download.onnxruntime.ai/
https://download.onnxruntime.ai/onnxruntime_stable_cu111.html
So far the dockerfile for the example has not been updated for a while, it might have some uncovered issues with older version of torch and ort. So once you confirm the fix, pls feel free to create another ticket here - https://github.com/microsoft/onnxruntime-training-examples/issues and close this one.
Thanks.
from ort.
Hello @ytaous ,
Thank you for the reply, bravo for the newly released onnxruntime 1.11.0!
I have tried onnxruntime-training 1.11.0, however I met some unexpected errors while trying to run a simple text classification task. It seems that even the examples that we had no problem with before are broken now. Here are the error messages that I got:
Traceback (most recent call last):
File "run_glue.py", line 572, in <module>
main()
File "run_glue.py", line 491, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/workspace/optimum/onnxruntime/trainer.py", line 476, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1984, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2016, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
return ortmodule._torch_module.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 288, in _forward
return torch_module_ort._execution_manager(
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 295, in forward
self._fallback_manager.handle_exception(exception=e,
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
raise exception
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 231, in forward
build_gradient_graph = self._export_model(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 329, in _export_model
self._onnx_models.exported_model = SymbolicShapeInference.infer_shapes(self._onnx_models.exported_model,
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 2061, in infer_shapes
all_shapes_inferred = symbolic_shape_inference._infer_impl()
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 1928, in _infer_impl
self._check_merged_dims(in_dims, allow_broadcast=True)
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 620, in _check_merged_dims
self._add_suggested_merge(dims, apply=True)
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 218, in _add_suggested_merge
assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols])
AssertionError
0%| | 0/12630 [00:03<?, ?it/s]
Besides, I received a lot of warnings before that. It seems that the exported ONNX graph is broken:
WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of com.microsoft::SoftmaxCrossEntropyLossInternal type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Warning: Checker does not support models with experimental ops: ATen
Environment information
The tests were run with this Dockerfile with
- OS: Ubuntu 20.04
- CUDA/cuDNN version: 11.1/8
- onnxruntime-training: 1.11.0+cu111
- torch: 1.10.0+cu111
(actually I tried every stable torch version>1.9.0, like 1.9.0, 1.9.0+cu111, 1.10.0, 1.10.0+cu111, 1.11.0+cu113, none of them works) - torch-ort: 1.11.0
- Python version:3.8.10
- GPU: A100 / T4
To Reproduce
(updated 2022/4/8 with the suggestions)
Thanks for helping!! 🙏
from ort.
Hi, @JingyaHuang - you can ignore those warning messages, they are expected for now. Eventually they will get cleaned up.
@liqunfu , @thiagocrepaldi - this error message looks similar to what @edgchen1 reported a while ago, and the fix has been merged - #65
Any idea what might be the issue here by looking at user's env?
I'll see if I can repro it locally, thx.
from ort.
Hi folks,
Thanks for the reply!
I just tested onnxruntime-training
with cuda 10.2 and it works as expected. Here is the Dockerfile that I used.
But it will be definitely great to figure out a worked env for 1.11.0 under cuda 11, since A100 doesn't support cuda 10.2 and some benchmarks have been done with that...
@ytaous @liqunfu @thiagocrepaldi
from ort.
@baijumeswani I came across the same precision training issue with huggingface transformers (version 4.18.0) gpt2 reported here . The error doesn't arise when using transformers version 4.16.0.
My setup has:
- torch 1.11.0
- transformers 4.18.0
- torch-ort 1.11.0
- onnx 1.10.2
- onnxruntime-training 1.11.0+cu113
Following the chain of messages here, it appears to me that the mixed precision training issue hasn't actually been addressed yet. The PR you have linked addresses a different issue related to symbolic_shape_infer
and unrelated to the mixed precision error. Will the next release of ort also fix the mixed precision issue?
from ort.
Related Issues (20)
- MaxPool op resolved as Aten OP HOT 6
- Seg fault while training model with maxpool op
- Compatibility between ORTModule and DeepSpeed HOT 6
- [Question] PyTorch 1.11 HOT 2
- Turn off fallback to torch by default HOT 3
- `python -m torch_ort.configure` fails with protobuf errors HOT 1
- CUDA error cudaErrorInvalidConfiguration:invalid configuration argument HOT 3
- Where operator export error when performing fp16 quantization
- torch-ort cannot be installed on windows: onnxruntime-training not found HOT 5
- What does ORT stands for? HOT 1
- Will there be new nightly builds with version 1.13.0.dev? HOT 2
- [torch-ort-infer] Aten fallback doesn't work HOT 6
- RuntimeError: Error in execution: At least one output should be requested.
- Warning: Checker does not support models with experimental ops: ATen HOT 2
- Clarify installation requirements for CUDA vs ROCm HOT 1
- Why should I be forced to have a CUDA or ROCm machine when wanting to run OpenVino on Intel? HOT 2
- python -m torch_ort.configure fail HOT 2
- topKgate loss issues
- Does it support TensorRT backend?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ort.