Giter Club home page Giter Club logo

Comments (8)

baijumeswani avatar baijumeswani commented on September 18, 2024 2

@JingyaHuang I think this error is exactly as the one reported in this pull request: microsoft/onnxruntime#10674

I thought this PR made its way to ort 1.11.0 release. I will check and get back to you on that. In the meantime, you could use onnx==1.10.2 with onnxruntime-training==1.11.0+cu113 to work around the problem.

from ort.

baijumeswani avatar baijumeswani commented on September 18, 2024 2

@JingyaHuang I can confirm that that PR didn't make it to ort release 1.11.0. Please use the work around for now until we have a more permanent solution in place.

from ort.

JingyaHuang avatar JingyaHuang commented on September 18, 2024 2

Hi @ytaous and @baijumeswani ,

Thanks a lot for the reply, super glad to know the root of the error!!

I adopted the workaround suggested with onnx==1.10.2 and onnxruntime-training==1.11.0+cu113, and it works well for the previous models that we have tested! Although there is still the issue of mixed-precision training, I think that we will temporarily jump the benchmark for gpt2 until the next release with the fix integrated.

Thanks again for the help!

from ort.

ytaous avatar ytaous commented on September 18, 2024

Hi, can you please try our latest version? Also try later version of torch if possible.
https://download.onnxruntime.ai/
https://download.onnxruntime.ai/onnxruntime_stable_cu111.html

So far the dockerfile for the example has not been updated for a while, it might have some uncovered issues with older version of torch and ort. So once you confirm the fix, pls feel free to create another ticket here - https://github.com/microsoft/onnxruntime-training-examples/issues and close this one.
Thanks.

from ort.

JingyaHuang avatar JingyaHuang commented on September 18, 2024

Hello @ytaous ,

Thank you for the reply, bravo for the newly released onnxruntime 1.11.0!

I have tried onnxruntime-training 1.11.0, however I met some unexpected errors while trying to run a simple text classification task. It seems that even the examples that we had no problem with before are broken now. Here are the error messages that I got:

Traceback (most recent call last):
  File "run_glue.py", line 572, in <module>
    main()
  File "run_glue.py", line 491, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/optimum/onnxruntime/trainer.py", line 476, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2016, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 309, in _forward
    return ortmodule._torch_module.forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_utils.py", line 288, in _forward
    return torch_module_ort._execution_manager(
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 295, in forward
    self._fallback_manager.handle_exception(exception=e,
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_training_manager.py", line 231, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 329, in _export_model
    self._onnx_models.exported_model = SymbolicShapeInference.infer_shapes(self._onnx_models.exported_model,
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 2061, in infer_shapes
    all_shapes_inferred = symbolic_shape_inference._infer_impl()
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 1928, in _infer_impl
    self._check_merged_dims(in_dims, allow_broadcast=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 620, in _check_merged_dims
    self._add_suggested_merge(dims, apply=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/tools/symbolic_shape_infer.py", line 218, in _add_suggested_merge
    assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols])
AssertionError
  0%|                                                                                   | 0/12630 [00:03<?, ?it/s]

Besides, I received a lot of warnings before that. It seems that the exported ONNX graph is broken:

WARNING: The shape inference of org.pytorch.aten::ATen type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
WARNING: The shape inference of com.microsoft::SoftmaxCrossEntropyLossInternal type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Warning: Checker does not support models with experimental ops: ATen

Environment information

The tests were run with this Dockerfile with

  • OS: Ubuntu 20.04
  • CUDA/cuDNN version: 11.1/8
  • onnxruntime-training: 1.11.0+cu111
  • torch: 1.10.0+cu111
    (actually I tried every stable torch version>1.9.0, like 1.9.0, 1.9.0+cu111, 1.10.0, 1.10.0+cu111, 1.11.0+cu113, none of them works)
  • torch-ort: 1.11.0
  • Python version:3.8.10
  • GPU: A100 / T4

To Reproduce
(updated 2022/4/8 with the suggestions)

Thanks for helping!! 🙏

from ort.

ytaous avatar ytaous commented on September 18, 2024

Hi, @JingyaHuang - you can ignore those warning messages, they are expected for now. Eventually they will get cleaned up.

@liqunfu , @thiagocrepaldi - this error message looks similar to what @edgchen1 reported a while ago, and the fix has been merged - #65
Any idea what might be the issue here by looking at user's env?
I'll see if I can repro it locally, thx.

from ort.

JingyaHuang avatar JingyaHuang commented on September 18, 2024

Hi folks,

Thanks for the reply!

I just tested onnxruntime-training with cuda 10.2 and it works as expected. Here is the Dockerfile that I used.

But it will be definitely great to figure out a worked env for 1.11.0 under cuda 11, since A100 doesn't support cuda 10.2 and some benchmarks have been done with that...

@ytaous @liqunfu @thiagocrepaldi

from ort.

jambayk avatar jambayk commented on September 18, 2024

@baijumeswani I came across the same precision training issue with huggingface transformers (version 4.18.0) gpt2 reported here . The error doesn't arise when using transformers version 4.16.0.

My setup has:

  • torch 1.11.0
  • transformers 4.18.0
  • torch-ort 1.11.0
  • onnx 1.10.2
  • onnxruntime-training 1.11.0+cu113

Following the chain of messages here, it appears to me that the mixed precision training issue hasn't actually been addressed yet. The PR you have linked addresses a different issue related to symbolic_shape_infer and unrelated to the mixed precision error. Will the next release of ort also fix the mixed precision issue?

from ort.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.