Describe the issue Hi! I have exported an onnx model from pytorch

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data

Hello <a class="user-mention notranslate" data-hovercard-typ

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Thanks for checking <a class="user-mention notranslate" data-hovercard-type="user" dat

onnxruntime.InferenceSession hangs about onnxruntime HOT 21 CLOSED

doloresgarcia commented on June 25, 2024

onnxruntime.InferenceSession hangs

from onnxruntime.

Comments (21)

doloresgarcia commented on June 25, 2024 1

Hi @justinchuby, I am wondering if this could be related to the intro of new transformations 124160. Do you think it could be the case? (sorry to bother you again)

from onnxruntime.

doloresgarcia commented on June 25, 2024 1

I have optimized the model and now I can start the inference session and run it. Thank you @yuslepukhin and @justinchuby :)

Awesome! Curious what was done?

The graph had many constant that were created by the model inside functions, I initialized those with the model instead. Also there were some conversion errors like:
x[...,index_list] is not converted well and has to be modified to use torch.index_select.
However, operations like einsum do not seem to be dynamic with input shape (this is for a GNN like architecture) so that is problematic.

from onnxruntime.

doloresgarcia commented on June 25, 2024 1

Hello @doloresgarcia , I try to convert, save, load and run a custom PyTorch model via ONNX runtime. However, as in your case, the run gets stuck and I get no clear error messages besides UnsqueezeElimination cannot remove node _inlfunc_aten_mean_dim_n1 and UnsqueezeElimination cannot remove node _inlfunc_aten_mean_dim_token_14647_n1. If I turn off the optimization, I get no error message and the process gets killed after a while. Can you give some guidance on what you did exactly besides the torch.index_select to optimize the model for onnxruntime to work? That would be of great help! Thank you.

I am using the torch.onnx.dynamo_export which seems to support more complex models than the torch.onnx.export
Also, disabling the graph optimization as you say
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
but my error messages appeared even after disabling this so I guess it must be something different.

from onnxruntime.

phierhager commented on June 25, 2024 1

Hello @doloresgarcia , I try to convert, save, load and run a custom PyTorch model via ONNX runtime. However, as in your case, the run gets stuck and I get no clear error messages besides UnsqueezeElimination cannot remove node _inlfunc_aten_mean_dim_n1 and UnsqueezeElimination cannot remove node _inlfunc_aten_mean_dim_token_14647_n1. If I turn off the optimization, I get no error message and the process gets killed after a while. Can you give some guidance on what you did exactly besides the torch.index_select to optimize the model for onnxruntime to work? That would be of great help! Thank you.

I am using the torch.onnx.dynamo_export which seems to support more complex models than the torch.onnx.export Also, disabling the graph optimization as you say so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL but my error messages appeared even after disabling this so I guess it must be something different.

Okay, thank you for the quick reply.

from onnxruntime.

justinchuby commented on June 25, 2024

Hi @justinchuby, I am wondering if this could be related to the intro of new transformations 124160. Do you think it could be the case? (sorry to bother you again)

I suspect there maybe another cause. Could you test with the latest ONNX Runtime release to see if it is still an issue?

from onnxruntime.

doloresgarcia commented on June 25, 2024

Thanks for checking @justinchuby! I tested now with 1.17.3 and it is still the case :/

from onnxruntime.

justinchuby commented on June 25, 2024

Is the model open source? Could you share source code to it?

from onnxruntime.

justinchuby commented on June 25, 2024

Please try the following:

set the env var TORCHLIB_EXPERIMENTAL_PREFER_TRACING=1 before running the pytorch export script to get the model, then inline the model with

model_proto = onnx.load("model.onnx")
inlined = onnx.inliner.inline_local_functions(model_proto)
onnx.save(inlined, "model_inlined.onnx")

not guaranteed to succeed but curious if that would help.

from onnxruntime.

yuslepukhin commented on June 25, 2024

Try different optimization levels and see if this affects the outcome.

from onnxruntime.

justinchuby commented on June 25, 2024

Some observations: the model has ~350k nodes

from onnxruntime.

doloresgarcia commented on June 25, 2024

Is the model open source? Could you share source code to it?
The model is an adaptation of the gatr (just removing the ._VF einsums so that it is onnx exportable)
https://github.com/Qualcomm-AI-research/geometric-algebra-transformer/blob/main/gatr/nets/gatr.py

from onnxruntime.

doloresgarcia commented on June 25, 2024

Please try the following:

set the env var TORCHLIB_EXPERIMENTAL_PREFER_TRACING=1 before running the pytorch export script to get the model, then inline the model with
model_proto = onnx.load("model.onnx")
inlined = onnx.inliner.inline_local_functions(model_proto)
onnx.save(inlined, "model_inlined.onnx")
not guaranteed to succeed but curious if that would help.

This code runs, and returns the inlined model. The InferenceSession log now shows an error:

2024-04-18 23:22:23.255135644 [W:onnxruntime:, constant_folding.cc:212 ApplyImpl] Could not find a CPU kernel and hence can't constant fold CastLike node 'n1__11634_2008'
2024-04-18 23:22:23.255240965 [W:onnxruntime:, constant_folding.cc:212 ApplyImpl] Could not find a CPU kernel and hence can't constant fold CastLike node 'n1__11602_1985'
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (n0__11868) Op (Mul) [ShapeInferenceError] Incompatible dimensions

from onnxruntime.

doloresgarcia commented on June 25, 2024

Some observations: the model has ~350k nodes

Would just this make the inference session take too long to start or not start at all?

from onnxruntime.

doloresgarcia commented on June 25, 2024

Try different optimization levels and see if this affects the outcome.

Thanks for the reply @yuslepukhin
with ort.GraphOptimizationLevel.ORT_DISABLE_ALL it initializes the session (after 3h)
Then there is also a bug on shapes
Status Message: updates tensor should have shape equal to indices.shape[:-1] + data.shape[indices.shape[-1]:]. updates shape: {}, indices shape: {3,1}, data shape: {4,4}

What is the correct way to debug this? I have no information about where to look for this operation in the original code. I am assuming this is a conversion error.

from onnxruntime.

yuslepukhin commented on June 25, 2024

Some observations: the model has ~350k nodes

Would just this make the inference session take too long to start or not start at all?

The model inlining takes a lot of time. Stand by.
How exactly the conversion was performed?

from onnxruntime.

doloresgarcia commented on June 25, 2024

I have optimized the model and now I can start the inference session and run it. Thank you @yuslepukhin and @justinchuby :)

from onnxruntime.

justinchuby commented on June 25, 2024

I have optimized the model and now I can start the inference session and run it. Thank you @yuslepukhin and @justinchuby :)

Awesome! Curious what was done?

from onnxruntime.

yuslepukhin commented on June 25, 2024

The initial model fails the check from ONNX:

This is from the ORT Optimized model (inlining only)

Graph must be in single static assignment (SSA) form, however '_inlfunc_IsScalar_tmp' has been used as output names multiple times.

==> Context: Bad node spec for node. Name: _inlfunc_aten_mean_dim_n1 OpType: If

from onnxruntime.

justinchuby commented on June 25, 2024

initialized those with the model instead

Do you mean turning Constant operators into graph initializers?

einsum do not seem to be dynamic with input shape

Could you share a concrete example?

from onnxruntime.

phierhager commented on June 25, 2024

Hello @doloresgarcia ,
I try to convert, save, load and run a custom PyTorch model via ONNX runtime. However, as in your case, the run gets stuck and I get no clear error messages besides UnsqueezeElimination cannot remove node _inlfunc_aten_mean_dim_n1 and UnsqueezeElimination cannot remove node _inlfunc_aten_mean_dim_token_14647_n1. If I turn off the optimization, I get no error message and the process gets killed after a while.
Can you give some guidance on what you did exactly besides the torch.index_select to optimize the model for onnxruntime to work? That would be of great help!
Thank you.

from onnxruntime.

doloresgarcia commented on June 25, 2024

initialized those with the model instead

Do you mean turning Constant operators into graph initializers?

einsum do not seem to be dynamic with input shape

Could you share a concrete example?

I mean just matrices that were created inside functions that were used in many layers. So a solution was to add those as arguments of the main model class and pass them to those layers. This reduced the time to start inference and now it works quickly.

from onnxruntime.

onnxruntime.InferenceSession hangs about onnxruntime HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent