Describe the issue We are seeing a regression when using onnxrunti

I reproduced the issue with <a href="https://github.com/onnx/models/blob/main/validate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Performance] Regression observed when using CUDA execution provider about onnxruntime HOT 16 CLOSED

krishung5 commented on July 23, 2024

[Performance] Regression observed when using CUDA execution provider

from onnxruntime.

Comments (16)

tianleiwu commented on July 23, 2024 3

I reproduced the issue with https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v1-12.onnx in A100. The average latency (ms) output:
ORT 1.13.1: 2.98
ORT 1.14.0: 3.20
ORT 1.17.1: 3.20
So there is some regression from 1.13.1 to 1.14.0. I will take a look at the cause.

from onnxruntime.

gedoensmax commented on July 23, 2024 1

@krishung5 I would recommend trying to use a CUDA graph, that might help reducing the execution time for such small networks.

from onnxruntime.

gedoensmax commented on July 23, 2024

Which cuDNN version are you using ?

from onnxruntime.

krishung5 commented on July 23, 2024

@gedoensmax I am using cuDNN 8.7.0.84. Tried to use cuDNN 9 with onnxruntime-gpu 1.17.1 but it's finding cuDNN 8.

2024-05-17 19:57:13.917623151 [E:onnxruntime:Default, provider_bridge_ort.cc:1548 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1209 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.8: cannot open shared object file: No such file or directory

from onnxruntime.

krishung5 commented on July 23, 2024

Hi team, I was wondering if we have any update on this issue?

from onnxruntime.

JackWeiw commented on July 23, 2024

Hi team, I was wondering if we have any update on this issue?

Hello, do you have some idea about the performance degrassion? I have test the performance of onnxruntime 1.17，it's performance is even worse than torch2.0.1

from onnxruntime.

gedoensmax commented on July 23, 2024

@tianleiwu can you help out with this. My initial guess was that there might be regressions due to cuDNN shipping less kernels. But it looks like cuDNN version was the same across the different versions.

from onnxruntime.

JackWeiw commented on July 23, 2024

@gedoensmax Sir, one thing i am confused is that if i install onnxruntime by pip install onnxruntime-gpu==1.17, would the onnxruntime package be the optimum one (i mean it will match the cuda-11.8 install on my machine and corresponding cublas cudnn librarys). Can you explain that, thanks a lot!

from onnxruntime.

gedoensmax commented on July 23, 2024

The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x

from onnxruntime.

JackWeiw commented on July 23, 2024

The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x

OK, Thank you very much. Can you please take a look at this issue about dynamic quantize? There are some problem with dynamic quantize vicuna-7b model from fp16 to int8

from onnxruntime.

krishung5 commented on July 23, 2024

Hi @pranavsharma, just wanted to follow up and see if we have any update on this, thank you!

from onnxruntime.

tianleiwu commented on July 23, 2024

The root cause seems to be the change of default value of cudnn_conv_use_max_workspace from 0 to 1
in #13981.

The solution is to set the value to 0 for Resnet:

session = ort.InferenceSession(model_path, providers=[("CUDAExecutionProvider", {"cudnn_conv_use_max_workspace": '0'})])

For debugging, set an environment variable to limit the cudnn workspace in MiB could help:

CUDNN_CONV_WSCAP_DBG=128 python test.py

@gedoensmax, do you know why larger workspace causes performance drop in some convolution network (we've enabled conv algo tuning by default)?

from onnxruntime.

gedoensmax commented on July 23, 2024

@tianleiwu I just saw cone also tuning is now set to exhaustive search. This should guarantee the best possible perf, but usually using the heuristics is sufficient.
Could you capture and Nsight Systems trace with and without the limited workspace size ? I would like to confirm which kernels are used, it might no longer do a transformation from NCHW to NHWC to leverage tensor cores. It still surprises me why the exhaustive search did not pick that strategy.

from onnxruntime.

tianleiwu commented on July 23, 2024

@gedoensmax,

The Nsight trace files:
resnet_nsys.zip

from onnxruntime.

krishung5 commented on July 23, 2024

@gedoensmax I think using cuda graph indeed helps with the performance. I wasn't able to run the model used by RIVA team due to the issue

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session cannot use the graph capture feature as requested by the user as the model has control flow nodes which can't be supported byCUDAExecutionProvider

but with the resnet model, I'm seeing an approximate improvement of 19.18% in average latency.

ONNX 1.18 with CUDA Graph

Latencies (ms):
2.595525799375592
2.116176817152235
2.7692823699026397
2.5585733278833254
2.085702587859799

ONNX 1.18 without CUDA Graph:

Latencies (ms):
3.0858926098756116
2.4176077409224077
2.685696187645498
3.6532445387406782
3.1608499661840574

from onnxruntime.

krishung5 commented on July 23, 2024

We were able to resolve the performance regression by setting the value of cudnn_conv_use_max_workspace to 0 after this PR provides the flexibility to do so in the Triton onnxruntime backend: triton-inference-server/onnxruntime_backend#256
Closing this issue. Thanks so much for everyone's help!

from onnxruntime.

[Performance] Regression observed when using CUDA execution provider about onnxruntime HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent