Comments (16)
I reproduced the issue with https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v1-12.onnx in A100. The average latency (ms) output:
ORT 1.13.1: 2.98
ORT 1.14.0: 3.20
ORT 1.17.1: 3.20
So there is some regression from 1.13.1 to 1.14.0. I will take a look at the cause.
from onnxruntime.
@krishung5 I would recommend trying to use a CUDA graph, that might help reducing the execution time for such small networks.
from onnxruntime.
Which cuDNN version are you using ?
from onnxruntime.
@gedoensmax I am using cuDNN 8.7.0.84. Tried to use cuDNN 9 with onnxruntime-gpu 1.17.1 but it's finding cuDNN 8.
2024-05-17 19:57:13.917623151 [E:onnxruntime:Default, provider_bridge_ort.cc:1548 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1209 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.8: cannot open shared object file: No such file or directory
from onnxruntime.
Hi team, I was wondering if we have any update on this issue?
from onnxruntime.
Hi team, I was wondering if we have any update on this issue?
Hello, do you have some idea about the performance degrassion? I have test the performance of onnxruntime 1.17οΌit's performance is even worse than torch2.0.1
from onnxruntime.
@tianleiwu can you help out with this. My initial guess was that there might be regressions due to cuDNN shipping less kernels. But it looks like cuDNN version was the same across the different versions.
from onnxruntime.
@gedoensmax Sir, one thing i am confused is that if i install onnxruntime by pip install onnxruntime-gpu==1.17, would the onnxruntime package be the optimum one (i mean it will match the cuda-11.8 install on my machine and corresponding cublas cudnn librarys). Can you explain that, thanks a lot!
from onnxruntime.
The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x
from onnxruntime.
The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x
OK, Thank you very much. Can you please take a look at this issue about dynamic quantize? There are some problem with dynamic quantize vicuna-7b model from fp16 to int8
from onnxruntime.
Hi @pranavsharma, just wanted to follow up and see if we have any update on this, thank you!
from onnxruntime.
The root cause seems to be the change of default value of cudnn_conv_use_max_workspace from 0 to 1
in #13981.
The solution is to set the value to 0 for Resnet:
session = ort.InferenceSession(model_path, providers=[("CUDAExecutionProvider", {"cudnn_conv_use_max_workspace": '0'})])
For debugging, set an environment variable to limit the cudnn workspace in MiB could help:
CUDNN_CONV_WSCAP_DBG=128 python test.py
@gedoensmax, do you know why larger workspace causes performance drop in some convolution network (we've enabled conv algo tuning by default)?
from onnxruntime.
@tianleiwu I just saw cone also tuning is now set to exhaustive search. This should guarantee the best possible perf, but usually using the heuristics is sufficient.
Could you capture and Nsight Systems trace with and without the limited workspace size ? I would like to confirm which kernels are used, it might no longer do a transformation from NCHW to NHWC to leverage tensor cores. It still surprises me why the exhaustive search did not pick that strategy.
from onnxruntime.
The Nsight trace files:
resnet_nsys.zip
from onnxruntime.
@gedoensmax I think using cuda graph indeed helps with the performance. I wasn't able to run the model used by RIVA team due to the issue
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session cannot use the graph capture feature as requested by the user as the model has control flow nodes which can't be supported byCUDAExecutionProvider
but with the resnet model, I'm seeing an approximate improvement of 19.18% in average latency.
ONNX 1.18 with CUDA Graph
Latencies (ms):
2.595525799375592
2.116176817152235
2.7692823699026397
2.5585733278833254
2.085702587859799
ONNX 1.18 without CUDA Graph:
Latencies (ms):
3.0858926098756116
2.4176077409224077
2.685696187645498
3.6532445387406782
3.1608499661840574
from onnxruntime.
We were able to resolve the performance regression by setting the value of cudnn_conv_use_max_workspace to 0 after this PR provides the flexibility to do so in the Triton onnxruntime backend: triton-inference-server/onnxruntime_backend#256
Closing this issue. Thanks so much for everyone's help!
from onnxruntime.
Related Issues (20)
- Missing onnxruntime_perf_test.exe in Release Assets (or what actually is "Build Drop"?) HOT 2
- [Build]: cmake', '--build', '/temp/liz/onnxruntime/build/Linux/RelWithDebInfo', '--config', 'RelWithDebInfo', '--', '-j64'] HOT 1
- [Feature Request] Request grid_sample 5D support π HOT 1
- [Build][Bug] The compiler doesn't support BFLOAT16!!! HOT 2
- [WebGPU] `Error: [WebGPU] Kernel "[MaxPool] /sincnet/pool1d.0/MaxPool" failed. Error: length of specified kernel shapes should be 2 less than length of input dimensions` HOT 2
- Error Instantiating EmbeddingModel with ONNX Model intfloat/multilingual-e5-large HOT 1
- [Documentation] Community blog post contribution HOT 1
- [ARM][CPU] Unit test and onnx_runtime_perf test gives cpuinfo error for new Windows ARM chips HOT 2
- [Feature Request] Mark as negative tests for minimal CUDA build
- New restricted asymmetric quantization mode in QDQ mode with zero_point restricted to either 128 or 0
- Trilu op still not work with INT32 input HOT 3
- [WebNN EP] Support int64 output data type for CoreML backend HOT 1
- [Web] where is the demo of object detection on web HOT 2
- LNK2019: unresolved external symbol OrtGetApiBase HOT 1
- Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node HOT 4
- TensorRT EP failed to create engine from network. HOT 4
- Custom Op Library does not work for CUDA
- onnxruntime.InferenceSession.run sometimes get stuck, sometimes not HOT 3
- How to do multithreaded infer with onnxruntime HOT 1
- CUDA provider fallback to CPU is not working when CUDA_PATH environment variable exists
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onnxruntime.