Comments (9)
I found the problem, when onnxruntime performs the basic optimizations on the lm_head subgraph it transposes the weight matrix (called model.shared.weight) and saves it as another initializer, eliminating the Transpose node (if I deactivate the basic optimizations the transpose is simply done anyway on model.shared.weight by duplicating it during execution, so in addition to consuming 1GB more it also slows down execution), so I thought of applying this theorem on the lm_head, executing the Transpose on the other matrix which is multiplied with model.shared.weight (pre_logits) and on the final result (and inverting the order of the two matrices of the MatMul), in this way we obtain an equivalent MatMul but without having to perform the Transpose on model.shared.weight (which is much larger than the other two matrices on which we now perform the transpose), in this way I managed to reduce the RAM consumption by 1GB, but the problem is that the execution time increases.
This is the updated model: https://github.com/niedev/testModel/releases/download/testModel_2.0/nllb_embed_and_lm_head_if3.onnx
I used the onnx profiler on the new model (without optimizations) to understand which node causes the performance decrease, and the two added Transposes execute practically instantly, the node that takes longer than before is lm_head's MatMul (goes from 36ms to 50ms), but I can't understand why, given that the multiplication is practically equivalent.
This is the profiling result of the old model: https://github.com/niedev/testModel/blob/main/embed_and_lmhead_log_2024-05-19_old.json
This is the profiling result of the new model: https://github.com/niedev/testModel/blob/main/embed_and_lmhead_log_2024-05-19_new.json
The model I'm working on is the result of extracting the components that perform the embed and lm_head of NLLB, so if you could solve the MatMul problem (if it is solvable) and integrate this modification into the basic optimization process of onnxruntime this would lead to a significant reduction of RAM consumption for NLLB (and also for other Transformers that share the embed and lm_head matrix).
from onnxruntime.
Try using the XNNPACK execution provider. The MLAS kernels on arm64 focus on quantized data, so for 32-bit floats the XNNPACK kernels might have optimizations that address the performance drop.
from onnxruntime.
Hi, the final model I will implement will be quantized, so I also did tests with these quantized components (u8/u8 and with both weights and activations asymmetric ) on arm64 and the problem is the same (less RAM consumption, but about 35% more execution time).
from onnxruntime.
The activation_size
value is the sum of the input tensor sizes. In the new model that's 1024x larger.
Old:
New:
from onnxruntime.
Ok, but I think this is a problem just with the log, since in the old model the second MatMul input (which has dimensions 1024x256000) is not shown in the log, but the MatMul must have 2 inputs (also based on the old model graph):
from onnxruntime.
Ah ok. Definitely unexpected that a MatMul of {1, 1K} x {1K, 256K} is significantly better than {256K, 1K} x {1K, 1}. @yufenglee any ideas as to why that would be?
Did you try the XNNPACK EP just for another data point?
More of a workaround, but could you change the initializer to be in the transposed ordering that the original MatMul used and instead update the usage of model.shared.weight in the other subgraph to adjust for that? That may be a way to avoid the constant folding duplicating the initializer which would address the original problem.
from onnxruntime.
I haven't tested with XNNPACK because it only supports MatMul and Gather 2D so I would have to modify the model and redo all the tests.
I tried the workaround you proposed but in that way the dynamic quantization does not quantize the Gather but only the MatMul, so the size of the model is not reduced. I tried to apply the workaround directly on the quantized model but in that case the duplication of the weights in memory still happens and the quality of the output produced by the model drops drastically.
Plus I re-tested the inference with the (old) new model (the one with the "transposed" MatMul) and I noticed that even that model has a reduced output quality, perhaps because with the single shared weight matrix the dynamic quantization applied is the same for both MatMul and Gather and the two operators requires different scale and zero point.
So at this point I think I'll keep the added RAM consumption and finish the project I'm working on, thank you anyway for the help.
from onnxruntime.
FWIW XNNPACK kernels just fill gaps with the CPU EP, so it doesn't matter how many/few operators it implements. That's because it runs on CPU so there's no device transfer cost to go between a node executing using XNNPACK and one using the CPU EP.
So the XNNPACK EP will take whatever nodes it can, and the others will be taken by the CPU EP.
from onnxruntime.
Ok, but according to the documentation it shouldn't support any of the operators of the lm_head subgraph, so without modification it should be useless right?
from onnxruntime.
Related Issues (20)
- Missing onnxruntime_perf_test.exe in Release Assets (or what actually is "Build Drop"?) HOT 2
- [Build]: cmake', '--build', '/temp/liz/onnxruntime/build/Linux/RelWithDebInfo', '--config', 'RelWithDebInfo', '--', '-j64'] HOT 1
- [Feature Request] Request grid_sample 5D support 🌟 HOT 1
- [Build][Bug] The compiler doesn't support BFLOAT16!!! HOT 2
- [WebGPU] `Error: [WebGPU] Kernel "[MaxPool] /sincnet/pool1d.0/MaxPool" failed. Error: length of specified kernel shapes should be 2 less than length of input dimensions` HOT 2
- Error Instantiating EmbeddingModel with ONNX Model intfloat/multilingual-e5-large HOT 1
- [Documentation] Community blog post contribution HOT 1
- [ARM][CPU] Unit test and onnx_runtime_perf test gives cpuinfo error for new Windows ARM chips HOT 2
- [Feature Request] Mark as negative tests for minimal CUDA build
- New restricted asymmetric quantization mode in QDQ mode with zero_point restricted to either 128 or 0
- Trilu op still not work with INT32 input HOT 3
- [WebNN EP] Support int64 output data type for CoreML backend HOT 1
- [Web] where is the demo of object detection on web HOT 2
- LNK2019: unresolved external symbol OrtGetApiBase HOT 1
- Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node HOT 4
- TensorRT EP failed to create engine from network. HOT 4
- Custom Op Library does not work for CUDA
- onnxruntime.InferenceSession.run sometimes get stuck, sometimes not HOT 3
- How to do multithreaded infer with onnxruntime HOT 1
- CUDA provider fallback to CPU is not working when CUDA_PATH environment variable exists
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onnxruntime.