Comments (11)
There is howvec plugin which breaks instructions into categories, maybe its can show more details?
I checked the plugin's source code and it doesn't support the breakdown into categories yet for RISC-V. I tried running it anyway and it isn't listing the vector instructions. I'm guessing it just lists the top N instructions executed, as the objdump shows that it should be executing vector instructions.
How many threads do you have, maybe this matters too? You disable parallel processing in the opencv using
cv::setNumThreads(1)
function call. Could it affect instruction count? For example idle threads in the thread pool might be counted.
I'm running it on a 4-core CPU, same thing happened in an 8-core. Setting numThreads to 1 at the start of the program didn't make a difference in this case. The plugin is still showing 4 CPU threads, but one thread has an order of magnitude more instructions than the rest and it increases when I increase the VLEN, so the behavior still persists.
Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked
(g128d9cc0599) 13.2.1 20240220
). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):
I didn't know this. I compiled OpenCV with the Clang toolchain with the command listed above. When I mentioned GCC I meant that I compiled the test program both with GCC and Clang but now that I think of it, it wouldn't make much difference anyway. I'll see if I can get a later GCC version, compile OpenCV with it and check if the problem persists.
from opencv.
QEMU has -d in_asm
option to show "input" instructions. However this output would be very large, redirect result to logfile and/or gzip pipe.
from opencv.
Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):
Just to be sure, I compiled GCC 13.2 cherry picking the commit that fixed the bug mentioned there, then compiled OpenCV with this GCC, ran the experiments and the behavior isn't happening anymore, so it seems to be a Clang issue. I'm closing the issue, thanks everyone for your help.
Here are the results of some runs I made if anyone is curious:
results.csv
from opencv.
Perhaps Clang is more aggressive in auto-vectorization than GCC in this case. AFAIK OpenCV doesn't use manually optimized code for RVV in dnn as extensively as in core or imgproc.
from opencv.
Here's a much shorter and faster test where the behavior is visible:
#include <opencv2/opencv.hpp>
using namespace std;
static const string MODEL_PATH = "test.onnx";
int main() {
cv::dnn::Net model = cv::dnn::readNetFromONNX(MODEL_PATH);
model.setPreferableBackend(cv::dnn::DNN_BACKEND_OPENCV);
model.setPreferableTarget(cv::dnn::DNN_TARGET_CPU);
cv::Mat input = cv::Mat::ones(32, 32, CV_32F);
cv::dnn::blobFromImage(input, input);
model.setInput(input);
vector<cv::Mat> output_vec(1);
model.forward(output_vec, model.getUnconnectedOutLayersNames());
cv::Mat output = output_vec[0];
cout << output << endl;
return 0;
}
There isn't a ~80% or higher slowdown like in the previous test but there is ~5% slowdown each time you double the VLEN size.
test.zip
from opencv.
How do you count the cycles? qemu isn't performance accurate, and in my testing it's slower in emulating larger VLEN. IIRC qemu just passes trough the system timer and cycle count.
Edit: I missed the libinsn part, I haven't used that before. That might still be the issue though.
from opencv.
Yes, sorry in my last comment when I said "slowdown" I meant to say "higher instruction count". There is indeed no correlation between QEMU execution time and actual execution time. Most programs are (wall clock) faster in QEMU when I compile them without vector instructions than when I do. That's why I've been using guest instructions count as a (poor) proxy for performance
from opencv.
There is howvec plugin which breaks instructions into categories, maybe its can show more details?
How many threads do you have, maybe this matters too? You disable parallel processing in the opencv using cv::setNumThreads(1)
function call. Could it affect instruction count? For example idle threads in the thread pool might be counted.
Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220
). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=/work/opencv/platforms/linux/riscv64-gcc.toolchain.cmake \
-DCPU_BASELINE=RVV \
-DCPU_BASELINE_REQUIRE=RVV \
-DBUILD_SHARED_LIBS=OFF \
-DWITH_OPENCL=OFF \
'-DOPENCV_EXTRA_CXX_FLAGS=-fmax-errors=1 -Wfatal-errors' \
-DRISCV_RVV_SCALABLE=ON \
-DCPU_BASELINE=RVV \
-DCPU_BASELINE_REQUIRE=RVV \
-DCMAKE_BUILD_TYPE=Release \
../opencv
from opencv.
QEMU has
-d in_asm
option to show "input" instructions. However this output would be very large, redirect result to logfile and/or gzip pipe.
I gave it a look, but I can't get much insight from this outputs. This is the output from running the first example from the issue with VLEN=128 and VLEN=256
in_asm_output.zip
Question: could it be a performance issue with a BLAS implementation I compiled for RISC-V or does OpenCV use its own lineal algebra implementations for running ONNX models? Because I compiled OpenBLAS optimized for RVV with VLEN=256. However, that wouldn't explain why 128 bits uses less instructions than 256 anyway. Also, in another program I tested OpenBLAS segfaulted with VLEN=128 due to these optimizations, so I don't see why it wouldn't segfault here.
from opencv.
I can not say anything about OpenBLAS. AFAIK OpenCV doesn't use BLAS in DNN module. My suggestion is that increased instructions number came from tail processing. Consider the following example:
int i = 0;
for (; i <= len - VLEN; i += VLEN)
{
// vector processing
}
while (i < len)
{
// scalar processing
++i;
}
Let's say len = 1023
for example, then
- for VLEN=128 we have 7 vector iterations and 127 scalar iterations
- for VLEN=256 we have 3 vector iterations and 255 scalar iterations
- for VLEN=512 we have 1 vector iteration and 511 scalar iterations
Depending on number of instructions in each loop cycle we can have increase of total number of instructions with increasing VLEN. It doesn't mean that real world performance will be slower though, it would also depend on other factors like cache sizes and microarchitecture details.
I assume here that RVV is used in fixed-length manner like shown above, and OpenCV uses it this way currently. Perhaps it would be more efifcient to use scalable approach where VLEN can change on each iteration to process main part and the tail in the same loop.
from opencv.
I see, that would make a lot of sense. If that is so, I'd recommend against doing it if it's done on purpose. I'm guessing it's implemented so it's optimal that way in instances where, for example, AVL=513 and VLMAX=512. Naturally, one would expect that it would pick vl=512 and just do one vec instruction and one scalar to process the remaining element. But that is actually hardware implementation defined, the specification says it can be any value such that ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX)
. So it could pick vl=257, do a vector instruction, and then do an awkward 256 scalar operations. It seems the specification encourages looping until the AVL is 0.
Thank you for your help!
from opencv.
Related Issues (20)
- core: hal: With GCC14, some hal_intrin128.*_CPP_EMULATOR tests are failed. HOT 1
- macOs error: no matching function for call to 'dgeev_' HOT 27
- windows error: invalid cast from type 'pthread_t' {aka 'ptw32_handle_t'} to type 'void* HOT 1
- Connect webcam with its Alternative name
- macOS arm64 build cannot be linked HOT 7
- Capture the raw frame of the webcam HOT 4
- Sistema de seguridad
- Sistema de seguridad con reconocimiento facial
- ocl compile error on imgproc/remap HOT 2
- Incorporating optimisations to GaussianBlur and Filter2D operations HOT 2
- the first grab on a camera video stream is consistently several hundreds of milliseconds long HOT 4
- Building OpenCV for Python with Cuda 12.5 and cuDNN 9.1.1: A dynamic link library (DLL) initialization routine failed. HOT 5
- Should the camera pose be changed to object pose in the introduction of Homography from the camera display? HOT 1
- On macOS linking to third-party Orbbec SDK binary by default HOT 3
- `cv::imread` with `OutputArray dst` returns with empty output
- Problem in Subdiv2D, empty triangulation
- CMake install target on Windows does not install libs into correct subdirs HOT 3
- FaceRecognizerSF documentation typos
- Window freezes after 4 updates
- Max/Min pool HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opencv.