System Information OpenCV version: 4.9.0 Operating System / Pl

RISC-V: instructions executed count increases linearly with VLEN size about opencv HOT 11 CLOSED

salmundani commented on June 12, 2024

RISC-V: instructions executed count increases linearly with VLEN size

from opencv.

Comments (11)

salmundani commented on June 12, 2024 1

There is howvec plugin which breaks instructions into categories, maybe its can show more details?

I checked the plugin's source code and it doesn't support the breakdown into categories yet for RISC-V. I tried running it anyway and it isn't listing the vector instructions. I'm guessing it just lists the top N instructions executed, as the objdump shows that it should be executing vector instructions.

How many threads do you have, maybe this matters too? You disable parallel processing in the opencv using cv::setNumThreads(1) function call. Could it affect instruction count? For example idle threads in the thread pool might be counted.

I'm running it on a 4-core CPU, same thing happened in an 8-core. Setting numThreads to 1 at the start of the program didn't make a difference in this case. The plugin is still showing 4 CPU threads, but one thread has an order of magnitude more instructions than the rest and it increases when I increase the VLEN, so the behavior still persists.

Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):

I didn't know this. I compiled OpenCV with the Clang toolchain with the command listed above. When I mentioned GCC I meant that I compiled the test program both with GCC and Clang but now that I think of it, it wouldn't make much difference anyway. I'll see if I can get a later GCC version, compile OpenCV with it and check if the problem persists.

from opencv.

opencv-alalek commented on June 12, 2024 1

QEMU has -d in_asm option to show "input" instructions. However this output would be very large, redirect result to logfile and/or gzip pipe.

from opencv.

salmundani commented on June 12, 2024 1

Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):

Just to be sure, I compiled GCC 13.2 cherry picking the commit that fixed the bug mentioned there, then compiled OpenCV with this GCC, ran the experiments and the behavior isn't happening anymore, so it seems to be a Clang issue. I'm closing the issue, thanks everyone for your help.

Here are the results of some runs I made if anyone is curious:
results.csv

from opencv.

mshabunin commented on June 12, 2024 1

Perhaps Clang is more aggressive in auto-vectorization than GCC in this case. AFAIK OpenCV doesn't use manually optimized code for RVV in dnn as extensively as in core or imgproc.

from opencv.

salmundani commented on June 12, 2024

Here's a much shorter and faster test where the behavior is visible:

#include <opencv2/opencv.hpp>
using namespace std;

static const string MODEL_PATH = "test.onnx";

int main() {
    cv::dnn::Net model = cv::dnn::readNetFromONNX(MODEL_PATH);
    model.setPreferableBackend(cv::dnn::DNN_BACKEND_OPENCV);
    model.setPreferableTarget(cv::dnn::DNN_TARGET_CPU);
    cv::Mat input = cv::Mat::ones(32, 32, CV_32F);
    cv::dnn::blobFromImage(input, input);
    model.setInput(input);
    vector<cv::Mat> output_vec(1);
    model.forward(output_vec, model.getUnconnectedOutLayersNames());
    cv::Mat output = output_vec[0];
    cout << output << endl;
    return 0;
}

There isn't a ~80% or higher slowdown like in the previous test but there is ~5% slowdown each time you double the VLEN size.
test.zip

from opencv.

camel-cdr commented on June 12, 2024

How do you count the cycles? qemu isn't performance accurate, and in my testing it's slower in emulating larger VLEN. IIRC qemu just passes trough the system timer and cycle count.
Edit: I missed the libinsn part, I haven't used that before. That might still be the issue though.

from opencv.

salmundani commented on June 12, 2024

Yes, sorry in my last comment when I said "slowdown" I meant to say "higher instruction count". There is indeed no correlation between QEMU execution time and actual execution time. Most programs are (wall clock) faster in QEMU when I compile them without vector instructions than when I do. That's why I've been using guest instructions count as a (poor) proxy for performance

from opencv.

mshabunin commented on June 12, 2024

There is howvec plugin which breaks instructions into categories, maybe its can show more details?

How many threads do you have, maybe this matters too? You disable parallel processing in the opencv using cv::setNumThreads(1) function call. Could it affect instruction count? For example idle threads in the thread pool might be counted.

Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):

cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_TOOLCHAIN_FILE=/work/opencv/platforms/linux/riscv64-gcc.toolchain.cmake \
  -DCPU_BASELINE=RVV \
  -DCPU_BASELINE_REQUIRE=RVV \
  -DBUILD_SHARED_LIBS=OFF \
  -DWITH_OPENCL=OFF \
  '-DOPENCV_EXTRA_CXX_FLAGS=-fmax-errors=1 -Wfatal-errors' \
  -DRISCV_RVV_SCALABLE=ON \
  -DCPU_BASELINE=RVV \
  -DCPU_BASELINE_REQUIRE=RVV \
  -DCMAKE_BUILD_TYPE=Release \
../opencv

from opencv.

salmundani commented on June 12, 2024

QEMU has -d in_asm option to show "input" instructions. However this output would be very large, redirect result to logfile and/or gzip pipe.

I gave it a look, but I can't get much insight from this outputs. This is the output from running the first example from the issue with VLEN=128 and VLEN=256
in_asm_output.zip

Question: could it be a performance issue with a BLAS implementation I compiled for RISC-V or does OpenCV use its own lineal algebra implementations for running ONNX models? Because I compiled OpenBLAS optimized for RVV with VLEN=256. However, that wouldn't explain why 128 bits uses less instructions than 256 anyway. Also, in another program I tested OpenBLAS segfaulted with VLEN=128 due to these optimizations, so I don't see why it wouldn't segfault here.

from opencv.

mshabunin commented on June 12, 2024

I can not say anything about OpenBLAS. AFAIK OpenCV doesn't use BLAS in DNN module. My suggestion is that increased instructions number came from tail processing. Consider the following example:

int i = 0;
for (; i <= len - VLEN; i += VLEN)
{
    // vector processing
}
while (i < len)
{
    // scalar processing
   ++i;
}

Let's say len = 1023 for example, then

for VLEN=128 we have 7 vector iterations and 127 scalar iterations
for VLEN=256 we have 3 vector iterations and 255 scalar iterations
for VLEN=512 we have 1 vector iteration and 511 scalar iterations

Depending on number of instructions in each loop cycle we can have increase of total number of instructions with increasing VLEN. It doesn't mean that real world performance will be slower though, it would also depend on other factors like cache sizes and microarchitecture details.

I assume here that RVV is used in fixed-length manner like shown above, and OpenCV uses it this way currently. Perhaps it would be more efifcient to use scalable approach where VLEN can change on each iteration to process main part and the tail in the same loop.

from opencv.

salmundani commented on June 12, 2024

I see, that would make a lot of sense. If that is so, I'd recommend against doing it if it's done on purpose. I'm guessing it's implemented so it's optimal that way in instances where, for example, AVL=513 and VLMAX=512. Naturally, one would expect that it would pick vl=512 and just do one vec instruction and one scalar to process the remaining element. But that is actually hardware implementation defined, the specification says it can be any value such that ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX) . So it could pick vl=257, do a vector instruction, and then do an awkward 256 scalar operations. It seems the specification encourages looping until the AVL is 0.

Thank you for your help!

from opencv.

RISC-V: instructions executed count increases linearly with VLEN size about opencv HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent