Giter Club home page Giter Club logo

Comments (11)

salmundani avatar salmundani commented on June 12, 2024 1

There is howvec plugin which breaks instructions into categories, maybe its can show more details?

I checked the plugin's source code and it doesn't support the breakdown into categories yet for RISC-V. I tried running it anyway and it isn't listing the vector instructions. I'm guessing it just lists the top N instructions executed, as the objdump shows that it should be executing vector instructions.

How many threads do you have, maybe this matters too? You disable parallel processing in the opencv using cv::setNumThreads(1) function call. Could it affect instruction count? For example idle threads in the thread pool might be counted.

I'm running it on a 4-core CPU, same thing happened in an 8-core. Setting numThreads to 1 at the start of the program didn't make a difference in this case. The plugin is still showing 4 CPU threads, but one thread has an order of magnitude more instructions than the rest and it increases when I increase the VLEN, so the behavior still persists.

Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):

I didn't know this. I compiled OpenCV with the Clang toolchain with the command listed above. When I mentioned GCC I meant that I compiled the test program both with GCC and Clang but now that I think of it, it wouldn't make much difference anyway. I'll see if I can get a later GCC version, compile OpenCV with it and check if the problem persists.

from opencv.

opencv-alalek avatar opencv-alalek commented on June 12, 2024 1

QEMU has -d in_asm option to show "input" instructions. However this output would be very large, redirect result to logfile and/or gzip pipe.

from opencv.

salmundani avatar salmundani commented on June 12, 2024 1

Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):

Just to be sure, I compiled GCC 13.2 cherry picking the commit that fixed the bug mentioned there, then compiled OpenCV with this GCC, ran the experiments and the behavior isn't happening anymore, so it seems to be a Clang issue. I'm closing the issue, thanks everyone for your help.

Here are the results of some runs I made if anyone is curious:
results.csv

from opencv.

mshabunin avatar mshabunin commented on June 12, 2024 1

Perhaps Clang is more aggressive in auto-vectorization than GCC in this case. AFAIK OpenCV doesn't use manually optimized code for RVV in dnn as extensively as in core or imgproc.

from opencv.

salmundani avatar salmundani commented on June 12, 2024

Here's a much shorter and faster test where the behavior is visible:

#include <opencv2/opencv.hpp>
using namespace std;

static const string MODEL_PATH = "test.onnx";

int main() {
    cv::dnn::Net model = cv::dnn::readNetFromONNX(MODEL_PATH);
    model.setPreferableBackend(cv::dnn::DNN_BACKEND_OPENCV);
    model.setPreferableTarget(cv::dnn::DNN_TARGET_CPU);
    cv::Mat input = cv::Mat::ones(32, 32, CV_32F);
    cv::dnn::blobFromImage(input, input);
    model.setInput(input);
    vector<cv::Mat> output_vec(1);
    model.forward(output_vec, model.getUnconnectedOutLayersNames());
    cv::Mat output = output_vec[0];
    cout << output << endl;
    return 0;
}

There isn't a ~80% or higher slowdown like in the previous test but there is ~5% slowdown each time you double the VLEN size.
test.zip

from opencv.

camel-cdr avatar camel-cdr commented on June 12, 2024

How do you count the cycles? qemu isn't performance accurate, and in my testing it's slower in emulating larger VLEN. IIRC qemu just passes trough the system timer and cycle count.
Edit: I missed the libinsn part, I haven't used that before. That might still be the issue though.

from opencv.

salmundani avatar salmundani commented on June 12, 2024

Yes, sorry in my last comment when I said "slowdown" I meant to say "higher instruction count". There is indeed no correlation between QEMU execution time and actual execution time. Most programs are (wall clock) faster in QEMU when I compile them without vector instructions than when I do. That's why I've been using guest instructions count as a (poor) proxy for performance

from opencv.

mshabunin avatar mshabunin commented on June 12, 2024

There is howvec plugin which breaks instructions into categories, maybe its can show more details?

How many threads do you have, maybe this matters too? You disable parallel processing in the opencv using cv::setNumThreads(1) function call. Could it affect instruction count? For example idle threads in the thread pool might be counted.

Intresting, I can't compile OpenCV 4.x and 4.9.0 using gcc 13.2.0 (gc891d8dc23e) due to this compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111074 It has been fixed on 14.x branch and somewhere on the releases/gcc-13 branch (I checked (g128d9cc0599) 13.2.1 20240220). Are you sure you have this GCC version? Or maybe our compile flags are different. I have following cmake command (some options are duplicated, compiler bin directory is in the PATH):

cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_TOOLCHAIN_FILE=/work/opencv/platforms/linux/riscv64-gcc.toolchain.cmake \
  -DCPU_BASELINE=RVV \
  -DCPU_BASELINE_REQUIRE=RVV \
  -DBUILD_SHARED_LIBS=OFF \
  -DWITH_OPENCL=OFF \
  '-DOPENCV_EXTRA_CXX_FLAGS=-fmax-errors=1 -Wfatal-errors' \
  -DRISCV_RVV_SCALABLE=ON \
  -DCPU_BASELINE=RVV \
  -DCPU_BASELINE_REQUIRE=RVV \
  -DCMAKE_BUILD_TYPE=Release \
../opencv

from opencv.

salmundani avatar salmundani commented on June 12, 2024

QEMU has -d in_asm option to show "input" instructions. However this output would be very large, redirect result to logfile and/or gzip pipe.

I gave it a look, but I can't get much insight from this outputs. This is the output from running the first example from the issue with VLEN=128 and VLEN=256
in_asm_output.zip

Question: could it be a performance issue with a BLAS implementation I compiled for RISC-V or does OpenCV use its own lineal algebra implementations for running ONNX models? Because I compiled OpenBLAS optimized for RVV with VLEN=256. However, that wouldn't explain why 128 bits uses less instructions than 256 anyway. Also, in another program I tested OpenBLAS segfaulted with VLEN=128 due to these optimizations, so I don't see why it wouldn't segfault here.

from opencv.

mshabunin avatar mshabunin commented on June 12, 2024

I can not say anything about OpenBLAS. AFAIK OpenCV doesn't use BLAS in DNN module. My suggestion is that increased instructions number came from tail processing. Consider the following example:

int i = 0;
for (; i <= len - VLEN; i += VLEN)
{
    // vector processing
}
while (i < len)
{
    // scalar processing
   ++i;
}

Let's say len = 1023 for example, then

  • for VLEN=128 we have 7 vector iterations and 127 scalar iterations
  • for VLEN=256 we have 3 vector iterations and 255 scalar iterations
  • for VLEN=512 we have 1 vector iteration and 511 scalar iterations

Depending on number of instructions in each loop cycle we can have increase of total number of instructions with increasing VLEN. It doesn't mean that real world performance will be slower though, it would also depend on other factors like cache sizes and microarchitecture details.

I assume here that RVV is used in fixed-length manner like shown above, and OpenCV uses it this way currently. Perhaps it would be more efifcient to use scalable approach where VLEN can change on each iteration to process main part and the tail in the same loop.

from opencv.

salmundani avatar salmundani commented on June 12, 2024

I see, that would make a lot of sense. If that is so, I'd recommend against doing it if it's done on purpose. I'm guessing it's implemented so it's optimal that way in instances where, for example, AVL=513 and VLMAX=512. Naturally, one would expect that it would pick vl=512 and just do one vec instruction and one scalar to process the remaining element. But that is actually hardware implementation defined, the specification says it can be any value such that ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX) . So it could pick vl=257, do a vector instruction, and then do an awkward 256 scalar operations. It seems the specification encourages looping until the AVL is 0.

Thank you for your help!

from opencv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.