Giter Club home page Giter Club logo

Comments (4)

roshanr95 avatar roshanr95 commented on May 23, 2024

Any performance gain would mostly come from the fact that global memory is faster when memory locations are accessed contiguously. Will this outweigh the extra reads/writes done on local memory?

from compute.

kylelutz avatar kylelutz commented on May 23, 2024

I think there is may be a gain in pulling tiles of global memory to local memory, reversing it in a work-group, and writing the reversed tile back to global memory (which would then be contiguous). But, like any performance improvements, it should be benchmarked. I think a good first step would be to write a perf_reverse test (similar to the other algorithm perf_* tests in the perf/ directory) which measures the current performance and then can be used to evaluate any future improvements.

from compute.

jszuppe avatar jszuppe commented on May 23, 2024

I've implemented local memory solution (mentioned above) - jszuppe@7479fa4. It is just first version and ofc needs refactor etc.

More important is that the performance is virtually the same (sometimes lower, sometimes higher) compared to current solution (only global memory) on my device (AMD Radeon HD7770). Also global memory solution is more stable in terms of performance. When buffer is small (<10000) difference is noticeable to the disadvantage of local memory solution.

I tested it with CodeXL GPU Performance counters (running perf_reverse), 10 tests for each version of kernel.

I think it's because with current solution reading and writing from global memory is already coalesced/contiguous/adjacent.

   decl<cl_uint>("i") << " = get_global_id(0);\n" <<
   decl<cl_uint>("j") << " = size - get_global_id(0) - 1;\n" <<
   decl<value_type>("tmp") << "=" << first[var<cl_uint>("i")] << ";\n" <<
   first[var<cl_uint>("i")] << "=" << first[var<cl_uint>("j")] << ";\n" <<
   first[var<cl_uint>("j")] << "= tmp;\n";

I mean this part decl<value_type>("tmp") << "=" << first[var<cl_uint>("i")] << ";\n" is obviously coalesced, textbook example. And this part first[var<cl_uint>("i")] << "=" << first[var<cl_uint>("j")] << ";\n" is read/write in reversed manner. Device OpenCL compiler and/or device drivers (I dont know details) know what to do so it would be coalesced.

Probably with older devices (NVIDIA GPU with CC 1.0/1.1) using local memory would make some difference. I will test it later.
Update: I cannot test it - my laptop's NVIDIA GPU has CC 1.2.
Update 2: New commit - code refactor, fix, tests now run on local kernel.
Update 3:

I've just discovered pref.py script... 😐 It's certainly better way to test performance compared to what I've been doing, but results are the same - devices know how to handle global memory access pattern present in current solution.

current master performance (20 trials)

=== reverse with compute ===
size,time (ms)
2,0.089776
4,0.089753
8,0.086176
16,0.092829
32,0.090977
64,0.090615
128,0.090403
256,0.090353
512,0.089974
1024,0.089263
2048,0.090476
4096,0.091755
8192,0.096906
16384,0.104873
32768,0.122258
65536,0.157439
131072,0.217474
262144,0.396446
524288,0.787806
1048576,0.274227
2097152,0.453932
4194304,0.912999
8388608,1.530450
16777216,2.845040
33554432,5.477610

jszuppe@7479fa4 performance (20 trials):

=== reverse with compute ===
size,time (ms)
2,0.110501
4,0.111839
8,0.112056
16,0.111535
32,0.112302
64,0.112528
128,0.113221
256,0.112125
512,0.116040
1024,0.115111
2048,0.114388
4096,0.113263
8192,0.117490
16384,0.128465
32768,0.142770
65536,0.178416
131072,0.250299
262144,0.415767
524288,0.757610
1048576,0.294105
2097152,0.541047
4194304,0.913123
8388608,1.556500
16777216,2.902590
33554432,5.510650

from compute.

kylelutz avatar kylelutz commented on May 23, 2024

I guess we should stick with the current naive implementation as it performs well. Thanks for trying this out and getting the performance numbers!

from compute.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.