Comments (4)
Any performance gain would mostly come from the fact that global memory is faster when memory locations are accessed contiguously. Will this outweigh the extra reads/writes done on local memory?
from compute.
I think there is may be a gain in pulling tiles of global memory to local memory, reversing it in a work-group, and writing the reversed tile back to global memory (which would then be contiguous). But, like any performance improvements, it should be benchmarked. I think a good first step would be to write a perf_reverse
test (similar to the other algorithm perf_*
tests in the perf/
directory) which measures the current performance and then can be used to evaluate any future improvements.
from compute.
I've implemented local memory solution (mentioned above) - jszuppe@7479fa4. It is just first version and ofc needs refactor etc.
More important is that the performance is virtually the same (sometimes lower, sometimes higher) compared to current solution (only global memory) on my device (AMD Radeon HD7770). Also global memory solution is more stable in terms of performance. When buffer is small (<10000) difference is noticeable to the disadvantage of local memory solution.
I tested it with CodeXL GPU Performance counters (running perf_reverse), 10 tests for each version of kernel.
I think it's because with current solution reading and writing from global memory is already coalesced/contiguous/adjacent.
decl<cl_uint>("i") << " = get_global_id(0);\n" <<
decl<cl_uint>("j") << " = size - get_global_id(0) - 1;\n" <<
decl<value_type>("tmp") << "=" << first[var<cl_uint>("i")] << ";\n" <<
first[var<cl_uint>("i")] << "=" << first[var<cl_uint>("j")] << ";\n" <<
first[var<cl_uint>("j")] << "= tmp;\n";
I mean this part decl<value_type>("tmp") << "=" << first[var<cl_uint>("i")] << ";\n"
is obviously coalesced, textbook example. And this part first[var<cl_uint>("i")] << "=" << first[var<cl_uint>("j")] << ";\n"
is read/write in reversed manner. Device OpenCL compiler and/or device drivers (I dont know details) know what to do so it would be coalesced.
Probably with older devices (NVIDIA GPU with CC 1.0/1.1) using local memory would make some difference. I will test it later.
Update: I cannot test it - my laptop's NVIDIA GPU has CC 1.2.
Update 2: New commit - code refactor, fix, tests now run on local kernel.
Update 3:
I've just discovered pref.py script... 😐 It's certainly better way to test performance compared to what I've been doing, but results are the same - devices know how to handle global memory access pattern present in current solution.
current master performance (20 trials)
=== reverse with compute ===
size,time (ms)
2,0.089776
4,0.089753
8,0.086176
16,0.092829
32,0.090977
64,0.090615
128,0.090403
256,0.090353
512,0.089974
1024,0.089263
2048,0.090476
4096,0.091755
8192,0.096906
16384,0.104873
32768,0.122258
65536,0.157439
131072,0.217474
262144,0.396446
524288,0.787806
1048576,0.274227
2097152,0.453932
4194304,0.912999
8388608,1.530450
16777216,2.845040
33554432,5.477610
jszuppe@7479fa4 performance (20 trials):
=== reverse with compute ===
size,time (ms)
2,0.110501
4,0.111839
8,0.112056
16,0.111535
32,0.112302
64,0.112528
128,0.113221
256,0.112125
512,0.116040
1024,0.115111
2048,0.114388
4096,0.113263
8192,0.117490
16384,0.128465
32768,0.142770
65536,0.178416
131072,0.250299
262144,0.415767
524288,0.757610
1048576,0.294105
2097152,0.541047
4194304,0.913123
8388608,1.556500
16777216,2.902590
33554432,5.510650
from compute.
I guess we should stick with the current naive implementation as it performs well. Thanks for trying this out and getting the performance numbers!
from compute.
Related Issues (20)
- Build program failure when use default_random_engine. HOT 1
- Can we use iterator in valarray?
- how about android devices HOT 2
- how to fill array of custom user structures ?
- how to reduce "array of custom user structures" with custom function ?
- vectorized lower_bound HOT 2
- Adding nullptr to std::string HOT 1
- Cannot run example on README.md: what(): Out of Host Memory HOT 1
- How can I return a custom struct from a BOOST_COMPUTE_FUNCTION
- BOOST_COMPUTE_CLOSURE triggers run-time compiling every time the closure value changes
- Unwanted function call for std::map::operator[] HOT 1
- g++ appears to resolve OpenCL/cl.h header but clang does not.
- How to convert a cv::UMat to cv::Mat?
- Minimum Boost Version
- Cmake error on android
- transform_reduce on M1 Pro
- Please do not set CMAKE_MODULE_PATH.
- Why does the transform(...) method work only on vectors with values of type <float> ?
- Get rid of `BOOST_COMPUTE_ADAPT_STRUCT` integrating boost::pfr HOT 1
- Modular Boost C++ Libraries Request
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compute.