Comments (2)
Hi @Maetveis ,
(Not an AMD employee here, but going to try to help.)
I guess you probably moved on to something else, but if anything, I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially.
Your code seems wrong due to all threads writing into c[0]
, but I assume that it was a typo when writing your message, or a big part of the code is missing due to using c-reduce.
From basic experience, reaching max performance on RDNA can be quite challenging compared to GCN.
While hard to tell and profiling being quite hard on RDNA, I suspect that you might be hitting one these issues:
- bad write coalescing when writing
h
, that somehow gets better when putting the load. I have observed that putting synchronizations (withsyncthreads()
) can sometimes improve performance by improving write coalescing - if you are really have all your threads writing into
c[0]
, depending on your blocksize, the load might help with avoiding all threads are hitting the LDS at the same time and the LDS serializes less all the writes
Anyway, that's pure guessing.
Best,
Epliz
from hipamd.
Thanks, @Epliz for the suggestions, I replied to them below
I guess you probably moved on to something else
I am still interested in this as solving it would unblock upstreaming an optimization in vkFFT for AMDGPU, as well as potentially improve many kernels on RDNA2 if the root cause can be fixed.
I would recommend to open a bug ticket in LLVM instead, as maybe it is something the compiler can be taught to improve potentially.
For this specific issue I don't think the compiler is involved because it can be reproduced by changing the generated assembly (removing the load instruction for blockDim
).
--- test.s 2023-01-31 08:43:20.798273297 +0000
+++ test.noasm.s 2023-01-31 08:44:16.353553582 +0000
@@ -7,7 +7,7 @@
VkFFT_main: ; @VkFFT_main
; %bb.0:
s_load_dwordx4 s[0:3], s[6:7], 0x0
- s_load_dword s4, s[4:5], 0x4
+ ;s_load_dword s4, s[4:5], 0x4
v_add_nc_u32_e32 v3, v1, v0
v_mul_u32_u24_e32 v1, 7, v1
s_mulk_i32 s9, 0x18f0
Your code seems wrong due to all threads writing into c[0], but I assume that it was a typo when writing your message, or a big part of the code is missing due to using c-reduce.
Yes its the effect of the reduction, using different locations for each thread still reproduces the problem, using the same location (while technically UB due to the race condition) results in a simpler assembly.
My guess was something to do with the command processor (CP)1 not caching the dispatch packet, but accessing it from the kernel (to read blockDim
from it) leaves it in cache for subsequent blocks. This is supported by the fact that the following change (loading the kernarg pointer by hand; which is also done by the CP) also result in the faster performing kernel.
--- test.s 2023-01-31 09:35:20.406608882 +0000
+++ test.noasm.s 2023-01-31 09:37:13.877167577 +0000
@@ -6,8 +6,9 @@
.type VkFFT_main,@function
VkFFT_main: ; @VkFFT_main
; %bb.0:
+ s_load_dwordx2 s[6:7], s[4:5], 0x28
s_load_dwordx4 s[0:3], s[6:7], 0x0
- s_load_dword s4, s[4:5], 0x4
+ ;s_load_dword s4, s[4:5], 0x4
v_add_nc_u32_e32 v3, v1, v0
v_mul_u32_u24_e32 v1, 7, v1
s_mulk_i32 s9, 0x18f0
from hipamd.
Related Issues (20)
- Missing definitions for hipExternalSemaphore and hipExternalMemory when compiling for cuda HOT 1
- hipDeviceGetUuid is not compatible with VkPhysicalDeviceIDProperties::deviceUUID
- hipDeviceGetUuid yields link error when compiling with nvcc
- ROCm 5.3 gfx1030 hang with hipStreamCreate and hipStreamDestroy HOT 3
- hipGetSymbol* functions always return hipErrorInvalidSymbol. HOT 1
- CMAKE_INSTALL_<dir> paths assumed to be relative HOT 11
- Failed to embed PCH HOT 2
- Possible bug in occupancy calculation HOT 1
- wavefrontsize64 not supported on GFX10 while it is on GFX11. Intended or not? HOT 7
- RedHat & SLES - missing Clang RT Directory HOT 1
- question about boardName HOT 2
- Implement float/double atomicMin/Max in terms of integer atomics HOT 12
- FAIL to build hipamd with pal lib. How to do that?
- Remove HIP_COMPILER STREQUAL "clang"
- `#define __local` clashes with libcxx templates
- Support for setting `hipLimitMallocHeapSize` on ROCm 5.4? HOT 1
- Compile error in hip_cooperative_groups.h with ROCm 5.5 HOT 1
- compiling with CUDA 12 fails: error: identifier "cudaBindTexture" is undefined
- Polaris 10 support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hipamd.