Comments (10)
My plan for further investigation is following.
There are 6 permutations for TSRRejectShadingCS.
The permutation 1 and 4 shows a big increase of the compile time.
The permutation 0 and 3 don't show any problem on the compile time.
When Slangc generates HLSL for the permutation 0 and 1, their HLSL outputs are identical except one difference.
LANE_COUNT is 64 for the permutation 0 and it is 32 for permutation 1.
I am going to disable certain parts of the source code in the permutation 0 and 1 until their compile time is same.
From there I will do bisect-search into a function that causes the increase of the compile time.
from slang.
I did some experiment and the compile-time appears to be related to the for-loop.
The increase of the compile time is strongly co-related to a macro variable SIMD_SIZE.
There are 34 for-loops that iterates "n" number of times where "n" is equals to SIMD_SIZE.
I initially thought that "WaveSize" might be related to the issue.
Because the issue was observed only when DIM_WAVE_SIZE is 32.
But it doesn't seem to be the case.
It looks to be more related to SIMD_SIZE which changes the iteration counts of for-loops.
The following table shows how long the compilation takes depending on the value of SIMD_SIZE; from 4 to 16.
The unit is in "second".
4 | 6 | 8 | 10 | 12 | 14 | 16 | |
---|---|---|---|---|---|---|---|
DIM_WAVE_SIZE=32 | 36.59 | 120.25 | 239.75 | 428.68 | 633.18 | ||
DIM_WAVE_SIZE=64 | 7.69 | 15.29 | 34.96 | 89.31 | 201.99 | 418.88 | 687.65 |
By default, SIMD_SIZE is 8 when DIM_WAVE_SIZE is 32. SIMD_SIZE is 4 when DIM_WAVE_SIZE is 64.
When I changed SIMD_SIZE while keeping DIM_WAVE_SIZE same, the compile-time was changed.
Note that the compile-time looks almost same between DIM_WAVE_SIZE=32 and DIM_WAVE_SIZE=64 when SIMD_SIZE is same.
That makes me to believe WaveSize is unrelated to the increase of the compile-time.
I am attaching a source file that re-produces the issue.
TSRRejectShading.txt
When the attached shader is compiled with a following command, it reproduces the compile-time for the permutation 0.
dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=64 -DSIMD_SIZE=4 TSRRejectShading.txt
When the attached shader is compiled with a following command, it reproduces the compile-time for the permutation 1.
dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=32 -DSIMD_SIZE=8 TSRRejectShading.txt
The compile-time for all permutations is around 7~8 seconds when Slang is not involved.
That means when SIMD_SIZE is 4, the compile-time is good but when SIMD_SIZE is above 4, DXC takes longer than it could be.
from slang.
After collecting some data, I found that each for-loop increases the compile-time differently.
I narrowed down to three particular functions that increase the compile-time most
- AccessNeighborTexel_2
- min_0
- max_0
I am going to focus on "AccessNeighborTexel_2" and try more experiments.
The following graph shows how much compile-time is increased when each for-loop iterates more.
from slang.
*Edit: I take this back. I made a mistake on my oberservation.
I am gonna need to experiment little more but I may have found something.
It looks like DXC generates a bigger binary when if-else-if-else chain is written differently.
When the code is written as following, DXC generates smaller binary.
if (cond1) bla;
else if (cond2) bla;
else if (cond3) bla;
...
When the code is written as following, dxc generates a bigger binary and it takes longer to compile.
if (cond1) bla;
else
{
if (cond2) bla;
else
{
if (cond3) bla;
else
{
if (cond4) bla;
...
I am going to test my observation more tomorrow.
from slang.
I am gonna focus more on the following functions tomorrow:
- Median3x3_0
- Max3x3_0
- Min3x3_0
- Convolve3x3HV_1
Yesterday, I left a comment saying that three functions looked to be related to the compile-time increase.
And they were "AccessNeighborTexel_2", "min_0" and "max_0".
But it turned out that they themselves don't increase the compile-time.
They are just called by other functions more.
I was comparing the compile-time increase per more for-loop iterations, but it wasn't fair because when they were called by other functions 10 times, 2 increment of for-loop iterations became 20 times increases.
from slang.
Unfortunately, I haven't been able to make much progress, although I have spent good amount of time investigating the issue.
Six functions kept showing up as suspects that increased the compile time,
- ClampFireFliersWithGuide
- Convole3x1
- Convolve3x3HV
- Max3x3
- Median3x3
- Min3x3
They are all in a same chain of function call.
- "ClampFireFliersWithGuide" calls both Max3x3 and Min3x3.
- "Max3x3", "Median3x3", and "Min3x3" calls to "Convolve3x3HV"
- "Convolve3x3HV" calls to "Convole3x1".
So it looks like the source of the issue is at "Convole3x1".
But the trail ends there strangely.
"Convole3x1_2" calls to a few simple functions and only two of them have for-loop and some complexity: "AccessNeighborTexel_1" and "AccessNeighborTexel_2".
When I swap "AccessNeighborTexel_X" with the original HLSL, the compile time is still high.
When I swap "Convole3x1" with the original HLSL, the compile time goes down by big.
I am focusing on the area around here.
I am attaching a shader file that I am using for the debugging as a reference.
TSRRejectShading.txt
I am measuring the compile time with the following command,
/usr/bin/time -f "%E" dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=32 TSRRejectShading.txt -Fo a.out -Fe error.txt
For measuring the timing for each function, I am using a following command,
for i in 1 2 3 4 5
do
echo "== $i try"
for m in $(grep 'ifndef M_' TSRRejectShading.usf.hlsl | dos2unix | sed 's|#ifndef \(.*\)_[0-9]*$|\1|' | sed 's|#ifndef ||' | sort -u)
do
echo "Measuring time: $m ..."
/usr/bin/time -f "%E" dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=32 TSRRejectShading.txt -Fo a.out -Fe error.txt -D${m} -D${m}_0 -D${m}_1 -D${m}_2 -D${m}_3 -D${m}_4
done
done
from slang.
I am sharing data that shows how much compile time is reduced when a single function is replaced with the original HLSL function.
The values are median value from 5 tries.
As I described on my previous comment, three functions stands out most:
- ClampFireFliersWithGuide
- Convole3x1
- Convolve3x3HV
They eventually calls down to Convole3x1.
And my focus is currently on "Convole3x1".
Note that when all of the functions use the original HLSL functions at the same time, the compile time is around 15seconds.
from slang.
I think I found a cause of the compile-time increase.
In short, it takes longer time to search for the functions from the global namespace compare to searching for a member function in a struct.
Here is a quick example from the UE shader,
#ifndef M_CallMemberFromGlobal
TLaneVector_SetElement_0(R_9, SimdIndex_20, max(TLaneVector_GetElement_0(A_6, SimdIndex_20), TLaneVector_GetElement_0(B_3, SimdIndex_20)));
#else
R_9.SetElement(SimdIndex_20, max(A_6.GetElement(SimdIndex_20), B_3.GetElement(SimdIndex_20)));
#endif
Two functions do exactly the same thing; in fact, DXC generates a binary identical output DXIL whichever side is used.
But the first line takes longer for DXC to compile than the second line.
There are a few member functions in the UE shader.
As an experiment, I made copies of global functions and added member function of the same functionality. Those functions were initially member functions but Slang turned them to global functions.
The compile-time went down significantly when I applied this approach only to two functions, "GetElement()" and "SetElement()". Note that there are more functions but I modified only those two.
See the attached graph for the compile-time comparison.
The bar labeled as "CallMemberFromGlobal" is the case that uses the member functions of "GetElement()" and "SetElement()", (21 seconds).
The bar labeled as "Worst" is the case that DXC compiles what Slang generated without any modifications, (38 seconds).
from slang.
While investigating this issue, I discovered another bug in DXC.
When "this" keyword is used as a function argument, the compile-time goes up by big and the generated DXIL binary becomes 10 times bigger.
I filed a bug on DXC github for this problem
DirectXShaderCompiler/-/6512
However, this doesn't appear to have any impact on Slang because slang doesn't use "this" keyword as a function argument.
It is mainly because Slang turns all member functions to global functions and there is no way to use "this" keyword when all functions are non-member functions.
from slang.
The investigation is done.
The increased compile-time is mainly from the fact that DXC takes longer when calling to global functions compare to calling to member functions.
It appears to be a bug on DXC but we may be able to workaround.
A new issue is created for the workaround approach as a follow-up task: #3921
from slang.
Related Issues (20)
- Improve README 2: Show the code coverage status in README
- Improve README 3: beautify the support table
- Improve README 4: Improve and cleanup the build instruction
- Remove most uses of `__target_intrinsic` and `__specialize_for_target` HOT 4
- Initialization of link-time specialized groupshared array HOT 4
- gl_WorkGroupSize must be a constant
- Are combined sampler/texture objects supported? HOT 4
- Ambiguous operators when multiplying half by integer literal HOT 5
- Update spirv-tools
- Prefer to keep member functions as member functions HOT 5
- Explicit binding for vulkan is not correctly assigned for AppendStructuredBuffer HOT 4
- Investigate why `sp*` API does not produce any result when compiling to binary module. HOT 2
- Error while specializing generic entrypoint with different types HOT 1
- For GLSL targets, `uniform` struct fields if unused will optimize out but still appear in function calls
- Compiler crash
- Alignment of float4x4 seems to change in spirv backend depending on surrounding code HOT 21
- Two same "struct" are generated with different names when an arithmetic math is applied to the generic-argument HOT 10
- Support __LINE__ macro for the caller of a macro HOT 3
- Support SPIR-V 1.6 HOT 4
- Operator << incorrectly truncates values.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from slang.