Giter Club home page Giter Club logo

Comments (10)

jkwak-work avatar jkwak-work commented on June 27, 2024

My plan for further investigation is following.

There are 6 permutations for TSRRejectShadingCS.
The permutation 1 and 4 shows a big increase of the compile time.
The permutation 0 and 3 don't show any problem on the compile time.

When Slangc generates HLSL for the permutation 0 and 1, their HLSL outputs are identical except one difference.
LANE_COUNT is 64 for the permutation 0 and it is 32 for permutation 1.

I am going to disable certain parts of the source code in the permutation 0 and 1 until their compile time is same.
From there I will do bisect-search into a function that causes the increase of the compile time.

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

I did some experiment and the compile-time appears to be related to the for-loop.
The increase of the compile time is strongly co-related to a macro variable SIMD_SIZE.
There are 34 for-loops that iterates "n" number of times where "n" is equals to SIMD_SIZE.

I initially thought that "WaveSize" might be related to the issue.
Because the issue was observed only when DIM_WAVE_SIZE is 32.
But it doesn't seem to be the case.
It looks to be more related to SIMD_SIZE which changes the iteration counts of for-loops.

The following table shows how long the compilation takes depending on the value of SIMD_SIZE; from 4 to 16.
The unit is in "second".
compile_time_for_tsrrejectingshadercs_01

  4 6 8 10 12 14 16
DIM_WAVE_SIZE=32 7.58 15.49 36.59 120.25 239.75 428.68  633.18
DIM_WAVE_SIZE=64 7.69 15.29 34.96 89.31 201.99 418.88 687.65

By default, SIMD_SIZE is 8 when DIM_WAVE_SIZE is 32. SIMD_SIZE is 4 when DIM_WAVE_SIZE is 64.
When I changed SIMD_SIZE while keeping DIM_WAVE_SIZE same, the compile-time was changed.
Note that the compile-time looks almost same between DIM_WAVE_SIZE=32 and DIM_WAVE_SIZE=64 when SIMD_SIZE is same.
That makes me to believe WaveSize is unrelated to the increase of the compile-time.

I am attaching a source file that re-produces the issue.
TSRRejectShading.txt

When the attached shader is compiled with a following command, it reproduces the compile-time for the permutation 0.

dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=64 -DSIMD_SIZE=4 TSRRejectShading.txt

When the attached shader is compiled with a following command, it reproduces the compile-time for the permutation 1.

dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=32 -DSIMD_SIZE=8 TSRRejectShading.txt

The compile-time for all permutations is around 7~8 seconds when Slang is not involved.
That means when SIMD_SIZE is 4, the compile-time is good but when SIMD_SIZE is above 4, DXC takes longer than it could be.

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

After collecting some data, I found that each for-loop increases the compile-time differently.
I narrowed down to three particular functions that increase the compile-time most

  1. AccessNeighborTexel_2
  2. min_0
  3. max_0

I am going to focus on "AccessNeighborTexel_2" and try more experiments.

The following graph shows how much compile-time is increased when each for-loop iterates more.
compile_time_for_tsrrejectingshadercs_03

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

*Edit: I take this back. I made a mistake on my oberservation.

I am gonna need to experiment little more but I may have found something.
It looks like DXC generates a bigger binary when if-else-if-else chain is written differently.
When the code is written as following, DXC generates smaller binary.

if (cond1) bla;
else if (cond2) bla;
else if (cond3) bla;
...

When the code is written as following, dxc generates a bigger binary and it takes longer to compile.

if (cond1) bla;
else
{
  if (cond2) bla;
  else
  {
    if (cond3) bla;
    else
    {
      if (cond4) bla;
...

I am going to test my observation more tomorrow.

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

I am gonna focus more on the following functions tomorrow:

  1. Median3x3_0
  2. Max3x3_0
  3. Min3x3_0
  4. Convolve3x3HV_1

Yesterday, I left a comment saying that three functions looked to be related to the compile-time increase.
And they were "AccessNeighborTexel_2", "min_0" and "max_0".
But it turned out that they themselves don't increase the compile-time.
They are just called by other functions more.
I was comparing the compile-time increase per more for-loop iterations, but it wasn't fair because when they were called by other functions 10 times, 2 increment of for-loop iterations became 20 times increases.

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

Unfortunately, I haven't been able to make much progress, although I have spent good amount of time investigating the issue.
Six functions kept showing up as suspects that increased the compile time,

  1. ClampFireFliersWithGuide
  2. Convole3x1
  3. Convolve3x3HV
  4. Max3x3
  5. Median3x3
  6. Min3x3

They are all in a same chain of function call.

  • "ClampFireFliersWithGuide" calls both Max3x3 and Min3x3.
  • "Max3x3", "Median3x3", and "Min3x3" calls to "Convolve3x3HV"
  • "Convolve3x3HV" calls to "Convole3x1".

So it looks like the source of the issue is at "Convole3x1".

But the trail ends there strangely.
"Convole3x1_2" calls to a few simple functions and only two of them have for-loop and some complexity: "AccessNeighborTexel_1" and "AccessNeighborTexel_2".

When I swap "AccessNeighborTexel_X" with the original HLSL, the compile time is still high.
When I swap "Convole3x1" with the original HLSL, the compile time goes down by big.
I am focusing on the area around here.

I am attaching a shader file that I am using for the debugging as a reference.
TSRRejectShading.txt

I am measuring the compile time with the following command,

/usr/bin/time -f "%E" dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=32 TSRRejectShading.txt -Fo a.out -Fe error.txt

For measuring the timing for each function, I am using a following command,

for i in 1 2 3 4 5
do
  echo "== $i try"
  for m in $(grep 'ifndef M_' TSRRejectShading.usf.hlsl | dos2unix | sed 's|#ifndef \(.*\)_[0-9]*$|\1|' | sed 's|#ifndef ||' | sort -u)
  do
    echo "Measuring time: $m ..."
    /usr/bin/time -f "%E" dxc.exe -E MainCS -T cs_6_6 -DDIM_WAVE_SIZE=32 TSRRejectShading.txt -Fo a.out -Fe error.txt -D${m} -D${m}_0 -D${m}_1 -D${m}_2 -D${m}_3 -D${m}_4
  done
done

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

I am sharing data that shows how much compile time is reduced when a single function is replaced with the original HLSL function.
The values are median value from 5 tries.
compile_time_when_using_original_function_01

As I described on my previous comment, three functions stands out most:

  1. ClampFireFliersWithGuide
  2. Convole3x1
  3. Convolve3x3HV

They eventually calls down to Convole3x1.
And my focus is currently on "Convole3x1".

Note that when all of the functions use the original HLSL functions at the same time, the compile time is around 15seconds.

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

I think I found a cause of the compile-time increase.
In short, it takes longer time to search for the functions from the global namespace compare to searching for a member function in a struct.
Here is a quick example from the UE shader,

#ifndef M_CallMemberFromGlobal
        TLaneVector_SetElement_0(R_9, SimdIndex_20, max(TLaneVector_GetElement_0(A_6, SimdIndex_20), TLaneVector_GetElement_0(B_3, SimdIndex_20)));
#else
        R_9.SetElement(SimdIndex_20, max(A_6.GetElement(SimdIndex_20), B_3.GetElement(SimdIndex_20)));
#endif

Two functions do exactly the same thing; in fact, DXC generates a binary identical output DXIL whichever side is used.
But the first line takes longer for DXC to compile than the second line.

There are a few member functions in the UE shader.
As an experiment, I made copies of global functions and added member function of the same functionality. Those functions were initially member functions but Slang turned them to global functions.
The compile-time went down significantly when I applied this approach only to two functions, "GetElement()" and "SetElement()". Note that there are more functions but I modified only those two.
See the attached graph for the compile-time comparison.

compile_time_when_using_original_function_02

The bar labeled as "CallMemberFromGlobal" is the case that uses the member functions of "GetElement()" and "SetElement()", (21 seconds).
The bar labeled as "Worst" is the case that DXC compiles what Slang generated without any modifications, (38 seconds).

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

While investigating this issue, I discovered another bug in DXC.
When "this" keyword is used as a function argument, the compile-time goes up by big and the generated DXIL binary becomes 10 times bigger.
I filed a bug on DXC github for this problem
DirectXShaderCompiler/-/6512

However, this doesn't appear to have any impact on Slang because slang doesn't use "this" keyword as a function argument.
It is mainly because Slang turns all member functions to global functions and there is no way to use "this" keyword when all functions are non-member functions.

from slang.

jkwak-work avatar jkwak-work commented on June 27, 2024

The investigation is done.
The increased compile-time is mainly from the fact that DXC takes longer when calling to global functions compare to calling to member functions.
It appears to be a bug on DXC but we may be able to workaround.

A new issue is created for the workaround approach as a follow-up task: #3921

from slang.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.