I'm getting piet-gpu running on Android (see <a class="issue-link js-issue-link" data-

Rebasing to master (which brings in <a class="commit-link" data-hovercard-type="commit

I think you should at least bring <a class="commit-link" data-hovercard-type="commit"

If the clip stack is such a bottleneck, I'll mention <a class="issue-link js-issue-lin

I haven't exactly experimented with <a class="issue-link js-issue-link" data-

Slow fine rasterization (k4) on Adreno 640,about linebender/vello

Comments (33)

raphlinus commented on June 27, 2024 1

Rebasing to master (which brings in 22507de among other things) yields continued slow performance (seemingly worse than the branch I was working on). Similarly, commenting out the begin and end clip commands gives a dramatic improvement in performance (but still not as good as the snapshot). This is interesting and somewhat unexpected, as I would have expected the clip_stack[] itself to be significant. It's not entirely unexpected, though, as one of the experiments was to increase the clip stack size, with the intent of forcing the compiler to spill it and not consume registers, and that did not significantly change performance. At least we have more variables to play with, though.

The "% Shader ALU Capacity Utilized" metric reported by AGI is around 6% for CHUNK=4 and with the clip stack in place, using #82 as the measurement baseline. It's about double that with CHUNK=1 and no clip stack.

from vello.

raphlinus commented on June 27, 2024 1

Ok, I got envytools to run and have substantially more insight. What's attached are disassembler output from kernel 4 in #82, and a similar version with the bodies of Cmd_BeginClip and Cmd_EndClip commented out (noclip). These were obtained by running pgmdump2 from [freedreno/envytools], patched slightly to set gpu_id to 640 and MAX_REG to 512. These were all run using the kitchen APK as above, ie, using glGetProgramBinary. Incidentally, sometimes this output a compressed format that looks very similar to k4_pipeline as attached above (first two dwords are 000051c0 5bed9c78, while the dword at offset 0x30 in k4_pipeline is 59ed9c78, different by only one bit). If I interact with the graphical app before using the service, it seems more likely to produce uncompressed output.

In any case, looking at the asm, there is indeed a difference. In the k4_noclip version (happy path), all reading from the input buffers is with the isam instruction. Reading the tag in particular is:

:5:0166:0205[a0005100x_02000001x] isam (s32)(x)r0.x, r0.x, s#0, t#1

But in k4_clip (sad path), while some of the reads are still isam, the tag and the TileSeg reads are the ldib instruction instead.

:6:0339:0402[c0260201x_01618001x] ldib.untyped.1d.u32.1.imm r0.y, r0.y, 1

Doing a little reading in the freedreno code, ldib is a cat6 instruction new to the Adreno 6xx series. There's a comment on emit_intrinsic_load_ssbo with some info, and also there's a doc string in the xml for the instruction that reads, cryptically "LoaD IBo". (IBO = "index buffer object" in GL-speak)

I didn't do quite as much study of the isam instruction. There's another comment which discusses changes in ssbo handling that refers to it. It's a category 5 instruction.

So, what we know with reasonably high confidence: there's some heuristic that the compiler is using to select between isam and ldib for reads from these buffers. For some reason, ldib seems quite a bit slower than isam. Why, I'm not completely sure. It might have something to do with caching.

I've skimmed over the rest of the shader code and don't see anything unexpected or very different between the two. Register usage seems comparable. There are of course lots of opportunities for experimental error here, but I'm reasonably confident I'm looking at the actual source of the slowdown.

k4_clip.gz
k4_noclip.gz

from vello.

robclark commented on June 27, 2024 1

fwiw, IBO is actually "Image Buffer Object" (I kinda made that name up.. it is what is used for both Images and SSBOs)

isam is texelFetch, so it goes thru the TPL1 texture cache.. whereas ldib does not (image writes are not coherent with reads that go thru the texture cache).. I believe the closed driver will try to promote image loads to isam when it realizes there is no potential coherency issue with writes to the same image (which could include the same image attached at a different binding point)

from vello.

ishitatsuyuki commented on June 27, 2024

I think you should at least bring 22507de in; the old clip stack have a really hard register pressure.

Also, what's the current used ALU capacity %? On desktop GPUs these should be close to 80--90%, so that would be the goal for efficiency.

For profiling, having access to the compiler output (ISA) would be quite helpful, although I think qcom's stack is too proprietary so they wouldn't provide that.

from vello.

eliasnaur commented on June 27, 2024

If the clip stack is such a bottleneck, I'll mention #77 as a potential optimization of both the global memory traffic, and for avoiding the sRGB conversions. It doesn't reduce register pressure, though.

from vello.

raphlinus commented on June 27, 2024

I haven't exactly experimented with #77, but did tests similar to it, without encouraging results.

I'm finding this issue quite challenging and difficult, but want to approach it as systematically as possible. One approach is to see if I can take a look at asm. Another is to continue treating the GPU as a black box, and build up a synthetic workload that gradually starts to resemble the real kernel4, watching at each step to see where the performance cliff is.

from vello.

raphlinus commented on June 27, 2024

I've spent a considerable amount of time investigating, and have some insight but not complete understanding. Perhaps more important, there is a way forward.

Something is causing shader compilation to produce extremely deoptimized code for current master kernel4. By isolation, I find a single two-line fragment of code that can be commented out: the write_mem calls in Cmd_BeginClip. When CHUNK=1, the difference is dramatic, about 6ms vs 38ms in k4 (CHUNK=2 and 4 don't change the happy case much but make the sad case better; I've been going for rough measurements but could do them more carefully if needed). I've spent some quality time experimenting and using Android GPU Inspector, and still don't have a good accounting of where this time is going. I have been able to rule out some hypothesis that seem plausible, and will spend a little time going over that evidence.

It's not low numbers of workgroups concurrently scheduled (occupancy) due to register pressure, though there is a difference in occupancy. With CHUNK=1, a maximum of 18 workgroups get scheduled in the happy case, 14 in the sad. (I instrumented the code with atomics to count number of workgroups in flight). With CHUNK=2, these numbers are 30/24, and with CHUNK=4, 32/32. These differences are nowhere near enough to explain the performance gap.

It also appears not to be register spilling, based on Read Total and Write Total counters from AGI; the memory traffic per pixel rendered doesn't vary substantially between runs. The fact that the performance gap is worse with CHUNK=1 than larger CHUNK values is also not consistent with spilling. We can estimate register usage with the Radeon shader analyzer (accessed through the shader playground). Using gfx900 asic, with CHUNK=1, vgpr used is 21, allocated is 24. At CHUNK=2, 30/38, and at CHUNK=4, 46/62. The 6xx wiki linked above suggests that <= 32 poses no problems whatsoever for occupancy. One possibility, not completely ruled out, is that memory traffic from spilling is not properly reported by AGI. (We could test this by doing an experiment that forces spilling).

I'd still be very curious to look at the asm to see what's so deoptimal. It is possible to build Mesa for Android, and it is possible to get that to cough up disassembly, but I haven't done this (it seems to involve some tedious mucking about with devices and building). In addition, this might not help very much, as the freedreno driver may well have different performance characteristics.

Now for the good news. Given the focus on clip memory traffic, I tried the approach of #77, and it seems to perform reasonably well at all CHUNK values. I'm now strongly leaning toward adopting that, though I think there is more discussion to be had about the best way to handle the clip/blend stack. Obviously one shortcoming is that the size of the clip stack is hardcoded, but in my experiments even a large value (256) doesn't seem to affect performance - clearly the clip stack itself is spilled and doesn't consume vgpr's, which is good overall.

We could in theory escalate the deoptimized shader compilation results to the GPU driver vendor, but at this point I'm not very eager to do that, as I think we have a good approach going forward. I am somewhat worried about some other future development triggering a similar performance cliff, but I figure we can cross that bridge when we get to it - another reason to push forward with the complete imaging model.

from vello.

eliasnaur commented on June 27, 2024

Obviously one shortcoming is that the size of the clip stack is hardcoded, but in my experiments even a large value (256) doesn't seem to affect performance - clearly the clip stack itself is spilled and doesn't consume vgpr's, which is good overall.

Why not a hybrid, as described in #77 (comment): Create two sets of clip commands, one using global memory and one using thread local memory? Is it because the mere existence of the two write_mem calls causes the performance cliff, not their execution?

from vello.

raphlinus commented on June 27, 2024

Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands.

Btw, another clue (and another set of experiments I did but didn't report) is that that whatever is causing the performance cliff, the main effect seems to be making memory loads very slow; when adding a synthetic ALU-intensive load, the ALU operations don't seem significantly slower in the sad case. This is also consistent with the variation in response to CHUNK - each subgroup within a workgroup reads its input once, so the total amount of memory read should scale as 4/CHUNK. Meanwhile, the total amount of ALU is roughly the same regardless of CHUNK. In the happy case, the total time is fairly consistent wrt CHUNK, suggesting a mostly-ALU workload, while in the sad case it seems almost directly proportional to the expected amount of memory read.

from vello.

eliasnaur commented on June 27, 2024

Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands.

How unfortunate. Did you get to try Snapdragon Profiler? It's the platform profiler for Adreno chipsets, and may give more detailed information than the generic Android one.

Btw, another clue (and another set of experiments I did but didn't report) is that that whatever is causing the performance cliff, the main effect seems to be making memory loads very slow; when adding a synthetic ALU-intensive load, the ALU operations don't seem significantly slower in the sad case. This is also consistent with the variation in response to CHUNK - each subgroup within a workgroup reads its input once, so the total amount of memory read should scale as 4/CHUNK. Meanwhile, the total amount of ALU is roughly the same regardless of CHUNK. In the happy case, the total time is fairly consistent wrt CHUNK, suggesting a mostly-ALU workload, while in the sad case it seems almost directly proportional to the expected amount of memory read.

An interesting experiment would be to keep the write_mems, but replace their target buffer. That is, add

layout(set = 0, binding = 5) buffer FakeMemory {
      uint fake_mem_error;
      uint[] fake_memory;
};

and write to that instead.

from vello.

raphlinus commented on June 27, 2024

How unfortunate. Did you get to try Snapdragon Profiler? It's the platform profiler for Adreno chipsets, and may give more detailed information than the generic Android one.

I did try to use it, but ran into the same layer problem as AGI 1.0 (google/agi#760). My understanding is that the performance counters it exposes are basically the same as AGI. It's possible I could track down what's going wrong, and would do so if I had evidence I would be able to capture higher quality information from the tool.

from vello.

eliasnaur commented on June 27, 2024

Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands.

FWIW, I don't get any noticeable speedup by removing the write_mems (nor read_mems) on my Pixel 1. However, I do gain a lot of performance by removing the sRGB conversion in fillImage. Again, the mere existence of the line is enough; I remove all FillImage commands from the scene.

from vello.

raphlinus commented on June 27, 2024

This will be something to watch, and I suspect we're going to be running into variants of it. My best guess is that we're running into a heuristic in the compiler, perhaps triggered by register pressure, perhaps code size, perhaps something else we don't understand yet. It may be that the ubershader approach doesn't scale on this class of hardware/drivers. This is a reason to build the entire imaging model, so we know what we're dealing with.

I also had another thought about how to gain more insight here: it may be worthwhile trying to run the shader in OpenGLES rather than Vulkan, and use glGetProgramBinary.

from vello.

eliasnaur commented on June 27, 2024

I also had another thought about how to gain more insight here: it may be worthwhile trying to run the shader in OpenGLES rather than Vulkan, and use glGetProgramBinary.

Doesn't vkGetPipelineCacheData include the shader program binary?

Anyway, I hacked a glGetProgramBinary-as-a-service into a Gio program. If you adb install kitchen.apk to a phone, you should be able to fetch binaries with curl. For example:

$ adb forward tcp:8080 tcp:8080
$ cd piet-gpu/piet-gpu/shader
$ spirv-cross --es --version 310 kernel4.spv > kernel4.glsl
$ curl -v -F "[email protected]" http://localhost:8080 > shader.bin

It works on my Pixel 1 and on my Fedora with the Intel GPU.

from vello.

raphlinus commented on June 27, 2024

Fascinating, and thanks! I didn't realize the pipeline cache data could potentially be used to get the program binary. I tried it and was able to recover a file (k4_pipeline), but I'm not sure how useful it is. It seems pretty obfuscated or at least compressed.

I was also able to use the kitchen apk (thanks!), and the result of that looks like it might be more tractable. It clearly has some symbol names at the end, and the bulk of it is clearly low entropy, indicating likely executable binary. I'll see if I can get envytools to produce something intelligible.

k4_pipeline.gz
kernel4.bin.gz

from vello.

raphlinus commented on June 27, 2024

Ahh, interesting. That would certainly explain why writes to memory is the difference between the two cases. It doesn't explain the srgb conversion as observed by Elias above, but it might be a similar issue.

This is making me wonder whether we should be more rigorous in separating read-only memory access patterns from read-write, and annotating the buffers with readonly when appropriate. I had done a little experimentation on desktop and not observed any performance difference, but this looks pretty compelling.

In any case, thanks much for the insight, it's very helpful and really tells us where to look.

from vello.

robclark commented on June 27, 2024

Yes, readonly is a good idea.. and I think also restrict so the compiler knows that the readonly image isn't also bound to a 2nd image binding point

from vello.

eliasnaur commented on June 27, 2024

Adding restrict to every buffer and image doesn't seem to make any difference on Pixel 1. However, adding precision mediump float to k4 eliminates the performance hit from sRGB. The rendering is somewhat corrupted as a result, but I suppose that's a question of carefully using full-precision floats where needed. lowp doesn't have a noticeable diffference in either performance or corruption.

from vello.

robclark commented on June 27, 2024

At a hw level, lowp and mediump are the same thing, fwiw. Using mediump where possible is recommended.

from vello.

eliasnaur commented on June 27, 2024

I've added a mediump change to my work branch. It avoids corrupting rendering output but reclaims the performance loss from sRGB conversions on my Pixel 1.

from vello.

raphlinus commented on June 27, 2024

I did a little experimentation with adding mediump and didn't see any difference in Vulkan. I suspect that it has an effect for OpenGL shader compilation. I also believe it's possible to to in Vulkan, but requires the VK_KHR_shader_float16_int8 extension. I definitely think it's worth exploring, my guess is that it has a dual effect of increasing ALU bandwidth and reducing register pressure, but I also think it's a somewhat disjoint issue from the memory stuff we're seeing here.

from vello.

eliasnaur commented on June 27, 2024

I did a little experimentation with adding mediump and didn't see any difference in Vulkan. I suspect that it has an effect for OpenGL shader compilation. I also believe it's possible to to in Vulkan, but requires the VK_KHR_shader_float16_int8 extension.

I'm not so sure. Gio's shader compilation stages goes through SPIR-V: piet-gpu shaders => SPIR-V => GL ES 3.1, and precision qualifiers survive the conversions without mentioning VK_KHR_shader_float16_int8 anywhere. My guess is that VK_KHR_shader_float16_int8 is for explicit float16 and uint8 etc. types, whereas precision qualifiers are merely hints.

My guess is that the Pixel 1 GPU is just register/ALU limited, whereas your Pixel 4 is less so.

I definitely think it's worth exploring, my guess is that it has a dual effect of increasing ALU bandwidth and reducing register pressure, but I also think it's a somewhat disjoint issue from the memory stuff we're seeing here.

I agree that the issue you're seeing is caused by something else. I brought up the mediump fix to eliminate the sRGB code as a culprit.

from vello.

eliasnaur commented on June 27, 2024

This is making me wonder whether we should be more rigorous in separating read-only memory access patterns from read-write, and annotating the buffers with readonly when appropriate. I had done a little experimentation on desktop and not observed any performance difference, but this looks pretty compelling.

In any case, thanks much for the insight, it's very helpful and really tells us where to look.

Did you have any luck adding restrict to the buffers k4 uses? If not, do you know if it is legal to bind a single buffer more than once? If so, we can avoid multiple buffers by binding the memory buffer twice, once for read-only access and once for read-write. Something like:

layout(set = 0, binding = 0) restrict buffer Memory {
      uint mem_offset;
      uint mem_error;
      uint[] memory;
};

layout(set = 0, binding = 1) restrict readonly buffer MemoryRO {
      uint _mem_offset;
      uint _mem_error;
      uint[] memory_ro;
};

and then assigning both bind points 0 and 1 to the same memory buffer.

I experimented with the above, but my Pixel 1 does not have a performance cliff similar to your Pixel 4. Presumably because the isam instruction is not available or used on it.

from vello.

robclark commented on June 27, 2024

note that binding the same buffer twice with restrict keyword might result in undefined behavior.. ie. reads could be a TPL1 cache hit with stale data prior to the write

from vello.

eliasnaur commented on June 27, 2024

The intention is to read and write dependent values from the same buffer. That is, clip state reads and writes will use the read/write memory buffer binding, while reading commands and path segments from the read-only binding.

from vello.

robclark commented on June 27, 2024

I suppose that could work, provided the read/write and read-only sections don't share a cache line.. but not sure I'd call that portable.

Is it possible to separate the r/o data into a separate buffer? This wouldn't depend so much on details of hw implementation.

from vello.

raphlinus commented on June 27, 2024

I think the solution here (which I have working locally, not yet a PR) is to port a version of #77, so there is no explicit writing to buffers at all in kernel4, only the image write to the output buffer.

Note that similar issues almost certainly exist for other pipeline stages, where the code currently does reading and writing from a single unified buffer, but the actual input regions are a read-only pattern. The performance lossage doesn't seem to be as serious, though I haven't done careful measurement yet. I'm not sure what the solution is. One approach is to revert the approach of 4de67d9, and have separate memory buffers. That would have the disadvantage of the stages not being able to share a free memory pool. Another approach is to reliably signal to the shader compiler that relaxed memory semantics are desired, but I'm not sure there is any way to do that other than fast-forward to a world in which the Vulkan memory model is widely adopted and there are performance tests in place to ensure that the optimizations it enables are actually exploited.

I haven't read up on "restrict" yet but based on my knowledge of the analogous keyword in C, I am skeptical that it is a reliable, portable solution. The description in the GLSL wiki is so vague and non-normative that it seems close to useless to my eyes. The spec language is slightly better but I don't yet fully understand it.

from vello.

eliasnaur commented on June 27, 2024

restrict reads to me as exactly made for this case: to optimize memory access to separate buffers or regions within the same buffer.

from vello.

raphlinus commented on June 27, 2024

My read: the optimization is sensible, but the restrict spec is a childish and inadequate attempt to express what is properly represented as relaxed memory semantics in a real, grown-up memory model. As Rob points out above, restrict on two different buffers aliased to the same memory is a violation of the letter of the spec, and, while it may give good results in some cases, it feels likely to cause problems in others.

By contrast, going for a segmented, 80286-style approach to data, is a pattern that is quite well represented by existing GPU workloads and is likely to be optimized reliably, even though a big linear 386-style is very tempting for programmer ergonomics and because a unified free memory pool makes it less likely for an allocation to fail even when there is plenty of free space in another buffer.

from vello.

ishitatsuyuki commented on June 27, 2024

Some additional insights on this: having well segmented buffers also helps desktop GPUs. Raph, you almost predicted this in your blog post :)

I should also note that these differences in the way compilers handle memory coherence are not in any way specific to mobile vs desktop; it would not have been surprising to see this issue pop up on desktop GPU with very aggressive caching but not on a simpler mobile architecture. As an example of aggressive caching, RDNA has some cache attached to a SIMD execution unit, which is not necessarily coherent with caches of other SIMD execution units even in the same workgroup.

The thing is that, AMD GPUs have two kinds of memory caches, which are scalar cache and vector cache. The former is used for uniform loads and has a smaller cache line size, while the latter is used for everything else and typically has a 128-byte cache line. The caveat is that these two kinds of cache is not coherent. So if the compiler cannot prove that your access don't alias, then you end up using vector memory instructions everywhere instead.

And therefore the problem with spilling writes applies the same to AMD GPUs, and we basically waste some capacity as we're not utilizing the scalar cache and the scalar memory controller. The effect is not that catastrophic compared to Adreno, but it probably makes a sizable difference (preliminary testing shows 950us -> 850us for k4). So that's another strong motivation toward a segmented buffer model.

Thanks for Mesa developers helping me to pinpoint this, working with Mesa has always been a joy :)

from vello.

eliasnaur commented on June 27, 2024

Can we get away with a single backing buffer bound to, say, separate readonly and a writeonly bindings in the shader? I haven't dug into the specs, but it seems to me that should free the compiler to use non-coherent memory accesses.

from vello.

raphlinus commented on June 27, 2024

Wow, I was looking at AMD asm and wondering why there were vector instructions even for scalar memory access. I assumed it was because vector was at least as efficient so there was no benefit to scalar, but this explanation makes more sense.

Yes, a separate buffer for just the clip stack would also work fine - for this stage. There are other stages where there are lots of reads from input buffers and a write stage on the output. Those may be seeing slowdowns due to using a single allocation pool.

My sense of the best long term solution is to rely on memory model annotations to tell the compiler explicitly how much coherence we need. But that also feels more like a future thing, it doesn't help us with compatibility.

My inclination is to fix this by having the compiler spill by using an array variable, as it's simplest from our pov. The downside is that there is a fixed maximum stack size, but I think that can be big enough (256 is what I have in my head) that it's extremely unlikely to be a problem in practice.

from vello.

raphlinus commented on June 27, 2024

I haven't closed this issue because I'm not sure #77 is the right solution. I spent a little more time poking into it, and believe that scratch is pretty badly broken on AMD 5700 XT. It has artifacts on Vulkan, but blue-screens on DX12. Talking with a number of other GPU experts, it seems like relying on a large array to be placed in scratch memory is just not something that we can expect to be implemented reliably and efficiently, though there are aspects of it that are appealing.

My gut feeling is that we'll need to implement a separate buffer for the clip spills (to avoid the performance problems from overly generous coherence) and manually spill into that. I'm wrestling with how to allocate that, as targeting worst case might be really large. You only need an allocation that can support the maximum number of occupant workgroups. That is of course pretty GPU dependent, but I think we can make a conservative estimate (or perhaps even do a fingerprint probe on startup). I'm leaning toward that now, but also considering other possibilities. We can go back to the pre-77 state, where the amount of scratch is computed in coarse rasterization. That's less dependent on the GPU execution model, but to my mind just shifts the risks around - it's still very much possible to hit an out of memory condition, it's just that you're not paying for a maximum stack depth.

I'm leaning toward an atomic approach - maybe 32 * workgroup size bits, then on first push the workgroup relaxed-reads those, finds a zero bit, does atomicOr to set it, and loops if that doesn't succeed. On exit, atomicAnd to reset the bit. I don't think this will be super complicated or expensive.

There can be some CPU-side accounting of maximum stack depth and allocation of buffers to support that. Basically, this issue needs a little more thought to figure out the best approach.

from vello.

Slow fine rasterization (k4) on Adreno 640 about vello HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent