linebender / vello Goto Github PK

View Code? Open in Web Editor NEW

2.1K 2.1K 112.0 7.84 MB

An experimental GPU compute-centric 2D renderer.

Home Page: http://linebender.org/vello/

License: Apache License 2.0

Rust 70.21% WGSL 29.79%

vello's People

Contributors

Stargazers

Watchers

Forkers

msiglreith kobutri atul9 joeedh longjohncoder michaeljclark kant kzantow-rust ztlpn tiggilyboo eliasnaur isgasho happydpc age-rs eira-fransham neuroradiology yiwang matcheydj faywong ishitatsuyuki lp249839965 djmcnab rosefromthedead pure-water charygao matthias-fauconneau ec1oud dvc94ch dfrg dejan-stankovic pv-ntr armansito alexvestin raphlinus icodein maxwell8888 robotoss alyfreym gg-big-org mikayla-maki nilsmartel canselcik canadaduane intrepidpig simbleau lemmih jneem xanahopper becmer jasondavies abanoubha ghishadow longmathemagician jasim bzm3r iq-scm vicavo jonathanolson sirxyzzy waywardmonkeys herculesyuan zhennanwu xdestiny110 italicsjenga oyelowo yumoh study-game-engines anantnrg jasper-bekkers dekustudios v-po zoxc lucalewin nixon-voxell daheige dev-el akalenuk samuelselleck alepez dragostis linecode pengiie monad-one additional-pumpkin crowlkats poignardazur scatteredray xorgy 0hypercube hibrianj ratmice marcelstruckwo xstrom jamestiotio daslixou nicoburns enter-tainer nyxalexandra ycysdf ragibhasin

vello's Issues

Minor glitches with reschart

This is a continuation of #68 since the auto-close of path has been implemented.

After the fix, a few glitches can still be seen in the rendering. They are the most visible around the crossing-like structures.

Add SPDX license headers to shaders

Would it be appropriate to add SPDX headers to the shaders? They're dual-licensed unlike the rest of the project, and I use the headers in Gio where I'd like to copy the shaders verbatim.

Winit example crashes on Linux with AMD GPU

The with_winit example crashes on my system when I try to run it. The with_bevy example works fine though.

OS: Arch Linux
Desktop Environment: Cinnamon
GPU: AMD RX 570

backtrace:

[examples/with_winit/src/hot_reload.rs:19] &Path::new(env!("CARGO_MANIFEST_DIR")).join("../../shader").canonicalize().unwrap() = "/home/billydm/Dev/vello/shader"
thread 'main' panicked at 'failed to get surface texture: Timeout', examples/with_winit/src/main.rs:189:18
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:65:14
   2: core::result::unwrap_failed
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/result.rs:1791:5
   3: core::result::Result<T,E>::expect
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/result.rs:1070:23
   4: with_winit::run::{{closure}}::{{closure}}
             at ./examples/with_winit/src/main.rs:186:35
   5: winit::platform_impl::platform::sticky_exit_callback
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.27.5/src/platform_impl/linux/mod.rs:849:9
   6: winit::platform_impl::platform::x11::EventLoop<T>::run_return::single_iteration
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.27.5/src/platform_impl/linux/x11/mod.rs:380:21
   7: winit::platform_impl::platform::x11::EventLoop<T>::run_return
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.27.5/src/platform_impl/linux/x11/mod.rs:488:27
   8: winit::platform_impl::platform::x11::EventLoop<T>::run
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.27.5/src/platform_impl/linux/x11/mod.rs:503:25
   9: winit::platform_impl::platform::EventLoop<T>::run
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.27.5/src/platform_impl/linux/mod.rs:755:56
  10: winit::event_loop::EventLoop<T>::run
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.27.5/src/event_loop.rs:278:9
  11: with_winit::run::{{closure}}
             at ./examples/with_winit/src/main.rs:102:5
  12: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/future/mod.rs:91:19
  13: pollster::block_on
             at /home/billydm/.cargo/registry/src/github.com-1ecc6299db9ec823/pollster-0.2.5/src/lib.rs:125:15
  14: with_winit::main
             at ./examples/with_winit/src/main.rs:244:9
  15: core::ops::function::FnOnce::call_once
             at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/ops/function.rs:251:5

Also here is the output of the with_bevy example if it's any help.

2023-01-25T22:26:42.486751Z  INFO winit::platform_impl::platform::x11::window: Guessed window scale factor: 1    
2023-01-25T22:26:42.501557Z  INFO bevy_render::renderer: AdapterInfo { name: "AMD Radeon RX 570 Series", vendor: 4098, device: 26591, device_type: DiscreteGpu, driver: "AMD open-source driver", driver_info: "2022.Q4.4 (LLPC)", backend: Vulkan }

Missing elements when drawing >256 elements

When encoding more then 256 elements in scene, elements start to disappear at the screen.
This seems to be caused by the binning step which allocates a too small bin_header buffer to fit all the data.

The buffer size assumes a packed bin layout, while in the shader the access pattern differs. One workgroup of 256 threads executes binning for 256 elements and each threads writes the output for one bin. Therefore the number of buffer size should be able to hold drawobj_wgs * 256 elements or the shader adjusted.

Duplicate reparsing in SVG Rendering native example

render_svg() in the example is loading and reparsing the file every frame, which was definitely not intended... I'll make a PR tomorrow or soon when I have time to fix that.

A bit of clip optimization

I was curious about whether the clip stack really needs to store both rgba and another alpha channel, so had another look at that logic. I think it is possible to get rid of the extra channel, with just a bit of cleverness.

The tricky bit is that the coarse rasterizer needs the path at BeginClip, so it can do the optimizations, but the fine rasterizer only needs it at EndClip. The basic idea is to make it available to both.

Encoding: the EndClip element is annotated with the number of paths inside the clip, ie the path count at BeginClip plus this quantity equals the path count at EndClip. This will require just a bit of accounting, but hopefully not too bad. Also note that by storing the difference (as opposed to an absolute path count) the encoding is no less "relocatable" than before.

Coarse rasterization, input stage: for EndClip, the path_ix is not just the element_ix (coarse.comp line 236), but the delta encoded above is subtracted. Thus, we actually read the BeginClip path, and the path associated with EndClip is ignored.

Coarse rasterization, output stage: BeginClip has the same optimization logic, but does not write the fill. EndClip writes the fill right before writing the EndClip command.

Fine rasterization: BeginClip does not push an alpha value. EndClip does not pop the alpha value, but is otherwise basically unchanged; it already composites using area[k], which at present is 1.0 because coarse rasterization always sends a Solid command before the EndClip.

A couple other notes. The do-while in kernel4 Cmd_Fill should be a while loop, as there are cases when coarse rasterization can send a tile with no path segments (when the nesting depth exceeds 32; this is unusual but possible).

Lastly, I think using the same path (and thus the same bbox) for begin and end of clips might address the problem I was talking about in this comment, and allow the use of relative bounding boxes again, which would increase the compositionality of encoded scene subtrees.

I'm not 100% sure this will work, but I think so. I also reviewed the current separation between path alpha and color source, and that's looking good for gradients; thanks Elias!

Metal validation errors when running piet-gpu in the Xcode debugger

When trying to run piet-gpu in the Xcode debugger, I got the following error:

validateComputeFunctionArguments:745: failed assertion `Compute Function(main0): missing buffer binding at index 25 for spvBufferSizeConstants[0].'

And indeed if you look at for example binning.msl, you have a parameter constant uint* spvBufferSizeConstants [[buffer(25)]] even though there doesn't seem to be any buffer assigned to 25 in metal.rs's fn dispatch().
Having this fixed would make investigating problems on macOS easier.

As I asked on zulip, it seems to be related to the use of the --msl-decoration-binding flag passed to SPIRV-Cross.

By the way here's a step-by-step way to run the winit app in the Xcode debugger:

Build the app with cargo build.
Open Xcode.
In Xcode's menu, select Debug > Debug Executable..., and select the winit executable in the file selection dialog. It should be in the target/debug subdirectory of your copy of the repo.
The scheme editor shows up. In the Options tab in "Working directory" turn on "Use custom working directory" and enter the directory you want to run the app from, for example the piet-gpu subdirectory of your copy of the repo. Then in the "Arguments" tab, in "Arguments Passed On Launch" click on the "+" and enter the argument you want to pass, for example "Ghostscript_Tiger.svg".
Press "Close". To edit the settings afterwards, in Xcode's menu select Product > Scheme > Edit Scheme... (Cmd+<)
To start the app in the debugger, click on the play button on the top left, or Cmd+R.

Get tiger to run on Chrome Canary

The simple path rendering demo runs on Web (#201) without too much trouble, and we should get that merged. An intermediate version that runs the tiger runs on wgpu native (current dev/main), but there are some problems running on Web. This issue lists those problems and outlines what needs to be done. There are two blockers: workgroup shared memory exceeded and bindings exceeded, plus a number of paper cuts. Some of these can be fixed in the piet-gpu codebase, perhaps others are better addressed elsewhere.

Workgroup shared memory

The binning stage has 17408 bytes of workgroup shared memory, while the standard only mandates 16k. Most modern hardware can do 32k, so this is partly a case being more restrictive than need be (this is why it runs on native), but in this case the best mitigation is to optimize the shader, as trimming workgroup shared memory usage can improve occupancy. The best candidate is sh_count, which is currently u32[N_TILE =256][N_SLICE = 8]. These counts never exceed 256, so can be packed as u16x2, or, with a bit of careful logic, u8x4 (we only use the exclusive sum from shared memory, so the max value is 224.

Number of buffer bindings

The coarse stage takes 10 bindings, and Chrome complains that the maximum is 8. The WebGPU spec only requires 8, but I consider this an unreasonable limitation, suitable only for low-spec hardware. I recommend the implementation raise it (which might require work in Chrome, but it's also possible this is being limited in the wgpu layer).

The reason for this is that the piet-wgpu reimplementation now uses a separate binding for each major data type. In some cases, that's mandated by the spec (an unsized array of atomics cannot share a binding with an unsized array of nonatomics, a limitation imposed by WGSL rather than the underlying APIs). In other cases, it seems like a good idea for performance reasons (input buffers can be bound readonly, which may have better cache behavior) and also code clarity.

One of the buffers (Config) can be moved from storage to uniform. That just requires a bit of additional plumbing in the engine implementation (I'm pretty sure uniforms don't count against the limit).

That leaves one extra binding. Basically, we need to do suballocation by hand. The most direct way to do this is to blit, which helps keeps things modular but at some cost (I consider this acceptable if we think the limits will be raised). To combine A and B, allocate the A buffer with size A+B, then before running coarse, blit B to after A. Good candidates include DrawMonoid + BinHeader, but Path and info are also plausible, as the sizes of these are all known at dispatch time.

Paper cuts

Tint warns about the use of let when const is more appropriate, but naga does not yet support const. We should patch naga to support at least the const syntax, though there is much more work to be done to bring constants fully up to spec.

Uniformity analysis (coarse) is a warning. We're tracking that separately in a number of issues, starting with gpuweb/gpuweb#3479 (comment).

Rendering glitches on path filling

With great interest, I read your blog post on the new code in piet-gpu and gave it a try on my Radeon RX 470 GPU:

08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev cf)

I got the example images from:
[1] http://w3.impa.br/~diego/projects/GanEtAl14/

It would seem there are a few rendering glitches, most notably:

Some images are mirrored
A few font glitches

Please note, I used the default radav driver on ArchLinux. There is an experimental new shader backend, which is faster, but has the same output (or at least I didn't notice any difference). It can be activated by exporting export RADV_PERFTEST=aco.

Also, there is another vulkan driver for AMD, namely amdvlk. That seems to be a little slower for most operations, but the output seems to be the same as well.

How where the benchmarks from the blog post performed? Merely run the CLI tool without any additional options? Given the tiger svg from this repo, my old RX 470 seems to be faster than a 1060, which seems odd.

paper-1

cargo run --release --bin=cli --  inputs/svg/paper-1.svg
    Finished release [optimized] target(s) in 0.06s
     Running `target/release/cli inputs/svg/paper-1.svg`
parsing time: 44.684545ms
flattening and encoding time: 5.586778ms
scene: 104322 elements
Element kernel time: 0.258ms
Tile allocation kernel time: 0.016ms
Coarse path kernel time: 0.199ms
Backdrop kernel time: 0.074ms
Binning kernel time: 0.106ms
Coarse raster kernel time: 0.070ms
Render kernel time: 0.220ms

At scale=2

cargo run --release --bin=cli -- --scale 2  inputs/svg/paper-1.svg
    Finished release [optimized] target(s) in 0.06s
     Running `target/release/cli --scale 2 inputs/svg/paper-1.svg`
parsing time: 43.184138ms
flattening and encoding time: 3.919508ms
scene: 104322 elements
Element kernel time: 0.259ms
Tile allocation kernel time: 0.014ms
Coarse path kernel time: 0.157ms
Backdrop kernel time: 0.055ms
Binning kernel time: 0.024ms
Coarse raster kernel time: 0.111ms
Render kernel time: 0.693ms

tiger

cargo run --release --bin=cli
    Finished release [optimized] target(s) in 0.07s
     Running `target/release/cli`
parsing time: 1.110947ms
flattening and encoding time: 100.067µs
scene: 2655 elements
Element kernel time: 0.032ms
Tile allocation kernel time: 0.091ms
Coarse path kernel time: 0.162ms
Backdrop kernel time: 0.242ms
Binning kernel time: 0.021ms
Coarse raster kernel time: 0.126ms
Render kernel time: 0.290ms

The tiger from [1], however, is upside-down as well.

 cargo run --release --bin=cli -- --scale 2  inputs/svg/tiger.svg 
    Finished release [optimized] target(s) in 0.05s
     Running `target/release/cli --scale 2 inputs/svg/tiger.svg`
parsing time: 1.601394ms
flattening and encoding time: 53.149µs
scene: 2835 elements
Element kernel time: 0.031ms
Tile allocation kernel time: 0.035ms
Coarse path kernel time: 0.118ms
Backdrop kernel time: 0.091ms
Binning kernel time: 0.012ms
Coarse raster kernel time: 0.115ms
Render kernel time: 0.251ms

paris-30k

 cargo run --release --bin=cli --  inputs/svg/paris-30k.svg 
    Finished release [optimized] target(s) in 0.05s
     Running `target/release/cli inputs/svg/paris-30k.svg`
parsing time: 234.354676ms
flattening and encoding time: 40.583878ms
scene: 732418 elements
Element kernel time: 1.533ms
Tile allocation kernel time: 0.142ms
Coarse path kernel time: 0.423ms
Backdrop kernel time: 0.282ms
Binning kernel time: 0.161ms
Coarse raster kernel time: 0.355ms
Render kernel time: 0.637ms

At scale=2

    Finished release [optimized] target(s) in 0.07s
     Running `target/release/cli --scale 2 inputs/svg/paris-30k.svg`
parsing time: 234.821258ms
flattening and encoding time: 39.913438ms
scene: 732418 elements
Element kernel time: 1.524ms
Tile allocation kernel time: 0.204ms
Coarse path kernel time: 0.501ms
Backdrop kernel time: 0.329ms
Binning kernel time: 0.147ms
Coarse raster kernel time: 0.897ms
Render kernel time: 2.888ms

Resolve uniformity analysis warnings

The coarse shader gives warnings in WebGPU about failing uniformity analysis. The workgroupUniformLoad intrinsic is the principled way to resolve this, and is now in the spec. An implementation is expected soon in tint, and we'd like to (a) verify that this does solve the problem as expected, and (b) be robust to those warnings being turned back into errors, as barriers in non-uniform control flow are potentially a security problem.

Because we expect this feature won't land in naga for a while, we'll make the use of the intrinsic gated by an #ifdef, and only turn it on in web targets for now.

Possible Nvidia compiler bugs

I'm running into a number of cases where either the Nvidia shader compiler crashes (which seems pretty obviously bad) or I'm getting unexplained bad results on Nvidia with no comparable problem on Intel. This issue is to investigate whether it is actually shader compiler bugs or something I'm doing wrong, and whether there are workarounds. There are 3 separate subcases but they may well be related.

These are all reported against a laptop GTX 1060, in a Gigabyte Aero 14, running Windows 10 build 19041.264, and Nvidia driver 445.87.0.0.

Case 1: shader compiler crash

This can be repro'ed from scratch by the following steps:

git clone https://github.com/linebender/piet-gpu
cd piet-gpu
git checkout nv_crash_1
cd piet-gpu
cargo run

On two machines, this results in a STATUS_ACCESS_VIOLATION deep inside the shader compiler, in the VkCreateComputePipelines call (I can provide stack traces etc if needed).

There is a workaround, which can be applied by checking out nv_crash1~ (ie the head commit in that branch is a revert of the workaround). This is a very small change and appears to be completely semantics-invariant. With the workaround, the example runs fine (producing a beautiful abstract artwork in image.png).

Case 2: ERROR_DEVICE_LOST

git clone https://github.com/linebender/piet-gpu
cd piet-gpu
git checkout nv_crash_2
cd piet-gpu
cargo run --bin cli

Again there is a workaround, but it's extremely fragile; almost any perturbation will provoke failure. And again the workaround looks to be completely semantics preserving (there's no chance that bin_ix can have this value; I came up with this while trying to figure out which bin was responsible).

This is the one I have the least confidence is a driver bug, but I've done a considerable amount of work trying to track down possible errors on my side, and of course it works fine on Intel.

Case 3: shared memory corruption

git clone https://github.com/linebender/piet-gpu
cd piet-gpu
git checkout nv_bug_3
cd piet-gpu
cargo run --bin cli

In this case, the element.comp kernel runs but produces incorrect output. Again the workaround looks to be entirely semantics preserving. Inspecting the output carefully, it seems the sh_state array is being corrupted, it looks like elements being written to a wrong offset (the mat field can never be anything other than [1, 0, 0, 1] in this example, but looking at intermediate values it sometimes takes values from other fields such as linewidth).

Again, it's possible I'm doing something wrong (some undefined behavior somewhere) but I've looked pretty hard, and it runs fine on Intel.

Obviously, any help in further tracking these down would be appreciated.

Plan for glyph rendering

This issue is branched from a question raised linebender/xilem#1 about our plan for glyph rendering, as that is the responsibility of piet-gpu rather than other layers in the stack.

For the first cut, all glyph rendering will happen as dynamic vector graphics rendering, getting the outlines from the font (currently through swash but possibly migrating to Oxidize) and assembling them into scene fragments. At first we will also not support hinting at all, nor stem darkening, nor RGB subpixel rendering. Results will be pretty good on high-dpi screens, but definitely suboptimal on low-dpi. Performance may be a problem on very low-spec GPUs as well.

The roadmap has a number of improvements. One is to support a glyph cache, which by itself doesn't change rendering quality, but will improve performance greatly (especially on low-spec GPUs, in which pulling textures from the glyph atlas will outperform rendering vectors). The infrastructure for glyph (which involves tracking the transform at which a glyph will be displayed) also enables hinting (which is fairly well implemented by swash already).

When hinting is enabled, it will be a configurable option. It is desirable for UI text (and you'd want it for displaying the text in a text editor), but it is not desirable when the scene is more dynamic. Cases where hinting would cause "shimmering" effects include pinch-to-zoom (or any other transform than scrolling) and animation of variable font parameters.

An important quality improvement (especially on macOS) is stem darkening. The best way to implement this is to make coarse path rasterization capable of offset as well as flattening (the same logic will support both stroke rendering and stem darkening). I've figured out the math for this and have a prototype (in JS) but haven't yet implemented it as a compute shader; there are some tricky aspects, especially dealing with joins at corners. This improvement is orthogonal to glyph caching.

Lastly, to support low-dpi screens (especially Windows) we would want to implement RGB subpixel rendering, which is mostly a change in fine rasterization.

With a glyph cache in place, we have a choice. Either this cache is filled by doing path rendering in piet-gpu, or by calling into platform glyph rendering (particularly DirectWrite on Windows). The latter would allow pixel-perfect matching to platform text, but at a performance cost and with other tradeoffs (having two pipelines rather than one always causes problems). My current inclination is to render all the glyphs ourselves, but this is something we can revisit. Note that one option for platform glyph rendering is to take a dependency on font-kit.

New element processing pipeline

It's time to redo the element processing pipeline. The proposal in this issue fixes the following problems (of which some but not all clip related issues are addressed in #88):

Accounting of clip bbox needs to be done by CPU, and is in global coordinates. This seriously impacts composability of encoded scene subgraphs.
Clip bbox is accounted as the union of enclosed drawing commands, rather than intersection of clips in the stack.
Clip mask is computed at BeginClip but is not needed for alpha composition until EndClip, requiring more memory bandwidth in fine rasterization.
Bounding boxes are not tight in the presence of non axis aligned transforms.
Element processing is slower than it might be.
Adding blends makes many of the above issues worse.

The main theme of this proposal is to divide the scene buffer into streams, rather than having a single flattened encoding of the scene tree. Later streams can reference earlier streams, and use a prefix sum of the count to track the reference. This idea is a continuation of the transform work done by Elias (#62, #71), but takes it farther.

In this proposal, there are 3 streams encoded by the CPU: transform, path, and draw object. From the draw object stream is derived a clip stream, so there is a pipeline of 4 stream processing shaders, each implemented as a single compute dispatch implementing a monoid scan. It should be emphasized, the size of all these streams is easily determined during encoding and is used to allocate buffer space and determine the number of workgroups in each such dispatch. We're already doing a lot of that.

Thus, an encoded scene (or subgraph of a scene) is a tuple of the 3 streams, as well as some counts of number of transform and clip elements. The idea is that it's still straightforward and efficient to stitch together encoded subgraphs, which is done by concatenating each of the streams and adding the counts.

Transform stream

The transform stream is very simple, each element is just an affine transform, and the monoid is affine matrix multiplication.

As a potential later refinement, we might encode push and pop elements and use a stack monoid, rather than the current approach of representing a pop by pushing the inverse transform. That would allow degenerate transforms and reduce roundoff error (because floating point, we might not end up with the identical transform), but it's not clear it's worth the effort at this point.

Path stream

The path stream has the following elements:

Transform()
SetLineWidth(line width)
LineSegment(p0, p1)
QuadSegment(p0, p1, p2)
CubicSegment(p0, p1, p2, p3)
EndPath()

(Alternatively, we might SetLineWidth as is done now. I don't have a strong feeling about this, but encoding this in the End element seems like it might be slightly more efficient)

There is one transform element in this stream for each transform in the transform stream. Thus, a prefix sum of the count is a reference to the corresponding transform. In this stream the transform element has no payload.

The shader for this stage does two things: apply the transform to the points, and accumulate a bounding box for each path. The output is into two vectors, one for path segments, one for paths. Overall this shader is very similar to the current element processing stage, the main difference is that it does apply the transform (as used to be the case, then that moved to coarse path rendering in #71) but relies on the multiplication of transforms having been done in the previous stage.

Draw object stream

The draw object stream has the following elements:

BeginClip()
EndClip()
FillColor(color)
FillGradient(gradient params)
FillImage(image params)

The draw object stream processing shader has two outputs. One is a vector with one element per draw object, containing a clip count and a path count (these counts are straightforward prefix sums). The other is a compacted vector containing an element only for BeginClip and EndClip draw objects, the output of which is a tag indicating begin/end and a path index.

(An alternative to counting paths is to assume that each draw object is one path, and have empty elements in the path stream when that's not the case - currently just EndClip - but that's less great if and when we allow rectangles or other geometries in addition to Bézier paths)

Clip stream

The clip stream (the second vector produced by the draw object shader above) has the following elements:

BeginClip(path_ix)
EndClip()

The shader implements the stack monoid computing the intersection of the bboxes. The output contains both a bbox and a reference (index) to the clip path (this is the same as the one on input for BeginClip, and is derived for EndClip). Note that begin and end both have the same bbox and reference the same path.

Changes to rasterization

A major benefit of this proposal is that it makes more and better information about clips available to subsequent stages. Here's how those stages are affected:

Binning

Binning uses the bounding box of an element to decide whether to place it into a bin for coarse rasterization. Currently for clip operations it uses the bounding box encoded by the CPU. In the new design, instead use the bounding box computed the clip stream processing shader.

Coarse rasterization

For every element in the draw object stream, look up the clip index (output of draw object stream shader). With that, look up the clip bbox (output of clip shader). That bbox is intersected with the bbox of the draw object (which, in the case of paths, is now resolved from the output of the path stream processing shader.

To process BeginClip, resolve the clip path, then look up the tile. That is either all-0, all-1, or mixed. If all-1, increment clip_zero_count (similar to clip_zero_depth in the current logic; all drawing ops are quick-rejected when this is nonzero). If all-1, basically a no-op. If mixed, encode a BeginClip to the per-tile command list.

To process EndClip, resolve the clip path, then look up the tile. Importantly, this will always match the corresponding BeginClip. If all-0, decrement clip_zero_count. If mixed, encode the clip path tile, then encode EndClip. Note that the clip_one_mask and clip_depth variables and their associated logic can go away; their function is subsumed by the stack monoid in the clip stream processing.

Fine rasterization

Changes to fine rasterization are the same as described in #88. Basically, the encoding of the clip path is moved from BeginClip to EndClip, which is where the derived alpha mask values are actually used for composition. This eliminates the need to store the alpha values in the interim. Another small change is that it eliminates the possibility that a fill will be encoded with zero path segments (this could happen before on overflow of the 32 bit clip_one_mask stack window).

Blending

Blending is similar in many ways to clipping, but with some important differences.

A major design decision regarding blending is whether to require CPU encoding of a blend bounding box. Higher level vector representations such as SVG and the COLRv1 proposal do not include them, so they must either be inferred from the contents of the blend group, or assumed to be the drawing area. For font rendering, the latter may be reasonable (the drawing area being considered the glyph bounds, even if many glyphs are being drawn to an atlas).

Many drawing APIs do require bounding boxes for blend groups so the size of any needed temporary buffer can be known before drawing into that buffer. For example, in Direct2D there is a contentBounds field in the parameters of the Direct2D PushLayer method.

If blends are relatively rare, then it makes sense to require some bounding box on encoding (not necessarily accurate), and optimize them on a per-tile basis in coarse rasterization. If, on pop of a blend group, no drawing operations have been encoded since the corresponding push, delete the corresponding push. This requires a bit of state in the coarse rasterization process, but not dramatically more than we already have.

Another possibility is to infer blend bounding boxes from the encoded scene. This is a sketch of how that might be done.

Run a monoid scan over the draw object stream that implements this pseudocode:

raw_blends = []
bbox = empty
for obj in draw_objects:
    match obj:
        case BeginBlend:
            raw_blends.push(('begin', bbox))
            bbox = empty
        case EndBlend:
            raw_blends.push(('end', bbox))
            bbox = empty
        other:
            look up clip bbox for object
            look up bbox for object (for example, path bbox if it's a path)
            bbox = union(bbox, intersect(clip_bbox, obj_bbox))

Run a second stack monoid scan over the result of this first one:

accum_bbox = empty
stack = []
blends = [undefined] * len(raw_blends)
for i in 0..len(raw_blends):
    match raw_blends[i]:
        case ('begin', bbox):
            stack.push((i, union(accum_bbox, bbox)))
            accum_bbox = empty
        case ('end', bbox):
            bbox = union(accum_bbox, bbox)
            (begin_ix, accum_bbox) = stack.pop()
            blends[begin_ix] = bbox
            blends[i] = bbox
            accum_bbox = union(accum_bbox, bbox)

The result is to compute the union of (clipped) bounding boxes of all draw objects that occur between each begin/end pair, and to store that bounding box at both the beginning and end.

Note, there are opportunities to fuse these scans, but it's unclear whether they are actually more efficient. Also note that these scans can be skipped altogether when there are no blends in the scene.

SOA vs AOS

Right now, piet-gpu uses an "array of structures" approach for most of its processing. I think in some cases a "structure of arrays" might be more efficient, as some operations might only use some fields.

Shader compilation plan

It's super annoying and error-prone to compile shaders and check the generated code into the repo. Yet there are advantages to having the generated code checked in, particularly not requiring extra tools to build and run, and also pinning those tools to a specific version, for repeatability.

The proposed solution is for the source of truth to be GitHub Actions that compile the shaders. Here is a somewhat more detailed plan.

All commits (and PRs) go to a newly created dev branch. This branch has no shader/gen directories. Thus, the PR contains changes to the shader source but not any compiled artifacts. The main branch tracks dev closely but has the gen subdirectories. A GitHub action runs on pushes to the dev branch and creates a commit to main updating the gen subdirectories.

This action runs on Windows, which gives us access to both the official DXIL signing DLL and also the Windows port of Metal tools (subject to figuring out how to download the latter). In future, we are likely to add to the sophistication of shader compilation in various ways, including getting metadata programmatically (binding renumbering on MSL/HLSL, workgroup size, etc), spinning more permutations. Instead of shell/ninja, we may move to a Rust tool, possibly using hassle-rs and related tools.

Large blend stack causes 10x slowdown in fine rasterization

Was trying to familiarize with the structure and code and noticed one of the timings seemed off on my machine.

Running cargo run --release --bin cli Ghostscript_Tiger.svg had the Render kernel time at around 3 ms, while all other kernels were in the range of 0.02 to 0.15ms. Setting the MAX_BLEND_STACK to 1 in kernel4.comp the render kernel time drops to around 0.3ms, (and 1ms for MAX_BLEND_STACK 16). I tested with some other svgs (paper-1 and hawaii), which had similar results.

I had noticed the "large array per thread causing slowdown" in my own project previously, but have no idea if it is a my system issue since I've only tested with my own hardware. I'd be happy to help if any machine specific debugging is needed / close this issue if this only applies to my setup. I'm using a GTX 1070 and had the slowdown on both windows and linux.

wgpu Adapter and Surface might not be compatible

Since RequestAdapterOptions field compatible_surface is not specified (see here), wgpu may not give an adapter that can render to the display. But the current API means the Instance and Adapter are constructed at the same time and the Surface is constructed later, so without changing the API (which admittedly seems pretty natural how it is), it's not possible to fix this. Maybe it's worth separating Instance and Adapter creation? Or making the user provide Instance and Surface themselves?

FWIW I ran into this incompatibility on my intel+nvidia laptop, but the default PowerPreference should select intel, so I might just not have any compatible combinations. Still might be worth allowing this for other machines.

Rectangles

Currently every draw object is associated with a full Bézier path. That'll be particularly wasteful when we have more draw objects that are images or glyphs drawn from a texture atlas. I'm implementing axis-aligned rectangles as a special case of geometry, and finding it useful to write the design down for references.

As usual, a change like this touches many stages of the pipeline. In addition to the changes listed below, there are places that make the assumption that the draw object index and path index are the same, and that will need to change (that's already something that

Encoding

The design is for axis-aligned rectangles. Therefore the "append fragment with transform" function needs to handle one special case: the transform is not axis-aligned, and the appended fragment contains rectangles. In that case, the rectangles need to be expanded to full paths.

(in the piet render context, this logic is simpler, as the transform is known at the time the draw command is issued, but we're not planning on retaining this)

Alternative considered: use parallelogram as the primitive rather than axis-aligned rectangle, so it's closed under general affine transform.

The encoding is as follows: the draw object tag gets an extra bit incidating rect or path. If it's a rect, the drawdata stream gets 4 f32's representing the coordinates.

(to be determined: many choices for bit encoding; it's already got a bunch of bit magic for making size calculations faster)

Alternative considered: separate stream for rectangle coordinate data. That would require another scalar in DrawMonoid.

Draw scan

No major change here except to incorporate size into scene_offset. (easiest way to do that is to have just one bit for rect, mask and shift to add it to size). Also obviously don't increment path_ix when rect bit is set.

Currently draw_leaf fetches the bounding box, but it looks like that's unnecessary and can be cleaned up.

Note that rectangles are not represented in the pathtag or path data stream, so the pathtag scan and coarse path rendering are not affected.

Binning

Binning is the main place the bbox is processed, as it's the place in the pipeline where the clip bbox is available.

The output rect is stored into draw_bbox_alloc. This will serve as the dimensions of the rectangle as well for downstream processing.

Discussion question: I'm tempted to have draw leaf compute the bbox (otherwise it would be output in path_coarse). If it's the same format, that format will need to change, as it's currently quantized to integers (so it can be operated on by atomics). If the format is changed (say, 8 bits of fraction), then binning can be completely unaffected.

Coarse raster

When the rect bit is set, don't reference the path data structure, rather do some special-case processing.

The interior of the rect is a solid, but tiles on the edge need to issue a Cmd_Rect, see below for more details on that.

Fine raster

One new command: Cmd_Rect. Rendering is by multiplication of half-planes antialiased by clamp (this should be about as fast as non-AA rendering).

Also note: the most natural encoding for rect coords is 4 f32's, but those could be more compact. Using f16's (tile-relative) would also be natural, but even u8 (with 1/16 pixel quantization) is doable.

Add command for setting fill mode

The split between stroking and filling is awkward because there are two versions of many scene commands (StrokeLine/FillLine, StrokeQuad/FillQuad, StrokeCubic/FillCubic, Fill/Stroke), and some don't even support strokes even though they easily could (BeginClip, EndClip, FillImage).

There's now support for command flags, that can be used to make stroke vs fill a flag for every supported command.

However, I still find it somewhat awkward to select the correct command (or flag) when constructing scenes. One issue is that you don't specify a stroke/fill flag when constructing paths in Gio; rather the mode is given from the use of the path, and there is no reason you can't re-use a path for both a stroke and a fill. Another is that Gio separates clipping from filling, so that even strokes are merely a form of clipping. That model is easier to implement when all commands automatically support both stroking and filling.

So I'd like to propose a SetMode command that sets the filling mode for subsequent commands similar to how SetLineWidth sets the line width for subsequent strokes. In particular, SetMode has a single argument specifying the mode to be either stroking or nonzero-fill. A future even-odd mode naturally fits into SetMode as a third option.

The result is that there will no longer be duplicate commands for stroking and filling, and the user won't have to set a mode flag in every command that needs it. Instead, the elements.comp monoid will automatically fill the mode flags, and subsequent pipeline stages can use the flags as they see fit.

Image resources

This is a followup to #38 capturing our current thinking of how image resources should be managed.

CPU-side image resource type and creation API

The CPU-side image resource is basically just an Arc<[u8]>. It also contains a globally unique (atomic counter) id. Creation of an image resource requires no context or factory, and the object is easily Send etc. The image bytes are provided by impl Into<Arc<[u8]>> which I believe is sufficiently ergonomic, it's satisfied by &[u8] and Vec<u8>, as well as Cow.

Possibly we wrap the whole thing in another Arc to make the reference smaller, but that probably doesn't matter, it's cheap to clone either way.

Encoding into scene fragment

Adding an image draw object to a scene fragment similarly simple, it's basically appending a clone of the reference and an affine transformation (possibly other stuff like extend mode, not part of the current imaging model but worth considering).

Staging

Resource creation and encoding are simple. Staging to GPU is where it gets hard. The algorithm in this section is run every time an encoded scene is to be rendered.

Atlas vs descriptor array vs bindless

All three strategies are viable. Descriptor arrays would reduce copying of pixels but possibly incur overhead for managing descriptor sets, at the cost of poorer compatibility (as a data point, VK_EXT_descriptor_indexing is available on Pixel 6 Pro but not Pixel 4). A bindless approach goes even farther, requiring recent GPU but reducing the cost of managing descriptor sets.

For the first implementation, we're going atlas only, for maximum compatibility and because it has some other desirable properties. The atlas contains scaled (more generally, affine transformed) images. Further, to run the pipeline, the atlas must contain all images inside the viewport. This can potentially fail, as the maximum dimensions of an atlas are exceeded, so see #175 for a discussion of spatial subdivision. In constrained cases, the atlas contains the scaled image clipped to the current viewport (so subdivision is basically guaranteed to reduce atlas requirements), but in relaxed cases it may be desirable not to clip, so that, for example, scrolling can happen without addition re-scaling.

Detailed staging algorithm

The renderer state consists of:

A staging GPU buffer (vkBuffer).
An LRU cache mapping image id to GPU image (vkImage on Vulkan).
An atlas GPU image, which may be large. The atlas must contain all image resources needed to render the viewport (after spatial subdivision as that applies)
A mapping of (image id, transform matrix) pairs to texture rectangles in the atlas image. (note: some additional complexity is needed to account for clipping to viewport)
A set of image ids represented in draw calls in the pending command buffer

The first step of staging is to allocate all (id, transform) pairs that appear in the encoded image, inside the viewport, to rectangles in the atlas. Note that this requires affine transforms (motivating moving that from GPU to CPU, doing a bit of work on the "append fragment with transform" method on scene fragments).

Iterate all (id, transform) pairs in the encoded scene and resolve to a rectangle. On miss, attempt to allocate the rectangle in the atlas (perhaps using etagere or a custom rectangle allocator). If that fails, blow away the entire atlas mapping and start again. If the atlas contained any mappings that weren't present in the current scene, then it's possible that retry with the same size will succeed, otherwise not so the atlas must be resized. And if that fails, fail to spatial subdivision.

At this point there is a list of new (id, transform) to rectangle mappings, and also every (id, transform) pair represented in the encoded scene has a mapping in the atlas. The current task is to fill those rectangles with scaled images. Generally this involves blit and draw calls added to a command buffer.

For each new mapping, first materialize the GPU image. Look up the id in the cache. On miss, try to allocate space in the staging buffer. If allocation fails, flush the command buffer, fence (waiting for the staging buffer to become writable again). If the staging buffer is smaller than the source image bytes, reallocate the staging buffer. At this point, it is possible to write image bytes into the staging buffer, so map that, copy from the Arc<[u8]>, allocate a GPU image, and record a blit command to copy from the staging buffer to the GPU image.

Further logic in cache for eviction: if the GPU image being evicted is represented in any pending draw call in the command buffer, then flush the command buffer. This state may also be used to prioritize evicting images not in the pending set.

At this point (cache hit or creation of new GPU image) we have a GPU image for the id, and we have a rectangle in the atlas. Record a draw call (adding the id to the pending set). Note that this draw call requires building out enough rasterization capability in the HAL to do textured quads.

For each (id, transform) image instance in the CPU-side scene fragment, record the corresponding atlas rectangle in the encoding to be uploaded to GPU.

Double buffer staging buffer?

It's likely we'll want two GPU buffers rather than one, so the CPU can be copying bytes and recording draw calls while the GPU is executing blits and draws. But this is a slightly unclear tradeoff, as it might mean more frequent flushes.

Extension modes

Mirror, repeat, etc., in the general case require handling in the draw calls, storing the result of that in the atlas. In special cases (axis aligned where the bounds of the image align to integers) it might be possible to store only one instance and move the extension logic into fine rasterization.

Of course in the future when descriptor arrays or bindless are available, then at least in some cases fine rasterization will sample from the image rather than a rectangle in the atlas.

Fine rasterization

For the most part, we can use the same Image command as now, which does imageLoad from the atlas. One potential refinement is to only load the image texture when the alpha mask is nonzero. Currently we issue a texture load for all pixels in the tile. It's possible there is overhead from predication, but I suspect that reducing memory bandwidth for texture fetches of unused pixels will be worth it.

Glyphs

We'll have a separate issue for cached glyphs when we get to those, but much of the logic is similar. The glyph atlas must contain all glyphs needed to render the viewport (post spatial subdivision), and the staging process has a similar flavor.

Robust dynamic memory

This issue outlines our plans for robust dynamic memory in piet-gpu. Right now, we essentially hardcode "sufficiently large" buffers for the intermediate results such as path segments, per-tile command lists, and so on. The actual size needed is highly dependent on the scene, and difficult to predict without doing significant processing. This is also an area where support in modern GPUs is sadly lacking and might be expected to evolve. Thus, a satisfactory design involves some tradeoffs.

The goals are to (a) reliably render correct results, except in extreme resource-constrained environments relative to the scene (ie when the scene is adversarial in its resource requirements), (b) use modest amounts of memory when more is not required, and (c) not impact performance too much. These goals are in tension.

The general strategy is similar to what's already partially supported in the code, but currently lacking a full implementation CPU-side. A small number of atomic counters (currently 1, but this will increase, as described below) provide bump allocation, then all memory writes are conditional on the offset being less than the allocated buffer size. Currently, we have logic to early-out when memory is exceeded, but we may replace that with logic to proceed when allocation was successful at the input stage, so that the atomic counter accurately reflects how much memory would be needed for the successive stage to succeed. (Fine implementation detail: if the number of stages is no more than 32, then atomicOr a bit corresponding to the failed stage, and for early-out do a relaxed atomic load testing bits for input dependencies. If greater than 32, atomicMax the stage number, which is assigned in decreasing dependency order (so if A depends on B, A < B), and again do relaxed atomic load confirming this value is less than the minimum of the input dependencies)

Currently there is one command buffer submission all the way from the input scene to the presentation of the rendered frame buffer. In this proposal, we split that in half and fence back to the CPU after the first submission. The first submission is everything up to fine rasterization. On fence back, the CPU checks the value of the atomic counter, and if it's less than or equal to the buffer size, submits a command buffer for fine rasterization.

If it's greater than buffer size, it reallocates the buffer, rebinds the descriptor sets, and tries again. One reasonable choice for the new size of the buffer is the value of the atomic counter, perhaps rounded up a bit. A heuristic could refine this, for example if the error is at an earlier pipeline stage, a multiplier could be applied on the reasonable assumption that later stages will also require additional memory.

Related to this work, the blend stack memory is moved to a separate allocation and binding, for a definitive solution to #83. In particular, this allows the main memory buffer to be bound readonly in fine rasterization, which is confirmed to make a significant performance difference on Pixel 4. Note that the amount of blend memory required is completely known at the end of coarse rasterization, there are no dynamic choices in fine rasterization.

For a potential WebGPU port, the number of allocation buffers should increase. In particular, a separate buffer is required for atomics (linked list building and coarse winding number in coarse path rasterization), due to divergence of WGSL from standard shader practice in the types of atomic methods.

The blend stack memory is special, in that if it overflows, the coarse pipeline need not be rerun; it suffices to make sure the buffer is large enough. Note that this is an additional motivation to split that into an additional buffer, as interleaving blend stack and per-tile command list allocations would require rerun. Also, it is possible to make binning infallible, the worst case is the number of draw objects times the number of bins (worst case 256 in a 4k x 4k output buffer).

Spatial subdivision

In memory-constrained environments the reallocation may fail, either because the GPU is out of memory or because we wish to constrain memory consumed by piet-gpu to allow for other GPU tasks. Our general strategy to succeed in these cases is spatial subdivision. The theory is that a smaller viewport will require less memory for intermediate results, as well as bound resources such as images. That assumption may not hold for adversarial input, but in general should be fairly sound.

To support this case (a lower priority than above), there is an additional outer loop for spatial subdivision. On first run, the viewport is the entire frame buffer. On failure of the coarse pipeline, bumping into the memory limit, the viewport is split in half (recursively), and the pipeline is run on each half. Once the last coarse pipeline succeeds, the submission of the last fine rasterization command buffer can signal the present semaphore.

Also note that staging of resources can fail (filling the image or glyph atlas beyond memory or texture size limits) even before running the coarse pipeline, and that would also trigger spatial subdivision.

Alternatives considered

Fully analyze memory requirements CPU-side before submission. This requires duplicating a substantial amount of the coarse pipeline, which would be time-intensive. In addition, conservative memory estimates (based on bounding boxes rather than actual tile coverage, as well as nesting depth) may be wildly larger than actual, a particular problem for blend memory. Further, if the analysis is not conservative, the entire pipeline may fail.
Run an analysis pass GPU-side to estimate memory and fence back before running the "real" coarse then fine pipelines. This has most of the disadvantages of other approaches, but potentially fewer retries as scene complexity steps up. Fundamentally it is wasted work on the happy path.
Double memory buffer on failure rather than reading back the atomic counter. This may result in lg(n) retries (for large steps in scene complexity), while the proposed approach is proportional to the number of failed pipeline stages and is expected to be small in practice. On the other hand, failure reporting is simpler.

Simplification of `Alloc` struct

Currently the code has a MEM_DEBUG define which optionally makes Alloc effectively a slice rather than an offset. This is potentially useful for debugging, but is not enforced by any actual mechanism, and, more importantly, there is no support in the Rust runtime (it was developed for the Gio port). Newer work sometimes bypasses this mechanism, so the code is not consistent. I propose simplifying this so we just use offsets, which will also get rid of write_mem and read_mem. We have to be rigorous in always checking allocation failure, but that's a different place in the code. Medium term I think the best strategy is to have tests that rerun compute stages with varying allocation sizes, so we exercise the allocation failure cases as fully as possible. One other idea that might be worthwhile is allocating (for tests) a little extra "guard" memory past the actual allocation, then checking that none of those values have been overwritten.

offset or index

Currently byte offsets are used throughout, resulting in a lot of >> 2 to convert it into an index into uint[]. The offset is convenient for Rust and when we were contemplating writing HLSL directly (ByteAddressBuffer indices), but in GLSL it would be more natural, and possibly skip an ALU op, to use indices to u32 directly.

I don't think I want to change that at this point, but it's worth considering.

upgrade piet dependency (0.5?)

It's nice to be able to use newer APIs like Color::from_hex_str; and 0.2 is rather old. OTOH I see (and don't really understand) what goes wrong when trying to upgrade it:

error[E0277]: the trait bound `PietGpuImage: piet::Image` is not satisfied
   --> piet-gpu/src/render_ctx.rs:157:5
    |
157 |     type Image = PietGpuImage;
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `piet::Image` is not implemented for `PietGpuImage`
    | 
   ::: /home/rutledge/.cargo/registry/src/github.com-1ecc6299db9ec823/piet-0.5.0/src/render_context.rs:84:17
    |
84  |     type Image: Image;
    |                 ----- required by this bound in `piet::RenderContext::Image`

error[E0049]: method `clear` has 0 type parameters but its trait declaration has 1 type parameter
   --> piet-gpu/src/render_ctx.rs:191:13
    |
191 |     fn clear(&mut self, _color: Color) {}
    |             ^ found 0 type parameters, expected 1

error[E0046]: not all trait items implemented, missing: `capture_image_area`
   --> piet-gpu/src/render_ctx.rs:155:1
    |
155 | impl RenderContext for PietGpuRenderContext {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `capture_image_area` in implementation
    |
    = help: implement the missing item: `fn capture_image_area<impl Into<Rect>>(&mut self, _: impl Into<Rect>) -> Result<<Self as RenderContext>::Image, piet::Error> where impl Into<Rect>: Into { todo!() }`

error[E0046]: not all trait items implemented, missing: `trailing_whitespace_width`
  --> piet-gpu/src/text.rs:81:1
   |
81 | impl TextLayout for PietGpuTextLayout {
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `trailing_whitespace_width` in implementation
   |
   = help: implement the missing item: `fn trailing_whitespace_width(&self) -> f64 { todo!() }`

error: aborting due to 4 previous errors

Radial gradients

Right now there's (somewhat rough) linear gradient support, but radial gradients are part of both the piet imaging model and COLRv1. Actually COLRv1 has a "sweep" gradient as well, which is not part of piet.

To avoid encoding bloat, this should happen after variable size scene encoding (#152).

There are a bunch of edge cases, sadly, and we'll need to figure out how to handle them. It wouldn't be a bad idea to have or refer to a spec that nails down behavior.

Below are some notes.

WebRender used to have a shader for radial gradients, but got rid of it not long ago in favor of a cached texture. I can see the value in that for web content, but when the gradient itself might be dynamic, it may be better to do it in the shader. In any case, we currently don't have an infrastructure for cached resources (it is coming, though).

Another WebRender change changed the way radial gradients were rendered. It's not obvious to me whether the "before" or "after" state is a better model for us; in any case, linking the patch here lets us look at both.

Apple has a good page with info about radial gradients in Quartz. It's not normative, but might be interesting.

And here's the relevant section from the SVG spec. It probably should be considered normative. I'm not sure if the expectation is that it's exactly the same as COLRv1, but I would be surprised if there were major differences.

Probably adding more references to "sweep" gradients should be followup, can be posted to this issue.

Dynamic GPU memory management

piet-gpu allocates large GPU memory for its working memory. I'd like for memory use to be proportional to scene complexity, or even a constant low amount with several passes over complex scenes. I'm particularly interested in a multi-pass scheme because I want to run piet-gpu shaders on the CPU at some point, where caching unchanged scenery is vital.

Again, if you have thoughts in this area, I'm all ears; otherwise I'll see if I can work something out.

Android port

I have a local branch which displays tiger on a Pixel 4, but it needs a fair amount of cleanup before it can be merged. This issue is to track that happening and also document some of the things I ran into.

I was not able to get winit to work on my device - for reasons I still do not understand, it never seems to get a WindowCreated event which is properly plumbed through to the user. I used ndk-glue instead and was able to get it to work. I have some concerns about that code as well, which I should raise as issues on that repo, but none of it is blocking the work at hand, which is to get it running well enough to do performance measurement. One concern is that at least one callback is intended to be synchronous, but the ndk-glue code just asynchronously sends an event (over a Unix fd pipe, for a reason I do not understand) to the user thread.

The Pixel 4 does not debug validation layers, and the current code depends on those existing in debug builds. It should sense it at runtime.

The Pixel 4 also does not have descriptor indexing. Again, this should be queried at runtime. This issue is considerably more complex, as it requires multiple compilations of k4, and also CPU-side code to deal with the non descriptor indexing case. The local branch just disables images.

For some reason, I'm getting ERROR_SURFACE_LOST_KHR when acquiring the swapchain after two presentations. This is probably something simple, I just haven't figured it out yet.

As part of the cleanup, I want to reduce code duplication between the existing two binaries (cli and winit) which are now three.

If anyone is really eager to try this at home, I can upload my local branch as a draft PR. But I plan on getting the real PR uploaded before long also.

Ghostscript_Tiger.svg is not rendered properly on macOS

On iMac with Radeon GPU, the image rendered from cli is blocky with sketchy outlines. Line and Circle drawing is fine. Seems like the bezier path rendering is not functioning with MoltenVK (latest 1.2.154 Vulkan SDK).

Examples crash WindowServer on macOS

Running this:

cargo run -p with_winit

An empty window is shown, and then after a second or two the WindowServer process seems to freeze, requiring a hard reboot. Unfortunately, no crash report is available in the system console, but if I leave it frozen for a while and reboot, a watchdog event seems to have been reported, noting that the WindowServer process is unresponsive and that the with_winit example is still running. No stack trace, no nothing.

Tried the following:

Release mode did not make a difference.
The with_bevy example is the same.
Disconnecting my external monitor did not make a difference.

I tried inserting println!()s and commenting things out to try to understand what's going on.

Commenting out surface.present() seems to make it not freeze the system outright, but does still seem to cause some mysterious jankiness even after the process is terminated (like scrolling in VSCode becomes jittery).

Then I looked at block_on_wgpu(), which has a note about deadlocking if it is "awaiting anything other than GPU progress". As I understand wgpu::Device::poll(), it defines "GPU progress" as work having been submitted to a queue, and it returns true if the queue is empty. I'm not sure, but I think that means that when wgpu::Device::poll() returns true, that is exactly the situation where there would be a deadlock. So I tried panicking when poll() returns true, and indeed it happens.

(Oddly enough, after crashing the process in a separate run with that panic, the jankiness in VSCode stopped...)

The render_to_texture_async() function does .await a buffer mapping after some stuff has been submitted to the queue (through run_recording()), and I'm honestly not sure what to make of wgpu's documentation here, because device.poll() is also meant to be called when awaiting buffer mapping events, but what exactly happens when the queue is fully emptied by the driver, but there are still outstanding buffer mappings?

To clarify: I'm not certain that poll() returning true while the future passed to block_on_wgpu() is still Pending is actually a reliable indication of a deadlock, because I'm not sure it captures pending buffer mappings. But if it is, it seems plausible that the device is being spammed by poll()s (with the Maintain::Wait argument no less), which somehow ends up crashing either WindowServer or the GPU driver (which is concerning in itself).

I'm sorry if I'm missing something and this is a wild goose chase - I'm really struggling (and I know I'm not alone here) with understanding how exactly wgpu::Device::poll() should be used, and I know that the wgpu teams has iterated a bit on its behavior.

Blending

Now that the changes to clip are landing it's a good time to consider blending. This will very much build on top of the clip infrastructure, and, at least in the first iteration, will actually be a specialization of clip.

Blending is not supported in the piet API. For a first cut, we will extend piet-gpu directly. Whether we diverge from piet or keep them in sync is a deeper question; I'm leaning towards diverging.

The blend modes are defined in the Compositing and Blending Level 1 spec from the W3C . The immediate goal is to support COLRv1 - if there's anything that's complex or tricky, the deciding factor is whether it's needed in COLRv1. I haven't carefully gone through the spec, but from my current understanding, the main things are:

blend enum
compositing enum
flag for isolated (and maybe some other stuff)
alpha value that applies to the group

In the initial implementation, blends are additional parameters added to clip. Each clip still has an associated path, which can just be a rectangle. (We can consider relaxing this and allowing blends without a path, as discussed some in #119, but this will require some architectural changes). BeginClip is annotated with a flag indicating whether it's a pure Porter-Duff over with unity alpha; if not, then the "all-1" optimization is disabled, as that still needs to push a layer for further compositing. It also has a flag for isolated, indicating the initial value of the newly pushed layer.

EndClip is annotated with the remaining info: the blend enum, compositing enum, and alpha value.

I reread the COLRv1 spec and believe it falls into the above framework: a PaintComposite node is translated into drawing the backdrop node, doing a BeginClip, drawing the src node, then doing an EndClip. The glyph viewport may be used as the clip path. It's possible I'm missing something, as this stuff can be a bit subtle.

Also, it seems to me compositing should happen in sRGB color space. Right now, alpha compositing happens in linear sRGB space, which can give superior antialiasing but may not be compatible. We probably need a mode to set this, either at compile or run time.

Proper nested clipping

This issue contains a detailed design of how to properly add clipping to piet-gpu. It is a followup to #30, which represents a more limited form of clipping which might be useful in some UI contexts, but doesn't efficiently capture the general clipping problem. The PostScript-derived imaging model entails clips being nested to arbitrary depth, with the nesting defined by save and restore operations. The arbitrary depth is challenging for a GPU, which is generally optimized for fixed-sized grains of work. Thus, this design contains profound changes across the entire pipeline.

A central concern is bounding box tracking. The invariant associated with a bounding box of a drawing operation is that pixels outside the bounding box are not affected. In the case of a clip operation, it is tempting to take the bounding box of the clip mask, but this is inadequate. Pixels outside the bounding box of the clip mask are affected, in particular all drawing operations inside the clip become no-ops.

The invariant is restored when the bounding box of a clip operation is taken as the union of the bounding boxes of the drawing operations inside.

Encoding

A long-term goal of piet-gpu is to make the CPU-side encoding process as lightweight as possible. To this end, there is a nontrivial bbox tracking mechanism in the encoding stage of the pipeline, eliminating the burden to track bboxes for stroke and fill elements during CPU-side encoding. An additional potential benefit of delegating bbox tracking to the GPU is that display lists can be assembled by binary string concatenation of encoded display list fragments.

However, maintaining those properties in the face of nested clips is nontrivial, therefore we propose to regress the light weight of CPU encoding, with the hope of restoring it in a later cycle. In this design, we will track the bbox of each drawing element. When encoding a clip, we include the clip's bbox (which is the union of drawing elements contained in the clip) in a "push clip" element, and also in the corresponding "pop" element, which is encoded at the time of the restore method is called on the render context.

Note that bbox tracking also entails tracking of transforms, another task that was entirely subsumed by element processing in the previous design.

Element processing

Element processing is barely affected by this design; the main requirement is passing through the encoded bbox.

Element processing also potentially computes the bbox of the clip path. I believe it makes sense to simply discard that, and use the encoded bbox.

Binning

Binning is barely affected; the main requirement is to use the clip's bbox for binning purposes. (as opposed to the clip path's bbox, if both bboxes are in fact preserved by the element processing stage).

Coarse rasterization

An unoptimized version of coarse rasterization would be fairly simple, and likely this should be done as part of a staged implementation strategy. Basically, the "push clip" element is recorded in the ptcl as a "push clip", and likewise for a "pop". Keep in mind that "push clip" has a fair amount in common with a "fill" element, in particular the fact that it references a clip path composed of a number of path segments preceding the "push clip" element.

However, considerable optimization is possible, as in the case of fill, when the set of path segments intersecting the tile is empty. The handling will vary depending on the backdrop. When the backdrop is 0, the clip need not be encoded at all, and the drawing operations until the corresponding "pop" can be suppressed. This can be tracked as a single "culled nesting level" integer, with a default of 0. All drawing operations check it and skip when it is non-zero. A "push clip" with empty path segments and a backdrop of 0 increments it, as does any "push clip" when it is already non-zero. A "pop clip" decrements it when it is non-zero.

A related but different optimization is to avoid writing push and pop operations to the ptcl when the backdrop is non-zero (and there are also no path segments in the tile). Eliding the push is easy, but to elide the corresponding pop requires a stack to track what happened to the push. Since nesting depth is arbitrary, this would potentially require dynamic memory allocations for the stack, but I propose this efficient solution: a fixed size (32 element) stack of bools, represented as a single 4-byte word, and another scalar to track stack depth. When depth exceeds 32, the optimization is disabled (ie there is always a push operation).

A further optimization would erase a "push clip" operation written to the ptcl tape if it is immediately followed by a "pop" with no intervening drawing operations.

Fine rasterization

The fact that clips can be nested arbitrarily creates significant challenges at fine rasterization time, as dynamic memory allocations are needed. Each pixel requires a clip/blend stack, consisting of an RGB(A) pixel and an alpha mask value.

Let's review what "push clip" and "pop" actually do. The "push clip" operation takes the current RGB(A) pixel value and pushes it on the stack, along with a computed alpha value from the clip mask. (The computation of the alpha value is essentiall the same as the "fill" operation). If the blend is isolated, it resets the current RGBA pixel value to all-clear. In any case, the "pop" operation does a Porter-Duff blend of the current pixel value over the pixel value popped from the stack, modulated with the alpha value popped from the stack. (Note: it would also be useful to provide an additional opacity value in the "pop" element, which is simply multiplied by the mask alpha; this unlocks "blend group" functionality, which is not in the current Piet graphics model, but is extremely useful and should be added).

I propose using a windowed stack approach. A small stack window is provided in thread registers, and when it spills it allocates (global) memory and copies the bottom of the stack to that memory. Similarly, popping the stack when the local window is empty, it reads from memory. To allow truly unbounded stack sizes, use a linked list approach so each stack frame in memory contains a pointer (bytebuf offset) to the next frame.

Dynamic allocation of stack frames

All dynamic allocation so far in piet-gpu is done with atomic bump allocation. That's simple and effective, and in the first implementation stage we should just do that. However, with deeply nested clips over large areas, the amount of memory required may be significant; potentially multiples of the render surface. The fine rasterization clip stacks are temporary, so ideally the memory can be freed and then reused.

A general purpose malloc/free allocator running GPU-side would be quite complex. I propose an approach specialized to the expected allocation lifetime patterns for the clip stack. In particular, the allocator state is represented by a bitmap, with one bit for each frame (the frames are all the same size, which is a huge simplification compared with the general malloc case). Then there is a round robin atomic representing the current allocation slot. To allocate, the client does an atomicAdd on the slot atomic, followed by an atomicOr on the bitmap to set the bit. If the bit was already set, retry. Otherwise the allocation has succeeded. Freeing is simple, it's just an atomicAnd to reset the bit.

Do note that correct operation of this allocator requires respecting ordering constraints of the memory model; acquire on allocation and release on free (the increment of the slot atomic can be relaxed). There are similar concerns for the scan (prefix sum) used in element processing, and the code currently uses volatile as a sort of fake memory model to handle this. See Prefix sum on Vulkan for a more detailed discussion. A longer term, proper approach to this is to runtime-query the GPU driver for memory model support and switch to a shader that supports it if available.

Also note that this allocation strategy can deadlock (or perhaps "livelock" would be a better word) if memory is exhausted. There are many similar issues throughout the engine, and a proper approach would involve some combination of accurately estimating resource needs before launching compute kernels, or failing (taking care to free everything that's been allocated) and retrying, perhaps using some kind of scoreboard mechanism to track which tiles have succeeded and which need a retry.

Isolated blending

Above I mentioned the possibility of blending either being isolated or not. In the presence of complex blend modes, there is a (potentially significant) visual difference, and design tools such as Adobe Illustrator expose this choice to the user. It is also provided as an option in the W3C compositing spec.

If only the Porter-Duff "over" blend mode is supported, then in theory isolated and non-isolated blending are identical. One subtle difference is that if only "over" is used, and all blending is non-isolated, and the render target has an opaque background, then it is not actually required to store an alpha channel, and all blending can be done on RGB pixel values. It's not clear the savings are worth it, as memory traffic with chunked pixels is likely to be aligned to 4 byte boundaries, and the ALU savings of not having to blend the alpha channel are minimal.

A more compelling reason to favor non-isolated blending is that it can preserve RGB subpixel text rendering even in the presence of soft clipping and masking. See linebender/piet#289 for more discussion on this point, including significant limitations in Direct2D around subpixel rendering and clipping.

Since we don't currently support complex blend modes, and they're not in the Piet API, a reasonable strategy is to implement only non-isolated blending. When those blend modes are added, that would be a good time to revisit the strategy regarding RGB subpixel text rendering. (It's likely that on Windows and probably Linux as well, we'll want to use the system font renderer for text at normal UI sizes, to match the appearance; this assumption does depend on the intended usage, though).

Conflation artifacts

Unfortunately, soft (antialiased) clipping provides more opportunities to introduce conflation artifacts. In particular, applying the same clip mask twice in nested clips, while in theory idempotent, produces an effect not unlike applying gamma to the mask - in this simple example, it's basically squaring the alpha values.

I'm basically making the assumption that all alpha values should be handled according to Porter-Duff rules, meaning that in the case of nested clips all the alpha values are multiplied. Other renderers (Pathfinder in particular) might use a different heuristic, but I think these other approaches are fragile; for example, they're not appropriate when opacity is set on a blend group.

Most of the academic renderers (MPVG, Li et al, etc) avoid conflation artifacts by computing the equivalent of a higher resolution image with hard path boundaries (including clip masks), then doing reconstruction filtering to achieve antialiasing. Ultimately, I think piet-gpu should have these kinds of rendering modes as well. One consideration is that the memory bandwidth required for the fine rasterization clip/blend stack might increase. Certainly, supporting these modes would be another issue, but I mention them here because there might be an argument for doing something other than Porter-Duff for mask compositing.

Future: GPU-side bbox accounting

Lastly, I've given some thought to the problem of computing bboxes for nested clips. While my blog post on the stack monoid focuses on parsing use cases, I think the central idea (in particular, the use of a windowed stack) would also apply to the stack of bounding boxes needed for nested clips.

I think the main challenge to implementing such a thing efficiently is a sparser approach. Currently, element processing computes the scan of an almost-monoid for each element in the input. That's already a bit wasteful because it's expected that a tiny fraction of the input elements will be transforms. The situation is no doubt considerably worse for handling nested clips, because it's expected the stack monoid operation will be even more expensive than affine transform multiplication (just a bunch of multiply-add operations, a sweet spot for GPU).

Here's the sketch of an idea: the element processing stage doesn't do the clip stack processing, but rather does a filtering stage with the output being a compacted sequence of partially-processed elements. All clip push and pop operations are preserved in this output, and sequences of drawing operations (with no intervening clip stack operations) may be aggregated into a single element with the union of the bboxes. Then, a scan is run on this compacted sequence; a significant fraction of the input will indeed be stack ops, so even if the stack monoid is expensive to compute, its amortized cost will be low.

Obviously, such a thing is tricky to get right, so I think it should be done in a future cycle. But I mention it now because the original vision of piet-gpu doing essentially all the work GPU-side is still valid, and I think it's possible to get there, but realistically only after getting an implementation of the full imaging model (including nested clips) in place using simpler techniques.

Conflation artifacts

The current piet-gpu rendering strategy uses exact-area pixel coverage calculations to derive an antialiased alpha mask, followed by Porter-Duff style compositing. This strategy has some definite advantages, notably very smooth rendering of shapes (especially compared to supersampling approaches with a small sampling factor), but is also susceptible to so-called "conflation artifacts." An excellent resource on this topic is GPU-accelerated Path Rendering, from which this terminology is taken. Quoting from section 4.1.2 of that paper, "Conflation is particularly noticeable when two opaque paths exactly seam at a shared boundary."

Conflation artifacts are especially a problem for Flash rendering, see ruffle-rs/ruffle#26 for one such discussion. While the most common source of such artifacts is compositing multiple shapes, it's also possible to occur within a shape. For example, the SVG path string "M0 0L0 4 4 4 0 0 4 0 4 4z" can render with a gap along the diagonal, as the winding number is of opposite signs on the lower and upper triangles. I believe (but do not have a link handy) that Skia goes to great lengths to avoid conflation artifacts within a path, but does not avoid them during compositing, as among other things the HTML5 Canvas drawing model seems to require that the final render is consistent with alpha compositing individual antialiased elements.

There are other applications for which avoiding conflation artifacts would be an improvement, including Flutter, SVG rendering, Export from InDesign, clipping in WPF and no doubt many others.

Many academic renderers, including MPVG, Li et al 2016, and to some extent RAVG deal with conflation artifacts by doing some form of supersampling, as does the NV_path_rendering work cited above (NV_path_rendering uses GPU hardware MSAA, and so is limited by the sampling factor provided by the hardware, especially a problem on mobile and low-tier GPUs).

I believe it is possible to render without conflation artifacts in the basic piet-gpu architecture, with some changes to the fine rasterization stage. Following is a proposal.

Fine rasterization proceeds in two phases. The first phase is a straightforward non-AA render, but which also accumulates a bit per pixel indicating whether any path edge (fill, stroke, or clip) crosses the pixel in a way that requires antialiasing. A good way to get that bit is conservative rasterization of path segments, similar to what's already done in the coarse rasterization stage to determine which tiles intersect the given path segment.

After the first phase, the bits are compacted (using prefix sum of the bit count within the workgroup, possibly also subgroup accelerated). A pixel is assigned to a group of threads, with the number of threads equal to the multisampling factor divided by the number of samples per thread (similar to CHUNK in k4.comp today). That loop then runs a variable number of times, depending on how many pixels need re-rendering vs the number of threads in the workgroup.

For each such loop, the workgroup iterates over the per-tile command list again, each thread computing samples for at most a single pixel. If there are multiple threads per pixel (ie the multisampling factor exceeds CHUNK), then texture reads can be coalesced, either by relying on the hardware to do so, or manually using either threadgroup shared storage or subgroups to distributed the texture read across the threads. At the end of the loop over ptcl commands, the samples are summed and the average written to the output image (again using shared memory or subgroups if there is more than one thread per pixel).

Note that to avoid artifacts during clipping (and compositing in general), the blend stack (see #36) needs to have a value per sample, which is potentially a pretty high bandwidth.

Some more potential refinements. I think font rendering can still best be done with the exact-area approach; in the longer term evolution of piet-gpu, rendered glyphs would be cached in an atlas, so would be texture reads anyway. There may be a tiny quality improvement by avoiding conflation artifacts in glyph rendering, but only when the outlines have winding numbers other than 0 or 1, which basically means overlapping subpaths. Also, the multisampling factor needs to be quite high for good glyph rendering, otherwise it's a quality regression from the exact-area case.

Also, the calculation of the "needs AA" bits can be done cleverly, so that edges exactly aligned to pixel boundaries need not count. This covers the case of clipping and filling of rectangles aligned to the pixel grid, which can accelerate common cases in UI.

Another appealing aspect of this approach: the supersampling can be done in a linear-intensity space, improving antialiasing quality, even if alpha compositing is done in sRGB for compatibility reasons (see the RAVG paper for more discussion of this issue).

Further, doing the equivalent to stem darkening on filled paths becomes straightforward, by doing a distance-field stroke of each path segment, and OR-ing that to the contribution from the summed signed winding numbers from the segments.

There are other ways to address the details (see the papers cited above), but overall I'm hopeful that the specific approach I've outlined would work well as a compute shader within the piet-gpu architecture.

Bevy and winit examples won't run on macOS Monterey 12.6.2 on a MacBook Pro (Retina, 15-inch, Mid 2015)

This was the output on running cargo run -p with_winit

warning: `vello` (lib) generated 10 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.26s
     Running `target/debug/with_winit`
thread 'main' panicked at 'wgpu error: Validation Error

Caused by:
    In Device::create_compute_pipeline
    Internal error: new_compute_pipeline_state: "Compiler encountered an internal error"

', /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/backend/direct.rs:2403:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: wgpu::backend::direct::default_error_handler
             at /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/backend/direct.rs:2403:5
   3: core::ops::function::Fn::call
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:78:5
   4: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/alloc/src/boxed.rs:2001:9
   5: wgpu::backend::direct::ErrorSinkRaw::handle_error
             at /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/backend/direct.rs:2389:17
   6: wgpu::backend::direct::Context::handle_error
             at /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/backend/direct.rs:254:9
   7: <wgpu::backend::direct::Context as wgpu::Context>::device_create_compute_pipeline
             at /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/backend/direct.rs:1462:13
   8: wgpu::Device::create_compute_pipeline
             at /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/lib.rs:2202:17
   9: vello::engine::Engine::add_shader
             at ./src/engine.rs:215:24
  10: vello::shaders::full_shaders
             at ./src/shaders.rs:138:26
  11: vello::Renderer::new
             at ./src/lib.rs:57:23
  12: with_winit::run::{{closure}}
             at ./examples/with_winit/src/main.rs:57:24
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/future/mod.rs:91:19
  14: pollster::block_on
             at /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/pollster-0.2.5/src/lib.rs:125:15
  15: with_winit::main
             at ./examples/with_winit/src/main.rs:213:9
  16: core::ops::function::FnOnce::call_once
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

with -p with_bevy it was

warning: `vello` (lib) generated 10 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.62s
     Running `target/debug/with_bevy`
2023-01-16T10:37:25.132773Z  INFO bevy_render::renderer: AdapterInfo { name: "Intel Iris Pro Graphics", vendor: 0, device: 0, device_type: IntegratedGpu, driver: "", driver_info: "", backend: Metal }
2023-01-16T10:37:25.477810Z ERROR wgpu::backend::direct: Handling wgpu errors as fatal by default    
thread 'main' panicked at 'wgpu error: Validation Error

Caused by:
    In Device::create_compute_pipeline
    Internal error: new_compute_pipeline_state: "Compiler encountered an internal error"

', /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.14.2/src/backend/direct.rs:2403:5

..and then continued the same

bitmap and gradients support

The pipeline supports uniform colors. I'd like to work on expanding that to accept bitmaps, and later on gradients. If you have any thoughts on the design, let me know. If not, I'll come up with something to start the discussion.

DX12 portability polish

In working on #123, I'm running into some of the unfinished work on the DX12 port. This issue captures what's to be done to make it work.

The biggest issue is that DX12 makes a distinction in the descriptor set between readonly (SRV) and read-write (UAV) buffers, and the create_simple_compute_pipeline signature is not powerful enough to reflect that. The wgpu syntax feels too verbose, but just num_buffers is too simplistic. As a compromise, I propose a slice of BindType which at first is just an enum of Buffer, BufReadOnly, Image to cover today's cases. I should probably add Uniform and Texture (or Sampler or SampledTexture or some combination) to round out the basic set. The enum can expand in the future to handle descriptor arrays and other fancy stuff, as needed. I realize this is almost certainly inadequate for general graphics use, I am deliberately trying to simplify the story for compute shaders.

The clear_buffer method is NYI. I think the best way to do this is ClearUnorderedAccessViewUint but this requires a descriptor. That's the tip of a general descriptor rework iceberg - right now, we're creating the descriptors late, at descriptor set build time (and with other inefficiencies, like a new heap per descriptor set), but I think DX12 practice is generally to create descriptors along with the resources, and copy them as needed. I think a relatively simple change can fix this, and will no doubt improve performance generally. (Longer term work is to integrate gpu-descriptor or something like it so that there's a small number of descriptor heaps and everything is suballocated from those)

Related to this is a better approach to copying between buffers and images; in general a compute shader is needed, I think. However, this is not critical path for the current testing focus, so will probably be done later.

Shaders are currently compiled with D3DCompile, which is FXC. That's failing to compile the current version of the prefix sum code (see #121 for discussion). I think with our ahead-of-time strategy for shader compilation, the best strategy is to switch to DXC at build time (ie in the shader build.ninja). There are some tradeoffs and it may impact ergonomics, but my gut feeling is that it's better to jump into that pool rather than continue to fight FXC limitations (note: DXC is also critical-path for things like subgroups).

iOS: fails to create compute pipeline

Creates a simple iOS objective c test app using xcode default template and brings in the vello/winit_demo as "it is" but as a static lib. winit_demo lib exports a public start_app "C" function, which calls winit_demo/main. And the dummy iOS app has its main.m to call the exported start_app.

When deploying the vello test app to iPad (M1), two issues are encountered:

Surface creation panic with default device limits
a) Limits::default() fails with one buffer size check.
b) Limits::downlevel_defaults() (passed the size check but panics too as the shader group limit is 4 while vello needs 6.
addressed with:
let mut limits = adapter.limits();

(NOTE: same issue with base winit/wgpu triangle sample code. deployed successfully to iPad after using adapter limits.)

compute pipeline creation issue:

   In Device::create_compute_pipeline
    Internal error: Metal: program_source:460:29: warning: unused variable '_e379'
                        int _e379 = uint(tile_ix) < 1 + (_buffer_sizes.size6 - 0 - 8) / 8 ? metal::atomic_fetch_add_explicit(&tiles[tile_ix].backdrop, backdrop, metal::memory_order_relaxed) : DefaultConstructible();
                            ^
program_source:513:70: error: cannot take the address of an rvalue of type 'metal::uint' (aka 'unsigned int')
                        uint _e450 = metal::atomic_exchange_explicit(&uint(tile_ix_1) < 1 + (_buffer_sizes.size6 - 0 - 8) / 8 ? tiles[tile_ix_1].segments : DefaultConstructible(), _e447, metal::memory_order_relaxed);
                                                                     ^~~~~~~~~~~~~~~~

Needs advice on how to approach this.

(NOTE: the winit_demo sample code works on M1 Macbook pro)

Regards,
Dan

Path intersection (aka SVG clipPath)

I'd like to generalize path+fill to stack-of-paths+fills such that fills only cover the intersection of the paths in a stack. Would that be feasible with the current architecture? Would you be interested in having that work in piet-gpu?

An efficient algorithm that converts intersecting paths into a single path would be a nice alternative, in which case clipPath would be a pre-process step.

DX12 validation fixes

This is something of a followup to #95. Also, #138 runs reasonably portably but is showing some problems on DX12. Since the logic is good enough to run on Vulkan and Metal, I'm making a separate issue for the DX12-specific issues, most of which are at the HAL level. Overall even on DX12 it runs in release (at least on my hardware), but prints validation warnings and hangs in debug mode.

One problem is validation warnings. Actually a sub-issue is that detailed validation warnings are available when run under Visual Studio, but less so when just run on the command line. It should be possible to capture and print those too, probably just missing the proper intercept calls. These validation warnings show that resources are in the incorrect state. A lot of that is due to binding buffers either as UAV or SRV depending on whether the shader source marks them as "readonly." The simplest workaround is to always bind as UAV, for which there is a spirv-cross flag (--hlsl-force-storage-buffer-as-uav). In addition, some of the transfer commands seem to put the buffer in particular states. Again, the simplest thing is to add transition barriers to take them back to common. As future work, we might do more fine-grained tracking and auto-generate the more precise barriers. I think asking users of the HAL to write precise barriers themselves is probably asking too much.

A deeper and more serious problem is coarse.comp hanging. I've started to investigate this but don't have a full answer. That shader has several loops to consume the full binned input. Crudely breaking three of those loops makes the shader terminate (with incomplete output, but the rest of the pipeline runs). It's not obvious to me where the problem; it's possible the input is wrong, making this stage fail to terminate. It's possible the input is right but there is a logic error, but one that only manifests as hanging under shader validation. Or it's possible it's correct but shader validation is buggy. In any case, this is the issue to track the failure.

Investigate tight ptcl allocation

The fine raster stage reads commands (ptcl) sequentially, so there should be no penalty from sizing the Cmd enum dynamically. If we add a piet-gpu-types feature where each Cmd only takes as much space as whatever the variant occupies, then it probably can slightly reduce memory traffic.

That said, in the current state most Cmd variants are similarly sized so this probably isn't going to make a big difference.

elements.comp scene traversal may leave gaps in paths

The fill algorithm implemented by path_coarse.comp assumes that closed paths contain no gaps. Unfortunately, the non-deterministic scene tracersal in elements.comp may introduce gaps in otherwise gap-less paths.

Consider two fill path segments through endpoints A, B, C in a scene fragment with three preceding transforms. The segments are initialized with the identity transform, I.

T₁·T₂·T₃·S(I, A, B)·S(I, B, C)

Since scene element operations are associative, elements.comp evaluates them in a GPU-friendly but non-deterministic order. In particular, it may use the ordering

S((T₁·T₂)·T₃, A, B)·S(T₁·(T₂·T₃), B, C)

Because of numerical imprecision, the path segments may end up with slightly different transformations,

S(M₁, A, B)·S(M₂, B, C)

Finally, M₁·B is not in general equal to M₂·B, leading to a gap in the path that includes the two segments.

To verify my claim, I made the following change,

diff --git gpu/shaders/elements.comp gpu/shaders/elements.comp
index a43c270..b5f2342 100644
--- gpu/shaders/elements.comp
+++ gpu/shaders/elements.comp
@@ -259,34 +259,8 @@ void main() {
                     State their_prefix = State_read(state_prefix_ref(look_back_ix));
                     exclusive = combine_state(their_prefix, exclusive);
                     break;
-                } else if (flag == FLAG_AGGREGATE_READY) {
-                    their_agg = State_read(state_aggregate_ref(look_back_ix));
-                    exclusive = combine_state(their_agg, exclusive);
-                    look_back_ix--;
-                    their_ix = 0;
-                    continue;
-                }
+                } 
                 // else spin
-
-                // Unfortunately there's no guarantee of forward progress of other
-                // workgroups, so compute a bit of the aggregate before trying again.
-                // In the worst case, spinning stops when the aggregate is complete.
-                ElementRef ref = ElementRef((look_back_ix * PARTITION_SIZE + their_ix) * Element_size);
-                State s = map_element(ref);
-                if (their_ix == 0) {
-                    their_agg = s;
-                } else {
-                    their_agg = combine_state(their_agg, s);
-                }
-                their_ix++;
-                if (their_ix == PARTITION_SIZE) {
-                    exclusive = combine_state(their_agg, exclusive);
-                    if (look_back_ix == 0) {
-                        break;
-                    }
-                    look_back_ix--;
-                    their_ix = 0;
-                }
             }
 
             // step 5 of paper: compute inclusive prefix

which removes the non-deterministic evaluation order. As expected, the path gaps disappeared.

I don't have a good fix yet. It's possible that the endpoints can be propagated in the scene traversal, at the cost of state memory. Perhaps a PathSeparator scene element should be added to delimit closed paths. It may also be a good idea to make path scene segment have implicit start points, saving a bit of memory.

High memory usage

Hi, all

cargo run --bin winit

winit.exe uses more than 200MB memory. Is that expected? Any way to reduce memory usage? Thanks.

Memory for blend state

In this issue is the design that will hopefully resolve issues with the blend state, mostly #83 but it's also come up in #155 and possibly other issues.

There are several parts. I'll work backward from fine rasterization, as that may be a clearer presentation.

In fine rasterization, the blend stack is split into two segments. The first segment is a small array in vector registers (ie function scope). The exact size of that array is a tunable parameter, small enough it doesn't cause a spill to scratch, large enough to accommodate "most" blending. Also note, this parameter interacts with CHUNK (the number of pixels rendered by a single thread). My gut feeling is that reducing CHUNK to 4 and an array size of 4 will be close to the sweet spot.

The second segment is a region of memory in a separate read/write blend stack binding. This way, the memory (ptcl) binding can be read-only, which appears to be important on at least some hardware (Pixel 4, see #83). The offset of that region is written into the ptcl by coarse rasterization.

The logic to select between the two is a simple clip_depth < BLEND_STACK_SIZE predicate. I've considered more complex logic, something like a stack window, but I think keeping it simple is a win.

In coarse rasterization, we reinstate the logic by Elias to compute maximum blend stack depth. However, we subtract off the small array size and only allocate if it's above that. Note that this lifts a hard limit on blend depth, but at the same time introduces a new requirement for memory allocation. Obviously the blend stack buffer needs to be large enough to hold all allocations. We'll have another architecture issue coming up soon with a more detailed plan for that. For the time being, it's come up with a number for the size of that buffer, cross fingers and hope it's big enough.

One more point, addressing approaches considered and rejected. It's disappointing that memory used by the blend stack can't be recovered after a workgroup does its rendering. I thought about a malloc/free approach, but there are a number of problems. One is that free requires release semantics and malloc requires acquire, if memory is to be reused, and this barrier is not available on current Metal. Another is that while it's very likely that the number of concurrently resident workgroups is much smaller than the total, it is very difficult to quantify that in a reliable, portable way. Thus, a conservative or worst-case estimate for the required allocation may not be significantly smaller than adding up the total allocation, and the difference in complexity is significant.

Dragging vello windows is laggy

When attempting to drag a window which is rendering using vello by grabbing it's titlebar with the mouse there is a noticeable lag (I would guess at least 500ms) before the window starts to move. This does not happen with any other apps on my system, so I think vello is the common factor here.

Tested on:

macOS 13.0.1 (M1)

This applies to all uses of Vello I have found including:

The examples in this repo
The Xilem demo
Examples in the Blitz repo

Variable size scene encoding

Right now, the main scene encoding is a sequence of fixed size draw objects. This has been the case for a while, but more recently path segments and a couple other things moved out of the scene encoding. In addition, there's a stream of Annotated objects, also fixed size, with a 1:1 correspondence to the Element objects in the scene encoding.

It would be better for it to be fully variable size, I think. That would allow larger objects (gradients come to mind) without concern for bloating the encoding of other objects that may be more compact. Here's roughly what's involved.

The main step to unlock variable size encoding is to add two more prefix sums to the draw monoid: these count byte (or possibly u32) offsets within the element stream, and also a new variable size object to take the place of "annotated." Most of the functionality of Annotated moves to other places. The bounding box gets its own stream (this is generally a move towards SoA). anything currently just copied from Element is replaced by a read from the scene buffer (there's no value in copying it; this was originally seen as a simplification, so the memory read patterns were more uniform). The main thing that remains is the line equation for linear gradients (and a similar computation for radial gradients, when that's done).

There's nothing particularly deep about these changes. They follow the same general pattern as the element pipeline changes. Also there should be cleanup involved, the Rust-side encoding will I think no longer use piet-gpu-derive at all, and in general that could be removed.

SLOW Render kernel time

I originally thought #161 was the cause of the slowness, but apparently not.
Even after #162 PR the render kernel stage is very slow, unsure why.

GhostScript Tiger, GeForce 1060 3GB:

parsing time: 19.066854ms
flattening and encoding time: 1.534247ms
elapsed = 162.079181ms
Element kernel time: 4.987ms
Tile allocation kernel time: 1.812ms
Coarse path kernel time: 2.253ms
Backdrop kernel time: 1.775ms
Binning kernel time: 0.080ms
Coarse raster kernel time: 10.996ms
Render kernel time: 134.673ms

trying to render emojis causes system freeze

hard to provide a reproducer as I've modified stuff in different places. also it would be bordering on distributing malware. any suggestions how to debug a system freeze?

Instance without surface + small buffer errors

use piet::RenderContext;

fn ctx_to_file(
    ctx: &mut piet_gpu::PietGpuRenderContext,
    width: usize,
    height: usize,
    path: &std::path::Path,
) {
    let mut img_data: Vec<u8> = Default::default();
    unsafe {
        // this part is copied from bin/cli.rs
        let (instance, _) = piet_gpu_hal::Instance::new(None, Default::default()).unwrap();
        let device = instance.device(None).unwrap();
        let session = piet_gpu_hal::Session::new(device);
        let mut renderer = piet_gpu::Renderer::new(&session, width, height, 1).unwrap();
        renderer.upload_render_ctx(ctx, 0).unwrap();

        let mut cmd_buf = session.cmd_buf().unwrap();
        let query_pool = session.create_query_pool(8).unwrap();
        let image_usage = piet_gpu_hal::BufferUsage::MAP_READ | piet_gpu_hal::BufferUsage::COPY_DST;
        let image_buf = session
            .create_buffer((width * height * 4) as u64, image_usage)
            .unwrap();
        cmd_buf.begin();
        renderer.record(&mut cmd_buf, &query_pool, 0);
        cmd_buf.copy_image_to_buffer(&renderer.image_dev, &image_buf);
        cmd_buf.finish_timestamps(&query_pool);
        cmd_buf.host_barrier();
        cmd_buf.finish();
        let submitted = session.run_cmd_buf(cmd_buf, &[], &[]).unwrap();
        submitted.wait().unwrap();
        session.fetch_query_pool(&query_pool).unwrap();
        image_buf.read(&mut img_data).unwrap();
    }

    // Write image as PNG file.
    let file = std::fs::File::create(path).unwrap();
    let ref mut w = std::io::BufWriter::new(file);
    let mut encoder = png::Encoder::new(w, width as u32, height as u32);
    encoder.set_color(png::ColorType::Rgba);
    encoder.set_depth(png::BitDepth::Eight);
    let mut writer = encoder.write_header().unwrap();
    writer.write_image_data(&img_data).unwrap();
}

fn main() {
    let mut ctx = piet_gpu::PietGpuRenderContext::new();
    let size = (200, 200);
    ctx.fill(
        piet::kurbo::Rect::new(0., 0., size.0 as f64, size.1 as f64),
        &piet::Color::WHITE,
    );
    ctx.stroke(
        piet::kurbo::Line::new((0., 50.0), (100.0, 50.0)),
        &piet::Color::rgb8(0, 255, 0),
        2.0,
    );
    ctx.finish().unwrap();
    ctx_to_file(&mut ctx, size.0, size.1, std::path::Path::new("gpu.png"));
}

Now increase (width, height) to (2048, 1600) and it produces the correct result (image is truncated):

This can be also reproduced in piet-gpu/bin/cli.rs. Reducing the size

const WIDTH: usize = 2048;
const HEIGHT: usize = 1536;

to a smaller size like (200,200) results in ERROR_DEVICE_LOST. I am on linux, nvidia 510.60.

Garbled output with reschart

For some reason, piet-gpu seems to give garbled rendering for the reschart test. The test svg is attached below, and it consists of only typical path and strokes, which piet-gpu should have no problem handling. (It does use a transform to flip the image upside down, but piet-gpu simply ignore so it's unrelated to the issue.)

(File taken from MPVG test suite, renamed to txt for uploading to GitHub)

reschart.svg.txt

piet-gpu rendering (cargo run --release --bin cli -- reschart.svg --scale 2):

Looks like backdrop calculation is all over the place, but I'm not sure what is causing that.

SVG/Fill render black squares on macOS

I'm trying to render the tiger SVG on macOS/Metal and I get some strange result
also some black square appears inside rect fills

is this due to metal api?
there is a way to use Vulkan/MoltenVK to render on macOS?

Thanks

Cross-platform runtime

This is a tracking issue for work in progress. The goal is to get piet-gpu running on DX12 and Metal as well as Vulkan, as well as to make it a suitable target for the upcoming "compute shader 101" class (an alternative to wgpu, which will probably also be supported - having both in good shape would be a nice problem to have). I also want to improve some of the basic infrastructure, which is quite hacky in places.

Work is required across the entire piet-gpu-hal subcrate, with minor changes expected in the clients. The current architecture of piet-gpu-hal roughly follows gfx-hal and wgpu, but I will adapt it a bit.

There are PR's in progress for both DX12 (#80) and Metal (#93). However, I'm shortly going to need to make changes that are much less localized to a particular backend. I'll probably land #94 then do the bulk of the work in #80 - the goal is to reduce friction, rather than have things in clean commits.

There are some design decisions coming up. One is how extensive the polyfill should be in the lower (piet_gpu::Device) layer. If you look at buffer<->image copies in gfx-hal's dx12 backend, they do a fairly elaborate dance to split the copy into individual rows when the stride doesn't agree with DX12's 256 byte alignment. I'm much more inclined to run a compute shader instead. I'm also inclined to just make the lower level operation fail and have the hub (roughly equivalent to wgpu) schedule the work.

Another major area of work involves staging buffers, especially creation of buffers with initial contents. WebGPU has a clever approach to that problem: you can create a buffer as mapped on creation, even if it's not host-visible. When running on discrete GPU, what happens under the hood is the creation of a staging buffer and a copy to the real buffer at some point after buffer creation but before first use. I'm inclined to do something similar, but it will bring in a fair amount more resource tracking at the hub layer. On Metal, this should probably map to newBufferWithBytes: though there is a debate whether it's better to let Metal handle this or do our own layer and treat Metal as just a nonstandard syntax for Vulkan.

(there's more possible complexity, including a Vec<u8> like interface to the uninitialized buffer slice, as well as choosing whether to use a staging buffer on readback, but I think I will defer that)

I will also need to beef up the shader compilation story, as it will bring in spirv-cross, and dxc if I want to use SM6. I plan to keep using ninja for now, though the build files will become considerably more tedious. A better tool (possibly using hassle-rs, and almost definitely using naga when it's more mature) is possible, but I think I'll accept a bit of tedium for now.

One thing I could do as part of this cycle is real memory allocation - currently I do an API alloc for each resource, the equivalent of CreateCommittedResource. But that's a performance concern only and shouldn't affect the external interface, so I'm inclined to defer it. I'm looking at both gpu-alloc and gpu-allocator and haven't decided yet.

Slow fine rasterization (k4) on Adreno 640

I'm getting piet-gpu running on Android (see #82 for snapshot), and running into fine rasterization being considerably slower than expected. The other stages in the pipeline seem fine. I've done some investigation but the fundamental cause remains a mystery.

Info so far: the Adreno 640 (Pixel 4) has a default subgroup size of 64 (though it can also be set to 128 using the vk subgroup size control). That should be fine, as it means that the memory read operations from the per-tile command list are amortized over a significant number of pixels even if CHUNK (the number of pixels written per invocation of kernel4.comp) is 1 or 2. If CHUNK is 4, then the workgroup and subgroup sizes are the same; any larger value results in a partially filled subgroup. There's more detail about the Adreno 6xx on the freedreno wiki.

There are some experiments that move the needle. Perhaps the most interesting is that commenting out the body of Cmd_BeginClip and Cmd_EndClip at least doubles the speed. This is true even when the clip commands are completely absent from the workload. My working hypothesis is that register pressure from accommodating the clip stack and other state is reducing the occupancy.

Another interesting set of experiments involves adding a per-pixel ALU load. The amount of time taken increases significantly with CHUNK. I'm not sure how to interpret that yet. Tweaking synthetic workloads like this may will be the best way to move forward, though I'd love to be able to see the asm from the shader compilation. I'm looking into the possibility of running this workload on the same (or similar hardware) but a free driver such as freedreno so that I might be able to gain more insight that way.

I've been using Android GPU Inspector (make sure to use at least 1.1) but so far it only gives me fairly crude metrics - things like % ALU capacity and write bandwidth scale with how much work gets done, and other metrics like % shaders busy and GPU % Utilization are high in all cases.

There are other things I've been able to rule out: a failure of loop unrolling by the shader compiler. Failure to account ptcl reads as dynamically uniform (I manually had one invocation read and broadcast the results, yielding no change).

I do have some ideas how to make things faster (including moving as much of the math as possible to f16), but the first order of business is understanding why it's slow, especially when we don't seem to be seeing similar problems on desktop GPUs.

text layout and rendering support

piet-gpu's master, with PietGpuTextLayout, only has placeholder layout support and limited/incomprehensible rendering support in it's draw_text method

PR #190 currently removes PietGpuTextLayout and adds the even more limited and incomprehensible SimpleText

@dfrg, @raphlinus: I'd like to know what your plans are, I want to test piet-gpu but there's not much to see if I can't even do

text.max_width = width;
let layout = text.layout(my_string);
self.height = layout.height;
renderer.add_layout(layout, top_left_corner);

wgpu has wgpu-glyph (powered by glyph-brush (powered by ab-glyph)) and glyphon (my favorite, powered by fontdue)

Hangs on coarse path on Nvidia 1060

This is branched from #106 , which observed the issue and attempted to fix it, but the patch didn't contain a full analysis.

The problem is that on some hardware (including Nvidia 1060, at least with recent driver), the n_out loop continues. I've got instrumentation on the shader now, and one of the things that's most obviously wrong is a NaN val coming out of estimate_subdiv. The idea that can happen when cross is zero (or nearly so) is not surprising, looking at the code.

I've tried a simple patch to that function that early-outs when cross is small, but that's not fixing all the looping. I'll do a deeper analysis and then propose a patch, but I did want to update based on my results so far.

linebender / vello Goto Github PK

vello's People

Contributors

Stargazers

Watchers

Forkers

vello's Issues

Workgroup shared memory

Number of buffer bindings

Paper cuts

paper-1

At scale=2

tiger

paris-30k

At scale=2

Case 1: shader compiler crash

Case 2: ERROR_DEVICE_LOST

Case 3: shared memory corruption

Transform stream

Path stream

Draw object stream

Clip stream

Changes to rasterization

Binning

Coarse rasterization

Fine rasterization

Blending

SOA vs AOS

Encoding

Draw scan

Binning

Coarse raster

Fine raster

CPU-side image resource type and creation API

Encoding into scene fragment

Staging

Atlas vs descriptor array vs bindless

Detailed staging algorithm

Double buffer staging buffer?

Extension modes

Fine rasterization

Glyphs

Spatial subdivision

Alternatives considered

Simplification of Alloc struct

offset or index

Encoding

Element processing

Binning

Coarse rasterization

Fine rasterization

Dynamic allocation of stack frames

Isolated blending

Conflation artifacts

Future: GPU-side bbox accounting

Recommend Projects

Recommend Topics

Recommend Org

Simplification of `Alloc` struct