For a fairly simple minimal working example shown below, I see a considerably large de

Here's a much more detailed trace: <a target="_blank" rel="noopener noreferrer" hr

Improve launch latency about cuda.jl HOT 4 OPEN

cibinjoseph commented on September 27, 2024

Improve launch latency

from cuda.jl.

Comments (4)

maleadt commented on September 27, 2024 1

Here's a much more detailed trace:

So it looks more like "death by a thousand papercuts" rather than one operation being slow:

methodinstance taking 1us: partly a profiler artifact, as it takes only 300ns outside of NSight. This has been improved on 1.11, only taking 50ns there
cudaconvert being surprisingly slow (1us), but also executing twice (once as part of @cuda in order to determine the argument types to compile for, and once during the actual launch). This used to be optimized away, I'm not sure why it isn't anymore, but we can probably work around this by only doing cudaconvert once and having a way to pass converted arguments to the compiled kernel
cache accessors (compiler_config, compiler_cache, cached_compilation): these all have to take locks for the purpose of safe multi-threaded execution. Not sure how to optimize these.

Bottom line, the latency is much better on my system, but I agree things can be improved.

I also want to emphasize once more that NSight makes these look much worse. Where each iteration takes around 50us in the profile trace, in a non-profiled session an iteration only takes 30us.

from cuda.jl.

maleadt commented on September 27, 2024

FWIW, I'm seeing more reasonable timings: 30us between copy and launch, of which around 10us taken by cached_compilation

Still a lot of latency, and the timings surprise me, because the functionality in question has been thoroughly micro-optimized:

julia> cache = CUDA.compiler_cache(context());

julia> src = GPUCompiler.methodinstance(typeof(identity), Tuple{Nothing});

julia> cfg = CUDA.compiler_config(device());

julia> @benchmark GPUCompiler.cached_compilation(cache, src, cfg, CUDA.compile, CUDA.link)
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
 Range (min … max):  159.770 ns … 322.471 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     183.145 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   182.211 ns ±  10.717 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂█▇▆▄
  ▂▅▆▃▂▂▂▂▂▂▁▂▂▂▁▂▂▃█████▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▂▂▂▂▂▁▁▁▁▂▂▂▂ ▃
  160 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 64 bytes, allocs estimate: 2.

from cuda.jl.

maleadt commented on September 27, 2024

Running the benchmark a couple of times (the profiler can make short measurements like this weird) lowers the time of cached_compilation further down to hundreds of ns

Are you sure the cached_compilation latency isn't a profiling artifact on your end? I'm using an isolated copy+launch+copy sequence as follows:

function kernel!(a)
    i = threadIdx().x + blockDim().x*(blockIdx().x-1)
    a[i] = CUDA.cos(a[i]) + i^0.6 + CUDA.tan(a[i])
    return
end

NVTX.@annotate function benchmark(a)
    threads = 256
    blocks = cld(length(a), threads)

    a_d = CuArray(a)
    @cuda threads=threads blocks=blocks kernel!(a_d)
    Array(a_d)
end

function main()
    n = 2^12
    a = rand(n)

    benchmark(a)
    benchmark(a)
    benchmark(a)

    return
end

from cuda.jl.

cibinjoseph commented on September 27, 2024

I ran these benchmarks again and you're right in that the latency does go down when the profiler is run a few times.
I get delays of around 17 us before and after the kernel launch at the moment. So perhaps it is not too bad.

I also noticed you were focusing on the CUDA API line in Nsight systems. I usually look at the CUDA HW line in the timeline and maybe that was throwing me off too.

Thanks for looking into this!

from cuda.jl.

Improve launch latency about cuda.jl HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent