<a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group

Driver API: <a href="https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.

Some more exploration: <div class="highlight highlight-source-julia notranslate po

Explore CUDA graph API about cuda.jl HOT 8 CLOSED

juliagpu commented on May 16, 2024

Explore CUDA graph API

from cuda.jl.

Comments (8)

vchuravy commented on May 16, 2024

There is also cudaStreamBeginCapture to turn a Stream into a Graph.

Capture may not be initiated if stream is cudaStreamLegacy.

Which includes the default stream IIUC. (In any case we might want to switch to per thread default stream)

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g1811d555e88205c2f60d61535294c4fe

from cuda.jl.

maleadt commented on May 16, 2024

BLAS has cublasSetStream so might require some work across the package though.

from cuda.jl.

maleadt commented on May 16, 2024

Driver API: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1gea22d4496b1c8d02d0607bb05743532f

The legacy default stream seems different from the regular default one?
https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html#stream-sync-behavior
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1ga53e8210837f039dd6434a3a4c3324aa

from cuda.jl.

maleadt commented on May 16, 2024

This will require the (inevitable) work of putting streams everywhere:

using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t

stream = CuStream()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)

A = cu(rand(2,2)) # implicitly uploads on the default stream
B = cu(rand(2,2))

ERROR: LoadError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream (code #906, ERROR_STREAM_CAPTURE_IMPLICIT)
Stacktrace:
 [1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float32}, ::Int64, ::CuStream) at /home/tbesard/Julia/CUDAdrv/src/memory.jl:235

... which I wasn't planning on attempting in the near future.

from cuda.jl.

maleadt commented on May 16, 2024

... which I wasn't planning on attempting in the near future.

The reason being that I haven't put enough thought in how the API should look like, and how it would be compatible with CUDA:

like contexts, a default global one and use do blocks to switch stream
or rather put the stream in the array and thread it through everywhere

The question is also where this functionality should go. And how it should interact with foreign libraries.

It also requires figuring out which operations to make asynchronous, because you can only synchronously cuMemcpyHtoD on the default stream AFAIK. Maybe we should just make everything asynchronous.

from cuda.jl.

maleadt commented on May 16, 2024

https://github.com/NVIDIA/cuda-samples/blob/master/Samples/simpleCudaGraphs/simpleCudaGraphs.cu

from cuda.jl.

maleadt commented on May 16, 2024

Some more exploration:

using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t, isvalid

# graph
const CuGraph_t = Ptr{Cvoid}
mutable struct CuGraph
    handle::CuGraph_t
    ctx::CuContext

    function CuGraph(f::Function, stream::CuStream)
        handle_ref = Ref{CuGraph_t}()

        @apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
        f()
        @apicall(:cuStreamEndCapture, (CuStream_t, Ptr{CuGraph_t}), stream, handle_ref)

        ctx = CuCurrentContext()
        obj = new(handle_ref[], ctx)
        finalizer(unsafe_destroy!, obj)
        return obj
    end 
end
function unsafe_destroy!(x::CuGraph)
    if isvalid(x.ctx)
        @apicall(:cuGraphDestroy, (CuGraph_t,), x)
    end
end
Base.unsafe_convert(::Type{CuGraph_t}, x::CuGraph) = x.handle

# graph node
const CuGraphNode_t = Ptr{Cvoid}

# graph execution
const CuGraphExec_t = Ptr{Cvoid}
function instantiate(graph::CuGraph)
    exec_ref = Ref{CuGraphExec_t}()
    error_node = Ref{CuGraphNode_t}()
    buflen = 256
    buf = Vector{Cchar}(undef, buflen)
    @apicall(:cuGraphInstantiate,
             (Ptr{CuGraphExec_t}, CuGraph_t, Ptr{CuGraphNode_t}, Ptr{Cchar}, Csize_t),
             exec_ref, graph, error_node, buf, buflen)
    return exec_ref[]
end
function launch(exec::CuGraphExec_t, stream::CuStream=CuDefaultStream())
    @apicall(:cuGraphLaunch, (CuGraphExec_t, CuStream_t), exec, stream)
end
launch(graph::CuGraph, stream::CuStream=CuDefaultStream()) =
    launch(instantiate(graph), stream)

# demo
stream = CuStream()
graph = CuGraph(stream) do
    dims=(3,4)

    a = rand(Float32, dims)
    #d_a = cu(a)
    buf_a = Mem.alloc(a)
    Mem.upload!(buf_a, a, stream; async=true)
    d_a = CuArray{Float32,2}(buf_a, dims)

    b = rand(Float32, dims)
    #d_b = cu(b)
    buf_b = Mem.alloc(b)
    Mem.upload!(buf_b, b, stream; async=true)
    d_b = CuArray{Float32,2}(buf_b, dims)

    c = rand(Float32, dims)
    #d_b = cu(b)
    buf_c = Mem.alloc(c)
    Mem.upload!(buf_c, c, stream; async=true)
    d_c = CuArray{Float32,2}(buf_c, dims)

    #d_out = similar(d_out)
    buf_out = Mem.alloc(b)
    d_out = CuArray{Float32,2}(buf_out, dims)

    # d_out .= d_a .+ d_b
    function vadd(a, b, c)
        i = (blockIdx().x-1) * blockDim().x + threadIdx().x
        c[i] = a[i] + b[i]
        return
    end
    @cuda threads=prod(dims) stream=stream vadd(d_out, d_a, d_b)

    # d_out .= d_out .+ d_c
    @cuda threads=prod(dims) stream=stream vadd(d_out, d_out, d_c)
end
launch(graph)

==8594== NVPROF is profiling process 8594, command: julia wip.jl
==8594== Profiling application: julia wip.jl
==8594== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   72.22%  5.8240us         2  2.9120us  1.8880us  3.9360us  ptxcall_vadd_1
                   27.78%  2.2400us         3     746ns     608ns     960ns  [CUDA memcpy HtoD]
      API calls:   70.19%  101.91ms         1  101.91ms  101.91ms  101.91ms  cuCtxCreate
                   28.57%  41.479ms         1  41.479ms  41.479ms  41.479ms  cuCtxDestroy
                    0.56%  807.98us         1  807.98us  807.98us  807.98us  cuModuleLoadDataEx
                    0.37%  534.67us         1  534.67us  534.67us  534.67us  cuGraphInstantiate
                    0.14%  208.24us         4  52.059us  4.7680us  180.50us  cuMemAlloc
                    0.07%  95.018us         1  95.018us  95.018us  95.018us  cuModuleUnload
                    0.03%  41.206us         1  41.206us  41.206us  41.206us  cuStreamCreate
                    0.02%  30.955us         1  30.955us  30.955us  30.955us  cuGraphLaunch
                    0.01%  17.924us         1  17.924us  17.924us  17.924us  cuStreamDestroy
                    0.01%  14.153us         3  4.7170us  1.6700us  10.220us  cuMemcpyHtoDAsync
                    0.01%  9.8060us        11     891ns     337ns  2.4880us  cuCtxGetCurrent
                    0.00%  6.3230us         1  6.3230us  6.3230us  6.3230us  cuDeviceGetPCIBusId
                    0.00%  6.2120us         1  6.2120us  6.2120us  6.2120us  cuGraphDestroy
                    0.00%  6.0810us         5  1.2160us     420ns  2.2970us  cuDeviceGetAttribute
                    0.00%  4.2560us         2  2.1280us  1.5410us  2.7150us  cuDeviceGet
                    0.00%  4.2250us         1  4.2250us  4.2250us  4.2250us  cuStreamBeginCapture
                    0.00%  3.4370us         2  1.7180us     826ns  2.6110us  cuDeviceGetCount
                    0.00%  2.0160us         1  2.0160us  2.0160us  2.0160us  cuDriverGetVersion
                    0.00%  1.3710us         1  1.3710us  1.3710us  1.3710us  cuStreamEndCapture
                    0.00%     840ns         1     840ns     840ns     840ns  cuModuleGetFunction
                    0.00%     657ns         1     657ns     657ns     657ns  cuCtxGetDevice

I had thought it would merge kernels, but it doesn't. It just avoids multiple launches.

For this to be efficient we'd have to cache graphs which seems hard to do in an automatic fashion. I could imagine the CuGraph constructor doing something dispatch-y, in relation to the graph construction body and its arguments, but that seems iffy since graph construction might depend on information that isn't in the type, such as array sizes, or worse actual values.

Maybe I'm overthinking this and it should just be explicit, but then it might not be worth it. It would be useful to have some workloads that would benefit from this, in order to estimate that (cc @MikeInnes).

from cuda.jl.

MikeInnes commented on May 16, 2024

Doing this automatically seems like one of the big wins we can get in Julia. I'm imagining seeing this as a compiler feature rather than an API as such; as part of our optimisation passes we'll look for multiple kernel launches in sequence and fuse them. There are obviously a lot of details to be worked out there, but as long as we can build a graph using only type information we should be fine; it's not so far off from fusing a broadcast tree.

AFAIK this feature is pretty squarely aimed at DL, as it's increasingly difficult to stress a V100 with only matmuls and broadcasts. But I agree that it's easy to check the numbers here and we should be looking for a use case first.

from cuda.jl.

Explore CUDA graph API about cuda.jl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent