Comments (8)
There is also cudaStreamBeginCapture
to turn a Stream into a Graph.
Capture may not be initiated if stream is cudaStreamLegacy.
Which includes the default stream IIUC. (In any case we might want to switch to per thread default stream)
from cuda.jl.
BLAS has cublasSetStream so might require some work across the package though.
from cuda.jl.
The legacy default stream seems different from the regular default one?
https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html#stream-sync-behavior
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1ga53e8210837f039dd6434a3a4c3324aa
from cuda.jl.
This will require the (inevitable) work of putting streams everywhere:
using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t
stream = CuStream()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
A = cu(rand(2,2)) # implicitly uploads on the default stream
B = cu(rand(2,2))
ERROR: LoadError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream (code #906, ERROR_STREAM_CAPTURE_IMPLICIT)
Stacktrace:
[1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float32}, ::Int64, ::CuStream) at /home/tbesard/Julia/CUDAdrv/src/memory.jl:235
... which I wasn't planning on attempting in the near future.
from cuda.jl.
... which I wasn't planning on attempting in the near future.
The reason being that I haven't put enough thought in how the API should look like, and how it would be compatible with CUDA:
- like contexts, a default global one and use
do
blocks to switch stream - or rather put the stream in the array and thread it through everywhere
The question is also where this functionality should go. And how it should interact with foreign libraries.
It also requires figuring out which operations to make asynchronous, because you can only synchronously cuMemcpyHtoD on the default stream AFAIK. Maybe we should just make everything asynchronous.
from cuda.jl.
https://github.com/NVIDIA/cuda-samples/blob/master/Samples/simpleCudaGraphs/simpleCudaGraphs.cu
from cuda.jl.
Some more exploration:
using CUDAnative, CUDAdrv, CuArrays
import CUDAdrv: @apicall, CuStream_t, isvalid
# graph
const CuGraph_t = Ptr{Cvoid}
mutable struct CuGraph
handle::CuGraph_t
ctx::CuContext
function CuGraph(f::Function, stream::CuStream)
handle_ref = Ref{CuGraph_t}()
@apicall(:cuStreamBeginCapture, (CuStream_t,), stream)
f()
@apicall(:cuStreamEndCapture, (CuStream_t, Ptr{CuGraph_t}), stream, handle_ref)
ctx = CuCurrentContext()
obj = new(handle_ref[], ctx)
finalizer(unsafe_destroy!, obj)
return obj
end
end
function unsafe_destroy!(x::CuGraph)
if isvalid(x.ctx)
@apicall(:cuGraphDestroy, (CuGraph_t,), x)
end
end
Base.unsafe_convert(::Type{CuGraph_t}, x::CuGraph) = x.handle
# graph node
const CuGraphNode_t = Ptr{Cvoid}
# graph execution
const CuGraphExec_t = Ptr{Cvoid}
function instantiate(graph::CuGraph)
exec_ref = Ref{CuGraphExec_t}()
error_node = Ref{CuGraphNode_t}()
buflen = 256
buf = Vector{Cchar}(undef, buflen)
@apicall(:cuGraphInstantiate,
(Ptr{CuGraphExec_t}, CuGraph_t, Ptr{CuGraphNode_t}, Ptr{Cchar}, Csize_t),
exec_ref, graph, error_node, buf, buflen)
return exec_ref[]
end
function launch(exec::CuGraphExec_t, stream::CuStream=CuDefaultStream())
@apicall(:cuGraphLaunch, (CuGraphExec_t, CuStream_t), exec, stream)
end
launch(graph::CuGraph, stream::CuStream=CuDefaultStream()) =
launch(instantiate(graph), stream)
# demo
stream = CuStream()
graph = CuGraph(stream) do
dims=(3,4)
a = rand(Float32, dims)
#d_a = cu(a)
buf_a = Mem.alloc(a)
Mem.upload!(buf_a, a, stream; async=true)
d_a = CuArray{Float32,2}(buf_a, dims)
b = rand(Float32, dims)
#d_b = cu(b)
buf_b = Mem.alloc(b)
Mem.upload!(buf_b, b, stream; async=true)
d_b = CuArray{Float32,2}(buf_b, dims)
c = rand(Float32, dims)
#d_b = cu(b)
buf_c = Mem.alloc(c)
Mem.upload!(buf_c, c, stream; async=true)
d_c = CuArray{Float32,2}(buf_c, dims)
#d_out = similar(d_out)
buf_out = Mem.alloc(b)
d_out = CuArray{Float32,2}(buf_out, dims)
# d_out .= d_a .+ d_b
function vadd(a, b, c)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return
end
@cuda threads=prod(dims) stream=stream vadd(d_out, d_a, d_b)
# d_out .= d_out .+ d_c
@cuda threads=prod(dims) stream=stream vadd(d_out, d_out, d_c)
end
launch(graph)
==8594== NVPROF is profiling process 8594, command: julia wip.jl
==8594== Profiling application: julia wip.jl
==8594== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.22% 5.8240us 2 2.9120us 1.8880us 3.9360us ptxcall_vadd_1
27.78% 2.2400us 3 746ns 608ns 960ns [CUDA memcpy HtoD]
API calls: 70.19% 101.91ms 1 101.91ms 101.91ms 101.91ms cuCtxCreate
28.57% 41.479ms 1 41.479ms 41.479ms 41.479ms cuCtxDestroy
0.56% 807.98us 1 807.98us 807.98us 807.98us cuModuleLoadDataEx
0.37% 534.67us 1 534.67us 534.67us 534.67us cuGraphInstantiate
0.14% 208.24us 4 52.059us 4.7680us 180.50us cuMemAlloc
0.07% 95.018us 1 95.018us 95.018us 95.018us cuModuleUnload
0.03% 41.206us 1 41.206us 41.206us 41.206us cuStreamCreate
0.02% 30.955us 1 30.955us 30.955us 30.955us cuGraphLaunch
0.01% 17.924us 1 17.924us 17.924us 17.924us cuStreamDestroy
0.01% 14.153us 3 4.7170us 1.6700us 10.220us cuMemcpyHtoDAsync
0.01% 9.8060us 11 891ns 337ns 2.4880us cuCtxGetCurrent
0.00% 6.3230us 1 6.3230us 6.3230us 6.3230us cuDeviceGetPCIBusId
0.00% 6.2120us 1 6.2120us 6.2120us 6.2120us cuGraphDestroy
0.00% 6.0810us 5 1.2160us 420ns 2.2970us cuDeviceGetAttribute
0.00% 4.2560us 2 2.1280us 1.5410us 2.7150us cuDeviceGet
0.00% 4.2250us 1 4.2250us 4.2250us 4.2250us cuStreamBeginCapture
0.00% 3.4370us 2 1.7180us 826ns 2.6110us cuDeviceGetCount
0.00% 2.0160us 1 2.0160us 2.0160us 2.0160us cuDriverGetVersion
0.00% 1.3710us 1 1.3710us 1.3710us 1.3710us cuStreamEndCapture
0.00% 840ns 1 840ns 840ns 840ns cuModuleGetFunction
0.00% 657ns 1 657ns 657ns 657ns cuCtxGetDevice
I had thought it would merge kernels, but it doesn't. It just avoids multiple launches.
For this to be efficient we'd have to cache graphs which seems hard to do in an automatic fashion. I could imagine the CuGraph
constructor doing something dispatch-y, in relation to the graph construction body and its arguments, but that seems iffy since graph construction might depend on information that isn't in the type, such as array sizes, or worse actual values.
Maybe I'm overthinking this and it should just be explicit, but then it might not be worth it. It would be useful to have some workloads that would benefit from this, in order to estimate that (cc @MikeInnes).
from cuda.jl.
Doing this automatically seems like one of the big wins we can get in Julia. I'm imagining seeing this as a compiler feature rather than an API as such; as part of our optimisation passes we'll look for multiple kernel launches in sequence and fuse them. There are obviously a lot of details to be worked out there, but as long as we can build a graph using only type information we should be fine; it's not so far off from fusing a broadcast tree.
AFAIK this feature is pretty squarely aimed at DL, as it's increasingly difficult to stress a V100 with only matmuls and broadcasts. But I agree that it's easy to check the numbers here and we should be looking for a use case first.
from cuda.jl.
Related Issues (20)
- NSight Compute: prevent API calls during precompilation HOT 1
- Track CuArray stream usage
- Integrated profiler: detect lack of permissions
- Dependencies in `profile.jl` constitute a significant fraction of the load time HOT 2
- Support for Julia 1.11
- Base.stack is underperforming. HOT 1
- Inverse Complex-to-Real FFT allocates GPU memory HOT 1
- Illegal memory access in GemmKernels on Julia 1.9
- cuDNN not available for your platform HOT 5
- Supporting precompilation without CUDA is problematic
- Power of adjoint of `CuSparseMatrix` does not work anymore HOT 2
- Cannot reset CuArray to zero HOT 2
- Inference failure with sort(::CuMatrix) after loading MLDatasets HOT 1
- Cannot take gradient of `sort` on 2D CuArray HOT 2
- Multi-threaded code hanging forever with Julia 1.10 HOT 11
- cuDNN: Provide wrappers for the declarative API HOT 3
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 HOT 1
- Adjoint not supported on Diagonal arrays HOT 1
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) HOT 3
- 2-`norm` for views of CuArray falls back to scalar indexing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cuda.jl.