juliagpu / cuda.jl Goto Github PK
View Code? Open in Web Editor NEWCUDA programming in Julia.
Home Page: https://juliagpu.org/cuda/
License: Other
CUDA programming in Julia.
Home Page: https://juliagpu.org/cuda/
License: Other
CUDA_VISIBLE_DEVICES=3 julia --project=..
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.0.3 (2018-12-18)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using CUDAnative
julia> f(x) = CUDAnative.tanh(x)
f (generic function with 1 method)
julia> @code_typed f(1f0)
CodeInfo(
1 1 ─ %1 = Base.llvmcall::IntrinsicFunction │╻╷ tanh
│ %2 = (%1)(("declare float @__nv_tanhf(float)", "%2 = call float @__nv_tanhf(float %0)\nret float %2"), Float32, Tuple{Float32}, x)::Float32o expansion
└── return %2 │
) => Float32
julia> f(1f0)
signal (11): Segmentation fault
in expression starting at no file:0
_ITM_registerTMCloneTable at /opt/julia/julia-1.0.3/bin/../lib/julia/libspqr.so (unknown line)
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:82
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
JuliaGPU/CUDAnative.jl#28 at ./task.jl:259
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:268
unknown function (ip: 0xffffffffffffffff)
Allocations: 4184659 (Pool: 4183753; Big: 906); GC: 8
Segmentation fault (core dumped)
My suggestion would be to
cc @maleadt
Ref JuliaGPU/CuArrays.jl#135 (comment)
BLAS | 43 8 51
matmul with element type Float32 | 10 2 12
matmul with element type Float64 | 10 2 12
matmul with element type Complex{Float32} | 10 2 12
matmul with element type Complex{Float64} | 10 2 12
One interesting use-case is to pass Julia functions as callbacks to cufft
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/
To register the callbacks with the cuFFT plan, the first step is to get the device function pointers from the device onto the host.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH
https://devblogs.nvidia.com/cuda-10-features-revealed/
I also came across http://www.cudahandbook.com/2018/09/cuda-graphs-roi-and-api-adoption/ today but didn't give it a proper read yet
See JuliaGPU/CUDAnative.jl#272. Seems to scale with parameters, so probably argument conversion related. Tracking allocations didn't reveal anything interesting.
We should have a test that checks for @allocated @cuda ... == 0
System information
The forward propagation contains reshape operations like below
P = reshape(exp.(logsoftmax(Aout[1:end])), (1,:))
u_k_combined = reshape(cat(u_k_lin, u_k_nonLin, dims=1), (1,:))
input_i = reshape(input[i,:], (1, edim))
These operations work fine on CPU during forward and back propagation.
On GPU it causes the following error
ERROR: LoadError: MethodError: Base._reshape(::CuArray{Float32,1}, ::Tuple{Int64}) is ambiguous. Candidates:
_reshape(A::GPUArrays.GPUArray{T,N} where N, dims::Tuple{Vararg{Int64,N}} where N) where T in GPUArrays at /home/dpk1729/.julia/packages/GPUArrays/NHNmH/src/abstractarray.jl:236
_reshape(A::GPUArrays.GPUArray{T,1}, dims::Tuple{Integer}) where T in GPUArrays at /home/dpk1729/.julia/packages/GPUArrays/NHNmH/src/abstractarray.jl:242
_reshape(parent::CuArray, dims::Tuple{Vararg{Int64,N}} where N) in CuArrays at /home/dpk1729/.julia/packages/CuArrays/clDeS/src/array.jl:68
_reshape(v::AbstractArray{T,1} where T, dims::Tuple{Int64}) in Base at reshapedarray.jl:148
Possible fix, define
_reshape(::CuArray{T,1}, ::Tuple{Int64})
Stacktrace:
[1] reshape(::CuArray{Float32,1}, ::Tuple{Int64}) at ./reshapedarray.jl:96
[2] (::getfield(Flux.Tracker, Symbol("##285#290")){CuArray{Float32,1},Int64})(::TrackedArray{…,CuArray{Float32,1}}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/array.jl:177
[3] iterate at ./generator.jl:47 [inlined]
[4] collect(::Base.Generator{Tuple{TrackedArray{…,CuArray{Float32,1}},TrackedArray{…,CuArray{Float32,1}}},getfield(Flux.Tracker, Symbol("##285#290")){CuArray{Float32,1},Int64}}) at ./array.jl:619
[5] JuliaGPU/CuArrays.jl#283 at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/array.jl:173 [inlined]
[6] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##283#288")){Int64,Tuple{TrackedArray{…,CuArray{Float32,1}},TrackedArray{…,CuArray{Float32,1}}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}},Flux.Tracker.Tracked{CuArray{Float32,1}}}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:23
[7] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
[8] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##293#294")){TrackedArray{…,CuArray{Float32,1}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}},Nothing}}, ::CuArray{Float32,2}) at ./abstractarray.jl:1844
[9] back(::Flux.Tracker.Tracked{CuArray{Float32,2}}, ::CuArray{Float32,2}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
[10] foreach at ./abstractarray.jl:1844 [inlined]
[11] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##326#327")){TrackedArray{…,CuArray{Float32,2}},TrackedArray{…,CuArray{Float32,2}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,2}},Flux.Tracker.Tracked{CuArray{Float32,2}}}}, ::CuArray{Float32,2}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
... (the last 5 lines are repeated 1 more time)
[17] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
[18] foreach(::Function, ::Tuple{Nothing,Flux.Tracker.Tracked{CuArray{Float32,1}},Nothing}, ::Tuple{CuArray{Float32,1},CuArray{Float32,1},Float32}) at ./abstractarray.jl:1844
[19] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("#back#353")){3,getfield(Base.Broadcast, Symbol("##26#28")){getfield(Base.Broadcast, Symbol("##27#29")){typeof(*),getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}},getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}},getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}}},typeof(*)},Tuple{CuArray{Float32,1},TrackedArray{…,CuArray{Float32,1}},Int64}},Tuple{Nothing,Flux.Tracker.Tracked{CuArray{Float32,1}},Nothing}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
[20] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
[21] foreach at ./abstractarray.jl:1844 [inlined]
[22] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##299#300")){TrackedArray{…,CuArray{Float32,1}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}}}}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
[23] back(::Flux.Tracker.Tracked{Float32}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:43
... (the last 3 lines are repeated 1 more time)
[27] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##194#195")){Flux.Tracker.TrackedReal{Float32},Int64},Tuple{Flux.Tracker.Tracked{Float32},Nothing}}, ::Float32) at ./abstractarray.jl:1844
[28] back(::Flux.Tracker.Tracked{Float32}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:43
... (the last 2 lines are repeated 1 more time)
[31] foreach at ./abstractarray.jl:1844 [inlined]
[32] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##202#203")),Tuple{Flux.Tracker.Tracked{Float32},Flux.Tracker.Tracked{Float32}}}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
[33] back(::Flux.Tracker.Tracked{Float32}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:43
... (the last 3 lines are repeated 126 more times)
[412] back!(::Flux.Tracker.TrackedReal{Float32}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:62
[413] train(::Data, ::Memory, ::Int64, ::Int64, ::Int64, ::Float64) at /home/dpk1729/code/MemN2N_model_gpu.jl:212
[414] main() at /home/dpk1729/code/main.jl:68
[415] top-level scope at none:0
[416] include at ./boot.jl:317 [inlined]
[417] include_relative(::Module, ::String) at ./loading.jl:1038
[418] include(::Module, ::String) at ./sysimg.jl:29
[419] exec_options(::Base.JLOptions) at ./client.jl:239
in expression starting at /home/dpk1729/code/main.jl:72
julia> for dev in devices()
@show dev
end
dev = CUDAdrv.CuDevice(0, 0)
dev = CUDAdrv.CuDevice(1, 1)
dev = CUDAdrv.CuDevice(2, 2)
I was profiling why a resnet model (https://github.com/KristofferC/resnet) was running extremely slow on Flux.
Sprinkling some sections using https://github.com/KristofferC/TimerOutputs.jl and training the model a little bit I got:
(edit: the timings below are stale due to changes in CuArrays, see https://github.com/JuliaGPU/CuArrays.jl/issues/273#issuecomment-461943376 for an update)
───────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 82.6s / 62.9% 9.09GiB / 3.32%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────
gc true 468 47.8s 92.1% 102ms 4.88MiB 1.58% 10.7KiB
crossentropy 32 1.91s 3.69% 59.8ms 159MiB 51.5% 4.98MiB
conv 1.70k 1.61s 3.10% 948μs 85.6MiB 27.7% 51.7KiB
dense 32 488ms 0.94% 15.2ms 59.1MiB 19.1% 1.85MiB
gc false 32 81.0ms 0.16% 2.53ms 645KiB 0.20% 20.2KiB
───────────────────────────────────────────────────────────────────────
The gc true
section refers to only this line:
It seems this line is being called too often compared to how expensive a gc(true)
call is.
julia> curand(Int8, 2)
ERROR: MethodError: no method matching curand(::Type{Int8}, ::Int64)
Closest candidates are:
curand(::Type{Float32}, ::Int64) at /home/patomson/.julia/packages/CuArrays/F96Gk/src/rand/highlevel.jl:4
curand(::Type{Float64}, ::Int64) at /home/patomson/.julia/packages/CuArrays/F96Gk/src/rand/highlevel.jl:7
It looks like you can only create CuArrays of type Float32 and Float64, but current CUDA (9.2) and Nvidia GPUs support Integer types (Int8) as well as Float16.
Related to JuliaGPU/CuArrays.jl#200, but without the adapt issue.
julia> convert(CuArray, 1:3)
ERROR: MethodError: no method matching similar(::Type{CuArray}, ::Type{Int64}, ::Tuple{Int64})
Closest candidates are:
similar(::Array, ::Type, ::Tuple{Vararg{Int64,N}}) where N at array.jl:332
similar(::SubArray, ::Type, ::Tuple{Vararg{Int64,N}} where N) at subarray.jl:59
similar(::Base.ReshapedArray, ::Type, ::Tuple{Vararg{Int64,N}} where N) at reshapedarray.jl:187
...
Stacktrace:
[1] convert(::Type{CuArray}, ::UnitRange{Int64}) at /home/mikeinnes/.julia/packages/GPUArrays/HmVTY/src/construction.jl:67
[2] top-level scope at none:0
CuArray{Float32}(3, N)
-> CuArray{Float32}(undef, 3, N)
A[1,:] = 1
-> A[1,:] .= 1
curand(Float32, 10, 10)
julia> using CUDAdrv, CUDAnative
julia> a = round.(rand(Float32, (3, 4)) * 100)
3×4 Array{Float32,2}:
57.0 13.0 55.0 14.0
78.0 55.0 90.0 73.0
61.0 85.0 36.0 85.0
julia> d_a = CuArray(a)
3×4 Array{Float32,2}:
57.0 13.0 55.0 14.0
78.0 55.0 90.0 73.0
61.0 85.0 36.0 85.0
julia> d_b = similar(d_a)
3×4 Array{Float32,2}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
julia> function kernel_vpow(a, b)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
b[i] = a[i]^1.5
return nothing
end
kernel_vpow (generic function with 1 method)
julia> @cuda (3,4) kernel_vpow(d_a, d_b)
ERROR: LLVM error: Cannot select: 0x6092650: f64 = fpow 0x6092570, ConstantFP:f64<1.500000e+00>
0x6092570: f64 = fp_extend 0x6092180
0x6092180: f32,ch = load<LD4[null(addrspace=101)]> 0x447c450, TargetExternalSymbol:i64'julia_^_61355_param_0', undef:i64
0x60920a0: i64 = TargetExternalSymbol'julia_^_61355_param_0'
0x6092110: i64 = undef
0x60925e0: f64 = ConstantFP<1.500000e+00>
In function: julia_^_61355
Stacktrace:
[1] handle_error(::Cstring) at /home/alha02/.julia/v0.6/LLVM/src/core/context.jl:96
[2] macro expansion at /home/alha02/.julia/v0.6/LLVM/src/util/logging.jl:102 [inlined]
[3] macro expansion at /home/alha02/.julia/v0.6/LLVM/src/base.jl:18 [inlined]
[4] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/alha02/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
[5] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/alha02/.julia/v0.6/LLVM/src/targetmachine.jl:39
[6] #mcgen#45(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:303
[7] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
[8] #compile_function#46(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:328
[9] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:369
[10] macro expansion at /home/alha02/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
[11] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::#kernel_vpow, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}) at /home/alha02/.julia/v0.6/CUDAnative/src/execution.jl:80
Integer powers are no problem:
julia> function kernel_vpow(a, b)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
b[i] = a[i]^2
return nothing
end
kernel_vpow (generic function with 1 method)
julia> @cuda (3,4) kernel_vpow(d_a, d_b)
julia> d_b
3×4 Array{Float32,2}:
3249.0 169.0 3025.0 196.0
6084.0 3025.0 8100.0 5329.0
3721.0 7225.0 1296.0 7225.0
[brain-dump of potentially interesting approach for exception traces]
The exception backtrace POC in JuliaGPU/CUDAnative.jl#303 embeds debug information in global variables and reports it at run-time since we don't have the ability to unwind the stack on-device. At the same time, unwinding the stack should be possible from the host, using the CUDA debugger API (this requires https://github.com/JuliaGPU/CUDAnative.jl/issues/31, of course). The only remaining problem there is how to transfer control from the device to the host once an exception occurs, since running under the debugger all time is probably expensive. the CUDA_DEVICE_WAITS_ON_EXCEPTION
environment variable might be a possibility: https://docs.nvidia.com/cuda/cuda-gdb/index.html#attaching
I'm trying to solve a system of linear equations on the GPU, but it crashes julia with no error message. My understanding of a discussion from a year or two ago was that backslash would be the way to access the cusolver routines. The simplest backslash attempt crashes julia with no error message, then since
it appears that cusolver has (a routine that solves a system with a triangular structure)[https://software.intel.com/en-us/mkl-developer-reference-c-trtrs], I tried to get to that by performing a QR decomposition on the cuarrays (which works fine), then solving the triangular version of the problem, with the same result. I couldn't find that routine wrapped in CuArrays.jl anyway though. Backslash seemed the most natural way to do this, but is there an accepted way to do this that I'm missing?
using CuArrays
using LinearAlgebra
A, y = rand(3,4), rand(3)
cuA, cuY = cu(A), cu(y)
getQR(A) = begin
QR = qr(A)
return (QR.Q, QR.R)
end
Q, R = getQR(A)
cuQ, cuR = getQR(cuA)
actualSolution = A \ y
solutionFromQR = R \ (Q' * y)
@assert(all(abs.(actualSolution - solutionFromQR) .< 1e-4))
println("all is well thus far")
solutionFromCu = cuA \ cuY # crashes
solutionFromCuQR = cuR \ (cuQ' * cuY) # also crashes
if relevant:
julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
julia> CUDAdrv.device()
CuDevice(0): GeForce GTX 1050
I got following error while executing sum(vecnorm, params(model))
, where model is a Flux Chain.
ERROR: LLVM error: Cannot select: 0x559dd22e8a80: f32 = fpow 0x559dd22e91f0, 0x559dd22e8930
0x559dd22e91f0: f32 = bitcast 0x559dd22e9180
0x559dd22e9180: i32 = and 0x559dd1bea130, Constant:i32<2147483647>
0x559dd1bea130: i32,ch = load<LD4[%4]> 0x559dd206c280, 0x559dd1be9870, undef:i64
0x559dd1be9870: i64,ch = load<LD8[null(addrspace=101)]> 0x559dd206c280, TargetExternalSymbol:i64'julia__26_64205_param_1', undef:i64
0x559dd1be9800: i64 = TargetExternalSymbol'julia__26_64205_param_1'
0x559dd1be9720: i64 = undef
0x559dd1be9720: i64 = undef
0x559dd22e8460: i32 = Constant<2147483647>
0x559dd22e8930: f32 = sint_to_fp 0x559dd1be9950
0x559dd1be9950: i64,ch = load<LD8[null(addrspace=101)]> 0x559dd206c280, TargetExternalSymbol:i64'julia__26_64205_param_2', undef:i64
0x559dd1be98e0: i64 = TargetExternalSymbol'julia__26_64205_param_2'
0x559dd1be9720: i64 = undef
In function: julia__26_64205
Stacktrace:
[1] handle_error(::Cstring) at /home/tejank10/.julia/v0.6/LLVM/src/core/context.jl:100
[2] macro expansion at /home/tejank10/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
[3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/tejank10/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
[4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/tejank10/.julia/v0.6/LLVM/src/targetmachine.jl:42
[5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/tejank10/.julia/v0.6/CUDAnative/src/jit.jl:393
[6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
[7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/tejank10/.julia/v0.6/CUDAnative/src/jit.jl:440
[8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/tejank10/.julia/v0.6/CUDAnative/src/jit.jl:488
[9] macro expansion at /home/tejank10/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
[10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::Flux.Tracker.##26#27, ::CUDAnative.CuDeviceArray{ForwardDiff.Dual{Void,Float32,3},2,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool,Bool},Tuple{},Tuple{}}, ::Tuple{Tuple{Int64,Int64},Tuple{},Tuple{}}, ::CUDAnative.CuDeviceArray{ForwardDiff.Dual{Void,Float32,3},2,CUDAnative.AS.Global}, ::Tuple{Int64,Float32}) at /home/tejank10/.julia/v0.6/CUDAnative/src/execution.jl:80
[11] _broadcast! at /home/tejank10/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
[12] broadcast_t at /home/tejank10/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
[13] broadcast_c at /home/tejank10/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
[14] broadcast at ./broadcast.jl:455 [inlined]
[15] tracked_broadcast(::Function, ::TrackedArray{…,CuArray{Float32,2}}, ::Int64, ::Float32) at /home/tejank10/.julia/v0.6/Flux/src/tracker/array.jl:364
[16] vecnorm(::TrackedArray{…,CuArray{Float32,2}}, ::Int64) at /home/tejank10/.julia/v0.6/Flux/src/tracker/array.jl:224 (repeats 2 times)
[17] _mapreduce(::Base.LinAlg.#vecnorm, ::Base.#+, ::IndexLinear, ::Array{Any,1}) at ./reduce.jl:273
[18] sum(::Function, ::Array{Any,1}) at ./reduce.jl:347
The statement works fine while running on CPU
I'm trying to diagonalize a sparse CuArrays with eigs
, which currently fails because it dispatches to the general matrix*vector multiplication. What would be needed to get it to work?
Is it only a question of defining the correct mul!
method?
I am currently trying to perform an eigendecomposition on GPU.
Is there any plan for CuArrays.jl
to support it from CUSOLVER sometime soon?
The following code triggers this error:
ERROR: CUDA error: out of memory (code JuliaGPU/CuArrays.jl#2, ERROR_OUT_OF_MEMORY)
Stacktrace:
[1] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
[2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:161
[3] alloc at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:157 [inlined] (repeats 2 times)
[4] CuArray{Float32,2}(::Tuple{Int64,Int64}) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:33
[5] similar at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:83 [inlined]
[6] similar at .\abstractarray.jl:571 [inlined]
[7] h_bench(::Int64, ::Int64) at .\REPL[4]:7
[8] macro expansion at .\show.jl:555 [inlined]
[9] top-level scope at .\REPL[6]:2 [inlined]
[10] top-level scope at .\none:0
using CUDAdrv, CUDAnative, BenchmarkTools
function kernel_vadd(a, b, c)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return nothing
end
function h(m, n)
# CUDAdrv functionality: generate and upload data
a = round.(rand(Float32, (m, n)) * 100)
b = round.(rand(Float32, (m, n)) * 100)
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a) # output array
@cuda threads=12 kernel_vadd(d_a, d_b, d_c)
end
function h_bench(m, n)
# CUDAdrv functionality: generate and upload data
a = round.(rand(Float32, (m, n)) * 100)
b = round.(rand(Float32, (m, n)) * 100)
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a) # output array
@benchmark @cuda threads=12 kernel_vadd($d_a, $d_b, $d_c)
end
# Works
for i in 1:5
@show i
h(10_000, 10_000)
end
# Errors after i = 3
for i in 1:5
@show i
h_bench(10_000, 10_000)
end
Now that we support some exceptions, we should treat these kernels as valid and just have them throw at run time. This should make it much easier to debug issues like that -- assuming we can report something useful for eg. MethodError
s.
A flatten
broadcast is no longer amendable to cudaconvert
since there is a closure involved (see SciML/RecursiveArrayTools.jl#40)
julia> x.body
getfield(Base.Broadcast, Symbol("##1#2")){bc,makeargs} where makeargs
I don't know if it is feasible for us to define a version of cudaconvert
that could inspect a closure and apply cudaconvert
to its fields
Further to discussion about porting CUFFT.jl
to use CUDAdrv
(JuliaAttic/CUFFT.jl#12), it would be useful to provide support for CUDA pitched pointers, in addition to standard memory types.
How might such an implementation be best approached, given longer term plans for this package (e.g., is there still a plan to move to the CuArray
implementation?).
Can be based on PyTorch's.
The API should be equivalent to:
function softmax(x::AbstractArray, axis=1)
out = exp.(x .- maximum(x, axis))
out ./= sum(out, axis)
return out
end
Very simple x2 .^ 4
completely fails and has been since the age of times. Any chance this may be supported at some point? It feels like CuArrays
is in real shortage of TLC. Just my sad perspective of trying to use it, apologies.
julia> x = rand(Float32, 100);
julia> using Flux
using CuArrays
julia>
julia> using CuArrays
x2 = x |> gpu;
julia> x2 = x |> gpu;
julia> x2 .^ 4
ERROR: LLVM error: Cannot select: 0x23701520: f32 = fpow 0x237010c0, ConstantFP:f32<4.000000e+00>
0x237010c0: f32,ch = load<LD4[null(addrspace=101)]> 0x237065b0, TargetExternalSymbol:i64'julia__1_63502_param_0', undef:i64
0x23700fe0: i64 = TargetExternalSymbol'julia__1_63502_param_0'
0x23701050: i64 = undef
0x237014b0: f32 = ConstantFP<4.000000e+00>
In function: julia__1_63502
Stacktrace:
[1] handle_error(::Cstring) at /home/ubuntu/.julia/v0.6/LLVM/src/core/context.jl:100
[2] macro expansion at /home/ubuntu/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
[3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/ubuntu/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
[4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/ubuntu/.julia/v0.6/LLVM/src/targetmachine.jl:42
[5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:393
[6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
[7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:440
[8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:488
[9] macro expansion at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
[10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##1#2, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{}) at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:80
[11] _broadcast! at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
[12] broadcast_t at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
[13] broadcast_c at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
[14] broadcast(::Function, ::CuArray{Float32,1}) at ./broadcast.jl:455
julia>
julia> x2 .^ 4.0
ERROR: LLVM error: Cannot select: 0x252443c0: f64 = fpow 0x252442e0, ConstantFP:f64<4.000000e+00>
0x252442e0: f64 = fp_extend 0x25243ef0
0x25243ef0: f32,ch = load<LD4[null(addrspace=101)]> 0x23494000, TargetExternalSymbol:i64'julia_^_63768_param_0', undef:i64
0x25243e10: i64 = TargetExternalSymbol'julia_^_63768_param_0'
0x25243e80: i64 = undef
0x25244350: f64 = ConstantFP<4.000000e+00>
In function: julia_^_63768
Stacktrace:
[1] handle_error(::Cstring) at /home/ubuntu/.julia/v0.6/LLVM/src/core/context.jl:100
[2] macro expansion at /home/ubuntu/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
[3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/ubuntu/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
[4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/ubuntu/.julia/v0.6/LLVM/src/targetmachine.jl:42
[5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:393
[6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
[7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:440
[8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:488
[9] macro expansion at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
[10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##3#4, ::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{}) at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:80
[11] _broadcast! at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
[12] broadcast_t at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
[13] broadcast_c at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
[14] broadcast(::Function, ::CuArray{Float32,1}) at ./broadcast.jl:455
julia> x2 .^ 4.0f0
ERROR: LLVM error: Cannot select: 0x240c3670: f32 = fpow 0x240c3210, ConstantFP:f32<4.000000e+00>
0x240c3210: f32,ch = load<LD4[null(addrspace=101)]> 0x2497acc0, TargetExternalSymbol:i64'julia__5_63777_param_0', undef:i64
0x240c3130: i64 = TargetExternalSymbol'julia__5_63777_param_0'
0x240c31a0: i64 = undef
0x240c3600: f32 = ConstantFP<4.000000e+00>
In function: julia__5_63777
Stacktrace:
[1] handle_error(::Cstring) at /home/ubuntu/.julia/v0.6/LLVM/src/core/context.jl:100
[2] macro expansion at /home/ubuntu/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
[3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/ubuntu/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
[4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/ubuntu/.julia/v0.6/LLVM/src/targetmachine.jl:42
[5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:393
[6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
[7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:440
[8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:488
[9] macro expansion at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
[10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##5#6, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{}) at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:80
[11] _broadcast! at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
[12] broadcast_t at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
[13] broadcast_c at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
[14] broadcast(::Function, ::CuArray{Float32,1}) at ./broadcast.jl:455
julia>
While:
arr = ones(10)
arr2 = ones(1)
CuArray(arr) .- CuArray(arr2)
Works fine, CuArray(arr) .- arr2
fails.
Is that intended?
Idea from Elliot: a mode for asserting or warning about memory transfers, which are often unwanted. E.g., CuArrays.allowtransfer(:Bool)
, CuArrays.@allowtransfer
, CuArrays.@asserttransfer
.
We should also think about which packages export these functions. They can live in GPUArrays, like allowscalar
does right now, but I feel like users shouldn't have to import any other package except CuArrays (cf. not having to import CUDAdrv in JuliaGPU/CuArrays.jl#179).
@vchuravy Care to summarize your findings? It'd be good to have all relevant documentation/links in an issue.
Many of the solver routines provide more detailed error information in the info
variable but it is never reported because @check
has already thrown a very broad error message.
using CuArrays
n = 4
A = randn(n, n) |> cu
exp(A)
ERROR: ReadOnlyMemoryError()
Stacktrace:
[1] exp!(::CuArray{Float32,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/LinearAlgebra/src/lapack.jl:209
[2] exp(::CuArray{Float32,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/LinearAlgebra/src/dense.jl:508
[3] top-level scope at none:0
I get the same error without a stack trace for A \ cu(randn(n))
(This is using CuArrays#vc/0.7 (a55b12a) depending on GPUArrays#master, since CuArrays#master doesn't seem to support indexing A[i,j]
that exp(m::AbstractMatrix)
uses)
How much work would it take to support LinearAlgebra functions like these? If reasonable for a beginner like myself, any pointers to where to start?
Thanks!
If you mix CuArrays with Base Arrays, you get the slow version of things. I think this should best be disabled by default and give a nice error. Users should be able to opt in to enable it.
The code
using CUDAdrv, CUDAnative
k(a,b) = (@inbounds a[1] = b[1]; nothing)
t = CuArray([(0x0,0x0)])
@device_code_sass @cuda k(t, t)
, as well as
struct AAA; x::UInt8; y::UInt8; end
s = CuArray([AAA(0x0,0x0)])
@device_code_sass @cuda k(s,s)
, generate two load and two store instructions (independently of the data type used (int, float, etc.)):
...
/*0028*/ LDG.E.U8 R7, [R2+0x1]; /* 0xeed0200000170207 */
/*0030*/ LDG.E.U8 R6, [R2]; /* 0xeed0200000070206 */
...
/*0058*/ STG.E.U8 [R4], R7; /* 0xeed8200000070407 */
/* 0x001ffc00ffe081f1 */
/*0068*/ STG.E.U8 [R4+-0x1], R6; /* 0xeed82ffffff70406 */
...
It would be amazing if the compiler could optimize that to use vectorized memory access in both cases (https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ update: this link is not totally about what I'm referring to, see comments below).
Libdl requires the lib
prefix, CUDAapi doesn't.
Have you implemented Base.repeat
for CuArrays? repeat
seems to use the slow fallback. Is it hard to? Thanks!
Hi and thanks for your work!
I have a problem that effectively prohibits me from training my Flux model with multiple of embedding matrices.
It's hard to avoid allocations when indexing into (flux-tracked) CuArrays and my training time mainly becomes GC time.
Views won't work in these examples that I think illustrate the problem:
using CuArrays
a = UInt16.(collect(1:65535));
W = cu(randn(64, maximum(a))); # These are parameters
f(t, a) = (A = W[:, rand(a, 5000)]; sum([A for _ in 1:t]))
function g(n, a)
for _ in 1:n
CuArrays.@time f(1000, a)
end
end
g(10, a)
0.486 sec (280.39 k CPU allocs: 14.736 MiB, 11.66% gc time) (1.00 k GPU allocs: 1.192 GiB, 54.35% gc time of which 100.00% spent allocating)
0.341 sec ( 50.05 k CPU allocs: 1.768 MiB) (1.00 k GPU allocs: 1.192 GiB, 95.19% gc time of which 100.00% spent allocating)
0.398 sec ( 50.07 k CPU allocs: 1.909 MiB) (1.00 k GPU allocs: 1.192 GiB, 95.64% gc time of which 100.00% spent allocating)
0.198 sec ( 79.98 k CPU allocs: 2.314 MiB, 34.47% gc time) (1.00 k GPU allocs: 1.192 GiB, 58.37% gc time of which 100.00% spent allocating)
0.016 sec ( 45.05 k CPU allocs: 1.677 MiB) (1.00 k GPU allocs: 1.192 GiB)
0.019 sec ( 45.05 k CPU allocs: 1.677 MiB) (1.00 k GPU allocs: 1.192 GiB)
0.089 sec ( 78.19 k CPU allocs: 2.282 MiB, 73.77% gc time) (1.00 k GPU allocs: 1.192 GiB)
0.012 sec ( 44.92 k CPU allocs: 1.675 MiB) (1.00 k GPU allocs: 1.192 GiB)
0.018 sec ( 45.05 k CPU allocs: 1.677 MiB) (1.00 k GPU allocs: 1.192 GiB)
0.086 sec ( 78.75 k CPU allocs: 2.291 MiB, 75.18% gc time) (1.00 k GPU allocs: 1.192 GiB)
I think getindex has some issues with Flux backprop as well as the following script allocates a lot of memory in the Flux.back! calls:
using CuArrays
using Flux
Ea = gpu(param(randn(64, 1_000_000)));
Eb = gpu(param(randn(64, 65_535)));
i = UInt16.(collect(1:5_000));
loss(i,n) = sum(sum(Eb[:, i] .+ Ea[:, rand(1:size(Ea,2), 1)]) for _ in 1:n)
function g(n, t, i)
for _ in 1:t
print("loss ")
CuArrays.@time l = loss(i, n)
print("back ")
CuArrays.@time Flux.back!(l)
end
end
g(100, 10, i)
loss | 0.029 sec | (33.99 k CPU allocs: 2.097 MiB) | (600 GPU allocs: 245.150 MiB, 55.07% gc)
back | 0.837 sec | (63.93 k CPU allocs: 2.997 MiB, 31.91% gc time) | (900 GPU allocs: 25.882 GiB, 28.31% gc)
loss | 0.083 sec | (35.31 k CPU allocs: 2.111 MiB) | (600 GPU allocs: 245.150 MiB, 78.41% gc)
back | 0.828 sec | (63.90 k CPU allocs: 3.004 MiB, 31.35% gc time) | (900 GPU allocs: 25.882 GiB, 28.60% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.103 MiB) | (600 GPU allocs: 245.150 MiB, 73.20% gc)
back | 0.840 sec | (64.20 k CPU allocs: 3.006 MiB, 31.64% gc time) | (900 GPU allocs: 25.882 GiB, 28.13% gc)
loss | 0.076 sec | (35.31 k CPU allocs: 2.107 MiB) | (600 GPU allocs: 245.150 MiB, 77.21% gc)
back | 0.830 sec | (63.91 k CPU allocs: 3.005 MiB, 31.28% gc time) | (900 GPU allocs: 25.882 GiB, 28.84% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.106 MiB) | (600 GPU allocs: 245.150 MiB, 73.26% gc)
back | 0.835 sec | (64.20 k CPU allocs: 3.004 MiB, 31.16% gc time) | (900 GPU allocs: 25.882 GiB, 28.69% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB) | (600 GPU allocs: 245.150 MiB, 77.39% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.004 MiB, 31.52% gc time) | (900 GPU allocs: 25.882 GiB, 28.82% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.107 MiB) | (600 GPU allocs: 245.150 MiB, 72.99% gc)
back | 0.846 sec | (64.20 k CPU allocs: 3.004 MiB, 31.89% gc time) | (900 GPU allocs: 25.882 GiB, 28.35% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB) | (600 GPU allocs: 245.150 MiB, 77.28% gc)
back | 0.834 sec | (63.91 k CPU allocs: 3.004 MiB, 31.21% gc time) | (900 GPU allocs: 25.882 GiB, 28.64% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.107 MiB) | (600 GPU allocs: 245.150 MiB, 73.00% gc)
back | 0.844 sec | (64.20 k CPU allocs: 3.004 MiB, 30.82% gc time) | (900 GPU allocs: 25.882 GiB, 28.65% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB) | (600 GPU allocs: 245.150 MiB, 77.24% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.003 MiB, 30.95% gc time) | (900 GPU allocs: 25.882 GiB, 28.52% gc)
Are there any workarounds I can use?
Please let me know if you need more info.
Thank you so much for looking into this!
Many constants in CUDA world are 32-bit, eg. the warp-size, thread or block IDs and dimensions, etc. We don't promote these to Int64 in order to avoid conversions when doing math on them, however it might be equally expensive not to do so because of conversions when doing math with literals.
For example, take the following idiomatic code:
function reduce_warp{F<:Function,T}(op::F, val::T)::T
offset = CUDAnative.warpsize() ÷ 2
while offset > 0
val = op(val, shfl_down(val, offset))
offset ÷= 2
end
return val
end
warpsize
yields an Int32, but gets converted and promoted to Int64 because of the ÷ 2
. This in turn causes shf_down
which takes an Int32 do convert it back, including an exactness check + exception (trap):
julia> CUDAnative.code_llvm(reduce_warp, (typeof(+), Int32))
define i32 @julia_reduce_warp_62748(i32) local_unnamed_addr # {
top:
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.warpsize()
%2 = icmp slt i32 %1, 2
br i1 %2, label %L23, label %if.preheader
if.preheader: ; preds = %top
%3 = lshr i32 %1, 1
%4 = zext i32 %3 to i64
br label %if
if: ; preds = %if.preheader, %pass2
%val.03 = phi i32 [ %9, %pass2 ], [ %0, %if.preheader ]
%offset.02 = phi i64 [ %10, %pass2 ], [ %4, %if.preheader ]
%sext = shl i64 %offset.02, 32
%5 = ashr exact i64 %sext, 32
%6 = icmp eq i64 %5, %offset.02
br i1 %6, label %pass2, label %fail1
L23.loopexit: ; preds = %pass2
br label %L23
L23: ; preds = %L23.loopexit, %top
%val.0.lcssa = phi i32 [ %0, %top ], [ %9, %L23.loopexit ]
ret i32 %val.0.lcssa
fail1: ; preds = %if
tail call void @llvm.trap()
unreachable
pass2: ; preds = %if
%7 = trunc i64 %offset.02 to i32
%8 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %val.03, i32 %7, i32 31)
%9 = add i32 %8, %val.03
%10 = lshr i64 %offset.02, 1
%11 = icmp eq i64 %10, 0
br i1 %11, label %L23.loopexit, label %if
}
An improved, but less readable version of the same code goes like:
function reduce_warp{F<:Function,T}(op::F, val::T)::T
offset = CUDAnative.warpsize() ÷ Int32(2)
while offset > Int32(0)
val = op(val, shfl_down(val, offset))
offset ÷= Int32(2)
end
return val
end
This yields the following, much cleaner IR:
define i32 @julia_reduce_warp_62749(i32) local_unnamed_addr #0 {
top:
%1 = tail call i32 @llvm.nvvm.read.ptx.sreg.warpsize()
%2 = icmp slt i32 %1, 2
br i1 %2, label %L25, label %if.preheader
if.preheader: ; preds = %top
br label %if
if: ; preds = %if.preheader, %if
%offset.03.in = phi i32 [ %offset.03, %if ], [ %1, %if.preheader ]
%val.02 = phi i32 [ %4, %if ], [ %0, %if.preheader ]
%offset.03 = sdiv i32 %offset.03.in, 2
%3 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %val.02, i32 %offset.03, i32 31)
%4 = add i32 %3, %val.02
%5 = icmp slt i32 %offset.03.in, 4
br i1 %5, label %L25.loopexit, label %if
L25.loopexit: ; preds = %if
br label %L25
L25: ; preds = %L25.loopexit, %top
%val.0.lcssa = phi i32 [ %0, %top ], [ %4, %L25.loopexit ]
ret i32 %val.0.lcssa
}
continued from JuliaGPU/CUDAnative.jl#141
I ran the build script with TRACE=true
and I'm getting this:
julia> Pkg.build("CUDAnative")
INFO: Building LLVM
DEBUG: Performing package build for LLVM.jl from /home/cody/.julia/reg_and_seg/v0.6/LLVM/deps
DEBUG: Discovering LLVM libraries in /home/cody/src/julia_06/usr/bin/../lib/julia, and configs in /home/cody/src/julia_06/usr/bin, /home/cody/src/julia_06/usr/bin/../tools
TRACE: Looking for libllvm in /home/cody/src/julia_06/usr/bin/../lib/julia
TRACE: Looking for llvm-config in /home/cody/src/julia_06/usr/bin
TRACE: Looking for llvm-config in /home/cody/src/julia_06/usr/bin/../tools
TRACE: - 3.9.1 at /home/cody/src/julia_06/usr/bin/../tools/llvm-config
TRACE: Looking for libllvm in /home/cody/src/julia_06/usr/lib
TRACE: - v3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so
TRACE: - v3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.so
DEBUG: Discovered LLVM toolchains: 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so
DEBUG: Discovering LLVM libraries in , and configs in /home/cody/julia/bin, /home/cody/xmrig/build, /home/cody/src/julia/deps/build, /home/cody/src/antsbin/bin, , /home/cody/bin, /home/cody/.local/bin, /home/cody/bin, /home/cody/.local/bin, /usr/local/sbin, /usr/local/bin, /usr/sbin, /usr/bin, /sbin, /bin, /usr/games, /usr/local/games, /snap/bin/usr/local/cuda-7.5/bin
TRACE: Looking for llvm-config in /home/cody/julia/bin
TRACE: Looking for llvm-config in /home/cody/xmrig/build
TRACE: Looking for llvm-config in /home/cody/src/antsbin/bin
TRACE: Looking for llvm-config in /usr/local/sbin
TRACE: Looking for llvm-config in /usr/local/bin
TRACE: Looking for llvm-config in /usr/sbin
TRACE: Looking for llvm-config in /usr/bin
TRACE: - 4.0.1 at /usr/bin/llvm-config-4.0
TRACE: Looking for libllvm in /usr/lib/llvm-4.0/lib
TRACE: - v4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.1.so
TRACE: - v4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.so
TRACE: Looking for llvm-config in /sbin
TRACE: Looking for llvm-config in /bin
TRACE: Looking for llvm-config in /usr/games
TRACE: Looking for llvm-config in /usr/local/games
DEBUG: Discovered LLVM toolchains: 4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.1.so
DEBUG: Selecting LLVM from libraries 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true), 4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.1.so (bundled: false) and wrappers 3.9.0, 4.0.0
DEBUG: Selected LLVM 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true)
DEBUG: Selecting wrapper for 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true) out of wrappers 3.9.0, 4.0.0
DEBUG: Selected wrapper 3.9 for LLVM 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true)
DEBUG: Checking validity of existing ext.jl...
INFO: LLVM.jl has already been built for this toolchain, no need to rebuild
INFO: Building CUDAdrv
INFO: Building CUDAnative
TRACE: LLVM.jl is running in trace mode, this will generate a lot of additional output
DEBUG: Checking validity of bundled library at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so
config[:llvm_version] = LLVM.version() = v"3.9.1"
version = v"3.9.1"
target_support = CUDAapi.devices_for_llvm(version) = Set(VersionNumber[v"3.7.0", v"6.2.0", v"6.0.0", v"5.2.0", v"3.5.0", v"5.0.0", v"3.0.0", v"5.3.0", v"6.1.0", v"2.0.0", v"2.1.0", v"3.2.0"])
DEBUG: Dropping down to post-finalizer I/O
Note that I added a few @show
statements at the end to show the version and target support info retrieved by the script.
In case it's useful here's my ext.jl
# autogenerated file, do not edit
const ptx_support = VersionNumber[v"3.2.0", v"4.0.0", v"4.1.0", v"4.2.0", v"4.3.0"]
const llvm_version = v"3.9.1"
const cuda_driver_version = v"9.0.0"
const julia_llvm_version = v"3.9.1"
const cuda_toolkit_version = v"7.5.17"
const cuobjdump = "/usr/local/cuda-7.5/bin/cuobjdump"
const julia_version = v"0.6.2"
const target_support = VersionNumber[v"3.0.0", v"3.2.0", v"3.5.0", v"3.7.0", v"5.0.0", v"5.2.0", v"5.3.0"]
const ptxas = "/usr/local/cuda-7.5/bin/ptxas"
const configured = true
const libdevice = Dict(v"3.0.0"=>"/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_30.10.bc",v"3.5.0"=>"/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_35.10.bc",v"5.0.0"=>"/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_50.10.bc")
The following are equivalent, but the second one is much slower:
function logsumexp(X)
m = maximum(X)
log(sum(x -> exp(x - m), X)) + m
end
function logsumexp(X)
m = maximum(X)
log(sum(exp(x - m) for x in X)) + m
end
I've been trying to make more of the highlevel linear algebra routines work similarly to the Base versions. For this to really work well, we'll need support for strided views where stride(A,1)==1
but stride(A,2)>size(A,1)
. Currently, I see two approaches to support view
for CuArrays
: 1) Make Base's SubArray
s work for CuArray
or 2) include strides information in the CuArray
to support this.
The CuArray
struct already includes a field for offset and it wraps a memory buffer so it might make sense to make CuArray
s more general and add a strides field. The cost is that people working explicitly with the buffer would have to be more careful about the strides information. These views would also be less general than Base's SubArray
s but I think they'll cover the more important cases.
Last night I tried the other approach, i.e. to just wrap CuArray
s in SubArray
but it doesn't work with ccall
and it might require some controversial changes to how CuArray
s are converted to Ptr
if we want to go this direction. Basically, the problem is that unsafe_convert(Ptr{T}, CuArray{T})
isn't defined and therefore you can't pass a SubArray{CuArray}
to ccall
.
What do people think? Which one is the better solution or have a missed a third better option? Personally, I lean towards including stride information in CuArray
s since we already wrap a memory buffer and attach shape information so also doing this as part of a SubArray
is kind of double wrapping.
It would be nice to have an easy way to pick one among multiple GPUs.
Hello,
In every multi GPU config I use I see one GPU (default) is having significantly higher memory usage . I don't know if this is me missing some configuration setting or a bug/missing feature. Would really appreciate hints on how to solve.
Trouble is that (for a particular algorithm) the mem overhead all piled up on a single gpu makes it impossible to actually use the default gpu for calculations. no more memory left
Here is a typical setup
Win 10
Julia 1.0.3
CUDAdrv v0.8.6
CUDAnative v0.9.1
CuArrays v0.8.1
(above combination of packages is the latest I could make working together)
CPU: 9900k with all win graphics running off build in GPU. So GPUs having 0 load/0 mem usage before we start Julia
4x2070 8GB GPUs (but have same problem on a 2 gpu configs, different models)
Here is the simplest example (prepares 16 julia workers, 4 per GPU)
"master" code
using Distributed
addprocs(15)
@everywhere include("worker_code.jl")
and content of the worker_code.jl is
using CUDAdrv: CuDevice, CuContext, DeviceSet
dev = CuDevice((myid() % length(DeviceSet())))
ctx = CuContext(dev)
println("Running on ", dev)
using CuArrays
As you see above code doesn't create anything custom yet. just prepares the basics. Yet, the memory usage on the default GPU is 2.9GB vs on 3 others only 0.4GB
I would expected CuArrays to distribute mem it needs evenly across GPUs (taking into account dev or context we set in CUDAdrv). is it possible with some flags?
Given the following MWE:
#define BLOCK_SIZE 5
__global__ void kernel() {
__shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
int idx = threadIdx.x;
for (int i = 0; i < BLOCK_SIZE; i++) {
for (int j = 0; j < i; j++)
peri_col[idx][i] -= dia[j][i];
peri_col[idx][i] /= dia[i][i];
}
}
nvcc
generates PTX with loads clustered together:
.version 6.2
.target sm_35
.address_size 64
// .globl kernel
// kernel_dia has been demoted
// kernel_peri_col has been demoted
.visible .entry kernel(
)
{
.reg .f32 %f<36>;
.reg .b32 %r<4>;
// demoted variable
.shared .align 4 .b8 kernel_dia[100];
// demoted variable
.shared .align 4 .b8 kernel_peri_col[100];
mov.u32 %r1, %tid.x;
mov.u32 %r2, kernel_peri_col;
mad.lo.s32 %r3, %r1, 20, %r2;
ld.shared.f32 %f1, [%r3];
ld.shared.f32 %f2, [kernel_dia];
div.rn.f32 %f3, %f1, %f2;
ld.shared.f32 %f4, [%r3+4];
ld.shared.f32 %f5, [kernel_dia+4];
ld.shared.f32 %f6, [kernel_dia+24];
ld.shared.f32 %f7, [%r3+8];
ld.shared.f32 %f8, [kernel_dia+8];
ld.shared.f32 %f9, [kernel_dia+28];
ld.shared.f32 %f10, [kernel_dia+48];
ld.shared.f32 %f11, [%r3+12];
ld.shared.f32 %f12, [kernel_dia+12];
ld.shared.f32 %f13, [kernel_dia+32];
ld.shared.f32 %f14, [kernel_dia+52];
ld.shared.f32 %f15, [kernel_dia+72];
ld.shared.f32 %f16, [%r3+16];
ld.shared.f32 %f17, [kernel_dia+16];
ld.shared.f32 %f18, [kernel_dia+36];
ld.shared.f32 %f19, [kernel_dia+56];
ld.shared.f32 %f20, [kernel_dia+76];
ld.shared.f32 %f21, [kernel_dia+96];
st.shared.f32 [%r3], %f3;
sub.f32 %f22, %f4, %f5;
div.rn.f32 %f23, %f22, %f6;
st.shared.f32 [%r3+4], %f23;
sub.f32 %f24, %f7, %f8;
sub.f32 %f25, %f24, %f9;
div.rn.f32 %f26, %f25, %f10;
st.shared.f32 [%r3+8], %f26;
sub.f32 %f27, %f11, %f12;
sub.f32 %f28, %f27, %f13;
sub.f32 %f29, %f28, %f14;
div.rn.f32 %f30, %f29, %f15;
st.shared.f32 [%r3+12], %f30;
sub.f32 %f31, %f16, %f17;
sub.f32 %f32, %f31, %f18;
sub.f32 %f33, %f32, %f19;
sub.f32 %f34, %f33, %f20;
div.rn.f32 %f35, %f34, %f21;
st.shared.f32 [%r3+16], %f35;
ret;
}
Note the pessimistic alignment of 4 for the shared memory arrays. I guess this follows the C spec, but it rules out vectorization. However, when ptxas assembles this code, it knows the memory layout and "physical" alignment of the shared memory arrays, and as a result nicely vectorizes this code:
MOV R1, c[0x0][0x44];
S2R R0, SR_TID.X;
MOV32I R3, 0x64;
IMAD R0, R0, 0x14, R3;
LDS R10, [RZ];
LDS R5, [R0];
CAL 0x170;
LDS.128 R4, [RZ];
LDS.128 R8, [0x10];
LDS R20, [R0+0x4];
LDS.128 R12, [0x30];
LDS R18, [0x60];
LDS.64 R16, [0x20];
LDS.64 R2, [0x48];
LDS R19, [R0+0x8];
FADD R5, R20, -R5;
LDS R9, [R0+0xc];
LDS R4, [R0+0x10];
STS [R0], R25;
CAL 0x170;
STS [R0+0x4], R25;
We tackle this differently, using LLVM's Load Store Vectorizer by specifying a much more optimistic alignment that enables vectorization (JuliaGPU/CUDAnative.jl#204):
const BLOCK_SIZE = 5
function kernel()
dia = @cuStaticSharedMem(Float32, (BLOCK_SIZE,BLOCK_SIZE))
peri_col = @cuStaticSharedMem(Float32, (BLOCK_SIZE,BLOCK_SIZE))
index = threadIdx().x
for i = 1:BLOCK_SIZE
for j = 1:i-1
peri_col[i, index] -= dia[i, j]
end
peri_col[i, index] /= dia[i, i]
end
end
.version 6.0
.target sm_35
.address_size 64
// .globl ptxcall_kernel_2
// shmem1 has been demoted
// shmem2 has been demoted
.visible .entry ptxcall_kernel_2(
.param .align 8 .b8 ptxcall_kernel_2_param_0[24],
.param .u64 ptxcall_kernel_2_param_1
)
{
.reg .f32 %f<36>;
.reg .b32 %r<3>;
.reg .b64 %rd<4>;
// demoted variable
.shared .align 16 .b8 shmem1[100];
// demoted variable
.shared .align 16 .b8 shmem2[100];
mov.u32 %r1, %tid.x;
mul.lo.s32 %r2, %r1, 5;
mul.wide.u32 %rd1, %r2, 4;
mov.u64 %rd2, shmem2;
add.s64 %rd3, %rd2, %rd1;
ld.shared.f32 %f1, [%rd3];
ld.shared.v4.f32 {%f2, %f3, %f4, %f5}, [shmem1];
div.rn.f32 %f6, %f1, %f2;
st.shared.f32 [%rd3], %f6;
ld.shared.f32 %f7, [%rd3+4];
sub.f32 %f8, %f7, %f3;
ld.shared.f32 %f9, [shmem1+24];
div.rn.f32 %f10, %f8, %f9;
st.shared.f32 [%rd3+4], %f10;
ld.shared.f32 %f11, [%rd3+8];
sub.f32 %f12, %f11, %f4;
ld.shared.f32 %f13, [shmem1+28];
sub.f32 %f14, %f12, %f13;
ld.shared.f32 %f15, [shmem1+48];
div.rn.f32 %f16, %f14, %f15;
st.shared.f32 [%rd3+8], %f16;
ld.shared.f32 %f17, [%rd3+12];
sub.f32 %f18, %f17, %f5;
ld.shared.f32 %f19, [shmem1+32];
sub.f32 %f20, %f18, %f19;
ld.shared.f32 %f21, [shmem1+52];
sub.f32 %f22, %f20, %f21;
ld.shared.v2.f32 {%f23, %f24}, [shmem1+72];
div.rn.f32 %f25, %f22, %f23;
st.shared.f32 [%rd3+12], %f25;
ld.shared.f32 %f26, [%rd3+16];
ld.shared.f32 %f27, [shmem1+16];
sub.f32 %f28, %f26, %f27;
ld.shared.f32 %f29, [shmem1+36];
sub.f32 %f30, %f28, %f29;
ld.shared.f32 %f31, [shmem1+56];
sub.f32 %f32, %f30, %f31;
sub.f32 %f33, %f32, %f24;
ld.shared.f32 %f34, [shmem1+96];
div.rn.f32 %f35, %f33, %f34;
st.shared.f32 [%rd3+16], %f35;
ret;
}
MOV R1, c[0x0][0x44];
S2R R0, SR_TID.X;
LDS.128 R4, [RZ];
IMUL32I R0, R0, 0x5;
ISCADD R0, R0, 0x70, 0x2;
LDS R2, [R0];
CAL 0x178;
LDS R2, [R0+0x4];
STS [R0], R13;
LDS R4, [0x18];
FADD R2, -R5, R2;
CAL 0x178;
LDS R2, [R0+0x8];
STS [R0+0x4], R13;
LDS R3, [0x1c];
FADD R2, -R6, R2;
LDS R4, [0x30];
FADD R2, R2, -R3;
CAL 0x178;
LDS R2, [R0+0xc];
STS [R0+0x8], R13;
LDS R3, [0x20];
FADD R2, -R7, R2;
LDS R6, [0x34];
LDS.64 R4, [0x48];
FADD R2, R2, -R3;
FADD R2, R2, -R6;
CAL 0x178;
STS [R0+0xc], R13;
LDS R2, [R0+0x10];
LDS R3, [0x10];
LDS R4, [0x24];
LDS R6, [0x38];
FADD R2, R2, -R3;
FADD R2, R2, -R4;
LDS R4, [0x60];
FADD R2, R2, -R6;
FADD R2, -R5, R2;
CAL 0x178;
STS [R0+0x10], R13;
EXIT;
FCHK.DIVIDE P0, R2, R4;
While our PTX contains vectorized loads now (16, vs 20 with nvcc), the resulting SASS still performs more loads (remaining at 16, as we don't cluster our loads for ptxas to vectorize, vs only 12). This is due to ptxas vectorizing load chains with gaps, eg. LDS.128 R8, [0x10]
corresponding with loading elements 4 to 7 from kernel_dia
, while kernel_dia+20
isn't actually used by the code (and R9
isn't either in the SASS code). LLVM refuses to vectorize these accesses, emitting 3 loads instead (it should have still been able to vectorize accesses to elements 6 and 7, looking into that right now).
So this probably needs a fix to the Load Store Vectorizer to vectorize chains with gaps.
Alternatively, we could cluster loads like nvcc, but the existing load/store clustering DAG mutation isn't quite as aggressive and as such only enables ptxas
to merge a couple of loads (maybe there's some missing bits since NVPTX hasn't been using the machine scheduler before):
@@ NVPTXTargetMachine.cpp
class NVPTXPassConfig : public TargetPassConfig {
+ ScheduleDAGInstrs *
+ createMachineScheduler(MachineSchedContext *C) const override {
+ ScheduleDAGMILive *DAG = createGenericSchedLive(C);
+ DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
+ DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
+ return DAG;
+ }
};
@@ NVPTXSubtarget.h
class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
+ bool enableMachineScheduler() const override { return true; }
};
cc @jlebar
reduced from rodinia/lud
ref https://lists.llvm.org/pipermail/llvm-dev/2018-June/124209.htm
Now that we have a GPU runtime library that can allocate and box, I tried to get rid of this hack: https://github.com/JuliaGPU/CUDAnative.jl/blob/53368d48b6405ee962e54d4c3b9f90e3eb623310/src/compiler/irgen.jl#L229-L236
Turns out we still need it, as the argument to throw
often is the value returned by the BoundsError
constructor, which has a @nospecialize
resulting in a jl_invoke
. LLVM obviously can't remove this function by itself, so we end up with GPU-incompatible code. Just try to cu([1]) .+ cu([2])
with --check-bounds=yes
(and the above hack disabled, of course).
We could try and redefine BoundsError
or throw_boundserror
as it occurs for GPU code, e.g.:
using LinearAlgebra
for (W, _) in (:AT => (A,mut)->mut(A), Adapt.wrappers...)
@eval begin
# BoundsError has a @nospecialize constructor resulting in `invoke`
Base.throw_boundserror(A::$W where {AT <: CuDeviceArray}, I) = throw(nothing)
end
end
But then we still miss cases thrown from the broadcasting code (and I couldn't find a way to dispatch on throw_boundserror(Broadcasted{...CuDeviceArray...}
. It would be best to just get rid of @nospecialize
for GPU code altogether.
Not sure whether that would need to happen at the inference/optimizer/codegen level though. Monkey-patching MethodInstance
s from within CUDAnative's emit_function
hook didn't seem to work.
@Keno did you do something similar for XLA, since you worked on more exhaustive inference there?
Are those interfaces public already?
Following up on https://github.com/JuliaGPU/CuArrays.jl/issues/107#issuecomment-414659174. The base behavior is now
julia> lu(zeros(2,2), check = true)
ERROR: SingularException(1)
Stacktrace:
[1] checknonsingular at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/factorization.jl:12 [inlined]
[2] #lu!#97(::Bool, ::Function, ::Array{Float64,2}, ::Val{true}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/lu.jl:41
[3] #lu! at ./none:0 [inlined]
[4] #lu#101 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/lu.jl:142 [inlined]
[5] #lu at ./none:0 [inlined] (repeats 2 times)
[6] top-level scope at none:0
julia> lu(zeros(2,2), check = false)
Failed factorization of type LU{Float64,Array{Float64,2}}
mapreduce
over CartesianIndices
now fallbacks to the one in Base (as it should be).
I find this is also a useful interface for CUDA programming, is there some existing approach to realize similar functionality?
If not, where can I start to support this kind of interface?
Here is an minimal example of calculating means distance
julia> using CuArrays
julia> using CUDAnative
julia> function expect(f, xs::CuVector, ys::CuVector)
M = length(xs)
N = length(ys)
ci = CartesianIndices((M, N))
mapreduce(inds->f(xs[inds[1]], ys[inds[2]]), +, ci)/M/N
end
julia> @benchmark expect((x,y)->abs2(x-y), randn(100)|>cu, randn(100)|>cu)
BenchmarkTools.Trial:
memory estimate: 3.97 MiB
allocs estimate: 100033
--------------
minimum time: 182.878 ms (0.00% GC)
median time: 185.717 ms (0.00% GC)
mean time: 190.299 ms (1.61% GC)
maximum time: 246.243 ms (24.11% GC)
--------------
samples: 27
evals/sample: 1
Base has decided not to provide functions that take a container type, but built abstractions around array constructors (currently only doing so for undef
). Would be good to get rid of cuzeros
, curand
, etc.
My setup:
GTX 1080 Ti
CUDA v9, driver v396.26
CUDAnative v0.9.1
Julia v1.0.2
I have been experimenting with @maetshju's implementation of CTC loss from this Flux Pull Request. I noticed that when using multiple threads to execute one of the CUDA kernels, the results differ from single-thread. After extensive debugging, I discovered that adding @cuprintf
calls immediately after two of the sync_threads()
calls causes the multi-threaded code to produce the same results as the single-threaded code. It appears that the sync_threads()
calls are not actually sync'ing the threads in this particular instance (and @cuprintf
seems to have the side-effect of forcing thread synchronization).
Unfortunately, I haven't been able to reproduce this issue in a simple code example, so I have attached my version of the entire CTC implementation with a test case that reproduces the problem:
ctc.jl.txt
To reproduce:
julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241 accum[7]: -inf
thread 1 grad[8]: -0.829811 accum[8]: -3.782628
thread 1 grad[9]: 0.164570 accum[9]: -6.222818
multi-thread intermediate printouts:
thread 2 grad[8]: 0.090031 accum[8]: -inf
thread 3 grad[9]: 0.244728 accum[9]: -inf
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241 accum[7]: -inf
max grad diff: 9.20e-01
single-thread grad: -0.82981
multi-thread grad: 0.09003
The lines beginning with "thread" are printouts of intermediate results from the computeBetasAndGrads kernel. In the multi-thread case, threads 2 and 3 should not be printing out values for grad
and accum
(line 168 in ctc.jl) before thread 1 has finished initializing accum
(line 147). The maximum difference between the single- and multi-thread gradients is about 0.9.
Then modify ctc.jl
by uncommenting the two @cuprintf
calls on lines 152 and 240 ("sync 1" and "sync 2" printouts) and re-run:
julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
thread 1 grad[7]: 0.665241 accum[7]: -inf
thread 1 grad[8]: -0.829811 accum[8]: -3.782628
thread 1 grad[9]: 0.164570 accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 1
multi-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
sync 1 thread: 2
sync 1 thread: 3
thread 1 grad[7]: 0.665241 accum[7]: -inf
thread 2 grad[8]: -0.829811 accum[8]: -3.782628
thread 3 grad[9]: 0.164570 accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3
max grad diff: 0.00e+00
single-thread grad: -0.31767
multi-thread grad: -0.31767
Now thread 1 finishes initializing accum
before the other threads begin calculating values for grad
. The single- and multi-thread gradients are now identical.
So, although I suspect the actual problem is something else, this example does seem to indicate that the sync_threads()
calls on lines 151 and 239 are simply not working.
Is a ccall
within a CUDA kernel guaranteed to be synchronous with respect to the kernel thread?
Do you have any suggestions for how to debug this further?
Thanks!
I would like to ask, if someone would be willing to write function(s) for segmented reductions. These functions are handy for multiple-instance learning problems, where sample is represented by an unordered set of vectors (each sample can have different number of vectors). It would be nice to have these functions available for use with Flux.
My implementation of the normal Julia code for the forward pass looks like
function segmented_max(x::Matrix{T},bags::Vector) where{T}
assert(checkbounds(x,bags))
o = zeros(eltype(x),size(x,1),length(bags))
fill!(o,typemin(T));
@inbounds for i in 1:length(bags) #iterate over bags
for j in bags[i] #iterate over items (vectors) in bags
for k in 1:size(x,1)
if x[k,j]>o[k,i]
o[k,i]=x[k,j];
end
end
end
end
o
end
while for the reverse part it is
assert(checkbounds(x,bags))
maxI=zeros(Int,size(x,1),length(bags))
gx = zeros(x)
for i in 1:length(bags) #iterate over bags
bagsize=length(bags[i])
for j in bags[i] #iterate over subbags
for k in 1:size(x,1)
if x[k,j]>gx[k,i]
gx[k,i]=x[k,j];
maxI[k,i]=j;
end
end
end
end
fill!(gx,0)
@inbounds for I in CartesianRange(size(Δ))
if maxI[I]>0
gx[I[1],maxI[I]]=Δ[I];
end
end
gx
end
I have to confess that I have never programmed for Cuda, therefore I am writing here.
Thanks for any help.
Add a versioninfo
method for printing stuff, like Base and LinearAlgebra.
Repro:
@target ptx function foo(n)
return 1.0^n
end
code_native(foo, Tuple{Int})
Generated IR:
define double @julia_foo_64487(i64) #0 !dbg !6 {
top:
%sext = shl i64 %0, 32
%1 = ashr exact i64 %sext, 32
%2 = icmp eq i64 %1, %0
br i1 %2, label %pass, label %fail
fail: ; preds = %top
call void @llvm.trap() JuliaGPU/CUDAnative.jl#2
unreachable
pass: ; preds = %top
%3 = trunc i64 %0 to i32
%4 = call double @llvm.powi.f64(double 1.000000e+00, i32 %3)
ret double %4
}
versioninfo()
:
Julia Version 0.6.0-dev.884
Commit 34d11b0* (2016-09-14 21:55 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.0 (ORCJIT, ivybridge)
I guess this can be faster with GPU? But any idea on how to implement this?
A = cu(rand(2048, 2048))
B = CuArray(rand(1:2048, 1024))
A[B, 2]
Pkg.test("CuArrays") fails, and I was able to locate the source of the error to the test below. (All other tests pass if I remove this test)
@testset "gemm_strided_batched with element type $elty" for elty in [Float32, Float64, ComplexF32, ComplexF64]
The function CuArrays.CUBLAS.gemm_strided_batched! fails with repeated calls.
The function CuArrays.CUBLAS.gemm_batched! works will repeated calls.
I think this is because there is an issue with using A::CuArray{$elty, 3} in the call to
@check ccall(($(string(fname)), libcublas), cublasStatus_t,
(cublasHandle_t, cublasOperation_t,
cublasOperation_t, Cint, Cint, Cint, Ptr{$elty},
Ptr{$elty}, Cint, Cint, Ptr{$elty}, Cint, Cint, Ptr{$elty},
Ptr{$elty}, Cint, Cint, Cint),
handle(), cutransA,
cutransB, m, n, k, [alpha], A, lda, strideA, B, ldb, strideB, [beta],
C, ldc, strideC, batchCount)
C
I can recreate the error above by running the following in jupyter
using CUDAnative, CuArrays
using CuArrays.CUBLAS
m = 20
n = 35
k = 13
elty= Float32
nbatch = 10
# generate matrices
alpha = one(elty)
beta = one(elty)
A = rand(elty, m, k, nbatch)
B = rand(elty, k, n, nbatch)
C = rand(elty, m, n, nbatch)
# move to device
d_A = CuArray{elty, 3}(A)
d_B = CuArray{elty, 3}(B)
d_C = CuArray{elty, 3}(C)
CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', alpha, d_A, d_B, beta, d_C)
CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', alpha, d_A, d_B, beta, d_C)
Here is the error from running the test
gemm_strided_batched with element type Float64: Error During Test at "juliahome"\.julia\dev\CuArrays\test\blas.jl:659
Got exception outside of a @test
CUDA error: an illegal memory access was encountered (code JuliaGPU/CUDA.jl#85, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
[1] macro expansion at "juliahome"\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
[2] #download!#11(::Bool, ::Function, ::Ptr{Float64}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at "juliahome"\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:254
[3] download! at "juliahome"\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:248 [inlined] (repeats 2 times)
[4] unsafe_copyto! at "juliahome"\.julia\dev\CuArrays\src\array.jl:166 [inlined]
[5] copyto!(::Array{Float64,3}, ::CuArray{Float64,3}) at "juliahome"\.julia\packages\GPUArrays\HmVTY\src\abstractarray.jl:110
[6] Array(::CuArray{Float64,3}) at .\array.jl:497
[7] macro expansion at "juliahome"\.julia\dev\CuArrays\test\blas.jl:690 [inlined]
[8] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1156 [inlined]
[9] macro expansion at "juliahome"\.julia\dev\CuArrays\test\blas.jl:659 [inlined]
[10] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1083 [inlined]
[11] macro expansion at "juliahome"\.julia\dev\CuArrays\test\blas.jl:546 [inlined]
[12] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1083 [inlined]
[13] top-level scope at "juliahome"\.julia\dev\CuArrays\test\blas.jl:3
[14] include at .\boot.jl:317 [inlined]
[15] include_relative(::Module, ::String) at .\loading.jl:1044
[16] include(::Module, ::String) at .\sysimg.jl:29
[17] include(::String) at .\client.jl:392
[18] macro expansion at "juliahome"\.julia\dev\CuArrays\test\runtests.jl:21 [inlined]
[19] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1083 [inlined]
[20] top-level scope at "juliahome"\.julia\dev\CuArrays\test\runtests.jl:19
[21] include at .\boot.jl:317 [inlined]
[22] include_relative(::Module, ::String) at .\loading.jl:1044
[23] include(::Module, ::String) at .\sysimg.jl:29
[24] include(::String) at .\client.jl:392
[25] top-level scope at none:0
[26] eval(::Module, ::Any) at .\boot.jl:319
[27] exec_options(::Base.JLOptions) at .\client.jl:243
[28] _start() at .\client.jl:425
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.