juliagpu / cuda.jl Goto Github PK

View Code? Open in Web Editor NEW

1.1K 21.0 198.0 41.31 MB

CUDA programming in Julia.

Home Page: https://juliagpu.org/cuda/

License: Other

Julia 99.95% Makefile 0.01% Cuda 0.02% Dockerfile 0.02%

julia cuda gpu hacktoberfest

cuda.jl's People

Contributors

Stargazers

Watchers

Forkers

innerlee mbauman leios juliatagbot gharib85 kristofferc asinghvi17 erathorn irhum lmd01 ellipse0934 ivanyashchuk clintonte samo-lin roger-luo lassepe thomasfaingnaert kshyatt rockstar2514 gartangh janshim viktree marius311 ventzx masonprotter hayatoikoma chrisrackauckas denizyuret n5n3 yixingfu amontoison murilo xaellison sailfish009 guyvdbroeck blegat paulstey ianbutterworth alexander-barth pabloferz berquist mustafaquraish efmanu mcabbott stjordanis martenlienen s-d-r achuchmala xanfus jpsamaroo dhairyalgandhi mforets rsdmse ajuvercr dfdx tubbz-alt qin-yu navidcy maxfreu jondeuce yuehhua jw3126 cfranken kaywiese budhyant carlolucibello tkf satvik soyukke jatippit exanauts rohitrathore1 gunnarfarneback torfjelde iyaja glwagner standardgalactic yajinwuzl lvyitian anandijain akashkgarg ali-ramadhan sshin23 galenlynch nmoran johnnychen94 max-hawkins simonbyrne toollu mfkiwl thazhemadam bjarthur jonas-eschmann carstenbauer kvnyng rkube melonedo danielzim oucyf eschnett

cuda.jl's Issues

Accidentally calling GPU intrinsics on the host causes segfaults

CUDA_VISIBLE_DEVICES=3 julia --project=..
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.3 (2018-12-18)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using CUDAnative

julia> f(x) = CUDAnative.tanh(x)
f (generic function with 1 method)

julia> @code_typed f(1f0)
CodeInfo(
1 1 ─ %1 = Base.llvmcall::IntrinsicFunction                                                                                              │╻╷ tanh
  │   %2 = (%1)(("declare float @__nv_tanhf(float)", "%2 =  call float @__nv_tanhf(float %0)\nret float %2"), Float32, Tuple{Float32}, x)::Float32o expansion
  └──      return %2                                                                                                                     │
) => Float32

julia> f(1f0)

signal (11): Segmentation fault
in expression starting at no file:0
_ITM_registerTMCloneTable at /opt/julia/julia-1.0.3/bin/../lib/julia/libspqr.so (unknown line)
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:82
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
JuliaGPU/CUDAnative.jl#28 at ./task.jl:259
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:268
unknown function (ip: 0xffffffffffffffff)
Allocations: 4184659 (Pool: 4183753; Big: 906); GC: 8
Segmentation fault (core dumped)

My suggestion would be to

Switch this to ccall with llvmcall calling convention
Add an llvmcall_gpu calling convention that emits an error when somebody tries to compile it for a host backend.

cc @maleadt

Dispatching to generic matmul instead of CUBLAS

Ref JuliaGPU/CuArrays.jl#135 (comment)

    BLAS                                              |   43      8     51
      matmul with element type Float32                |   10      2     12
      matmul with element type Float64                |   10      2     12
      matmul with element type Complex{Float32}       |   10      2     12
      matmul with element type Complex{Float64}       |   10      2     12

Support for CUFFT callbacks

One interesting use-case is to pass Julia functions as callbacks to cufft
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/

To register the callbacks with the cuFFT plan, the first step is to get the device function pointers from the device onto the host.

Explore CUDA graph API

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH
https://devblogs.nvidia.com/cuda-10-features-revealed/

I also came across http://www.cudahandbook.com/2018/09/cuda-graphs-roi-and-api-adoption/ today but didn't give it a proper read yet

Launching kernels should not allocate

See JuliaGPU/CUDAnative.jl#272. Seems to scale with parameters, so probably argument conversion related. Tracking allocations didn't reveal anything interesting.

We should have a test that checks for @allocated @cuda ... == 0

Reshaping CuArray throws error during backpropagation

System information

OS Platform and Distribution : Ubuntu 16.04.3 LTS
CuArray version : 0.6.2
Julia version: 0.6.4
CUDA/cuDNN version: 8.0/6
GPU model and memory: Tesla K40

The forward propagation contains reshape operations like below

P = reshape(exp.(logsoftmax(Aout[1:end])), (1,:))
u_k_combined = reshape(cat(u_k_lin, u_k_nonLin, dims=1), (1,:))
input_i = reshape(input[i,:], (1, edim))

These operations work fine on CPU during forward and back propagation.
On GPU it causes the following error

ERROR: LoadError: MethodError: Base._reshape(::CuArray{Float32,1}, ::Tuple{Int64}) is ambiguous. Candidates:
  _reshape(A::GPUArrays.GPUArray{T,N} where N, dims::Tuple{Vararg{Int64,N}} where N) where T in GPUArrays at /home/dpk1729/.julia/packages/GPUArrays/NHNmH/src/abstractarray.jl:236
  _reshape(A::GPUArrays.GPUArray{T,1}, dims::Tuple{Integer}) where T in GPUArrays at /home/dpk1729/.julia/packages/GPUArrays/NHNmH/src/abstractarray.jl:242
  _reshape(parent::CuArray, dims::Tuple{Vararg{Int64,N}} where N) in CuArrays at /home/dpk1729/.julia/packages/CuArrays/clDeS/src/array.jl:68
  _reshape(v::AbstractArray{T,1} where T, dims::Tuple{Int64}) in Base at reshapedarray.jl:148
Possible fix, define
  _reshape(::CuArray{T,1}, ::Tuple{Int64})
Stacktrace:
 [1] reshape(::CuArray{Float32,1}, ::Tuple{Int64}) at ./reshapedarray.jl:96
 [2] (::getfield(Flux.Tracker, Symbol("##285#290")){CuArray{Float32,1},Int64})(::TrackedArray{…,CuArray{Float32,1}}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/array.jl:177
 [3] iterate at ./generator.jl:47 [inlined]
 [4] collect(::Base.Generator{Tuple{TrackedArray{…,CuArray{Float32,1}},TrackedArray{…,CuArray{Float32,1}}},getfield(Flux.Tracker, Symbol("##285#290")){CuArray{Float32,1},Int64}}) at ./array.jl:619
 [5] JuliaGPU/CuArrays.jl#283 at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/array.jl:173 [inlined]
 [6] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##283#288")){Int64,Tuple{TrackedArray{…,CuArray{Float32,1}},TrackedArray{…,CuArray{Float32,1}}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}},Flux.Tracker.Tracked{CuArray{Float32,1}}}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:23
 [7] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
 [8] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##293#294")){TrackedArray{…,CuArray{Float32,1}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}},Nothing}}, ::CuArray{Float32,2}) at ./abstractarray.jl:1844
 [9] back(::Flux.Tracker.Tracked{CuArray{Float32,2}}, ::CuArray{Float32,2}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
 [10] foreach at ./abstractarray.jl:1844 [inlined]
 [11] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##326#327")){TrackedArray{…,CuArray{Float32,2}},TrackedArray{…,CuArray{Float32,2}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,2}},Flux.Tracker.Tracked{CuArray{Float32,2}}}}, ::CuArray{Float32,2}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
 ... (the last 5 lines are repeated 1 more time)
 [17] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
 [18] foreach(::Function, ::Tuple{Nothing,Flux.Tracker.Tracked{CuArray{Float32,1}},Nothing}, ::Tuple{CuArray{Float32,1},CuArray{Float32,1},Float32}) at ./abstractarray.jl:1844
 [19] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("#back#353")){3,getfield(Base.Broadcast, Symbol("##26#28")){getfield(Base.Broadcast, Symbol("##27#29")){typeof(*),getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}},getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}},getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}}},typeof(*)},Tuple{CuArray{Float32,1},TrackedArray{…,CuArray{Float32,1}},Int64}},Tuple{Nothing,Flux.Tracker.Tracked{CuArray{Float32,1}},Nothing}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
 [20] back(::Flux.Tracker.Tracked{CuArray{Float32,1}}, ::CuArray{Float32,1}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:45
 [21] foreach at ./abstractarray.jl:1844 [inlined]
 [22] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##299#300")){TrackedArray{…,CuArray{Float32,1}}},Tuple{Flux.Tracker.Tracked{CuArray{Float32,1}}}}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
 [23] back(::Flux.Tracker.Tracked{Float32}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:43
 ... (the last 3 lines are repeated 1 more time)
 [27] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##194#195")){Flux.Tracker.TrackedReal{Float32},Int64},Tuple{Flux.Tracker.Tracked{Float32},Nothing}}, ::Float32) at ./abstractarray.jl:1844
 [28] back(::Flux.Tracker.Tracked{Float32}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:43
 ... (the last 2 lines are repeated 1 more time)
 [31] foreach at ./abstractarray.jl:1844 [inlined]
 [32] back_(::Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##202#203")),Tuple{Flux.Tracker.Tracked{Float32},Flux.Tracker.Tracked{Float32}}}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:26
 [33] back(::Flux.Tracker.Tracked{Float32}, ::Float32) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:43
 ... (the last 3 lines are repeated 126 more times)
 [412] back!(::Flux.Tracker.TrackedReal{Float32}) at /home/dpk1729/.julia/packages/Flux/UHjNa/src/tracker/back.jl:62
 [413] train(::Data, ::Memory, ::Int64, ::Int64, ::Int64, ::Float64) at /home/dpk1729/code/MemN2N_model_gpu.jl:212
 [414] main() at /home/dpk1729/code/main.jl:68
 [415] top-level scope at none:0
 [416] include at ./boot.jl:317 [inlined]
 [417] include_relative(::Module, ::String) at ./loading.jl:1038
 [418] include(::Module, ::String) at ./sysimg.jl:29
 [419] exec_options(::Base.JLOptions) at ./client.jl:239
in expression starting at /home/dpk1729/code/main.jl:72

Tests for showing

julia> for dev in devices()
               @show dev
             end
dev = CUDAdrv.CuDevice(0, 0)
dev = CUDAdrv.CuDevice(1, 1)
dev = CUDAdrv.CuDevice(2, 2)

ResNet spending much time in CuArrays GC

I was profiling why a resnet model (https://github.com/KristofferC/resnet) was running extremely slow on Flux.

Sprinkling some sections using https://github.com/KristofferC/TimerOutputs.jl and training the model a little bit I got:

(edit: the timings below are stale due to changes in CuArrays, see https://github.com/JuliaGPU/CuArrays.jl/issues/273#issuecomment-461943376 for an update)

 ───────────────────────────────────────────────────────────────────────
                                Time                   Allocations
                        ──────────────────────   ───────────────────────
    Tot / % measured:        82.6s / 62.9%           9.09GiB / 3.32%

 Section        ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────
 gc true           468    47.8s  92.1%   102ms   4.88MiB  1.58%  10.7KiB
 crossentropy       32    1.91s  3.69%  59.8ms    159MiB  51.5%  4.98MiB
 conv            1.70k    1.61s  3.10%   948μs   85.6MiB  27.7%  51.7KiB
 dense              32    488ms  0.94%  15.2ms   59.1MiB  19.1%  1.85MiB
 gc false           32   81.0ms  0.16%  2.53ms    645KiB  0.20%  20.2KiB
 ───────────────────────────────────────────────────────────────────────

The gc true section refers to only this line:

https://github.com/JuliaGPU/CuArrays.jl/blob/61e25a2d239da77a5e8f3dc9746f9f62cd9e1380/src/memory.jl#L256

It seems this line is being called too often compared to how expensive a gc(true) call is.

Support for Ints and Float16?

julia> curand(Int8, 2)
ERROR: MethodError: no method matching curand(::Type{Int8}, ::Int64)
Closest candidates are:
  curand(::Type{Float32}, ::Int64) at /home/patomson/.julia/packages/CuArrays/F96Gk/src/rand/highlevel.jl:4
  curand(::Type{Float64}, ::Int64) at /home/patomson/.julia/packages/CuArrays/F96Gk/src/rand/highlevel.jl:7

It looks like you can only create CuArrays of type Float32 and Float64, but current CUDA (9.2) and Nvidia GPUs support Integer types (Int8) as well as Float16.

Conversion issue

Related to JuliaGPU/CuArrays.jl#200, but without the adapt issue.

julia> convert(CuArray, 1:3)
ERROR: MethodError: no method matching similar(::Type{CuArray}, ::Type{Int64}, ::Tuple{Int64})
Closest candidates are:
  similar(::Array, ::Type, ::Tuple{Vararg{Int64,N}}) where N at array.jl:332
  similar(::SubArray, ::Type, ::Tuple{Vararg{Int64,N}} where N) at subarray.jl:59
  similar(::Base.ReshapedArray, ::Type, ::Tuple{Vararg{Int64,N}} where N) at reshapedarray.jl:187
  ...
Stacktrace:
 [1] convert(::Type{CuArray}, ::UnitRange{Int64}) at /home/mikeinnes/.julia/packages/GPUArrays/HmVTY/src/construction.jl:67
 [2] top-level scope at none:0

Match syntax and APIs against Julia 1.0 standard libraries

CuArray{Float32}(3, N) -> CuArray{Float32}(undef, 3, N)
A[1,:] = 1 -> A[1,:] .= 1
curand(Float32, 10, 10)

Cannot select `fpow` while lowering `pow.f32`

julia> using CUDAdrv, CUDAnative

julia> a = round.(rand(Float32, (3, 4)) * 100)
3×4 Array{Float32,2}:
 57.0  13.0  55.0  14.0
 78.0  55.0  90.0  73.0
 61.0  85.0  36.0  85.0

julia> d_a = CuArray(a)
3×4 Array{Float32,2}:
 57.0  13.0  55.0  14.0
 78.0  55.0  90.0  73.0
 61.0  85.0  36.0  85.0

julia> d_b = similar(d_a)
3×4 Array{Float32,2}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

julia> function kernel_vpow(a, b)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           b[i] = a[i]^1.5
           return nothing
       end
kernel_vpow (generic function with 1 method)

julia> @cuda (3,4) kernel_vpow(d_a, d_b)
ERROR: LLVM error: Cannot select: 0x6092650: f64 = fpow 0x6092570, ConstantFP:f64<1.500000e+00>
  0x6092570: f64 = fp_extend 0x6092180
    0x6092180: f32,ch = load<LD4[null(addrspace=101)]> 0x447c450, TargetExternalSymbol:i64'julia_^_61355_param_0', undef:i64
      0x60920a0: i64 = TargetExternalSymbol'julia_^_61355_param_0'
      0x6092110: i64 = undef
  0x60925e0: f64 = ConstantFP<1.500000e+00>
In function: julia_^_61355
Stacktrace:
 [1] handle_error(::Cstring) at /home/alha02/.julia/v0.6/LLVM/src/core/context.jl:96
 [2] macro expansion at /home/alha02/.julia/v0.6/LLVM/src/util/logging.jl:102 [inlined]
 [3] macro expansion at /home/alha02/.julia/v0.6/LLVM/src/base.jl:18 [inlined]
 [4] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/alha02/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [5] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/alha02/.julia/v0.6/LLVM/src/targetmachine.jl:39
 [6] #mcgen#45(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:303
 [7] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [8] #compile_function#46(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:328
 [9] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:369
 [10] macro expansion at /home/alha02/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [11] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::#kernel_vpow, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}) at /home/alha02/.julia/v0.6/CUDAnative/src/execution.jl:80

Integer powers are no problem:

julia> function kernel_vpow(a, b)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           b[i] = a[i]^2
           return nothing
       end
kernel_vpow (generic function with 1 method)

julia> @cuda (3,4) kernel_vpow(d_a, d_b)

julia> d_b
3×4 Array{Float32,2}:
 3249.0   169.0  3025.0   196.0
 6084.0  3025.0  8100.0  5329.0
 3721.0  7225.0  1296.0  7225.0

Host-side stack unwinding

[brain-dump of potentially interesting approach for exception traces]

The exception backtrace POC in JuliaGPU/CUDAnative.jl#303 embeds debug information in global variables and reports it at run-time since we don't have the ability to unwind the stack on-device. At the same time, unwinding the stack should be possible from the host, using the CUDA debugger API (this requires https://github.com/JuliaGPU/CUDAnative.jl/issues/31, of course). The only remaining problem there is how to transfer control from the device to the host once an exception occurs, since running under the debugger all time is probably expensive. the CUDA_DEVICE_WAITS_ON_EXCEPTION environment variable might be a possibility: https://docs.nvidia.com/cuda/cuda-gdb/index.html#attaching

backslash with gpu matrices crashes julia

I'm trying to solve a system of linear equations on the GPU, but it crashes julia with no error message. My understanding of a discussion from a year or two ago was that backslash would be the way to access the cusolver routines. The simplest backslash attempt crashes julia with no error message, then since
it appears that cusolver has (a routine that solves a system with a triangular structure)[https://software.intel.com/en-us/mkl-developer-reference-c-trtrs], I tried to get to that by performing a QR decomposition on the cuarrays (which works fine), then solving the triangular version of the problem, with the same result. I couldn't find that routine wrapped in CuArrays.jl anyway though. Backslash seemed the most natural way to do this, but is there an accepted way to do this that I'm missing?

using CuArrays
using LinearAlgebra

A, y = rand(3,4), rand(3)
cuA, cuY = cu(A), cu(y)
getQR(A) = begin
  QR = qr(A)
  return (QR.Q, QR.R)
end
Q, R = getQR(A)
cuQ, cuR = getQR(cuA)

actualSolution = A \ y
solutionFromQR = R \ (Q' * y)
@assert(all(abs.(actualSolution - solutionFromQR) .< 1e-4))
println("all is well thus far")
solutionFromCu = cuA \ cuY # crashes
solutionFromCuQR = cuR \ (cuQ' * cuY) # also crashes

if relevant:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

julia> CUDAdrv.device()
CuDevice(0): GeForce GTX 1050

support for vecnorm

I got following error while executing sum(vecnorm, params(model)), where model is a Flux Chain.

ERROR: LLVM error: Cannot select: 0x559dd22e8a80: f32 = fpow 0x559dd22e91f0, 0x559dd22e8930
  0x559dd22e91f0: f32 = bitcast 0x559dd22e9180
    0x559dd22e9180: i32 = and 0x559dd1bea130, Constant:i32<2147483647>
      0x559dd1bea130: i32,ch = load<LD4[%4]> 0x559dd206c280, 0x559dd1be9870, undef:i64
        0x559dd1be9870: i64,ch = load<LD8[null(addrspace=101)]> 0x559dd206c280, TargetExternalSymbol:i64'julia__26_64205_param_1', undef:i64
          0x559dd1be9800: i64 = TargetExternalSymbol'julia__26_64205_param_1'
          0x559dd1be9720: i64 = undef
        0x559dd1be9720: i64 = undef
      0x559dd22e8460: i32 = Constant<2147483647>
  0x559dd22e8930: f32 = sint_to_fp 0x559dd1be9950
    0x559dd1be9950: i64,ch = load<LD8[null(addrspace=101)]> 0x559dd206c280, TargetExternalSymbol:i64'julia__26_64205_param_2', undef:i64
      0x559dd1be98e0: i64 = TargetExternalSymbol'julia__26_64205_param_2'
      0x559dd1be9720: i64 = undef
In function: julia__26_64205
Stacktrace:
 [1] handle_error(::Cstring) at /home/tejank10/.julia/v0.6/LLVM/src/core/context.jl:100
 [2] macro expansion at /home/tejank10/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
 [3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/tejank10/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/tejank10/.julia/v0.6/LLVM/src/targetmachine.jl:42
 [5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/tejank10/.julia/v0.6/CUDAnative/src/jit.jl:393
 [6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/tejank10/.julia/v0.6/CUDAnative/src/jit.jl:440
 [8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/tejank10/.julia/v0.6/CUDAnative/src/jit.jl:488
 [9] macro expansion at /home/tejank10/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::Flux.Tracker.##26#27, ::CUDAnative.CuDeviceArray{ForwardDiff.Dual{Void,Float32,3},2,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool,Bool},Tuple{},Tuple{}}, ::Tuple{Tuple{Int64,Int64},Tuple{},Tuple{}}, ::CUDAnative.CuDeviceArray{ForwardDiff.Dual{Void,Float32,3},2,CUDAnative.AS.Global}, ::Tuple{Int64,Float32}) at /home/tejank10/.julia/v0.6/CUDAnative/src/execution.jl:80
 [11] _broadcast! at /home/tejank10/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [12] broadcast_t at /home/tejank10/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
 [13] broadcast_c at /home/tejank10/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
 [14] broadcast at ./broadcast.jl:455 [inlined]
 [15] tracked_broadcast(::Function, ::TrackedArray{…,CuArray{Float32,2}}, ::Int64, ::Float32) at /home/tejank10/.julia/v0.6/Flux/src/tracker/array.jl:364
 [16] vecnorm(::TrackedArray{…,CuArray{Float32,2}}, ::Int64) at /home/tejank10/.julia/v0.6/Flux/src/tracker/array.jl:224 (repeats 2 times)
 [17] _mapreduce(::Base.LinAlg.#vecnorm, ::Base.#+, ::IndexLinear, ::Array{Any,1}) at ./reduce.jl:273
 [18] sum(::Function, ::Array{Any,1}) at ./reduce.jl:347

The statement works fine while running on CPU

Getting Sparse CuArrays to work with Arpack

I'm trying to diagonalize a sparse CuArrays with eigs, which currently fails because it dispatches to the general matrix*vector multiplication. What would be needed to get it to work?

Is it only a question of defining the correct mul! method?

CUSOLVER: support eigendecomposition

I am currently trying to perform an eigendecomposition on GPU.
Is there any plan for CuArrays.jl to support it from CUSOLVER sometime soon?

at-benchmark captures GPU arrays

The following code triggers this error:

ERROR: CUDA error: out of memory (code JuliaGPU/CuArrays.jl#2, ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] macro expansion at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
 [2] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:161
 [3] alloc at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:157 [inlined] (repeats 2 times)
 [4] CuArray{Float32,2}(::Tuple{Int64,Int64}) at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:33
 [5] similar at C:\Users\user\.julia\packages\CUDAdrv\LC5XS\src\array.jl:83 [inlined]
 [6] similar at .\abstractarray.jl:571 [inlined]
 [7] h_bench(::Int64, ::Int64) at .\REPL[4]:7
 [8] macro expansion at .\show.jl:555 [inlined]
 [9] top-level scope at .\REPL[6]:2 [inlined]
 [10] top-level scope at .\none:0

using CUDAdrv, CUDAnative, BenchmarkTools

function kernel_vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return nothing
end

function h(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@cuda threads=12 kernel_vadd(d_a, d_b, d_c)
end

function h_bench(m, n)
	# CUDAdrv functionality: generate and upload data
	a = round.(rand(Float32, (m, n)) * 100)
	b = round.(rand(Float32, (m, n)) * 100)
	d_a = CuArray(a)
	d_b = CuArray(b)
	d_c = similar(d_a)  # output array

	@benchmark @cuda threads=12 kernel_vadd($d_a, $d_b, $d_c)
end

# Works
for i in 1:5
	@show i
	h(10_000, 10_000)
end

# Errors after i = 3
for i in 1:5
	@show i
	h_bench(10_000, 10_000)
end

Support kernels returning Union{}

Now that we support some exceptions, we should treat these kernels as valid and just have them throw at run time. This should make it much easier to debug issues like that -- assuming we can report something useful for eg. MethodErrors.

`cudaconvert` for closures

A flatten broadcast is no longer amendable to cudaconvert since there is a closure involved (see SciML/RecursiveArrayTools.jl#40)

https://github.com/JuliaLang/julia/blob/26b6a5811c464bd66a371ddb341df140c677a2a3/base/broadcast.jl#L307-L313

julia> x.body
getfield(Base.Broadcast, Symbol("##1#2")){bc,makeargs} where makeargs

I don't know if it is feasible for us to define a version of cudaconvert that could inspect a closure and apply cudaconvert to its fields

Pitched pointers

Further to discussion about porting CUFFT.jl to use CUDAdrv (JuliaAttic/CUFFT.jl#12), it would be useful to provide support for CUDA pitched pointers, in addition to standard memory types.

How might such an implementation be best approached, given longer term plans for this package (e.g., is there still a plan to move to the CuArray implementation?).

Native Softmax

Can be based on PyTorch's.

The API should be equivalent to:

function softmax(x::AbstractArray, axis=1)
    out = exp.(x .- maximum(x, axis))
    out ./= sum(out, axis)
    return out
end

Any chance x^.a, for (a> 2.0) will be supported at some point?

Very simple x2 .^ 4 completely fails and has been since the age of times. Any chance this may be supported at some point? It feels like CuArrays is in real shortage of TLC. Just my sad perspective of trying to use it, apologies.

julia> x = rand(Float32, 100);

julia> using Flux

using CuArrays

julia> 

julia> using CuArrays
x2 = x |> gpu;

julia> x2 = x |> gpu;

julia> x2 .^ 4

ERROR: LLVM error: Cannot select: 0x23701520: f32 = fpow 0x237010c0, ConstantFP:f32<4.000000e+00>
  0x237010c0: f32,ch = load<LD4[null(addrspace=101)]> 0x237065b0, TargetExternalSymbol:i64'julia__1_63502_param_0', undef:i64
    0x23700fe0: i64 = TargetExternalSymbol'julia__1_63502_param_0'
    0x23701050: i64 = undef
  0x237014b0: f32 = ConstantFP<4.000000e+00>
In function: julia__1_63502
Stacktrace:
 [1] handle_error(::Cstring) at /home/ubuntu/.julia/v0.6/LLVM/src/core/context.jl:100
 [2] macro expansion at /home/ubuntu/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
 [3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/ubuntu/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/ubuntu/.julia/v0.6/LLVM/src/targetmachine.jl:42
 [5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:393
 [6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:440
 [8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:488
 [9] macro expansion at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##1#2, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{}) at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:80
 [11] _broadcast! at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [12] broadcast_t at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
 [13] broadcast_c at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
 [14] broadcast(::Function, ::CuArray{Float32,1}) at ./broadcast.jl:455

julia> 

julia> x2 .^ 4.0
ERROR: LLVM error: Cannot select: 0x252443c0: f64 = fpow 0x252442e0, ConstantFP:f64<4.000000e+00>
  0x252442e0: f64 = fp_extend 0x25243ef0
    0x25243ef0: f32,ch = load<LD4[null(addrspace=101)]> 0x23494000, TargetExternalSymbol:i64'julia_^_63768_param_0', undef:i64
      0x25243e10: i64 = TargetExternalSymbol'julia_^_63768_param_0'
      0x25243e80: i64 = undef
  0x25244350: f64 = ConstantFP<4.000000e+00>
In function: julia_^_63768
Stacktrace:
 [1] handle_error(::Cstring) at /home/ubuntu/.julia/v0.6/LLVM/src/core/context.jl:100
 [2] macro expansion at /home/ubuntu/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
 [3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/ubuntu/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/ubuntu/.julia/v0.6/LLVM/src/targetmachine.jl:42
 [5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:393
 [6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:440
 [8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:488
 [9] macro expansion at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##3#4, ::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{}) at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:80
 [11] _broadcast! at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [12] broadcast_t at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
 [13] broadcast_c at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
 [14] broadcast(::Function, ::CuArray{Float32,1}) at ./broadcast.jl:455

julia> x2 .^ 4.0f0
ERROR: LLVM error: Cannot select: 0x240c3670: f32 = fpow 0x240c3210, ConstantFP:f32<4.000000e+00>
  0x240c3210: f32,ch = load<LD4[null(addrspace=101)]> 0x2497acc0, TargetExternalSymbol:i64'julia__5_63777_param_0', undef:i64
    0x240c3130: i64 = TargetExternalSymbol'julia__5_63777_param_0'
    0x240c31a0: i64 = undef
  0x240c3600: f32 = ConstantFP<4.000000e+00>
In function: julia__5_63777
Stacktrace:
 [1] handle_error(::Cstring) at /home/ubuntu/.julia/v0.6/LLVM/src/core/context.jl:100
 [2] macro expansion at /home/ubuntu/.julia/v0.6/LLVM/src/base.jl:22 [inlined]
 [3] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/ubuntu/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/ubuntu/.julia/v0.6/LLVM/src/targetmachine.jl:42
 [5] #mcgen#57(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:393
 [6] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [7] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:440
 [8] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/ubuntu/.julia/v0.6/CUDAnative/src/jit.jl:488
 [9] macro expansion at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [10] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##5#6, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{}) at /home/ubuntu/.julia/v0.6/CUDAnative/src/execution.jl:80
 [11] _broadcast! at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [12] broadcast_t at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
 [13] broadcast_c at /home/ubuntu/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
 [14] broadcast(::Function, ::CuArray{Float32,1}) at ./broadcast.jl:455

julia>

Behavior of heterogeneous broadcast/operations

While:

arr = ones(10)
arr2 = ones(1)
CuArray(arr) .- CuArray(arr2)

Works fine, CuArray(arr) .- arr2 fails.
Is that intended?

Extend and generalize allowscalar to memory transfers

Idea from Elliot: a mode for asserting or warning about memory transfers, which are often unwanted. E.g., CuArrays.allowtransfer(:Bool), CuArrays.@allowtransfer, CuArrays.@asserttransfer.

We should also think about which packages export these functions. They can live in GPUArrays, like allowscalar does right now, but I feel like users shouldn't have to import any other package except CuArrays (cf. not having to import CUDAdrv in JuliaGPU/CuArrays.jl#179).

PTX code missing DWARF debug information

@vchuravy Care to summarize your findings? It'd be good to have all relevant documentation/links in an issue.

Don't always throw when return value is non-zero

Many of the solver routines provide more detailed error information in the info variable but it is never reported because @check has already thrown a very broad error message.

support for matrix exp or ldiv \ ?

using CuArrays
n = 4
A = randn(n, n) |> cu
exp(A)

ERROR: ReadOnlyMemoryError()                         
Stacktrace:                                          
 [1] exp!(::CuArray{Float32,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/LinearAlgebra/src/lapack.jl:209                                                                            
 [2] exp(::CuArray{Float32,2}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/LinearAlgebra/src/dense.jl:508                                                                              
 [3] top-level scope at none:0

I get the same error without a stack trace for A \ cu(randn(n))

(This is using CuArrays#vc/0.7 (a55b12a) depending on GPUArrays#master, since CuArrays#master doesn't seem to support indexing A[i,j] that exp(m::AbstractMatrix) uses)

How much work would it take to support LinearAlgebra functions like these? If reasonable for a beginner like myself, any pointers to where to start?

Thanks!

Suggestion - Disable AbstractArray indexing fallback by default

If you mix CuArrays with Base Arrays, you get the slow version of things. I think this should best be disabled by default and give a nice error. Users should be able to opt in to enable it.

Vectorization of field access

The code

using CUDAdrv, CUDAnative

k(a,b) = (@inbounds a[1] = b[1]; nothing)

t = CuArray([(0x0,0x0)])
@device_code_sass @cuda k(t, t)

, as well as

struct AAA; x::UInt8; y::UInt8; end
s = CuArray([AAA(0x0,0x0)])
@device_code_sass @cuda k(s,s)

, generate two load and two store instructions (independently of the data type used (int, float, etc.)):

...
        /*0028*/                   LDG.E.U8 R7, [R2+0x1];         /* 0xeed0200000170207 */
        /*0030*/                   LDG.E.U8 R6, [R2];             /* 0xeed0200000070206 */
...
        /*0058*/                   STG.E.U8 [R4], R7;             /* 0xeed8200000070407 */
                                                                  /* 0x001ffc00ffe081f1 */
        /*0068*/                   STG.E.U8 [R4+-0x1], R6;        /* 0xeed82ffffff70406 */
        ...

It would be amazing if the compiler could optimize that to use vectorized memory access in both cases (https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ update: this link is not totally about what I'm referring to, see comments below).

Mismatch between find_library and find_cuda_library

Libdl requires the lib prefix, CUDAapi doesn't.

Implement Base.repeat

Have you implemented Base.repeat for CuArrays? repeat seems to use the slow fallback. Is it hard to? Thanks!

High allocations and getindex

Hi and thanks for your work!
I have a problem that effectively prohibits me from training my Flux model with multiple of embedding matrices.
It's hard to avoid allocations when indexing into (flux-tracked) CuArrays and my training time mainly becomes GC time.
Views won't work in these examples that I think illustrate the problem:

using CuArrays

a = UInt16.(collect(1:65535));
W = cu(randn(64, maximum(a))); # These are parameters
f(t, a) = (A = W[:, rand(a, 5000)]; sum([A for _ in 1:t]))

function g(n, a)
          for _ in 1:n
              CuArrays.@time f(1000, a)
          end
       end

g(10, a)

  0.486 sec (280.39 k CPU allocs: 14.736 MiB, 11.66% gc time) (1.00 k GPU allocs: 1.192 GiB, 54.35% gc time of which 100.00% spent allocating)
  0.341 sec ( 50.05 k CPU allocs: 1.768 MiB)                  (1.00 k GPU allocs: 1.192 GiB, 95.19% gc time of which 100.00% spent allocating)
  0.398 sec ( 50.07 k CPU allocs: 1.909 MiB)                  (1.00 k GPU allocs: 1.192 GiB, 95.64% gc time of which 100.00% spent allocating)
  0.198 sec ( 79.98 k CPU allocs: 2.314 MiB, 34.47% gc time)  (1.00 k GPU allocs: 1.192 GiB, 58.37% gc time of which 100.00% spent allocating)
  0.016 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.019 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.089 sec ( 78.19 k CPU allocs: 2.282 MiB, 73.77% gc time)  (1.00 k GPU allocs: 1.192 GiB)
  0.012 sec ( 44.92 k CPU allocs: 1.675 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.018 sec ( 45.05 k CPU allocs: 1.677 MiB)                  (1.00 k GPU allocs: 1.192 GiB)
  0.086 sec ( 78.75 k CPU allocs: 2.291 MiB, 75.18% gc time)  (1.00 k GPU allocs: 1.192 GiB)

I think getindex has some issues with Flux backprop as well as the following script allocates a lot of memory in the Flux.back! calls:

using CuArrays
using Flux

Ea = gpu(param(randn(64, 1_000_000)));
Eb = gpu(param(randn(64, 65_535)));
i = UInt16.(collect(1:5_000));
loss(i,n) = sum(sum(Eb[:, i] .+ Ea[:, rand(1:size(Ea,2), 1)]) for _ in 1:n)

function g(n, t, i)
   for _ in 1:t
      print("loss ")
      CuArrays.@time l = loss(i, n)
      print("back ")
      CuArrays.@time Flux.back!(l)
   end
end

g(100, 10, i)

loss | 0.029 sec | (33.99 k CPU allocs: 2.097 MiB)                 | (600 GPU allocs: 245.150 MiB, 55.07% gc)
back | 0.837 sec | (63.93 k CPU allocs: 2.997 MiB, 31.91% gc time) | (900 GPU allocs:  25.882 GiB, 28.31% gc)
loss | 0.083 sec | (35.31 k CPU allocs: 2.111 MiB)                 | (600 GPU allocs: 245.150 MiB, 78.41% gc)
back | 0.828 sec | (63.90 k CPU allocs: 3.004 MiB, 31.35% gc time) | (900 GPU allocs:  25.882 GiB, 28.60% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.103 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.20% gc)
back | 0.840 sec | (64.20 k CPU allocs: 3.006 MiB, 31.64% gc time) | (900 GPU allocs:  25.882 GiB, 28.13% gc)
loss | 0.076 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.21% gc)
back | 0.830 sec | (63.91 k CPU allocs: 3.005 MiB, 31.28% gc time) | (900 GPU allocs:  25.882 GiB, 28.84% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.106 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.26% gc)
back | 0.835 sec | (64.20 k CPU allocs: 3.004 MiB, 31.16% gc time) | (900 GPU allocs:  25.882 GiB, 28.69% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.39% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.004 MiB, 31.52% gc time) | (900 GPU allocs:  25.882 GiB, 28.82% gc)
loss | 0.052 sec | (34.62 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 72.99% gc)
back | 0.846 sec | (64.20 k CPU allocs: 3.004 MiB, 31.89% gc time) | (900 GPU allocs:  25.882 GiB, 28.35% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.28% gc)
back | 0.834 sec | (63.91 k CPU allocs: 3.004 MiB, 31.21% gc time) | (900 GPU allocs:  25.882 GiB, 28.64% gc)
loss | 0.053 sec | (34.62 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 73.00% gc)
back | 0.844 sec | (64.20 k CPU allocs: 3.004 MiB, 30.82% gc time) | (900 GPU allocs:  25.882 GiB, 28.65% gc)
loss | 0.075 sec | (35.31 k CPU allocs: 2.107 MiB)                 | (600 GPU allocs: 245.150 MiB, 77.24% gc)
back | 0.832 sec | (63.91 k CPU allocs: 3.003 MiB, 30.95% gc time) | (900 GPU allocs:  25.882 GiB, 28.52% gc)

Are there any workarounds I can use?
Please let me know if you need more info.

Thank you so much for looking into this!

Int64 literals vs Int32 constants: avoid conversions & checks

Many constants in CUDA world are 32-bit, eg. the warp-size, thread or block IDs and dimensions, etc. We don't promote these to Int64 in order to avoid conversions when doing math on them, however it might be equally expensive not to do so because of conversions when doing math with literals.

For example, take the following idiomatic code:

function reduce_warp{F<:Function,T}(op::F, val::T)::T
    offset = CUDAnative.warpsize() ÷ 2
    while offset > 0
        val = op(val, shfl_down(val, offset))
        offset ÷= 2
    end
    return val
end

warpsize yields an Int32, but gets converted and promoted to Int64 because of the ÷ 2. This in turn causes shf_down which takes an Int32 do convert it back, including an exactness check + exception (trap):

julia> CUDAnative.code_llvm(reduce_warp, (typeof(+), Int32))

define i32 @julia_reduce_warp_62748(i32) local_unnamed_addr # {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.warpsize()
  %2 = icmp slt i32 %1, 2
  br i1 %2, label %L23, label %if.preheader

if.preheader:                                     ; preds = %top
  %3 = lshr i32 %1, 1
  %4 = zext i32 %3 to i64
  br label %if

if:                                               ; preds = %if.preheader, %pass2
  %val.03 = phi i32 [ %9, %pass2 ], [ %0, %if.preheader ]
  %offset.02 = phi i64 [ %10, %pass2 ], [ %4, %if.preheader ]
  %sext = shl i64 %offset.02, 32
  %5 = ashr exact i64 %sext, 32
  %6 = icmp eq i64 %5, %offset.02
  br i1 %6, label %pass2, label %fail1

L23.loopexit:                                     ; preds = %pass2
  br label %L23

L23:                                              ; preds = %L23.loopexit, %top
  %val.0.lcssa = phi i32 [ %0, %top ], [ %9, %L23.loopexit ]
  ret i32 %val.0.lcssa

fail1:                                            ; preds = %if
  tail call void @llvm.trap()
  unreachable

pass2:                                            ; preds = %if
  %7 = trunc i64 %offset.02 to i32
  %8 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %val.03, i32 %7, i32 31)
  %9 = add i32 %8, %val.03
  %10 = lshr i64 %offset.02, 1
  %11 = icmp eq i64 %10, 0
  br i1 %11, label %L23.loopexit, label %if
}

An improved, but less readable version of the same code goes like:

function reduce_warp{F<:Function,T}(op::F, val::T)::T
    offset = CUDAnative.warpsize() ÷ Int32(2)
    while offset > Int32(0)
        val = op(val, shfl_down(val, offset))
        offset ÷= Int32(2)
    end
    return val
end

This yields the following, much cleaner IR:

define i32 @julia_reduce_warp_62749(i32) local_unnamed_addr #0 {
top:
  %1 = tail call i32 @llvm.nvvm.read.ptx.sreg.warpsize()
  %2 = icmp slt i32 %1, 2
  br i1 %2, label %L25, label %if.preheader

if.preheader:                                     ; preds = %top
  br label %if

if:                                               ; preds = %if.preheader, %if
  %offset.03.in = phi i32 [ %offset.03, %if ], [ %1, %if.preheader ]
  %val.02 = phi i32 [ %4, %if ], [ %0, %if.preheader ]
  %offset.03 = sdiv i32 %offset.03.in, 2
  %3 = tail call i32 @llvm.nvvm.shfl.down.i32(i32 %val.02, i32 %offset.03, i32 31)
  %4 = add i32 %3, %val.02
  %5 = icmp slt i32 %offset.03.in, 4
  br i1 %5, label %L25.loopexit, label %if

L25.loopexit:                                     ; preds = %if
  br label %L25

L25:                                              ; preds = %L25.loopexit, %top
  %val.0.lcssa = phi i32 [ %0, %top ], [ %4, %L25.loopexit ]
  ret i32 %val.0.lcssa
}

CUDA driver device support does not match toolkit

continued from JuliaGPU/CUDAnative.jl#141

I ran the build script with TRACE=true and I'm getting this:

julia> Pkg.build("CUDAnative")
INFO: Building LLVM
DEBUG: Performing package build for LLVM.jl from /home/cody/.julia/reg_and_seg/v0.6/LLVM/deps
DEBUG: Discovering LLVM libraries in /home/cody/src/julia_06/usr/bin/../lib/julia, and configs in /home/cody/src/julia_06/usr/bin, /home/cody/src/julia_06/usr/bin/../tools
TRACE: Looking for libllvm in /home/cody/src/julia_06/usr/bin/../lib/julia
TRACE: Looking for llvm-config in /home/cody/src/julia_06/usr/bin
TRACE: Looking for llvm-config in /home/cody/src/julia_06/usr/bin/../tools
TRACE: - 3.9.1 at /home/cody/src/julia_06/usr/bin/../tools/llvm-config
TRACE: Looking for libllvm in /home/cody/src/julia_06/usr/lib
TRACE: - v3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so
TRACE: - v3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.so
DEBUG: Discovered LLVM toolchains: 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so
DEBUG: Discovering LLVM libraries in , and configs in /home/cody/julia/bin, /home/cody/xmrig/build, /home/cody/src/julia/deps/build, /home/cody/src/antsbin/bin, , /home/cody/bin, /home/cody/.local/bin, /home/cody/bin, /home/cody/.local/bin, /usr/local/sbin, /usr/local/bin, /usr/sbin, /usr/bin, /sbin, /bin, /usr/games, /usr/local/games, /snap/bin/usr/local/cuda-7.5/bin
TRACE: Looking for llvm-config in /home/cody/julia/bin
TRACE: Looking for llvm-config in /home/cody/xmrig/build
TRACE: Looking for llvm-config in /home/cody/src/antsbin/bin
TRACE: Looking for llvm-config in /usr/local/sbin
TRACE: Looking for llvm-config in /usr/local/bin
TRACE: Looking for llvm-config in /usr/sbin
TRACE: Looking for llvm-config in /usr/bin
TRACE: - 4.0.1 at /usr/bin/llvm-config-4.0
TRACE: Looking for libllvm in /usr/lib/llvm-4.0/lib
TRACE: - v4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.1.so
TRACE: - v4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.so
TRACE: Looking for llvm-config in /sbin
TRACE: Looking for llvm-config in /bin
TRACE: Looking for llvm-config in /usr/games
TRACE: Looking for llvm-config in /usr/local/games
DEBUG: Discovered LLVM toolchains: 4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.1.so
DEBUG: Selecting LLVM from libraries 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true), 4.0.1 at /usr/lib/llvm-4.0/lib/libLLVM-4.0.1.so (bundled: false) and wrappers 3.9.0, 4.0.0
DEBUG: Selected LLVM 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true)
DEBUG: Selecting wrapper for 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true) out of wrappers 3.9.0, 4.0.0
DEBUG: Selected wrapper 3.9 for LLVM 3.9.1 at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so (bundled: true)
DEBUG: Checking validity of existing ext.jl...
INFO: LLVM.jl has already been built for this toolchain, no need to rebuild
INFO: Building CUDAdrv
INFO: Building CUDAnative
TRACE: LLVM.jl is running in trace mode, this will generate a lot of additional output
DEBUG: Checking validity of bundled library at /home/cody/src/julia_06/usr/lib/libLLVM-3.9.1.so
config[:llvm_version] = LLVM.version() = v"3.9.1"
version = v"3.9.1"
target_support = CUDAapi.devices_for_llvm(version) = Set(VersionNumber[v"3.7.0", v"6.2.0", v"6.0.0", v"5.2.0", v"3.5.0", v"5.0.0", v"3.0.0", v"5.3.0", v"6.1.0", v"2.0.0", v"2.1.0", v"3.2.0"])
DEBUG: Dropping down to post-finalizer I/O

Note that I added a few @show statements at the end to show the version and target support info retrieved by the script.

In case it's useful here's my ext.jl

# autogenerated file, do not edit
const ptx_support = VersionNumber[v"3.2.0", v"4.0.0", v"4.1.0", v"4.2.0", v"4.3.0"]
const llvm_version = v"3.9.1"
const cuda_driver_version = v"9.0.0"
const julia_llvm_version = v"3.9.1"
const cuda_toolkit_version = v"7.5.17"
const cuobjdump = "/usr/local/cuda-7.5/bin/cuobjdump"
const julia_version = v"0.6.2"
const target_support = VersionNumber[v"3.0.0", v"3.2.0", v"3.5.0", v"3.7.0", v"5.0.0", v"5.2.0", v"5.3.0"]
const ptxas = "/usr/local/cuda-7.5/bin/ptxas"
const configured = true
const libdevice = Dict(v"3.0.0"=>"/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_30.10.bc",v"3.5.0"=>"/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_35.10.bc",v"5.0.0"=>"/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_50.10.bc")

generators are slow

The following are equivalent, but the second one is much slower:

function logsumexp(X)
    m = maximum(X)
    log(sum(x -> exp(x - m), X)) + m
end

function logsumexp(X)
    m = maximum(X)
    log(sum(exp(x - m) for x in X)) + m
end

Subarrays/views support

I've been trying to make more of the highlevel linear algebra routines work similarly to the Base versions. For this to really work well, we'll need support for strided views where stride(A,1)==1 but stride(A,2)>size(A,1). Currently, I see two approaches to support view for CuArrays: 1) Make Base's SubArrays work for CuArray or 2) include strides information in the CuArray to support this.

The CuArray struct already includes a field for offset and it wraps a memory buffer so it might make sense to make CuArrays more general and add a strides field. The cost is that people working explicitly with the buffer would have to be more careful about the strides information. These views would also be less general than Base's SubArrays but I think they'll cover the more important cases.

Last night I tried the other approach, i.e. to just wrap CuArrays in SubArray but it doesn't work with ccall and it might require some controversial changes to how CuArrays are converted to Ptr if we want to go this direction. Basically, the problem is that unsafe_convert(Ptr{T}, CuArray{T}) isn't defined and therefore you can't pass a SubArray{CuArray} to ccall.

What do people think? Which one is the better solution or have a missed a third better option? Personally, I lean towards including stride information in CuArrays since we already wrap a memory buffer and attach shape information so also doing this as part of a SubArray is kind of double wrapping.

Easy way to pick among multiple GPUs

It would be nice to have an easy way to pick one among multiple GPUs.

CuArrays allocates a lot of memory on the default GPU

Hello,

In every multi GPU config I use I see one GPU (default) is having significantly higher memory usage . I don't know if this is me missing some configuration setting or a bug/missing feature. Would really appreciate hints on how to solve.

Trouble is that (for a particular algorithm) the mem overhead all piled up on a single gpu makes it impossible to actually use the default gpu for calculations. no more memory left

Here is a typical setup
Win 10
Julia 1.0.3
CUDAdrv v0.8.6
CUDAnative v0.9.1
CuArrays v0.8.1
(above combination of packages is the latest I could make working together)
CPU: 9900k with all win graphics running off build in GPU. So GPUs having 0 load/0 mem usage before we start Julia

4x2070 8GB GPUs (but have same problem on a 2 gpu configs, different models)

Here is the simplest example (prepares 16 julia workers, 4 per GPU)

"master" code

using Distributed
addprocs(15)
@everywhere include("worker_code.jl")

and content of the worker_code.jl is

using CUDAdrv: CuDevice, CuContext, DeviceSet
dev = CuDevice((myid() % length(DeviceSet())))
ctx = CuContext(dev)
println("Running on ", dev)

using CuArrays

As you see above code doesn't create anything custom yet. just prepares the basics. Yet, the memory usage on the default GPU is 2.9GB vs on 3 others only 0.4GB

I would expected CuArrays to distribute mem it needs evenly across GPUs (taking into account dev or context we set in CUDAdrv). is it possible with some flags?

Improve vectorization of (shared) memory accesses

Given the following MWE:

#define BLOCK_SIZE 5

__global__ void kernel() {
    __shared__ float    dia[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float    peri_col[BLOCK_SIZE][BLOCK_SIZE];

    int idx = threadIdx.x;
    for (int i = 0; i < BLOCK_SIZE; i++) {
        for (int j = 0; j < i; j++)
            peri_col[idx][i] -= dia[j][i];
        peri_col[idx][i] /= dia[i][i];
    }
}

nvcc generates PTX with loads clustered together:

.version 6.2
.target sm_35
.address_size 64

	// .globl	kernel
// kernel_dia has been demoted
// kernel_peri_col has been demoted

.visible .entry kernel(

)
{
	.reg .f32 	%f<36>;
	.reg .b32 	%r<4>;
	// demoted variable
	.shared .align 4 .b8 kernel_dia[100];
	// demoted variable
	.shared .align 4 .b8 kernel_peri_col[100];

	mov.u32 	%r1, %tid.x;
	mov.u32 	%r2, kernel_peri_col;
	mad.lo.s32 	%r3, %r1, 20, %r2;
	ld.shared.f32 	%f1, [%r3];
	ld.shared.f32 	%f2, [kernel_dia];
	div.rn.f32 	%f3, %f1, %f2;
	ld.shared.f32 	%f4, [%r3+4];
	ld.shared.f32 	%f5, [kernel_dia+4];
	ld.shared.f32 	%f6, [kernel_dia+24];
	ld.shared.f32 	%f7, [%r3+8];
	ld.shared.f32 	%f8, [kernel_dia+8];
	ld.shared.f32 	%f9, [kernel_dia+28];
	ld.shared.f32 	%f10, [kernel_dia+48];
	ld.shared.f32 	%f11, [%r3+12];
	ld.shared.f32 	%f12, [kernel_dia+12];
	ld.shared.f32 	%f13, [kernel_dia+32];
	ld.shared.f32 	%f14, [kernel_dia+52];
	ld.shared.f32 	%f15, [kernel_dia+72];
	ld.shared.f32 	%f16, [%r3+16];
	ld.shared.f32 	%f17, [kernel_dia+16];
	ld.shared.f32 	%f18, [kernel_dia+36];
	ld.shared.f32 	%f19, [kernel_dia+56];
	ld.shared.f32 	%f20, [kernel_dia+76];
	ld.shared.f32 	%f21, [kernel_dia+96];
	st.shared.f32 	[%r3], %f3;
	sub.f32 	%f22, %f4, %f5;
	div.rn.f32 	%f23, %f22, %f6;
	st.shared.f32 	[%r3+4], %f23;
	sub.f32 	%f24, %f7, %f8;
	sub.f32 	%f25, %f24, %f9;
	div.rn.f32 	%f26, %f25, %f10;
	st.shared.f32 	[%r3+8], %f26;
	sub.f32 	%f27, %f11, %f12;
	sub.f32 	%f28, %f27, %f13;
	sub.f32 	%f29, %f28, %f14;
	div.rn.f32 	%f30, %f29, %f15;
	st.shared.f32 	[%r3+12], %f30;
	sub.f32 	%f31, %f16, %f17;
	sub.f32 	%f32, %f31, %f18;
	sub.f32 	%f33, %f32, %f19;
	sub.f32 	%f34, %f33, %f20;
	div.rn.f32 	%f35, %f34, %f21;
	st.shared.f32 	[%r3+16], %f35;
	ret;
}

Note the pessimistic alignment of 4 for the shared memory arrays. I guess this follows the C spec, but it rules out vectorization. However, when ptxas assembles this code, it knows the memory layout and "physical" alignment of the shared memory arrays, and as a result nicely vectorizes this code:

MOV R1, c[0x0][0x44];                               
S2R R0, SR_TID.X;                                   
MOV32I R3, 0x64;                                    
IMAD R0, R0, 0x14, R3;                              
LDS R10, [RZ];                                      
LDS R5, [R0];                                       
CAL 0x170;                                          
                                                           
LDS.128 R4, [RZ];                                   
LDS.128 R8, [0x10];                                 
LDS R20, [R0+0x4];                                  
LDS.128 R12, [0x30];                                
LDS R18, [0x60];                                    
LDS.64 R16, [0x20];                                 
LDS.64 R2, [0x48];                                  
                                                           
LDS R19, [R0+0x8];                                  
FADD R5, R20, -R5;                                  
LDS R9, [R0+0xc];                                   
LDS R4, [R0+0x10];                                  
STS [R0], R25;                                      
CAL 0x170;                                          
STS [R0+0x4], R25;

We tackle this differently, using LLVM's Load Store Vectorizer by specifying a much more optimistic alignment that enables vectorization (JuliaGPU/CUDAnative.jl#204):

const BLOCK_SIZE = 5

function kernel()
    dia = @cuStaticSharedMem(Float32, (BLOCK_SIZE,BLOCK_SIZE))
    peri_col = @cuStaticSharedMem(Float32, (BLOCK_SIZE,BLOCK_SIZE))

    index = threadIdx().x
    for i = 1:BLOCK_SIZE
        for j = 1:i-1
            peri_col[i, index] -= dia[i, j]
        end
        peri_col[i, index] /= dia[i, i]
    end
end

.version 6.0
.target sm_35
.address_size 64

	// .globl	ptxcall_kernel_2
// shmem1 has been demoted
// shmem2 has been demoted

.visible .entry ptxcall_kernel_2(
	.param .align 8 .b8 ptxcall_kernel_2_param_0[24],
	.param .u64 ptxcall_kernel_2_param_1
)
{
	.reg .f32 	%f<36>;
	.reg .b32 	%r<3>;
	.reg .b64 	%rd<4>;
	// demoted variable
	.shared .align 16 .b8 shmem1[100];
	// demoted variable
	.shared .align 16 .b8 shmem2[100];
	mov.u32 	%r1, %tid.x;
	mul.lo.s32 	%r2, %r1, 5;
	mul.wide.u32 	%rd1, %r2, 4;
	mov.u64 	%rd2, shmem2;
	add.s64 	%rd3, %rd2, %rd1;
	ld.shared.f32 	%f1, [%rd3];
	ld.shared.v4.f32 	{%f2, %f3, %f4, %f5}, [shmem1];
	div.rn.f32 	%f6, %f1, %f2;
	st.shared.f32 	[%rd3], %f6;
	ld.shared.f32 	%f7, [%rd3+4];
	sub.f32 	%f8, %f7, %f3;
	ld.shared.f32 	%f9, [shmem1+24];
	div.rn.f32 	%f10, %f8, %f9;
	st.shared.f32 	[%rd3+4], %f10;
	ld.shared.f32 	%f11, [%rd3+8];
	sub.f32 	%f12, %f11, %f4;
	ld.shared.f32 	%f13, [shmem1+28];
	sub.f32 	%f14, %f12, %f13;
	ld.shared.f32 	%f15, [shmem1+48];
	div.rn.f32 	%f16, %f14, %f15;
	st.shared.f32 	[%rd3+8], %f16;
	ld.shared.f32 	%f17, [%rd3+12];
	sub.f32 	%f18, %f17, %f5;
	ld.shared.f32 	%f19, [shmem1+32];
	sub.f32 	%f20, %f18, %f19;
	ld.shared.f32 	%f21, [shmem1+52];
	sub.f32 	%f22, %f20, %f21;
	ld.shared.v2.f32 	{%f23, %f24}, [shmem1+72];
	div.rn.f32 	%f25, %f22, %f23;
	st.shared.f32 	[%rd3+12], %f25;
	ld.shared.f32 	%f26, [%rd3+16];
	ld.shared.f32 	%f27, [shmem1+16];
	sub.f32 	%f28, %f26, %f27;
	ld.shared.f32 	%f29, [shmem1+36];
	sub.f32 	%f30, %f28, %f29;
	ld.shared.f32 	%f31, [shmem1+56];
	sub.f32 	%f32, %f30, %f31;
	sub.f32 	%f33, %f32, %f24;
	ld.shared.f32 	%f34, [shmem1+96];
	div.rn.f32 	%f35, %f33, %f34;
	st.shared.f32 	[%rd3+16], %f35;
	ret;
}

MOV R1, c[0x0][0x44];                             
S2R R0, SR_TID.X;                                 
LDS.128 R4, [RZ];                                 
IMUL32I R0, R0, 0x5;                              
ISCADD R0, R0, 0x70, 0x2;                         
LDS R2, [R0];                                     
CAL 0x178;                                        

LDS R2, [R0+0x4];                                 
STS [R0], R13;                                    
LDS R4, [0x18];                                   
FADD R2, -R5, R2;                                 
CAL 0x178;                                        
LDS R2, [R0+0x8];                                 
STS [R0+0x4], R13;                                

LDS R3, [0x1c];                                   
FADD R2, -R6, R2;                                 
LDS R4, [0x30];                                   
FADD R2, R2, -R3;                                 
CAL 0x178;                                        
LDS R2, [R0+0xc];                                 
STS [R0+0x8], R13;                                

LDS R3, [0x20];                                   
FADD R2, -R7, R2;                                 
LDS R6, [0x34];                                   
LDS.64 R4, [0x48];                                
FADD R2, R2, -R3;                                 
FADD R2, R2, -R6;                                 
CAL 0x178;                                        

STS [R0+0xc], R13;                                
LDS R2, [R0+0x10];                                
LDS R3, [0x10];                                   
LDS R4, [0x24];                                   
LDS R6, [0x38];                                   
FADD R2, R2, -R3;                                 
FADD R2, R2, -R4;                                 

LDS R4, [0x60];                                   
FADD R2, R2, -R6;                                 
FADD R2, -R5, R2;                                 
CAL 0x178;                                        
STS [R0+0x10], R13;                               
EXIT;                                             
FCHK.DIVIDE P0, R2, R4;

While our PTX contains vectorized loads now (16, vs 20 with nvcc), the resulting SASS still performs more loads (remaining at 16, as we don't cluster our loads for ptxas to vectorize, vs only 12). This is due to ptxas vectorizing load chains with gaps, eg. LDS.128 R8, [0x10] corresponding with loading elements 4 to 7 from kernel_dia, while kernel_dia+20 isn't actually used by the code (and R9 isn't either in the SASS code). LLVM refuses to vectorize these accesses, emitting 3 loads instead (it should have still been able to vectorize accesses to elements 6 and 7, looking into that right now).

So this probably needs a fix to the Load Store Vectorizer to vectorize chains with gaps.
Alternatively, we could cluster loads like nvcc, but the existing load/store clustering DAG mutation isn't quite as aggressive and as such only enables ptxas to merge a couple of loads (maybe there's some missing bits since NVPTX hasn't been using the machine scheduler before):

@@ NVPTXTargetMachine.cpp

class NVPTXPassConfig : public TargetPassConfig {
+  ScheduleDAGInstrs *
+  createMachineScheduler(MachineSchedContext *C) const override {
+    ScheduleDAGMILive *DAG = createGenericSchedLive(C);
+    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
+    DAG->addMutation(createStoreClusterDAGMutation(DAG->TII, DAG->TRI));
+    return DAG;
+  }
};

@@ NVPTXSubtarget.h

class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
+  bool enableMachineScheduler() const override { return true; }
};

cc @jlebar
reduced from rodinia/lud
ref https://lists.llvm.org/pipermail/llvm-dev/2018-June/124209.htm

Disable at-nospecialize for GPU codegen

Now that we have a GPU runtime library that can allocate and box, I tried to get rid of this hack: https://github.com/JuliaGPU/CUDAnative.jl/blob/53368d48b6405ee962e54d4c3b9f90e3eb623310/src/compiler/irgen.jl#L229-L236

Turns out we still need it, as the argument to throw often is the value returned by the BoundsError constructor, which has a @nospecialize resulting in a jl_invoke. LLVM obviously can't remove this function by itself, so we end up with GPU-incompatible code. Just try to cu([1]) .+ cu([2]) with --check-bounds=yes (and the above hack disabled, of course).

We could try and redefine BoundsError or throw_boundserror as it occurs for GPU code, e.g.:

using LinearAlgebra

for (W, _) in (:AT => (A,mut)->mut(A), Adapt.wrappers...)
    @eval begin
        # BoundsError has a @nospecialize constructor resulting in `invoke`
        Base.throw_boundserror(A::$W where {AT <: CuDeviceArray}, I) = throw(nothing)
    end
end

But then we still miss cases thrown from the broadcasting code (and I couldn't find a way to dispatch on throw_boundserror(Broadcasted{...CuDeviceArray...}. It would be best to just get rid of @nospecialize for GPU code altogether.

Not sure whether that would need to happen at the inference/optimizer/codegen level though. Monkey-patching MethodInstances from within CUDAnative's emit_function hook didn't seem to work.

@Keno did you do something similar for XLA, since you worked on more exhaustive inference there?
Are those interfaces public already?

Don't throw unconditionally when matrix is singular

Following up on https://github.com/JuliaGPU/CuArrays.jl/issues/107#issuecomment-414659174. The base behavior is now

julia> lu(zeros(2,2), check = true)
ERROR: SingularException(1)
Stacktrace:
 [1] checknonsingular at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/factorization.jl:12 [inlined]
 [2] #lu!#97(::Bool, ::Function, ::Array{Float64,2}, ::Val{true}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/lu.jl:41
 [3] #lu! at ./none:0 [inlined]
 [4] #lu#101 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/lu.jl:142 [inlined]
 [5] #lu at ./none:0 [inlined] (repeats 2 times)
 [6] top-level scope at none:0

julia> lu(zeros(2,2), check = false)
Failed factorization of type LU{Float64,Array{Float64,2}}

Feature Request: `mapreduce` over cartesian indices

mapreduce over CartesianIndices now fallbacks to the one in Base (as it should be).
I find this is also a useful interface for CUDA programming, is there some existing approach to realize similar functionality?
If not, where can I start to support this kind of interface?

Here is an minimal example of calculating means distance

julia> using CuArrays
julia> using CUDAnative
julia> function expect(f, xs::CuVector, ys::CuVector)
           M = length(xs)
           N = length(ys)
           ci = CartesianIndices((M, N))
           mapreduce(inds->f(xs[inds[1]], ys[inds[2]]), +, ci)/M/N
       end

julia> @benchmark expect((x,y)->abs2(x-y), randn(100)|>cu, randn(100)|>cu)
BenchmarkTools.Trial: 
  memory estimate:  3.97 MiB
  allocs estimate:  100033
  --------------
  minimum time:     182.878 ms (0.00% GC)
  median time:      185.717 ms (0.00% GC)
  mean time:        190.299 ms (1.61% GC)
  maximum time:     246.243 ms (24.11% GC)
  --------------
  samples:          27
  evals/sample:     1

Array constructors for ones, zeros, rand, ...

Base has decided not to provide functions that take a container type, but built abstractions around array constructors (currently only doing so for undef). Would be good to get rid of cuzeros, curand, etc.

sync_threads() appears to not be sync'ing threads

My setup:
GTX 1080 Ti
CUDA v9, driver v396.26
CUDAnative v0.9.1
Julia v1.0.2

I have been experimenting with @maetshju's implementation of CTC loss from this Flux Pull Request. I noticed that when using multiple threads to execute one of the CUDA kernels, the results differ from single-thread. After extensive debugging, I discovered that adding @cuprintf calls immediately after two of the sync_threads() calls causes the multi-threaded code to produce the same results as the single-threaded code. It appears that the sync_threads() calls are not actually sync'ing the threads in this particular instance (and @cuprintf seems to have the side-effect of forcing thread synchronization).

Unfortunately, I haven't been able to reproduce this issue in a simple code example, so I have attached my version of the entire CTC implementation with a test case that reproduces the problem:
ctc.jl.txt

To reproduce:

julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 1 grad[8]: -0.829811   accum[8]: -3.782628
thread 1 grad[9]: 0.164570   accum[9]: -6.222818

multi-thread intermediate printouts:
thread 2 grad[8]: 0.090031   accum[8]: -inf
thread 3 grad[9]: 0.244728   accum[9]: -inf
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
thread 1 grad[7]: 0.665241   accum[7]: -inf

max grad diff: 9.20e-01
        single-thread grad: -0.82981
        multi-thread grad: 0.09003

The lines beginning with "thread" are printouts of intermediate results from the computeBetasAndGrads kernel. In the multi-thread case, threads 2 and 3 should not be printing out values for grad and accum (line 168 in ctc.jl) before thread 1 has finished initializing accum (line 147). The maximum difference between the single- and multi-thread gradients is about 0.9.

Then modify ctc.jl by uncommenting the two @cuprintf calls on lines 152 and 240 ("sync 1" and "sync 2" printouts) and re-run:

julia> include("ctc.jl")
single-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 1 grad[8]: -0.829811   accum[8]: -3.782628
thread 1 grad[9]: 0.164570   accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 1

multi-thread intermediate printouts:
thread 1 accum[9] = -inf
thread 1 accum[7] = -inf
thread 1 accum[9] = -inf
thread 1 accum[8] = -3.782628
thread 1 accum[9] = -6.222818
sync 1 thread: 1
sync 1 thread: 2
sync 1 thread: 3
thread 1 grad[7]: 0.665241   accum[7]: -inf
thread 2 grad[8]: -0.829811   accum[8]: -3.782628
thread 3 grad[9]: 0.164570   accum[9]: -6.222818
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3
sync 2 thread: 1
sync 2 thread: 2
sync 2 thread: 3

max grad diff: 0.00e+00
        single-thread grad: -0.31767
        multi-thread grad: -0.31767

Now thread 1 finishes initializing accum before the other threads begin calculating values for grad. The single- and multi-thread gradients are now identical.

So, although I suspect the actual problem is something else, this example does seem to indicate that the sync_threads() calls on lines 151 and 239 are simply not working.

Is a ccall within a CUDA kernel guaranteed to be synchronous with respect to the kernel thread?
Do you have any suggestions for how to debug this further?

Thanks!

segmented reduction

I would like to ask, if someone would be willing to write function(s) for segmented reductions. These functions are handy for multiple-instance learning problems, where sample is represented by an unordered set of vectors (each sample can have different number of vectors). It would be nice to have these functions available for use with Flux.

My implementation of the normal Julia code for the forward pass looks like

function segmented_max(x::Matrix{T},bags::Vector) where{T}
  assert(checkbounds(x,bags))
  o = zeros(eltype(x),size(x,1),length(bags))
  fill!(o,typemin(T));
  @inbounds for i in 1:length(bags) #iterate over bags
    for j in bags[i]  #iterate over items (vectors) in bags
      for k in 1:size(x,1)
        if x[k,j]>o[k,i]
          o[k,i]=x[k,j];
        end
      end
    end
  end
  o
end

while for the reverse part it is

    assert(checkbounds(x,bags))
    maxI=zeros(Int,size(x,1),length(bags))
    gx = zeros(x)
    for i in 1:length(bags) #iterate over bags
        bagsize=length(bags[i])
        for j in bags[i]  #iterate over subbags
            for k in 1:size(x,1)
                if x[k,j]>gx[k,i]
                    gx[k,i]=x[k,j];
                    maxI[k,i]=j;
                end
            end
        end
    end
    
    fill!(gx,0)
    @inbounds for I in CartesianRange(size(Δ))
        if maxI[I]>0 
            gx[I[1],maxI[I]]=Δ[I];    
        end
    end
    gx
end

I have to confess that I have never programmed for Cuda, therefore I am writing here.
Thanks for any help.

versioninfo()

Add a versioninfo method for printing stuff, like Base and LinearAlgebra.

Cannot select `__powidf2` while lowering `powi.f64`

Repro:

@target ptx function foo(n)
    return 1.0^n
end
code_native(foo, Tuple{Int})

Generated IR:

define double @julia_foo_64487(i64) #0 !dbg !6 {
top:
  %sext = shl i64 %0, 32
  %1 = ashr exact i64 %sext, 32
  %2 = icmp eq i64 %1, %0
  br i1 %2, label %pass, label %fail

fail:                                             ; preds = %top
  call void @llvm.trap() JuliaGPU/CUDAnative.jl#2
  unreachable

pass:                                             ; preds = %top
  %3 = trunc i64 %0 to i32
  %4 = call double @llvm.powi.f64(double 1.000000e+00, i32 %3)
  ret double %4
}

versioninfo():

Julia Version 0.6.0-dev.884
Commit 34d11b0* (2016-09-14 21:55 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.0 (ORCJIT, ivybridge)

[Feature Request] Indexing CuArray with CuArray

I guess this can be faster with GPU? But any idea on how to implement this?

A = cu(rand(2048, 2048))
B = CuArray(rand(1:2048, 1024))
A[B, 2]

Batched strided GEMM tests fail

Pkg.test("CuArrays") fails, and I was able to locate the source of the error to the test below. (All other tests pass if I remove this test)

@testset "gemm_strided_batched with element type $elty" for elty in [Float32, Float64, ComplexF32, ComplexF64]

The function CuArrays.CUBLAS.gemm_strided_batched! fails with repeated calls.
The function CuArrays.CUBLAS.gemm_batched! works will repeated calls.

I think this is because there is an issue with using A::CuArray{$elty, 3} in the call to

@check ccall(($(string(fname)), libcublas), cublasStatus_t,
(cublasHandle_t, cublasOperation_t,
cublasOperation_t, Cint, Cint, Cint, Ptr{$elty},
Ptr{$elty}, Cint, Cint, Ptr{$elty}, Cint, Cint, Ptr{$elty},
Ptr{$elty}, Cint, Cint, Cint),
handle(), cutransA,
cutransB, m, n, k, [alpha], A, lda, strideA, B, ldb, strideB, [beta],
C, ldc, strideC, batchCount)
C

I can recreate the error above by running the following in jupyter

using  CUDAnative, CuArrays
using CuArrays.CUBLAS

m = 20
n = 35
k = 13
elty= Float32
nbatch = 10

# generate matrices
alpha = one(elty)
beta = one(elty)
A = rand(elty, m, k, nbatch)
B = rand(elty, k, n, nbatch)
C = rand(elty, m, n, nbatch)

# move to device
d_A = CuArray{elty, 3}(A)
d_B = CuArray{elty, 3}(B)
d_C = CuArray{elty, 3}(C)

CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', alpha, d_A, d_B, beta, d_C)

CuArrays.CUBLAS.gemm_strided_batched!('N', 'N', alpha, d_A, d_B, beta, d_C)

Here is the error from running the test

gemm_strided_batched with element type Float64: Error During Test at "juliahome"\.julia\dev\CuArrays\test\blas.jl:659
  Got exception outside of a @test
  CUDA error: an illegal memory access was encountered (code JuliaGPU/CUDA.jl#85, ERROR_ILLEGAL_ADDRESS)
  Stacktrace:
   [1] macro expansion at "juliahome"\.julia\packages\CUDAdrv\LC5XS\src\base.jl:147 [inlined]
   [2] #download!#11(::Bool, ::Function, ::Ptr{Float64}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at "juliahome"\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:254
   [3] download! at "juliahome"\.julia\packages\CUDAdrv\LC5XS\src\memory.jl:248 [inlined] (repeats 2 times)
   [4] unsafe_copyto! at "juliahome"\.julia\dev\CuArrays\src\array.jl:166 [inlined]
   [5] copyto!(::Array{Float64,3}, ::CuArray{Float64,3}) at "juliahome"\.julia\packages\GPUArrays\HmVTY\src\abstractarray.jl:110
   [6] Array(::CuArray{Float64,3}) at .\array.jl:497
   [7] macro expansion at "juliahome"\.julia\dev\CuArrays\test\blas.jl:690 [inlined]
   [8] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1156 [inlined]
   [9] macro expansion at "juliahome"\.julia\dev\CuArrays\test\blas.jl:659 [inlined]
   [10] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1083 [inlined]
   [11] macro expansion at "juliahome"\.julia\dev\CuArrays\test\blas.jl:546 [inlined]
   [12] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1083 [inlined]
   [13] top-level scope at "juliahome"\.julia\dev\CuArrays\test\blas.jl:3
   [14] include at .\boot.jl:317 [inlined]
   [15] include_relative(::Module, ::String) at .\loading.jl:1044
   [16] include(::Module, ::String) at .\sysimg.jl:29
   [17] include(::String) at .\client.jl:392
   [18] macro expansion at "juliahome"\.julia\dev\CuArrays\test\runtests.jl:21 [inlined]
   [19] macro expansion at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Test\src\Test.jl:1083 [inlined]
   [20] top-level scope at "juliahome"\.julia\dev\CuArrays\test\runtests.jl:19
   [21] include at .\boot.jl:317 [inlined]
   [22] include_relative(::Module, ::String) at .\loading.jl:1044
   [23] include(::Module, ::String) at .\sysimg.jl:29
   [24] include(::String) at .\client.jl:392
   [25] top-level scope at none:0
   [26] eval(::Module, ::Any) at .\boot.jl:319
   [27] exec_options(::Base.JLOptions) at .\client.jl:243
   [28] _start() at .\client.jl:425