juliaattic / cublas.jl Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 19.0 132 KB

Julia interface to CUBLAS

License: Other

Julia 100.00%

cublas.jl's People

Contributors

Stargazers

Watchers

Forkers

denizyuret kshyatt nagyist hiracu angelociffa kristofferc adambrewster ranjanan tkelman mikhail-j patrickjae keno jekbradbury nunofernandes-plight jekyllstein pierreblanchard stjordanis juliatagbot

cublas.jl's Issues

Problem with WaitForEvents

I think this is an issue in CLBLAS although the error is being raised in the OpenCL api.jl

I was looking at the clblasSaxpy.jl example for a talk I'm giving on OpenCL and got:

julia> cl.api.clWaitForEvents(cl.cl_uint(1), ptrEvent)
WARNING: ccall Ptr argument types must now match exactly. or be Ptr(Void).
in clWaitForEvents at C:\Users\Malcolm.julia\v0.3\OpenCL\src\api.jl:22

The code still seems to execute, not looked at the other examples as yet.
I'm running on a Lenovo i7 with GeForce 830M chip on Windows

no more BlasChar

Just an FYI: The latest Julia 0.4 has removed BlasChar. I can make the code work by providing a typealias as a workaround. I am not sure what is the proper way to support multiple Julia versions in a package...

Info about upcoming removal of packages in the General registry

As described in https://discourse.julialang.org/t/ann-plans-for-removing-packages-that-do-not-yet-support-1-0-from-the-general-registry/ we are planning on removing packages that do not support 1.0 from the General registry. This package has been detected to not support 1.0 and is thus slated to be removed. The removal of packages from the registry will happen approximately a month after this issue is open.

To transition to the new Pkg system using Project.toml, see https://github.com/JuliaRegistries/Registrator.jl#transitioning-from-require-to-projecttoml.
To then tag a new version of the package, see https://github.com/JuliaRegistries/Registrator.jl#via-the-github-app.

If you believe this package has erroneously been detected as not supporting 1.0 or have any other questions, don't hesitate to discuss it here or in the thread linked at the top of this post.

Support for multi-GPU Parallelism

It would be great to have support in this package for multi-GPU parallelism. I devised a very hacky sort of way to accomplish this and wrote it up here. This particular one is for the CUSPARSE package, but the implementation would be nearly identical for CUBLAS. I'm not certain if my approach was necessarily the best. I'd be happy to work a bit to get this added to the package. But, since I've never contributed to a package before and am not certain how good my approach is, it would be helpful to correspond a bit before just putting up a pull request.

How does the linked implementation look? Any comments, thoughts, suggestions? Obviously, what's posted there is just the rudiments of an implementation, just for a single function and without even all of the functionality for that.

See also this discussion of much this same issue on the Julia CUSPARSE GitHub page here.

CUBLAS.gemm with CudaArrays is gone

I have some code that uses CUBLAS.gemm with CudaArrays. After updating today I am getting this error:

ERROR: LoadError: error in running finalizer: CUDAdrv.CuError(code=201, meta=nothing)
MethodError: no method matching gemm(::Char, ::Char, ::Float32, ::CUDArt.CudaArray{Float32,2}, ::CUDArt.CudaArray{Float32,2})
Closest candidates are:
  gemm(::Char, ::Char, ::Float32, !Matched::CUDAdrv.CuArray{Float32,2}, !Matched::CUDAdrv.CuArray{Float32,2}) at /home/julieta/.julia/v0.6/CUBLAS/src/blas.jl:929
  gemm(::Char, ::Char, ::Float32, !Matched::Union{Base.ReshapedArray{Float32,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:DenseArray, DenseArray{Float32,2}, SubArray{Float32,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:DenseArray where N where T, DenseArray}, !Matched::Union{Base.ReshapedArray{Float32,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:DenseArray, DenseArray{Float32,2}, SubArray{Float32,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:DenseArray where N where T, DenseArray}) at linalg/blas.jl:1039
  gemm(::Char, ::Char, !Matched::Float64, !Matched::CUDAdrv.CuArray{Float64,2}, !Matched::CUDAdrv.CuArray{Float64,2}) at /home/julieta/.julia/v0.6/CUBLAS/src/blas.jl:929

Sorry, what happened to the old gemm? And what are these new CuArrays (not CudaArrays)?

iamin segmentation fault

using CUDArt, CUBLAS

a = rand( 10 );
d_a = CudaArray( a );

# All this works fine
CUBLAS.iamin( d_a );
CUBLAS.iamin( d_a );
CUBLAS.iamin( d_a );
...
CUBLAS.iamin( d_a );

# But after doing
devices(dev->true) do devlist
  ...
end

CUBLAS.iamin( d_a ); # Gives a segmentation fault

This does not happen with other functions such as axpy!, but it does happen with iamax and asum

Is this a bug or am I doing something wrong?

CUBLAS dot far slower than BLAS dot

I wrote simple functions that perform dot products on Arrays and CudaArrays. I'm finding that the CUDA version is about 4x slower. Is this expected?

using CUDArt
using CUBLAS

function blasdots(x :: Vector{Float64}, y :: Vector{Float64}; kmax :: Int=100)
  for k = 1:kmax
    BLAS.dot(x, y)
  end
end

function cublasdots(d_x :: CudaArray{Float64}, d_y :: CudaArray{Float64}; kmax :: Int=100)
  for k = 1:kmax
    CUBLAS.dot(d_x, d_y)
  end
end

n = 10000
x = rand(n); y = rand(n)
d_x = CudaArray(x); d_y = CudaArray(y)

blasdots(x, y, kmax=1)  # compile
@time blasdots(x, y)

cublasdots(d_x, d_y, kmax=1)  # compile
@time cublasdots(d_x, d_y)

Running this script gives:

$ julia time_cublas.jl 
  0.001865 seconds (431 allocations: 27.450 KB)
  0.007459 seconds (583 allocations: 28.250 KB)
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF
jl_uv_writecb() ERROR: bad file descriptor EBADF

(Bonus question: what's up with the EBADF???)

This is on OSX 10.9, Julia 0.4.1 installed from Homebrew, built against OpenBLAS, CUDA 7.5.

importall BLAS for generic code

Hi Nick,

Currently CUBLAS does not export or modify any existing functions, but provides separate CUBLAS versions like CUBLAS.axpy! etc. (If I understand things correctly). If you add the line:

importall Base.LinAlg.BLAS

to the beginning of CUBLAS.jl, (see module usage), then it will be possible for the users to write generic code that works whether the inputs are Arrays or CudaArrays.

best,
deniz

Tag new version

Would be nice to get #16 in a released version.

Memory management

I was wondering if there was a way to modify the behaviour of CUBLAS.jl functions so that when a pointer is overridden in julia, the appropriate CUDArt.free(ptr) is called.

using CUBLAS
using CUDArt

N = 50
T = 193
V = 60 * 10^3
K = 100

hA = map(Float32, randn(N*T, K));
hS = map(Float32, randn(K, V));

dA = CudaArray(hA)
dS = CudaArray(hS)

for i in 1:10
    dY = CUBLAS.gemm('N', 'N', dA, dS)
end

Results in an out of memory error:

julia> for i in 1:10
           dY = CUBLAS.gemm('N', 'N', dA, dS)
       end
WARNING: CUDA error triggered from:

 [inlined code] from error.jl:26
 in checkerror at /home/mcp50/.julia/v0.5/CUDArt/src/libcudart-6.5.jl:15
 [inlined code] from essentials.jl:111
 in cudaMalloc at /home/mcp50/.julia/v0.5/CUDArt/src/../gen-6.5/gen_libcudart.jl:260
 in malloc at /home/mcp50/.julia/v0.5/CUDArt/src/pointer.jl:36
 in call at /home/mcp50/.julia/v0.5/CUDArt/src/arrays.jl:99
 [inlined code] from /home/mcp50/.julia/v0.5/CUDArt/src/arrays.jl:110
 in gemm at /home/mcp50/.julia/v0.5/CUBLAS/src/blas.jl:928
 [inlined code] from float.jl:24
 in gemm at /home/mcp50/.julia/v0.5/CUBLAS/src/blas.jl:936
 [inlined code] from none:2
 in anonymous at no file:0ERROR: "out of memory"
 [inlined code] from essentials.jl:111
 in checkerror at /home/mcp50/.julia/v0.5/CUDArt/src/libcudart-6.5.jl:16
 [inlined code] from essentials.jl:111
 in cudaMalloc at /home/mcp50/.julia/v0.5/CUDArt/src/../gen-6.5/gen_libcudart.jl:260
 in malloc at /home/mcp50/.julia/v0.5/CUDArt/src/pointer.jl:36
 in call at /home/mcp50/.julia/v0.5/CUDArt/src/arrays.jl:99
 [inlined code] from /home/mcp50/.julia/v0.5/CUDArt/src/arrays.jl:110
 in gemm at /home/mcp50/.julia/v0.5/CUBLAS/src/blas.jl:928
 [inlined code] from float.jl:24
 in gemm at /home/mcp50/.julia/v0.5/CUBLAS/src/blas.jl:936
 [inlined code] from none:2
 in anonymous at no file:0

It is straightforward to call CUDArt.free in the loop, so this is a 'nice to have' point.

'None' should be 'Void' for julia v0.5 in the source file "src/libcublas.jl

None is replaced by Union{} in v0.5 and the None type is not defined in this version. Simply replacing the Nones to Union{}s does not work. According to the documentation (i.e., ccall) of v0.5, the occurrances of None in the following functions defined in src/libcublas.jl should be Void.

function cublasSetVector(n, elemSize, x, incx, devicePtr, incy)
  statuscheck(ccall( (:cublasSetVector, libcublas), cublasStatus_t, (Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint), n, elemSize, x, incx, devicePtr, incy))
end
function cublasGetVector(n, elemSize, x, incx, y, incy)
  statuscheck(ccall( (:cublasGetVector, libcublas), cublasStatus_t, (Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint), n, elemSize, x, incx, y, incy))
end
function cublasSetMatrix(rows, cols, elemSize, A, lda, B, ldb)
  statuscheck(ccall( (:cublasSetMatrix, libcublas), cublasStatus_t, (Cint, Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint), rows, cols, elemSize, A, lda, B, ldb))
end
function cublasGetMatrix(rows, cols, elemSize, A, lda, B, ldb)
  statuscheck(ccall( (:cublasGetMatrix, libcublas), cublasStatus_t, (Cint, Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint), rows, cols, elemSize, A, lda, B, ldb))
end
function cublasSetVectorAsync(n, elemSize, hostPtr, incx, devicePtr, incy, stream)
  statuscheck(ccall( (:cublasSetVectorAsync, libcublas), cublasStatus_t, (Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint, cudaStream_t), n, elemSize, hostPtr, incx, devicePtr, incy, stream))
end
function cublasGetVectorAsync(n, elemSize, devicePtr, incx, hostPtr, incy, stream)
  statuscheck(ccall( (:cublasGetVectorAsync, libcublas), cublasStatus_t, (Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint, cudaStream_t), n, elemSize, devicePtr, incx, hostPtr, incy, stream))
end
function cublasSetMatrixAsync(rows, cols, elemSize, A, lda, B, ldb, stream)
  statuscheck(ccall( (:cublasSetMatrixAsync, libcublas), cublasStatus_t, (Cint, Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint, cudaStream_t), rows, cols, elemSize, A, lda, B, ldb, stream))
end
function cublasGetMatrixAsync(rows, cols, elemSize, A, lda, B, ldb, stream)
  statuscheck(ccall( (:cublasGetMatrixAsync, libcublas), cublasStatus_t, (Cint, Cint, Cint, Ptr{None}, Cint, Ptr{None}, Cint, cudaStream_t), rows, cols, elemSize, A, lda, B, ldb, stream))
end
function cublasXerbla(srName, info)
  ccall( (:cublasXerbla, libcublas), None, (Ptr{UInt8}, Cint), srName, info)
end

Add elementwise operation on CudaArray?

Since I don't know how to write low-level cuda kernels, it'll be great to have some common high-level elementwise operations on CudaArray. For example, d_A = CudaArray(A);exp(d_A)