juliaornl / jacc.jl Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 9.0 200 KB

CPU/GPU parallel performance portable layer in Julia via functions as arguments

License: MIT License

Julia 98.13% Shell 1.87%

cpu gpu julialang parallel-programming performance portability

jacc.jl's People

Contributors

Stargazers

Watchers

Forkers

williamfgc geekdude michel2323 kmp5vt philipfackler hetmankad channerma ygtangg pedrovalerolara

jacc.jl's Issues

Remove non-portable zeros ones signatures

Mostly looking for feedback as CUDA/AMDGPU return a Float32 and Julia Base returns a Float64.

Add custom range for parallel_for iterations

Currently parallel_for(N, f, x...) iterates from 1:N
Need UnitRange M:N for parallel_for(M:N, f, x...) for custom iteration from M to N (M < N)

2D (CUDA) parallel_for gets indices wrong if (M,N) not less than or multiples of 16

mwe_2d.txt

See attached example. When the lengths are less than 16 it works. When they are multiples of 16 it works. Otherwise the i or j provided inside the kernel go past the lengths.

Adding CI for Intel (OneAPI) GPUs (Backend)

Would like to pass in more than just numbers and arrays

This commit causes me problems:
6d12657

parallel_for now accepts only Numbers and Arrays (or CuArrays in the cuda case). I'd like to pass in instances of structs and tuples also. Perhaps the function could use Vararg{Any}?

[Enhancement]Simplification of JACC interface

I created a small example in the PR of how it should be possible to create a JACCArray wrapper and then use multiple dispatch to choose the different distributed backend instead of relying on a global flag.

Should use `precompile(false)` (at least for extensions)

When using the cuda backend I get the following message:

WARNING: Method definition parallel_for(I, F, Any...) where {I<:Integer, F<:Function} in module JACC at /home/4pf/.julia/packages/JACC/Crxi0/src/JACC.jl:11 overwritten in module JACCCUDA at /home/4pf/.julia/packages/JACC/Crxi0/ext/JACCCUDA/JACCCUDA.jl:5.
ERROR: Method overwriting is not permitted during Module precompilation. Use `__precompile__(false)` to opt-out of precompilation.

It's an error when I don't add __precompile__(false) to my own module definition.

Add GPU tests

CUDA and AMDGPU tests are missing CG and LatticeBoltzmann tests present in CPU threads. Goal is to keep all back synced as much as possible.

Default backend should be "serial"

The "threads" backend should be another extension. By default JACC should run serial for loops and avoid the cost of launching a single thread.

Add 3D parallel_for and computational grid on GPU

Currently JACC parallel_for allows 1D and 2D dimensions signatures.
Extend this to 3D to enable computational grid on GPU.

AMDGPU test for JACC.BLAS fails

Although JACC.BLAS works well when using a Julia terminal, but it fails when running the AMDGPU JACC.BLAS test (see output below).
More work is needed. The JACC.BLAS module is now part of JACC, but the JACC.BLAS test code for the AMDGPU backend is commented.

JACC.BLAS: Error During Test at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:100
Got exception outside of a @test
GPU Kernel Exception
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] throw_if_exception(dev::AMDGPU.HIP.HIPDevice)
@ AMDGPU ~/.julia/packages/AMDGPU/BhNdC/src/exception_handler.jl:123
[3] synchronize(stm::AMDGPU.HIP.HIPStream*** blocking::Bool, stop_hostcalls::Bool)
@ AMDGPU ~/.julia/packages/AMDGPU/BhNdC/src/highlevel.jl:53
[4] synchronize (repeats 2 times)
@ ~/.julia/packages/AMDGPU/BhNdC/src/highlevel.jl:49 [inlined]
[5] parallel_for(::Int64, ::typeof(JACC.BLAS._axpy), ::Float64, ::Vararg{Any})
@ JACCAMDGPU ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/ext/JACCAMDGPU/JACCAMDGPU.jl:12
[6] axpy(n::Int64, alpha::Float64, x::AMDGPU.ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, y::AMDGPU.ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
@ JACC.BLAS ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/src/JACCBLAS.jl:14
[7] macro expansion
@ ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:125 [inlined]
[8] macro expansion
@ /auto/software/swtree/ubuntu22.04/x86_64/julia/1.9.1/share/julia/stdlib/v1.9/Test/src/Test.jl:1498 [inlined]
[9] top-level scope
@ ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:102
[10] include(fname::String)
@ Base.MainInclude ./client.jl:478
[11] top-level scope
@ ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/runtests.jl:15
[12] include(fname::String)
@ Base.MainInclude ./client.jl:478
[13] top-level scope
@ none:6
[14] eval
@ ./boot.jl:370 [inlined]
[15] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:280
[16] _start()
@ Base ./client.jl:522
Test Summary: | Error Total Time
JACC.BLAS | 1 1 1.9s

Random Numbers

My code is using the version CUDA.randn!(A::AnyCuArray) (here) from within a kernel. This corresponds to here. Is this something we could add to JACC or should I just try to wrap this myself? (The random number interface is pretty large. And right now I only need this one function.)

Expand CI platforms

CI coverage:

CPU: GitHub Actions runners
GPU: CUDA RTXA1000
GPU: AMD MI100

Flawed use of `threadid()`

See https://julialang.org/blog/2023/07/PSA-dont-use-threadid/ :)

If you can live with traditional non-composable multithreading just use @threads :static instead of @threads.

(I'll make a PR if I find the time or you beat me to it :D)

Struct of arrays

(or any user-defined type). If the type of each member of a struct can be passed to a parallel_for, I should also be able to pass a struct of such types. This came up because I wanted to use something like this:

mutable struct Q
    h::JACC.Array{Float64,1}
    k::JACC.Array{Float64,1}
    l::JACC.Array{Float64,1}
    w::JACC.Array{Float64,1}
end

and pass an instance of this to a parallel_for. This is fine with the "threads" backend, but CUDA complains that this struct is not isbits.

Add support for ones and zeros

ones and zeros are convenience functions across all back ends
JACC.ones and JACC.zeros different signatures need to be supported for target applications.

Adding operation (*,/,+,-) as paramenter for JACC.parallel_reduce

Now we assume that the reduction operation for parallel_reduce is "+".
The idea is to use any kind of operation, passing the operation as a parameter.
This has to be implemented in all the backends in JACC.jl/src/JACC.jl (Threads.jl), JACC.jl/ext/JACCCUDA/JACCUDA.jl (CUDA.jl) ...

Need more accurate threads per block

I hit a CUDA error about "too many resources" and discovered it's because my kernel required a lot of registers. I found the following answer helpful, but it uses the deprecated CUDAnative package. The maxthreads on cufunction takes the number of registers needed by the kernel into account. Based on that example, here's what I came up with for JACC.parallel_for for single dimension:

function JACC.parallel_for(N::I, f::F, x...) where {I<:Integer,F<:Function}
  parallel_args = (f, x...)
  parallel_kargs = cudaconvert.(parallel_args)
  parallel_tt = Tuple{Core.Typeof.(parallel_kargs)...}
  parallel_kernel = cufunction(_parallel_for_cuda, parallel_tt)
  maxPossibleThreads = CUDA.maxthreads(parallel_kernel)
  threads = min(N, maxPossibleThreads)
  blocks = ceil(Int, N / threads)
  parallel_kernel(parallel_kargs...; threads=threads, blocks=blocks)
end

This works, although it probably needs more exploration.

Brainstorm performance tests as part of CI (performance degradation)

`JACCPreferences.backend` is not enforced

The backend preference does not constrain the code to only use any specific background . For example

julia> using JACC: JACC
julia> JACC.JACCPreferences.backend
cuda
julia> JACC.Array
Array
julia> begin
  function f(x, a)
    @inbounds a[x] += 5.0
  end

  dims = (10)
  a = round.(rand(Float32, dims) * 100)
  a_expected = a .+ 5.0

  a = JACC.Array(a)
  JACC.parallel_for(10, f, a)
end
julia> a≈a_expected rtol=1e-5
true
julia> typeof(a)
Vector{Float32} (alias for Array{Float32, 1})

Another issue arises when loading backends

julia> using JACC: JACC
julia> using CUDA
julia> JACC.JACCPreferences.backend
"threads"
julia> JACC.Array
CuArray

This makes it impossible to load a backend and use threads.

False sharing in multithreaded `parallel_reduce`

I just stumbled across your multithreaded parallel_reduce implementations here. Because you're writing to shared cache-lines from different threads (tmp[threadid()]) in a hot loop (1:N where N may be large) these implementations will very likely suffer (a lot) from false sharing.

I recommend to perform a thread-local reduction first, i.e. something like the following (untested):

using ChunkSplitters

function parallel_reduce(N::I, f::F, x...) where {I<:Integer,F<:Function}
    nt = Threads.nthreads()
    tmp = zeros(nt)
    Threads.@threads :static for (idcs, t) in chunks(1:N, nt)
        tmp[t] = sum(i->f(i, x...), idcs)
    end
    return [sum(tmp)]
end

(One could avoid the ChunkSplitters dependency by using e.g. Iterators.partition.)

AMDGPU tweaks

Looking into the scripts, I see you are using AMDGPU v0.8 which has now the same "convention" as CUDA wrt using threads and blocks as kernel launch params; gridsize = blocks and thus

JACC.jl/ext/JACCAMDGPU/JACCAMDGPU.jl

Lines 7 to 9 in 041c271

 threads = min(N, numThreads) 

 blocks = ceil(Int, N / threads) 

 @roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)

should be:

@roc groupsize = threads gridsize = blocks _parallel_for_amdgpu(f, x...)

Regarding the issue JuliaGPU/AMDGPU.jl#614 it could be that using weakdeps (adding Project.toml to /test) in tests could solve the issue as since JACC is using extensions one could make sure not relying on conditional loading but truly on extension mechanism?

Also, it looks like you are running AMDGPU CI on Julia 1.9. There used to be issues because of LLVM on Julia 1.9 and thus Julia 1.10 could globally be preferred (although depending on the GPU 1.9 would work fine).

Add MPI tests

Currently we don't have MPI tests only node-level tests.
Adding MPI tests will help understand interoperability between JACC.Array with MPI.jl
Because of MPI.Init() and MPI.Finalize() single nature MPI tests need to run as separate processes similarly to MPI.jl own tests

Need set of common `atomic_*` functions (and maybe `@atomic` macro)

Complete BLAS level 1 routines in JACC.BLAS

We have axpy and dot BLAS level routines already implemented.
We need to implement the rest of BLAS level-1 operations.
This is implemented in JACCBLAS.jl in JACC.jl/src/
The list of BLAS level -1 routines can be found here:
https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/blas/blas-level-1-routines.html
SDSDOT routine shouldn't be implemented, and maybe others too.

	threads = min(N, numThreads)
	blocks = ceil(Int, N / threads)
	@roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)