Giter Club home page Giter Club logo

jacc.jl's People

Contributors

geekdude avatar michel2323 avatar pedrovalerolara avatar philipfackler avatar williamfgc avatar ygtangg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

jacc.jl's Issues

[Enhancement]Simplification of JACC interface

I created a small example in the PR of how it should be possible to create a JACCArray wrapper and then use multiple dispatch to choose the different distributed backend instead of relying on a global flag.

Should use `__precompile__(false)` (at least for extensions)

When using the cuda backend I get the following message:

WARNING: Method definition parallel_for(I, F, Any...) where {I<:Integer, F<:Function} in module JACC at /home/4pf/.julia/packages/JACC/Crxi0/src/JACC.jl:11 overwritten in module JACCCUDA at /home/4pf/.julia/packages/JACC/Crxi0/ext/JACCCUDA/JACCCUDA.jl:5.
ERROR: Method overwriting is not permitted during Module precompilation. Use `__precompile__(false)` to opt-out of precompilation.

It's an error when I don't add __precompile__(false) to my own module definition.

Add GPU tests

CUDA and AMDGPU tests are missing CG and LatticeBoltzmann tests present in CPU threads. Goal is to keep all back synced as much as possible.

Default backend should be "serial"

The "threads" backend should be another extension. By default JACC should run serial for loops and avoid the cost of launching a single thread.

AMDGPU test for JACC.BLAS fails

Although JACC.BLAS works well when using a Julia terminal, but it fails when running the AMDGPU JACC.BLAS test (see output below).
More work is needed. The JACC.BLAS module is now part of JACC, but the JACC.BLAS test code for the AMDGPU backend is commented.

JACC.BLAS: Error During Test at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:100
Got exception outside of a @test
GPU Kernel Exception
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] throw_if_exception(dev::AMDGPU.HIP.HIPDevice)
@ AMDGPU ~/.julia/packages/AMDGPU/BhNdC/src/exception_handler.jl:123
[3] synchronize(stm::AMDGPU.HIP.HIPStream*** blocking::Bool, stop_hostcalls::Bool)
@ AMDGPU ~/.julia/packages/AMDGPU/BhNdC/src/highlevel.jl:53
[4] synchronize (repeats 2 times)
@ ~/.julia/packages/AMDGPU/BhNdC/src/highlevel.jl:49 [inlined]
[5] parallel_for(::Int64, ::typeof(JACC.BLAS._axpy), ::Float64, ::Vararg{Any})
@ JACCAMDGPU ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/ext/JACCAMDGPU/JACCAMDGPU.jl:12
[6] axpy(n::Int64, alpha::Float64, x::AMDGPU.ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}, y::AMDGPU.ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
@ JACC.BLAS ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/src/JACCBLAS.jl:14
[7] macro expansion
@ ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:125 [inlined]
[8] macro expansion
@ /auto/software/swtree/ubuntu22.04/x86_64/julia/1.9.1/share/julia/stdlib/v1.9/Test/src/Test.jl:1498 [inlined]
[9] top-level scope
@ ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:102
[10] include(fname::String)
@ Base.MainInclude ./client.jl:478
[11] top-level scope
@ ~/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/runtests.jl:15
[12] include(fname::String)
@ Base.MainInclude ./client.jl:478
[13] top-level scope
@ none:6
[14] eval
@ ./boot.jl:370 [inlined]
[15] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:280
[16] _start()
@ Base ./client.jl:522
Test Summary: | Error Total Time
JACC.BLAS | 1 1 1.9s

Random Numbers

My code is using the version CUDA.randn!(A::AnyCuArray) (here) from within a kernel. This corresponds to here. Is this something we could add to JACC or should I just try to wrap this myself? (The random number interface is pretty large. And right now I only need this one function.)

Expand CI platforms

CI coverage:

  • CPU: GitHub Actions runners
  • GPU: CUDA RTXA1000
  • GPU: AMD MI100

Struct of arrays

(or any user-defined type). If the type of each member of a struct can be passed to a parallel_for, I should also be able to pass a struct of such types. This came up because I wanted to use something like this:

mutable struct Q
    h::JACC.Array{Float64,1}
    k::JACC.Array{Float64,1}
    l::JACC.Array{Float64,1}
    w::JACC.Array{Float64,1}
end

and pass an instance of this to a parallel_for. This is fine with the "threads" backend, but CUDA complains that this struct is not isbits.

Add support for ones and zeros

ones and zeros are convenience functions across all back ends
JACC.ones and JACC.zeros different signatures need to be supported for target applications.

Adding operation (*,/,+,-) as paramenter for JACC.parallel_reduce

Now we assume that the reduction operation for parallel_reduce is "+".
The idea is to use any kind of operation, passing the operation as a parameter.
This has to be implemented in all the backends in JACC.jl/src/JACC.jl (Threads.jl), JACC.jl/ext/JACCCUDA/JACCUDA.jl (CUDA.jl) ...

Need more accurate threads per block

I hit a CUDA error about "too many resources" and discovered it's because my kernel required a lot of registers. I found the following answer helpful, but it uses the deprecated CUDAnative package. The maxthreads on cufunction takes the number of registers needed by the kernel into account. Based on that example, here's what I came up with for JACC.parallel_for for single dimension:

function JACC.parallel_for(N::I, f::F, x...) where {I<:Integer,F<:Function}
  parallel_args = (f, x...)
  parallel_kargs = cudaconvert.(parallel_args)
  parallel_tt = Tuple{Core.Typeof.(parallel_kargs)...}
  parallel_kernel = cufunction(_parallel_for_cuda, parallel_tt)
  maxPossibleThreads = CUDA.maxthreads(parallel_kernel)
  threads = min(N, maxPossibleThreads)
  blocks = ceil(Int, N / threads)
  parallel_kernel(parallel_kargs...; threads=threads, blocks=blocks)
end

This works, although it probably needs more exploration.

`JACCPreferences.backend` is not enforced

The backend preference does not constrain the code to only use any specific background . For example

julia> using JACC: JACC
julia> JACC.JACCPreferences.backend
cuda
julia> JACC.Array
Array
julia> begin
  function f(x, a)
    @inbounds a[x] += 5.0
  end

  dims = (10)
  a = round.(rand(Float32, dims) * 100)
  a_expected = a .+ 5.0

  a = JACC.Array(a)
  JACC.parallel_for(10, f, a)
end
julia> aโ‰ˆa_expected rtol=1e-5
true
julia> typeof(a)
Vector{Float32} (alias for Array{Float32, 1})

Another issue arises when loading backends

julia> using JACC: JACC
julia> using CUDA
julia> JACC.JACCPreferences.backend
"threads"
julia> JACC.Array
CuArray

This makes it impossible to load a backend and use threads.

False sharing in multithreaded `parallel_reduce`

I just stumbled across your multithreaded parallel_reduce implementations here. Because you're writing to shared cache-lines from different threads (tmp[threadid()]) in a hot loop (1:N where N may be large) these implementations will very likely suffer (a lot) from false sharing.

I recommend to perform a thread-local reduction first, i.e. something like the following (untested):

using ChunkSplitters

function parallel_reduce(N::I, f::F, x...) where {I<:Integer,F<:Function}
    nt = Threads.nthreads()
    tmp = zeros(nt)
    Threads.@threads :static for (idcs, t) in chunks(1:N, nt)
        tmp[t] = sum(i->f(i, x...), idcs)
    end
    return [sum(tmp)]
end

(One could avoid the ChunkSplitters dependency by using e.g. Iterators.partition.)

AMDGPU tweaks

Looking into the scripts, I see you are using AMDGPU v0.8 which has now the same "convention" as CUDA wrt using threads and blocks as kernel launch params; gridsize = blocks and thus

threads = min(N, numThreads)
blocks = ceil(Int, N / threads)
@roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)

should be:

@roc groupsize = threads gridsize = blocks _parallel_for_amdgpu(f, x...) 

Regarding the issue JuliaGPU/AMDGPU.jl#614 it could be that using weakdeps (adding Project.toml to /test) in tests could solve the issue as since JACC is using extensions one could make sure not relying on conditional loading but truly on extension mechanism?

Also, it looks like you are running AMDGPU CI on Julia 1.9. There used to be issues because of LLVM on Julia 1.9 and thus Julia 1.10 could globally be preferred (although depending on the GPU 1.9 would work fine).

Add MPI tests

Currently we don't have MPI tests only node-level tests.
Adding MPI tests will help understand interoperability between JACC.Array with MPI.jl
Because of MPI.Init() and MPI.Finalize() single nature MPI tests need to run as separate processes similarly to MPI.jl own tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.