Giter Club home page Giter Club logo

Comments (23)

jpsamaroo avatar jpsamaroo commented on June 1, 2024 1

I'm happy to help you figure this out, but I need you to post what you tried to do, where it failed, and the errors and stacktrace you got. Otherwise I don't know what doesn't work.

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024 1

Please retry this with AMDGPU#master, we've merged support for ROCT and ROCR artifacts, so this might help alleviate this error.

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024 1

Hey, I am really suprised since I don't know how did I install 4.0 now I didn't find any description for installing specified versions. But now everything seems working after another fresh install.

Weeeellll I guess installing the master branch and the 4.0 rocm-dkms maybe solved then now. :)

Well done! ;)

Ok let's check speeds!! :D

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024 1

I just write my ideas to improve as I face the problem during writing a basic speed testing code. Sorry to use this thread:

  • The basic example is for 32 length array. It would be nice to have by time a little bit bigger test like: N=1<<26

  • I didn't find @rocprintf enough easy to use because I couldn't find the "%d %s " the ANY type... so I couldn't print for example types. Sadly I couldn't discover the world of AMD environment because I don't know what type and what fields do I have and I had a really hard time to figure anything out. But I could some test code to print like:

@rocprint "%d" workitemIdx().x
@rocprint "\n" 

but of course this isn't that nice as it would be and could be a little bit more convenient if possible, so it would be easier to descover for anyone. Also this is the function that will be used a lot in the beginning so it could improve the start for every single developer. :)

  • There is a typo in https://juliagpu.github.io/AMDGPU.jl/stable/printing/#Printing where the brackets didn't match: so the closing "end" was "ed" and got error at first copy paste.
  • Any time I run a kernel I get the long:
    AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}(AMDGPU.HSAStatusSignal(HSASignal(Base.RefValue{AMDGPU.HSA.Signal}(AMDGPU.HSA.Signal(0x00007f7c04007a00))), HSAExecutable{AMDGPU.Mem.Buffer}(GPU: Vega 20 [Radeon VII] (gfx906), Base.RefValue{AMDGPU.HSA.Executable}(AMDGPU.HSA.Executable(0x000000000ae30c00)), UInt8[0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x40, 0x01, 0x00 … 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], Dict{Symbol, AMDGPU.Mem.Buffer}()))) but I am on #master so I guess this can be some debug?
  • It was hard to figure out what do I have for environment I would definitely add some code that holds all the thing I can use.
function kernel(y)
	@rocprintf "\n"
	wgdim = workgroupDim().x
	for i in 1:wgdim
		gr = gridDimWG().x
		grd = gridDim().x
		gidx = workgroupIdx().x
		idx = workitemIdx().x
		@rocprintf "workgroup: %3d/%-3d  idx: %-3d  groupidx: %-3d  grid: %-3d  grididk: %-3d\n" i wgdim idx gidx gr grd
	end
	return nothing
end
  • Somehow we could create a @RocMax [device] fn() that just run the function with the maximum possible capacity. Because later on nobody wants to waste time with finding the appropriate config, it have to be done by the @roc command. Or maybe just create something device config object that points to the device with the maximum capacity that handle the @roc {device_withtheMAXPOWER} fn(), so we can bridge this assembly level refining of the whole workgroup. I mean I know this is crazy but... if someone wants to bother with low level, he can any time... but 98% of the developer would just go with an instant command that ROCKS. :D
  • gridsize how to use example missing.
  • Why does a function has to have a parameter, I didn't see in the documentation.

Sadly I couldn't make the c .= a .+ b to work parallel with the kernel function. But I made this... It works but slow because I couldn't make to work in parallel

function vadd!(c, a, b)
	wgdim = workgroupDim().x
	i = workitemIdx().x
	batch = Int(size(a,1)/wgdim)
	for j in (i-1)*batch+1:i*batch
		c[j] = a[j] + b[j]
	end
	return
end
@time wait(@roc groupsize=1024 vadd!(c_d, a_d, b_d))

I guess it does it somehow wrong, I didn't get how does this work now. :)

But really great work all in all, I just added some note so you can see how do the beginner fails based on the documentation.

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024 1

I didn't find @rocprintf enough easy to use because I couldn't find the "%d %s " the ANY type... so I couldn't print for example types. Sadly I couldn't discover the world of AMD environment because I don't know what type and what fields do I have and I had a really hard time to figure anything out.

So that's my fault: I should have documented that @rocprintf just calls Julia's Printf.@printf, so for any type, you could just do @rocprintf("Value: %s\n", myvalue) and it will probably interpret it correctly.

@rocprint "%d" workitemIdx().x
@rocprint "\n"

I don't recommend using @rocprint anymore, since you can't interpolate values like you're trying to here, and @rocprintf is just generally more flexible. They were really just implemented as a proof-of-concept. I'll probably remove them soon. What you want is:

@rocprintf("%d\n", workitemIdx().x)

There is a typo in https://juliagpu.github.io/AMDGPU.jl/stable/printing/#Printing where the brackets didn't match: so the closing "end" was "ed" and got error at first copy paste.

Fixed, thanks!

Any time I run a kernel I get the long:
AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}(AMDGPU.HSAStatusSignal(HSASignal(Base.RefValue{AMDGPU.HSA.Signal}(AMDGPU.HSA.Signal(0x00007f7c04007a00))), HSAExecutable{AMDGPU.Mem.Buffer}(GPU: Vega 20 [Radeon VII] (gfx906), Base.RefValue{AMDGPU.HSA.Executable}(AMDGPU.HSA.Executable(0x000000000ae30c00)), UInt8[0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x40, 0x01, 0x00 … 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], Dict{Symbol, AMDGPU.Mem.Buffer}()))) but I am on #master so I guess this can be some debug?

As pointed out in the docs (near the end of https://juliagpu.github.io/AMDGPU.jl/stable/quickstart/#Running-a-simple-kernel), you need to wait on the result of @roc. The object you're seeing here is not an error, but the object that comes of @roc, which should probably print nicer. I'll file an issue for that.

It was hard to figure out what do I have for environment I would definitely add some code that holds all the thing I can use.

Can you elaborate on what you mean by this? Are you thinking that we should have a function that lets you see all the useful information about a thread's location in a kernel, which can then be printed? If so, I agree, and I'd be happy to accept a PR that implements this 🙂

Somehow we could create a @RocMax [device] fn() that just run the function with the maximum possible capacity. Because later on nobody wants to waste time with finding the appropriate config, it have to be done by the @roc command. Or maybe just create something device config object that points to the device with the maximum capacity that handle the @roc {device_withtheMAXPOWER} fn(), so we can bridge this assembly level refining of the whole workgroup. I mean I know this is crazy but... if someone wants to bother with low level, he can any time... but 98% of the developer would just go with an instant command that ROCKS. :D

I appreciate the enthusiasm! I think what would be best the ability to pass groupsize=auto to @roc, and implement a simply occupancy estimator that will pick some valid value automatically. I've filed an issue about this.

gridsize how to use example missing.

Issue filed; feel free to help add these docs if you know how grids and groups work (they're the same as OpenCL workgroups and grids).

Why does a function has to have a parameter, I didn't see in the documentation.

That's no longer the case on v0.2.3, maybe you need to update AMDGPU.jl?

Sadly I couldn't make the c .= a .+ b to work parallel with the kernel function. But I made this... It works but slow because I couldn't make to work in parallel

Try this:

function vadd!(c, a, b)
  idx = (workgroupDim().x * (workgroupIdx().x-1)) + workitemIdx().x
  c[idx] = a[idx] + b[idx]
  nothing
end
@time wait(@roc groupsize=min(1024,length(c_d)) gridsize=length(c_d) vadd!(c_d, a_d, b_d))

I get 0.011611 seconds (120 allocations: 4.328 KiB), which is pretty OK for creating and launching a kernel (although the GPU is far faster than this; I'm aware of the fact that we spend too long waiting for a kernel to complete).

But really great work all in all, I just added some note so you can see how do the beginner fails based on the documentation.

Thanks for reporting all of these! If you get the chance to help fix some of these issues, I would greatly appreciate it 😄

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

I went through this documentation again:
https://juliagpu.gitlab.io/AMDGPU.jl/

The code I tried to run:

using AMDGPU

@show AMDGPU.agents()

N=32
@time a = rand(Float64, N)

@time a_d = AMDGPU.ROCArray(a)

The results:

┌ Warning: HSA runtime has not been built, runtime functionality will be unavailable.
│ Please run Pkg.build("AMDGPU") and reload AMDGPU.
└ @ AMDGPU ~/.julia/packages/AMDGPU/lrlUy/src/AMDGPU.jl:152
┌ Warning: ROCm-Device-Libs have not been downloaded, device intrinsics will be unavailable.
│ Please run Pkg.build("AMDGPU") and reload AMDGPU.
└ @ AMDGPU ~/.julia/packages/AMDGPU/lrlUy/src/AMDGPU.jl:160
AMDGPU.agents() = HSAAgent[]
  0.059273 seconds (84.48 k allocations: 4.414 MiB)
ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
 [1] getproperty at ./Base.jl:33 [inlined]
 [2] getindex at ./refvalue.jl:32 [inlined]
 [3] get_default_agent at /home/hm/.julia/packages/AMDGPU/lrlUy/src/agent.jl:109 [inlined]
 [4] #alloc#2 at /home/hm/.julia/packages/AMDGPU/lrlUy/src/memory.jl:223 [inlined]
 [5] alloc at /home/hm/.julia/packages/AMDGPU/lrlUy/src/memory.jl:223 [inlined]
 [6] ROCArray at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:93 [inlined]
 [7] ROCArray at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:107 [inlined]
 [8] ROCArray at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:134 [inlined]
 [9] ROCArray(::Array{Float64,1}) at /home/hm/.julia/packages/AMDGPU/lrlUy/src/array.jl:140
 [10] top-level scope at ./timing.jl:174 [inlined]
 [11] top-level scope at /home/hm/repo/amd/tests/test_AMD.jl:0
 [12] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088
 [13] include_string(::Module, ::String, ::String) at ./loading.jl:1096
 [14] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./essentials.jl:710
 [15] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N) at ./essentials.jl:709
 [16] inlineeval(::Module, ::String, ::Int64, ::Int64, ::String; softscope::Bool) at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:132
 [17] (::VSCodeServer.var"#50#53"{String,Int64,Int64,String,Module,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:93
 [18] withpath(::VSCodeServer.var"#50#53"{String,Int64,Int64,String,Module,Bool,VSCodeServer.ReplRunCodeRequestParams}, ::String) at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/repl.jl:119
 [19] (::VSCodeServer.var"#49#52"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:91
 [20] hideprompt(::VSCodeServer.var"#49#52"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams}) at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/repl.jl:36
 [21] (::VSCodeServer.var"#48#51"{VSCodeServer.ReplRunCodeRequestParams})() at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:71
 [22] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [23] invokelatest(::Any) at ./essentials.jl:709
 [24] macro expansion at /home/hm/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:27 [inlined]
 [25] (::VSCodeServer.var"#46#47")() at ./task.jl:356
in expression starting at /home/hm/repo/amd/tests/test_AMD.jl:8

That is why I tried:
(@v1.5) pkg> build AMDGPU

 Building AMDGPU → `~/.julia/packages/AMDGPU/lrlUy/deps/build.log`
┌ Error: Error building `AMDGPU`: 
│ Inconsistency detected by ld.so: dl-close.c: 223: _dl_close_worker: Assertion `(*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded' failed!
└ @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949

Which I guess is somehow affected by LLVM? I tried to change the ld.lld package but I just don't understand what is going on, so I don't know if it was necessary.
I updated these package:
apt-get install clang-format clang-tidy clang-tools clang clangd libc++-dev libc++1 libc++abi-dev libc++abi1 libclang-dev libclang1 liblldb-dev libllvm-ocaml-dev libomp-dev libomp5 lld lldb llvm-dev llvm-runtime llvm python-clang

Changed the permission of the /dev/kfd and tried to set LD_LIBRARY_PATH without success.

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

Ahh yes I recall you showing me this error. This is almost definitely not an issue with Julia (as far as I can tell), but an issue with one of your ROCm libraries. You can try adding some @info statements into your deps/build.jl file to see where it happens. It's probably occurring somewhere in this region:

AMDGPU.jl/deps/build.jl

Lines 138 to 180 in f681252

config[:libhsaruntime_path] = find_hsa_library("libhsa-runtime64", roc_dirs)
if config[:libhsaruntime_path] == nothing
build_error("Could not find HSA runtime library.")
end
# initializing the library isn't necessary, but flushes out errors that otherwise would
# happen during `version` or, worse, at package load time.
status = init_hsa(config[:libhsaruntime_path])
if status != 0
build_error("Initializing HSA runtime failed with code $status.")
end
config[:libhsaruntime_version] = version_hsa(config[:libhsaruntime_path])
# also shutdown just in case
status = shutdown_hsa(config[:libhsaruntime_path])
if status != 0
build_error("Shutdown of HSA runtime failed with code $status.")
end
# find the ld.lld program for linking kernels
ld_path = find_ld_lld()
if ld_path == ""
build_error("Couldn't find ld.lld, please install it with your package manager")
end
config[:ld_lld_path] = ld_path
config[:hsa_configured] = true
for name in ("rocblas", "rocsparse", "rocalution", "rocfft", "rocrand", "MIOpen")
lib = Symbol("lib$(lowercase(name))")
config[lib] = find_roc_library("lib$name")
if config[lib] == nothing
build_warning("Could not find library '$name'")
end
end
lib_hip = Symbol("libhip")
_paths = String[]
config[lib_hip] = Libdl.find_library(["libhip_hcc","libamdhip64"], _paths)
config[lib_hip] == nothing && build_warning("Could not find library HIP")

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

Hey! I till try it tomorrow! Thank you for your efforts! :)

(I check the package and follow its update btw, just still didn't know if I could try again, great notification!) :)

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

I reinstalled everything by following these steps:

So it definitely improved, I could allocate arrays!

For the code:

using AMDGPU

@show AMDGPU.agents()

N=2^20
a = rand(Float64, N)
b = rand(Float64, N)
c_cpu = a + b
a_d = ROCArray(a)
b_d = ROCArray(b)
c_d = similar(a_d)

function vadd!(c, a, b)
	i = workitemIdx().x
	c[i] = a[i] + b[i]
	return
end

@roc groupsize=32 vadd!(c_d, a_d, b_d)

I get the following error message:

ERROR: LoadError: MethodError: no method matching cached_compilation(::Dict{UInt64, Any}, ::GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.ROCCompilerParams}, ::typeof(AMDGPU.rocfunction_compile), ::typeof(AMDGPU.rocfunction_link))
Closest candidates are:
  cached_compilation(::Dict, ::Function, ::Function, ::GPUCompiler.FunctionSpec{f, tt}; kwargs...) where {f, tt} at /home/hm/.julia/packages/GPUCompiler/AdCnd/src/cache.jl:65
Stacktrace:
 [1] rocfunction(f::Function, tt::Type; name::Nothing, device::AMDGPU.RuntimeDevice{HSAAgent}, global_hooks::NamedTuple{(), Tuple{}}, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:291
 [2] rocfunction(f::Function, tt::Type)
   @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:286
 [3] top-level scope
   @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:165
in expression starting at /home/username/repo/TestScripts/tests/test_AMD.jl:20

Is this error message somewhat helpful for you?
I have: + GPUCompiler v0.10.0 which can be helpful for you I guess.

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

Can you pull AMDGPU again? I don't know how you ended up in this situation (that version of AMDGPU shouldn't allow you to use GPUCompiler 0.10), but the latest master of AMDGPU has explicit support for GPUCompiler 0.10.

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

I pulled AMDGPU again and now:
It actually doesn't crash we are getting closer!

julia> @roc groupsize=N vadd!(c_d, a_d, b_d)
AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}(AMDGPU.HSAStatusSignal(HSASignal(Base.RefValue{AMDGPU.HSA.Signal}(AMDGPU.HSA.Signal(0x00007fdccae6e780))), HSAExecutable{AMDGPU.Mem.Buffer}(GPU: Vega 20 [Radeon VII] (gfx906), Base.RefValue{AMDGPU.HSA.Executable}(AMDGPU.HSA.Executable(0x00000000091a3950)), UInt8[0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x40, 0x01, 0x00  …  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], Dict{Symbol, AMDGPU.Mem.Buffer}(:__global_malloc_hostcall => AMDGPU.Mem.Buffer(Ptr{Nothing} @0x00007fdccae27000, 40, GPU: Vega 20 [Radeon VII] (gfx906), true), :__global_exception_ring => AMDGPU.Mem.Buffer(Ptr{Nothing} @0x00007fdccae25000, 8, GPU: Vega 20 [Radeon VII] (gfx906), true), :__global_exception_flag => AMDGPU.Mem.Buffer(Ptr{Nothing} @0x00007fdccae23000, 16, GPU: Vega 20 [Radeon VII] (gfx906), true)))))

After running the command I don1t get the aritmetics to perform, so:

c = Array(c_d)

isapprox(c, c_cpu) # == false

Running julia> wait(@roc groupsize=N vadd!(c_d, a_d, b_d)) hangs forever I guess?

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

So you should always do wait(@roc ...), because @roc doesn't wait for the kernel to complete. Does wait actually hang (give Julia a minute or two to compile the first time)? If it does hang, can you tell me which GPU you're using?

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

I didn't select any GPU and yeah it hangs forever.
Maybe I should select a GPU? How to do that? :o

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

You don't necessarily need to set a GPU; AMDGPU selects the first available GPU for you, which you can see with @show AMDGPU.get_default_agent(). There isn't really a great way to set the current GPU right now, but you could do something like:

agents = AMDGPU.agents()
AMDGPU.DEFAULT_AGENT[] = agents[2] # Make the 2nd GPU the default

The reason I asked what GPU you're using (or really, what GPU AMDGPU selects for you) is that you could be using an unsupported GPU, which can possibly hang when trying to use it. I've had that happen with my Raven Ridge integrated GPU.

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

Hey,

julia> AMDGPU.get_default_agent()
GPU: Vega 20 [Radeon VII] (gfx906)

I selected GPU 3 to test an other one as you described and it worked I think.

Runnig this: wait(@roc groupsize=N vadd!(c_d, a_d, b_d))
The error I get now:

julia> wait(@roc groupsize=N vadd!(c_d, a_d, b_d))
ERROR: InexactError: trunc(UInt16, 1048576)
Stacktrace:
  [1] throw_inexacterror(f::Symbol, #unused#::Type{UInt16}, val::UInt32)
    @ Core ./boot.jl:602
  [2] checked_trunc_uint
    @ ./boot.jl:632 [inlined]
  [3] toUInt16
    @ ./boot.jl:709 [inlined]
  [4] UInt16
    @ ./boot.jl:755 [inlined]
  [5] convert
    @ ./number.jl:7 [inlined]
  [6] KernelDispatchPacket
    @ ~/.julia/packages/AMDGPU/XxCm7/src/hsa/libhsa_types.jl:112 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/ConstructionBase/Lt33X/src/ConstructionBase.jl:0 [inlined]
  [8] _setproperties
    @ ~/.julia/packages/ConstructionBase/Lt33X/src/ConstructionBase.jl:60 [inlined]
  [9] setproperties
    @ ~/.julia/packages/ConstructionBase/Lt33X/src/ConstructionBase.jl:57 [inlined]
 [10] set
    @ ~/.julia/packages/Setfield/XM37G/src/lens.jl:110 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/Setfield/XM37G/src/sugar.jl:182 [inlined]
 [12] (::AMDGPU.var"#22#23"{AMDGPU.ROCDim3, AMDGPU.ROCDim3, HSAKernelInstance{Tuple{ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}}}, HSASignal})(_packet::AMDGPU.HSA.KernelDispatchPacket)
    @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/kernel.jl:141
 [13] _launch!(f::AMDGPU.var"#22#23"{AMDGPU.ROCDim3, AMDGPU.ROCDim3, HSAKernelInstance{Tuple{ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}}}, HSASignal}, T::Type, queue::HSAQueue, signal::HSASignal)
    @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/kernel.jl:178
 [14] #launch!#21
    @ ~/.julia/packages/AMDGPU/XxCm7/src/kernel.jl:139 [inlined]
 [15] #launch_kernel#31
    @ ~/.julia/packages/AMDGPU/XxCm7/src/runtime.jl:114 [inlined]
 [16] #launch_kernel#30
    @ ~/.julia/packages/AMDGPU/XxCm7/src/runtime.jl:109 [inlined]
 [17] macro expansion
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:203 [inlined]
 [18] _launch
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:180 [inlined]
 [19] launch
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:160 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:131 [inlined]
 [21] roccall(::ROCFunction, ::Type{Tuple{ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}, ROCDeviceVector{Float64, 1}}}, ::ROCDeviceVector{Float64, 1}, ::ROCDeviceVector{Float64, 1}, ::ROCDeviceVector{Float64, 1}; queue::AMDGPU.RuntimeQueue{HSAQueue}, signal::AMDGPU.RuntimeEvent{AMDGPU.HSAStatusSignal}, groupsize::Int64, gridsize::Int64)
    @ AMDGPU ~/.julia/packages/AMDGPU/XxCm7/src/execution_utils.jl:109
 [22] #roccall#208
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:265 [inlined]
 [23] macro expansion
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:246 [inlined]
 [24] #call#196
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:222 [inlined]
 [25] #_#236
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:412 [inlined]
 [26] top-level scope
    @ ~/.julia/packages/AMDGPU/XxCm7/src/execution.jl:167

julia> 

Interesting error, now I don't know what can be the problem?

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

Hmm, I've never personally tested any of the gfx906 chips, although they should probably work. You might consider updating your Linux kernel and making sure you're on the latest ROCm packages (currently we distribute ROCm 3.8; we should probably get those updated to 4.0).

What value of N did you specify? The groupsize is generally only 1024 on Vega, as far I can recall, but is definitely limited to what can fit in a UInt16. I'll file an issue to add a check for invalid workgroup sizes.

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

I used N=32, but yeah I got a strange error at bigger values. :D

For me I can't believe different chips can produce this big deal. It would be a hell of work if there is no common interface for them. :o

Do you think I should update to 4.0 then to solve?

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

Yeah, that one's on me, it really should have been an error.

Just remember, ROCm is basically beta software right now (even though they're on version >4.0). Bugs and broken configurations are easy to stumble upon.

I never asked you, what Linux Kernel version are you using?

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

I think:

➜  ~ uname -a
Linux user 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I will try to install this 4.0.1 but I see this is not a one click for now and don't want to restart my whole workspace due to a lot of other work. I will do it in the end of week maybe! :)

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

That kernel is probably recent enough for most cards, but the VII might be too new to work on that kernel. I'd consider upgrading to something newer, if you can.

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

Oh I see soo grid size is the size of the aritmetic operation WOW! Very cool! That sounds really effective!

I realised i used @rocprintf so sorry for the typo.

For groupsize=auto is great, maybe it would be nice to consider making it default. Also if you ever used @Everywhere [workers list] fn(), it would be nice to be able to specify the device maybe like "worker", but I know this is harder then I say.

I am workin on a company I think it is more beneficial if I try to build a team and adapt AMDGPU with our open packages. Also I think we have the best machine learning library on the way adapting AMDGPU support would be crazy for that. I know there is Flux Zygote and many more out there but hey all have seruous hard time and limitations because of the core.

On the figring out the environment topic, what you described is really nice I would just be satisfied to have a 10 liner code that shows all the information I have during a kernel run. Just a whole example that shows everything from my runtime environment, so wit running the cose and reading the doc I could learn the whole kernel programming in 15 seconds and understand the details. :) I know this is just an idea, what could be the best way to simply make the whole AMDGPU simple and easy to learn.

Btw I would be glad if you could tell me if you think it possible to do 10,100 kerel operation in a row to have a syntax like this:
function fn(x,y)
x.= x .+ x .* x .* 2 .+ y
y .= y .- 1 .+ y .^ 2
return x .+ y
end
fn(x,y) # where x, y is rocarray
So like nvidia does by defining the cuarray aritmetic? Or is this on the way already?

(Sent from mobile)

from amdgpu.jl.

Cvikli avatar Cvikli commented on June 1, 2024

Hey,

It is interesting to see the timing of the example code you wrote is similar in my case too. If everything measures right then this is 10x speedup ATM.

I think it would be a good idea to update the basic example to this one you wrote here. It explains a lot also and shows the way how this whole system work.

Also what do you think about the "cuarray" approach of defining the aritmetic operations between ROCArray-s? Is that possible like CUDA did? I feel like the broadcasting could allow the gridsize to work in case of aritmetic redefining. I am just asking if this all is possible in the future, because that would allow ROCArray to be a 1on1 replacement for Array?

from amdgpu.jl.

jpsamaroo avatar jpsamaroo commented on June 1, 2024

Closing since this issue meandered over too many unrelated things; further discussion can continue on Discourse, or specific issues should be filed separately.

from amdgpu.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.