juliagpu / amdgpu.jl Goto Github PK

View Code? Open in Web Editor NEW

266.0 18.0 38.0 6.1 MB

AMD GPU (ROCm) programming in Julia

License: Other

Julia 100.00%

julia rocm amdgpu

amdgpu.jl's Introduction

AMDGPU.jl

AMD GPU (ROCm) programming in Julia

Requirements

Julia 1.9+ (Navi 3 GPUs require Julia 1.10 or higher).
ROCm 5.3 - 6.0, which means only GPUs that are supported by these versions will work.
64-bit Linux and Windows are supported, see requirements section in the docs.

Quick start

AMDGPU.jl can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add AMDGPU

Or, equivalently, via the Pkg API:

julia> import Pkg; Pkg.add("AMDGPU")

Testing

To ensure that everything is working you can run tests for the package with

pkg> test AMDGPU

Or specifying a subset of tests to run:

julia> using Pkg

julia> Pkg.test("AMDGPU"; test_args=["core", "kernelabstractions"])

Full list of tests to run can be obtained with --list argument:

julia> Pkg.test("AMDGPU"; test_args=["--list"])

Questions and Contributions

Usage questions can be posted on the Julia Discourse forum under the GPU domain and/or in the #gpu channel of the Julia Slack.

Contributions are very welcome, as are feature requests and suggestions. Please open an issue if you encounter any problems.

Acknowledgment

AMDGPU would not have been possible without the work by Tim Besard and contributors to CUDA.jl and LLVM.jl.

License

AMDGPU.jl is licensed under the MIT License.

amdgpu.jl's People

Contributors

Stargazers

Watchers

amdgpu.jl's Issues

ROCArrays matrix multiplication not working

Running on GLIBC void linux. AMD RX570 8gb sapphire.
Using my fork for using Yggdrasil HSA artifacts: https://github.com/0x0f0f0f/AMDGPU.jl/tree/artifacts

gen(x) = rand(x,x)
ROCArray(gen(10)) * ROCArray(gen(10))
Memory access fault by GPU node-1 (Agent handle: 0x556f0ed38c80) on address 0xa0000. Reason: Page not present or supervisor privilege.
signal (6): Aborted
in expression starting at REPL[9]:1
Allocations: 40373585 (Pool: 40359774; Big: 13811); GC: 45
Aborted

Implement execution contexts

To support running operations on either the GPU or CPU, since HSAArrays already use "unified memory" by default. Ref: https://github.com/JuliaGPU/CuArrays.jl/issues/303

Throw error in wait() call on queue error

When a kernel traps or otherwise does a bad thing, it will (usually) inactivate its associated queue. However, the soft-wait wait() call will just keep spinning, waiting for a dead queue to signal it. We should explicitly check the queue status while spinning, and if it becomes inactivated, throw an appropriate error message.

Allow 0-argument kernels

For some reason, 0-argument kernels end up with an HSA_STATUS_ERROR_INVALID_ARGUMENT error.

test failures and crashes on 580

My understanding is that the 580 is going out of support, but for what is worth, here is a test run and a console session with failures.

Is there any expectation for these tests to ever pass on 580?

Let me know how I can help fix these issues (if possible). I have zero knowledge of the low-level implementation of the gpu support.

A failed attempt at matrix-vector multiplication

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-beta1 (2021-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU; using LinearAlgebra

julia> N = 100;

julia> m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N); 

julia> m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);

julia> versioninfo()
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, znver1)

julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x19b4290) on address 0x640000. Reason: Page not present or supervisor privilege.

signal (6): Aborted
in expression starting at REPL[4]:1
Allocations: 34952292 (Pool: 34939863; Big: 12429); GC: 39
fish: “~/localcompiles/julia-1.6.0-bet…” terminated by signal SIGABRT (Abort)

The test summary


Test Summary:                                 | Pass  Error  Broken  Total
AMDGPU                                        |  932     15      81   1028
  Core                                        |                   1      1
  HSA                                         |   16      6             22
    HSA Status Error                          |    1                     1
    Agent                                     |    5                     5
    Memory                                    |   10      6             16
      Pointer-based                           |    3                     3
      Array-based                             |    2                     2
      Type-based                              |    1                     1
      Pointer information                     |           1              1
      Page-locked memory (OS allocations)     |           5              5
      Exceptions                              |    3                     3
      Mutable structs                         |    1                     1
  Codegen                                     |    3                     3
  Device Functions                            |  175             77    252
  ROCArray                                    |  737      9       3    749
    GPUArrays test suite                      |  737      9            746
      math                                    |    8                     8
      indexing scalar                         |  249                   249
      input output                            |    5                     5
      value constructors                      |   36                    36
      indexing multidimensional               |   25      9             34
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        empty array                           |    8      7             15
          1D                                  |    1      1              2
          2D with other index Colon()         |    2      2              4
          2D with other index 1:5             |    2      2              4
          2D with other index 5               |    2      2              4
        GPU source                            |    2      1              3
        CPU source                            |    2      1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                     1
      interface                               |    7                     7
      conversions                             |   72                    72
      constructors                            |  335                   335
    ROCm External Libraries                   |                   3      3
ERROR: LoadError: Some tests did not pass: 932 passed, 0 failed, 15 errored, 81 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/UpYiP/test/runtests.jl:29
ERROR: Package AMDGPU errored during testing

rocminfo

~> /opt/rocm/bin/rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 1700 Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 1700 Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32878744(0x1f5b098) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32878744(0x1f5b098) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26591(0x67df)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1366                               
  BDFID:                   2304                               
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8388608(0x800000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

clinfo

~> /opt/rocm/opencl/bin/clinfo
Number of platforms:				 2
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 1.1 Mesa 20.3.4 - kisak-mesa PPA
  Platform Name:				 Clover
  Platform Vendor:				 Mesa
  Platform Extensions:				 cl_khr_icd
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.0 AMD-APP (3212.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback 


  Platform Name:				 Clover
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Max compute units:				 36
  Max work items dimensions:			 3
    Max work items[0]:				 256
    Max work items[1]:				 256
    Max work items[2]:				 256
  Max work group size:				 256
  Preferred vector width char:			 16
  Preferred vector width short:			 8
  Preferred vector width int:			 4
  Preferred vector width long:			 2
  Preferred vector width float:			 4
  Preferred vector width double:		 2
  Native vector width char:			 16
  Native vector width short:			 8
  Native vector width int:			 4
  Native vector width long:			 2
  Native vector width float:			 4
  Native vector width double:			 2
  Max clock frequency:				 1366Mhz
  Address bits:					 64
  Max memory allocation:			 6871947673
  Image support:				 No
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 32768
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 No
    Round to +ve and infinity:			 No
    IEEE754-2008 fused multiply-add:		 No
  Cache type:					 None
  Cache line size:				 0
  Cache size:					 0
  Global memory size:				 27487790692
  Constant buffer size:				 67108864
  Max number of constant args:			 16
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 0
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Platform ID:					 0x7f589bdbab60
  Name:						 Radeon RX 580 Series (POLARIS10, DRM 3.40.0, 5.4.0-65-generic, LLVM 11.0.1)
  Vendor:					 AMD
  Device OpenCL C version:			 OpenCL C 1.1 
  Driver version:				 20.3.4 - kisak-mesa PPA
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.1 Mesa 20.3.4 - kisak-mesa PPA
  Extensions:					 cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64


  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
  Device Topology:				 PCI[ B#9, D#0, F#0 ]
  Max compute units:				 36
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 1366Mhz
  Address bits:					 64
  Max memory allocation:			 7301444400
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 16384
  Max image 3D height:				 16384
  Max image 3D depth:				 8192
  Max samplers within kernel:			 26591
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 8589934592
  Constant buffer size:				 7301444400
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 3006477104
  Max global variable size:			 7301444400
  Max global variable preferred total size:	 8589934592
  Max read/write image args:			 64
  Max on device events:				 1024
  Queue on device max size:			 8388608
  Max on device queues:				 1
  Queue on device preferred size:		 262144
  SVM capabilities:				 
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 Yes
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Platform ID:					 0x7f589388acf0
  Name:						 gfx803
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0 
  Driver version:				 3212.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

Add options to at-roc to allow initializing globals

Currently only special globals are initialized during executable compilation and linking. We should allow users to provide key-value pairs to @roc that map a global name to an initialization function.

Add example for gridsize usage and workgroup sizing

Allow setting Private and Group segment sizes manually

Support OpenCL.jl as device runtime

Currently, we only support using HSARuntime.jl as our device runtime, which currently only allows use of this package by Linux users with recent kernels. However, @vchuravy has made the suggestion that we could use OpenCL.jl as an alternative backend, since it supports loading arbitrarily-generated GPU object code. Because AMD OpenCL drivers are available on Windows and Mac systems (and potentially others, like the BSDs), we should be able to use this as a workaround for users on such systems until ROCm is made available.

Build fails on OSX

hope you complete this package -

as im sure you know - fails to compile on OSX.

Error: Error building AMDGPUnative:
│ ERROR: LoadError: Your platform ("x86_64-apple-darwin18.6.0", parsed as "x86_64-apple-darwin14-gcc8-cxx03") is not supported by this package!
│ Stacktrace:
│ [1] error(::String) at ./error.jl:33
│ [2] top-level scope at /Users/cbrown/.julia/packages/AMDGPUnative/gE2NQ/deps/build.jl:26
│ [3] include(::String) at ./client.jl:439
│ [4] top-level scope at none:5
│ in expression starting at /Users/cbrown/.julia/packages/AMDGPUnative/gE2NQ/deps/build.jl:22
└ @ Pkg.Operations /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Operations.jl:892

Add tests to match CUDAnative

We should be able to pass equivalents of most of CUDAnative's tests, so let's do that!

Add queue/device/system sync functionality

To fix issues as raised in JuliaGPU/ROCArrays.jl#4 (comment)

Add/test broadcasting support to ROCArray

This is slightly tricky/undefined since, when two or more HSAArray's are broadcast together, one could allocate the new HSAArray on any of their devices. Maybe we should just fall back to Array if not operating in-place?

Track kernel dependencies in ROCArray

So that other calls can wait for completion.

Reported by @chriselrod

Implement sparse BLAS routines

https://github.com/ROCmSoftwarePlatform/rocSPARSE

Runtime Locking

To be thread-safe, we need at least one lock around all runtime operations which mutate global state (such as DEFAULT_AGENT/DEFAULT_QUEUE).

Embed/wrap lld linker

Currently we invoke ld.lld manually to link our kernel's .o file into the final executable. We should figure out how to embed lld (or call the appropriate C++ functions in LLVM) so that we don't rely on the lld binary existing on the user's system.

'disable-symbolication' build error

I get the following build error on Manjaro:

I added the rocm libraries from arch4edu and replaced libstdc++.so provided by Julia with the one provided by the system by copying it. I get the same error with the Julia 1.5.2 binaries and current Julia master.

I set up the environment as follows:

LD_LIBRARY_PATH="/opt/rocm/hsa/lib/"
LD_LIBRARY_PATH="/opt/rocm/lib/:$LD_LIBRARY_PATH"
PATH="/opt/rocm/llvm/bin/:$PATH"

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.1046 (2020-09-25)
 _/ |\__'_|_|_|\__'_|  |  Commit 55aeb2ff01 (6 days old master)
|__/                   |

(@v1.6) pkg> add AMDGPU#master
...

(@v1.6) pkg> build AMDGPU
   Building AMDGPU → `~/.julia/packages/AMDGPU/ztzIl/deps/build.log`
ERROR: Error building `AMDGPU`: 
: CommandLine Error: Option 'disable-symbolication' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

Unreachable reached in SIISelLowering.cpp due to unhandled AS

julia> versioninfo()
Julia Version 1.5.4-pre.0
Commit 599ecd8210* (2020-11-10 10:50 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 3700X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver2)
Environment:
  JULIA_PKG_DEVDIR = /home/vchuravy/src

Status `/tmp/jl_Mx1zez/Manifest.toml`
  [21141c5a] AMDGPU v0.2.1
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.3.0
  [56f22d72] Artifacts v1.3.0
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [34da2185] Compat v3.25.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.4+0
  [187b0558] ConstructionBase v1.0.0
  [864edb3b] DataStructures v0.18.8
  [7a1cc6ca] FFTW v1.2.4
  [f5851436] FFTW_jll v3.3.9+6
  [1a297f60] FillArrays v0.10.1
  [0c68f7d7] GPUArrays v5.1.0 ⚲
  [61eb1bfa] GPUCompiler v0.8.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [692b3bcd] JLLWrappers v1.1.3
  [929cbde3] LLVM v3.5.1
  [856f044c] MKL_jll v2020.2.254+0
  [1914dd2f] MacroTools v0.5.6
  [ca575930] NetworkOptions v1.2.0
  [efe28fd5] OpenSpecFun_jll v0.5.3+4
  [bac558e1] OrderedCollections v1.3.2
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.1.1
  [6c6a2e73] Scratch v1.0.3
  [efcf1570] Setfield v0.7.0
  [276daf66] SpecialFunctions v1.1.0
  [a759f4b9] TimerOutputs v0.5.7
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode

unhandled address space
UNREACHABLE executed at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Target/AMDGPU/SIISelLowering.cpp:1201!

signal (6): Aborted
in expression starting at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:29
gsignal at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
llvm_unreachable_internal at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Support/ErrorHandling.cpp:209
isLegalAddressingMode at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Target/AMDGPU/SIISelLowering.cpp:1201 [inlined]
isLegalAddressingMode at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Target/AMDGPU/SIISelLowering.cpp:1124
isLegalAddressingMode at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/include/llvm/CodeGen/BasicTTIImpl.h:234 [inlined]
getGEPCost at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/include/llvm/Analysis/TargetTransformInfoImpl.h:765
getUserCost at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Analysis/TargetTransformInfo.cpp:209
getUserCost at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/include/llvm/Analysis/TargetTransformInfo.h:330 [inlined]
isFreeInLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:1284 [inlined]
isNotUsedOrFreeInLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:1312 [inlined]
sinkRegion at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:535
runOnLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:382
runOnLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:219 [inlined]
runOnLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:202
runOnFunction at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Analysis/LoopPass.cpp:225
runOnFunction at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1648
runOnModule at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1685
runOnModule at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1750 [inlined]
run at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1863
LLVMRunPassManager at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/Core.cpp:4022
macro expansion at /home/vchuravy/.julia/packages/LLVM/MZvb3/src/util.jl:114 [inlined]
LLVMRunPassManager at /home/vchuravy/.julia/packages/LLVM/MZvb3/lib/libLLVM_h.jl:2881 [inlined]
run! at /home/vchuravy/.julia/packages/LLVM/MZvb3/src/passmanager.jl:39
#62 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/optim.jl:24 [inlined]
ModulePassManager at /home/vchuravy/.julia/packages/LLVM/MZvb3/src/passmanager.jl:33
unknown function (ip: 0x7f55b990b811)
optimize! at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/optim.jl:19
macro expansion at /home/vchuravy/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
macro expansion at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:114 [inlined]
macro expansion at /home/vchuravy/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
#codegen#87 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:106
codegen##kw at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:63 [inlined]
#compile#85 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
compile at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
#rocfunction_compile#226 at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:327 [inlined]
rocfunction_compile##kw at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:324
unknown function (ip: 0x7f5484ed104c)
#check_cache#106 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
unknown function (ip: 0x7f5484ed0d63)
check_cache##kw at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:15
unknown function (ip: 0x7f5484ecee53)
broadcast_kernel at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/host/broadcast.jl:60 [inlined]
#cached_compilation#107 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:0
unknown function (ip: 0x7f5484ecec23)
cached_compilation##kw at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65
unknown function (ip: 0x7f5484ecea3c)
#rocfunction#223 at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:316
rocfunction at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:314 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:184 [inlined]
#gpu_call#263 at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/array.jl:15
unknown function (ip: 0x7f5484ece6e8)
gpu_call##kw at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/array.jl:14 [inlined]
#gpu_call#1 at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/device/execution.jl:67 [inlined]
gpu_call##kw at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/device/execution.jl:46 [inlined]
copyto! at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/host/broadcast.jl:68 [inlined]
copyto! at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/host/broadcast.jl:76 [inlined]
materialize! at ./broadcast.jl:848 [inlined]
materialize! at ./broadcast.jl:845
unknown function (ip: 0x7f5484ece305)
#18 at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/test/testsuite/gpuinterface.jl:4
#176#test_interface at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/test/testsuite.jl:43
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:77 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:77 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:65 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:63 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
top-level scope at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:31
jl_toplevel_eval_flex at /home/vchuravy/src/julia/src/toplevel.c:834
jl_parse_eval_all at /home/vchuravy/src/julia/src/ast.c:913
jl_load_rewrite at /home/vchuravy/src/julia/src/toplevel.c:914
include at ./client.jl:457
jl_apply at /home/vchuravy/src/julia/src/julia.h:1690 [inlined]
do_call at /home/vchuravy/src/julia/src/interpreter.c:117
eval_value at /home/vchuravy/src/julia/src/interpreter.c:206
eval_stmt_value at /home/vchuravy/src/julia/src/interpreter.c:157 [inlined]
eval_body at /home/vchuravy/src/julia/src/interpreter.c:552
jl_interpret_toplevel_thunk at /home/vchuravy/src/julia/src/interpreter.c:660
jl_toplevel_eval_flex at /home/vchuravy/src/julia/src/toplevel.c:840
jl_toplevel_eval_flex at /home/vchuravy/src/julia/src/toplevel.c:790
jl_toplevel_eval_in at /home/vchuravy/src/julia/src/toplevel.c:883
eval at ./boot.jl:331
exec_options at ./client.jl:272
_start at ./client.jl:506
jfptr__start_31419 at /home/vchuravy/builds/julia-1.5/usr/lib/julia/sys.so (unknown line)
jl_apply at /home/vchuravy/src/julia/ui/../src/julia.h:1690 [inlined]
true_main at /home/vchuravy/src/julia/ui/repl.c:106
main at /home/vchuravy/src/julia/ui/repl.c:227
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
_start at /home/vchuravy/builds/julia-1.5/usr/bin/julia (unknown line)
Allocations: 227026595 (Pool: 226953636; Big: 72959); GC: 246
ERROR: Package AMDGPU errored during testing (received signal: 6)

HSASignal and HSAKernelInstance references can be accidentally GC'd

The short of it: our GC doesn't know when a kernel holds a reference to HSA-allocated objects, so long-running kernels could end up seeing their resources finalized from under them. Fixing this should be reasonably simple, if slightly more expensive: we add references to them (either directly in the HSAQueue object or even globally) within @roc, and then all wait calls will remove those references once the kernel holding them completes. This would imply that direct usages of rocfunction+roccall will have to manually preserve such objects, but I can't think of any easy way around that.

Add API to allocate read-only (texture) memory

Read-only memory can be allocated via the HSA runtime, and can be potentially much faster for reads than regular global memory. We should support working with this memory via the Mem API.

Test code doesn't work

Hey!

I wanted to try AMDGPU.jl, but I fail at the first step at running a simple test code.
The problem:
I installed: ROCm install Ubuntu
Then: ]add AMDGPU

Then trieing to run this code: Running-a-simple-kernel example
The run result:

ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
 [1] getproperty at ./Base.jl:33 [inlined]
 [2] getindex at ./refvalue.jl:32 [inlined]
 [3] HSAArray(::Array{Float64,1}) at /home/user/.julia/packages/AMDGPU/JBXsp/src/hsaarray.jl:34
 [4] top-level scope at /home/user/repo/amd_test/tests/test_AMD.jl:9
 [5] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088
 [6] include_string(::Module, ::String, ::String) at ./loading.jl:1096

on the first AMDGPU line:
a_d = AMDGPU.HSAArray(a)

I have ubuntu 20.04 and running this test on multiple Radeon VII.

I think the problem should be pretty trivial since ]test AMDGPU throws the same error.

ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
 [1] getproperty at ./Base.jl:33 [inlined]
 [2] getindex at ./refvalue.jl:32 [inlined]
 [3] get_default_agent() at /home/user/.julia/packages/AMDGPU/JBXsp/src/agent.jl:109
 [4] top-level scope at /home/user/.julia/packages/AMDGPU/JBXsp/test/runtests.jl:28
 [5] include(::String) at ./client.jl:457
 [6] top-level scope at none:6
in expression starting at /home/user/.julia/packages/AMDGPU/JBXsp/test/runtests.jl:28
ERROR: Package AMDGPU errored during testing

Does anyone know what do I do wrong?

What is needed to build?

Initially I was prompted no HSA lib, so I installed https://aur.archlinux.org/packages/hsa-rocr/, then

(@v1.4) pkg> build AMDGPU
   Building AMDGPU → `~/.julia/packages/AMDGPU/6zgIY/deps/build.log`
┌ Error: Error building `AMDGPU`:
│ [ Info: libhsa-runtime64.so: true
│ ERROR: LoadError: LoadError: could not load library "/opt/rocm/hsa/lib/libhsa-runtime64.so"
│ /home/akako/Downloads/julia-1.4.1/bin/../lib/julia/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /opt/rocm/hsa/lib/libhsa-runtime64.so)
│ Stacktrace:

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Some errors during test. Are they cause for concern?

The errors in question: (minus all the file calls)


   Got exception outside of a @test
    Conversion of boxed type Array{Float32(Also Int32,Int64,Complex{Float32}),1} is not allowed

Test threw exception
  Expression: compare((a->begin
            a[view(i, 1, :), :]
        end), AT, a)
  GPU compilation of kernel index_kernel(AMDGPU.ROCKernelContext, ROCDeviceArray{Float64,2,AMDGPU.AS.Global}, ROCDeviceArray{Float64,2,AMDGPU.AS.Global}, Tuple{Int64,Int64}, Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true},Base.Slice{Base.OneTo{Int64}}}) failed
  KernelError: passing and using non-bitstype argument
  
  Argument 6 to your kernel function is of type Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true},Base.Slice{Base.OneTo{Int64}}}, which is not isbits:
    .1 is of type SubArray{Int64,1,Array{Int64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true} which is not isbits.
      .parent is of type Array{Int64,2} which is not isbits.


  Expression: typeof(A[other, []]) == typeof(AT(Ac[other, []]))
  HSA error (code #4097, HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.)

Test threw exception
  Expression: x[:, :, 2] == y
  Not implemented

Test result:

Test Summary:                           | Pass  Error  Broken  Total
AMDGPU                                  |  889     22      71    982
  Core                                  |   20                    20
  HSA                                   |   32              1     33
  Codegen                               |    3                     3
  Device Functions                      |  110             68    178
  ROCArray                              |  723     22       2    747
    GPUArrays test suite                |  723     22            745
      math                              |    8                     8
      indexing scalar                   |  243      6            249
        errors and warnings             |   12                    12
        getindex with Float32           |   34                    34
        getindex with Float64           |   34                    34
        getindex with Int32             |   34                    34
        getindex with Int64             |   34                    34
        getindex with Complex{Float32}  |   34                    34
        getindex with Complex{Float64}  |   34                    34
        setindex! with Float32          |    1      1              2
        setindex! with Float64          |    1      1              2
        setindex! with Int32            |    1      1              2
        setindex! with Int64            |    1      1              2
        setindex! with Complex{Float32} |    1      1              2
        setindex! with Complex{Float64} |    1      1              2
        issue #42 with Float32          |    3                     3
        issue #42 with Float64          |    3                     3
        issue #42 with Int32            |    3                     3
        issue #42 with Int64            |    3                     3
        issue #42 with Complex{Float32} |    3                     3
        issue #42 with Complex{Float64} |    3                     3
        get/setindex!                   |    3                     3
      input output                      |    5                     5
      value constructors                |   36                    36
      indexing multidimensional         |   17     16             33
        sliced setindex                 |    1                     1
        sliced setindex                 |    1                     1
        sliced setindex                 |    1                     1
        sliced setindex                 |    1                     1
        sliced setindex                 |    1                     1
        sliced setindex                 |    1                     1
        sliced setindex, CPU source     |           1              1
        sliced setindex, CPU source     |           1              1
        sliced setindex, CPU source     |           1              1
        sliced setindex, CPU source     |           1              1
        sliced setindex, CPU source     |           1              1
        sliced setindex, CPU source     |           1              1
        empty array                     |    8      7             15
          1D                            |    1      1              2
          2D with other index Colon()   |    2      2              4
          2D with other index 1:5       |    2      2              4
          2D with other index 5         |    2      2              4
        GPU source                      |    3                     3
        CPU source                      |           3              3
      interface                         |    7                     7
      conversions                       |   72                    72
      constructors                      |  335                   335
    ROCm External Libraries             |                   2      2

GPU:RX480
ROCm:3.9.x Built from source (might've messed up)

rocminfo output:


*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26591(0x67df)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1266                               
  BDFID:                   256                                
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8388608(0x800000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Implement dynamic kernel launch

LLVM and the ROCm device libs expose the necessary functions to access the owning queue for a kernel and place packets on it. We should implement the equivalent of CUDAnative's dynamic_cufunction to make use of this.

Tests allocate memory indefinitely

For some reason the tests allocate memory indefinitely. I suspect that it's mostly coming from cached executables, which should probably be kept in LRU fashion.

Broken build script?

If I try to build AMDGPU:

(@v1.5) pkg> build AMDGPU
   Building AMDGPU → `~/.julia/packages/AMDGPU/nnddY/deps/build.log`
┌ Error: Error building `AMDGPU`:
│ WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
│ WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
│ Inconsistency detected by ld.so: dl-close.c: 223: _dl_close_worker: Assertion `(*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded' failed!
└ @ Pkg.Operations ~/Documents/languages/julia/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949

Now

julia> using BinaryProvider

shell> cat .julia/packages/AMDGPU/nnddY/deps/rocm-external/ext.jl
# autogenerated file, do not edit
const ext_libs_configured = false

julia> include("/home/chriselrod/.julia/packages/AMDGPU/nnddY/deps/build.jl")
paths = ["/opt/rocm/hsa/lib"]
[ Info: Found useable ld.lld at /opt/rocm/llvm/bin/ld.lld
WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
WARNING: replacing module Previous.

shell> cat .julia/packages/AMDGPU/nnddY/deps/rocm-external/ext.jl
# autogenerated file, do not edit
const librocfft = "/opt/rocm/lib/librocfft.so"
const librocalution = "/opt/rocm/lib/librocalution.so"
const libmiopen = "/opt/rocm/lib/libMIOpen.so"
const ext_libs_configured = true
const libhip = "libamdhip64"
const librocrand = "/opt/rocm/rocrand/lib/librocrand.so"
const librocsparse = "/opt/rocm/lib/librocsparse.so"
const librocblas = "/opt/rocm/lib/librocblas.so"

If I don't first using BinaryProvider the script exits on the inlcude without generating a mroe complete ext.jl.

Document math intrinsics usage

Since JuliaGPU/AMDGPUnative.jl#62 was merged, we should document the fact that math intrinsics exist and show how to use them in a kernel.

Allow preserving HSA executable

For debugging purposes, it would be helpful to keep executables around so that they can be inspected.

Distribute ROCR/ROCT via artifacts

Implement Neural Network primitives

https://github.com/ROCmSoftwarePlatform/MIOpen

This one should be doable in pieces, as MIOpen is a pretty large library. Having even partial MIOpen support will make FluxML/Flux.jl#938 more useful.

Use atomics/locking in hostcall

Wavefronts could race on making the hostcall or filling the buffer

Support optional return values

Typically one cannot statically determine the size and shape (or even type) of the returned value of a function, thus making it hard to trivially support returning values from kernels. Additionally, kernels (in the hardware sense) do not have return values; they just have arguments, which might be mutated. However, with the use of hostcall and some LLVM magic, we should be able to allow kernels to optionally return a value into a buffer which is dynamically allocated at runtime, and returned to the host once the kernel finishes executing.

Implement RNGs

https://github.com/ROCmSoftwarePlatform/rocRAND

Only load libhsa-runtime64 major version 1

As noted in ROCm/ROCgdb#5, we load libhsa-runtime64.so, which could be any major version of ROCR (bad idea). We should instead explicitly load libhsa-runtime64.so.1, which will then let us use ROCgdb to debug AMDGPU.jl!

FATAL ERROR: Symbol "ccalllib_libhsa-runtime64445"not found on AMDGPU

Void Linux, glibc. Built libhsakmt.so and libhsa-runtime64.so manually.

(@v1.5) pkg> build AMDGPU
   Building AMDGPU → `~/.julia/packages/AMDGPU/lrlUy/deps/build.log`

(@v1.5) pkg> ^C

julia> using AMDGPU
FATAL ERROR: Symbol "ccalllib_libhsa-runtime64445"not found
signal (6): Aborted
in expression starting at REPL[2]:1
raise at /builddir/glibc-2.30/signal/../sysdeps/unix/sysv/linux/raise.c:51
abort at /builddir/glibc-2.30/stdlib/abort.c:79
unknown function (ip: 0x7f0b696b417a)
unknown function (ip: 0x7f0b697b9a31)
unknown function (ip: 0x7f0b697b9712)
unknown function (ip: 0x7f0b697ba5f5)
unknown function (ip: 0x7f0b697bb458)
unknown function (ip: 0x7f0b6973c28c)
unknown function (ip: 0x7f0b696ee04f)
unknown function (ip: 0x7f0b69704a05)
unknown function (ip: 0x7f0b69705cc8)
unknown function (ip: 0x7f0b69706051)
unknown function (ip: 0x7f0b69709669)
unknown function (ip: 0x7f0b6970fa2f)
unknown function (ip: 0x7f0b6970fb7d)
unknown function (ip: 0x7f0b697103a9)
unknown function (ip: 0x7f0b697ba02d)
unknown function (ip: 0x7f0b697bb458)
unknown function (ip: 0x7f0b6973c28c)
jl_apply_generic at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b697709a9)
jl_init_restored_modules at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b5a3d3fb3)
unknown function (ip: 0x7f0b5a4d83cf)
unknown function (ip: 0x7f0b5a60d63d)
unknown function (ip: 0x7f0b5a558c1e)
unknown function (ip: 0x7f0b5a5597bc)
unknown function (ip: 0x7f0b697705fd)
unknown function (ip: 0x7f0b69771b32)
unknown function (ip: 0x7f0b69771c98)
jl_toplevel_eval_in at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b5a2d9c81)
unknown function (ip: 0x7f0b5a03162e)
unknown function (ip: 0x7f0b5a032270)
unknown function (ip: 0x7f0b5a03239a)
unknown function (ip: 0x7f0b5a048ae7)
run_repl at /builddir/julia-1.5.2/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
unknown function (ip: 0x7f0b5a1dbfdf)
unknown function (ip: 0x7f0b5a1dc0b8)
unknown function (ip: 0x7f0b697498c0)
jl_f__apply_latest at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b5a2054a7)
unknown function (ip: 0x7f0b5a210b07)
unknown function (ip: 0x7f0b5a2123ae)
unknown function (ip: 0x7f0b5a212505)
unknown function (ip: 0x5600ebef4755)
unknown function (ip: 0x5600ebef4332)
__libc_start_main at /builddir/glibc-2.30/csu/../csu/libc-start.c:308
unknown function (ip: 0x5600ebef43d9)
Allocations: 4838761 (Pool: 4836877; Big: 1884); GC: 6
Aborted

User-accessible objects should print nicely

As pointed out in #68 (comment), objects like RuntimeEvent{HSAStatusSignal} print as some monstrosity that can easily be mistaken for an error. We should make sure that all user-facing objects print decently.

Document LLVM target features and allow setting them

The LLVM AMDGPU target has features like XNACK that we might want to enable in certain cases, like wavefront debugging. We should document each known feature and provide a way to set them, globally and/or per-kernel.

Create a Docker image for AMDGPU.jl

Since at the moment many of the dependencies required by AMDGPU.jl are not distributed as artifacts and system configuration can be further complicated by version/architecture mismatches, it would be useful to have a Docker image including Julia with AMDGPU installed, (alternatively the user can use their own Julia environment with a volume exposing ~/.julia to the container).

Error in build step: Inconsistency detected by ld.so

Dear @jpsamaroo,
I want to know what is the best source to setup this library? I tried to follow the basic setup mentioned in the docs, which is a little bit confusing for me and I don't know if am missed something or not but trieing hard for a hour already without success. 😞

Is it possible to ask a cleaner install instruction list?

I would be glad if I could use it, it looks damn promising!

Implement occupancy estimator

We should be able to guess how well a given kernel can occupy a given piece of hardware. We should then be able to allow @roc groupsize=auto ... to automatically select a groupsize when it's irrelevant to the given kernel.

Functions to map to/from HIP agent IDs

For interactions with HIP, which uses an implicit, incrementing device ID similar to CUDA, we should provide functions that can map from HSA's agents to HIP's integer device IDs, and back.

Implement iterative solvers

https://github.com/ROCmSoftwarePlatform/rocALUTION

Implement exponential back-off for signal wait

It should start at a bit longer than the minimum possible kernel launch-and-complete latency, and then go up to a user-defined maximum.

Check for invalid workgroup sizes

So that users don't accidentally specify a too-large groupsize, we should throw an error if it's greater than a UInt16, and possibly even query the agent to find out the real max value.

Add atomic intrinsics

Add support for trap handlers

When GPU kernels crash or don't do what we expect them to do, it can be very frustrating to figure out what went wrong (especially given that there isn't currently a non-deprecated debugger for ROCm). However, when a kernel crashes, it's possible to still execute code in the kernel context via trap handlers. We should implement support for loading code into the trap handler, to allow the user to get extra information on what went wrong. This would handle cases that software exception handling can't catch, like how gdb can debug a program that just generated a segmentation fault.

Ensure that CI tests all available external libraries

It looks like the new buildkite CI isn't testing rocBLAS, rocFFT, and rocRNG bindings (probably my fault). We should ensure that if we're running under CI, that all of those libraries are available.

Support non-bitstype arguments and allocations

It should be possible to support non-bitstype arguments and possibly on-device allocations with a bit of elbow grease, as long as we allocate all non-bitstype structures entirely on HSA finegrained memory blocks (even when they reference other device memory blocks). We'll probably need to provide:

A way to allocate non-bitstype objects on the host on only HSA memory blocks (Cassette?)
Convert any device-side allocations from Julia addrspaces and conventions to AMDGPU equivalents
malloc/free on the device
An optional verifier pass to ensure all non-bitstype arguments only point to valid (device-reachable) memory locations
All sorts of tests to ensure this actually works