jpsamaroo avatar jpsamaroo commented on June 12, 2024 1

I've also noticed that HSAKMT environment variables don't work with AMDGPU.jl. We don't do any stderr capture to my knowledge. Do note that those variables apply to HCC, HIP, and MIOpen, none of which we use in any significant capacity (except for HIP, for device sync, which is not done automatically).

Krastanov avatar Krastanov commented on June 12, 2024

Running on master still has failing tests, but way fewer:

Test Summary:                                 | Pass  Error  Broken  Total
AMDGPU                                        | 1040      2      79   1121
  Core                                        |                   1      1
  HSA                                         |   16                    16
  Codegen                                     |    3                     3
  Device Functions                            |  179             75    254
  ROCArray                                    |  744      2       3    749
    GPUArrays test suite                      |  744      2            746
      math                                    |    8                     8
      indexing scalar                         |  249                   249
      input output                            |    5                     5
      value constructors                      |   36                    36
      indexing multidimensional               |   32      2             34
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        empty array                           |   15                    15
        GPU source                            |    2      1              3
        CPU source                            |    2      1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                     1
      interface                               |    7                     7
      conversions                             |   72                    72
      constructors                            |  335                   335
    ROCm External Libraries                   |                   3      3
  External Packages                           |   97                    97
ERROR: LoadError: Some tests did not pass: 1040 passed, 0 failed, 2 errored, 79 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/TAdgr/test/runtests.jl:27
ERROR: Package AMDGPU errored during testing

The matrix multiplication still crashes

julia> using AMDGPU; using LinearAlgebra

julia> N = 100; m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N); m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);

julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x2177160) on address 0x640000. Reason: Page not present or supervisor privilege.

signal (6): Aborted
in expression starting at REPL[3]:1
Allocations: 31842465 (Pool: 31831316; Big: 11149); GC: 37
fish: “~/localcompiles/julia-1.6.0-bet…” terminated by signal SIGABRT (Abort)

Here is the manifest:

pkg> st --manifest
Status `~/Documents/ScratchSpace/julia_gpu/Manifest.toml`
  [21141c5a] AMDGPU v0.2.2 ``
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v3.1.1
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [34da2185] Compat v3.25.0
  [187b0558] ConstructionBase v1.0.0
  [864edb3b] DataStructures v0.18.9
  [0c68f7d7] GPUArrays v6.2.0
  [61eb1bfa] GPUCompiler v0.9.2
  [929cbde3] LLVM v3.6.0
  [1914dd2f] MacroTools v0.5.6
  [bac558e1] OrderedCollections v1.3.3
  [ae029012] Requires v1.1.2
  [6c6a2e73] Scratch v1.0.3
  [efcf1570] Setfield v0.7.0
  [a759f4b9] TimerOutputs v0.5.7
  [0dad84c5] ArgTools
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [f43a241f] Downloads
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [b27032c2] LibCURL
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML
  [a4e569a6] Tar
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [deac9b47] LibCURL_jll
  [29816b5a] LibSSH2_jll
  [c8ffd9c3] MbedTLS_jll
  [14a3606d] MozillaCACerts_jll
  [83775a58] Zlib_jll
  [8e850ede] nghttp2_jll

jpsamaroo avatar jpsamaroo commented on June 12, 2024

Seems like it might be a crash in rocBLAS, but I'm not sure since I don't regularly run AMDGPU with it enabled (because it sucks to build). Do you have rocBLAS installed?

Krastanov avatar Krastanov commented on June 12, 2024

I do not think so. I checked with apt-get and rocblas was not installed. Then, just to check, I also ran sudo apt-get install rocblas which reported successful (and brand new) install. However, the problem persists even after installing rocblas, so I think it is something independent from it.

Krastanov avatar Krastanov commented on June 12, 2024

I checked a couple of times with and without rocblas (by running sudo apt-get install/purge rocblas and then running ] build AMDGPU), but the crash in the matrix multiplication persists.

Krastanov avatar Krastanov commented on June 12, 2024

I attempted various debug and serialization flags, as suggested in ROCm/tensorflow-upstream#302 and in , but I did not get any debug info out to stderr!? Is AMDGPU.jl capturing and redirecting stderr? Any other suggestions to try to track what exactly causes the memory fault?

Here is my attempt with the entirety of its console output:

$> ~/localcompiles/julia-1.6.0-beta1/bin/julia --project=.
   _       _ _(_)_     |  Documentation:
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-beta1 (2021-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official release
|__/                   |

julia> using AMDGPU; using LinearAlgebra

julia> N = 100; m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N); m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);

julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x13812e0) on address 0x640000. Reason: Page not present or supervisor privilege.

Krastanov avatar Krastanov commented on June 12, 2024

All of this was on rocm 4. I tried also installing tensorflow-rocm, but that had the additional requirements of installing apt install rocm-libs rccl. Tensorflow seemed to work fine, but after adding these extra libraries AMDGPU.jl stopped building!? ] build AMDGPU started reporting this error ROCm/ROCm#1269

I ended downgrading to rocm 3.5.1. Now AMDGPU.jl seems to work. Tensforflow 2.4 does not work anymore, but I can downgrade tensorflow too.

There are test failures for the current release of AMDGPU:

Test Summary:                                 | Pass  Fail  Error  Broken  Total
AMDGPU                                        | 1198    12     15      90   1315
  Core                                        |                         1      1
  HSA                                         |   16            6             22
    HSA Status Error                          |    1                           1
    Agent                                     |    5                           5
    Memory                                    |   10            6             16
      Pointer-based                           |    3                           3
      Array-based                             |    2                           2
      Type-based                              |    1                           1
      Pointer information                     |                 1              1
      Page-locked memory (OS allocations)     |                 5              5
      Exceptions                              |    3                           3
      Mutable structs                         |    1                           1
  Codegen                                     |    3                           3
  Device Functions                            |  175                   77    252
  ROCArray                                    | 1003    12      9      12   1036
    GPUArrays test suite                      |  737            9            746
      math                                    |    8                           8
      indexing scalar                         |  249                         249
      input output                            |    5                           5
      value constructors                      |   36                          36
      indexing multidimensional               |   25            9             34
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        empty array                           |    8            7             15
          1D                                  |    1            1              2
          2D with other index Colon()         |    2            2              4
          2D with other index 1:5             |    2            2              4
          2D with other index 5               |    2            2              4
        GPU source                            |    2            1              3
        CPU source                            |    2            1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                           1
      interface                               |    7                           7
      conversions                             |   72                          72
      constructors                            |  335                         335
    ROCm External Libraries                   |  266    12             12    290
      BLAS                                    |   17                          17
      FFT                                     |  106    12             12    130
        T = ComplexF64                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = ComplexF32                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = Float32                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
        T = Float64                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
      rand                                    |  143                         143
ERROR: LoadError: Some tests did not pass: 1198 passed, 12 failed, 15 errored, 90 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/UpYiP/test/runtests.jl:29
ERROR: Package AMDGPU errored during testing

And here are the tests on the current master branch, doing a bit better, but still having errors:

Test Summary:                                 | Pass  Fail  Error  Broken  Total
AMDGPU                                        | 1306    12      2      88   1408
  Core                                        |                         1      1
  HSA                                         |   16                          16
  Codegen                                     |    3                           3
  Device Functions                            |  179                   75    254
  ROCArray                                    | 1010    12      2      12   1036
    GPUArrays test suite                      |  744            2            746
      math                                    |    8                           8
      indexing scalar                         |  249                         249
      input output                            |    5                           5
      value constructors                      |   36                          36
      indexing multidimensional               |   32            2             34
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        empty array                           |   15                          15
        GPU source                            |    2            1              3
        CPU source                            |    2            1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                           1
      interface                               |    7                           7
      conversions                             |   72                          72
      constructors                            |  335                         335
    ROCm External Libraries                   |  266    12             12    290
      BLAS                                    |   17                          17
      FFT                                     |  106    12             12    130
        T = ComplexF64                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = ComplexF32                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = Float32                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
        T = Float64                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
      rand                                    |  143                         143
  External Packages                           |   97                          97
ERROR: LoadError: Some tests did not pass: 1306 passed, 12 failed, 2 errored, 88 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/AKLQk/test/runtests.jl:27
ERROR: Package AMDGPU errored during testing

Krastanov avatar Krastanov commented on June 12, 2024

Am I correct in assuming that if I want to use 580 with AMDGPU.jl, I have to freeze rocm to version 3.5.1 and just hope for "best effort", without any guarantees given that the device seems to be going out of support in rocm?

Should I freeze the AMDGPU.jl version too? Should I expect future versions of AMDGPU.jl to lower the level of support for 580?

Is there a more "official" table of support, giving hardware versions, rocm versions, and AMDGPU.jl versions that are tested/supported?

Krastanov avatar Krastanov commented on June 12, 2024

Sigh... now there is a separate problem (on rocm 3.5.1 and AMDGPU#master) that simply gives wrong answers (no crash, just incorrect answers) when I do matrix multiplication:

julia> N = 10; T = Float64; a,b,c = cpus = [rand(T, N, N) for i in 1:3]; ag,bg,cg = [ROCArray(i) for i in cpus];

julia> mul!(ag,bg,cg)
10×10 ROCMatrix{Float64}:
 0.169517  0.666133   0.787853   0.952216  0.52438    0.226889  0.895567  0.563802   0.603744  0.0810141
 0.774994  0.0350809  0.705357   0.544661  0.775764   0.966118  0.965179  0.351198   0.25837   0.0632102
 0.947915  0.0939128  0.711592   0.964582  0.484883   0.503159  0.618847  0.199      0.598743  0.913767
 0.166383  0.24303    0.0343327  0.954652  0.952374   0.911542  0.216517  0.144033   0.601291  0.205171
 0.349153  0.223039   0.129581   0.442686  0.766986   0.551424  0.292206  0.0795419  0.43372   0.655484
 0.173297  0.241994   0.915943   0.191715  0.202254   0.305148  0.221799  0.78068    0.75416   0.900042
 0.137884  0.25165    0.342389   0.159862  0.355102   0.836764  0.989629  0.935794   0.526686  0.762097
 0.116692  0.244034   0.724202   0.794337  0.168172   0.497086  0.937436  0.592061   0.813417  0.351207
 0.33148   0.346618   0.96186    0.436207  0.430171   0.623167  0.823441  0.63495    0.477421  0.497221
 0.411855  0.231901   0.578217   0.623853  0.0970518  0.633137  0.945868  0.616912   0.731479  0.731409

julia> mul!(a,b,c)
10×10 Matrix{Float64}:
 2.87753  2.07307  3.1106   3.34475  2.74262  3.61348  3.07164  2.82941  2.95761  1.97778
 3.03134  1.83036  3.24821  3.72434  3.07734  3.92818  3.27126  3.92044  3.94197  2.58042
 2.89656  2.19109  2.64611  3.02358  2.89144  3.69149  2.87703  3.7068   3.77624  2.60822
 1.46729  1.08032  1.38129  1.41364  1.54596  1.69974  1.36592  1.88397  1.82102  0.655919
 1.90435  1.22412  1.73246  1.82557  1.94339  2.39507  1.9207   1.88028  2.36406  1.84471
 1.96599  1.63776  2.00905  2.08006  1.7166   2.25551  1.6797   1.89926  1.67244  1.063
 2.09604  1.65736  1.91219  2.21415  1.86685  2.60482  2.10144  2.70042  2.52462  1.43959
 2.56943  1.31602  1.94323  2.61937  3.13482  2.81117  2.08695  2.95018  2.91306  1.97668
 2.74569  2.02152  3.04165  3.21203  2.86864  3.48828  2.51794  2.95315  2.98953  2.64708
 3.01357  2.14793  2.52376  2.93145  3.03869  3.55187  3.10702  3.39474  3.41577  2.14719

If you guys have any suggestions where to look for the source of these issues (or whether I should downgrade/upgrade to other versions), let me know. Either way, thanks for your effort in putting this library together!

Some community-sourced table of "this hardware ran successfully for me" would be really useful.

jpsamaroo avatar jpsamaroo commented on June 12, 2024

I tested this on my Vega system, and I also get a memory access fault. I'll run this under my newly-working debugger in the next day or two.

Btw, our CI was running on an RX480 for the longest time, but I had to remove the card because HIP started killing the build process due to not being able to find code for the GPU (stupid problem, I should reproduce it and patch it upstream). I'll probably put the RX480 in another machine and add it to the CI queue so that we ensure that we still have working support.

Krastanov avatar Krastanov commented on June 12, 2024

Is there a way to donate to the CI effort? (money or compute time, especially if I can get my 580 to do CI for you; I am competent enough sysadmin to run a docker on this computer that is accessible to your CI jobs). It is in my selfish interest to get 580 with configuration similar to mine (ubuntu with same drivers and rocm version) ;)

By the way, as a new users I was definitely very confused by what rocm version I should be using. What version of rocm is used by the CI?

jpsamaroo avatar jpsamaroo commented on June 12, 2024

We currently use Buildkite to host CI, which runs under docker-compose, so it's pretty nicely isolated. I'll talk to the JuliaGPU devs and see what they think.

Also, the ROCm config is not fixed to a particular version, which is something I would like to fix by providing ROCm libraries as JLLs, but that's complicated by such a config not working on my musl system 😄 It's on the roadmap, though.

jpsamaroo avatar jpsamaroo commented on June 12, 2024

While I wait for a response on the CI question, I found that the issue does not turn into a regular device error when running with -g2 --check-bounds=yes on AMDGPU master (-g2 is for outputting a full device stacktrace on error), which indicates to me that this is either a miscompile, or a bug somewhere where unsafe_load/unsafe_store is being called manually (since array accesses are bounds-checked).

jpsamaroo avatar jpsamaroo commented on June 12, 2024

Regarding CI: because adding buildkite agents requires sharing our global secret key with the agent's owner, we can't reasonably accept outside CI. However, I plan to setup an RX480 runner and ensure that we run it for all PRs, to ensure older cards still work as much as possible. We'll also be potentially getting access to a lot of newer (but still Vega arch) AMD GPUs soon, so hopefully we can use some of them for CI.

jpsamaroo avatar jpsamaroo commented on June 12, 2024

In terms of donations from the community, I would appreciate any bug reports, code contributions, or ideas for improvements you and others might have. That's more valuable to me than CI by a long shot 🙂

Krastanov avatar Krastanov commented on June 12, 2024

Sounds great! If this starts working I would certainly be active giving feedback. I do have a bunch of projects that would use bitwise operations on integer types, so hopefully I will be able to stress-test that side of the project.

jpsamaroo avatar jpsamaroo commented on June 12, 2024

I'm closing this in favor of #103, since the failing tests you reported are known to fail (see #91), or just generally unreliable (in my experience).

