Giter Club home page Giter Club logo

hipblas's Introduction

hipBLAS

hipBLAS is a Basic Linear Algebra Subprograms (BLAS) marshalling library with multiple supported backends. It sits between your application and a 'worker' BLAS library, where it marshals inputs to the backend library and marshals results to your application. hipBLAS exports an interface that doesn't require the client to change, regardless of the chosen backend. Currently, hipBLAS supports rocBLAS and cuBLAS backends.

To use hipBLAS, you must first install rocBLAS, rocSPARSE, and rocSOLVER or cuBLAS.

Documentation

Documentation for hipBLAS is available at https://rocm.docs.amd.com/projects/hipBLAS/en/latest/.

To build our documentation locally, use the following code:

cd docs

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Alternatively, build with CMake:

cmake -DBUILD_DOCS=ON ...

Build and install

  1. Download the hipBLAS source code (clone this repository):

        git clone https://github.com/ROCmSoftwarePlatform/hipBLAS.git
        hipBLAS requires specific versions of rocBLAS and rocSOLVER. Refer to
        [CMakeLists.txt](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/library/CMakeLists.txt)
        for details.
    
  2. Build hipBLAS and install it into /opt/rocm/hipblas:

        cd hipblas
        ./install.sh -i

Interface examples

The hipBLAS interface is compatible with rocBLAS and cuBLAS-v2 APIs. Porting a CUDA application that originally calls the cuBLAS API to an application that calls the hipBLAS API is relatively straightforward. For example, the hipBLAS SGEMV interface is:

GEMV API

hipblasStatus_t
hipblasSgemv( hipblasHandle_t handle,
              hipblasOperation_t trans,
              int m, int n, const float *alpha,
              const float *A, int lda,
              const float *x, int incx, const float *beta,
              float *y, int incy );

Batched and strided GEMM API

hipBLAS GEMM can process matrices in batches with regular strides by using the strided-batched version of the API:

hipblasStatus_t
hipblasSgemmStridedBatched( hipblasHandle_t handle,
              hipblasOperation_t transa, hipblasOperation_t transb,
              int m, int n, int k, const float *alpha,
              const float *A, int lda, long long bsa,
              const float *B, int ldb, long long bsb, const float *beta,
              float *C, int ldc, long long bsc,
              int batchCount);

hipBLAS assumes matrix A and vectors x, y are allocated in GPU memory space filled with data. You are responsible for copying data to and from the host and device memory.

hipblas's People

Contributors

aaronenyeshi avatar amcamd avatar amd-jmacaran avatar amdkila avatar arvindcheru avatar cgmb avatar daineamd avatar dependabot[bot] avatar eidenyoshida avatar estewart08 avatar hgaspar avatar jichangjichang avatar lawruble13 avatar leekillough avatar lisadelaney avatar mhbliao avatar naveenelumalaiamd avatar ntrost57 avatar pavahora avatar peterjunpark avatar pruthvistony avatar randyh62 avatar rkamd avatar saadrahim avatar samjwu avatar tfalders avatar torrezuk avatar urmbista avatar yvanmokwinski avatar zaliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hipblas's Issues

hipblasVersion* is not set correctly by CMake

What is the expected behavior

  • hipblasVersion* (hipblasVersionMajor, hipblasVersionMinor, ..) should be set by CMake.

What actually happens

  • they are set string @hipblasVersion* not the version number

How to reproduce

  • can not use hipblasVersion* in #if

There are extra space between the last @ and the string in this hipblas-version.h.in
I think this leads CMake can not replace the version correctly

Type of AP in hipblasStrsm functions

For example, float *AP may be changed to const float *AP.

HIPBLAS_EXPORT hipblasStatus_t hipblasStrsm(hipblasHandle_t    handle,
                                            hipblasSideMode_t  side,
                                            hipblasFillMode_t  uplo,
                                            hipblasOperation_t transA,
                                            hipblasDiagType_t  diag,
                                            int                m,
                                            int                n,
                                            const float*       alpha,
                                            float*             AP,
                                            int                lda,
                                            float*             BP,
                                            int                ldb);

HIPBLAS_EXPORT hipblasStatus_t hipblasDtrsm(hipblasHandle_t    handle,
                                            hipblasSideMode_t  side,
                                            hipblasFillMode_t  uplo,
                                            hipblasOperation_t transA,
                                            hipblasDiagType_t  diag,
                                            int                m,
                                            int                n,
                                            const double*      alpha,
                                            double*            AP,
                                            int                lda,
                                            double*            BP,
                                            int                ldb);


cublasStatus_t cublasStrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           float           *B, int ldb)
cublasStatus_t cublasDtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           double          *B, int ldb)
cublasStatus_t cublasCtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           cuComplex       *B, int ldb)
cublasStatus_t cublasZtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           cuDoubleComplex *B, int ldb)

Error with `DgetriBatched` function on AMD devices

What is the expected behavior

The function should work without error. The same function call with the same inputs works on an Nvidia system using cuBLAS.

What actually happens

rocblas_status_invalid_pointer error is returned.
This code fails on AMD platforms, but runs fine when using any Nvidia card (the HIP version and original CUDA/cuBLAS versions both work).

How to reproduce

  • See small example code below.
#include <iostream>
#include <vector>
#include <assert.h>

#include <hip/hip_runtime.h>
#include <hipblas.h>

void getri(hipblasHandle_t *handle, int *n, double *A, int *lda, int *ipiv, double *lwork, int *info);

int main(int argc, char **argv)
{

  std::vector<double> const test{0.767135868133925, -0.641484652834663,
                                 0.641484652834663, 0.767135868133926};
  std::vector<int> ipiv(2);
  std::vector<int> info(10);
  std::vector<double> work(4);
  int n = 2;
  int lda = 4;

  double *test_d, *work_d;
  int *ipiv_d, *info_d;

  // Create device copies of input matrix, ipiv, info, and work buffers
  auto success = hipMalloc((void **)&ipiv_d, ipiv.size() * sizeof(int *));
  assert(success == hipSuccess);
  success = hipMemcpy(ipiv_d, ipiv.data(), ipiv.size() * sizeof(int), hipMemcpyHostToDevice);
  assert(success == 0);

  success = hipMalloc((void **)&info_d, info.size() * sizeof(int *));
  assert(success == hipSuccess);
  success = hipMemcpy(info_d, info.data(), info.size() * sizeof(int), hipMemcpyHostToDevice);
  assert(success == 0);

  success = hipMalloc((void **)&test_d, test.size() * sizeof(double));
  assert(success == hipSuccess);
  success = hipMemcpy(test_d, test.data(), test.size() * sizeof(double), hipMemcpyHostToDevice);
  assert(success == 0);

  success = hipMalloc((void **)&work_d, work.size() * sizeof(double));
  assert(success == hipSuccess);
  success = hipMemcpy(work_d, work.data(), work.size() * sizeof(double), hipMemcpyHostToDevice);
  assert(success == 0);

  // Make a hipblas handle
  hipblasHandle_t handle;
  auto hipblas_success = hipblasCreate(&handle);
  assert(hipblas_success == HIPBLAS_STATUS_SUCCESS);

  // Call hipBlasDgetriBatched
  std::cout << "Running getri..\n";
  getri(&handle, &n, test_d, &lda, ipiv_d, work_d, info_d);

  success = hipDeviceSynchronize();
  assert(success == 0);
  std::cout << " -- after getri\n";

  hipblas_success = hipblasDestroy(handle);
  assert(hipblas_success == HIPBLAS_STATUS_SUCCESS);

  std::cout << "Done\n";

  return 0;
}

void getri(hipblasHandle_t *handle, int *n, double *A, int *lda, int *ipiv, double *lwork, int *info)
{
  double **A_d;
  double **work_d;

  auto stat = hipMalloc((void **)&A_d, sizeof(double *));
  assert(stat == 0);
  stat = hipMemcpy(A_d, &A, sizeof(double *), hipMemcpyHostToDevice);
  assert(stat == 0);

  stat = hipMalloc((void **)&work_d, sizeof(double *));
  assert(stat == 0);
  stat = hipMemcpy(work_d, &lwork, sizeof(double *), hipMemcpyHostToDevice);
  assert(stat == 0);

  auto const success = hipblasDgetriBatched(*handle, *n, A_d, *lda, nullptr, work_d, *n, info, 1);
  assert(success == 0);
}

Environment

Hardware description
GPU gfx906 (Vega 20)
CPU AMD EPYC 7452

System is using ROCM 4.2. See attached rocminfo.txt.
rocm_smi --showdriverversion gives Driver version: 5.9.25

hipBLAS Ubuntu packaging is broken

Since I can't see a better place to list this issue

What is the expected behavior

The hipblas package contains library, headers and CMake config files

example from rocm-4.3.0

# dpkg-query -L hipblas
/opt
/opt/rocm-4.3.0
/opt/rocm-4.3.0/hipblas
/opt/rocm-4.3.0/hipblas/include
/opt/rocm-4.3.0/hipblas/include/exceptions.hpp
/opt/rocm-4.3.0/hipblas/include/hipblas-export.h
/opt/rocm-4.3.0/hipblas/include/hipblas-version.h
/opt/rocm-4.3.0/hipblas/include/hipblas.h
/opt/rocm-4.3.0/hipblas/include/hipblas_module.f90
/opt/rocm-4.3.0/hipblas/lib
/opt/rocm-4.3.0/hipblas/lib/cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-config-version.cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-config.cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-targets-release.cmake
/opt/rocm-4.3.0/hipblas/lib/cmake/hipblas/hipblas-targets.cmake
/opt/rocm-4.3.0/hipblas/lib/libhipblas.so.0.1.40300
/opt/rocm-4.3.0/include
/opt/rocm-4.3.0/lib
/opt/rocm-4.3.0/lib/cmake
/opt/rocm-4.3.0/lib/cmake/hipblas
/opt/rocm-4.3.0/hipblas/lib/libhipblas.so
/opt/rocm-4.3.0/hipblas/lib/libhipblas.so.0
/opt/rocm-4.3.0/include/exceptions.hpp
/opt/rocm-4.3.0/include/hipblas-export.h
/opt/rocm-4.3.0/include/hipblas-version.h
/opt/rocm-4.3.0/include/hipblas.h
/opt/rocm-4.3.0/include/hipblas_module.f90
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-config-version.cmake
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-config.cmake
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-targets-release.cmake
/opt/rocm-4.3.0/lib/cmake/hipblas/hipblas-targets.cmake
/opt/rocm-4.3.0/lib/libhipblas.so
/opt/rocm-4.3.0/lib/libhipblas.so.0
/opt/rocm-4.3.0/lib/libhipblas.so.0.1.40300

What actually happens

The hipblas package contains only the library itself without headers and CMake config

# dpkg-query -L hipblas
/opt
/opt/rocm-4.5.0
/opt/rocm-4.5.0/hipblas
/opt/rocm-4.5.0/hipblas/lib
/opt/rocm-4.5.0/hipblas/lib/libhipblas.so.0.1.40500
/opt/rocm-4.5.0/lib
/opt/rocm-4.5.0/hipblas/lib/libhipblas.so.0
/opt/rocm-4.5.0/lib/libhipblas.so.0
/opt/rocm-4.5.0/lib/libhipblas.so.0.1.40500

How to reproduce

Install the latest hipblas package from http://repo.radeon.com/rocm/apt/debian/ ubuntu main

[5.3.X] TRMM functions do not have correct correspondence in hipBLAS

hipBLAS TRMM functions hipblasStrmm, hipblasDtrmm, hipblasCtrmm, hipblasZtrmm do not match neither cublas TRMM functions, nor cublas TRMM _v2 functions.

For instance:

cublasStrmm:

void CUBLASWINAPI cublasStrmm(char side,
                              char uplo,
                              char transa,
                              char diag,
                              int m,
                              int n,
                              float alpha,
                              const float* A,
                              int lda,
                              float* B,
                              int ldb);

cublasStrmm_v2:

CUBLASAPI cublasStatus_t CUBLASWINAPI cublasStrmm_v2(cublasHandle_t handle,
                                                     cublasSideMode_t side,
                                                     cublasFillMode_t uplo,
                                                     cublasOperation_t trans,
                                                     cublasDiagType_t diag,
                                                     int m,
                                                     int n,
                                                     const float* alpha, /* host or device pointer */
                                                     const float* A,
                                                     int lda,
                                                     const float* B,
                                                     int ldb,
                                                     float* C,
                                                     int ldc);

hipblasStrmm:

HIPBLAS_EXPORT hipblasStatus_t hipblasStrmm(hipblasHandle_t    handle,
                                            hipblasSideMode_t  side,
                                            hipblasFillMode_t  uplo,
                                            hipblasOperation_t transA,
                                            hipblasDiagType_t  diag,
                                            int                m,
                                            int                n,
                                            const float*       alpha,
                                            const float*       AP,
                                            int                lda,
                                            float*             BP,
                                            int                ldb);

The same goes for rocBLAS analogues rocblas_strmm, rocblas_dtrmm, rocblas_ctrmm, rocblas_ztrmm (ROCm/rocBLAS#1265).

So, the above 4 hipBLAS and 4 rocBLAS functions are marked as HIP UNSUPPORTED.

[Solution]
As far as hipBLAS doesn't support v1 BLAS functions, populate hipblas TRMM functions with two missing arguments: float* C and int ldc and revise functions' logic.

Unsupported for hipblas equivalent of Integer Datatypes.

The hipblas Datatypes equivalent is not supported for integer datatypes but supported in both rocblas and cublas.
list of rocblas datatypes not supported in hipblas
rocblas_datatype_i8_c
rocblas_datatype_u8_r
rocblas_datatype_u8_c
rocblas_datatype_i32_r
rocblas_datatype_i32_c
rocblas_datatype_u32_r
rocblas_datatype_u32_c

list of cublas Datatypes:

    CUDA_R_8I = 3,   
    CUDA_C_8I = 7,   
    CUDA_R_8U = 8,   
    CUDA_C_8U = 9,   
    CUDA_R_32I= 10,  
    CUDA_C_32I= 11,  
    CUDA_R_32U= 12,  
    CUDA_C_32U= 13   

where as hipblas only supports float datatypes as below
enum hipblasDatatype_t
{
HIPBLAS_R_16F = 150,
HIPBLAS_R_32F = 151,
HIPBLAS_R_64F = 152,
HIPBLAS_C_16F = 153,
HIPBLAS_C_32F = 154,
HIPBLAS_C_64F = 155,
};

GemmEx of float16 runs so fast but with wrong result

What is the expected behavior

  • The performance(e.g. running time) of GEMM in float32 and float16 should be different in a reasonable way. And the result matrix should be close.

What actually happens

  • The run time of GemmEx in fp16 or Hgemm is too short, like 0.02 ms.
  • The result of GemmEx in fp16 or Hgemm is too different( code dong the check is inside the file too)

How to reproduce

  • Code is below

code.zip

  • compile command: hipcc fp16gemm.hip -o fp16_gemm -lhipblas -L/opt/rocm-4.3.0/hipblas/lib

  • execute: ./fp16_gemm

  • Output is like this:

U)@LK{UTVGP%J909SMSU3CN
The error analysis may have some problem(I am still working on it), but the run time of Gemm of fp16 or Hgemm is too few to be real.

  • Matrix Data I used the gen_data.py to generate by Python3.8

Environment

Hardware description
GPU MI100
CPU AMD EPYC 7302 16-Core Processor
Software version
ROCm v4.3.0
HipCC v4.3
RocBlas v4.3

Make error on nvcc platform

Hi, first I build HIP from source and change the HIP_PATH, then when I cmake and make hipBLAS it shows the following error:
CMake Error: INSTALL(EXPORT) given unknown export "hipblas-targets"

I also try to correct the CMakelist.txt it doesn't work, too.
So is there a simple description about how to build hipBLAS on nvcc platform without install rocm?

Error : There is no device can be used to do the computation for HIP_PLATFORM NVCC

Background
โ€œError: There is no device can be used to do the computationโ€, while executing MXNet tests on HIP/CUDA (NVCC) path after integrating hipBLAS

Issue
Facing runtime issues on integrating hipBLAS to MXNet library . โ€œerror: There is no device can be used to do the computationโ€. (Test cases terminate after throwing this error)
Attached sample application which will reproduce the above mentioned error
hipblas_test.cpp.zip

Steps to reproduce

  • $git clone --recursive https://github.com/ROCmSoftwarePlatform/mxnet
  • $cd mxnet
  • $export HIP_PLATFORM=nvcc
  • $make -j(nproc)
  • $unzip hipblas_test.cpp.zip
  • $cp hipblas_test.cpp mxnet/
  • $/opt/rocm/bin/hipcc -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 hipblas_test.cpp -I../ -I/usr/local/cuda-8.0/include -I/opt/rocm/hipblas/include -L/opt/rocm/hipblas/lib -lhipblas -lcublas -o hipblas_test.cpp.out
  • $./hipblas_test.cpp.out

Analysis
Though the error indicates absence of device, the gpu is being detected. The image of nvidia-smi log has been attached.
nvdia-smi log

Install directory inconsistency

This is not really a bug, but it seems kind of inconsistent. The hipblas .deb package is installed into /usr/include and /usr/lib, whereas most other rocm packages install into /opt/rocm.

Upgrade hipblas 0.5.2.0 to 0.6.0.8 via deb repository fails.

What is the expected behavior

  • apt-get install hipblas should work

What actually happens

The following packages will be upgraded:
  hipblas
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
4 not fully installed or removed.
Need to get 0 B/62.1 kB of archives.
After this operation, 270 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
(Reading database ... 1092339 files and directories currently installed.)
Preparing to unpack .../hipblas_0.6.0.8_amd64.deb ...
Unpacking hipblas (0.6.0.8) over (0.5.2.0) ...
dpkg: error processing archive /var/cache/apt/archives/hipblas_0.6.0.8_amd64.deb (--unpack):
 unable to open '/opt/rocm/hipblas/lib/cmake/hipblas/hipblas-config-version.cmake.dpkg-new': No such file or directory
abort-upgrade
Errors were encountered while processing:
 /var/cache/apt/archives/hipblas_0.6.0.8_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

How to reproduce

  • Standard upgrade from previous ROCm version via the repository.

Environment

Linux mint 18.2

Results of hipblasHgemm seems incorrect

I tried to run an example that calls the hipblasHgemm function on a gfx908 GPU. The results seemed incorrect when comparing the results of the half- and single-precision versions. Thanks.

https://github.com/zjin-lcf/HeCBench/tree/master/src/mkl-sgemm-hip

HIP version: 6.0.32831-204d35d16
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)

hipcc  -std=c++17 -Wall -O3 gemm.o -o main -lhipblas
./main 79 91 83 1000
        Running with half precision real data type:
Average GEMM execution time: 542.991028 (us)

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0, 1, ...
                            [ 1, 1, ...
                            [ ...


                        B = [ 1, 0, ...
                            [ 1, 1, ...
                            [ ...


                        C = [ 0, 0, ...
                            [ 0, 0, ...
                            [ ...

        Running with single precision real data type:
Average GEMM execution time: 288.084442 (us)

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0, 1, ...
                            [ 1, 1, ...
                            [ ...


                        B = [ 1, 0, ...
                            [ 1, 1, ...
                            [ ...


                        C = [ 92, 96, ...
                            [ 116, 112, ...
                            [ ...

hipblasSgemm extremely slow when compared to cublasSgemm

What is the expected behavior

Would expect the performance of GEMM operations to be comparable to cublas GEMM operations.

What actually happens

The GEMM operations on floats is extremely slow.

How to reproduce

See attached test case. Compile test case with -D CUDA and run on an NVIDIA GPU and recompile with -D HIP and run on an AMD Hipyfied AMD GPU and compare the results.

Environment

| Hardware | description |
A Linux desktop with Radeon PRO W6800 and a Linux desktop with GeForce RTX 2060 SUPER.
| GPU | device string |
19:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 GL-XL [Radeon PRO W6800]
| CPU | device string |

| Software | version |
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"

| ROCK | v0.0 |
Runtime Version: 1.1
| ROCR | v0.0 |
| HCC | v0.0 |
HIP version: 5.2.21153-02187ecf
| Library | v0.0 |

I ran it on my two desktops -- one with Radeon W6800 and the other with GeForce RTX 2060. See attached GemmTest.cpp and here are the results:

GeForce RTX 2060:

size 512 average 8.85824e-05 s
size 1024 average 0.000308755 s
size 2048 average 0.00214913 s
size 4096 average 0.0187068 s
size 8192 average 0.153399 s
size 16384 average 1.2847 s

Radeon W6800:

size 512 average 0.262805 s
size 1024 average 0.0126241 s
size 2048 average 0.118601 s
size 4096 average 0.794954 s
size 8192 average 3.916 s
size 16384 average 26.0589 s

#include <unistd.h>
#include <iostream>
#include <stdlib.h>
#include <assert.h>

#if defined(CUDA)
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define GPUBLAS_OP_N CUBLAS_OP_N

typedef cudaError_t gpuError_t;
typedef cudaEvent_t gpuEvent_t;
typedef cudaStream_t gpuStream_t;

typedef cublasHandle_t gpuBlasHandle_t;
typedef cublasStatus_t gpuBlasStatus_t;

const gpuBlasStatus_t gpuBlasSuccess = CUBLAS_STATUS_SUCCESS;

const gpuError_t gpuSuccess = cudaSuccess;

#define gpuMemcpyHostToDevice cudaMemcpyHostToDevice

gpuError_t gpuMemcpy(void* dest, void* src, size_t size, cudaMemcpyKind flags) {
  return cudaMemcpy(dest, src, size, flags);
}

gpuError_t gpuMallocManaged(void** ret, size_t size) {
  return cudaMallocManaged(ret, size);
}

gpuError_t gpuEventCreate(gpuEvent_t* pevent) {
  return cudaEventCreate(pevent);
}

gpuError_t gpuEventRecord(gpuEvent_t event, gpuStream_t stream) {
  return cudaEventRecord(event, stream);
}

gpuError_t gpuEventSynchronize(gpuEvent_t event) {
  return cudaEventSynchronize(event);
}

gpuError_t gpuGetLastError() {
  return cudaGetLastError();
}

gpuError_t gpuEventElapsedTime(float* t, gpuEvent_t start, gpuEvent_t stop) {
  return cudaEventElapsedTime(t, start, stop);
}

gpuError_t gpuFree(void* p) {
  return cudaFree(p);
}

gpuBlasStatus_t gpuBlasCreate(gpuBlasHandle_t* phandle) {
  return cublasCreate(phandle);
}

gpuBlasStatus_t gpuBlasSgemm(gpuBlasHandle_t handle, cublasOperation_t ta, cublasOperation_t tb, int m, int n, int k, const float* alpha, const float* a, int lda, const float* b, int ldb, const float* beta, float* c, int ldc) {
  return cublasSgemm(handle, ta, tb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}

#elif defined(HIP)
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <hipblas/hipblas.h>
#include <hiprand/hiprand.h>

#define GPUBLAS_OP_N HIPBLAS_OP_N

typedef hipError_t gpuError_t;
typedef hipEvent_t gpuEvent_t;
typedef hipStream_t gpuStream_t;

typedef hipblasHandle_t gpuBlasHandle_t;
typedef hipblasStatus_t gpuBlasStatus_t;

const gpuBlasStatus_t gpuBlasSuccess = HIPBLAS_STATUS_SUCCESS;

const gpuError_t gpuSuccess = hipSuccess;

#define gpuMemcpyHostToDevice hipMemcpyHostToDevice

gpuError_t gpuMemcpy(void* dest, void* src, size_t size, hipMemcpyKind flags) {
  return hipMemcpy(dest, src, size, flags);
}

gpuError_t gpuMallocManaged(void** ret, size_t size) {
  return hipMallocManaged(ret, size);
}

gpuError_t gpuEventCreate(gpuEvent_t* pevent) {
  return hipEventCreate(pevent);
}

gpuError_t gpuEventRecord(gpuEvent_t event, hipStream_t stream) {
  return hipEventRecord(event, stream);
}

gpuError_t gpuEventSynchronize(gpuEvent_t event) {
  return hipEventSynchronize(event);
}

gpuError_t gpuGetLastError() {
  return hipGetLastError();
}

gpuError_t gpuEventElapsedTime(float* t, gpuEvent_t start, gpuEvent_t stop) {
  return hipEventElapsedTime(t, start, stop);
}

gpuError_t gpuFree(void* p) {
  return hipFree(p);
}

gpuBlasStatus_t gpuBlasCreate(gpuBlasHandle_t* phandle) {
  return hipblasCreate(phandle);
}

gpuBlasStatus_t gpuBlasSgemm(gpuBlasHandle_t handle, hipblasOperation_t ta, hipblasOperation_t tb, int m, int n, int k, const float* alpha, const float* a, int lda, const float* b, int ldb, const float* beta, float* c, int ldc) {
  return hipblasSgemm(handle, ta, tb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}

#else
#error "Specify GPU type"
#endif


void initRand(float* p, int rows, int cols) {
  int a = 1;

  for (int i = 0; i < rows * cols; i++) {
    p[i] = (float) rand() / (float) (RAND_MAX / a);
  }
}

int main(int argc, char* argv[]) {
  int status, lower, upper, num, reps, verbose;

  lower = 256;
  upper = 8192;
  num = 25000;
  reps = 5;
  verbose = 0;

  while ((status = getopt(argc, argv, "l:u:n:r:v")) != -1) {
    switch (status) {
    case 'l':
      lower = strtoul(optarg, 0, 0);
      break;
    case 'u':
      upper = strtoul(optarg, 0, 0);
      break;
    case 'n':
      num = strtoul(optarg, 0, 0);
      break;
    case 'r':
      reps = strtoul(optarg, 0, 0);
      break;
    case 'v':
      verbose = strtoul(optarg, 0, 0);
      break;
    default:
      std::cerr << "invalid argument: " << status << std::endl;
      exit(1);
    }
  }

  if (verbose) {
    std::cout << "Running with" << " lower: " << lower << " upper: " << upper << " num: " << num << " reps: " << reps << std::endl;
  }

  if (verbose) {
    std::cout << "initializing inputs" << std::endl;
  }

  gpuBlasHandle_t handle;
  gpuBlasStatus_t blasStatus;

  blasStatus = gpuBlasCreate(&handle);
  if (blasStatus != gpuBlasSuccess) {
    std::cerr << "Could not create blas handle " << blasStatus << std::endl;
    exit(1);
  }

  float* A = (float*) calloc(1, upper * upper * sizeof(float));
  float* B = (float*) calloc(1, upper * upper * sizeof(float));
  float* C = (float*) calloc(1, upper * upper * sizeof(float));

  initRand(A, upper, upper);
  initRand(B, upper, upper);
  initRand(C, upper, upper);

  float* dA, *dB, *dC;
  gpuError_t err;

  int lda, ldb, ldc, m, n, k;
  float alpha = 1.0f, beta = 0.0f;

  err = gpuMallocManaged((void**) &dA, upper * upper * sizeof(float));
  if (err != gpuSuccess) {
    std::cerr << "Could not allocate GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMallocManaged((void**) &dB, upper * upper * sizeof(float));
  if (err != gpuSuccess) {
    std::cerr << "Could not allocate GPU memory; size: " <<  upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMallocManaged((void**) &dC, upper * upper * sizeof(float));
  if (err != gpuSuccess) {
    std::cerr << "Could not allocate GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMemcpy(dA, A, upper * upper * sizeof(float), gpuMemcpyHostToDevice);
  if (err != gpuSuccess) {
    std::cerr << "Could not copy to GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMemcpy(dB, B, upper * upper * sizeof(float), gpuMemcpyHostToDevice);
  if (err != gpuSuccess) {
    std::cerr << "Could not copy to GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }

  err = gpuMemcpy(dC, C, upper * upper * sizeof(float), gpuMemcpyHostToDevice);
  if (err != gpuSuccess) {
    std::cerr << "Could not copy to GPU memory; size: " << upper * upper * sizeof(float) << std::endl;
  }
  gpuEvent_t start, stop;
  gpuEventCreate(&start);
  gpuEventCreate(&stop);

  for (int s = lower; s <= upper; s = s * 2) {
    double sum = 0.0;
    for (int r = 0; r < reps; r++) {
      gpuEventRecord(start, 0);
      m = n = k = s;
      lda = m; ldb = k; ldc = m;

      blasStatus = gpuBlasSgemm(handle, GPUBLAS_OP_N, GPUBLAS_OP_N, m, n, k, &alpha, dA, lda, dB, ldb, &beta, dC, ldc);
      gpuEventRecord(stop, 0);
      gpuEventSynchronize(stop);
      if (blasStatus != gpuBlasSuccess) {
        std::cerr << "gpuBlasSgemm failed: " << blasStatus << std::endl;
        exit(1);
      }
      err = gpuGetLastError();
      if (err != gpuSuccess) {
        std::cerr << "gpu error: " << err << std::endl;
        exit(1);
      }

      float elapsed;
      gpuEventElapsedTime(&elapsed, start, stop);
      elapsed /= 1000.0f;
      sum += elapsed;
    }
    std::cout << "size " << s << " average " << sum / reps << " s " << std::endl;
  }

  gpuFree(dA); gpuFree(dB); gpuFree(dC);
  free(A); free(B); free(C);
}

Building rocBLAS for CUDA backend fails

I'm building using CMake directly[1] for CUDA backend:

# On branch release/rocm-rel-5.5
mkdir build && cd build
cmake -DUSE_CUDA=ON -DHIP_ROOT_DIR=/opt/rocm ..
make

Scanning dependencies of target hipblas_fortran
[ 20%] Building Fortran object library/src/CMakeFiles/hipblas_fortran.dir/hipblas_module.f90.o
[ 40%] Linking Fortran shared library libhipblas_fortran.so
[ 40%] Built target hipblas_fortran
[ 60%] Building CXX object library/src/CMakeFiles/hipblas.dir/nvidia_detail/hipblas.cpp.o
In file included from /home/torrance/hipBLAS/library/src/nvidia_detail/hipblas.cpp:27:
/usr/local/cuda/include/cublas_v2.h:59:2: error: #error "It is an error to include both cublas.h and cublas_v2.h"
   59 | #error "It is an error to include both cublas.h and cublas_v2.h"
      |  ^~~~~

I can compile with the following changes, though I don't know if this is properly functional or not:

diff --git a/library/src/nvidia_detail/hipblas.cpp b/library/src/nvidia_detail/hipblas.cpp
index b60a31d..08a6257 100644
--- a/library/src/nvidia_detail/hipblas.cpp
+++ b/library/src/nvidia_detail/hipblas.cpp
@@ -23,7 +23,7 @@

 #include "hipblas.h"
 #include "exceptions.hpp"
-#include <cublas.h>
+//#include <cublas.h>
 #include <cublas_v2.h>
 #include <cuda_runtime_api.h>
 #include <hip/hip_runtime.h>
@@ -350,7 +350,7 @@ hipblasStatus_t hipblasGetPointerMode(hipblasHandle_t handle, hipblasPointerMode
 try
 {
     cublasPointerMode_t cublasMode;
-    cublasStatus        status = cublasGetPointerMode((cublasHandle_t)handle, &cublasMode);
+    cublasStatus_t        status = cublasGetPointerMode((cublasHandle_t)handle, &cublasMode);
     *mode                      = CudaPointerModeToHIPPointerMode(cublasMode);
     return hipCUBLASStatusToHIPStatus(status);
 }

[1] i.e. not using the install.sh script, though the error is present in both obviously. Honestly though, why this install script, when the rest of the hip ecosystem uses CMake? And why default to making a .deb package?

Environment

Hardware description
GPU Tesla T4
CPU Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Software version
HIP v5.5
CUDA 12.1

hipConfig.cmake issue on NVIDIA

While installing on NVIDIA, hipBLAS searches for hipConfig.cmake but hipConfig.cmake will be generated only on HCC Platform while installing HIP.

MXNet: Query regarding HIP equivalents of cuBLAS APIs

Issue : While downstreaming mxnet code we have come across certain additional cuBLAS apis. What are the equivalent hip apis for these?

The list of cuBLAS apis :
cublasStrmm
cublasDtrmm
cublasSsyrk
cublasDsyrk
cublasGetMathMode
cublasSetMathMode
cublasSgemmEx
cublasGemmEx

The list of cuBLAS varibales :
cudaDataType:
CUDA_R_32F
CUDA_R_64F
CUDA_R_16F
CUDA_R_8U
CUDA_R_8I
CUDA_R_32I
cublasMath_t:
CUBLAS_DEFAULT_MATH
CUBLAS_TENSOR_OP_MATH
cublasGemmAlgo_t:
CUBLAS_GEMM_DFALT

hipEventCreate

What is the expected behavior

What actually happens

How to reproduce

Environment

Hardware description
GPU device string
CPU device string
Software version
ROCK v0.0
ROCR v0.0
HCC v0.0
Library v0.0

Disrespecting CMake option `BUILD_WITH_SOLVER`

What is the expected behavior

  • Guarding rocSOLVER specific code paths based on CMake option

What actually happens

  • Unconditional #include "rocsolver.h" in hipblas.cpp resulting in failing build.

How to reproduce

  • Default specifying -D BUILD_WITH_SOLVER=OFF

Environment

Hardware description
GPU device string
CPU device string
Software version
ROCK v3.3-19
ROCR v3.3-19
HCC v3.1.20114-6776c83f-1
Library master @ 29b3fb0

Error while build hipBLAS

What is the expected behavior

  • build successful

What actually happens

running develop
running egg_info
writing hipmat.egg-info/PKG-INFO
writing top-level names to hipmat.egg-info/top_level.txt
writing dependency_links to hipmat.egg-info/dependency_links.txt
reading manifest file 'hipmat.egg-info/SOURCES.txt'
writing manifest file 'hipmat.egg-info/SOURCES.txt'
running build_ext
building 'hipmat.libhipmat' extension
/opt/rocm/hip/bin/hipcc -I/opt/rocm/hipblas/include -I/opt/rocm/hip/include/hip/ -I/usr/include/python2.7 -c hipmat/hipmat.cpp -o build/temp.linux-x86_64-2.7/hipmat/hipmat.o -O -fPIC
In file included from hipmat/hipmat.cpp:5:
In file included from /opt/rocm/hipblas/include/hipblas.h:17:
In file included from /opt/rocm/hip/include/hip/hip_complex.h:29:
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:107:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:108:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:109:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:110:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, double)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:111:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:112:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:113:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:106:5: error: C++ requires a type specifier for all declarations
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:80: error: expected ';' at end of declaration list
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
                                                                               ^
                                                                               ;
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:45: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                            ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:54: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                                     ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:52: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                   ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:58: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                         ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:61: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                            ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:67: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                                  ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:126:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:127:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:128:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed int)
    ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
In file included from hipmat/hipmat.cpp:5:
In file included from /opt/rocm/hipblas/include/hipblas.h:17:
In file included from /opt/rocm/hip/include/hip/hip_complex.h:29:
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:107:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:108:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:109:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:110:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, double)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:111:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:112:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:113:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:106:5: error: C++ requires a type specifier for all declarations
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, unsigned short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:114:80: error: expected ';' at end of declaration list
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipFloatComplex, signed long long)
                                                                               ^
                                                                               ;
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:45: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                            ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:103:54: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex() : x(0.0f), y(0.0f) {}
                                                     ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:52: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                   ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:104:58: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x) : x(x), y(0.0f) {}
                                                         ^~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:61: error: member initializer 'x' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                            ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:105:67: error: member initializer 'y' does not name a non-static data member or base class
    __device__ __host__ hipFloatComplex(float x, float y) : x(x), y(y) {}
                                                                  ^~~~
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:126:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed short)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:127:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, unsigned int)
    ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:128:5: error: expected 'restrict' specifier
    MAKE_COMPONENT_CONSTRUCTOR_TWO_COMPONENT(hipDoubleComplex, signed int)
    ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
Died at /opt/rocm/hip/bin/hipcc line 565.
error: command '/opt/rocm/hip/bin/hipcc' failed with exit status 1

How to reproduce

  • cd hipBLAS && mkdir build && cd build
  • ccmake ..
  • make

Environment

Hardware description
GPU Radeon R9 Fury Nano
CPU Intel i5-4440

Software stack

  • I am currently using latest version of HIP, hipBLAS

compute type of hipblasGemmEx

Running a CUDA program shows that cublasGemmEx supports compute type CUBLAS_COMPUTE_32F_FAST_TF32 and CUBLAS_GEMM_DEFAULT_TENSOR_OP. The type is not available in hipBLAS. Thank you for your discussion.

status = cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, B_cols, A_rows, A_cols, &alpha, gpu_B, CUDA_R_32F, A_rows, gpu_A, CUDA_R_32F,
                      A_cols, &beta, gpu_C, CUDA_R_32F, A_rows, CUBLAS_COMPUTE_32F_FAST_TF32, CUBLAS_GEMM_DEFAULT_TENSOR_OP);

hipblasGemmEx does not match the CPU or ROCBlas results for int8 x int8 to int32 matrix multiplication

The minimal testing case has been attached as
igemm_all_in_one.cc.gz.
Can be compiled with g++ -I/usr/include/eigen3 -I/opt/rocm/include igemm_all_in_one.cc -Wl,-rpath,/opt/rocm/lib/ -L/opt/rocm/lib/ -lrocblas -lhipblas -lamdhip64 -o igemm_aiw

What is the expected behavior

  • "Result from hipblasGemmEx with i8 input" matches the output of "Result from Eigen with i32 data type" and "Result from rocblas_gemm_ex with i8 input"

What actually happens

  • "Result from hipblasGemmEx with i8 input" shows
    • -15 65 110 -65
    • -49 -79 -90 117
  • The reference output "Result from Eigen with i8 data type" and "Result from rocblas_gemm_ex with i32 input" both show
    • -55 16 89 -44
    • 122 -102 68 -39
  • Note the reference output matches the unit tests from onnxruntime.

How to reproduce

  • Run the unit tests

Environment

Hardware description
GPU gfx90a
CPU AMD EPYC 7542 32-Core Processor
Software version
ROCK modinfo amdgpu|grep version shows version: 5.13.20.22.10
ROCR v5.1.3
HCC HIP version: 5.1.20532-f592a741
Library v5.1.3

hipblasXgelsBatched() failed with error

Running the HIP program produces some error. I migrated the program from CUDA to HIP. Thanks.

To reproduce:

go to https://github.com/zjin-lcf/HeCBench/tree/master/src/gels-hip
type 'make'
type './main 1'

ERROR: hipblasXgelsBatched() failed with error HIPBLAS_STATUS_INVALID_VALUE..

hipBLAS's complex number definition causes compilation failures

What is the expected behavior

  • hipBLAS doesn't redefine HIP complex number implementation, and simply uses hip_complex.h to provide hipFloatComplex & hipDoubleComplex

What actually happens

  • hipBLAS redefines HIP complex number definitions in a contradictory way which causes compilation failures

How to reproduce

  • Simply include hipblas.h and hip/hip_complex.h in any project:
/// test.cpp
#include <hipblas.h>
#include <hip/hip_complex.h>

hipcc test.cpp errors

Environment

Software version
HCC 3.0.19493-75ea952e-40756364719e

Why does the hipBLAS library define complex numbers in a way which is incompatible with the rest of HIP?

There are far too many errors for even an example program. The first error is:

/opt/rocm/include/hipblas.h:53:36: error: typedef redefinition with different types ('hip_complex_number<float>' vs 'hipFloatComplex' (aka 'HIP_vector_type<float, 2>'))
typedef hip_complex_number<float>  hipComplex;
                                   ^
/opt/rocm/hip/include/hip/hcc_detail/hip_complex.h:269:25: note: previous definition is here
typedef hipFloatComplex hipComplex;
                        ^

It seems the lines are: https://github.com/ROCmSoftwarePlatform/hipBLAS/tree/develop/library/include#L53

Is there a reason why hipBLAS could not use the HIP standard complex number definition? Or is there some workaround to be able to use both at the same time?

[5.5.X] `hipblasDatatype_t` should be replaced with HIP's `hipDataType`

What is the expected behavior

  1. hipDataType from HIP API should be used instead of hipblasDatatype_t.
  2. hipDataType should be populated with the corresponding hipblasDatatype_t elements, like:
    HIPBLAS_R_8I -> HIP_R_8I
  3. hipblasDatatype_t should be deleted from hipBLAS API or might be left as a typedef for hipDataType, like it is done for cublasDataType_t in cuBLAS API:
typedef cudaDataType cublasDataType_t;

[Reasons]

  1. hipblasDatatype_t being actually an analogue to cudaDataType is a type common for all HIP libraries, not only for hipBLAS, and it might be used in any HIP application.
  2. Currently, we can't just hipify cudaDataType to hipblasDatatype_t because it will require adding a corresponding #include to hipblas.h what is definitely incorrect.

What actually happens

ROCm/HIPIFY#383

How to reproduce

ROCm/HIPIFY#383

Software
hipcc --version HIP version: 4.3.0

[5.3.X][CUDA >= 11.0] `hipblasGemmEx` doesn't fully match `cublasGemmEx`

The problem is with the penultimate argument hipblasDatatype_t computeType, which doesn't match to cublasComputeType_t computeType. cublasComputeType_t appeared with CUDA 11.0. cublasGemmEx used cudaDataType instead of cublasComputeType_t for its penultimate argument starting with CUDA 8.0 and till CUDA 11.0.

HIPBLAS_EXPORT hipblasStatus_t hipblasGemmEx(hipblasHandle_t    handle,
                                             hipblasOperation_t transA,
                                             hipblasOperation_t transB,
                                             int                m,
                                             int                n,
                                             int                k,
                                             const void*        alpha,
                                             const void*        A,
                                             hipblasDatatype_t  aType,
                                             int                lda,
                                             const void*        B,
                                             hipblasDatatype_t  bType,
                                             int                ldb,
                                             const void*        beta,
                                             void*              C,
                                             hipblasDatatype_t  cType,
                                             int                ldc,
                                             hipblasDatatype_t  computeType,
                                             hipblasGemmAlgo_t  algo);
CUBLASAPI cublasStatus_t CUBLASWINAPI cublasGemmEx(cublasHandle_t handle,
                                                   cublasOperation_t transa,
                                                   cublasOperation_t transb,
                                                   int m,
                                                   int n,
                                                   int k,
                                                   const void* alpha, /* host or device pointer */
                                                   const void* A,
                                                   cudaDataType Atype,
                                                   int lda,
                                                   const void* B,
                                                   cudaDataType Btype,
                                                   int ldb,
                                                   const void* beta, /* host or device pointer */
                                                   void* C,
                                                   cudaDataType Ctype,
                                                   int ldc,
                                                   cublasComputeType_t computeType,
                                                   cublasGemmAlgo_t algo);
typedef enum
{
    HIPBLAS_R_16F = 150, /**< 16 bit floating point, real */
    HIPBLAS_R_32F = 151, /**< 32 bit floating point, real */
    HIPBLAS_R_64F = 152, /**< 64 bit floating point, real */
    HIPBLAS_C_16F = 153, /**< 16 bit floating point, complex */
    HIPBLAS_C_32F = 154, /**< 32 bit floating point, complex */
    HIPBLAS_C_64F = 155, /**< 64 bit floating point, complex */
    HIPBLAS_R_8I  = 160, /**<  8 bit signed integer, real */
    HIPBLAS_R_8U  = 161, /**<  8 bit unsigned integer, real */
    HIPBLAS_R_32I = 162, /**< 32 bit signed integer, real */
    HIPBLAS_R_32U = 163, /**< 32 bit unsigned integer, real */
    HIPBLAS_C_8I  = 164, /**<  8 bit signed integer, complex */
    HIPBLAS_C_8U  = 165, /**<  8 bit unsigned integer, complex */
    HIPBLAS_C_32I = 166, /**< 32 bit signed integer, complex */
    HIPBLAS_C_32U = 167, /**< 32 bit unsigned integer, complex */
    HIPBLAS_R_16B = 168, /**< 16 bit bfloat, real */
    HIPBLAS_C_16B = 169, /**< 16 bit bfloat, complex */
} hipblasDatatype_t;
typedef enum {
  CUBLAS_COMPUTE_16F = 64,           /* half - default */
  CUBLAS_COMPUTE_16F_PEDANTIC = 65,  /* half - pedantic */
  CUBLAS_COMPUTE_32F = 68,           /* float - default */
  CUBLAS_COMPUTE_32F_PEDANTIC = 69,  /* float - pedantic */
  CUBLAS_COMPUTE_32F_FAST_16F = 74,  /* float - fast, allows down-converting inputs to half or TF32 */
  CUBLAS_COMPUTE_32F_FAST_16BF = 75, /* float - fast, allows down-converting inputs to bfloat16 or TF32 */
  CUBLAS_COMPUTE_32F_FAST_TF32 = 77, /* float - fast, allows down-converting inputs to TF32 */
  CUBLAS_COMPUTE_64F = 70,           /* double - default */
  CUBLAS_COMPUTE_64F_PEDANTIC = 71,  /* double - pedantic */
  CUBLAS_COMPUTE_32I = 72,           /* signed 32-bit int - default */
  CUBLAS_COMPUTE_32I_PEDANTIC = 73,  /* signed 32-bit int - pedantic */
} cublasComputeType_t;

CMake always wants to build for CUDA if it is at all present

What is the expected behavior

User should have control of whether they want to build for either platform, since you can't do both at once

What actually happens

You have no way to specify that the build is meant for HCC, if you have CUDA at all installed, for whatever reason, it tries to enable that, and if you were trying to build with HCC, the build fails

How to reproduce

Try to build using HCC, for ROC, but also have CUDA installed on your system

Environment

Observed in Arch Linux whilst converting the scripts from Experimental ROC
I have CUDA installed and bumped into this.

Proposed fix

I'm making a PR with what seems to be the simplest way to do this, I provided an additional TRY_CUDA option. if that is OFF, the find_package call is skipped. By default it is ON, so the legacy behaviour is maintained.
But really I feel like making this choice on the back of a find_package (and a find module that has already been deprecated anyway) is the wrong way to do this. Instead the correct would be a bit of a larger change and would include

set(HIPBLAS_BACKEND "hcc" CACHE STRING "Backend the hipBLAS build should use")
set_property(CACHE HIPBLAS_BACKEND PROPERTY STRINGS "hcc;cuda")

And then derive all the choices in inner branches of the build tree from querying this variable

hipBLAS compiled for CUDA cannot be found in CMake

What is the expected behavior

  • In CMake, find_package(hipblas) should work.

What actually happens

  • After updating hip and hipblas to the 4.5 release, hipblas can no longer be found be CMake on an NVIDIA platform:
CMake Error at /opt/cmake-3.21.3-linux-x86_64/share/cmake-3.21/Modules/CMakeFindDependencyMacro.cmake:47 (find_package):
  By not providing "Findhip.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "hip", but
  CMake did not find one.

  Could not find a package configuration file provided by "hip" with any of
  the following names:

    hipConfig.cmake
    hip-config.cmake

  Add the installation prefix of "hip" to CMAKE_PREFIX_PATH or set "hip_DIR"
  to a directory containing one of the above files.  If "hip" provides a
  separate development package or SDK, be sure it has been installed.
Call Stack (most recent call first):
  /opt/rocm/hipblas/lib/cmake/hipblas/hipblas-config.cmake:90 (find_dependency)
  CMakeLists.txt:239 (find_package)
  • This used to work fine on all previous releases.
  • A current workaround is to change this line from DEPENDS PACKAGE hip to DEPENDS PACKAGE HIP. Then after recompiling and installing hipBLAS, it works as expected.

How to reproduce

  • On an NVIDIA machine, install hip 4.5.
  • Download/clone hipBLAS at the 4.5 release. Compile using ./install.sh -i --cuda
  • In a CMake project, using the following will output the above error.
find_package(HIP REQUIRED)
find_package(hipblas REQUIRED)
  • Looking at /opt/rocm/hipblas/lib/cmake/hipblas/hipblas-config.cmake:90 shows find_dependency(hip) whereas find_dependency(HIP) works fine (see workaround above).

Environment

Ubuntu 20.04 with CUDA 11.5, CMake 3.21.3.
HIP 4.5 installed using debian packages. hipBLAS 4.5 compiled with --cuda.

HIPBLAS depends on ROCSOLVER, which isn't an Apt dependency

What is the expected behavior

  • Link programs that call DGEMM against HIPBLAS without error.
  • Apt installs all dependencies of HIPBLAS when HIPBLAS is installed.

What actually happens

  • HIPBLAS depends on the LAPACK-like ROCSOLVER library, which is not installed by Apt.
  • HIPBLAS program cannot be linked due to missing symbols from ROCSOLVER.

Apt Log

$ sudo apt update && sudo apt install hipblas
Hit:2 http://repo.radeon.com/rocm/apt/debian xenial InRelease                                      
Reading package lists... Done                                
Building dependency tree       
Reading state information... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
hipblas is already the newest version (0.36.0.0-rocm-rel-3.9-17-e4d9e7b).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
$ apt search rocsolver
Sorting... Done
Full Text Search... Done
rocsolver/Ubuntu 16.04 3.9.0.0-c2cd214 amd64
  Radeon Open Compute SOLVER library

rocsolver3.9.0/Ubuntu 16.04 3.9.0.0-c2cd214 amd64
  Radeon Open Compute SOLVER library

Linker errors

/usr/bin/ld: warning: librocsolver.so.0, needed by /opt/rocm/lib/libhipblas.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_batched'
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

How to reproduce

  • Try to link the following program.

Code

#include <hipblas.h>

int main()
{
    hipblasHandle_t h;
    int order{1};
    double alpha{0};
    double A{1}, B{1}, C{1};
    hipblasDgemm(h, HIPBLAS_OP_N, HIPBLAS_OP_N, order, order, order, &alpha, &A, order, &B, order, &alpha, &C, order);
    return 0;
}

Build

$ hipcc bug.cc -L/opt/rocm/lib -lrocblas -L/opt/rocm/lib -lhipblas
/usr/bin/ld: warning: librocsolver.so.0, needed by /opt/rocm/lib/libhipblas.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrf_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrs_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgeqrf_ptr_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_cgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgeqrf_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_dgetri_outofplace_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_sgetrs_strided_batched'
/usr/bin/ld: /opt/rocm/lib/libhipblas.so: undefined reference to `rocsolver_zgetrf_batched'
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

Environment

Hardware

jrhammon@jrhammon-nuc:~/PRK/Cxx11$ rocminfo 
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   8300                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            8                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32803976(0x1f48c88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32803976(0x1f48c88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Polaris 22 XT [Radeon RX Vega M GH]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26956(0x694c)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1190                               
  BDFID:                   256                                
  Internal Node ID:        1                                  
  Compute Unit:            24                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Software

$ hipcc --version
HIP version: 3.9.20412-6d111f85
clang version 12.0.0 (/src/external/llvm-project/clang 60f39e2924d51c1e8606f2135f95e9047fb1da5d)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-3.9.0/llvm/bin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.