Giter Club home page Giter Club logo

kblas-gpu's Introduction

kblas-gpu

What is KBLAS

KAUST BLAS (KBLAS) is a high performance CUDA library implementing a subset of BLAS as well as Linear Algebra PACKage (LAPACK) routines on NVIDIA GPUs. Using recursive and batch algorithms, KBLAS maximizes the GPU bandwidth, reuses locally cached data and increases device occupancy. KBLAS supports operations for regular dense and hierarchical low-rank matrix workloads. KBLAS provides, therefore, the critical building blocks not only for the traditional high performance dense linear algebra libraries, but also for the emerging numerical libraries supporting hierarchical low-rank matrix computations. This is a major leap toward leveraging GPU hardware accelerators for high performance low-rank matrix approximations and computations, which currently drives the research agenda of the linear algebra scientific community.

KBLAS is written in CUDA C. It requires CUDA Toolkit for installation.

Current Features of KBLAS

KBLAS provides highly optimized routines from various levels of BLAS and LAPACK, including:

  1. Legacy Level-2 BLAS: (⇟⎐ ⚭ ⚬) SYMV, GEMV, HEMV.
  2. Legacy Level-3 BLAS: (⇟⎐ ⚭ ⚬) TRSM, TRMM, GEMM (⚭ only).
  3. Batch Level-3 BLAS: (⇟⎏ ⚭ ⚬= ✼) TRSM, TRMM, SYRK.
  4. Batch Triangular: (⎏⇞ ⚭ ⚬= ✼) TRTRI, LAUUM.
  5. Batch Symmetric: (⎏⇞ ⚭ ⚬= ✼) POTRF, POTRS, POSV, POTRI, POTI.
  6. Batch General: (⎐⇟ ⚭ ⚬= ✼) GESVJ, GERSVD, GEQRF.
  7. Batch Tile low-rank GEMM (⎏ ⎐ ⇞ ⚬ =).
  8. GPU-Resident POTRF kernel (⎐ ⇞ ⚬).
  9. Batch Tall-and-Skinny QR (⇞ ⎐ ⚬ = | ✼) TSQR.
  10. Batch Adaptive Randomized Approximation (⇞ ⎐ ⚬ |) ARA.
  11. Batch column pivoted QR (⇞ ⎏ ⚬ = ✼) GEQP2.
  12. Batch small pivoted Cholesky (⇞ ⎏ ⚬ = ✼) PSTRF.

⇟ Standard precisions: s/d/c/z. ⇞ Real precisions: s/d. ⎏ Very small matrix sizes. ⎐ Arbitrary sizes. ⚬ Single-GPU support. ⚭ Multi-GPU support. = Uniform batch sizes. | Non-uniform batch sizes. ✼ Non-strided and strided variants.

Installation

KBLAS installation requires a recent make. To build KBLAS, please follow these instructions:

  1. Get KBLAS from git repository

    git clone [email protected]:ecrc/kblas-gpu
    

    or

    git clone https://github.com/ecrc/kblas-gpu
    
  2. Go into KBLAS folder

    cd kblas-gpu
    
  3. Edit file make.inc to:

    • Enable / disable KBLAS sub modules (SUPPORT_BLAS2, SUPPORT_BLAS3, SUPPORT_BATCH_TR, SUPPORT_SVD, SUPPORT_TLR).
    • Enable / disable usage of third party libraries (USE_MKL, USE_MAGMA) for performance comparisons.
    • Provide path for third party libraries if required (CUB_DIR, MAGMA_ROOT).
    • Specify CUDA architecture to compile for (CUDA_ARCH).

    or

    • Provide equivalent environment variables.
  4. Build KBLAS

    make
    

Testing

The folder 'testing' includes a set of sample programs to illustrate the usage of each KBLAS routine, as well as to test the performance and accuracy of such routines against other vendor libraries.

Related Publications

  1. A. Charara, D. Keyes, and H. Ltaief, Tile Low-Rank GEMM Using Batched Operations on GPUs 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Proceedings, http://hdl.handle.net/10754/627402, 2018.

  2. A. Charara, D. Keyes, and H. Ltaief, Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs ACM Trans. Math. Software (accepted), http://hdl.handle.net/10754/622975, 2018.

  3. W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. Keyes, Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression, J. Parallel Comput., Special Edition, 2017.

  4. A. Abdelfattah, D. Keyes, and H. Ltaief, KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators, ACM Trans. Math. Software 42(3), DOI: http://dx.doi.org/10.1145/2818311, 2016.

  5. A. Charara, D. Keyes, and H. Ltaief, A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures, Concurr. Comput.: Prac. Experience, http://hdl.handle.net/10754/622077, 2016.

  6. A. Charara, H. Ltaief, and D. Keyes, Redesigning Triangular Dense Matrix Computations on GPUs, 22nd International Euro-Par Conference on Parallel and Distributed Computing, Best papers, DOI: http://dx.doi.org/10.1007/978-3-319-43659-3_35, 2016.

  7. A. Abdelfattah, H. Ltaief, and D. Keyes, High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications, 21st International Euro-Par Conference on Parallel and Distributed Computing, 2015.

  8. A. Abdelfattah, D. Keyes, and H. Ltaief, Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU, 18th International Euro-Par Conference on Parallel and Distributed Computing, 2013.

  9. A. Abdelfattah, J. Dongarra, D. Keyes, and H. Ltaief, Optimizing Memory-Bound SyMV Kernel on GPU Hardware Accelerators, 10th International Conference High Performance Computing for Computational Science - VECPAR, DOI: http://dx.doi.org/10.1007/978-3-642-38718-0_10, 2012.

  10. W. H. Boukaram, G. Turkiyyah, H. Ltaief, D. Keyes, Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression, Parallel Computing, 74:19-33, 2018.

Handout

Handout Handout

kblas-gpu's People

Contributors

acharara avatar egonzalf avatar ltaiefhatem avatar pghysels avatar stefanozampini avatar wajihboukaram avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kblas-gpu's Issues

Build broken with CUDA 12.5.1

I'm trying to build KBLAS using CUDA 12.5.82 and GCC 13.2 and getting the following error:
/mnt/slowstore/pub/kblas-gpu/include/kblas_operators.h(130): error: cannot overload functions distinguished by return type alone __attribute__((device)) static __inline__ void atomicAdd(cuFloatComplex* address, cuFloatComplex val)

__shfl() deprecated

Hi
When making kblas with arch 60 via cuda 9, thousand of warnings :

ptxas /tmp/tmpxft_00003b92_00000000-5_XXXXXXXXXX.ptx, line 3957; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version

../include/operators.h(93): warning: function "__shfl(int, int, int)"
/usr/local/cuda/include/sm_30_intrinsics.hpp(151): here was declared deprecated ("__shfl() is deprecated in favor of __shfl_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")

CUDA 8 compatibility

Compilation errors with CUDA 8, naming conflicts with new functions introduced in cuda 8. To be resloved.

GTX Titan runtime error

A runtime error occurs when running TRMM or TRSM on GTX Titan device.
Error reports cuBLAS error or invalid memory access.
No definite sequence is noticed, error is mostly happening with single precision and complex precisions.

installation issue about v3.0.0

Hello acharara,
The following error occurred when I built the KBLAS.It seems to be a cast error.
batch_triangular/Xtrmm_batch.cu(83): error: argument of type "float **" is incompatible with parameter of type "magma_int_t"

batch_triangular/Xtrmm_batch.cu(84): error: argument of type "int" is incompatible with parameter of type "float **"
Do you have any ideas?

Error while testing library

Hi

After installing kblas on my arch 62 via cuda 10.2 and running make in testing, I tried running "./test_dtrmm -N 200:512" in /testing/bin which gave me the following error:

side L, uplo L, trans N, diag N, db 512
    M     N     kblasTRMM_REC GF/s (ms)  kblasTRMM_CU GF/s (ms)  cublasTRMM GF/s (ms)  SP_REC   SP_CU   Error
====================================================================
  200   512   CUDA runtime error: no kernel image is available for execution on the device (209) in Xtrmm at blas_l3/Xtrmm.cu:479
CUBLAS error: execution failed (13) in test_trmm at blas_l3/test_trmm.ch:202

Am I doing something wrong? I wish to use kblas for doing batched svd. How do I use this library? I did get some warnings while making kblas, could that be the reason for this error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.