Giter Club home page Giter Club logo

tiled-mm's Introduction

Table of Contents

Overview

Tiled-MM is a very fast and easy-to-use library for multiplying matrices on GPU. As opposed to NVIDIA's cublas, this library takes pointer from the host side (CPU), splits the matrices into tiles, pipelines them efficiently to the GPU and copies the result back to the CPU. It can serve as almost a drop-in replacement for cublasXt, and is ported to both NVIDIA and AMD gpus.

It offers more features than the standard cublas API. For example, the user can specify the number of gpu streams to be used, as well as the tile size for each dimension separately, which is not possible with the standard cublas API.

Tiled-MM is used in production as a backend of the COSMA algorithm and is thus well-tested.

Performance

The benchmarks were performed on a single node of Piz Daint Supercomputer (Cray XC50), equipped with a P100 NVIDIA GPU. We compared the performance of our library Tiled-MM with the vanilla version of cublasXt and also with the manually tuned version of cublasXt, where we manually set the tile size to 4000 and enabled the pinned memory mode. Tiled-MM was substantially faster than the vanilla version of cublasXt, and achieved similar performance as the manually tuned version of cublasXt, as can be seen from the results below.

In the benchmark, we used double precision, square matrices given in column-major ordering, and alpha = beta = 1.0.

Features:

  • The user can specify the tile size of each dimension separately.
  • The user can specify the number of streams to be used.
  • The user can reuse the same context (and thus the same device memory) for many multiplications which can lead to significant performance improvements.
  • Fully templatized, supporting arbitrary data types.
  • Ported to both NVIDIA and AMD GPUs.

Building and Installing

Assuming that you want to use the gcc 8 compiler, you can build the project as follows:

# clone the repo
git clone https://github.com/eth-cscs/Tiled-MM
cd Tiled-MM
mkdir build
cd build

# build
CC=gcc-8 CXX=g++-8 cmake -DTILEDMM_GPU_BACKEND=CUDA ..

# compile
make -j 4

When building the examples cxxopts is required. It is available in most package manager, apt-get install libcxxopts-dev (ubuntu) or brew install cxxopts (macos).

The option -DTILEDMM_GPU_BACKEND can have the following values:

  • CUDA: for NVIDIA GPUs
  • ROCM: for AMD GPUs

Minimal Working Example

Using the library is very simple, just include #include <tiled_mm.hpp> and use it as follows:

// A dimensions: m x k
auto a_host = gpu::malloc_pinned<double>(m * k, 1);
// B dimensions: k x n
auto b_host = gpu::malloc_pinned<double>(k * n, 1);
// C dimensions: m x n
auto c_host = gpu::malloc_pinned<double>(m * n, 0);

double alpha = 1.0;
double beta = 0.0;

// preallocates device buffers and other CUDA stuff
// the context does not have to be created explicitly
// so the user can omit this part
auto ctx = gpu::make_context();

// compute c = alpha * a * b + beta * c
// There is also a version without ctx, in case the user
// does not want to create the context explicitly
gpu::gemm(*ctx,
          trans_a, trans_b,
          m, n, k,
          alpha,
          a_host, ld_a,
          b_host, ld_b,
          beta,
          c_host, ld_c);

// optionally, we can set the following two boolean flags
bool pin_buffers = false; // since a_host, b_host and c_host are already pinned, gpu::dgemm should not pin them
bool copy_c_back = true;  // if we want to copy the result back to the host or leave it on the gpu
gpu::gemm(*ctx,
          trans_a, trans_b,
          m, n, k,
          alpha,
          a_host, ld_a,
          b_host, ld_b,
          beta,
          c_host, ld_c,
          pin_buffers, copy_c_back);

// if copy_c_back == false, the result is stored on the device with the following pointer:
double* c_device = ctx->get_full_device_buffer_c().data()

When creating the context, the user can specify tile dimensions and the number of streams to be used as:

int tile_size_m = 5000;
int tile_size_n = 5000;
int tile_size_k = 5000;
int n_streams = 2;

auto ctx = gpu::make_context(n_streams, tile_size_m, tile_size_n, tile_size_k);

Running the Benchmarks

For detailed benchmarking, there is a miniapp that takes the host pointers for A, B and C and computes C = beta * C + alpha * A * B outputing the time-to-solution, as well as the throughput.

The miniapp consists of the executable ./build/examples/multiply which can be run with the following command line (assuming we are in the root folder of the project):

./build/examples/multiply -m 10000 -n 10000 -k 10000 -r 1

The overview of all supported options is given below:

Option Flags POSSIBLE VALUES DESCRIPTION
m (--m_dim) positive integer Number of rows of C
n (--n_dim) positive integer Number of columns of C
k (--k_dim) positive integer size of the shared dimension between matrices A and B
--tile_m positive integer tile size for dimension m
--tile_n positive integer tile size for dimension n
--tile_k positive integer tile size for dimension k
--ld_a positive integer leading dimension of matrix A
--ld_b positive integer leading dimension of matrix B
--ld_c positive integer leading dimension of matrix C
-t (--transpose) a string XY, where X, Y can be one of {N, T, C} transpose flags for matrices A and B
--alpha real value (double) the alpha in C = beta * C + alpha * A * B
--beta real value (double) the beta in C = beta * C + alpha * A * B

For example, running with the following flags:

./build/examples/multiply -m 1000 -n 1000 -k 1000 --transpose=TN -r 1

should produce the following output:

==================================================
                Benchmarking Tiled-MM
==================================================
         MATRIX SIZES
=============================
 A = (1000, 1000)
 B = (1000, 1000)
 C = (1000, 1000)
=============================
         LEADING DIMS
=============================
 LD_A = 1000
 LD_B = 1000
 LD_C = 1000
=============================
      SCALING CONSTANTS
=============================
 alpha = 1
 beta  = 1
=============================
      TRANSPOSE FLAGS
=============================
 trans_a = T
 trans_b = N
=============================
         TILE SIZES
=============================
 tile_m = 5000
 tile_n = 5000
 tile_k = 5000
=============================
      ADDITIONAL OPTIONS
=============================
 num. of gpu streams = 2
 num. of repetitions = 1
=============================

==================================================
         Results of benchmarking Tiled-MM
==================================================
 1) The version with copying C to back to host:
    -> Avg Time [ms] = 11
    -> Throughput [Gflops] = 181.818
==================================================
 2) The version without copying C to back to host:
    -> Avg Time [ms] = 10
    -> Throughput [Gflops] = 200
==================================================

Testing

For testing purposes, there is a testing miniapp that generates random matrices A, B and C, computes C = beta * C + alpha * A * B with Tiled-MM as well as with blas and outputs whether the results are correct.

The miniapp consists of the executable ./build/tests/test-multiply supports the same parameters as the benchmarking miniapp (see above). It can be run e.g. with the following command line (assuming we are in the root folder of the project):

./build/tests/test-multiply -m 1000 -n 1000 -k 1000 --transpose=TN

which should produce the following output:

==================================================
                Benchmarking Tiled-MM
==================================================
         MATRIX SIZES
=============================
 A = (1000, 1000)
 B = (1000, 1000)
 C = (1000, 1000)
=============================
         LEADING DIMS
=============================
 LD_A = 1000
 LD_B = 1000
 LD_C = 1000
=============================
      SCALING CONSTANTS
=============================
 alpha = 1
 beta  = 1
=============================
      TRANSPOSE FLAGS
=============================
 trans_a = T
 trans_b = N
=============================
         TILE SIZES
=============================
 tile_m = 5000
 tile_n = 5000
 tile_k = 5000
=============================
      ADDITIONAL OPTIONS
=============================
 num. of gpu streams = 2
 num. of repetitions = 1
=============================
Time [ms] with copying C back: 11
Time [ms] without copying C back: 10
The result is CORRECT

Running make test will few default tests.

Author

Marko Kabic ([email protected])

tiled-mm's People

Contributors

adhocman avatar elbriggs avatar haampie avatar kabicm avatar mtaillefumier avatar simonpintarelli avatar teonnik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tiled-mm's Issues

Problem when upgrading from 2.0 to 2.3

We have been using v2.0 of Tiled-MM successfully for some time but when I recently tried updating to use v2.3 we started experiencing incorrect results from our application (https://github.com/RMGDFT/rmgdft). I determined that the problematic calls were occurring with transa and transb both 'N' while calls with transa='T' and transb='N' were producing the correct results. The matrices in question were tall and skinny (m,n,k) (46656,152,480). I tried running the testing app using these values but no error was reported so I'm not sure how to proceed. I looked over the commit history and 85331eb seens to be the only thing that could have changed the behavior.

Fixing CI/CD issues

It seems the latest commit struggles with finding CUDA. This was not a problem before. Do you have any idea why this happens now?

mkdir build 
cd build 
cmake .. 
CMAKE_VERSION="v$(cat CMakeCache.txt | grep '^CMAKE_PROJECT_VERSION\b' | cut -d "=" -f2)"
GIT_VERSION=$(git describe --tags)
if [ "$CMAKE_VERSION" != "$GIT_VERSION" ]; then
    echo ::set-output name=CMAKE_ISSUE::yes
    echo ::set-output name=CMAKE_VERSION::$CMAKE_VERSION
    echo ::set-output name=GIT_VERSION::$GIT_VERSION
fi
shell: /usr/bin/bash -e {0}
-- The CXX compiler identification is GNU 9.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Selected TILEDMM_GPU_BACKEND: CUDA
CMake Error at /usr/local/share/cmake-3.23/Modules/FindCUDA.cmake:859 (message):
-- Configuring incomplete, errors occurred!
See also "/home/runner/work/Tiled-MM/Tiled-MM/build/CMakeFiles/CMakeOutput.log".
Specify CUDA_TOOLKIT_ROOT_DIR
Call Stack (most recent call first):
  CMakeLists.txt:32 (find_package)
Error: Process completed with exit code 1.

@teonnik @haampie @simonpintarelli

Transposes?

I've been looking for a suitable cublasxt equivalent for AMD GPUs but require the ability to do multiply the transpose of a matrix by another and the API does not seem to support that. Or am I missing something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.