The memcpy-gemm from google

# memcpy-gemm

memcpy-gemm provides libraries and binaries for testing the communication
and compute performance of GPUs. The primary binary generated for this purpose
is the memcpy_gemm program. memcpy_gemm supports testing communication speed
between any GPU and it's host or peers, and can test the compute
performance in half, single, or double precision for matrix multiplication.

## Requirements:
memcpy-gemm requires the following tools for installation
* bazel
* autoconf

Bazel relies on the following libraries for execution. When building, bazel
will pull the versions specified in the WORKSPACE file from internet sources.
* abseil
* glog
* googletest
* libnuma
* half
* re2

Currently, libnuma depends on autoconf for configuration, which we wrap in bazel
at build time.

memcpy-gemm also requires a CUDA installation. The default location is
/usr/local/cuda. To specify a different directory, export a "CUDA_PATH"
environmental variable. Note that bazel will create a symlink to this path
as "cuda", which is how cuda headers are referenced in the code. memcpy-gemm
is compatibile with CUDA >= 9.0, and makes no guarantees about long-term support
for older CUDA versions going forward.

A few CUDA sources in memcpy-gemm need to be compiled with NVCC. If CUDA_PATH is
correctly set (or the default location is correct), NVCC will be found and
invoked automatically. The NVCC compiler has limits on which host compilers
work with it - so you may need to modify the bazel host compiler. The quickest
way to do this is via the CC and CXX environmental variables.

## Building:
With the proper dependencies installed, building the memcpy_gemm binary
should be as simple as:

`bazel build :memcpy_gemm`

and the memcpy_gemm binary is built in to bazel-bin.

If Cuda and/or host compilers need to be configured, a build invocation may
look something like:
```
# In this example, we have a custom CUDA location, and we need to specify
# GCC 9, because the default gcc (10 in this example) doesn't work with CUDA 11
# NVCC
CUDA_PATH=/usr/local/cuda-11.0 CC=gcc-9 CXX=g++-9 bazel build :memcpy_gemm
```

To run tests, "bazel test :memcpy_gemm_test" or "bazel test :all". Note that
the memcpy_gemm_test assumes that one GPU is available, but makes no assumptions
about its performance characteristics.

## Creating a static build:
By default, when built on most linux systems memcpy-gemm dynamically links to 
libgcc, libstdc++, and libc. While some dynamic dependencies to libc are unavoidable,
memcpy-gemm will link statically to libstdc++ and libgcc if compiled in the static configuration:

`bazel build --config=static :memcpy_gemm`

This mode is useful for sidestepping dependencies when building for cross-compilation.

### Configuring static build
memcpy-gemm uses it's own description of the C++ toolchain for static builds. On some linux systems, 
the toolchain include path to the standard c++ library may be incorrectly configured. When building in 
static mode, this can lead to errors such as:

> this rule is missing dependency declarations for the following files included by (some library)
> '/usr/lib/....some header'
> '/usr/lib/....other header'

In the short term, this issue can be fixed by modifying the C++ toolchain descriptor file. 
Open /(path/to/memcpy-gemm/source)/toolchain/cc_toolchain_config.bzl. Search for the 
cxx_bultin_include_directories variable, and add the path to your headers to the list.

## Running:
To see the list of options of memcpy-gemm, type "/path/to/memcpy_gemm --help".

### Supported compute types

memcpy-gemm supports a variety of input/output/compute precisions. The specific
supported precision depends on the GPU architecture and CUDA version. With CUDA
9 or later, the following data type combinations are allowed (cc=compute capability
of device).

| `input_precision` | `output_precision` | `compute_precision` | `restrictions` | `Notes`
| ----------------- | ------------------ | ------------------- | -------------- | -------
| single            | single             | single              | none           | same as fp_precision = 'single'
| double            | double             | double              | none           | same as fp_precision = 'double'
| half              | half               | half                | cc >= 6.0      | same as fp_precision = 'half'
| half              | single             | single              | cc >= 6.0      | commonly called 'mixed precision'
| int8              | int32              | int32               | cc >= 7.0      |
| int8              | single             | single              | cc >= 7.0      | On cc >= 7.5, can use tensor cores.

### Memory Usage

memcpy-GEMM allocates CPU and GPU memory for both memcpy and GEMM operations. For
memcpy tests, memcpy-gemm allocates a buffer on both the source and destination
devices (be it a CPU or GPU), with the size equal to the number of kilobytes
specified by the buffer_size_KiB flag. Buffers are not reused between flows, so
(for example) with flows 'c0-g0-a0 g0-g1-a1', GPU0 will allocate 2 buffers, one
for each flow.

GEMM calls allocate two input arrays and one ouput array on each device specified
in the --gpus flag, if the --gemm option is set to true. The total number of bytes
allocated is then:
bytes = (dim1_matrix1 * dim2_matrix1 * sizeof(input precision)) + \
        (dim1_matrix2 * dim2_matrix2 * sizeof(input precision)) + \
        (dim1_output_matrix * dim2_ouput_matrix * sizeof(output precision))
If M=N=K, using N as the dimension, this simplifies to:
bytes = 2 * (N^2 * sizeof(input precision)) + N^2 * sizeof(output precision)

Supported sizes of precisions are:
int8 - 1 byte
half - 2 bytes
bf16 - 2 bytes
single - 4 bytes
int32 - 4 bytes
double - 8 bytes

In addition, expect a small GPU memory overhead from the cublas environment initializations.
google / memcpy-gemm Goto Github PK

memcpy-gemm's Introduction

memcpy-gemm's People

Stargazers

Watchers

Forkers

memcpy-gemm's Issues

Remove support for CUDA 7 and 8

Compile error

Add support for int8 computations

Refactor gemm_test_lib_internal

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent