Giter Club home page Giter Club logo

hiptt's Introduction

hipTT - Fast GPU Tensor Transpose for NVIDIA and AMD GPU

hipTT is a high-performance tensor transpose library for NVIDIA and AMD GPUs. hipTT is a HIP port of the cuTT library.

Copyright (c) 2016-2020 Antti-Pekka Hynninen, Dmitry Lyakh, Luke Roskop

Copyright (c) 2016-2020 Oak Ridge National Laboratory (UT-Batelle)

Version 1.1

Installation

Software requirements:

  • C++ compiler with C++11 compitability
  • CUDA or ROCM compiler

Hardware requirements:

  • NVIDIA Kepler GPU (compute capability 3.0) or above
  • AMD GPU with a sufficient compute capability

To compile hipTT library as well as test cases and benchmarks, simply do

make

This will create the library itself:

  • include/cutt.h
  • lib/libcutt.a

as well as the tests and benchmarks

  • bin/cutt_test
  • bin/cutt_bench

In order to use hipTT, you only need the include (include/cutt.h) and the library (lib/libcutt.a) files.

Running tests and benchmarks

Tests and benchmark executables are in the bin/ directory and they can be run without any options. Options to the test executable lets you choose the device ID on which to run:

cutt_test [options] Options: -device gpuid : use GPU with ID gpuid

For the benchmark executable, we have an additional option that lets you run the benchmarks using plans that are chosen optimally by measuring the performance of every possible implementation and choosing the best one.

cutt_bench [options] Options: -device gpuid : use GPU with ID gpuid -measure : use cuttPlanMeasure (default is cuttPlan)

Usage

hipTT uses a "plan structure" similar to FFTW and cuFFT libraries, where the user first creates a plan for the transpose and then executes that plan. Here is an example code.

#include <cutt.h>

//
// Error checking wrapper for cutt
//
#define cuttCheck(stmt) do {                                 \
  cuttResult err = stmt;                            \
  if (err != CUTT_SUCCESS) {                          \
    fprintf(stderr, "%s in file %s, function %s\n", #stmt,__FILE__,__FUNCTION__); \
    exit(1); \
  }                                                  \
} while(0)

int main() {

  // Four dimensional tensor
  // Transpose (31, 549, 2, 3) -> (3, 31, 2, 549)
  int dim[4] = {31, 549, 2, 3};
  int permutation[4] = {3, 0, 2, 1};

  .... input and output data is setup here ...
  // double* idata : size product(dim)
  // double* odata : size product(dim)

  // Option 1: Create plan on NULL stream and choose implementation based on heuristics
  cuttHandle plan;
  cuttCheck(cuttPlan(&plan, 4, dim, permutation, sizeof(double), 0));

  // Option 2: Create plan on NULL stream and choose implementation based on performance measurements
  // cuttCheck(cuttPlanMeasure(&plan, 4, dim, permutation, sizeof(double), 0, idata, odata));

  // Execute plan
  cuttCheck(cuttExecute(plan, idata, odata));

  ... do stuff with your output and deallocate data ...

  // Destroy plan
  cuttCheck(cuttDestroy(plan));

  return 0;
}

Input (idata) and output (odata) data are both in GPU memory and must point to different memory areas for correct operation. That is, hipTT only currently supports out-of-place transposes. Note that using Option 2 to create the plan can take up some time especially for high-rank tensors.

hipTT API

//
// Create plan
//
// Parameters
// handle            = Returned handle to cuTT plan
// rank              = Rank of the tensor
// dim[rank]         = Dimensions of the tensor
// permutation[rank] = Transpose permutation
// sizeofType        = Size of the elements of the tensor in bytes (=4 or 8)
// stream            = CUDA stream (0 if no stream is used)
//
// Returns
// Success/unsuccess code
//
cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType,
 cudaStream_t stream);

//
// Create plan and choose implementation by measuring performance
//
// Parameters
// handle            = Returned handle to cuTT plan
// rank              = Rank of the tensor
// dim[rank]         = Dimensions of the tensor
// permutation[rank] = Transpose permutation
// sizeofType        = Size of the elements of the tensor in bytes (=4 or 8)
// stream            = CUDA stream (0 if no stream is used)
// idata             = Input data size product(dim)
// odata             = Output data size product(dim)
//
// Returns
// Success/unsuccess code
//
cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType,
 cudaStream_t stream, void* idata, void* odata);

//
// Destroy plan
//
// Parameters
// handle            = Handle to the cuTT plan
//
// Returns
// Success/unsuccess code
//
cuttResult cuttDestroy(cuttHandle handle);

//
// Execute plan out-of-place
//
// Parameters
// handle            = Returned handle to cuTT plan
// idata             = Input data size product(dim)
// odata             = Output data size product(dim)
//
// Returns
// Success/unsuccess code
//
cuttResult cuttExecute(cuttHandle handle, void* idata, void* odata);

Licence

MIT License

Copyright (c) 2016-2020 Antti-Pekka Hynninen, Dmitry Lyakh, Luke Roskop

Copyright (c) 2016-2020 Oak Ridge National Laboratory (UT-Batelle)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

hiptt's People

Contributors

afanfa avatar dmitrylyakh avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hiptt's Issues

compile and test

I was just curious to build and run your library. I built the library with the following changes:

  1. change 3.9.0 to 3.9.1
  2. replace g++ with hipcc because there were errors using g++
g++ -o bin/cutt_test -L/opt/rocm-3.9.1/lib/ -lamdhip64 build/cutt_test.o build/TensorTester.o build/CudaMem.o build/CudaUtils.o build/cuttTimer.o -Llib -lcutt -fPIC
/usr/bin/ld: build/TensorTester.o: relocation R_X86_64_32 against symbol `_Z42__device_stub__setTensorCheckPatternKernelPjj' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: build/CudaUtils.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: lib/libcutt.a(cuttkernel.o): relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: lib/libcutt.a(cuttGpuModelKernel.o): relocation R_X86_64_32 against symbol `_Z32__device_stub__runCountersKernelPKiiiiPiS1_S1_' can not be used when making a PIE object; recompile with -fPIE

Running the test shows the following message

TensorTester::checkTranspose FAIL at 98 ref 130 data 40
Test 1 failed

For the benchmark, please suggest the options for meaningful results. Thanks.

./bin/cutt_bench
Using Vega 20 [Radeon VII] SM version 9.0
L2 0.00MB
CPU using vector type AVX2 of length 8
0.00 GB/s
0.00 GB/s
0.00 GB/s
0.00 GB/s
scalarCopy 0.000000 GB/s
0.00 GB/s
0.00 GB/s
0.00 GB/s
0.00 GB/s
vectorCopy 0.000000 GB/s
0.00 GB/s
0.00 GB/s
0.00 GB/s
0.00 GB/s
memcpyFloat 0.000000 GB/s
bench OK
seed 1606747073

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.