The PACXX (Programming Accelerators with C++) Project started in 2013 as PhD Thesis and is finally open source.
PACXX is a simple, lightweight and still powerful programming model for accelerators in C++. PACXX was primary planed as replacement to CUDA in a time without C++11/14 support. In the past years PACXX did not only advance GPU programming to C++14 and beyond, but also becomes portable across different hardware architectures.
Currently, PACXX supports Nvidia GPUs with Compute Capability of 2.0 and above, CPUs from different vendors (Intel, AMD, ARM) and in some weeks from now PACXX will rock on ROCm enabled GPUs from AMD as well.
First of all clone the source:
git clone --recursive https://github.com/pacxx/pacxx-llvm llvm
The main repo has set up all submodules you need to get going including the PACXX runtime and the modified Clang frontend.
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_ENABLE_CXX1Y=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE="-O3"
make -j<number of cores>
Get some coffee, this can take some time.
#include <PACXX.h>
#include <vector>
#include <algorithm>
using namespace pacxx::v2;
int main(int argc, char *argv[]) {
Executor::Create<CUDARuntime>(0); // create an executor
auto &exec = Executor::get(0); // retrieve the default executor
size_t size = 128;
std::vector<int> a(size, 1); // allocate some memory on the host
std::vector<int> b(size, 2);
std::vector<int> c(size, 0);
std::vector<int> gold(size, 0);
auto &da = exec.allocate<int>(a.size()); // allocate some memory on the device
auto &db = exec.allocate<int>(b.size());
auto &dc = exec.allocate<int>(c.size());
da.upload(a.data(), a.size()); // upload data to the device
db.upload(b.data(), b.size());
dc.upload(c.data(), c.size());
auto pa = da.get(); // grab the raw pointer from the device address space
auto pb = db.get();
auto pc = dc.get();
auto vadd = [=](range &config) { // define the vector addition kernel
auto i = config.get_global(0); // get the global id (in x-dimension) for the thread
if (i < size)
pc[i] = pa[i] + pb[i] + 2;
};
exec.launch(vadd, {{1}, {128}}); // launch the kernel with 128 threads in 1 block
dc.download(c.data(), c.size()); // download the results from the device
std::transform(a.begin(), a.end(), b.begin(), gold.begin(), [](auto a, auto b) { return a + b + 2; });
if (std::equal(c.begin(), c.end(), gold.begin())) // check the results
return 0; // passed
else
return 1; // failed
}
To compile your code, the easiest way is to use Cmake. The following script shows how you can integrate PACXX into you Cmake build system:
cmake_minimum_required(VERSION 3.5)
project(vadd)
set(CMAKE_MODULE_PATH ${PACXX_DIR}/lib/cmake/pacxx)
find_package(PACXX REQUIRED)
include_directories(${PACXX_INCLUDE_DIRECTORY} ${PACXX_INCLUDE_DIRECTORY}/pacxx)
set(CMAKE_CXX_STANDARD 14)
set(SOURCE_FILES ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}.cpp)
add_executable(${PROJECT_NAME} ${SOURCE_FILES})
add_pacxx_to_target(${PROJECT_NAME} ${CMAKE_CURRENT_BINARY_DIR} ${SOURCE_FILES})
Configure your Cmake project using:
mkdir build && cd build
CC=<pacxx_install_prefix>/bin/clang CXX=<pacxx_install_prefix>/bin clang++ ccmake .. -DPACXX_DIR=<pacxx_install_prefix>
make
If everything was set up correctly you should now get an executable linked against the PACXX runtime and good to go.
Running the executable with PACXX_LOG_LEVEL=2
env variable set will give you the verbose output of the runtime:
CUDARuntime.cpp:186: note: VERBOSE: CUDARuntime has found 1 CUDA devices
CUDARuntime.cpp:34: note: VERBOSE: Creating cudaCtx for device: 0 0 0x2055550
CUDARuntime.cpp:43: note: VERBOSE: Initializing PTXBackend for Tesla K20c (dev: 0) with compute capability 3.5
PTXBackend.cpp:52: note: VERBOSE: Intializing LLVM components for PTX generation!
CoreInitializer.cpp:32: note: VERBOSE: Core components initialized!
Executor.cpp:88: note: VERBOSE: Created new Executor with id: 0
MSPEngine.cpp:49: note: DEBUG: MSP Engine disabled!
CUDARuntime.cpp:93: note: VERBOSE: //
// Generated by LLVM NVPTX Back-End
//
// ptx stripped for shortness
Timing.h:45: note: VERBOSE: CUDARuntime.cpp:71 compileAndLink timed: 4497us
Executor.h:191: note: VERBOSE: allocating memory: 512
Executor.h:191: note: VERBOSE: allocating memory: 512
Executor.h:191: note: VERBOSE: allocating memory: 512
CUDAKernel.cpp:43: note: VERBOSE: setting kernel arguments
CUDAKernel.cpp:51: note: DEBUG: Launching kernel: _ZN5pacxx2v213genericKernelIZL19test_vadd_low_leveliPPcE3$_0EEvT_
CUDAKernel.cpp:55: note: VERBOSE: Kernel configuration:
blocks(1,1,1)
threads(128,1,1)
shared_mem=0
Executor.h:99: note: VERBOSE: destroying executor 0
- Support for AMD GPUs through the HIP stack on the ROCm infrastructure.
- Nvidia's libdevice must be linked manually to get all math functions in device code. This will be fixed in a future update.
- Atomic Operations are more or less a bad hack.
- Missing support for constant memory regions on GPUs.
- SLEEF fails to compile on AVX2 architectures due to missing intrinsics in llvm 6.0. (currently under investigation)
- Documentation. Well yes the only available documentation on the PACXX runtime and the programming model itself is source code.
Contributions are always welcome. If you want to contribute to PACXX just open a pull request.
Haidl M, Gorlatch S. 2014. ‘PACXX: Towards a Unified Programming Model for Programming Accelerators using C++14.’ Contributed to the The LLVM Compiler Infrastructure in HPC Workshop at Supercomputing '14, New Orleans. doi: 10.1109/LLVM-HPC.2014.9.
Haidl M, Hagedorn B, Gorlatch S. 2016. ‘Programming GPUs with C++14 and Just-In-Time Compilation.’ Contributed to the Advances in Parallel Computing: On the Road to Exascale, ParCo2015, Edinburgh, Schottland. doi: 10.3233/978-1-61499-621-7-247.
Haidl M, Steuwer M, Humernbrum T, Gorlatch S. 2016. ‘Multi-Stage Programming for GPUs in Modern C++ using PACXX.’ Contributed to the The 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, GPGPU '16, Barcelona, Spain. doi: 10.1145/2884045.2884049.
Haidl M, Gorlatch S. 2017. ‘High-Level Programming for Many-Cores using C++14 and the STL.’ International Journal of Parallel Programming 2017. doi: 10.1007/s10766-017-0497-y.
Haidl M, Steuwer M, Dirks H, Humernbrum T, Gorlatch S. 2017. ‘Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views.’ In Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, edited by Chen Q, Huang Z, 58-67. New York, NY: ACM. doi: 10.1145/3026937.3026942.
Haidl M, Moll S, Klein L, Sun H, Hack S, Gorlatch S. 2017 'PACXXv2 + RV: An LLVM-based Portable High-Performance Programming Model.' In Proceedings of the 4th of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC at Supercomputing '17, Denver, ACM doi: 10.1145/3148173.3148185