uob-hpc / babelstream Goto Github PK

STREAM, for lots of devices written in many programming models

License: Other

C++ 52.43% Cuda 1.59% CMake 9.73% Shell 5.45% Julia 4.27% Scala 2.81% Batchfile 0.63% Java 7.30% Rust 5.07% Fortran 10.05% Makefile 0.45% Futhark 0.21%

memory-bandwidth benchmark parallel-processing gpgpu opencl cuda kokkos raja sycl openmp

babelstream's Introduction

BabelStream

Measure memory transfer rates to/from global device memory on GPUs. This benchmark is similar in spirit, and based on, the STREAM benchmark [1] for CPUs.

Unlike other GPU memory bandwidth benchmarks this does not include the PCIe transfer time.

There are multiple implementations of this benchmark in a variety of programming models.

This code was previously called GPU-STREAM.

Programming Models
How is this different to STREAM?
Building
Results
Contributing
Citing
- Other BabelStream publications

Programming Models

BabelStream is currently implemented in the following parallel programming models, listed in no particular order:

OpenCL
CUDA
HIP
OpenACC
OpenMP 3 and 4.5
C++ Parallel STL
Kokkos
RAJA
SYCL and SYCL2020 (USM and accessors)
TBB
Thrust (via CUDA or HIP)
Futhark

This project also contains implementations in alternative languages with different build systems:

Julia - JuliaStream.jl
Java - java-stream
Scala - scala-stream
Rust - rust-stream

How is this different to STREAM?

BabelStream implements the four main kernels of the STREAM benchmark (along with a dot product), but by utilising different programming models expands the platforms which the code can run beyond CPUs.

The key differences from STREAM are that:

the arrays are allocated on the heap
the problem size is unknown at compile time
wider platform and programming model support

With stack arrays of known size at compile time, the compiler is able to align data and issue optimal instructions (such as non-temporal stores, remove peel/remainder vectorisation loops, etc.). But this information is not typically available in real HPC codes today, where the problem size is read from the user at runtime.

BabelStream therefore provides a measure of what memory bandwidth performance can be attained (by a particular programming model) if you follow today's best parallel programming best practice.

BabelStream also includes the nstream kernel from the Parallel Research Kernels (PRK) project, available on GitHub. Details about PRK can be found in the following references:

Van der Wijngaart, Rob F., and Timothy G. Mattson. The parallel research kernels. IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2014.
R. F. Van der Wijngaart, A. Kayi, J. R. Hammond, G. Jost, T. St. John, S. Sridharan, T. G. Mattson, J. Abercrombie, and J. Nelson. Comparing runtime systems with exascale ambitions using the Parallel Research Kernels. ISC 2016, DOI: 10.1007/978-3-319-41321-1_17.
Jeff R. Hammond and Timothy G. Mattson. Evaluating data parallelism in C++ using the Parallel Research Kernels. IWOCL 2019, DOI: 10.1145/3318170.3318192.

Building

Drivers, compiler and software applicable to whichever implementation you would like to build against is required.

CMake

The project supports building with CMake >= 3.13.0, which can be installed without root via the official script.

Each BabelStream implementation (programming model) is built as follows:

$ cd babelstream

# configure the build, build type defaults to Release
# The -DMODEL flag is required
$ cmake -Bbuild -H. -DMODEL=<model> <model specific flags prefixed with -D...>

# compile
$ cmake --build build

# run executables in ./build
$ ./build/<model>-stream

The MODEL option selects one implementation of BabelStream to build. The source for each model's implementations are located in ./src/<model>.

Currently available models are:

omp;ocl;std-data;std-indices;std-ranges;hip;cuda;kokkos;sycl;sycl2020-acc;sycl2020-usm;acc;raja;tbb;thrust;futhark

Overriding default flags

By default, we have defined a set of optimal flags for known HPC compilers. There are assigned those to RELEASE_FLAGS, and you can override them if required.

To find out what flag each model supports or requires, simply configure while only specifying the model. For example:

> cd babelstream
> cmake -Bbuild -H. -DMODEL=ocl 
...
- Common Release flags are `-O3`, set RELEASE_FLAGS to override
-- CXX_EXTRA_FLAGS: 
        Appends to common compile flags. These will be used at link phase at well.
        To use separate flags at link time, set `CXX_EXTRA_LINKER_FLAGS`
-- CXX_EXTRA_LINK_FLAGS: 
        Appends to link flags which appear *before* the objects.
        Do not use this for linking libraries, as the link line is order-dependent
-- CXX_EXTRA_LIBRARIES: 
        Append to link flags which appears *after* the objects.
        Use this for linking extra libraries (e.g `-lmylib`, or simply `mylib`) 
-- CXX_EXTRA_LINKER_FLAGS: 
        Append to linker flags (i.e GCC's `-Wl` or equivalent)
-- Available models:  omp;ocl;std;std20;hip;cuda;kokkos;sycl;acc;raja;tbb
-- Selected model  :  ocl
-- Supported flags:

   CMAKE_CXX_COMPILER (optional, default=c++): Any CXX compiler that is supported by CMake detection
   OpenCL_LIBRARY (optional, default=): Path to OpenCL library, usually called libOpenCL.so
...

Alternatively, refer to the CI script, which test-compiles most of the models, and see which flags are used there.

It is recommended that you delete the build directory when you change any of the build flags.

Spack

The project supports building with Spack >= 0.19.0, which can be installed without root via the official GitHub repo. The BabelStream Spack Package source code could be accessed from the link here. Each BabelStream implementation (programming model) is built as follows:

# Spack package installation starts with `spack install babelstream` for all programming models
# The programming model wish to be build needs to be specified with `+` option
# The model specific flags needs to be specified after defining model
$ spack install babelstream@<version>%<compiler> +<model> <model specific flags>


# The executables will be generated in:
# SPACK_INSTALL_DIRECTORY/opt/spack/system-name/compiler-name/babelstream-version-identifier/bin/
# this address will be printed at the end of generation which could be easily copied
$ cd SPACK_INSTALL_DIRECTORY/opt/spack/system-name/compiler-name/babelstream-version-identifier/bin/
$ ./<model>-stream

More detailed examples are provided in Spack README file. The MODEL variant selects one implementation of BabelStream to build.

Currently available models are:

omp;ocl;std-data;std-indices;std-ranges;hip;cuda;kokkos;sycl;sycl2020-acc;sycl2020-usm;acc;raja;tbb;thrust

GNU Make

Support for Make has been removed from 4.0 onwards. However, as the build process only involves a few source files, the required compile commands can be extracted from the CI output.

Results

Sample results can be found in the results subdirectory. Newer results are found in our Performance Portability repository.

Contributing

As of v4.0, the main branch of this repository will hold the latest released version.

The develop branch will contain unreleased features due for the next (major and/or minor) release of BabelStream. Pull Requests should be made against the develop branch.

Citing

Please cite BabelStream via this reference:

Deakin T, Price J, Martineau M, McIntosh-Smith S. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering. Special issue. Vol. 17, No. 3, pp. 247–262. 2018. DOI: 10.1504/IJCSE.2018.095847

Other BabelStream publications

Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany. DOI: 10.1007/978- 3-319-46079-6_34
Deakin T, McIntosh-Smith S. GPU-STREAM: Benchmarking the achievable memory bandwidth of Graphics Processing Units. 2015. Poster session presented at IEEE/ACM SuperComputing, Austin, United States. You can view the Poster and Extended Abstract.
Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM: Now in 2D!. 2016. Poster session presented at IEEE/ACM SuperComputing, Salt Lake City, United States. You can view the Poster and Extended Abstract.
Raman K, Deakin T, Price J, McIntosh-Smith S. Improving achieved memory bandwidth from C++ codes on Intel Xeon Phi Processor (Knights Landing). IXPUG Spring Meeting, Cambridge, UK, 2017.
Deakin T, Price J, McIntosh-Smith S. Portable methods for measuring cache hierarchy performance. 2017. Poster sessions presented at IEEE/ACM SuperComputing, Denver, United States. You can view the Poster and Extended Abstract

[1]: McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.

babelstream's People

Contributors

Stargazers

Watchers

Forkers

hominhquan mcopik dkuspawono psteinb davidbeckingsale 5297vp davidpoliakoff caomw ramankarthik15 pfultz2 gomson wyaneva jameslinus antonrv timocafe sunway513 progtx mangupta ruyk mehdi-goli plavin codrin-popa mcleary chengduozh atharvag1 georgeweb alanphumphrey mengzhihan01 fuaszero jlgreathouse smritip1 errord noerr tejashah94 mzhan1988 engrab6 thorbenlouw tom91136 vtsynergy akhilguliani kerilk jeffhammond denis0x0d gonzalobg pbartholomew08 feiyunwill annatrainingg tob2 do17096 jyoung3131 hkepmark pourroyjean fodinabor ivanradanov mengfei25 akitajin bearwithdog keichi lfmeadow ashishd vlkale kliegeois vmadananth noseknowsall thomasgibson dernils-git robj0nes mfkiwl mgawan diku-dk ethanjjjjjjj wr-hamburg olivier-snl kaanolgu yskale hongshitan ethanluismcdonough jonchesterfield christopheredsall okamiwong illuhad hpcgroup milthorpe ajohnson-uoregon nicejunjie poseidonfluids hemimin osmanyasal rafbiels cwpearson giordano yiming992 koaie jamestiotio addy419 gsitaram rommeldb mutumini isazi bernhardmgruber

babelstream's Issues

Add option for base 2 MiB/s output

BabelStream uses base 10 output where MB = 10^6.

It would be useful to add a --base2 or --mibibytes option to use base 2 output to calculate bandwidth for MiB = 2^20.

Build system needs improving

Binaries don't tend to build with the correct flags. This needs fixing. In the mean time things can be compiled manually, but the following should be run first:

echo '#define VERSION_STRING "2.0"' > common.h

Makefile variables should use conditional variable assignment operator

The Makefiles should define CXXFLAGS etc with ?= instead of = so that it doesn't matter if you set the variable on the left or right and side of the make invocation, i.e.:

VAR=val make
make VAR=val

No need to specify layout in Kokkos View

The Kokkos View mechanism does not require that a layout (seen as a DEVICE in this code) is passed to the View type. That way the build of the Kokkos library determine the best layout for us.

Add TBB version

Build error of SYCL version with hipSYCL

Due to some ambiguities in the SYCL spec, BabelStream currently fails to build with hipSYCL. Workarounds until a solution in the SYCL spec and hipSYCL are found can are detailed in AdaptiveCpp/AdaptiveCpp#185

how to contribute?

it was mentioned in #21 that GPU-STREAM is switching to a Makefile based build (I am not sure of the reasons for this as pointed out there). I wanted to create a PR for contributing a HCC compiled memory benchmark based on rocm. How should I go about it? Use a Makefile or integrate it into the cmake setup of this repo?

HIP dot product fails to build

Hi, I just wanted to start adding a bare HC stream and started off with the HIPified version in cc90cef and got this:

$ $ make -f HIP.make 
hipcc  -std=c++11 -DHIP main.cpp HIPStream.cu  -o hip-stream
HIPStream.cu:185:10: error: cannot combine with previous 'extern' declaration specifier
  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
         ^
/opt/rocm/include/hip/hcc_detail/host_defines.h:49:24: note: expanded from macro '__shared__'
#define __shared__     tile_static
                       ^
/opt/rocm/hcc/include/hc_defines.h:52:21: note: expanded from macro 'tile_static'
#define tile_static static __attribute__((section("clamp_opencl_local")))
                    ^
HIPStream.cu:185:31: error: expected parameter declarator
  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
                              ^
HIPStream.cu:185:31: error: expected ')'
HIPStream.cu:185:30: note: to match this '('
  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
                             ^
HIPStream.cu:185:21: error: C++ requires a type specifier for all declarations
  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
  ~~~~~~~~~~~~~~~~~ ^
HIPStream.cu:185:21: error:  tile_static can only be applied to a variable declaration
HIPStream.cu:185:41: error: expected ';' at end of declaration
  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
                                        ^
                                        ;
HIPStream.cu:185:56: error: definition of variable with array type needs an explicit size or an initializer
  extern __shared__ __align__(sizeof(T)) unsigned char smem[];
                                                       ^
HIPStream.cu:188:11: error: use of undeclared identifier 'blockDim'
  int i = blockDim.x * blockIdx.x + threadIdx.x;
          ^
HIPStream.cu:188:24: error: use of undeclared identifier 'blockIdx'
  int i = blockDim.x * blockIdx.x + threadIdx.x;
                       ^
HIPStream.cu:188:37: error: use of undeclared identifier 'threadIdx'
  int i = blockDim.x * blockIdx.x + threadIdx.x;
                                    ^
HIPStream.cu:189:26: error: use of undeclared identifier 'threadIdx'
  const size_t local_i = threadIdx.x;
                         ^
HIPStream.cu:192:31: error: use of undeclared identifier 'blockDim'
  for (; i < array_size; i += blockDim.x*gridDim.x)
                              ^
HIPStream.cu:192:42: error: use of undeclared identifier 'gridDim'
  for (; i < array_size; i += blockDim.x*gridDim.x)
                                         ^
HIPStream.cu:195:21: error: use of undeclared identifier 'blockDim'
  for (int offset = blockDim.x / 2; offset > 0; offset /= 2)
                    ^
HIPStream.cu:205:9: error: use of undeclared identifier 'blockIdx'
    sum[blockIdx.x] = tb_sum[local_i];
        ^
15 errors generated.
Died at /opt/rocm//bin/hipcc line 378.
make: *** [hip-stream] Error 1

which comes from this line in HIPStream.cu (btw, I think this file needs a suffix rename as it technically doesn't contain any CUDA code):

extern __shared__ __align__(sizeof(T)) unsigned char smem[];
T *tb_sum = reinterpret_cast<T*>(smem);

OpenACC implementation requires GCC 6 and patch

The OpenACC implementation requires GCC 6+ to compile, but then only with a patch.

Please rename the patch to remove the .txt suffix, which Github forced me to add.
0001-use-restrict-instead-of-restrict-which-is-not-a-.patch.txt

Unfortunately, my version of CUDA only supports GCC 5, so I'm kludging them together like this (the important piece is -ccbin):

cmake \
-DCMAKE_CXX_COMPILER=g++-6.2 \
-DCMAKE_C_COMPILER=gcc-6.2 \
-DCMAKE_CXX_FLAGS="-g -O3 -std=gnu++11 -fopenmp -fopenacc -Wall -Wextra" \
-DCMAKE_C_FLAGS="-g -O3 -std=gnu11 -fopenmp -fopenacc -Wall -Wextra" \
-DCUDA_NVCC_FLAGS="-gencode arch=compute_20,code=sm_20 \
                   -gencode arch=compute_30,code=sm_30 \
                   -gencode arch=compute_35,code=sm_35 \
                   -gencode arch=compute_37,code=sm_37 \
                   -gencode arch=compute_50,code=sm_50 \
                   -gencode arch=compute_52,code=sm_52 \
                   -ccbin gcc-5.4 "  ..

GCC 5.4

jrhammon@klondike:~/Work/GPU/GPU-STREAM/build$ /opt/gcc/5.4.0/bin/g++-5.4   -DACC -I/home/jrhammon/Work/GPU/GPU-STREAM/build  -g -O3 -std=gnu++11 -fopenmp -fopenacc -Wall -Wextra -O3 -DNDEBUG   -std=gnu++11 -o CMakeFiles/gpu-stream-acc.dir/main.cpp.o -c /home/jrhammon/Work/GPU/GPU-STREAM/main.cpp
In file included from /home/jrhammon/Work/GPU/GPU-STREAM/ACCStream.h:15:0,
                 from /home/jrhammon/Work/GPU/GPU-STREAM/main.cpp:32:
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:69:40: error: expected primary-expression before ‘)’ token
 int acc_get_num_devices (acc_device_t) __GOACC_NOTHROW;
                                        ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:70:41: error: expected primary-expression before ‘)’ token
 void acc_set_device_type (acc_device_t) __GOACC_NOTHROW;
                                         ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:71:41: error: expected primary-expression before ‘)’ token
 acc_device_t acc_get_device_type (void) __GOACC_NOTHROW;
                                         ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:72:45: error: expected primary-expression before ‘)’ token
 void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW;
                                             ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:73:39: error: expected primary-expression before ‘)’ token
 int acc_get_device_num (acc_device_t) __GOACC_NOTHROW;
                                       ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:74:26: error: expected primary-expression before ‘)’ token
 int acc_async_test (int) __GOACC_NOTHROW;
                          ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:75:31: error: expected primary-expression before ‘)’ token
 int acc_async_test_all (void) __GOACC_NOTHROW;
                               ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:76:21: error: expected primary-expression before ‘)’ token
 void acc_wait (int) __GOACC_NOTHROW;
                     ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:77:32: error: expected primary-expression before ‘)’ token
 void acc_wait_async (int, int) __GOACC_NOTHROW;
                                ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:78:26: error: expected primary-expression before ‘)’ token
 void acc_wait_all (void) __GOACC_NOTHROW;
                          ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:79:31: error: expected primary-expression before ‘)’ token
 void acc_wait_all_async (int) __GOACC_NOTHROW;
                               ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:80:30: error: expected primary-expression before ‘)’ token
 void acc_init (acc_device_t) __GOACC_NOTHROW;
                              ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:81:34: error: expected primary-expression before ‘)’ token
 void acc_shutdown (acc_device_t) __GOACC_NOTHROW;
                                  ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:82:34: error: expected primary-expression before ‘)’ token
 int acc_on_device (acc_device_t) __GOACC_NOTHROW;
                                  ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:83:27: error: expected primary-expression before ‘)’ token
 void *acc_malloc (size_t) __GOACC_NOTHROW;
                           ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:84:24: error: expected primary-expression before ‘)’ token
 void acc_free (void *) __GOACC_NOTHROW;
                        ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:87:35: error: expected primary-expression before ‘)’ token
 void *acc_copyin (void *, size_t) __GOACC_NOTHROW;
                                   ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:88:46: error: expected primary-expression before ‘)’ token
 void *acc_present_or_copyin (void *, size_t) __GOACC_NOTHROW;
                                              ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:89:35: error: expected primary-expression before ‘)’ token
 void *acc_create (void *, size_t) __GOACC_NOTHROW;
                                   ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:90:46: error: expected primary-expression before ‘)’ token
 void *acc_present_or_create (void *, size_t) __GOACC_NOTHROW;
                                              ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:91:35: error: expected primary-expression before ‘)’ token
 void acc_copyout (void *, size_t) __GOACC_NOTHROW;
                                   ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:92:34: error: expected primary-expression before ‘)’ token
 void acc_delete (void *, size_t) __GOACC_NOTHROW;
                                  ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:93:41: error: expected primary-expression before ‘)’ token
 void acc_update_device (void *, size_t) __GOACC_NOTHROW;
                                         ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:94:39: error: expected primary-expression before ‘)’ token
 void acc_update_self (void *, size_t) __GOACC_NOTHROW;
                                       ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:95:44: error: expected primary-expression before ‘)’ token
 void acc_map_data (void *, void *, size_t) __GOACC_NOTHROW;
                                            ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:96:30: error: expected primary-expression before ‘)’ token
 void acc_unmap_data (void *) __GOACC_NOTHROW;
                              ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:97:30: error: expected primary-expression before ‘)’ token
 void *acc_deviceptr (void *) __GOACC_NOTHROW;
                              ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:98:28: error: expected primary-expression before ‘)’ token
 void *acc_hostptr (void *) __GOACC_NOTHROW;
                            ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:99:37: error: expected primary-expression before ‘)’ token
 int acc_is_present (void *, size_t) __GOACC_NOTHROW;
                                     ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:100:52: error: expected primary-expression before ‘)’ token
 void acc_memcpy_to_device (void *, void *, size_t) __GOACC_NOTHROW;
                                                    ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:101:54: error: expected primary-expression before ‘)’ token
 void acc_memcpy_from_device (void *, void *, size_t) __GOACC_NOTHROW;
                                                      ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:109:42: error: expected primary-expression before ‘)’ token
 void *acc_get_current_cuda_device (void) __GOACC_NOTHROW;
                                          ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:110:43: error: expected primary-expression before ‘)’ token
 void *acc_get_current_cuda_context (void) __GOACC_NOTHROW;
                                           ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:111:33: error: expected primary-expression before ‘)’ token
 void *acc_get_cuda_stream (int) __GOACC_NOTHROW;
                                 ^
/opt/gcc/5.4.0/lib/gcc/x86_64-unknown-linux-gnu/5.4.0/include/openacc.h:112:39: error: expected primary-expression before ‘)’ token
 int acc_set_cuda_stream (int, void *) __GOACC_NOTHROW;
                                       ^

GCC 6.2

jrhammon@klondike:~/Work/GPU/GPU-STREAM/build$ g++-6.2   -DACC -I/home/jrhammon/Work/GPU/GPU-STREAM/build  -g -O3 -std=gnu++11 -fopenmp -fopenacc -Wall -Wextra -O3 -DNDEBUG   -std=gnu++11 -o CMakeFiles/gpu-stream-acc.dir/main.cpp.o -c /home/jrhammon/Work/GPU/GPU-STREAM/main.cpp

OpenMP 4.5 should use alloc instead of to

This line:

https://github.com/UoB-HPC/GPU-STREAM/blob/master/OMP45Stream.cpp#L21

Fujitsu A64FX flags for OpenMP

Istvan Reguly Today at 1:08 PM
And here it is with @shinji Sumimoto’s flags:
FCC main.cpp OMPStream.cpp -DOMP -DSTREAM_ARRAY_SIZE=60000000 -O3 -Kfast,openmp -KA64FX -KSVE -KARMV8_3_A -Kzfill=100 -Kprefetch_sequential=soft -Kprefetch_line=8 -Kprefetch_line_L2=16 -o stream_fujitsu -std=c++11

Kokkos float option

Compile errors when building with both double and single precision instantiations of the class. This only seems to happen when the dot kernel is included.

Bablestream is not working with HIP programming model on the latest ROCm1.7.1(1.7-137)

Bablestream is not working with HIP programming model on the latest ROCm1.7.1(1.7-137).
Application should be modified as per the latest changes in HIP, correct?

Can you please take a look at below log and provide the resolution?

taccuser@ROCM-DTB-VG10:~/Desktop/BabelStream$ make -f HIP.make
/opt/rocm/hip/bin/hipcc -std=c++11 -DHIP main.cpp HIPStream.cpp -o hip-stream
HIPStream.cpp:129:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(copy_kernel), dim3(array_size/TBSIZE), ...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::copy' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float )] not viable: no overload
of 'copy_kernel' matching 'void ()(hip_impl::Empty_launch_parm, float *,
float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:162:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(add_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::add' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float *, float )] not viable: no
overload of 'add_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
float *, float *, float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:146:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(mul_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::mul' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float )] not viable: no overload
of 'mul_kernel' matching 'void ()(hip_impl::Empty_launch_parm, float *,
float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:179:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(triad_kernel), dim3(array_size/TBSIZE)...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::triad' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float *, float )] not viable: no
overload of 'triad_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
float *, float *, float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:213:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(dot_kernel), dim3(DOT_NUM_BLOCKS), dim3...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::dot' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float *, float , unsigned int>,
F = void ()(hip_impl::Empty_launch_parm, float *, float *, float ,
unsigned int)] not viable: no overload of 'dot_kernel' matching 'void
()(hip_impl::Empty_launch_parm, float *, float *, float *, unsigned int)'
for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:129:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(copy_kernel), dim3(array_size/TBSIZE), ...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::copy' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double )] not viable: no
overload of 'copy_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
double *, double *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:162:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(add_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::add' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double *, double )] not viable:
no overload of 'add_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
double *, double *, double *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:146:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(mul_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::mul' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double )] not viable: no
overload of 'mul_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
double *, double *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:179:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(triad_kernel), dim3(array_size/TBSIZE)...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::triad' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double *, double )] not viable:
no overload of 'triad_kernel' matching 'void
()(hip_impl::Empty_launch_parm, double *, double *, double *)' for 1st
argument
void hipLaunchKernel(
^
HIPStream.cpp:213:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(dot_kernel), dim3(DOT_NUM_BLOCKS), dim3...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::dot' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double *, double , unsigned
int>, F = void ()(hip_impl::Empty_launch_parm, double *, double *, double
, unsigned int)] not viable: no overload of 'dot_kernel' matching 'void
()(hip_impl::Empty_launch_parm, double *, double *, double *, unsigned
int)' for 1st argument
void hipLaunchKernel(
^
10 errors generated.
HIPStream.cpp:129:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(copy_kernel), dim3(array_size/TBSIZE), ...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::copy' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float )] not viable: no overload
of 'copy_kernel' matching 'void ()(hip_impl::Empty_launch_parm, float *,
float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:162:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(add_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::add' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float *, float )] not viable: no
overload of 'add_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
float *, float *, float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:146:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(mul_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::mul' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float )] not viable: no overload
of 'mul_kernel' matching 'void ()(hip_impl::Empty_launch_parm, float *,
float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:179:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(triad_kernel), dim3(array_size/TBSIZE)...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::triad' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float *, float >, F = void
()(hip_impl::Empty_launch_parm, float *, float *, float )] not viable: no
overload of 'triad_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
float *, float *, float *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:213:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(dot_kernel), dim3(DOT_NUM_BLOCKS), dim3...
^~~~~~~~~~~~~~~
HIPStream.cpp:270:16: note: in instantiation of member function
'HIPStream::dot' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <float *, float *, float , unsigned int>,
F = void ()(hip_impl::Empty_launch_parm, float *, float *, float ,
unsigned int)] not viable: no overload of 'dot_kernel' matching 'void
()(hip_impl::Empty_launch_parm, float *, float *, float *, unsigned int)'
for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:129:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(copy_kernel), dim3(array_size/TBSIZE), ...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::copy' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double )] not viable: no
overload of 'copy_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
double *, double *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:162:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(add_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::add' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double *, double )] not viable:
no overload of 'add_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
double *, double *, double *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:146:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(mul_kernel), dim3(array_size/TBSIZE), d...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::mul' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double )] not viable: no
overload of 'mul_kernel' matching 'void ()(hip_impl::Empty_launch_parm,
double *, double *)' for 1st argument
void hipLaunchKernel(
^
HIPStream.cpp:179:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(triad_kernel), dim3(array_size/TBSIZE)...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::triad' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double *, double >, F = void
()(hip_impl::Empty_launch_parm, double *, double *, double )] not viable:
no overload of 'triad_kernel' matching 'void
()(hip_impl::Empty_launch_parm, double *, double *, double *)' for 1st
argument
void hipLaunchKernel(
^
HIPStream.cpp:213:3: error: no matching function for call to 'hipLaunchKernel'
hipLaunchKernel(HIP_KERNEL_NAME(dot_kernel), dim3(DOT_NUM_BLOCKS), dim3...
^~~~~~~~~~~~~~~
HIPStream.cpp:271:16: note: in instantiation of member function
'HIPStream::dot' requested here
template class HIPStream;
^
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:142:6: note:
candidate function [with Args = <double *, double *, double , unsigned
int>, F = void ()(hip_impl::Empty_launch_parm, double *, double *, double
, unsigned int)] not viable: no overload of 'dot_kernel' matching 'void
()(hip_impl::Empty_launch_parm, double *, double *, double *, unsigned
int)' for 1st argument
void hipLaunchKernel(
^
10 errors generated.
Died at /opt/rocm/hip/bin/hipcc line 498.
HIP.make:7: recipe for target 'hip-stream' failed
make: *** [hip-stream] Error 1

Add Julia and Rust versions

These would be self-contained ports as not sure it's worth attempting plugging in these languages into the C++ main routine here.

Make fails for RAJA

CUDA and RAJA interfeer in CMakeLists.txt

Dot verification fails with single precision

We probably just need to increase the tolerance. The error will also be proportional to the size of the arrays (unlike with the other kernels), so we need to make sure whatever error checking tolerance we use is robust enough to avoid these sorts of false positives for any sort of input.

Validation failed on sum. Error 0.000209808
Sum was 39.7910385131836 but should be 39.7912483215332

Add OpenMP target build flags for GCC

AMD

-foffload=amdgcn-amdhsa="-march=gfx906"

NVIDIA

-foffload=nvptx-none="-march=sm_75"

Update for latest OpenMP version (5.2 and beyond)

Examples for updates are:

nontemporal clause
metadirectives for selecting target

Can there be one OpenMP version?

Update Cray OpenMP build for CCE 9

Cray C/C++ 9.x now needs Clang-style flags for the C++ standard and -fopenmp.

A new name for GPU-STREAM

GPU-STREAM is pretty ill-named now we have v3.0 running across CPUs, GPUs, Xeon Phi, etc. The different programming models also adds a different dimension to the supported hardware. As such, we should probably rename the project to something more accurate. Potential names are:

X-STREAM
Portable STREAM
Parallel STREAM
ManyCore STREAM

Thoughts/suggestions?

common.h.in missing

Hi,
I just downloaded v3.0 and during cmake execution, I get:

$ cmake ..
-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.8.4
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
...
CMake Error: File /home/steinbac/development/gpu-stream/GPU-STREAM-3.0/common.h.in does not exist.
CMake Error at CMakeLists.txt:25 (configure_file):
  configure_file Problem configuring file
No CMAKE_BUILD_TYPE specified, defaulting to 'Release'
...
-- Configuring incomplete, errors occurred!
See also "/home/steinbac/development/gpu-stream/GPU-STREAM-3.0/build/CMakeFiles/CMakeOutput.log".
See also "/home/steinbac/development/gpu-stream/GPU-STREAM-3.0/build/CMakeFiles/CMakeError.log".

this error comes from (CMakeLists.txt#L25)[https://github.com/UoB-HPC/GPU-STREAM/blob/master/CMakeLists.txt#L25]

For info: TARGET flags for clang-based CCE for OpenMP target

To target NVIDIA with OpenMP with the clang-based CCE, we need to do something like this:

make COMPILER=CRAY TARGET=NVIDIA EXTRA_FLAGS="-fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target -march=sm_60"

Non-CSV output when using `--csv`-option with `--triad-only`

I guess it's not a big issue since the output is a table with only one row, but still. Tested with the CUDA-version only.

$ cuda-stream --csv
Using CUDA device Tesla P100-SXM2-16GB
Driver: 9020
function,num_times,n_elements,sizeof,max_mbytes_per_sec,min_runtime,max_runtime,avg_runtime
Copy,100,..,
Mul,100,...,
Add,100,...,
Triad,100,...,
Dot,100,...,

$ cuda-stream --csv --triad-only
Running triad 100 times
Number of elements: 33554432
Precision: double
Array size: 268435.5 KB (=268.4 MB)
Total size: 805306.4 KB (=805.3 MB)
Using CUDA device Tesla P100-SXM2-16GB
Driver: 9020
--------------------------------
Runtime (seconds): ...
Bandwidth (GB/s):  ...

SYCL device discovery should be simplified

We should use the get_devices() function of the cl::sycl::device class to get all the devices in the system. This means we don't have to search through the platforms first.

ComputeCpp does not compile with signed array sizes

This was caught by #91 after trying to enable ComputeCpp (still not possible yet due to authentication requirements).
I've added better filtering in #91 as well so all the compiler warnings show up in the CI log.

> /opt/ComputeCpp-CE-2.3.0-x86_64-linux-gnu/bin/compute++ -sycl -O2 -mllvm -inline-threshold=1000 -intelspirmetadata -sycl-target spir64 -std=c++1z -I"/opt/computecpp_archive/ComputeCpp-CE-2.3.0-x86_64-linux-gnu/include" -I"/home/tom/babelstream-upstream/CL" -DSYCL       -DCL_TARGET_OPENCL_VERSION=220  -D_GLIBCXX_USE_CXX11_ABI=0 SYCLStream.cpp

SYCLStream.cpp:94:44: error: non-constant-expression cannot be narrowed from type 'int' to 'size_t' (aka 'unsigned long') in initializer list [-Wc++11-narrowing]
    cgh.parallel_for<copy_kernel>(range<1>{array_size}, [=](id<1> idx)
                                           ^~~~~~~~~~
SYCLStream.cpp:313:16: note: in instantiation of member function 'SYCLStream<float>::copy' requested here
template class SYCLStream<float>;
               ^
SYCLStream.cpp:94:44: note: insert an explicit cast to silence this issue
    cgh.parallel_for<copy_kernel>(range<1>{array_size}, [=](id<1> idx)
                                           ^~~~~~~~~~

... reports the same thing for all range<1>{array_size} calls ...

This is without any extra warning flags, we also got the same thing but as warnings in hipSYCL:

/opt/hipsycl/cff515c/lib/cmake/hipSYCL/syclcc-launcher --launcher-cxx-compiler=/usr/lib64/ccache/c++ --launcher-syclcc=/opt/hipsycl/cff515c/bin/syclcc-clang  --hipsycl-platform=omp /usr/lib64/ccache/c++  -DNDEBUG CMakeFiles/babelstream.dir/SYCLStream.cpp.o CMakeFiles/babelstream.dir/main.cpp.o -o babelstream  -Wl,-rpath,/opt/hipsycl/cff515c/lib /opt/hipsycl/cff515c/lib/libhipSYCL-rt.so
SYCLStream.cpp:94:44: warning: narrowing conversion of ‘(int)((SYCLStream<float>*)this)->SYCLStream<float>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:126:43: warning: narrowing conversion of ‘(int)((SYCLStream<float>*)this)->SYCLStream<float>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:110:43: warning: narrowing conversion of ‘(int)((SYCLStream<float>*)this)->SYCLStream<float>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:143:45: warning: narrowing conversion of ‘(int)((SYCLStream<float>*)this)->SYCLStream<float>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:160:47: warning: narrowing conversion of ‘(int)((SYCLStream<float>*)this)->SYCLStream<float>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:221:44: warning: narrowing conversion of ‘(int)((SYCLStream<float>*)this)->SYCLStream<float>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:94:44: warning: narrowing conversion of ‘(int)((SYCLStream<double>*)this)->SYCLStream<double>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:126:43: warning: narrowing conversion of ‘(int)((SYCLStream<double>*)this)->SYCLStream<double>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:110:43: warning: narrowing conversion of ‘(int)((SYCLStream<double>*)this)->SYCLStream<double>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:143:45: warning: narrowing conversion of ‘(int)((SYCLStream<double>*)this)->SYCLStream<double>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:160:47: warning: narrowing conversion of ‘(int)((SYCLStream<double>*)this)->SYCLStream<double>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]
SYCLStream.cpp:221:44: warning: narrowing conversion of ‘(int)((SYCLStream<double>*)this)->SYCLStream<double>::array_size’ from ‘int’ to ‘size_t’ {aka ‘long unsigned int’} [-Wnarrowing]

DPC++ compiled fine and reported nothing (!)

We got several options here:

Append -Wno-narrowing, ComputeCpp compiles after this but probably not a good thing
static_cast to the correct type at warning/error site, or don't use initialiser lists
use size_t for array_size

The only place array_size is used for SYCL are:

new buffer<T>(array_size);
range<1>{array_size} for all the parallel_for calls
for (int i = 0; i < array_size; i++) in SYCLStream<T>::read_arrays(...)
size_t N = array_size; ... for (; i < N; i += global_size) in SYCLStream<T>::dot()

Not part of this issue but the N in the dot kernel might need to be int as per 9a69d3d.

Other than that, we aren't using it in any benchmark kernels directly; I vote option 3.

OpenACC version should use parallel loop instead of kernels

It seems like parallel loop is becoming best practice for OpenACC.

Add OpenCL driver version to output

Cannot find source file OMP3Stream.cpp

Hi,
just checked out caf367f and did:

$ cd repo
$ mkdir build && cd build
$ cmake ..
#...
- Configuring done
CMake Error at CMakeLists.txt:111 (add_executable):
  Cannot find source file:

    OMP3Stream.cpp

  Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
  .hxx .in .txx


CMake Error: CMake can not determine linker language for target: gpu-stream-omp3
CMake Error: Cannot determine link language for target "gpu-stream-omp3".
-- Generating done
-- Build files have been written to: /projects/hpcsupport/steinbac/development/gpu-stream/build

I see OMPStream.cpp in the source directory, but OMP3Stream.cpp as well as OMP45Stream.cpp are missing.

Specify memory size

Add options to do something like:

use 90% of available device memory
use 200MB in total
use 100MB per array

instead of just a number of elements

enable OpenMP 4.5 for compilers besides Cray

I'm not a CMake guru but this project should test for OpenMP 4.5 compiler support generically rather than just associate this implementation with the Cray toolchain.

http://www.openmp.org/resources/openmp-compilers/ has the full list, but at least GCC 6.1+ and Intel 17+ support OpenMP 4.5 and I just confirmed that GCC 6.2.0 can correctly run gpu-stream-omp45.

$ git diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index efee733..574fcc3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -118,12 +118,12 @@ endif ()
 #-------------------------------------------------------------------------------
 # OpenMP 4.5
 #-------------------------------------------------------------------------------
-if ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Cray")
-  if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 4.5)
+#if ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Cray")
+#  if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 4.5)
     add_executable(gpu-stream-omp45 main.cpp OMP45Stream.cpp)
     target_compile_definitions(gpu-stream-omp45 PUBLIC OMP45)
-  endif ()
-endif ()
+#  endif ()
+#endif ()

memcpy declaration issue

I downloaded and configured with cmake ... This error results:

$ /usr/bin/nvcc /home/jrhammon/Work/GPU/GPU-STREAM/CUDAStream.cu -c 
-o /home/jrhammon/Work/GPU/GPU-STREAM/build/CMakeFiles/gpu-stream-cuda.dir//./gpu-stream-cuda_generated_CUDAStream.cu.o 
-ccbin /usr/bin/cc -m64 -Xcompiler ,\"-O3\",\"-DNDEBUG\" --std=c++11 -DNVCC 
-I/usr/include -I$HOME/Work/GPU/GPU-STREAM/build -I/usr/include
/usr/include/string.h: In function ‘void* __mempcpy_inline(void*, const void*, size_t)’:
/usr/include/string.h:652:42: error: ‘memcpy’ was not declared in this scope
   return (char *) memcpy (__dest, __src, __n) + __n;

cc points to a reasonable toolchain (see below), so I do not think it is reasonable to blame it for this error.

$ cc -v
Using built-in specs.
COLLECT_GCC=cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4' 
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs 
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ 
--prefix=/usr --program-suffix=-5 
--enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext 
--enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new 
--enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin 
--with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo 
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home 
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar 
--enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic 
--enable-checking=release 
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

Add C++20 Parallel Standard Library Range based kernels

Requirement of map clauses on OpenMP target regions

Is it required to have the map clause on all target regions?
The pointers are defined using the unstructured data directives.
It works OK without for the Clang compiler.

Add build rules for hipSYCL and DPC++ to SYCL version

Missing compiler flag for LLVM OpenMP target

The Makefile should specify -Xopenmp-target -march=sm_?? when building with LLVM for NVPTX targets.

Compile errors with RAJA 0.3.x

RAJA 0.3.x has breaking changes which are not backwards compatible, which means that BabelStream no longer builds with the latest versions of RAJA.

SYCL implementation crashes

I tried to run SYCL stream but it crashes as shown below. I have my own implementation in the PRK project (here) that works, but some caveats. The NVIDIA GPU only works with -sycl-target ptx64 (and not -sycl-target spir64). The Intel CPU (using Intel OpenCL) only works when I do the opposite.

It seems that I am hitting this issue here, but I don't know how to change your build system to do ptx64. My attempts were unsuccessful.

Any suggestions? The reason I am interested is that my implementation performs poorly on the GPU and I am trying to rule out user error.

$ make -f SYCL.make clean ; make -f SYCL.make && ./sycl-stream
rm -f sycl-stream SYCLStream.sycl SYCLStream.bc
/opt/sycl/latest/bin/compute++ SYCLStream.cpp -O2 -mllvm -inline-threshold=1000 -sycl -emit-llvm -intelspirmetadata -c -I/opt/sycl/latest/include -o SYCLStream.sycl
remark: [Computecpp:CC0027]: Some memcpy/memset intrinsics added by the llvm optimizer were replaced by serial functions. This
      is a workaround for OpenCL drivers that do not support those intrinsics. This may impact performance, consider using
      -no-serial-memop. [-Rsycl-serial-memop]
g++ -O3 -std=c++11 -DSYCL main.cpp SYCLStream.cpp -I/opt/sycl/latest/include -include SYCLStream.sycl  -L/opt/sycl/latest/lib -lComputeCpp -lOpenCL -Wl,--rpath=/opt/sycl/latest/lib/ -o sycl-stream
BabelStream
Version: 3.3
Implementation: SYCL
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device GeForce GTX 960
Driver: 390.30
Reduction kernel config: 32 groups of size 1024
terminate called after throwing an instance of 'cl::sycl::exception'
Aborted (core dumped)

ComputeCpp info

$ /opt/sycl/latest/bin/computecpp_info 
********************************************************************************

ComputeCpp Info (CE 0.5.1)

********************************************************************************

Toolchain information:

GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.

********************************************************************************


Device Info:

Discovered 3 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : NO - Device does not support SPIR
  CL_DEVICE_NAME                          : GeForce GTX 960
  CL_DEVICE_VENDOR                        : NVIDIA Corporation
  CL_DRIVER_VERSION                       : 390.30
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 
--------------------------------------------------------------------------------
Device 1:

  Device is supported                     : YES - Tested internally by Codeplay Software Ltd.
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 1.2.0.10
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU 
--------------------------------------------------------------------------------
Device 2:

  Device is supported                     : YES - Tested internally by Codeplay Software Ltd.
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 1.2.0.25
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU 

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.5.1/platform-support-notes

********************************************************************************

OpenCL info

$ clinfo
Number of platforms                               3
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 9.1.84
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   Experimental OpenCL 2.1 CPU Only Platform
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 2.1 LINUX
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             INTEL

  Platform Name                                   Intel(R) OpenCL
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 1.2 LINUX
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 
  Platform Extensions function suffix             INTEL

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     GeForce GTX 960
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  390.30
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Topology (NV)                            PCI-E, 03:00.0
  Max compute units                               8
  Max clock frequency                             1228MHz
  Compute Capability (NV)                         5.2
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              2096168960 (1.952GiB)
  Error Correction support                        No
  Max memory allocation                           524042240 (499.8MiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        131072
  Global Memory cache line                        128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             4096x4096x4096 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max constant buffer size                        65536 (64KiB)
  Max number of constant args                     9
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

  Platform Name                                   Experimental OpenCL 2.1 CPU Only Platform
Number of devices                                 1
  Device Name                                     Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 2.1 (Build 10)
  Driver Version                                  1.2.0.10
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               16
  Max clock frequency                             3000MHz
  Device Partition                                (core)
    Max number of sub-devices                     16
    Supported partition types                     by counts, equally, by names (Intel)
  Max work item dimensions                        3
  Max work item sizes                             8192x8192x8192
  Max work group size                             8192
  Preferred work group size multiple              128
  Max sub-groups per work group                   1
  Preferred / native vector sizes                 
    char                                                 1 / 32      
    short                                                1 / 16      
    int                                                  1 / 8       
    long                                                 1 / 4       
    half                                                 0 / 0        (n/a)
    float                                                1 / 8       
    double                                               1 / 4        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              16645128192 (15.5GiB)
  Error Correction support                        No
  Max memory allocation                           4161282048 (3.875GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   Yes
    Atomics                                       Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             65536 (64KiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        262144
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             480
    Max size for 1D images from buffer            260080128 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   64 bytes
    Pitch alignment for 2D image buffers          64 bytes
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 480
    Max number of write image args                480
    Max number of read/write image args           480
  Max number of pipe args                         16
  Max active pipe reservations                    16383
  Max pipe packet size                            1024
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max constant buffer size                        131072 (128KiB)
  Max number of constant args                     480
  Max size of kernel argument                     3840 (3.75KiB)
  Queue properties (on host)                      
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Local thread execution (Intel)                Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                4294967295 (4GiB)
    Max size                                      4294967295 (4GiB)
  Max queues on device                            4294967295
  Max events on device                            4294967295
  Prefer user sync for interop                    No
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    Sub-group independent forward progress        No
    IL version                                    SPIR-V_1.0
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer 

  Platform Name                                   Intel(R) OpenCL
Number of devices                                 1
  Device Name                                     Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 (Build 25)
  Driver Version                                  1.2.0.25
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               16
  Max clock frequency                             3000MHz
  Device Partition                                (core)
    Max number of sub-devices                     16
    Supported partition types                     by counts, equally, by names (Intel)
  Max work item dimensions                        3
  Max work item sizes                             8192x8192x8192
  Max work group size                             8192
Stack dump:
0.	Running pass 'ChannelPipeTransformation' on module 'main'.
Segmentation fault (core dumped)

For info: Windows timing resolution

If you compile the code with VS2013 or lower, the timer resolution is not fine enough to collect accurate results [1]. You would expect to see minimum timings of zero, and very large memory bandwidths in this case.

This should be fixed in VS2015.

[1] https://connect.microsoft.com/VisualStudio/feedback/details/719443/c-chrono-headers-high-resolution-clock-does-not-have-high-resolution

Compile errors with Kokkos 2.5.00

The --arch flag is defined in two places in the Kokkos build system so single command compile and link commands, as used in BabelStream, now fail.
The solution seems to be to require separate build and link steps in the Makefile.

Ref: kokkos/kokkos#1394

OpenMP 4.5 reduction fails

This is because map(tofrom: sum) should be before the reduction clause.
This is due to order of clause evaluation in the OpenMP 4.5 specification.

error: identifier "__float128" is undefined

I am posting this because others may encounter it and the solution should be logged for posterity.

Problem

/opt/gcc/5.4.0/include/c++/5.4.0/type_traits(311): error: identifier "__float128" is undefined

Solution 1

GCC 5.4.0 include/c++/5.4.0/type_traits needs to be modified to add && !defined(__CUDACC__) in the following, which is at line 311 in my installation.

#if !defined(__STRICT_ANSI__) && defined(_GLIBCXX_USE_FLOAT128) && !defined(__CUDACC__)
  template<>
    struct __is_floating_point_helper<__float128>
    : public true_type { };
#endif

Solution 2

Compile with strict ISO/ANSI flags (e.g. -std=c++11 instead of the default -std=gnu++11) such that the GCC quadmath extensions will be disabled.

Problem with Solution 2

Short error

nvcc fatal   : redefinition of argument 'std'

Long error

/usr/bin/nvcc -M -D__CUDACC__ ~/Work/GPU/GPU-STREAM/CUDAStream.cu -o ~/Work/GPU/GPU-STREAM/build/CMakeFiles/gpu-stream-cuda.dir//gpu-stream-cuda_generated_CUDAStream.cu.o.NVCC-depend -ccbin /opt/gcc/5.4.0/bin/gcc-5.4 -m64 --std c++11 -Xcompiler ,\"-g\",\"-O3\",\"-fopenmp\",\"-fopenacc\",\"-Wall\",\"-Wextra\",\"-O3\",\"-DNDEBUG\" -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 --std=c++11 -DNVCC -I/usr/include -I~/Work/GPU/GPU-STREAM/build -I/usr/include
nvcc fatal   : redefinition of argument 'std'

Related Work

Both Caffe and FFTW have identified the same issue with nvcc:

More detail can be found on StackOverflow.

For info: aligned_alloc undefined

On systems with GLIBC < 2.17 C11 is not available, and so aligned_alloc is not defined. It can be defined using the POSIX equivalent by adding the following to the code:

void* aligned_alloc(size_t alignment, size_t size)
{
    void* mem;
    posix_memalign(&mem, alignment, size);
    return mem;
}

Bablestream is not working with HIP programming model on the latest ROCm1.7.1(1.7-137)

Bablestream is not working with HIP programming model on the latest ROCm1.7.1(1.7-137).
Application should be modified as per the latest changes in HIP, correct?

Can you please take a look at below log and provide the resolution?

Add flags for Intel OpenMP offload

The Intel OneAPI HPC Toolkit Beta links to an example for OpenMP target. We can see in the CMakeLists.txt there that the icpc flags required to add to this Makefile might be similar to:

-qnextgen -fiopenmp -std=c++11 -fopenmp-targets=spir64 -O3

Update for SYCL 2020 Provisional

There will probably be a number of changes, but the main one is to update the accessors to use the simpler form:

auto ka = d_a->template get_access<access::mode::read>(cgh); // SYCL 1.2.1
auto ka = d_a->template get_access(cgh, sycl::read_only);    // SYCL 2020 Provisional

Should probably use accessor constructers instead of get_access, with read and write versions.

Implement Dot using the reduction support in SYCL 2020 Provisional.

Remove OpenACC build option for Cray due to deprecation

See https://pubs.cray.com/content/S-5212/9.0/cray-compiling-environment-cce-release-overview/cce-900-compatibilities-and-differences

Add USM/SVM implementations for OpenCL, SYCL and OpenMP

Add option for OpenCL 2.0 shared virtual memory buffers. This will allow measurement of the overhead of the implementations over the device interconnect.

GTX1080 CUDA issues

I wanted to benchmark a GTX 1080 with cuda 8.0.27 under CentOS 7.2.1511. the gpu-stream-cuda app behaves normal with the default parameters.
Strange enough though, when I want to provide more than the default number of elements in the array:

$ gpu-stream-cuda --arraysize 67108864

the copy kernel dispatch throws a CUDA API error 0xb which is Invalid Argument. I tracked down the problem to (this line of code)[https://github.com/UoB-HPC/GPU-STREAM/blob/master/CUDAStream.cu#L112]:

template <class T>
void CUDAStream<T>::copy()
{
  copy_kernel<<<array_size/TBSIZE, TBSIZE>>>(d_a, d_c);
  check_error();
  cudaDeviceSynchronize();
  check_error();
}

strange enough, if I look at the values of array_size/TBSIZE, they are in plausible ranges arraysize/TBSIZE = 65536.

Does anyone have an idea where this is coming from? (as this is a RC cuda, I see no problem forwarding this issue to nvidia)

Error building on Xcode 6.1 on MacOS 10.10.4

Hi,
using that configuration I get:
In file included from ocl-stream.cpp:38:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/iostream:38:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:15:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/string:439:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:626:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/utility:253:9: error:
field has incomplete type 'cl::Device'
_T1 first;

I will update Xcode to see if it fixes the issue..

Allocate checking vectors after the main computation

For processors with strong NUMA effects, we need to make sure run() doesn't allocate too much memory before each model allocates its own memory. The memory in run() is just an abstraction on checking the results, so we shouldn't have it impact the performance of the computation.