alpaka-group / cupla Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 18.0 6.31 MB

C++ User interface for the Platform independent Library Alpaka :arrows_clockwise:

License: Other

CMake 5.93% C++ 87.85% Shell 3.27% Cuda 0.18% Makefile 0.03% CSS 0.01% Python 2.41% Ruby 0.33%

cupla's People

Contributors

Stargazers

Watchers

Forkers

psychocoderhpc vincentridder tdd11235813 ax3l dreamplayerzhang derwaldschrat sk1p sbastrakov simeonehrig fwyzard darcato jkelling zeta1999 jyoung3131 ajunlonglive jjwillsc

cupla's Issues

threads vs elements when using the OpenMP 4.0 backend

It looks like cupla does not swap the number of threads and elements when using the OpenMP 4.0 backend.

Using alpaka directly, with the swap explicitly in place:

Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 532.66 us

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 283.06 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 211.79 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 632.7 us

Using cupla:

Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us

The much larger time observed with the OpenMP 4.0 backend is consistent with what I was seeing with alpaka before introducing the swap between threads and elements.

Thread Support

build cupla as a standalone library

When cupla is not being used as a header-only library, would it make sense to build it as a standalone library ?
I'm thinking about a shared library mostly, but a static library could also make sense.

It could build something like lib/libcupla.so, that user code could link to.

Unfortunately I don't know how to write a CMake file to do it, but I have written a simple Makefile:

.PHONY: all clean

# external tools and dependencies

# CUDA installation, leave empty to disable CUDA support
CUDA_BASE       := /usr/local/cuda

# boost installation, leave empty to use the system installation
BOOST_BASE      :=

# TBB installation, leave empty to use the system installation
TBB_BASE        :=

# Alpaka installation, leave empty to use the version bundled with Cupla
ALPAKA_BASE     :=

# host compiler
CXX             := g++
CXXFLAGS        := -std=c++14 -O2 -g
HOST_CXXFLAGS   := -pthread -fPIC -Wall -Wextra

# OpenMP flags
OMP_FLAGS       := -fopenmp -foffload=disable

# CUDA compiler
ifdef CUDA_BASE
NVCC            := $(CUDA_BASE)/bin/nvcc
NVCC_FLAGS      := --generate-line-info --source-in-ptx --expt-extended-lambda --expt-relaxed-constexpr --generate-code arch=compute_35,code=sm_35 --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_70,code=compute_70 --cudart shared -ccbin $(CXX) -Xcudafe --display_error_number -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored
CUDA_CXXFLAGS   := -I$(CUDA_BASE)/include
CUDA_LDFLAGS    := -L$(CUDA_BASE)/lib64 -lcudart
endif

# boost library
ifdef BOOST_BASE
BOOST_CXXFLAGS  := -I$(BOOST_BASE)/include
else
BOOST_CXXFLAGS  :=
endif

# TBB library
ifdef TBB_BASE
TBB_CXXFLAGS    := -I$(TBB_BASE)/include
TBB_LDFLAGS     := -L$(TBB_BASE)/lib -lrt
else
TBB_CXXFLAGS    :=
TBB_LDFLAGS     := -lrt
endif

# Alpaka library
ifdef ALPAKA_BASE
ALPAKA_CXXFLAGS := -I$(ALPAKA_BASE)/include -DALPAKA_DEBUG=0
else
ALPAKA_CXXFLAGS := -Ialpaka/include -DALPAKA_DEBUG=0
endif

# source files
SRC=$(wildcard src/*.cpp src/manager/*.cpp)


all: lib/libcupla.so

clean:
	rm -rf build lib

# compile the CUDA GPU backend only if CUDA support is available
ifdef CUDA_BASE

# CUDA GPU backend with synchronous queues
CUDA_SYNC_OBJ = $(SRC:src/%.cpp=build/cuda-sync/%.o)
OBJ += $(CUDA_SYNC_OBJ)

$(CUDA_SYNC_OBJ): build/cuda-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(NVCC) -x cu $(CXXFLAGS) $(NVCC_FLAGS) -Xcompiler '$(HOST_CXXFLAGS)' $(CUDA_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_GPU_CUDA_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# CUDA GPU backend with asynchronous queues
CUDA_ASYNC_OBJ = $(SRC:src/%.cpp=build/cuda-async/%.o)
OBJ += $(CUDA_ASYNC_OBJ)

$(CUDA_ASYNC_OBJ): build/cuda-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(NVCC) -x cu $(CXXFLAGS) $(NVCC_FLAGS) -Xcompiler '$(HOST_CXXFLAGS)' $(CUDA_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_GPU_CUDA_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

endif

# serial CPU backend with synchronous queues
SEQ_SEQ_SYNC_OBJ = $(SRC:src/%.cpp=build/seq-seq-sync/%.o)
OBJ += $(SEQ_SEQ_SYNC_OBJ)

$(SEQ_SEQ_SYNC_OBJ): build/seq-seq-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# serial CPU backend with asynchronous queues
SEQ_SEQ_ASYNC_OBJ = $(SRC:src/%.cpp=build/seq-seq-async/%.o)
OBJ += $(SEQ_SEQ_ASYNC_OBJ)

$(SEQ_SEQ_ASYNC_OBJ): build/seq-seq-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

# std::thread CPU backend with synchronous queues
SEQ_THREADS_SYNC_OBJ = $(SRC:src/%.cpp=build/seq-threads-sync/%.o)
OBJ += $(SEQ_THREADS_SYNC_OBJ)

$(SEQ_THREADS_SYNC_OBJ): build/seq-threads-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# std::thread CPU backend with asynchronous queues
SEQ_THREADS_ASYNC_OBJ = $(SRC:src/%.cpp=build/seq-threads-async/%.o)
OBJ += $(SEQ_THREADS_ASYNC_OBJ)

$(SEQ_THREADS_ASYNC_OBJ): build/seq-threads-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

# OpenMP 2.0 parallel threads CPU backend with synchronous queues
SEQ_OMP2_SYNC_OBJ = $(SRC:src/%.cpp=build/seq-omp2-sync/%.o)
OBJ += $(SEQ_OMP2_SYNC_OBJ)

$(SEQ_OMP2_SYNC_OBJ): build/seq-omp2-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# OpenMP 2.0 parallel threads CPU backend with asynchronous queues
SEQ_OMP2_ASYNC_OBJ = $(SRC:src/%.cpp=build/seq-omp2-async/%.o)
OBJ += $(SEQ_OMP2_ASYNC_OBJ)

$(SEQ_OMP2_ASYNC_OBJ): build/seq-omp2-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

# OpenMP 2.0 parallel blocks CPU backend with synchronous queues
OMP2_SEQ_SYNC_OBJ = $(SRC:src/%.cpp=build/omp2-seq-sync/%.o)
OBJ += $(OMP2_SEQ_SYNC_OBJ)

$(OMP2_SEQ_SYNC_OBJ): build/omp2-seq-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# OpenMP 2.0 parallel blocks CPU backend with asynchronous queues
OMP2_SEQ_ASYNC_OBJ = $(SRC:src/%.cpp=build/omp2-seq-async/%.o)
OBJ += $(OMP2_SEQ_ASYNC_OBJ)

$(OMP2_SEQ_ASYNC_OBJ): build/omp2-seq-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

# OpenMP 4.0 parallel CPU backend with synchronous queues
OMP4_SYNC_OBJ = $(SRC:src/%.cpp=build/omp4-sync/%.o)
OBJ += $(OMP4_SYNC_OBJ)

$(OMP4_SYNC_OBJ): build/omp4-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_BT_OMP4_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# OpenMP 4.0 parallel CPU backend with asynchronous queues
OMP4_ASYNC_OBJ = $(SRC:src/%.cpp=build/omp4-async/%.o)
OBJ += $(OMP4_ASYNC_OBJ)

$(OMP4_ASYNC_OBJ): build/omp4-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_BT_OMP4_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

# TBB parallel blocks CPU backend with synchronous queues
TBB_SEQ_SYNC_OBJ = $(SRC:src/%.cpp=build/tbb-seq-sync/%.o)
OBJ += $(TBB_SEQ_SYNC_OBJ)

$(TBB_SEQ_SYNC_OBJ): build/tbb-seq-sync/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(TBB_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=0 -c $< -o $@

# TBB parallel blocks CPU backend with asynchronous queues
TBB_SEQ_ASYNC_OBJ = $(SRC:src/%.cpp=build/tbb-seq-async/%.o)
OBJ += $(TBB_SEQ_ASYNC_OBJ)

$(TBB_SEQ_ASYNC_OBJ): build/tbb-seq-async/%.o: src/%.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(TBB_CXXFLAGS) $(BOOST_CXXFLAGS) $(ALPAKA_CXXFLAGS) -Iinclude -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -c $< -o $@

# cupla shared library
lib/libcupla.so: $(OBJ)
	@mkdir -p $(dir $@)
	$(CXX) $(CXXFLAGS) $(HOST_CXXFLAGS) $(OMP_FLAGS) $^ $(CUDA_LDFLAGS) $(TBB_LDFLAGS) -shared -o $@

The list backends being compiled is hardcoded, but it could be adjusted depending on some top level variables as well.

What do you think ?
Is it a feasible approach, or are there some caveats and pitfalls I didn't consider ?

Add cudaStreamQuery, cudaMemcpyToSymbol

cudaStreamQuery

Use-case is a function that shows progress of a long-running kernel.
This involves querying the stream status. As long as stream is busy with kernel, it copies the counter value to host to display the progress (and waits a time before querying again).
(In our case we know the final value (100%), so we ~~stop polling then, although infinite loop might occur in buggy cases~~ can workaround cudaStreamQuery by just checking if 100% reached.)

cudaMemcpyToSymbol

Not implemented as well, a CUDA only solution would be:

cudaMemcpyToSymbol(/*...*/);
checkCudaErrors( cudaPeekAtLastError() );

Edit: I can add it to cupla, this issue just documents the request.

Problems with cuda_fp16 and Eigen Library

Hello,
I was porting a code from cuda using cupla, and I noticed that some errors are generated when <cupla/config/GpuCudaRt.hpp> is included before the #includes of the Eigen library. In the detail:
Errors are generated when I have a structure like this:

#include <cuda/config/GpuCudaRt.hpp>
#include <Eigen/Core>
#include <Eigen/Eigenvalues>

While in the following way, I get no errors:

#include <Eigen/Core>
#include <Eigen/Eigenvalues>
#include <cuda/config/GpuCudaRt.hpp>

In particular the errors are the following:

/usr/local/cuda/include/cuda_fp16.h(3002): error: incomplete type is not allowed
/usr/local/cuda/include/cuda_fp16.h(3002): error: expected an identifier
/usr/local/cuda/include/cuda_fp16.h(3002): error: inline specifier allowed on function declarations only
/usr/local/cuda/include/cuda_fp16.h(3002): error: expected a ";"
/usr/local/cuda/include/cuda_fp16.h(3008): error: incomplete type is not allowed
/usr/local/cuda/include/cuda_fp16.h(3008): error: expected an identifier
/usr/local/cuda/include/cuda_fp16.h(3008): error: inline specifier allowed on function declarations only
/usr/local/cuda/include/cuda_fp16.h(3008): error: expected a ";"
/usr/local/cuda/include/cuda_fp16.hpp(2018): error: name followed by "::" must be a class or namespace name
/usr/local/cuda/include/cuda_fp16.hpp(2018): error: expected an identifier
/usr/local/cuda/include/cuda_fp16.hpp(2018): error: inline specifier allowed on function declarations only
/usr/local/cuda/include/cuda_fp16.hpp(2018): error: expected a ";"

Is there a way to solve this problem, where for some reasons I would need to include the GpuCudaRt.hpp before the Eigen library?

I'm on CentOS7 with cuda-10.1 and gcc 8.3.1
In the following the flags for the compilation:

export CXXFLAGS="-m64 -std=c++14 -g -O2 -DALPAKA_DEBUG=0 -DALPAKA_CUDA_ARCH=60:70:75 -I$CUDA_ROOT/include -L$CUDA_ROOT/lib64 -lcudart -lcuda -I$ALPAKA_ROOT/include -I$CUPLA_ROOT/include -I$SLC7_BASE/include -I$EIGEN_BASE -I$TBB_BASE/include -L$TBB_BASE/lib -ltbb" 
HOST_FLAGS="-fopenmp -pthread -fPIC -ftemplate-depth-512 -Wall -Wextra -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-local-typedefs -Wno-attributes -Wno-reorder -Wno-sign-compare"

NVCC="$CUDA_ROOT/bin/nvcc"
NVCC_FLAGS="-ccbin $CXX -w -lineinfo --expt-extended-lambda --expt-relaxed-constexpr --use_fast_math --ftz=false --cudart shared"```

Compile error (outdated alpaka, misplaced LLAMA_NO_HOST_ACC_WARNING)

When using CUDA 10, there are also the warnings, but the error below occurs on CUDA 9.2.88 and CUDA 10.
It is the same bug as in llama.
I try to fix it by updating it to the latest alpaka version, where it should be fixed.
Maybe it will take some time, but our students need cupla soon.

/opt/cuda/include/crt/math_functions.hpp:54:2: Warnung: #warning "crt/math_functions.hpp is an internal header file and must not be used directly.  Please use cuda_runtime_api.h or c  uda_runtime.h instead." [-Wcpp]
   #warning "crt/math_functions.hpp is an internal header file and must not be used directly.  Please use cuda_runtime_api.h or cuda_runtime.h instead."
    ^~~~~~~
  In file included from /home/gcoe04/cuda-workspace/cupla/alpaka/include/alpaka/math/trunc/TruncCudaBuiltIn.hpp:38:0,
                   from /home/gcoe04/cuda-workspace/cupla/alpaka/include/alpaka/math/MathCudaBuiltIn.hpp:54,
                   from /home/gcoe04/cuda-workspace/cupla/alpaka/include/alpaka/acc/AccGpuCudaRt.hpp:38,
                   from /home/gcoe04/cuda-workspace/cupla/alpaka/include/alpaka/alpaka.hpp:40,
                   from /home/gcoe04/cuda-workspace/cupla/include/cupla_runtime.hpp:24,
                   from /home/gcoe04/cuda-workspace/cupla/src/stream.cpp:22:
  /opt/cuda/include/crt/math_functions.hpp:54:2: Warnung: #warning "crt/math_functions.hpp is an internal header file and must not be used directly.  Please use cuda_runtime_api.h or c  uda_runtime.h instead." [-Wcpp]
   #warning "crt/math_functions.hpp is an internal header file and must not be used directly.  Please use cuda_runtime_api.h or cuda_runtime.h instead."
    ^~~~~~~

  /home/gcoe04/cuda-workspace/cupla/alpaka/include/alpaka/core/Assert.hpp(102): error: this pragma must immediately precede a declaration
  
  1 error detected in the compilation of "/tmp/tmpxft_0000275b_00000000-6_stream.cpp1.ii".
  CMake Error at matrixMul_generated_stream.cpp.o.cmake:279 (message):
    Error generating file
    /home/gcoe04/cuda-workspace/cupla/build/CMakeFiles/matrixMul.dir/__/__/__/src/./matrixMul_generated_stream.cpp.o
  
  
  make[2]: *** [CMakeFiles/matrixMul.dir/build.make:107: CMakeFiles/matrixMul.dir/__/__/__/src/matrixMul_generated_stream.cpp.o] Fehler 1
  make[1]: *** [CMakeFiles/Makefile2:72: CMakeFiles/matrixMul.dir/all] Fehler 2
  make: *** [Makefile:84: all] Fehler 2

add CUDA intrinsics

I have code which makes use of warpSize and __shfl_down. The latter may be impossible to implement with alpaka, but warpSize could be mapped to something of value, e.g. elemDim.

thread support

Add thread support for manager classes.
Currently it is not possible to call cuplaSetDevice(), cuplaMalloc(), ... from different threads.

Lots of warnings with 0.4.0.

I get hundreds of warnings like this one:

../thirdparty/alpaka-develop/include/alpaka/vec/Vec.hpp(248): warning: __device__ annotation is ignored on a function("operator=") that is explicitly defaulted on its first declaration

Applies to __device__ and __host__ annotations.

Used cuda compiler flags

['-x=cu', '-ccbin', '/scratch/FSDM/rhel_7_6/fs_dependencies/gcc/bin/g++', '-m64', '-Xcompiler', '"-g","-fopenmp","-ggdb","-fPIC","-std=c++14"', '--expt-extended-lambda', '--expt-relaxed-constexpr', '--use_fast_math', '--ftz=false', '--generate-code', 'arch=compute_52,code=sm_52', '-std=c++14', '-DALPAKA_ACC_GPU_CUDA_ENABLED']

cuplaEventRecord to stream zero is not blocking all streams

A event recorded event in the default stream zero cuplaEventRecord(event,0) is not blocking all other streams.
This behavior is different to CUDA.

Note: with CUDA 7 the behavior of the default stream can be changed that each thread can have it's own default stream. https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
This should not interesting because cupla is not thread save :-)

Making header-only version of cupla

As recently discussed offline and in emails, it would make sense to have an option to make cupla a header-only library, in addition to the current "normal" library. One use case is to simplify using it from build systems other than cmake.

Here is a link on how some Boost libraries do this "optional header-only" configuration. It is essentially what we thought anyways, with a little more inlining.

error at build matmul example with cuda

I tried to build the matmul example with the instruction in INSTALL.md (and an own project with a simple hello world main.cpp) and got follow error:

/home/simeon/libraries/cupla/alpaka/include/alpaka/vec/Vec.hpp(461): error: no instance of function template "alpaka::vec::Vec<TDim, TSize>::foldrAll [with TDim=cupla::AlpakaDim<1U>, TSize=cupla::MemSizeType]" matches the argument list
            argument types are: (lambda [](cupla::MemSizeType, cupla::MemSizeType)->cupla::MemSizeType)
          detected during:
            instantiation of "auto alpaka::vec::Vec<TDim, TSize>::prod() const->TSize [with TDim=cupla::AlpakaDim<1U>, TSize=cupla::MemSizeType]" 
/home/simeon/libraries/cupla/alpaka/include/alpaka/mem/buf/BufCpu.hpp(524): here
            instantiation of "auto alpaka::mem::buf::traits::Pin<alpaka::mem::buf::BufCpu<TElem, TDim, TSize>, void>::pin(alpaka::mem::buf::BufCpu<TElem, TDim, TSize> &)->void [with TElem=uint8_t, TDim=cupla::AlpakaDim<1U>, TSize=cupla::MemSizeType]" 
/home/simeon/libraries/cupla/alpaka/include/alpaka/mem/buf/Traits.hpp(223): here
            instantiation of "auto alpaka::mem::buf::pin(TBuf &)->void [with TBuf=alpaka::mem::buf::Buf<cupla::AccHost, uint8_t, cupla::AlpakaDim<1U>, cupla::MemSizeType>]" 
/home/simeon/libraries/cupla/src/memory.cpp(150): here

1 error detected in the compilation of "/tmp/tmpxft_00002cb3_00000000-7_memory.cpp1.ii".
CMake Error at matrixMul_generated_memory.cpp.o.cmake:266 (message):
  Error generating file
  /home/simeon/libraries/cupla/build_matmul_cuda/CMakeFiles/matrixMul.dir/__/__/__/src/./matrixMul_generated_memory.cpp.o

I used export CUPLA_ROOT=$HOME/libraries/cupla and cmake $CUPLA_ROOT/example/CUDASamples/matrixMul -DALPAKA_ACC_GPU_CUDA_ENABLE=ON && make -j4.

If I used -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=ON instead -DALPAKA_ACC_GPU_CUDA_ENABLE=ON, the build was fine.

System: Ubuntu 16.04, gcc 5.4, Cuda 8

Problem with multi-Gpu single CPU kernel with CUPLA

Hello,

I am trying to use CUPLA for multi-GPU single CPU architecture. Below is the code.
It works with a single GPU but not with more than one.

#################################################

struct testBla{

    template<typename T_Acc> ALPAKA_FN_ACC void operator()(T_Acc const &acc, int* bla) const
    {
       *bla = 11;
        printf("Value %d\n", bla);

        return;
    }
};

int main()
{

    int ngpus;
    cudaGetDeviceCount(&ngpus);

    std::vector< int* > bla;
    bla.resize(ngpus);

    for (int idGpu=0; idGpu<ngpus; ++idGpu)
    {
     	cuplaSetDevice (idGpu);

        cuplaMalloc ((void**) &bla[idGpu], sizeof(int) );
        CUPLA_KERNEL_OPTI(testBla)(1,1)(bla[idGpu]);
        cuplaDeviceSynchronize();
    }

    for (int idGpu=0; idGpu<ngpus; ++idGpu)
    {
	cuplaSetDevice (idGpu);
	cuplaDeviceSynchronize();
    }


     for (int i=0; i<ngpus; ++i)
     {
        cuplaSetDevice (i);
        cudaDeviceReset();
     }

    return 0;

}

############################

This is the runtime error:

terminate called after throwing an instance of 'std::runtime_error'
what(): /home/rodrig17/gpu-collision-simulation/cupla/alpaka/include/alpaka/queue/cuda_hip/QueueUniformCudaHipRtBase.hpp(75) 'cudaStreamCreateWithFlags( &m_UniformCudaHipQueue, 0x01)' returned error : 'cudaErrorMemoryAllocation': 'out of memory'!
srun: error: kepler021: task 0: Aborted

#######################

This is the ccmake config:

ALPAKA_ACC_CPU_BT_OMP4_ENABLE OFF
ALPAKA_ACC_CPU_B_OMP2_T_SEQ_EN OFF
ALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ OFF
ALPAKA_ACC_CPU_B_SEQ_T_OMP2_EN OFF
ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENA OFF
ALPAKA_ACC_CPU_B_SEQ_T_THREADS OFF
ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENA OFF
ALPAKA_ACC_GPU_CUDA_ENABLE ON
ALPAKA_ACC_GPU_CUDA_ONLY_MODE ON
ALPAKA_ACC_GPU_HIP_ENABLE OFF
ALPAKA_ACC_GPU_HIP_ONLY_MODE OFF
ALPAKA_CUDA_ARCH 30
ALPAKA_CUDA_COMPILER nvcc
ALPAKA_CUDA_FAST_MATH ON
ALPAKA_CUDA_FTZ OFF
ALPAKA_CUDA_KEEP_FILES OFF
ALPAKA_CUDA_NVCC_EXPT_EXTENDED ON
ALPAKA_CUDA_NVCC_SEPARABLE_COM OFF
ALPAKA_CUDA_SHOW_CODELINES OFF
ALPAKA_CUDA_SHOW_REGISTER OFF
ALPAKA_CXX_STANDARD 14
ALPAKA_DEBUG 0
ALPAKA_EMU_MEMCPY3D OFF
ALPAKA_HIP_PLATFORM nvcc
BUILD_APPS ON
BUILD_GMOCK ON
BUILD_GTEST OFF
BUILD_SHARED_LIBS OFF
CMAKE_BUILD_TYPE
CMAKE_CUDA_HOST_COMPILER
CMAKE_INSTALL_PREFIX /usr/local
COMPILER_WARNINGS -W;-Wall;-Wno-unused;-Wextra;-Wno-variadic-macros
CUDA_HOST_COMPILER
CUDA_SDK_ROOT_DIR CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_TOOLKIT_ROOT_DIR /trinity/shared/pkg/devel/cuda/10.0
CUDA_USE_STATIC_CUDA_RUNTIME ON
CUDA_rt_LIBRARY /usr/lib64/librt.so
CUPLA_STREAM_ASYNC_ENABLE ON
NO_DECREMENT_DEPRECATED_WARNIN OFF
OPENMESH_BENCHMARK_DIR
OPENMESH_DOCS ON
OPENMESH_LIBRARY_DIR /home/rodrig17/gpu-collision-simulation/build/Build/lib
RT_LIBRARY /usr/lib64/librt.so
STL_VECTOR_CHECKS OFF
cupla_DIR /home/rodrig17/gpu-collision-simulation/cupla
gmock_build_tests OFF
gtest_build_samples OFF
gtest_build_tests OFF
gtest_disable_pthreads OFF

Thanks for any help!

warning: unused parameter ‘sharedMemSize’

    cupla/include/cupla/kernel.hpp:72:17: warning: unused parameter ‘sharedMemSize’ [-Wunused-parameter]
             size_t sharedMemSize = 0,
                     ^
    cupla/include/cupla/kernel.hpp:82:10: warning: unused parameter ‘stream’ [-Wunused-parameter]
             cuplaStream_t stream = 0
              ^

This can be easily solved changing the corresponding lines to:

    static cuplaStream_t
    getStream(
        size_t /*sharedMemSize*/ = 0,
        cuplaStream_t stream = 0
    )
    {
        return stream;
    }

    static size_t
    getSharedMemSize(
        size_t sharedMemSize = 0,
        cuplaStream_t /*stream*/ = 0
    )
    {
        return sharedMemSize;
    }

The problem I have with these warnings, is that they get repeated for every single object file being compiled, because cupla is header only. This makes it hard to find real errors and useful warnings in user-code.

manager::Memory not thread safe

Now noticed there is already an issue for this #12

I ran into a memory error with the PMacc-GoL in a multithreaded case with resource manager.

After an investigation i found that in cuplaMallocHost(), the variable buf is a nullptr, which comes from manager::Memory::alloc(), which makes a reference out of an pointer (read from a map).
The variable manager::Memory::m_mapVector is not protected with a mutex, so there is a race condition indeed when doing multiple cuplaMallocHost() concurrently, causing one insert to be overwritten.
Could be solved with a simple mutex.

In cuplaMallocHost():

(gdb) select-frame 13
(gdb) info locals
extent = {static s_uiDim = <optimized out>, m_data = {8}}
buf = <error reading variable>
(gdb) p &buf
$2 = (alpaka::mem::buf::BufCpu<unsigned char, std::integral_constant<unsigned long, 1>, unsigned long> *) 0x0

So here is the relevant backtrace from my case:

#10 0x00005555556746da in
std::__shared_ptr_access<
    alpaka::mem::buf::cpu::detail::BufCpuImpl<
        unsigned char,
        std::integral_constant<unsigned long, 1ul>,
        unsigned long
    >,
    (__gnu_cxx::_Lock_policy)2,
    false,
    false
>::operator-> (this=0x0)
at /usr/include/c++/9.1.0/bits/shared_ptr_base.h:1015

#11 0x0000555555672c5e in
alpaka::mem::view::traits::GetPtrNative<
    alpaka::mem::buf::BufCpu<
        unsigned char,
        std::integral_constant<unsigned long, 1ul>,
    unsigned long
    >,
    void
>::getPtrNative
at .../alpaka/include/alpaka/mem/buf/BufCpu.hpp:291

#12 0x0000555555671279 in
alpaka::mem::view::getPtrNative<
    alpaka::mem::buf::BufCpu<
        unsigned char,
    std::integral_constant<unsigned long, 1ul>,
    unsigned long
    >
>
at .../alpaka/include/alpaka/mem/view/Traits.hpp:202

#13 0x000055555566f0ba in cupla_omp2_seq_async::cuplaMallocHost (ptrptr=0x5555558c0ba0, size=8)

add atomic functions

Atomic add and exchange are currently added with a macro function.
It is better to implement inlined functions for that.

Example: Image Processing

We should add an additional example in example/ that is non-CUDA-SDK and supports the full range of Alpaka execution strategies (add element layers, allow arbitrary number of threads in both blocks and thread level) so it can be run in travis with a few (stay 2-4) physical threads in less then a minute.

One example would be:

the standard edge detection/skeletonize/sobel (example) algorithm with stencils on an image of ~400x200 px
some mandelbrot example

Some atomic operations not defined

I am writing a code with CUPLA and I need to use the atomicAnd, atomicOr and atomicXor bit operations. Is it possible to implement them in CUPLA or to somehow use the ALPAKA implementations?

Thanks!

support for math_functions.h

I got the following when porting by search and replace from CUDA to cupla:

error: there are no arguments to ‘max’ that depend on a template parameter, so a declaration of ‘max’ must be available

cupla should provide such a function (also simple min was missing). CUDA provides this in math_functions.h which is included by common_functions.h which is included by cuda_runtime.h. I didn't include this manually, but I guess nvcc did include this for me just like it is unnecessary to write #include <cuda.h> in a .cu file.

I guess min and max are not the only missing functions, henceforth the title.

The problem is, that std::max and so on can't be called from a device function.

Current workaround:

#ifndef max
#   define max( a, b ) ( ((a) > (b)) ? (a) : (b) )
#endif
#ifndef min
#   define min( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif

examples missing

cupla has no examples of how atomic functions or other device functions should be used.

Please add more documentation.

Redundant CUDA error checking in example?

Isn't the error checking e.g. done here redundant, because it's already done in alpaka?

GitLab CI: test different GCC, Clang, Boost ... versions

Issue is related to: #170

Approaches

spawn job for every combination
run over different versions in a single job

cupla does not build with GCC 10 and CUDA 11 n C++17 mode

For example, the matrixMul example builds find with GCC 9, CUDA 11, in C++17 mode:

/usr/local/cuda/bin/nvcc \
  -ccbin /usr/bin/g++-9 \
  -std=c++17 \
  -m64 \
  -x=cu /home/fwyzard/src/alpaka-group/cupla/src/stream.cpp \
  -c \
  -o /home/fwyzard/src/alpaka-group/cupla/example/CUDASamples/matrixMul/build/CMakeFiles/cupla.dir/__/__/__/src/cupla_generated_stream.cpp.o \
  -DCUPLA_STREAM_ASYNC_ENABLED=1 \
  -DALPAKA_ACC_GPU_CUDA_ENABLED \
  -DALPAKA_DEBUG=0 \
  -DBOOST_ALL_NO_LIB \
  --expt-extended-lambda \
  --expt-relaxed-constexpr \
  --generate-code=arch=compute_50,code=compute_50 \
  --use_fast_math \
  --ftz=false \
  -Xcudafe=--display_error_number \
  -Xcudafe=--diag_suppress=esa_on_defaulted_function_ignored \
  -DNVCC \
  -I/home/fwyzard/src/alpaka-group/cupla/include \
  -I/usr/local/cuda/include \
  -I/home/fwyzard/src/alpaka-group/cupla/alpaka/include \
  -I/usr/include \
  -Wno-deprecated-gpu-targets

The same command using g++-10 instead of g++-9 failes with

/usr/include/c++/10/tuple(566): error #1921: pack "_UElements" does not have the same number of elements as "_Elements"
          detected during:
            instantiation of "__nv_bool std::tuple<_Elements...>::__nothrow_constructible<_UElements...>() [with _Elements=<const int &>, _UElements=<>]" 
/usr/include/c++/10/bits/stl_map.h(502): here
            instantiation of "std::map<_Key, _Tp, _Compare, _Alloc>::mapped_type &std::map<_Key, _Tp, _Compare, _Alloc>::operator[](const std::map<_Key, _Tp, _Compare, _Alloc>::key_type &) [with _Key=int, _Tp=std::unique_ptr<cupla::cupla_cuda_async::AccDev, std::default_delete<cupla::cupla_cuda_async::AccDev>>, _Compare=std::less<int>, _Alloc=std::allocator<std::pair<const int, std::unique_ptr<cupla::cupla_cuda_async::AccDev, std::default_delete<cupla::cupla_cuda_async::AccDev>>>>]" 
/home/fwyzard/src/alpaka-group/cupla/include/cupla/manager/Device.hpp(117): here
            instantiation of "auto cupla::cupla_cuda_async::manager::Device<T_DeviceType>::device(int)->cupla::cupla_cuda_async::manager::Device<T_DeviceType>::DeviceType & [with T_DeviceType=cupla::cupla_cuda_async::AccDev]" 
/home/fwyzard/src/alpaka-group/cupla/include/cupla/manager/Device.hpp(158): here
            instantiation of "auto cupla::cupla_cuda_async::manager::Device<T_DeviceType>::current()->cupla::cupla_cuda_async::manager::Device<T_DeviceType>::DeviceType & [with T_DeviceType=cupla::cupla_cuda_async::AccDev]" 
/home/fwyzard/src/alpaka-group/cupla/include/cupla/manager/Stream.hpp(157): here
            instantiation of "auto cupla::cupla_cuda_async::manager::Stream<T_DeviceType, T_QueueType>::createNewStream(cuplaStream_t)->cuplaStream_t [with T_DeviceType=cupla::cupla_cuda_async::AccDev, T_QueueType=cupla::cupla_cuda_async::AccStream]" 
/home/fwyzard/src/alpaka-group/cupla/include/cupla/manager/Stream.hpp(72): here
            instantiation of "auto cupla::cupla_cuda_async::manager::Stream<T_DeviceType, T_QueueType>::create()->cuplaStream_t [with T_DeviceType=cupla::cupla_cuda_async::AccDev, T_QueueType=cupla::cupla_cuda_async::AccStream]" 
/home/fwyzard/src/alpaka-group/cupla/src/stream.cpp(43): here

1 error detected in the compilation of "/home/fwyzard/src/alpaka-group/cupla/src/stream.cpp".

Some more tests:
g++-9 with -std=c++14 --> ok
g++-9 with -std=c++17 --> ok
g++-10 with -std=c++14 --> ok
g++-10 with -std=c++17 --> error (see above)

multiple warnings about missing initializer

cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::srcPos’ [-Wmissing-field-initializers]
                         cudaMemcpy3DPeerParms cudaMemCpy3DPeerParms = {0};
                                                   ^
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::srcPtr’ [-Wmissing-field-initializers]
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::srcDevice’ [-Wmissing-field-initializers]
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::dstArray’ [-Wmissing-field-initializers]
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::dstPos’ [-Wmissing-field-initializers]
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::dstPtr’ [-Wmissing-field-initializers]
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::dstDevice’ [-Wmissing-field-initializers]
cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp:638:51: warning: missing initializer for member ‘cudaMemcpy3DPeerParms::extent’ [-Wmissing-field-initializers]

The problem here is the ={0} which only initializes the first member to 0, not sure what is wanted here though. Maybe no initializer list may be necessary, or a whole memset( &cudaMemCpy3DPeerParms, 0, sizeof(cudaMemCpy3DPeerParms) ) is what is wanted here.

Lines 576 and 700 are also affected by this. These 3 lines produce roughly 200 lines of warnings which often are ten times longer than 80 characters (per line).

cudaGetLastError() results in infinite loop

Calling cudaGetLastError() results while using Cupla in header-only mode in an infinite loop.

`test.cc`

#include <iostream>

#include <cupla/config/GpuCudaRt.hpp>

int main(void) {
  auto status = cudaGetLastError();

  std::cout << "status: " << status << std::endl;

  return 0;
}

Build and run with

nvcc -x cu -std=c++11 -O2 -g -I/usr/local/alpaka/alpaka/include -I/usr/local/alpaka/cupla/include -w test.cc -o test
./test

The program gets stuck on line 6, and it never reaches the std::cout statement.

Removing the #include <cupla/config/GpuCudaRt.hpp> and running with CUDA natively works, as expected.

types.hpp

cupla/types.hpp
Maybe we want to change the naming of that file. nevertheless, be aware that if installed in a common prefix-directory (such as /usr/include/cupla/), naming collisions do not occur with other third party headers.

I would generally install our libraries only in prefixed dirs, see libSplash which is always inslide splash/... for users to include.

@psychocoderHPC

Similar as in PMacc: ComputationalRadiationPhysics/picongpu#1355

Issues with building CUPLA examples with ROCm

Hi all,

I'm attempting to debug some issues with using CUPLA and Alpaka with ROCM 4.2, and I'm wondering if I'm missing a step somewhere. I'm testing compilation of examples and I think that ROCm is being found correctly but it seems to be breaking when it goes to call specific Alpaka macros like "HIP_INCLUDE_DIRECTORIES" and "HIP_ADD_LIBRARY" that are defined in cupla/alpaka/alpakaConfig.cmake.

Is it possible that I am missing a define from my initial command or something else?

cmake $CUPLA_ROOT/example/CUDASamples/matrixMul -DALPAKA_ACC_GPU_HIP_ENABLE=ON -DALPAKA_HIP_VERSION=4.2

....
CMake Warning at /ccs/home/jyoung/caar_spock/build/cupla/alpaka/cmake/alpakaCommon.cmake:30 (message):
  The HIP back-end is currently experimental.alpaka HIP backend compiled with
  clang does not support callback functions.
Call Stack (most recent call first):
  /ccs/home/jyoung/caar_spock/build/cupla/alpaka/CMakeLists.txt:85 (include)


-- Found Boost: /sw/spock/spack-envs/base/opt/linux-sles15-x86_64/gcc-7.5.0/boost-1.73.0-nnaww46oatokeudktwfaxu6dvl7zsyhw/lib/cmake/Boost-1.73.0/BoostConfig.cmake (found suitable version "1.73.0", minimum required is "1.65.1")  missing components: fiber
-- ROCclr at /opt/rocm-4.2.0/lib/cmake/rocclr
-- hip::amdhip64 is SHARED_LIBRARY
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed
-- ALPAKA_ACC_GPU_HIP_ENABLED
CMake Error at /ccs/home/jyoung/caar_spock/build/cupla/alpaka/cmake/alpakaCommon.cmake:792 (HIP_INCLUDE_DIRECTORIES):
  Unknown CMake command "HIP_INCLUDE_DIRECTORIES".
Call Stack (most recent call first):
  /ccs/home/jyoung/caar_spock/build/cupla/alpaka/CMakeLists.txt:85 (include)

Line 792 in /cupla/alpaka/cmake/alpakaCommon.cmake:

   list(APPEND HIP_HIPCC_FLAGS
        ${_ALPAKA_COMPILE_DEFINITIONS_HIP}
        )
        HIP_INCLUDE_DIRECTORIES(
        #${_ALPAKA_INCLUDE_DIRECTORY}
        # ${_ALPAKA_INCLUDE_DIRECTORIES_PUBLIC}
        #${HIP_INCLUDE_DIRS}
        ${Boost_INCLUDE_DIRS}
        ${_ALPAKA_ROOT_DIR}/test/common/include
        )

Using cupla alongside mallocMC

I am trying to create a new project with both of the above as submodules. In https://github.com/frobnitzem/FastParticleToolkit I basically replicated the structure of ComputationalRadiationPhysics/picongpu, but left out PMacc. Something got lost in translation with the include/fpt/CMakeLists.txt though, since I always get some variant of:

CMake Error at $HOME/FastParticleToolkit/thirdParty/mallocMC/alpaka/CMakeLists.txt:108 (add_custom_target):
  add_custom_target cannot create target "alpakaIde" because another target
  with the same name already exists.  The existing target is a custom target
  created in source directory
  "$HOME/FastParticleToolkit/include/fpt".  See
  documentation for policy CMP0002 for more details.

-- Found mallocMC: $HOME/FastParticleToolkit/thirdParty/mallocMC/src (found suitable version "2.6.0", minimum required is "2.3.0")

It seems the issue is that cupla links to alpaka and defines alpakaIde first. Then, mallocMC tries the same trick and fails. I can't set mallocMC to "use external alpaka" since the 2 alpaka versions are different. So I have too many alpakas.

code:

git clone --recursive https://github.com/frobnitzem/FastParticleToolkit
FastParticleToolkit/bin/FPT-create.sh experiment
cd experiment
../FastParticleToolkit/bin/FPT-build.sh

Support for __float_as_int

and also for __int_as_float. I make use of it e.g. here

template<class T_FUNC>
__device__ inline void atomicFunc
(
    double * const rdpTarget,
    double const rValue,
    T_FUNC f
)
{
    using ull = unsigned long long int;
    ull assumed;
    ull old = * (ull*) rdpTarget;
    do
    {
        assumed = old;
        old = atomicCAS( (ull*) rdpTarget, assumed,
            __double_as_longlong( f( __longlong_as_double(assumed), rValue ) ) );
    }
    while ( assumed != old );
}

A host implementation could look like this: http://stackoverflow.com/a/13802260/2191065

GitLab CI: add Clang-CUDA test

Use Clang as CUDA Device compiler
Container already exists: https://gitlab.com/hzdr/crp/alpaka-group-container/-/merge_requests/4

problem with asynchronous peer to peer copy.

Hi, I would like to convert my code in CUDA to CUPLA and I have some issues with
mutli-GPU/single-CPU P2P async. copies. This is the error:

error: argument of type "cuplaStream_t" is incompatible with parameter of type "cudaStream_t"

This is a typical P2P copy in my code:

    const int numStreams = 10;
    cudaStream_t streams[numStreams];

    cuplaSetDevice (idGpuI);
    for (size_t i=0; i<numStreams; ++i)
        cudaStreamCreate(&streams[i]);

   for (size_t loc1=0; loc1<grid1Size*grid1Size*grid1Size; ++loc1)
   {
       cudaMemcpyPeerAsync(&(grid0GpuJ[loc1].grid0Size), idGpuJ, &(grid0GpuI[loc1].grid0Size), 
       idGpuI, sizeof(int), streams[loc1%numStreams]);
   }

So how do I write this in CUPLA?

In order to give access to P2P copy, this is what I am doing in CUDA:

inline void enableP2P (int ngpus, std::string info) {
    for( int i = 0; i < ngpus; i++ ) {
         cuplaSetDevice (i);
         for(int j = 0; j < ngpus; j++) {
            if(i == j) continue;
            int peer_access_available = 0;
            cudaDeviceCanAccessPeer(&peer_access_available, i, j);
            if (peer_access_available) {
                cudaDeviceEnablePeerAccess(j, 0);
                if (info=="info")
                    printf("> GPU%d enabled direct access to GPU%d\n",i,j);
                }else {
                if (info=="info")
                    printf("(%d, %d)\n", i, j);
                }
        }
    }
}

It seems in CUPLA cudaDeviceEnablePeerAccess is done automatically and
cudaDeviceCanAccessPeer disappears, so I think the function enableP2P is not necessary anymore, right?

Thanks for any help!

[edited by psychocoderHPC: fixed formation]

libCupla should be not forced to build as static library

@fwyzard provides a good reason to not build cupla as static library: #203 (comment)

Check options, to keep good default behavior, but not force to build static library.

update alpaka to the latest dev

After https://github.com/ComputationalRadiationPhysics/alpaka/pull/416 is merged cupla should be updated to the latest alpaka dev.
Currently the TBB backend is not known and enabled from alpaka by default if possible, this triggered a warning

If ALPAKA_ACC_GPU_CUDA_ONLY_MODE is enabled, only back-ends using CUDA can be enabled! This allows to mix alpaka code with native CUDA code. However, this prevents any non-CUDA back-ends from being enabled

if CUDA is used.

FAIL in matrixMul example

The matrixMul example fails for several CPU backends, for example -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLE=ON:

...
Error! Matrix[51197]=0.00000000, ref=1.59999990 error term is > 1.000000E-06
Error! Matrix[51198]=0.00000000, ref=1.59999990 error term is > 1.000000E-06
Error! Matrix[51199]=0.00000000, ref=1.59999990 error term is > 1.000000E-06
Result = FAIL

Note: For peak performance, please refer to the matrixMulCUBLAS example.

Looking at the CI setup, it seems like this is expected, is that correct? If that is the case, can I update the install instructions to use the vectorAdd example instead?

sharedMemExtern not working

original:

extern __shared__ __align__( sizeof(T_PREC) ) unsigned char dynamicSharedMemory[];

converted version:

sharedMemExtern( dynamicSharedMemory, unsigned char * );

gets me:

cudaGaussian.tpp:808:9: error: invalid conversion from ‘unsigned char**’ to ‘unsigned char*’ [-fpermissive]

and just to try it:

sharedMemExtern( dynamicSharedMemory, unsigned char * );

gets me:

cudaGaussian.tpp:808:9: error: invalid conversion from ‘unsigned char*’ to ‘unsigned char’ [-fpermissive]

I think adding a pointer to this:

    #define sharedMemExtern(ppName, ...)   \
    __VA_ARGS__ /* fix -> */ *ppName =    \
        ::alpaka::block::shared::dyn::getMem<__VA_ARGS__>(acc)

should work, but I'm not sure if this is correct. But I can say, that it compiles and runs, but my tests fail with wrong results, but that could be some other porting error.

If this fix is correct, then also the PortingGuide.md needs to be corrected, because:

extern __shared__ int * smInt;

becomes

sharedMemExtern( smInt, int );

instead of

sharedMemExtern( smInt, int * );

See commit https://github.com/mxmlnkn/cupla/commit/cd30ff81f0701325f793c956923429b18c7dc449

cupla examples can not be build without installing cupla

When I try to build the cupla examples out of the repository tree the build process is failing because cupla can not be found.

cupla/build$ cmake ../example/CUDASamples/vectorAdd/
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:38 (find_package):
  By not providing "Findcupla.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "cupla", but
  CMake did not find one.

  Could not find a package configuration file provided by "cupla" with any of
  the following names:

    cuplaConfig.cmake
    cupla-config.cmake

  Add the installation prefix of "cupla" to CMAKE_PREFIX_PATH or set
  "cupla_DIR" to a directory containing one of the above files.  If "cupla"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!

Support for cudaGetDeviceProperties

I have some code, which queries cudaGetDeviceProperties for multiProcessorCount and maxThreadsPerMultiProcessor in order to determine how many blocks, threads and streams to start.

It would be cool if cupla could emulate the device properties to some extent for this kind of use-case by mapping it onto corresponding features. E.g. maxThreadsPerMultiProcessor = OMP_NUM_CORES, multiProcessorCount=1;

LOGO

alpaka started to create a logo.
We need also a logo for cupla.

#draft 1

thx @steindev for your help

Math function wrappers are not sufficiently generic

The current implementation does not call generic alpaka math function due to arising host-device problems. Instead, it calls the alpaka implementation traits directly. However, in doing so it assumes that the first argument of operator() is the same type as the first template parameter. Which does not have to be the case in alpaka, and some complex math functions are different (and have to be, as they rely on other math functions). So it is not alpaka's problem, but a bug in cupla that mistakenly relies on this form of trait implementation. The bug is probably here for a long time, just previously alpaka didn't have other kinds of math traits.

This bug causes behavior like ComputationalRadiationPhysics/picongpu#4138

Test minimal required CMake version

Based on @ax3l's comment: https://github.com/ComputationalRadiationPhysics/cupla/pull/129#issuecomment-547191164
we should test in the CI the minimal required CMake version.

Todo

change CMake version in the CI back to 3.11.4
update the minimal requirement for CMake in all cupla CMake scripts.

change submodule repository protocol

After cupla release the path to alpaka in .gitmodules must be changed to url = git://github.com/ComputationalRadiationPhysics/alpaka.git to provide anonymous (read-only) access.

http://stackoverflow.com/questions/16187849/permission-denied-when-trying-to-anonymously-clone-a-github-repository