rocm / aomp Goto Github PK

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.

Home Page: https://github.com/ROCm/aomp

License: Apache License 2.0

Shell 12.50% Makefile 7.64% C++ 16.08% C 23.84% Fortran 31.29% Python 0.16% Cuda 0.18% GDB 0.01% AMPL 0.02% Awk 0.01% SWIG 6.71% Assembly 1.55% CMake 0.03%

amd llvm clang openmp rocm

aomp's Introduction

AMD ROCm Software

ROCm is an open-source stack, composed primarily of open-source software, designed for graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

With ROCm, you can customize your GPU software to meet your specific needs. You can develop, collaborate, test, and deploy your applications in a free, open source, integrated, and secure software ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC), artificial intelligence (AI), scientific computing, and computer aided design (CAD).

ROCm is powered by AMD’s Heterogeneous-computing Interface for Portability (HIP), an open-source software C++ GPU programming environment and its corresponding runtime. HIP allows ROCm developers to create portable applications on different platforms by deploying code on a range of platforms, from dedicated gaming GPUs to exascale HPC clusters.

ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary open source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning (ML) frameworks, such as PyTorch and TensorFlow.

Getting the ROCm Source Code

AMD ROCm is built from open source software. It is, therefore, possible to modify the various components of ROCm by downloading the source code and rebuilding the components. The source code for ROCm components can be cloned from each of the GitHub repositories using git. For easy access to download the correct versions of each of these tools, the ROCm repository contains a repo manifest file called default.xml. You can use this manifest file to download the source code for ROCm software.

Installing the repo tool

The repo tool from Google allows you to manage multiple git repositories simultaneously. Run the following commands to install the repo tool:

mkdir -p ~/bin/
curl https://storage.googleapis.com/git-repo-downloads/repo > ~/bin/repo
chmod a+x ~/bin/repo

Note: The ~/bin/ folder is used as an example. You can specify a different folder to install the repo tool into if you desire.

Installing git-lfs

Some ROCm projects use the Git Large File Storage (LFS) format that may require you to install git-lfs. Refer to Git Large File Storage for more information. For example, to install git-lfs for Ubuntu, use the following command:

sudo apt-get install git-lfs

Downloading the ROCm source code

The following example shows how to use the repo tool to download the ROCm source code. If you choose a directory other than ~/bin/ to install the repo tool, you must use that chosen directory in the code as shown below:

mkdir -p ~/ROCm/
cd ~/ROCm/
~/bin/repo init -u http://github.com/ROCm/ROCm.git -b roc-6.0.x
~/bin/repo sync

Note: Using this sample code will cause the repo tool to download the open source code associated with the specified ROCm release. Ensure that you have ssh-keys configured on your machine for your GitHub ID prior to the download as explained at Connecting to GitHub with SSH.

Building the ROCm source code

Each ROCm component repository contains directions for building that component, such as the rocSPARSE documentation Installation and Building for Linux. Refer to the specific component documentation for instructions on building the repository.

Each release of the ROCm software supports specific hardware and software configurations. Refer to System requirements (Linux) for the current supported hardware and OS.

ROCm documentation

This repository contains the manifest file for ROCm releases, changelogs, and release information.

The default.xml file contains information for all repositories and the associated commit used to build the current ROCm release; default.xml uses the Manifest Format repository.

Source code for our documentation is located in the /docs folder of most ROCm repositories. The develop branch of our repositories contains content for the next ROCm release.

The ROCm documentation homepage is rocm.docs.amd.com.

Building the documentation

For a quick-start build, use the following code. For more options and detail, refer to Building documentation.

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Alternatively, CMake build is supported.

cmake -B build
cmake --build build --target=doc

Older ROCm releases

For release information for older ROCm releases, refer to the CHANGELOG.

aomp's People

Contributors

Stargazers

Watchers

aomp's Issues

Data race when multiple host threads offload

The issue occurs in 0.6.5
Use the reproducer in #22
make -j15 check_spo make sure ./bin/check_spo is created.
OMP_NUM_THREADS=1 ./bin/check_spo always pass
OMP_NUM_THREADS=4 ./bin/check_spo
almost always fail

Copying data from device failed.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Segmentation fault (core dumped)

linking error multi definition of offload code

Use reproducer linking.zip

$ sh build.sh 
error: Linking globals named '__omp_offloading_10304_1e81eac__ZN3fooIfEC1Ev_l7_exec_mode': symbol multiply defined!
clang-8: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)

When the offload compiler flags are removed, the code links well.

Lack of math function support

Support for basic math functions in offload region is missing.
mainline clang currently has a temporal fix.
https://bugs.llvm.org/show_bug.cgi?id=41090
Reproducer

#include <cmath>
using namespace std;

int main(int argc, char **argv)
{
  #pragma omp target
  {
    float intpart, res;
    res = modff(1.1f, &intpart);
  }

  #pragma omp target
  {
    double intpart, res;
    res = modf(1.1, &intpart);
  }
  return 0;
}

Remove support for BRIG from openmp plugin

I think this is obsolete. Any objections?

libatmi_runtime can't find libhsa-runtime64

$ ldd ~/rocm/aomp/lib/libatmi_runtime.so.0 | grep found
	libhsa-runtime64.so.1 => not found

$ readelf -d ~/rocm/aomp/lib/libatmi_runtime.so.0 | grep RUNPATH
 0x000000000000001d (RUNPATH)            Library runpath: [/home/amd/rocm/aomp_0.7-6/lib:$ORIGIN/../../hsa/lib:$ORIGIN/../../lib]

I think that's one level too far back up the tree for hsa. I have thus far been unable to work out which part of the build scripts to modify to fix this.

Remove DeepCode bot

E.g. ROCm-Developer-Tools/llvm-project#93

It's line noise. I don't have permissions to delete it.

Is aomp a replacement for ROC llvm/clang/lld?

Is the "amop" compiler a replacement for the https://github.com/RadeonOpenCompute/clang/tree/roc-2.7.0 compiler? Or shall it be used in future?

Problem with target exit data not cleaning up the device memory (?)

I am continuously exchanging data between the host and device over longer periods of time. I noticed that the speed of all threads is slowly and permanently decreasing each time target data is called until I recompile the code.

I have the following example in a file called test.cpp which will compile to Vanilla C/C++ as it reads (sorry for using the Rcpp wrapper here but I want to provide the full context regarding the compiler settings below ):

#include <Rcpp.h>
#include <omp.h>
using namespace Rcpp;
// [[Rcpp::export]]

RcppExport SEXP test(SEXP data) {
Rcpp::NumericVector Data(data);
int N = Data.length();
int I = 100;
double *pointer_Data = Data.begin();
#pragma omp target enter data map(to:pointer_Data[0:N]) device(0)
//do some stuff here
double output = 0;
for (int i=0; i<I;++i) {
#pragma omp target teams distribute parallel for device(0)
for (int n = 0; n < N; ++n) {
output += 1;}}
//do more stuff here
#pragma omp target exit data map(delete:pointer_Data[0:N]) device(0)
return Rcpp::wrap(output);}
`

The problem here sits in iterative calls to target data that should however nevertheless clean up after exit if I understand correctly and not pile something up. I use the following compiler settings to work with R (to be put in the MAKEVARS file in the .R folder in the home directory):

CFLAGS += -O3 -m64 -mavx -mfpmath=sse
CXXFLAGS += -O3 -std=c++14 -m64 -mavx -mfpmath=sse
CXX11FLAGS += -O3 -std=c++14 -m64 -mavx -mfpmath=sse
CXX1XFLAGS += -O3 -std=c++14 -m64 -mavx -mfpmath=sse

SHLIB_OPENMP_CFLAGS = -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906
SHLIB_OPENMP_CXXFLAGS = -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906
PKG_CFLAGS = $(SHLIB_OPENMP_CFLAGS)
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)

CC = /usr/lib/aomp/bin/clang-9
CXX = /usr/lib/aomp/bin/clang-9
CXX11 = /usr/lib/aomp/bin/clang-9
CCX1X = /usr/lib/aomp/bin/clang-9`

I can compile, call and run the code in the R console as follows:

install.packages("Rcpp")
library(Rcpp)
sourceCpp('test.cpp',rebuild = TRUE)
N<-10000000
I<-500
Time<-vector(mode="numeric",length=I)
for (i in 1:I) {
Data<-c(sample(x=c(1:i),size=N+sample(c(0,2),size=1),replace=TRUE))
Time[i]<-system.time(test(Data),gcFirst=TRUE)[3]}

Which shows on my machine that the speed for the same calculation drops to less than 50% over just 200 function calls and stays there until recompilation. I am happy to try any suggestions that could narrow the problem further down unless it is obvious and report the results back, so nobody has to use this code. But I am not sure what else to try next other than avoiding continuous data exchange. What worries me is that the problem piles up over time and is persistent which seems wrong. Any suggestions are very welcome! Thank you very much!

Best, Dirk

linker error

Using 0.7-7.
The AOMP linker works on more complicated miniQMC but failed in linking the following test case.
https://github.com/ye-luo/openmp-target/tree/master/hands-on/tests/link_static_fat_bin

/usr/lib/aomp/bin/clang++ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -c classA.cpp
rm -f mylib.a
ar qc mylib.a classA.o
ranlib mylib.a
/usr/lib/aomp/bin/clang++ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -c main.cpp
/usr/lib/aomp/bin/clang++ -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa main.o ./mylib.a

error

: error loading '/tmp/mylib-gfx906-fd6e48.a': The file was not recognized as a valid object file!: The file was not recognized as a valid object file.
clang-9: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)

upstream spack package

I notice in the docs and in the repo there are references to spack, and in fact a partial spack package. Could this be upstreamed so we can start actually using it for downstream packages?

I'm working on fixing up the llvm package in spack to build a working offload compiler with libomptarget. It's working for cuda, but the lack of a working aomp package with appropriate dependencies means no AMD support for now.

`hipcc` fails when installing package into custom location

I followed the instructions to install the aomp .deb package into a custom location that does not require privileged permissions. My aomp installation is not in /usr/lib/....

When I try to compile something I get the following output:

$ hipcc threaded_hip.cpp
Can't exec "/usr/lib/aomp_0.7-7/bin/clang": No such file or directory at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 190.
Use of uninitialized value $HIP_CLANG_VERSION in pattern match (m//) at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 191.
Use of uninitialized value $HIP_CLANG_VERSION in concatenation (.) or string at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 195.
Use of uninitialized value $HIP_CLANG_INCLUDE_PATH in concatenation (.) or string at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 219.
Can't exec "/usr/lib/aomp_0.7-7/bin/rocm_agent_enumerator": No such file or directory at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 724.
Use of uninitialized value $myAgents in split at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 725.
Died at /data/common/aomp/7.7/usr/lib/aomp/bin/hipcc line 771.
No valid AMD GPU target was either specified or found. Please specify a valid target using --amdgpu-target=

I looked at line 190 of the hipcc script and I see:

$HIP_CLANG_VERSION = `$HIPCC --version`;

Then I backtracked the the $HIPCC variable definition to line182 to find:

$HIPCC="$HIP_CLANG_PATH/clang++";

Then I backtracked $HIP_CLANG_PATH variable definition to line 82 to find:

$HIP_CLANG_PATH="/usr/lib/aomp_0.7-7/bin";

Therein lies the problem. Hardcoding the path to hipcc makes the shipped .deb package non-relocatable.

Wrong reduction result on AMD GPU

https://github.com/ye-luo/openmp-target/tree/master/hands-on/gemv/4-gemv-omp-target-reduction

AMD GPU:

/opt/rocm/aomp/bin/clang++ -std=c++11 -Drestrict=__restrict__ -g -O3 -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -o gemv-omp-target-reduction.x -I ../../common gemv-omp-target-reduction.cpp

prints

Vout[0] != 8192, wrong value is 0

Nvidia GPU runs well.

/opt/rocm/aomp/bin/clang++ -std=c++11 -Drestrict=__restrict__ -g -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_61 -nocudalib -o gemv-omp-target-reduction.x -I ../../common gemv-omp-target-reduction.cpp

Is it because reduction has not been implemented?

Change to __builtin_readcyclecounter

Removes an #ifdef on amdgcn version. Need to check what it expands to (hopefully memrealtime) and still need to work out what scaling factor to use to map the result onto seconds.

edit: Seems to be expanding to s_memtime, not s_memrealtime.

aomp include search order finds /usr/local/include /usr/include before its localized include

our aomp product searches /usr/local/include:/usr/include:
this leads to issues if the compiler package has newer definitions of omp related functionality inside omp.h. and there exists a file omp.h in the preceding include paths.

This can be worked around with an -isystem option
-isystem/home/rlieberm/rocm/aomp/include

Possible gpu arch mismatch: gfx900, please check compiler: -march=<gpu> flag

Hi guys,

I am trying to use AOMP on our UD system. I was able to run . 0.6-(2,3,4) without issues, but I get this error on 0.7-(0,1,2,3,4,5)

Possible gpu arch mismatch: gfx900, please check compiler: -march=<gpu> flag
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

LIBOMPTARGET_DEBUG=1

> LD_LIBRARY_PATH=/software/compilers/aomp/aomp0.7-5/usr/lib/aomp_0.7-5/lib-debug/:$LD_LIBRARY_PATH LIBOMPTARGET_DEBUG=1 ./offloading_success.c.o
Libomptarget --> Loading RTLs...
Libomptarget --> Unable to find file '/sys/firmware/devicetree/base/ibm,firmware-versions/open-power', skipping dlopen for 'libomptarget.rtl.ppc64.so'
Libomptarget --> Loading library '/software/compilers/aomp/aomp0.7-5/usr/lib/aomp_0.7-5/lib-debug/libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library '/software/compilers/aomp/aomp0.7-5/usr/lib/aomp_0.7-5/lib-debug/libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Unable to find file '/dev/nvidia0', skipping dlopen for 'libomptarget.rtl.cuda.so'
Libomptarget --> Loading library '/software/compilers/aomp/aomp0.7-5/usr/lib/aomp_0.7-5/lib-debug/libomptarget.rtl.hsa.so'...
Target HSA RTL --> Start initializing HSA-ATMI
Target HSA RTL --> There are 2 devices supporting HSA.
Target HSA RTL --> Device 0: Initial groupsPerDevice 128 & threadsPerGroup 256
Target HSA RTL --> "Get HSA agents" succeeded
Target HSA RTL --> Device 1: Initial groupsPerDevice 128 & threadsPerGroup 256
Target HSA RTL --> "Get HSA agents" succeeded
Libomptarget --> Successfully loaded library '/software/compilers/aomp/aomp0.7-5/usr/lib/aomp_0.7-5/lib-debug/libomptarget.rtl.hsa.so'!
Libomptarget --> Registering RTL libomptarget.rtl.hsa.so supporting 2 devices!
Libomptarget --> Unable to find file '/sys/module/mdio_thunder/initstate', skipping dlopen for 'libomptarget.rtl.aarch64.so'
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000601050 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000601050 is compatible with RTL libomptarget.rtl.hsa.so!
Libomptarget --> RTL 0x00000000006da540 has index 0!
Libomptarget --> Registering image 0x0000000000601050 with RTL libomptarget.rtl.hsa.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 2
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Entering target region with entry point 0x0000000000400884 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target HSA RTL --> Init requires flags to 1
Target HSA RTL --> Initialize the device id: 0
Target HSA RTL --> Using 64 compute unis per grid
Target HSA RTL --> Using 1024 ROCm blocks per grid
Target HSA RTL --> Capped thread limit: 1024
Target HSA RTL --> Queried wavefront size: 64
Target HSA RTL --> Name of gpu:gfx900
Target HSA RTL --> Default number of teams set according to library's default 128
Target HSA RTL --> Default number of threads set according to library's default 256
Target HSA RTL --> Device 0: default limit for groupsPerDevice 1024 & threadsPerGroup 1024
Target HSA RTL --> Device 0: wavefront size 64, total threads 1024 x 1024 = 1048576
Libomptarget --> Device 0 is ready to use.
Target HSA RTL --> Machine ID found: 224
Target HSA RTL --> "Module registering" failed
Possible gpu arch mismatch: gfx900, please check compiler: -march=<gpu> flag
Libomptarget --> Unable to generate entries table for device id 0.
Libomptarget --> Failed to init globals on device 0
Libomptarget --> Failed to get device 0 ready
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000000601050 is compatible with RTL 0x00000000006da540!
Libomptarget --> Unregistered image 0x0000000000601050 from RTL 0x00000000006da540!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x0000000000604aa0
Libomptarget --> Done unregistering library!
Target HSA RTL --> Finalizing the HSA-ATMI DeviceInfo.

System information:

ROCMINFO

[monsalve@obiwan] [~/SOLLVE/sollve_vv] [Thu Nov 14] [10:44 PM] (master)
> rocminfo
ROCk module is loaded
monsalve is member of video group
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 7 1800X Eight-Core Processor
  Marketing Name:          AMD Ryzen 7 1800X Eight-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3600
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            16
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32932700(0x1f6835c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Acessible by all:        TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32932700(0x1f6835c) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Acessible by all:        TRUE
  ISA Info:
    N/A
*******
Agent 2
*******
  Name:                    gfx900
  Marketing Name:          Vega 10 XTX [Radeon Vega Frontier Edition]
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          4096(0x1000)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
  Chip ID:                 26723(0x6863)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1600
  BDFID:                   2560
  Internal Node ID:        1
  Compute Unit:            64
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      FALSE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16760832(0xffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Acessible by all:        FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Acessible by all:        FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx900
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx900
  Marketing Name:          Vega 10 XTX [Radeon Vega Frontier Edition]
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          4096(0x1000)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
  Chip ID:                 26723(0x6863)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1600
  BDFID:                   3328
  Internal Node ID:        2
  Compute Unit:            64
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      FALSE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16760832(0xffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Acessible by all:        FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Acessible by all:        FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx900
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

ROCM-SMI

> rocm-smi -a


====================    ROCm System Management Interface    ====================
================================================================================
GPU[0] 		: GPU ID: 0x6863
GPU[1] 		: GPU ID: 0x6863
================================================================================
================================================================================
GPU[0] 		: Temperature: 35c
GPU[1] 		: Temperature: 31c
================================================================================
================================================================================
GPU[0] 		: GPU Clock Level: 0 (852Mhz)
GPU[0] 		: GPU Memory Clock Level: 0 (167Mhz)
GPU[1] 		: GPU Clock Level: 0 (852Mhz)
GPU[1] 		: GPU Memory Clock Level: 0 (167Mhz)
================================================================================
================================================================================
GPU[0] 		: Fan Level: 35 (13.73)%
GPU[1] 		: Fan Level: 35 (13.73)%
================================================================================
================================================================================
GPU[0] 		: Current PowerPlay Level: auto
GPU[1] 		: Current PowerPlay Level: auto
================================================================================
================================================================================
GPU[0] 		: Current GPU OverDrive value: 0%
GPU[1] 		: Current GPU OverDrive value: 0%
================================================================================
================================================================================
GPU[0] 		:
NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
  0 3D_FULL_SCREEN :             70  60          1              3
  1   POWER_SAVING :             90  60          0              0
  2          VIDEO*:             70  60          0              0
  3             VR :             70  90          0              0
  4        COMPUTE :             30  60          0              6
  5         CUSTOM :              0   0          0              0
GPU[1] 		:
NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
  0 3D_FULL_SCREEN :             70  60          1              3
  1   POWER_SAVING :             90  60          0              0
  2          VIDEO*:             70  60          0              0
  3             VR :             70  90          0              0
  4        COMPUTE :             30  60          0              6
  5         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0] 		: Average GPU Power: 3.0 W
GPU[1] 		: Average GPU Power: 5.0 W
================================================================================
================================================================================
GPU[0] 		: Supported GPU clock frequencies on GPU0
GPU[0] 		: 0: 852Mhz *
GPU[0] 		: 1: 991Mhz
GPU[0] 		: 2: 1138Mhz
GPU[0] 		: 3: 1269Mhz
GPU[0] 		: 4: 1348Mhz
GPU[0] 		: 5: 1440Mhz
GPU[0] 		: 6: 1528Mhz
GPU[0] 		: 7: 1600Mhz
GPU[0] 		:
GPU[0] 		: Supported GPU Memory clock frequencies on GPU0
GPU[0] 		: 0: 167Mhz *
GPU[0] 		: 1: 500Mhz
GPU[0] 		: 2: 800Mhz
GPU[0] 		: 3: 945Mhz
GPU[0] 		:
GPU[1] 		: Supported GPU clock frequencies on GPU1
GPU[1] 		: 0: 852Mhz *
GPU[1] 		: 1: 991Mhz
GPU[1] 		: 2: 1138Mhz
GPU[1] 		: 3: 1269Mhz
GPU[1] 		: 4: 1348Mhz
GPU[1] 		: 5: 1440Mhz
GPU[1] 		: 6: 1528Mhz
GPU[1] 		: 7: 1600Mhz
GPU[1] 		:
GPU[1] 		: Supported GPU Memory clock frequencies on GPU1
GPU[1] 		: 0: 167Mhz *
GPU[1] 		: 1: 500Mhz
GPU[1] 		: 2: 800Mhz
GPU[1] 		: 3: 945Mhz
GPU[1] 		:
================================================================================
====================           End of ROCm SMI Log          ====================

Another reduction bug

The second check in the following program failed.
/usr/lib/aomp/bin/clang++ -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 test_parallel_for.cpp
I got wrong counts2 = 0 should be 4!. The test can pass with -O2 or -O3 added.

#include <iostream>

int main()
{
  int counts1 = 0;
  int counts2 = 0;
  #pragma omp target teams map(from:counts1)
  {
    int counts_team = 0;
    #pragma omp parallel
    {
      #pragma omp for
      for (int i=0; i<4; i++)
        #pragma omp atomic
        counts_team += 1;
    }
    counts1 = counts_team;
  }

  #pragma omp target teams map(from:counts2)
  {
    int counts_team = 0;
    #pragma omp parallel
    {
      #pragma omp for reduction(+:counts_team)
      for (int i=0; i<4; i++)
        counts_team += 1;
    }
    counts2 = counts_team;
  }

  if (counts1 != 4)
    std::cout << " wrong counts1 = " << counts1 << " should be 4!" << std::endl;
  if (counts2 != 4)
    std::cout << " wrong counts2 = " << counts2 << " should be 4!" << std::endl;
}

typeid error when enabling GPU offloading in the compiler

I posted this question in the general forum at first with no answer so far *, but it is probably better placed here. I can't successfully compile code that includes offloading from OpenMP to Vega20, since aomp seems not to support typeid requests anymore when gpu offloading is enabled in the compiler. Does anybody know how to solve this problem?

I am using rocm 2.4 on Ubuntu 18.04 with aomp 0.6-2 and OpenMP. The code works fine as long as I offload to the CPUs. However, once I offload to my GPU it only compiles and works when I remove all typeid requests. But I need these requests to pass the results and data from and to Rcpp.

I provide a minimal example by simply adding a single typeid request and the required header to one of the working examples from aomp, namely reduction.c:

aomp example

`#include <stdio.h>
#define N 1000000ll
#define SUM (N * (N-1)/2)

int main (void)
{
long long a, i;

//std::cout << typeid(a).name() << std::endl;

#pragma omp target parallel shared(a) private(i)
{
#pragma omp master
a = 0;

#pragma omp barrier

#pragma omp for reduction(+:a)
for (i = 0; i < N; i++) {
    a += i;
}

// The Sum shall be sum:[0:N]
#pragma omp single
{
  if (a != SUM)
    printf ("Incorrect result = %lld, expected = %lld!\n", a, SUM);
  else
    printf ("The result is correct = %lld!\n", a);
}

}

return 0;
}`

This works fine when compiled unchanged with the following options:

$ /opt/rocm/aomp/bin/clang++ -frtti -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 reduction.c

However, when I uncomment the single typeid request in the code above and add the required headers at the very top:

#include <typeinfo> #include <iostream>

the compilation fails with the error:

reduction.c:12:14: error: use of typeid requires -frtti

which is however already used. Once I compile with CPU options only, it works again:

$ /opt/rocm/aomp/bin/clang++ -target x86_64-pc-linux-gnu -fopenmp reduction.c

Does anybody know the reason for this problem and how it can be solved? Any hint would be greatly appreciated!

AMDGPU: test/hip-openmp: some tests fail to compile

Two tests for hip+OpenMP hybrid mode fail to compile (undefined reference to void printf())

matrixmul_omp_for (https://github.com/ROCm-Developer-Tools/aomp/tree/master/test/hip-openmp/matrixmul_omp_for)
/tmp/matrixmul-8da20e.o: In function .omp_outlined.': matrixmul.cpp:(.text+0x9e0): undefined reference to void printf(char const*, int)'
matrixmul.cpp:(.text+0xa0c): undefined reference to void printf<int, int>(char const*, int, int)' matrixmul.cpp:(.text+0xcf8): undefined reference to void printf(char const*, int)'
matrixmul.cpp:(.text+0x119b): undefined reference to `void printf(char const*, int)'
matrixmul_omp_task (https://github.com/ROCm-Developer-Tools/aomp/tree/master/test/hip-openmp/matrixmul_omp_task)
/tmp/matrixmul-ea1f2d.o: In function .omp_task_entry.': matrixmul.cpp:(.text+0x73e): undefined reference to void printf(char const*, int)'
matrixmul.cpp:(.text+0xb9e): undefined reference to void printf<int>(char const*, int)' matrixmul.cpp:(.text+0xea5): undefined reference to void printf(char const*, int)'
matrixmul.cpp:(.text+0xed4): undefined reference to `void printf<int, int>(char const*, int, int)'

__HIP_DEVICE_COMPILE__ not working correctly

With at least aomp tag rel_0.7-0 and rel_0.7-1-temp, the following HIP code fails to compile:

#include <hip/hip_runtime.h>

__device__
void g() {}

__host__ __device__
void f()
{
#ifdef __HIP_DEVICE_COMPILE__
  g();
#endif
}

__global__
void kernel()
{
  f();
}

int main()
{
  kernel<<<1,1>>>();
  hipDeviceSynchronize();
}

with the following error during the host pass:

test.cpp:10:3: error: reference to __device__ function 'g' in __host__ __device__ function
  g();
  ^
test.cpp:4:6: note: 'g' declared here
void g() {}

It seems that the __HIP_DEVICE_COMPILE__ macro is also defined during the host pass. The same code compiles fine for CUDA. I have also not seen this problem with other clang/LLVM distributions, only with aomp so far.

Template instantiations missing from hip device code

Piping through llvm-dis because the binary form of the bitcode gets written to the terminal despite passing -S. O1 cleans up the IR, function still missing at O0.

// ~/rocm/aomp/bin/clang++ --cuda-device-only -nogpulib --cuda-gpu-arch=gfx906
// atomic.hip  -o - | ~/rocm/aomp/bin/llvm-dis -o -

#define DEV __attribute__((device))

// Function emitted as expected
DEV int atomic_inc_int(int *x) {
  return __atomic_fetch_add(x, 1, __ATOMIC_SEQ_CST);
}

// Template definition
template <typename T> DEV T atomic_inc(T *x) {
  return __atomic_fetch_add(x, 1, __ATOMIC_SEQ_CST);
}

// Explict instantiation, not emitted
template DEV int atomic_inc<int>(int *);

// Use of implicit instantiation, not emitted
auto *leak_addr = &atomic_inc<int>;

The instantiation works in C++, but not with the above invocation.

; ModuleID = '<stdin>'                                                                                                                                                                                             
source_filename = "atomic.hip"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
target triple = "amdgcn-amd-amdhsa"

; Function Attrs: nofree norecurse nounwind                                                                                                                                                                        
define hidden i32 @_Z14atomic_inc_intPi(i32* nocapture %x) local_unnamed_addr #0 {
entry:
  %0 = atomicrmw add i32* %x, i32 1 seq_cst
  ret i32 %0
}

attributes #0 = { nofree norecurse nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="true" \
"no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "targ\
et-cpu"="gfx906" "target-features"="+16-bit-insts,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+fp32-denormals,+fp64-fp16-denormals,+gfx8-insts,+gfx9-insts,+s-memrealtime" "unsafe-fp-math"="false" "use-soft\
-float"="false" }

!llvm.module.flags = !{!0, !1}
!opencl.ocl.version = !{!2}
!llvm.ident = !{!3}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"PIC Level", i32 1}
!2 = !{i32 2, i32 0}
!3 = !{!"clang version 9.0.1 ([email protected]:ROCm-Developer-Tools/llvm-project 199fd9ded9cfb9dfebfde3ca38a6f56441b947aa)"}

The atomic intrinsics are incidental - this was encountered while looking at dropping a dependency on hc from openmp. I'll work around by writing functions instead of templates.

Reproduces on trunk (https://godbolt.org/z/WAhRBA) so raised as https://bugs.llvm.org/show_bug.cgi?id=44416

Kernel error and computer freeze

After installing 0.7-5 and rebuild the app, if I ran test_openmp, the process either stuck or the whole screen freezes. If I reverted to 0.7-4 and test_openmp runs fine. I'm using rocm2.9.
Note that my monitor is connected to a Nvidia GPU. So the failure is a catastrophe to the whole node.

Here is my /var/log/kern.log

Oct 28 20:23:12 ryzen-box kernel: [  903.868096] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test_openmp:6648]
Oct 28 20:23:12 ryzen-box kernel: [  903.868101] Modules linked in: rfcomm xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo cmac bnep binfmt_misc nvidia_uvm(OE) nvidia_drm(POE) nls_iso8859_1 nvidia_modeset(POE) arc4 edac_mce_amd nvidia(POE) amdgpu(OE) amdttm(OE) iwlmvm wmi_bmof mac80211 snd_hda_codec_realtek uvcvideo snd_hda_codec_generic snd_usb_audio kvm iwlwifi videobuf2_vmalloc snd_usbmidi_lib snd_hda_codec_hdmi videobuf2_memops irqbypass amdkcl(OE) videobuf2_v4l2 crct10dif_pclmul snd_seq_midi snd_hda_intel amd_sched(OE) crc32_pclmul videobuf2_core ghash_clmulni_intel snd_seq_midi_event amd_iommu_v2 videodev snd_hda_codec btusb pcbc cfg80211 btrtl drm_kms_helper snd_rawmidi btbcm snd_hda_core media btintel drm input_leds snd_hwdep bluetooth snd_seq ipmi_devintf snd_pcm aesni_intel i2c_algo_bit ipmi_msghandler
Oct 28 20:23:12 ryzen-box kernel: [  903.868138]  fb_sys_fops aes_x86_64 syscopyarea crypto_simd sysfillrect snd_seq_device glue_helper sysimgblt ecdh_generic snd_timer cryptd snd ccp soundcore k10temp shpchp wmi mac_hid sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid i2c_piix4 r8169 ahci nvme mii libahci nvme_core gpio_amdpt gpio_generic
Oct 28 20:23:12 ryzen-box kernel: [  903.868160] CPU: 0 PID: 6648 Comm: test_openmp Tainted: P           OE    4.15.0-66-generic #75-Ubuntu
Oct 28 20:23:12 ryzen-box kernel: [  903.868161] Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.40 04/19/2017
Oct 28 20:23:12 ryzen-box kernel: [  903.868166] RIP: 0010:iommu_unmap_page+0x1f/0x100
Oct 28 20:23:12 ryzen-box kernel: [  903.868168] RSP: 0018:ffffc1a5602efae0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff11
Oct 28 20:23:12 ryzen-box kernel: [  903.868170] RAX: bfea5b6cd39b2100 RBX: 000016c3ff619000 RCX: 0000000000000027
Oct 28 20:23:12 ryzen-box kernel: [  903.868170] RDX: 0000000000001000 RSI: 000016c3ff618000 RDI: ffffa0ea85eff000
Oct 28 20:23:12 ryzen-box kernel: [  903.868171] RBP: ffffc1a5602efb18 R08: 0000000000000000 R09: 0000000000000000
Oct 28 20:23:12 ryzen-box kernel: [  903.868172] R10: ffffc1a5602efae0 R11: 000000000000001b R12: ffffa0ea85eff000
Oct 28 20:23:12 ryzen-box kernel: [  903.868173] R13: 00001ffe8e970000 R14: 00000000fff8e970 R15: 00000fff00000000
Oct 28 20:23:12 ryzen-box kernel: [  903.868174] FS:  00007fc60f13bec0(0000) GS:ffffa0ea8e600000(0000) knlGS:0000000000000000
Oct 28 20:23:12 ryzen-box kernel: [  903.868175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 20:23:12 ryzen-box kernel: [  903.868176] CR2: 00007fc60da7b420 CR3: 0000000b5496e000 CR4: 00000000003406f0
Oct 28 20:23:12 ryzen-box kernel: [  903.868177] Call Trace:
Oct 28 20:23:12 ryzen-box kernel: [  903.868181]  __unmap_single.isra.27+0x62/0x100
Oct 28 20:23:12 ryzen-box kernel: [  903.868184]  unmap_sg+0x5f/0x70
Oct 28 20:23:12 ryzen-box kernel: [  903.868228]  amdgpu_ttm_backend_unbind+0x8e/0xe0 [amdgpu]
Oct 28 20:23:12 ryzen-box kernel: [  903.868232]  ttm_tt_unbind+0x21/0x40 [amdttm]
Oct 28 20:23:12 ryzen-box kernel: [  903.868235]  ttm_tt_destroy.part.11+0x12/0x60 [amdttm]
Oct 28 20:23:12 ryzen-box kernel: [  903.868237]  ttm_tt_destroy+0x13/0x20 [amdttm]
Oct 28 20:23:12 ryzen-box kernel: [  903.868240]  ttm_bo_cleanup_memtype_use+0x32/0x70 [amdttm]
Oct 28 20:23:12 ryzen-box kernel: [  903.868243]  ttm_bo_release+0x1eb/0x2a0 [amdttm]
Oct 28 20:23:12 ryzen-box kernel: [  903.868246]  ? kmem_cache_free+0x1b3/0x1e0
Oct 28 20:23:12 ryzen-box kernel: [  903.868248]  ? kmem_cache_free+0x1b3/0x1e0
Oct 28 20:23:12 ryzen-box kernel: [  903.868251]  amdttm_bo_put+0x1e/0x20 [amdttm]
Oct 28 20:23:12 ryzen-box kernel: [  903.868286]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
Oct 28 20:23:12 ryzen-box kernel: [  903.868342]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x17f/0x240 [amdgpu]
Oct 28 20:23:12 ryzen-box kernel: [  903.868393]  kfd_ioctl_free_memory_of_gpu+0x85/0xd0 [amdgpu]
Oct 28 20:23:12 ryzen-box kernel: [  903.868442]  kfd_ioctl+0x271/0x450 [amdgpu]
Oct 28 20:23:12 ryzen-box kernel: [  903.868487]  ? kfd_ioctl_set_memory_policy+0xc0/0xc0 [amdgpu]
Oct 28 20:23:12 ryzen-box kernel: [  903.868490]  ? kmem_cache_free+0x1b3/0x1e0
Oct 28 20:23:12 ryzen-box kernel: [  903.868491]  ? kmem_cache_free+0x1b3/0x1e0
Oct 28 20:23:12 ryzen-box kernel: [  903.868493]  do_vfs_ioctl+0xa8/0x630
Oct 28 20:23:12 ryzen-box kernel: [  903.868495]  SyS_ioctl+0x79/0x90
Oct 28 20:23:12 ryzen-box kernel: [  903.868498]  do_syscall_64+0x73/0x130
Oct 28 20:23:12 ryzen-box kernel: [  903.868501]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Oct 28 20:23:12 ryzen-box kernel: [  903.868503] RIP: 0033:0x7fc60cd5f5d7
Oct 28 20:23:12 ryzen-box kernel: [  903.868504] RSP: 002b:00007ffce0f62f68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 28 20:23:12 ryzen-box kernel: [  903.868505] RAX: ffffffffffffffda RBX: 000000000097ac60 RCX: 00007fc60cd5f5d7
Oct 28 20:23:12 ryzen-box kernel: [  903.868506] RDX: 00007ffce0f62fa0 RSI: 0000000040084b17 RDI: 0000000000000009
Oct 28 20:23:12 ryzen-box kernel: [  903.868506] RBP: 00007ffce0f62fa0 R08: 00007fc5e30ba438 R09: 000000000097ac90
Oct 28 20:23:12 ryzen-box kernel: [  903.868507] R10: 000000008e970000 R11: 0000000000000246 R12: 0000000040084b17
Oct 28 20:23:12 ryzen-box kernel: [  903.868508] R13: 0000000000000009 R14: 000000008e970000 R15: 0000000000000000
Oct 28 20:23:12 ryzen-box kernel: [  903.868509] Code: 0b 5d c3 e8 a4 ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 10 65 48 8b 04 25 28 00 00 00 <48> 89 45 d0 31 c0 48 85 d2 0f 84 b8 00 00 00 48 8d 42 ff 48 89

LIBOMPTARGET_DEBUG doesn't work with AOMP releases

LIBOMPTARGET_DEBUG currently doesn't work with AOMP binary releases.
Can we have -DLIBOMPTARGET_ENABLE_DEBUG=ON enabled on future builds?

Latent bug in device_atomic_functions, atomicInc, for reductions

The cuda analogue of atomicInc is documented to do:

reads the 32-bit word old located at the address address in global or shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This is used by reduction.cu with a comment showing that the author knew this was the behaviour

// atomicInc increments 'timestamp' and has a range [0, NumTeams-1].
// It resets 'timestamp' back to 0 once the last team increments
// this counter.
unsigned val = __kmpc_atomic_inc(timestamp, NumTeams - 1);
IsLastTeam = val == NumTeams - 1;

Our implementation of this function is

unsigned atomicInc(unsigned *address, unsigned max) {
if (*address >= max)
return *address;
else
return atomic_inc_unsigned(address);
}

This is not atomic and does not write a zero to *address in the first part of the branch. I don't have a test case that fails due to this.

std::complex support needed

https://github.com/ye-luo/openmp-target/blob/master/hands-on/tests/complex/complex.cpp

$ clang++ -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -march=native -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 complex.cpp 
lld: error: undefined symbol: __mulsc3
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_mulIfSt7complexIfES1_EvT0_T1__l59)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_mulIfSt7complexIfES1_EvT0_T1__l59)

lld: error: undefined symbol: __divsc3
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIfSt7complexIfES1_EvT0_T1__l76)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIfSt7complexIfES1_EvT0_T1__l76)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIffSt7complexIfEEvT0_T1__l76)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIffSt7complexIfEEvT0_T1__l76)

lld: error: undefined symbol: __muldc3
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_mulIdSt7complexIdES1_EvT0_T1__l59)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_mulIdSt7complexIdES1_EvT0_T1__l59)

lld: error: undefined symbol: __divdc3
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIdSt7complexIdES1_EvT0_T1__l76)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIdSt7complexIdES1_EvT0_T1__l76)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIddSt7complexIdEEvT0_T1__l76)
>>> referenced by /tmp/complex-gfx906-72c03e-gfx906-c2b83e.o:(__omp_offloading_10304_2920ae4__Z8test_divIddSt7complexIdEEvT0_T1__l76)
clang-11: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)

Remove references to ptx intrinsics from amdgcn flang

Said intrinsics don't exist on amdgcn. I intend to fix this by patching flang but won't necessarily get to it immediately.

aomp 0.7-3 interop issue with ROCm 2.9 hip libraries

if the ROCM 2.9 module is loaded, subsequent compilation of test/hip-OpenMP/matrixmul_omp_for fails to link, missing printf symbols

to reproduce

module load rocm/2.9.0
cd ../matrixmul_omp_for/
AOMP=/rocm/aomp make clean
AOMP=/rocm/aomp make run

/ccs/home/ronlieb/rocm/aomp/bin/clang++ -O3 -target x86_64-pc-linux-gnu -fopenmp -x hip --offload-arch=gfx906 matrixmul.cpp -o matrixmul
In file included from matrixmul.cpp:25:
In file included from /opt/rocm/include/hip/hip_runtime.h:56:
In file included from /opt/rocm/include/hip/hcc_detail/hip_runtime.h:101:
In file included from /opt/rocm/include/hip/hcc_detail/hip_atomic.h:3:
In file included from /opt/rocm/include/hip/hcc_detail/device_functions.h:34:
/opt/rocm/include/hip/hcc_detail/device_library_decls.h:80:72: warning: cast to 'attribute((address_space(3))) void ' from smaller integer type 'unsigned int' [-Wint-to-void-pointer-cast]
device inline static __local void __to_local(unsigned x) { return (__local void*)x; }
^
1 warning generated when compiling for host.
/tmp/matrixmul-06b648.o: In function .omp_outlined.': matrixmul.cpp:(.text+0x748): undefined reference to void printf(char const*, int)'
matrixmul.cpp:(.text+0xbeb): undefined reference to void printf<int>(char const*, int)' matrixmul.cpp:(.text+0xf52): undefined reference to void printf(char const*, int)'
matrixmul.cpp:(.text+0xf88): undefined reference to `void printf<int, int>(char const*, int, int)'
clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
Makefile:81: recipe for target 'matrixmul' failed
make: *** [matrixmul] Error 1

Kokkos: Getting ICE on ItaniumMangleContextImpl::mangleCXXName

Hi,

I am getting the following error when trying to compile the Kokkos OpenMPTarget backend with aomp. This compiles with clang 9.0 using gcc 6.x as the gcc-toolchain and targeting NVIDIA Volta GPUs.

clang-9: /root/git/aomp/llvm-project/clang/lib/AST/ItaniumMangle.cpp:4859: virtual void {anonymous}::ItaniumMangleContextImpl::mangleCXXName(const clang::NamedDecl*, llvm::raw_ostream&): Assertion `(isa<FunctionDecl>(D) || isa<VarDecl>(D)) && "Invalid mangleName() call, argument is not a variable or function!"' failed.
Stack dump:  

0.      Program arguments: /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9 -cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-unknown-linux-gnu -std=c++11 -emit-llvm-bc -emit-llvm-uselists -disable-free -main-file-name TestOpenMPTarget_Reducers_b.cpp -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -mconstructor-aliases -fuse-init-array -target-cpu gfx900 -fcuda-is-device -fcuda-allow-variadic-functions -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/hip.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/opencl.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/ocml.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/ockl.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/oclc_finite_only_off.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/oclc_daz_opt_off.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/oclc_correctly_rounded_sqrt_on.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/oclc_unsafe_math_off.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/oclc_isa_version_900.amdgcn.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/libdevice/libaompextras-amdgcn-gfx900.bc -mlink-builtin-bitcode /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../lib/oclc_wavefrontsize64_on.amdgcn.bc -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/core/unit_test/CMakeFiles/KokkosCore_UnitTest_OpenMPTarget.dir/openmptarget/TestOpenMPTarget_Reducers_b.cpp.gcno -resource-dir /home/projects/x86-64-naples/rocm/aomp_0.7-6/lib/clang/9.0.1 -D GTEST_HAS_PTHREAD=0 -D KOKKOS_WORKAROUND_OPENMPTARGET_CLANG -I /ascldap/users/crtrott/Kokkos/kokkos/tpls/gtest -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/core/unit_test -I /ascldap/users/crtrott/Kokkos/kokkos/core/unit_test -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/core/src -I /ascldap/users/crtrott/Kokkos/kokkos/core/src -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/containers/src -I /ascldap/users/crtrott/Kokkos/kokkos/containers/src -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/algorithms/src -I /ascldap/users/crtrott/Kokkos/kokkos/algorithms/src -D GTEST_HAS_PTHREAD=0 -D KOKKOS_WORKAROUND_OPENMPTARGET_CLANG -I /ascldap/users/crtrott/Kokkos/kokkos/tpls/gtest -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/core/unit_test -I /ascldap/users/crtrott/Kokkos/kokkos/core/unit_test -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/core/src -I /ascldap/users/crtrott/Kokkos/kokkos/core/src -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/containers/src -I /ascldap/users/crtrott/Kokkos/kokkos/containers/src -I /ascldap/users/crtrott/Kokkos/KokkosBuild/Caraway/OpenMPTarget/algorithms/src -I /ascldap/users/crtrott/Kokkos/kokkos/algorithms/src -I/home/projects/x86-64/binutils/2.30.0/include -I/home/projects/x86-64-naples/rocm/aomp_0.7-6/include -internal-isystem /home/projects/x86-64/gcc/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../include/c++/6.4.0 -internal-isystem /home/projects/x86-64/gcc/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../include/c++/6.4.0/x86_64-pc-linux-gnu -internal-isystem /home/projects/x86-64/gcc/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../include/c++/6.4.0/backward -internal-isystem /home/projects/x86-64/gcc/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../include/c++/6.4.0 -internal-isystem /home/projects/x86-64/gcc/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../include/c++/6.4.0/x86_64-pc-linux-gnu -internal-isystem /home/projects/x86-64/gcc/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/../../../../include/c++/6.4.0/backward -internal-isystem /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../include -internal-isystem /home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/../include -internal-isystem /usr/local/include -internal-isystem /home/projects/x86-64-naples/rocm/aomp_0.7-6/lib/clang/9.0.1/include -internal-externc-isystem /include -internal-ex
1.      <eof> parser at end of file
2.      /ascldap/users/crtrott/Kokkos/kokkos/core/unit_test/TestReducers.hpp:947:15: instantiating function definition 'Test::TestReducers<unsigned long, Kokkos::Experimental::OpenMPTarget>::execute_integer'
3.      /ascldap/users/crtrott/Kokkos/kokkos/core/unit_test/TestReducers.hpp:292:15: instantiating function definition 'Test::TestReducers<unsigned long, Kokkos::Experimental::OpenMPTarget>::test_sum'
4.      /ascldap/users/crtrott/Kokkos/kokkos/core/src/Kokkos_Parallel_Reduce.hpp:986:13: instantiating function definition 'Kokkos::parallel_reduce<Kokkos::RangePolicy<Kokkos::Experimental::OpenMPTarget>, Test::TestReducers<unsigned long, Kokkos::Experimental::OpenMPTarget>::SumFunctor, Kokkos::Sum<unsigned long, Kokkos::HostSpace> >'
5.      /ascldap/users/crtrott/Kokkos/kokkos/core/src/Kokkos_Parallel_Reduce.hpp:856:22: instantiating function definition 'Kokkos::Impl::ParallelReduceAdaptor<Kokkos::RangePolicy<Kokkos::Experimental::OpenMPTarget>, Test::TestReducers<unsigned long, Kokkos::Experimental::OpenMPTarget>::SumFunctor, Kokkos::Sum<unsigned long, Kokkos::HostSpace> >::execute'
6.      /ascldap/users/crtrott/Kokkos/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp:682:15: instantiating function definition 'Kokkos::Impl::ParallelReduce<Test::TestReducers<unsigned long, Kokkos::Experimental::OpenMPTarget>::SumFunctor, Kokkos::RangePolicy<Kokkos::Experimental::OpenMPTarget>, Kokkos::Sum<unsigned long, Kokkos::HostSpace>, Kokkos::Experimental::OpenMPTarget>::execute'
7.      /ascldap/users/crtrott/Kokkos/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp:575:8: instantiating class definition 'Kokkos::Impl::ParallelReduceSpecialize<Test::TestReducers<unsigned long, Kokkos::Experimental::OpenMPTarget>::SumFunctor, Kokkos::RangePolicy<Kokkos::Experimental::OpenMPTarget>, Kokkos::Sum<unsigned long, Kokkos::HostSpace>, unsigned long *, unsigned long, 0, 1>'
8.      /ascldap/users/crtrott/Kokkos/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp:575:8: LLVM IR generation of declaration 'Kokkos::Impl::ParallelReduceSpecialize'
 #0 0x0000000001cb8e4a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x1cb8e4a)
 #1 0x0000000001cb6d54 llvm::sys::RunSignalHandlers() (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x1cb6d54)
 #2 0x0000000001cb6e82 SignalHandler(int) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x1cb6e82)
 #3 0x00007ffa74d475d0 __restore_rt (/lib64/libpthread.so.0+0xf5d0)
 #4 0x00007ffa73ae7207 raise (/lib64/libc.so.6+0x36207)
 #5 0x00007ffa73ae88f8 abort (/lib64/libc.so.6+0x378f8)
 #6 0x00007ffa73ae0026 __assert_fail_base (/lib64/libc.so.6+0x2f026)
 #7 0x00007ffa73ae00d2 (/lib64/libc.so.6+0x2f0d2)
 #8 0x0000000003a747eb (anonymous namespace)::ItaniumMangleContextImpl::mangleCXXName(clang::NamedDecl const*, llvm::raw_ostream&) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3a747eb)
 #9 0x0000000001f57d38 getMangledNameImpl(clang::CodeGen::CodeGenModule const&, clang::GlobalDecl, clang::NamedDecl const*, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x1f57d38)
#10 0x0000000001f65a2f clang::CodeGen::CodeGenModule::getMangledName(clang::GlobalDecl) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x1f65a2f)
#11 0x000000000218ffa0 clang::CodeGen::CGOpenMPRuntime::emitTargetFunctions(clang::GlobalDecl) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x218ffa0)
#12 0x0000000001f7b3c9 clang::CodeGen::CodeGenModule::EmitGlobal(clang::GlobalDecl) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x1f7b3c9)
#13 0x000000000288e2c3 (anonymous namespace)::CodeGeneratorImpl::HandleTagDeclDefinition(clang::TagDecl*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x288e2c3)
#14 0x0000000002884135 clang::BackendConsumer::HandleTagDeclDefinition(clang::TagDecl*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x2884135)
#15 0x0000000003739298 clang::Sema::InstantiateClass(clang::SourceLocation, clang::CXXRecordDecl*, clang::CXXRecordDecl*, clang::MultiLevelTemplateArgumentList const&, clang::TemplateSpecializationKind, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3739298)
#16 0x000000000374bd2b clang::Sema::InstantiateClassTemplateSpecialization(clang::SourceLocation, clang::ClassTemplateSpecializationDecl*, clang::TemplateSpecializationKind, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x374bd2b)
#17 0x0000000003787bff clang::Sema::RequireCompleteTypeImpl(clang::SourceLocation, clang::QualType, clang::Sema::TypeDiagnoser*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3787bff)
#18 0x0000000003787fa5 clang::Sema::RequireCompleteType(clang::SourceLocation, clang::QualType, clang::Sema::TypeDiagnoser&) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3787fa5)
#19 0x0000000003249429 clang::Sema::RequireCompleteDeclContext(clang::CXXScopeSpec&, clang::DeclContext*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3249429)
#20 0x00000000034e198b LookupMemberExprInRecord(clang::Sema&, clang::LookupResult&, clang::Expr*, clang::RecordType const*, clang::SourceLocation, bool, clang::CXXScopeSpec&, bool, clang::SourceLocation, clang::TypoExpr*&) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x34e198b)
#21 0x00000000034e82b2 clang::Sema::BuildMemberReferenceExpr(clang::Expr*, clang::QualType, clang::SourceLocation, bool, clang::CXXScopeSpec&, clang::SourceLocation, clang::NamedDecl*, clang::DeclarationNameInfo const&, clang::TemplateArgumentListInfo const*, clang::Scope const*, clang::Sema::ActOnMemberAccessExtraArgs*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x34e82b2)
#22 0x000000000374230d clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformCXXDependentScopeMemberExpr(clang::CXXDependentScopeMemberExpr*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x374230d)
#23 0x0000000003722bd1 clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformExpr(clang::Expr*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3722bd1)
#24 0x00000000037289a8 clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformCallExpr(clang::CallExpr*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x37289a8)
#24 0x00000000037289a8 clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformCallExpr(clang::CallExpr*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x37289a8)
#25 0x0000000003722ce5 clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformExpr(clang::Expr*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3722ce5)
#26 0x0000000003748444 clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformStmt(clang::Stmt*, clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::StmtDiscardKind) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3748444)
#27 0x00000000037498d3 clang::TreeTransform<(anonymous namespace)::TemplateInstantiator>::TransformCompoundStmt(clang::CompoundStmt*, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x37498d3)
#28 0x000000000374bc33 clang::Sema::SubstStmt(clang::Stmt*, clang::MultiLevelTemplateArgumentList const&) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x374bc33)
#29 0x0000000003763d8c clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3763d8c)
#30 0x0000000003762766 clang::Sema::PerformPendingInstantiations(bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3762766)
#31 0x0000000003763b17 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3763b17)
#32 0x0000000003762766 clang::Sema::PerformPendingInstantiations(bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3762766)
#33 0x0000000003763b17 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3763b17)
#34 0x0000000003762766 clang::Sema::PerformPendingInstantiations(bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3762766)
#35 0x0000000003763b17 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3763b17)
#36 0x0000000003762766 clang::Sema::PerformPendingInstantiations(bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3762766)
#37 0x0000000003763b17 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3763b17)
#38 0x0000000003762766 clang::Sema::PerformPendingInstantiations(bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3762766)
#39 0x000000000323115d clang::Sema::ActOnEndOfTranslationUnitFragment(clang::Sema::TUFragmentKind) (.part.1299) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x323115d)
#40 0x00000000032312a7 clang::Sema::ActOnEndOfTranslationUnit() (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x32312a7)
#41 0x0000000003111a8e clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3111a8e)
#42 0x0000000003107d89 clang::ParseAST(clang::Sema&, bool, bool) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x3107d89)
#43 0x000000000288b1a8 clang::CodeGenAction::ExecuteAction() (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x288b1a8)
#44 0x00000000023406b9 clang::FrontendAction::Execute() (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x23406b9)
#45 0x0000000002309019 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x2309019)
#46 0x00000000023e0405 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0x23e0405)
#47 0x0000000000b14ba7 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0xb14ba7)
#48 0x0000000000ab24f8 main (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0xab24f8)
#49 0x00007ffa73ad33d5 __libc_start_main (/lib64/libc.so.6+0x223d5)
#50 0x0000000000b10475 _start (/home/projects/x86-64-naples/rocm/aomp_0.7-6/bin/clang-9+0xb10475)
clang-9: error: unable to execute command: Aborted
clang-9: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 9.0.1 (/root/git/aomp/llvm-project/clang 2d8529565d71c80db00079ea80d37c5814f9ff25)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /ascldap/users/projects/x86-64-naples/rocm/aomp_0.7-6/bin
clang-9: note: diagnostic msg: PLEASE open a git issue at https://github.com/ROCm-Developer-Tools/aomp with a detailed description of this problem.
clang-9: note: diagnostic msg: Error generating preprocessed source(s).
make[2]: *** [core/unit_test/CMakeFiles/KokkosCore_UnitTest_OpenMPTarget.dir/openmptarget/TestOpenMPTarget_Reducers_b.cpp.o] Error 254

Reproduction:

git clone https://github.com/crtrott/kokkos
cd kokkos
git checkout 92713d7a589
mkdir build
cd build
cmake -DKokkos_ENABLE_OPENMPTARGET=ON -DKokkos_ENABLE_TESTS=ON -DKokkos_ARCH_VEGA=ON -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_FLAGS=--gcc-toolchain=/home/projects/x86-64/gcc/6.4.0 ..
cd core/unit_test/
make -j 24

AMDGPU: snap-mp4 fails execution

The snap-mp4 app (https://github.com/AMDComputeLibraries/OpenMPApps/tree/AOMP-0.5/snap-mp4) whether soft hangs or gives "Memory access fault by GPU node" for any AMDGPU (gfx803, gfx900, gfx906).

Memory access fault by GPU node-2 (Agent handle: 0xd13780) on address 0x100000000. Reason: Page not present or supervisor privilege.

atmi doesn't build with default aomp_clone options

Fails on:
runtime/core/CMakeFiles/atmi_runtime.dir/__/interop/hsa/atmi_interop_hsa.cpp.o
/home/amd/aomp/atmi/src/runtime/core/system.cpp:20:10: fatal error: amd_hostcall.h: No such file or directory
#include "amd_hostcall.h"
^~~~~~~~~~~~~~~~

A couple of other source files also fail to find the header. It's in aomp-extras/hostcall/lib/include/, which is now built after atmi. Assigning to Ashwin in the hope you remember the dependency tree here. I think atmi needs aomp-extras and vice versa.

Related to this, I think including header files from the install dir while building is a serious hazard. It means we have to remember to delete the install directory in order to get an actually clean build.

Crash when compiling miniqmc

aomp-0.7 has this issue. 0.6.5 was fine.

clang-9: /home/yeluo/git/aomp/llvm-project/llvm/lib/IR/Instructions.cpp:1349: void llvm::StoreInst::AssertOK(): Assertion `getOperand(0)->getType() == cast<PointerType>(getOperand(1)->getType())->getElementType() && "Ptr must be a pointer to Val type!"' failed.

reproducer

git clone https://github.com/ye-luo/miniqmc
cd miniqmc/build
cmake -DCMAKE_CXX_COMPILER=/home/yeluo/rocm/aomp_0.7-0/bin/clang++ \
-DENABLE_OFFLOAD=1 -DOFFLOAD_TARGET=amdgcn-amd-amdhsa \
-DCMAKE_CXX_FLAGS="-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906" \
..
make -j15

Assertion `!LocalDeclMap.count(VD) && "Decl already exists in LocalDeclMap!"' failed.

I encountered an error when trying compile a C++ code for OmpenMP 4 offloading. The error is reported in the same place (at a compound statement), like in #29 , but this is related to the host compiler, as it occurs independently of the chosen offload target.

AOMP version 0.7
Call to clang++ (host) generated by cmake:
/home/kelling/rocm/aomp/bin/clang++ -DALPAKA_ACC_CPU_BT_OMP4_ENABLED -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -DALPAKA_DEBUG=0 -DBOOST_ALL_NO_LIB -I/home/kelling/checkout/alpaka/include -isystem /home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/boost-1.70.0-acaxl7cie57ccno6sjxmfgn2usv3243b/include --target=x86_64-pc-linux-gnu -fopenmp-targets=x86_64-pc-linux-gnu -Xopenmp-target=x86_64-pc-linux-gnu -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-disabled-macro-expansion -Wno-global-constructors -Wno-padded -fopenmp=libomp -fopenmp-version=40 -o main.cpp.o -c /home/kelling/checkout/alpaka/example/vectorAdd/src/main.cpp --save-temps
- same error with HSA offloading (error occurs before #29 can in this changed code)
compiler output

clang-8: /home/kelling/git/aomp/clang/lib/CodeGen/CodeGenFunction.h:4150: void clang::CodeGen::CodeGenFunction::setAddrOfLocalVar(const clang::VarDecl*, clang::CodeGen::Address): Assertion `!LocalDeclMap.count(VD) && "Decl already exists in LocalDeclMap!"' failed.
Stack dump:
0.	Program arguments: /home/kelling/rocm/aomp_0.6-6/bin/clang-8 -cc1 -triple x86_64-pc-linux-gnu -emit-llvm-bc -emit-llvm-uselists -save-temps=cwd -disable-free -main-file-name main.cpp -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file /home/kelling/checkout/alpaka/buildCpu/bugHost/main.cpp.gcno -resource-dir /home/kelling/rocm/aomp_0.6-6/lib/clang/8.0.1 -Wno-c++98-compat-pedantic -Wno-disabled-macro-expansion -Wno-global-constructors -Wno-padded -fdeprecated-macro -fdebug-compilation-dir /home/kelling/checkout/alpaka/buildCpu/bugHost -ferror-limit 19 -fmessage-length 0 -fopenmp -fopenmp-version=40 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -disable-llvm-passes -o main-host-x86_64-pc-linux-gnu.bc -x c++-cpp-output main-host-x86_64-pc-linux-gnu.ii -fopenmp-targets=x86_64-pc-linux-gnu -faddrsig 
1.	<eof> parser at end of file
2.	Per-file LLVM IR generation
3.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:86:33: Generating code for declaration 'alpaka::kernel::TaskKernelCpuOmp4<std::integral_constant<unsigned long, 1>, unsigned long, VectorAddKernel, unsigned int *, unsigned int *, unsigned int *, unsigned long>::operator()'
4.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:140:2: LLVM IR generation of compound statement ('{}')
 #0 0x00000000019aa3fa llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x19aa3fa)
 #1 0x00000000019a8274 llvm::sys::RunSignalHandlers() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x19a8274)
 #2 0x00000000019a83d5 SignalHandler(int) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x19a83d5)
 #3 0x00007fbdef0d0890 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12890)
 #4 0x00007fbdedb79e97 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x3ee97)
 #5 0x00007fbdedb7b801 abort (/lib/x86_64-linux-gnu/libc.so.6+0x40801)
 #6 0x00007fbdedb6b39a (/lib/x86_64-linux-gnu/libc.so.6+0x3039a)
 #7 0x00007fbdedb6b412 (/lib/x86_64-linux-gnu/libc.so.6+0x30412)
 #8 0x0000000001c2d0ab clang::CodeGen::CodeGenFunction::GenerateOpenMPCapturedStmtFunction(clang::CapturedStmt const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c2d0ab)
 #9 0x0000000001ee6b22 emitParallelOrTeamsOutlinedFunction(clang::CodeGen::CodeGenModule&, clang::OMPExecutableDirective const&, clang::CapturedStmt const*, clang::VarDecl const*, clang::OpenMPDirectiveKind, llvm::StringRef, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ee6b22)
#10 0x0000000001ee6c96 clang::CodeGen::CGOpenMPRuntime::emitTeamsOutlinedFunction(clang::OMPExecutableDirective const&, clang::VarDecl const*, clang::OpenMPDirectiveKind, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ee6c96)
#11 0x0000000001c1f624 emitCommonOMPTeamsDirective(clang::CodeGen::CodeGenFunction&, clang::OMPExecutableDirective const&, clang::OpenMPDirectiveKind, clang::CodeGen::RegionCodeGenTy const&) (.constprop.1495) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c1f624)
#12 0x0000000001c1fed4 clang::CodeGen::CodeGenFunction::EmitOMPTeamsDirective(clang::OMPTeamsDirective const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c1fed4)
#13 0x0000000001bff020 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff020)
#14 0x0000000001bff343 clang::CodeGen::CodeGenFunction::EmitCompoundStmtWithoutScope(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff343)
#15 0x0000000001bff658 clang::CodeGen::CodeGenFunction::EmitCompoundStmt(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff658)
#16 0x0000000001c025e3 clang::CodeGen::CodeGenFunction::EmitSimpleStmt(clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c025e3)
#17 0x0000000001bfec40 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bfec40)
#18 0x0000000001c230a9 emitTargetRegion(clang::CodeGen::CodeGenFunction&, clang::OMPTargetDirective const&, clang::CodeGen::PrePostActionTy&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c230a9)
#19 0x0000000001eb7b3b clang::CodeGen::RegionCodeGenTy::operator()(clang::CodeGen::CodeGenFunction&) const (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1eb7b3b)
#20 0x0000000001eb7bbf (anonymous namespace)::CGOpenMPRegionInfo::EmitBody(clang::CodeGen::CodeGenFunction&, clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1eb7bbf)
#21 0x0000000001c2e07b clang::CodeGen::CodeGenFunction::GenerateOpenMPCapturedStmtFunction(clang::CapturedStmt const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c2e07b)
#22 0x0000000001ee71c0 clang::CodeGen::CGOpenMPRuntime::emitTargetOutlinedFunctionHelper(clang::OMPExecutableDirective const&, llvm::StringRef, llvm::Function*&, llvm::Constant*&, bool, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ee71c0)
#23 0x0000000001c176d0 emitCommonOMPTargetDirective(clang::CodeGen::CodeGenFunction&, clang::OMPExecutableDirective const&, clang::CodeGen::RegionCodeGenTy const&) (.constprop.1506) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c176d0)
#24 0x0000000001c17ac2 clang::CodeGen::CodeGenFunction::EmitOMPTargetDirective(clang::OMPTargetDirective const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c17ac2)
#25 0x0000000001bff0d0 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff0d0)
#26 0x0000000001bff343 clang::CodeGen::CodeGenFunction::EmitCompoundStmtWithoutScope(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff343)
#27 0x0000000001c38dd1 clang::CodeGen::CodeGenFunction::EmitFunctionBody(clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c38dd1)
#28 0x0000000001c48329 clang::CodeGen::CodeGenFunction::GenerateCode(clang::GlobalDecl, llvm::Function*, clang::CodeGen::CGFunctionInfo const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c48329)
#29 0x0000000001c9aac6 clang::CodeGen::CodeGenModule::EmitGlobalFunctionDefinition(clang::GlobalDecl, llvm::GlobalValue*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c9aac6)
#30 0x0000000001c985a5 clang::CodeGen::CodeGenModule::EmitGlobalDefinition(clang::GlobalDecl, llvm::GlobalValue*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c985a5)
#31 0x0000000001c9efee clang::CodeGen::CodeGenModule::EmitDeferred() (.localalias.7282) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c9efee)
#32 0x0000000001c9ef70 clang::CodeGen::CodeGenModule::EmitDeferred() (.localalias.7282) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c9ef70)
#33 0x0000000001c9ef70 clang::CodeGen::CodeGenModule::EmitDeferred() (.localalias.7282) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c9ef70)
#34 0x0000000001c9f141 clang::CodeGen::CodeGenModule::Release() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c9f141)
#35 0x0000000002565892 (anonymous namespace)::CodeGeneratorImpl::HandleTranslationUnit(clang::ASTContext&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2565892)
#36 0x00000000025640e5 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x25640e5)
#37 0x0000000002f5b911 clang::ParseAST(clang::Sema&, bool, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2f5b911)
#38 0x0000000002563330 clang::CodeGenAction::ExecuteAction() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2563330)
#39 0x0000000002082d3e clang::FrontendAction::Execute() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2082d3e)
#40 0x00000000020466c6 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x20466c6)
#41 0x0000000002139ce3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2139ce3)
#42 0x0000000000893d90 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x893d90)
#43 0x000000000082bcd6 main (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x82bcd6)
#44 0x00007fbdedb5cb97 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b97)
#45 0x00000000008900ba _start (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x8900ba)
clang-8: error: unable to execute command: Aborted (core dumped)
error: The provided host compiler IR file 'main-host-x86_64-pc-linux-gnu.bc' is required to generate code for OpenMP target regions but cannot be found.
clang-8: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 8.0.1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /home/kelling/rocm/aomp/bin
clang-8: note: diagnostic msg: PLEASE open a git issue at https://github.com/ROCm-Developer-Tools/aomp with a detailed description of this problem.
clang-8: note: diagnostic msg: Error generating preprocessed source(s).

Below you the diff between this code and the code not triggering this error in #29 .
- /include/alpaka/kernel/TaskKernelCpuOmp4.hpp(middle) is where the error is triggered
- The declaration acc::AccCpuOmp4<TDim, TIdx> acc(...) is withing a target region, the class is defined in /include/alpaka/acc/AccCpuOmp4.hpp(top) and has been changed
- the class is not derived from a new class having one private member, defined in alpaka/workdiv/WorkDivOmp4BuiltIn.hpp(bottom)

diff --git a/include/alpaka/acc/AccCpuOmp4.hpp b/include/alpaka/acc/AccCpuOmp4.hpp
index 4d773c37..d6381fa6 100644
--- a/include/alpaka/acc/AccCpuOmp4.hpp
+++ b/include/alpaka/acc/AccCpuOmp4.hpp
@@ -16,7 +16,7 @@
 #endif
 
 // Base classes.
-#include <alpaka/workdiv/WorkDivMembers.hpp>
+#include <alpaka/workdiv/WorkDivOmp4BuiltIn.hpp>
 #include <alpaka/idx/gb/IdxGbOmp4BuiltIn.hpp>
 #include <alpaka/idx/bt/IdxBtOmp4BuiltIn.hpp>
 #include <alpaka/atomic/AtomicStdLibLock.hpp>
@@ -66,7 +66,7 @@ namespace alpaka
             typename TDim,
             typename TIdx>
         class AccCpuOmp4 final :
-            public workdiv::WorkDivMembers<TDim, TIdx>,
+            public workdiv::WorkDivOmp4BuiltIn<TDim, TIdx>,
             public idx::gb::IdxGbOmp4BuiltIn<TDim, TIdx>,
             public idx::bt::IdxBtOmp4BuiltIn<TDim, TIdx>,
             public atomic::AtomicHierarchy<
@@ -92,12 +92,10 @@ namespace alpaka
 
         private:
             //-----------------------------------------------------------------------------
-            template<
-                typename TWorkDiv>
             ALPAKA_FN_HOST AccCpuOmp4(
-                TWorkDiv const & workDiv,
+                TIdx const & threadElemExtent,
                 TIdx const & blockSharedMemDynSizeBytes) :
-                    workdiv::WorkDivMembers<TDim, TIdx>(workDiv),
+                    workdiv::WorkDivOmp4BuiltIn<TDim, TIdx>(threadElemExtent),
                     idx::gb::IdxGbOmp4BuiltIn<TDim, TIdx>(),
                     idx::bt::IdxBtOmp4BuiltIn<TDim, TIdx>(),
                     atomic::AtomicHierarchy<
diff --git a/include/alpaka/kernel/TaskKernelCpuOmp4.hpp b/include/alpaka/kernel/TaskKernelCpuOmp4.hpp
index c72f68b9..7c2b02aa 100644
--- a/include/alpaka/kernel/TaskKernelCpuOmp4.hpp
+++ b/include/alpaka/kernel/TaskKernelCpuOmp4.hpp
@@ -118,6 +118,8 @@ namespace alpaka
                 TIdx const gridBlockCount(gridBlockExtent.prod());
                 // The number of threads in a block.
                 TIdx const blockThreadCount(blockThreadExtent.prod());
+                // The number of elements in a thread. (to avoid mapping vec to target)
+                TIdx const threadElemCount(threadElemExtent[0u]);
 
                 // We have to make sure, that the OpenMP runtime keeps enough threads for executing a block in parallel.
                 auto const maxOmpThreadCount(::omp_get_max_threads());
@@ -147,7 +149,7 @@ namespace alpaka
                         }
 #endif
                         acc::AccCpuOmp4<TDim, TIdx> acc(
-                            *static_cast<workdiv::WorkDivMembers<TDim, TIdx> const *>(this),
+                            threadElemCount,
                             blockSharedMemDynSizeBytes);
 
                         #pragma omp distribute
diff --git a/include/alpaka/workdiv/WorkDivOmp4BuiltIn.hpp b/include/alpaka/workdiv/WorkDivOmp4BuiltIn.hpp
new file mode 100644
index 00000000..3634acac
--- /dev/null
+++ b/include/alpaka/workdiv/WorkDivOmp4BuiltIn.hpp
@@ -0,0 +1,163 @@
+/* Copyright 2019 Axel Huebl, Benjamin Worpitz
+ *
+ * This file is part of Alpaka.
+ *
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, You can obtain one at http://mozilla.org/MPL/2.0/.
+ */
+
+#pragma once
+
+#ifdef ALPAKA_ACC_CPU_BT_OMP4_ENABLED
+
+#if _OPENMP < 201307
+    #error If ALPAKA_ACC_CPU_BT_OMP4_ENABLED is set, the compiler has to support OpenMP 4.0 or higher!
+#endif
+
+#include <alpaka/workdiv/Traits.hpp>
+#include <alpaka/idx/Traits.hpp>
+
+#include <alpaka/core/Omp4.hpp>
+#include <alpaka/core/Unused.hpp>
+#include <alpaka/vec/Vec.hpp>
+
+namespace alpaka
+{
+    namespace workdiv
+    {
+        //#############################################################################
+        //! The GPU CUDA accelerator work division.
+        template<
+            typename TDim,
+            typename TIdx>
+        class WorkDivOmp4BuiltIn
+        {
+        public:
+            using WorkDivBase = WorkDivOmp4BuiltIn;
+
+            //-----------------------------------------------------------------------------
+            WorkDivOmp4BuiltIn(
+                vec::Vec<TDim, TIdx> const & threadElemExtent) :
+                    m_threadElemExtent(threadElemExtent)
+            {}
+            //-----------------------------------------------------------------------------
+            WorkDivOmp4BuiltIn(WorkDivOmp4BuiltIn const &) = delete;
+            //-----------------------------------------------------------------------------
+            WorkDivOmp4BuiltIn(WorkDivOmp4BuiltIn &&) = delete;
+            //-----------------------------------------------------------------------------
+            auto operator=(WorkDivOmp4BuiltIn const &) -> WorkDivOmp4BuiltIn & = delete;
+            //-----------------------------------------------------------------------------
+            auto operator=(WorkDivOmp4BuiltIn &&) -> WorkDivOmp4BuiltIn & = delete;
+            //-----------------------------------------------------------------------------
+            /*virtual*/ ~WorkDivOmp4BuiltIn() = default;
+
+        public:
+            // \TODO: Optimize! Add WorkDivCudaBuiltInNoElems that has no member m_threadElemExtent as well as AccGpuCudaRtNoElems.
+            // Use it instead of AccGpuCudaRt if the thread element extent is one to reduce the register usage.
+            vec::Vec<TDim, TIdx> const & m_threadElemExtent;
+        };
+    }
+
+    namespace dim
+    {
+        namespace traits
+        {
+            //#############################################################################
+            //! The GPU CUDA accelerator work division dimension get trait specialization.
+            template<
+                typename TDim,
+                typename TIdx>
+            struct DimType<
+                workdiv::WorkDivOmp4BuiltIn<TDim, TIdx>>
+            {
+                using type = TDim;
+            };
+        }
+    }
+    namespace idx
+    {
+        namespace traits
+        {
+            //#############################################################################
+            //! The GPU CUDA accelerator work division idx type trait specialization.
+            template<
+                typename TDim,
+                typename TIdx>
+            struct IdxType<
+                workdiv::WorkDivOmp4BuiltIn<TDim, TIdx>>
+            {
+                using type = TIdx;
+            };
+        }
+    }
+    namespace workdiv
+    {
+        namespace traits
+        {
+            //#############################################################################
+            //! The GPU CUDA accelerator work division grid block extent trait specialization.
+            template<
+                typename TDim,
+                typename TIdx>
+            struct GetWorkDiv<
+                WorkDivOmp4BuiltIn<TDim, TIdx>,
+                origin::Grid,
+                unit::Blocks>
+            {
+                //-----------------------------------------------------------------------------
+                //! \return The number of blocks in each dimension of the grid.
+                static auto getWorkDiv(
+                    WorkDivOmp4BuiltIn<TDim, TIdx> const & workDiv)
+                -> vec::Vec<TDim, TIdx>
+                {
+                    alpaka::ignore_unused(workDiv);
+                    return vec::Vec<TDim, TIdx>(static_cast<TIdx>(omp_get_num_teams()));
+                }
+            };
+
+            //#############################################################################
+            //! The GPU CUDA accelerator work division block thread extent trait specialization.
+            template<
+                typename TDim,
+                typename TIdx>
+            struct GetWorkDiv<
+                WorkDivOmp4BuiltIn<TDim, TIdx>,
+                origin::Block,
+                unit::Threads>
+            {
+                //-----------------------------------------------------------------------------
+                //! \return The number of threads in each dimension of a block.
+                static auto getWorkDiv(
+                    WorkDivOmp4BuiltIn<TDim, TIdx> const & workDiv)
+                -> vec::Vec<TDim, TIdx>
+                {
+                    alpaka::ignore_unused(workDiv);
+                    return vec::Vec<TDim, TIdx>(static_cast<TIdx>(omp_get_num_threads()));
+                }
+            };
+
+            //#############################################################################
+            //! The GPU CUDA accelerator work division thread element extent trait specialization.
+            template<
+                typename TDim,
+                typename TIdx>
+            struct GetWorkDiv<
+                WorkDivOmp4BuiltIn<TDim, TIdx>,
+                origin::Thread,
+                unit::Elems>
+            {
+                //-----------------------------------------------------------------------------
+                //! \return The number of blocks in each dimension of the grid.
+                static auto getWorkDiv(
+                    WorkDivOmp4BuiltIn<TDim, TIdx> const & workDiv)
+                -> vec::Vec<TDim, TIdx>
+                {
+                    return workDiv.m_threadElemExtent;
+                }
+            };
+        }
+    }
+}
+
+#endif

Github does not allow me to upload the temporary files, so I will send by by mail.

Problems with reductions on GPU (not CPU).

Dear Community,

the following reduction from OpenMP 4.5 works well on Haswell but crashes on Vega20 during compilation. I attached a minimal reproducible example below (no meaningful data required for replication). Is this type of reduction supposed to work already in aomp 0.63? If not, which alternative would currently work for this example without slowing the computation down too much? Unfortunately, I can neither move the reduction clause nor the team clause to lower levels of the loops without creating either a data race, or slowing the computation down considerably. Any help would be greatly appreciated!

#include <stdio.h>
#include <iostream>

int main () {
int no = 25;
int nc = 1000000;
int nv = 30;
int nos = (no-1)*(no-1);
double *A_reduction = new double[nos]();
double *PA = new double[nv*nos]();
double *CA = new double[nc*nv]();

#pragma omp target data map(to:CA[0:(nc*nv)],PA[0:(nv*nos)]) map(A_reduction[0:nos])
#pragma omp target teams distribute parallel for reduction(+:A_reduction[:nos]) collapse(2)
//#pragma omp parallel for reduction(+:A_reduction[:nos]) collapse(2) //works!
for (int op = 0; op < no-1; ++op) {
for (int of = 0; of < no-1; ++of) {
for (int c=0; c < nc; ++c) {
double pc = 1;
for (int v = 0; v < nv; ++v) {
pc *= (CA[v+c*nv]<0)+CA[v+c*nv]*PA[v+of*nv+op*(no-1)*nv];}
A_reduction[of+op*(no-1)] += pc;}}}}

I used the following compilation flags for the GPU and the CPU versions:

/opt/rocm/aomp/bin/clang++ -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 reductionproblem.cpp

Thank you very much for your attention!

Generic/SPMD enums are inverted between host runtime and deviceRTL

From Ashwin:
I had noticed a “bug” in the Generic/SPMD modes a while ago, but how can this work fine for NVPTX? That is the biggest mystery. This observation fell off my radar until today’s meeting. Hope this helps in resolving your issue, but the mystery has always been that how can it continue to work for NVIDIA?

In plugins/cuda/src/rtl.cpp and in plugins/hsa/src/rtl.cpp
65 enum ExecutionModeType {
66 SPMD, // constructors, destructors,
67 // combined constructs (teams distribute parallel for [simd])
68 GENERIC, // everything else
69 NONE
70 };

So, in the host runtime, SPMD is 0 and GENERIC is 1.
However, in the deviceRTL, it is switched. Generic is 0 and Spmd is 1.

14 ////////////////////////////////////////////////////////////////////////////////
15 // Execution Parameters
16 ////////////////////////////////////////////////////////////////////////////////
17 enum ExecutionMode {
18 Generic = 0x00u,
19 Spmd = 0x01u,
20 ModeMask = 0x01u,
21 };

Implementation of smid in aomp-extras is off by one

Discovered by comparison with hip/hcc_detail which has a different implementation of smid.

See commit in hip repo:

commit ff74babedfd74b671166d3db3420bc091b1938e8 (HEAD)
Author: Aaron Enye Shi <[email protected]>
Date:   Wed May 22 19:20:09 2019 +0000

    Fix bug in __smid not setting correct size
    
    The SZ field should minus by 1 since SIZE range is 1..32. Also add comments that results may vary.

Plan to resolve by matching the hip implementation.

AMDGPU: gfx803: comd-mp4 fails execution

The comd-mp4 app (https://github.com/AMDComputeLibraries/OpenMPApps/tree/AOMP-0.5/comd-mp4) soft hangs while executed on gfx803 (gfx900 and gfx906 are fine).

More register used when multiple target regions are compiled together

The source code I'm using has multiple offload regions in different member functions of a class.
If I enable individual target region and comment the other target pragma
Kernel 1 only

      NumSGPRs:        90
      NumVGPRs:        256
      NumSpilledVGPRs: 158

kernel 2 only

      NumSGPRs:        86
      NumVGPRs:        164

If I enabled both offload regions.
kernel 1

      NumSGPRs:        90
      NumVGPRs:        256
      NumSpilledVGPRs: 160

kernal 2

      NumSGPRs:        86
      NumVGPRs:        256
      NumSpilledVGPRs: 160

The amount of needed vector register + spill is more than individually ones.
Both kernels are compiled from independent target regions. This behaviour seems very strange.

host services race condition

cd smoke/firstprivate2
make clean
AOMP=/opt/rocm/aomp make run
gdb ./firstprivate2

run it repeatedly till it fails

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff446e700 (LWP 4773)]
[New Thread 0x7ffedebfe780 (LWP 4774)]
[New Thread 0x7ffede7fd800 (LWP 4775)]
Thread num: 0 P_VAL: 0
Thread num: 1 P_VAL: 1
Thread num: 2 P_VAL: 2
[New Thread 0x7ffedcbff700 (LWP 4776)]
[New Thread 0x7ffed75ff700 (LWP 4777)]
terminate called without an active exception

Thread 1 "firstprivate2" received signal SIGABRT, Aborted.
0x00007ffff6fad428 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

bt
#0 0x00007ffff6fad428 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff6faf02a in __GI_abort () at abort.c:89
#2 0x00007ffff75ee84d in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff75ec6b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff75ec701 in std::terminate() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff57120af in amd_hostcall_launch_consumer ()
from /opt/rocm/aomp_0.6-3/lib/libamd_hostcall.so
#6 0x00007ffff5713190 in atmi_hostcall_assign_buffer ()
from /opt/rocm/aomp_0.6-3/lib/libamd_hostcall.so
#7 0x00007ffff5be9c10 in dispatch_task(atl_task_s*) ()
from /opt/rocm/aomp/lib/libatmi_runtime.so
#8 0x00007ffff5bea206 in try_dispatch(atl_task_s*, void**, char) ()
from /opt/rocm/aomp/lib/libatmi_runtime.so
#9 0x00007ffff5beb78e in atl_trylaunch_kernel(atmi_lparm_s const*, atl_task_s*, unsigned int, void**) () from /opt/rocm/aomp/lib/libatmi_runtime.so
#10 0x00007ffff5e11b96 in __tgt_rtl_run_target_team_region ()

Elide varargs in IR

Allocate a buffer in caller, write arguments to it as if it was the stack. Pass a pointer to that buffer.

In callee, read from that buffer instead of the stack.

Might need to be in clang. Hoping for a target-agnostic rewrite of use to any arch that wants a functional varargs without writing the backend lowering.

AMDGPU: gfx906: usage of omp_get_device_num() gives warning

Compilation of a code with omp_get_device_num() gives warning about implicit declaration of the function, while compiled on gfx906 only (e.g. https://github.com/ROCm-Developer-Tools/aomp/tree/master/test/smoke/devices).
Although this is only a warning, the test can be compiled and executed.

devices.c:19:25: warning: implicit declaration of function 'omp_get_device_num' is invalid in C99 [-Wimplicit-function-declaration]

Libomptarget fatal error 1: default offloading policy must be switched to mandatory or disabled

The first two cases pass, but the last two fail... Thanks.

rm -f reduction obin *.i *.ii *.bc .lk a.out- *.ll *.s *.o *.cubin
/nfs/home/test/lib/aomp_0.7-7//bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targ ets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 reduction.c -o reducti on
./reduction
The result is correct on target = 499999500000!
Success!
rm -f veccopy obin *.i *.ii *.bc .lk a.out- *.ll *.s *.o *.cubin
/nfs/home/test/lib/aomp_0.7-7//bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targ ets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 veccopy.c -o veccopy
./veccopy
Success
rm -f vmulsum obin *.i *.ii *.bc .lk a.out- *.ll *.s *.o *.cubin
/nfs/home/test/lib/aomp_0.7-7//bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targ ets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 main.c vsum.c vmul.c -o vmulsum
./vmulsum
Libomptarget fatal error 1: default offloading policy must be switched to mandatory or disabled
make: *** [run] Error 1
rm -f vmul_template obin *.i *.ii *.bc .lk a.out- *.ll *.s *.o *.cubin
/nfs/home/test/lib/aomp_0.7-7//bin/clang++ -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-ta rgets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 vmul_template.cpp -o vmul_template
./vmul_template
Libomptarget fatal error 1: default offloading policy must be switched to mandatory or disabled
make: *** [run] Error 1

Replace (most of?) clone_aomp.sh with git submodules

I think we can achieve a reasonable fraction of the clone_aomp.sh capability by add submodules to the aomp git repo, along with a config file. This would be an alternative to repo.

[submodule "aomp-extras"]
        path = aomp-extras
        url = [email protected]:ROCm-Developer-Tools/aomp-extras.git
        branch = 1.0-0
        update = rebase

[submodule "rocminfo"]
        path = rocminfo
        url = https://github.com/RadeonOpenCompute/rocminfo.git
        branch = master
        update = checkout

This seems to result in subdirectories that are themselves self contained git repos (e.g. I was able to check the log, branch, raise a pull request etc), albeit with a tendency to start out as a detached head commit from the top of the named branch. The above update = rebase means you only have to change to the dev branch once, subsequent updates seem to do the right thing.

The submodule functionality appears to be designed for third party components which change infrequently. This means that reverting to an earlier aomp revision would then yield exactly the right versions of the dependencies, which I think is an increase in capability from the clone script. It's not very well set up for post checkout patching.

We could add the submodules configuration file and also keep the clone script. That way git clone --recursive will use the former and git clone will yield the same thing it does today.

Submodules may prove unsuitable for the repos we actively develop, in which case we may be best served by using git to manage the third party components and bash to handle the others.

AMDGPU: gfx900/gfx906: test/smoke/teams fails execution

The teams test (https://github.com/ROCm-Developer-Tools/aomp/tree/master/test/smoke/teams) gives "Memory access fault by GPU node" during execution. This happens for gfx900 and gfx906, interestingly enough gfx803 is fine.

Memory access fault by GPU node-2 (Agent handle: 0xd13780) on address 0x100000000. Reason: Page not present or supervisor privilege

isa<X>(Val) && "cast<Ty>() argument of incompatible type! failed for target=amdgcn-amd-amdhsa -march=gfx900

I encountered an error when trying compile a C++ code for offloading to amdgcn-amd-amdhsa.

AOMP version 0.7
Call to clang++ (host) generated by cmake:
/home/kelling/rocm/aomp/bin/clang++ -DALPAKA_ACC_CPU_BT_OMP4_ENABLED -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -DALPAKA_DEBUG=1 -DBOOST_ALL_NO_LIB -I/home/kelling/checkout/alpaka/include -isystem /home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/boost-1.70.0-acaxl7cie57ccno6sjxmfgn2usv3243b/include --target=x86_64-pc-linux-gnu -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx900 -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-disabled-macro-expansion -Wno-global-constructors -Wno-padded -fopenmp=libomp -fopenmp-version=40 -o CMakeFiles/vectorAdd.dir/src/main.cpp.o -c /home/kelling/checkout/alpaka/example/vectorAdd/src/main.cpp
Code compiles for target x86_64-pc-linux-gnu
compiler/make output

[  0%] Building CXX object example/vectorAdd/CMakeFiles/vectorAdd.dir/src/main.cpp.o
cd /home/kelling/checkout/alpaka/build/example/vectorAdd && /home/kelling/rocm/aomp/bin/clang++  -DALPAKA_ACC_CPU_BT_OMP4_ENABLED -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -DALPAKA_DEBUG=1 -DBOOST_ALL_NO_LIB -I/home/kelling/checkout/alpaka/include -isystem /home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/boost-1.70.0-acaxl7cie57ccno6sjxmfgn2usv3243b/include  --target=x86_64-pc-linux-gnu -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx900   -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-disabled-macro-expansion -Wno-global-constructors -Wno-padded -fopenmp=libomp -fopenmp-version=40 -o CMakeFiles/vectorAdd.dir/src/main.cpp.o -c /home/kelling/checkout/alpaka/example/vectorAdd/src/main.cpp
clang-8: /home/kelling/rocm/aomp_0.6-6/include/llvm/Support/Casting.h:255: typename llvm::cast_retty<X, Y*>::ret_type llvm::cast(Y*) [with X = clang::StringLiteral; Y = const clang::Expr; typename llvm::cast_retty<X, Y*>::ret_type = const clang::StringLiteral*]: Assertion `isa<X>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
Stack dump:
0.	Program arguments: /home/kelling/rocm/aomp_0.6-6/bin/clang-8 -cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -std=c++11 -emit-llvm-bc -emit-llvm-uselists -disable-free -main-file-name main.cpp -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -mconstructor-aliases -fuse-init-array -target-cpu gfx900 -fcuda-is-device -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file /home/kelling/checkout/alpaka/build/example/vectorAdd/CMakeFiles/vectorAdd.dir/src/main.cpp.gcno -resource-dir /home/kelling/rocm/aomp_0.6-6/lib/clang/8.0.1 -isystem /home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/boost-1.70.0-acaxl7cie57ccno6sjxmfgn2usv3243b/include -isystem /home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/boost-1.70.0-acaxl7cie57ccno6sjxmfgn2usv3243b/include -D ALPAKA_ACC_CPU_BT_OMP4_ENABLED -D ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -D ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -D ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -D ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -D ALPAKA_DEBUG=1 -D BOOST_ALL_NO_LIB -I /home/kelling/checkout/alpaka/include -D ALPAKA_ACC_CPU_BT_OMP4_ENABLED -D ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED -D ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED -D ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED -D ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED -D ALPAKA_DEBUG=1 -D BOOST_ALL_NO_LIB -I /home/kelling/checkout/alpaka/include -I/home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/boost-1.70.0-acaxl7cie57ccno6sjxmfgn2usv3243b/include -I/opt/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/pkgconf-1.6.0-espbmjgms57ao5jfwcywhqvxfcectpqi/include -I/home/kelling/checkout/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-8.3.0/openmpi-3.1.4-wz5hptqns7t4mvlooh5jx2cchjqm7w2h/include -I/opt/spack/opt/spack/linux-ubuntu18.04-x86_64/gcc-7.3.0/gcc-8.3.0-pttzhrjx2abqtbwkzrmrowgcgkargro6/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/backward -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/backward -internal-isystem /home/kelling/rocm/aomp_0.6-6/bin/../include -internal-isystem /home/kelling/rocm/aomp_0.6-6/bin/../include -internal-isystem /usr/local/include -internal-isystem /home/kelling/rocm/aomp_0.6-6/lib/clang/8.0.1/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /home/kelling/rocm/aomp_0.6-6/lib/clang/8.0.1/include/cuda_wrappers -internal-isystem /home/kelling/rocm/aomp_0.6-6/bin/../include -internal-isystem /usr/local/include -internal-isystem /home/kelling/rocm/aomp_0.6-6/lib/clang/8.0.1/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-disabled-macro-expansion -Wno-global-constructors -Wno-padded -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-disabled-macro-expansion -Wno-global-constructors -Wno-padded -fdeprecated-macro -fdebug-compilation-dir /home/kelling/checkout/alpaka/build/example/vectorAdd -ferror-limit 19 -fmessage-length 0 -fopenmp -fopenmp-version=40 -fopenmp-version=40 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -o /tmp/main-gfx900-b6b103.bc -x c++ /home/kelling/checkout/alpaka/example/vectorAdd/src/main.cpp -fopenmp-is-device -fopenmp-host-ir-file-path /tmp/main-b8f46f.bc -faddrsig 
1.	<eof> parser at end of file
2.	/home/kelling/checkout/alpaka/include/alpaka/queue/Traits.hpp:54:29: instantiating function definition 'alpaka::queue::enqueue<alpaka::queue::QueueOmp4Blocking, const alpaka::kernel::TaskKernelCpuOmp4<std::integral_constant<unsigned long, 1>, unsigned long, VectorAddKernel, unsigned int *, unsigned int *, unsigned int *, unsigned long> &>'
3.	/home/kelling/checkout/alpaka/include/alpaka/queue/QueueOmp4Blocking.hpp:186:44: instantiating function definition 'alpaka::queue::traits::Enqueue<alpaka::queue::QueueOmp4Blocking, alpaka::kernel::TaskKernelCpuOmp4<std::integral_constant<unsigned long, 1>, unsigned long, VectorAddKernel, unsigned int *, unsigned int *, unsigned int *, unsigned long>, void>::enqueue'
4.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:109:33: instantiating function definition 'alpaka::kernel::TaskKernelCpuOmp4<std::integral_constant<unsigned long, 1>, unsigned long, VectorAddKernel, unsigned int *, unsigned int *, unsigned int *, unsigned long>::operator()'
5.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:109:33: LLVM IR generation of declaration 'alpaka::kernel::TaskKernelCpuOmp4<std::integral_constant<unsigned long, 1>, unsigned long, VectorAddKernel, unsigned int *, unsigned int *, unsigned int *, unsigned long>::operator()'
6.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:163:17: LLVM IR generation of compound statement ('{}')
7.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:165:21: LLVM IR generation of compound statement ('{}')
8.	/home/kelling/checkout/alpaka/include/alpaka/kernel/TaskKernelCpuOmp4.hpp:169:25: LLVM IR generation of compound statement ('{}')
 #0 0x00000000019aa3fa llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x19aa3fa)
 #1 0x00000000019a8274 llvm::sys::RunSignalHandlers() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x19a8274)
 #2 0x00000000019a83d5 SignalHandler(int) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x19a83d5)
 #3 0x00007fe20ca4b890 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12890)
 #4 0x00007fe20b4f4e97 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x3ee97)
 #5 0x00007fe20b4f6801 abort (/lib/x86_64-linux-gnu/libc.so.6+0x40801)
 #6 0x00007fe20b4e639a (/lib/x86_64-linux-gnu/libc.so.6+0x3039a)
 #7 0x00007fe20b4e6412 (/lib/x86_64-linux-gnu/libc.so.6+0x30412)
 #8 0x0000000001e4a677 getSL(clang::Expr const*, clang::Type const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e4a677)
 #9 0x0000000001e4c5c3 clang::CodeGen::CodeGenFunction::EmitAMDGPUDevicePrintfCallExpr(clang::CallExpr const*, clang::CodeGen::ReturnValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e4c5c3)
#10 0x0000000001d8c21b clang::CodeGen::CodeGenFunction::EmitBuiltinExpr(clang::GlobalDecl, unsigned int, clang::CallExpr const*, clang::CodeGen::ReturnValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1d8c21b)
#11 0x0000000001e15308 clang::CodeGen::CodeGenFunction::EmitCallExpr(clang::CallExpr const*, clang::CodeGen::ReturnValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e15308)
#12 0x0000000001e42b4a (anonymous namespace)::ScalarExprEmitter::VisitCallExpr(clang::CallExpr const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e42b4a)
#13 0x0000000001e40fd7 clang::StmtVisitorBase<std::add_pointer, (anonymous namespace)::ScalarExprEmitter, llvm::Value*>::Visit(clang::Stmt*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e40fd7)
#14 0x0000000001e4213a (anonymous namespace)::ScalarExprEmitter::Visit(clang::Expr*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e4213a)
#15 0x0000000001e421b3 clang::CodeGen::CodeGenFunction::EmitScalarExpr(clang::Expr const*, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e421b3)
#16 0x0000000001df9437 clang::CodeGen::CodeGenFunction::EmitAnyExpr(clang::Expr const*, clang::CodeGen::AggValueSlot, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1df9437)
#17 0x0000000001e0be46 clang::CodeGen::CodeGenFunction::EmitIgnoredExpr(clang::Expr const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1e0be46)
#18 0x0000000001bfecaa clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bfecaa)
#19 0x0000000001bff343 clang::CodeGen::CodeGenFunction::EmitCompoundStmtWithoutScope(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff343)
#20 0x0000000001bff658 clang::CodeGen::CodeGenFunction::EmitCompoundStmt(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff658)
#21 0x0000000001c025e3 clang::CodeGen::CodeGenFunction::EmitSimpleStmt(clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c025e3)
#22 0x0000000001bfec40 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bfec40)
#23 0x0000000001bffb47 clang::CodeGen::CodeGenFunction::EmitIfStmt(clang::IfStmt const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bffb47)
#24 0x0000000001bfeefa clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bfeefa)
#25 0x0000000001bff343 clang::CodeGen::CodeGenFunction::EmitCompoundStmtWithoutScope(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff343)
#26 0x0000000001bff658 clang::CodeGen::CodeGenFunction::EmitCompoundStmt(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff658)
#27 0x0000000001c025e3 clang::CodeGen::CodeGenFunction::EmitSimpleStmt(clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c025e3)
#28 0x0000000001bfec40 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bfec40)
#29 0x0000000001c232c5 void clang::CodeGen::RegionCodeGenTy::CallbackFn<clang::CodeGen::CodeGenFunction::EmitOMPTeamsDirective(clang::OMPTeamsDirective const&)::'lambda'(clang::CodeGen::CodeGenFunction&, clang::CodeGen::PrePostActionTy&)>(long, clang::CodeGen::CodeGenFunction&, clang::CodeGen::PrePostActionTy&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c232c5)
#30 0x0000000001eb7b06 clang::CodeGen::RegionCodeGenTy::operator()(clang::CodeGen::CodeGenFunction&) const (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1eb7b06)
#31 0x0000000001eb7bbf (anonymous namespace)::CGOpenMPRegionInfo::EmitBody(clang::CodeGen::CodeGenFunction&, clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1eb7bbf)
#32 0x0000000001c2e07b clang::CodeGen::CodeGenFunction::GenerateOpenMPCapturedStmtFunction(clang::CapturedStmt const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c2e07b)
#33 0x0000000001ee6b22 emitParallelOrTeamsOutlinedFunction(clang::CodeGen::CodeGenModule&, clang::OMPExecutableDirective const&, clang::CapturedStmt const*, clang::VarDecl const*, clang::OpenMPDirectiveKind, llvm::StringRef, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ee6b22)
#34 0x0000000001ee6c96 clang::CodeGen::CGOpenMPRuntime::emitTeamsOutlinedFunction(clang::OMPExecutableDirective const&, clang::VarDecl const*, clang::OpenMPDirectiveKind, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ee6c96)
#35 0x0000000001f07450 clang::CodeGen::CGOpenMPRuntimeNVPTX::emitTeamsOutlinedFunction(clang::OMPExecutableDirective const&, clang::VarDecl const*, clang::OpenMPDirectiveKind, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1f07450)
#36 0x0000000001c1f624 emitCommonOMPTeamsDirective(clang::CodeGen::CodeGenFunction&, clang::OMPExecutableDirective const&, clang::OpenMPDirectiveKind, clang::CodeGen::RegionCodeGenTy const&) (.constprop.1495) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c1f624)
#37 0x0000000001c1fed4 clang::CodeGen::CodeGenFunction::EmitOMPTeamsDirective(clang::OMPTeamsDirective const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c1fed4)
#38 0x0000000001bff020 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff020)
#39 0x0000000001bff343 clang::CodeGen::CodeGenFunction::EmitCompoundStmtWithoutScope(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff343)
#40 0x0000000001bff658 clang::CodeGen::CodeGenFunction::EmitCompoundStmt(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bff658)
#41 0x0000000001c025e3 clang::CodeGen::CodeGenFunction::EmitSimpleStmt(clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c025e3)
#42 0x0000000001bfec40 clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1bfec40)
#43 0x0000000001c230a9 emitTargetRegion(clang::CodeGen::CodeGenFunction&, clang::OMPTargetDirective const&, clang::CodeGen::PrePostActionTy&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c230a9)
#44 0x0000000001eb7b06 clang::CodeGen::RegionCodeGenTy::operator()(clang::CodeGen::CodeGenFunction&) const (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1eb7b06)
#45 0x0000000001eb7bbf (anonymous namespace)::CGOpenMPRegionInfo::EmitBody(clang::CodeGen::CodeGenFunction&, clang::Stmt const*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1eb7bbf)
#46 0x0000000001c2e07b clang::CodeGen::CodeGenFunction::GenerateOpenMPCapturedStmtFunction(clang::CapturedStmt const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c2e07b)
#47 0x0000000001ee71c0 clang::CodeGen::CGOpenMPRuntime::emitTargetOutlinedFunctionHelper(clang::OMPExecutableDirective const&, llvm::StringRef, llvm::Function*&, llvm::Constant*&, bool, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ee71c0)
#48 0x0000000001f009d5 clang::CodeGen::CGOpenMPRuntimeNVPTX::emitNonSPMDKernel(clang::OMPExecutableDirective const&, llvm::StringRef, llvm::Function*&, llvm::Constant*&, bool, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1f009d5)
#49 0x0000000001f00ddf clang::CodeGen::CGOpenMPRuntimeNVPTX::emitTargetOutlinedFunction(clang::OMPExecutableDirective const&, llvm::StringRef, llvm::Function*&, llvm::Constant*&, bool, clang::CodeGen::RegionCodeGenTy const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1f00ddf)
#50 0x0000000001c061f6 clang::CodeGen::CodeGenFunction::EmitOMPTargetDeviceFunction(clang::CodeGen::CodeGenModule&, llvm::StringRef, clang::OMPTargetDirective const&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c061f6)
#51 0x0000000001ec6814 clang::CodeGen::CGOpenMPRuntime::scanForTargetRegionsFunctions(clang::Stmt const*, llvm::StringRef) (.localalias.1891) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ec6814)
#52 0x0000000001ec6537 clang::CodeGen::CGOpenMPRuntime::scanForTargetRegionsFunctions(clang::Stmt const*, llvm::StringRef) (.localalias.1891) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ec6537)
#53 0x0000000001ec8daf clang::CodeGen::CGOpenMPRuntime::emitTargetFunctions(clang::GlobalDecl) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1ec8daf)
#54 0x0000000001c986cd clang::CodeGen::CodeGenModule::EmitGlobal(clang::GlobalDecl) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c986cd)
#55 0x0000000001c9e538 clang::CodeGen::CodeGenModule::EmitTopLevelDecl(clang::Decl*) (.part.7280) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x1c9e538)
#56 0x0000000002566095 (anonymous namespace)::CodeGeneratorImpl::HandleTopLevelDecl(clang::DeclGroupRef) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2566095)
#57 0x000000000255a90a clang::BackendConsumer::HandleTopLevelDecl(clang::DeclGroupRef) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x255a90a)
#58 0x0000000003607398 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x3607398)
#59 0x000000000360673e clang::Sema::PerformPendingInstantiations(bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x360673e)
#60 0x0000000003607814 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x3607814)
#61 0x000000000360673e clang::Sema::PerformPendingInstantiations(bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x360673e)
#62 0x0000000003607814 clang::Sema::InstantiateFunctionDefinition(clang::SourceLocation, clang::FunctionDecl*, bool, bool, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x3607814)
#63 0x000000000360673e clang::Sema::PerformPendingInstantiations(bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x360673e)
#64 0x00000000030802b9 clang::Sema::ActOnEndOfTranslationUnit() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x30802b9)
#65 0x0000000002f66145 clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2f66145)
#66 0x0000000002f5b73f clang::ParseAST(clang::Sema&, bool, bool) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2f5b73f)
#67 0x0000000002563330 clang::CodeGenAction::ExecuteAction() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2563330)
#68 0x0000000002082d3e clang::FrontendAction::Execute() (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2082d3e)
#69 0x00000000020466c6 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x20466c6)
#70 0x0000000002139ce3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x2139ce3)
#71 0x0000000000893d90 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x893d90)
#72 0x000000000082bcd6 main (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x82bcd6)
#73 0x00007fe20b4d7b97 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b97)
#74 0x00000000008900ba _start (/home/kelling/rocm/aomp_0.6-6/bin/clang-8+0x8900ba)
clang-8: error: unable to execute command: Aborted (core dumped)
clang-8: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 8.0.1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /home/kelling/rocm/aomp/bin
clang-8: note: diagnostic msg: PLEASE open a git issue at https://github.com/ROCm-Developer-Tools/aomp with a detailed description of this problem.
clang-8: note: diagnostic msg: Error generating preprocessed source(s).

I tried to generate prprocessed code, by running the listed clang++ command line with an additional -E, but the resulting code does not compile with the same commandline, comnplaining about redefinitions, e.g.

/usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/type_traits:84:48: error: redefinition of 'value'
    constexpr _Tp integral_constant<_Tp, __v>::value;

What options should I pass to get valid preprocessed code?

incorrect omp_get_num_teams()

#include <stdio.h>
#include <omp.h>

int main (void)
{
  #pragma omp target teams distribute num_teams(2)
  for(int j = 0; j < omp_get_num_teams(); j++)
  {
    printf ("The number of teams = %d!\n", omp_get_num_teams());
  }
}

With llvm-9 and nvptx64 backend.

The number of teams = 2!
The number of teams = 2!

With aomp 0.7-2

The number of teams = 1!

This result seems strange to me. It can be caused by either wrong omp_get_num_teams() return value or the actually number of teams spawn by the runtime is 1.

Reduce duplication in makefiles

Particularly in tests, there's a block titled "Standard Makefile check for AOMP installation" which seems to occur a lot.

This should be included instead. The help target, and possibly others, may also be identical across a lot of tests.

omp critical implementation looks wrong

It calls through to omp_set_lock, implemented as

#define UNSET 0
#define SET 1
EXTERN void omp_set_lock(omp_lock_t *lock) {
  // TODO: not sure spinning is a good idea here..
  while (atomicCAS(lock, UNSET, SET) != UNSET) {
    clock_t start = clock();
    clock_t now;
    for (;;) {
      now = clock();
      clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
      if (cycles >= __OMP_SPIN * GetBlockIdInKernel()) {
        break;
      }
    }
  } // wait for 0 to be the read value
}

I think this should deadlock. One thread wins the CAS, all the others in the same wavefront wait on it forever. The CAS should to be wrapped in code that disables the other lanes.

The call to clock() looks like a performance aid. S_SLEEP looks more suitable for amdgcn.

I think the semantics of critical (as opposed to single) require an iteration across each lane in the wavefront in an outer loop with this CAS in the inner loop.

revert ATMI path when rocr runtime fixes arrive in ROCm release

Background: Rome systems present themselves as NUMA cpu nodes. If all the physical memory slots are NOT filled , our ATMI runtime attempts to use a memory pool that has no memory and causes an insufficient resource failures.

This means nobody can use AOMP on a Rome system which is not fully populated with memory. ROCM 3.1 AOMP (and previous) cant build executables that run on Rome, neither can our AOMP releases past and present, and no one but us with the specific patch can build an AOMP that is usable on Rome.

Again we believe this applies only to Rome systems with some empty memory slots on the motherboard.

I don’t know when the appropriate rocr-runtime patches will make it into a ROCM release.

We need to get ahead of this one with the ROCm teams to get whatever the rocr-runtime patch is, into a ROCm release.

From Ashwin{

the ideal solution is to wait/push ROCr/ROCt teams to prioritize the below two JIRA tickets ASAP.

http://ontrack-internal.amd.com/browse/SWDEV-202188
http://ontrack-internal.amd.com/browse/SWDEV-201817

ATMI just uses the memory pool provided by ROCr, and the above tickets indicate that ROCr is returning more memory pool objects than it should, some of which point to empty memory slots.

As a workaround, I have provided Ethan with the patch in case we want to jump ahead, but we should revert the patch once the ROCr/ROCt teams fix the above tickets. @stewart, Ethan/@lieberman, Ron: can one of you then track this TODO (to revert the workaround) at a later time? Perhaps as a GitHub issue?

Two implementations for lane id, choose one

Either:

__builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
// =>
v_mbcnt_lo_u32_b32_e64 v0, -1, 0
v_mbcnt_hi_u32_b32_e32 v0, -1, v0

or:

__builtin_amdgcn_workitem_id_x() & 64
// =>
v_and_b32_e32 v0, 63, v0

The former is suggested in the vega ISA manual. The latter depends on some special handling of VGPRs - it appears the workitem intrinsic triggers a special case in ISelLowering that populates a VGPR with the appropriate value.

I suspect we want to use the second sequence - thoughts?

Flang hardcodes a warp size of 32

See #define NVVM_WARPSIZE 32 in flang/tools/flang2/flang2exe/ompaccel.h

Only has one use site:

  sptrShmem = mk_ompaccel_addsymbol(
      name, mk_ompaccel_array_dtype(DT_INT8, NVVM_WARPSIZE),

I'm not totally sure what this API is doing, but would guess it's allocating an array char[WARPSIZE] which suggests there's a risk of out of bounds for amdgcn when warpsize is actually 64.

ICE on static shared variable

Reduced from a patch to the deviceRTL runtime which intended to make a boolean local to a translation unit.

// fail.hip
// /home/amd/rocm/aomp/bin/clang++ -std=c++11 --cuda-device-only --cuda-gpu-arch=gfx701 fail.hip

#define FAIL 1
#if FAIL
static
#endif
__attribute__((device, shared)) bool state;
__attribute__((device)) void set() { state = false; }

clang-9: /home/amd/aomp/llvm-project/llvm/include/llvm/Support/Casting.h:264: typename llvm::cast_retty<X, Y*>::ret_type llvm::cast(Y*) [with X = llvm::GlobalVariable; Y = llvm::Constant; typename llvm::cast_re\
tty<X, Y*>::ret_type = llvm::GlobalVariable*]: Assertion `isa<X>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
Stack dump:
0.      Program arguments: /home/amd/rocm/aomp_0.7-6/bin/clang-9 -cc1 -triple amdgcn-amd-amdhsa -std=c++11 -aux-triple x86_64-unknown-linux-gnu -emit-llvm-bc -emit-llvm-uselists -disable-free -main-file-name fa\
il.hip -mrelocation-model pic -pic-level 1 -mthread-model posix -mdisable-fp-elim -mconstructor-aliases -fuse-init-array -target-cpu gfx701 -fcuda-is-device -fgpu-rdc -fcuda-allow-variadic-functions -mlink-buil\
tin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/hip.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/opencl.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/ocml.amdgc\
n.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/ockl.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/oclc_finite_only_off.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/a\
omp_0.7-6/bin/../lib/oclc_daz_opt_off.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/oclc_correctly_rounded_sqrt_on.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/o\
clc_unsafe_math_off.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/oclc_isa_version_701.amdgcn.bc -mlink-builtin-bitcode /home/amd/rocm/aomp_0.7-6/bin/../lib/oclc_wavefrontsize64_on.amdgc\
n.bc -dwarf-column-info -debugger-tuning=gdb -resource-dir /home/amd/rocm/aomp_0.7-6/lib/clang/9.0.1 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0 -internal-isystem /usr/li\
b/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x8\
6_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/backward -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../\
include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/x86_64-linux-gnu/c++/7.4.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/\
c++/7.4.0/backward -internal-isystem /home/amd/rocm/aomp_0.7-6/bin/../include -internal-isystem /home/amd/rocm/aomp_0.7-6/bin/../include -internal-isystem /usr/local/include -internal-isystem /home/amd/rocm/aom\
p_0.7-6/lib/clang/9.0.1/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /home/amd/rocm/aomp_0.7-6/lib/\
clang/9.0.1/include/cuda_wrappers -internal-isystem /home/amd/rocm/aomp_0.7-6/bin/../include -internal-isystem /usr/local/include -internal-isystem /home/amd/rocm/aomp_0.7-6/lib/clang/9.0.1/include -internal-ex\
ternc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -std=c++11 -fdeprecated-macro -fdebug-compilation-dir /home/amd/aomp/build/openmp/libomptarg\
et/deviceRTLs/amdgcn -ferror-limit 19 -fmessage-length 104 -fvisibility default -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -fcuda-allow-variadic-functions -faddrsig -o /dev/null \
-x hip /home/amd/aomp/llvm-project/openmp/libomptarget/deviceRTLs/amdgcn/src/fail.hip
1.      <eof> parser at end of file
2.      Per-file LLVM IR generation
3.      /home/amd/aomp/llvm-project/openmp/libomptarget/deviceRTLs/amdgcn/src/fail.hip:6:38: Generating code for declaration 'state'
 #0 0x000055ccd9b38745 llvm::sys::PrintStackTrace(llvm::raw_ostream&) /home/amd/aomp/llvm-project/llvm/lib/Support/Unix/Signals.inc:533:0
 #1 0x000055ccd9b387d8 PrintStackTraceSignalHandler(void*) /home/amd/aomp/llvm-project/llvm/lib/Support/Unix/Signals.inc:594:0
 #2 0x000055ccd9b366b2 llvm::sys::RunSignalHandlers() /home/amd/aomp/llvm-project/llvm/lib/Support/Signals.cpp:68:0
 #3 0x000055ccd9b380fc SignalHandler(int) /home/amd/aomp/llvm-project/llvm/lib/Support/Unix/Signals.inc:385:0
 #4 0x00007f951a842890 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12890)
 #5 0x00007f95194f3e97 raise /build/glibc-OTsEL5/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0
 #6 0x00007f95194f5801 abort /build/glibc-OTsEL5/glibc-2.27/stdlib/abort.c:81:0
 #7 0x00007f95194e539a __assert_fail_base /build/glibc-OTsEL5/glibc-2.27/assert/assert.c:89:0
 #8 0x00007f95194e5412 (/lib/x86_64-linux-gnu/libc.so.6+0x30412)
 #9 0x000055ccd81f20c6 llvm::cast_retty<llvm::GlobalVariable, llvm::Constant*>::ret_type llvm::cast<llvm::GlobalVariable, llvm::Constant>(llvm::Constant*) /home/amd/aomp/llvm-project/llvm/include/llvm/Support/C\
asting.h:264:0
#10 0x000055ccd9fb386e clang::CodeGen::CodeGenModule::EmitGlobalVarDefinition(clang::VarDecl const*, bool) /home/amd/aomp/llvm-project/clang/lib/CodeGen/CodeGenModule.cpp:3876:0
#11 0x000055ccd9fae439 clang::CodeGen::CodeGenModule::EmitGlobalDefinition(clang::GlobalDecl, llvm::GlobalValue*) /home/amd/aomp/llvm-project/clang/lib/CodeGen/CodeGenModule.cpp:2770:0
#12 0x000055ccd9fab467 clang::CodeGen::CodeGenModule::EmitDeferred() /home/amd/aomp/llvm-project/clang/lib/CodeGen/CodeGenModule.cpp:2130:0
#13 0x000055ccd9fa264f clang::CodeGen::CodeGenModule::Release() /home/amd/aomp/llvm-project/clang/lib/CodeGen/CodeGenModule.cpp:395:0
#14 0x000055ccdae245e3 (anonymous namespace)::CodeGeneratorImpl::HandleTranslationUnit(clang::ASTContext&) /home/amd/aomp/llvm-project/clang/lib/CodeGen/ModuleBuilder.cpp:260:0
#15 0x000055ccdae1cdc8 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /home/amd/aomp/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:241:0
#16 0x000055ccdc368ebc clang::ParseAST(clang::Sema&, bool, bool) /home/amd/aomp/llvm-project/clang/lib/Parse/ParseAST.cpp:178:0
#17 0x000055ccda59e27d clang::ASTFrontendAction::ExecuteAction() /home/amd/aomp/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1041:0
#18 0x000055ccdae1a83f clang::CodeGenAction::ExecuteAction() /home/amd/aomp/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1116:0
#19 0x000055ccda59dbde clang::FrontendAction::Execute() /home/amd/aomp/llvm-project/clang/lib/Frontend/FrontendAction.cpp:938:0
#20 0x000055ccda53917e clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /home/amd/aomp/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:944:0
#21 0x000055ccda6fbfff clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /home/amd/aomp/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:291:0
#22 0x000055ccd813b2c0 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /home/amd/aomp/llvm-project/clang/tools/driver/cc1_main.cpp:249:0
#23 0x000055ccd813074b ExecuteCC1Tool(llvm::ArrayRef<char const*>, llvm::StringRef) /home/amd/aomp/llvm-project/clang/tools/driver/driver.cpp:309:0
#24 0x000055ccd8130e3c main /home/amd/aomp/llvm-project/clang/tools/driver/driver.cpp:381:0
#25 0x00007f95194d6b97 __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:344:0
#26 0x000055ccd812eefa _start (/home/amd/rocm/aomp_0.7-6/bin/clang-9+0x241eefa)
clang-9: error: unable to execute command: Aborted (core dumped)
clang-9: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 9.0.1 ([email protected]:ROCm-Developer-Tools/llvm-project 12fcfe2e566a5e4c8dc1b362ecc03744486a170b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/amd/rocm/aomp/bin
clang-9: note: diagnostic msg: PLEASE open a git issue at https://github.com/ROCm-Developer-Tools/aomp with a detailed description of this problem.