rocm / rocfft Goto Github PK

View Code? Open in Web Editor NEW

157.0 52.0 82.0 8.23 MB

Next generation FFT implementation for ROCm

Home Page: https://rocm.docs.amd.com/projects/rocFFT/en/latest/

License: Other

C++ 84.84% C 1.66% CMake 2.85% Shell 0.57% Python 8.33% Groovy 1.09% Asymptote 0.66%

fft rocm hip amd fast fourier gpu transform

rocfft's Introduction

rocFFT

rocFFT is a software library for computing fast Fourier transforms (FFTs) written in the HIP programming language. It's part of AMD's software ecosystem based on ROCm. The rocFFT library can be used with AMD and NVIDIA GPUs.

Documentation

Documentation for rocFFT is available at rocm.docs.amd.com.

To build our documentation locally, use the following code:

cd docs

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Build and install

You can install rocFFT using pre-built packages or building from source.

Installing pre-built packages:
1. Download the pre-built packages from the ROCm package servers or use the GitHub releases tab to download the source (this may give you a more recent version than the pre-built packages).
2. Run: sudo apt update && sudo apt install rocfft

Building from source:

rocFFT is compiled with AMD's clang++ and uses CMake. You can specify several options to customize your build. The following commands build a shared library for supported AMD GPUs:

mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_C_COMPILER=amdclang ..
make -j

You can compile a static library using the -DBUILD_SHARED_LIBS=off option.

With rocFFT, you can use indirect function calls by default; this requires ROCm 4.3 or higher. You can use -DROCFFT_CALLBACKS_ENABLED=off with CMake to prevent these calls on older ROCm compilers. Note that with this configuration, callbacks won't work correctly.

rocFFT includes the following clients:

rocfft-bench: Runs general transforms and is useful for performance analysis
rocfft-test: Runs various regression tests

Various small samples

Client	CMake option	Dependencies
`rocfft-bench`	`-DBUILD_CLIENTS_BENCH=on`	None
`rocfft-test`	`-DBUILD_CLIENTS_TESTS=on`	Fastest Fourier Transform in the West (FFTW), GoogleTest
samples	`-DBUILD_CLIENTS_SAMPLES=on`	FFTW

Clients are not built by default. To build them, use -DBUILD_CLIENTS=on. The build process downloads and builds GoogleTest and FFTW if they are not already installed.

Clients can be built separately from the main library. For example, you can build all the clients with an existing rocFFT library by invoking CMake from within the rocFFT-src/clients folder:

mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_C_COMPILER=amdclang_PREFIX_PATH=/path/to/rocFFT-lib ..
make -j

To install client dependencies on Ubuntu, run:

sudo apt install libgtest-dev libfftw3-dev

We use version 1.11 of GoogleTest.

Examples

A summary of the latest functionality and workflow to compute an FFT with rocFFT is available on the rocFFT documentation portal.

You can find additional examples in the clients/samples subdirectory.

Support

You can report bugs and feature requests through the GitHub issue tracker.

Contribute

If you want to contribute to rocFFT, you must follow our contribution guidelines.

rocfft's People

Contributors

Stargazers

Watchers

rocfft's Issues

Problems with double 1D complex interleaved for moderate to large sizes

What should happen

-rider should run correctly for doubles as well as floats.

What really happens

-It doesn't, for doubles. Typically gets output[0].real() = =nan, not N.
As far as I can tell, all sizes > 2048 are broken except for powers of 2.
4096 is also broken, but gets 3.88098e-304 rather than -nan.
8192, 16384, etc. are OK.

How to reproduce

./rocfft-rider --double -o -p 10 -x 2160

Environment

Hardware	description
GPU	device string
CPU	device string
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] (rev e7)
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1800X Eight-Core Processor
stepping : 1
microcode : 0x8001129

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
4.11.0-kfd-compute-rocm-rel-1.6-180 #1 SMP Tue Oct 10 08:15:38 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux
HCC clang version 6.0.0 (based on HCC 1.0.17412-f590a25-821e6d8-64e7fc7 )
Library	v0.0

Not-in-place real-to-complex 2D forward FFT modifies the input

What is the expected behavior

Not-in-place real-to-complex 2D forward FFT should not modify the input.

What actually happens

Input is modified after rocfft_execution completes. There is a pattern in the data that could be a clue to rocFFT developers on what has gone wrong.

How to reproduce

I could reproduce in two ways:

In standard rocfft sample at docs/samples/real2complex_2d.cpp, dump and compare input after forward rocfft_execution completes.
Use the sample code I provided (failure_repro.cpp.txt).

Environment

CPU => Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
GPU => gfx900

| Software | version |
Ubuntu 18.04
HIP version: 3.3.20126-2dbba46b
clang version 11.0.0
rocm 3.3.0 => standard RPM
rocfft => latest built from source (revision 4192227 May 28)
rocFFT compiled using HIP_PLATFORM=clang

rocFFT build error : error: unknown argument: '-fno-gpu-rdc'

I use this command to build the rocFFT
CXX=/opt/rocm/bin/hcc cmake ..
but the error happens, what should I do ? thank you!

What actually happens

-- The CXX compiler identification is Clang 7.0.0
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HCC compiler set; ROCm backend selected [ CXX=/opt/rocm/bin/hcc cmake ... ]
-- Building with ROCm tools
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1")
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /work/home/wangzh/rocFFT/build

~make
Scanning dependencies of target rocfft-kernel-generator
[  2%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
[  5%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[  7%] Linking CXX executable rocfft-kernel-generator
[  7%] Built target rocfft-kernel-generator
[ 10%] Generator producing device kernels for rocfft-device
Scanning dependencies of target rocfft-device
[ 13%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
clang-7: error: unknown argument: '-fno-gpu-rdc'
make[2]: *** [library/src/device/CMakeFiles/rocfft-device.dir/build.make:610: library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:168: library/src/device/CMakeFiles/rocfft-device.dir/all] Error 2
make: *** [Makefile:152: all] Error 2

How to reproduce

Environment

Hardware	description
GPU	gfx906

Software	version
HCC clang	version 7.0.0
hip	1.5.18353

Incorrect result with 2D C2R FFT

See attachment.

fft_test.zip

Input:
[[0, 1, 2]
[3, 4, 5]
[6, 7, 8]
[9, 0, 1]]

Both numpy and fftw3 produce the output
[[58, 2, 10, 2]
[-24, -8, 0, 8]
[6, -10, -10, -10]
[-24, 8, 0, -8]]

rocfft produces the output
[[58, 2, 10, 2]
[-14, -10, 10, 6]
[6, -10, -10, -10]
[-34, 10, -10, -6]]

hipfftPlanMany() returns HIPFFT_INVALID_VALUE

What is the expected behavior

hipfftPlanMany returns HIPFFT_SUCCESS

What actually happens

hipfftPlanMany returns HIPFFT_SUCCESS if input parameters inembed and onembed are not NULL

How to reproduce

Compile the follow sample code 2dfft.hip.cpp:

hipcc 2dfft.hip.cpp -lrocfft

Run the binary

./a.out

Result(HIPFFT_INVALID_VALUE is 4)
hipfftPlanMany result is 4

==============================

#include <stdlib.h>
#include <hip/hip_runtime.h>
#include <hipfft.h>

#define NX 256
#define NY 128
#define BATCH 1
#define NRANK 2

int main()
{
hipfftComplex(CompData)[NY] = (hipfftComplex()[NY])malloc(sizeof(hipfftComplex)NXNY);
// for the data in host
int i,j;
for (i = 0; i < NX; i++)
{
    for (j = 0; j < NY; j++)
    {
        CompData[i][j].x = rand();
        CompData[i][j].y = 0;
    }
}

hipfftReal * *d_fftData;
hipfftHandle plan;// handle
int n[NRANK] = {NX, NY};

hipMalloc((void**)&d_fftData, sizeof(hipfftReal)*NX*NY);
hipMemcpy(d_fftData, CompData, sizeof(hipfftReal)*NX*NY, hipMemcpyHostToDevice);

int inembed[2] = {NX, NY};
int onembed[2] = {NX, (int)(NY / 2) + 1};

hipfftResult result;
// result = hipfftPlanMany(&plan, NRANK, n, NULL, 1, 0, NULL, 1, 0, HIPFFT_R2C, BATCH); //This call return HIPFFT_SUCCESS
result = hipfftPlanMany(&plan, NRANK, n, inembed, 1, NX, onembed, 1, NX, HIPFFT_R2C, BATCH); //This call returns HIPFFT_INVALID_VALUE
// handlehipfftError(result);
printf("hipfftPlanMany result is %d\n", result);
result = hipfftExecR2C(plan, (hipfftReal *)d_fftData, (hipfftComplex*)d_fftData);//execute
printf("hipfftExecR2C result is %d\n", result);

hipDeviceSynchronize();//wait to be done
hipMemcpy(CompData, d_fftData, sizeof(hipfftComplex)*NX*NY, hipMemcpyDeviceToHost);// copy the result from device to host

hipfftDestroy(plan);

free(CompData);
hipFree(d_fftData);
return 0;
}

==============================

Environment

Hardware	description
GPU	MI25 (not GPU specific)
CPU	Intel Xeon(R) CPU E5-2690

Software	version
ROCR	2.1.96
HCC	1.3.19045

Comparison with clFFT (performance)

This is a documentation request:

could you please publish a performance comparison with clFFT.

Status of hipFFT

What features of cuFFT are actually implemented?
It's hard to say looking at header and source files.

There is a lot of commented code and empty switch cases, for example: https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/library/src/hipfft.cpp#L224
hipfftExecC2C calls the same code for different directions (unlike hipfftExecZ2Z):
https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/library/src/hipfft.cpp#L858
Functions always return HIPFFT_SUCCESS and ignore result of rocfft_* functions.
Will hipFFT be a wrapper around cuFFT? Or it's supposed to be used with rocFFT only (on both ROCm and CUDA).

Build failed at Ubuntu 18.04

What actually happens

/opt/rocm/hip/include/hip/hcc_detail/hip_runtime.h:76:10: fatal error: 'grid_launch.h' file not found

How to reproduce

build on Ubuntu 18.04

Environment

| HCC | HCC 1.2.18313-39ff360d-be6eeeeffb-1e422702eba|

/ROCm/rocFFT/release$ make
Scanning dependencies of target rocfft-kernel-generator
[ 4%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
[ 8%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[ 12%] Linking CXX executable rocfft-kernel-generator
[ 12%] Built target rocfft-kernel-generator
[ 16%] Generator producing device kernels for rocfft-device
Scanning dependencies of target rocfft-device
[ 20%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
In file included from /home/runtron/ROCm/rocFFT/library/src/device/transpose.cpp:6:
In file included from /home/runtron/ROCm/rocFFT/library/src/device/../include/kernel_launch.h:12:
In file included from /home/runtron/ROCm/rocFFT/library/src/device/../include/rocfft_hip.h:9:
In file included from /opt/rocm/hip/include/hip/hip_runtime.h:56:
/opt/rocm/hip/include/hip/hcc_detail/hip_runtime.h:76:10: fatal error: 'grid_launch.h' file not found
#include <grid_launch.h>
^~~~~~~~~~~~~~~
1 error generated.
library/src/device/CMakeFiles/rocfft-device.dir/build.make:504: recipe for target 'library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o' failed
make[2]: *** [library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o] Error 1
CMakeFiles/Makefile2:162: recipe for target 'library/src/device/CMakeFiles/rocfft-device.dir/all' failed
make[1]: *** [library/src/device/CMakeFiles/rocfft-device.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2

using rocFFT in an OpenCL application

What is the expected behavior

I would expect the ROCm platform to provide an OpenCL API for the rocFFT library.

What actually happens

Only HIP bindings seems to be available

How to reproduce

Try to use rocFFT with OpenCL

Environment

| Hardware | description |
Any

Software	version
ROCK	1.7.137
ROCR	1.1.7-12-gf0de514
HCC	1.2.18063
Library	git master

Note that this is a critical dependency for our work on bringing feature-parity with CUDA in the next GROMACS release.

Build fails with compiler error on rocm 3.0

When compiling with hcc, via, for example,

cmake -DCMAKE_CXX_COMPILER=hcc ..

the compilation fails with some stack trace errors in '/opt/rocm/hcc/bin/clang-10'.

Wrong results for inverse transformation for 2D batches

What is the expected behavior

Inverse should be original times a constant.

What actually happens

If I transform a 4x4x4 array, where the transformation should be applied to the 2. and 3. dimension, the forward transformation is correct but the inverse is not.
I have the following parameters for rocfft_plan_description:
strides = {4, 16} and distance = 1. Are these correct or am I missing something?

However, applying the transformations to first and second dimension (with strides={1,4} and distance=16) works.

How to reproduce

this gist gives Maximum error: 9.13869 but should be close to 0.

Environment

Hardware	description
GPU	gfx803
CPU	Threadripper 1950X

Software	version
ROCK	v2.10
ROCR	v1.1
HCC	v2.10
Library	v0.9.8

Used the rocm docker image with kernel v5.3.12.

How to port FFT library(cuFFT) for HIP port to work on cuda and ROCm platforms

In the context of HIP port of ML Frameworks,how to port FFT library(cuFFT) for HIP port to work on cuda and ROCm platforms.

For other libraries like cublas,it is found hipBLAS is a marshaling library, it marshals inputs and outputs from either cuBLAS or rocBLAS for each respective platform.

Is there any marshalling library for FFT or rocFFT will work for cuda and rocm platforms

rocfft examples fail on Radeon VII - rocfft 3.5.0

description

I was trying to convert my CUDA code to ROCm. After I had segmentation faults in the FFTs, I tried the examples provided by rocfft which also failed. I mostly tested with this since my original code also needs 3d FFTs.

compiling

hipcc -g -c hipfft_3d_z2z.cpp -o hipfft_3d_z2z.o -I/opt/rocm/include
hipcc -g --amdgpu-target=gfx906 -o hipfft_3d_z2z hipfft_3d_z2z.o -L/opt/rocm/lib -lrocfft

This works without issues or messages.

execution

./hipfft_3d_z2z
hipfft 3D double-precision complex-to-complex transform
input:
(0,0) (10,0) (20,0) (30,0) 
(0,1) (10,1) (20,1) (30,1) 
(0,2) (10,2) (20,2) (30,2) 
(0,3) (10,3) (20,3) (30,3) 

(1,0) (11,0) (21,0) (31,0) 
(1,1) (11,1) (21,1) (31,1) 
(1,2) (11,2) (21,2) (31,2) 
(1,3) (11,3) (21,3) (31,3) 

(2,0) (12,0) (22,0) (32,0) 
(2,1) (12,1) (22,1) (32,1) 
(2,2) (12,2) (22,2) (32,2) 
(2,3) (12,3) (22,3) (32,3) 

(3,0) (13,0) (23,0) (33,0) 
(3,1) (13,1) (23,1) (33,1) 
(3,2) (13,2) (23,2) (33,2) 
(3,3) (13,3) (23,3) (33,3) 


[1]    2457 segmentation fault (core dumped)  ./hipfft_3d_z2z

It's failing at hipfftExecZ2Z().

debug

When I try to debug, I'm getting this:

/opt/rocm/bin/rocgdb ./hipfft_3d_z2z
GNU gdb (GDB) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./hipfft_3d_z2z...
(gdb) run
Starting program: /home/tinux/dev/cpp/ROCm/fft/hipfft_3d_z2z 
warning: The version of the kernel driver does not match the version required by the ROCm debugger library
warning: Could not attach to process 2612
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
hipfft 3D double-precision complex-to-complex transform
[New Thread 0x7fffef53d700 (LWP 2616)]
[New Thread 0x7fffed1ff700 (LWP 2617)]
[Thread 0x7fffed1ff700 (LWP 2617) exited]
[New Thread 0x7fffed77f700 (LWP 2618)]
input:
(0,0) (10,0) (20,0) (30,0) 
(0,1) (10,1) (20,1) (30,1) 
(0,2) (10,2) (20,2) (30,2) 
(0,3) (10,3) (20,3) (30,3) 

(1,0) (11,0) (21,0) (31,0) 
(1,1) (11,1) (21,1) (31,1) 
(1,2) (11,2) (21,2) (31,2) 
(1,3) (11,3) (21,3) (31,3) 

(2,0) (12,0) (22,0) (32,0) 
(2,1) (12,1) (22,1) (32,1) 
(2,2) (12,2) (22,2) (32,2) 
(2,3) (12,3) (22,3) (32,3) 

(3,0) (13,0) (23,0) (33,0) 
(3,1) (13,1) (23,1) (33,1) 
(3,2) (13,2) (23,2) (33,2) 
(3,3) (13,3) (23,3) (33,3) 



Thread 1 "hipfft_3d_z2z" received signal SIGSEGV, Segmentation fault.
0x00007ffff7c6ee21 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.3
(gdb) backtrace 
#0  0x00007ffff7c6ee21 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.3
#1  0x00007ffff7c76985 in hipLaunchKernel () from /opt/rocm/hip/lib/libamdhip64.so.3
#2  0x00007fffef8e33fd in rocfft_internal_dfn_dp_ci_ci_stoc_4 () from /opt/rocm/lib/librocfft-device.so.0
#3  0x00007ffff7e06fbf in ?? () from /opt/rocm/lib/librocfft.so.0
#4  0x00007ffff7e00b22 in rocfft_execute () from /opt/rocm/lib/librocfft.so.0
#5  0x00007ffff7e03393 in hipfftExecZ2Z () from /opt/rocm/lib/librocfft.so.0
#6  0x0000000000401878 in main () at hipfft_3d_z2z.cpp:106

system

arch linux
kernel 5.7.2-arch1-1

hardware
GPU	AMD Radeon VII (Vega20)
CPU	AMD Ryzen 5 3600X

Software	version
ROCK	v3.5.0
ROCR	v3.5.0
HIPHCC-ROCCLR	v3.5.0
ROCFFT	v3.5.0

other things I have tried

I also stumbled over #224 and therefore tried the mentioned docs/samples/complex_1d.cpp, but I also got a segmentation fault.

I have no idea what else to try atm. Any ideas?

Incorrect result with even length C2R inverse transform

1D complex inverse FFT with length 8
Input: [0.+0.j 0.+1.j 0.+2.j 0.+3.j 0.+4.j]
Expected output: [ 0. -1.20710678 0.5 -0.20710678 0. 0.20710678
-0.5 1.20710678]
Observed output: [ 0.5 -0.7071068 1. 0.29289317 0.5 0.7071068 0. 1.7071068 ]

Happens both with 3.0 release and with the current develop branch.
Only happens with even length and only when the last element of the input has nonzero imaginary part.

Large Sizes Segfault

What is the expected behavior

No crash

What actually happens

Segfault with:

#0  0x00007f636fcdbc6f in TransformPowX(ExecPlan const&, void**, void**, rocfft_execution_info_t*) () from /opt/rocm/lib/librocfft.so.0
#1  0x00007f636fcd42cd in rocfft_execute () from /opt/rocm/lib/librocfft.so.0

How to reproduce

Take FFT example from here: https://rocfft.readthedocs.io/en/latest/library.html#example. Change N to 8192, and compile/run.

Environment

| Hardware | description |
GPU
Name: gfx906
Marketing Name: Vega 20
Vendor Name: AMD

CPU:
Name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

| Software | version |

|-----|-----|
| ROCK | 2.7.22 |
| ROCR | 1.1.9-99-g835b876 |
| HCC | 2.7.19315 |
| Library | 0.9.5.697-rocm-rel-2.7-22-ed7760e |

Cannot build rocFFT with latest HCC and HIP

What is the expected behavior

Should build correctly.

What actually happens

Running build will run into a build failure inside HIP.
Likely because of this nullptr redefinition: https://github.com/ROCmSoftwarePlatform/rocFFT/blob/222db7625e5a344b7fe3bb0d0cbb0ab0b07681fe/library/src/include/plan.h#L15
This is the dump:

[  2%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
[  4%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[  6%] Linking CXX executable rocfft-kernel-generator
[  6%] Built target rocfft-kernel-generator
[  9%] Generator producing device kernels for rocfft-device
Scanning dependencies of target rocfft-device
[ 11%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double_large.cpp.o
[ 16%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
[ 18%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single.cpp.o
[ 18%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/function_pool.cpp.o
[ 20%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o
[ 23%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double.cpp.o
[ 25%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single_large.cpp.o
[ 27%] Linking CXX shared library librocfft-device.so
[ 27%] Built target rocfft-device
Scanning dependencies of target rocfft
[ 32%] Building CXX object library/src/CMakeFiles/rocfft.dir/auxiliary.cpp.o
[ 32%] Building CXX object library/src/CMakeFiles/rocfft.dir/transform.cpp.o
[ 34%] Building CXX object library/src/CMakeFiles/rocfft.dir/powX.cpp.o
[ 37%] Building CXX object library/src/CMakeFiles/rocfft.dir/plan.cpp.o
[ 39%] Building CXX object library/src/CMakeFiles/rocfft.dir/hipfft.cpp.o
[ 41%] Building CXX object library/src/CMakeFiles/rocfft.dir/kargs.cpp.o
[ 44%] Building CXX object library/src/CMakeFiles/rocfft.dir/get_radix.cpp.o
[ 46%] Building CXX object library/src/CMakeFiles/rocfft.dir/repo.cpp.o
[ 48%] Building CXX object library/src/CMakeFiles/rocfft.dir/twiddles.cpp.o
In file included from /root/rocFFT/library/src/powX.cpp:17:
In file included from /root/rocFFT/library/src/include/kernel_launch.h:12:
In file included from /root/rocFFT/library/src/include/rocfft_hip.h:9:
In file included from /opt/rocm/hip/include/hip/hip_runtime.h:55:
In file included from /opt/rocm/hip/include/hip/hcc_detail/hip_runtime.h:84:
In file included from /opt/rocm/hip/include/hip/hcc_detail/grid_launch_GGL.hpp:26:
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:76:13: error: no matching function for call
      to 'round_up_to_next_multiple_nonnegative'
            round_up_to_next_multiple_nonnegative(kernarg.size(), alignof(T)) +
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:53:7: note: candidate template ignored:
      substitution failure [with T = unsigned long, $1 = nullptr]: null non-type template argument must be
      cast to template parameter type 'typename std::enable_if<std::is_integral<unsigned long>{}>::type *'
      (aka 'void *')
    T round_up_to_next_multiple_nonnegative(T x, T y)
      ^
1 error generated.
In file included from /root/rocFFT/library/src/powX.cpp:17:
In file included from /root/rocFFT/library/src/include/kernel_launch.h:12:
In file included from /root/rocFFT/library/src/include/rocfft_hip.h:9:
In file included from /opt/rocm/hip/include/hip/hip_runtime.h:55:
In file included from /opt/rocm/hip/include/hip/hcc_detail/hip_runtime.h:84:
In file included from /opt/rocm/hip/include/hip/hcc_detail/grid_launch_GGL.hpp:26:
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:76:13: error: no matching function for call
      to 'round_up_to_next_multiple_nonnegative'
            round_up_to_next_multiple_nonnegative(kernarg.size(), alignof(T)) +
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:53:7: note: candidate template ignored:
      substitution failure [with T = unsigned long, $1 = nullptr]: null non-type template argument must be
      cast to template parameter type 'typename std::enable_if<std::is_integral<unsigned long>{}>::type *'
      (aka 'void *')
    T round_up_to_next_multiple_nonnegative(T x, T y)
      ^
1 error generated.
library/src/CMakeFiles/rocfft.dir/build.make:182: recipe for target 'library/src/CMakeFiles/rocfft.dir/powX.cpp.o' failed
make[2]: *** [library/src/CMakeFiles/rocfft.dir/powX.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:105: recipe for target 'library/src/CMakeFiles/rocfft.dir/all' failed
make[1]: *** [library/src/CMakeFiles/rocfft.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

How to reproduce

Pull latest HCC clang_tot_upgrade branch from github and build/install it
Pull latest HIP master branch from github and build/install it
Pull latest rocFFT develop branch and build it for error.

Environment

Hardware	description
GPU	device string
Fiji [Radeon R9 FURY / NANO Series]	gfx803
CPU	device string
Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	top of clang_tot_upgrade branch
HIP	top of master
Library	v0.0
Everything else is from ROCm in http://repo.radeon.com/rocm/apt/debian/

out of place transform not working

What is the expected behavior

complex-to-complex in-place transforms using hipfft with single precision produce incorrect results when performed out of place

What actually happens

I modified the sample script hipfft_3d_z2z.cpp to do the FFT out-of-place.

I get

output:
(-23.8431,-23.8431) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 

(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 

(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 

(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0)

with out-of-place transform, but I expect the same result as for inplace, which is

output:
(1056,96) (-320,320) (-320,0) (-320,-320) 
(-32,-32) (0,0) (0,0) (0,0) 
(0,-32) (0,0) (0,0) (0,0) 
(32,-32) (0,0) (0,0) (0,0) 

(-32,32) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 

(-32,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 

(-32,-32) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0) 
(0,0) (0,0) (0,0) (0,0)

How to reproduce

// Kernel for initializing the real-valued input data on the GPU.
__global__ void initdata(hipfftComplex* x, const int Nx, const int Ny, const int Nz)
{
    const int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const int idy = blockIdx.y * blockDim.y + threadIdx.y;
    const int idz = blockIdx.z * blockDim.z + threadIdx.z;
    if(idx < Nx && idy < Ny && idz < Nz)
    {
        const int pos = (idx * Ny + idy) * Nz + idz;
        x[pos].x      = idx + 10 * idz;
        x[pos].y      = idy;
    }
}

// Helper function for determining grid dimensions
template <typename Tint1, typename Tint2>
Tint1 ceildiv(const Tint1 nominator, const Tint2 denominator)
{
    return (nominator + denominator - 1) / denominator;
}

int main()
{
    std::cout << "hipfft 3D single-precision complex-to-complex transform (out of place)\n";

    const int Nx        = 4;
    const int Ny        = 4;
    const int Nz        = 4;
    int       direction = HIPFFT_FORWARD; // forward=-1, backward=1

    std::vector<std::complex<float>> cdata(Nx * Ny * Nz);
    size_t complex_bytes = sizeof(decltype(cdata)::value_type) * cdata.size();

    // Create HIP device object and copy data to device:
    // hipfftComplex for single-precision
    hipfftComplex* x;
    hipMalloc(&x, complex_bytes);

    hipfftComplex* y;
    hipMalloc(&y, complex_bytes);

    // Inititalize the data on the device
    hipError_t rt;
    const dim3 blockdim(8, 8, 8);
    const dim3 griddim(ceildiv(Nx, blockdim.x), ceildiv(Ny, blockdim.y), ceildiv(Nz, blockdim.z));
    hipLaunchKernelGGL(initdata, blockdim, griddim, 0, 0, x, Nx, Ny, Nz);
    hipDeviceSynchronize();
    rt = hipGetLastError();
    assert(rt == hipSuccess);

    std::cout << "input:\n";
    hipMemcpy(cdata.data(), x, complex_bytes, hipMemcpyDefault);
    for(int i = 0; i < Nx; i++)
    {
        for(int j = 0; j < Ny; j++)
        {
            for(int k = 0; k < Nz; k++)
            {
                int pos = (i * Ny + j) * Nz + k;
                std::cout << cdata[pos] << " ";
            }
            std::cout << "\n";
        }
        std::cout << "\n";
    }
    std::cout << std::endl;

    // Create plan
    hipfftResult rc   = HIPFFT_SUCCESS;
    hipfftHandle plan = NULL;
    rc                = hipfftCreate(&plan);
    assert(rc == HIPFFT_SUCCESS);
    rc = hipfftPlan3d(&plan, // plan handle
                      Nx, // transform length
                      Ny, // transform length
                      Nz, // transform length
                      HIPFFT_C2C); // transform type (HIPFFT_C2C for single-precision)
    assert(rc == HIPFFT_SUCCESS);

    // Execute plan
    // hipfftExecZ2Z: single precision, hipfftExecC2C: for single-precision
    rc = hipfftExecC2C(plan, x, y, direction);
    assert(rc == HIPFFT_SUCCESS);

    std::cout << "output:\n";
    hipMemcpy(cdata.data(), y, complex_bytes, hipMemcpyDeviceToHost);
    for(int i = 0; i < Nx; i++)
    {
        for(int j = 0; j < Ny; j++)
        {
            for(int k = 0; k < Nz; k++)
            {
                int pos = (i * Ny + j) * Nz + k;
                std::cout << cdata[pos] << " ";
            }
            std::cout << "\n";
        }
        std::cout << "\n";
    }
    std::cout << std::endl;

    hipfftDestroy(plan);
    hipFree(x);
    hipFree(y);
    return 0;
}

Environment

Hardware	description
GPU	Vega 10 [Radeon Instinct MI25]
CPU	AMD EPYC 7601 32-Core

Software	version
HCC	10.0
Library	0.9.7.722-rocm-rel-2.9-6-e3055e1

rocFFT fails to build without device mounted

What is the expected behavior

Build successfully

What actually happens

When the GPU device is not mounted onto the docker container, the rocfft-kernel-generator fails to build. However, if I mount the GPU, build will succeed.
Why does it need a GPU to build?

Scanning dependencies of target rocfft-kernel-generator
[  2%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[  5%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
[  7%] Linking CXX executable rocfft-kernel-generator
[  7%] Built target rocfft-kernel-generator
[ 10%] Generator producing device kernels for rocfft-device
terminate called after throwing an instance of 'ihipException'
  what():  std::exception
Aborted (core dumped)
library/src/device/CMakeFiles/rocfft-device.dir/build.make:61: recipe for target 'library/src/device/kernel_launch_generator.h' failed
make[2]: *** [library/src/device/kernel_launch_generator.h] Error 134
CMakeFiles/Makefile2:162: recipe for target 'library/src/device/CMakeFiles/rocfft-device.dir/all' failed
make[1]: *** [library/src/device/CMakeFiles/rocfft-device.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

How to reproduce

Grab docker image from http://compute-artifactory.amd.com/artifactory/rocm-JIRA-temp/ROCMOPS-415/rocFFThipclang.tar
docker load -i rocFFThipclang.tar
docker run -it bash
./install.sh -icd --hip-clang

However, if you use docker run command with options: --device=/dev/kfd --device=/dev/dri --group-add video . then it will build successfully.

Environment

Hardware	description
GPU	gfx900, N/A for container
CPU	AMD Ryzen 7 1800X

Software	version
ROCK
ROCR
HCC
Library

Samples fail ROCm 2.8, GFX7, FX CPU.

What is the expected behavior

Unknown

What actually happens

Exception raised, GDB session:

r
Starting program: /tmp/staging/fixed-16-double 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffea98e700 (LWP 31174)]
terminate called after throwing an instance of 'std::runtime_error'
  what():  Undefined __global__ function.

Thread 1 "fixed-16-double" received signal SIGABRT, Aborted.
0x00007ffff7189a01 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7189a01 in raise () from /lib64/libc.so.6
#1  0x00007ffff7173535 in abort () from /lib64/libc.so.6
#2  0x00007ffff73d7703 in __gnu_cxx::__verbose_terminate_handler () at /var/tmp/portage/sys-devel/gcc-9.2.0-r1/work/gcc-9.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff7407796 in __cxxabiv1::__terminate (handler=<optimized out>) at /var/tmp/portage/sys-devel/gcc-9.2.0-r1/work/gcc-9.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007ffff74077e1 in std::terminate () at /var/tmp/portage/sys-devel/gcc-9.2.0-r1/work/gcc-9.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00007ffff7407a13 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff759c0f8 <typeinfo for std::runtime_error>, dest=0x7ffff741d420 <std::runtime_error::~runtime_error()>)
    at /var/tmp/portage/sys-devel/gcc-9.2.0-r1/work/gcc-9.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00007ffff75e55cd in hip_impl::hip_throw (ex=...) at /var/tmp/portage/sys-devel/hip-2.8.0/work/HIP-roc-2.8.0/src/hip_hcc.cpp:2507
#7  0x00007ffff75c2ff7 in hip_impl::program_state_impl::kernargs_size_align (this=0x45ca50, kernel=140737293554832) at /var/tmp/portage/sys-devel/hip-2.8.0/work/HIP-roc-2.8.0/src/program_state.inl:816
#8  0x00007ffff75c2c79 in hip_impl::program_state::get_kernargs_size_align (this=<optimized out>, kernel=140737488342368) at /var/tmp/portage/sys-devel/hip-2.8.0/work/HIP-roc-2.8.0/src/program_state.cpp:82
#9  0x00007ffff4641431 in hip_impl::make_kernarg<HIP_vector_type<double, 2u> const*, unsigned long, unsigned long const*, unsigned long const*, unsigned long, HIP_vector_type<double, 2u>*, HIP_vector_type<double, 2u> const*, unsigned long, unsigned long const*, unsigned long const*, unsigned long, HIP_vector_type<double, 2u>*> (kernel=
    0x7ffff4639490 <fft_fwd_ip_len16<HIP_vector_type<double, 2u>, (StrideBin)0>(HIP_vector_type<double, 2u> const*, unsigned long, unsigned long const*, unsigned long const*, unsigned long, HIP_vector_type<double, 2u>*)>, actuals=...)
    at /usr/lib/hip/include/hip/hcc_detail/functional_grid_launch.hpp:102
#10 0x00007ffff46393df in hipLaunchKernelGGL<HIP_vector_type<double, 2u> const*, unsigned long, unsigned long const*, unsigned long const*, unsigned long, HIP_vector_type<double, 2u>*, void (*)(HIP_vector_type<double, 2u> const*, unsigned long, unsigned long const*, unsigned long const*, unsigned long, HIP_vector_type<double, 2u>*)> (kernel=0x7fffffffcd60, numBlocks=..., dimBlocks=..., sharedMemBytes=0, stream=<optimized out>, args=<optimized out>, args=<optimized out>, 
    args=<optimized out>, args=<optimized out>, args=<optimized out>, args=<optimized out>) at /usr/lib/hip/include/hip/hcc_detail/functional_grid_launch.hpp:165
#11 rocfft_internal_dfn_dp_ci_ci_stoc_16 (data_p=<optimized out>, back_p=<optimized out>) at library/src/device/kernel_launch_double_0.cpp.h:28
#12 0x00007ffff7f2efc5 in TransformPowX (execPlan=..., in_buffer=0x7fffffffd3b8, out_buffer=0x7fffffffd3b8, info=0x0) at /usr/src/debug/sci-libs/rocFFT-2.8.0/rocFFT-rocm-2.8/library/src/powX.cpp:319
#13 0x00007ffff7f2896d in rocfft_execute (plan=0x0, in_buffer=0x7fffffffcd60, out_buffer=<optimized out>, info=0x7ffff7189a01 <raise+321>) at /usr/src/debug/sci-libs/rocFFT-2.8.0/rocFFT-rocm-2.8/library/src/transform.cpp:94
#14 0x0000000000401428 in main () at /var/tmp/portage/sci-libs/rocFFT-2.8.0/work/rocFFT-rocm-2.8/clients/samples/fixed-16/fixed-16-double.cpp:82

How to reproduce

Just run on older hardware

Environment

Hardware	description
GPU	Hawaii PRO [Radeon R9 290/390]
CPU	AMD FX(tm)-8350 Eight-Core Processor

Software	version
ROCK	v2.8
ROCR	v2.8
HCC	v2.8
Library	v2.8

I don't know if this is meant to work at all, but each of the samples fail with the same message.

I'm just experimenting with this to see if my machine can get it all to work.

Multiple plan create-execute-destory cycles leads to segfault

What is the expected behavior

terminate normally

What actually happens

segfault

How to reproduce

Running this example main function (https://github.com/RadeonOpenCompute/rocFFT/blob/77845e8f3960d9548f1bc61f71b912bfbd62787d/README.md#example) several times. See gist for code and gdb output: https://gist.github.com/sklam/a689af4b0d73ae65fdb4c2921f797901.
Using this commit: 77845e8

Environment

Hardware	description
GPU	gfx803
CPU	Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Software	version
ROCm	v1.3
HCC	clang version 3.5.0 (based on HCC 0.10.16464-0779319-06c8b76 LLVM 3.5.0svn)

Failed in Linking librocfft.so

What is the expected behavior

Build correctly.

What actually happens

I am getting this error message:

[100%] Linking CXX shared library librocfft.so
ar: /root/rocFFT/build/release/library/src/device/librocfft-device.so.0.3.0.0-54be23a: File format not recognized
ls: cannot access '/tmp/tmp.ccMFwNMSbU/librocfft-device.so.0.3.0.0-54be23a/*.o': No such file or directory

How to reproduce

Install latest HCC and HIP, then run build rocFFT using these steps
git clone rocFFT and checkout latest develop branch
cd rocFFT; mkdir build; cd build; mkdir release; cd release
CXX=/opt/rocm/bin/hcc cmake ../..
make -j16 <-- errors here at linking librocfft.so

Environment

Hardware	description
GPU	Fiji [Radeon R9 FURY / NANO Series]
CPU	Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	clang_tot_upgrade branch
Library	v0.0
HIP	master branch

Unable to locate package rocfft

What is the expected behavior

rocfft is installed.

What actually happens

E: Unable to locate package rocfft

How to reproduce

max@supernova:~$ cat /etc/apt/sources.list.d/rocm.list
deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main
max@supernova:~$ sudo apt install rocfft
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package rocfft

On the other hand, I was able to install rocm-dkms and rocm-opencl-dev from rocm repo successfully:

max@supernova:~$ sudo apt-get install rocm-dkms rocm-opencl-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
rocm-dkms is already the newest version (1.7.60).
rocm-opencl-dev is already the newest version (1.2.0-2017121952).

Environment

max@supernova:~$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

Cannot Compile for NVIDIA

What is the expected behavior

follow build instructions
successful build

What actually happens

nvcc can't handle std=c++1y

How to reproduce

build:
mkdir build && CD build && cmake ..
CXX=hipcc make

Environment

RHEL 7.7
cuda 10.2
HIP_PLATFORM=nvcc

Hardware	description
GPU	Quadro M4000
CPU	Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

Software	version
ROCK	N/A
ROCR	N/A
HCC	2.8.19361-cbe6b65e
Library	rocfft.x86_64 0:0.9.7.722_rocm_rel_2.9_6_e3055e1-1

C API feedback

Lets use this issue tracker to comment on C API design.

Incorrect results for batched 2D transforms with certain sizes

What is the expected behavior

Result for a batched 2D transform of size (1, 11) should be:

(0.0000, 55.0000) (-18.7313, -5.5000)  (-8.5582, -5.5000)  (-4.7658, -5.5000)  (-2.5118, -5.5000)  (-0.7908,-5.5000)   (0.7908, -5.5000)  (2.5118, -5.5000)  (4.7658, -5.5000)   (8.5582, -5.5000)  (18.7313, -5.5000)

What actually happens

The results computed with hipfft_planmany_2d_z2z.cpp from the provided samples with dimensions adjusted:

(1,6.66134e-16) (2.95949,-0.281733) (3.72514,-2.42113) (0.133655,-4.03178) (-1.72296,1.55329) (4.28922,-0.375205) (-2.28922,0.375205) (3.72296,-1.55329) (1.86635,4.03178) (-1.72514,2.42113) (-0.959493,0.281733)

Environment

Hardware	description
GPU	Radeon PRO WX 9100
CPU	AMD EPYC 7501

Software	version
ROCm	v2.6
HCC	v1.3.19242
HIP	v1.5.19255
rocFFT	Latest git version `4628e54`

How to reproduce

I've only encountered the issue with sizes 11 and 13 so far. Other sizes such as 10 work correctly.
The modified sample file for size 11:

/******************************************************************************
* Copyright (c) 2019 - present Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*******************************************************************************/

#include <complex>
#include <hipfft.h>
#include <iostream>

int main()
{
    std::cout << "2D double-precision complex-to-complex transform using advanced interface\n";

    int rank    = 2;
    int n[2]    = {1, 11};
    int howmany = 1;

    // array is contiguous in memory
    int istride = 1;
    // in-place transforms require istride=ostride
    int ostride = istride;

    // we choose to have no padding around our data:
    int inembed[2] = {istride * n[0], istride * n[1]};
    // in-place transforms require inembed=oneembed:
    int onembed[2] = {inembed[0], inembed[1]};

    int idist = inembed[0] * inembed[1];
    int odist = onembed[0] * onembed[1];

    std::cout << "n: " << n[0] << " " << n[1] << "\n"
              << "howmany: " << howmany << "\n"
              << "istride: " << istride << "\tostride: " << ostride << "\n"
              << "inembed: " << inembed[0] << " " << inembed[1] << "\n"
              << "onembed: " << onembed[0] << " " << onembed[1] << "\n"
              << "idist: " << idist << "\todist: " << odist << "\n"
              << std::endl;

    std::vector<std::complex<double>> data(howmany * idist);
    const auto total_bytes = data.size() * sizeof(decltype(data)::value_type);

    std::cout << "input:\n";
    std::fill(data.begin(), data.end(), 0.0);
    for(int ibatch = 0; ibatch < howmany; ++ibatch)
    {
        for(int i = 0; i < n[0]; i++)
        {
            for(int j = 0; j < n[1]; j++)
            {
                const auto pos = ibatch * idist + istride * (i * inembed[1] + j);
                data[pos]      = std::complex<double>(i + ibatch, j);
            }
        }
    }
    for(int ibatch = 0; ibatch < howmany; ++ibatch)
    {
        std::cout << "batch: " << ibatch << "\n";
        for(int i = 0; i < inembed[0]; i++)
        {
            for(int j = 0; j < inembed[1]; j++)
            {
                const auto pos = ibatch * idist + i * inembed[1] + j;
                std::cout << data[pos] << " ";
            }
            std::cout << "\n";
        }
        std::cout << "\n";
    }
    std::cout << std::endl;

    hipfftHandle hipPlan;
    hipfftResult result;
    result = hipfftPlanMany(
        &hipPlan, rank, n, inembed, istride, idist, onembed, ostride, odist, HIPFFT_Z2Z, howmany);

    hipfftDoubleComplex* d_in_out;
    hipMalloc((void**)&d_in_out, total_bytes);
    hipMemcpy(d_in_out, (void*)data.data(), total_bytes, hipMemcpyHostToDevice);

    result = hipfftExecZ2Z(hipPlan, d_in_out, d_in_out, HIPFFT_FORWARD);

    hipMemcpy((void*)data.data(), d_in_out, total_bytes, hipMemcpyDeviceToHost);

    std::cout << "output:\n";
    for(int ibatch = 0; ibatch < howmany; ++ibatch)
    {
        std::cout << "batch: " << ibatch << "\n";
        for(int i = 0; i < onembed[0]; i++)
        {
            for(int j = 0; j < onembed[1]; j++)
            {
                const auto pos = ibatch * odist + i * onembed[1] + j;
                std::cout << data[pos] << " ";
            }
            std::cout << "\n";
        }
        std::cout << "\n";
    }
    std::cout << std::endl;

    hipFree(d_in_out);
}

R2C notinplace modifies input

What is the expected behavior

A plan with rocfft_placement_notinplace should not change the input buffer.

What actually happens

For real to complex (single and double) this does not seem to be the case i.e. using

rocfft_plan_create(&plan, rocfft_placement_notinplace, rocfft_transform_type_real_forward, ...);

rocfft_execute(plan, (void**) &x, (void **) &y, NULL);

changes x.

How to reproduce

Using this example gives:

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

but they should be the same.

Environment

Hardware	description
GPU	gfx803
CPU	Threadripper 1950X

Software	version
ROCK	v2.9
ROCR	v1.1
HCC	v2.9
Library	v0.9.7

Used the rocm docker image with kernel v5.3.7.

install.sh fails on ROCm 1.7

issuing "./install.sh -i -d -c" fails with a compilation error

What is the expected behavior

I would expect the script to run to completion without errors.

What actually happens

Here's to script output:

[ 54%] Building CXX object clients/samples/fixed-large/CMakeFiles/fixed-large-double.dir/fixed-large-double.cpp.o
Scanning dependencies of target fixed-16-float
Scanning dependencies of target rocfft-rider
[ 56%] Building CXX object clients/samples/fixed-large/CMakeFiles/fixed-large-float.dir/fixed-large-float.cpp.o
[ 59%] Building CXX object clients/samples/fixed-16/CMakeFiles/fixed-16-double.dir/fixed-16-double.cpp.o
[ 61%] Building CXX object clients/rider/CMakeFiles/rocfft-rider.dir/rider.cpp.o
[ 65%] Building CXX object clients/rider/CMakeFiles/rocfft-rider.dir/misc.cpp.o
[ 65%] Building CXX object clients/samples/fixed-16/CMakeFiles/fixed-16-float.dir/fixed-16-float.cpp.o
Scanning dependencies of target rocfft-selftest
[ 68%] Building CXX object clients/selftest/CMakeFiles/rocfft-selftest.dir/test_complex.cpp.o
Scanning dependencies of target rocfft-test
[ 70%] Building CXX object clients/tests/CMakeFiles/rocfft-test.dir/gtest_main.cpp.o
In file included from /home/rvanoo/repos/rocFFT/clients/samples/fixed-large/fixed-large-double.cpp:12:
/opt/rocm/hip/include/hip/hip_runtime_api.h:276:2: error: ("Must define exactly one of HIP_PLATFORM_HCC or HIP_PLATFORM_NVCC");
#error("Must define exactly one of HIP_PLATFORM_HCC or HIP_PLATFORM_NVCC");

Looking at the script, it seems that hcc is used to compile HIP sources, without defining HIP_PLATFORM_HCC (which is what hipcc would otherwise take care of). I tried setting the missing #define as an environment variable, but that didn't seem to get passed on to hcc.

How to reproduce

by issuing "./install.sh -i -d -c"

Environment

Intel Core i7 6700K, AMD Radeon R9 Nano (fiji), ROCm 1.7 stack installed from the AMD ROCm repositories (repo.radeon.com)

rocm-dkms/Ubuntu 16.04,now 1.7.60 amd64 [installed]

Linux snell 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

3D in-place complex transforms fail for large enough dimensions

For example, 8192x4x4 produces an incorrectly large error.

Constructor issues in real2complex on NVCC platform

What is the expected behavior

I would like to compile rocFFT/hipFFT for an NVCC platform machine, but am unable due to several issues.

What actually happens

Compilation fails when building CXX object library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o. In particular, it seems the templates in real2complex attempt to invoke constructors which do not exist in CUDA10 to cast single decimal types to CUDA vector types, particularly float to float2 and double to double2:

(N.B.: A number of incidental errors re: parsing parentheses that would not be raised if the missing constructors and operators were available have been removed from the following error log for brevity.)

[ 17%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o
cd /gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/library/src/device && /sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/bin/hipcc  -Drocfft_device_EXPORTS -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/roc
FFT/library/src/device -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/../include -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/kernels -I/gpfs/alpine/stf007/proj-sh
ared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/generator -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/library/src/device -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/include -I/gpfs
/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/include  --verbose -std=c++11 "-Xcompiler=''"   "-gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_30,code=sm_30" -o CMakeFiles/rocfft-
device.dir/real2complex.cpp.o -c /gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp
#$ _SPACE_= 
#$ _CUDART_=cudart
#$ _HERE_=/sw/summit/cuda/10.1.168/bin
#$ _THERE_=/sw/summit/cuda/10.1.168/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/ppc64le-linux
#$ TOP=/sw/summit/cuda/10.1.168/bin/..
#$ NVVMIR_LIBRARY_DIR=/sw/summit/cuda/10.1.168/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/sw/summit/cuda/10.1.168/bin/../lib:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/lib64:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/lib:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/llvm/lib64:/sw/summit/cuda/
10.1.168/lib64:/opt/ibm/spectrumcomputing/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/lib:/opt/ibm/spectrum_mpi/jsm_pmix/lib
#$ PATH=/sw/summit/cuda/10.1.168/bin/../nvvm/bin:/sw/summit/cuda/10.1.168/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/ll
vm/bin:/sw/sources/lsf-tools/2.0/summit/bin:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/git-2.20.1-6zhngdgjqjq4qhp5lxfz6czu3qc2b5lh/bin:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/201809
14/linux-rhel7-ppc64le/gcc-4.8.5/cmake-3.15.2-xit2o3iepxvqbyku77lwcugufilztu7t/bin:/sw/summit/cuda/10.1.168/bin:/opt/ibm/csm/bin:/opt/ibm/spectrumcomputing/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1/li
nux3.10-glibc2.17-ppc64le-csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin:/ccs/home/belhorn/bin:/ccs/home/belhorn/.l
ocal/bin
#$ INCLUDES="-I/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/include"  
#$ LIBRARIES=  "-L/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/lib/stubs" "-L/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -std=c++11 -D__CUDA_ARCH__=300 -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.
10.0/src/rocFFT/library/src/device/../include" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/kernels" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/generator" -I
"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/library/src/device" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/include" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/in
clude" "-I/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/include"   -isystem "/sw/summit/cuda/10.1.168/include" -isystem "/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/include"  -D "HIP_VERSION_MAJOR=2" -D "HIP_VERSION_MINOR=
10" -D "HIP_VERSION_PATCH=19455" -D "rocfft_device_EXPORTS" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=1 -D__CUDACC_VER_BUILD__=168 -include "cuda_runtime.h" "/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src
/device/real2complex.cpp" > "/tmp/tmpxft_0000894c_00000000-8_real2complex.compute_30.cpp1.ii" 
#$ cicc --c++11 --gnu_version=50500 --allow_managed --unsigned_chars   -arch compute_30 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0000894c_00000000-2_real2complex.fatbin.c" -tused -nvvmir-library "/sw/summit/
cuda/10.1.168/bin/../nvvm/libdevice/libdevice.10.bc" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0000894c_00000000-3_real2complex.module_id" --orig_src_file_name "/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/l
ibrary/src/device/real2complex.cpp" --gen_c_file_name "/tmp/tmpxft_0000894c_00000000-5_real2complex.compute_30.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000894c_00000000-5_real2complex.compute_30.cudafe1.stub.c" --gen_device_file_name "/tm
p/tmpxft_0000894c_00000000-5_real2complex.compute_30.cudafe1.gpu"  "/tmp/tmpxft_0000894c_00000000-8_real2complex.compute_30.cpp1.ii" -o "/tmp/tmpxft_0000894c_00000000-5_real2complex.compute_30.ptx"
/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(389): error: no suitable constructor exists to convert from "float" to "float2"
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=float2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(390): error: no suitable constructor exists to convert from "float" to "float2"
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=float2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(395): error: no operator "*" matches these operands
            operand types are: float2 * double
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=float2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(396): error: no operator "*" matches these operands
            operand types are: float2 * double
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=float2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(439): error: no suitable constructor exists to convert from "float" to "float2"
          detected during:
            instantiation of "void real_pre_process_kernel(size_t, size_t, size_t, const Tcomplex *, size_t, Tcomplex *, size_t, const Tcomplex *) [with Tcomplex=float2]" 
(503): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(440): error: no suitable constructor exists to convert from "float" to "float2"
          detected during:
            instantiation of "void real_pre_process_kernel(size_t, size_t, size_t, const Tcomplex *, size_t, Tcomplex *, size_t, const Tcomplex *) [with Tcomplex=float2]" 
(503): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(442): error: no suitable constructor exists to convert from "float" to "float2"
          detected during:
            instantiation of "void real_pre_process_kernel(size_t, size_t, size_t, const Tcomplex *, size_t, Tcomplex *, size_t, const Tcomplex *) [with Tcomplex=float2]" 
(503): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=float2, R2C=true]" 
(560): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(389): error: no suitable constructor exists to convert from "double" to "double2"
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=double2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(390): error: no suitable constructor exists to convert from "double" to "double2"
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=double2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(395): error: no operator "*" matches these operands
            operand types are: double2 * double
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=double2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(396): error: no operator "*" matches these operands
            operand types are: double2 * double
          detected during:
            instantiation of "void real_post_process_kernel<T,IN_PLACE>(size_t, size_t, size_t, T *, size_t, T *, size_t, const T *) [with T=double2, IN_PLACE=true]" 
(483): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(439): error: no suitable constructor exists to convert from "double" to "double2"
          detected during:
            instantiation of "void real_pre_process_kernel(size_t, size_t, size_t, const Tcomplex *, size_t, Tcomplex *, size_t, const Tcomplex *) [with Tcomplex=double2]" 
(503): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(440): error: no suitable constructor exists to convert from "double" to "double2"
          detected during:
            instantiation of "void real_pre_process_kernel(size_t, size_t, size_t, const Tcomplex *, size_t, Tcomplex *, size_t, const Tcomplex *) [with Tcomplex=double2]" 
(503): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/real2complex.cpp(442): error: no suitable constructor exists to convert from "double" to "double2"
          detected during:
            instantiation of "void real_pre_process_kernel(size_t, size_t, size_t, const Tcomplex *, size_t, Tcomplex *, size_t, const Tcomplex *) [with Tcomplex=double2]" 
(503): here
            instantiation of "void real_1d_pre_post_process<Tcomplex,R2C>(size_t, size_t, Tcomplex *, Tcomplex *, Tcomplex *, size_t, size_t, size_t, size_t, size_t, hipStream_t) [with Tcomplex=double2, R2C=true]" 
(574): here
            instantiation of "void real_1d_pre_post<R2C>(const void *, void *) [with R2C=true]" 
(580): here

36 errors detected in the compilation of "/tmp/tmpxft_0000894c_00000000-8_real2complex.compute_30.cpp1.ii".
# --error 0x1 --
make[2]: *** [library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o] Error 1
make[2]: Leaving directory `/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT'
make[1]: *** [library/src/device/CMakeFiles/rocfft-device.dir/all] Error 2
make[1]: Leaving directory `/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT'
make: *** [all] Error 2

The CMakeLists on NVCC platforms also have a few issues.

Hard-coded Clang flags unrecognizable by GCC are used even when Clang is not. The complete ROCm 2.10.0 stack includes Clang10 and hcc is also based on Clang10. But Clang10 is incompatible with nvcc up to CUDA v10.2. Despite the fact ROCm Clang isn't all that useful on our platform, it seems necessary for a from-scratch source-build of ROCm for some incidental libraries in other components. So we're already building a CUDA-compatible GCC to bootstrap ROCm Clang and it would be undesirable to install an additional toolchain 3.2 <= Clang version < 9 for this library.

[ 12%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
cd /gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/library/src/device && /sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/bin/hipcc  -Drocfft_device_EXPORTS -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/roc
FFT/library/src/device -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/../include -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/kernels -I/gpfs/alpine/stf007/proj-sh
ared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/generator -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/library/src/device -I/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/include -I/gpfs
/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/include  --verbose -std=c++11 "-Xcompiler "   -fno-gpu-rdc "-gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_30,code=sm_30" -o CMakeFi
les/rocfft-device.dir/transpose.cpp.o -c /gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/transpose.cpp
#$ _SPACE_= 
#$ _CUDART_=cudart
#$ _HERE_=/sw/summit/cuda/10.1.168/bin
#$ _THERE_=/sw/summit/cuda/10.1.168/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/ppc64le-linux
#$ TOP=/sw/summit/cuda/10.1.168/bin/..
#$ NVVMIR_LIBRARY_DIR=/sw/summit/cuda/10.1.168/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/sw/summit/cuda/10.1.168/bin/../lib:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/lib64:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/lib:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/llvm/lib64:/sw/summit/cuda/
10.1.168/lib64:/opt/ibm/spectrumcomputing/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/lib:/opt/ibm/spectrum_mpi/jsm_pmix/lib
#$ PATH=/sw/summit/cuda/10.1.168/bin/../nvvm/bin:/sw/summit/cuda/10.1.168/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/ll
vm/bin:/sw/sources/lsf-tools/2.0/summit/bin:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/git-2.20.1-6zhngdgjqjq4qhp5lxfz6czu3qc2b5lh/bin:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/201809
14/linux-rhel7-ppc64le/gcc-4.8.5/cmake-3.15.2-xit2o3iepxvqbyku77lwcugufilztu7t/bin:/sw/summit/cuda/10.1.168/bin:/opt/ibm/csm/bin:/opt/ibm/spectrumcomputing/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1/li
nux3.10-glibc2.17-ppc64le-csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin:/ccs/home/belhorn/bin:/ccs/home/belhorn/.l
ocal/bin
#$ INCLUDES="-I/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/include"  
#$ LIBRARIES=  "-L/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/lib/stubs" "-L/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -std=c++11 -D__CUDA_ARCH__=300 -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  -fno-gpu-rdc -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device" -I"/gpfs/alpine/stf007/proj-shared/be
lhorn/rocm/2.10.0/src/rocFFT/library/src/device/../include" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/kernels" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/src/device/
generator" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT/library/src/device" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFFT/library/include" -I"/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/bu
ild/rocFFT/include" "-I/sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/include"   -isystem "/sw/summit/cuda/10.1.168/include" -isystem "/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/include"  -D "HIP_VERSION_MAJOR=2" -D "HIP_V
ERSION_MINOR=10" -D "HIP_VERSION_PATCH=19455" -D "rocfft_device_EXPORTS" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=1 -D__CUDACC_VER_BUILD__=168 -include "cuda_runtime.h" "/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/src/rocFF
T/library/src/device/transpose.cpp" > "/tmp/tmpxft_000085f5_00000000-8_transpose.compute_30.cpp1.ii" 
gcc: error: unrecognized command line option ‘-fno-gpu-rdc’
# --error 0x1 --
make[2]: *** [library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o] Error 1
make[2]: Leaving directory `/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT'
make[1]: *** [library/src/device/CMakeFiles/rocfft-device.dir/all] Error 2
make[1]: Leaving directory `/gpfs/alpine/stf007/proj-shared/belhorn/rocm/2.10.0/build/rocFFT'
make: *** [all] Error 2

The fix for this seems to be a simple patch:

--- a/library/src/device/CMakeLists.txt
+++ b/library/src/device/CMakeLists.txt
@@ -84,7 +84,9 @@ add_library( rocfft-device
   )   
 add_library( roc::rocfft-device ALIAS rocfft-device )
 target_compile_features( rocfft-device PRIVATE cxx_static_assert cxx_nullptr cxx_auto_type )
-target_compile_options (rocfft-device PRIVATE -fno-gpu-rdc)
+if( CMAKE_CXX_COMPILER MATCHES ".*/hcc$" OR HIP_PLATFORM STREQUAL "hip-clang")
+  target_compile_options (rocfft-device PRIVATE -fno-gpu-rdc)
+endif()
 
 # Remove this check when we no longer build with older rocm stack(ie < 1.8.2)
 if(TARGET hip::device)

Additionally, some of the host compiler options passed to nvcc appear to require quotation guarding should they be empty. I've patched only the minimal options needed in my build attempts:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -78,8 +78,8 @@ elseif( CMAKE_CXX_COMPILER MATCHES ".*/hipcc$" )
   message( STATUS "HIPCC compiler detected; CUDA backend selected" )
 
   set( HIP_PLATFORM "nvcc" )
-  set( CMAKE_C_COMPILE_OPTIONS_PIC "-Xcompiler ${CMAKE_C_COMPILE_OPTIONS_PIC}" )
-  set( CMAKE_CXX_COMPILE_OPTIONS_PIC "-Xcompiler ${CMAKE_CXX_COMPILE_OPTIONS_PIC}" )
+  set( CMAKE_C_COMPILE_OPTIONS_PIC "-Xcompiler='${CMAKE_C_COMPILE_OPTIONS_PIC}'" )
+  set( CMAKE_CXX_COMPILE_OPTIONS_PIC "-Xcompiler='${CMAKE_CXX_COMPILE_OPTIONS_PIC}'" )
   set( CMAKE_SHARED_LIBRARY_C_FLAGS "-Xlinker ${CMAKE_SHARED_LIBRARY_C_FLAGS}" )
   set( CMAKE_SHARED_LIBRARY_CXX_FLAGS "-Xlinker ${CMAKE_SHARED_LIBRARY_CXX_FLAGS}" )
   set( CMAKE_SHARED_LIBRARY_SONAME_C_FLAG "-Xlinker -soname," )
@@ -88,10 +88,10 @@ elseif( CMAKE_CXX_COMPILER MATCHES ".*/hipcc$" )
   set( CMAKE_SHARED_LIBRARY_RUNTIME_CXX_FLAG "-Xlinker -rpath," )
   set( CMAKE_EXECUTABLE_RUNTIME_C_FLAG "-Xlinker -rpath," )
   set( CMAKE_EXECUTABLE_RUNTIME_CXX_FLAG "-Xlinker -rpath," )
-  set( CMAKE_C_COMPILE_OPTIONS_VISIBILITY "-Xcompiler ${CMAKE_C_COMPILE_OPTIONS_VISIBILITY}" )
-  set( CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY "-Xcompiler ${CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY}" )
-  set( CMAKE_C_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN "-Xcompiler ${CMAKE_C_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN}" )
-  set( CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN "-Xcompiler ${CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN}" )
+  set( CMAKE_C_COMPILE_OPTIONS_VISIBILITY "-Xcompiler='${CMAKE_C_COMPILE_OPTIONS_VISIBILITY}'" )
+  set( CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY "-Xcompiler='${CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY}'" )
+  set( CMAKE_C_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN "-Xcompiler='${CMAKE_C_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN}'" )
+  set( CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN "-Xcompiler='${CMAKE_CXX_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN}'" )
 elseif( CMAKE_CXX_COMPILER MATCHES ".*/hcc$" )
   message( STATUS "HCC compiler set; ROCm backend selected [ CXX=/opt/rocm/bin/hcc cmake ... ]" )
 endif( )

And lastly, it seems that C++11 standard option is needed but not given by the CMakeLists when using toolchains where it is not the default and CMake 3.15.2 is unable to proceed using hipcc without expressly declaring the compiler ID:

  cmake                                                  \    
    -DCMAKE_INSTALL_PREFIX="${ROCM_PREFIX}"              \    
    -DCMAKE_BUILD_TYPE=Release                           \    
    -DCMAKE_CXX_COMPILER_ID=NVIDIA                       \    
    -DCMAKE_CXX_FLAGS="--verbose -std=c++11"             \    
    -DCMAKE_CXX_COMPILER="${ROCM_PREFIX}/hip/bin/hipcc"  \
    -DNVCUDASAMPLES_ROOT="${CUDA_HOME}/samples"          \    
    "${BUILD_PREFIX}/src/rocFFT"
  make -j 1 VERBOSE=1

How to reproduce

Building with hipcc (ROCm v2.10.0) on NVCC platform with CUDA 10.1.168. The entire ROCm stack is source-built in user-space in a non-standard prefix, ie not /opt/rocm, due to restrictions on the machine in question (OLCF Summit). We've had requests from users interested in porting their CUDA codes to HIP who want to use hipFFT.

$ hipconfig --full
HIP version  : 2.10.19455-48a7ae6a

== hipconfig
HIP_PATH     : /sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip
HIP_PLATFORM : nvcc
CPP_CONFIG   :  -D__HIP_PLATFORM_NVCC__=  -I/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/include -I/sw/summit/cuda/10.1.168/include

== nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:12:21_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

=== Environment Variables
PATH=/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/bin:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/llvm/bin:/sw/sources/lsf-tools/2.0/summit/bin:/autofs/nccs-svm1_sw/summit/.
swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/git-2.20.1-6zhngdgjqjq4qhp5lxfz6czu3qc2b5lh/bin:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/cmake-3.15.2-xit2o3iepxvqbyku77lwcugufil
ztu7t/bin:/sw/summit/cuda/10.1.168/bin:/opt/ibm/csm/bin:/opt/ibm/spectrumcomputing/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin
:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin:/ccs/home/belhorn/bin:/ccs/home/belhorn/.local/bin
CUDA_PATH=/sw/summit/cuda/10.1.168
HIP_PLATFORM=nvcc
HCC_PATH=/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hcc
HIP_CLANG_PATH=/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/llvm/bin
HSA_PATH=/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hsa
CUDA_TOOLKIT_ROOT_DIR=/sw/summit/cuda/10.1.168
LD_LIBRARY_PATH=/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/lib64:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/lib:/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/llvm/lib64:/sw/summit/cuda/10.1.168/lib64:/opt/ibm/spectrumcomputi
ng/lsf/10.1/linux3.10-glibc2.17-ppc64le-csm/lib:/opt/ibm/spectrum_mpi/jsm_pmix/lib
CUDA_DIR=/sw/summit/cuda/10.1.168
HIP_PATH=/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip
CUDAPATH=/sw/summit/cuda/10.1.168

== Linux Kernel
Hostname     : build1
Linux build1 4.14.0-115.8.1.el7a.ppc64le #1 SMP Thu May 9 14:45:13 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
LSB Version:    :core-4.1-noarch:core-4.1-ppc64le
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.6 (Maipo)
Release:        7.6
Codename:       Maipo

$ /sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/llvm/bin/gcc --version   
gcc (GCC) 5.5.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

mkdir -p ./src ./build/rocFFT
cd ./src
# Google's `repo` tool:
repo init -u https://github.com/RadeonOpenCompute/ROCm.git -b roc-2.10.0 \
    && repo sync
# having already successfully installed ROCt, ROCr, ROCm-Clang, HCC, HIP, hipBLAS, rocRAND, hipSPARSE, hipCUB in the same nonstandard prefix:
cd ../build/rocFFT
  cmake                                                  \    
    -DCMAKE_INSTALL_PREFIX="/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm"              \    
    -DCMAKE_BUILD_TYPE=Release                           \    
    -DCMAKE_CXX_COMPILER_ID=NVIDIA                       \    
    -DCMAKE_CXX_FLAGS="--verbose -std=c++11"             \    
    -DCMAKE_CXX_COMPILER="/sw/.testing/belhorn/summit/rocm/2.10.0/opt/rocm/hip/bin/hipcc"  \
    -DNVCUDASAMPLES_ROOT="/sw/summit/cuda/10.1.168/samples"          \    
    "../src/rocFFT"
  make -j 1 VERBOSE=1

Environment

Hardware	description
GPU	Nvidia Volta V100-SXM2-16GB
CPU	IBM Power9 ppc64le

Software	version
ROCK	N/A
ROCR	v1.1.9
HCC	HCC Clang v10.0.0 based on HCC 2.10.19446-7ffdb8b9-ff0b0f5c4c-6c4a0468561
Library	ROCm 2.10.0, rocFFT v0.9.8 (commit `c03da03`)

2D C2R transform does not work if second dimension is equal to 1

What is the expected behavior

Result after R2C and C2R should be equal to input times a scaling factor.

What actually happens

Result after R2C looks correct, but C2R result is all 0, if the second dimension is equal to 1. For example size (4, 1) does not work, while (4, 2) works as expected.

Environment

Hardware	description
GPU	Radeon PRO WX 9100
CPU	AMD EPYC 7501

Software	version
ROCm	v2.6
HCC	v1.3.19242
HIP	v1.5.19255
rocFFT	git version `923339e`

How to reproduce

I've modified a provided sample file, to reproduce the failed test case of our application:

/******************************************************************************
 * Copyright (c) 2019 - present Advanced Micro Devices, Inc. All rights reserved.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
 *******************************************************************************/

#include <hip/hip_runtime_api.h>
#include <hipfft.h>
#include <algorithm>
#include <complex>
#include <iostream>
#include <vector>

int main() {
  std::cout << "hipfft 2D single-precision real-to-complex transform using advanced interface\n";

  int rank = 2;
  int n[2]    = {4, 1}; // backward transform does not work
  // int n[2] = {4, 2};  // works
  int howmany = 1;    // batch size

  int n1_complex_elements = n[1] / 2 + 1;

  int istride = 1;
  int ostride = istride;
  int inembed[2] = {istride * n[0], istride * n[1]};
  int onembed[2] = {ostride * n[0], ostride * n1_complex_elements};
  int idist = inembed[0] * inembed[1];
  int odist = onembed[0] * onembed[1];

  std::cout << "n: " << n[0] << " " << n[1] << "\n"
            << "howmany: " << howmany << "\n"
            << "istride: " << istride << "\tostride: " << ostride << "\n"
            << "inembed: " << inembed[0] << " " << inembed[1] << "\n"
            << "onembed: " << onembed[0] << " " << onembed[1] << "\n"
            << "idist: " << idist << "\todist: " << odist << "\n"
            << std::endl;

  std::vector<float> data(howmany * idist);
  std::vector<hipfftComplex> data_freq(howmany * odist);
  const auto total_bytes_real = data.size() * sizeof(decltype(data)::value_type);
  const auto total_bytes_freq = data_freq.size() * sizeof(hipfftComplex);

  /*
   * Initialize data
   */
  std::fill(data.begin(), data.end(), 0.0);
  for (int ibatch = 0; ibatch < howmany; ++ibatch) {
    for (int i = 0; i < n[0]; i++) {
      for (int j = 0; j < n[1]; j++) {
        const auto pos = ibatch * idist + istride * (i * inembed[1] + j);
        data[pos] = pos + 1;
      }
    }
  }

  std::cout << "input:\n";
  for (int ibatch = 0; ibatch < howmany; ++ibatch) {
    std::cout << "batch: " << ibatch << "\n";
    for (int i = 0; i < inembed[0]; i++) {
      for (int j = 0; j < inembed[1]; j++) {
        const auto pos = ibatch * idist + i * inembed[1] + j;
        std::cout << data[pos] << " ";
      }
      std::cout << "\n";
    }
    std::cout << "\n";
  }
  std::cout << std::endl;

  hipfftHandle hipForwardPlan;
  hipfftHandle hipBackwardPlan;
  hipfftResult result;
  result = hipfftPlanMany(&hipForwardPlan, rank, n, inembed, istride, idist, onembed, ostride,
                          odist, HIPFFT_R2C, howmany);
  if(result != HIPFFT_SUCCESS) std::exit(result);

  result = hipfftPlanMany(&hipBackwardPlan, rank, n, onembed, ostride, odist, inembed, istride,
                          idist, HIPFFT_C2R, howmany);
  if(result != HIPFFT_SUCCESS) std::exit(result);

  hipfftReal* gpu_data;
  hipfftReal* gpu_data_out_1;
  hipfftReal* gpu_data_out_2;
  hipMalloc((void**)&gpu_data, total_bytes_real);
  hipMalloc((void**)&gpu_data_out_1, total_bytes_freq);
  hipMalloc((void**)&gpu_data_out_2, total_bytes_real);

  /*
   * Forward
   */
  hipMemcpy(gpu_data, (void*)data.data(), total_bytes_real, hipMemcpyHostToDevice);
  result = hipfftExecR2C(hipForwardPlan, gpu_data, (hipfftComplex*)gpu_data_out_1);
  if(result != HIPFFT_SUCCESS) std::exit(result);
  hipMemcpy((void*)data_freq.data(), gpu_data_out_1, total_bytes_freq, hipMemcpyDeviceToHost);

  std::cout << "forward transform:\n";
  const std::complex<float>* output = (std::complex<float>*)data_freq.data();
  for (int ibatch = 0; ibatch < howmany; ++ibatch) {
    std::cout << "batch: " << ibatch << "\n";
    for (int i = 0; i < onembed[0]; i++) {
      for (int j = 0; j < onembed[1]; j++) {
        const auto pos = ibatch * odist + i * onembed[1] + j;
        std::cout << output[pos] << " ";
      }
      std::cout << "\n";
    }
    std::cout << "\n";
  }
  std::cout << std::endl;

  /*
   * Backward
   */
  result = hipfftExecC2R(hipBackwardPlan, (hipfftComplex*)gpu_data_out_1, gpu_data_out_2);
  if(result != HIPFFT_SUCCESS) std::exit(result);
  hipMemcpy((void*)data.data(), gpu_data_out_2, total_bytes_real, hipMemcpyDeviceToHost);

  std::cout << "backward transform:\n";
  for (int ibatch = 0; ibatch < howmany; ++ibatch) {
    std::cout << "batch: " << ibatch << "\n";
    for (int i = 0; i < inembed[0]; i++) {
      for (int j = 0; j < inembed[1]; j++) {
        const auto pos = ibatch * idist + i * inembed[1] + j;
        std::cout << data[pos] << " ";
      }
      std::cout << "\n";
    }
    std::cout << "\n";
  }
  std::cout << std::endl;

  hipfftDestroy(hipForwardPlan);
  hipfftDestroy(hipBackwardPlan);
  hipFree(gpu_data);
  hipFree(gpu_data_out_1);
  hipFree(gpu_data_out_2);
  return 0;
}

Need hipfft interfaces to properly use hipcomplex types in HIP

What is the expected behavior

hipComplex types be used to implement hipfft interfaces

What actually happens

https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/library/include/rocfft/hipfft.h#L49

Incorrectly defined float[2] or double[2] without properly using data types defined in HIP.

Relevant commit in TensorFlow ROCm port:
https://github.com/ROCmSoftwarePlatform/tensorflow/commit/ed6bf17441a1cfd808cac127fff6da830f4e8ec8

Build failure on Ubuntu 17.10

From fresh source checkout, the build fails with objdump: /usr/lib/x86_64-linux-gnu/libm.a: File format not recognised

Although in the end, these files are generated, so maybe it's not a hard error but just a misleading message..?

~/rocFFT/build$ find . -name "so"
./release/library/src/librocfft.so
./release/library/src/device/librocfft-device.so.0.8.1.0
./release/library/src/device/librocfft-device.so
./release/library/src/device/librocfft-device.so.0
./release/library/src/librocfft.so.0
./release/library/src/librocfft.so.0.8.1.0
./release/compile_commands.json

~/rocFFT$ ./install.sh -d
Creating project build directory in: ./build
Hit:1 http://archive.ubuntu.com/ubuntu artful InRelease
Hit:2 http://archive.canonical.com/ubuntu artful InRelease
Hit:3 http://repo.radeon.com/rocm/apt/debian xenial InRelease
Get:4 http://archive.ubuntu.com/ubuntu artful-updates InRelease [78.6 kB]
Get:5 http://archive.ubuntu.com/ubuntu artful-backports InRelease [72.2 kB]
Get:6 http://archive.ubuntu.com/ubuntu artful-security InRelease [78.6 kB]
Get:7 http://archive.ubuntu.com/ubuntu artful-updates/main Sources [78.8 kB]
Get:8 http://archive.ubuntu.com/ubuntu artful-updates/universe Sources [22.8 kB]
Get:9 http://archive.ubuntu.com/ubuntu artful-updates/main amd64 Packages [191 kB]
Get:10 http://archive.ubuntu.com/ubuntu artful-updates/main i386 Packages [188 kB]
Get:11 http://archive.ubuntu.com/ubuntu artful-updates/main Translation-en [87.6 kB]
Get:12 http://archive.ubuntu.com/ubuntu artful-updates/main amd64 DEP-11 Metadata [73.3 kB]
Get:13 http://archive.ubuntu.com/ubuntu artful-updates/main DEP-11 64x64 Icons [44.9 kB]
Get:14 http://archive.ubuntu.com/ubuntu artful-updates/universe i386 Packages [76.7 kB]
Get:15 http://archive.ubuntu.com/ubuntu artful-updates/universe amd64 Packages [77.2 kB]
Get:16 http://archive.ubuntu.com/ubuntu artful-updates/universe Translation-en [44.4 kB]
Get:17 http://archive.ubuntu.com/ubuntu artful-updates/universe amd64 DEP-11 Metadata [49.5 kB]
Get:18 http://archive.ubuntu.com/ubuntu artful-updates/universe DEP-11 64x64 Icons [55.3 kB]
Get:19 http://archive.ubuntu.com/ubuntu artful-backports/universe amd64 DEP-11 Metadata [4,700 B]
Get:20 http://archive.ubuntu.com/ubuntu artful-security/main Sources [34.6 kB]
Get:21 http://archive.ubuntu.com/ubuntu artful-security/universe Sources [10.2 kB]
Get:22 http://archive.ubuntu.com/ubuntu artful-security/main i386 Packages [88.8 kB]
Get:23 http://archive.ubuntu.com/ubuntu artful-security/main amd64 Packages [90.5 kB]
Get:24 http://archive.ubuntu.com/ubuntu artful-security/main Translation-en [43.6 kB]
Get:25 http://archive.ubuntu.com/ubuntu artful-security/main amd64 DEP-11 Metadata [2,924 B]
Get:26 http://archive.ubuntu.com/ubuntu artful-security/universe i386 Packages [34.7 kB]
Get:27 http://archive.ubuntu.com/ubuntu artful-security/universe amd64 Packages [34.8 kB]
Get:28 http://archive.ubuntu.com/ubuntu artful-security/universe Translation-en [22.2 kB]
Get:29 http://archive.ubuntu.com/ubuntu artful-security/universe amd64 DEP-11 Metadata [10.4 kB]
Get:30 http://archive.ubuntu.com/ubuntu artful-security/universe DEP-11 64x64 Icons [10.2 kB]
Fetched 1,607 kB in 5s (305 kB/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
12 packages can be upgraded. Run 'apt list --upgradable' to see them.
~/rocFFT ~/rocFFT
-- The CXX compiler identification is Clang 6.0.0
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HCC compiler set; ROCm backend selected [ CXX=/opt/rocm/bin/hcc cmake ... ]
-- Building with ROCm tools
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1")
-- Checking for module 'libunwind'
-- Found libunwind, version 1.1
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /home/preda/rocFFT/build/release
Scanning dependencies of target rocfft-kernel-generator
[ 8%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[ 8%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
[ 13%] Linking CXX executable rocfft-kernel-generator
[ 13%] Built target rocfft-kernel-generator
[ 17%] Generator producing device kernels for rocfft-device
Scanning dependencies of target rocfft-device
[ 21%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
[ 26%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/complex2real.cpp.o
[ 30%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double.cpp.o
[ 39%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o
[ 43%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double_large.cpp.o
[ 34%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/function_pool.cpp.o
[ 47%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single_large.cpp.o
[ 52%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single.cpp.o
[ 56%] Linking CXX shared library librocfft-device.so
objdump: /usr/lib/x86_64-linux-gnu/libm.a: File format not recognised
[ 56%] Built target rocfft-device
Scanning dependencies of target rocfft
[ 60%] Building CXX object library/src/CMakeFiles/rocfft.dir/hipfft.cpp.o
[ 73%] Building CXX object library/src/CMakeFiles/rocfft.dir/get_radix.cpp.o
[ 73%] Building CXX object library/src/CMakeFiles/rocfft.dir/repo.cpp.o
[ 78%] Building CXX object library/src/CMakeFiles/rocfft.dir/twiddles.cpp.o
[ 65%] Building CXX object library/src/CMakeFiles/rocfft.dir/transform.cpp.o
[ 82%] Building CXX object library/src/CMakeFiles/rocfft.dir/powX.cpp.o
[ 95%] Building CXX object library/src/CMakeFiles/rocfft.dir/kargs.cpp.o
[ 86%] Building CXX object library/src/CMakeFiles/rocfft.dir/plan.cpp.o
[ 91%] Building CXX object library/src/CMakeFiles/rocfft.dir/auxiliary.cpp.o
[100%] Linking CXX shared library librocfft.so
objdump: /usr/lib/x86_64-linux-gnu/libm.a: File format not recognised
[100%] Built target rocfft

The wrong result using hipfft samples

When I use hipfft on my program, I find the result always is 0.

So I try the sample from this :clients/samples/hipfft/hipfft_planmany_2d_z2z.cpp
https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/clients/samples/hipfft/hipfft_planmany_2d_z2z.cpp
and the following is the result:
the hipfft output is different with fftw output.

What actually happens

fftw inputs:
[+0, +0] : ( +7.00, +7.00) [+0, +1] : ( +8.00, +8.00) [+0, +2] : ( +9.00, +9.00) [+0, +3] : (+10.00, +10.00) [+0, +4] : (+11.00, +11.00) [+0, +5] : ( +0.00, +0.00)
[+1, +0] : ( +1.00, +1.00) [+1, +1] : ( +2.00, +2.00) [+1, +2] : ( +3.00, +3.00) [+1, +3] : ( +4.00, +4.00) [+1, +4] : ( +5.00, +5.00) [+1, +5] : ( +6.00, +6.00)
[+2, +0] : ( +7.00, +7.00) [+2, +1] : ( +8.00, +8.00) [+2, +2] : ( +9.00, +9.00) [+2, +3] : (+10.00, +10.00) [+2, +4] : (+11.00, +11.00) [+2, +5] : ( +0.00, +0.00)
[+3, +0] : ( +1.00, +1.00) [+3, +1] : ( +2.00, +2.00) [+3, +2] : ( +3.00, +3.00) [+3, +3] : ( +4.00, +4.00) [+3, +4] : ( +5.00, +5.00) [+3, +5] : ( +6.00, +6.00)
[+4, +0] : ( +7.00, +7.00) [+4, +1] : ( +8.00, +8.00) [+4, +2] : ( +9.00, +9.00) [+4, +3] : (+10.00, +10.00) [+4, +4] : (+11.00, +11.00) [+4, +5] : ( +0.00, +0.00)
fftw outputs:
[+0, +0] : (+177.00, +177.00) [+0, +1] : (-27.80, -38.20) [+0, +2] : (+25.52, -19.52) [+0, +3] : (+21.00, +21.00) [+0, +4] : (-19.52, +25.52) [+0, +5] : (-38.20, -27.80)
[+1, +0] : ( +3.28, +20.72) [+1, +1] : ( +8.15, -6.60) [+1, +2] : ( +9.79, +3.76) [+1, +3] : ( +1.64, +10.36) [+1, +4] : ( -8.15, +6.60) [+1, +5] : ( -9.79, -3.76)
[+2, +0] : (-24.93, +48.93) [+2, +1] : (+27.42, -1.44) [+2, +2] : (+14.96, +23.03) [+2, +3] : (-12.47, +24.47) [+2, +4] : (-27.42, +1.44) [+2, +5] : (-14.96, -23.03)
[+3, +0] : (+48.93, -24.93) [+3, +1] : (-23.03, -14.96) [+3, +2] : ( +1.44, -27.42) [+3, +3] : (+24.47, -12.47) [+3, +4] : (+23.03, +14.96) [+3, +5] : ( -1.44, +27.42)
[+4, +0] : (+20.72, +3.28) [+4, +1] : ( -3.76, -9.79) [+4, +2] : ( +6.60, -8.15) [+4, +3] : (+10.36, +1.64) [+4, +4] : ( +3.76, +9.79) [+4, +5] : ( -6.60, +8.15)
hipfft outputs:
[+0, +0] : ( +7.00, +7.00) [+0, +1] : ( +8.00, +8.00) [+0, +2] : ( +9.00, +9.00) [+0, +3] : (+10.00, +10.00) [+0, +4] : (+11.00, +11.00) [+0, +5] : ( +0.00, +0.00)
[+1, +0] : ( +1.00, +1.00) [+1, +1] : ( +2.00, +2.00) [+1, +2] : ( +3.00, +3.00) [+1, +3] : ( +4.00, +4.00) [+1, +4] : ( +5.00, +5.00) [+1, +5] : ( +6.00, +6.00)
[+2, +0] : ( +7.00, +7.00) [+2, +1] : ( +8.00, +8.00) [+2, +2] : ( +9.00, +9.00) [+2, +3] : (+10.00, +10.00) [+2, +4] : (+11.00, +11.00) [+2, +5] : ( +0.00, +0.00)
[+3, +0] : ( +1.00, +1.00) [+3, +1] : ( +2.00, +2.00) [+3, +2] : ( +3.00, +3.00) [+3, +3] : ( +4.00, +4.00) [+3, +4] : ( +5.00, +5.00) [+3, +5] : ( +6.00, +6.00)
[+4, +0] : ( +7.00, +7.00) [+4, +1] : ( +8.00, +8.00) [+4, +2] : ( +9.00, +9.00) [+4, +3] : (+10.00, +10.00) [+4, +4] : (+11.00, +11.00) [+4, +5] : ( +0.00, +0.00)

How to reproduce

hipcc -I/opt/rocm/include -I/work/home/wangzh/relionrocm/external/fftw/include -L/work/home/wangzh/relionrocm/external/fftw/lib -lfftw3 -lrocfft hip2fft.hip.cpp

Environment

Software	version
HIP version	: 1.5.18353

Segmentation Fault in computing 1D FFT for different random sizes of arrays.

Code

The code executed is taken from the README.md of the ROCmSoftwarePlatform/rocFFT at https://github.com/ROCmSoftwarePlatform/rocFFT.

What is the expected behavior

The code works as expected for any size of the array (for any value of N in the code).

What actually happens

The code works only for few random values of N and produce segmentation fault for other values of N. Below is the error information from the gdb.

Program received signal SIGSEGV, Segmentation fault.
(gdb) where
 #0  0x00002aaaac890dd4 in TransformPowX(ExecPlan const&, void**, void**, rocfft_execution_info_t*) () from /opt/rocm/lib/librocfft.so.0
#1  0x00002aaaac889d30 in rocfft_execute () from /opt/rocm/lib/librocfft.so.0
#2  0x0000000000400f74 in main () at hipsample.cpp:38

How to reproduce

Use the same code from the README.md of the link https://github.com/ROCmSoftwarePlatform/rocFFT and change the values of N. The code works as expected for N value of 16, 20, 25 and many more. And the code produces segmentation fault for N value of 21, 22, 23 and many more.

Hardware and library versions

CPU Device: AMD_EPYC_7551_32-Core_Processor
GPU Device: Radeon_MI25
Hip version: HIP_VERSION_MAJOR=1; HIP_VERSION_MINOR=5; HIP_VERSION_PATCH=19211
rocFFT Version: 0.9.3.0

Compilation and execution commands

export HIP_PLATFORM=hcc
hipcc -std=c++11 -O3 -g -c hipsample.cpp -lrocfft
hipcc -std=c++11 -O3 -g -o gpuCuda hipsample.o -lrocfft
./gpuCuda

Comments

I am not sure if their is any issue with my environment or what could be the reason for the error.

result of large 1D FFT with stride not equal 1 is not correct

What is the expected behavior

The input dataset is N rows, 2 cols，I want to perform 1-D length N FFT transform on each column.
I construct the dataset, when N = 2048, the result is correct,like below:
batch = 0
[0 0] [2048.000000 0.000000 ]
just one data's real part = N, others are zero;
this is the correct behavior.

What actually happens

when N = 4096 or larger, the result is not correct and almost every data is not zero, unlike the correct behavior.

How to reproduce

The test case likes below:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <assert.h>
#include "hip/hip_runtime_api.h"
#include "rocfft.h"

#define pi 3.1415926535897932385
#define EPISION (1.0e-12)

int main(int argc, char *argv[])
{
rocfft_setup();
size_t N_row = 4096;
size_t N_col = 2 ;

if (argc>1) N_row = atoi(argv[1]);
if (argc>2) N_col = atoi(argv[2]);

size_t Nbytes = (size_t) N_row * sizeof(double) * 2 * N_col;
double * inx;
double * inx_trans;
double * outx;
double * outx_trans;
double * outx_trans2;
double * dcux;
double * dcux_trans;
double tmp;
hipError_t herr;

herr = hipHostMalloc(&inx, Nbytes);
herr = hipHostMalloc(&inx_trans, Nbytes);
if(herr != 0) 
	printf("something wrong in hipHostMalloc...\n");

herr = hipHostMalloc(&outx, Nbytes);
herr = hipHostMalloc(&outx_trans, Nbytes);
herr = hipHostMalloc(&outx_trans2, Nbytes);
if(herr != 0) 
	printf("something wrong in hipHostMalloc...\n");

herr = hipMalloc(&dcux, Nbytes);
herr = hipMalloc(&dcux_trans, Nbytes);
if(herr != 0) 
	printf("something wrong in hipMalloc...\n");

for (size_t j = 0; j < N_col; j++)
	for (size_t i = 0; i < N_row; i++)
	{
		tmp = 2.*j*i*pi/N_row;
		inx[2*i   + 2*j*N_row]  = cos(tmp);
		inx[2*i+1 + 2*j*N_row]  = sin(tmp);

	}
for (size_t j = 0; j < N_col; j++)
	for (size_t i = 0; i < N_row; i++)
	{
		inx_trans[2*j   + 2*i*N_col]  = inx[2*i+2*j*N_row];
		inx_trans[2*j+1 + 2*i*N_col]  = inx[2*i+1+2*j*N_row];

	}

hipMemcpy(dcux, inx, Nbytes, hipMemcpyHostToDevice);
hipMemcpy(dcux_trans, inx_trans, Nbytes, hipMemcpyHostToDevice);

rocfft_plan planf = NULL;
rocfft_plan planf_trans = NULL;
rocfft_plan_description plan_descriptionf;
rocfft_plan_description plan_descriptionf_trans;
rocfft_status err_status = rocfft_status_success;

size_t length = N_row;

// size_t *ioffsets, *ooffsets, value=0, istride=1, ostride=1, dimens=1, idist=1, odist=1;
size_t *ioffsets, *ooffsets, value = 0, istride = 2 , ostride = 2, dimens = 1, idist = length, odist = length;
ioffsets = &value;
ooffsets = &value;
size_t num_of_transforms = 1;
// perform a 1D FFT along columns, eg: 4096rows * 2cols, two batchs of 4096 points FFT
err_status = rocfft_plan_description_create(&plan_descriptionf_trans);
assert(err_status == rocfft_status_success);
err_status = rocfft_plan_description_set_data_layout(plan_descriptionf_trans,rocfft_array_type_complex_interleaved,rocfft_array_type_complex_interleaved,ioffsets,ooffsets,dimens,&istride,idist,dimens,&ostride,odist);
assert(err_status == rocfft_status_success);

err_status = rocfft_plan_create(&planf_trans, rocfft_placement_inplace, rocfft_transform_type_complex_forward, rocfft_precision_double,1, &length, num_of_transforms, plan_descriptionf_trans);
assert(err_status == rocfft_status_success);
err_status = rocfft_plan_get_print(planf_trans);
assert(err_status == rocfft_status_success);

rocfft_execution_info info_trans = NULL;		
err_status = rocfft_execution_info_create(&info_trans);
assert(err_status == rocfft_status_success);
size_t fbuffersize_trans = 0;
err_status =  rocfft_plan_get_work_buffer_size(planf_trans, &fbuffersize_trans);
assert(err_status == rocfft_status_success);
void* fbuffer_trans = NULL;
hipMalloc(&fbuffer_trans, fbuffersize_trans);
err_status = rocfft_execution_info_set_work_buffer(info_trans, fbuffer_trans, fbuffersize_trans);
assert(err_status == rocfft_status_success);

rocfft_execute(planf_trans, (void **) &dcux_trans, NULL, info_trans);
hipMemcpy(outx_trans,dcux_trans, Nbytes, hipMemcpyDeviceToHost);
hipDeviceSynchronize();

// result test
// for (size_t j = 0; j < N_col; j++) {
{ size_t j = 0;
      printf("batch = %ld\n",j);
      for (size_t i = 0; i < N_row; i++) {
              // tmp = abs(outx_trans[2*i+j*2*N])+abs(outx_trans[2*i+1+j*2*N]);
              tmp = abs(outx_trans[2*j + i*2*N_col]) + abs(outx_trans[2*j+1 + i*2*N_col]);
              if (tmp>EPISION){
                  // printf("[%ld %ld] (%.15f,%.15f)\n",j,i,outx_trans[2*i+j*2*N],outx_trans[2*i+1+j*2*N]);
                  printf("[%ld %ld] (%.15f %.15f)\n", j,i, outx_trans[2*j + i*2*N_col], outx_trans[2*j+1+i*2*N_col]);
              }
      }
      printf("\n");
}
hipDeviceSynchronize();


rocfft_execution_info_destroy(info_trans);
rocfft_plan_description_destroy(plan_descriptionf_trans);
rocfft_plan_destroy(planf_trans);
rocfft_cleanup();

hipFree(dcux);
hipFree(dcux_trans);
hipFree(fbuffer_trans);
hipHostFree(inx);
hipHostFree(inx_trans);
hipHostFree(outx);
hipHostFree(outx_trans);
hipHostFree(outx_trans2);

return 0;

}

Environment

hardware
GPU | AMD Radeon VII (Vega20)
CPU | AMD Ryzen 5 3600X

rocfft build stuck during compilation

What is the expected behavior

rocfft finishes build step

What actually happens

hcc gets stuck building rocfft

How to reproduce

These are the steps to reproduce.

$ git checkout v0.8.4
$ CXX=`which hcc` cmake -DCMAKE_INSTALL_PREFIX=/tmp/rocfft/master -DCMAKE_BUILD_TYPE=Release -DBUILD_CLIENTS_SAMPLES=ON  ..
$ make -j8

Environment

Hardware	description
GPU	gfx803
CPU	Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

fresh centos 7.4.1708 installation (kernel 3.10.0-693.17.1.el7.x86_64):

Software	version
ROCK	1.8
ROCR	1.1.8_15_ge851b7a
HCC	HCC clang version 7.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 86791fc4961dc8ffde77bde20d7dfa5e5cbeff5e)
Library	v0.8.4

After 30 min of building the output of the above looks like:

$ CXX=hcc cmake -DCMAKE_INSTALL_PREFIX=/home/steinbac/software/rocfft/master -DCMAKE_BUILD_TYPE=Release -DBUILD_CLIENTS_SAMPLES=ON  ..
-- The CXX compiler identification is Clang 7.0.0
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HCC compiler set; ROCm backend selected [ CXX=/opt/rocm/bin/hcc cmake ... ]
-- Building with ROCm tools
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Found FFTW: /sw/apps/fftw/3.3.6-pl1/include (Required is at least version "3.0") 
-- FindFFTW configured variables:
-- FFTW_INCLUDE_DIRS: /sw/apps/fftw/3.3.6-pl1/include
-- FFTW_LIBRARIES: /sw/apps/fftw/3.3.6-pl1/lib/libfftw3f.so;/sw/apps/fftw/3.3.6-pl1/lib/libfftw3.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/steinbac/software/rocfft/repo/build
$ make -j8
Scanning dependencies of target rocfft-kernel-generator
[  3%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[  6%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
[  9%] Linking CXX executable rocfft-kernel-generator
[  9%] Built target rocfft-kernel-generator
[ 12%] Generator producing device kernels for rocfft-device
Scanning dependencies of target rocfft-device
[ 15%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o
[ 21%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
[ 21%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/function_pool.cpp.o
[ 25%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/bluestein.cpp.o
[ 28%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/complex2real.cpp.o
[ 31%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single.cpp.o
[ 37%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single_large.cpp.o
[ 37%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double.cpp.o
[ 40%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double_large.cpp.o
[ 43%] Linking CXX shared library librocfft-device.so

Undefined global function. on rocFFT complex 2d FFT example\n

What is the expected behavior

to compute complex FFT

What actually happens

terminate called after throwing an instance of 'std::runtime_error'
what(): Undefined global function.

How to reproduce

Environment

Hardware	description
GPU	device string
Radeon 580
Name: gfx803
Marketing Name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]

| CPU | device string |
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz

| Software | version |
5.5.13-arch2-1
| ROCK | v0.0 |
3.3.0
| ROCR | v0.0 |
3.3.0
| HCC | v0.0 |
3.3.0
| Library | v0.0 |
3.3.0

Library implementation suggestions

Using this to record all ideas for a good library implementation

result of complex 3d sample is incorrect

expected outcome

Doing a forward and backward transform consecutively should produce the original data if result is divided by size of the data.

actual outcome

Here is the output of the provided example repo/docs/samples/ running "./complex3d 4 4 4"
`
rocFFT complex 3d FFT example
Nx: 4 Ny: 4 Nz: 4 in-place: 0
Input:
(0,0) (1,0) (2,0) (3,0)
(1,0) (2,0) (3,0) (4,0)
(2,0) (3,0) (4,0) (5,0)
(3,0) (4,0) (5,0) (6,0)

(1,0) (2,0) (3,0) (4,0)
(2,0) (3,0) (4,0) (5,0)
(3,0) (4,0) (5,0) (6,0)
(4,0) (5,0) (6,0) (7,0)

(2,0) (3,0) (4,0) (5,0)
(3,0) (4,0) (5,0) (6,0)
(4,0) (5,0) (6,0) (7,0)
(5,0) (6,0) (7,0) (8,0)

(3,0) (4,0) (5,0) (6,0)
(4,0) (5,0) (6,0) (7,0)
(5,0) (6,0) (7,0) (8,0)
(6,0) (7,0) (8,0) (9,0)

Transformed:
(-23.8431,-23.8431) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)

(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)

Transformed back:
(288,0) (-32,-32) (-32,0) (-32,32)
(-32,-32) (0,0) (0,0) (0,0)
(-32,0) (0,0) (0,0) (0,0)
(-32,32) (0,0) (0,0) (0,0)

(-32,-32) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)

(-32,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)

(-32,32) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)
(0,0) (0,0) (0,0) (0,0)

Maximum error: 9

Here is the result for the complex2d which works as expected "./complex2d 4 4":

rocFFT complex 2d FFT example
Nx: 4 Ny: 4 in-place: 0
Input:
(0,0) (1,0) (2,0) (3,0)
(1,0) (2,0) (3,0) (4,0)
(2,0) (3,0) (4,0) (5,0)
(3,0) (4,0) (5,0) (6,0)

Transformed:
(48,0) (-8,8) (-8,0) (-8,-8)
(-8,8) (0,0) (0,0) (0,0)
(-8,0) (0,0) (0,0) (0,0)
(-8,-8) (0,0) (0,0) (0,0)

Transformed back:
(0,0) (16,9.79717e-16) (32,0) (48,-9.79717e-16)
(16,9.79717e-16) (32,1.95943e-15) (48,9.79717e-16) (64,0)
(32,0) (48,9.79717e-16) (64,0) (80,-9.79717e-16)
(48,-9.79717e-16) (64,0) (80,-9.79717e-16) (96,-1.95943e-15)

Maximum error: 1.22465e-16
`

Not able to build from source on nvidia Platform

What is the expected behavior

I am trying to build rocFFT from source on NVIDIA platform.

What actually happens

but during make it will got hang (after completing 52%)

How to reproduce

Environment

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

-- ./hipconfig
HIP version : 1.5.18151

== hipconfig
HIP_PATH : /opt/rocm/hip
HIP_PLATFORM : nvcc
CPP_CONFIG : -D__HIP_PLATFORM_NVCC__= -I/opt/rocm/hip/include -I/usr/local/cuda/include

== nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88

=== Environment Variables
PATH=/home/usr/bin:/home/usr/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/opt/rocm/hip/bin:/opt/rocm/bin
HIP_PATH=/opt/rocm/hip
CUDA_PATH=/usr/local/cuda

== Linux Kernel
Hostname : usr
Linux usr 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial

###error log:
[ 4%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.file.cpp.o
[ 8%] Building CXX object library/src/device/generator/CMakeFiles/rocfft-kernel-generator.dir/generator.main.cpp.o
/home/usr

/rocFFT/library/src/device/generator/generator.file.cpp(43): warning: variable "real_transform" was declared but never referenced
/rocFFT/library/src/device/generator/generator.file.cpp(51): warning: variable "batchsize" was declared but never referenced
/rocFFT/library/src/device/generator/generator.file.cpp(84): warning: variable "realSpecial_Nr" was declared but never referenced
/rocFFT/library/src/device/generator/generator.kernel.hpp(219): warning: variable "iStride" was declared but never referenced
detected during instantiation of "StockhamGenerator::Kernel::Kernel(const FFTKernelGenKeyParams &) [with PR=(rocfft_precision)0]"
/rocFFT/library/src/device/generator/generator.file.cpp(466): here
/rocFFT/library/src/device/generator/generator.kernel.hpp(219): warning: variable "oStride" was declared but never referenced
detected during instantiation of "StockhamGenerator::Kernel::Kernel(const FFTKernelGenKeyParams &) [with PR=(rocfft_precision)0]"
/rocFFT/library/src/device/generator/generator.file.cpp(466): here
/rocFFT/library/src/device/generator/generator.kernel.hpp(707): warning: integer conversion resulted in a change of sign
/rocFFT/library/src/device/generator/generator.kernel.hpp(709): warning: integer conversion resulted in a change of sign
/home/usr/rocFFT/library/src/device/generator/generator.kernel.hpp(660): warning: variable "cReg" was declared but never referenced
/rocFFT/library/src/device/generator/generator.kernel.hpp(707): warning: integer conversion resulted in a change of sign
/rocFFT/library/src/device/generator/generator.kernel.hpp(709): warning: integer conversion resulted in a change of sign
/rocFFT/library/src/device/generator/generator.kernel.hpp(707): warning: integer conversion resulted in a change of sign
/rocFFT/library/src/device/generator/generator.kernel.hpp(709): warning: integer conversion resulted in a change of sign

[ 13%] Linking CXX executable rocfft-kernel-generator
[ 13%] Built target rocfft-kernel-generator
[ 17%] Generator producing device kernels for rocfft-device
Scanning dependencies of target rocfft-device
[ 21%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/function_pool.cpp.o
[ 26%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/transpose.cpp.o
[ 30%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/complex2real.cpp.o
[ 34%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/real2complex.cpp.o
[ 39%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_double.cpp.o
[ 43%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single.cpp.o
[ 47%] Building CXX object library/src/device/CMakeFiles/rocfft-device.dir/kernel_launch_single_large.cpp.o
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h: In function ‘hipError_t hipCtxDetach(hipCtx_t)’:
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h:1019:30: warning: ‘CUresult cuCtxDetach(CUcontext)’ is deprecated [-Wdeprecated-declarations]
return hipCUResultTohipError(cuCtxDetach(ctx));
^
^
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h: In function ‘hipError_t hipDeviceComputeCapability(int*, int*, hipDevice_t)’:
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h:1027:30: warning: ‘CUresult cuDeviceComputeCapability(int*, int*, CUdevice)’ is deprecated [-Wdeprecated-declarations]
return hipCUResultTohipError(cuDeviceComputeCapability(major, minor, device));
^
/usr/local/cuda/include/cuda.h:2528:36: note: declared here
__CUDA_DEPRECATED CUresult CUDAAPI cuDeviceComputeCapability(int major, int minor, CUdevice dev);
^
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h:1027:30: warning: ‘CUresult cuDeviceComputeCapability(int, int, CUdevice)’ is deprecated [-Wdeprecated-declarations]
return hipCUResultTohipError(cuDeviceComputeCapability(major, minor, device));
^
/usr/local/cuda/include/cuda.h:2528:36: note: declared here
__CUDA_DEPRECATED CUresult CUDAAPI cuDeviceComputeCapability(int major, int minor, CUdevice dev);
^
/opt/rocm/hip/include/hip/nvcc_detail/hip_runtime_api.h:1027:76: warning: ‘CUresult cuDeviceComputeCapability(int, int, CUdevice)’ is deprecated [-Wdeprecated-declarations]
return hipCUResultTohipError(cuDeviceComputeCapability(major, minor, device));
^
/usr/local/cuda/include/cuda.h:2528:36: note: declared here
__CUDA_DEPRECATED CUresult CUDAAPI cuDeviceComputeCapability(int *major, int *minor, CUdevice dev);

it will hang from here....

CMake failure

I am attempting to build tensorflow with ROCm support on Arch Linux. (Yeah, I should just give up and dual-boot Ubuntu, I know.) I've installed
https://aur.archlinux.org/packages/rocr-runtime/
https://aur.archlinux.org/packages/roct-thunk-interface/
https://aur.archlinux.org/packages/hcc/
https://aur.archlinux.org/packages/hip/
and I have some sort of ROCm install in /opt/rocm, including /opt/rocm/hcc, /opt/rocm/hip, /opt/rocm/lib, etc. But tensorflow won't build because it wants rocFFT, which is not installed.

Following the wiki https://github.com/ROCmSoftwarePlatform/rocFFT/wiki/Build-on-ROCM, I create the build directory, type

CXX=/opt/rocm/hcc/bin/hcc cmake source_dir

and get the following:

CMake Error at /opt/rocm/hcc/lib/cmake/hcc/hcc-config.cmake:41 (add_library):
add_library cannot create imported target "hsa-runtime64" because another
target with the same name already exists.
Call Stack (most recent call first):
/usr/share/cmake-3.15/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
/opt/rocm/hip/lib/cmake/hip/hip-config.cmake:77 (find_dependency)
CMakeLists.txt:143 (find_package)

What seems to be happening is that hcc-config.cmake gets pulled in twice, first by CMakeLists.txt line 139 "find_package(hcc REQUIRED CONFIG PATHS /opt/rocm )" and then again by CMakeLists.txt line 143 "find_package( hip REQUIRED CONFIG PATHS /opt/rocm )"), because hip tries to pull in hcc internally.

Commenting out line 139 makes the error go away and CMake successfully generates everything.

Not able to install on SLES15.1 using install.sh

What is the expected behavior

Retrieving: libcxxtools9-2.2.1-11.1.x86_64.rpm ......................................................................................................................................................................................................................[not found]
File './x86_64/libcxxtools9-2.2.1-11.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/server:/http/SLE_15/'
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.

What actually happens

The package is not found

How to reproduce

Run the install.sh script on SLES system

Environment

Hardware	description
GPU	device string
CPU	device string

Software	version
ROCK	v0.0
ROCR	v0.0
HCC	v0.0
Library	v0.0

complex-to-real inverse 2D FFT: incorrect results

I am unable to get correct results from an inverse 2D Hermitian-to-real FFT. Generally I am inclined to suspect an error on my part, but after much investigation, I can't find any error, and suspect it may be in the library. If anyone is able to point out an error, I would much appreciate it.

See the attached source code, which is as brief an example as I could manage. Data is all in raw pointers to maintain transparency.

This code:

Generates some simple synthetic data in K-space in Hermitian form (with a single non-zero unique value). By Hermitian I mean taking advantage of K(i,j) = K(M-i,N-j)* to reduce storage in the usual way (N/2+1 values along the innermost axis). Call this result A.
Applies a Hermitian-to-real inverse FFT (result B).
Generates the same K-space data, but in full complex form, no reduced storage; applies a complex-to-complex inverse FFT to this data (result C). Result B and result C differ significantly.
Applies a real-to-Hermitian forward FFT to result B. This should be the same as result A, but is significantly different.
The real part of result C is extracted and a real-to-Hermitian forward FFT is applied to it. This does recover result A, indicating that at least step 3 is correct.

Library versions:
hipcc version: 1.5.18151
rocfft version 0.8.1

rocfft_real_inverse.cpp.txt

hipfft* functions don't call associated cufft* functions

What is the expected behavior

Using hipfft* with the nvidia compiler should result in compiling cufft* functions

What actually happens

I get the following errors on the linking step

undefined reference to `hipblasCreate'
undefined reference to `hipblasDestroy
undefined reference to `hipblasSscal'
undefined reference to `hipfftDestroy'
undefined reference to `hipfftExecC2C'
undefined reference to `hipfftPlan1d'

How to reproduce

Compile a program containing one of the above functions with HIP_PLATFORM=nvcc

Environment

Hardware	description
GPU	Quadro M4000
CPU	Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

Software	version
ROCK	N/A
ROCR	N/A
HCC	2.8.19361-cbe6b65e
Library	rocfft.x86_64 0:0.9.7.722_rocm_rel_2.9_6_e3055e1-1

Guesses

Is there supposed to be a translation entry in the HIP repo for these items? I see that they are supported in the HIP repo, but I don't see it in the code.

Out-of-place operation failure in 3D C2C case

I am investigating a mismatch in the fft operation in tensorflow-rocm. Tracing the failure brings me to rocFFT. The observed behavior is as follows.

A plan is created via a call to hipfftMakePlanMany.
hipfftExecC2C is invoked with the understanding that the operation is going to be out-of-place (arguments idata and odata are different).
It produces garbled data.

I got rocFFT to dump the plan, and it looks as follows

Work buffer size: 64
Work buffer ratio: 1

scheme: CS_3D_RTRT
dimension: 3
batch: 1
length: 4 4 4
iStrides: 1 4 16
oStrides: 1 4 16
iOffset: 0
oOffset: 0
iDist: 64
oDist: 64
direction: -1
not inplace
array type: complex interleaved -> complex interleaved
TTD: 0
large1D: 0
lengthBlue: 0
OB_USER_OUT -> OB_USER_OUT
B -> B

scheme: CS_2D_RTRT
dimension: 2
batch: 1
length: 4 4 4 
iStrides: 1 4 16 
oStrides: 1 4 16 
iOffset: 0
oOffset: 0
iDist: 64
oDist: 64
direction: -1
inplace
array type: complex interleaved -> complex interleaved
TTD: 0
large1D: 0
lengthBlue: 0
OB_USER_OUT -> OB_USER_OUT
B -> B

    scheme: CS_KERNEL_STOCKHAM
    dimension: 1
    batch: 1
    length: 4 4 4 
    iStrides: 1 4 16 
    oStrides: 1 4 16 
    iOffset: 0
    oOffset: 0
    iDist: 64
    oDist: 64
    direction: -1
    inplace
    array type: complex interleaved -> complex interleaved
    TTD: 0
    large1D: 0
    lengthBlue: 0
    OB_USER_OUT -> OB_USER_OUT
    B -> B

    scheme: CS_KERNEL_TRANSPOSE
    dimension: 2
    batch: 1
    length: 4 4 4 
    iStrides: 1 4 16 
    oStrides: 1 4 16 
    iOffset: 0
    oOffset: 0
    iDist: 64
    oDist: 64
    direction: -1
    not inplace
    array type: complex interleaved -> complex interleaved
    TTD: 0
    large1D: 0
    lengthBlue: 0
    OB_USER_OUT -> OB_TEMP
    B -> T

    scheme: CS_KERNEL_STOCKHAM
    dimension: 1
    batch: 1
    length: 4 4 4 
    iStrides: 1 4 16 
    oStrides: 1 4 16 
    iOffset: 0
    oOffset: 0
    iDist: 64
    oDist: 64
    direction: -1
    inplace
    array type: complex interleaved -> complex interleaved
    TTD: 0
    large1D: 0
    lengthBlue: 0
    OB_TEMP -> OB_TEMP
    T -> T

    scheme: CS_KERNEL_TRANSPOSE
    dimension: 2
    batch: 1
    length: 4 4 4 
    iStrides: 1 4 16 
    oStrides: 1 4 16 
    iOffset: 0
    oOffset: 0
    iDist: 64
    oDist: 64
    direction: -1
    not inplace
    array type: complex interleaved -> complex interleaved
    TTD: 0
    large1D: 0
    lengthBlue: 0
    OB_TEMP -> OB_USER_OUT
    T -> B

scheme: CS_KERNEL_TRANSPOSE_XY_Z
dimension: 2
batch: 1
length: 4 4 4 
iStrides: 1 4 16 
oStrides: 1 4 16 
iOffset: 0
oOffset: 0
iDist: 64
oDist: 64
direction: -1
not inplace
array type: complex interleaved -> complex interleaved
TTD: 0
large1D: 0
lengthBlue: 0
OB_USER_OUT -> OB_TEMP
B -> T

scheme: CS_KERNEL_STOCKHAM
dimension: 1
batch: 1
length: 4 4 4 
iStrides: 1 4 16 
oStrides: 1 4 16 
iOffset: 0
oOffset: 0
iDist: 64
oDist: 64
direction: -1
inplace
array type: complex interleaved -> complex interleaved
TTD: 0
large1D: 0
lengthBlue: 0
OB_TEMP -> OB_TEMP
T -> T

scheme: CS_KERNEL_TRANSPOSE_Z_XY
dimension: 2
batch: 1
length: 4 4 4 
iStrides: 1 4 16 
oStrides: 1 4 16 
iOffset: 0
oOffset: 0
iDist: 64
oDist: 64
direction: -1
not inplace
array type: complex interleaved -> complex interleaved
TTD: 0
large1D: 0
lengthBlue: 0
OB_TEMP -> OB_USER_OUT
T -> B

There is a notable absence of any mentions of OB_USER_IN. A test confirmed that rocFFT ignores the input buffer entirely and processes the output buffer inplace (so, if I copy the input data into the output buffer by hand before calling hipfftExecC2C, it does work and it does produce the correct results.)

rocFFT build fail at hcc/HIP build from source at ARM64 Ubuntu 16.04

~/ROCm/rocFFT/release# CXX=/opt/rocm/bin/hcc cmake ..
-- HCC compiler set; ROCm backend selected [ CXX=/opt/rocm/bin/hcc cmake ... ]
-- Building with ROCm tools
CMake Error at CMakeLists.txt:126 (find_package):
Could not find a package configuration file provided by "hip" with any of
the following names:

hipConfig.cmake
hip-config.cmake

Add the installation prefix of "hip" to CMAKE_PREFIX_PATH or set "hip_DIR"
to a directory containing one of the above files. If "hip" provides a
separate development package or SDK, be sure it has been installed.

-- Configuring incomplete, errors occurred!
See also "/root/ROCm/rocFFT/release/CMakeFiles/CMakeOutput.log".

I find the build HIP from source code, make install did not install hip-config.cmake

Environment

Hardware	description
GPU	device string
CPU	device string
ARM64

| Software | version |
Ubuntu 16.04
|-----|-----|
| ROCK | v0.0 |
| ROCR | v0.0 |
| HCC | v0.0 |
HCC clang version 7.0.0
| Library | v0.0 |
HIP version : 1.5.0

Can't build for HIP

What is the expected behavior

That it builds

All output reported here justxi/rocm#132

What actually happens

I get errors

How to reproduce

Try to build

Environment

Hardware	description
GPU	Hawaii PRO [Radeon R9 290/390]
CPU	AMD FX(tm)-8350 Eight-Core Processor

Software	version
ROCK	No idea, kernel 5.6.0-rc1
ROCR	v3.0.0
HCC	v3.0.0
Library	v3.0.0

`__float128` is not supported on this target

What is the expected behavior

hcc/clang compiles a very simple example submitted in #146

What actually happens

the following error is thrown

[ 50%] Building CXX object CMakeFiles/complex_forward_1d.dir/complex_forward_1d.cpp.o
/opt/rocm/hcc/bin/hcc   -I/opt/rocm/rocfft/include -isystem /opt/rocm/include -isystem /opt/rocm/hip/include -isystem /opt/rocm/hsa/include -isystem /opt/rocm/hcc/include  -Wno-unused-command-line-argument -hc -fPIC -std=gnu++11 -o CMakeFiles/complex_forward_1d.dir/complex_forward_1d.cpp.o -c /home/steinbac/software/rocfft/repo/docs/samples/complex_forward_1d.cpp
In file included from /home/steinbac/software/rocfft/repo/docs/samples/complex_forward_1d.cpp:2:
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/vector:60:
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/stl_algobase.h:64:
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/stl_pair.h:59:
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/move.h:57:
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/type_traits:269:39: error: __float128 is not supported on this target
    struct __is_floating_point_helper<__float128>
                                      ^
1 error generated.
make[2]: *** [CMakeFiles/complex_forward_1d.dir/complex_forward_1d.cpp.o] Error 1
make[2]: *** Deleting file `CMakeFiles/complex_forward_1d.dir/complex_forward_1d.cpp.o'
make[2]: Leaving directory `/data/steinbac/software/rocfft/repo/docs/samples/build'
make[1]: *** [CMakeFiles/complex_forward_1d.dir/all] Error 2
make[1]: Leaving directory `/data/steinbac/software/rocfft/repo/docs/samples/build'
make: *** [all] Error 2

How to reproduce

using centos 7.4.1708
using rocm-dev 1.9.307-1
clone+checkout #146
cd docs/samples
mkdir build
cd build
CXX=which hcc cmake ..

Environment

Hardware	description
GPU	gfx803
CPU	Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

Software	version
rocm-dev	1.9.307-1
rocm-dkms	1.9.307-1
HCC	1.2.18451-1
hsakmt	1.0.0-7.el7

Investigate cmake component based installer to generate seperate package for clients

Please provide a compiled application that exercises the library (similar to clFFT-client), in the release package (rocfft.deb) or as a separate package.

There should be an easy way to test and benchmark the library.

Memory access fault by GPU node-4 after using hipfftExecR2C

I have ported a program to hip. It is an iteration framework and in each iteration, it will use hipfftExecR2C many times. However, there is a strange bug is that I will randomly get the memory out error after 3 or more iterations like this :

Memory access fault by GPU node-4 (Agent handle: 0x2a9e950) on address 0x2b118b909000. Reason: Page not present or supervisor privilege.

Although the break down is random, every time the bug appeared after using hipfftExecR2C transform and invoke the hipMemcpy to copy the hipFloatComplex data and the error is as folows:

#25 0x00002aaab320f7e5 in copy_ext () at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:4131
#26 0x00002aaaabbddc48 in ihipStream_t::locked_copySync(void*, void const*, unsigned long, unsigned int, bool) ()
   from /opt/rocm/hip/lib/libhip_hcc.so
#27 0x00002aaaabc9879e in hipMemcpy () from /opt/rocm/hip/lib/libhip_hcc.so
#28 0x00000000005271c9 in getFourierTransformsAndCtfs<MlOptimiserHip> (my_ori_particle=0, op=..., sp=..., baseMLO=0x7fffffffb038,
    accMLO=0x14a9ed0, ptrFactory=..., ibody=0) at src/acc/hip/../acc_ml_optimiser_impl.h:492

Sorry to disturbed and I want to find some tips to solve this probelm, Thank you very much !

Software	version
rocFFT	v0.9.4
rocm	v2.1

and I add the full track :

#0  0x00002aaab08b2107 in ioctl () from /lib64/libc.so.6
#1  0x00002aaab0b91ad8 in ?? () from /opt/rocm/lib64/libhsakmt.so.1
#2  0x00002aaab0b8aa3e in hsaKmtWaitOnMultipleEvents () from /opt/rocm/lib64/libhsakmt.so.1
#3  0x00002aaab0b8b059 in hsaKmtWaitOnEvent () from /opt/rocm/lib64/libhsakmt.so.1
#4  0x00002aaaac5005b2 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00002aaaac50040a in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00002aaaac4f2ad9 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00002aaab323a2b4 in waitComplete () at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:4884
#8  0x00002aaab324039d in operator() () at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:5015
#9  __invoke_impl<void, (lambda at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:5014:77)> ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/bits/invoke.h:60
#10 __invoke<(lambda at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:5014:77)> ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/bits/invoke.h:95
#11 _M_invoke<0> () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/thread:234
#12 operator() () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/thread:243
#13 operator() () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/future:1362
#14 0x00002aaab3240332 in _M_invoke ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/bits/std_function.h:301
#15 0x00002aaab3231607 in operator() ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/bits/std_function.h:706
#16 _M_do_set () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/future:561
#17 0x00002aaaab4e9e20 in pthread_once () from /lib64/libpthread.so.0
#18 0x00002aaab3240763 in __gthread_once ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/x86_64-redhat-linux/bits/gthr-default.h:699
#19 call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *> ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/mutex:684
#20 _M_set_result () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/future:401
#21 _M_complete_async () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/future:1615
#22 0x00002aaab320df74 in wait ()
    at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/future:334
#23 wait () at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/future:692
#24 wait () at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:1617
#25 0x00002aaab320f7e5 in copy_ext () at /data/my-tests/others/hcc-roc-2.1.x/lib/hsa/mcwamp_hsa.cpp:4131
#26 0x00002aaaabbddc48 in ihipStream_t::locked_copySync(void*, void const*, unsigned long, unsigned int, bool) ()
   from /opt/rocm/hip/lib/libhip_hcc.so
#27 0x00002aaaabc9879e in hipMemcpy () from /opt/rocm/hip/lib/libhip_hcc.so
#28 0x00000000005271c9 in getFourierTransformsAndCtfs<MlOptimiserHip> (my_ori_particle=0, op=..., sp=..., baseMLO=0x7fffffffb038,
    accMLO=0x14a9ed0, ptrFactory=..., ibody=0) at src/acc/hip/../acc_ml_optimiser_impl.h:492
#29 0x00000000004ef0dc in accDoExpectationOneParticle<MlOptimiserHip> (myInstance=0x14a9ed0, my_ori_particle=0, thread_id=0,
    ptrFactory=...) at src/acc/hip/../acc_ml_optimiser_impl.h:3185
#30 0x00000000004eeb70 in MlOptimiserHip::doThreadExpectationSomeParticles (this=0x14a9ed0, thread_id=0) at src/acc/hip/hip_ml_optimiser.hip.cpp:284
#31 0x00000000008cf123 in globalThreadExpectationSomeParticles (thArg=...) at ./src/ml_optimiser.cpp:76
#32 0x00000000009e3a78 in _threadMain (data=0x14618a0) at ./src/parallel.cpp:129
#33 0x00002aaaab4e4e25 in start_thread () from /lib64/libpthread.so.0
#34 0x00002aaab08bb34d in clone () from /lib64/libc.so.6

rocm / rocfft Goto Github PK

rocfft's Introduction

rocFFT

Documentation

Build and install

Examples

Support

Contribute

rocfft's People

Contributors

Stargazers

Watchers

Forkers

rocfft's Issues

What should happen

What really happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

description

compiling

execution

debug

system

other things I have tried

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens

How to reproduce

Environment

What is the expected behavior

What actually happens