Giter Club home page Giter Club logo

vkfft's Introduction

VkFFT - Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library

VkFFT is an efficient GPU-accelerated multidimensional Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal projects. VkFFT aims to provide the community with an open-source alternative to Nvidia's cuFFT library while achieving better performance. VkFFT is written in C language and supports Vulkan, CUDA, HIP, OpenCL, Level Zero and Metal as backends.

The white paper of VkFFT is out - if you use VkFFT, you can cite it: https://ieeexplore.ieee.org/document/10036080

Currently supported features:

  • 1D/2D/3D/ND systems - specify VKFFT_MAX_FFT_DIMENSIONS for arbitrary number of dimensions.
  • Forward and inverse directions of FFT.
  • Support for big FFT dimension sizes. Current limits: approximately 2^32 in all dimensions for all types of transforms. Depends on the amount of shared memory available on the device.
  • Radix-2/3/4/5/7/8/11/13 FFT. Sequences using radix 3, 5, 7, 11 and 13 have comparable performance to that of powers of 2.
  • Rader's FFT algorithm for primes from 17 up to max shared memory length (~10000). Inlined and done without additional memory transfers.
  • Bluestein's FFT algorithm for all other sequences. Optimized to have as few memory transfers as possible by using zero padding and merged convolution support of VkFFT.
  • Single, double, half and quad (double-double) precision support. Double and quad precision uses CPU-generated LUT tables. Half precision still does all computations in single and only uses half precision to store data.
  • All transformations are performed in-place with no performance loss. Out-of-place transforms are supported by selecting different input/output buffers.
  • No additional transposition uploads. Note: Data can be reshuffled after the Four Step FFT algorithm with an additional buffer (for big sequences). Doesn't matter for convolutions - they return to the input ordering (saves memory).
  • Complex to complex (C2C), real to complex (R2C), complex to real (C2R) transformations and real to real (R2R) Discrete Cosine Transformations of types I, II, III and IV. R2R, R2C and C2R are optimized to run up to 2x times faster than C2C and take 2x less memory.
  • 1x1, 2x2, 3x3 convolutions with symmetric or nonsymmetric kernel (no register overutilization).
  • Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of sequences filled with zeros and the direction where zero padding is applied (read or write stage).
  • WHD+CN layout - data is stored in the following order (sorted by increase in strides): the width, the height, the depth, other dimensions, the coordinate (the number of feature maps), the batch number.
  • Multiple feature/batch convolutions - one input, multiple kernels.
  • Multiple input/output/temporary buffer split. Allows using data split between different memory allocations and mitigates 4GB single allocation limit.
  • Works on Nvidia, AMD, Intel and Apple GPUs. And Raspberry Pi 4 GPU.
  • Works on Windows, Linux and macOS.
  • VkFFT supports Vulkan, CUDA, HIP, OpenCL, Level Zero and Metal as backend to cover wide range of APIs.
  • Header-only library, which allows appending VkFFT directly to user's command buffer. Kernels are compiled at run-time.

Future release plan

  • Ambitious
    • Multiple GPU job splitting

Installation

Vulkan version: Include the vkFFT.h file and glslang compiler. Provide the library with correctly chosen VKFFT_BACKEND definition (VKFFT_BACKEND=0 for Vulkan). Sample CMakeLists.txt file configures project based on Vulkan_FFT.cpp file, which contains examples on how to use VkFFT to perform FFT, iFFT and convolution calculations, use zero padding, multiple feature/batch convolutions, C2C FFTs of big systems, R2C/C2R transforms, R2R DCT-I, II, III and IV, double precision FFTs, half precision FFTs.
For single and double precision, Vulkan 1.0 is required. For half precision, Vulkan 1.1 is required.

CUDA/HIP: Include the vkFFT.h file and make sure your system has NVRTC/HIPRTC built. Provide the library with correctly chosen VKFFT_BACKEND definition.
To build CUDA/HIP version of the benchmark, replace VKFFT_BACKEND in CMakeLists (line 5) with the correct one and optionally enable FFTW. VKFFT_BACKEND=1 for CUDA, VKFFT_BACKEND=2 for HIP.

OpenCL: Include the vkFFT.h file. Provide the library with correctly chosen VKFFT_BACKEND definition.
To build OpenCL version of the benchmark, replace VKFFT_BACKEND in CMakeLists (line 5) with the value 3 and optionally enable FFTW.

Level Zero: Include the vkFFT.h file. Provide the library with correctly chosen VKFFT_BACKEND definition. Clang and llvm-spirv must be valid system calls.
To build Level Zero version of the benchmark, replace VKFFT_BACKEND in CMakeLists (line 5) with the value 4 and optionally enable FFTW.

Metal: Include the vkFFT.h file. Provide the library with correctly chosen VKFFT_BACKEND definition. VkFFT uses metal-cpp as a C++ bindings to Apple's libraries - Foundation.hpp, QuartzCore.hpp and Metal.hpp.
To build Metal version of the benchmark, replace VKFFT_BACKEND in CMakeLists (line 5) with the value 5 and optionally enable FFTW.

Command-line interface

VkFFT has a command-line interface with the following set of commands:
-h: print help
-devices: print the list of available GPU devices
-d X: select GPU device (default 0)
-o NAME: specify output file path
-vkfft X: launch VkFFT sample X (0-17, 100, 101, 200, 201, 1000-1003) (if FFTW is enabled in CMakeLists.txt)
-cufft X: launch cuFFT sample X (0-4, 1000-1003) (if enabled in CMakeLists.txt)
-rocfft X: launch rocFFT sample X (0-4, 1000-1003) (if enabled in CMakeLists.txt)
-test: (or no other keys) launch all VkFFT and cuFFT benchmarks
So, the command to launch single precision benchmark of VkFFT and cuFFT and save log to output.txt file on device 0 will look like this on Windows:
.\VkFFT_TestSuite.exe -d 0 -o output.txt -vkfft 0 -cufft 0
For double precision benchmark, replace -vkfft 0 -cufft 0 with -vkfft 1 -cufft 1. For half precision benchmark, replace -vkfft 0 -cufft 0 with -vkfft 2 -cufft 2.

How to use VkFFT

VkFFT.h is a library that can append FFT, iFFT or convolution calculation to the user-defined command buffer. It operates on storage buffers allocated by the user and doesn't require any additional memory by itself (except for LUT, if they are enabled). All computations are fully based on Vulkan compute shaders with no CPU usage except for FFT planning. VkFFT creates and optimizes memory layout by itself and performs FFT with the best-chosen parameters. For an example application, see VkFFT_TestSuite.cpp file, which has comments explaining the VkFFT configuration process.
VkFFT achieves striding by grouping nearby FFTs instead of transpositions.
Explicit VkFFT documentation can be found in the documentation folder.

Benchmark results in comparison to cuFFT

The test configuration below takes multiple 1D FFTs of all lengths from the range of 2 to 4096, batch them together so the full system takes from 500MB to 1GB of data and perform multiple consecutive FFTs/iFFTs (-vkfft 1001 key). After that time per a single FFT is obtained by averaging the result. Total system size will be divided by the time taken by a single transform upload+download, resulting in the estimation of an achieved global bandwidth. The GPUs used in this comparison are Nvidia A100 and AMD MI250. The performance was compared against Nvidia cuFFT (CUDA 11.7 version) and AMD rocFFT (ROCm 5.2 version) libraries in double precision: alt text alt text

Precision comparison of cuFFT/VkFFT/FFTW

alt text alt text

Above, VkFFT precision is verified by comparing its results with FP128 version of FFTW. We test all FFT lengths from the [2, 100000] range. We perform tests in single and double precision on random input data from [-1;1] range.

For both precisions, all tested libraries exhibit logarithmic error scaling. The main source of error is imprecise twiddle factor computation – sines and cosines used by FFT algorithms. For FP64 they are calculated on the CPU either in FP128 or in FP64 and stored in the lookup tables. With FP128 precomputation (left) VkFFT is more precise than cuFFT and rocFFT.

For FP32, twiddle factors can be calculated on-the-fly in FP32 or precomputed in FP64/FP32. With FP32 twiddle factors (right) VkFFT is slightly less precise in Bluestein’s and Rader’s algorithms. If needed, this can be solved with FP64 precomputation.

VkFFT - a story of Vulkan Compute GPU HPC library development: https://youtu.be/FQuJJ0m-my0

VkFFT and beyond – a platform for runtime GPU code generation: https://youtu.be/lHlFPqlOezo

Check out my panel at Nvidia's GTC 2021 in Higher Education and Research category: https://gtc21.event.nvidia.com/

Python interface to VkFFT can be found here: https://github.com/vincefn/pyvkfft

Rust bindings to VkFFT can be found here: https://github.com/semio-ai/vkfft-rs

Benchmark results of VkFFT can be found here: https://openbenchmarking.org/test/pts/vkfft

Contact information

The initial version of VkFFT is developed by Tolmachev Dmitrii
E-mail 1: [email protected]

vkfft's People

Contributors

al42and avatar anarkiwi avatar dragonjoker avatar dtolm avatar ex-rzr avatar expenses avatar leengit avatar mabraham avatar maetveis avatar nicoboss avatar pborsutzki avatar sbalint98 avatar thewtex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vkfft's Issues

cuFFT benchmark synchronizing too often

Hello,

I noticed there is sync after each cufftExec call in the timing loop:

cudaDeviceSynchronize();

cudaDeviceSynchronize();

cudaDeviceSynchronize();

cudaDeviceSynchronize();

are those synchronizations necessary? It would be sufficient to do single synchronization after the loop. This would also make performance comparison apples-to-apples as VkFFT issues commands and then waits to complete all of them.

Best regards,
Lukasz

Any plan / desire to have OpenCL as backend?

For OpenCL, there is the buggy clFFT from AMD which has been abandoned for years.
Is there any plan to have an OpenCL backend for vkFFT? Or maybe there is a reference to the algorithms used so that others can do the effort.

"non-CUDA" in binary release

Please rename it to something more fitting.

Like Vulkan. Or OpenGL.

It is like saying "Fiat_500p_not-a-rocket.zip"

Shader compilation error when planning multi-upload half precision R2C/C2R FFTs

Hello Dmitrii
When I try to plan a 1D multi-upload half precision R2C/C2R FFT using the Vulkan backend the following error occurs. This error does not occur when using single precision or C2C with the same configurations. It also does not occur with a size like 2^12 where no multi-upload is required. There is nothing in the documentation that would indicate that this combination isn’t supported. The official error code returned by VkFFT is 4014: VKFFT_ERROR_FAILED_SHADER_PARSE.
Thank you for having a look at it.
Cheers,
Nico

Here an example of the given error with a 1D multi-upload half precision R2C/C2R FFT with a size of 2^13:

#version 450

#extension GL_EXT_shader_16bit_storage : require

layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
const float loc_PI = 3.1415926535897932384626433832795f;
const float loc_SQRT1_2 = 0.70710678118654752440084436210485f;
layout(push_constant) uniform PushConsts
{
        uint coordinate;
        uint batchID;
        uint workGroupShiftX;
        uint workGroupShiftY;
        uint workGroupShiftZ;
} consts;

layout(std430, binding = 0) buffer DataIn{
        f16vec2 inputs[8194];
};

layout(std430, binding = 1) buffer DataOut{
        f16vec2 outputs[8194];
};

void main() {
uint id_x = gl_GlobalInvocationID.x % 2048;
uint id_y = (gl_GlobalInvocationID.x / 2048) % 1;
uint id_z = (gl_GlobalInvocationID.x / 2048) / 1;
if (gl_GlobalInvocationID.x < 2048){
uint inoutID = id_x + id_y*4097 +id_z*4097;
uint inoutID2;
uint inoutID3;
                vec2 t0 = inputs[inoutID];
                vec2 tf;
if (id_x == 0)  {
        inoutID2 = 4096 + id_y*4097 +id_z*4097;
        inoutID3 = 2048 + id_y*4097 +id_z*4097;
                tf = inputs[inoutID3];
} else {
        inoutID2 = (4096-id_x) + id_y*4097 +id_z*4097;
}               vec2 t1 = inputs[inoutID2];
                vec2 t2;
                vec2 t3;
if (id_x == 0) {
        t2.x = (t0.x+t1.x);
        t2.y = (t0.x-t1.x);
        tf.y = -tf.y;
        tf.x = tf.x * 2;
        tf.y = tf.y * 2;
                outputs[inoutID] = t2;
                outputs[inoutID3] = tf;
} else {
        t2.x = t0.x + t1.x;
        t2.y = t0.y + t1.y;
        t3.x = t0.x - t1.x;
        t3.y = t0.y - t1.y;
                float angle = (loc_PI*id_x)/4096;
                tf.x = cos(angle);
                tf.y = sin(angle);
        t0.x = tf.x*t2.y+tf.y*t3.x;
        t0.y = -tf.y*t2.y+tf.x*t3.x;
        t1.x = t2.x+t0.x;
        t1.y = -t3.y+t0.y;
        t0.x = t2.x-t0.x;
        t0.y = t3.y+t0.y;
                outputs[inoutID] = t0;
                outputs[inoutID2] = t1;
}
}
}

ERROR: 0:33: '=' :  cannot convert from 'layout( column_major std430 offset=0) temp 2-component vector of float16_t' to ' temp highp 2-component vector of float'
ERROR: 0:33: '' : compilation terminated
ERROR: 2 compilation errors.  No code generated.


VkFFT shader type: 6

Incorrect C2R inplace transform when using Bluestein algorithm

Hi @DTolm, I think I've found a corner case which gives incorrect results with the latest code.

This is specifically when using the C->R inplace transform with the Bluestein transform (e.g. an original 512x510+2 array, with 510=2x3x5x17). The resulting array does not correspond at all to the original array after doing a R2C+C2R transform.

The forward calculation is OK, as is the out-of-place R2C+C2R transform for a 512x510 array.

In my exhaustives tests in pyvkfft, all others pass.

[Announcement] Rust bindings for VkFFT

Hi,

I've put together some Rust bindings for VkFFT. You can find it here. It's certainly incomplete (particularly in documentation, examples, and safety guarantees), but I think it's close enough for other people to get some use from. Hopefully I'll be able to keep giving it some love for a while.

It assumes the Vulkan backend is used. Unfortunately the other backends would probably be best expressed through completely separate bindings. It further assumes vulkano (idiomatic Rust bindings for Vulkan) is also used.

I've appended a snippet of the first kernel FFT from sample_9 to give you a taste of the API.

Thanks for creating VkFFT!

// Configure kernel FFT
let config = Config::builder()
  .physical_device(context.physical)
  .device(context.device.clone())
  .fence(&context.fence)
  .queue(context.queue.clone())
  .buffer(kernel.clone())
  .command_pool(context.pool.clone())
  .kernel_convolution()
  .normalize()
  .coordinate_features(2)
  .batch_count(2)
  .r2c()
  .disable_reorder_four_step()
  .dim(&[32, 32])
  .build()?;

// Allocate a command buffer
let cmd_buffer = context.alloc_primary_cmd_buffer()?;

// Create command buffer handle
let builder =
  unsafe { UnsafeCommandBufferBuilder::new(&cmd_buffer, Kind::primary(), Flags::None)? };

// Configure FFT launch parameters
let mut params = LaunchParams::builder()
  .command_buffer(&builder)
  .build()?;

// Construct FFT "Application"
let mut app = App::new(config)?;

// Run forward FFT
app.forward(&mut params)?;

// Dispatch command buffer and wait for completion
let cmd_buffer = builder.build()?;
context.submit(cmd_buffer)?;

Numerical issue with some Bluestein FFT dimensions

Dear @DTolm, I think I found an issue when performing 2D FFT with some non-radix dimensions.

Here is the comparison between the VkFFT 2D complex64 FFT and the same from numpy, for a 1208x1208 array (1208=2**3 * 151):

image

You can see (zoom in the image) that there is a strong vertical central streak which is not present in numpy (much less intense). And after doing the FFT+iFFT there are horizontal streaks compared to the original array.

DCT type 2 and 3 results

Hi @DTolm - as discussed a few weeks ago I checked the result of the DCT against scipy, and found an inversion of results with respect to vkfft for types 2 and 3. Even if DCT is not the main focus of the library, I've done some further checks using a direct calculation from the formula, and still the discrepancy persists.

Here is the graph of the calculation of the DCT with calculations from pyvkfft, scipy.fft.dct and a direct calculation (labelled dct). As you can see the direct calculations match the results from scipy, but exchange the results for types 2 and 3 for pyvkfft:

image

The code used for the calculations was:

dct_type = 3  
app = VkFFTApp(x.shape,x.dtype,dct=dct_type)
x_cu = cua.to_gpu(x)
app.fft(x_cu)
# scipy
x_scipy = dctn(x, type=dct_type)

# Direct calculation
N = len(x)
y = np.empty_like(x)
n = np.arange(N)
if dct_type == 2:
    for k in range(N):
        y[k] = 2 * (x * np.cos(np.pi/N*(n+0.5)*k)).sum()
elif dct_type == 3:
    for k in range(N):
        y[k] = x[0] + 2 * (x * np.cos(np.pi/N*n*(k+0.5)))[1:].sum()
elif dct_type == 4:
    for k in range(N):
        y[k] = 2 * (x * np.cos(np.pi/N*(n+0.5)*(k+0.5))).sum()

You can also test that in a working colab notebook with all the libraries installed.

PS: in the documentation's equation for DCT3 (2.3.3 page 8) I think the sum should begin at n=1.

By the way thanks a lot for the documentation !

compilation issue on Power9

Hello,
When compiling on an IBM power9, nvcc fails with the following message:

/gpfs/apps/POWER9/GCC/8.3.0/lib/gcc/ppc64le-redhat-linux/8.3.0/include/c++/type_traits(335): error: identifier "__ieee128" is undefined

passing option -std=c++11 to nvcc solves the issue. Maybe this should be taken care of in the CMakeLists.txt file.

c2r/r2c memory layout

Hi, could you explain or point out the location of the in-place c2r/r2c memory layout of the computation result? There is a comment on performR2C saying the memory layout on github, but I couldn't find it. thanks!

Find a way to set OpenCL events

In the recent OpenCL backend, the kernel is enqueued with:

result = clEnqueueNDRangeKernel(app->configuration.commandQueue[0], axis->kernel, 3, 0, global_work_size, local_work_size, 0, 0, 0);

i.e. the last three arguments are set to 0. It would be nice to set/have access to these event arguments, which are helpful for synchronization and profiling.

Benchmark results

Hi, thanks for sharing this, saw it on reddit/r/programming. I ran the benchmarks on a Quadro RTX6000 using defaults in release mode on Windows 10. CUDA 10.2, MSVC 19.2, recent NVidia drivers. Not a clean system, e.g. chrome/firefox/etc running, which might be using some GPU time.

CUDA benchmark first, then VkFFT

benchmark_cuFFT.exe
First 5 runs are a warmup
System: 1024x1024x1, run: 0, Buffer: 8 MB, time per step: 0.068 ms, batch: 1000
System: 1024x1024x1, run: 1, Buffer: 8 MB, time per step: 0.068 ms, batch: 1000
System: 1024x1024x1, run: 2, Buffer: 8 MB, time per step: 0.059 ms, batch: 1000
System: 1024x1024x1, run: 3, Buffer: 8 MB, time per step: 0.052 ms, batch: 1000
System: 1024x1024x1, run: 4, Buffer: 8 MB, time per step: 0.052 ms, batch: 1000
System: 64x64x1, run: 0, Buffer: 0 MB, time per step: 0.012 ms, batch: 1000
System: 64x64x1, run: 1, Buffer: 0 MB, time per step: 0.013 ms, batch: 1000
System: 64x64x1, run: 2, Buffer: 0 MB, time per step: 0.013 ms, batch: 1000
System: 64x64x1, run: 3, Buffer: 0 MB, time per step: 0.013 ms, batch: 1000
System: 64x64x1, run: 4, Buffer: 0 MB, time per step: 0.013 ms, batch: 1000
System: 256x256x1, run: 0, Buffer: 0 MB, time per step: 0.018 ms, batch: 1000
System: 256x256x1, run: 1, Buffer: 0 MB, time per step: 0.018 ms, batch: 1000
System: 256x256x1, run: 2, Buffer: 0 MB, time per step: 0.019 ms, batch: 1000
System: 256x256x1, run: 3, Buffer: 0 MB, time per step: 0.019 ms, batch: 1000
System: 256x256x1, run: 4, Buffer: 0 MB, time per step: 0.021 ms, batch: 1000
System: 1024x256x1, run: 0, Buffer: 2 MB, time per step: 0.022 ms, batch: 1000
System: 1024x256x1, run: 1, Buffer: 2 MB, time per step: 0.022 ms, batch: 1000
System: 1024x256x1, run: 2, Buffer: 2 MB, time per step: 0.022 ms, batch: 1000
System: 1024x256x1, run: 3, Buffer: 2 MB, time per step: 0.022 ms, batch: 1000
System: 1024x256x1, run: 4, Buffer: 2 MB, time per step: 0.022 ms, batch: 1000
System: 512x512x1, run: 0, Buffer: 2 MB, time per step: 0.023 ms, batch: 1000
System: 512x512x1, run: 1, Buffer: 2 MB, time per step: 0.024 ms, batch: 1000
System: 512x512x1, run: 2, Buffer: 2 MB, time per step: 0.024 ms, batch: 1000
System: 512x512x1, run: 3, Buffer: 2 MB, time per step: 0.024 ms, batch: 1000
System: 512x512x1, run: 4, Buffer: 2 MB, time per step: 0.024 ms, batch: 1000
System: 1024x1024x1, run: 0, Buffer: 8 MB, time per step: 0.051 ms, batch: 1000
System: 1024x1024x1, run: 1, Buffer: 8 MB, time per step: 0.052 ms, batch: 1000
System: 1024x1024x1, run: 2, Buffer: 8 MB, time per step: 0.052 ms, batch: 1000
System: 1024x1024x1, run: 3, Buffer: 8 MB, time per step: 0.052 ms, batch: 1000
System: 1024x1024x1, run: 4, Buffer: 8 MB, time per step: 0.051 ms, batch: 1000
System: 4096x256x1, run: 0, Buffer: 8 MB, time per step: 0.046 ms, batch: 1000
System: 4096x256x1, run: 1, Buffer: 8 MB, time per step: 0.047 ms, batch: 1000
System: 4096x256x1, run: 2, Buffer: 8 MB, time per step: 0.047 ms, batch: 1000
System: 4096x256x1, run: 3, Buffer: 8 MB, time per step: 0.046 ms, batch: 1000
System: 4096x256x1, run: 4, Buffer: 8 MB, time per step: 0.046 ms, batch: 1000
System: 2048x1024x1, run: 0, Buffer: 16 MB, time per step: 0.134 ms, batch: 511
System: 2048x1024x1, run: 1, Buffer: 16 MB, time per step: 0.133 ms, batch: 511
System: 2048x1024x1, run: 2, Buffer: 16 MB, time per step: 0.138 ms, batch: 511
System: 2048x1024x1, run: 3, Buffer: 16 MB, time per step: 0.135 ms, batch: 511
System: 2048x1024x1, run: 4, Buffer: 16 MB, time per step: 0.135 ms, batch: 511
System: 4096x2048x1, run: 0, Buffer: 64 MB, time per step: 0.819 ms, batch: 127
System: 4096x2048x1, run: 1, Buffer: 64 MB, time per step: 0.794 ms, batch: 127
System: 4096x2048x1, run: 2, Buffer: 64 MB, time per step: 0.796 ms, batch: 127
System: 4096x2048x1, run: 3, Buffer: 64 MB, time per step: 0.795 ms, batch: 127
System: 4096x2048x1, run: 4, Buffer: 64 MB, time per step: 0.831 ms, batch: 127
System: 4096x4096x1, run: 0, Buffer: 128 MB, time per step: 1.691 ms, batch: 63
System: 4096x4096x1, run: 1, Buffer: 128 MB, time per step: 1.560 ms, batch: 63
System: 4096x4096x1, run: 2, Buffer: 128 MB, time per step: 1.557 ms, batch: 63
System: 4096x4096x1, run: 3, Buffer: 128 MB, time per step: 1.560 ms, batch: 63
System: 4096x4096x1, run: 4, Buffer: 128 MB, time per step: 1.556 ms, batch: 63
System: 32x32x32, run: 0, Buffer: 0 MB, time per step: 0.019 ms, batch: 1000
System: 32x32x32, run: 1, Buffer: 0 MB, time per step: 0.019 ms, batch: 1000
System: 32x32x32, run: 2, Buffer: 0 MB, time per step: 0.019 ms, batch: 1000
System: 32x32x32, run: 3, Buffer: 0 MB, time per step: 0.019 ms, batch: 1000
System: 32x32x32, run: 4, Buffer: 0 MB, time per step: 0.020 ms, batch: 1000
System: 64x64x64, run: 0, Buffer: 2 MB, time per step: 0.029 ms, batch: 1000
System: 64x64x64, run: 1, Buffer: 2 MB, time per step: 0.028 ms, batch: 1000
System: 64x64x64, run: 2, Buffer: 2 MB, time per step: 0.028 ms, batch: 1000
System: 64x64x64, run: 3, Buffer: 2 MB, time per step: 0.027 ms, batch: 1000
System: 64x64x64, run: 4, Buffer: 2 MB, time per step: 0.027 ms, batch: 1000
System: 256x256x32, run: 0, Buffer: 16 MB, time per step: 0.206 ms, batch: 508
System: 256x256x32, run: 1, Buffer: 16 MB, time per step: 0.205 ms, batch: 508
System: 256x256x32, run: 2, Buffer: 16 MB, time per step: 0.202 ms, batch: 508
System: 256x256x32, run: 3, Buffer: 16 MB, time per step: 0.202 ms, batch: 508
System: 256x256x32, run: 4, Buffer: 16 MB, time per step: 0.203 ms, batch: 508
System: 1024x256x32, run: 0, Buffer: 64 MB, time per step: 0.852 ms, batch: 127
System: 1024x256x32, run: 1, Buffer: 64 MB, time per step: 0.850 ms, batch: 127
System: 1024x256x32, run: 2, Buffer: 64 MB, time per step: 0.851 ms, batch: 127
System: 1024x256x32, run: 3, Buffer: 64 MB, time per step: 0.852 ms, batch: 127
System: 1024x256x32, run: 4, Buffer: 64 MB, time per step: 0.854 ms, batch: 127
System: 256x256x256, run: 0, Buffer: 128 MB, time per step: 1.819 ms, batch: 63
System: 256x256x256, run: 1, Buffer: 128 MB, time per step: 1.823 ms, batch: 63
System: 256x256x256, run: 2, Buffer: 128 MB, time per step: 1.934 ms, batch: 63
System: 256x256x256, run: 3, Buffer: 128 MB, time per step: 1.866 ms, batch: 63
System: 256x256x256, run: 4, Buffer: 128 MB, time per step: 1.898 ms, batch: 63
System: 2048x1024x8, run: 0, Buffer: 128 MB, time per step: 1.889 ms, batch: 63
System: 2048x1024x8, run: 1, Buffer: 128 MB, time per step: 1.868 ms, batch: 63
System: 2048x1024x8, run: 2, Buffer: 128 MB, time per step: 1.801 ms, batch: 63
System: 2048x1024x8, run: 3, Buffer: 128 MB, time per step: 1.797 ms, batch: 63
System: 2048x1024x8, run: 4, Buffer: 128 MB, time per step: 1.807 ms, batch: 63
System: 512x512x128, run: 0, Buffer: 256 MB, time per step: 3.196 ms, batch: 31
System: 512x512x128, run: 1, Buffer: 256 MB, time per step: 3.198 ms, batch: 31
System: 512x512x128, run: 2, Buffer: 256 MB, time per step: 3.201 ms, batch: 31
System: 512x512x128, run: 3, Buffer: 256 MB, time per step: 3.413 ms, batch: 31
System: 512x512x128, run: 4, Buffer: 256 MB, time per step: 3.199 ms, batch: 31
System: 2048x256x256, run: 0, Buffer: 1024 MB, time per step: 15.750 ms, batch: 7
System: 2048x256x256, run: 1, Buffer: 1024 MB, time per step: 14.674 ms, batch: 7
System: 2048x256x256, run: 2, Buffer: 1024 MB, time per step: 15.164 ms, batch: 7
System: 2048x256x256, run: 3, Buffer: 1024 MB, time per step: 14.731 ms, batch: 7
System: 2048x256x256, run: 4, Buffer: 1024 MB, time per step: 14.665 ms, batch: 7
System: 4096x4096x8, run: 0, Buffer: 1024 MB, time per step: 16.517 ms, batch: 7
System: 4096x4096x8, run: 1, Buffer: 1024 MB, time per step: 16.589 ms, batch: 7
System: 4096x4096x8, run: 2, Buffer: 1024 MB, time per step: 16.451 ms, batch: 7
System: 4096x4096x8, run: 3, Buffer: 1024 MB, time per step: 16.613 ms, batch: 7
System: 4096x4096x8, run: 4, Buffer: 1024 MB, time per step: 16.449 ms, batch: 7
Benchmark score: 39921

VkFFT
First 5 runs are a warmup
System: 1024x1024x1, run: 0, Buffer: 4 MB, time per step: 0.097 ms, batch: 1000
System: 1024x1024x1, run: 1, Buffer: 4 MB, time per step: 0.097 ms, batch: 1000
System: 1024x1024x1, run: 2, Buffer: 4 MB, time per step: 0.073 ms, batch: 1000
System: 1024x1024x1, run: 3, Buffer: 4 MB, time per step: 0.070 ms, batch: 1000
System: 1024x1024x1, run: 4, Buffer: 4 MB, time per step: 0.072 ms, batch: 1000
System: 64x64x1, run: 0, Buffer: 0 MB, time per step: 0.024 ms, batch: 1000
System: 64x64x1, run: 1, Buffer: 0 MB, time per step: 0.024 ms, batch: 1000
System: 64x64x1, run: 2, Buffer: 0 MB, time per step: 0.024 ms, batch: 1000
System: 64x64x1, run: 3, Buffer: 0 MB, time per step: 0.024 ms, batch: 1000
System: 64x64x1, run: 4, Buffer: 0 MB, time per step: 0.024 ms, batch: 1000
System: 256x256x1, run: 0, Buffer: 0 MB, time per step: 0.029 ms, batch: 1000
System: 256x256x1, run: 1, Buffer: 0 MB, time per step: 0.029 ms, batch: 1000
System: 256x256x1, run: 2, Buffer: 0 MB, time per step: 0.029 ms, batch: 1000
System: 256x256x1, run: 3, Buffer: 0 MB, time per step: 0.029 ms, batch: 1000
System: 256x256x1, run: 4, Buffer: 0 MB, time per step: 0.029 ms, batch: 1000
System: 1024x256x1, run: 0, Buffer: 1 MB, time per step: 0.033 ms, batch: 1000
System: 1024x256x1, run: 1, Buffer: 1 MB, time per step: 0.035 ms, batch: 1000
System: 1024x256x1, run: 2, Buffer: 1 MB, time per step: 0.033 ms, batch: 1000
System: 1024x256x1, run: 3, Buffer: 1 MB, time per step: 0.033 ms, batch: 1000
System: 1024x256x1, run: 4, Buffer: 1 MB, time per step: 0.033 ms, batch: 1000
System: 512x512x1, run: 0, Buffer: 1 MB, time per step: 0.032 ms, batch: 1000
System: 512x512x1, run: 1, Buffer: 1 MB, time per step: 0.035 ms, batch: 1000
System: 512x512x1, run: 2, Buffer: 1 MB, time per step: 0.036 ms, batch: 1000
System: 512x512x1, run: 3, Buffer: 1 MB, time per step: 0.034 ms, batch: 1000
System: 512x512x1, run: 4, Buffer: 1 MB, time per step: 0.034 ms, batch: 1000
System: 1024x1024x1, run: 0, Buffer: 4 MB, time per step: 0.072 ms, batch: 1000
System: 1024x1024x1, run: 1, Buffer: 4 MB, time per step: 0.072 ms, batch: 1000
System: 1024x1024x1, run: 2, Buffer: 4 MB, time per step: 0.072 ms, batch: 1000
System: 1024x1024x1, run: 3, Buffer: 4 MB, time per step: 0.071 ms, batch: 1000
System: 1024x1024x1, run: 4, Buffer: 4 MB, time per step: 0.071 ms, batch: 1000
System: 4096x256x1, run: 0, Buffer: 4 MB, time per step: 0.062 ms, batch: 1000
System: 4096x256x1, run: 1, Buffer: 4 MB, time per step: 0.063 ms, batch: 1000
System: 4096x256x1, run: 2, Buffer: 4 MB, time per step: 0.063 ms, batch: 1000
System: 4096x256x1, run: 3, Buffer: 4 MB, time per step: 0.065 ms, batch: 1000
System: 4096x256x1, run: 4, Buffer: 4 MB, time per step: 0.063 ms, batch: 1000
System: 2048x1024x1, run: 0, Buffer: 8 MB, time per step: 0.166 ms, batch: 511
System: 2048x1024x1, run: 1, Buffer: 8 MB, time per step: 0.169 ms, batch: 511
System: 2048x1024x1, run: 2, Buffer: 8 MB, time per step: 0.168 ms, batch: 511
System: 2048x1024x1, run: 3, Buffer: 8 MB, time per step: 0.171 ms, batch: 511
System: 2048x1024x1, run: 4, Buffer: 8 MB, time per step: 0.168 ms, batch: 511
System: 4096x2048x1, run: 0, Buffer: 32 MB, time per step: 0.740 ms, batch: 127
System: 4096x2048x1, run: 1, Buffer: 32 MB, time per step: 0.738 ms, batch: 127
System: 4096x2048x1, run: 2, Buffer: 32 MB, time per step: 0.754 ms, batch: 127
System: 4096x2048x1, run: 3, Buffer: 32 MB, time per step: 0.736 ms, batch: 127
System: 4096x2048x1, run: 4, Buffer: 32 MB, time per step: 0.746 ms, batch: 127
System: 4096x4096x1, run: 0, Buffer: 64 MB, time per step: 1.463 ms, batch: 63
System: 4096x4096x1, run: 1, Buffer: 64 MB, time per step: 1.462 ms, batch: 63
System: 4096x4096x1, run: 2, Buffer: 64 MB, time per step: 1.463 ms, batch: 63
System: 4096x4096x1, run: 3, Buffer: 64 MB, time per step: 1.463 ms, batch: 63
System: 4096x4096x1, run: 4, Buffer: 64 MB, time per step: 1.434 ms, batch: 63
System: 32x32x32, run: 0, Buffer: 0 MB, time per step: 0.040 ms, batch: 1000
System: 32x32x32, run: 1, Buffer: 0 MB, time per step: 0.040 ms, batch: 1000
System: 32x32x32, run: 2, Buffer: 0 MB, time per step: 0.040 ms, batch: 1000
System: 32x32x32, run: 3, Buffer: 0 MB, time per step: 0.041 ms, batch: 1000
System: 32x32x32, run: 4, Buffer: 0 MB, time per step: 0.040 ms, batch: 1000
System: 64x64x64, run: 0, Buffer: 1 MB, time per step: 0.048 ms, batch: 1000
System: 64x64x64, run: 1, Buffer: 1 MB, time per step: 0.047 ms, batch: 1000
System: 64x64x64, run: 2, Buffer: 1 MB, time per step: 0.046 ms, batch: 1000
System: 64x64x64, run: 3, Buffer: 1 MB, time per step: 0.046 ms, batch: 1000
System: 64x64x64, run: 4, Buffer: 1 MB, time per step: 0.047 ms, batch: 1000
System: 256x256x32, run: 0, Buffer: 8 MB, time per step: 0.218 ms, batch: 508
System: 256x256x32, run: 1, Buffer: 8 MB, time per step: 0.215 ms, batch: 508
System: 256x256x32, run: 2, Buffer: 8 MB, time per step: 0.218 ms, batch: 508
System: 256x256x32, run: 3, Buffer: 8 MB, time per step: 0.219 ms, batch: 508
System: 256x256x32, run: 4, Buffer: 8 MB, time per step: 0.218 ms, batch: 508
System: 1024x256x32, run: 0, Buffer: 32 MB, time per step: 0.757 ms, batch: 127
System: 1024x256x32, run: 1, Buffer: 32 MB, time per step: 0.760 ms, batch: 127
System: 1024x256x32, run: 2, Buffer: 32 MB, time per step: 0.766 ms, batch: 127
System: 1024x256x32, run: 3, Buffer: 32 MB, time per step: 0.773 ms, batch: 127
System: 1024x256x32, run: 4, Buffer: 32 MB, time per step: 0.756 ms, batch: 127
System: 256x256x256, run: 0, Buffer: 64 MB, time per step: 1.536 ms, batch: 63
System: 256x256x256, run: 1, Buffer: 64 MB, time per step: 1.546 ms, batch: 63
System: 256x256x256, run: 2, Buffer: 64 MB, time per step: 1.566 ms, batch: 63
System: 256x256x256, run: 3, Buffer: 64 MB, time per step: 1.537 ms, batch: 63
System: 256x256x256, run: 4, Buffer: 64 MB, time per step: 1.548 ms, batch: 63
System: 2048x1024x8, run: 0, Buffer: 64 MB, time per step: 1.527 ms, batch: 63
System: 2048x1024x8, run: 1, Buffer: 64 MB, time per step: 1.551 ms, batch: 63
System: 2048x1024x8, run: 2, Buffer: 64 MB, time per step: 1.525 ms, batch: 63
System: 2048x1024x8, run: 3, Buffer: 64 MB, time per step: 1.549 ms, batch: 63
System: 2048x1024x8, run: 4, Buffer: 64 MB, time per step: 1.548 ms, batch: 63
System: 512x512x128, run: 0, Buffer: 128 MB, time per step: 2.939 ms, batch: 31
System: 512x512x128, run: 1, Buffer: 128 MB, time per step: 2.964 ms, batch: 31
System: 512x512x128, run: 2, Buffer: 128 MB, time per step: 2.967 ms, batch: 31
System: 512x512x128, run: 3, Buffer: 128 MB, time per step: 2.945 ms, batch: 31
System: 512x512x128, run: 4, Buffer: 128 MB, time per step: 2.924 ms, batch: 31
System: 2048x256x256, run: 0, Buffer: 512 MB, time per step: 12.321 ms, batch: 7
System: 2048x256x256, run: 1, Buffer: 512 MB, time per step: 12.409 ms, batch: 7
System: 2048x256x256, run: 2, Buffer: 512 MB, time per step: 12.370 ms, batch: 7
System: 2048x256x256, run: 3, Buffer: 512 MB, time per step: 12.288 ms, batch: 7
System: 2048x256x256, run: 4, Buffer: 512 MB, time per step: 12.324 ms, batch: 7
System: 4096x4096x8, run: 0, Buffer: 512 MB, time per step: 15.102 ms, batch: 7
System: 4096x4096x8, run: 1, Buffer: 512 MB, time per step: 14.992 ms, batch: 7
System: 4096x4096x8, run: 2, Buffer: 512 MB, time per step: 14.884 ms, batch: 7
System: 4096x4096x8, run: 3, Buffer: 512 MB, time per step: 14.894 ms, batch: 7
System: 4096x4096x8, run: 4, Buffer: 512 MB, time per step: 14.996 ms, batch: 7
Benchmark score: 35830

2D and 3D R2C/C2R wrong results with hip backend

Hello, I'm using vkfft with the hip backend. I extracted a simple test case from the source file Vulkan_FFT.cpp。On AMD MI50 GPU and ROCm 3.9/4.0,the 1D R2C/C2R results are right, but those of 2D and 3D are partially wrong. I'm not sure if it is caused by vkfft itself or I used it incorrectly.
Here is the code:

#include<cstdio>
#include<cstdlib>

#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <hip/hip_complex.h>
#include "vkFFT.h"

#include <chrono>

#define DIM 3
#define SIZE 52

template < typename T>
void initialData(T *ip, const int size, const int max_value)
{
    int i;

    for(i = 0; i < size; i++)
    {
        ip[i] = (T)(rand() % max_value);
    }
}

void checkResult(float *backRef, float *originalRef, const int N)
{
    double epsilon = 1.0E-2;
    bool match = 1;
    const float overN = 1.0 / pow(SIZE, DIM);

    for (int i = 0; i < N; i++)
    {
        if (fabs(backRef[i] * overN - originalRef[i]) > epsilon)
        {
            match = 0;
            printf("Arrays do not match!\n");
            printf("back result %5.2f original result %5.2f at %d\n", backRef[i]* overN, originalRef[i], i);
            break;
        }
    }

    if (match) printf("Arrays match.\n\n");
}

typedef struct {
    hipDevice_t device;
    hipCtx_t context;
    uint32_t device_id;
}VkGPU;

int main()
{
    hipError_t res = hipSuccess;
    VkGPU vkGPU = {};
    hipInit(0);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_INITIALIZE;
    hipSetDevice(vkGPU.device_id);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_SET_DEVICE_ID;
    hipDeviceGet(&vkGPU.device, vkGPU.device_id);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_GET_DEVICE;
    hipCtxCreate(&vkGPU.context, 0, vkGPU.device);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_CREATE_CONTEXT;
    VkFFTConfiguration configuration = {};
    VkFFTApplication app = {};
    configuration.FFTdim = 3;
    configuration.size[0] = SIZE;
    configuration.size[1] = SIZE;
    configuration.size[2] = SIZE;
    configuration.device = &(vkGPU.device);
    configuration.performR2C = true;
    uint64_t buffersize = sizeof(hipFloatComplex) * pow(SIZE, DIM);

    float *h_realGrid = (float *)malloc(sizeof(float) * pow(SIZE, DIM));
    float *h_realGrid_original = (float *)malloc(sizeof(float) * pow(SIZE, DIM));
    hipFloatComplex *h_complexGrid = (hipFloatComplex *)malloc(sizeof(hipFloatComplex) * pow(SIZE, DIM));

    hipFloatComplex *d_complexGrid;
    //hipMalloc((void **)&d_realGrid, sizeof(float) * pow(SIZE, DIM));
    hipMalloc((void **)&d_complexGrid, sizeof(hipFloatComplex) * pow(SIZE, DIM));
 
    configuration.buffer = (void**)&d_complexGrid;
    configuration.bufferSize = &buffersize;

    initialData(h_realGrid_original, pow(SIZE, DIM), 10);
    //for(int i = 0; i < pow(SIZE, DIM); ++i)
    //{
    //    h_complexGrid[i].x = h_realGrid_original[i];
    //    h_complexGrid[i].y = 0.0f;
    //}
    memset(h_complexGrid, 0, sizeof(hipFloatComplex) * pow(SIZE, DIM));
    memcpy(h_complexGrid, h_realGrid_original, sizeof(float) * pow(SIZE, DIM));

    hipMemcpy(d_complexGrid, h_complexGrid, sizeof(hipFloatComplex) * pow(SIZE, DIM), hipMemcpyHostToDevice);
 
    std::chrono::steady_clock::time_point timeSubmit = std::chrono::steady_clock::now(); 
    VkFFTResult resFFT;
    resFFT = initializeVkFFT(&app, configuration);
    if (resFFT != VKFFT_SUCCESS) 
    {
        printf("VkFFT error code %d\n", resFFT);
        return resFFT;
    }
    VkFFTLaunchParams launchParams = {};
  
 
    resFFT = VkFFTAppend(&app, -1, &launchParams); 
    if (resFFT != VKFFT_SUCCESS) 
    {
        printf("vkfft error code %d\n", resFFT);
        return resFFT;
    }
    resFFT = VkFFTAppend(&app, 1, &launchParams);
    if (resFFT != VKFFT_SUCCESS) 
    {
        printf("vkfft error code %d\n", resFFT);
        return resFFT;
    }
    res = hipDeviceSynchronize();
    if (res != hipSuccess) 
    {
        printf("hip error code %d\n", res);
        return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE;
    }
    
    std::chrono::steady_clock::time_point timeEnd = std::chrono::steady_clock::now();
    float totTime = std::chrono::duration_cast<std::chrono::microseconds>(timeEnd - timeSubmit).count() * 0.001; 
    printf("time: %.3f ms\n", totTime);
    
    hipMemcpy(h_complexGrid, d_complexGrid, sizeof(hipFloatComplex) * pow(SIZE, DIM), hipMemcpyDeviceToHost);
    //for(int i = 0; i < pow(SIZE, DIM); ++i)
    //{
    //    h_realGrid[i] = h_complexGrid[i].x;
    //}
    memcpy(h_realGrid, h_complexGrid, sizeof(float) * pow(SIZE, DIM));
 
    printf("h_realGrid: ");
    for(int i = 0; i < 30; ++i)
    {
         printf("%.6f  ", h_realGrid[i] / pow(SIZE, DIM));
    }
    printf("\nh_realGrid_original: ");
    for(int i = 0; i < 30; ++i)
    {
         printf("%.6f  ", h_realGrid_original[i]);
    }

    printf("\n");
    checkResult(h_realGrid, h_realGrid_original, pow(SIZE, DIM));

    deleteVkFFT(&app);
    free(h_realGrid);
    free(h_realGrid_original);
    free(h_complexGrid);

    //hipFree(d_realGrid);
    hipFree(d_complexGrid);
    
    return 0;
}

I compile it using the command:

hipcc -o planMany-vkfft -DVKFFT_BACKEND=2  planMany-vkfft.cpp

and my results:

time: 11023.094 ms
h_realGrid: 2.999996  5.999995  6.999996  4.999996  2.999997  4.999995  5.999996  1.999996  8.999998  0.999997  1.999998  6.999994  -0.000003  8.999997  2.999996  5.999996  -0.000004  5.999995  1.999997  5.999997  0.999996  7.999999  6.999996  8.999996  1.999997  -0.000003  1.999996  2.999999  6.999996  4.999996
h_realGrid_original: 3.000000  6.000000  7.000000  5.000000  3.000000  5.000000  6.000000  2.000000  9.000000  1.000000  2.000000  7.000000  0.000000  9.000000  3.000000  6.000000  0.000000  6.000000  2.000000  6.000000  1.000000  8.000000  7.000000  9.000000  2.000000  0.000000  2.000000  3.000000  7.000000  5.000000
back result -0.46 original result  1.00 at 52
back result  0.08 original result  7.00 at 106
back result  0.00 original result  2.00 at 107
back result -0.23 original result  9.00 at 160
back result -0.00 original result  8.00 at 161
back result  0.35 original result  7.00 at 214
back result  0.00 original result  6.00 at 215
back result -0.31 original result  2.00 at 268
back result  0.00 original result  4.00 at 269
back result  0.56 original result  3.00 at 322
back result -0.00 original result  2.00 at 323
back result -0.02 original result  0.00 at 376
back result -0.00 original result  5.00 at 377
......

Could you please help me with this problem? Thank you!

DCT

Hi Dmitry,

I've updated pyvkfft for vkfft 1.2.12 notably with the updated DCT.
Some issues:

  1. normalisation (using the default norm option in VkFFT):
  • the forward DCT reports the same values as scipy, but only types 2 and 3 inverse transform give the same result.
  • the forward+backward DCT only gives back the original array for type 2 and 3.
  • For type 1 the forward+backward DCT seems to give back the original array if I scale the final result by (N/(N-1))**2. This for a NxN array, tested with N=512,400,330
  • For type 2 the forward+backward DCT seems to give back the original array if I scale the final result by 4. This also for a NxN array with N=512,400,330
    In both cases not only is the original array recovered as expected, but it does correspond to scipy's result. So it seems that with a small normalisation correction this could give consistent results, even if DCT is less 'standardised' than FFT when it comes to normalisation.
  1. in OpenCL, the DCT type 1 (for a 512x512 array) gives a ptxas error on my GTX 1080 Ti (mac with cuda 10.1) - it requests 64k shared mem ? (type 2,3,4 run OK):
Testing DCT type 1
ptxas error   : Entry function 'VkFFT_main' uses too much shared data (0x10000 bytes, 0xc000 max)
__constant float loc_PI = 3.1415926535897932384626433832795f;
__constant float loc_SQRT1_2 = 0.70710678118654752440084436210485f;
	typedef struct {
	unsigned int workGroupShiftX;
	unsigned int workGroupShiftY;
	unsigned int workGroupShiftZ;
	}PushConsts;
__kernel __attribute__((reqd_work_group_size(4, 128, 1))) void VkFFT_main (__global float* inputs, __global float* outputs, __global float2* BluesteinConvolutionKernel, __global float2* BluesteinMultiplication, PushConsts consts) {
unsigned int sharedStride = 4;
__local float2 sdata[8192];

	float2 temp_0;
	temp_0.x=0;
	temp_0.y=0;
	float2 temp_1;
	temp_1.x=0;
	temp_1.y=0;
	float2 temp_2;
	temp_2.x=0;
	temp_2.y=0;
	float2 temp_3;
	temp_3.x=0;
	temp_3.y=0;
	float2 temp_4;
	temp_4.x=0;
	temp_4.y=0;
	float2 temp_5;
	temp_5.x=0;
	temp_5.y=0;
	float2 temp_6;
	temp_6.x=0;
	temp_6.y=0;
	float2 temp_7;
	temp_7.x=0;
	temp_7.y=0;
	float2 temp_8;
	temp_8.x=0;
	temp_8.y=0;
	float2 temp_9;
	temp_9.x=0;
	temp_9.y=0;
	float2 temp_10;
	temp_10.x=0;
	temp_10.y=0;
	float2 temp_11;
	temp_11.x=0;
	temp_11.y=0;
	float2 temp_12;
	temp_12.x=0;
	temp_12.y=0;
	float2 temp_13;
	temp_13.x=0;
	temp_13.y=0;
	float2 temp_14;
	temp_14.x=0;
	temp_14.y=0;
	float2 temp_15;
	temp_15.x=0;
	temp_15.y=0;
	float2 w;
	w.x=0;
	w.y=0;
	float2 loc_0;
	loc_0.x=0;
	loc_0.y=0;
	float2 iw;
	iw.x=0;
	iw.y=0;
	unsigned int stageInvocationID=0;
	unsigned int blockInvocationID=0;
	unsigned int sdataID=0;
	unsigned int combinedID=0;
	unsigned int inoutID=0;
	float angle=0;
	float2 mult;
	mult.x = 0;
	mult.y = 0;
		sdata[sharedStride*(get_local_id(1)+0)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+0)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+128)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+128)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+256)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+256)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+384)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+384)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+512)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+512)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+640)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+640)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+768)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+768)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+896)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+896)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1024)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1024)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1152)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1152)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1280)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1280)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1408)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1408)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1536)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1536)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1664)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1664)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1792)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1792)+get_local_id(0)].y=0;
		sdata[sharedStride*(get_local_id(1)+1920)+get_local_id(0)].x=0;
		sdata[sharedStride*(get_local_id(1)+1920)+get_local_id(0)].y=0;
	barrier(CLK_LOCAL_MEM_FENCE);

		combinedID = (get_local_id(1) + 0);
		if((combinedID % 2048) < 1022){
		sdataID = (combinedID % 512) * sharedStride + get_local_id(0);
			inoutID = ((get_global_id(0)) % (512)) + ((get_local_id(1) + 0)) * 512;
		temp_0.x = inputs[inoutID];
		sdata[sdataID].x = temp_0.x;
		sdata[sdataID].y = 0;
		if (((combinedID % 512)>0)&&((combinedID % 512) < 511)){
		inoutID = (1022 - combinedID % 512) * sharedStride + get_local_id(0);
		sdata[inoutID] = sdata[sdataID];
		}
		}
		combinedID = (get_local_id(1) + 128);
		if((combinedID % 2048) < 1022){
		sdataID = (combinedID % 512) * sharedStride + get_local_id(0);
			inoutID = ((get_global_id(0)) % (512)) + ((get_local_id(1) + 128)) * 512;
		temp_0.x = inputs[inoutID];
		sdata[sdataID].x = temp_0.x;
		sdata[sdataID].y = 0;
		if (((combinedID % 512)>0)&&((combinedID % 512) < 511)){
		inoutID = (1022 - combinedID % 512) * sharedStride + get_local_id(0);
		sdata[inoutID] = sdata[sdataID];
		}
		}
		combinedID = (get_local_id(1) + 256);
		if((combinedID % 2048) < 1022){
		sdataID = (combinedID % 512) * sharedStride + get_local_id(0);
			inoutID = ((get_global_id(0)) % (512)) + ((get_local_id(1) + 256)) * 512;
		temp_0.x = inputs[inoutID];
		sdata[sdataID].x = temp_0.x;
		sdata[sdataID].y = 0;
		if (((combinedID % 512)>0)&&((combinedID % 512) < 511)){
		inoutID = (1022 - combinedID % 512) * sharedStride + get_local_id(0);
		sdata[inoutID] = sdata[sdataID];
		}
		}
		combinedID = (get_local_id(1) + 384);
		if((combinedID % 2048) < 1022){
		if(combinedID < 512){
		sdataID = (combinedID % 512) * sharedStride + get_local_id(0);
			inoutID = ((get_global_id(0)) % (512)) + ((get_local_id(1) + 384)) * 512;
		temp_0.x = inputs[inoutID];
		sdata[sdataID].x = temp_0.x;
		sdata[sdataID].y = 0;
		if (((combinedID % 512)>0)&&((combinedID % 512) < 511)){
		inoutID = (1022 - combinedID % 512) * sharedStride + get_local_id(0);
		sdata[inoutID] = sdata[sdataID];
		}
		}
		}
	barrier(CLK_LOCAL_MEM_FENCE);

		inoutID = get_local_id(1) + 0;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_0 = sdata[get_local_id(0) + (get_local_id(1) + 0 * get_local_size(1))*sharedStride];
	loc_0.x = temp_0.x * w.x - temp_0.y * w.y;
	loc_0.y = temp_0.y * w.x + temp_0.x * w.y;
	temp_0 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 0 * get_local_size(1))*sharedStride] = temp_0;
		}
		inoutID = get_local_id(1) + 128;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_1 = sdata[get_local_id(0) + (get_local_id(1) + 1 * get_local_size(1))*sharedStride];
	loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
	loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
	temp_1 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 1 * get_local_size(1))*sharedStride] = temp_1;
		}
		inoutID = get_local_id(1) + 256;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_2 = sdata[get_local_id(0) + (get_local_id(1) + 2 * get_local_size(1))*sharedStride];
	loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
	loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
	temp_2 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 2 * get_local_size(1))*sharedStride] = temp_2;
		}
		inoutID = get_local_id(1) + 384;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_3 = sdata[get_local_id(0) + (get_local_id(1) + 3 * get_local_size(1))*sharedStride];
	loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
	loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
	temp_3 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 3 * get_local_size(1))*sharedStride] = temp_3;
		}
		inoutID = get_local_id(1) + 512;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_4 = sdata[get_local_id(0) + (get_local_id(1) + 4 * get_local_size(1))*sharedStride];
	loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
	loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
	temp_4 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 4 * get_local_size(1))*sharedStride] = temp_4;
		}
		inoutID = get_local_id(1) + 640;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_5 = sdata[get_local_id(0) + (get_local_id(1) + 5 * get_local_size(1))*sharedStride];
	loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
	loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
	temp_5 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 5 * get_local_size(1))*sharedStride] = temp_5;
		}
		inoutID = get_local_id(1) + 768;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_6 = sdata[get_local_id(0) + (get_local_id(1) + 6 * get_local_size(1))*sharedStride];
	loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
	loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
	temp_6 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 6 * get_local_size(1))*sharedStride] = temp_6;
		}
		inoutID = get_local_id(1) + 896;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_7 = sdata[get_local_id(0) + (get_local_id(1) + 7 * get_local_size(1))*sharedStride];
	loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
	loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
	temp_7 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 7 * get_local_size(1))*sharedStride] = temp_7;
		}
		inoutID = get_local_id(1) + 1024;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_8 = sdata[get_local_id(0) + (get_local_id(1) + 8 * get_local_size(1))*sharedStride];
	loc_0.x = temp_8.x * w.x - temp_8.y * w.y;
	loc_0.y = temp_8.y * w.x + temp_8.x * w.y;
	temp_8 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 8 * get_local_size(1))*sharedStride] = temp_8;
		}
		inoutID = get_local_id(1) + 1152;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_9 = sdata[get_local_id(0) + (get_local_id(1) + 9 * get_local_size(1))*sharedStride];
	loc_0.x = temp_9.x * w.x - temp_9.y * w.y;
	loc_0.y = temp_9.y * w.x + temp_9.x * w.y;
	temp_9 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 9 * get_local_size(1))*sharedStride] = temp_9;
		}
		inoutID = get_local_id(1) + 1280;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_10 = sdata[get_local_id(0) + (get_local_id(1) + 10 * get_local_size(1))*sharedStride];
	loc_0.x = temp_10.x * w.x - temp_10.y * w.y;
	loc_0.y = temp_10.y * w.x + temp_10.x * w.y;
	temp_10 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 10 * get_local_size(1))*sharedStride] = temp_10;
		}
		inoutID = get_local_id(1) + 1408;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_11 = sdata[get_local_id(0) + (get_local_id(1) + 11 * get_local_size(1))*sharedStride];
	loc_0.x = temp_11.x * w.x - temp_11.y * w.y;
	loc_0.y = temp_11.y * w.x + temp_11.x * w.y;
	temp_11 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 11 * get_local_size(1))*sharedStride] = temp_11;
		}
		inoutID = get_local_id(1) + 1536;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_12 = sdata[get_local_id(0) + (get_local_id(1) + 12 * get_local_size(1))*sharedStride];
	loc_0.x = temp_12.x * w.x - temp_12.y * w.y;
	loc_0.y = temp_12.y * w.x + temp_12.x * w.y;
	temp_12 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 12 * get_local_size(1))*sharedStride] = temp_12;
		}
		inoutID = get_local_id(1) + 1664;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_13 = sdata[get_local_id(0) + (get_local_id(1) + 13 * get_local_size(1))*sharedStride];
	loc_0.x = temp_13.x * w.x - temp_13.y * w.y;
	loc_0.y = temp_13.y * w.x + temp_13.x * w.y;
	temp_13 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 13 * get_local_size(1))*sharedStride] = temp_13;
		}
		inoutID = get_local_id(1) + 1792;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_14 = sdata[get_local_id(0) + (get_local_id(1) + 14 * get_local_size(1))*sharedStride];
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 14 * get_local_size(1))*sharedStride] = temp_14;
		}
		inoutID = get_local_id(1) + 1920;
		if((inoutID % 2048) < 1022){
		w = BluesteinMultiplication[inoutID];
		temp_15 = sdata[get_local_id(0) + (get_local_id(1) + 15 * get_local_size(1))*sharedStride];
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15 = loc_0;
		sdata[get_local_id(0) + (get_local_id(1) + 15 * get_local_size(1))*sharedStride] = temp_15;
		}
	barrier(CLK_LOCAL_MEM_FENCE);

		stageInvocationID = (get_local_id(1)+ 0) % (1);
		angle = stageInvocationID * -3.14159265358979312f;
		temp_0 = sdata[sharedStride*(get_local_id(1)+0)+get_local_id(0)];
		temp_2 = sdata[sharedStride*(get_local_id(1)+256)+get_local_id(0)];
		temp_4 = sdata[sharedStride*(get_local_id(1)+512)+get_local_id(0)];
		temp_6 = sdata[sharedStride*(get_local_id(1)+768)+get_local_id(0)];
		temp_8 = sdata[sharedStride*(get_local_id(1)+1024)+get_local_id(0)];
		temp_10 = sdata[sharedStride*(get_local_id(1)+1280)+get_local_id(0)];
		temp_12 = sdata[sharedStride*(get_local_id(1)+1536)+get_local_id(0)];
		temp_14 = sdata[sharedStride*(get_local_id(1)+1792)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_8.x * w.x - temp_8.y * w.y;
	loc_0.y = temp_8.y * w.x + temp_8.x * w.y;
	temp_8.x = temp_0.x - loc_0.x;
	temp_8.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_10.x * w.x - temp_10.y * w.y;
	loc_0.y = temp_10.y * w.x + temp_10.x * w.y;
	temp_10.x = temp_2.x - loc_0.x;
	temp_10.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	loc_0.x = temp_12.x * w.x - temp_12.y * w.y;
	loc_0.y = temp_12.y * w.x + temp_12.x * w.y;
	temp_12.x = temp_4.x - loc_0.x;
	temp_12.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14.x = temp_6.x - loc_0.x;
	temp_14.y = temp_6.y - loc_0.y;
	temp_6.x = temp_6.x + loc_0.x;
	temp_6.y = temp_6.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
	loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
	temp_4.x = temp_0.x - loc_0.x;
	temp_4.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
	loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
	temp_6.x = temp_2.x - loc_0.x;
	temp_6.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_12.x * iw.x - temp_12.y * iw.y;
	loc_0.y = temp_12.y * iw.x + temp_12.x * iw.y;
	temp_12.x = temp_8.x - loc_0.x;
	temp_12.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	loc_0.x = temp_14.x * iw.x - temp_14.y * iw.y;
	loc_0.y = temp_14.y * iw.x + temp_14.x * iw.y;
	temp_14.x = temp_10.x - loc_0.x;
	temp_14.y = temp_10.y - loc_0.y;
	temp_10.x = temp_10.x + loc_0.x;
	temp_10.y = temp_10.y + loc_0.y;
	w.x = native_cos(0.25f*angle);
	w.y = native_sin(0.25f*angle);
	loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
	loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
	temp_2.x = temp_0.x - loc_0.x;
	temp_2.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_6.x * iw.x - temp_6.y * iw.y;
	loc_0.y = temp_6.y * iw.x + temp_6.x * iw.y;
	temp_6.x = temp_4.x - loc_0.x;
	temp_6.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	iw.x = w.x * loc_SQRT1_2 + w.y * loc_SQRT1_2;
	iw.y = w.y * loc_SQRT1_2 - w.x * loc_SQRT1_2;

	loc_0.x = temp_10.x * iw.x - temp_10.y * iw.y;
	loc_0.y = temp_10.y * iw.x + temp_10.x * iw.y;
	temp_10.x = temp_8.x - loc_0.x;
	temp_10.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	w.x = iw.y;
	w.y = -iw.x;
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14.x = temp_12.x - loc_0.x;
	temp_14.y = temp_12.y - loc_0.y;
	temp_12.x = temp_12.x + loc_0.x;
	temp_12.y = temp_12.y + loc_0.y;
	loc_0 = temp_2;
	temp_2 = temp_8;
	temp_8 = loc_0;
	loc_0 = temp_6;
	temp_6 = temp_12;
	temp_12 = loc_0;
		stageInvocationID = (get_local_id(1)+ 128) % (1);
		angle = stageInvocationID * -3.14159265358979312f;
		temp_1 = sdata[sharedStride*(get_local_id(1)+128)+get_local_id(0)];
		temp_3 = sdata[sharedStride*(get_local_id(1)+384)+get_local_id(0)];
		temp_5 = sdata[sharedStride*(get_local_id(1)+640)+get_local_id(0)];
		temp_7 = sdata[sharedStride*(get_local_id(1)+896)+get_local_id(0)];
		temp_9 = sdata[sharedStride*(get_local_id(1)+1152)+get_local_id(0)];
		temp_11 = sdata[sharedStride*(get_local_id(1)+1408)+get_local_id(0)];
		temp_13 = sdata[sharedStride*(get_local_id(1)+1664)+get_local_id(0)];
		temp_15 = sdata[sharedStride*(get_local_id(1)+1920)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_9.x * w.x - temp_9.y * w.y;
	loc_0.y = temp_9.y * w.x + temp_9.x * w.y;
	temp_9.x = temp_1.x - loc_0.x;
	temp_9.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	loc_0.x = temp_11.x * w.x - temp_11.y * w.y;
	loc_0.y = temp_11.y * w.x + temp_11.x * w.y;
	temp_11.x = temp_3.x - loc_0.x;
	temp_11.y = temp_3.y - loc_0.y;
	temp_3.x = temp_3.x + loc_0.x;
	temp_3.y = temp_3.y + loc_0.y;
	loc_0.x = temp_13.x * w.x - temp_13.y * w.y;
	loc_0.y = temp_13.y * w.x + temp_13.x * w.y;
	temp_13.x = temp_5.x - loc_0.x;
	temp_13.y = temp_5.y - loc_0.y;
	temp_5.x = temp_5.x + loc_0.x;
	temp_5.y = temp_5.y + loc_0.y;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_7.x - loc_0.x;
	temp_15.y = temp_7.y - loc_0.y;
	temp_7.x = temp_7.x + loc_0.x;
	temp_7.y = temp_7.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
	loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
	temp_5.x = temp_1.x - loc_0.x;
	temp_5.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
	loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
	temp_7.x = temp_3.x - loc_0.x;
	temp_7.y = temp_3.y - loc_0.y;
	temp_3.x = temp_3.x + loc_0.x;
	temp_3.y = temp_3.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_13.x * iw.x - temp_13.y * iw.y;
	loc_0.y = temp_13.y * iw.x + temp_13.x * iw.y;
	temp_13.x = temp_9.x - loc_0.x;
	temp_13.y = temp_9.y - loc_0.y;
	temp_9.x = temp_9.x + loc_0.x;
	temp_9.y = temp_9.y + loc_0.y;
	loc_0.x = temp_15.x * iw.x - temp_15.y * iw.y;
	loc_0.y = temp_15.y * iw.x + temp_15.x * iw.y;
	temp_15.x = temp_11.x - loc_0.x;
	temp_15.y = temp_11.y - loc_0.y;
	temp_11.x = temp_11.x + loc_0.x;
	temp_11.y = temp_11.y + loc_0.y;
	w.x = native_cos(0.25f*angle);
	w.y = native_sin(0.25f*angle);
	loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
	loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
	temp_3.x = temp_1.x - loc_0.x;
	temp_3.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
	loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
	temp_7.x = temp_5.x - loc_0.x;
	temp_7.y = temp_5.y - loc_0.y;
	temp_5.x = temp_5.x + loc_0.x;
	temp_5.y = temp_5.y + loc_0.y;
	iw.x = w.x * loc_SQRT1_2 + w.y * loc_SQRT1_2;
	iw.y = w.y * loc_SQRT1_2 - w.x * loc_SQRT1_2;

	loc_0.x = temp_11.x * iw.x - temp_11.y * iw.y;
	loc_0.y = temp_11.y * iw.x + temp_11.x * iw.y;
	temp_11.x = temp_9.x - loc_0.x;
	temp_11.y = temp_9.y - loc_0.y;
	temp_9.x = temp_9.x + loc_0.x;
	temp_9.y = temp_9.y + loc_0.y;
	w.x = iw.y;
	w.y = -iw.x;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_13.x - loc_0.x;
	temp_15.y = temp_13.y - loc_0.y;
	temp_13.x = temp_13.x + loc_0.x;
	temp_13.y = temp_13.y + loc_0.y;
	loc_0 = temp_3;
	temp_3 = temp_9;
	temp_9 = loc_0;
	loc_0 = temp_7;
	temp_7 = temp_13;
	temp_13 = loc_0;
		sharedStride = 4;
	barrier(CLK_LOCAL_MEM_FENCE);

	stageInvocationID = get_local_id(1) + 0;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 1;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 8;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_0;
	sdataID = inoutID + 1;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_2;
	sdataID = inoutID + 2;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_4;
	sdataID = inoutID + 3;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_6;
	sdataID = inoutID + 4;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_8;
	sdataID = inoutID + 5;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_10;
	sdataID = inoutID + 6;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_12;
	sdataID = inoutID + 7;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_14;
	stageInvocationID = get_local_id(1) + 128;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 1;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 8;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_1;
	sdataID = inoutID + 1;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_3;
	sdataID = inoutID + 2;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_5;
	sdataID = inoutID + 3;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_7;
	sdataID = inoutID + 4;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_9;
	sdataID = inoutID + 5;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_11;
	sdataID = inoutID + 6;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_13;
	sdataID = inoutID + 7;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_15;
	barrier(CLK_LOCAL_MEM_FENCE);

		stageInvocationID = (get_local_id(1)+ 0) % (8);
		angle = stageInvocationID * -0.39269908169872414f;
		temp_0 = sdata[sharedStride*(get_local_id(1)+0)+get_local_id(0)];
		temp_4 = sdata[sharedStride*(get_local_id(1)+256)+get_local_id(0)];
		temp_8 = sdata[sharedStride*(get_local_id(1)+512)+get_local_id(0)];
		temp_12 = sdata[sharedStride*(get_local_id(1)+768)+get_local_id(0)];
		temp_1 = sdata[sharedStride*(get_local_id(1)+1024)+get_local_id(0)];
		temp_5 = sdata[sharedStride*(get_local_id(1)+1280)+get_local_id(0)];
		temp_9 = sdata[sharedStride*(get_local_id(1)+1536)+get_local_id(0)];
		temp_13 = sdata[sharedStride*(get_local_id(1)+1792)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
	loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
	temp_1.x = temp_0.x - loc_0.x;
	temp_1.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
	loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
	temp_5.x = temp_4.x - loc_0.x;
	temp_5.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	loc_0.x = temp_9.x * w.x - temp_9.y * w.y;
	loc_0.y = temp_9.y * w.x + temp_9.x * w.y;
	temp_9.x = temp_8.x - loc_0.x;
	temp_9.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	loc_0.x = temp_13.x * w.x - temp_13.y * w.y;
	loc_0.y = temp_13.y * w.x + temp_13.x * w.y;
	temp_13.x = temp_12.x - loc_0.x;
	temp_13.y = temp_12.y - loc_0.y;
	temp_12.x = temp_12.x + loc_0.x;
	temp_12.y = temp_12.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_8.x * w.x - temp_8.y * w.y;
	loc_0.y = temp_8.y * w.x + temp_8.x * w.y;
	temp_8.x = temp_0.x - loc_0.x;
	temp_8.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_12.x * w.x - temp_12.y * w.y;
	loc_0.y = temp_12.y * w.x + temp_12.x * w.y;
	temp_12.x = temp_4.x - loc_0.x;
	temp_12.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_9.x * iw.x - temp_9.y * iw.y;
	loc_0.y = temp_9.y * iw.x + temp_9.x * iw.y;
	temp_9.x = temp_1.x - loc_0.x;
	temp_9.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	loc_0.x = temp_13.x * iw.x - temp_13.y * iw.y;
	loc_0.y = temp_13.y * iw.x + temp_13.x * iw.y;
	temp_13.x = temp_5.x - loc_0.x;
	temp_13.y = temp_5.y - loc_0.y;
	temp_5.x = temp_5.x + loc_0.x;
	temp_5.y = temp_5.y + loc_0.y;
	w.x = native_cos(0.25f*angle);
	w.y = native_sin(0.25f*angle);
	loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
	loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
	temp_4.x = temp_0.x - loc_0.x;
	temp_4.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_12.x * iw.x - temp_12.y * iw.y;
	loc_0.y = temp_12.y * iw.x + temp_12.x * iw.y;
	temp_12.x = temp_8.x - loc_0.x;
	temp_12.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	iw.x = w.x * loc_SQRT1_2 + w.y * loc_SQRT1_2;
	iw.y = w.y * loc_SQRT1_2 - w.x * loc_SQRT1_2;

	loc_0.x = temp_5.x * iw.x - temp_5.y * iw.y;
	loc_0.y = temp_5.y * iw.x + temp_5.x * iw.y;
	temp_5.x = temp_1.x - loc_0.x;
	temp_5.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	w.x = iw.y;
	w.y = -iw.x;
	loc_0.x = temp_13.x * w.x - temp_13.y * w.y;
	loc_0.y = temp_13.y * w.x + temp_13.x * w.y;
	temp_13.x = temp_9.x - loc_0.x;
	temp_13.y = temp_9.y - loc_0.y;
	temp_9.x = temp_9.x + loc_0.x;
	temp_9.y = temp_9.y + loc_0.y;
	loc_0 = temp_4;
	temp_4 = temp_1;
	temp_1 = loc_0;
	loc_0 = temp_12;
	temp_12 = temp_9;
	temp_9 = loc_0;
		stageInvocationID = (get_local_id(1)+ 128) % (8);
		angle = stageInvocationID * -0.39269908169872414f;
		temp_2 = sdata[sharedStride*(get_local_id(1)+128)+get_local_id(0)];
		temp_6 = sdata[sharedStride*(get_local_id(1)+384)+get_local_id(0)];
		temp_10 = sdata[sharedStride*(get_local_id(1)+640)+get_local_id(0)];
		temp_14 = sdata[sharedStride*(get_local_id(1)+896)+get_local_id(0)];
		temp_3 = sdata[sharedStride*(get_local_id(1)+1152)+get_local_id(0)];
		temp_7 = sdata[sharedStride*(get_local_id(1)+1408)+get_local_id(0)];
		temp_11 = sdata[sharedStride*(get_local_id(1)+1664)+get_local_id(0)];
		temp_15 = sdata[sharedStride*(get_local_id(1)+1920)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
	loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
	temp_3.x = temp_2.x - loc_0.x;
	temp_3.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
	loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
	temp_7.x = temp_6.x - loc_0.x;
	temp_7.y = temp_6.y - loc_0.y;
	temp_6.x = temp_6.x + loc_0.x;
	temp_6.y = temp_6.y + loc_0.y;
	loc_0.x = temp_11.x * w.x - temp_11.y * w.y;
	loc_0.y = temp_11.y * w.x + temp_11.x * w.y;
	temp_11.x = temp_10.x - loc_0.x;
	temp_11.y = temp_10.y - loc_0.y;
	temp_10.x = temp_10.x + loc_0.x;
	temp_10.y = temp_10.y + loc_0.y;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_14.x - loc_0.x;
	temp_15.y = temp_14.y - loc_0.y;
	temp_14.x = temp_14.x + loc_0.x;
	temp_14.y = temp_14.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_10.x * w.x - temp_10.y * w.y;
	loc_0.y = temp_10.y * w.x + temp_10.x * w.y;
	temp_10.x = temp_2.x - loc_0.x;
	temp_10.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14.x = temp_6.x - loc_0.x;
	temp_14.y = temp_6.y - loc_0.y;
	temp_6.x = temp_6.x + loc_0.x;
	temp_6.y = temp_6.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_11.x * iw.x - temp_11.y * iw.y;
	loc_0.y = temp_11.y * iw.x + temp_11.x * iw.y;
	temp_11.x = temp_3.x - loc_0.x;
	temp_11.y = temp_3.y - loc_0.y;
	temp_3.x = temp_3.x + loc_0.x;
	temp_3.y = temp_3.y + loc_0.y;
	loc_0.x = temp_15.x * iw.x - temp_15.y * iw.y;
	loc_0.y = temp_15.y * iw.x + temp_15.x * iw.y;
	temp_15.x = temp_7.x - loc_0.x;
	temp_15.y = temp_7.y - loc_0.y;
	temp_7.x = temp_7.x + loc_0.x;
	temp_7.y = temp_7.y + loc_0.y;
	w.x = native_cos(0.25f*angle);
	w.y = native_sin(0.25f*angle);
	loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
	loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
	temp_6.x = temp_2.x - loc_0.x;
	temp_6.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_14.x * iw.x - temp_14.y * iw.y;
	loc_0.y = temp_14.y * iw.x + temp_14.x * iw.y;
	temp_14.x = temp_10.x - loc_0.x;
	temp_14.y = temp_10.y - loc_0.y;
	temp_10.x = temp_10.x + loc_0.x;
	temp_10.y = temp_10.y + loc_0.y;
	iw.x = w.x * loc_SQRT1_2 + w.y * loc_SQRT1_2;
	iw.y = w.y * loc_SQRT1_2 - w.x * loc_SQRT1_2;

	loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
	loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
	temp_7.x = temp_3.x - loc_0.x;
	temp_7.y = temp_3.y - loc_0.y;
	temp_3.x = temp_3.x + loc_0.x;
	temp_3.y = temp_3.y + loc_0.y;
	w.x = iw.y;
	w.y = -iw.x;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_11.x - loc_0.x;
	temp_15.y = temp_11.y - loc_0.y;
	temp_11.x = temp_11.x + loc_0.x;
	temp_11.y = temp_11.y + loc_0.y;
	loc_0 = temp_6;
	temp_6 = temp_3;
	temp_3 = loc_0;
	loc_0 = temp_14;
	temp_14 = temp_11;
	temp_11 = loc_0;
	barrier(CLK_LOCAL_MEM_FENCE);

	stageInvocationID = get_local_id(1) + 0;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 8;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 8;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_0;
	sdataID = inoutID + 8;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_4;
	sdataID = inoutID + 16;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_8;
	sdataID = inoutID + 24;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_12;
	sdataID = inoutID + 32;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_1;
	sdataID = inoutID + 40;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_5;
	sdataID = inoutID + 48;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_9;
	sdataID = inoutID + 56;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_13;
	stageInvocationID = get_local_id(1) + 128;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 8;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 8;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_2;
	sdataID = inoutID + 8;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_6;
	sdataID = inoutID + 16;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_10;
	sdataID = inoutID + 24;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_14;
	sdataID = inoutID + 32;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_3;
	sdataID = inoutID + 40;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_7;
	sdataID = inoutID + 48;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_11;
	sdataID = inoutID + 56;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_15;
	barrier(CLK_LOCAL_MEM_FENCE);

		stageInvocationID = (get_local_id(1)+ 0) % (64);
		angle = stageInvocationID * -0.04908738521234052f;
		temp_0 = sdata[sharedStride*(get_local_id(1)+0)+get_local_id(0)];
		temp_8 = sdata[sharedStride*(get_local_id(1)+256)+get_local_id(0)];
		temp_1 = sdata[sharedStride*(get_local_id(1)+512)+get_local_id(0)];
		temp_9 = sdata[sharedStride*(get_local_id(1)+768)+get_local_id(0)];
		temp_2 = sdata[sharedStride*(get_local_id(1)+1024)+get_local_id(0)];
		temp_10 = sdata[sharedStride*(get_local_id(1)+1280)+get_local_id(0)];
		temp_3 = sdata[sharedStride*(get_local_id(1)+1536)+get_local_id(0)];
		temp_11 = sdata[sharedStride*(get_local_id(1)+1792)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
	loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
	temp_2.x = temp_0.x - loc_0.x;
	temp_2.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_10.x * w.x - temp_10.y * w.y;
	loc_0.y = temp_10.y * w.x + temp_10.x * w.y;
	temp_10.x = temp_8.x - loc_0.x;
	temp_10.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
	loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
	temp_3.x = temp_1.x - loc_0.x;
	temp_3.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	loc_0.x = temp_11.x * w.x - temp_11.y * w.y;
	loc_0.y = temp_11.y * w.x + temp_11.x * w.y;
	temp_11.x = temp_9.x - loc_0.x;
	temp_11.y = temp_9.y - loc_0.y;
	temp_9.x = temp_9.x + loc_0.x;
	temp_9.y = temp_9.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_1.x * w.x - temp_1.y * w.y;
	loc_0.y = temp_1.y * w.x + temp_1.x * w.y;
	temp_1.x = temp_0.x - loc_0.x;
	temp_1.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_9.x * w.x - temp_9.y * w.y;
	loc_0.y = temp_9.y * w.x + temp_9.x * w.y;
	temp_9.x = temp_8.x - loc_0.x;
	temp_9.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_3.x * iw.x - temp_3.y * iw.y;
	loc_0.y = temp_3.y * iw.x + temp_3.x * iw.y;
	temp_3.x = temp_2.x - loc_0.x;
	temp_3.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	loc_0.x = temp_11.x * iw.x - temp_11.y * iw.y;
	loc_0.y = temp_11.y * iw.x + temp_11.x * iw.y;
	temp_11.x = temp_10.x - loc_0.x;
	temp_11.y = temp_10.y - loc_0.y;
	temp_10.x = temp_10.x + loc_0.x;
	temp_10.y = temp_10.y + loc_0.y;
	w.x = native_cos(0.25f*angle);
	w.y = native_sin(0.25f*angle);
	loc_0.x = temp_8.x * w.x - temp_8.y * w.y;
	loc_0.y = temp_8.y * w.x + temp_8.x * w.y;
	temp_8.x = temp_0.x - loc_0.x;
	temp_8.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_9.x * iw.x - temp_9.y * iw.y;
	loc_0.y = temp_9.y * iw.x + temp_9.x * iw.y;
	temp_9.x = temp_1.x - loc_0.x;
	temp_9.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	iw.x = w.x * loc_SQRT1_2 + w.y * loc_SQRT1_2;
	iw.y = w.y * loc_SQRT1_2 - w.x * loc_SQRT1_2;

	loc_0.x = temp_10.x * iw.x - temp_10.y * iw.y;
	loc_0.y = temp_10.y * iw.x + temp_10.x * iw.y;
	temp_10.x = temp_2.x - loc_0.x;
	temp_10.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	w.x = iw.y;
	w.y = -iw.x;
	loc_0.x = temp_11.x * w.x - temp_11.y * w.y;
	loc_0.y = temp_11.y * w.x + temp_11.x * w.y;
	temp_11.x = temp_3.x - loc_0.x;
	temp_11.y = temp_3.y - loc_0.y;
	temp_3.x = temp_3.x + loc_0.x;
	temp_3.y = temp_3.y + loc_0.y;
	loc_0 = temp_8;
	temp_8 = temp_2;
	temp_2 = loc_0;
	loc_0 = temp_9;
	temp_9 = temp_3;
	temp_3 = loc_0;
		stageInvocationID = (get_local_id(1)+ 128) % (64);
		angle = stageInvocationID * -0.04908738521234052f;
		temp_4 = sdata[sharedStride*(get_local_id(1)+128)+get_local_id(0)];
		temp_12 = sdata[sharedStride*(get_local_id(1)+384)+get_local_id(0)];
		temp_5 = sdata[sharedStride*(get_local_id(1)+640)+get_local_id(0)];
		temp_13 = sdata[sharedStride*(get_local_id(1)+896)+get_local_id(0)];
		temp_6 = sdata[sharedStride*(get_local_id(1)+1152)+get_local_id(0)];
		temp_14 = sdata[sharedStride*(get_local_id(1)+1408)+get_local_id(0)];
		temp_7 = sdata[sharedStride*(get_local_id(1)+1664)+get_local_id(0)];
		temp_15 = sdata[sharedStride*(get_local_id(1)+1920)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
	loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
	temp_6.x = temp_4.x - loc_0.x;
	temp_6.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14.x = temp_12.x - loc_0.x;
	temp_14.y = temp_12.y - loc_0.y;
	temp_12.x = temp_12.x + loc_0.x;
	temp_12.y = temp_12.y + loc_0.y;
	loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
	loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
	temp_7.x = temp_5.x - loc_0.x;
	temp_7.y = temp_5.y - loc_0.y;
	temp_5.x = temp_5.x + loc_0.x;
	temp_5.y = temp_5.y + loc_0.y;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_13.x - loc_0.x;
	temp_15.y = temp_13.y - loc_0.y;
	temp_13.x = temp_13.x + loc_0.x;
	temp_13.y = temp_13.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
	loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
	temp_5.x = temp_4.x - loc_0.x;
	temp_5.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	loc_0.x = temp_13.x * w.x - temp_13.y * w.y;
	loc_0.y = temp_13.y * w.x + temp_13.x * w.y;
	temp_13.x = temp_12.x - loc_0.x;
	temp_13.y = temp_12.y - loc_0.y;
	temp_12.x = temp_12.x + loc_0.x;
	temp_12.y = temp_12.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_7.x * iw.x - temp_7.y * iw.y;
	loc_0.y = temp_7.y * iw.x + temp_7.x * iw.y;
	temp_7.x = temp_6.x - loc_0.x;
	temp_7.y = temp_6.y - loc_0.y;
	temp_6.x = temp_6.x + loc_0.x;
	temp_6.y = temp_6.y + loc_0.y;
	loc_0.x = temp_15.x * iw.x - temp_15.y * iw.y;
	loc_0.y = temp_15.y * iw.x + temp_15.x * iw.y;
	temp_15.x = temp_14.x - loc_0.x;
	temp_15.y = temp_14.y - loc_0.y;
	temp_14.x = temp_14.x + loc_0.x;
	temp_14.y = temp_14.y + loc_0.y;
	w.x = native_cos(0.25f*angle);
	w.y = native_sin(0.25f*angle);
	loc_0.x = temp_12.x * w.x - temp_12.y * w.y;
	loc_0.y = temp_12.y * w.x + temp_12.x * w.y;
	temp_12.x = temp_4.x - loc_0.x;
	temp_12.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	iw.x = w.y;
	iw.y = -w.x;
	loc_0.x = temp_13.x * iw.x - temp_13.y * iw.y;
	loc_0.y = temp_13.y * iw.x + temp_13.x * iw.y;
	temp_13.x = temp_5.x - loc_0.x;
	temp_13.y = temp_5.y - loc_0.y;
	temp_5.x = temp_5.x + loc_0.x;
	temp_5.y = temp_5.y + loc_0.y;
	iw.x = w.x * loc_SQRT1_2 + w.y * loc_SQRT1_2;
	iw.y = w.y * loc_SQRT1_2 - w.x * loc_SQRT1_2;

	loc_0.x = temp_14.x * iw.x - temp_14.y * iw.y;
	loc_0.y = temp_14.y * iw.x + temp_14.x * iw.y;
	temp_14.x = temp_6.x - loc_0.x;
	temp_14.y = temp_6.y - loc_0.y;
	temp_6.x = temp_6.x + loc_0.x;
	temp_6.y = temp_6.y + loc_0.y;
	w.x = iw.y;
	w.y = -iw.x;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_7.x - loc_0.x;
	temp_15.y = temp_7.y - loc_0.y;
	temp_7.x = temp_7.x + loc_0.x;
	temp_7.y = temp_7.y + loc_0.y;
	loc_0 = temp_12;
	temp_12 = temp_6;
	temp_6 = loc_0;
	loc_0 = temp_13;
	temp_13 = temp_7;
	temp_7 = loc_0;
	barrier(CLK_LOCAL_MEM_FENCE);

	stageInvocationID = get_local_id(1) + 0;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 64;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 8;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_0;
	sdataID = inoutID + 64;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_8;
	sdataID = inoutID + 128;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_1;
	sdataID = inoutID + 192;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_9;
	sdataID = inoutID + 256;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_2;
	sdataID = inoutID + 320;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_10;
	sdataID = inoutID + 384;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_3;
	sdataID = inoutID + 448;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_11;
	stageInvocationID = get_local_id(1) + 128;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 64;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 8;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_4;
	sdataID = inoutID + 64;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_12;
	sdataID = inoutID + 128;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_5;
	sdataID = inoutID + 192;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_13;
	sdataID = inoutID + 256;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_6;
	sdataID = inoutID + 320;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_14;
	sdataID = inoutID + 384;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_7;
	sdataID = inoutID + 448;
	sdataID = sharedStride * sdataID;
	sdataID = sdataID + get_local_id(0);
	sdata[sdataID] = temp_15;
	barrier(CLK_LOCAL_MEM_FENCE);

		stageInvocationID = (get_local_id(1)+ 0) % (512);
		angle = stageInvocationID * -0.00613592315154256f;
		temp_0 = sdata[sharedStride*(get_local_id(1)+0)+get_local_id(0)];
		temp_2 = sdata[sharedStride*(get_local_id(1)+512)+get_local_id(0)];
		temp_4 = sdata[sharedStride*(get_local_id(1)+1024)+get_local_id(0)];VkFFTApp: (512, 512) None 2 False -> [512, 512, 1, 1]
		temp_6 = sdata[sharedStride*(get_local_id(1)+1536)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_4.x * w.x - temp_4.y * w.y;
	loc_0.y = temp_4.y * w.x + temp_4.x * w.y;
	temp_4.x = temp_0.x - loc_0.x;
	temp_4.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
	loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
	temp_6.x = temp_2.x - loc_0.x;
	temp_6.y = temp_2.y - loc_0.y;
	temp_2.x = temp_2.x + loc_0.x;
	temp_2.y = temp_2.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_2.x * w.x - temp_2.y * w.y;
	loc_0.y = temp_2.y * w.x + temp_2.x * w.y;
	temp_2.x = temp_0.x - loc_0.x;
	temp_2.y = temp_0.y - loc_0.y;
	temp_0.x = temp_0.x + loc_0.x;
	temp_0.y = temp_0.y + loc_0.y;
	loc_0.x = w.x;	w.x = w.y;
	w.y = -loc_0.x;
	loc_0.x = temp_6.x * w.x - temp_6.y * w.y;
	loc_0.y = temp_6.y * w.x + temp_6.x * w.y;
	temp_6.x = temp_4.x - loc_0.x;
	t emp_6.y = temp_4.y - loc_0.y;
	temp_4.x = temp_4.x + loc_0.x;
	temp_4.y = temp_4.y + loc_0.y;
	loc_0 = temp_2;
	temp_2 = temp_4;
	temp_4 = loc_0;
		stageInvocationID = (get_local_id(1)+ 128) % (512);
		angle = stageInvocationID * -0.00613592315154256f;
		temp_8 = sdata[sharedStride*(get_local_id(1)+128)+get_local_id(0)];
		temp_10 = sdata[sharedStride*(get_local_id(1)+640)+get_local_id(0)];
		temp_12 = sdata[sharedStride*(get_local_id(1)+1152)+get_local_id(0)];
		temp_14 = sdata[sharedStride*(get_local_id(1)+1664)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_12.x * w.x - temp_12.y * w.y;
	loc_0.y = temp_12.y * w.x + temp_12.x * w.y;
	temp_12.x = temp_8.x - loc_0.x;
	temp_12.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14.x = temp_10.x - loc_0.x;
	temp_14.y = temp_10.y - loc_0.y;
	temp_10.x = temp_10.x + loc[False, False, False] 2
_0.x;
	temp_10.y = temp_10.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_10.x * w.x - temp_10.y * w.y;
	loc_0.y = temp_10.y * w.x + temp_10.x * w.y;
	temp_10.x = temp_8.x - loc_0.x;
	temp_10.y = temp_8.y - loc_0.y;
	temp_8.x = temp_8.x + loc_0.x;
	temp_8.y = temp_8.y + loc_0.y;
	loc_0.x = w.x;	w.x = w.y;
	w.y = -loc_0.x;
	loc_0.x = temp_14.x * w.x - temp_14.y * w.y;
	loc_0.y = temp_14.y * w.x + temp_14.x * w.y;
	temp_14.x = temp_12.x - loc_0.x;
	temp_14.y = temp_12.y - loc_0.y;
	temp_12.x = temp_12.x + loc_0.x;
	temp_12.y = temp_12.y + loc_0.y;
	loc_0 = temp_10;
	temp_10 = temp_12;
	temp_12 = loc_0;
		stageInvocationID = (get_local_id(1)+ 256) % (512);
		angle = stageInvocationID * -0.00613592315154256f;
		temp_1 = sdata[sharedStride*(get_local_id(1)+256)+get_local_id(0)];
		temp_3 = sdata[sharedStride*(get_local_id(1)+768)+get_local_id(0)];
		temp_5 = sdata[sharedStride*(get_local_id(1)+1280)+get_local_id(0)];
		temp_7 = sdata[sharedStride*(get_local_id(1)+1792)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_5.x * w.x - temp_5.y * w.y;
	loc_0.y = temp_5.y * w.x + temp_5.x * w.y;
	temp_5.x = temp_1.x - loc_0.x;
	temp_5.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
	loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
	temp_7.x = temp_3.x - loc_0.x;
	temp_7.y = temp_3.y - loc_0.y;
	temp_3.x = temp_3.x + loc_0.x;
	temp_3.y = temp_3.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_3.x * w.x - temp_3.y * w.y;
	loc_0.y = temp_3.y * w.x + temp_3.x * w.y;
	temp_3.x = temp_1.x - loc_0.x;
	temp_3.y = temp_1.y - loc_0.y;
	temp_1.x = temp_1.x + loc_0.x;
	temp_1.y = temp_1.y + loc_0.y;
	loc_0.x = w.x;	w.x = w.y;
	w.y = -loc_0.x;
	loc_0.x = temp_7.x * w.x - temp_7.y * w.y;
	loc_0.y = temp_7.y * w.x + temp_7.x * w.y;
	temp_7.x = temp_5.x - loc_0.x;
	temp_7.y = temp_5.y - loc_0.y;
	temp_5.x = temp_5.x + loc_0.x;
	temp_5.y = temp_5.y + loc_0.y;
	loc_0 = temp_3;
	temp_3 = temp_5;
	temp_5 = loc_0;
		stageInvocationID = (get_local_id(1)+ 384) % (512);
		angle = stageInvocationID * -0.00613592315154256f;
		temp_9 = sdata[sharedStride*(get_local_id(1)+384)+get_local_id(0)];
		temp_11 = sdata[sharedStride*(get_local_id(1)+896)+get_local_id(0)];
		temp_13 = sdata[sharedStride*(get_local_id(1)+1408)+get_local_id(0)];
		temp_15 = sdata[sharedStride*(get_local_id(1)+1920)+get_local_id(0)];
	w.x = native_cos(angle);
	w.y = native_sin(angle);
	loc_0.x = temp_13.x * w.x - temp_13.y * w.y;
	loc_0.y = temp_13.y * w.x + temp_13.x * w.y;
	temp_13.x = temp_9.x - loc_0.x;
	temp_13.y = temp_9.y - loc_0.y;
	temp_9.x = temp_9.x + loc_0.x;
	temp_9.y = temp_9.y + loc_0.y;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_11.x - loc_0.x;
	temp_15.y = temp_11.y - loc_0.y;
	temp_11.x = temp_11.x + loc_0.x;
	temp_11.y = temp_11.y + loc_0.y;
	w.x = native_cos(0.5f*angle);
	w.y = native_sin(0.5f*angle);
	loc_0.x = temp_11.x * w.x - temp_11.y * w.y;
	loc_0.y = temp_11.y * w.x + temp_11.x * w.y;
	temp_11.x = temp_9.x - loc_0.x;
	temp_11.y = temp_9.y - loc_0.y;
	temp_9.x = temp_9.x + loc_0.x;
	temp_9.y = temp_9.y + loc_0.y;
	loc_0.x = w.x;	w.x = w.y;
	w.y = -loc_0.x;
	loc_0.x = temp_15.x * w.x - temp_15.y * w.y;
	loc_0.y = temp_15.y * w.x + temp_15.x * w.y;
	temp_15.x = temp_13.x - loc_0.x;
	temp_15.y = temp_13.y - loc_0.y;
	temp_13.x = temp_13.x + loc_0.x;
	temp_13.y = temp_13.y + loc_0.y;
	loc_0 = temp_11;
	temp_11 = temp_13;
	temp_13 = loc_0;
	barrier(CLK_LOCAL_MEM_FENCE);

		sharedStride = 4;
	stageInvocationID = get_local_id(1) + 0;
	blockInvocationID = stageInvocationID;
	stageInvocationID = stageInvocationID % 512;
	blockInvocationID = blockInvocationID - stageInvocationID;
	inoutID = blockInvocationID * 4;
	inoutID = inoutID + stageInvocationID;
	sdataID = inoutID + 0;
	sdataID = sharedStride * sdataID;
	sdata

Could provide the api the same as cuFFT

Thanks for such great work for improving FFT. My question is I'm trying to optimize the Quantum ESPRESSO single card version, an application that calls cuFFT frequently. Could this project provide a drop-in replacement for cuFFT APIs so that more users could gain benefits from it?

Suggestion for 2 minor improvements..

Hi,
just 2 minor suggestions:

*Can you support passing arguments to Vulkan_FFT.cpp main() function? so in command line we can switch between FP16,FP32,FP64 benchmarking by passing the case 0,6,7 currently and not having to compile different executables for different precision..

*Also can you print current device name so on systems with more than one GPU we know clearly which GPU is the benchmark being runned on.. better yet allow some argument like "-d 0" "-d 1" to select among multiple GPU devices..

thanks..

Benchmark sources?

Hi, interesting project. I see on the main page you show off some vkFFT benchmarks (in relation to cuFFT). Just wondering if you have plans to post your benchmark code used for running the vkFFT tests? From quickly looking at Vulkan_FFT.cpp, it didn't seem to be part of the file. Just wondering for evaluating possibly using vkFFT in GPU/driver combination testing.

Thanks,
Michael

COMP: Numerous warnings from Microsoft Windows compiler

From vkFFT/vkFFT.h, I get the warnings listed at a CDash page for a brand new project that has VkFFT as a dependency. We are trying to compile without warnings, so this is slowing us down.

In case these vkFFT.h warnings are a priority for you as well ... submitting this issue. Thank you.

_deps\vulkan_lib-src\vkFFT\vkFFT.h(3717): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3718): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3725): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3726): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3738): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3739): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3746): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3747): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3925): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3926): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3933): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3934): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3946): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3947): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3954): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(3955): warning C4146: unary minus operator applied to unsigned type, result still unsigned
_deps\vulkan_lib-src\vkFFT\vkFFT.h(7472): warning C4244: 'argument': conversion from 'double' to 'uint64_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(9325): warning C4244: '=': conversion from 'double' to 'uint64_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(9326): warning C4244: '=': conversion from 'double' to 'uint64_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(9332): warning C4244: '=': conversion from 'double' to 'uint64_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(11441): warning C4244: '=': conversion from 'double' to 'uint64_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(11445): warning C4244: '=': conversion from 'double' to 'uint64_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13746): warning C4244: 'argument': conversion from 'uint64_t' to 'cl_uint', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13754): warning C4244: 'argument': conversion from 'uint64_t' to 'cl_uint', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13762): warning C4244: 'argument': conversion from 'uint64_t' to 'cl_uint', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13765): warning C4244: '=': conversion from 'uint64_t' to 'uint32_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13766): warning C4244: '=': conversion from 'uint64_t' to 'uint32_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13767): warning C4244: '=': conversion from 'uint64_t' to 'uint32_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13768): warning C4244: '=': conversion from 'uint64_t' to 'uint32_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13769): warning C4244: '=': conversion from 'uint64_t' to 'uint32_t', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13770): warning C4244: 'argument': conversion from 'uint64_t' to 'cl_uint', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13847): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13935): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(13978): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14016): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14059): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14298): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14299): warning C4244: '=': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14329): warning C4244: '=': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14339): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14340): warning C4244: '=': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14368): warning C4244: '=': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14410): warning C4244: 'initializing': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14411): warning C4244: '=': conversion from 'uint64_t' to 'int', possible loss of data
_deps\vulkan_lib-src\vkFFT\vkFFT.h(14457): warning C4244: '=': conversion from 'uint64_t' to 'int', possible loss of data              ```

2 Minor requests for OpenCL backend..

Hi,
great to hear VkFFT gains a OpenCL backend! nice work..
unfortunately just wanted to do a quick test on new OpenCL backend and found two issues:

  1. seems you don't provide prebuilt OpenCL Windows releases yet (similar to CUDA, VK)..
  2. also vulkan_fft command doesn't have a -clfft argument similar so -cufft or -rocfft to perform some basic easy perf testing..
    this 2 would be desirable..
    thanks..

How to do 2D channel-wise convolution/cross-correlation?

Hi, I'm using VkFFT to accelerate channel-wise cross-correlation between two feature maps (real number) of size C x H1 x W1 and C x H2 x W2. I've tested the cross-correlation between two single channel data, it worked quite well.

Here's my configuration for single channel cross-correlation:

    // Calculate the padded size
    size_t m = cv::getOptimalDFTSize(mA + mB - 1);
    size_t n = cv::getOptimalDFTSize(nA + nB - 1);

    VkFFTConfiguration fftConfigCrossCorrelation = {};
    VkFFTConfiguration fftConfigKernel = {};
    VkFFTApplication fftAppKernel = {};
    VkFFTApplication fftAppCrossCorrelation = {};

    /* Make kernel from feature map B */
    fftConfigKernel.FFTdim = 2;
    fftConfigKernel.coordinateFeatures = 1;
    fftConfigKernel.size[0] = n;
    fftConfigKernel.size[1] = m;
    fftConfigKernel.size[2] = 1;
    fftConfigKernel.performZeropadding[0] = true;
    fftConfigKernel.performZeropadding[1] = true;
    fftConfigKernel.fft_zeropad_left[0] = nB + 1;
    fftConfigKernel.fft_zeropad_left[1] = mB + 1;
    fftConfigKernel.fft_zeropad_right[0] = n;
    fftConfigKernel.fft_zeropad_right[1] = m;
    fftConfigKernel.kernelConvolution = true;
    fftConfigKernel.performR2C = true;
    fftConfigKernel.disableReorderFourStep = true;
    fftConfigKernel.normalize = true;
    fftConfigKernel.numberBatches = 1;
    fftConfigKernel.buffer = &bufferKernelPtr;
    fftConfigKernel.bufferSize = &bufferKernelSize;

    /* Do FFT of B */
    err = initializeVkFFT(&fftAppKernel, fftConfigKernel);
    err = VkFFTAppend(&fftAppKernel, false, &fftLaunchParams);

     /* Do corss-correlation on feature map A using the FFT of B */
    fftConfigCrossCorrelation.FFTdim = 2;
    fftConfigCrossCorrelation.coordinateFeatures = 1;
    fftConfigCrossCorrelation.size[0] = n;
    fftConfigCrossCorrelation.size[1] = m;
    fftConfigCrossCorrelation.size[2] = 1;
    fftConfigCrossCorrelation.performZeropadding[0] = true;
    fftConfigCrossCorrelation.performZeropadding[1] = true;
    fftConfigCrossCorrelation.fft_zeropad_left[0] = nA + 1;
    fftConfigCrossCorrelation.fft_zeropad_left[1] = mA + 1;
    fftConfigCrossCorrelation.fft_zeropad_right[0] = n;
    fftConfigCrossCorrelation.fft_zeropad_right[1] = m;
    fftConfigCrossCorrelation.performR2C = true;
    fftConfigCrossCorrelation.disableReorderFourStep = true;
    fftConfigCrossCorrelation.normalize = true;
    fftConfigCrossCorrelation.numberBatches = 1;
    fftConfigCrossCorrelation.kernelConvolution = false;
    fftConfigCrossCorrelation.performConvolution = true;
    fftConfigCrossCorrelation.symmetricKernel = false;
    fftConfigCrossCorrelation.conjugateConvolution = 2;
    fftConfigCrossCorrelation.kernel = &bufferKernelPtr;
    fftConfigCrossCorrelation.kernelSize = &bufferKernelSize;
    fftConfigCrossCorrelation.buffer = &bufferCrossCorrelationPtr;
    fftConfigCrossCorrelation.bufferSize = &bufferCrossCorrelationSize;

I noticed that there's an example demonstrated how to do convolution with multiple kernels: in_C[0] * kernel_C[i] -> out_C[i], but what I need is in_C[i] * kernel_C[i] -> out_C[i], could you show me how to change the configuration to make it work this way?

syntax error, unexpected SEMICOLON, expecting RIGHT_PAREN or COMMA

Problems occur when I ran sample 9 with a relatively large configuration.size[1], such as 1980, 2048.

Delete one left parenthesis between loc_PI and %s%s in L4078 fixed the problem.

sprintf(output + strlen(output), " angle = 2 * loc_PI * (((((%s%s) / %d) %% (%d)) * (%s + %d)) / %f%s;\n", sc->gl_GlobalInvocationID_x, shiftX, sc->fft_dim_x, sc->stageStartSize, sc->gl_LocalInvocationID_y, i * sc->localSize[1], (double)(sc->stageStartSize * sc->fftDim), LFending);

Suggestions on printing

Vulkan_FFT.cpp has many cases of code that looks like this:

if (file_output)
	fprintf(output, "Some string %d", some_value);
printf( "Some string %d", some_value);

This seems inefficient. Instead, simply print to fprintf with stdout as the default if no file is specified.
Although I understand that you may want to log to file as well as stdout, I believe this should be done in the commandline with tee or pipes.

// output defaults to stdout if it is not defined
fprintf(output,  "Some string %d", some_value);

vkFFT on Android

Hi, I'm currently developing an Android App which needs to calculate FFTs.

Till now I used OpenCV for this.
In an attempt to speed the FFT implementation on Android devices I successfully was able to implement your library so that it works on most devices and the results are also quite close to the implementation provided by OpenCV.

But I noticed differences which were a little too big in my opinion.
I got an avg. difference of 0.06 on a 128x128 system, which is quiet bigger then the results in your precision tests.
Also I noticed differences on different mobile GPUs. I tested a Snapdragon 855+ (Adreno 640 GPU) which got a avg. difference of 0.01 and a Exynos 990 (Mali G77) which achieved the 0.06 avg. difference.

Is this difference because of the devices or do you think I misconfigured something?

Setup Notes:
Values are uniform random numbers calculated by OpenCV and then passed to an modified version of the sample 0.
Calculation in C2C but imaginary part is 0 -> didn't used R2C for now cause I didn't really understand how to align the data properly

iFFT bug for some 2D mixed Bluestein transforms

Hi @DTolm,

I found an issue with the inverse FT for some C2C 2D transforms which use both radix and Bluestein algorithms (e.g. 2D NxN arrays with N=6*101):

Here is an example comparison between numpy and the opencl and cuda transforms using (py)vkfft with N=6x101:

image

Note that in 1.2.12, the cuda backend would also fail with N=6x101.

When trying this for N=15x101, both the cuda and opencl backend fail:

image

Here is the python code to reproduce this through pyvkfft (I am using the latest VkFFT version) - assuming you have both pyopencl and pycuda:

%matplotlib inline
import os
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as cua
import pyopencl.array as cla
import pyopencl as cl
from pyvkfft.cuda import VkFFTApp as cuVkFFTApp
from pyvkfft.opencl import VkFFTApp as clVkFFTApp
from pyvkfft.version import vkfft_version, __version__

if 'PYOPENCL_CTX' in os.environ:
    ctx = cl.create_some_context()
else:
    ctx = None
    # Find the first OpenCL GPU available and use it, unless
    for p in cl.get_platforms():
        for d in p.get_devices():
            if d.type & cl.device_type.GPU == 0:
                continue
            cl_device_name = d.name
            print("Selected OpenCL device: ", d.name)
            ctx = cl.Context(devices=(d,))
            break
        if ctx is not None:
            break
cq = cl.CommandQueue(ctx)

####################################################################
n = 15*101 # try 4*101, 6*101, 8*101, 15*101
####################################################################
dn = 20
d= np.zeros((n,n),dtype=np.complex64)
d[n//2-dn:n//2+dn,n//2-dn:n//2+dn] = 1

cud = cua.to_gpu(d)
cld = cla.to_device(cq,d)

cuapp = cuVkFFTApp(d.shape, d.dtype, ndim=2, norm=0)
clapp = clVkFFTApp(d.shape, d.dtype, queue=cq, ndim=2, norm=0)

plt.figure(figsize=(9.5,4), dpi=120)
plt.subplot(131)
plt.imshow(np.fft.fftshift(abs(np.fft.ifftn(d))), norm=LogNorm())
plt.title("numpy")

plt.subplot(132)
cuapp.ifft(cud,cud)
plt.imshow(np.fft.fftshift(abs(cud.get())), norm=LogNorm())
plt.title("pyvkfft[CUDA]")

plt.subplot(133)
clapp.ifft(cld,cld)
plt.imshow(np.fft.fftshift(abs(cld.get())), norm=LogNorm())
plt.title("pyvkfft[OpenCL]")
plt.tight_layout()

plt.suptitle("pyvkfft %s, VkFFT %s, 2D C2C iFFT %dx%d" % (__version__, vkfft_version(), n, n))

I have not tried other transforms (r2c, dct), but I will probably add an exhaustive pyvkfft unit test to evaluate all possible transforms - I guess this'll take a while to run, but will be good for releases.

Reading Complex data into performVulkanFFT

Hi,

I am assigning an array of floats formatted like [real number, imaginary number, real number, imaginary number, ...] to an input buffer, I am reading in 8192 complex samples (16384 floats in total). Each pair of real and imaginary in the array represent a complex type. I then perform a transform on the data performVulkanFFT and the values returned by the output buffer aren't what I expect. How does performVulkanFFT expect complex data to be assigned to the input_buffer? and how is complex data written to the output_buffer?

uint32_t res = 0; 
std::ifstream in;
in.open("fftData.iq", std::ios::in | std::ios::binary);

if(in.is_open()) {
  // memory allocated on the CPU once, makes benchmark completion faster + avoids performance issues connected to frequent allocation/deallocation.
  float* buffer_input = (float*)malloc(2*(sizeof(float)*8192));
  
  in.seekg(0, std::ios::end);
  const size_t num_elements = in.tellg() / sizeof(float);
  in.seekg(0, std::ios::beg);
  
  std::vector<float> data(num_elements);
  in.read(reinterpret_cast<char*>(&data[0]), num_elements*sizeof(float));
  
  printf("number of floats %i \n", data.size());
  
  for(size_t i = 0; i < 30; ++i)
  {
	  printf("%f \n",data[i]);
  }
  
  for(size_t i = 0; i < data.size(); ++i)
  {
	  buffer_input[i] = data[i];
  }
  
  std::cout << "ln:664" << std::endl;
  
  VkFFTConfiguration configuration = {};
  VkFFTApplication app = {};
  configuration.FFTdim = 1; //FFT dimension, 1D, 2D or 3D (default 1).
  
  //Multidimensional FFT dimensions sizes (default 1). For best performance (and stability), order dimensions in descendant size order as: x>y>z.
  configuration.size[0] = 8192;
  configuration.size[1] = 1;
  configuration.size[2] = 1;
  
  std::cout << "ln:674" << std::endl;
  
  // After this, configuration file contains pointers to Vulkan objects needed to work with the GPU: 
  // VkDevice* device - created device, 
  // [uint64_t *bufferSize, VkBuffer *buffer, VkDeviceMemory* bufferDeviceMemory] - allocated GPU memory FFT is performed on. 
  // [uint64_t *kernelSize, VkBuffer *kernel, VkDeviceMemory* kernelDeviceMemory] - allocated GPU memory, where kernel for convolution is stored.
  configuration.device = &vkGPU->device;
  
  // to allocate memory for LUT, we have to pass a queue, vkGPU->fence, commandPool and physicalDevice pointers 
  configuration.queue = &vkGPU->queue; 
  configuration.fence = &vkGPU->fence;
  configuration.commandPool = &vkGPU->commandPool;
  configuration.physicalDevice = &vkGPU->physicalDevice;
  
  // compiler can be initialized before VkFFT plan creation. if not, VkFFT will create and destroy one after initialization
  configuration.isCompilerInitialized = isCompilerInitialized;
  
  std::cout << "ln:691" << std::endl;
  
  // Allocate buffer for the input data.
  uint64_t bufferSize = (uint64_t)(2*(sizeof(float)*8192));
  VkBuffer buffer = {};
  VkDeviceMemory bufferDeviceMemory = {};
  allocateFFTBuffer(vkGPU, &buffer, &bufferDeviceMemory, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | 
  VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT, 
  VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, bufferSize);
  configuration.buffer = &buffer;
  configuration.bufferSize = &bufferSize;
  
  // Sample buffer transfer tool. 
  // Uses staging buffer of the same size as destination buffer, which can be reduced if transfer is done sequentially in small buffers.
  transferDataFromCPU(vkGPU, buffer_input, &buffer, bufferSize);
  
  // Initialize applications. 
  // This function loads shaders, creates pipeline and configures FFT based on configuration file. 
  // No buffer allocations inside VkFFT library.  
  res = initializeVkFFT(&app, configuration);
  if (res != 0) return res;
  
  uint32_t num_iter = 1;
  performVulkanFFT(vkGPU, &app, -1, num_iter);
  
  float* buffer_output = (float*)malloc(2*(sizeof(float)*8192));
  transferDataToCPU(vkGPU, buffer_output, &buffer, bufferSize);
  
  for (int x=0; x < 30; x++) {
	  printf("value: %f \n", buffer_output[x]);
  }
  
  std::ofstream myfile;
  myfile.open ("output.csv");
  for (int x=0; x < 8192; x++) {
	  myfile << std::to_string(buffer_output[x])+"\n";
  }
  myfile.close();
  
  //Clean up 
  vkDestroyBuffer(vkGPU->device, buffer, NULL);
  vkFreeMemory(vkGPU->device, bufferDeviceMemory, NULL);
  deleteVkFFT(&app);
  free(buffer_input);
  free(buffer_output);
}

Intel HD graphics issues (C2C launch failure and R2C calculation issue)

Hi @DTolm, as promised here is a report on issues with Intel HD graphics. This can be due to issues on the Intel GPUs (which are used for display as well), rather than with an issue with VkFFT, but I guess it's good to document that as an issue.

The first test was done with a systematic accuracy test with 1D, single precision, C2C radix-2&3 transforms up to N=2**18:

  • on a mac mini with Intel HD graphics 5000, all sizes are OK (accuracy as expected with or without LUT) except for N=55296 and 248832 which return empty (zero-valued) arrays.
  • on a macbook air with "Intel Iris Graphics 6100", all sizes are OK (accuracy as expected with or without LUT) except for sizes 12288, 16384, 41472, 49152, 55296, 62208, 124416, 248832 which return empty (zero-valued) arrays.

Checking the launch result code I get a VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL (4039), with or without LUT.

The second test used a 1D R2C transform on a 3D real array of size (32, 32, 32+2), using a LUT. On the Iris graphics 6100 machine, the following data is obtained after an R2C+C2R, looking either at the average or max difference along the z dimension, or some specific layers:
image

When the calculation is repeated, the errors change (the original array max is around 11 so the above differences are really very high).

On other GPU (CUDA) I have no accuracy issues, but with these Intel graphics pyvkfft's R2C unit tests fail.

For the R2C case I do not get a launch error. The code used for the figure is:

import pyopencl as cl
import pyopencl.array as cla
import pyvkfft.opencl
from pyvkfft.opencl import VkFFTApp
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Create some context on the first available GPU
# Find the first OpenCL GPU available and use it, unless
for p in cl.get_platforms():
    for d in p.get_devices():
        if d.type & cl.device_type.GPU == 0:
            continue
        print("Selected device: ", d.name)
        ctx = cl.Context(devices=(d,))
        break
    if ctx is not None:
        break
cq = cl.CommandQueue(ctx)

dims = 3
ndim = 1
norm = 1
n = 32
sh = [n] * dims
for i in range(ndim, dims):
    sh[-i-1] = n
sh[-1] += 2
d0 = np.random.uniform(0, 1, sh)
# A pure random array may not be a very good test (too random),
# so add a Gaussian
xx = [np.fft.fftshift(np.fft.fftfreq(nx)) for nx in sh]
v = np.zeros_like(d0)
for x in np.meshgrid(*xx, indexing='ij'):
    v += x ** 2
d0 += 10 * np.exp(-v * 2)


d = cla.to_device(cq, d0.astype(np.float32))
app = VkFFTApp(d.shape, d.dtype, queue=cq, ndim=ndim, norm=norm, r2c=True, useLUT=True)

d = app.fft(d) * app.get_fft_scale()

d = app.ifft(d) * app.get_ifft_scale()

plt.figure(figsize=(13,3))
plt.subplot(141)
plt.imshow(abs(d.get() - d0)[...,:-2].mean(axis=0))
plt.title("3D R2C+C2R diff 2D mean")
plt.colorbar()
plt.subplot(142)
plt.imshow(abs(d.get() - d0)[...,:-2].max(axis=0))
plt.title("3D R2C+C2R diff 2D max")
plt.colorbar()
plt.subplot(143)
plt.imshow(abs(d.get() - d0)[0,:,:-2])
plt.title("3D R2C+C2R diff z=0")
plt.colorbar()
plt.subplot(144)
plt.imshow(abs(d.get() - d0)[-1,:,:-2])
plt.title("3D R2C+C2R diff z=1")
plt.colorbar()

print((abs(d0[...,:-2])**2).sum(), (abs(d.get()[...,:-2])**2).sum())
plt.tight_layout()

please provide the sources of the documentation

Hello, I would like to package vkFFT for Debian in order to provide the Python binding later.

But it is not possible to accept a pdf as documentation without the source of this pdf.

Would you be so kind and considere to add the source of the Documentation in the repository.

Cheers

Frederic

Issue running on macOS via MoltenVK..

Hi,
I tried to run your library (single precision mode only) under macOS..
just wanted to see FP32 perf of your library on Metal on a RX Vega..
with a pair of trivial fixes code builds but fails to run due to some issue raised by Apple Metal shader compiler upon variable length arrays:

program_source:353:29: error: variable length arrays are not supported in Metal
    threadgroup float2 sdata[_891];

for more details I opened a MoltenVK issue:
KhronosGroup/MoltenVK#1106
altough seems an issue of MoltenVK/SPIRV-Cross/Metal I share here the problem as maybe you can find a simple workaround in your shaders without degrading current peformance..
feel free to close the issue if not interested on it..
thanks..

R2C/C2R 1D 2^12 limitation

I'm currently switching from cuFFT to VkFFT. I'm using 2^27 large single precision 1D R2C/C2R transformations in cuFFT. In vkFFT single precision 1D R2C/C2R is unfortunately limited to 2^12. Are there any plans to support larger one-dimensional R2C/C2R transformations? If so, I would highly appreciate it. The limitation seems to have to do with the amount of available shared memory but I don’t understand why it can’t be dealt with using multiple axis uploads like done on C2C.

Until this is supported should I just use C2C and fill the complex part with zeros or are there any more efficient ways? It seems like a huge waste of memory access time having to read and write useless zeros from global memory. Could it be possible to only read the real part from memory and just hardcode the complex part to always be zero on FFT and discard the complex part on IFFT? I'm working on a science project about privacy amplification in quantum cryptography where performance is very important. Thanks a lot for making VkFFT. Without it I would be stuck on NVidia GPUs. I’m looking forward to GTC 2021.

cmake installation does not copy header files

When compiling with cmake, eg cmake -DCMAKE_INSTALL_PREFIX=${HOME}/vkfft .. && make && make install, the file vkFFT.h file isn't copied to the installation directory; they should be in ~/vkfft/include. Other header files may be missing as well.

Forward - Backward FFT

Hey,
first of all thank you for your work. I've come across something which I do not understand.

in Vulkan_FFT.cpp:2837
p = fftw_plan_dft_1d(benchmark_dimensions[n][0], inputC_double, output_FFTW, 1, FFTW_ESTIMATE);
you are creating a backwards FFT according to http://www.fftw.org/fftw3_doc/Complex-One_002dDimensional-DFTs.html

The fourth argument, sign, can be either FFTW_FORWARD (-1) or FFTW_BACKWARD (+1), and indicates the direction of the transform you are interested in; technically, it is the sign of the exponent in the transform.

but in Vulkan_FFT.cpp:2914
forward_configuration.inverse = false; //Direction of FFT. false - forward, true - inverse.
a forward fft is created.

Interestingly the error seems to be quite low even though we're comparing the results of and ifft with the results of an fft which is really confusing to me.

Radeon VII benchmarks

Hello, thanks for your work on VkFFT! Here are some initial benchmark results for VkFFT on a Radeon VII on commit 6403518:

VkFFT 0

VkFFT System: 32x32x1 Buffer: 0 MB avg_time_per_step: 0.026 ms std_error: 0.000 batch: 1000 benchmark: 313
VkFFT System: 64x64x1 Buffer: 0 MB avg_time_per_step: 0.024 ms std_error: 0.000 batch: 1000 benchmark: 1355
VkFFT System: 256x256x1 Buffer: 0 MB avg_time_per_step: 0.039 ms std_error: 0.000 batch: 1000 benchmark: 13275
VkFFT System: 1024x256x1 Buffer: 2 MB avg_time_per_step: 0.067 ms std_error: 0.000 batch: 1000 benchmark: 30454
VkFFT System: 512x512x1 Buffer: 2 MB avg_time_per_step: 0.058 ms std_error: 0.000 batch: 1000 benchmark: 35036
VkFFT System: 1024x1024x1 Buffer: 8 MB avg_time_per_step: 0.202 ms std_error: 0.005 batch: 512 benchmark: 40647
VkFFT System: 4096x1024x1 Buffer: 32 MB avg_time_per_step: 0.718 ms std_error: 0.001 batch: 128 benchmark: 45669
VkFFT System: 2048x2048x1 Buffer: 32 MB avg_time_per_step: 0.765 ms std_error: 0.002 batch: 128 benchmark: 42852
VkFFT System: 4096x4096x1 Buffer: 128 MB avg_time_per_step: 3.448 ms std_error: 0.003 batch: 32 benchmark: 38011
VkFFT System: 64x64x64 Buffer: 2 MB avg_time_per_step: 0.083 ms std_error: 0.000 batch: 1000 benchmark: 24803
VkFFT System: 128x128x128 Buffer: 16 MB avg_time_per_step: 0.530 ms std_error: 0.001 batch: 256 benchmark: 30912
VkFFT System: 256x256x256 Buffer: 128 MB avg_time_per_step: 5.825 ms std_error: 0.007 batch: 32 benchmark: 22501
VkFFT System: 512x256x64 Buffer: 64 MB avg_time_per_step: 1.598 ms std_error: 0.003 batch: 64 benchmark: 41013
VkFFT System: 1024x1024x64 Buffer: 512 MB avg_time_per_step: 15.492 ms std_error: 0.030 batch: 8 benchmark: 33841
VkFFT System: 4096x256x32 Buffer: 256 MB avg_time_per_step: 6.225 ms std_error: 0.014 batch: 16 benchmark: 42112
VkFFT System: 2048x256x256 Buffer: 1024 MB avg_time_per_step: 58.240 ms std_error: 0.121 batch: 4 benchmark: 18004
VkFFT System: 4096x4096x8 Buffer: 1024 MB avg_time_per_step: 33.426 ms std_error: 0.012 batch: 4 benchmark: 31370
VkFFT System: 32768x64x1 Buffer: 16 MB avg_time_per_step: 0.397 ms std_error: 0.001 batch: 256 benchmark: 41307
VkFFT System: 65536x64x1 Buffer: 32 MB avg_time_per_step: 0.816 ms std_error: 0.002 batch: 128 benchmark: 40134
VkFFT System: 131072x64x1 Buffer: 64 MB avg_time_per_step: 1.637 ms std_error: 0.004 batch: 64 benchmark: 40045
VkFFT System: 262144x64x1 Buffer: 128 MB avg_time_per_step: 3.407 ms std_error: 0.005 batch: 32 benchmark: 38470
VkFFT System: 1048576x64x1 Buffer: 512 MB avg_time_per_step: 19.710 ms std_error: 0.101 batch: 8 benchmark: 26599
VkFFT System: 4194304x64x1 Buffer: 2048 MB avg_time_per_step: 96.410 ms std_error: 0.073 batch: 2 benchmark: 21752
VkFFT System: 8192x8192x1 Buffer: 512 MB avg_time_per_step: 20.152 ms std_error: 0.017 batch: 8 benchmark: 26017
VkFFT System: 16384x16384x1 Buffer: 2048 MB avg_time_per_step: 135.186 ms std_error: 0.242 batch: 2 benchmark: 15513
Benchmark score VkFFT: 29680
Device name: AMD RADV VEGA20 (ACO) API:1.2.131

VkFFT 1

VkFFT System: 32x32x1 Buffer: 0 MB avg_time_per_step: 0.043 ms std_error: 0.009 batch: 1000 benchmark: 184
VkFFT System: 64x64x1 Buffer: 0 MB avg_time_per_step: 0.044 ms std_error: 0.009 batch: 1000 benchmark: 723
VkFFT System: 256x256x1 Buffer: 1 MB avg_time_per_step: 0.112 ms std_error: 0.017 batch: 819 benchmark: 4578
VkFFT System: 1024x256x1 Buffer: 4 MB avg_time_per_step: 0.268 ms std_error: 0.021 batch: 204 benchmark: 7652
VkFFT System: 512x512x1 Buffer: 4 MB avg_time_per_step: 0.255 ms std_error: 0.041 batch: 204 benchmark: 8046
VkFFT System: 1024x1024x1 Buffer: 16 MB avg_time_per_step: 0.806 ms std_error: 0.080 batch: 51 benchmark: 10160
VkFFT System: 4096x1024x1 Buffer: 64 MB avg_time_per_step: 4.040 ms std_error: 0.726 batch: 12 benchmark: 8111
VkFFT System: 2048x2048x1 Buffer: 64 MB avg_time_per_step: 3.546 ms std_error: 0.561 batch: 12 benchmark: 9240
VkFFT System: 4096x4096x1 Buffer: 256 MB avg_time_per_step: 17.434 ms std_error: 2.677 batch: 3 benchmark: 7518
VkFFT System: 64x64x64 Buffer: 4 MB avg_time_per_step: 0.336 ms std_error: 0.015 batch: 204 benchmark: 6087
VkFFT System: 128x128x128 Buffer: 32 MB avg_time_per_step: 2.495 ms std_error: 0.174 batch: 25 benchmark: 6566
VkFFT System: 256x256x256 Buffer: 256 MB avg_time_per_step: 26.039 ms std_error: 0.947 batch: 3 benchmark: 5033
VkFFT System: 512x256x64 Buffer: 128 MB avg_time_per_step: 7.879 ms std_error: 0.601 batch: 6 benchmark: 8317
VkFFT System: 1024x1024x64 Buffer: 1024 MB avg_time_per_step: 59.508 ms std_error: 3.754 batch: 1 benchmark: 8810
VkFFT System: 4096x256x32 Buffer: 512 MB avg_time_per_step: 40.336 ms std_error: 4.076 batch: 1 benchmark: 6499
VkFFT System: 2048x256x256 Buffer: 2048 MB avg_time_per_step: 152.475 ms std_error: 5.434 batch: 1 benchmark: 6877
VkFFT System: 4096x4096x8 Buffer: 2048 MB avg_time_per_step: 113.239 ms std_error: 6.673 batch: 1 benchmark: 9259
VkFFT System: 32768x64x1 Buffer: 32 MB avg_time_per_step: 1.801 ms std_error: 0.211 batch: 25 benchmark: 9098
VkFFT System: 65536x64x1 Buffer: 64 MB avg_time_per_step: 4.478 ms std_error: 0.348 batch: 12 benchmark: 7317
VkFFT System: 131072x64x1 Buffer: 128 MB avg_time_per_step: 9.203 ms std_error: 0.282 batch: 6 benchmark: 7121
VkFFT System: 262144x64x1 Buffer: 256 MB avg_time_per_step: 21.956 ms std_error: 1.218 batch: 3 benchmark: 5969
VkFFT System: 1048576x64x1 Buffer: 1024 MB avg_time_per_step: 96.336 ms std_error: 4.537 batch: 1 benchmark: 5442
Benchmark score VkFFT: 6755
Device name: AMD RADV VEGA20 (ACO) API:1.2.131

VkFFT 2

VkFFT System: 32x32x1 Buffer: 0 MB avg_time_per_step: 0.026 ms std_error: 0.000 batch: 1000 benchmark: 304
VkFFT System: 64x64x1 Buffer: 0 MB avg_time_per_step: 0.026 ms std_error: 0.004 batch: 1000 benchmark: 1210
VkFFT System: 256x256x1 Buffer: 0 MB avg_time_per_step: 0.062 ms std_error: 0.010 batch: 1000 benchmark: 8313
VkFFT System: 1024x256x1 Buffer: 1 MB avg_time_per_step: 0.113 ms std_error: 0.011 batch: 1000 benchmark: 18128
VkFFT System: 512x512x1 Buffer: 1 MB avg_time_per_step: 0.079 ms std_error: 0.004 batch: 1000 benchmark: 26037
VkFFT System: 1024x1024x1 Buffer: 4 MB avg_time_per_step: 0.171 ms std_error: 0.010 batch: 1000 benchmark: 47824
VkFFT System: 4096x1024x1 Buffer: 16 MB avg_time_per_step: 0.615 ms std_error: 0.024 batch: 256 benchmark: 53283
VkFFT System: 2048x2048x1 Buffer: 16 MB avg_time_per_step: 0.621 ms std_error: 0.012 batch: 256 benchmark: 52749
VkFFT System: 4096x4096x1 Buffer: 64 MB avg_time_per_step: 2.623 ms std_error: 0.133 batch: 64 benchmark: 49961
VkFFT System: 64x64x64 Buffer: 1 MB avg_time_per_step: 0.091 ms std_error: 0.012 batch: 1000 benchmark: 22458
VkFFT System: 128x128x128 Buffer: 8 MB avg_time_per_step: 0.436 ms std_error: 0.018 batch: 512 benchmark: 37619
VkFFT System: 256x256x256 Buffer: 64 MB avg_time_per_step: 4.135 ms std_error: 0.106 batch: 64 benchmark: 31694
VkFFT System: 512x256x64 Buffer: 32 MB avg_time_per_step: 1.145 ms std_error: 0.081 batch: 128 benchmark: 57245
VkFFT System: 1024x1024x64 Buffer: 256 MB avg_time_per_step: 11.447 ms std_error: 0.441 batch: 16 benchmark: 45800
VkFFT System: 4096x256x32 Buffer: 128 MB avg_time_per_step: 4.722 ms std_error: 0.214 batch: 32 benchmark: 55519
VkFFT System: 2048x256x256 Buffer: 512 MB avg_time_per_step: 48.096 ms std_error: 0.616 batch: 8 benchmark: 21801
VkFFT System: 4096x4096x8 Buffer: 512 MB avg_time_per_step: 23.822 ms std_error: 1.083 batch: 8 benchmark: 44016
VkFFT System: 32768x64x1 Buffer: 8 MB avg_time_per_step: 0.322 ms std_error: 0.009 batch: 512 benchmark: 50837
VkFFT System: 65536x64x1 Buffer: 16 MB avg_time_per_step: 0.686 ms std_error: 0.016 batch: 256 benchmark: 47768
VkFFT System: 131072x64x1 Buffer: 32 MB avg_time_per_step: 1.383 ms std_error: 0.065 batch: 128 benchmark: 47375
VkFFT System: 262144x64x1 Buffer: 64 MB avg_time_per_step: 2.680 ms std_error: 0.094 batch: 64 benchmark: 48913
VkFFT System: 1048576x64x1 Buffer: 256 MB avg_time_per_step: 11.622 ms std_error: 0.593 batch: 16 benchmark: 45111
VkFFT System: 4194304x64x1 Buffer: 1024 MB avg_time_per_step: 56.663 ms std_error: 2.702 batch: 4 benchmark: 37011
VkFFT System: 8192x8192x1 Buffer: 256 MB avg_time_per_step: 17.445 ms std_error: 0.379 batch: 16 benchmark: 30053
VkFFT System: 16384x16384x1 Buffer: 1024 MB avg_time_per_step: 107.746 ms std_error: 2.006 batch: 4 benchmark: 19463
Benchmark score VkFFT: 36020
Device name: AMD RADV VEGA20 (ACO) API:1.2.131

VkFFT 6

VkFFT System: 64 64x4194304 Buffer: 2048 MB avg_time_per_step: 25.119 ms std_error: 0.216 batch: 10 benchmark: 83490
VkFFT System: 128 128x2097152 Buffer: 2048 MB avg_time_per_step: 21.802 ms std_error: 0.910 batch: 10 benchmark: 96191
VkFFT System: 256 256x1048576 Buffer: 2048 MB avg_time_per_step: 12.443 ms std_error: 0.601 batch: 10 benchmark: 168545
VkFFT System: 512 512x524288 Buffer: 2048 MB avg_time_per_step: 10.051 ms std_error: 0.037 batch: 10 benchmark: 208658
VkFFT System: 1024 1024x262144 Buffer: 2048 MB avg_time_per_step: 11.829 ms std_error: 0.589 batch: 10 benchmark: 177288
VkFFT System: 2048 2048x131072 Buffer: 2048 MB avg_time_per_step: 12.967 ms std_error: 0.406 batch: 10 benchmark: 161730
VkFFT System: 4096 4096x65536 Buffer: 2048 MB avg_time_per_step: 11.893 ms std_error: 0.418 batch: 10 benchmark: 176336
VkFFT System: 8192 8192x32768 Buffer: 2048 MB avg_time_per_step: 23.767 ms std_error: 0.272 batch: 10 benchmark: 88238
VkFFT System: 16384 16384x16384 Buffer: 2048 MB avg_time_per_step: 22.890 ms std_error: 1.024 batch: 10 benchmark: 91618
VkFFT System: 32768 32768x8192 Buffer: 2048 MB avg_time_per_step: 24.336 ms std_error: 0.271 batch: 10 benchmark: 86173
VkFFT System: 65536 65536x4096 Buffer: 2048 MB avg_time_per_step: 27.001 ms std_error: 1.616 batch: 10 benchmark: 77670
VkFFT System: 131072 131072x2048 Buffer: 2048 MB avg_time_per_step: 27.755 ms std_error: 0.616 batch: 10 benchmark: 75559
VkFFT System: 262144 262144x1024 Buffer: 2048 MB avg_time_per_step: 30.953 ms std_error: 0.737 batch: 10 benchmark: 67752
VkFFT System: 524288 524288x512 Buffer: 2048 MB avg_time_per_step: 38.938 ms std_error: 1.414 batch: 10 benchmark: 53858
VkFFT System: 1048576 1048576x256 Buffer: 2048 MB avg_time_per_step: 54.666 ms std_error: 0.697 batch: 10 benchmark: 38363
VkFFT System: 2097152 2097152x128 Buffer: 2048 MB avg_time_per_step: 93.011 ms std_error: 1.052 batch: 10 benchmark: 22547
VkFFT System: 4194304 4194304x64 Buffer: 2048 MB avg_time_per_step: 73.632 ms std_error: 0.601 batch: 10 benchmark: 28481
VkFFT System: 8388608 8388608x32 Buffer: 2048 MB avg_time_per_step: 94.380 ms std_error: 0.517 batch: 10 benchmark: 22220
VkFFT System: 16777216 16777216x16 Buffer: 2048 MB avg_time_per_step: 166.482 ms std_error: 0.690 batch: 10 benchmark: 12596
VkFFT System: 33554432 33554432x8 Buffer: 2048 MB avg_time_per_step: 288.631 ms std_error: 0.151 batch: 10 benchmark: 7265
VkFFT System: 67108864 67108864x4 Buffer: 2048 MB avg_time_per_step: 300.939 ms std_error: 0.537 batch: 10 benchmark: 6968
VkFFT System: 134217728 134217728x2 Buffer: 2048 MB avg_time_per_step: 324.874 ms std_error: 3.085 batch: 10 benchmark: 6455
VkFFT System: 268435456 268435456x1 Buffer: 2048 MB avg_time_per_step: 156.643 ms std_error: 1.005 batch: 10 benchmark: 13388
Benchmark score VkFFT: 77017
Device name: AMD RADV VEGA20 (ACO) API:1.2.131

For me, tests 3, 4, and 5 produced large numerical outputs without timing information, and test 7 segfaulted with no output. Happy to make any tweaks and run again.

Necessary to bundle glslang?

I was wanting to submodule VkFFT in a project. It's probably better practice than manually copying the header all the time - would make it easier to update or pin to specific commits, etc. (Especially after the refactoring into multiple files that you mentioned.)

I balked at the size of the repo - the source files are 45MB, 90% of which is the bundled glslang-master directory. My understanding is that it's only there for benchmarks, etc., and only for the Vulkan backend at that. Would you consider changing the workflow to clone the glslang repo when needed? (I have no experience with cmake, but it seems like it even has a built-in way to do so.)

offset parameters not respected?

It appears to me that the buffer offset configuration parameters aren't being utilized: buffers always seem to be read from and/or written to at the head of the buffer. I'm testing on the OpenCL backend.

For example, say I make a buffer with double the length of the transform size and set only the second half to some input while setting the first half to zero. Passing the correct offset to the middle of the buffer yields an output with all zeros, while setting both halves of the array to the same input data yields the (correct) transform. Vice versa, outputs always seem to land at the head of the buffer regardless of passed offsets. I also peeked at the shader code and didn't see any of the offsets there (i.e., the literal number in items, not bytes).

Apologies that I don't have a simple reproducer to share; I'm working in Python and pyvkfft doesn't support offsets at the moment, so it's taken a bit of patching to attempt this. Let me know if I can provide any further information. Apologies in advanced if I'm missing something silly.

As an aside, is there something about the codegen that requires the offsets to be configuration parameters rather than launch parameters (i.e., arguments to the kernels themselves)? I imagine it'd be useful (or at least nice) to need only one VkFFTApplication for transforms that only differ by the offset of passed buffers.

Titan V FP32 & FP64 results..

Hi,
just tested your awesome library with a Titan V (Volta) which has only 2x slower FP64 vs FP32..
using NVIDIA 455.26.01 and CUDA 10.1 (can update CUDA SDK to 11.1 if requested: faster CUFFT?)
Titan V:
VKFFT single:
titanv.txt
CUFFT single:
titanvcufft.txt
VKFFT double:
titanvfp64.txt
CUFFT double:
titanvcufftdouble.txt

roughly speaking your library is well optimized for double precision as it slowdowns 2x and similar score to CUFFT..

some perf issues in vkFFT in double precision vs single compared to CUFFT like for example in 4kx4kx8 case:

VkFFT System: 4096x4096x8 Buffer: 2048 MB avg_time_per_step: 75.931 ms
cuFFT System: 4096x4096x8 Buffer: 2048 MB avg_time_per_step: 58.844 ms

note similar "big" double precision FFTs are not affected:

VkFFT System: 2048x256x256 Buffer: 2048 MB avg_time_per_step: 48.205 ms std_error: 0.329 batch: 1
cuFFT System: 2048x256x256 Buffer: 2048 MB avg_time_per_step: 45.925 ms std_error: 0.215 batch: 1

as you can see from fp32 case:

cuFFT System: 4096x4096x8 Buffer: 1024 MB avg_time_per_step: 29.235 ms std_error: 0.179 batch: 4
VkFFT System: 4096x4096x8 Buffer: 1024 MB avg_time_per_step: 29.923 ms std_error: 0.613 batch: 4

4kx4kx8 was performing similar to CUFFT in fp32 case : 29ms .. and CUFFT scales well from 29ms to 58ms..

as said 2kx256x256 scales well to fp64: goes from 23-24ms in fp32 to 46-48 ms in fp64 case

VkFFT System: 2048x256x256 Buffer: 1024 MB avg_time_per_step: 24.031 ms std_error: 0.662 batch: 4 benchmark: 43633
cuFFT System: 2048x256x256 Buffer: 1024 MB avg_time_per_step: 23.424 ms std_error: 0.085 batch: 4 benchmark: 44765

Plots:
[EDIT: in next post]

Raspberry Pi 4

The raspberry pi 4 got a vulkan driver some weeks ago and I wanted to try if I can get VkFFT running.
The precision results of running
./Vulkan_FFT -vkfft 11`
look as follows:

11 - VkFFT/FFTW C2C precision test in single precision
VkFFT System: 32x1x1 avg_difference: 4.210443 max_difference: 11.260108 avg_eps: 1.013775 max_eps: 1.678629
VkFFT System: 64x1x1 avg_difference: 6.011346 max_difference: 16.996910 avg_eps: 1.015163 max_eps: 1.728844
VkFFT System: 128x1x1 avg_difference: 7.739901 max_difference: 20.079729 avg_eps: 1.002452 max_eps: 1.615784
VkFFT System: 256x1x1 avg_difference: 11.449183 max_difference: 36.175850 avg_eps: 1.020900 max_eps: 2.139938
VkFFT System: 512x1x1 avg_difference: 16.608913 max_difference: 48.377995 avg_eps: 1.005776 max_eps: 2.111690
VkFFT System: 1024x1x1 avg_difference: 23.274626 max_difference: 67.035599 avg_eps: 1.001270 max_eps: 1.506914
VkFFT System: 2048x1x1 avg_difference: 32.783424 max_difference: 101.897461 avg_eps: 1.001845 max_eps: 2.103968
VkFFT System: 4096x1x1 avg_difference: 46.165894 max_difference: 154.442322 avg_eps: 1.000973 max_eps: 1.426838
VkFFT System: 8192x1x1 avg_difference: 65.744987 max_difference: 238.900620 avg_eps: 0.999933 max_eps: 1.604411
VkFFT System: 16384x1x1 avg_difference: 92.832741 max_difference: 335.043823 avg_eps: 1.000153 max_eps: 1.980254
VkFFT System: 32768x1x1 avg_difference: 130.892410 max_difference: 503.015656 avg_eps: 1.000137 max_eps: 1.570149
VkFFT System: 65536x1x1 avg_difference: 185.043289 max_difference: 726.917480 avg_eps: 1.000051 max_eps: 2.197726
VkFFT System: 131072x1x1 avg_difference: 262.086639 max_difference: 1031.330566 avg_eps: 1.000044 max_eps: 3.638197
VkFFT System: 262144x1x1 avg_difference: 370.687378 max_difference: 1480.791260 avg_eps: 1.000040 max_eps: 6.577468
VkFFT System: 524288x1x1 avg_difference: 523.781616 max_difference: 2129.027832 avg_eps: 0.999997 max_eps: 1.553625
VkFFT System: 1048576x1x1 avg_difference: 741.405579 max_difference: 3288.805908 avg_eps: 1.000000 max_eps: 1.452406
VkFFT System: 2097152x1x1 avg_difference: 1048.030029 max_difference: 4571.413574 avg_eps: 1.000000 max_eps: 1.693890
VkFFT System: 4194304x1x1 avg_difference: 1481.965088 max_difference: 6517.591797 avg_eps: 1.000000 max_eps: 1.335009
VkFFT System: 8388608x1x1 avg_difference: 2095.434570 max_difference: 9311.947266 avg_eps: 1.000000 max_eps: 1.520402

On my other machines it is running fine but there seems to be a problem here. This might has to do with the 32-Bit architecture?
Maybe I should add here that I checked already the results for correctness and the ones from fftw are correct.
Best regards

Vulkan_FFT crashes the GPU

Dear Dmitrij,

Thank you developing an interesting application. It looks very
promising. I built it from sources. When I ran Vulkan_FFT, it
showed rather small computation time ... till
it crashed the card around test 17. The monitor became black and I had
to reboot the host. Here are messages in the kernel log:

May 9 17:43:45 xxxxxxxx kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered
May 9 17:43:48 xxxxxxxx kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=1265969, emitted seq=1265970
May 9 17:43:48 xxxxxxxx kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Vulkan_FFT pid 1867631 thread Vulkan_FFT pid 1867631
May 9 17:43:48 xxxxxxxx kernel: amdgpu 0000:21:00.0: amdgpu: GPU reset begin!
May 9 17:43:48 xxxxxxxx kernel: amdgpu 0000:21:00.0: amdgpu: BACO reset
May 9 17:43:48 xxxxxxxx kernel: amdgpu 0000:21:00.0: amdgpu: GPU reset succeeded, trying to resume
May 9 17:43:48 xxxxxxxx kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
May 9 17:43:48 xxxxxxxx kernel: [drm] VRAM is lost due to GPU reset!

Then the OS tries to restart the driver, but fails.

/dist/VkFFT-master/build/Vulkan_FFT -h
VkFFT v1.2.1 (26-04-2021). Author: Tolmachev Dmitrii
Vulkan backend
-h: print help

Here is some information from clinfo:

Device Name gfx804
Device Board Name (AMD) Radeon RX550/550 Series
Global memory size 4080807936 (3.801GiB)

I ran gputest_gui.py giu. It passed. I ran fft using AMD rocfft via
hipfort -- it worked, though probably not as fast as your code
(single precision 2D FFT 8192x8192 consumed 0.56s including coping
data to and from the GPU).

Does this ring any bell for you? Any other information that might
help to debug?

Sincerely,
Leonid
2021.05.09_19:55:30

2D and 3D R2C/C2R get wrong results with HIP backend

Hello, I'm using vkfft with the hip backend. I extracted a simple test case from the source file Vulkan_FFT.cpp。On AMD MI50 GPU and ROCm 3.9/4.0,the 1D R2C/C2R results are right, but those of 2D and 3D are partially wrong. I'm not sure if it is caused by vkfft itself or I used it incorrectly.
Here is the code:

#include<cstdio>
#include<cstdlib>

#include <hip/hip_runtime.h>
#include <hip/hiprtc.h>
#include <hip/hip_complex.h>
#include "vkFFT.h"

#include <chrono>

#define DIM 3
#define SIZE 52

template < typename T>
void initialData(T *ip, const int size, const int max_value)
{
    int i;

    for(i = 0; i < size; i++)
    {
        ip[i] = (T)(rand() % max_value);
    }
}

void checkResult(float *backRef, float *originalRef, const int N)
{
    double epsilon = 1.0E-2;
    bool match = 1;
    const float overN = 1.0 / pow(SIZE, DIM);

    for (int i = 0; i < N; i++)
    {
        if (fabs(backRef[i] * overN - originalRef[i]) > epsilon)
        {
            match = 0;
            printf("Arrays do not match!\n");
            printf("back result %5.2f original result %5.2f at %d\n", backRef[i]* overN, originalRef[i], i);
            break;
        }
    }

    if (match) printf("Arrays match.\n\n");
}

typedef struct {
    hipDevice_t device;
    hipCtx_t context;
    uint32_t device_id;
}VkGPU;

int main()
{
    hipError_t res = hipSuccess;
    VkGPU vkGPU = {};
    hipInit(0);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_INITIALIZE;
    hipSetDevice(vkGPU.device_id);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_SET_DEVICE_ID;
    hipDeviceGet(&vkGPU.device, vkGPU.device_id);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_GET_DEVICE;
    hipCtxCreate(&vkGPU.context, 0, vkGPU.device);
    if (res != hipSuccess) return VKFFT_ERROR_FAILED_TO_CREATE_CONTEXT;
    VkFFTConfiguration configuration = {};
    VkFFTApplication app = {};
    configuration.FFTdim = 3;
    configuration.size[0] = SIZE;
    configuration.size[1] = SIZE;
    configuration.size[2] = SIZE;
    configuration.device = &(vkGPU.device);
    configuration.performR2C = true;
    uint64_t buffersize = sizeof(hipFloatComplex) * pow(SIZE, DIM);

    float *h_realGrid = (float *)malloc(sizeof(float) * pow(SIZE, DIM));
    float *h_realGrid_original = (float *)malloc(sizeof(float) * pow(SIZE, DIM));
    hipFloatComplex *h_complexGrid = (hipFloatComplex *)malloc(sizeof(hipFloatComplex) * pow(SIZE, DIM));

    hipFloatComplex *d_complexGrid;
    //hipMalloc((void **)&d_realGrid, sizeof(float) * pow(SIZE, DIM));
    hipMalloc((void **)&d_complexGrid, sizeof(hipFloatComplex) * pow(SIZE, DIM));
 
    configuration.buffer = (void**)&d_complexGrid;
    configuration.bufferSize = &buffersize;

    initialData(h_realGrid_original, pow(SIZE, DIM), 10);
    //for(int i = 0; i < pow(SIZE, DIM); ++i)
    //{
    //    h_complexGrid[i].x = h_realGrid_original[i];
    //    h_complexGrid[i].y = 0.0f;
    //}
    memset(h_complexGrid, 0, sizeof(hipFloatComplex) * pow(SIZE, DIM));
    memcpy(h_complexGrid, h_realGrid_original, sizeof(float) * pow(SIZE, DIM));

    hipMemcpy(d_complexGrid, h_complexGrid, sizeof(hipFloatComplex) * pow(SIZE, DIM), hipMemcpyHostToDevice);
 
    std::chrono::steady_clock::time_point timeSubmit = std::chrono::steady_clock::now(); 
    VkFFTResult resFFT;
    resFFT = initializeVkFFT(&app, configuration);
    if (resFFT != VKFFT_SUCCESS) 
    {
        printf("VkFFT error code %d\n", resFFT);
        return resFFT;
    }
    VkFFTLaunchParams launchParams = {};
  
 
    resFFT = VkFFTAppend(&app, -1, &launchParams); 
    if (resFFT != VKFFT_SUCCESS) 
    {
        printf("vkfft error code %d\n", resFFT);
        return resFFT;
    }
    resFFT = VkFFTAppend(&app, 1, &launchParams);
    if (resFFT != VKFFT_SUCCESS) 
    {
        printf("vkfft error code %d\n", resFFT);
        return resFFT;
    }
    res = hipDeviceSynchronize();
    if (res != hipSuccess) 
    {
        printf("hip error code %d\n", res);
        return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE;
    }
    
    std::chrono::steady_clock::time_point timeEnd = std::chrono::steady_clock::now();
    float totTime = std::chrono::duration_cast<std::chrono::microseconds>(timeEnd - timeSubmit).count() * 0.001; 
    printf("time: %.3f ms\n", totTime);
    
    hipMemcpy(h_complexGrid, d_complexGrid, sizeof(hipFloatComplex) * pow(SIZE, DIM), hipMemcpyDeviceToHost);
    //for(int i = 0; i < pow(SIZE, DIM); ++i)
    //{
    //    h_realGrid[i] = h_complexGrid[i].x;
    //}
    memcpy(h_realGrid, h_complexGrid, sizeof(float) * pow(SIZE, DIM));
 
    printf("h_realGrid: ");
    for(int i = 0; i < 30; ++i)
    {
         printf("%.6f  ", h_realGrid[i] / pow(SIZE, DIM));
    }
    printf("\nh_realGrid_original: ");
    for(int i = 0; i < 30; ++i)
    {
         printf("%.6f  ", h_realGrid_original[i]);
    }

    printf("\n");
    checkResult(h_realGrid, h_realGrid_original, pow(SIZE, DIM));

    deleteVkFFT(&app);
    free(h_realGrid);
    free(h_realGrid_original);
    free(h_complexGrid);

    //hipFree(d_realGrid);
    hipFree(d_complexGrid);
    
    return 0;
}

I compile it using the command:

hipcc -o planMany-vkfft -DVKFFT_BACKEND=2  planMany-vkfft.cpp

and my results:

time: 11023.094 ms
h_realGrid: 2.999996  5.999995  6.999996  4.999996  2.999997  4.999995  5.999996  1.999996  8.999998  0.999997  1.999998  6.999994  -0.000003  8.999997  2.999996  5.999996  -0.000004  5.999995  1.999997  5.999997  0.999996  7.999999  6.999996  8.999996  1.999997  -0.000003  1.999996  2.999999  6.999996  4.999996
h_realGrid_original: 3.000000  6.000000  7.000000  5.000000  3.000000  5.000000  6.000000  2.000000  9.000000  1.000000  2.000000  7.000000  0.000000  9.000000  3.000000  6.000000  0.000000  6.000000  2.000000  6.000000  1.000000  8.000000  7.000000  9.000000  2.000000  0.000000  2.000000  3.000000  7.000000  5.000000
back result -0.46 original result  1.00 at 52
back result  0.08 original result  7.00 at 106
back result  0.00 original result  2.00 at 107
back result -0.23 original result  9.00 at 160
back result -0.00 original result  8.00 at 161
back result  0.35 original result  7.00 at 214
back result  0.00 original result  6.00 at 215
back result -0.31 original result  2.00 at 268
back result  0.00 original result  4.00 at 269
back result  0.56 original result  3.00 at 322
back result -0.00 original result  2.00 at 323
back result -0.02 original result  0.00 at 376
back result -0.00 original result  5.00 at 377
......

Could you please help me with this problem? Thank you!

glslang is included in CMake even if targets already exist

A vulkan application using vkfft may already be using glslang, or depend on some other library that also uses glslang. If multiple places try to add_subdirectory() glslang, then it tries to set up multiple targets with the same name in Cmake's global namespace. So when you add_subdirectory(vkfft), you get errors.

If you wrap the inclusion with something like

if (NOT TARGET OSDependent)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/glslang-master)
endif()

it will be a bit easier to use in some applications.

[benchmarks] 2009 Mac Pro with NVIDIA 980 Ti

2009 Mac Pro 4,1->5,1 12-core 3.06GHz Xeon 96GB RAM macOS 10.13.6 with NVIDIA 980 Ti
VkFFT commit 6947846
Setup:

sudo port install MoltenVK vulkan-headers vulkan-loader vulkan-tools vulkan-validationlayers
git clone https://github.com/DTolm/VkFFT; cd VkFFT; mkdir build; cd build; cmake ..; make -24
N=0; time ./Vulkan_FFT -vkfft $N >$N.log 2>$N.err
N=1 ...

Results

0 0h:01m:51s Benchmark score VkFFT: 9928303
1 crashes
2 crashes
3 0h:00m:49s Benchmark score VkFFT: 42851
4 0h:01m:14s Benchmark score VkFFT: 54713
5 0h:01m:48s Benchmark score VkFFT: 9320665
6 0h:00m:44s Benchmark score VkFFT: 15106

0.log
0.err.log
1.log
1.err.log
2.log [empty]
2.err.log
3.log
3.err.log
4.log
4.err.log
5.log
5.err.log
6.log
6.err.log

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.