amd / amd-fftw Goto Github PK

FFTW code optimized for AMD based processors

License: GNU General Public License v2.0

CMake 0.18% Makefile 9.39% C 85.06% Shell 2.46% Fortran 0.35% Perl 0.17% M4 0.44% OCaml 1.60% Roff 0.11% Python 0.06% Batchfile 0.18%

amd-fftw's Introduction

AOCL-FFTW

AOCL-FFTW is AMD optimized version of FFTW implementation targeted for AMD EPYC CPUs. It is developed on top of FFTW (version fftw-3.3.10). All known features and functionalities of FFTW are retained and supported as it is with this AMD optimized FFTW library.

AOCL-FFTW achieves high performance as a result of its various optimizations involving improved SIMD Kernel functions, improved copy functions (cpy2d and cpy2d_pair used in rank-0 transform and buffering plan), improved 256-bit kernels selection by Planner and an optional in-place transpose for large problem sizes. AOCL-FFTW improves the performance of in-place MPI FFTs by employing a faster in-place MPI transpose function. AOCL-FFTW provides a new fast planner mode as an extension to the original planner that improves planning time of various planning modes in general and PATIENT mode in particular. Another new planning mode called Top N planner is also available that minimizes single-threaded run-to-run variations. AOCL-FFTW has a feature called AMD's application optimization layer that speeds up HPC and scientific applications. AOCL-FFTW implements the dynamic dispatcher feature that can build a single portable optimized library for execution on a wide range of x86 CPU architectures.

FFTW is a free collection of fast C routines for computing the Discrete Fourier Transform and various special cases thereof in one or more dimensions. It includes complex, real, symmetric, and parallel transforms, and can handle arbitrary array sizes efficiently.

The doc/ directory contains the manual in texinfo, PDF, info, and HTML formats. Frequently asked questions and answers can be found in the doc/FAQ/ directory in ASCII and HTML.

For a quick introduction to calling FFTW, see the "Tutorial" section of the manual.

INSTALLATION

INSTALLATION FROM AOCL-FFTW GIT REPOSITORY:

After downloading the latest stable release from the git repository, https://github.com/amd/amd-fftw, follow the below steps to configure and build it for AMD EPYC processor based on Naples, Rome, Milan and future generation architectures.

 ./configure --enable-sse2 --enable-avx --enable-avx2 --enable-avx512
             --enable-mpi --enable-openmp --enable-shared 
             --enable-amd-opt --enable-amd-mpifft
             --enable-dynamic-dispatcher
             --prefix=<your-install-dir>
 make
 make install

The configure option "--enable-amd-opt" enables all the improvements and optimizations targeted for AMD EPYC CPUs. For enabling various optional configure options provided for AMD EPYC CPUs, the master optimization switch "--enable-amd-opt" must be kept enabled.

When enabling configure option "--enable-amd-opt", do not use the configure option "--enable-generic-simd128" or "--enable-generic-simd256".

The optional configure option "--enable-amd-mpifft" enables the MPI FFT related optimizations.

An optional configure option "--enable-amd-mpi-vader-limit" is supported that controls enabling of AMD's new MPI transpose algorithms. When using this configure option, the user needs to set --mca btl_vader_eager_limit appropriately (current preference is 65536) in the MPIRUN command.

The new fast planner can be enabled using optional configure option "--enable-amd-fast-planner". It is supported in single and double precisions.

Top N planner mode can be enabled using optional configure option "--enable-amd-top-n-planner" to minimize run-to-run variations in performance. It is supported in single-threaded execution in single and double precisions.

An optional configure option "AMD_ARCH" is supported that can be set to CPU architecture values like "auto" or "znver1" or "znver2" or "znver3" or "znver4" for AMD EPYC processors.

The optional configure option "--enable-amd-app-opt" turns on AMD's application optimization layer to benefit performance of HPC and scientific applications. Currently it is developed for complex and real (r2c and c2r) DFT problem types in double and single precisions. It is not supported for MPI FFTs, r2r real DFT problem types, Quad or Long double precisions, and split array format.

Dynamic dispatcher achieves Function Multi-versioning by using compiler's attributes. Use "--enable-dynamic-dispatcher" configure option to enable this feature. It is supported for Linux based systems for now. The set of x86 CPUs on which the single portable library can work depends upon the highest level of CPU SIMD instruction set with which it is configured.

An optional configure option "--enable-amd-trans" is provided that may benefit the performance of transpose operations in case of very large FFT problem sizes. This is by default not enabled and provided as an experimental optional switch.

By default, configure script enables double-precision mode. User should pass appropriate configure options to enable the single-precision or quad-precision or long-double mode.

CONTACTS

AOCL-FFTW is developed and maintained by AMD. For support of these libraries and the other tools of AMD Zen Software Studio, see https://www.amd.com/en/developer/aocc/compiler-technical-support.html

ACKNOWLEDGEMENTS

FFTW was developed by Matteo Frigo and Steven G. Johnson. We thank Matteo Frigo for his support provided to us.

amd-fftw's People

Contributors

Stargazers

Watchers

Forkers

gustavoarsenioubuntu xdevs23 berolinux seanwallawalla-forks rim9zeri b17jps1 onguntoglu e-kwsm manjsharma gerhobbelt jeroen-mostert fcarraustewart

amd-fftw's Issues

Could not find fftwf_plan_many_[r2c|c2r] libfftw3.so

Using this optimized fftw running GROMACS 2020 following this AMD Whitepaper results in the following error on including the lib running cmake:

(...)
-- Checking for module 'fftw3f'
--   Found fftw3f, version 3.3.8
-- Looking for fftwf_plan_many_dft in /usr/local/lib/libfftw3.so
-- Looking for fftwf_plan_many_dft in /usr/local/lib/libfftw3.so - not found
-- Looking for fftwf_plan_many_dft_r2c in /usr/local/lib/libfftw3.so
-- Looking for fftwf_plan_many_dft_r2c in /usr/local/lib/libfftw3.so - not found
-- Looking for fftwf_plan_many_dft_c2r in /usr/local/lib/libfftw3.so
-- Looking for fftwf_plan_many_dft_c2r in /usr/local/lib/libfftw3.so - not found
CMake Error at cmake/FindFFTW.cmake:105 (message):
  Could not find fftwf_plan_many_[r2c|c2r] in /usr/local/lib/libfftw3.so,
(...)

Expecting this functions to be implemented

Reproduce with:

Building amd-fftw on Ubuntu 20.04

~/amd-fftw$ ./configure --enable-sse2 --enable-avx --enable-avx2 --enable-shared  --enable-amd-opt --enable-amd-trans --enable-float
~/amd-fftw$ make -j48
~/amd-fftw$ sudo make install

Building GROMACS 2020 on Ubuntu 20.04

Run in ~/gromacs-2020/build$

#!/bin/bash
set -euo pipefail

rm -f CMakeCache.txt  
make clean
  
CMAKE_PREFIX_PATH=/usr/local/ cmake .. \
     -DGMX_BUILD_OWN_FFTW=OFF \
     -DGMX_FFT_LIBRARY=fftw3 \
     -DFFTWF_LIBRARY=/usr/local/lib/libfftw3.so \
     -DFFTWF_INCLUDE_DIR=/usr/local/include \
     -DREGRESSIONTEST_DOWNLOAD=ON \
     -DGMX_GPU=ON \
     -DCUDA_TOOLKIT_ROOT_DIR=/usr/lib/cuda/ \
     -DCMAKE_C_COMPILER=gcc-8 \
     -DCMAKE_CXX_COMPILER=g++-8 \
     -DCMAKE_C_FLAGS="-march=znver1 -O3" \
     -DCMAKE_CXX_FLAGS="-march=znver1 -O3"

make -j48
make check -j24

hybrid OpenMP+MPI tests fail with --enable-amd-trans

perl -w ./check.pl  -r -c=30 -v `pwd`/bench
         FFTW transforms passed basic tests!
perl -w ./check.pl  -r -c=30 -v --nthreads=2 `pwd`/bench
         FFTW threaded transforms passed basic tests!
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 1 `pwd`/mpi-bench"
     MPI FFTW transforms passed 10 tests, 1 CPU
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 2 `pwd`/mpi-bench"
      MPI FFTW transforms passed 10 tests, 2 CPUs
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 3 `pwd`/mpi-bench"
      MPI FFTW transforms passed 10 tests, 3 CPUs
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 4 `pwd`/mpi-bench"
      MPI FFTW transforms passed 10 tests, 4 CPUs
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi --nthreads=2 "mpirun -np 3 `pwd`/mpi-bench"
Found relative error 1.000000e+00 (impulse 1)

this is for float datatype using both --enable-amd-opt and --enable-amd-trans (no problem without --enable-amd-trans)

Missing fftw3f.dll with AOCL release v3

I am attempting to use this library on Windows 10 with Visual Studio 2019. I downloaded the prebuilt Windows version from the AMD Optimizing CPU Libraries page. My code #defines FFTW_DLL prior to #including "fftw3.h". My build configuration is x64, and I copy all of the DLLs in the package into the working directory.

When I run my executable I get an error message: "The code execution cannot proceed because fftw3f.dll was not found."

There is no such DLL in the aforementioned package, and I'm surprised because that DLL seems to be from the standard (non-AMD optimized) FFTW distro. The DLLs in the "Zen2" directory are:

$ ls -l *.dll
-rwxr-xr-x 1 shehr 197609 6041240 Mar 12 09:23 libfftw-dll.dll*
-rwxr-xr-x 1 shehr 197609 7379104 Mar 12 09:23 libfftwf-dll.dll*
-rwxr-xr-x 1 shehr 197609 7456936 Mar 12 09:23 libfftwf-mpi-dll.dll*
-rwxr-xr-x 1 shehr 197609 7390888 Mar 12 09:23 libfftwf-omp-dll.dll*
-rwxr-xr-x 1 shehr 197609  939168 Mar 12 09:23 libfftwl-dll.dll*
-rwxr-xr-x 1 shehr 197609 1018024 Mar 12 09:23 libfftwl-mpi-dll.dll*
-rwxr-xr-x 1 shehr 197609  950952 Mar 12 09:23 libfftwl-omp-dll.dll*
-rwxr-xr-x 1 shehr 197609 6119072 Mar 12 09:23 libfftw-mpi-dll.dll*
-rwxr-xr-x 1 shehr 197609 6052512 Mar 12 09:23 libfftw-omp-dll.dll*
-rwxr-xr-x 1 shehr 197609  660112 Mar 12 09:23 libomp.dll*

Any plan for pre-optimized wisdom files for AMD cpus?

It would be a huge convenience if there were some pre-optimized wisdom files (targeting a range of FFT configurations / sizes) for each AMD cpu. I do scientific computing that requires very fast FFTs and in my experience the fft training is extremely important (sometimes I've seen 2x speedup from going from plan level 1 to 3), but often very costly (in wall time) to do properly.

Single precision build failing 'make check'

Hi,

I've been attempting to compile release 3.1 on an AMD EPYC 7763, with gcc 9.4.0. The configure line I've been using is:

$AMD_FFTW_SRC/configure --prefix=$AMD_PREFIX --enable-shared --enable-float --enable-sse2 --enable-avx --enable-avx2 --enable-openmp --enable-amd-opt --enable-amd-fast-planner --enable-amd-app-opt

which seems to run fine, as does make in the build directory.

But when I run make check, I get errors like the following:

make[1]: Entering directory `/tmp/build-fftw/amd/fast_planner/single/tests'
make  check-local
make[2]: Entering directory `/tmp/build-fftw/amd/fast_planner/single/tests'
perl -w /tmp/amd/fftw/3.1/amd-fftw-3.1/tests/check.pl  -r -c=30 -v `pwd`/bench
Executing "/tmp/build-fftw/amd/fast_planner/single/tests/bench --verbose=1   --verify 'okd60e10v86' --verify 'ikd60e10v86' --verify 'obr
d12v174' --verify 'ibrd12v174' --verify 'ofrd12v174' --verify 'ifrd12v174' --verify '//obcd12v174' --verify '//ibcd12v174' --verify '//o
fcd12v174' --verify '//ifcd12v174' --verify 'obcd12v174' --verify 'ibcd12v174' --verify 'ofcd12v174' --verify 'ifcd12v174' --verify 'ok1
08o11*92' --verify 'ik108o11*92' --verify '//obrd6x13x6x4' --verify '//ofrd6x13x6x4' --verify 'obrd6x13x6x4' --verify 'ibrd6x13x6x4' --v
erify 'ofrd6x13x6x4' --verify 'ifrd6x13x6x4' --verify '//obcd6x13x6x4' --verify '//ibcd6x13x6x4' --verify '//ofcd6x13x6x4' --verify '//i
fcd6x13x6x4' --verify 'obcd6x13x6x4' --verify 'ibcd6x13x6x4' --verify 'ofcd6x13x6x4' --verify 'ifcd6x13x6x4' --verify 'okd11o00x2o00x5o1
0*2' --verify 'ikd11o00x2o00x5o10*2' --verify 'obr8x7x11x3*7' --verify 'ibr8x7x11x3*7' --verify 'ofr8x7x11x3*7' --verify 'ifr8x7x11x3*7'
 --verify '//obc8x7x11x3*7' --verify '//ibc8x7x11x3*7' --verify '//ofc8x7x11x3*7' --verify '//ifc8x7x11x3*7' --verify 'obc8x7x11x3*7' --
verify 'ibc8x7x11x3*7' --verify 'ofc8x7x11x3*7' --verify 'ifc8x7x11x3*7'"
apiplan: UNSUPPORTED problem type/kind [2]
No can_do for okd60e10v86
bench: /tmp/amd/fftw/3.1/amd-fftw-3.1/libbench2/verify.c:51: assertion failed: 0
FAILED /tmp/build-fftw/amd/fast_planner/single/tests/bench:  --verify 'okd60e10v86' --verify 'ikd60e10v86' --verify 'obrd12v174' --verif
y 'ibrd12v174' --verify 'ofrd12v174' --verify 'ifrd12v174' --verify '//obcd12v174' --verify '//ibcd12v174' --verify '//ofcd12v174' --ver
ify '//ifcd12v174' --verify 'obcd12v174' --verify 'ibcd12v174' --verify 'ofcd12v174' --verify 'ifcd12v174' --verify 'ok108o11*92' --veri
fy 'ik108o11*92' --verify '//obrd6x13x6x4' --verify '//ofrd6x13x6x4' --verify 'obrd6x13x6x4' --verify 'ibrd6x13x6x4' --verify 'ofrd6x13x
6x4' --verify 'ifrd6x13x6x4' --verify '//obcd6x13x6x4' --verify '//ibcd6x13x6x4' --verify '//ofcd6x13x6x4' --verify '//ifcd6x13x6x4' --v
erify 'obcd6x13x6x4' --verify 'ibcd6x13x6x4' --verify 'ofcd6x13x6x4' --verify 'ifcd6x13x6x4' --verify 'okd11o00x2o00x5o10*2' --verify 'i
kd11o00x2o00x5o10*2' --verify 'obr8x7x11x3*7' --verify 'ibr8x7x11x3*7' --verify 'ofr8x7x11x3*7' --verify 'ifr8x7x11x3*7' --verify '//obc
8x7x11x3*7' --verify '//ibc8x7x11x3*7' --verify '//ofc8x7x11x3*7' --verify '//ifc8x7x11x3*7' --verify 'obc8x7x11x3*7' --verify 'ibc8x7x1
1x3*7' --verify 'ofc8x7x11x3*7' --verify 'ifc8x7x11x3*7'
make[2]: *** [check-local] Error 1
make[2]: Leaving directory `/tmp/build-fftw/amd/fast_planner/single/tests'
make[1]: *** [check-am] Error 2
make[1]: Leaving directory `/tmp/build-fftw/amd/fast_planner/single/tests'
make: *** [check-recursive] Error 1

Here it's failing on an r2r problem, which I would not have expected make check to have tried, given that --enable-amd-app-opt expressly does not support that. To work around this, I manually commented out this line in tests/check.pl (I don't think I should have had to do that!).

But now make check fails in a different way:

Making check in tests
make[1]: Entering directory `/tmp/build-fftw/amd/fast_planner/single/tests'
make  check-local
make[2]: Entering directory `/tmp/build-fftw/amd/fast_planner/single/tests'
perl -w /tmp/amd/fftw/3.1/amd-fftw-3.1/tests/check.pl  -r -c=30 -v `pwd`/bench
Executing "/tmp/build-fftw/amd/fast_planner/single/tests/bench --verbose=1   --verify 'ofcd3x12' --verify 'ifcd3x12' --verify '//obr9x8x
7' --verify '//ofr9x8x7' --verify 'obr9x8x7' --verify 'ibr9x8x7' --verify 'ofr9x8x7' --verify 'ifr9x8x7' --verify '//obc9x8x7' --verify 
'//ibc9x8x7' --verify '//ofc9x8x7' --verify '//ifc9x8x7' --verify 'obc9x8x7' --verify 'ibc9x8x7' --verify 'ofc9x8x7' --verify 'ifc9x8x7'
 --verify '//obr1680' --verify '//ibr1680' --verify '//ofr1680' --verify '//ifr1680' --verify 'obr1680' --verify 'ibr1680' --verify 'ofr
1680' --verify 'ifr1680' --verify '//obc1680' --verify '//ibc1680' --verify '//ofc1680' --verify '//ifc1680' --verify 'obc1680' --verify
 'ibc1680' --verify 'ofc1680' --verify 'ifc1680' --verify 'obr10x21*2' --verify 'ibr10x21*2' --verify 'ofr10x21*2' --verify 'ifr10x21*2'
 --verify '//obc10x21*2' --verify '//ibc10x21*2' --verify '//ofc10x21*2' --verify '//ifc10x21*2' --verify 'obc10x21*2' --verify 'ibc10x2
1*2' --verify 'ofc10x21*2' --verify 'ifc10x21*2'"
ofcd3x12 2.29443e-07 2.38419e-07 1.99951e-07
ifcd3x12 1.32676e-07 2.38419e-07 1.92132e-07
//obr9x8x7 1.97225e-07 3.3984e-07 2.25098e-07
//ofr9x8x7 2.03782e-07 3.3984e-07 1.89523e-07
obr9x8x7 1.93463e-07 4.248e-07 2.22474e-07
ibr9x8x7 1.73335e-07 3.3984e-07 2.25008e-07
ofr9x8x7 1.79529e-07 2.5488e-07 2.31433e-07
ifr9x8x7 1.8554e-07 2.5488e-07 1.96309e-07
//obc9x8x7 1.95973e-07 3.3984e-07 2.07795e-07
//ibc9x8x7 2.1938e-07 4.248e-07 1.9454e-07
//ofc9x8x7 2.18904e-07 4.248e-07 2.19664e-07
//ifc9x8x7 1.89425e-07 3.3984e-07 1.92071e-07
obc9x8x7 2.02438e-07 3.3984e-07 2.23317e-07
ibc9x8x7 1.81485e-07 3.3984e-07 2.04248e-07
ofc9x8x7 2.11242e-07 3.3984e-07 2.03609e-07
ifc9x8x7 2.54701e-07 4.248e-07 2.37408e-07
Found relative error 6.366191e-01 (impulse 1)
       0  20.493900299072   0.000000000000    20.493902206421   0.000000000000
       1 -13.046808242798   0.000000000000     0.000000000000   0.000000000000
       2  -0.000000057171   0.000000000000     0.000000000000   0.000000000000
<truncated>

This is quite concerning; a relative error of 64%. Do you know what might be happening here?

fftw mpi does not compile

amd-fftw: 5a64feb
MPI: OpenMPI v5.0.3

$ ./configure --enable-mpi
$ make -C mpi
make: Entering directory './mpi'
make  all-am
make[1]: Entering directory './mpi'
/bin/sh ../libtool  --tag=CC   --mode=compile mpicc -DHAVE_CONFIG_H -I. -I..  -I .. -I ../api   -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -MT transpose-pairwise-omc.lo -MD -MP -MF .deps/transpose-pairwise-omc.Tpo -c -o transpose-pairwise-omc.lo transpose-pairwise-omc.c
libtool: compile:  mpicc -DHAVE_CONFIG_H -I. -I.. -I .. -I ../api -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns -MT transpose-pairwise-omc.lo -MD -MP -MF .deps/transpose-pairwise-omc.Tpo -c transpose-pairwise-omc.c  -fPIC -DPIC -o .libs/transpose-pairwise-omc.o
transpose-pairwise-omc.c: In function 'transpose_chunks':
transpose-pairwise-omc.c:108:115: error: passing argument 7 of 'MPI_Isend' from incompatible pointer type [-Wincompatible-pointer-types]
  108 |                    MPI_Isend(buf[j&0x1], (int) (sbs[pe]), FFTW_MPI_TYPE, pe, (my_pe * n_pes + pe) & 0xffff, comm, &send_status);
      |                                                                                                                   ^~~~~~~~~~~~
      |                                                                                                                   |
      |                                                                                                                   MPI_Status * {aka struct ompi_status_public_t *}
In file included from ifftw-mpi.h:28,
                 from mpi-transpose.h:22,
                 from transpose-pairwise-omc.c:32:
/usr/include/mpi.h:1783:67: note: expected 'struct ompi_request_t **' but argument is of type 'MPI_Status *' {aka 'struct ompi_status_public_t *'}
 1783 |                              int tag, MPI_Comm comm, MPI_Request *request);
      |                                                      ~~~~~~~~~~~~~^~~~~~~
transpose-pairwise-omc.c:109:116: error: passing argument 7 of 'MPI_Irecv' from incompatible pointer type [-Wincompatible-pointer-types]
  109 |                    MPI_Irecv(O + rbo[pe], (int) (rbs[pe]), FFTW_MPI_TYPE, pe, (pe * n_pes + my_pe) & 0xffff, comm, &recv_status);
      |                                                                                                                    ^~~~~~~~~~~~
      |                                                                                                                    |
      |                                                                                                                    MPI_Status * {aka struct ompi_status_public_t *}
/usr/include/mpi.h:1779:67: note: expected 'struct ompi_request_t **' but argument is of type 'MPI_Status *' {aka 'struct ompi_status_public_t *'}
 1779 |                              int tag, MPI_Comm comm, MPI_Request *request);
      |                                                      ~~~~~~~~~~~~~^~~~~~~
transpose-pairwise-omc.c:113:29: error: passing argument 1 of 'MPI_Wait' from incompatible pointer type [-Wincompatible-pointer-types]
  113 |                    MPI_Wait(&send_status, MPI_STATUS_IGNORE);
      |                             ^~~~~~~~~~~~
      |                             |
      |                             MPI_Status * {aka struct ompi_status_public_t *}
/usr/include/mpi.h:2099:42: note: expected 'struct ompi_request_t **' but argument is of type 'MPI_Status *' {aka 'struct ompi_status_public_t *'}
 2099 | OMPI_DECLSPEC  int MPI_Wait(MPI_Request *request, MPI_Status *status);
      |                             ~~~~~~~~~~~~~^~~~~~~
transpose-pairwise-omc.c:114:29: error: passing argument 1 of 'MPI_Wait' from incompatible pointer type [-Wincompatible-pointer-types]
  114 |                    MPI_Wait(&recv_status, MPI_STATUS_IGNORE);
      |                             ^~~~~~~~~~~~
      |                             |
      |                             MPI_Status * {aka struct ompi_status_public_t *}
/usr/include/mpi.h:2099:42: note: expected 'struct ompi_request_t **' but argument is of type 'MPI_Status *' {aka 'struct ompi_status_public_t *'}
 2099 | OMPI_DECLSPEC  int MPI_Wait(MPI_Request *request, MPI_Status *status);
      |                             ~~~~~~~~~~~~~^~~~~~~
transpose-pairwise-omc.c: In function 'apply':
transpose-pairwise-omc.c:177:36: warning: passing argument 11 of 'transpose_chunks' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
  177 |                ego->comm, O, I, ego->send_block_bufs);
      |                                 ~~~^~~~~~~~~~~~~~~~~
transpose-pairwise-omc.c:56:34: note: expected 'R **' {aka 'double **'} but argument is of type 'R * const*' {aka 'double * const*'}
   56 |                  R *I, R *O, R **bufs)
      |                              ~~~~^~~~
transpose-pairwise-omc.c:184:36: warning: passing argument 11 of 'transpose_chunks' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
  184 |                ego->comm, I, O, ego->send_block_bufs);
      |                                 ~~~^~~~~~~~~~~~~~~~~
transpose-pairwise-omc.c:56:34: note: expected 'R **' {aka 'double **'} but argument is of type 'R * const*' {aka 'double * const*'}
   56 |                  R *I, R *O, R **bufs)
      |                              ~~~~^~~~
transpose-pairwise-omc.c:193:36: warning: passing argument 11 of 'transpose_chunks' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
  193 |                ego->comm, I, I, ego->send_block_bufs);
      |                                 ~~~^~~~~~~~~~~~~~~~~
transpose-pairwise-omc.c:56:34: note: expected 'R **' {aka 'double **'} but argument is of type 'R * const*' {aka 'double * const*'}
   56 |                  R *I, R *O, R **bufs)
      |                              ~~~~^~~~
make[1]: *** [Makefile:576: transpose-pairwise-omc.lo] Error 1
make[1]: Leaving directory './mpi'
make: *** [Makefile:432: all] Error 2
make: Leaving directory './mpi'

As shown, MPI_Request must be passed.
See MPI 4.1 Standard §3.7.2 Communication Initiation.

How to get best performance installation on AMD EPYC 7773X 64-Core Processor?

I am using spack, with the instruction on https://developer.amd.com/spack/amd-optimized-cpu-libraries/.
However, I don't know what is "threads"

Trying to mkdir -p /include

The build is going nicely through "Making install in api" section, and then this:
/usr/bin/mkdir -p '/include'
/usr/bin/mkdir: cannot create directory ‘/include’: Permission denied.
Is there an unwanted / ?

My options are:
./configure --prefix=${InstallRoot} --enable-sse2 --enable-avx --enable-avx2 --enable-mpi --enable-openmp --enable-shared --enable-amd-opt --enable-amd-mpifft

long double and quad precision tests with --enable-amd-opt are failing

they pass without --enable-amd-opt for all precisions, and they pass with --enable-amd-opt for float and double, but they fail with --enable-amd-opt for long double and quad precision, are these not supported?

(This is on a server with AMD EPYC 7601 CPUs and using Easybuild's FFTW easyblock, https://github.com/easybuilders/easybuild-easyblocks/blob/master/easybuild/easyblocks/f/fftw.py, on the sources from https://github.com/amd/amd-fftw/archive/2.0.tar.gz)

fork of fftw/fftw3

Why isn't this a fork of https://github.com/FFTW/fftw3?

It seems to me that this is simply build on top of fftw3, so why not ensure you can easily track changes?

Could you clarify the intents of this code?
I.e. will you continue to have this code orthogonal to original fftw? Or will you try and return developments to fftw?