When building with gcc and linking to modern MKL versions, the MKL unit tests fail. T

Closed by <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

MKL FFT failures about toast HOT 10 CLOSED

hpc4cmb commented on September 14, 2024

MKL FFT failures

from toast.

Comments (10)

tskisner commented on September 14, 2024

I thought this problem looked suspiciously familiar... I actually mostly fixed this in an earlier commit:

30c24f9

But this was done directly to master accidentally as part of other changes and then reverted. The fix / cleanup was then never migrated to a branch for proper testing and merging. I'll cherry pick this into a branch now for testing.

from toast.

tskisner commented on September 14, 2024

With the latest master branch (after merging the reworking of MKL FFTs in ee43a97), all C++ and python unit tests pass on a workstation using Intel compilers and MKL version 2018-beta. On edison.nersc.gov, using gcc and MKL, the following behavior is seen:

C++ "single" (one FFT) roundtrip passes.
C++ "multi" (batch of FFTs) roundtrip fails.
Python unit test of wrapper around C++ FFT roundtrip fails. This uses a single FFT, but uses a persistent cache of FFT plans.
Noise simulation unit test "fails", but this may or may not be related, since the threshold is kind of arbitrary.

Leaving this issue open to track why this combination (gcc + MKL on edison) fails while a pure Intel build does not.

from toast.

tskisner commented on September 14, 2024

Testing this on cori.nersc.gov (KNL) with Intel compilers and MKL also passes all tests. So this seems to be an interaction between MKL and gcc-6.3 on edison. I will test that same combination on a workstation to determine if it is NERSC-specific.

from toast.

tskisner commented on September 14, 2024

Ok, confirmed that I see the same failures on a linux workstation with latest MKL and gcc-6.3. That should make it much easier to debug.

from toast.

tskisner commented on September 14, 2024

On deeper inspection, this is actually a problem with the persistent cache of fft plans. In the older versions of toast, we had independent OpenMP threads that were processing pieces of the data and so we needed different FFT plans for each thread. In the current version of toast, we made a design decision to keep threading at a lower level, so that mid-level operations could be done in python. So rather than individual threads doing separate FFTs, we rely on the FFT libraries being smart about threading (which for 1D FFTs is not great, but...) The per-process cache of FFT plans should then have only one copy of the plan rather than one per thread. I now have unit tests that use the plan cache and fail as they should. Working on this fix.

from toast.

tskisner commented on September 14, 2024

OK, I found the real source of the problem. We always link to the single-library version of MKL (libmkl_rt). However, as documented here:

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/657060

This defaults to using the Intel OpenMP library rather than the GNU one. This means that when compiling with gcc and linking to MKL, operations with OMP_NUM_THREADS > 1 will give the wrong answer.

The solution is to explicitly link to the full set of individual MKL libraries, and to use the correct set depending on whether one is compiling with the Intel compilers or gcc.

from toast.

tskisner commented on September 14, 2024

I explored this in more detail, and implemented a test during configure that links to MKL differently depending on the compiler used. This works when linking C++ code (like the toast_test executable used to run the compiled unit tests).

However, when we dlopen (through ctypes) the libtoast.so library, it tries to dlopen its chain of dependency libraries and claims that libmkl_gnu_thread has missing symbols (it does, and they are defined in libmkl_core). This seems to leave 3 choices:

Figure out why loading libtoast does not load dependencies (I tried extra linking options like -Wl,--no-undefined when building libtoast) or does not load them in the right order.
Always link to the static MKL libraries, which will then be built in to libtoast.so. This "probably" won't cause problems, since libtoast is the top-level object (we are not linking libtoast to other code which might already have MKL symbols in it)
Go back to linking to libmkl_rt, and then include a snippet of code (generated at configure time) which is called at library initialization and calls the right MKL functions that set the threading layer to whatever compiler we used.

Static linking in option 2 might introduce problems in cross compilation. Also, trying to get the linking of libtoast "just right" so that option 1 works could be very machine / compiler / mkl version specific. Instead, I think option 3 is the way to go. There would be toast function "toast::mkl_init()" which would:

Be a no-op if we don't have MKL or if the function has already been called.
Call the mkl functions to set the interface and threading layers based on whether we are using the Intel or GNU compilers. We don't currently use MKL on OS X with Clang, so we don't need to consider that. On OS X we should probably use the accelerate framework instead anyway.
be called internally in toast before making MKL calls (this is currently in the fft and special function source files).

from toast.

tskisner commented on September 14, 2024

The discussion so far in this issue holds true and makes sense for C++ code linking to MKL. If you use gnu compilers and -fopenmp, but then want to link to threaded MKL, you must ensure that the gnu threading interface is used (by calling mkl_set_threading_layer() or setting the proper environment variable).

Things become more difficult when throwing in anaconda python. Anaconda (and Intel Distribution for Python) ship with their own version of libmkl_rt and the Intel thread interface (no libmkl_gnu_thread.so). Since Python / Numpy does not use OpenMP, this is fine- the Intel thread library is always used.

In toast, we can have the situation where we have a threaded C++ library built with gnu compilers, linking to threaded MKL. We can make these play together using the techniques above. However, we are dlopen'ing that library from python (using ctypes). When this happens, somehow libtoast.so calls to MKL end up using symbols / functions in the libmkl_rt from anaconda.

For compatibility, I believe we will either have to statically link to MKL when building toast, OR simply not allow use of MKL with non-Intel compilers. In the latter case, it is troubling that our runtime calls would be using the MKL shipped with python packages rather than the (potentially newer) MKL directly from Intel at compile time. For example, this means that the C++ unit tests run from the toast_test executable and the same tests run from within the python toast.test() function would have different code paths (through different MKL libraries). This point is one that pushes us towards static linking of MKL at compile time.

from toast.

tskisner commented on September 14, 2024

I have looked into the static linking option, and even after jumping through hoops with libtool this does not really work. I think the only sane path here is to simply only use MKL when we are also using the Intel compiler. Again the solution for pure C++ code is straightforward (use libmkl_rt with threading layer set to gnu or intel, OR link explicitly to the list of MKL libraries).

When mixing threaded C++ code that uses MKL and loading this into python which has a different MKL library already loaded, the C++ code must be compatible with the MKL used by python. Which means it must use intel threads. Which means it must be built with the intel compiler.

from toast.

tskisner commented on September 14, 2024

Closed by #114

from toast.

MKL FFT failures about toast HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent