libxsmm / libxsmm Goto Github PK

Library for specialized dense and sparse matrix operations, and deep learning primitives.

Home Page: https://libxsmm.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Shell 1.45% C++ 1.73% C 92.64% Python 1.00% Makefile 1.93% Batchfile 0.04% Fortran 1.14% HTML 0.01% JavaScript 0.01% CSS 0.01% Starlark 0.01% CMake 0.03%

jit simd avx512 machine-learning sparse blas matrix-multiplication transpose bfloat16 avx2

libxsmm's Introduction

LIBXSMM

LIBXSMM is a library for specialized dense and sparse matrix operations as well as for deep learning primitives such as small convolutions. The library is targeting Intel Architecture with Intel SSE, Intel AVX, Intel AVX2, Intel AVX‑512 (with VNNI and Bfloat16), and Intel AMX (Advanced Matrix Extensions) supported by future Intel processor code-named Sapphire Rapids. Code generation is mainly based on Just‑In‑Time (JIT) code specialization for compiler-independent performance (matrix multiplications, matrix transpose/copy, sparse functionality, and deep learning). LIBXSMM is suitable for "build once and deploy everywhere", i.e., no special target flags are needed to exploit the available performance. Supported GEMM datatypes are: FP64, FP32, bfloat16, int16, and int8.

For a list questions and answers, please also have a look at https://github.com/libxsmm/libxsmm/wiki/Q&A.

Where to go for documentation?

ReadtheDocs: main and sample documentation with full text search.
PDF: main documentation file, and separate sample documentation.
Articles: magazine article incl. sample code (full list of Articles).

Getting Started: The following C++ code is focused on a specific functionality but may be considered as Hello LIBXSMM. Build the example with cd /path/to/libxsmm; make STATIC=0 (shared library), save the code under hello.cpp (below) and compile with g++ -I/path/to/libxsmm/include hello.cpp -L/path/to/libxsmm/lib -lxsmm -lblas -o hello (GNU CCC), and finally execute with LD_LIBRARY_PATH=/path/to/libxsmm/lib LIBXSMM_VERBOSE=2 ./hello.

#include <libxsmm.h>
#include <vector>
int main(int argc, char* argv[]) {
  typedef double T;
  int batchsize = 1000, m = 13, n = 5, k = 7;
  std::vector<T> a(batchsize * m * k), b(batchsize * k * n), c(m * n, 0);
  /* C/C++ and Fortran interfaces are available */
  typedef libxsmm_mmfunction<T> kernel_type;
  /* generates and dispatches a matrix multiplication kernel (C++ functor) */
  kernel_type kernel(LIBXSMM_GEMM_FLAG_NONE, m, n, k, 1.0 /*alpha*/, 1.0 /*beta*/);
  assert(kernel);
  for (int i = 0; i < batchsize; ++i) { /* initialize input */
    for (int ki = 0; ki < k; ++ki) {
      for (int j = 0; j < m; ++j) a[i * j * ki] = static_cast<T>(1) / ((i + j + ki) % 25);
      for (int j = 0; j < n; ++j) b[i * j * ki] = static_cast<T>(7) / ((i + j + ki) % 75);
    }
  }
  /* kernel multiplies and accumulates matrices: C += Ai * Bi */
  for (int i = 0; i < batchsize; ++i) kernel(&a[i * m * k], &b[i * k * n], &c[0]);
}

Plain C code as well as Fortran code resemble the same example.

What is a small matrix multiplication? When characterizing the problem-size by using the M, N, and K parameters, a problem-size suitable for LIBXSMM falls approximately within (M N K)^1/3 <= 64 (which illustrates that non-square matrices or even "tall and skinny" shapes are covered as well). The library is typically used to generate code up to the specified threshold. Raising the threshold may not only generate excessive amounts of code (due to unrolling in M or K dimension), but also miss to implement a tiling scheme to effectively utilize the cache hierarchy. For auto-dispatched problem-sizes above the configurable threshold (explicitly JIT'ted code is not subject to the threshold), LIBXSMM is falling back to BLAS. In terms of GEMM, the supported kernels are limited to Alpha := 1, Beta := { 1, 0 }, and TransA := 'N'.

What is a small convolution? In the last years, new workloads such as deep learning and more specifically convolutional neural networks (CNN) emerged and are pushing the limits of today's hardware. One of the expensive kernels is a small convolution with certain kernel sizes such that calculations in the frequency space is not the most efficient method when compared with direct convolutions. LIBXSMM's current support for convolutions aims for an easy-to-use invocation of small (direct) convolutions, which are intended for CNN training and classification.

Interfaces and Domains

Overview

Please have a look at https://github.com/libxsmm/libxsmm/tree/main/include for all published functions. Get started with the following list of available domains and documented functionality:

MM: Matrix Multiplication
TPP: Tensor Processing Primitives
DNN: Deep Neural Networks
AUX: Service Functions
PERF: Performance
BE: Backend

To initialize library internal resources, an explicit initialization routine helps to avoid lazy initialization overhead when calling LIBXSMM for the first time. The library deallocates internal resources at program exit, but also provides a companion of the afore mentioned initialization (finalize).

/** Initialize the library; pay for setup cost at a specific point. */
void libxsmm_init(void);
/** De-initialize the library and free internal memory (optional). */
void libxsmm_finalize(void);

Matrix Multiplication

This domain (MM) supports Small Matrix Multiplications (SMM), batches of multiple multiplications as well as the industry-standard interface for GEneral Matrix Matrix multiplication (GEMM).

The Matrix Multiplication domain (MM) contains routines for:

Deep Learning

The Deep Learning domain is detailed by the following sample codes. Here we demonstrate how common operators in deep learning applications (GEMM with activation function fusion, Convolutions with activation function fusion, various norming operators, and pooling operators, etc.) can be implemented using the Tensor Processing Primitive provided by LIBXSMM. Example drivers for performance evaluation are provided as part of LIBXSMM_DNN.

Service Functions

For convenient operation of the library and to ease integration, some service routines are available. These routines may not belong to the core functionality of LIBXSMM (SMM or DNN domain), but users are encouraged to use this domain (AUX). There are two categories: (1) routines which are available for C and FORTRAN, and (2) routines that are only available per C interface.

The service function domain (AUX) contains routines for:

Backend

More information about the JIT-backend and the code generator can be found in a separate document. The encoder sample collection can help to get started writing a kernel using LIBXSMM. Please note, LIBXSMM's stand-alone generator-driver is considered legacy (deprecated).

Build Instructions

Overview

The main interface file is generated, and it is therefore not stored in the code repository. To inspect the interface for C/C++ and FORTRAN, one can take a look at the template files used to generate the actual interface. There are two general ways to build and use LIBXSMM:

Classic Library (ABI) and Link Instructions (C/C++ and FORTRAN)
Header-Only (C and C++)

Note: LIBXSMM is available as prebuilt package for Fedora/RedHat/CentOS, Debian/Ubuntu, FreeBSD, and others. Further, LIBXSMM can be installed with the Spack Package Manager or per EasyBuild+EasyConfig.

Classic Library (ABI)

There are two ways to rely on prebuilt code for a given project: (1) using LIBXSMM's Makefile based build system, (2) or using another build system and writing own rules for building LIBXSMM. The Makefile based build system relies on GNU Make (typically associated with the make command, but e.g. FreeBSD is calling it gmake). The build can be customized by using key‑value pairs. Key‑value pairs can be supplied in two ways: (1) after the "make" command, or (2) prior to the "make" command (env) which is effectively the same as exporting the key‑value pair as an environment variable (export, or setenv). Both methods can be mixed (the second method may require make's -e flag).

In contrast to header-only which does not require configuration by default, 3rd-party build systems can compile and link LIBXSMM's sources but still avoid configuring the library (per libxsmm_config.py). The prerequisite to omit configuration is to opt-in by defining LIBXSMM_DEFAULT_CONFIG (-D). The zero-config feature is not available for LIBXSMM's Fortran interface.

Note: By default, C/C++ and FORTRAN compilers are needed (some sample code is written in C++). Beside of specifying the compilers (make CXX=g++ CC=gcc FC=gfortran and maybe AR=ar), the need for a FORTRAN compiler can be relaxed (make FC= or make FORTRAN=0). The latter affects the availability of the MODule file and the corresponding libxsmm.f library (the interface libxsmm.f is still generated).

The build system considers a set of given key-value pairs as a single unique build and triggers a rebuild for a distinct set of flags. For more advanced builds or additional background, please consult the section about Customization. To generate the interface of the library inside of the include directory and to build the static library (by default, STATIC=1 is activated). Run any (or both) of the following command(s):

make STATIC=0
make

On CRAY systems, the CRAY Compiling Environment (CCE) should be used regardless of utilizing the CRAY compiler, the Intel Compiler, or the GNU Compiler Collection (GCC). The CCE is eventually suppressing to build shared libraries (STATIC=0). In any case, (1) switch to the desired compiler (module load/switch), and (2) rely on:

make CXX=CC CC=cc FC=ftn

A variety of build environments is out-of-the-box compatible, see https://github.com/libxsmm/libxsmm/wiki/Compatibility. If the build process is not successful, it may help to avoid advanced GCC flags. This is useful with a tool chain, which pretends to be GCC-compatible (and is treated as such) but fails to consume the afore mentioned flags:

make COMPATIBLE=1

In case of outdated Binutils, compilation can fail to assemble code when building the library (this has nothing to do with JIT-generated code and it does not affect how JIT-code is targeting the system). LIBXSMM implements some functionality using compiler-intrinsics and multiple code-paths which are scheduled according to CPUID. In contrast to INTRINSICS=2 (default), INTRINSICS=1 enables a fully static code path according to the desired target. If no target is given (e.g., AVX=3, or AVX=2), instruction set extensions cannot be leveraged for such code-paths. Try to fix failing compilation by building the latest GNU Binutils (and export PATH=/path/to/binutils/bin:${PATH}). Binutils are versioned independently of GNU GCC and other compilers. If one cannot update Binutils, work around with a CPUID-value as tabulated in libxsmm_cpuid.h: start at the upper end (less than 1999) and decrement until compilation passes (make INTRINSICS=CPUID, e.g., make INTRINSICS=1021). As a last resort, rely on a fully static code path:

make INTRINSICS=1

To test and validate a build, please consult https://github.com/libxsmm/libxsmm/wiki/Validation. To run some basic sanity checks, remember that each set of given key-value pairs represents a different build (and test):

make STATIC=0 tests

To remove intermediate files, or to remove all generated files and folders (including the interface and the library archives), run one of the make-targets below. An additional distclean-target recursively cleans the entire tree (after version 1.9).

make clean
make realclean

FORTRAN code can make use of LIBXSMM:

By using the module and linking with libxsmmf, libxsmm, and libxsmmext,
By including libxsmm.f and linking with libxsmm, and libxsmmext, or
By (implicitly) calling a SUBROUTINE and linking with libxsmm, and libxsmmext.

Note: libxsmmf requires libxsmmext (starting with LIBXSMM 2.0), and thereby requires to link with the OpenMP runtime as well.

Using the Fortran module (or including the interface), requires at least a Fortran 2003 compiler (F2K3). FORTRAN 77 compatibility is only implicitly available (no interface), and the available subset of routines is documented in libxsmm.f and marked with comments (part of the implementation).

Header-Only

Version 1.4.4 introduced support for "header-only" usage in C and C++. By only including libxsmm_source.h allows to get around building the library. However, this gives up on a clearly defined application binary interface (ABI). An ABI may allow for hot-fixes after deploying an application (when relying on the shared library form), and it may also ensure to only rely on the public interface of LIBXSMM. In contrast, the header-only form not only exposes the internal implementation of LIBXSMM but can also increase the turnaround time during development of an application (due to longer compilation times). The header file is intentionally named "libxsmm_source.h" since this header file relies on the src directory (with the implications as noted earlier).

The header-only form depends on libxsmm_source.h which is generated according to the content of the source folder (src). LIBXSMM 1.16 (and later) provides header-only support without invoking a make-target (zero configuration) for any given checkout of LIBXSMM. To use configured header-only (non-default), LIBXSMM_CONFIGURED must be defined (-D). Previously, it was necessary to invoke make header-only (v1.6.2 or later), make cheader (prior to v1.6.2), or any target building the library (make). The zero-config feature allows 3rd-party build systems an easier integration of LIBXSMM, which also holds true if the system builds LIBXSMM from source (see classic ABI). Fortran code may include libxsmm.f but still requires that interface to be generated.

Note: building an application applies the same build settings to LIBXSMM! For instance, to omit debug code inside of LIBXSMM NDEBUG must be defined (-DNDEBUG).

Rules for building LIBXSMM

LIBXSMM can be used as header-only library, i.e., no source code must be (pre-)built. However, it can be desirable to build LIBXSMM as an intermediate library using a custom setup or build system. The latter can still implement custom build rules to configure LIBXSMM's interface before building the code. More likely, building LIBXSMM from source in a custom fashion can still be omitting to configure the interface and rely on "(zero-config)[#zero-config-abi]", i.e., defining LIBXSMM_DEFAULT_CONFIG (-DLIBXSMM_DEFAULT_CONFIG). For example, a CMake module for LIBXSMM can look like:

include(FetchContent)
FetchContent_Declare(
  xsmm
  URL https://github.com/chelini/libxsmm/archive/<your-preferred-revision>.tar.gz
  URL_HASH SHA256=<sha256sum-corresponding-to-above-revision>
)
FetchContent_GetProperties(xsmm)
if(NOT xsmm_POPULATED)
  FetchContent_Populate(xsmm)
endif()

set(LIBXSMMROOT ${xsmm_SOURCE_DIR})
file(GLOB _GLOB_XSMM_SRCS LIST_DIRECTORIES false CONFIGURE_DEPENDS ${LIBXSMMROOT}/src/*.c)
list(REMOVE_ITEM _GLOB_XSMM_SRCS ${LIBXSMMROOT}/src/libxsmm_generator_gemm_driver.c)
set(XSMM_INCLUDE_DIRS ${LIBXSMMROOT}/include)

add_library(xsmm STATIC ${_GLOB_XSMM_SRCS})
target_include_directories(xsmm PUBLIC ${XSMM_INCLUDE_DIRS})
target_compile_definitions(xsmm PUBLIC
  LIBXSMM_DEFAULT_CONFIG
)
target_compile_definitions(xsmm PRIVATE
  __BLAS=0
)

Above, LIBXSMM_DEFAULT_CONFIG is propagated to dependent code (PUBLIC) and further, LIBXSMM is configured to not require a LAPACK/BLAS library/fallback (-D__BLAS=0).

Link Instructions

Using the classic ABI (including Fortran code), requires linking LIBXSMM against the application. The library is agnostic with respect to the threading-runtime, and therefore an application is free to use any threading runtime (e.g., OpenMP). The library is also thread-safe, and multiple application threads can call LIBXSMM's routines concurrently. Enabling OpenMP for LIBXSMM's main library is supported as well (OMP=1), and mostly affects the synchronization primitives used inside of the library. All the "omp" functionality (function postfix) is served by the libxsmmext library, which is automatically built with OpenMP enabled. When using this "omp" functionality, libxsmmext needs to be present at the link line.

Library	Purpose
libxsmm	Thread-safe core functions (same routine can be called concurrently). Contains routines that can take a thread-ID and the number of library-external threads.
libxsmmf	Necessary when using the Fortran MODule but not when including `libxsmm.f` or relying on implicit interfaces (Fortran 77).
libxsmmext	Provides library-internal OpenMP-threaded functions carrying the `omp` postfix when compared to function name names of the core library.
libxsmmnoblas	Supplies faked symbols for `dgemm` (and others) and thereby removes the need to link against a LAPACK/BLAS library.

To ease linking with LIBXSMM, pkg-config can be used. For example:

export PKG_CONFIG_PATH=/path/to/libxsmm/lib
pkg-config libxsmm --libs

Similarly, an application is free to choose any BLAS or LAPACK library (if the link model available on the OS supports this), and therefore linking GEMM routines when linking LIBXSMM itself (by supplying BLAS=1|2) may prevent a user from making this decision at the time of linking the actual application. To use LIBXSMM without GEMM-related functionality, any BLAS-dependency can be removed in two ways: (1) building a special library with make BLAS=0, or (2) linking the application against the libxsmmnoblas library. If an application however uses BLAS already, the Call Wrapper can be used to intercept existing BLAS calls (and to rely on LIBXSMM instead).

Note: LIBXSMM does not support to dynamically link libxsmm or libxsmmext ("so") when BLAS is linked statically ("a"). If BLAS is linked statically, the static version of LIBXSMM must be used!

Installation

There are two main mechanisms to install LIBXSMM (both mechanisms can be combined): (1) building the library in an out‑of‑tree fashion, and (2) installing into a certain location. Building in an out‑of‑tree fashion looks like:

cd libxsmm-install
make -f /path/to/libxsmm/Makefile

Installation into a specific location looks like (PREFIX or DESTDIR):

make MNK="1 2 3 4 5" PREFIX=/path/to/libxsmm-install install

Both PREFIX and DESTDIR are equivalent and can be relative or absolute paths. An installation can be repeated for different locations without triggering a rebuild. The prefix directory inside of each of the package configuration files is set to where LIBXSMM is built (staging folder) unless PREFIX or DESTDIR is specified. The effect of PREFIX (or DESTDIR) with respect to the pkg-config files is independent of whether the install-target is invoked or not (make).

Further, performing make install-minimal omits the documentation (default: PREFIX/share/libxsmm). Moreover, PINCDIR, POUTDIR, PBINDIR, and PDOCDIR allow to customize the locations underneath of the PREFIX location. To build a general package for an unpredictable audience (Linux distribution, or similar), it is advised to not over-specify or customize the build step, i.e., JIT, SSE, AVX, OMP, BLAS, etc. should not be used. The following is building and installing a complete set of libraries where the generated interface matches both the static and the shared libraries:

make PREFIX=/path/to/libxsmm-install STATIC=0 install
make PREFIX=/path/to/libxsmm-install install

Runtime Control

Handling Errors

The library handles errors with mechanisms available to the C programming language (no exceptions). The backend uses result codes passed by an argument rather than an actual return value. Such an argument is often a descriptor (struct) guiding and covering the state of the code generation. The frontend however may not hand-out any error state, which can be a big relief on the call-side. Instead, the frontend implements a verbose mode to inform about unexpected input or an error captured from the backend. Guiding principles of LIBXSMM are muted operation by default (non-verbose) and no unexpected exit from execution.

Verbose Mode

The verbose mode (level of verbosity) allows for an insight into the code dispatch mechanism by receiving a small, tabulated statistic as soon as the library terminates. The design point for this functionality is to not impact the performance of any critical code path, i.e., verbose mode is always enabled and does not require symbols (SYM=1) or debug code (DBG=1). The statistics appears (stderr) when the environment variable LIBXSMM_VERBOSE is set to a non-zero value. For example:

LIBXSMM_VERBOSE=1 ./myapplication
[... application output]

HSW/SP      TRY    JIT    STA    COL
   0..13      0      0      0      0
  14..23      0      0      0      0
 24..128      3      3      0      0

The tables are distinct between single-precision and double-precision, but either table is pruned if all counters are zero. If both tables are pruned, the library shows the code path which would have been used for JIT'ting the code: LIBXSMM_TARGET=hsw (otherwise the code path is shown in the table's header). The actual counters are collected for three buckets: small kernels (MNK^1/3 <= 13), medium-sized kernels (13 < MNK^1/3 <= 23), and larger kernels (23 < MNK^1/3 <= 64; the actual upper bound depends on LIBXSMM_MAX_MNK as selected at compile-time). Keep in mind, that "larger" is supposedly still small in terms of arithmetic intensity (which grows linearly with the kernel size). Unfortunately, the arithmetic intensity depends on the way a kernel is used (which operands are loaded/stored into main memory), and it is not performance-neutral to collect this information.

The TRY counter represents all attempts to register statically generated kernels, and all attempts to dynamically generate and register kernels. The TRY counter includes rejected JIT requests due to unsupported GEMM arguments. The JIT and STA counters distinct the successful cases of the afore mentioned event (TRY) into dynamically (JIT) and statically (STA) generated code. In case the capacity (O(n) = 10⁵) of the code registry is exhausted, no more kernels can be registered although further attempts are not prevented. Registering many kernels (O(n) = 10³) may ramp the number of hash key collisions (COL), which can degrade performance. The latter is prevented if the small thread-local cache is utilized effectively.

Since explicitly JIT-generated code (libxsmm_?mmdispatch) does not fall under the THRESHOLD criterion, the above table is extended by one line if large kernels have been requested. This indicates a missing threshold-criterion (customized dispatch) or asks for cache-blocking the matrix multiplication. Setting a verbosity level of at least two summarizes the number of registered JIT-generated kernels, which includes the total size and counters for GEMM, MCOPY (matrix copy), and TCOPY (matrix transpose) kernels.

Registry: 20 MB (gemm=0 mcopy=14 tcopy=0)

If the call-wrapper is used, an additional runtime statistic becomes available (see Call Wrapper).

Note: Setting LIBXSMM_VERBOSE to a negative value dumps each generated JIT kernel to a file (binary) with each file being named like the function name shown in Intel VTune. Disassembly of the raw binary files can be accomplished by:

objdump -D -b binary -m i386 -M x86-64 [JIT-dump-file]

Call Trace

During the initial steps of employing the LIBXSMM API, one may rely on a debug version of the library (make DBG=1). The latter also implies console output (stderr) in case of an error/warning condition inside of the library. It is also possible to print the execution flow (call trace) inside of LIBXSMM (can be combined with DBG=1 or OPT=0):

make TRACE=1

Building an application which traces calls (inside of the library) requires the shared library of LIBXSMM, alternatively the application is required to link the static library of LIBXSMM in a dynamic fashion (GNU tool chain: -rdynamic). Tracing calls (without debugger) can be then accomplished by an environment variable called LIBXSMM_TRACE.

LIBXSMM_TRACE=1 ./myapplication

Syntactically up to three arguments separated by commas (which allows to omit arguments) are taken (tid,i,n): tid signifies the ID of the thread to be traced with 1...NTHREADS being valid and where LIBXSMM_TRACE=1 is filtering for the "main thread" (in fact the first thread running into the trace facility); grabbing all threads (no filter) can be achieved by supplying a negative id (which is also the default when omitted). The second argument is pruning higher levels of the call-tree with i=1 being the default (level zero is the highest at the same level as the main function). The last argument is taking the number of inclusive call levels with n=-1 being the default (signifying no filter).

Although the ltrace (Linux utility) provides similar insight, the trace facility might be useful due to the afore mentioned filtering expressions. Please note that the trace facility is severely impacting the performance (even with LIBXSMM_TRACE=0), and this is not just because of console output but rather since inlining (internal) functions might be prevented along with additional call overhead on each function entry and exit. Therefore, debug symbols can be also enabled separately (make SYM=1; implied by TRACE=1 or DBG=1) which might be useful when profiling an application.

Verification

This section refers to testing correctness of an application using LIBXSMM utilities, i.e., using libxsmm_matdiff or libxsmm_matdiff_epsilon in particular. The former function (libxsmm_matdiff) compares two matrices (which can degenerate to vector shape), and yields a structure with information about the difference of both matrices (gold vs. test). The latter function (libxsmm_matdiff_epsilon) combines absolute and relative norms (given by afore mentioned structure) and calculates a scalar "epsilon" which can be used to check against a margin.

Using libxsmm_matdiff_epsilon in an application exposes an environment variable LIBXSMM_MATDIFF which can specify a file or directory path (LIBXSMM_MATDIFF=1 simply uses some filename as default). In any case, the application appends one line to the respective file for each call of libxsmm_matdiff_epsilon. A data record consists of the epsilon and the command line used to launch the application. A generated file can be further evaluated, e.g., sort -gk1 libxsmm_matdiff.log | tail -n 10 which yields the largest ten epsilon values discovered along with the application's command line.

The environment variable LIBXSMM_MATDIFF can carry optional space-separated arguments to amend each file entry like export LIBXSMM_MATDIFF="libxsmm_matdiff.log hello world". In sophisticated cases this can be used to amend a value only known at runtime, e.g., the actual margin which is used to judge the epsilon (putenv).

Performance

Profiling an application, which uses LIBXSMM's JIT-code is well-supported. The library supports Intel VTune Amplifier and Linux perf. Details are given on how to include profiler support, and how to run the application.

At build time, a variety of options exist to customize LIBXSMM. The library is setup for a broad range of use cases, which include sophisticated defaults for typical use.

To find performance results of applications or performance reproducers, the repository provides an orphaned branch called "results" which collects collateral material such as measured performance results along with explanatory figures. The results can be found at https://github.com/libxsmm/libxsmm/tree/results#libxsmm-results, or the results can be cloned as shown below.

git clone --branch results \
  https://github.com/libxsmm/libxsmm.git \
  libxsmm-results

Please note that comparing performance results depends on whether the operands of the matrix multiplication are streamed or not. For example, multiplying with all matrices covered by the L1 cache may have an emphasis towards an implementation which perhaps performs worse for the real workload (if this real workload needs to stream some or all matrices from the main memory). Most of the code samples are aimed to reproduce performance results, and it is encouraged to model the exact case or to look at real applications.

Applications

High Performance Computing (HPC)

[1] https://cp2k.org/: Open Source Molecular Dynamics and the DBCSR library, which processes batches of small matrix multiplications. The batches originate from a distributed block-sparse matrix with problem-specific small matrices. Starting with CP2K 3.0, LIBXSMM can substitute CP2K's libsmm library.

[2] https://github.com/SeisSol/SeisSol/: SeisSol is one of the leading codes for earthquake scenarios, for simulating dynamic rupture processes. LIBXSMM provides highly optimized assembly kernels which form the computational back-bone of SeisSol (see https://github.com/TUM-I5/seissol_kernels/.

[3] https://github.com/NekBox/NekBox: NekBox is a highly scalable and portable spectral element code, which is inspired by the Nek5000 code. NekBox is specialized for box geometries and intended to prototype new methods as well as to leverage FORTRAN beyond the FORTRAN 77 standard. LIBXSMM can be used to substitute the MXM_STD code. Please also note LIBXSMM's NekBox reproducer.

[4] https://github.com/Nek5000/Nek5000: Nek5000 is the open-source, highly-scalable, always-portable spectral element code from https://nek5000.mcs.anl.gov/. The development branch of the Nek5000 code incorporates LIBXSMM.

[5] http://pyfr.org/: PyFR is an open-source Python based framework for solving advection-diffusion type problems on streaming architectures by using the flux reconstruction approach. PyFR 1.6.0 optionally incorporates LIBXSMM as a matrix multiplication provider for the OpenMP backend. Please also note LIBXSMM's PyFR-related code sample.

[6] http://dial3343.org/about/: The Extreme-scale Discontinuous Galerkin Environment (EDGE) is a solver for hyperbolic partial differential equations with emphasis on seismic simulations. The EDGE source code optionally relies on LIBXSMM, but for high performance LIBXSMM's kernels are highly recommended.

[7] https://sxs-collaboration.github.io/spectre/: SpECTRE is an open-source code for multi-scale, multi-physics problems in astrophysics and gravitational physics which runs at Petascale and is designed for Exascale computers. In the future, SpECTRE may be applied to problems across discipline boundaries in fluid dynamics, geoscience, plasma physics, nuclear physics, and engineering.

[8] https://ceed.exascaleproject.org/ceed-code/: The Center for Efficient Exascale Discretizations (CEED) is building on the efforts of the Nek5000, MFEM, MAGMA, OCCA and PETSc projects to develop application program interfaces (APIs), both at high-level and at low-level to enable applications to take advantage of high-order methods. The CEED low-level API, libCEED uses LIBXSMM as a backend for high performance on CPUs.

[9] https://github.com/romeric/Fastor: Fastor is a lightweight high performance tensor algebra framework for modern C++ and can optionally use LIBXSMM as JIT-backend.

Machine Learning (ML)

[10] https://github.com/plaidml/plaidml: PlaidML is an open source tensor compiler aiming for performance portability across a wide range of CPUs, GPUs and other accelerators. Combined with Intel’s nGraph compiler, PlaidML is targeting popular deep learning frameworks such as PyTorch, Keras (TensorFlow), and OpenVino. PlaidML/v1 (development branch) adopted MLIR, an extensible compiler infrastructure gaining industry-wide adoption. PlaidML/v1 started using LIBXSMM as backend for targeting CPUs.

[11] https://github.com/intel/intel-extension-for-pytorch: Intel Extension for PyTorch aims for a smooth user experience of PyTorch on CPUs by the means of good performance. The extension pack started to rely on LIBXSMM for achieving high performance on CPUs.

[12] https://github.com/libxsmm/tpp-pytorch-extension: Intel(R) Tensor Processing Primitive Extension for pytorch is an open source software library the integrates Tensor Processing Primitives (TPP) into pytorch. It is aiming for a smooth user experience of PyTorch on CPUs by the means of good performance. Intel's MLPerf Training submission codes leverage this project.

[13] https://github.com/libxsmm/libxsmm-dnn: LIBXSMM-DNN is an open source software library that demonstrates how Tensor Processing Primitives (TPP) can be used to implement various deep learning primitives such as convolutions, linear layers or even pooling and norming. Due to the use of TPP not a single line of platform-specific code is needed.

Automated Driving (AD)

[15] https://software.seek.intel.com/accelerating-eigen-math-library: Accelerating The Eigen Math Library for Automated Driving Workloads: The Need for Speed in Kalman Filtering. An article in Issue 31 of The Parallel Universe magazine (pdf).

References

[1] https://sc19.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost244.html: High-Performance Deep Learning via a Single Building Block (poster and abstract), SC’19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver (Colorado).

[2] https://dl.acm.org/doi/10.1109/SC.2018.00069: Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures (paper). SC'18: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas (Texas).

[3] https://pasc17.pasc-conference.org/fileadmin/user_upload/pasc17/program/post116s2.pdf: DBCSR: A Sparse Matrix Multiplication Library for Electronic Structure Codes (poster), PASC’17: The PASC17 Conference, Lugano (Switzerland).

[4] https://sc17.supercomputing.org/SC17%20Archive/tech_poster/tech_poster_pages/post190.html: Understanding the Performance of Small Convolution Operations for CNN on Intel Architecture (poster and abstract), SC’17: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver (Colorado).

[5] https://www.computer.org/csdl/proceedings-article/sc/2016/8815a981/12OmNCeaQ1D: LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation. SC'16: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City (Utah).

[6] http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/tech_poster_pages/post137.html: LIBXSMM: A High Performance Library for Small Matrix Multiplications (poster and abstract). SC'15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Austin (Texas).

[7] Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads SC'21: The International Conference for High Performance Computing, Networking, Storage and Analysis, St Louis.

Articles

[1] https://www.nextplatform.com/2019/10/09/cloudy-supercomputers-join-the-hpc-petascale-club/: Cloudy Supercomputers Join the HPC Petascale Club. An article written by Rob Farber, 2019. The article covers LIBXSMM in a separate section.

[2] https://www.nextplatform.com/2019/06/26/counting-the-cost-of-scaling-hpc-applications/: Counting The Cost Of Scaling HPC Applications. An article written by Timothy Prickett Morgan, 2019. This article is about CP2K Open Source Molecular Dynamics and not about LIBXSMM. However, LIBXSMM was key for application performance.

[3] https://www.nextplatform.com/2019/06/26/counting-the-cost-of-scaling-hpc-applications/: Azure Benchmarks HC-series Across Twenty-thousand Cores for HPC. An article written by John Russell, 2019. This article is about CP2K Open Source Molecular Dynamics and not about LIBXSMM. However, LIBXSMM was key for application performance.

[4] https://software.intel.com/sites/default/files/parallel-universe-issue-34.pdf: LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel (pdf). An article written by Hans Pabst, Greg Henry, and Alexander Heinecke, 2018.

[5] https://medium.com/@rmfarber/libxsmm-brings-deep-learning-lessons-learned-to-many-hpc-applications-9143c6c93125: LIBXSMM Brings Deep-learning "Lessons Learned" to Many HPC Applications. An article written by Rob Farber, 2018.

[6] https://www.rdworldonline.com/largest-supercomputer-simulation-of-sumatra-andaman-earthquake/: Largest Supercomputer Simulation of Sumatra-Andaman Earthquake. An article written by Linda Barney, 2018.

libxsmm's People

Contributors

Stargazers

Watchers

Forkers

maxhutch efeguney xianyi rscohn2 molguin-qc yyzreal liyancas danielpeter oanaoana mdebski cfandy nrsatish algoskynet zhangyangang dislexic yuede jspark1105 benoitsteiner liuguoyou kunalbanerjee jewillco 6676401088 loliod wolf1981 qingsong99 bmcdanie templeblock egeor dmudiger rajbarik alvarovm scottsallinen dnbaker aizatrosli yujunfeng xiaoxuefeng geoffreyqiu hiprince liangfu breuera taihulight xjtuwj narayanan2004 alheinecke xiaocenxiaocen jjykh mtaillefumier zhcui qq332982511 mdave schoenemeyer luke-evans-liu chenzheng1030 mypopydev stoni gregmbi dev-zero shaun95 agostini01 sharkhack awesomemachinelearning gpuworld zdqf nom8393 neveroldmilk sprinterzzj ddkalamk hanzz2007 ugiwgh taozhang8 yang123vc ceseo magastzheng isuruf kobeliu85 legrosbuffle mahudu97 ashokei zoq abhisekkundu-intel paulhjkelly xing-liu crystalbobby fossabot bhaskarnallani zwbjtu123 thebluesmoke firecracker15 fbaru-dev ciyongch vkarihal dpfhty dmitry-gorokhov nazpyro xiaming9880 ranalytica xiangchunyang wuyouqian96169 yushansu liutongxuan

libxsmm's Issues

mmfunction dispatch not working

The libxsmm_mmfunction interface invariably returns 0.

Having built libxsmm like so:

$ make JIT=1 AVX=2 ROW_MAJOR=1

and the attached code like so:

$ make -f Makefile.big xmm-dispatch-bug

Run the example:

$ ./xmm-dispatch-bug 64 240 64 64 240 240 1

Note the assert that fails. The other call to libxsmm seems to succeed.

xmm-dispatch-bug.zip

Make flow is not compatible with python 3.4

Python 3.4 throws some errors when using the current make-flow:
[aheineck@aheineck-linux libxsmm_github]$ make realclean
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
[aheineck@aheineck-linux libxsmm_github]$ make generator
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax
File "/nfs_home/aheineck/Projects/LIBXSMM_workspace/libxsmm_github/scripts/libxsmm_utilities.py", line 133
print " ".join(map(lambda mnk: "".join(map(str, mnk)), dims))
^
SyntaxError: invalid syntax

fortran module breaks under `-r8` or `-fdefault-real-8`

ifort's -r8 and gfortran's -fdefault-real-8 cause LIBXSMM_SINGLE_PRECISION and LIBXSMM_DOUBLE_PRECISION to be the same, causing double implementations of all the calls that differ only in precision. I can think of a few solutions:

Don't change anything; codes shouldn't be using -r8 anyways.
Define LIBXSMM_SINGLE_PRECISION using selected_real_kind
Define LIBXSMM_SINGLE_PRECISION as 4
If BLAS had a true interface, then I'd go with (1), but seeing as LIBXSMM_SINGLE_PRECISION being defined as anything other than 4 would break the underlying SGEMM call, I think (2) and (3) are more flexible for right now. The difference there should be mostly aesthetic. Thoughts?

Handle hash key collisions in the code cache.

This issue is known both in terms of the problem and the solution. It is planned to evict code from the cache in case of a collision in order to avoid the performance overhead of a full collision handling. The latter requires an exact comparison of two descriptors on top of CRC32 based hash key. Moreover, the infrastructure to receive the target/populated descriptor needs to be implemented. Actually evicting the code also requires to properly release/reuse the memory associated with an entry of the cache.

Omit registering SSE code if JIT code can reach higher an ISA level

Omit registering SSE code if JIT code can reach higher an ISA level. This feature allows to statically generate and include SSE3 code into the library but still getting the best ISA level (if the JIT backend is enabled). Please note that the JIT backend does not support non-AVX (SSE3).

link Error (build requirements?)

I'm trying to run libxsmm on a CPU-only (IvyBridge-E) system with a somewhat dated compiler:

maxhutch@edoras:~/src/clean-tests/RTI-LST$ ifort --version
ifort (IFORT) 14.0.1 20131008
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

At link, it gives some warnings about some MIC things and then dies with an opaque Error 100, maybe related to missing x86_64-k1om-linux-ld:

/opt/openmpi-intel/bin/mpif90 -g -check all -debug all -traceback  -o nek5000 -ffpe-trap=invalid,zero,overflow -fsignaling-nans -I/opt/fftw3/include/ -I/home/maxhutch/src/libxsmm/include obj/test.o obj/kinds_mod.o obj/mpif.o obj/fftw3.o obj/size_mod.o obj/speclib.o obj/mesh_mod.o obj/input_mod.o obj/parallel_mod.o obj/fft_fftw_mod.o obj/ctimer_mod.o obj/dealias_mod.o obj/domain_mod.o obj/dxyz_mod.o obj/eigen_mod.o obj/esolv_mod.o obj/fdmh1_mod.o obj/geom_mod.o obj/hsmg_mod.o obj/interp_mod.o obj/ixyz_mod.o obj/mvgeom_mod.o obj/nekuse_mod.o obj/opctr_mod.o obj/restart_mod.o obj/scratch_mod.o obj/semhat_mod.o obj/soln_mod.o obj/steady_mod.o obj/string_mod.o obj/topol_mod.o obj/tstep_mod.o obj/turbo_mod.o obj/wz_mod.o obj/wzf_mod.o obj/zper_mod.o obj/io_mod.o obj/poisson_mod.o obj/navier4.o obj/drive.o obj/drive1.o obj/drive2.o obj/plan4.o obj/bdry.o obj/coef.o obj/conduct.o obj/connect1.o obj/connect2.o obj/dssum.o obj/eigsolv.o obj/genxyz.o obj/hsmg.o obj/gmres.o obj/convect.o obj/induct.o obj/navier0.o obj/navier1.o obj/navier5.o obj/navier6.o obj/navier8.o obj/map2.o obj/ic.o obj/ssolv.o obj/math.o obj/mxm_wrapper.o obj/hmholtz.o obj/subs1.o obj/fast3d.o obj/fasts.o obj/byte.o obj/chelpers.o obj/byte_mpi.o obj/prepost.o obj/nek_comm.o obj/setprop.o obj/papi.o obj/gauss.o obj/makeq.o obj/makeq_aux.o obj/mxm_std.o obj/comm_mpi.o obj/singlmesh.o obj/jl_gs.o obj/jl_sort.o obj/jl_sarray_transfer.o obj/jl_sarray_sort.o obj/jl_gs_local.o obj/jl_crystal.o obj/jl_comm.o obj/jl_tensor.o obj/jl_fail.o obj/jl_fcrystal.o obj/jl_findpts.o obj/jl_findpts_local.o obj/jl_obbox.o obj/jl_poly.o obj/jl_lob_bnd.o obj/jl_findpts_el_3.o obj/jl_findpts_el_2.o obj/jl_sparse_cholesky.o obj/jl_xxt.o obj/jl_fcrs.o -lblas -llapack -L/opt/fftw3/lib/ -lfftw3 -L/home/maxhutch/src/libxsmm/lib/intel64 -lxsmm
ifort: command line warning #10006: ignoring unknown option '-ffpe-trap=invalid,zero,overflow'
ifort: command line warning #10006: ignoring unknown option '-fsignaling-nans'
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: command line warning #10006: ignoring unknown option '-ffpe-trap=invalid,zero,overflow'
ifort: command line warning #10006: ignoring unknown option '-fsignaling-nans'
ifort: warning #10362: Environment configuration problem encountered.  Please check for proper MPSS installation and environment setup.
ifort: warning #10182: disabling optimization; runtime debug checks enabled
x86_64-k1om-linux-ld: No such file or directory
makefile:165: recipe for target 'nek5000' failed
make: *** [nek5000] Error 100

Dynamically dispatch CRC32 according to CPUID flags

Dynamically dispatch the code path making use of CRC32 instructions. This will allow running on pre-Nehalem/Westmere CPUs (no SSE4.2/CRC32 instructions). The intention is to support Linux distributions (package managers) aiming for a wider range of processors.

support for arbitrary values of alpha and beta

Implementing support for arbitrary values of alpha and beta is not impossible (~5% performance hit for very small sizes). Therefore we should considerate adding this to the generator backend.

Remove any calls performing non-private file I/O (incl. console output)

A library is not supposed to perform I/O operations which is not invisible (console, and leave-behind files). However, our non-NDEBUG code path may perform such kind of I/O to improve application testing and debugging. This requirement belongs to the code quality category which is about allowing our code to be adopted where highest standards apply.

list option when pre-building the library

Currently, if a user wants to pre-build a specific set of specializations the MNK="" or M="", N="", K="" interface has to be used. For larger sets there have been reports that bash/make are failing and the entire build fails. It might be useful to have a python script, json, or xml input file which specifies the requested kernels and the LIBXSMM make system builds these kernels afterwards step-by-step.

LIBXSMM_GEMM_DESCRIPTOR macro is broken for sparse

the LIBXSMM_GEMM_DESCRIPTOR in libxsmm_generator.h doesn't allow for values 0 for LDA,LDB and LDC. However, these exception cases are used by the sparse matrix code generator for determining which matrix is sparse. A workaround was added generator_driver.c (simply overwriting the generated descriptor).

However, automatically promoting LDA - LDC to m or k seems to be a pretty dangerous thing. If the users requests such a code -> no DGEMM as this is an invalid specification. An error should be issued during generation of such a code instead.

LIBXSMM interface/frontend refinement

Promote Alpha and Beta arguments to the simplified interface. Support JIT-building kernels with general xGEMM arguments using the frontend, and adjust the dispatch functions accordingly. This change will break with our currently deployed simplified interface (frontend) which is only accepting M, N, and K parameters. The intention of this issue is to settle our frontend interface.

Finalize the library and free internal resources (libxsmm_finalize)

Finalize the library (as the opposite of "libxsmm_init"), and free internal resources such as memory allocated to hold generated code (hash table).

integration tests

Could some of the samples, smm seems particularly suitable, be massaged into integration tests? It would boost my confidence in making changes and opening PRs.

I use travis for other projects. If someone else sets up a script that returns 0 on pass and non-zero on fail, I'm willing to set up the rest.

Provide libxsmmf library accompanying the MODule file

Providing a libxsmmf.[a|so|dll] library which is accompanying the MODule file (already generated) allows for using LIBXSMM without including the header file and the related implications. Including libxsmm.f and linking against the regular libxsmm.[a|so|dll] is just an additional option for users who prefer working with a compiler-dependent module file.

Add a version stamp to LIBXSMM's interface

Add a version stamp (compile-time), and perhaps a runtime API to query the version of the library (C/C++ and Fortran interfaces).

support for transb in the generator backend

Currently the generator can only generate code for non-trans operations. Support for transB is straightforward and should be therefore added.

remove ALIGNED_STORES and ALIGNED_LOADS options

currently the LIBXSMM has two build options which control implicit changes of LDA and LDC parameters.
As we are moving to a more general interface which includes support for LDx ALIGNED_STORES and ALIGNED_LOADS are redundant and should be removed.
LIBXSMM should still provide macros or functions which allow for easily deriving "padded" LDx values matching the smallest required value.

Timers in sample/smm rely on OpenMP

remove OpenMP timers and use gettimeofday (at least under linux). This allows us to run in serial and to debug the Fortran interface performance.

extracting common parts of sample makefiles

currently makefiles in the sample directories (smm, cp2k, nek) do not share a common configuration. Common parts should be carved out to make their maintenance easier.

Dispatch for unsupported code generation requests

Detect unsupported JIT code generation requests when building the LIBXSMM_GEMM_DESCRIPTOR. An unsupported code version needs to be dispatched to the fallback code path. Currently an unsupported code version would fail in the code generator. This error condition is likely generated too slow to be used for code dispatch.

Implement streaming stores

I want to call libxsmm functions for chunks of large matrices in a memory-bound kernel (similar to the "batched" mode in the examples).
Therefore, it would be great to have the possibility to employ streaming stores for the result matrix.

Is it possible/sensible/realistic that you implement this?

Rework Makefile's mkdir mechanism to avoid issues in parallel builds

There are still spurious issues when building in parallel (make -j). The problems appear also with newer versions of GNU make (and independent of what is worked around already; make v3.82). Adopt a solution which is implicitly creating the necessary directories for any target placed in a particular folder by introducing a "dummy" file representing the directory in questions.

As a general cleanup, remove the rule(s) in the NEK sample which are installing into DEPDIR. Really this is an awful solution where the sample code installs into the library's directory structure (and this cannot be preserved). Any NEK-related code can still do it the other way around and simply rely on the sample folder. In another cleanup stage, one could also remove the sample related rules in LIBXSMM's Makefile doing various stuff (testing, script generation, etc.). Really this was never intended to be a solution for dealing with Travis (and there are better ways to do this).

Support specifying static code versions in full detail (LDx, etc.)

There is no support for specifying LDx with static code generation. This is a minor feature which can be supported via our long-planned spec-file. The latter would allow to specify static code versions beyond (M,N,K) triplets.

MPSS required, even with MIC=0

On commit 50ed3d0, my system with ICC 16.0.1 and without MPSS cannot build libxsmm, even with MIC=0:

$ make MIC=0

[jsewall libxsmm (master)]$ make OFFLOAD=0
icc -Wall -Wno-unused-function -Wno-attributes -fPIC -O2 -ftree-vectorize -ffast-math -funroll-loops -D__extern_always_inli
ne=inline -DNDEBUG -D__STATIC -D__MKL -Iinclude -Ibuild -I/nfs_home/jsewall/src/libxsmm/src -I/swtools/intel/compilers_and_lib
raries_2016.1.150/linux/mkl/include -mavx2 -c /nfs_home/jsewall/src/libxsmm/src/libxsmm.c -o build/intel64/libxsmm.o
icc: command line warning #10006: ignoring unknown option '-ffast-math'
icc: command line warning #10353: option '-mavx2' ignored, suggest using '-march=core-avx2'
icc: warning #10193: -vec is default; use -x and -ax to configure vectorization
icc: command line warning #10006: ignoring unknown option '-ffast-math'
icc: warning #10362: Environment configuration problem encountered. Please check for proper MPSS installation and environment
setup.
icc: warning #10193: -vec is default; use -x and -ax to configure vectorization
In file included from include/libxsmm_frontend.h(35),
from include/libxsmm.h(65),
from /nfs_home/jsewall/src/libxsmm/src/libxsmm.c(31):
include/libxsmm_macros.h(279): catastrophic error: MIC cannot open source file "pthread.h"

include <pthread.h>

I also get warnings about the flag -mavx2, which ICC ignores (-march=core-avx2 is the preferred flag).

-cp2k flag in make.sh

hey all,
just a few quick questions about the -cp2k flag in make.sh. Is it supposed to deliver a cp2k suitable library?

I see that the MNK options are

MNK="
23,
6,
14 16 29,
14 32 29,
5 32 13 24 26,
9 32 22,
64,
78,
16 29 55,
32 29 55,
12,
4 5 7 9 13 25 26 28 32 45"

But in the cp2k's toolchain installer the MNK options are

MNK="1 4 5 6 8 9 13 16 17 22 23 24 26 32"
which are many less, but also different combinations.

Next it sets SSE=3, which according to the makefile.inc and documentation doesn't exist. Only SSE=1 and AVX=1|2|3 exists.

And last, I read Intel optimized and the cp2k.pdf inside documentation says I should have icc. Is it also supposed to work with gcc compiler? I did manage to compile it with gcc. Are there any issues with gcc?

Thank you for any answers.

Johannes

Support the prefetch interface for Fortran.

Right now when requesting PREFETCH=1, LIBXSMM does not generate the extended function signatures for taking prefetch locations using the Fortran interface.

Move libxsmm_generator_dense_add_isa_check_header and libxsmm_generator_dense_add_isa_check_footer into generator_common

The functions:
libxsmm_generator_dense_add_isa_check_header
libxsmm_generator_dense_add_isa_check_footer
are not specific to dense matrix multiplication and therefore they should be renamed and moved to generator_common.c/h

support for alpha -1

a common and special case for alpha is -1. Currently, the generator is not generally supporting this case (alpha -1).

AVX512 instruction size reduction

AVX512 instructions allow for various memory reference encodings which impact the encoded instructions length. In general shorter instructions should achieve better performance. The generator should be rewritten to use short instructions as often as possible.

Remove compiler generated fallback code

Currently, LIBXSMM offers two fallback options: a) compiler generated and unrolled code b) call into a BLAS library. As it's planned to evolve LIBXSMM generator to support additional cases such as arbitrary alpha and beta and transpose options and LIBXSMM's JIT feature will become a stable release feature, alternative a) is redundant and will most likely never been called. Therefore it should be considered as deprecated and removed in a future release of LIBXSMM.

loop elimination in generated code

Independent of the matrix kernel size, the generator backend generates loop bodies. For very small sizes (M<16) these loops have only one on trip, therefore they can be eliminated.

clBLAS-master\src\library\blas\xgemm.cc(394): error C2065: 'gemmSelectKernel': nichtdeklarierter Bezeichner (not declared)

Hi there,
I struggle to create a cblas.lib using Visual Studio (desktop) 2012 since days and I just can't get it to compile correctly.
The long list of errors starts with

clBLAS-master\src\library\blas\xgemm.cc(394): error C2065: 'gemmSelectKernel': nichtdeklarierter Bezeichner

after that there's a lot of errors aorund which I presume are just follow ups. Can anyone help?

Many thanks,
René

determine highest available vector instruction set extension based on CPUID when using JIT option

Currently, at compile time the target vector instruction set extensions are fixed. This is an unnecessary limitation as the CPU running LIBXSMM later on can be different (newer or older). In order to guarantee best out-of-the box performance we should read the targets CPUID and determine the highest supported vector instruction set extension on-the-fly.

assumed-size F90 interface

Currently LIBXSMM's F90 interface requires 2D Fortran arrays as inputs. We have seen applications which need to call LIBXSMM routines for contiguous slices of higher dimensional arrays. A quick test unveiled that the needed reshape is not replaced by a no-op. Therefore, the only solution is to changes LIBXSMM's F90 interface to an assumed-size interface. This will disable row-major support for Fortran.

KNC code generation

KNC is generated although not requested:

make M="4 8 10 12 16 64 100 144" N="4 8 10 12 16 64 100 144" K="4 8 10 12" BETA=0 OFFLOAD=0 MIC=0 SSE=3

Furthermore, when calling with OFFLOAD=0, the application shouldn't be required to use -no-offload.

This is true for compiling the f90 module or including the libxsmm.h header in C/C++ applications.

Provide PREFIX-based installation mechanism and related cleanup

Implement a PREFIX-based installation, and perhaps renamed the generator executable to libxsmm_generator (to avoid any name clashes). This task will be able to leverage/complement the existing out-of-tree build mechanism.

Remove exit calls and instead propagate errors to the call side

A library is not supposed to exit an application. Instead, an unrecoverable error is propagated to the call side (where exit may be called or not). This gives an application the chance to perform own cleanup and tear-down (independent of "magic" exit handler code). This requirement belongs to the code safety category which is about allowing our code to be adopted where highest standards apply.

add support for vendor-specific (e.g. CRAY) wrappers to at least LIBXSMM samples

running on Cray machines is easiest when using the cray wrappers for gnu/intel compilers. They are CC=cc CXX=CC FC=ftn. Currently makefiles can be hacked (incl. STATIC=1) to build on cray.

Often on cray machines, the login node has a different arch then the compute node, but the wrappers have the best arch flags -> LIBXSMM's cray support shouldn't specify -xHost.

add travis-ci tests for F90 examples

Currently the F90-binding examples are not tested for correctness when running travis-ci tests. this should be fixed.

FORTRAN interface

Generate and implement a FORTRAN interface along with some sample code (driver).

samples/smm doesn't build

When I try building this sample, I'm blasted with these errors:

/usr/lib/gcc/x86_64-linux-gnu/4.9/include/xopintrin.h(438): error: identifier "__builtin_ia32_vpcomltud" is undefined
    return (__m128i) __builtin_ia32_vpcomltud ((__v4si)__A, (__v4si)__B);
                     ^

In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/x86intrin.h(52),
                 from /usr/include/x86_64-linux-gnu/c++/4.9/bits/opt_random.h(33),
                 from /usr/include/c++/4.9/random(50),
                 from /usr/include/c++/4.9/bits/stl_algo.h(66),
                 from /usr/include/c++/4.9/algorithm(62),
                 from /home/maxhutch/src/libxsmm/samples/smm/blas.cpp(37):
/usr/lib/gcc/x86_64-linux-gnu/4.9/include/xopintrin.h(444): error: identifier "__builtin_ia32_vpcomleud" is undefined
    return (__m128i) __builtin_ia32_vpcomleud ((__v4si)__A, (__v4si)__B);
                     ^

compilation aborted for /home/maxhutch/src/libxsmm/samples/smm/blas.cpp (code 4)
Makefile:380: recipe for target 'build/blas-cpp.o' failed
make: *** [build/blas-cpp.o] Error 4

These look like compiler issues, but I'm running a vanilla debian system, so I thought they'd be worth pointing out.

Incorrect results when ldA > K, ldB > N, or ldC > N.

As of commit baad5c1, libxsmm_sgemm returns incorrect results when ldA != K, ldB != N, and ldC != N for a row-major configuration.

Specifically, libsxmm was compiled with

$ make AVX=2 JIT=1 ROW_MAJOR=1

I've attached a reproducer:
xmm-bug.zip

Edit the makefile to find your local libsxmm, then run

$ ./xmm-bug 64 240 64 64 240 240 1

You will see that reference C code, MKL, and libxsmm roughly agree on the answers.

Run

$ ./xmm-bug 64 239 64 64 240 240 1

and you will see that MKL and C agree, but libxsmm does not.

OFFLOAD mode issue

I am trying to run this on a Phi and am compiling with
> make install OFFLOAD=1 MNK="2,4,6,8,10,12,14,16,18,20,23" AVX=3
but it errors out with

../../include/libxsmm.f90(143): error #6643: This statement is incorrectly positioned.
!DIR$ ATTRIBUTES OFFLOAD:MIC :: libxsmm_smm_2_2_2
----------^
../../include/libxsmm.f90(150): error #6643: This statement is incorrectly positioned.
!DIR$ ATTRIBUTES OFFLOAD:MIC :: libxsmm_dmm_2_2_2
----------^

Is there another flag that needs to be set to compile for the Phi?

Implement dynamic code dispatch for ISA extensions

Currently a single code path (instruction set extension) is supported and determined at build time of the library. This is true for the statically requested kernels but also for the JITted code, both of which can support selecting the architecture at runtime (initialization time of the library). To actually implement this feature, we can check feature bits our self, or rely on certain attributes available for both the Intel and the GNU based tool chain. The solution based on attributes might preferred with respect to maintenance. However, the level of ISA-dispatch will be ultimately driven by an anticipated performance impact (we do not want a performance impact due to supporting this feature). Fixing this issue is to at least enable JITted code matching the platform at runtime.

TODO

Code optimizations: (1) prefetching memory references, (2) introducing a leading matrix dimension such that aligned Load and/or Store instructions can be used, and (3) AVX-512 testing and tuning.
Incorporate separate routines for matrix transposes, and check performance of a specialized MM kernel which is multiplying with a pretransposed B matrix.
Improved build system retiring the current mechanism (INDICES_M, INDICES_N, and INDICES_K). It is also accepting empty list(s) i.e., not generating a specialized function.
Publish performance results along with the benchmark driver.

routines in libxsmm don't obey BETA=0

The inlined routines libxsmm_{s,d}imm and libxsmm_{s,d}blasmm generated in libxsmm.f90 hardcode beta = 1 even when built with BETA=0.

remove generation for aligned stores and loads from the generator backend

All recent IAs (Sandy Bridge or later) do not suffer from performance penalties when executing a unaligned vector load (vmovups/vmovupd) on aligned data (so in theory we could use vmovaps/vmovapd). Therefore, we can take this complexity out of the generator backend.

Side-note: this would also mean that the Intel Knights Corner backend needs to be removed, or at least limited to aligned LDx. This is due to the fact that the previous statement is not true on Intel Knight Corner as this architecture does not offer simple unaligned vector data move instructions.

the internal jit_generator tester is currently broken

After the latest refactoring, the jit_generator doesn't compile anymore. This seems to be simple include issue. However, we moved many function definition between several headers, so we need to check which header needs to be included in jit_validation.c

Full xGEMM interface and LD_PRELOADable library

Add procedures with the exact LAPACK/xGEMM signature including the appropriate code dispatch. Implement a libxsmm_proxy library (so, dll) which is able to intercept existing xGEMM calls. Document the way to achieve a similar effect using static linkage (no code changes but adjusting the link-line).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.