Giter Club home page Giter Club logo

x86-simd-sort's Introduction

x86-simd-sort

C++ template library for high performance SIMD based sorting routines for built-in integers and floats (16-bit, 32-bit and 64-bit data types) and custom defined C++ objects. The sorting routines are accelerated using AVX-512/AVX2 when available. The library auto picks the best version depending on the processor it is run on. If you are looking for the AVX-512 or AVX2 specific implementations, please see README file under src/ directory. The following routines are currently supported:

Sort an array of custom defined class objects (uses O(N) space)

template <typename T, typename Func>
void x86simdsort::object_qsort(T *arr, uint32_t arrsize, Func key_func)

T is any user defined struct or class and arr is a pointer to the first element in the array of objects of type T. Func is a lambda function that computes the key value for each object which is the metric used to sort the objects. Func needs to have the following signature:

[] (T obj) -> key_t { key_t key; /* compute key for obj */ return key; }

Note that the return type of the key key_t needs to be one of the following : [float, uint32_t, int32_t, double, uint64_t, int64_t]. object_qsort has a space complexity of O(N). Specifically, it requires arrsize * sizeof(key_t) bytes to store a vector with all the keys and an additional arrsize * sizeof(uint32_t) bytes to store the indexes of the object array. For performance reasons, we support object_qsort only when the array size is less than or equal to UINT32_MAX. An example usage of object_qsort is provided in the examples section. Refer to section to get a sense of how fast this is relative to std::sort.

Sort an array of built-in integers and floats

void x86simdsort::qsort(T* arr, size_t size, bool hasnan);
void x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan);
void x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan);

Supported datatypes: T $\in$ [_Float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t]

Key-value sort routines on pairs of arrays

void x86simdsort::keyvalue_qsort(T1* key, T2* val, size_t size, bool hasnan);

Supported datatypes: T1, T2 $\in$ [float, uint32_t, int32_t, double, uint64_t, int64_t] Note that keyvalue sort is not yet supported for 16-bit data types.

Arg sort routines on arrays

std::vector<size_t> arg = x86simdsort::argsort(T* arr, size_t size, bool hasnan);
std::vector<size_t> arg = x86simdsort::argselect(T* arr, size_t k, size_t size, bool hasnan);

Supported datatypes: T $\in$ [_Float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t]

Build/Install

meson is the used build system. Command to build and install the library:

meson setup --buildtype release builddir && cd builddir
meson compile
sudo meson install

Once installed, you can use pkg-config --cflags --libs x86simdsortcpp to populate the right cflags and ldflags to compile and link your C++ program. This repository also contains a test suite and benchmarking suite which are written using googletest and google benchmark frameworks respectively. You can configure meson to build them both by using -Dbuild_tests=true and -Dbuild_benchmarks=true.

Example usage

Sort an array of floats

#include "x86simdsort.h"

int main() {
    std::vector<float> arr{1000};
    x86simdsort::qsort(arr.data(), 1000, true);
    return 0;
}

Sort an array of Points using object_qsort

#include "x86simdsort.h"
#include <cmath>

struct Point {
    double x, y, z;
};

int main() {
    std::vector<Point> arr{1000};
    // Sort an array of Points by its x value:
    x86simdsort::object_qsort(arr.data(), 1000, [](Point p) { return p.x; });
    // Sort an array of Points by its distance from origin:
    x86simdsort::object_qsort(arr.data(), 1000, [](Point p) {
        return sqrt(p.x*p.x+p.y*p.y+p.z*p.z);
        });
    return 0;
}

Details

  • x86simdsort::qsort is equivalent to qsort in C or std::sort in C++.
  • x86simdsort::qselect is equivalent to std::nth_element in C++ or np.partition in NumPy.
  • x86simdsort::partial_qsort is equivalent to std::partial_sort in C++.
  • x86simdsort::argsort is equivalent to np.argsort in NumPy.
  • x86simdsort::argselect is equivalent to np.argpartition in NumPy.

Supported datatypes: uint16_t, int16_t, _Float16, uint32_t, int32_t, float, uint64_t, int64_t, double. Note that _Float16 will require building this library with g++ >= 12.x. All the functions have an optional argument bool hasnan set to false by default (these are relevant to floating point data types only). If your array has NAN's, the the behaviour of the sorting routine is undefined. If hasnan is set to true, NAN's are always sorted to the end of the array. In addition to that, qsort will replace all your NAN's with std::numeric_limits<T>::quiet_NaN. The original bit-exact NaNs in the input are not preserved. Also note that the arg methods (argsort and argselect) will not use the SIMD based algorithms if they detect NAN's in the array. You can read details of all the implementations here.

Performance comparison on AVX-512: object_qsort v/s std::sort

Performance of object_qsort can vary significantly depending on the defintion of the custom class and we highly recommend benchmarking before using it. For the sake of illustration, we provide a few examples in ./benchmarks/bench-objsort.hpp which measures performance of object_qsort relative to std::sort when sorting an array of 3D points represented by the class: struct Point {double x, y, z;} and struct Point {float x, y, x;}. We sort these points based on several different metrics:

  • sort by coordinate x
  • sort by manhanttan distance (relative to origin): abs(x) + abx(y) + abs(z)
  • sort by Euclidean distance (relative to origin): sqrt(x*x + y*y + z*z)
  • sort by Chebyshev distance (relative to origin): max(abs(x), abs(y), abs(z))

The performance data (shown in the plot below) can be collected by building the benchmarks suite and running ./builddir/benchexe --benchmark_filter==*obj*. The data plot shown below was collected on a processor with AVX-512 because object_qsort is currently accelerated only on AVX-512 (we plan to add the AVX2 version soon). For the simplest of cases where we want to sort an array of struct by one of its members, object_qsort can be up-to 5x faster for 32-bit data type and about 4x for 64-bit data type. It tends to do even better when the metric to sort by gets more complicated. Sorting by Euclidean distance can be up-to 10x faster.

alt text

Downstream projects using x86-simd-sort

  • NumPy uses this as a submodule to accelerate np.sort, np.argsort, np.partition and np.argpartition.
  • A slightly modifed version this library has been integrated into openJDK.
  • GRAPE: C++ library for parallel graph processing.
  • AVX-512 version of the key-value sort has been submitted to Oceanbase.

x86-simd-sort's People

Contributors

bkmgit avatar icfaust avatar mosullivan93 avatar r-devulap avatar rdower avatar rouault avatar ruclz avatar sterrettm2 avatar zbjornson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

x86-simd-sort's Issues

Can I use this lib to output the original index of these sorted data

This lib looks awesome!
But I have a requirement to output the original index of these sorted data, for example the input array is {300 200 400 99 150 50 230 250}, then the sorted result should be {50 99 150 200 230 250 300 400}, and I want to get the original index as output {5, 3, 4, 1, 6, 7, 0, 2}.
Currently, I use the std::sort with a lambda to implement this require, but the performance is not good.

Compilation falure on vpcompressw

Hi,
A build failure on some asm instr:

The Meson build system
Version: 0.55.1
Source dir: repos/x86-simd-sort
Build dir: repos/x86-simd-sort/builddir
Build type: native build
Project name: x86-simd-sort
Project version: 4.0.0
C++ compiler for the host machine: ccache c++ (gcc 11.3.0 "c++ (GCC) 11.3.0")
C++ linker for the host machine: c++ ld.bfd 2.27-44
Host machine cpu family: x86_64
Host machine cpu: x86_64
Compiler for C++ supports arguments -march=haswell: YES 
Compiler for C++ supports arguments -march=skylake-avx512: YES 
Compiler for C++ supports arguments -march=icelake-client: YES 
Build targets in project: 4

x86-simd-sort 4.0.0

  Configuration
    Can compile AVX-512 FP16 ISA: NO
              Build test content: NO
                Build benchmarks: NO

Found runner: /usr/bin/ninja
ninja: Entering directory `.'
[2/8] Compiling C++ object lib/liblibicl.a.p/x86simdsort-icl.cpp.o
FAILED: lib/liblibicl.a.p/x86simdsort-icl.cpp.o 
ccache c++ -Ilib/liblibicl.a.p -Ilib -I../lib -I../src -fvisibility=hidden -fvisibility-inlines-hidden -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=c++17 -O3 -fPIC -march=icelake-client -MD -MQ lib/liblibicl.a.p/x86simdsort-icl.cpp.o -MF lib/liblibicl.a.p/x86simdsort-icl.cpp.o.d -o lib/liblibicl.a.p/x86simdsort-icl.cpp.o -c ../lib/x86simdsort-icl.cpp
{standard input}: Assembler messages:
{standard input}:184: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:185: Error: no such instruction: `vpcompressw %zmm4,64(%rbx,%rax,2){%k1}'
{standard input}:197: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:198: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%rax,2){%k1}'
...

Tks

Merge sort and AVX512

Hi!

Many thanks for your contribution to speeding up sorting in numpy. Wanted to ask if there are any plans to speed up merge/tim sort with AVX2/AVX512?

argsort and AVX2

Hi,
Wanted to ask if there are any plans for argsort implementation with AVX2?

Argsort Adaptation

Amazing work! I was wondering if it would be possible to adapt the qsort kernels for an argsort. So also returning the indices that would make the original array sorted. (I believe you can do this with std::sort by sorting pair<value, size_t> where size_t is the original index) Would love to help out if it's feasible!

Need Argsort For 32Bit Index For 32Bit Types

We only have argsort whose indexes are (unsigned) int64_t. For 32bit types like int/float, I think argsort with int32_t as index type can also be provided, which may performs better with avx512's i32gather_ps/epi32 instructions. Also, this function can be easily implemented in AVX2 plantforms.

Lower-version gcc can`t support Optimization Pragmas.

We attempt to merge avx512-sort to oceanbase, compared with std::sort, the avx512-sort has an obvious perf. improvement. But there is a problem when merging new x86_simd_sort into oceanbase. We found that, you added some GCC Optimization Pragmas in the latest version of x86-simd-sort. Such as โ€œ#pragma GCC unroll 8โ€ . The default GCC version of oceanbase is gcc5.2.0, it seems that this gcc doesn`t support these pragmas. The error is like this:
image

Please help to fix.

Improve argsort for 32-bit

32-bit argsort uses ymm registers: we can switch to zmm registers (use 2x i64gather instructions) and add new bitonic networks.

Benchmarks can't be compiled on older platforms

When I try to compile the benchmark executable I get the following error from running make meson:

...
FAILED: benchexe
c++  -o benchexe 'benchexe@exe/benchmarks_main.cpp.o' -Wl,--as-needed -Wl,--no-undefined -Wl,--whole-archive -Wl,--start-group benchmarks/libbench_qsort.a utils/libcpuinfo.a -Wl,--no-whole-archive /usr/local/lib/libbenchmark.a -Wl,--end-group
/usr/bin/ld: /usr/local/lib/libbenchmark.a(benchmark_runner.cc.o): in function `benchmark::internal::BenchmarkRunner::DoNIterations()':
benchmark_runner.cc:(.text+0x1418): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
...

This is with v1.8.0 of Google Benchmarks, v1.13.0 of Google's Test Framework. My system is running Ubuntu 20.04 and I have tried both g++-10 and g++-9. Further, the Makefile spits out many errors because my version of GCC can't compile the fp16 files.

I eventually pinned the meson issue down to the fact that my system is running glibc 2.31, where the linker still needs to be explicitly given -pthread or -lpthread to use threading. This flag is present by default for gtest andgtest_main, but Google Benchmark's pkg-config will only expose it when you indicate that it's being linked statically.

The default installation method for both Google's benchmark and testing suite will give a static library, so we should probably be explicitly configuring it as such in the build scripts. As for the Makefile, we could adapt this solution from Stack Overflow to detect when GCC is not new enough for the -march=sapphirerapids flag and exclude the fp16 files.

Improve perf of qsort

two ideas we could try:

  • Use larger bitonic sorting networks (256 and 512 elements)
  • Improve pivot selection by pich a set of random indices

CI: Add basic CI

Add CI tests to run for every PR.

  • Build testexe and run to check pass/fail
  • Build x86-simd-sort with main branch of NumPy and ensure it builds correctly.

Argsort on int32_t severely underperforms

Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).

The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.

Here's my benchmark on my hardware:

[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 4096 KiB (x6)
  L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t               959 ns          956 ns       731536
simdargsort/random_256/int32_t              2045 ns         2040 ns       343034
simdargsort/random_512/int32_t              3866 ns         3857 ns       181511
simdargsort/random_1k/int32_t               9289 ns         9267 ns        75493
simdargsort/random_5k/int32_t              50900 ns        50768 ns        13789
simdargsort/random_100k/int32_t          1764351 ns      1757047 ns          400
simdargsort/random_1m/int32_t           29256670 ns     29111150 ns           24
simdargsort/random_10m/int32_t         959237861 ns    954095660 ns            1
simdargsort/smallrange_128/int32_t           722 ns          720 ns       975799
simdargsort/smallrange_256/int32_t          2049 ns         2044 ns       342384
simdargsort/smallrange_512/int32_t          3655 ns         3646 ns       191985
simdargsort/smallrange_1k/int32_t           9169 ns         9147 ns        76497
simdargsort/smallrange_5k/int32_t          21375 ns        21320 ns        32747
simdargsort/smallrange_100k/int32_t       467703 ns       465853 ns         1502
simdargsort/smallrange_1m/int32_t        9097255 ns      9047326 ns           75
simdargsort/smallrange_10m/int32_t     183596592 ns    182645622 ns            4
simdargsort/sorted_10k/int32_t            104574 ns       104314 ns         6668
simdargsort/constant_10k/int32_t            8648 ns         8628 ns        81134
simdargsort/reverse_10k/int32_t           105977 ns       105722 ns         6558
scalarargsort/random_128/int32_t             797 ns          795 ns       882026
scalarargsort/random_256/int32_t            1736 ns         1732 ns       403901
scalarargsort/random_512/int32_t            4000 ns         3990 ns       176210
scalarargsort/random_1k/int32_t            21879 ns        21822 ns        31937
scalarargsort/random_5k/int32_t           262249 ns       261537 ns         2676
scalarargsort/random_100k/int32_t        7551632 ns      7525002 ns           93
scalarargsort/random_1m/int32_t        103352847 ns    102895746 ns            7
scalarargsort/random_10m/int32_t      1974523438 ns   1964054017 ns            1
scalarargsort/smallrange_128/int32_t         667 ns          665 ns      1051997
scalarargsort/smallrange_256/int32_t        1498 ns         1494 ns       469065
scalarargsort/smallrange_512/int32_t        3152 ns         3144 ns       223444
scalarargsort/smallrange_1k/int32_t        13730 ns        13697 ns        50943
scalarargsort/smallrange_5k/int32_t       127606 ns       127285 ns         5499
scalarargsort/smallrange_100k/int32_t    3031465 ns      3019854 ns          232
scalarargsort/smallrange_1m/int32_t     38177852 ns     37995219 ns           18
scalarargsort/smallrange_10m/int32_t   695700305 ns    692244521 ns            1
scalarargsort/sorted_10k/int32_t           92423 ns        92202 ns         7578
scalarargsort/constant_10k/int32_t         83139 ns        82931 ns         8439
scalarargsort/reverse_10k/int32_t          74214 ns        74031 ns         9359

Recent changes don't compile with MSVC, cause warnings in GCC

The additions in #61 fail to compile with MSVC. Is MSVC supported?

# using VS2022 (17.7.34031.279) ...
deps\simdsort\src\xss-network-qsort.hpp(80,11): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,0,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,1,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,2,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,4,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,8,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(134,5): message : see reference to function template instantiation 'void sort_n_vec<vtype,16,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' 
 being compiled
          with
          [
              vtype=vtype
          ]
xxx.h(199,5): message : see reference to function template instantiation 'void sort_n<vtype,256>(zmm_vector<float>::type_t *,int)' being compiled
simdsort\src\xss-network-qsort.hpp(80,11): error C2133: 'vecs': unknown size
simdsort\src\xss-network-qsort.hpp(83,30): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(83,30): error C2133: 'ioMasks': unknown size

The changes also cause a few GCC -Wunused-but-set-parameter warnings for the regs parameter in the same file.

A question about hardware implementation of quick-sort in future Intel CPUs

Hi,

If I understand well, this is a software C++ implementation of a quick sort using hardware AVX vector instructions.

A lot of CPUs now have instructions for cryptography, hashing etc. And Intel does.

Is there a roadmap at intel , even for the next 10 years, to have a basic quick sort performed in hardware the same way a hash is computed ? I know the hardware implementation is complex.

We need the hardware to move closer to what standard software libraries provide as a service level and a hardware sort is big game changer (I think of Data processing, Machine learning, binning, quantization, etc).

performance on amd 7950x ...

Hello,
I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?

Thanks,
Jan

Multiple definition issue due to explicit function template specialization

On AVX512 builds there are multiple definitions of some functions, e.g. avx512_qsort, see numpy/numpy#25274

Reason is that the explicit specialization of the function template is not declared inline. See cppreference:

Whether an explicit specialization of a function [...] template is inline/constexpr(since C++11) [...] is determined by the explicit specialization itself, regardless of whether the primary template is declared with that specifier

And e.g. template <> void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan) has no inline and hence could be define by multiple translation units, see the source:

void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan)

The issue seems to also apply to pretty much every explicit function specialization, so it looks like a major fixup is required searching for "template <>" and marking all those functions defined in headers inline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.