Giter Club home page Giter Club logo

simde's Introduction

SIMD Everywhere

All Contributors

Chat codecov

The SIMDe header-only library provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM. There is no performance penalty if the hardware supports the native implementation (e.g., SSE/AVX runs at full speed on x86, NEON on ARM, etc.).

This makes porting code to other architectures much easier in a few key ways:

First, instead of forcing you to rewrite everything for each architecture, SIMDe lets you get a port up and running almost effortlessly. You can then start working on switching the most performance-critical sections to native intrinsics, improving performance gradually. SIMDe lets (for example) SSE/AVX and NEON code exist side-by-side, in the same implementation.

Second, SIMDe makes it easier to write code targeting ISA extensions you don't have convenient access to. You can run NEON code on your x86 machine without an emulator. Obviously you'll eventually want to test on the actual hardware you're targeting, but for most development, SIMDe can provide a much easier path.

SIMDe takes a very different approach from most other SIMD abstraction layers in that it aims to expose the entire functionality of the underlying instruction set. Instead of limiting functionality to the lowest common denominator, SIMDe tries to minimize the amount of effort required to port while still allowing you the space to optimize as needed.

The current focus is on writing complete portable implementations, though a large number of functions already have accelerated implementations using one (or more) of the following:

You can try SIMDe online using Compiler Explorer and an amalgamated SIMDe header.

If you have any questions, please feel free to use the issue tracker or the mailing list.

Current Status

There are currently complete implementations of the following instruction set extensions:

As well as partial support for many others, including AES-ni, CLMUL, SSE4.2, SVE, MSA in addition to several AVX-512 extensions. See the instruction-set-support label in the issue tracker for details on progress. If you'd like to be notified when an instruction set is available you may subscribe to the relevant issue.

If you have a project you're interested in using with SIMDe but we don't yet support all the functions you need, please file an issue with a list of what's missing so we know what to prioritize.

The default branch is protected so commits never reach it unless they have passed extensive CI checks. Status badges don't really make sense since they will always be green, but here are the links:

If you're adding a new build I suggest Cirrus CI, which is where we currently have the most room given the number of builds currently on the platform and the quotas for free/open-source usage. Alternately, feel free to set up another provider (such as Codefresh, Shippable, Bitrise, Werkaer, etc.).

Notice: we plan on changing the name of the default branch from "master" to something else soon; we are just trying to wait to see what name git settles on so we can be consistent.

Contributing

First off, if you're reading this: thank you! Even considering contributing to SIMDe is very much appreciated!

SIMDe is a fairly large undertaking; there are a lot of functions to get through and a lot of opportunities for optimization on different platforms, so we're very happy for any help you can provide.

Programmers of all skill levels are welcome, there are lots of tasks which are pretty straightforward and don't require any special expertise.

If you're not sure how you'd like to contribute, please consider taking a look at the issue tracker. There is a good first issue tag if you want to ease into a your first contributions, but if you're interested in something else please get in touch via the issue tracker; we're happy to help you get a handle on whatever you are interested in.

If you're interested in implementing currently unimplemented functions, there is a guide explaining how to add new functions and how to quickly and easily get a test case in place. It's a bit rough right now, but if anything is unclear please feel free to use the issue tracker to ask about anything you're not clear on.

Usage

First, it is important to note that you do not need two separate versions (one using SIMDe, the other native). If the native functions are available SIMDe will use them, and compilers easily optimize away any overhead from SIMDe; all they have to do is some basic inlining. -O2 should be enough, but we strongly recommend -O3 (or whatever flag instructs your compiler to aggressizely optimize) since many of the portable fallbacks are substantially faster with aggressive auto-vectorization that isn't enabled at lower optimization levels.

Each instruction set has a separate file; x86/mmx.h for MMX, x86/sse.h for SSE, x86/sse2.h for SSE2, and so on. Just include the header for whichever instruction set(s) you want instead of the native version (if you include the native version after SIMDe it will result in compile-time errors if native aliases are enabled). SIMDe will provide the fastest implementation it can given which extensions you've enabled in your compiler (i.e., if you want to use NEON to implement SSE, you may need to pass something like -mfpu=neon or -march=armv8-a+simd. See GCC ARM-Options for more information).

If you define SIMDE_ENABLE_NATIVE_ALIASES before including SIMDe you can use the same names as the native functions. Unfortunately, this is somewhat error-prone due to portability issues in the APIs, so it's recommended to only do this for testing. When SIMDE_ENABLE_NATIVE_ALIASES is undefined only the versions prefixed with simde_ will be available; for example, the MMX _mm_add_pi8 intrinsic becomes simde_mm_add_pi8, and __m64 becomes simde__m64.

Since SIMDe is meant to be portable, many functions which assume types are of a specific size have been altered to use fixed-width types instead. For example, Intel's APIs use char for signed 8-bit integers, but char on ARM is generally unsigned. SIMDe uses int8_t to make the API portable, but that means your code may require some minor changes (such as using int8_t instead of char) to work on other platforms.

That said, the changes are usually quite minor. It's often enough to just use search and replace, manual changes are required pretty infrequently.

OpenMP 4 SIMD

SIMDe makes extensive use of annotations to help the compiler vectorize code. By far the best annotations use the SIMD support built in to OpenMP 4, so if your compiler supports these annotations we strongly recommend you enable them.

If you are already using OpenMP, SIMDe will automatically detect it using the _OPENMP macro and no further action is required.

Some compilers allow you to enable OpenMP SIMD without enabling the full OpenMP. In such cases there is no runtime dependency on OpenMP and no runtime overhead; SIMDe will just be faster. Unfortunately, SIMDe has no way to detect such situations (the _OPENMP macro is not defined), so after enabling it in your compiler you'll need to define SIMDE_ENABLE_OPENMP (e.g., by passing -DSIMDE_ENABLE_OPENMP) to get SIMDe to output the relevant pragmas.

Enabling OpenMP SIMD support varies by compiler:

  • GCC 4.9+ and clang 6+ support a -fopenmp-simd command line flag.
  • ICC supports a -qopenmp-simd command line flag.
  • MCST's LCC enables OpenMP SIMD by default, so no flags are needed (technically you don't even need to pass -DSIMDE_ENABLE_OPENMP).

We are not currently aware of any other compilers which allow you to enable OpenMP SIMD support without enabling full OpenMP (if you are please file an issue to let us know). You should determine whether you wish to enable full OpenMP support on a case-by-case basis, but it is likely that the overhead of linking to (but not using) the OpenMP runtime library will be dwarfed by the performance improvements from using the OpenMP SIMD annotations in SIMDe.

If you choose not to use OpenMP SIMD, SIMDe also supports using Cilk Plus, GCC loop-specific pragmas, or clang pragma loop hint directives, though these are not nearly as effective as OpenMP SIMD and depending on them will likely result in less efficient code. All of these are detected automatically by SIMDe, so if they are enabled in your compiler nothing more is required.

If for some reason you do not wish to enable OpenMP 4 SIMD support even though SIMDe detects it, you should define SIMDE_DISABLE_OPENMP prior to including SIMDe.

Portability

Compilers

SIMDe does depend on some C99 features, though the subset supported by MSVC also works. While we do our best to make sure we provide optimized implementations where they are supported, SIMDe does contain portable fallbacks which are designed to work on any C99 compiler.

Every commit is tested in CI on multiple compilers, platforms, and configurations, and our test coverage is extremely extensive. Currently tested compilers include:

  • GCC versions back to 4.8
  • Clang versions back to 3.8
  • Microsoft Visual Studio back to 12 (2013)
  • IBM XL C/C++
  • Intel C/C++ Compiler (ICC)

I'm generally willing to accept patches to add support for other compilers, as long as they're not too disruptive, especially if we can get CI support going. If using one of our existing CI providers isn't an option then other CI platforms can be added.

Hardware

The following architectures are tested in CI for every commit:

  • x86_64/amd64
  • x86
  • AArch64
  • ARMv8
  • ARMv7 with VFPv3-D16 floating point
  • ARMv5 EABI
  • PPC64
  • z/Architecture (with "-mzvector")
  • MIPS Loongson 64
  • RISC-V 64
  • emscripten 32- & 64-bit; regular and relaxed

We would love to add more, so patches are extremely welcome!

Related Projects

  • The "builtins" module in portable-snippets does much the same thing, but for compiler-specific intrinsics (think __builtin_clz and _BitScanForward), not SIMD intrinsics.
  • Intel offers an emulator, the Intel® Software Development Emulator which can be used to develop software which uses Intel intrinsics without having to own hardware which supports them, though it doesn't help for deployment.
  • Iris is the only other project I'm aware of which is attempting to create portable implementations like SIMDe. SIMDe is much further along on the Intel side, but Iris looks to be in better shape on ARM. C++-only, Apache 2.0 license. AFAICT there are no accelerated fallbacks, nor is there a good way to add them since it relies extensively on templates.
  • There are a few projects trying to implement one set with another:
    • ARM_NEON_2_x86_SSE — implementing NEON using SSE. Quite extensive, Apache 2.0 license.
    • sse2neon — implementing SSE using NEON. This code has already been merged into SIMDe.
    • veclib — implementing SSE2 using AltiVec/VMX, using a non-free IBM library called powerveclib
    • SSE-to-NEON — implementing SSE with NEON. Non-free, C++.
    • AvxToNeon — Popular AVX+ intrinsincs implemented in NEON. C, Apache 2.0 license.
    • neon2rvv - A C/C++ header file that converts Arm/Aarch64 NEON intrinsics to RISC-V Vector (RVV) Extension, MIT license
    • sse2rvv - A C/C++ header file that converts Intel SSE intrinsics to RISCV-V Extension intrinsics, MIT license.
  • arm-neon-tests contains tests to verify NEON implementations.

If you know of any other related projects, please let us know!

Caveats

Sometime features can't be emulated. If SIMDe is operating in native mode the functions will work as expected, but if there is no native support some caveats apply:

  • Many functions require <math.h> and/or <fenv.h>. SIMDe will still work without those headers, but the results of those functions are undefined.
  • x86 / x86_64
    • SSE
      • SIMDE_MM_SET_ROUNDING_MODE() will use fesetround(), altering the global rounding mode.
      • simde_mm_getcsr and simde_mm_setcsr only implement bits 13 and 14 (rounding mode).
    • AVX
      • simde_mm256_test* do not set the CF/ZF registers as there is no portable way to implement that functionality.
      • simde_mm256_zeroall and simde_mm256_zeroupper are not implemented as there is no portable way to implement that functionality.

Additionally, there are some known limitations which apply when using native aliases (SIMDE_ENABLE_NATIVE_ALIASES):

  • On Windows x86 (but not x86_64), some MMX functions and SSE/SSE2 functions which use MMX types (__m64) other than for pointers may return incorrect results.

Also, as mentioned earlier, while some APIs make assumptions about basic types (e.g., int is 32 bits), SIMDe does not, so many types have been altered to use portable fixed-width versions such as int32_t.

If you find any other differences, please file an issue so we can either fix it or add it to the list above.

Benefactors

SIMDe uses resources provided for free by a number of organizations. While this shouldn't be taken to imply endorsement of SIMDe, we're tremendously grateful for their support:

  • IntegriCloud — provides access to a very fast POWER9 server for developing AltiVec/VMX support.
  • GCC Compile Farm — provides access to a wide range of machines with different architectures for developing support for various ISA extensions.
  • CodeCov.io — provides code coverage analysis for our test cases.
  • Google ­— financing Summer of Code, substantial amounts of code (Sean Maher's contributions), and an Open Source Peer Bonus.

Without such organizations donating resources, SIMDe wouldn't be nearly as useful or usable as it is today.

We would also like to thank anyone who has helped develop the myriad of software on which SIMDe relies, including compilers and analysis tools.

Finally, a special thank you to anyone who has contributed to SIMDe, filed bugs, provided suggestions, or helped with SIMDe development in any way.

License

SIMDe is distributed under an MIT-style license; see COPYING for details.

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Evan Nemerson

💻 🖋 📖 💡 🤔 💬 👀 ⚠️ 📢 🐛 🚇 🚧 📆

Michael R. Crusoe

🐛 💻 📋 🔍 🤔 🚇 📦 ⚠️ 🚧 📆 👀

HIMANSHI MATHUR

💻 ⚠️

Hidayat Khan

💻 ⚠️

rosbif

💻 ⚠️ 🐛 🤔 📖

Jun Aruga

💻 🤔 📦 🚇 🚧 ⚠️ 🐛

Élie ROUDNINSKI

💻 ⚠️

Jesper Storm Bache

💻

Jeff Daily

💻 🚇

Pavel

💻

Sabarish Bollapragada

💻

Gavin Li

💻

Yining Karl Li

💻

Anirban Dey

📖

Darren Ng

📖

FaresSalem

📖

Pradnyesh Gore

💻

Sean Maher

💻

Mingye Wang

📖

Ng Zhi An

💻 📖

Atharva Nimbalkar

💻 ⚠️

simba611

💻 ⚠️

Ashleigh Newman-Jones

💻 ⚠️

Willy R. Vasquez

💻 🚧 ⚠️

Keith Winstein

💻 🚧 ⚠️

David Seifert

🚧

Milot Mirdita

💻 🚧 ⚠️

aqrit

💻 🚧

Décio Luiz Gazzoni Filho

💻 🚧 ⚠️

Igor Molchanov

💻 🚧 📦

Andrew Rodriguez

💻 🚧 ⚠️

Changqing Jing

🚧

JP Cimalando

💻 🚇

Jiaxun Yang

💻 📦

Masahiro Kitagawa

💻 ⚠️

Pavel Iatchenii

💻 ⚠️

Tommy Vercetti

🚧

Robert Cohn

🚧

Adam Novak

📖

boris-kuz

🚧

Dimo Markov

🚧

dblue

🚧

zekehul

💻 🚧

Laurent Thomas

💻

Max Bachmann

📖

psaab

🚧

Sam Clegg

🚧

Thomas Lively

🐛 🤔 🚧

coderzh

💻 ⚠️

Dominik Kutra

💻 ⚠️

Lithrein

🚧

Nick

🚧

thomasdwu

🚧

Stephen

🐛

John Platts

🐛

Steven Noonan

🐛

p0nce

🐛

Paul Wise

🐛

easyaspi314 (Devin)

🐛 💻

JonLiu1993

📦

Cheney Wang

📦

myd7349

📦

chausner

📦

Yi-Yen Chung

💻 ⚠️

Chi-Wei Chu

💻 ⚠️

M-HT

💻

Simon Gene Gottlieb

💻

Chris Bielow

💻

gu xiwei

📦 ⚠️

George Vinokhodov

💻

Cœur

💻

Florian @Proudsalsa

💻

Thomas Schlichter

🐛 💻

This project follows the all-contributors specification. Contributions of any kind are welcome!

simde's People

Contributors

anrodrig avatar aqrit avatar ashnewmanjones avatar changqing-jing avatar coeur avatar dependabot[bot] avatar dgazzoni avatar flygoat avatar glitch18 avatar himanshi18037 avatar jeffdaily avatar jpcima avatar jsbache avatar junaruga avatar keithw avatar m-ht avatar makise-homura avatar marmeladema avatar masterchef2209 avatar milot-mirdita avatar mr-c avatar nemequ avatar ngzhian avatar rosbif avatar seanptmaher avatar simba611 avatar soapgentoo avatar wrv avatar yyctw avatar zengdage avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simde's Issues

Support __builtin_shufflevector and convertvector where appropriate

They're clang-specific, but clang uses them to implement a lot of the shuffle/convert intrinsics instead of calling an intrinsic specifically for the underlying instruction like GCC does, which means it should be a very fast abstraction we could use for non-native implementations instead of relying on the portable versions.

This should be pretty straightforward, and a big performance win, so it will probably be my priority after SSE2 is finished.

Need instructions for android integration

I am integrating this code in android. I need to call SSE2 instructions in my c++ code on arm cpu platform. What else files do I need apart from the below listed files:
a) hedley.h
b) mmx.h
c) simde-arch.h
d) simde-common.h
e) sse.h
f) sse2.h

SH4: simde/x86/sse.h:2906:10: error: 'FE_UPWARD' was not declared in this scope

Full log at https://buildd.debian.org/status/fetch.php?pkg=libssw&arch=sh4&ver=1.1-6%7E0expsimde0&stamp=1576178932&raw=0

g++ -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -DSIMDE_ENABLE_OPENMP -fopenmp-simd -O3 -DSIMDE_ENABLE_OPENMP -fopenmp-simd -O3 -fPIC -shared -rdynamic -Wl,-soname,libssw.so.0 -o libssw.so.0 ssw.c ssw.h ssw_cpp.h ssw_cpp.cpp -Wl,-z,relro -Wl,-z,now
In file included from ../debian/include/simde/x86/sse2.h:34,
                 from ssw.c:38:
../debian/include/simde/x86/sse.h: In function 'uint32_t simde_mm_getcsr()':
../debian/include/simde/x86/sse.h:2906:10: error: 'FE_UPWARD' was not declared in this scope
 2906 |     case FE_UPWARD:
      |          ^~~~~~~~~
../debian/include/simde/x86/sse.h:2909:10: error: 'FE_DOWNWARD' was not declared in this scope; did you mean 'FP_INT_DOWNWARD'?
 2909 |     case FE_DOWNWARD:
      |          ^~~~~~~~~~~
      |          FP_INT_DOWNWARD
../debian/include/simde/x86/sse.h: In function 'void simde_mm_setcsr(uint32_t)':
../debian/include/simde/x86/sse.h:2935:18: error: 'FE_DOWNWARD' was not declared in this scope; did you mean 'FP_INT_DOWNWARD'?
 2935 |       fesetround(FE_DOWNWARD);
      |                  ^~~~~~~~~~~
      |                  FP_INT_DOWNWARD
../debian/include/simde/x86/sse.h:2938:18: error: 'FE_UPWARD' was not declared in this scope
 2938 |       fesetround(FE_UPWARD);
      |                  ^~~~~~~~~

PGI C Compiler bugs

There are a decent number of issues in PGI we are currently working around in SIMDe, this is a meta-bug for tracking them. I've reported them to PGI, just waiting on fixes.

I'll try to keep this list updated as we add new functions.

General:

  • ICE when compiling in C++ mode (comment)

MMX:

  • _mm_cvtm64_si64 — function missing
  • _mm_cvtsi64_m64 — function missing
  • _mm_slli_pi16 — incorrect result
  • _mm_slli_pi32 — incorrect result
  • _mm_srli_pi16 — incorrect result
  • _mm_srli_pi32 — incorrect result
  • _mm_srli_pi64 — incorrect result
  • _mm_srai_pi16 — incorrect result
  • _mm_srai_pi32 — incorrect result

SSE:

  • _mm_cmpge_ss — incorrect result
  • _mm_cmpgt_ss — incorrect result
  • _mm_cmpnge_ss — incorrect result
  • _mm_cmpngt_ss — incorrect result
  • _mm_cmpunord_ss — incorrect result
  • _mm_cvtsi64_ss — function missing (but the _mm_cvtsi64x_ss alias exists)
  • _mm_cvtss_si64 — function missing (but the _mm_cvtss_si64x alias exists)
  • _mm_cvttss_si64 — function missing (but the _mm_cvttss_si64x alias exists)
  • _mm_insert_pi16 — triggers infinite loop during compilation
  • _mm_shuffle_pi16 — triggers infinite loop during compilation
  • _mm_shuffle_ps — triggers infinite loop during compilation
  • _mm_undefined_ps — function missing

SSE2:

  • _mm_bslli_si128 — triggers infinite loop during compilation
  • _mm_bsrli_si128 — triggers infinite loop during compilation
  • _mm_cmpgt_sd — incorrect result
  • _mm_cmpge_sd — incorrect result
  • _mm_cmpnge_sd — incorrect result
  • _mm_cvtsd_f64 — incorrect result
  • _mm_cvtsd_si64 — function missing (but the _mm_cvtsd_si64x alias exists)
  • _mm_cvtsi128_si64 — function missing (but the _mm_cvtsi128_si64x alias exists)
  • _mm_cvtsi64_sd — function missing (but the _mm_cvtsi64_sdx alias exists)
  • _mm_cvtsi64_si128 — function missing (but the _mm_cvtsi64_si128x alias exists)
  • _mm_cvttsd_si64 — function missing (but the _mm_cvttsd_si64x alias exists)
  • _mm_insert_epi16 — triggers infinite loop during compilation
  • _mm_mul_su32 — incorrect result, fixable
  • _mm_mulhi_epu16 — incorrect result, fixable
  • _mm_shuffle_pd — triggers infinite loop during compilation
  • _mm_undefined_pd — function missing
  • _mm_undefined_si128 — function missing

AVX2:

  • _mm256_shufflelo_epi16 — ICE

Some of these functions (annotated above with "fixable") can be fixed with trivial changes to the headers in question.

--- emmintrin-orig.h   2017-05-07 20:32:21.726746806 -0700
+++ emmintrin.h   2017-05-08 19:05:51.769286452 -0700
@@ -758,7 +758,7 @@
 
   __u.__v = __A;
 
-  return __u.__a[1];
+  return __u.__a[0];
 }
 
 /* Create the vector [Y Z].  */
@@ -2360,7 +2360,7 @@
 ATTRIBUTE __m128i
 _mm_mulhi_epu16(__m128i __A, __m128i __B)
 {
-  __asm__("PMULHW %1, %0" : "=x"(__A) : "x"(__B), "0"(__A));
+  __asm__("PMULHUW %1, %0" : "=x"(__A) : "x"(__B), "0"(__A));
   return __A;
 }

Some other functions could probably be fixed like this, too, but I haven't checked them all.

Improve compile-time checks for shifting by scalar

There are a ton of functions that shift a vector by a scalar. We should do what we can to validate the range at compile-time. For most of there functions the scalar is an immediate mode (a compile-time constant is required).

For functions (i.e., the portable fallbacks), we should use the HEDLEY_REQUIRE_MSG macro, which compiles to a diagnose_if attribute in clang. There are a bunch of examples in avx.h, we should just be doing something similar in sse, sse2, etc.. Here is an example patch:

diff --git a/simde/x86/sse2.h b/simde/x86/sse2.h
index b2a7174..925b109 100644
--- a/simde/x86/sse2.h
+++ b/simde/x86/sse2.h
@@ -3925,7 +3925,8 @@ simde_mm_sra_epi32 (simde__m128i a, simde__m128i count) {
 
 SIMDE__FUNCTION_ATTRIBUTES
 simde__m128i
-simde_mm_slli_epi16 (simde__m128i a, const int imm8) {
+simde_mm_slli_epi16 (simde__m128i a, const int imm8)
+    HEDLEY_REQUIRE_MSG((imm8 & 15) == imm8, "imm8 must be in range [0, 15]") {
   simde__m128i r;
 
   const int s = (imm8 > HEDLEY_STATIC_CAST(int, sizeof(r.i16[0]) * CHAR_BIT) - 1) ? 0 : imm8;

The Intel documentation makes it clear for a lot of these functions that values outside of the expected range are acceptable (i.e., you can shift an int16_t by 100 bits) as they just use imm8 & 0xff, but clang (and sometimes gcc) will often emit an error, which I think is the right choice; clearing should be done with one of the setzero functions.

This would be an excellent first issue for someone to help introduce you to the codebase by doing something useful for a lot of people.

extra header to map the simde names to the original names

hi nemequ,

just one other idea i ran across while using simde: it would be really nice if there would be some optional extra header available as part of simde, which maps all the simde_ names back to their regular counterparts. this way one could simply replace the occurence of the original *mintrin.h includes with the simde headers plus those new headers and should be able to run code unmodified if everything works fine. right now it looks like i have to change all occurences of the original sse _m..., _m... etc to their simde counterparts by hand or some script which is error prone.

maybe such a suggested new header might be created automatically somehow from the simde sources

best wishes - hexdump

Use simde_mm_move_sd to simplify simde_mm_*_sd functions

For SSE2 and above, there are a lot of simde_mm_*_sd functions that could easily be implemented using simde_mm_move_sd. For example,

SIMDE__FUNCTION_ATTRIBUTES
simde__m128d
simde_mm_sub_sd (simde__m128d a, simde__m128d b) {
  simde__m128d r;

#if defined(SIMDE_SSE2_NATIVE)
  r.n = _mm_sub_sd(a.n, b.n);
#elif defined(SIMDE_ASSUME_VECTORIZATION)
  r = simde_mm_move_sd(a, simde_mm_sub_pd(a, b));
#else
  r.f64[0] = a.f64[0] - b.f64[0];
  r.f64[1] = a.f64[1];
#endif

  return r;
}

This lets us take advantage of the vectorization that often already exists in the simde_mm_*_pd functions without constantly duplicating the logic for simde_mm_move_sd in each function. It also greatly simplifies adding new implementations for other ISA extensions.

Basically, it's less code, should be just as fast and much less error-prone.

I already did this for SSE with the _mm_move_ss function; see d01896e. We just need to do the same thing for _mm_move_sd.

SSSE3 functions

Per https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSSE3

  • _mm_abs_epi16
  • _mm_abs_epi32
  • _mm_abs_epi8
  • _mm_abs_pi16
  • _mm_abs_pi32
  • _mm_abs_pi8
  • _mm_alignr_epi8
  • _mm_alignr_pi8
  • _mm_hadd_epi16
  • _mm_hadd_epi32
  • _mm_hadd_pi16
  • _mm_hadd_pi32
  • _mm_hadds_epi16
  • _mm_hadds_pi16
  • _mm_hsub_epi16
  • _mm_hsub_epi32
  • _mm_hsub_pi16
  • _mm_hsub_pi32
  • _mm_hsubs_epi16
  • _mm_hsubs_pi16
  • _mm_maddubs_epi16
  • _mm_maddubs_pi16
  • _mm_mulhrs_epi16
  • _mm_mulhrs_pi16
  • _mm_shuffle_epi8
  • _mm_shuffle_pi8
  • _mm_sign_epi16
  • _mm_sign_epi32
  • _mm_sign_epi8
  • _mm_sign_pi16
  • _mm_sign_pi32
  • _mm_sign_pi8

SVML functions

I've been hesitating on this one because I expect many of them to be a bit painful to implement, but SVML support would be a pretty nice addition to SIMDe:

  • _mm256_acos_pd
  • _mm256_acos_ps
  • _mm256_acosh_pd
  • _mm256_acosh_ps
  • _mm256_asin_pd
  • _mm256_asin_ps
  • _mm256_asinh_pd
  • _mm256_asinh_ps
  • _mm256_atan_pd
  • _mm256_atan_ps
  • _mm256_atan2_pd
  • _mm256_atan2_ps
  • _mm256_atanh_pd
  • _mm256_atanh_ps
  • _mm256_cbrt_pd
  • _mm256_cbrt_ps
  • _mm256_cdfnorm_pd
  • _mm256_cdfnorm_ps
  • _mm256_cdfnorminv_pd
  • _mm256_cdfnorminv_ps
  • _mm256_cexp_ps
  • _mm256_clog_ps
  • _mm256_cos_pd
  • _mm256_cos_ps
  • _mm256_cosd_pd
  • _mm256_cosd_ps
  • _mm256_cosh_pd
  • _mm256_cosh_ps
  • _mm256_csqrt_ps
  • _mm256_div_epi8
  • _mm256_div_epi16
  • _mm256_div_epi32
  • _mm256_div_epi64
  • _mm256_div_epu8
  • _mm256_div_epu16
  • _mm256_div_epu32
  • _mm256_div_epu64
  • _mm256_erf_pd
  • _mm256_erf_ps
  • _mm256_erfc_pd
  • _mm256_erfc_ps
  • _mm256_erfcinv_pd
  • _mm256_erfcinv_ps
  • _mm256_erfinv_pd
  • _mm256_erfinv_ps
  • _mm256_exp_pd
  • _mm256_exp_ps
  • _mm256_exp10_pd
  • _mm256_exp10_ps
  • _mm256_exp2_pd
  • _mm256_exp2_ps
  • _mm256_expm1_pd
  • _mm256_expm1_ps
  • _mm256_hypot_pd
  • _mm256_hypot_ps
  • _mm256_idiv_epi32
  • _mm256_idivrem_epi32
  • _mm256_invcbrt_pd
  • _mm256_invcbrt_ps
  • _mm256_invsqrt_pd
  • _mm256_invsqrt_ps
  • _mm256_irem_epi32
  • _mm256_log_pd
  • _mm256_log_ps
  • _mm256_log10_pd
  • _mm256_log10_ps
  • _mm256_log1p_pd
  • _mm256_log1p_ps
  • _mm256_log2_pd
  • _mm256_log2_ps
  • _mm256_logb_pd
  • _mm256_logb_ps
  • _mm256_pow_pd
  • _mm256_pow_ps
  • _mm256_rem_epi8
  • _mm256_rem_epi16
  • _mm256_rem_epi32
  • _mm256_rem_epi64
  • _mm256_rem_epu8
  • _mm256_rem_epu16
  • _mm256_rem_epu32
  • _mm256_rem_epu64
  • _mm256_sin_pd
  • _mm256_sin_ps
  • _mm256_sincos_pd
  • _mm256_sincos_ps
  • _mm256_sind_pd
  • _mm256_sind_ps
  • _mm256_sinh_pd
  • _mm256_sinh_ps
  • _mm256_svml_ceil_pd
  • _mm256_svml_ceil_ps
  • _mm256_svml_floor_pd
  • _mm256_svml_floor_ps
  • _mm256_svml_round_pd
  • _mm256_svml_round_ps
  • _mm256_svml_sqrt_pd
  • _mm256_svml_sqrt_ps
  • _mm256_tan_pd
  • _mm256_tan_ps
  • _mm256_tand_pd
  • _mm256_tand_ps
  • _mm256_tanh_pd
  • _mm256_tanh_ps
  • _mm256_trunc_pd
  • _mm256_trunc_ps
  • _mm256_udiv_epi32
  • _mm256_udivrem_epi32
  • _mm256_urem_epi32
  • _mm512_acos_pd
  • _mm512_mask_acos_pd
  • _mm512_acos_ps
  • _mm512_mask_acos_ps
  • _mm512_acosh_pd
  • _mm512_mask_acosh_pd
  • _mm512_acosh_ps
  • _mm512_mask_acosh_ps
  • _mm512_asin_pd
  • _mm512_mask_asin_pd
  • _mm512_asin_ps
  • _mm512_mask_asin_ps
  • _mm512_asinh_pd
  • _mm512_mask_asinh_pd
  • _mm512_asinh_ps
  • _mm512_mask_asinh_ps
  • _mm512_atan2_pd
  • _mm512_mask_atan2_pd
  • _mm512_atan2_ps
  • _mm512_mask_atan2_ps
  • _mm512_atan_pd
  • _mm512_mask_atan_pd
  • _mm512_atan_ps
  • _mm512_mask_atan_ps
  • _mm512_atanh_pd
  • _mm512_mask_atanh_pd
  • _mm512_atanh_ps
  • _mm512_mask_atanh_ps
  • _mm512_cbrt_pd
  • _mm512_mask_cbrt_pd
  • _mm512_cbrt_ps
  • _mm512_mask_cbrt_ps
  • _mm512_cdfnorm_pd
  • _mm512_mask_cdfnorm_pd
  • _mm512_cdfnorm_ps
  • _mm512_mask_cdfnorm_ps
  • _mm512_cdfnorminv_pd
  • _mm512_mask_cdfnorminv_pd
  • _mm512_cdfnorminv_ps
  • _mm512_mask_cdfnorminv_ps
  • _mm512_ceil_pd
  • _mm512_mask_ceil_pd
  • _mm512_ceil_ps
  • _mm512_mask_ceil_ps
  • _mm512_cos_pd
  • _mm512_mask_cos_pd
  • _mm512_cos_ps
  • _mm512_mask_cos_ps
  • _mm512_cosd_pd
  • _mm512_mask_cosd_pd
  • _mm512_cosd_ps
  • _mm512_mask_cosd_ps
  • _mm512_cosh_pd
  • _mm512_mask_cosh_pd
  • _mm512_cosh_ps
  • _mm512_mask_cosh_ps
  • _mm512_erf_pd
  • _mm512_mask_erf_pd
  • _mm512_erfc_pd
  • _mm512_mask_erfc_pd
  • _mm512_erf_ps
  • _mm512_mask_erf_ps
  • _mm512_erfc_ps
  • _mm512_mask_erfc_ps
  • _mm512_erfinv_pd
  • _mm512_mask_erfinv_pd
  • _mm512_erfinv_ps
  • _mm512_mask_erfinv_ps
  • _mm512_erfcinv_pd
  • _mm512_mask_erfcinv_pd
  • _mm512_erfcinv_ps
  • _mm512_mask_erfcinv_ps
  • _mm512_exp10_pd
  • _mm512_mask_exp10_pd
  • _mm512_exp10_ps
  • _mm512_mask_exp10_ps
  • _mm512_exp2_pd
  • _mm512_mask_exp2_pd
  • _mm512_exp2_ps
  • _mm512_mask_exp2_ps
  • _mm512_exp_pd
  • _mm512_mask_exp_pd
  • _mm512_exp_ps
  • _mm512_mask_exp_ps
  • _mm512_expm1_pd
  • _mm512_mask_expm1_pd
  • _mm512_expm1_ps
  • _mm512_mask_expm1_ps
  • _mm512_floor_pd
  • _mm512_mask_floor_pd
  • _mm512_floor_ps
  • _mm512_mask_floor_ps
  • _mm512_hypot_pd
  • _mm512_mask_hypot_pd
  • _mm512_hypot_ps
  • _mm512_mask_hypot_ps
  • _mm512_div_epi32
  • _mm512_mask_div_epi32
  • _mm512_div_epi8
  • _mm512_div_epi16
  • _mm512_div_epi64
  • _mm512_invsqrt_pd
  • _mm512_mask_invsqrt_pd
  • _mm512_invsqrt_ps
  • _mm512_mask_invsqrt_ps
  • _mm512_rem_epi32
  • _mm512_mask_rem_epi32
  • _mm512_rem_epi8
  • _mm512_rem_epi16
  • _mm512_rem_epi64
  • _mm512_log10_pd
  • _mm512_mask_log10_pd
  • _mm512_log10_ps
  • _mm512_mask_log10_ps
  • _mm512_log1p_pd
  • _mm512_mask_log1p_pd
  • _mm512_log1p_ps
  • _mm512_mask_log1p_ps
  • _mm512_log2_pd
  • _mm512_mask_log2_pd
  • _mm512_log_pd
  • _mm512_mask_log_pd
  • _mm512_log_ps
  • _mm512_mask_log_ps
  • _mm512_logb_pd
  • _mm512_mask_logb_pd
  • _mm512_logb_ps
  • _mm512_mask_logb_ps
  • _mm512_nearbyint_pd
  • _mm512_mask_nearbyint_pd
  • _mm512_nearbyint_ps
  • _mm512_mask_nearbyint_ps
  • _mm512_pow_pd
  • _mm512_mask_pow_pd
  • _mm512_pow_ps
  • _mm512_mask_pow_ps
  • _mm512_recip_pd
  • _mm512_mask_recip_pd
  • _mm512_recip_ps
  • _mm512_mask_recip_ps
  • _mm512_rint_pd
  • _mm512_mask_rint_pd
  • _mm512_rint_ps
  • _mm512_mask_rint_ps
  • _mm512_svml_round_pd
  • _mm512_mask_svml_round_pd
  • _mm512_sin_pd
  • _mm512_mask_sin_pd
  • _mm512_sin_ps
  • _mm512_mask_sin_ps
  • _mm512_sinh_pd
  • _mm512_mask_sinh_pd
  • _mm512_sinh_ps
  • _mm512_mask_sinh_ps
  • _mm512_sind_pd
  • _mm512_mask_sind_pd
  • _mm512_sind_ps
  • _mm512_mask_sind_ps
  • _mm512_tan_pd
  • _mm512_mask_tan_pd
  • _mm512_tan_ps
  • _mm512_mask_tan_ps
  • _mm512_tand_pd
  • _mm512_mask_tand_pd
  • _mm512_tand_ps
  • _mm512_mask_tand_ps
  • _mm512_tanh_pd
  • _mm512_mask_tanh_pd
  • _mm512_tanh_ps
  • _mm512_mask_tanh_ps
  • _mm512_trunc_pd
  • _mm512_mask_trunc_pd
  • _mm512_trunc_ps
  • _mm512_mask_trunc_ps
  • _mm512_div_epu32
  • _mm512_mask_div_epu32
  • _mm512_div_epu8
  • _mm512_div_epu16
  • _mm512_div_epu64
  • _mm512_rem_epu32
  • _mm512_mask_rem_epu32
  • _mm512_rem_epu8
  • _mm512_rem_epu16
  • _mm512_rem_epu64
  • _mm512_sincos_pd
  • _mm512_mask_sincos_pd
  • _mm512_sincos_ps
  • _mm512_mask_sincos_ps
  • _mm_acos_pd
  • _mm_acos_ps
  • _mm_acosh_pd
  • _mm_acosh_ps
  • _mm_asin_pd
  • _mm_asin_ps
  • _mm_asinh_pd
  • _mm_asinh_ps
  • _mm_atan_pd
  • _mm_atan_ps
  • _mm_atan2_pd
  • _mm_atan2_ps
  • _mm_atanh_pd
  • _mm_atanh_ps
  • _mm_cbrt_pd
  • _mm_cbrt_ps
  • _mm_cdfnorm_pd
  • _mm_cdfnorm_ps
  • _mm_cdfnorminv_pd
  • _mm_cdfnorminv_ps
  • _mm_cexp_ps
  • _mm_clog_ps
  • _mm_cos_pd
  • _mm_cos_ps
  • _mm_cosd_pd
  • _mm_cosd_ps
  • _mm_cosh_pd
  • _mm_cosh_ps
  • _mm_csqrt_ps
  • _mm_div_epi8
  • _mm_div_epi16
  • _mm_div_epi32
  • _mm_div_epi64
  • _mm_div_epu8
  • _mm_div_epu16
  • _mm_div_epu32
  • _mm_div_epu64
  • _mm_erf_pd
  • _mm_erf_ps
  • _mm_erfc_pd
  • _mm_erfc_ps
  • _mm_erfcinv_pd
  • _mm_erfcinv_ps
  • _mm_erfinv_pd
  • _mm_erfinv_ps
  • _mm_exp_pd
  • _mm_exp_ps
  • _mm_exp10_pd
  • _mm_exp10_ps
  • _mm_exp2_pd
  • _mm_exp2_ps
  • _mm_expm1_pd
  • _mm_expm1_ps
  • _mm_hypot_pd
  • _mm_hypot_ps
  • _mm_idiv_epi32
  • _mm_idivrem_epi32
  • _mm_invcbrt_pd
  • _mm_invcbrt_ps
  • _mm_invsqrt_pd
  • _mm_invsqrt_ps
  • _mm_irem_epi32
  • _mm_log_pd
  • _mm_log_ps
  • _mm_log10_pd
  • _mm_log10_ps
  • _mm_log1p_pd
  • _mm_log1p_ps
  • _mm_log2_pd
  • _mm_log2_ps
  • _mm_logb_pd
  • _mm_logb_ps
  • _mm_pow_pd
  • _mm_pow_ps
  • _mm_rem_epi8
  • _mm_rem_epi16
  • _mm_rem_epi32
  • _mm_rem_epi64
  • _mm_rem_epu8
  • _mm_rem_epu16
  • _mm_rem_epu32
  • _mm_rem_epu64
  • _mm_sin_pd
  • _mm_sin_ps
  • _mm_sincos_pd
  • _mm_sincos_ps
  • _mm_sind_pd
  • _mm_sind_ps
  • _mm_sinh_pd
  • _mm_sinh_ps
  • _mm_svml_ceil_pd
  • _mm_svml_ceil_ps
  • _mm_svml_floor_pd
  • _mm_svml_floor_ps
  • _mm_svml_round_pd
  • _mm_svml_round_ps
  • _mm_svml_sqrt_pd
  • _mm_svml_sqrt_ps
  • _mm_tan_pd
  • _mm_tan_ps
  • _mm_tand_pd
  • _mm_tand_ps
  • _mm_tanh_pd
  • _mm_tanh_ps
  • _mm_trunc_pd
  • _mm_trunc_ps
  • _mm_udiv_epi32
  • _mm_udivrem_epi32
  • _mm_urem_epi32

X86-64 AES-ni support

It would be nice if simde implemented support for AES, especially AES round as this particular part of AES is also used in a lot of hash algorithms etc.

Many x86 based CPUs support this via AES-ni; and a lot of armv8 cores implement it via 'crypto extensions'. For arm cpus that don't have neon its possible to make use of other neon intrinsics.

I've submitted a PR to sse2neon that implements _mm_aesenc_si128 which is the most important instruction - it might be nice to have this as a starting point DLTcollab/sse2neon#6

  • _mm_aesenc_si128
  • _mm_aesdec_si128
  • _mm_aesdeclast_si128
  • _mm_aesenclast_si128
  • _mm_aesimc_si128
  • _mm_aeskeygenassist_si128

Reference: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=AES

Missing macro SIMDE_MM_TRANSPOSE4_PS

I've added this in my code that includes <x86/sse.h>, but it should be placed inside it AFAIK.

#ifndef SIMDE_MM_TRANSPOSE4_PS	/* this should be in <x86/sse.h> */
/* Macro: Transpose the 4x4 matrix formed by the 4 rows of single-precision (32-bit) floating-point elements in row0, row1, row2, and row3, and store the transposed matrix in these vectors (row0 now contains column 0, etc.). */
#define SIMDE_MM_TRANSPOSE4_PS(row0,row1,row2,row3)	{	\
simde__m128 tmp3, tmp2, tmp1, tmp0; 			\
tmp0 = simde_mm_unpacklo_ps(row0, row1); 	\
tmp2 = simde_mm_unpacklo_ps(row2, row3); 	\
tmp1 = simde_mm_unpackhi_ps(row0, row1); 	\
tmp3 = simde_mm_unpackhi_ps(row2, row3); 	\
row0 = simde_mm_movelh_ps(tmp0, tmp2); 		\
row1 = simde_mm_movehl_ps(tmp2, tmp0); 		\
row2 = simde_mm_movelh_ps(tmp1, tmp3); 		\
row3 = simde_mm_movehl_ps(tmp3, tmp1);		\
}
#endif //SIMDDE_MM_TRANSPOSE4_PS

Hope it works.

P.S. Very useful project. Thank you for sharing.

Versioned release?

Hello, I'm considering packaging simde for Debian. Could you make a tagged release?

Thanks!

C++ testing

a tracking issue to remind us to develop (autogenerated) c++ tests and execute them in Travis CI / appveyor

MSA functions

For progress information, see https://github.com/simd-everywhere/implementation-status/blob/main/msa.md

Due to QEMU bugs we can't just use the versions shipped with Debian to test. The latest git works, though, and the docker image can be configured to download and compile it. Use QEMU_GIT=y ./docker/simde-dev.sh (or, if you already have a docker image built, QEMU_GIT=y BUILD_IMAGE=y ./docker/simde-dev.sh to force a rebuild) to get it.

Once you have qemu from git, you can use the mips64el+msa-gcc-10 build to target MSA and generate test vectors.

SSE test failures on x86 32-bit

I was working on running wip/meson on Travis, and when I was testing locally I added -m32 to the cflags. After a couple tweaks (52bc6bd) it's running, but a bunch of tests fail;

  • mm_cvt_pi2ps
  • mm_cvt_ps2pi
  • mm_cvtpi16_ps
  • mm_cvtpi8_ps
  • mm_cvtps_pi16
  • mm_cvtps_pi32
  • mm_cvtps_pi8
  • mm_cvtpu16_ps
  • mm_cvtpu8_ps
  • mm_cvtt_ps2pi

I haven't looked at all of them, but it seems like the bugs are really in the tests not SIMDe.

SIMD.js / Emscripten

https://github.com/kripken/emscripten/blob/master/system/include/emscripten/vector.h

  • emscripten_float64x2_set
  • emscripten_float64x2_splat
  • emscripten_float64x2_add
  • emscripten_float64x2_sub
  • emscripten_float64x2_mul
  • emscripten_float64x2_div
  • emscripten_float64x2_max
  • emscripten_float64x2_min
  • emscripten_float64x2_maxNum
  • emscripten_float64x2_minNum
  • emscripten_float64x2_neg
  • emscripten_float64x2_sqrt
  • emscripten_float64x2_reciprocalApproximation
  • emscripten_float64x2_reciprocalSqrtApproximation
  • emscripten_float64x2_abs
  • emscripten_float64x2_and
  • emscripten_float64x2_fromInt32x4Bits
  • emscripten_int32x4_and
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_float64x2_xor
  • emscripten_float64x2_fromInt32x4Bits
  • emscripten_int32x4_xor
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_float64x2_or
  • emscripten_float64x2_fromInt32x4Bits
  • emscripten_int32x4_or
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_float64x2_not
  • emscripten_float64x2_fromInt32x4Bits
  • emscripten_int32x4_not
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_float64x2_lessThan
  • emscripten_float64x2_lessThanOrEqual
  • emscripten_float64x2_greaterThan
  • emscripten_float64x2_greaterThanOrEqual
  • emscripten_float64x2_equal
  • emscripten_float64x2_notEqual
  • emscripten_float64x2_select
  • emscripten_float64x2_extractLane
  • emscripten_float64x2_replaceLane
  • emscripten_float64x2_store
  • emscripten_float64x2_store1
  • emscripten_float64x2_load
  • emscripten_float64x2_load1
  • emscripten_float64x2_fromFloat32x4Bits
  • emscripten_float64x2_fromInt32x4Bits
  • emscripten_float64x2_fromUint32x4Bits
  • emscripten_float64x2_fromInt16x8Bits
  • emscripten_float64x2_fromUint16x8Bits
  • emscripten_float64x2_fromInt8x16Bits
  • emscripten_float64x2_fromUint8x16Bits
  • emscripten_float64x2_swizzle
  • emscripten_float64x2_shuffle
  • emscripten_float32x4_set
  • emscripten_float32x4_splat
  • emscripten_float32x4_add
  • emscripten_float32x4_sub
  • emscripten_float32x4_mul
  • emscripten_float32x4_div
  • emscripten_float32x4_max
  • emscripten_float32x4_min
  • emscripten_float32x4_maxNum
  • emscripten_float32x4_minNum
  • emscripten_float32x4_neg
  • emscripten_float32x4_sqrt
  • emscripten_float32x4_reciprocalApproximation
  • emscripten_float32x4_reciprocalSqrtApproximation
  • emscripten_float32x4_abs
  • emscripten_float32x4_and
  • emscripten_float32x4_fromInt32x4Bits
  • emscripten_int32x4_and
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_float32x4_xor
  • emscripten_float32x4_fromInt32x4Bits
  • emscripten_int32x4_xor
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_float32x4_or
  • emscripten_float32x4_fromInt32x4Bits
  • emscripten_int32x4_or
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_float32x4_not
  • emscripten_float32x4_fromInt32x4Bits
  • emscripten_int32x4_not
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_float32x4_lessThan
  • emscripten_float32x4_lessThanOrEqual
  • emscripten_float32x4_greaterThan
  • emscripten_float32x4_greaterThanOrEqual
  • emscripten_float32x4_equal
  • emscripten_float32x4_notEqual
  • emscripten_float32x4_select
  • emscripten_float32x4_extractLane
  • emscripten_float32x4_replaceLane
  • emscripten_float32x4_store
  • emscripten_float32x4_store1
  • emscripten_float32x4_store2
  • emscripten_float32x4_load
  • emscripten_float32x4_load1
  • emscripten_float32x4_load2
  • emscripten_float32x4_fromFloat64x2Bits
  • emscripten_float32x4_fromInt32x4Bits
  • emscripten_float32x4_fromUint32x4Bits
  • emscripten_float32x4_fromInt16x8Bits
  • emscripten_float32x4_fromUint16x8Bits
  • emscripten_float32x4_fromInt8x16Bits
  • emscripten_float32x4_fromUint8x16Bits
  • emscripten_float32x4_fromInt32x4
  • emscripten_float32x4_fromUint32x4
  • emscripten_float32x4_swizzle
  • emscripten_float32x4_shuffle
  • emscripten_int32x4_set
  • emscripten_int32x4_splat
  • emscripten_int32x4_add
  • emscripten_int32x4_sub
  • emscripten_int32x4_mul
  • emscripten_int32x4_neg
  • emscripten_int32x4_and
  • emscripten_int32x4_xor
  • emscripten_int32x4_or
  • emscripten_int32x4_not
  • emscripten_int32x4_lessThan
  • emscripten_int32x4_lessThanOrEqual
  • emscripten_int32x4_greaterThan
  • emscripten_int32x4_greaterThanOrEqual
  • emscripten_int32x4_equal
  • emscripten_int32x4_notEqual
  • emscripten_int32x4_anyTrue
  • emscripten_bool32x4_anyTrue
  • emscripten_int32x4_allTrue
  • emscripten_bool32x4_allTrue
  • emscripten_int32x4_select
  • emscripten_int32x4_shiftLeftByScalar
  • emscripten_int32x4_shiftRightByScalar
  • emscripten_int32x4_extractLane
  • emscripten_int32x4_replaceLane
  • emscripten_int32x4_store
  • emscripten_int32x4_store1
  • emscripten_int32x4_store2
  • emscripten_int32x4_load
  • emscripten_int32x4_load1
  • emscripten_int32x4_load2
  • emscripten_int32x4_fromFloat64x2Bits
  • emscripten_int32x4_fromFloat32x4Bits
  • emscripten_int32x4_fromUint32x4Bits
  • emscripten_int32x4_fromInt16x8Bits
  • emscripten_int32x4_fromUint16x8Bits
  • emscripten_int32x4_fromInt8x16Bits
  • emscripten_int32x4_fromUint8x16Bits
  • emscripten_int32x4_fromFloat32x4
  • emscripten_int32x4_fromUint32x4
  • emscripten_int32x4_fromFloat64x2
  • emscripten_int32x4_swizzle
  • emscripten_int32x4_shuffle
  • emscripten_uint32x4_set
  • emscripten_uint32x4_splat
  • emscripten_uint32x4_add
  • emscripten_uint32x4_sub
  • emscripten_uint32x4_mul
  • emscripten_uint32x4_neg
  • emscripten_uint32x4_and
  • emscripten_uint32x4_xor
  • emscripten_uint32x4_or
  • emscripten_uint32x4_not
  • emscripten_uint32x4_lessThan
  • emscripten_uint32x4_lessThanOrEqual
  • emscripten_uint32x4_greaterThan
  • emscripten_uint32x4_greaterThanOrEqual
  • emscripten_uint32x4_equal
  • emscripten_uint32x4_notEqual
  • emscripten_uint32x4_anyTrue
  • emscripten_bool32x4_anyTrue
  • emscripten_uint32x4_allTrue
  • emscripten_bool32x4_allTrue
  • emscripten_uint32x4_select
  • emscripten_uint32x4_shiftLeftByScalar
  • emscripten_uint32x4_shiftRightByScalar
  • emscripten_uint32x4_extractLane
  • emscripten_uint32x4_replaceLane
  • emscripten_uint32x4_store
  • emscripten_uint32x4_store1
  • emscripten_uint32x4_store2
  • emscripten_uint32x4_load
  • emscripten_uint32x4_load1
  • emscripten_uint32x4_load2
  • emscripten_uint32x4_fromFloat64x2Bits
  • emscripten_uint32x4_fromFloat32x4Bits
  • emscripten_uint32x4_fromInt32x4Bits
  • emscripten_uint32x4_fromInt16x8Bits
  • emscripten_uint32x4_fromUint16x8Bits
  • emscripten_uint32x4_fromInt8x16Bits
  • emscripten_uint32x4_fromUint8x16Bits
  • emscripten_uint32x4_fromFloat32x4
  • emscripten_uint32x4_fromInt32x4
  • emscripten_uint32x4_fromFloat64x2
  • emscripten_uint32x4_swizzle
  • emscripten_uint32x4_shuffle
  • emscripten_int16x8_set
  • emscripten_int16x8_splat
  • emscripten_int16x8_add
  • emscripten_int16x8_sub
  • emscripten_int16x8_mul
  • emscripten_int16x8_neg
  • emscripten_int16x8_and
  • emscripten_int16x8_xor
  • emscripten_int16x8_or
  • emscripten_int16x8_not
  • emscripten_int16x8_lessThan
  • emscripten_int16x8_lessThanOrEqual
  • emscripten_int16x8_greaterThan
  • emscripten_int16x8_greaterThanOrEqual
  • emscripten_int16x8_equal
  • emscripten_int16x8_notEqual
  • emscripten_int16x8_anyTrue
  • emscripten_bool16x8_anyTrue
  • emscripten_int16x8_allTrue
  • emscripten_bool16x8_allTrue
  • emscripten_int16x8_select
  • emscripten_int16x8_addSaturate
  • emscripten_int16x8_subSaturate
  • emscripten_int16x8_shiftLeftByScalar
  • emscripten_int16x8_shiftRightByScalar
  • emscripten_int16x8_extractLane
  • emscripten_int16x8_replaceLane
  • emscripten_int16x8_store
  • emscripten_int16x8_load
  • emscripten_int16x8_fromFloat64x2Bits
  • emscripten_int16x8_fromFloat32x4Bits
  • emscripten_int16x8_fromInt32x4Bits
  • emscripten_int16x8_fromUint32x4Bits
  • emscripten_int16x8_fromUint16x8Bits
  • emscripten_int16x8_fromInt8x16Bits
  • emscripten_int16x8_fromUint8x16Bits
  • emscripten_int16x8_fromUint16x8
  • emscripten_int16x8_swizzle
  • emscripten_int16x8_shuffle
  • emscripten_uint16x8_set
  • emscripten_uint16x8_splat
  • emscripten_uint16x8_add
  • emscripten_uint16x8_sub
  • emscripten_uint16x8_mul
  • emscripten_uint16x8_neg
  • emscripten_uint16x8_and
  • emscripten_uint16x8_xor
  • emscripten_uint16x8_or
  • emscripten_uint16x8_not
  • emscripten_uint16x8_lessThan
  • emscripten_uint16x8_lessThanOrEqual
  • emscripten_uint16x8_greaterThan
  • emscripten_uint16x8_greaterThanOrEqual
  • emscripten_uint16x8_equal
  • emscripten_uint16x8_notEqual
  • emscripten_uint16x8_anyTrue
  • emscripten_bool16x8_anyTrue
  • emscripten_uint16x8_allTrue
  • emscripten_bool16x8_allTrue
  • emscripten_uint16x8_select
  • emscripten_uint16x8_addSaturate
  • emscripten_uint16x8_subSaturate
  • emscripten_uint16x8_shiftLeftByScalar
  • emscripten_uint16x8_shiftRightByScalar
  • emscripten_uint16x8_extractLane
  • emscripten_uint16x8_replaceLane
  • emscripten_uint16x8_store
  • emscripten_uint16x8_load
  • emscripten_uint16x8_fromFloat64x2Bits
  • emscripten_uint16x8_fromFloat32x4Bits
  • emscripten_uint16x8_fromInt32x4Bits
  • emscripten_uint16x8_fromUint32x4Bits
  • emscripten_uint16x8_fromInt16x8Bits
  • emscripten_uint16x8_fromInt8x16Bits
  • emscripten_uint16x8_fromUint8x16Bits
  • emscripten_uint16x8_fromInt16x8
  • emscripten_uint16x8_swizzle
  • emscripten_uint16x8_shuffle
  • emscripten_int8x16_set
  • emscripten_int8x16_splat
  • emscripten_int8x16_add
  • emscripten_int8x16_sub
  • emscripten_int8x16_mul
  • emscripten_int8x16_neg
  • emscripten_int8x16_and
  • emscripten_int8x16_xor
  • emscripten_int8x16_or
  • emscripten_int8x16_not
  • emscripten_int8x16_lessThan
  • emscripten_int8x16_lessThanOrEqual
  • emscripten_int8x16_greaterThan
  • emscripten_int8x16_greaterThanOrEqual
  • emscripten_int8x16_equal
  • emscripten_int8x16_notEqual
  • emscripten_int8x16_anyTrue
  • emscripten_bool8x16_anyTrue
  • emscripten_int8x16_allTrue
  • emscripten_bool8x16_allTrue
  • emscripten_int8x16_select
  • emscripten_int8x16_addSaturate
  • emscripten_int8x16_subSaturate
  • emscripten_int8x16_shiftLeftByScalar
  • emscripten_int8x16_shiftRightByScalar
  • emscripten_int8x16_extractLane
  • emscripten_int8x16_replaceLane
  • emscripten_int8x16_store
  • emscripten_int8x16_load
  • emscripten_int8x16_fromFloat64x2Bits
  • emscripten_int8x16_fromFloat32x4Bits
  • emscripten_int8x16_fromInt32x4Bits
  • emscripten_int8x16_fromUint32x4Bits
  • emscripten_int8x16_fromInt16x8Bits
  • emscripten_int8x16_fromUint16x8Bits
  • emscripten_int8x16_fromUint8x16Bits
  • emscripten_int8x16_fromUint8x16
  • emscripten_int8x16_swizzle
  • emscripten_int8x16_shuffle
  • emscripten_uint8x16_set
  • emscripten_uint8x16_splat
  • emscripten_uint8x16_add
  • emscripten_uint8x16_sub
  • emscripten_uint8x16_mul
  • emscripten_uint8x16_neg
  • emscripten_uint8x16_and
  • emscripten_uint8x16_xor
  • emscripten_uint8x16_or
  • emscripten_uint8x16_not
  • emscripten_uint8x16_lessThan
  • emscripten_uint8x16_lessThanOrEqual
  • emscripten_uint8x16_greaterThan
  • emscripten_uint8x16_greaterThanOrEqual
  • emscripten_uint8x16_equal
  • emscripten_uint8x16_notEqual
  • emscripten_uint8x16_anyTrue
  • emscripten_bool8x16_anyTrue
  • emscripten_uint8x16_allTrue
  • emscripten_bool8x16_allTrue
  • emscripten_uint8x16_select
  • emscripten_uint8x16_addSaturate
  • emscripten_uint8x16_subSaturate
  • emscripten_uint8x16_shiftLeftByScalar
  • emscripten_uint8x16_shiftRightByScalar
  • emscripten_uint8x16_extractLane
  • emscripten_uint8x16_replaceLane
  • emscripten_uint8x16_store
  • emscripten_uint8x16_load
  • emscripten_uint8x16_fromFloat64x2Bits
  • emscripten_uint8x16_fromFloat32x4Bits
  • emscripten_uint8x16_fromInt32x4Bits
  • emscripten_uint8x16_fromUint32x4Bits
  • emscripten_uint8x16_fromInt16x8Bits
  • emscripten_uint8x16_fromUint16x8Bits
  • emscripten_uint8x16_fromInt8x16Bits
  • emscripten_uint8x16_fromInt8x16
  • emscripten_uint8x16_swizzle
  • emscripten_uint8x16_shuffle
  • emscripten_bool64x2_anyTrue
  • emscripten_bool64x2_allTrue
  • emscripten_bool32x4_anyTrue
  • emscripten_bool32x4_allTrue
  • emscripten_bool16x8_anyTrue
  • emscripten_bool16x8_allTrue
  • emscripten_bool8x16_anyTrue
  • emscripten_bool8x16_allTrue

Test errors with Debian's gcc-9

Hello again.

Here are some test errors from a preliminary packaging using the development version of Debian:

/x86/sse2/mm_cvtsd_ss                [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/sse2/sse2.c:3291: assertion failed: ((simde_float64*) &(r))[0] == ((simde_float64*) &(test_vec[i].r))[0] (0.0 == 317099649763060482048.0)
Error: child killed by signal 6 (Aborted)
/x86/sse2/mm_stream_pd/emul          [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/sse2/sse2.c:7279: assertion failed: ((int64_t*) &(r))[0] == ((int64_t*) &(test_vec[i].r))[0] (0xc07e3f851eb851ec == 0x0000000000000002)
Error: child killed by signal 6 (Aborted)
/x86/sse2/mm_cvtsd_ss/emul           [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/sse2/sse2.c:3291: assertion failed: ((simde_float64*) &(r))[0] == ((simde_float64*) &(test_vec[i].r))[0] (0.0 == 317099649763060482048.0)
Error: child killed by signal 6 (Aborted)
/x86/avx/mm256_movemask_ps           [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c:8138: assertion failed: r == test_vec[i].r (0 == 157)
Error: child killed by signal 6 (Aborted)
/x86/avx/mm256_movemask_pd           [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c:8181: assertion failed: r == test_vec[i].r (0 == 8)
Error: child killed by signal 6 (Aborted)
/x86/avx/mm_cmp_sd/emul              [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c:2939: assertion failed: ((uint64_t*) &(r))[0] == ((uint64_t*) &(e))[0] (0x000055622ea520dd == 0x0000000000000000)
Error: child killed by signal 6 (Aborted)
/x86/avx/mm_cmp_ss/emul              [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c:3141: assertion failed: ((uint64_t*) &(r))[0] == ((uint64_t*) &(e))[0] (0x4252cccd2ea520dd == 0x4252cccd00000000)
Error: child killed by signal 6 (Aborted)
/x86/avx/mm256_movemask_ps/emul      [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c:8138: assertion failed: r == test_vec[i].r (0 == 157)
Error: child killed by signal 6 (Aborted)
/x86/avx/mm256_movemask_pd/emul      [ ERROR ]
Error: /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c:8181: assertion failed: r == test_vec[i].r (0 == 8)
Error: child killed by signal 6 (Aborted)

In Debian we add some extra build flags, so here is are example compilation lines for reference

[ 48%] Building C object CMakeFiles/simde-test-native.dir/x86/avx/avx.c.o
/usr/bin/x86_64-linux-gnu-gcc-9  -I/build/simde-0.0.0.git.20191205.c2e740c/test/..  -g -O2 -fdebug-prefix-map=/build/simde-0.0.0.git.20191205.c2e740c=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -march=native -fopenmp-simd -DSIMDE_ENABLE_OPENMP   -std=gnu99 -o CMakeFiles/simde-test-native.dir/x86/avx/avx.c.o   -c /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c

[ 53%] Building C object CMakeFiles/simde-test-no-native.dir/x86/avx/avx.c.o
/usr/bin/x86_64-linux-gnu-gcc-9 -DSIMDE_NO_NATIVE -I/build/simde-0.0.0.git.20191205.c2e740c/test/..  -g -O2 -fdebug-prefix-map=/build/simde-0.0.0.git.20191205.c2e740c=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -march=native -fopenmp-simd -DSIMDE_ENABLE_OPENMP   -std=gnu99 -o CMakeFiles/simde-test-no-native.dir/x86/avx/avx.c.o   -c /build/simde-0.0.0.git.20191205.c2e740c/test/x86/avx/avx.c

$ /usr/bin/x86_64-linux-gnu-gcc-9 --version
/usr/bin/x86_64-linux-gnu-gcc-9 (Debian 9.2.1-21) 9.2.1 20191130
$ cat /proc/cpuinfo | grep model\ name
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
model name	: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz

Complete logs at https://gist.github.com/mr-c/ae7cc75349faae4d8224661526068dbe
There is also a Dockerfile at https://salsa.debian.org/med-team/simde/blob/master/debian/docker/Dockerfile To use it, download the Dockerfile to an empty directory and run docker build .

Native types (or GCC-style vectors) for args/retval instead of unions

I'm thinking that it might be a good idea to pass native types (e.g., __m128, int8x4_t, etc.) or GCC-style vectors when possible instead of SIMDe's union. This would cause a bit of cruft in most SIMDe functions since we'd need to convert to our union, but I think it would make things a bit more reliable.

  • Compilers are less likely to move stuff in and out of vector registers. Inlining takes care of most of this issue, but doesn't solve it completely for all code.
  • Some compilers (cough MSVC) don't like passing around aligned unions, but seem to do okay with the native vectors.
  • I'm hoping this also gets rid of some warnings from GCC that I don't see another way to silence (the ones about the ABI changing in some GCC version), but I'd need to verify that.

If I decide to go ahead this is going to be a pretty noisy change, but the sooner it happens the better.

Add more CPU architectures to CI

I haven't used GitHub Actions for CI before, but it seems pretty interesting and would let us test on more platforms, For example, uraimo/run-on-arch-action would let us build and test on armv6, armv7, aarch64, s390x, and ppc64le (and hopefully more in the future), which would be fantastic.

My current plan is to keep using Travis and AppVeyor for testing different versions of a few compilers, and use GitHub Actions for testing the latest GCC (and maybe clang) on other architectures.

I was playing around in a wip/gh-ci branch last night, but I think I'll play around in a different repo for a bit to get a feel for it instead of generating a lot of noise on SIMDe. That said, if anyone is interested in helping out PRs are certainly welcome ;)

testing: add -march=native if no -march is present in CFLAGS

For some reason, without -march=native or -mavx2 my development laptop running Ubuntu Eoan gets the non-avx2 code routes when doing a plain mkdir test/build && cd test/build && cmake .. && make && ./run-tests in tests/build`

michael@mrc-tux:~/src/simde$ gcc -dM -E - < /dev/null | grep AVX
michael@mrc-tux:~/src/simde$ gcc -march=native -dM -E - < /dev/null | grep AVX
#define __AVX__ 1
#define __AVX2__ 1
michael@mrc-tux:~/src/simde$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.2.1-9ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

SSE functions

Per https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE

  • _mm_add_ps
  • _mm_add_ss
  • _mm_and_ps
  • _mm_andnot_ps
  • _mm_avg_pu16
  • _mm_avg_pu8
  • _mm_cmpeq_ps
  • _mm_cmpeq_ss
  • _mm_cmpge_ps
  • _mm_cmpge_ss
  • _mm_cmpgt_ps
  • _mm_cmpgt_ss
  • _mm_cmple_ps
  • _mm_cmple_ss
  • _mm_cmplt_ps
  • _mm_cmplt_ss
  • _mm_cmpneq_ps
  • _mm_cmpneq_ss
  • _mm_cmpnge_ps
  • _mm_cmpnge_ss
  • _mm_cmpngt_ps
  • _mm_cmpngt_ss
  • _mm_cmpnle_ps
  • _mm_cmpnle_ss
  • _mm_cmpnlt_ps
  • _mm_cmpnlt_ss
  • _mm_cmpord_ps
  • _mm_cmpord_ss
  • _mm_cmpunord_ps
  • _mm_cmpunord_ss
  • _mm_comieq_ss
  • _mm_comige_ss
  • _mm_comigt_ss
  • _mm_comile_ss
  • _mm_comilt_ss
  • _mm_comineq_ss
  • _mm_cvt_pi2ps
  • _mm_cvt_ps2pi
  • _mm_cvt_si2ss
  • _mm_cvt_ss2si
  • _mm_cvtpi16_ps
  • _mm_cvtpi32_ps
  • _mm_cvtpi32x2_ps
  • _mm_cvtpi8_ps
  • _mm_cvtps_pi16
  • _mm_cvtps_pi32
  • _mm_cvtps_pi8
  • _mm_cvtpu16_ps
  • _mm_cvtpu8_ps
  • _mm_cvtsi32_ss
  • _mm_cvtsi64_ss
  • _mm_cvtss_f32
  • _mm_cvtss_si32
  • _mm_cvtss_si64
  • _mm_cvtt_ps2pi
  • _mm_cvtt_ss2si
  • _mm_cvttps_pi32
  • _mm_cvttss_si32
  • _mm_cvttss_si64
  • _mm_div_ps
  • _mm_div_ss
  • _mm_extract_pi16
  • _mm_getcsr
  • _mm_insert_pi16
  • _mm_load_ps
  • _mm_load_ps1
  • _mm_load_ss
  • _mm_load1_ps
  • _mm_loadh_pi
  • _mm_loadl_pi
  • _mm_loadr_ps
  • _mm_loadu_ps
  • _mm_maskmove_si64
  • _m_maskmovq
  • _mm_max_pi16
  • _mm_max_ps
  • _mm_max_pu8
  • _mm_max_ss
  • _mm_min_pi16
  • _mm_min_ps
  • _mm_min_pu8
  • _mm_min_ss
  • _mm_move_ss
  • _mm_movehl_ps
  • _mm_movelh_ps
  • _mm_movemask_pi8
  • _mm_movemask_ps
  • _mm_mul_ps
  • _mm_mul_ss
  • _mm_mulhi_pu16
  • _mm_or_ps
  • _m_pavgb
  • _m_pavgw
  • _m_pextrw
  • _m_pinsrw
  • _m_pmaxsw
  • _m_pmaxub
  • _m_pminsw
  • _m_pminub
  • _m_pmovmskb
  • _m_pmulhuw
  • _mm_prefetch
  • _m_psadbw
  • _m_pshufw
  • _mm_rcp_ps
  • _mm_rcp_ss
  • _mm_rsqrt_ps
  • _mm_rsqrt_ss
  • _mm_sad_pu8
  • _mm_set_ps
  • _mm_set_ps1
  • _mm_set_ss
  • _mm_set1_ps
  • _mm_setcsr
  • _mm_setr_ps
  • _mm_setzero_ps
  • _mm_sfence
  • _mm_shuffle_pi16
  • _mm_shuffle_ps
  • _mm_sqrt_ps
  • _mm_sqrt_ss
  • _mm_store_ps
  • _mm_store_ps1
  • _mm_store_ss
  • _mm_store1_ps
  • _mm_storeh_pi
  • _mm_storel_pi
  • _mm_storer_ps
  • _mm_storeu_ps
  • _mm_stream_pi
  • _mm_stream_ps
  • _mm_sub_ps
  • _mm_sub_ss
  • _mm_ucomieq_ss
  • _mm_ucomige_ss
  • _mm_ucomigt_ss
  • _mm_ucomile_ss
  • _mm_ucomilt_ss
  • _mm_ucomineq_ss
  • _mm_undefined_ps
  • _mm_unpackhi_ps
  • _mm_unpacklo_ps
  • _mm_xor_ps

AVX functions

It would be good to support the AVX instruction set.

This might be a good first issue; you don't need to implement all of the functions at once (one at a time is fine), and you if your machine supports AVX natively it's pretty easy to test. Furthermore, most of the functions are just 256-bit versions of 128-bit functions which are already implemented as part of other Intel instructions, so you can probably look at other functions for help.

Here is a list of functions in AVX. You can find details about each one at https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX

  • _mm256_add_pd
  • _mm256_add_ps
  • _mm256_addsub_pd
  • _mm256_addsub_ps
  • _mm256_and_pd
  • _mm256_and_ps
  • _mm256_andnot_pd
  • _mm256_andnot_ps
  • _mm256_blend_pd
  • _mm256_blend_ps
  • _mm256_blendv_pd
  • _mm256_blendv_ps
  • _mm256_broadcast_pd
  • _mm256_broadcast_ps
  • _mm256_broadcast_sd
  • _mm_broadcast_ss
  • _mm256_broadcast_ss
  • _mm256_castpd_ps
  • _mm256_castpd_si256
  • _mm256_castpd128_pd256
  • _mm256_castpd256_pd128
  • _mm256_castps_pd
  • _mm256_castps_si256
  • _mm256_castps128_ps256
  • _mm256_castps256_ps128
  • _mm256_castsi128_si256
  • _mm256_castsi256_pd
  • _mm256_castsi256_ps
  • _mm256_castsi256_si128
  • _mm256_ceil_pd
  • _mm256_ceil_ps
  • _mm_cmp_pd
  • _mm256_cmp_pd
  • _mm_cmp_ps
  • _mm256_cmp_ps
  • _mm_cmp_sd
  • _mm_cmp_ss
  • _mm256_cvtepi32_pd
  • _mm256_cvtepi32_ps
  • _mm256_cvtpd_epi32
  • _mm256_cvtpd_ps
  • _mm256_cvtps_epi32
  • _mm256_cvtps_pd
  • _mm256_cvttpd_epi32
  • _mm256_cvttps_epi32
  • _mm256_div_pd
  • _mm256_div_ps
  • _mm256_dp_ps
  • _mm256_extract_epi32
  • _mm256_extract_epi64
  • _mm256_extractf128_pd
  • _mm256_extractf128_ps
  • _mm256_extractf128_si256
  • _mm256_floor_pd
  • _mm256_floor_ps
  • _mm256_hadd_pd
  • _mm256_hadd_ps
  • _mm256_hsub_pd
  • _mm256_hsub_ps
  • _mm256_insert_epi16
  • _mm256_insert_epi32
  • _mm256_insert_epi64
  • _mm256_insert_epi8
  • _mm256_insertf128_pd
  • _mm256_insertf128_ps
  • _mm256_insertf128_si256
  • _mm256_lddqu_si256
  • _mm256_load_pd
  • _mm256_load_ps
  • _mm256_load_si256
  • _mm256_loadu_pd
  • _mm256_loadu_ps
  • _mm256_loadu_si256
  • _mm256_loadu2_m128
  • _mm256_loadu2_m128d
  • _mm256_loadu2_m128i
  • _mm_maskload_pd
  • _mm256_maskload_pd
  • _mm_maskload_ps
  • _mm256_maskload_ps
  • _mm_maskstore_pd
  • _mm256_maskstore_pd
  • _mm_maskstore_ps
  • _mm256_maskstore_ps
  • _mm256_max_pd
  • _mm256_max_ps
  • _mm256_min_pd
  • _mm256_min_ps
  • _mm256_movedup_pd
  • _mm256_movehdup_ps
  • _mm256_moveldup_ps
  • _mm256_movemask_pd
  • _mm256_movemask_ps
  • _mm256_mul_pd
  • _mm256_mul_ps
  • _mm256_or_pd
  • _mm256_or_ps
  • _mm_permute_pd
  • _mm256_permute_pd
  • _mm_permute_ps
  • _mm256_permute_ps
  • _mm256_permute2f128_pd
  • _mm256_permute2f128_ps
  • _mm256_permute2f128_si256
  • _mm_permutevar_pd
  • _mm256_permutevar_pd
  • _mm_permutevar_ps
  • _mm256_permutevar_ps
  • _mm256_rcp_ps
  • _mm256_round_pd
  • _mm256_round_ps
  • _mm256_rsqrt_ps
  • _mm256_set_epi16
  • _mm256_set_epi32
  • _mm256_set_epi64x
  • _mm256_set_epi8
  • _mm256_set_m128
  • _mm256_set_m128d
  • _mm256_set_m128i
  • _mm256_set_pd
  • _mm256_set_ps
  • _mm256_set1_epi16
  • _mm256_set1_epi32
  • _mm256_set1_epi64x
  • _mm256_set1_epi8
  • _mm256_set1_pd
  • _mm256_set1_ps
  • _mm256_setr_epi16
  • _mm256_setr_epi32
  • _mm256_setr_epi64x
  • _mm256_setr_epi8
  • _mm256_setr_m128
  • _mm256_setr_m128d
  • _mm256_setr_m128i
  • _mm256_setr_pd
  • _mm256_setr_ps
  • _mm256_setzero_pd
  • _mm256_setzero_ps
  • _mm256_setzero_si256
  • _mm256_shuffle_pd
  • _mm256_shuffle_ps
  • _mm256_sqrt_pd
  • _mm256_sqrt_ps
  • _mm256_store_pd
  • _mm256_store_ps
  • _mm256_store_si256
  • _mm256_storeu_pd
  • _mm256_storeu_ps
  • _mm256_storeu_si256
  • _mm256_storeu2_m128
  • _mm256_storeu2_m128d
  • _mm256_storeu2_m128i
  • _mm256_stream_pd
  • _mm256_stream_ps
  • _mm256_stream_si256
  • _mm256_sub_pd
  • _mm256_sub_ps
  • _mm_testc_pd
  • _mm256_testc_pd
  • _mm_testc_ps
  • _mm256_testc_ps
  • _mm256_testc_si256
  • _mm_testnzc_pd
  • _mm256_testnzc_pd
  • _mm_testnzc_ps
  • _mm256_testnzc_ps
  • _mm256_testnzc_si256
  • _mm_testz_pd
  • _mm256_testz_pd
  • _mm_testz_ps
  • _mm256_testz_ps
  • _mm256_testz_si256
  • _mm256_undefined_pd
  • _mm256_undefined_ps
  • _mm256_undefined_si256
  • _mm256_unpackhi_pd
  • _mm256_unpackhi_ps
  • _mm256_unpacklo_pd
  • _mm256_unpacklo_ps
  • _mm256_xor_pd
  • _mm256_xor_ps
  • _mm256_zeroall
  • _mm256_zeroupper
  • _mm256_zextpd128_pd256
  • _mm256_zextps128_ps256
  • _mm256_zextsi128_si256

FMA support

For RAxML

  • _mm256_fmadd_pd

others:

  • _mm_fmadd_pd
  • _mm_fmadd_ps
  • _mm256_fmadd_ps
  • _mm_fmadd_sd
  • _mm_fmadd_ss
  • _mm_fmaddsub_pd
  • _mm256_fmaddsub_pd
  • _mm_fmaddsub_ps
  • _mm256_fmaddsub_ps
  • _mm_fmsub_pd
  • _mm256_fmsub_pd
  • _mm_fmsub_ps
  • _mm256_fmsub_ps
  • _mm_fmsub_sd
  • _mm_fmsub_ss
  • _mm_fmsubadd_pd
  • _mm256_fmsubadd_pd
  • _mm_fmsubadd_ps
  • _mm256_fmsubadd_ps
  • _mm_fnmadd_pd
  • _mm256_fnmadd_pd
  • _mm_fnmadd_ps
  • _mm256_fnmadd_ps
  • _mm_fnmadd_sd
  • _mm_fnmadd_ss
  • _mm_fnmsub_pd
  • _mm256_fnmsub_pd
  • _mm_fnmsub_ps
  • _mm256_fnmsub_ps
  • _mm_fnmsub_sd
  • _mm_fnmsub_ss

MMX functions

Per https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=MMX

  • _mm_add_pi16
  • _mm_add_pi32
  • _mm_add_pi8
  • _mm_adds_pi16
  • _mm_adds_pi8
  • _mm_adds_pu16
  • _mm_adds_pu8
  • _mm_and_si64
  • _mm_andnot_si64
  • _mm_cmpeq_pi16
  • _mm_cmpeq_pi32
  • _mm_cmpeq_pi8
  • _mm_cmpgt_pi16
  • _mm_cmpgt_pi32
  • _mm_cmpgt_pi8
  • _mm_cvtm64_si64
  • _mm_cvtsi32_si64
  • _mm_cvtsi64_m64
  • _mm_cvtsi64_si32
  • _m_empty
  • _mm_empty
  • _m_from_int
  • _m_from_int64
  • _mm_madd_pi16
  • _mm_mulhi_pi16
  • _mm_mullo_pi16
  • _mm_or_si64
  • _mm_packs_pi16
  • _mm_packs_pi32
  • _mm_packs_pu16
  • _m_packssdw
  • _m_packsswb
  • _m_packuswb
  • _m_paddb
  • _m_paddd
  • _m_paddsb
  • _m_paddsw
  • _m_paddusb
  • _m_paddusw
  • _m_paddw
  • _m_pand
  • _m_pandn
  • _m_pcmpeqb
  • _m_pcmpeqd
  • _m_pcmpeqw
  • _m_pcmpgtb
  • _m_pcmpgtd
  • _m_pcmpgtw
  • _m_pmaddwd
  • _m_pmulhw
  • _m_pmullw
  • _m_por
  • _m_pslld
  • _m_pslldi
  • _m_psllq
  • _m_psllqi
  • _m_psllw
  • _m_psllwi
  • _m_psrad
  • _m_psradi
  • _m_psraw
  • _m_psrawi
  • _m_psrld
  • _m_psrldi
  • _m_psrlq
  • _m_psrlqi
  • _m_psrlw
  • _m_psrlwi
  • _m_psubb
  • _m_psubd
  • _m_psubsb
  • _m_psubsw
  • _m_psubusb
  • _m_psubusw
  • _m_psubw
  • _m_punpckhbw
  • _m_punpckhdq
  • _m_punpckhwd
  • _m_punpcklbw
  • _m_punpckldq
  • _m_punpcklwd
  • _m_pxor
  • _mm_set_pi16
  • _mm_set_pi32
  • _mm_set_pi8
  • _mm_set1_pi16
  • _mm_set1_pi32
  • _mm_set1_pi8
  • _mm_setr_pi16
  • _mm_setr_pi32
  • _mm_setr_pi8
  • _mm_setzero_si64
  • _mm_sll_pi16
  • _mm_sll_pi32
  • _mm_sll_si64
  • _mm_slli_pi16
  • _mm_slli_pi32
  • _mm_slli_si64
  • _mm_sra_pi16
  • _mm_sra_pi32
  • _mm_srai_pi16
  • _mm_srai_pi32
  • _mm_srl_pi16
  • _mm_srl_pi32
  • _mm_srl_si64
  • _mm_srli_pi16
  • _mm_srli_pi32
  • _mm_srli_si64
  • _mm_sub_pi16
  • _mm_sub_pi32
  • _mm_sub_pi8
  • _mm_subs_pi16
  • _mm_subs_pi8
  • _mm_subs_pu16
  • _mm_subs_pu8
  • _m_to_int
  • _m_to_int64
  • _mm_unpackhi_pi16
  • _mm_unpackhi_pi32
  • _mm_unpackhi_pi8
  • _mm_unpacklo_pi16
  • _mm_unpacklo_pi32
  • _mm_unpacklo_pi8
  • _mm_xor_si64

ssh.h typo

In the 2111 line there must be neon define instead of native
Sorry the file is sse.h

SSE4.1 functions

Per https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_1

  • _mm_blend_epi16
  • _mm_blend_pd
  • _mm_blend_ps
  • _mm_blendv_epi8
  • _mm_blendv_pd
  • _mm_blendv_ps
  • _mm_ceil_pd
  • _mm_ceil_ps
  • _mm_ceil_sd
  • _mm_ceil_ss
  • _mm_cmpeq_epi64
  • _mm_cvtepi16_epi32
  • _mm_cvtepi16_epi64
  • _mm_cvtepi32_epi64
  • _mm_cvtepi8_epi16
  • _mm_cvtepi8_epi32
  • _mm_cvtepi8_epi64
  • _mm_cvtepu16_epi32
  • _mm_cvtepu16_epi64
  • _mm_cvtepu32_epi64
  • _mm_cvtepu8_epi16
  • _mm_cvtepu8_epi32
  • _mm_cvtepu8_epi64
  • _mm_dp_pd
  • _mm_dp_ps
  • _mm_extract_epi32
  • _mm_extract_epi64
  • _mm_extract_epi8
  • _mm_extract_ps
  • _mm_floor_pd
  • _mm_floor_ps
  • _mm_floor_sd
  • _mm_floor_ss
  • _mm_insert_epi32
  • _mm_insert_epi64
  • _mm_insert_epi8
  • _mm_insert_ps
  • _mm_max_epi32
  • _mm_max_epi8
  • _mm_max_epu16
  • _mm_max_epu32
  • _mm_min_epi32
  • _mm_min_epi8
  • _mm_min_epu16
  • _mm_min_epu32
  • _mm_minpos_epu16
  • _mm_mpsadbw_epu8
  • _mm_mul_epi32
  • _mm_mullo_epi32
  • _mm_packus_epi32
  • _mm_round_pd
  • _mm_round_ps
  • _mm_round_sd
  • _mm_round_ss
  • _mm_stream_load_si128
  • _mm_test_all_ones
  • _mm_test_all_zeros
  • _mm_test_mix_ones_zeros
  • _mm_testc_si128
  • _mm_testnzc_si128
  • _mm_testz_si128

Add operator overloading for C++

hello,

first a big thanks for simde! i'm trying to use it to compile a plugin for vcvrack (an open source software modular synthesizer -> vcvrack.com) on arm 32bit and 64bit and it works very well in most cases but on this code line:

__c32 = simde_mm_and_si128(__c32, ~simde_mm_and_si128(simde_mm_and_si128(~__a32, ~__b32), simde_mm_castps_si128(__chance32))); // Make 0 if a == 0 AND b == 0

i get the following error:

error: no match for 'operator~' (operand type is 'simde__m128i')

and have no real idea how to work around the problem. is there something wrong with that code or with some part of the simde implementation? is there maybe a fix or a workaround for it?

the code is from here, so that you have the full context:
https://github.com/ValleyAudio/ValleyRackFree/blob/master/src/Amalgam/VecAmalgam.cpp#L238

all the other sse code translated via simde compiles perfectly fine and the original code compiles without an error on intel with regular sse. system is plain ubuntu 18.04 natively compiled on armhf and arm64 (both have the same problem).

a lot of thanks in advance and best wishes - hexdump

SSE2 functions

Per https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE2

  • _mm_add_epi16
  • _mm_add_epi32
  • _mm_add_epi64
  • _mm_add_epi8
  • _mm_add_pd
  • _mm_add_sd
  • _mm_add_si64
  • _mm_adds_epi16
  • _mm_adds_epi8
  • _mm_adds_epu16
  • _mm_adds_epu8
  • _mm_and_pd
  • _mm_and_si128
  • _mm_andnot_pd
  • _mm_andnot_si128
  • _mm_avg_epu16
  • _mm_avg_epu8
  • _mm_bslli_si128
  • _mm_bsrli_si128
  • _mm_castpd_ps
  • _mm_castpd_si128
  • _mm_castps_pd
  • _mm_castps_si128
  • _mm_castsi128_pd
  • _mm_castsi128_ps
  • _mm_clflush
  • _mm_cmpeq_epi16
  • _mm_cmpeq_epi32
  • _mm_cmpeq_epi8
  • _mm_cmpeq_pd
  • _mm_cmpeq_sd
  • _mm_cmpge_pd
  • _mm_cmpge_sd
  • _mm_cmpgt_epi16
  • _mm_cmpgt_epi32
  • _mm_cmpgt_epi8
  • _mm_cmpgt_pd
  • _mm_cmpgt_sd
  • _mm_cmple_pd
  • _mm_cmple_sd
  • _mm_cmplt_epi16
  • _mm_cmplt_epi32
  • _mm_cmplt_epi8
  • _mm_cmplt_pd
  • _mm_cmplt_sd
  • _mm_cmpneq_pd
  • _mm_cmpneq_sd
  • _mm_cmpnge_pd
  • _mm_cmpnge_sd
  • _mm_cmpngt_pd
  • _mm_cmpngt_sd
  • _mm_cmpnle_pd
  • _mm_cmpnle_sd
  • _mm_cmpnlt_pd
  • _mm_cmpnlt_sd
  • _mm_cmpord_pd
  • _mm_cmpord_sd
  • _mm_cmpunord_pd
  • _mm_cmpunord_sd
  • _mm_comieq_sd
  • _mm_comige_sd
  • _mm_comigt_sd
  • _mm_comile_sd
  • _mm_comilt_sd
  • _mm_comineq_sd
  • _mm_cvtepi32_pd
  • _mm_cvtepi32_ps
  • _mm_cvtpd_epi32
  • _mm_cvtpd_pi32
  • _mm_cvtpd_ps
  • _mm_cvtpi32_pd
  • _mm_cvtps_epi32
  • _mm_cvtps_pd
  • _mm_cvtsd_f64
  • _mm_cvtsd_si32
  • _mm_cvtsd_si64
  • _mm_cvtsd_si64x
  • _mm_cvtsd_ss
  • _mm_cvtsi128_si32
  • _mm_cvtsi128_si64
  • _mm_cvtsi128_si64x
  • _mm_cvtsi32_sd
  • _mm_cvtsi32_si128
  • _mm_cvtsi64_sd
  • _mm_cvtsi64_si128
  • _mm_cvtsi64x_sd
  • _mm_cvtsi64x_si128
  • _mm_cvtss_sd
  • _mm_cvttpd_epi32
  • _mm_cvttpd_pi32
  • _mm_cvttps_epi32
  • _mm_cvttsd_si32
  • _mm_cvttsd_si64
  • _mm_cvttsd_si64x
  • _mm_div_pd
  • _mm_div_sd
  • _mm_extract_epi16
  • _mm_insert_epi16
  • _mm_lfence
  • _mm_load_pd
  • _mm_load_pd1
  • _mm_load_sd
  • _mm_load_si128
  • _mm_load1_pd
  • _mm_loadh_pd
  • _mm_loadl_epi64
  • _mm_loadl_pd
  • _mm_loadr_pd
  • _mm_loadu_pd
  • _mm_loadu_si128
  • _mm_madd_epi16
  • _mm_maskmoveu_si128
  • _mm_max_epi16
  • _mm_max_epu8
  • _mm_max_pd
  • _mm_max_sd
  • _mm_mfence
  • _mm_min_epi16
  • _mm_min_epu8
  • _mm_min_pd
  • _mm_min_sd
  • _mm_move_epi64
  • _mm_move_sd
  • _mm_movemask_epi8
  • _mm_movemask_pd
  • _mm_movepi64_pi64
  • _mm_movpi64_epi64
  • _mm_mul_epu32
  • _mm_mul_pd
  • _mm_mul_sd
  • _mm_mul_su32
  • _mm_mulhi_epi16
  • _mm_mulhi_epu16
  • _mm_mullo_epi16
  • _mm_or_pd
  • _mm_or_si128
  • _mm_packs_epi16
  • _mm_packs_epi32
  • _mm_packus_epi16
  • _mm_pause
  • _mm_sad_epu8
  • _mm_set_epi16
  • _mm_set_epi32
  • _mm_set_epi64
  • _mm_set_epi64x
  • _mm_set_epi8
  • _mm_set_pd
  • _mm_set_pd1
  • _mm_set_sd
  • _mm_set1_epi16
  • _mm_set1_epi32
  • _mm_set1_epi64
  • _mm_set1_epi64x
  • _mm_set1_epi8
  • _mm_set1_pd
  • _mm_setr_epi16
  • _mm_setr_epi32
  • _mm_setr_epi64
  • _mm_setr_epi8
  • _mm_setr_pd
  • _mm_setzero_pd
  • _mm_setzero_si128
  • _mm_shuffle_epi32
  • _mm_shuffle_pd
  • _mm_shufflehi_epi16
  • _mm_shufflelo_epi16
  • _mm_sll_epi16
  • _mm_sll_epi32
  • _mm_sll_epi64
  • _mm_slli_epi16
  • _mm_slli_epi32
  • _mm_slli_epi64
  • _mm_slli_si128
  • _mm_sqrt_pd
  • _mm_sqrt_sd
  • _mm_sra_epi16
  • _mm_sra_epi32
  • _mm_srai_epi16
  • _mm_srai_epi32
  • _mm_srl_epi16
  • _mm_srl_epi32
  • _mm_srl_epi64
  • _mm_srli_epi16
  • _mm_srli_epi32
  • _mm_srli_epi64
  • _mm_srli_si128
  • _mm_store_pd
  • _mm_store_pd1
  • _mm_store_sd
  • _mm_store_si128
  • _mm_store1_pd
  • _mm_storeh_pd
  • _mm_storel_epi64
  • _mm_storel_pd
  • _mm_storer_pd
  • _mm_storeu_pd
  • _mm_storeu_si128
  • _mm_stream_pd
  • _mm_stream_si128
  • _mm_stream_si32
  • _mm_stream_si64
  • _mm_sub_epi16
  • _mm_sub_epi32
  • _mm_sub_epi64
  • _mm_sub_epi8
  • _mm_sub_pd
  • _mm_sub_sd
  • _mm_sub_si64
  • _mm_subs_epi16
  • _mm_subs_epi8
  • _mm_subs_epu16
  • _mm_subs_epu8
  • _mm_ucomieq_sd
  • _mm_ucomige_sd
  • _mm_ucomigt_sd
  • _mm_ucomile_sd
  • _mm_ucomilt_sd
  • _mm_ucomineq_sd
  • _mm_undefined_pd
  • _mm_undefined_si128
  • _mm_unpackhi_epi16
  • _mm_unpackhi_epi32
  • _mm_unpackhi_epi64
  • _mm_unpackhi_epi8
  • _mm_unpackhi_pd
  • _mm_unpacklo_epi16
  • _mm_unpacklo_epi32
  • _mm_unpacklo_epi64
  • _mm_unpacklo_epi8
  • _mm_unpacklo_pd
  • _mm_xor_pd
  • _mm_xor_si128

WebAssembly functions

Here is the current API: https://github.com/emscripten-core/emscripten/blob/master/system/include/wasm_simd128.h

Since this is pretty new and not a lot of code is targeting it, for now it's probably more important to get other ISA extensions accelerated using these than to create a portable implementation of them. That said, it would be helpful for people targeting WASM to be able to compile and test code using native tools.

It could also be interesting as a common layer we could call from all ISA extensions and avoid some duplication.

Add clang on arm32 to CI

Clang tends to do a much better job of warning (well, erroring) about out-of-range values being passed. This can be a bit annoying because the checks are done before dead code elimination (so you can't do something like (imm8 & ~15) ? vdup_n_s16(0) : vshl_s16(imm8)), but it also catches a lot of bugs.

For now clang-9 is working on arm32, but I'd definitely like to get a CI build going before I start working on more ARM implementations.

@junaruga, maybe you happen to know of an example we could use?

Implement functions for Leopard

@catid mentioned Leopard on Twitter as a potential target for SIMDe. AFAICT (grep -hoP '_mm(256)?_[a-z0-9_]+' *.{cpp,h} | sort | uniq) only 7 functions are currently missing:

  • _mm256_and_si256
  • _mm256_broadcastsi128_si256
  • _mm256_loadu_si256
  • _mm256_shuffle_epi8
  • _mm256_srli_epi64
  • _mm256_storeu_si256
  • _mm256_xor_si256

I'm not sure how competitive a SIMDe implementation would be, but it's certainly worth a try.

error: invalid cast from type 'simde__m128' to type 'simde__m128i'

Hi hi. While making a first pass at patching https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library to use simde, I got the following error while compiling for i386:

/usr/bin/i686-linux-gnu-g++-9 -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fdebug-prefix-map=/build/libssw-1.1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -shared -rdynamic -Wl,-soname,libssw.so.0 -o libssw.so.0 ssw.c ssw.h ssw_cpp.h ssw_cpp.cpp -Wl,-z,relro -Wl,-z,now
In file included from ../debian/include/simde/x86/../simde-common.h:27,
                 from ../debian/include/simde/x86/mmx.h:28,
                 from ../debian/include/simde/x86/sse.h:32,
                 from ../debian/include/simde/x86/sse2.h:34,
                 from ssw.c:38:
../debian/include/simde/x86/sse2.h: In function 'simde__m128i simde_mm_castps_si128(simde__m128)':
../debian/include/simde/x86/../hedley.h:1358:69: error: invalid cast from type 'simde__m128' to type 'simde__m128i'
 1358 | #  define HEDLEY_REINTERPRET_CAST(T, expr) (reinterpret_cast<T>(expr))
      |                                                                     ^
../debian/include/simde/x86/sse2.h:821:7: note: in expansion of macro 'HEDLEY_REINTERPRET_CAST'
  821 |   r = HEDLEY_REINTERPRET_CAST(simde__m128i, a);
      |       ^~~~~~~~~~~~~~~~~~~~~~~
../debian/include/simde/x86/sse2.h: In function 'simde__m128d simde_mm_castsi128_pd(simde__m128i)':
../debian/include/simde/x86/../hedley.h:1358:69: error: invalid cast from type 'simde__m128i' to type 'simde__m128d'
 1358 | #  define HEDLEY_REINTERPRET_CAST(T, expr) (reinterpret_cast<T>(expr))
      |                                                                     ^
../debian/include/simde/x86/sse2.h:838:7: note: in expansion of macro 'HEDLEY_REINTERPRET_CAST'
  838 |   r = HEDLEY_REINTERPRET_CAST(simde__m128d, a);
      |       ^~~~~~~~~~~~~~~~~~~~~~~
../debian/include/simde/x86/sse2.h: In function 'simde__m128 simde_mm_castsi128_ps(simde__m128i)':
../debian/include/simde/x86/../hedley.h:1358:69: error: invalid cast from type 'simde__m128i' to type 'simde__m128'
 1358 | #  define HEDLEY_REINTERPRET_CAST(T, expr) (reinterpret_cast<T>(expr))
      |                                                                     ^
../debian/include/simde/x86/sse2.h:857:7: note: in expansion of macro 'HEDLEY_REINTERPRET_CAST'
  857 |   r = HEDLEY_REINTERPRET_CAST(simde__m128, a);
      |       ^~~~~~~~~~~~~~~~~~~~~~~

Here's the patch https://salsa.debian.org/med-team/libssw/blob/simde/debian/patches/simde.patch

Adding -march=i686 (the baseline mode for Debian's 32-bit x86 release) gives the same error. If I instead specify -march=pentium-m or a more advanced processor with SSE2 then the error goes away.

AltiVec implementations for SSE

I'm trying to get access to a VPS for developing AltiVec implementations, and SSE is first on the list. Even if I don't get access to the VPS it should be possible (though somewhat painful) to do it using QEMU since we can now verify the implementations on Travis CI.

I'm hoping it will also provide a better feeling for the AltiVec API and whether it is reasonable to implement a portable variant. Unfortunately they require some language-level support (i.e., the vector keyword), but if people are willing to tweak their code a bit it may be possible to provide something.

Some NEON implementations rely on AArch64 intrinsics even on arm32

Hello there. I'm trying to compile https://github.com/surge-synthesizer/surge for the Raspberry Pi 4. Surge makes a lot of use of sse2 and I'm trying to use your library. However although I consider myself to be pretty good at building and packaging software this a bit out of my headspace.

I've copied over simde-arch.h, simde-common.h, hedley.h, check.h, x86/{mmx.h,sse2.h, sse.h} and included them in the .h in place of xmmintrin.h & immintrin.h

#include "simde-arch.h"
#include "simde-common.h"
#include "x86/sse.h"
#include "x86/sse2.h"
#include "x86/mmx.h"

when I try and build I get a whole bunch of

src/common/x86/mmx.h:839:60: error: ‘vcgezq_s16’ was not declared in this scope
src/common/x86/mmx.h:1496:37: error: ‘vneg_s64’ was not declared in this scope
src/common/x86/mmx.h:1854:15: error: ‘vzip2_s8’ was not declared in this scope
src/common/x86/mmx.h:1886:16: error: ‘vzip2_s16’ was not declared in this scope
src/common/x86/mmx.h:1914:16: error: ‘vzip2_s32’ was not declared in this scope
src/common/x86/mmx.h:1940:15: error: ‘vzip1_s8’ was not declared in this scope
src/common/x86/mmx.h:1972:16: error: ‘vzip1_s16’ was not declared in this scope
src/common/x86/mmx.h:2000:16: error: ‘vzip1_s32’ was not declared in this scope

Which I thought were supposed to be provided by arm_nova.h
and then a heap of

src/common/vt_dsp/shared.h:11:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:12:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:13:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:14:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:15:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:16:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:17:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:18:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/shared.h:19:7: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/basic_dsp.h:50:39: error: ‘_mm_load_ss’ was not declared in this scope
src/common/vt_dsp/basic_dsp.h:50:28: error: ‘_mm_max_ss’ was not declared in this scope
src/common/vt_dsp/basic_dsp.h:50:17: error: ‘_mm_min_ss’ was not declared in this scope
src/common/vt_dsp/basic_dsp.h:49:4: error: ‘_mm_store_ss’ was not declared in this scope
src/common/vt_dsp/basic_dsp.h:54:8: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/basic_dsp.h:64:8: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/basic_dsp.h:72:8: error: ‘__m128’ does not name a type; did you mean ‘__y1l’?
src/common/vt_dsp/basic_dsp.h:81:32: error: ‘_mm_load_ss’ was not declared in this scope
src/common/vt_dsp/basic_dsp.h:81:21: error: ‘_mm_rcp_ss’ was not declared in this scope
src/common/vt_dsp/basic_dsp.h:81:4: error: ‘_mm_store_ss’ was not declared in this scope

And there's more where that came from! I'm hoping I've just done the includes incompetently.

MSVC compatibility in C++ mode

In C++ mode, MSVC doesn't like code like

return (simde__m128) { .i64 = { ... } };

Which, unfortunately, is something used pretty heavily in SIMDe. I've started removing the problematic syntax in the wip/cpp branch, but there is a lot to do.

Pull in updates from sse2neon project

@hasindu2008 recently contributed a bunch of NEON implementations to sse2neon.

SIMDe had many NEON implementations not in sse2neon, but it's quite possible the recent additions to sse2neon include functions for which we only have a portable implementation. It's also possible their implementations are faster.

Just need to go through the changes and see what we can use. Both projects are MIT.

How about for ARMv8?

I'm testing ARMv8, and set -march="armv8-a+crc", but the testing "CFLAG__mfpu_neon" is failed.
How can I use neon in correct way?

AVX2 functions

It would be good to support the AVX2 instruction set.

This might be a good first issue; you don't need to implement all of the functions at once (one at a time is fine), and you if your machine supports AVX2 natively it's pretty easy to test. Furthermore, most of the functions are just 256-bit versions of 128-bit functions which are already implemented as part of other Intel instructions, so you can probably look at other functions for help.

Here is a list of functions in AVX2. You can find details for each one at https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2

  • _mm256_extract_epi8
  • _mm256_extract_epi16
  • _mm256_abs_epi8
  • _mm256_abs_epi16
  • _mm256_abs_epi32
  • _mm256_add_epi8
  • _mm256_add_epi16
  • _mm256_add_epi32
  • _mm256_add_epi64
  • _mm256_adds_epi8
  • _mm256_adds_epi16
  • _mm256_adds_epu8
  • _mm256_adds_epu16
  • _mm256_alignr_epi8
  • _mm256_and_si256
  • _mm256_andnot_si256
  • _mm256_avg_epu8
  • _mm256_avg_epu16
  • _mm256_blend_epi16
  • _mm_blend_epi32
  • _mm256_blend_epi32
  • _mm256_blendv_epi8
  • _mm_broadcastb_epi8
  • _mm256_broadcastb_epi8
  • _mm_broadcastd_epi32
  • _mm256_broadcastd_epi32
  • _mm_broadcastq_epi64
  • _mm256_broadcastq_epi64
  • _mm_broadcastsd_pd
  • _mm256_broadcastsd_pd
  • _mm_broadcastsi128_si256
  • _mm256_broadcastsi128_si256
  • _mm_broadcastss_ps
  • _mm256_broadcastss_ps
  • _mm_broadcastw_epi16
  • _mm256_broadcastw_epi16
  • _mm256_cmpeq_epi8
  • _mm256_cmpeq_epi16
  • _mm256_cmpeq_epi32
  • _mm256_cmpeq_epi64
  • _mm256_cmpgt_epi8
  • _mm256_cmpgt_epi16
  • _mm256_cmpgt_epi32
  • _mm256_cmpgt_epi64
  • _mm256_cvtepi16_epi32
  • _mm256_cvtepi16_epi64
  • _mm256_cvtepi32_epi64
  • _mm256_cvtepi8_epi16
  • _mm256_cvtepi8_epi32
  • _mm256_cvtepi8_epi64
  • _mm256_cvtepu16_epi32
  • _mm256_cvtepu16_epi64
  • _mm256_cvtepu32_epi64
  • _mm256_cvtepu8_epi16
  • _mm256_cvtepu8_epi32
  • _mm256_cvtepu8_epi64
  • _mm256_extracti128_si256
  • _mm256_hadd_epi16
  • _mm256_hadd_epi32
  • _mm256_hadds_epi16
  • _mm256_hsub_epi16
  • _mm256_hsub_epi32
  • _mm256_hsubs_epi16
  • _mm_i32gather_pd
  • _mm256_i32gather_pd
  • _mm_i32gather_ps
  • _mm256_i32gather_ps
  • _mm_i32gather_epi32
  • _mm256_i32gather_epi32
  • _mm_i32gather_epi64
  • _mm256_i32gather_epi64
  • _mm_i64gather_pd
  • _mm256_i64gather_pd
  • _mm_i64gather_ps
  • _mm256_i64gather_ps
  • _mm_i64gather_epi32
  • _mm256_i64gather_epi32
  • _mm_i64gather_epi64
  • _mm256_i64gather_epi64
  • _mm256_inserti128_si256
  • _mm256_madd_epi16
  • _mm256_maddubs_epi16
  • _mm_mask_i32gather_pd
  • _mm256_mask_i32gather_pd
  • _mm_mask_i32gather_ps
  • _mm256_mask_i32gather_ps
  • _mm_mask_i32gather_epi32
  • _mm256_mask_i32gather_epi32
  • _mm_mask_i32gather_epi64
  • _mm256_mask_i32gather_epi64
  • _mm_mask_i64gather_pd
  • _mm256_mask_i64gather_pd
  • _mm_mask_i64gather_ps
  • _mm256_mask_i64gather_ps
  • _mm_mask_i64gather_epi32
  • _mm256_mask_i64gather_epi32
  • _mm_mask_i64gather_epi64
  • _mm256_mask_i64gather_epi64
  • _mm_maskload_epi32
  • _mm256_maskload_epi32
  • _mm_maskload_epi64
  • _mm256_maskload_epi64
  • _mm_maskstore_epi32
  • _mm256_maskstore_epi32
  • _mm_maskstore_epi64
  • _mm256_maskstore_epi64
  • _mm256_max_epi8
  • _mm256_max_epi16
  • _mm256_max_epi32
  • _mm256_max_epu8
  • _mm256_max_epu16
  • _mm256_max_epu32
  • _mm256_min_epi8
  • _mm256_min_epi16
  • _mm256_min_epi32
  • _mm256_min_epu8
  • _mm256_min_epu16
  • _mm256_min_epu32
  • _mm256_movemask_epi8
  • _mm256_mpsadbw_epu8
  • _mm256_mul_epi32
  • _mm256_mul_epu32
  • _mm256_mulhi_epi16
  • _mm256_mulhi_epu16
  • _mm256_mulhrs_epi16
  • _mm256_mullo_epi16
  • _mm256_mullo_epi32
  • _mm256_or_si256
  • _mm256_packs_epi16
  • _mm256_packs_epi32
  • _mm256_packus_epi16
  • _mm256_packus_epi32
  • _mm256_permute2x128_si256
  • _mm256_permute4x64_epi64
  • _mm256_permute4x64_pd
  • _mm256_permutevar8x32_epi32
  • _mm256_permutevar8x32_ps
  • _mm256_sad_epu8
  • _mm256_shuffle_epi32
  • _mm256_shuffle_epi8
  • _mm256_shufflehi_epi16
  • _mm256_shufflelo_epi16
  • _mm256_sign_epi8
  • _mm256_sign_epi16
  • _mm256_sign_epi32
  • _mm256_slli_si256
  • _mm256_bslli_epi128
  • _mm256_sll_epi16
  • _mm256_slli_epi16
  • _mm256_sll_epi32
  • _mm256_slli_epi32
  • _mm256_sll_epi64
  • _mm256_slli_epi64
  • _mm_sllv_epi32
  • _mm256_sllv_epi32
  • _mm_sllv_epi64
  • _mm256_sllv_epi64
  • _mm256_sra_epi16
  • _mm256_srai_epi16
  • _mm256_sra_epi32
  • _mm256_srai_epi32
  • _mm_srav_epi32
  • _mm256_srav_epi32
  • _mm256_srli_si256
  • _mm256_bsrli_epi128
  • _mm256_srl_epi16
  • _mm256_srli_epi16
  • _mm256_srl_epi32
  • _mm256_srli_epi32
  • _mm256_srl_epi64
  • _mm256_srli_epi64
  • _mm_srlv_epi32
  • _mm256_srlv_epi32
  • _mm_srlv_epi64
  • _mm256_srlv_epi64
  • _mm256_stream_load_si256
  • _mm256_sub_epi8
  • _mm256_sub_epi16
  • _mm256_sub_epi32
  • _mm256_sub_epi64
  • _mm256_subs_epi8
  • _mm256_subs_epi16
  • _mm256_subs_epu8
  • _mm256_subs_epu16
  • _mm256_xor_si256
  • _mm256_unpackhi_epi8
  • _mm256_unpackhi_epi16
  • _mm256_unpackhi_epi32
  • _mm256_unpackhi_epi64
  • _mm256_unpacklo_epi8
  • _mm256_unpacklo_epi16
  • _mm256_unpacklo_epi32
  • _mm256_unpacklo_epi64

error: invalid static_cast from type 'simde_float64*' {aka 'double*'} to type 'simde__m256d*'

While using https://travis-ci.org/nemequ/simde/builds/625813486 to improve https://salsa.debian.org/med-team/last-align/

../debian/include/simde/x86/../hedley.h:1367:59: error: invalid static_cast from type 'simde_float32*' {aka 'float*'} to type 'simde__m256*'
 1367 | #  define HEDLEY_STATIC_CAST(T, expr) (static_cast<T>(expr))
      |                                                           ^
../debian/include/simde/x86/../simde-common.h:68:45: note: in expansion of macro 'HEDLEY_STATIC_CAST'
   68 | #  define SIMDE_CAST_ALIGN(alignment, T, v) HEDLEY_STATIC_CAST(T, v)
      |                                             ^~~~~~~~~~~~~~~~~~
../debian/include/simde/x86/avx.h:4323:4: note: in expansion of macro 'SIMDE_CAST_ALIGN'
 4323 |   *SIMDE_CAST_ALIGN(32, simde__m256*, mem_addr) = a;
      |    ^~~~~~~~~~~~~~~~
../debian/include/simde/x86/avx.h: In function 'void simde_mm256_stream_pd(simde_float64*, simde__m256d)':
../debian/include/simde/x86/../hedley.h:1367:59: error: invalid static_cast from type 'simde_float64*' {aka 'double*'} to type 'simde__m256d*'
 1367 | #  define HEDLEY_STATIC_CAST(T, expr) (static_cast<T>(expr))
      |                                                           ^
../debian/include/simde/x86/../simde-common.h:68:45: note: in expansion of macro 'HEDLEY_STATIC_CAST'
   68 | #  define SIMDE_CAST_ALIGN(alignment, T, v) HEDLEY_STATIC_CAST(T, v)
      |                                             ^~~~~~~~~~~~~~~~~~
../debian/include/simde/x86/avx.h:4338:4: note: in expansion of macro 'SIMDE_CAST_ALIGN'
 4338 |   *SIMDE_CAST_ALIGN(32, simde__m256d*, mem_addr) =  a;
      |    ^~~~~~~~~~~~~~~~

Compile flags:

g++ -DLAST_INT_TYPE=unsigned -DALPHABET_CAPACITY=64 -Wdate-time -D_FORTIFY_SOURCE=2 -DHAS_CXX_THREADS -DSIMDE_ENABLE_OPENMP -fopenmp-simd -O3 -g -O2 -fdebug-prefix-map=/build/last-align-1021=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wextra -Wcast-qual -Wswitch-enum -Wundef -Wcast-align -Wno-long-long -ansi -pedantic -std=c++11 -DSIMDE_ENABLE_OPENMP -fopenmp-simd -O3 -std=c++2a -I.

Consider marking functions `constexpr`

Someone was asking about something like this on StackOverflow recently. I'm skeptical that it's worth the effort, but I thought I'd at least open up the issue for discussion.

The basic idea is that with C++2a (or C++11 with something supported by HEDLEY_IS_CONSTEXPR_, or maybe c2x if they decide to add constexpr to the language) it should be possible to have pretty much everything in SIMDe be constexpr by using the portable implementation at compile-time.

Unfortunately, the implementation could get pretty messy. We would need to do something like

#if defined(HAVE_NATIVE_VERSION)
#  if defined(SUPPORTS_CONSTEXPR)
if (!std::is_constant_evaluated())
#  endif
return native_version();
#endif

#if defined(SUPPORTS_CONSTEXPR) || !defined(HAVE_NATIVE_VERSION)
return portable_version();
#endif

In every function. Also, HAVE_NATIVE_VERSION should also include using other ISA extensions (e.g., NEON for an x86 function)… really any time where we have a faster alternative to the portable version.

Obviously this complicates the implementations somewhat, but not prohibitively so. It would take a fair amount of time to go back and add to all the existing functions, but it's pretty straightforward work (good first issue, maybe?).

My main concern is just that I'm not sure it's worth the effort; SIMD instructions tend to be used mostly when there is a lot of data that can't be computed at compile-time, so this feels like a solution in search of a problem. That said, if people really find themselves in situations where this would be useful on a somewhat regular basis, I could be convinced to (accept a patch to) add it to SIMDe.

If anyone cares about this one way or the other please comment. If people don't respond I'll assume nobody cares and it won't happen. If you don't want to comment but do want to vote, thumbs up/down for yes/no.

error: no matching function for call to 'simde__m256::simde__m256(simde__m256i&)

https://salsa.debian.org/misterc-guest/gkl/blob/experimental/simde/debian/patches/simde

against

https://github.com/Intel-HLS/GKL

In file included from /build/gkl-0.8.6+dfsg/src/main/native/smithwaterman/avx2-smithwaterman.h:6,
                 from /build/gkl-0.8.6+dfsg/src/main/native/smithwaterman/avx2_impl.cc:3:
/build/gkl-0.8.6+dfsg/src/main/native/smithwaterman/PairWiseSW.h: In function 'void smithWatermanBackTrack(SeqPair*, int32_t, int32_t, int32_t, int32_t, int32_t*, int32_t)':
/build/gkl-0.8.6+dfsg/src/main/native/smithwaterman/PairWiseSW.h:26:44: error: no matching function for call to 'simde__m256::simde__m256(simde__m256i&)'
   26 |             VEC_INT_TYPE sbt11 = VEC_BLEND(w_mismatch_vec, w_match_vec, cmp11); \
      |                                            ^~~~~~~~~~~~~~
/build/gkl-0.8.6+dfsg/src/main/native/smithwaterman/avx2-functions.h:77:54: note: in definition of macro 'VEC_BLEND'
   77 |     (simde__m256i)simde_mm256_blendv_ps((simde__m256)__v1, (simde__m256)__v2, (simde__m256)__mask)
      |                                                      ^~~~
/build/gkl-0.8.6+dfsg/src/main/native/smithwaterman/PairWiseSW.h:134:13: note: in expansion of macro 'MAIN_CODE'
  134 |             MAIN_CODE(bt_vec_0)
      |             ^~~~~~~~~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.