agenium-scale / nsimd Goto Github PK
View Code? Open in Web Editor NEWAgenium Scale vectorization library for CPUs and GPUs
License: MIT License
Agenium Scale vectorization library for CPUs and GPUs
License: MIT License
Hey! I am working on a research paper that has benchmarks that uses NSIMD for vectorization. I use the open-source version. I could not find this repository on JOSS or Zenodo.
Is there a set way to cite your repository? How do you guys prefer to be cited in a paper?
-Thanks
There is no copysign
intrinsic; it should be straightforward to add.
The test case generator generates special code for rsqrt11
. It should probably do the same for rsqrt8
.
I find the non-standard flipsign
function https://docs.julialang.org/en/v1/base/math/#Base.flipsign convenient, e.g. to implement upwind finite differencing stencils. It can be defined as
flipsign(x, y) = copysign(1, y) * x
but can be implemented more efficiently as
flipsign(x, y) = x ^ (y & SIGNMASK)
It seems that I need to add -DNSIMD_AVX2 -DNSIMD_FMA
when I compile.
Instead, nsimd should be able to detect this from CPU flags at compile time, or should remember this from how it was configured with cmake.
cbrt
is an ANSI C math function. It would be convenient if it was supported in nsimd, instead of having to fall back to scalar code.
The variable NSIMD_CXX is misused in case of MSVC compiler
EG: NSIMD_CXX ==2014 for visual 2019
eg: we can't check value like NSIMD_CXX < 201103L
to switch on allocator flavor
The mechanism used in the get_impl
functions which stores C code for intrinsics in the impls
dictionary regenerates code for all (!) the intrinsics every time code for one of the intrinsics is generated. This slows down code generation.
To see this effect, you need to disable using clang-format
, which is the slowest part of code generation.
I suggest using lambda
s in the impls
dictionary, so that code generation can be delayed until when a dictionary entry is actually used.
Hello,
We are facing a compilation issue using nsimd on 32 bit target (VS2015 toolset v140_xp) :
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/nsimd.h(419): error C3861: '__popcnt64' : identificateur introuvable
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse2/storeu.h(47): error C2719: 'a1' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse2/storeu.h(75): error C2719: 'a1' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse42/storeu.h(47): error C2719: 'a1' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse42/storeu.h(75): error C2719: 'a1' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse2/eq.h(47): error C2719: 'a0' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse2/eq.h(47): error C2719: 'a1' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse2/eq.h(56): error C2719: 'a0' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
1>D:\Projets\DevSources\platform_sw_nsimd\sw_pico..\common\misc_lib\nsimd\include\nsimd/x86/sse2/eq.h(56): error C2719: 'a1' : le paramètre formel avec l'alignement demandé 16 ne sera pas aligné
the first line (error C3861: '__popcnt64' : identificateur introuvable) make me think that it has never been tested with such target
Thanks in advance
It seems that nsimd defines types i64
etc. in the global namespace (see file nsimd.h
, lines 793 ff.). These should probably be prefixed with nsimd_
.
// C
int is_aligned(const void* const ptr) {
#if NSIMD_WORD_SIZE == 32
const u32 val = (u32)ptr;
#else
const u64 val = (u64)ptr;
#endif
return val % (NSIMD_MAX_ALIGNMENT / CHAR_BIT) == 0;
}
// C++
template<typename T>
bool is_aligned(const T* const ptr) {
#if NSIMD_WORD_SIZE == 32
u32 val = (u32)ptr;
#else
u64 val = (u64)ptr;
#endif
return val % (NSIMD_MAX_ALIGNMENT / CHAR_BIT) == 0;
}
Certain math functions such as copysign
can be implemented efficiently with bitwise operations. It would be convenient to have these available:
copysign
isfinite
isinf
isnan
isnormal
signbit
I find the non-standard flipsign
function https://docs.julialang.org/en/v1/base/math/#Base.flipsign convenient, e.g. to implement upwind finite differencing stencils. It can be defined as
flipsign(x, y) = copysign(1, y) * x
but can be implemented more efficiently as
flipsign(x, y) = x ^ (y & SIGNMASK)
While trying the following code:
#include <nsimd/nsimd-all.hpp>
#include <vector>
int main()
{
std::vector<nsimd::pack<float> > vect;
return 0;
}
I run into a compilation error:
gupta2@juawei-a27:~/codes/test_codes$ armclang++ -DNSIMD_SVE -march=armv8-a+sve -ftree-vectorize -I$HOME/install/arm/nsimd/include -L/$HOME/install/arm/nsimd/lib sve.cpp
In file included from sve.cpp:2:
In file included from /opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/vector:64:
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_vector.h:286:35: error: arithmetic on a pointer to an incomplete type
'nsimd::pack<float, 1, nsimd::sve>'
_M_impl._M_end_of_storage - _M_impl._M_start);
~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_vector.h:391:7: note: in instantiation of member function 'std::_Vector_base<nsimd::pack<float,
1, nsimd::sve>, std::allocator<nsimd::pack<float, 1, nsimd::sve> > >::~_Vector_base' requested here
vector()
^
sve.cpp:6:38: note: in instantiation of member function 'std::vector<nsimd::pack<float, 1, nsimd::sve>, std::allocator<nsimd::pack<float, 1, nsimd::sve> > >::vector' requested here
std::vector<nsimd::pack<float> > vect;
^
In file included from sve.cpp:2:
In file included from /opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/vector:62:
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_construct.h:136:25: error: incomplete type '_Value_type'
(aka 'nsimd::pack<float, 1, nsimd::sve>') used in type trait expression
std::_Destroy_aux<__has_trivial_destructor(_Value_type)>::
^
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_construct.h:206:7: note: in instantiation of function template specialization
'std::_Destroy<nsimd::pack<float, 1, nsimd::sve> *>' requested here
_Destroy(__first, __last);
^
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_vector.h:567:7: note: in instantiation of function template specialization
'std::_Destroy<nsimd::pack<float, 1, nsimd::sve> *, nsimd::pack<float, 1, nsimd::sve> >' requested here
std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish,
^
sve.cpp:6:38: note: in instantiation of member function 'std::vector<nsimd::pack<float, 1, nsimd::sve>, std::allocator<nsimd::pack<float, 1, nsimd::sve> > >::~vector' requested here
std::vector<nsimd::pack<float> > vect;
^
2 errors generated.
How do I work with SVE vector packs?
Hey! I'm using this repository to optimize my code, but I don't know how to use 128-bit registers in my program. I'm compiling and runing in Linux, and my CPU is "Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz". I'm sure it supports SSE2. here are my operations.
I use this command to generate files first.
python3 egg/hatch.py -Af
My CMakeList.txt is like this.
set(NSIMD_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/nsimd/include)
ExternalProject_Add(nsimd
SOURCE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/nsimd"
BINARY_DIR "${CMAKE_BINARY_DIR}/third_party/nsimd"
CMAKE_CACHE_ARGS "-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=true"
CMAKE_ARGS "-DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/External/ -DSIMD=SSE2 -DSIMD_OPTIONALS=FMA"
)
My project code is like this.
#include <nsimd/nsimd-all.hpp>
...some codes...
using BaseType = int8_t;
using PackType = nsimd::pack<BaseType>;
uint64_t packLen = nsimd::len(PackType());
std::cout << packLen << "\n";
The output is "8". Did it means this pack only support 64bit data? How can I get a pack that support 128bit or more? Thank you.
I could not find a way to use the size()
function of the C++ pack
structure in a constexpr
manner. For example, this fails:
constexpr int vsize = pack<double>().size();
because the constructor pack<double>()
is not constexpr.
I believe one way to obtain the size of a fixed-size container is via tuple_size
. This works e.g. for std::array
as well. One could then write
constexpr size_t vsize = std::tuple_size_v<pack<double>>;
While utilizing the Right shift and Left shift bitwise operators, I'm getting an error:
/home/nk/opt/nsimd/include/nsimd/cxx_adv_api_functions.hpp:998:16: error: no matching function for call to ‘shr(const simd_vector&, int&, float, nsimd::cpu)’
998 | ret.car = shr(a0.car, a1, T(), SimdExt());
Here is the minimal program to reproduce the issue:
#include <iostream>
#include <nsimd/nsimd-all.hpp>
int main()
{
nsimd::pack<float> f(42.0f);
nsimd::pack<float> f2 = nsimd::shr(f2, 1);
std::cout << f2 << std::endl;
return 0;
}
Is there something I'm doing wrong?
Essentially I wish to change a simd vector, say [1, 2, 3, 4] to look like [0,1,2,3] using the right shift operation.
Does nsimd provide masked store functions (e.g. vstoreu_masked
or similar)? I could not find any.
I am maintaining the SIMD library of the Einstein Toolkit (see https://bitbucket.org/cactuscode/cactusutils/src/master/Vectors/), and am interested in exploring a community supported approach. If there are no masked store intrinsics in nsimd, would you be interested in accepting a contribution? Could you provide a few rough guidelines for implementing this?
Hi
So, following the instructions, I ran cmake
cmake .. -DSIMD=AVX2 -DDEV=1 -DBOOST_ROOT=/**/boost_1_72_0 -GNinja
It says:
CMake Warning:
Manually-specified variables were not used by the project:
BOOST_ROOT
DEV
Then: ninja -j1 update
- unknown target
However, running:
ninja -j 4 tests
ctest
Has worked, apparently successfully: 100% tests passed, 0 tests failed out of 2691
Was running on macos, I know - not a supported target, but I hope it will work since the tests passed.
Hi i migrate from boost::simd to nsimd
It seems that nsimd lack of satured operator for addition/substraction (or i miss something)
It could be usefull for ImageProcessing algorithm
I will try to add it on my side and make a pull request
When I check out a new copy of nsimd, configure with
python3 egg/hatch.py --all --force
mkdir build && cd build
cmake -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_FLAGS='-D_DARWIN_C_SOURCE' -DCMAKE_CXX_FLAGS='-D_DARWIN_C_SOURCE' -DSIMD=AVX2 -DSIMD_OPTIONALS=FMA -DCMAKE_INSTALL_PREFIX=$HOME/nsimd ..
cmake --build .
cmake --build . --target tests
ctest . -V
then three tests are failing:
The following tests FAILED:
505 - tests.c_base.rec11.f16.c (Failed)
1211 - tests.cxx_adv.rec11.f16.cpp (Failed)
1913 - tests.cxx_base.rec11.f16.cpp (Failed)
The files nsimd.cpp
and nsimd-all.cpp
output results via printf
, as e.g. with the lines
for (int i = 0; i < n; i++) {
fprintf(stdout, "%f vs %f\n", double(buf[i]), double(-i * i));
}
In the file nsimd-all.cpp
, this output also differs from the condition that decides whether the test case succeeds or fails (a +1
is missing).
Unless I am missing the correct naming, it looks like
nsimd misses some functions like shuffle, split, tofloat, toint
Are those planned ?
Thanks !
When I download the 2.0 release and follow the instructions, I receive this error:
$ bash scripts/build.sh for sse2 sse42 avx avx2 with gcc
+ set -e
+ SETUP_SH=/tmp/nsimd-2.0/scripts/setup.sh
+ NSCONFIG=/tmp/nsimd-2.0/scripts/../nstools/bin/nsconfig
+ HATCH_PY=/tmp/nsimd-2.0/scripts/../egg/hatch.py
+ BUILD_ROOT=/tmp/nsimd-2.0/scripts/..
+ sh /tmp/nsimd-2.0/scripts/setup.sh
+ set -e
+ NSTOOLS_DIR=/tmp/nsimd-2.0/scripts/../nstools
+ [email protected]:agenium-scale/nstools.git
+ NSTOOLS_URL2=https://github.com/agenium-scale/nstools.git
+ '[' -e /tmp/nsimd-2.0/scripts/../nstools/README.md ']'
+ cd /tmp/nsimd-2.0/scripts/..
++ git remote get-url origin
++ sed s/nsimd/nstools/g
fatal: not a git repository (or any of the parent directories): .git
+ git clone
fatal: You must specify a repository to clone.
The pack
class in the C++ API does not provide +=
etc. operators. One has to write x = x + y
instead of the shortcut x += y
.
Hi and thanks a lot for this library and congrats for V2
I'm facing issue related to build nstools on windows (MSVC 2019 14.28)
For fetching nstools the below url is used
[email protected]:agenium-scale/nstools.git
Which seems to require some user specific key
Changing to
https://github.com/agenium-scale/nstools
Solve the pb
While building Nsimd with -DSIMD=AARCH64
and arm-hpc-compiler
, I am getting errors as described here.
I do not face any issues while building with gcc. The errors come up only with compilers with clang backends (ex: arm-hpc-compiler, clang). I face the error with the Clang 9.0.1 as well, which is fairly new.
Here is the output of lscpu
:
gupta2@juawei-a19:~/2d_stencil/benchmark/builds/arm_trace(master)$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 4
Vendor ID: ARM
Model: 2
Model name: Cortex-A72
Stepping: r0p2
BogoMIPS: 100.00
L1d cache: 2 MiB
L1i cache: 3 MiB
L2 cache: 16 MiB
L3 cache: 64 MiB
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Here is the output of uname -a
:
gupta2@juawei-a19:~/2d_stencil/benchmark/builds/arm_trace(master)$ uname -a
Linux juawei-a19 4.18.0-80.7.2.el7.aarch64 #1 SMP Thu Sep 12 16:13:20 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
The builds were done with the current master.
First, size agnostic shuffles, such as reverse, unpack, zip, unzip, ... will be added. Then it seems that (to be confirmed) all supported and yet to be supported architecture by NSIMD have SIMD vector lengths multiple of 128 bits. Therefore custom shuffles whose pattern is 128-bits wide can be repeated on all 128-bits lanes of a SIMD vector which make thoses shuffles also length agnostic. In any case one can follow the same reasoning on 64-bits patterns as SIMD vector length are, a priori, multiple of 64 bits which is sizeof(float64_t) * 8
.
When regenerating files, only write those files that changed. This would make it much faster to rebuild everything, especially the tests.
The intrinsic name round_to_even
sounds strange. I assume that this is the usual "round to the nearest integer, break ties towards even numbers". However, the name of the intrinsic reads as if it always rounded towards the nearest even number, i.e. as if it never returned an odd number.
I just installed nsimd (master branch) on OS X, and the include files ended up in $INSTALLDIR/include/include/nsimd
. Note the double include/include
that probably shouldn't be there.
I notice that nsimd rounds towards zero (instead of towards the nearest representable number) when converting f32
to f16
on AVX512 architectures. See the constants _MM_FROUND_TO_ZERO
in the function store
in platform_x86.py
. All things being equal, nsimd should use the constant _MM_FROUND_TO_NEAREST_INT
instead.
It can easily be done using if_else, but we should offer a simpler way of doing it.
I have downloaded and built the master
branch of nsimd
. This is macOS with an Intel CPU. I configured with
../nstools/bin/nsconfig .. -Dsimd=avx2 -Dmpfr='-I/opt/local/include -L/opt/local/lib -lmpfr'
and then ran the self-tests with
../nstools/bin/nstest -j$(nproc)
This failed with the errors
-- SUMMARY: 8 fails out of 3059 tests
-- FAILED: ./tests.cxx_adv.notl.u8.cpp98
-- FAILED: ./tests.cxx_base.gather.f16.cpp98
-- FAILED: ./tests.cxx_base.gather.f32.cpp98
-- FAILED: ./tests.cxx_base.maskz_loadu1.i8.cpp11
-- FAILED: ./tests.cxx_base.upcvt.u32_to_f64.cpp98
-- FAILED: ./tests.modules.fixed_point.abs.fp_8_7.cpp11
-- FAILED: ./tests.modules.fixed_point.andl.fp_4_1.cpp11
-- FAILED: ./tests.modules.fixed_point.ne.fp_8_4.cpp98
I believe nsimd
chose my MacPorts-installed clang version 11.0.0
as compiler.
Most NSIMD SVE intrinsics generates a movprfx
instruction. This is caused by the use of *_z
intrinsics which puts zeros in inactive lanes. and the compiler must use this instruction to generrate correct code. But as all SVE intrinsics in NSIMD are use with svptrue_*()
we simple must use *_x
intrinsics which puts undefined values in inactive lanes and do not generate this instructions.
We do not know if movprfx
slows down execution ; to be tested but less code to execute seems better (at first glance at least).
For details on movprfx
see https://developer.arm.com/documentation/ddi0596/2021-03/SVE-Instructions/MOVPRFX--unpredicated---Move-prefix--unpredicated--
I'm trying to modify nsimd, and I find it difficult to get started since it's not obvious which files are autogenerated and which are not. It would be nice if all autogenerated code was safely "stashed away" into its own subdirectory.
Hi there,
Are you intending on adding support for WebAssembly SIMD?
BFloat16 are truncated standard float32, therefore
This is OK for all supported architectures.
Reference: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format.
I think that mask_for_loop_tail
calculates the mask via scalar code, which is then assembled into a mask vector. (I checked with AVX2.) Using iota()
instead of calculating the mask element-by-element should prevent this.
Currently, only a small part of the basic NSIMD
operators are implemented in the fixed_point module. However, most of the other operators like multiple loads/stores, zip/unzip or casts can be easily wrapped too.
The C++ standard provides fabs
, fmax
, and fmin
, with the same meaning as abs
, max
, and min
. It would be convenient to have these available for nsimd::pack
as well.
File gen_tests.py
, line 405 is
code += ['nsimd::store{}u(&vout_nsimd[i], vc);'.format(logical, typ)]
which has one {}
in the format string, but has two arguments.
Is there a FindNsimd.cmake file or similar structure to find nsimd through CMake, or a method to ease searching nsimd when using a CMake project?
The C++ standard does not allow using unions to reinterpret data as a different type, e.g. to access the bit patterns of a float (see e.g. https://en.wikipedia.org/wiki/Type_punning#Use_of_union). One has to use memcpy
instead.
While GCC allows this as an extension to the C++ standard, other compilers do not. I don't recall exactly which compilers these are, but I have had trouble in the past while using the IBM XL or PGI compilers on non-Intel architectures.
Fix missing documentation for:
to_pack
to_pack_interleave
get_pack
scoped_aligned_mem
NSIMD currently lasks "big" math functions such as cos, sin, exp and many other. We plan to use the excellent Sleef instead of providing our own for several reasons:
Hi i migrate from boost::simd to nsimd
It seems that nsimd doesnt provide high level STL like algorithm (Transform,Reduce,etc...)
I could try to implement them on my side
Questions:
Where to put them?
Where to put tests?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.