wojciechmula / base64simd Goto Github PK

View Code? Open in Web Editor NEW

150.0 15.0 13.0 411 KB

Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)

Home Page: http://0x80.pl/articles/index.html#base64-algorithm-new

License: BSD 2-Clause "Simplified" License

C++ 85.88% Makefile 3.69% Shell 1.06% Python 6.94% C 2.43%

base64 simd avx2 avx512 neon sse

base64simd's Introduction

base64 using SIMD instructions

Overview

Repository contains code for encoding and decoding base64 using SIMD instructions. Depending on CPU's architecture, vectorized encoding is faster than scalar versions by factor from 2 to 4; decoding is faster 2 .. 2.7 times.

There are several versions of procedures utilizing following instructions sets:

SSE,
AVX2,
AVX512F,
AVX512BW,
AVX512VBMI,
AVX512VL,
BMI2, and
ARM Neon.

Vectorization approaches were described in a series of articles:

Base64 encoding with SIMD instructions,
Base64 decoding with SIMD instructions,
Base64 encoding & decoding using AVX512BW instructions (includes AVX512VBMI and AVX512VL),
AVX512F base64 coding and decoding.

Daniel Lemire and I wrote also paper Faster Base64 Encoding and Decoding Using AVX2 Instructions which was published by ACM Transactiona on the Web.

Performance results from various machines are located in subdirectories results.

Project organization

There are separate subdirectories for both algorithms, however both have the same structure. Each project contains four programs:

verify --- does simple validation of particular parts of algorithms,
check --- validates whole procedures,
speed --- compares speed of different variants of procedures,
benchmark --- similarly to speed but works on small buffers and calculates CPU cycle rate (available only for Intel architectures).

Building

Change to either directory encode or decode and then use following make commands.

command	tools	instruction sets
`make`	`verify`, `check`, `speed`, `benchmark`	scalar, SSE, BMI2
`make avx2`	`verify_avx2`, `check_avx2`, `speed_avx2`, `benchmark_avx2`	scalar, SSE, BMI2, AVX2
`make avx512`	`verify_avx512`, `check_avx512`, `speed_avx512`, `benchmark_avx512`	scalar, SSE, BMI2, AVX2, AVX512F
`make avx512bw`	`verify_avx512bw`, `check_avx512bw`, `speed_avx512bw`, `benchmark_avx512bw`	scalar, SSE, BMI2, AVX2, AVX512F, AVX512BW
`make avx512vbmi`	`verify_avx512vbmi`, `check_avx512vbmi`, `benchmark_avx512vbmi`	scalar, SSE, BMI2, AVX2, AVX512F, AVX512BW, AVX512VBMI
`make xop`	`verify_xop`, `check_xop`, `speed_xop`, `benchmark_xop`	scalar, SSE and AMD XOP
`make arm`	`verify_arm`, `check_arm`, `speed_arm`	scalar, ARM Neon

Type make run (for SSE) or make run_ARCH to run all programs for given instruction sets; ARCH can be "sse", "avx2", "avx512", "avx512bw", "avx512vbmi", "avx512vl".

BMI2 presence is determined based on /proc/cpuinfo or a counterpart. When an AVX2 or AVX512 targets are used then BMI2 is enabled by default.

AVX512

To compile AVX512 versions of the programs at least GCC 5.3 is required. GCC 4.9.2 doesn't have AVX512 support.

Please download Intel Software Development Emulator in order to run AVX512 variants via make run_avx512, run_avx512bw or run_avx512vbmi. The emulator path should be added to the PATH.

Known problems

Both encoding and decoding don't match the base64 specification, there is no processing of data tail, i.e. encoder never produces '=' chars at the end, and decoder doesn't handle them at all.

All these shortcoming are not present in a brilliant library by Alfred Klomp: https://github.com/aklomp/base64.

Who uses our algorithms?

C/C++ library by Alfred Klomp https://github.com/aklomp/base64
.NET library by Günther Foidl https://github.com/gfoidl/Base64
there was attempt to include an assembly implementation into Go: golang/go#20206

base64simd's People

Contributors

Stargazers

Watchers

Forkers

lemire pombredanne rogersguedes msnh2012 d3v3l0 yanghongtian stjordanis xyncgas clayne idst-test songlei150 gluecklichste ahfuzhang

base64simd's Issues

vpermb belongs to AVX512BW?

I'm running avx512bw test on my SKL which has avx512bw supported,
while I got illegal instruction traps, and after some investigation, it seems
vpermb/vpermi2b belongs to avx512vbmi instead, the CPU supported for
avx512vbmi seems not officially released yet.

So does the code need a littler tweak to use avx512bw instruction for test?

]# gdb /tmp/check_avx512bw ./core.103927
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /tmp/check_avx512bw...done.
[New LWP 103927]
Core was generated by `/tmp/check_avx512bw'.
Program terminated with signal 4, Illegal instruction.
#0 0x00000000004082a5 in _mm512_permutex2var_epi8 (__B=..., __I=..., __A=...) at /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h:107
107 /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vbmiintrin.h: No such file or directory.

[1] https://software.intel.com/en-us/node/534480

[2]
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

[FYI] made proof-of-concept sse2 decoder

Rudimentary SSE2 decoder in NASM, if anyone is interested. Not benchmarked.
https://gist.github.com/aqrit/4e33614c64b5be81d88b6a630eb77731

nerd sniped from /r/programming

license question :)

Hey,

i am currently designing a little base64 decoding/encoding API and the implementation of the SIMD related code is heavily based on your code in here, but with a nice reusable API around, so that others can use it as a header-only lib in their projects.

Now, I usually set all my projects to Apache License version 2, and I think your code is in BSD license? I am by no means a license expert and most likely will never be. Is it okay to reference to your work to all extend and original license but keep my base64 library as "Apache License version 2"?

Many thanks in advance,
Christian Parpart.

Correct "See also" in http://0x80.pl/articles/base64-simd-neon.html#see-also-update

Hi,

First off thanks for your work, I'm always learning some new tricks reading your articles.

Just to clarify one thing in the see also of Neon base64 article: You mention Alfred Klomp developed the algorithm for libbase64 ARM64 decoding but what you're referring to is in fact one of my commits:
aklomp/base64@ac5c74a
I'd very much appreciate a correction of the author there.

Thanks,
Matthieu Darbois

Benchmarking RVV implementation

I'm not sure if you have access to RVV hardware, so I thought I'd offer my help in that regard:

C908 encode:

input size: 12582912, output size: 16777216
number of iterations: 10
We report the time in cycles per input byte.
For reference, we present the time needed to copy 12582912 bytes.
rdtsc_overhead set to 7
memcpy                          :     0.515 cycle/op (best)    0.612 cycle/op (avg)
scalar (32 bit)                         ... 0.08162
scalar (64 bit)                         ... 0.07382 (speedup 1.11)
SWAR (64 bit)                           ... 0.07765 (speedup 1.05)
RISC-V RVV (LMUL=1)                     ... 0.07042 (speedup 1.16)
RISC-V RVV (LMUL=8)                     ... 0.05483 (speedup 1.49)
RISC-V Vector (LMUL=4, segmented load)  ... 0.02932 (speedup 2.78)

C908 decode:

scalar                                           ... 0.01204
improved scalar                                  ... 0.01204 (speed-up: 1.00)
RISC-V Vector (pack: gather)                     ... 0.01580 (speed-up: 0.76)
RISC-V Vector (pack: compress)                   ... 0.01399 (speed-up: 0.86)
RISC-V Vector (omit ws; pack: gather)            ... 0.01676 (speed-up: 0.72)
RISC-V Vector (omit ws, pack: compress)          ... 0.01733 (speed-up: 0.69)

C920 encode (without the segmented load one, because I can only compile intrinsics code to xtheadvector, I doubt it would perform well though, compare this and this. I am hopeful that future hardware will perform more similar to the C908):

input size: 100663296, output size: 134217728
number of iterations: 10
We report the time in cycles per input byte.
For reference, we present the time needed to copy 100663296 bytes.
rdtsc_overhead set to 16
memcpy                          :     0.760 cycle/op (best)    0.905 cycle/op (avg)
scalar (32 bit)                         ... 0.56006
scalar (64 bit)                         ... 0.42620 (speedup 1.31)
SWAR (64 bit)                           ... 0.49197 (speedup 1.14)
RISC-V RVV (LMUL=1)                     ... 0.23546 (speedup 2.38)
RISC-V RVV (LMUL=8)                     ... 0.17674 (speedup 3.17)

C920 decode:

scalar                                           ... 0.05389
improved scalar                                  ... 0.05391 (speed-up: 1.00)
RISC-V Vector (pack: gather)                     ... 0.02384 (speed-up: 2.26)
RISC-V Vector (pack: compress)                   ... 0.02149 (speed-up: 2.51)
RISC-V Vector (omit ws; pack: gather)            ... 0.08527 (speed-up: 0.63)
RISC-V Vector (omit ws, pack: compress)          ... 0.08224 (speed-up: 0.66)

Btw, you should add the tail policies to the inline assembly vsetvlis, otherwise it won't compile with clang.

I'll be working on my own implementations, and share them here, so we can figure out what works best.

URL Safe

are you open to the idea of url safe alphabet change option?

sse3 lookup

the obvious extension of the saturation method to pshufb luts...
it is a few instructions shorter than the current sse4 method and AFAIK should be faster, if unrolled.

;;; sse3 lookup ;;;
; xmm0  = in/out
; xmm12 = 0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,[...]  // mask_2F
; xmm13 = 0x01,0x50,0x54,0x46,0x25,0x25,0x05,0x05,[...]  // lut_roll_up
; xmm14 = 0xFF,0x7F,0x7F,0x76,0x66,0x66,0x66,0x66,[...]  // lut_roll_down
; xmm15 = 0x00,0x3F,0x3E,0x34,0x00,0x00,0x1A,0x1A,[junk] // lut_shift
;
;;; perfect hash for lut = ((src>>4)&0x2F)+((src==0x2F)?0xFF:0x00)
;;; 0000 = garbage
;;; 0001 = slash
;;; 0010 = plus
;;; 0011 = 0-9
;;; 0100 = A-Z
;;; 0101 = A-Z
;;; 0110 = a-z
;;; 0111 = a-z
;;; 1000 >= garbage

movdqa   xmm2, xmm12
movdqa   xmm3, xmm13
movdqa   xmm4, xmm14
movdqa   xmm5, xmm15
movdqa   xmm1, xmm0
psrld    xmm1, 4
pcmpeqb  xmm2, xmm0
pand     xmm1, xmm12
paddb    xmm1, xmm2
pshufb   xmm3, xmm1
pshufb   xmm4, xmm1
pshufb   xmm5, xmm1
paddb    xmm0, xmm3
psubsb   xmm0, xmm4
paddusb  xmm0, xmm5
pmovmskb  eax, xmm0