aklomp / base64 Goto Github PK
View Code? Open in Web Editor NEWFast Base64 stream encoder/decoder in C99, with SIMD acceleration
License: BSD 2-Clause "Simplified" License
Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
License: BSD 2-Clause "Simplified" License
With gcc 8.2 with a i686 target the following error is generated:
`__x86.get_pc_thunk.bx' referenced in section `.text' of lib/libbase64.o: defined in discarded section `.text.__x86.get_pc_thunk.bx[__x86.get_pc_thunk.bx]' of lib/libbase64.o
I found 2 ways to fix this:
-fno-pie
to CFLAGS
in Makefile
__x86.get_pc_thunk.bx
to exports.txt
With 1 we get:
root@edison:~/base64# OMP_THREAD_LIMIT=1 ./benchmark
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 10 * 1
plain encode 82.34 MB/sec
plain decode 134.75 MB/sec
but no address space layout randomization (ASLR)
and with 2:
plain encode 77.88 MB/sec
plain decode 81.43 MB/sec
My vote goes to 1.
During the review of the vcpkg integration of this library (microsoft/vcpkg#25091) it was noted that the name of this library is very generic and will likely clash with other base64 libraries. This is especially dangerous for names resolved by the linker like the so/archive name and function names. The public header name is also prone to a file name conflict.
Therefore it would be nice if either the library name was changed to something more unique like a proper name or the entities above were prefixed with something unique. Obviously this would be a breaking change and I would understand if this was rejected. OTOH it has been mentioned that 0.6.0 would be the target for breaking changes anyway.
I want to use your great library in our pretty big system. We already have the detection of CPU/platform features, and it won't interfere with the library's settings. There is a little problem - when I want to use BASE64_FORCE_XXX (based on our settings) and the "XXX" is not really compiled into the library, we got the dummy implementation, which does nothing.
How can I ask the library which BASE64_FORCE_XXX are meaningful? Of course I can alter the sources, but it's not the proper solution.
I use the stable 0.3.0 version.
make
export j=""
for i in b a s e 6 4 t e s t; do
j="$j$i"
echo -n $j | /usr/bin/base64 | ./bin/base64 -d
echo
done
Output:
b
ba
decoding error
base
base6
decoding error
base64t
base64te
decoding error
base64test
Add an .editorconfig
dotfile in the root of the project to help enforce a consistent code style. This idea was raised in the discussion for #79, as a way to avoid preventable mistakes such as trailing whitespace or using an inconsistent spacing style.
The .editorconfig
format is documented at editorconfig.org.
I ran into an interesting issue where linking a project with libmysql.a as well as this library caused some extremely weird behavior, and eventually I found out there was a namespace clash as mysql also declares two functions base64_encode()
and base64_decode()
. Please considering prefixing your library C functions with a unique namespace. I've gotten around this by forking and just naming them differently for now but I want to say thanks for the great library, I've enjoyed using it. I'd also make a PR but I don't know what you'd want to prefix with and it is a breaking change to the API.
This is a new issue to discuss whitespace character filtering as mentioned in #15 and #27
The implementation will probably depend on density of whitespace characters.
while (srclen >= 32U) {
// Characters to be checked (all should be valid, repeat the first ones):
const __m128i whitespaces = _mm_setr_epi8(
'\n','\r','\t', ' ',
'\n','\r','\t', ' ',
'\n','\r','\t', ' ',
'\n','\r','\t', ' '
);
static const uint8_t lut[][16] __attribute__((aligned (16))) = {
{ 0U, 1U, 2U, 3U, 4U, 5U, 6U, 7U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U },
{ 1U, 2U, 3U, 4U, 5U, 6U, 7U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U },
...,
{ 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U }
};
static const uint8_t ipopcnt_lut[] = {
8U, 7U, 7U, 6U, 7U, 6U, 6U, 5U, 7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U, 3U, 2U, 2U, 1U, 2U, 1U, 1U, 0U
};
__m128i c0 = _mm_loadu_si128((const __m128i*)(src + 0));
__m128i c2 = _mm_loadu_si128((const __m128i*)(src + 16));
src += 32;
srclen -= 32U;
__m128i wm0 = _mm_cmpistrm(whitespaces, c0, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
__m128i wm2 = _mm_cmpistrm(whitespaces, c2, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
unsigned int mi0 = (unsigned int)_mm_cvtsi128_si32(wm0);
unsigned int mi2 = (unsigned int)_mm_cvtsi128_si32(wm2);
if (mi0 | mi2) {
unsigned int i0 = ipopcnt_lut[mi0 & 255U];
unsigned int i1 = ipopcnt_lut[mi0 >> 8];
unsigned int i2 = ipopcnt_lut[mi2 & 255U];
unsigned int i3 = ipopcnt_lut[mi2 >> 8];
__m128i c1 = _mm_srli_si128(c0, 8);
__m128i c3 = _mm_srli_si128(c2, 8);
c0 = _mm_shuffle_epi8(c0, _mm_load_si128(lut[mi0 & 255U]));
c1 = _mm_shuffle_epi8(c1, _mm_load_si128(lut[mi0 >> 8]));
c2 = _mm_shuffle_epi8(c2, _mm_load_si128(lut[mi2 & 255U]));
c3 = _mm_shuffle_epi8(c3, _mm_load_si128(lut[mi2 >> 8]));
_mm_storel_epi64((__m128i*)dst, c0);
dst += i0;
_mm_storel_epi64((__m128i*)dst, c1);
dst += i1;
_mm_storel_epi64((__m128i*)dst, c2);
dst += i2;
_mm_storel_epi64((__m128i*)dst, c3);
dst += i3;
}
else {
_mm_storeu_si128((__m128i*)(dst + 0), c0);
_mm_storeu_si128((__m128i*)(dst + 16), c2);
dst += 32;
}
}
Here are the timings. All throughputs are given for the output buffer which is the same for all variants.
decode
: valid base64 input
decode-s
: one whitespace every 80 characters (we can see the choice of 80 has an impact on AVX2 decoder for method 1 because it's a multiple of 16 but not 32)
decode-s8
: 8 whitespace every 80 characters
decode-d
: 1 whitespace before each valid character.
Method 1 seems to be the best for sparse whitespace characters except for AVX2
decoder (that could/should be fixed to handle 8/16 bytes valid input). For handling sparse group of characters or very dense whitespace characters, Method 3 is better.
Real world data analysis would be good to know what case we want to optimize.
For Method 1:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2 decode 5279.50 MB/sec
AVX2 decode-s 1587.19 MB/sec
AVX2 decode-s8 1005.11 MB/sec
AVX2 decode-d 226.64 MB/sec
plain decode 1234.00 MB/sec
plain decode-s 1180.52 MB/sec
plain decode-s8 899.77 MB/sec
plain decode-d 245.90 MB/sec
SSSE3 decode 2941.00 MB/sec
SSSE3 decode-s 2414.34 MB/sec
SSSE3 decode-s8 1198.43 MB/sec
SSSE3 decode-d 210.47 MB/sec
SSE41 decode 2938.98 MB/sec
SSE41 decode-s 2369.73 MB/sec
SSE41 decode-s8 1167.77 MB/sec
SSE41 decode-d 209.58 MB/sec
SSE42 decode 3820.97 MB/sec
SSE42 decode-s 3469.99 MB/sec
SSE42 decode-s8 2009.10 MB/sec
SSE42 decode-d 273.74 MB/sec
AVX decode 3959.44 MB/sec
AVX decode-s 3573.91 MB/sec
AVX decode-s8 2051.51 MB/sec
AVX decode-d 233.33 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2 decode 5302.67 MB/sec
AVX2 decode-s 1590.81 MB/sec
AVX2 decode-s8 1004.56 MB/sec
AVX2 decode-d 226.81 MB/sec
plain decode 1238.86 MB/sec
plain decode-s 1184.43 MB/sec
plain decode-s8 903.14 MB/sec
plain decode-d 241.19 MB/sec
SSSE3 decode 2952.08 MB/sec
SSSE3 decode-s 2472.69 MB/sec
SSSE3 decode-s8 1157.22 MB/sec
SSSE3 decode-d 211.32 MB/sec
SSE41 decode 2952.81 MB/sec
SSE41 decode-s 2477.51 MB/sec
SSE41 decode-s8 1201.25 MB/sec
SSE41 decode-d 210.74 MB/sec
SSE42 decode 3831.55 MB/sec
SSE42 decode-s 3487.39 MB/sec
SSE42 decode-s8 2101.31 MB/sec
SSE42 decode-d 275.14 MB/sec
AVX decode 3967.55 MB/sec
AVX decode-s 3507.18 MB/sec
AVX decode-s8 2055.18 MB/sec
AVX decode-d 225.99 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2 decode 5272.91 MB/sec
AVX2 decode-s 1592.12 MB/sec
AVX2 decode-s8 998.94 MB/sec
AVX2 decode-d 226.41 MB/sec
plain decode 1237.12 MB/sec
plain decode-s 1183.57 MB/sec
plain decode-s8 902.16 MB/sec
plain decode-d 241.08 MB/sec
SSSE3 decode 2920.62 MB/sec
SSSE3 decode-s 2401.63 MB/sec
SSSE3 decode-s8 1165.97 MB/sec
SSSE3 decode-d 213.04 MB/sec
SSE41 decode 2952.13 MB/sec
SSE41 decode-s 2473.57 MB/sec
SSE41 decode-s8 1170.04 MB/sec
SSE41 decode-d 210.97 MB/sec
SSE42 decode 3826.16 MB/sec
SSE42 decode-s 3484.39 MB/sec
SSE42 decode-s8 1877.99 MB/sec
SSE42 decode-d 274.92 MB/sec
AVX decode 3860.69 MB/sec
AVX decode-s 3491.69 MB/sec
AVX decode-s8 2030.65 MB/sec
AVX decode-d 222.85 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2 decode 5213.04 MB/sec
AVX2 decode-s 1608.52 MB/sec
AVX2 decode-s8 1012.05 MB/sec
AVX2 decode-d 231.93 MB/sec
plain decode 1237.74 MB/sec
plain decode-s 1184.74 MB/sec
plain decode-s8 903.57 MB/sec
plain decode-d 250.86 MB/sec
SSSE3 decode 2942.19 MB/sec
SSSE3 decode-s 2464.90 MB/sec
SSSE3 decode-s8 1197.61 MB/sec
SSSE3 decode-d 219.08 MB/sec
SSE41 decode 2946.40 MB/sec
SSE41 decode-s 2468.99 MB/sec
SSE41 decode-s8 1165.37 MB/sec
SSE41 decode-d 214.67 MB/sec
SSE42 decode 3703.31 MB/sec
SSE42 decode-s 3372.53 MB/sec
SSE42 decode-s8 1814.50 MB/sec
SSE42 decode-d 267.22 MB/sec
AVX decode 3843.75 MB/sec
AVX decode-s 3460.58 MB/sec
AVX decode-s8 2030.02 MB/sec
AVX decode-d 227.47 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2 decode 4314.66 MB/sec
AVX2 decode-s 1562.26 MB/sec
AVX2 decode-s8 1024.11 MB/sec
AVX2 decode-d 233.38 MB/sec
plain decode 1207.16 MB/sec
plain decode-s 1129.71 MB/sec
plain decode-s8 904.50 MB/sec
plain decode-d 249.83 MB/sec
SSSE3 decode 2714.32 MB/sec
SSSE3 decode-s 2369.40 MB/sec
SSSE3 decode-s8 1203.05 MB/sec
SSSE3 decode-d 219.10 MB/sec
SSE41 decode 2716.16 MB/sec
SSE41 decode-s 2392.42 MB/sec
SSE41 decode-s8 1203.59 MB/sec
SSE41 decode-d 215.04 MB/sec
SSE42 decode 3418.99 MB/sec
SSE42 decode-s 3181.47 MB/sec
SSE42 decode-s8 1904.47 MB/sec
SSE42 decode-d 283.45 MB/sec
AVX decode 3556.95 MB/sec
AVX decode-s 3238.04 MB/sec
AVX decode-s8 1986.58 MB/sec
AVX decode-d 233.68 MB/sec
For Method 3:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2 decode 3230.06 MB/sec
AVX2 decode-s 2741.82 MB/sec
AVX2 decode-s8 2874.97 MB/sec
AVX2 decode-d 1558.49 MB/sec
plain decode 619.40 MB/sec
plain decode-s 548.73 MB/sec
plain decode-s8 548.71 MB/sec
plain decode-d 420.43 MB/sec
SSSE3 decode 2057.59 MB/sec
SSSE3 decode-s 1807.58 MB/sec
SSSE3 decode-s8 1920.25 MB/sec
SSSE3 decode-d 1257.10 MB/sec
SSE41 decode 2123.90 MB/sec
SSE41 decode-s 1890.92 MB/sec
SSE41 decode-s8 1883.30 MB/sec
SSE41 decode-d 1209.47 MB/sec
SSE42 decode 2622.27 MB/sec
SSE42 decode-s 2366.88 MB/sec
SSE42 decode-s8 2421.09 MB/sec
SSE42 decode-d 1435.30 MB/sec
AVX decode 2777.55 MB/sec
AVX decode-s 2419.70 MB/sec
AVX decode-s8 2480.22 MB/sec
AVX decode-d 1450.81 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2 decode 3378.80 MB/sec
AVX2 decode-s 2775.84 MB/sec
AVX2 decode-s8 2892.33 MB/sec
AVX2 decode-d 1590.29 MB/sec
plain decode 626.00 MB/sec
plain decode-s 580.68 MB/sec
plain decode-s8 562.96 MB/sec
plain decode-d 408.17 MB/sec
SSSE3 decode 2191.44 MB/sec
SSSE3 decode-s 1892.64 MB/sec
SSSE3 decode-s8 1861.03 MB/sec
SSSE3 decode-d 1227.95 MB/sec
SSE41 decode 2140.16 MB/sec
SSE41 decode-s 1893.56 MB/sec
SSE41 decode-s8 1911.62 MB/sec
SSE41 decode-d 1226.44 MB/sec
SSE42 decode 2724.52 MB/sec
SSE42 decode-s 2257.47 MB/sec
SSE42 decode-s8 2370.60 MB/sec
SSE42 decode-d 1424.35 MB/sec
AVX decode 2743.86 MB/sec
AVX decode-s 2433.86 MB/sec
AVX decode-s8 2428.08 MB/sec
AVX decode-d 1408.13 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2 decode 3284.48 MB/sec
AVX2 decode-s 2735.35 MB/sec
AVX2 decode-s8 2813.35 MB/sec
AVX2 decode-d 1574.50 MB/sec
plain decode 623.97 MB/sec
plain decode-s 570.76 MB/sec
plain decode-s8 566.80 MB/sec
plain decode-d 406.15 MB/sec
SSSE3 decode 2191.73 MB/sec
SSSE3 decode-s 1905.69 MB/sec
SSSE3 decode-s8 1843.39 MB/sec
SSSE3 decode-d 1261.96 MB/sec
SSE41 decode 2190.32 MB/sec
SSE41 decode-s 1938.25 MB/sec
SSE41 decode-s8 1841.31 MB/sec
SSE41 decode-d 1259.89 MB/sec
SSE42 decode 2738.85 MB/sec
SSE42 decode-s 2408.05 MB/sec
SSE42 decode-s8 2430.57 MB/sec
SSE42 decode-d 1437.67 MB/sec
AVX decode 2818.79 MB/sec
AVX decode-s 2388.06 MB/sec
AVX decode-s8 2494.78 MB/sec
AVX decode-d 1455.20 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2 decode 3345.58 MB/sec
AVX2 decode-s 2842.44 MB/sec
AVX2 decode-s8 2887.61 MB/sec
AVX2 decode-d 1561.51 MB/sec
plain decode 634.51 MB/sec
plain decode-s 591.72 MB/sec
plain decode-s8 573.85 MB/sec
plain decode-d 464.49 MB/sec
SSSE3 decode 2182.45 MB/sec
SSSE3 decode-s 1927.53 MB/sec
SSSE3 decode-s8 1933.26 MB/sec
SSSE3 decode-d 1248.62 MB/sec
SSE41 decode 2183.39 MB/sec
SSE41 decode-s 1954.86 MB/sec
SSE41 decode-s8 1935.52 MB/sec
SSE41 decode-d 1247.71 MB/sec
SSE42 decode 2726.89 MB/sec
SSE42 decode-s 2421.89 MB/sec
SSE42 decode-s8 2432.85 MB/sec
SSE42 decode-d 1421.62 MB/sec
AVX decode 2809.99 MB/sec
AVX decode-s 2470.29 MB/sec
AVX decode-s8 2496.32 MB/sec
AVX decode-d 1432.62 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2 decode 2964.24 MB/sec
AVX2 decode-s 2619.61 MB/sec
AVX2 decode-s8 2530.74 MB/sec
AVX2 decode-d 1536.01 MB/sec
plain decode 621.43 MB/sec
plain decode-s 588.34 MB/sec
plain decode-s8 567.04 MB/sec
plain decode-d 459.67 MB/sec
SSSE3 decode 2011.50 MB/sec
SSSE3 decode-s 1848.21 MB/sec
SSSE3 decode-s8 1764.75 MB/sec
SSSE3 decode-d 1231.03 MB/sec
SSE41 decode 2001.69 MB/sec
SSE41 decode-s 1846.22 MB/sec
SSE41 decode-s8 1765.15 MB/sec
SSE41 decode-d 1229.00 MB/sec
SSE42 decode 2507.66 MB/sec
SSE42 decode-s 2254.18 MB/sec
SSE42 decode-s8 2162.37 MB/sec
SSE42 decode-d 1405.58 MB/sec
AVX decode 2582.07 MB/sec
AVX decode-s 2305.40 MB/sec
AVX decode-s8 2281.94 MB/sec
AVX decode-d 1414.54 MB/sec
Switch from a 6-bit lookup table to a 12-bit lookup table in the Generic32 and Generic64 encoders. Some quick tests show that halving the memory accesses greatly increases performance, at the cost of a larger lookup table (4096 bytes). Not a hard tradeoff.
Tables can be generated with a small Python script. Tables differ for little-endian and big-endian architectures.
Hi,
Proposal: Replace all the custom preprocessor checks for architecture, platform, compiler, SIMD, etc. with the Boost.Predef library which is independent and header only, so that one can simply import the necessary header files into the repository and be done with it (please note that this is an internal change and doesn't affect library users in any way). I think that this will improve code readability, because the Boost.Predef macros are far more descriptive and might prevent fallacies like the one in #12, e.g. compare
#ifdef BOOST_COMP_MSVC_AVAILABLE
#ifdef BOOST_OS_WINDOWS_AVAILABLE
with
#ifdef _MSC_VER
#if defined(_WIN32) || defined(__WIN32__)
If you think that this is a good idea, but don't have the time to make the change I'll happily submit a pull request.
Cheers,
Henrik
I did some automated benchmarking on my i7-10700 and Edison (Merrifield dual core Silvermont Atom without cache memory, similar to Baytrail) that I want to share here. Strictly, this issue is for reference only. It might be useful to find those commits causing substantial performance increases or decreases. All data have been taken without OpenMP (1 thread only) and in x86_64 mode. On i7 you will see some deviation probably caused by frequency scaling / turbo boost. Don't let that disturb you.
Data can be found here if you want to play yourself benchmarks.ods
Below I filter out the most interesting commits.
Note that on Edison SSE3 encoding took a hit with 9a0d1b2.
# | Hash | Commit message |
---|---|---|
24 | 3f3f31c | Fix build under Xcode |
30 | 67ee3fd | SSSE3->AVX2 encoding optimization |
76 | a5b6739 | SSSE3: enc: factor encoding loop into inline function |
79 | 99977db | Generic64: enc: factor encoding loop into inline function |
92 | e2c6687 | AVX2: enc: unroll inner loop |
93 | 9a0d1b2 | SSSE3: enc: unroll inner loop |
96 | bf7341f | Generic64: enc: unroll inner loop |
114 | b8b3c58 | Generic64: enc: use 12-bit lookup table |
Especially for Edison it has been a bumpy ride, with great improvements 3f3f31c and regressions 0a69845 on SSE3 but also for PLAIN cfa8bf7 and f538baa.
# | Hash | Commit message |
---|---|---|
24 | 3f3f31c | Fix build under Xcode |
29 | cfa8bf7 | Plain decoding optimization |
35 | 0a69845 | SSSE3->AVX2, NEON32 decoding optimization |
85 | 6310c1f | SSSE3: dec: factor decoding loop into inline function |
88 | f538baa | Generic32: dec: factor decoding loop into inline function |
100 | 495414b | AVX2: dec: unroll inner loop |
101 | 5874921 | SSSE3: dec: unroll inner loop |
When providing a library, it is good practice to only globally export symbols that the user is expected to link against. All other symbols should be hidden from view so that they do not pollute the global symbol namespace.
This is currently done by using the linker to roll all object files into a single object file, and instructing the linker to remove all symbols from that file except for those whitelisted in exports.txt.
This works nicely on GNU-based platforms, but is not really portable to OSes like Windows, the BSD's, and MacOS. It also will not be compatible with cmake when we change to that build system (#7), because cmake scripts are written at a level of abstraction where such platform dependent hacks should not exist. Also, weird linking problems such as #50 may arise. A better solution is needed.
These days it is common to declare symbol visibility at the source code level by using a compiler option to hide all symbols by default. Builtin functions are used to tag specific symbols as having global visibility. In GCC, this can be achieved by compiling with the -fvisibility=hidden
flag, and whitelisting user-visible symbols with __attribute__((visibility("default")))
.
This method is supported by GCC and Clang, and works along the same lines in MSVC with different builtins (__declspec(dllexport)
). It will also be forward compatible with the way cmake handles symbol visibility.
Older compilers may not have such builtins, and our library may end up exporting all global symbols. While not great, the pain can be alleviated by ensuring that all symbols with global scope at least have the base64_
prefix. (This is currently true for all global symbols except codec_choose
.)
libbase64.a
) instead of a single object file (libbase64.o
). This will break backwards compatibility for users, but will be forward compatible with what cmake will do.BASE64_EXPORT
, to make certain symbols globally visible.I have made some small modifications to the code to take advantage of multi-threading using openmp and achieve a further 3x improvement with codec plain and2x with codec SSE3. With 8 threads on a quad-core with hyperthreading both codec now run at approx the same throughput, which seems to memory bandwidth limited (see attached under the heading ferry-quad)
I uploaded the code to my repository https://github.com/htot/base64 branch openmp, but I'm not really familiar with working with github so don't know how to create a pull request. Also, as right now, the code FTB without the option -fopenmp.
Are you interested to pull? How to proceed?
My odroid c1+ arrived and I used my break to gather some benchmark results using your tool:
Processor | Plain enc | Plain dec | NEON32 enc | NEON32 dec |
---|---|---|---|---|
odroid c1+ clang 3.7.1 | 183,522 | 127,522 | 204,958 | 173,856 |
odroid c1+ gcc 5.3.0 | 165,600 | 137,060 | 212,044 | 165,356 |
I compiled the library on Windows x64 platform, I specified the sse3 instruction set, then I used it for encode to base64, but the length of the base64 string is zero. I tried the x86, it is working well.
The generic codecs currently read and write potentially unaligned memory ranges using raw pointer casts. This will break on platforms that require strict alignment for memory loads/stores. Examples:
// lib/arch/generic/32/enc_loop.c
uint32_t str = *(uint32_t *)c;
// lib/arch/generic/32/dec_loop.c
*(uint32_t*)o = x.asint;
Replace such casts with calls to memcpy(3)
:
#include <string.h>
uint32_t str;
memcpy(&str, c, sizeof (str));
memcpy(3)
is defined by the C99 standard, so is portable. The arguments to memcpy
are all known at compile time, so on platforms that allow nonaligned memory access, a good compiler will optimize the memcpy
to a single mov
instruction. On platforms without nonaligned memory loads/stores, the function generates bytewise access, which is the best we can do in those circumstances.
This will also allow us to remove the HAVE_FAST_UNALIGNED_ACCESS
macro, which currently guards an unaligned memory store.
Hi, aklomp i was searching for a nice base64 library and ended here but my low level C is a bit rusty, so i would like to ask you if this library could be used to handle base64URL variant? or at least some tips on how to add it?(i would post patches back if i made it)
Thanks you very much for your time
Hi, friend, I need to compile this library on the windows platform. What changes do i need to use vs? Do you have a manual to use? thanks!
I see a lot of commits since last release so I was interested to know by when you are going to publish a new release.
It looks like at least gcc fails now with BASE64_NEON64_USE_ASM
set and -O0
. Example output:
18:45:00 In file included from ../deps/base64/base64/lib/arch/neon64/codec.c:63:
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:11:36: error: invalid character ' ' in raw string delimiter
18:45:00 "ld3 {"P".16b, "Q".16b, "R".16b}, [%[src]], #48 \n\t"
18:45:00 ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:18:28: error: invalid character ' ' in raw string delimiter
18:45:00 "ushr %[t2].16b, "R".16b, #6 \n\t" \
18:45:00 ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:23:28: error: invalid character ' ' in raw string delimiter
18:45:00 "and %[t3].16b, "R".16b, %[n63].16b \n\t"
18:45:00 ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:31:16: error: invalid character ' ' in raw string delimiter
18:45:00 "tbl "R".16b, {v8.16b-v11.16b}, %[t2].16b \n\t" \
18:45:00 ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:37:35: error: invalid character ' ' in raw string delimiter
18:45:00 "st4 {"P".16b, "Q".16b, "R".16b, "S".16b}, [%[dst]], #64 \n\t"
18:45:00 ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c: In function 'enc_loop_neon64':
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:11:27: error: stray 'R' in program
18:45:00 "ld3 {"P".16b, "Q".16b, "R".16b}, [%[src]], #48 \n\t"
18:45:00 ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:65:2: note: in expansion of macro 'LOAD'
18:45:00 LOAD("v2", "v3", "v4") \
18:45:00 ^~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:114:10: note: in expansion of macro 'ROUND_A_FIRST'
18:45:00 " " ROUND_A_FIRST()
18:45:00 ^~~~~~~~~~~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:11:28: error: expected ':' or ')' before string constant
18:45:00 "ld3 {"P".16b, "Q".16b, "R".16b}, [%[src]], #48 \n\t"
18:45:00 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:65:2: note: in expansion of macro 'LOAD'
18:45:00 LOAD("v2", "v3", "v4") \
18:45:00 ^~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:114:10: note: in expansion of macro 'ROUND_A_FIRST'
18:45:00 " " ROUND_A_FIRST()
18:45:00 ^~~~~~~~~~~~~
and a bunch of similar "stray 'R'" errors after that.
clang compiles without errors however.
Convert the full encoding loop to an inline assembly implementation for compilers that can use inline assembly.
The motivation for this change is issue #96: when optimization is turned off on recent versions of clang, the encoding table is sometimes not loaded into sequential registers. This happens despite taking pains to ensure that the compiler uses an explicit set of registers for the load (v8
-v11
).
This leaves us with not much options beside rewriting the full encoding loop in inline assembly. Only that way can we be absolutely certain that the correct registers are used. Thankfully, aarch64 assembly is not very difficult to write by hand.
clang must not be allocating l3
in a contiguous register? While building 3eab8e6, the compiler errors are:
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: registers must be sequential
"and %[t3].16b, v14.16b, %[n63].16b \n\t"
^
<inline asm>:10:40: note: instantiated into assembly here
tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: unknown token in expression
"and %[t3].16b, v14.16b, %[n63].16b \n\t"
^
<inline asm>:10:48: note: instantiated into assembly here
tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: invalid operand
"and %[t3].16b, v14.16b, %[n63].16b \n\t"
^
<inline asm>:10:48: note: instantiated into assembly here
tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: registers must be sequential
"tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
^
<inline asm>:11:40: note: instantiated into assembly here
tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: unknown token in expression
"tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
^
<inline asm>:11:48: note: instantiated into assembly here
tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: invalid operand
"tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
^
<inline asm>:11:48: note: instantiated into assembly here
tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: registers must be sequential
"tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
^
<inline asm>:12:40: note: instantiated into assembly here
tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: unknown token in expression
"tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
^
<inline asm>:12:48: note: instantiated into assembly here
tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: invalid operand
"tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
^
<inline asm>:12:48: note: instantiated into assembly here
tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: registers must be sequential
"tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
^
<inline asm>:13:40: note: instantiated into assembly here
tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: unknown token in expression
"tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
^
<inline asm>:13:48: note: instantiated into assembly here
tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: invalid operand
"tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
^
<inline asm>:13:48: note: instantiated into assembly here
tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
^
I'm not sure how "invalid" input data shall be processed. Could this be clarified ?
A few samples using base64_decode
, what's the expected output for those ?
This could be extended to streaming API, but I thought I'd start with the simple block decode.
See the description for #58 for more background. This change allows the programmer to express their intent more clearly to the compiler, which will make for cleaner and faster code.
Like was done in #91 for NEON32
, we can implement the inner encoding loop for the NEON64
encoder in inline assembly. This should guarantee that we get the assembly code that we want/expect. The inner encoding loop is quite simple, so there is no large cost to adding a second parallel implementation.
On NEON, it is possible to do slightly better than the current 3-to-4 byte encoding shuffle, by using SLI instructions and some cleverness. We can also do better or equal than the compiler if we use inline assembly to manually pipeline these instructions.
hello,i try to use 40 thread call "base64_decode" interface, when process a thread,appear segment error in this interface,but if i use Fewer threads,ex 6 or 8,it will not have any problems,Here is my call:
memset(raw_image_buffer_, 0, 5242880);
base64_decode(base64_data.data(), base64_data.size(), raw_image_buffer_, &base2image_size, 0);
The same picture I used,and I can ensure that the input pointer and length are valid,size is 682901,the “raw_image_buffer_” is A fixed memory , size is 5242880, memset before each call.
Like was done for NEON32
in #91 and for NEON64
in #92, we can get a pretty large speedup for the AVX2 encoder by implementing the inner loop in inline assembly for compilers that support it. Testing on my machine (i5-4590S) with a proof-of-concept branch shows around a 33% speed improvement (!).
This is achieved in the same way that we handle the NEON
encoders. Split the encoder into assembly "recipes" for translation and shuffling, interleave them with loads and stores, and keep three sets of data in flight in parallel inside large unrolled loops. It's basically the code that we'd hope the compiler would generate for us if it was clever enough.
The drawbacks I see to adopting this approach is an increase in complexity and transparency in this library, because generating inline assembly code with C macros is a bit gnarly. But on the other hand, those speed gains don't lie. And this would be purely additive: the codepath would only be taken on compilers that support it, and the normal implementation would remain available.
The advantage is the large speedup, of course, and also the fact that the implementation is not too crazy when it's laid side to side with the intrinsics version. It's basically the same algorithm in a different expression.
I would like to release a new library version (v0.4.0) which puts a tag on the code that everyone has been using for the last few years. This will make room to introduce some breaking changes from here on forward.
However, before we can release, the code is in need of a cleanup round. The codebase has accumulated a lot of patches from various contributors, in various different styles. This has led to some inconsistencies with how things are done, as well as diminished the quality of comments, and accumulated dead code.
Objective: push a set of commits which fix many of these issues, without introducing any functional changes. The library code, as seen by the compiler, must remain unchanged from what it is now, to avoid regressions for anyone switching to this new version.
if
statements, etc.CMPGT
macros and friends.Currently, encoder kernels are simple while
loops that are inserted into a switch statement by an #include
directive. Turn these loops into full static inline
functions.
This change should have no adverse effects, because the compiler will inline the function at its only callsite. On the contrary, this change will be beneficial for maintenance and performance.
$ uname -a
FreeBSD t420nnd.cls.to 10.3-RELEASE-p24 FreeBSD 10.3-RELEASE-p24 #0: Wed Nov 15 04:57:40 UTC 2017 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
$ gmake -C test
gmake : on entre dans le répertoire « /home/nanard/code/git/base64/test »
rm -f benchmark test_base64 *.o
cc -std=c99 -O3 -Wall -Wextra -pedantic -o codec_supported.o -c codec_supported.c
cc -std=c99 -O3 -Wall -Wextra -pedantic -o test_base64 test_base64.c codec_supported.o ../lib/libbase64.o
cc -std=c99 -O3 -Wall -Wextra -pedantic -o benchmark benchmark.c codec_supported.o ../lib/libbase64.o -lrt
benchmark.c:101:16: error: use of undeclared identifier 'CLOCK_REALTIME'
clock_gettime(CLOCK_REALTIME, o_time);
^
benchmark.c:99:35: warning: unused parameter 'o_time' [-Wunused-parameter]
base64_gettime (base64_timespec * o_time)
^
1 warning and 1 error generated.
gmake: *** [Makefile:24: benchmark] Error 1
Adding -D_XOPEN_SOURCE=600 does the trick
diff --git a/test/Makefile b/test/Makefile
index d104582..11c57a3 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -2,6 +2,7 @@ CFLAGS += -std=c99 -O3 -Wall -Wextra -pedantic
ifdef OPENMP
CFLAGS += -fopenmp
endif
+CFLAGS += -D_XOPEN_SOURCE=600
TARGET := $(shell $(CC) -dumpmachine)
ifneq (, $(findstring darwin, $(TARGET)))
I want to build the library as a shared library(may be called libbase64.so),any suggestions?
hello, I test the fast64 decode, it indeed faster than other code, but i counter a problem when using function int base64_decode( const char*src, size_t srclen, char *out, size_t *outlen, int flags) ; it's unknow about out's size ,it easy to decode failed when size is small, how can i solve it
The transformation of decoder kernels to inline functions (#59) allows us to move the inner decoding loop into separate inline functions.
Because the number of remaining loop iterations is known, we can split calls to the inner loop into long unrolled stretches. Tests show that this can result in a significant speedup.
Benchmark results are currently "stored" in a big table in the README which is getting ungainly and hard to modify. The results are presented in text form, which makes it harder to compare across machines and codecs.
Objectives:
tests/benchmarks
.without #include <intrin.h>
, _byteswap_uint64
is undefined at point of use and assumed due to C sucking to be returning an int.. an int that's 32bits. this makes the result be garbled (for future searchers, it makes it 32bits sign extended into src
).
I want to use it in ios project. how to build it in xcode? help!!!
When I build with all x86 codes and run benchmarks on Atom (x86_64) it crashes due to an illegal instruction. I always thought I just need to disable AVX and AVX2 codecs (and that works around the problem). But looking into the code even though the AVX* codecs are forced, it seems support is intended to be detected. And the crash therefore unintended.
I think it would be better to benchmark the codecs for which the instructions are actually supported instead of crashing. But how we do that? Not force the codec, of detect supported codecs and force from that list?
The old Travis CI config compiled and tested the code on various architectures: arm32
, arm64
, ppc64be
, ppc64le
and s390
. We can partly replicate this with GitHub Actions by using the uraimo/run-on-arch-action
. Update the CI config to run and test the library on various architectures.
Unfortunately, uraimo/run-on-arch-action
does not support a big-endian architecture (yet?). It would be nice if we could find a way to include such an architecture in the test matrix.
An example of a project that uses this CI setup is aklomp/vec.
First off, thank you for providing this open source library (and under the BSD license).
My question is: would it be possible to see versions of the decoding functions that support uint16_t[]
strings in addition to const char[]
strings? I tried looking at just the ssse3 implementation (since that is what is used on my dev machine) to see how that might be implemented, but the numbers in the comments have me confused.
My particular use case is that I get a uint16_t[]
from the V8 JS engine, which is how it internally stores its strings. Being able to use that value type directly would be more efficient than having to make a copy to an 8-bit array first.
I'd like to get uint16_t[]
supported for each codec, since we run on a variety of platforms. Any help you can give would be much appreciated.
I'm using this library on an ARM embedded system.
I have a static function which is called from multiple threads in the same time. I'm using boost's mutex to make the important parts atomic:
char Base64Buffer[5500];
// buffer_size is max 4096
int writeOutput(uint8_t *buffer, int buffer_size) {
mutex.lock();
base64_encode((const char*)buffer, buffer_size, Base64Buffer, &Base64Length, 0);
...
mutex.unlock();
return 0;
}
It is working, however sometimes when writeOutput is called from different threads at the same time, base64_encode creates an invalid stream.
Here is a sample invalid base64 (unfortunately I don't have the raw unencoded data):
idqI4bA2ArMAfIkw0a5wmcSAxEpoMJTwOX/aR1BVTTZVRO+Hx4NT4OiOtvrhS8+W2LvbDO0IVh220MGMlZOG2x6MVJeIjeB4fkkNwnwiJ8tOwacCeJhHRCBZFKXvhU33fUw1lWH2dxw2sr9oihvNH3ot3PWrrl/KTC7wCatBcIHtMT6/cuH7dC5EgdwSYdKW4ibqMcjH9hGwT9nPZ92JUbIjmBrzmJI8aoMi8KEpEU61sbuugfxFFwfEtmVO7S5hwYjkormiBzTaDOSTpPi+O5TuTwKUXDb0bebTBVvL8JSicHsX7GVHPQdnAJgWSZ9uT1YU/MsrJYs/srGbvMw30sqGpbJ4/sNAcoVkpb8TFXJcRJZAl2jejm5qbY1TYzP6nz8y7LMMUnUdOwEKIdYu8Kub/Gb6jkxMQULsxc8UjNpS4EkDwYW6/Sm+LE6FIXwSrcIHwM93fBitXDJrbVEN0RYhGl1EB53QXlrUwabERn91tTq3J5i/90AF/KcJwpHCq7Lyrp8JeXuYnON4A84ygEO0p94JNgu1h/CO4iULOuw8GBCpZxFmbHkSIW7ihfPFRfHEV6+8BEoxhjDt7lXaFmjij07RfDJlGri+AhxgFLqCGGFNPKsZYUNfDYm0RaCMdjtDZOAgn+GeIfUpE9zR0y1lVrQVvYMi8X3GTchagsgCW73r/Zmff7qNY2AQsSTKKN+D6auzHvFhlAV0Q2pVd9DayVg9IUAJuslO0+XnrlPpAgIyOjio9ShPjk9mLS9uLyOSN4UsIAeqxWl7SoaXop79+t0iorBvbfQA94hO4Ly/tOkQqNY4Z71YAG2AOFs+VCorRcLvY+AchsGa2XPWJmtQv/9K0EyrJMZHAqwF04sRKY9GuFXpPV4ccLq2NUO0Ct/5chTUO+r0nDc059sZjvsW+4eLR2nDKq9+E9niTeSODPqb+lU3o992dExJPI5uqmzo5ntfUXPM8EvwCarbk/XdN1xtb30XqN7F8ENNFFT2HwQsshO47JgFETlrImdpfjXcpocCGyEoe+fqvIPIHdKpr9BvfAA1FczsDNIc7lIkk5vWsz8TpfRczTVzuXTRLACaOzRZOAAaeZ45ZMNtrP8Rm6gEsAcs+mxxHOidR9F4HiGAXh3be14xMhiUkMS9sYqn404H5SeLxg8HmQbyfTcORQWnB4AimvVWfi2MTbRGAkSiULdUGZKy65dp1axesRABHf46jF+YmCyysIg303IRFHI+oWRPwFoIhHTuQmBkAuuYypJyQwbzeb80XX/KdDyfIsfaNLVzgQoXFWozwqmlk/W4ept/7V6eebGkChc1P2sUZCfczk3uiiKSBx6L6eVvjtUsKpNq0vQAAAa+QeGI/ymmOd5c0K419SeXSexFWbJJi4etmdyBS6FsnE2ssGX5J+gw4Wc0YIlSkoXqGIAViAuz6cGx/nRp/ZuT9kqT88+Fc9xe0oabx32LFCwbaBVqHrmPqTDXE2BXkaCb1y+gX7XfvqRAIB7/5uuBM9sASdk2mNPa0r7h9pNyEzSsfvSda2fJEROOG+cGQ14nHRx+5XXANnuqnbbuY9g7k7BRqYrpapTVafKL9+DP4S86TVlZMUugCsHwHWM2VtBZgx2YvoOgrCtFnnUrdKUJoHMUuB/wa6SFV5BuvAnnlQ9l6PClDPN3NqRYECt6w7/wW0UJRhPWmkC07mhTvwyM4AiG/dR5DhUktBPZONHvwVKcWkRvM+i+Srr8UBQtvCQ5lUFDj/+OevmoYmkr9wwchEqWEk4EuO3mZZe3KoKW5dmIUNbx0WbwxXBSu5RW7YJPR9G2jxac9ZeHb9rW40NbPcKdZ9qBTmxKKwJZfJof997jCZRVFuKRpB/OBHbvTlAGfflJY0335As3eFROyJ3SQtzSNiT7h8offMnoH1d0EE01EyMwzvZFI/3Wk0JYP6T7ZEV+wGialipfRNkPoU0nmxewc9AN1LLFrEK2YtNr+qvZNhHWDlanBQFj6frKEvGvBTOSXeWRKyDnc37hNCUCpkda76GigNGeAAyLroj0xjzileWwjoHkWS//TiWQAS8lxW43tj+40eVx5M8mGQ0kvrZ3wHl7tToXha5Xw2he7EXfBVvo73DmXteH+0xJtEhNAgSClSKk8x70mfXMFYWZU0mXpU/CZZboYAgfAYdUq54lopUT4+YppDBZhGe9P+/1LeEujQ5xnO1rVsK23UUleJFePszCvQjvbuh6uawCouT3MTtvJFfQTt5WIkvhtr4Vv6tbQMMGBjH8J9RKjNJXsD1xnZ/6bSLD4Sqa0NSDzOcvsG3vmnL9F7a8U/pjG0barDAl7rYciN3jZWtTS09usUvMJKAZw/BF/jZN23QyF3oZ/v7/sZXzuvFYltwsMgqxrjZfVzwlAMwQ9d6ZI4Itg+Gf+sMxGHa4qmzn3/esNCcybwEWUAFzzRb5/RxJwGHKoUYz8QLApexP2AmKHng3hotDnyN0QbVab1oK2lXj3aRjsNP0QF/7lRhwcanrxNo14dPY0Lgqm+mEicezUO5RjpWUMWNtaqPbN7dMMewT+l9bTW6fpMx7UHh+H5cm+HRRxjBEn51XGSnyrm0AOpu5Qdh8Fe+ZCWx6dnJGsYPRXd7av789Mu2zOqYHqILU7Vl9vgagyPHkJCcWG+Z+FIuoTdLKrsaHReoMRileqqlt3PH9ko3Kz+r4ztDJDD6kZAbqlBKO+96I1AGbA+/QZ+ILZB2K0d7uFlnvbIUuCjTjTNdE9CQIYWBT7DthshNIoJdHp2mrOLzV2Hb7fJhdlHtq/HfxB1mESwQWnSAJluCqa0qsqjyADLz3ln6SavFYIfN6pSqzWzlt78CURh1p2EnRf+jQ9ZgM4fByIIEO2dgYo01wo60cExp9R+Fp8Z6CHXuq5jnkPeI+RSRsjrDJRvCC1LcLdrgmwfG1PARqWYlzmeKmWwztX74RqJVV+OQ0mIaWqti360cIBc5SJKyiP0YFVX5tYzI9HG1wquO5vgfH6kP9KYXgocQiD0LQR4tQp98v++I0Fq/sgj2mW5gHnXTMq1ijXr3rUyLqYbMvOTC4yvh08UQDHwDrEhJGCVlL1MKi48NIHtfF8+EhVqib7++6ZCttyc6uH5Hqkd3FVmJ25x6Gh0EFENsT51bJ8NK/xXtUSy5i8tXZgBEI/jpQcb7s2mPr6B3fRXhsRi7rs5bPCs3kMQynTMrRHT0jV0Uvm6XlrVA4TtxS43MLMNT8ItYIJya9gZLbfZE77qazpfiaVTujUOWMITsSi5sV80kZ0LELGGOY76b/M82v5zDCE5ss7rcFUqeUca8LWvzB8dn2VsOV9k2iMc2XIb2/MFffTsc9UEDo66+ZdBLEwLPXJ+OBt6yCnEE1xwWV8jCfbx2TLaT+9urSBb38lYfB0eyV5Ex/Y3jF58TqprSM8xbRL0XlwRp8F92KMMaoNGHCoTs+oDdDZ4PRcTdIRAZXaBCLOd2VT1gBIxiVc/gUPcU0uLaa743vI1u4OdJMp17hq00DrTpaWr+iFZjwkfBCBCZfB5C6K73Lo5WVUGNurZbvC4PagULv+FXFOKDbAOba/COlw1OR1ZK31c0TChWepTi9EuvD6KEgid+vTYxD2INfF08RLbcB7V5eDBdNsTXrVS7KFL33Ik6kKMacrPQy/SjJdg==Xgf5ecvEiJ8duJAD+GUKzDw3DbwfuHL7zLVQzKkLDA46/NhORitq1nzjndLtx0EEkTKMbOXEDi27t/DZzsEyeAOWt36bJ46KwDDFOpn3dT2KThXw130NiQ3TbBeMUMwR0obhUycp1A5qzFn1RknQm7f/e4kPyTuhVgDB8czuDb/Lok9y0vo9AE4TiA/4jgJ6S+OmjRcB/h/529iaAFf+lpf50QBTyqg1Pyy6inZ6jbdCn+aX2tU7hI6hp0G3IrCa4C5swRS72QiIDNne3zrzpIyHHW6e9yqT5GZAxEPeSaGs0lSb2j/1e6Z4dJwOW9rBkgO3IzvZC8tK4iYHC2alp+no76vFu3RCT1ktH3stq31EDR+7ulkVrjej9EobFuPu9vh+90WnkqI2Xtk0O/roBhSk9vROeGWFJmDP6ZVpPg3RpUWx2P4vvPQ8xyrCFubf4cGP76QmJ8TVQ/oWDBlngBsI1SwWdYuGtP+P7hVYUA/Rh3vdWkFJND4Wur7TUNQ1buhKaWa1K80pKmSEvBRDPvz7MctP6w+TP0ZZpmCA2m7kOiRvJYvPpjL8h121Sf7s33pHIE3c0936xXb0t72wJVLWAWAMzlkBxS0IcRQVEV0m4Z1HCX3tXVv+bVZ/XqRBEz78seIGW7yNkZmc3GbrZnuVqmzma5SYXybHkGlyxYo5aSP+zDpHIHkNsPxYiuQE1SeBMaiuzKU4FoggvEoHlC0mCwgvKR+2OKU/k7zf3GwoDwii4kz/9fDSEsiEuputtFijy3/Na0Qx0R0mWgwi5q1tApKVLpCvLn1WR55FeKZnVtVShtuued9WRebV6J8PR+YD4BJ2hmKN1EY9iJOuNJ14QG/IdiwtGcvP89r279OiTC3Stb4gz75MzZpbbO0/hQdg5YW8cF/w7YyCJ4N5p53a55dAnHv0SANXVTD+DQfBrpfK6rOSEhdOjE7y+cz7DvKCrfJkTkevA8GkIjoeffWVqIi2tXB+gJ3uhDMga9fKrv/M6M8qLFB7OGeimtVseZKAU/DZIB3fX0Ji20TWWG9+pZe9IAZllegjVo/mK10fwhnTfugg+GvcYVc+WjrYLQSYe6fkgPmlLLrkvhECUfkx7SWtZo1runArW6uM1vQDRFyI4LpeOCpOPSDcj2CVLUXLpJU44wmP/PUPIvOw2zCfkD+qKexnvs92k4bibBzn5N5pumiuV5TAbDCed4u9Oh5urO57eacaOvUp50oxYWoF2JoC3/IL0uci1wqN9S3X4B/a8Z6jCjCgX98REihYMVaemApD7KV6eHT7OIVCBQaZzk4DEjRcy1g/nVJiSHYlbLESnIZyojW8q7ZnCH5stnsaYp1L1V4R1FYJJPqfGEgu+GQq9AWP9NQyDUlesbriOl/kXII5kaeRYj7mzo6MkIdh9m/eBjL0yCUodo/C6Sk9XsTN97Mjkhme+nQutEH00plVLXecBRP28257ybo+sjvZXtl+kG6Ew2/hskGhjpfWC0xg9YBXQE1u2Dr2uswm0uBBfOjSEimmrtIHTjWKmDxCBEQIJo+5i07MqAAABKtB4YD/KYke+rrUl0r9qozDpbbRNTI5mnydC/Nr3aj9y9QPNqLrU5B4GrXmA52IQVQQ0jC2B4dMcmKDZ5qlvDMQCg1WCC3cjX0u8tgqLdB0sKknj8Muo+SIPGBXKm8h7NBq6PHTk/Ir/sANPZxfyx83OaB+IZCsXnGOwg7iO7CeuOQ7T0h3CtBROJDq+J2jkqlqrS19R5SnUEJr6sP6oWf2xo3AldD2fyHwBTQQ6/bZyw==
Could you help me with hunting this?
How can multithread affect base64 generation if it's in an atomic mutex block?
Compiling under GCC7 on OSX w/ AVX2:
fastbase64 git:(master) CC=gcc-7 CXX=g++-7 AVX2_CFLAGS=-mavx2 make
gcc-7 -fPIC -std=c99 -Wall -Wextra -Wshadow -Wpsabi -Wfatal-errors -O3 -march=native -march=native -mavx2 -c src/chromiumbase64.c -Iinclude
gcc-7 -fPIC -std=c99 -Wall -Wextra -Wshadow -Wpsabi -Wfatal-errors -O3 -march=native -march=native -mavx2 -c src/fastavxbase64.c -Iinclude
gcc-7 -fPIC -std=c99 -Wall -Wextra -Wshadow -Wpsabi -Wfatal-errors -O3 -march=native -march=native -mavx2 -c src/fastavx512bwbase64.c -Iinclude
src/fastavx512bwbase64.c: In function 'enc_reshuffle':
src/fastavx512bwbase64.c:6:1: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
static inline __m512i enc_reshuffle(const __m512i input) {
^~~~~~
src/fastavx512bwbase64.c:6:23: note: The ABI for passing parameters with 64-byte alignment has changed in GCC 4.6
static inline __m512i enc_reshuffle(const __m512i input) {
^~~~~~~~~~~~~
In file included from /usr/local/Cellar/gcc/7.3.0_1/lib/gcc/7/gcc/x86_64-apple-darwin17.3.0/7.3.0/include/immintrin.h:55:0,
from /usr/local/Cellar/gcc/7.3.0_1/lib/gcc/7/gcc/x86_64-apple-darwin17.3.0/7.3.0/include/x86intrin.h:48,
from src/fastavx512bwbase64.c:3:
/usr/local/Cellar/gcc/7.3.0_1/lib/gcc/7/gcc/x86_64-apple-darwin17.3.0/7.3.0/include/avx512bwintrin.h:1831:1: error: inlining failed in call to always_inline '_mm512_shuffle_epi8': target specific option mismatch
_mm512_shuffle_epi8 (__m512i __A, __m512i __B)
^~~~~~~~~~~~~~~~~~~
compilation terminated due to -Wfatal-errors.
make: *** [fastavx512bwbase64.o] Error 1
When I'm not using cmake-gui (i.e. generating using Yocto) only the plain codec is built in some cases. I don't understand why, but the below patch seems to resolve it. As if the string was never evaluated. Any ideas? @BurningEnlightenment @aklomp
add_feature_info("OpenMP codec" BASE64_WITH_OpenMP "spreads codec work accross multiple threads")
cmake_dependent_option(BASE64_REGENERATE_TABLES "regenerate the codec tables" OFF "NOT CMAKE_CROSSCOMPILING" OFF)
-set(_IS_X86 "_TARGET_ARCH STREQUAL \"x86\" OR _TARGET_ARCH STREQUAL \"x64\"")
+if((_TARGET_ARCH STREQUAL "x86") OR (_TARGET_ARCH STREQUAL "x64"))
+ set(_IS_X86 1)
+else()
+ set(_IS_X86 0)
+endif()
cmake_dependent_option(BASE64_WITH_SSSE3 "add SSSE 3 codepath" ON ${_IS_X86} OFF)
add_feature_info(SSSE3 BASE64_WITH_SSSE3 "add SSSE 3 codepath")
cmake_dependent_option(BASE64_WITH_SSE41 "add SSE 4.1 codepath" ON ${_IS_X86} OFF)
The transformation of encoder kernels to inline functions (#58) allows us to move the inner encoding loop into separate inline functions.
Because the number of remaining loop iterations is known, we can split calls to the inner loop into long unrolled stretches. Tests show that this can result in a significant speedup.
After pulling in recent changes I encountered a couple of issues so far:
if (max_level >= 7) {
#ifdef HAVE_AVX2
__cpuid_count(7, 0, eax, ebx, ecx, edx);
#endif
#ifdef HAVE_AVX
if (!(ecx & bit_XSAVE_XRSTORE)) {
__cpuid(1, eax, ebx, ecx, edx);
}
#endif
// ...
Subject... Or maybe there are alternatives?
Some testing on my Raspberry Pi 2B 1.1 shows that GCC and Clang both generate pretty terrible code from neon intrinsics.
For the NEON32 encoder, which is simpler than the x86 encoders, the speed can be substantially improved by hand-coding the relatively simple inner loop in inline assembly. A quick proof-of-concept shows that inline assembly gets around 382 MB/s on GCC, against 209 MB/s for the status quo. Clang does worse and better at the same time, getting 304 MB/s for the inline assembly and 294 MB/s for the status quo. Both are an improvement, so I think this should be added.
Suppose we are calling base64_encode
or base64_decode
in a loop (for different inputs) and doing it from multiple threads (for different data).
If we pass non-zero flags
to these routines, it will write to a single global variable repeatedly in codec_choose_forced
function and it will lead to "false sharing" and poor scalability.
There is no method to pre-initialize the choice of codec. (Actually, there is: we can simply call one of the encode/decode routines in advance with empty input, but it looks silly). If we don't do that and if we run our code with thread-sanitizer, it will argue about data race on codec function pointers. In fact, it is safe, because it is a single pointer - single machine word that is (supposedly) placed in aligned memory location. But we have to annotate it as _Atomic and store/load with memory_order_relaxed. Look at the similar issue here: simdjson/simdjson#256
Suppose we use these routines in a loop for short inputs. They have a branch to check if encoders/decoders were initialized. We want to move these branches out of the loop: check for CPU and call specialized implementation directly. But architecture specific methods are not exported and we cannot do that. We also have to pay for two non-inlined function calls.
All these issues was found while integrating this library to ClickHouse: ClickHouse/ClickHouse#8397
Currently, the Travis CI only builds one library on the standard OS and architecture. In order to increase testing coverage, update the Travis config file to include a build matrix with many more OS and architecture combinations.
Note that support for the native Windows MSVC toolchain would require a Cmake-based build configuration, which we don't have (yet), so that platform will initially not be supported.
I tried to do base64 encoding on an image data
char* out;
size_t outlen;
base64_encode((const char*)color.get_data(), color.get_data_size(), out, &outlen, 0);
color.get_data() returns pointer to the frame data.
color.get_data_size() returns the number of bytes in the frame handle.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.