Giter Club home page Giter Club logo

base64's People

Contributors

aklomp avatar burningenlightenment avatar denravonska avatar hardfalcon avatar htot avatar jkrems avatar lucshi avatar madebr avatar mayeut avatar mscdex avatar noj avatar vglavnyy avatar zenden2k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

base64's Issues

Invalid base64 generation on ARM: extra padding in the middle of the stream!

I'm using this library on an ARM embedded system.

I have a static function which is called from multiple threads in the same time. I'm using boost's mutex to make the important parts atomic:

char Base64Buffer[5500];

// buffer_size is max 4096
int writeOutput(uint8_t *buffer, int buffer_size) {
  mutex.lock();
  base64_encode((const char*)buffer, buffer_size, Base64Buffer, &Base64Length, 0);
  ...
  mutex.unlock();
  return 0;
}

It is working, however sometimes when writeOutput is called from different threads at the same time, base64_encode creates an invalid stream.

Here is a sample invalid base64 (unfortunately I don't have the raw unencoded data):
idqI4bA2ArMAfIkw0a5wmcSAxEpoMJTwOX/aR1BVTTZVRO+Hx4NT4OiOtvrhS8+W2LvbDO0IVh220MGMlZOG2x6MVJeIjeB4fkkNwnwiJ8tOwacCeJhHRCBZFKXvhU33fUw1lWH2dxw2sr9oihvNH3ot3PWrrl/KTC7wCatBcIHtMT6/cuH7dC5EgdwSYdKW4ibqMcjH9hGwT9nPZ92JUbIjmBrzmJI8aoMi8KEpEU61sbuugfxFFwfEtmVO7S5hwYjkormiBzTaDOSTpPi+O5TuTwKUXDb0bebTBVvL8JSicHsX7GVHPQdnAJgWSZ9uT1YU/MsrJYs/srGbvMw30sqGpbJ4/sNAcoVkpb8TFXJcRJZAl2jejm5qbY1TYzP6nz8y7LMMUnUdOwEKIdYu8Kub/Gb6jkxMQULsxc8UjNpS4EkDwYW6/Sm+LE6FIXwSrcIHwM93fBitXDJrbVEN0RYhGl1EB53QXlrUwabERn91tTq3J5i/90AF/KcJwpHCq7Lyrp8JeXuYnON4A84ygEO0p94JNgu1h/CO4iULOuw8GBCpZxFmbHkSIW7ihfPFRfHEV6+8BEoxhjDt7lXaFmjij07RfDJlGri+AhxgFLqCGGFNPKsZYUNfDYm0RaCMdjtDZOAgn+GeIfUpE9zR0y1lVrQVvYMi8X3GTchagsgCW73r/Zmff7qNY2AQsSTKKN+D6auzHvFhlAV0Q2pVd9DayVg9IUAJuslO0+XnrlPpAgIyOjio9ShPjk9mLS9uLyOSN4UsIAeqxWl7SoaXop79+t0iorBvbfQA94hO4Ly/tOkQqNY4Z71YAG2AOFs+VCorRcLvY+AchsGa2XPWJmtQv/9K0EyrJMZHAqwF04sRKY9GuFXpPV4ccLq2NUO0Ct/5chTUO+r0nDc059sZjvsW+4eLR2nDKq9+E9niTeSODPqb+lU3o992dExJPI5uqmzo5ntfUXPM8EvwCarbk/XdN1xtb30XqN7F8ENNFFT2HwQsshO47JgFETlrImdpfjXcpocCGyEoe+fqvIPIHdKpr9BvfAA1FczsDNIc7lIkk5vWsz8TpfRczTVzuXTRLACaOzRZOAAaeZ45ZMNtrP8Rm6gEsAcs+mxxHOidR9F4HiGAXh3be14xMhiUkMS9sYqn404H5SeLxg8HmQbyfTcORQWnB4AimvVWfi2MTbRGAkSiULdUGZKy65dp1axesRABHf46jF+YmCyysIg303IRFHI+oWRPwFoIhHTuQmBkAuuYypJyQwbzeb80XX/KdDyfIsfaNLVzgQoXFWozwqmlk/W4ept/7V6eebGkChc1P2sUZCfczk3uiiKSBx6L6eVvjtUsKpNq0vQAAAa+QeGI/ymmOd5c0K419SeXSexFWbJJi4etmdyBS6FsnE2ssGX5J+gw4Wc0YIlSkoXqGIAViAuz6cGx/nRp/ZuT9kqT88+Fc9xe0oabx32LFCwbaBVqHrmPqTDXE2BXkaCb1y+gX7XfvqRAIB7/5uuBM9sASdk2mNPa0r7h9pNyEzSsfvSda2fJEROOG+cGQ14nHRx+5XXANnuqnbbuY9g7k7BRqYrpapTVafKL9+DP4S86TVlZMUugCsHwHWM2VtBZgx2YvoOgrCtFnnUrdKUJoHMUuB/wa6SFV5BuvAnnlQ9l6PClDPN3NqRYECt6w7/wW0UJRhPWmkC07mhTvwyM4AiG/dR5DhUktBPZONHvwVKcWkRvM+i+Srr8UBQtvCQ5lUFDj/+OevmoYmkr9wwchEqWEk4EuO3mZZe3KoKW5dmIUNbx0WbwxXBSu5RW7YJPR9G2jxac9ZeHb9rW40NbPcKdZ9qBTmxKKwJZfJof997jCZRVFuKRpB/OBHbvTlAGfflJY0335As3eFROyJ3SQtzSNiT7h8offMnoH1d0EE01EyMwzvZFI/3Wk0JYP6T7ZEV+wGialipfRNkPoU0nmxewc9AN1LLFrEK2YtNr+qvZNhHWDlanBQFj6frKEvGvBTOSXeWRKyDnc37hNCUCpkda76GigNGeAAyLroj0xjzileWwjoHkWS//TiWQAS8lxW43tj+40eVx5M8mGQ0kvrZ3wHl7tToXha5Xw2he7EXfBVvo73DmXteH+0xJtEhNAgSClSKk8x70mfXMFYWZU0mXpU/CZZboYAgfAYdUq54lopUT4+YppDBZhGe9P+/1LeEujQ5xnO1rVsK23UUleJFePszCvQjvbuh6uawCouT3MTtvJFfQTt5WIkvhtr4Vv6tbQMMGBjH8J9RKjNJXsD1xnZ/6bSLD4Sqa0NSDzOcvsG3vmnL9F7a8U/pjG0barDAl7rYciN3jZWtTS09usUvMJKAZw/BF/jZN23QyF3oZ/v7/sZXzuvFYltwsMgqxrjZfVzwlAMwQ9d6ZI4Itg+Gf+sMxGHa4qmzn3/esNCcybwEWUAFzzRb5/RxJwGHKoUYz8QLApexP2AmKHng3hotDnyN0QbVab1oK2lXj3aRjsNP0QF/7lRhwcanrxNo14dPY0Lgqm+mEicezUO5RjpWUMWNtaqPbN7dMMewT+l9bTW6fpMx7UHh+H5cm+HRRxjBEn51XGSnyrm0AOpu5Qdh8Fe+ZCWx6dnJGsYPRXd7av789Mu2zOqYHqILU7Vl9vgagyPHkJCcWG+Z+FIuoTdLKrsaHReoMRileqqlt3PH9ko3Kz+r4ztDJDD6kZAbqlBKO+96I1AGbA+/QZ+ILZB2K0d7uFlnvbIUuCjTjTNdE9CQIYWBT7DthshNIoJdHp2mrOLzV2Hb7fJhdlHtq/HfxB1mESwQWnSAJluCqa0qsqjyADLz3ln6SavFYIfN6pSqzWzlt78CURh1p2EnRf+jQ9ZgM4fByIIEO2dgYo01wo60cExp9R+Fp8Z6CHXuq5jnkPeI+RSRsjrDJRvCC1LcLdrgmwfG1PARqWYlzmeKmWwztX74RqJVV+OQ0mIaWqti360cIBc5SJKyiP0YFVX5tYzI9HG1wquO5vgfH6kP9KYXgocQiD0LQR4tQp98v++I0Fq/sgj2mW5gHnXTMq1ijXr3rUyLqYbMvOTC4yvh08UQDHwDrEhJGCVlL1MKi48NIHtfF8+EhVqib7++6ZCttyc6uH5Hqkd3FVmJ25x6Gh0EFENsT51bJ8NK/xXtUSy5i8tXZgBEI/jpQcb7s2mPr6B3fRXhsRi7rs5bPCs3kMQynTMrRHT0jV0Uvm6XlrVA4TtxS43MLMNT8ItYIJya9gZLbfZE77qazpfiaVTujUOWMITsSi5sV80kZ0LELGGOY76b/M82v5zDCE5ss7rcFUqeUca8LWvzB8dn2VsOV9k2iMc2XIb2/MFffTsc9UEDo66+ZdBLEwLPXJ+OBt6yCnEE1xwWV8jCfbx2TLaT+9urSBb38lYfB0eyV5Ex/Y3jF58TqprSM8xbRL0XlwRp8F92KMMaoNGHCoTs+oDdDZ4PRcTdIRAZXaBCLOd2VT1gBIxiVc/gUPcU0uLaa743vI1u4OdJMp17hq00DrTpaWr+iFZjwkfBCBCZfB5C6K73Lo5WVUGNurZbvC4PagULv+FXFOKDbAOba/COlw1OR1ZK31c0TChWepTi9EuvD6KEgid+vTYxD2INfF08RLbcB7V5eDBdNsTXrVS7KFL33Ik6kKMacrPQy/SjJdg==Xgf5ecvEiJ8duJAD+GUKzDw3DbwfuHL7zLVQzKkLDA46/NhORitq1nzjndLtx0EEkTKMbOXEDi27t/DZzsEyeAOWt36bJ46KwDDFOpn3dT2KThXw130NiQ3TbBeMUMwR0obhUycp1A5qzFn1RknQm7f/e4kPyTuhVgDB8czuDb/Lok9y0vo9AE4TiA/4jgJ6S+OmjRcB/h/529iaAFf+lpf50QBTyqg1Pyy6inZ6jbdCn+aX2tU7hI6hp0G3IrCa4C5swRS72QiIDNne3zrzpIyHHW6e9yqT5GZAxEPeSaGs0lSb2j/1e6Z4dJwOW9rBkgO3IzvZC8tK4iYHC2alp+no76vFu3RCT1ktH3stq31EDR+7ulkVrjej9EobFuPu9vh+90WnkqI2Xtk0O/roBhSk9vROeGWFJmDP6ZVpPg3RpUWx2P4vvPQ8xyrCFubf4cGP76QmJ8TVQ/oWDBlngBsI1SwWdYuGtP+P7hVYUA/Rh3vdWkFJND4Wur7TUNQ1buhKaWa1K80pKmSEvBRDPvz7MctP6w+TP0ZZpmCA2m7kOiRvJYvPpjL8h121Sf7s33pHIE3c0936xXb0t72wJVLWAWAMzlkBxS0IcRQVEV0m4Z1HCX3tXVv+bVZ/XqRBEz78seIGW7yNkZmc3GbrZnuVqmzma5SYXybHkGlyxYo5aSP+zDpHIHkNsPxYiuQE1SeBMaiuzKU4FoggvEoHlC0mCwgvKR+2OKU/k7zf3GwoDwii4kz/9fDSEsiEuputtFijy3/Na0Qx0R0mWgwi5q1tApKVLpCvLn1WR55FeKZnVtVShtuued9WRebV6J8PR+YD4BJ2hmKN1EY9iJOuNJ14QG/IdiwtGcvP89r279OiTC3Stb4gz75MzZpbbO0/hQdg5YW8cF/w7YyCJ4N5p53a55dAnHv0SANXVTD+DQfBrpfK6rOSEhdOjE7y+cz7DvKCrfJkTkevA8GkIjoeffWVqIi2tXB+gJ3uhDMga9fKrv/M6M8qLFB7OGeimtVseZKAU/DZIB3fX0Ji20TWWG9+pZe9IAZllegjVo/mK10fwhnTfugg+GvcYVc+WjrYLQSYe6fkgPmlLLrkvhECUfkx7SWtZo1runArW6uM1vQDRFyI4LpeOCpOPSDcj2CVLUXLpJU44wmP/PUPIvOw2zCfkD+qKexnvs92k4bibBzn5N5pumiuV5TAbDCed4u9Oh5urO57eacaOvUp50oxYWoF2JoC3/IL0uci1wqN9S3X4B/a8Z6jCjCgX98REihYMVaemApD7KV6eHT7OIVCBQaZzk4DEjRcy1g/nVJiSHYlbLESnIZyojW8q7ZnCH5stnsaYp1L1V4R1FYJJPqfGEgu+GQq9AWP9NQyDUlesbriOl/kXII5kaeRYj7mzo6MkIdh9m/eBjL0yCUodo/C6Sk9XsTN97Mjkhme+nQutEH00plVLXecBRP28257ybo+sjvZXtl+kG6Ew2/hskGhjpfWC0xg9YBXQE1u2Dr2uswm0uBBfOjSEimmrtIHTjWKmDxCBEQIJo+5i07MqAAABKtB4YD/KYke+rrUl0r9qozDpbbRNTI5mnydC/Nr3aj9y9QPNqLrU5B4GrXmA52IQVQQ0jC2B4dMcmKDZ5qlvDMQCg1WCC3cjX0u8tgqLdB0sKknj8Muo+SIPGBXKm8h7NBq6PHTk/Ir/sANPZxfyx83OaB+IZCsXnGOwg7iO7CeuOQ7T0h3CtBROJDq+J2jkqlqrS19R5SnUEJr6sP6oWf2xo3AldD2fyHwBTQQ6/bZyw==

Could you help me with hunting this?

How can multithread affect base64 generation if it's in an atomic mutex block?

NEON64: enc: add inline asm codepath

Like was done in #91 for NEON32, we can implement the inner encoding loop for the NEON64 encoder in inline assembly. This should guarantee that we get the assembly code that we want/expect. The inner encoding loop is quite simple, so there is no large cost to adding a second parallel implementation.

Library naming

During the review of the vcpkg integration of this library (microsoft/vcpkg#25091) it was noted that the name of this library is very generic and will likely clash with other base64 libraries. This is especially dangerous for names resolved by the linker like the so/archive name and function names. The public header name is also prone to a file name conflict.

Therefore it would be nice if either the library name was changed to something more unique like a proper name or the entities above were prefixed with something unique. Obviously this would be a breaking change and I would understand if this was rejected. OTOH it has been mentioned that 0.6.0 would be the target for breaking changes anyway.

Benchmarks

I did some automated benchmarking on my i7-10700 and Edison (Merrifield dual core Silvermont Atom without cache memory, similar to Baytrail) that I want to share here. Strictly, this issue is for reference only. It might be useful to find those commits causing substantial performance increases or decreases. All data have been taken without OpenMP (1 thread only) and in x86_64 mode. On i7 you will see some deviation probably caused by frequency scaling / turbo boost. Don't let that disturb you.
Data can be found here if you want to play yourself benchmarks.ods

Below I filter out the most interesting commits.

Encoding

Note that on Edison SSE3 encoding took a hit with 9a0d1b2.
encode

# Hash Commit message
24 3f3f31c Fix build under Xcode
30 67ee3fd SSSE3->AVX2 encoding optimization
76 a5b6739 SSSE3: enc: factor encoding loop into inline function
79 99977db Generic64: enc: factor encoding loop into inline function
92 e2c6687 AVX2: enc: unroll inner loop
93 9a0d1b2 SSSE3: enc: unroll inner loop
96 bf7341f Generic64: enc: unroll inner loop
114 b8b3c58 Generic64: enc: use 12-bit lookup table

Decoding

Especially for Edison it has been a bumpy ride, with great improvements 3f3f31c and regressions 0a69845 on SSE3 but also for PLAIN cfa8bf7 and f538baa.
decode

# Hash Commit message
24 3f3f31c Fix build under Xcode
29 cfa8bf7 Plain decoding optimization
35 0a69845 SSSE3->AVX2, NEON32 decoding optimization
85 6310c1f SSSE3: dec: factor decoding loop into inline function
88 f538baa Generic32: dec: factor decoding loop into inline function
100 495414b AVX2: dec: unroll inner loop
101 5874921 SSSE3: dec: unroll inner loop

Encoders: unroll inner loops

The transformation of encoder kernels to inline functions (#58) allows us to move the inner encoding loop into separate inline functions.

Because the number of remaining loop iterations is known, we can split calls to the inner loop into long unrolled stretches. Tests show that this can result in a significant speedup.

.editorconfig: add dotfile

Add an .editorconfig dotfile in the root of the project to help enforce a consistent code style. This idea was raised in the discussion for #79, as a way to avoid preventable mistakes such as trailing whitespace or using an inconsistent spacing style.

The .editorconfig format is documented at editorconfig.org.

Using openmp to speed up further

I have made some small modifications to the code to take advantage of multi-threading using openmp and achieve a further 3x improvement with codec plain and2x with codec SSE3. With 8 threads on a quad-core with hyperthreading both codec now run at approx the same throughput, which seems to memory bandwidth limited (see attached under the heading ferry-quad)

TRP-Base64-00.pdf

I uploaded the code to my repository https://github.com/htot/base64 branch openmp, but I'm not really familiar with working with github so don't know how to create a pull request. Also, as right now, the code FTB without the option -fopenmp.

Are you interested to pull? How to proceed?

Whitespace character filtering

This is a new issue to discuss whitespace character filtering as mentioned in #15 and #27

The implementation will probably depend on density of whitespace characters.

  1. few occurrences, not grouped together (e.g. one new line every 80 characters): update error handling to filter those characters (easy, fast, tested on my end).
  2. few occurrences of group of whitespace (e.g. one new line + n spaces every 80 characters): update error handling to filter those characters by skipping multiple characters at a time (not tested at all on my end).
  3. many occurrences: a preprocess step might be more suited. This is still quite slow (1st implementation was based on code snippet from #15, changed that with a LUT for the shuffle mask):
    SSE4.2 implementation
while (srclen >= 32U) {
	// Characters to be checked (all should be valid, repeat the first ones):
	const __m128i whitespaces = _mm_setr_epi8(
		'\n','\r','\t', ' ',
		'\n','\r','\t', ' ',
		'\n','\r','\t', ' ',
		'\n','\r','\t', ' '
		);

	static const uint8_t lut[][16] __attribute__((aligned (16))) = {
		{   0U,   1U,   2U,   3U,   4U,   5U,   6U,   7U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U },
		{   1U,   2U,   3U,   4U,   5U,   6U,   7U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U },
		...,
		{ 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U, 255U }
	};

	static const uint8_t ipopcnt_lut[] = {
		8U, 7U, 7U, 6U, 7U, 6U, 6U, 5U, 7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U,
		7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
		7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
		6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
		7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
		6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
		6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
		5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
		7U, 6U, 6U, 5U, 6U, 5U, 5U, 4U, 6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U,
		6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
		6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
		5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
		6U, 5U, 5U, 4U, 5U, 4U, 4U, 3U, 5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U,
		5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
		5U, 4U, 4U, 3U, 4U, 3U, 3U, 2U, 4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U,
		4U, 3U, 3U, 2U, 3U, 2U, 2U, 1U, 3U, 2U, 2U, 1U, 2U, 1U, 1U, 0U
	};

	__m128i c0 = _mm_loadu_si128((const __m128i*)(src +  0));
	__m128i c2 = _mm_loadu_si128((const __m128i*)(src + 16));
	src += 32;
	srclen -= 32U;

	__m128i wm0 = _mm_cmpistrm(whitespaces, c0, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
	__m128i wm2 = _mm_cmpistrm(whitespaces, c2, _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);

	unsigned int mi0 = (unsigned int)_mm_cvtsi128_si32(wm0);
	unsigned int mi2 = (unsigned int)_mm_cvtsi128_si32(wm2);

	if (mi0 | mi2) {
		unsigned int i0 = ipopcnt_lut[mi0 & 255U];
		unsigned int i1 = ipopcnt_lut[mi0 >> 8];
		unsigned int i2 = ipopcnt_lut[mi2 & 255U];
		unsigned int i3 = ipopcnt_lut[mi2 >> 8];

		__m128i c1 = _mm_srli_si128(c0, 8);
		__m128i c3 = _mm_srli_si128(c2, 8);

		c0 = _mm_shuffle_epi8(c0, _mm_load_si128(lut[mi0 & 255U]));
		c1 = _mm_shuffle_epi8(c1, _mm_load_si128(lut[mi0 >> 8]));
		c2 = _mm_shuffle_epi8(c2, _mm_load_si128(lut[mi2 & 255U]));
		c3 = _mm_shuffle_epi8(c3, _mm_load_si128(lut[mi2 >> 8]));
		_mm_storel_epi64((__m128i*)dst, c0);
		dst += i0;
		_mm_storel_epi64((__m128i*)dst, c1);
		dst += i1;
		_mm_storel_epi64((__m128i*)dst, c2);
		dst += i2;
		_mm_storel_epi64((__m128i*)dst, c3);
		dst += i3;
	}
	else {
		_mm_storeu_si128((__m128i*)(dst +  0), c0);
		_mm_storeu_si128((__m128i*)(dst + 16), c2);
		dst += 32;
	}
}

Here are the timings. All throughputs are given for the output buffer which is the same for all variants.
decode: valid base64 input
decode-s: one whitespace every 80 characters (we can see the choice of 80 has an impact on AVX2 decoder for method 1 because it's a multiple of 16 but not 32)
decode-s8: 8 whitespace every 80 characters
decode-d: 1 whitespace before each valid character.

Method 1 seems to be the best for sparse whitespace characters except for AVX2 decoder (that could/should be fixed to handle 8/16 bytes valid input). For handling sparse group of characters or very dense whitespace characters, Method 3 is better.
Real world data analysis would be good to know what case we want to optimize.

For Method 1:

Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2   decode     5279.50 MB/sec
AVX2   decode-s   1587.19 MB/sec
AVX2   decode-s8  1005.11 MB/sec
AVX2   decode-d    226.64 MB/sec
plain  decode     1234.00 MB/sec
plain  decode-s   1180.52 MB/sec
plain  decode-s8   899.77 MB/sec
plain  decode-d    245.90 MB/sec
SSSE3  decode     2941.00 MB/sec
SSSE3  decode-s   2414.34 MB/sec
SSSE3  decode-s8  1198.43 MB/sec
SSSE3  decode-d    210.47 MB/sec
SSE41  decode     2938.98 MB/sec
SSE41  decode-s   2369.73 MB/sec
SSE41  decode-s8  1167.77 MB/sec
SSE41  decode-d    209.58 MB/sec
SSE42  decode     3820.97 MB/sec
SSE42  decode-s   3469.99 MB/sec
SSE42  decode-s8  2009.10 MB/sec
SSE42  decode-d    273.74 MB/sec
AVX    decode     3959.44 MB/sec
AVX    decode-s   3573.91 MB/sec
AVX    decode-s8  2051.51 MB/sec
AVX    decode-d    233.33 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2   decode     5302.67 MB/sec
AVX2   decode-s   1590.81 MB/sec
AVX2   decode-s8  1004.56 MB/sec
AVX2   decode-d    226.81 MB/sec
plain  decode     1238.86 MB/sec
plain  decode-s   1184.43 MB/sec
plain  decode-s8   903.14 MB/sec
plain  decode-d    241.19 MB/sec
SSSE3  decode     2952.08 MB/sec
SSSE3  decode-s   2472.69 MB/sec
SSSE3  decode-s8  1157.22 MB/sec
SSSE3  decode-d    211.32 MB/sec
SSE41  decode     2952.81 MB/sec
SSE41  decode-s   2477.51 MB/sec
SSE41  decode-s8  1201.25 MB/sec
SSE41  decode-d    210.74 MB/sec
SSE42  decode     3831.55 MB/sec
SSE42  decode-s   3487.39 MB/sec
SSE42  decode-s8  2101.31 MB/sec
SSE42  decode-d    275.14 MB/sec
AVX    decode     3967.55 MB/sec
AVX    decode-s   3507.18 MB/sec
AVX    decode-s8  2055.18 MB/sec
AVX    decode-d    225.99 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2   decode     5272.91 MB/sec
AVX2   decode-s   1592.12 MB/sec
AVX2   decode-s8   998.94 MB/sec
AVX2   decode-d    226.41 MB/sec
plain  decode     1237.12 MB/sec
plain  decode-s   1183.57 MB/sec
plain  decode-s8   902.16 MB/sec
plain  decode-d    241.08 MB/sec
SSSE3  decode     2920.62 MB/sec
SSSE3  decode-s   2401.63 MB/sec
SSSE3  decode-s8  1165.97 MB/sec
SSSE3  decode-d    213.04 MB/sec
SSE41  decode     2952.13 MB/sec
SSE41  decode-s   2473.57 MB/sec
SSE41  decode-s8  1170.04 MB/sec
SSE41  decode-d    210.97 MB/sec
SSE42  decode     3826.16 MB/sec
SSE42  decode-s   3484.39 MB/sec
SSE42  decode-s8  1877.99 MB/sec
SSE42  decode-d    274.92 MB/sec
AVX    decode     3860.69 MB/sec
AVX    decode-s   3491.69 MB/sec
AVX    decode-s8  2030.65 MB/sec
AVX    decode-d    222.85 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2   decode     5213.04 MB/sec
AVX2   decode-s   1608.52 MB/sec
AVX2   decode-s8  1012.05 MB/sec
AVX2   decode-d    231.93 MB/sec
plain  decode     1237.74 MB/sec
plain  decode-s   1184.74 MB/sec
plain  decode-s8   903.57 MB/sec
plain  decode-d    250.86 MB/sec
SSSE3  decode     2942.19 MB/sec
SSSE3  decode-s   2464.90 MB/sec
SSSE3  decode-s8  1197.61 MB/sec
SSSE3  decode-d    219.08 MB/sec
SSE41  decode     2946.40 MB/sec
SSE41  decode-s   2468.99 MB/sec
SSE41  decode-s8  1165.37 MB/sec
SSE41  decode-d    214.67 MB/sec
SSE42  decode     3703.31 MB/sec
SSE42  decode-s   3372.53 MB/sec
SSE42  decode-s8  1814.50 MB/sec
SSE42  decode-d    267.22 MB/sec
AVX    decode     3843.75 MB/sec
AVX    decode-s   3460.58 MB/sec
AVX    decode-s8  2030.02 MB/sec
AVX    decode-d    227.47 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2   decode     4314.66 MB/sec
AVX2   decode-s   1562.26 MB/sec
AVX2   decode-s8  1024.11 MB/sec
AVX2   decode-d    233.38 MB/sec
plain  decode     1207.16 MB/sec
plain  decode-s   1129.71 MB/sec
plain  decode-s8   904.50 MB/sec
plain  decode-d    249.83 MB/sec
SSSE3  decode     2714.32 MB/sec
SSSE3  decode-s   2369.40 MB/sec
SSSE3  decode-s8  1203.05 MB/sec
SSSE3  decode-d    219.10 MB/sec
SSE41  decode     2716.16 MB/sec
SSE41  decode-s   2392.42 MB/sec
SSE41  decode-s8  1203.59 MB/sec
SSE41  decode-d    215.04 MB/sec
SSE42  decode     3418.99 MB/sec
SSE42  decode-s   3181.47 MB/sec
SSE42  decode-s8  1904.47 MB/sec
SSE42  decode-d    283.45 MB/sec
AVX    decode     3556.95 MB/sec
AVX    decode-s   3238.04 MB/sec
AVX    decode-s8  1986.58 MB/sec
AVX    decode-d    233.68 MB/sec

For Method 3:

Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2   decode     3230.06 MB/sec
AVX2   decode-s   2741.82 MB/sec
AVX2   decode-s8  2874.97 MB/sec
AVX2   decode-d   1558.49 MB/sec
plain  decode      619.40 MB/sec
plain  decode-s    548.73 MB/sec
plain  decode-s8   548.71 MB/sec
plain  decode-d    420.43 MB/sec
SSSE3  decode     2057.59 MB/sec
SSSE3  decode-s   1807.58 MB/sec
SSSE3  decode-s8  1920.25 MB/sec
SSSE3  decode-d   1257.10 MB/sec
SSE41  decode     2123.90 MB/sec
SSE41  decode-s   1890.92 MB/sec
SSE41  decode-s8  1883.30 MB/sec
SSE41  decode-d   1209.47 MB/sec
SSE42  decode     2622.27 MB/sec
SSE42  decode-s   2366.88 MB/sec
SSE42  decode-s8  2421.09 MB/sec
SSE42  decode-d   1435.30 MB/sec
AVX    decode     2777.55 MB/sec
AVX    decode-s   2419.70 MB/sec
AVX    decode-s8  2480.22 MB/sec
AVX    decode-d   1450.81 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2   decode     3378.80 MB/sec
AVX2   decode-s   2775.84 MB/sec
AVX2   decode-s8  2892.33 MB/sec
AVX2   decode-d   1590.29 MB/sec
plain  decode      626.00 MB/sec
plain  decode-s    580.68 MB/sec
plain  decode-s8   562.96 MB/sec
plain  decode-d    408.17 MB/sec
SSSE3  decode     2191.44 MB/sec
SSSE3  decode-s   1892.64 MB/sec
SSSE3  decode-s8  1861.03 MB/sec
SSSE3  decode-d   1227.95 MB/sec
SSE41  decode     2140.16 MB/sec
SSE41  decode-s   1893.56 MB/sec
SSE41  decode-s8  1911.62 MB/sec
SSE41  decode-d   1226.44 MB/sec
SSE42  decode     2724.52 MB/sec
SSE42  decode-s   2257.47 MB/sec
SSE42  decode-s8  2370.60 MB/sec
SSE42  decode-d   1424.35 MB/sec
AVX    decode     2743.86 MB/sec
AVX    decode-s   2433.86 MB/sec
AVX    decode-s8  2428.08 MB/sec
AVX    decode-d   1408.13 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2   decode     3284.48 MB/sec
AVX2   decode-s   2735.35 MB/sec
AVX2   decode-s8  2813.35 MB/sec
AVX2   decode-d   1574.50 MB/sec
plain  decode      623.97 MB/sec
plain  decode-s    570.76 MB/sec
plain  decode-s8   566.80 MB/sec
plain  decode-d    406.15 MB/sec
SSSE3  decode     2191.73 MB/sec
SSSE3  decode-s   1905.69 MB/sec
SSSE3  decode-s8  1843.39 MB/sec
SSSE3  decode-d   1261.96 MB/sec
SSE41  decode     2190.32 MB/sec
SSE41  decode-s   1938.25 MB/sec
SSE41  decode-s8  1841.31 MB/sec
SSE41  decode-d   1259.89 MB/sec
SSE42  decode     2738.85 MB/sec
SSE42  decode-s   2408.05 MB/sec
SSE42  decode-s8  2430.57 MB/sec
SSE42  decode-d   1437.67 MB/sec
AVX    decode     2818.79 MB/sec
AVX    decode-s   2388.06 MB/sec
AVX    decode-s8  2494.78 MB/sec
AVX    decode-d   1455.20 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2   decode     3345.58 MB/sec
AVX2   decode-s   2842.44 MB/sec
AVX2   decode-s8  2887.61 MB/sec
AVX2   decode-d   1561.51 MB/sec
plain  decode      634.51 MB/sec
plain  decode-s    591.72 MB/sec
plain  decode-s8   573.85 MB/sec
plain  decode-d    464.49 MB/sec
SSSE3  decode     2182.45 MB/sec
SSSE3  decode-s   1927.53 MB/sec
SSSE3  decode-s8  1933.26 MB/sec
SSSE3  decode-d   1248.62 MB/sec
SSE41  decode     2183.39 MB/sec
SSE41  decode-s   1954.86 MB/sec
SSE41  decode-s8  1935.52 MB/sec
SSE41  decode-d   1247.71 MB/sec
SSE42  decode     2726.89 MB/sec
SSE42  decode-s   2421.89 MB/sec
SSE42  decode-s8  2432.85 MB/sec
SSE42  decode-d   1421.62 MB/sec
AVX    decode     2809.99 MB/sec
AVX    decode-s   2470.29 MB/sec
AVX    decode-s8  2496.32 MB/sec
AVX    decode-d   1432.62 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2   decode     2964.24 MB/sec
AVX2   decode-s   2619.61 MB/sec
AVX2   decode-s8  2530.74 MB/sec
AVX2   decode-d   1536.01 MB/sec
plain  decode      621.43 MB/sec
plain  decode-s    588.34 MB/sec
plain  decode-s8   567.04 MB/sec
plain  decode-d    459.67 MB/sec
SSSE3  decode     2011.50 MB/sec
SSSE3  decode-s   1848.21 MB/sec
SSSE3  decode-s8  1764.75 MB/sec
SSSE3  decode-d   1231.03 MB/sec
SSE41  decode     2001.69 MB/sec
SSE41  decode-s   1846.22 MB/sec
SSE41  decode-s8  1765.15 MB/sec
SSE41  decode-d   1229.00 MB/sec
SSE42  decode     2507.66 MB/sec
SSE42  decode-s   2254.18 MB/sec
SSE42  decode-s8  2162.37 MB/sec
SSE42  decode-d   1405.58 MB/sec
AVX    decode     2582.07 MB/sec
AVX    decode-s   2305.40 MB/sec
AVX    decode-s8  2281.94 MB/sec
AVX    decode-d   1414.54 MB/sec

clang build fails with inline ASM on NEON64 (Apple M1)

clang must not be allocating l3 in a contiguous register? While building 3eab8e6, the compiler errors are:

In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: registers must be sequential
                "and  %[t3].16b, v14.16b,   %[n63].16b \n\t"
                                                         ^
<inline asm>:10:40: note: instantiated into assembly here
        tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: unknown token in expression
                "and  %[t3].16b, v14.16b,   %[n63].16b \n\t"
                                                         ^
<inline asm>:10:48: note: instantiated into assembly here
        tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:32:44: error: invalid operand
                "and  %[t3].16b, v14.16b,   %[n63].16b \n\t"
                                                         ^
<inline asm>:10:48: note: instantiated into assembly here
        tbl v12.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v3.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: registers must be sequential
                "tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
                                                                                        ^
<inline asm>:11:40: note: instantiated into assembly here
        tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: unknown token in expression
                "tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
                                                                                        ^
<inline asm>:11:48: note: instantiated into assembly here
        tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:35:75: error: invalid operand
                "tbl v12.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t0].16b \n\t"
                                                                                        ^
<inline asm>:11:48: note: instantiated into assembly here
        tbl v13.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v2.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: registers must be sequential
                "tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
                                                                                        ^
<inline asm>:12:40: note: instantiated into assembly here
        tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: unknown token in expression
                "tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
                                                                                        ^
<inline asm>:12:48: note: instantiated into assembly here
        tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:36:75: error: invalid operand
                "tbl v13.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t1].16b \n\t"
                                                                                        ^
<inline asm>:12:48: note: instantiated into assembly here
        tbl v14.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v1.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: registers must be sequential
                "tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
                                                                                        ^
<inline asm>:13:40: note: instantiated into assembly here
        tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
                                              ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: unknown token in expression
                "tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
                                                                                        ^
<inline asm>:13:48: note: instantiated into assembly here
        tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
                                                      ^
In file included from ../../deps/base64/base64/lib/arch/neon64/codec.c:62:
../../deps/base64/base64/lib/arch/neon64/enc_loop.c:37:75: error: invalid operand
                "tbl v14.16b, {%[l0].16b, %[l1].16b, %[l2].16b, %[l3].16b}, %[t2].16b \n\t"
                                                                                        ^
<inline asm>:13:48: note: instantiated into assembly here
        tbl v15.16b, {v5.16b, v6.16b, v7.16b, v16.16b}, v0.16b
                                                      ^

Consider prefixing public functions with unique namespace

I ran into an interesting issue where linking a project with libmysql.a as well as this library caused some extremely weird behavior, and eventually I found out there was a namespace clash as mysql also declares two functions base64_encode() and base64_decode(). Please considering prefixing your library C functions with a unique namespace. I've gotten around this by forking and just naming them differently for now but I want to say thanks for the great library, I've enjoyed using it. I'd also make a PR but I don't know what you'd want to prefix with and it is a breaking change to the API.

Inlining Failed under GCC7

Compiling under GCC7 on OSX w/ AVX2:

fastbase64 git:(master) CC=gcc-7 CXX=g++-7 AVX2_CFLAGS=-mavx2 make
gcc-7 -fPIC -std=c99 -Wall -Wextra -Wshadow -Wpsabi -Wfatal-errors -O3 -march=native -march=native -mavx2 -c src/chromiumbase64.c -Iinclude
gcc-7 -fPIC -std=c99 -Wall -Wextra -Wshadow -Wpsabi -Wfatal-errors -O3 -march=native -march=native -mavx2 -c src/fastavxbase64.c -Iinclude
gcc-7 -fPIC -std=c99 -Wall -Wextra -Wshadow -Wpsabi -Wfatal-errors -O3 -march=native -march=native -mavx2 -c src/fastavx512bwbase64.c -Iinclude
src/fastavx512bwbase64.c: In function 'enc_reshuffle':
src/fastavx512bwbase64.c:6:1: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
 static inline __m512i enc_reshuffle(const __m512i input) {

 ^~~~~~
src/fastavx512bwbase64.c:6:23: note: The ABI for passing parameters with 64-byte alignment has changed in GCC 4.6
 static inline __m512i enc_reshuffle(const __m512i input) {
                       ^~~~~~~~~~~~~
In file included from /usr/local/Cellar/gcc/7.3.0_1/lib/gcc/7/gcc/x86_64-apple-darwin17.3.0/7.3.0/include/immintrin.h:55:0,
                 from /usr/local/Cellar/gcc/7.3.0_1/lib/gcc/7/gcc/x86_64-apple-darwin17.3.0/7.3.0/include/x86intrin.h:48,
                 from src/fastavx512bwbase64.c:3:
/usr/local/Cellar/gcc/7.3.0_1/lib/gcc/7/gcc/x86_64-apple-darwin17.3.0/7.3.0/include/avx512bwintrin.h:1831:1: error: inlining failed in call to always_inline '_mm512_shuffle_epi8': target specific option mismatch
 _mm512_shuffle_epi8 (__m512i __A, __m512i __B)
 ^~~~~~~~~~~~~~~~~~~
compilation terminated due to -Wfatal-errors.
make: *** [fastavx512bwbase64.o] Error 1

CI: GitHub Actions: build and test on different architectures

The old Travis CI config compiled and tested the code on various architectures: arm32, arm64, ppc64be, ppc64le and s390. We can partly replicate this with GitHub Actions by using the uraimo/run-on-arch-action. Update the CI config to run and test the library on various architectures.

Unfortunately, uraimo/run-on-arch-action does not support a big-endian architecture (yet?). It would be nice if we could find a way to include such an architecture in the test matrix.

An example of a project that uses this CI setup is aklomp/vec.

Runtime detection of avilable flags

I want to use your great library in our pretty big system. We already have the detection of CPU/platform features, and it won't interfere with the library's settings. There is a little problem - when I want to use BASE64_FORCE_XXX (based on our settings) and the "XXX" is not really compiled into the library, we got the dummy implementation, which does nothing.

How can I ask the library which BASE64_FORCE_XXX are meaningful? Of course I can alter the sources, but it's not the proper solution.

I use the stable 0.3.0 version.

decode size_t failed

hello, I test the fast64 decode, it indeed faster than other code, but i counter a problem when using function int base64_decode( const char*src, size_t srclen, char *out, size_t *outlen, int flags) ; it's unknow about out's size ,it easy to decode failed when size is small, how can i solve it

Encoders: turn into proper inline functions

Currently, encoder kernels are simple while loops that are inserted into a switch statement by an #include directive. Turn these loops into full static inline functions.

  • Makes early returning possible.
  • Makes it possible to do pre-loop and post-loop computations, such as calculating the number of rounds in advance. This could be done now, but not in a straightforward way.
  • Will later make it possible to unroll loops by inlining multiple copies of the inner loop.
  • Communicates additional constraints to the compiler, potentially allowing it to optimize better.

This change should have no adverse effects, because the compiler will inline the function at its only callsite. On the contrary, this change will be beneficial for maintenance and performance.

Replace exports.txt with compiler visibility flags

When providing a library, it is good practice to only globally export symbols that the user is expected to link against. All other symbols should be hidden from view so that they do not pollute the global symbol namespace.

This is currently done by using the linker to roll all object files into a single object file, and instructing the linker to remove all symbols from that file except for those whitelisted in exports.txt.

This works nicely on GNU-based platforms, but is not really portable to OSes like Windows, the BSD's, and MacOS. It also will not be compatible with cmake when we change to that build system (#7), because cmake scripts are written at a level of abstraction where such platform dependent hacks should not exist. Also, weird linking problems such as #50 may arise. A better solution is needed.

These days it is common to declare symbol visibility at the source code level by using a compiler option to hide all symbols by default. Builtin functions are used to tag specific symbols as having global visibility. In GCC, this can be achieved by compiling with the -fvisibility=hidden flag, and whitelisting user-visible symbols with __attribute__((visibility("default"))).

This method is supported by GCC and Clang, and works along the same lines in MSVC with different builtins (__declspec(dllexport)). It will also be forward compatible with the way cmake handles symbol visibility.

Older compilers may not have such builtins, and our library may end up exporting all global symbols. While not great, the pain can be alleviated by ensuring that all symbols with global scope at least have the base64_ prefix. (This is currently true for all global symbols except codec_choose.)

  • Remove the build step where the linker rolls all object files into a single object file and strips all symbols except the whitelisted.
  • Output an archive (libbase64.a) instead of a single object file (libbase64.o). This will break backwards compatibility for users, but will be forward compatible with what cmake will do.
  • Add a new header file to define macros, such as BASE64_EXPORT, to make certain symbols globally visible.
  • Add compiler flags to set the default symbol visibility to hidden.

Window x64 platform base_encode outlen is ZERO

I compiled the library on Windows x64 platform, I specified the sse3 instruction set, then I used it for encode to base64, but the length of the base64 string is zero. I tried the x86, it is working well.

Benchmark results: store in a CSV file, graph as SVG

Benchmark results are currently "stored" in a big table in the README which is getting ungainly and hard to modify. The results are presented in text form, which makes it harder to compare across machines and codecs.

Objectives:

  • Move the benchmark results to a CSV file (or comparable textural data format) in tests/benchmarks.
  • Write a script (or use Gnuplot, etc) to autogenerate a SVG chart from the results. Automate the steps in a Makefile. The chart should probably be a horizontal bar graph with a bar cluster for each machine.
  • Link the SVG chart in the README as an inline image.

Replace custom platform detection logic with Boost.Predef

Hi,
Proposal: Replace all the custom preprocessor checks for architecture, platform, compiler, SIMD, etc. with the Boost.Predef library which is independent and header only, so that one can simply import the necessary header files into the repository and be done with it (please note that this is an internal change and doesn't affect library users in any way). I think that this will improve code readability, because the Boost.Predef macros are far more descriptive and might prevent fallacies like the one in #12, e.g. compare

#ifdef BOOST_COMP_MSVC_AVAILABLE
#ifdef BOOST_OS_WINDOWS_AVAILABLE

with

#ifdef _MSC_VER
#if defined(_WIN32) || defined(__WIN32__)

If you think that this is a good idea, but don't have the time to make the change I'll happily submit a pull request.

Cheers,
Henrik

benchmarks on Atom crashes on AVX2

When I build with all x86 codes and run benchmarks on Atom (x86_64) it crashes due to an illegal instruction. I always thought I just need to disable AVX and AVX2 codecs (and that works around the problem). But looking into the code even though the AVX* codecs are forced, it seems support is intended to be detected. And the crash therefore unintended.

I think it would be better to benchmark the codecs for which the instructions are actually supported instead of crashing. But how we do that? Not force the codec, of detect supported codecs and force from that list?

cmake: generate only plain codecs

When I'm not using cmake-gui (i.e. generating using Yocto) only the plain codec is built in some cases. I don't understand why, but the below patch seems to resolve it. As if the string was never evaluated. Any ideas? @BurningEnlightenment @aklomp

 add_feature_info("OpenMP codec" BASE64_WITH_OpenMP "spreads codec work accross multiple threads")
 cmake_dependent_option(BASE64_REGENERATE_TABLES "regenerate the codec tables" OFF "NOT CMAKE_CROSSCOMPILING" OFF)

-set(_IS_X86 "_TARGET_ARCH STREQUAL \"x86\" OR _TARGET_ARCH STREQUAL \"x64\"")
+if((_TARGET_ARCH STREQUAL "x86") OR (_TARGET_ARCH STREQUAL "x64"))
+    set(_IS_X86 1)
+else()
+    set(_IS_X86 0)
+endif()
 cmake_dependent_option(BASE64_WITH_SSSE3 "add SSSE 3 codepath" ON ${_IS_X86} OFF)
 add_feature_info(SSSE3 BASE64_WITH_SSSE3 "add SSSE 3 codepath")
 cmake_dependent_option(BASE64_WITH_SSE41 "add SSE 4.1 codepath" ON ${_IS_X86} OFF)

Base64url support?

Hi, aklomp i was searching for a nice base64 library and ended here but my low level C is a bit rusty, so i would like to ask you if this library could be used to handle base64URL variant? or at least some tips on how to add it?(i would post patches back if i made it)

Thanks you very much for your time

NEON64: enc: ASM build fails on gcc with dd7a2b5f31

It looks like at least gcc fails now with BASE64_NEON64_USE_ASM set and -O0. Example output:

18:45:00 In file included from ../deps/base64/base64/lib/arch/neon64/codec.c:63:
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:11:36: error: invalid character ' ' in raw string delimiter
18:45:00   "ld3 {"P".16b, "Q".16b, "R".16b}, [%[src]], #48 \n\t"
18:45:00                            ^         
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:18:28: error: invalid character ' ' in raw string delimiter
18:45:00   "ushr %[t2].16b, "R".16b,   #6         \n\t" \
18:45:00                     ^        
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:23:28: error: invalid character ' ' in raw string delimiter
18:45:00   "and  %[t3].16b, "R".16b,   %[n63].16b \n\t"
18:45:00                     ^        
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:31:16: error: invalid character ' ' in raw string delimiter
18:45:00   "tbl "R".16b, {v8.16b-v11.16b}, %[t2].16b \n\t" \
18:45:00         ^        
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:37:35: error: invalid character ' ' in raw string delimiter
18:45:00   "st4 {"P".16b, "Q".16b, "R".16b, "S".16b}, [%[dst]], #64 \n\t"
18:45:00                            ^        
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c: In function 'enc_loop_neon64':
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:11:27: error: stray 'R' in program
18:45:00   "ld3 {"P".16b, "Q".16b, "R".16b}, [%[src]], #48 \n\t"
18:45:00                            ^
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:65:2: note: in expansion of macro 'LOAD'
18:45:00   LOAD("v2", "v3", "v4") \
18:45:00   ^~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:114:10: note: in expansion of macro 'ROUND_A_FIRST'
18:45:00    "    " ROUND_A_FIRST()
18:45:00           ^~~~~~~~~~~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:11:28: error: expected ':' or ')' before string constant
18:45:00   "ld3 {"P".16b, "Q".16b, "R".16b}, [%[src]], #48 \n\t"
18:45:00                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:65:2: note: in expansion of macro 'LOAD'
18:45:00   LOAD("v2", "v3", "v4") \
18:45:00   ^~~~
18:45:00 ../deps/base64/base64/lib/arch/neon64/enc_loop_asm.c:114:10: note: in expansion of macro 'ROUND_A_FIRST'
18:45:00    "    " ROUND_A_FIRST()
18:45:00           ^~~~~~~~~~~~~

and a bunch of similar "stray 'R'" errors after that.

clang compiles without errors however.

Concurrent call interface “base64_decode” appear Segment error

hello,i try to use 40 thread call "base64_decode" interface, when process a thread,appear segment error in this interface,but if i use Fewer threads,ex 6 or 8,it will not have any problems,Here is my call:

memset(raw_image_buffer_, 0, 5242880);
base64_decode(base64_data.data(), base64_data.size(), raw_image_buffer_, &base2image_size, 0);

The same picture I used,and I can ensure that the input pointer and length are valid,size is 682901,the “raw_image_buffer_” is A fixed memory , size is 5242880, memset before each call.

Generic encoders: use 12-bit lookup table

Switch from a 6-bit lookup table to a 12-bit lookup table in the Generic32 and Generic64 encoders. Some quick tests show that halving the memory accesses greatly increases performance, at the cost of a larger lookup table (4096 bytes). Not a hard tradeoff.

Tables can be generated with a small Python script. Tables differ for little-endian and big-endian architectures.

Cleanup round (no functional inpact) before releasing v0.4.0

I would like to release a new library version (v0.4.0) which puts a tag on the code that everyone has been using for the last few years. This will make room to introduce some breaking changes from here on forward.

However, before we can release, the code is in need of a cleanup round. The codebase has accumulated a lot of patches from various contributors, in various different styles. This has led to some inconsistencies with how things are done, as well as diminished the quality of comments, and accumulated dead code.

Objective: push a set of commits which fix many of these issues, without introducing any functional changes. The library code, as seen by the compiler, must remain unchanged from what it is now, to avoid regressions for anyone switching to this new version.

  • Fix some trivial whitespace issues (wrong indent, trailing whitespace, etc).
  • Reword some comments for clarity.
  • Split some monolithic code files into smaller parts, to conform to the general style.
  • Uniform style for function declarations, bodies of if statements, etc.
  • Remove dead code.
  • Remove the now unnecessary CMPGT macros and friends.

NEON32: enc: add inline asm codepath

Some testing on my Raspberry Pi 2B 1.1 shows that GCC and Clang both generate pretty terrible code from neon intrinsics.

For the NEON32 encoder, which is simpler than the x86 encoders, the speed can be substantially improved by hand-coding the relatively simple inner loop in inline assembly. A quick proof-of-concept shows that inline assembly gets around 382 MB/s on GCC, against 209 MB/s for the status quo. Clang does worse and better at the same time, getting 304 MB/s for the inline assembly and 294 MB/s for the status quo. Both are an improvement, so I think this should be added.

Decoding from uint16_t[]

First off, thank you for providing this open source library (and under the BSD license).

My question is: would it be possible to see versions of the decoding functions that support uint16_t[] strings in addition to const char[] strings? I tried looking at just the ssse3 implementation (since that is what is used on my dev machine) to see how that might be implemented, but the numbers in the comments have me confused.

My particular use case is that I get a uint16_t[] from the V8 JS engine, which is how it internally stores its strings. Being able to use that value type directly would be more efficient than having to make a copy to an 8-bit array first.

I'd like to get uint16_t[] supported for each codec, since we run on a variety of platforms. Any help you can give would be much appreciated.

NEON64: enc: convert full encoding loop to inline assembly

Convert the full encoding loop to an inline assembly implementation for compilers that can use inline assembly.

The motivation for this change is issue #96: when optimization is turned off on recent versions of clang, the encoding table is sometimes not loaded into sequential registers. This happens despite taking pains to ensure that the compiler uses an explicit set of registers for the load (v8-v11).

This leaves us with not much options beside rewriting the full encoding loop in inline assembly. Only that way can we be absolutely certain that the correct registers are used. Thankfully, aarch64 assembly is not very difficult to write by hand.

Fixes #96.
Closes #97.

AVX2: enc: add inline asm codepath

Like was done for NEON32 in #91 and for NEON64 in #92, we can get a pretty large speedup for the AVX2 encoder by implementing the inner loop in inline assembly for compilers that support it. Testing on my machine (i5-4590S) with a proof-of-concept branch shows around a 33% speed improvement (!).

This is achieved in the same way that we handle the NEON encoders. Split the encoder into assembly "recipes" for translation and shuffling, interleave them with loads and stores, and keep three sets of data in flight in parallel inside large unrolled loops. It's basically the code that we'd hope the compiler would generate for us if it was clever enough.

The drawbacks I see to adopting this approach is an increase in complexity and transparency in this library, because generating inline assembly code with C macros is a bit gnarly. But on the other hand, those speed gains don't lie. And this would be purely additive: the codepath would only be taken on compilers that support it, and the normal implementation would remain available.

The advantage is the large speedup, of course, and also the fact that the implementation is not too crazy when it's laid side to side with the intrinsics version. It's basically the same algorithm in a different expression.

fail to build test on FreeBSD

$ uname -a
FreeBSD t420nnd.cls.to 10.3-RELEASE-p24 FreeBSD 10.3-RELEASE-p24 #0: Wed Nov 15 04:57:40 UTC 2017     [email protected]:/usr/obj/usr/src/sys/GENERIC  amd64
$ gmake -C test
gmake : on entre dans le répertoire « /home/nanard/code/git/base64/test »
rm -f benchmark test_base64 *.o
cc -std=c99 -O3 -Wall -Wextra -pedantic -o codec_supported.o -c codec_supported.c
cc -std=c99 -O3 -Wall -Wextra -pedantic -o test_base64 test_base64.c codec_supported.o ../lib/libbase64.o
cc -std=c99 -O3 -Wall -Wextra -pedantic -o benchmark benchmark.c codec_supported.o ../lib/libbase64.o -lrt
benchmark.c:101:16: error: use of undeclared identifier 'CLOCK_REALTIME'
        clock_gettime(CLOCK_REALTIME, o_time);
                      ^
benchmark.c:99:35: warning: unused parameter 'o_time' [-Wunused-parameter]
base64_gettime (base64_timespec * o_time)
                                  ^
1 warning and 1 error generated.
gmake: *** [Makefile:24: benchmark] Error 1

Adding -D_XOPEN_SOURCE=600 does the trick

diff --git a/test/Makefile b/test/Makefile
index d104582..11c57a3 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -2,6 +2,7 @@ CFLAGS += -std=c99 -O3 -Wall -Wextra -pedantic
 ifdef OPENMP
   CFLAGS += -fopenmp
 endif
+CFLAGS += -D_XOPEN_SOURCE=600
 
 TARGET := $(shell $(CC) -dumpmachine)
 ifneq (, $(findstring darwin, $(TARGET)))

NEON: enc: better reshuffling algorithm

On NEON, it is possible to do slightly better than the current 3-to-4 byte encoding shuffle, by using SLI instructions and some cleverness. We can also do better or equal than the compiler if we use inline assembly to manually pipeline these instructions.

Issues with multithreaded code and CPU dispatching.

Suppose we are calling base64_encode or base64_decode in a loop (for different inputs) and doing it from multiple threads (for different data).

  1. If we pass non-zero flags to these routines, it will write to a single global variable repeatedly in codec_choose_forced function and it will lead to "false sharing" and poor scalability.

  2. There is no method to pre-initialize the choice of codec. (Actually, there is: we can simply call one of the encode/decode routines in advance with empty input, but it looks silly). If we don't do that and if we run our code with thread-sanitizer, it will argue about data race on codec function pointers. In fact, it is safe, because it is a single pointer - single machine word that is (supposedly) placed in aligned memory location. But we have to annotate it as _Atomic and store/load with memory_order_relaxed. Look at the similar issue here: simdjson/simdjson#256

  3. Suppose we use these routines in a loop for short inputs. They have a branch to check if encoders/decoders were initialized. We want to move these branches out of the loop: check for CPU and call specialized implementation directly. But architecture specific methods are not exported and we cannot do that. We also have to pay for two non-inlined function calls.

All these issues was found while integrating this library to ClickHouse: ClickHouse/ClickHouse#8397

Generic codecs: use memcpy to load/store to unaligned addresses

The generic codecs currently read and write potentially unaligned memory ranges using raw pointer casts. This will break on platforms that require strict alignment for memory loads/stores. Examples:

// lib/arch/generic/32/enc_loop.c
uint32_t str = *(uint32_t *)c;

// lib/arch/generic/32/dec_loop.c
*(uint32_t*)o = x.asint;

Replace such casts with calls to memcpy(3):

#include <string.h>

uint32_t str;

memcpy(&str, c, sizeof (str));

memcpy(3) is defined by the C99 standard, so is portable. The arguments to memcpy are all known at compile time, so on platforms that allow nonaligned memory access, a good compiler will optimize the memcpy to a single mov instruction. On platforms without nonaligned memory loads/stores, the function generates bytewise access, which is the best we can do in those circumstances.

This will also allow us to remove the HAVE_FAST_UNALIGNED_ACCESS macro, which currently guards an unaligned memory store.

Issues with recent changes

After pulling in recent changes I encountered a couple of issues so far:

  • With the new AVX target, I found that the XRSTORE check only works for AVX2. Here's what I ended up using in codec_choose.c to support AVX (not sure how correct this is):
if (max_level >= 7) {
  #ifdef HAVE_AVX2
    __cpuid_count(7, 0, eax, ebx, ecx, edx);
  #endif
  #ifdef HAVE_AVX
  if (!(ecx & bit_XSAVE_XRSTORE)) {
    __cpuid(1, eax, ebx, ecx, edx);
  }
  #endif

  // ...
  • My build setup uses libtool, which apparently has issues on macOS when multiple files have the same base name, in this case it's the codec.c files in each target/arch directory. Supposedly Windows may have a similar issue in some cases?

benchmark odroid c1+ results

My odroid c1+ arrived and I used my break to gather some benchmark results using your tool:

Processor Plain enc Plain dec NEON32 enc NEON32 dec
odroid c1+ clang 3.7.1 183,522 127,522 204,958 173,856
odroid c1+ gcc 5.3.0 165,600 137,060 212,044 165,356

CI config: create build matrix, build on multiple platforms

Currently, the Travis CI only builds one library on the standard OS and architecture. In order to increase testing coverage, update the Travis config file to include a build matrix with many more OS and architecture combinations.

Note that support for the native Windows MSVC toolchain would require a Cmake-based build configuration, which we don't have (yet), so that platform will initially not be supported.

[Question] Invalid input data handling ?

I'm not sure how "invalid" input data shall be processed. Could this be clarified ?

A few samples using base64_decode, what's the expected output for those ?

  • "Zm9vYg=" decodes successfully though it's not properly terminated
  • "Zm9vYg" decodes successfully though it's not properly terminated
  • "Zm9vY" decodes successfully though it misses data
  • "Zm9vYmF=Zm9v" decodes successfully though it contains too much data or corruption occurred.

This could be extended to streaming API, but I thought I'd start with the simple block decode.

With gcc 8.2 won't built due to __x86.get_pc_thunk.bx discarded

With gcc 8.2 with a i686 target the following error is generated:

`__x86.get_pc_thunk.bx' referenced in section `.text' of lib/libbase64.o: defined in discarded section `.text.__x86.get_pc_thunk.bx[__x86.get_pc_thunk.bx]' of lib/libbase64.o

I found 2 ways to fix this:

  1. add -fno-pie to CFLAGS in Makefile
  2. add __x86.get_pc_thunk.bx to exports.txt

With 1 we get:

root@edison:~/base64# OMP_THREAD_LIMIT=1 ./benchmark 
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 10 * 1
plain   encode  82.34 MB/sec
plain   decode  134.75 MB/sec

but no address space layout randomization (ASLR)
and with 2:

plain   encode  77.88 MB/sec
plain   decode  81.43 MB/sec

My vote goes to 1.

Unable to do base64_encode on an image data

I tried to do base64 encoding on an image data

    char* out;
    size_t outlen;
    
    base64_encode((const char*)color.get_data(), color.get_data_size(), out, &outlen, 0);

color.get_data() returns pointer to the frame data.
color.get_data_size() returns the number of bytes in the frame handle.

Decoders: unroll inner loops

The transformation of decoder kernels to inline functions (#59) allows us to move the inner decoding loop into separate inline functions.

Because the number of remaining loop iterations is known, we can split calls to the inner loop into long unrolled stretches. Tests show that this can result in a significant speedup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.