I have made some small modifications to the code to take advantage of multi-threading

Using openmp to speed up further about base64 HOT 27 CLOSED

aklomp commented on July 3, 2024

Using openmp to speed up further

from base64.

Comments (27)

aklomp commented on July 3, 2024

Hi Ferry, noticed you guys are from my old neighbourhood in Delft. Greetings from The Hague :)

I must admit that I don't have any experience with OpenMP, but I get from this that it's a library that can split the work over multiple cores? Looks good, and is in line with the aim of this project, which is to be fast most of all. ("Correctness" is also important, but sort of meaningless when a base64 string is either correct or not.)

I'd agree with you that the results look bounded by memory latency rather than raw throughput, but to me that's a good thing, it means the code is getting the theoretical maximal performance on the target hardware. If you wanted to get more accurate benchmarks, you could try to work on data only in the L1 or L2 caches.

I'd certainly be interested to pull, but there is one showstopper with the current code and that's that the library would require OpenMP. That's not worth it, I'd like to remain cross-platform and portable. You could fix this by using an environment variable to Make such as ENABLE_OPENMP to conditionally include omp.h and link the library. I also noticed some spacing issues which I will comment on separately. If you fix these (or I can fix them, but I'd become the author of your commit), I'll pull it.

Not being familiar with pull requests is no problem, as long as I have your commits and can merge them manually.

from base64.

htot commented on July 3, 2024

For me this use of openmp is new as well, so I needed to learn a few thing to get it right. It works mostly by compiler pragmas, so is ignored when the compiler doesn't support it. Yesterday I already put a conditional around the only openmp library function that is called from the code, so now it builds also without -fopenmp. But you are right, it should then not include omp.h. I'll fix that up, along with the spaces formatting.

from base64.

htot commented on July 3, 2024

I've split up the code, fixed formatting and permissions and added some benchmarks.

from base64.

htot commented on July 3, 2024

BTW I checked my memory configuration and found it to be DDR1333, while I have DDR1600, fixed that and got some more performance.

from base64.

aklomp commented on July 3, 2024

Thanks for the contribution! I'm amazed, on my Core i5 the SSSE3 and AVX2 decoders are now equally fast and bounded only by memory latency. This is a very cool technology to have in my toolbox, thanks.

I just pushed an 'openmp' branch which contains your work, plus a merge commit where I make some small cleanups. I fixed some typos, I tried to keep the line length at 80 characters, and I made some minor whitespace changes. If you agree on the changes, I'll push it into master.

from base64.

aklomp commented on July 3, 2024

Also, I think you deserve to be in the Acknowledgements section of the readme, so I'll add a mention if you agree.

from base64.

aklomp commented on July 3, 2024

Added a shout-out to the Readme.

from base64.

htot commented on July 3, 2024

Thanks for this.

Today I modified the benchmark program, to benchmark with different buffer sizes. I found that when the buffer is 100kB - 1MB that seems to be the best case, so my previous benchmarks @10Mb were not really comparable. with yours. Fixes that up, added a graph at the bottom of Readme.md .

You can see > 10M the cache is not very effective so you become memory constrained, while below 10k its not really worth to create threads. In the sweat spot you can achieve 27Gb/sec.

from base64.

htot commented on July 3, 2024

Ehr, that would be previously 100MB. Now I repeat 10MB 10x, to measure the timing accurately, and repeat proportionally more times when the block is smaller.

from base64.

htot commented on July 3, 2024

With this out of the way, I can get to doing something useful with the code: use it to encode binary data over a 2Mb/s serial connection. Using base64 I can add control/synchronization characters (/00 error, /FF preamble, STX, ETX) with almost no performance loss. With the Edison doing 150MB/s even without threading encoding will increase communication time ~1% (+33% due to base64 of course).

Thanks!

from base64.

aklomp commented on July 3, 2024

Ah, I saw that you merged your branch on top of my master. Could you maybe try to rebase it onto master instead? Rebasing will replay your commits on top of my master branch and "fake history" so that it looks like you were working off the current HEAD all the time. That gives a nice linear history without merge commits.

With rebase you can also squash commits so that it looks like they never happened. You could for instance suppress the plaintext ASCII table that appeared and disappeared in the README. The "recorded history" becomes a clean set of additive patches. Faking history in git is an awesome perfectionistic superpower :)

It's OK if you don't want to rebase, because I'm happy to merge your changes. Let me know if you're OK with the mention I put in the README and with the small tweaks I made in my merge commit.

from base64.

aklomp commented on July 3, 2024

The question of which benchmark size to choose is tricky. If you choose 100MB, the tests take very long to run on a Raspberry Pi or other slow hardware. If you choose 10MB, you'll be constrained by the speed of memory on desktop systems. (Which I think is an awesome achievement btw :) If you choose a lower size, you get fairer benchmarks for very fast hardware, but the timing variation is larger. I guess I'm not the first to note that meaningful benchmarking is both very difficult and quite subjective.

from base64.

aklomp commented on July 3, 2024

I think it's cool that you're using this library for serial comms. It's nice that you stretch out 3x8 bits over 4x6 so that you can fit in control characters in the top two bits. But if that's the approach you're taking, I have to note that base64-encoded data is not 6-bit clean, because after reshuffling, it gets translated to the base64 alphabet. (0x00 becomes 'A', and so on.) So the top two bits are not free for use. If you want to patch that out, get rid of the "translate" sections during encoding/decoding.

from base64.

htot commented on July 3, 2024

:-) I think this was my first C code in 10 years, and now I'm attempting openmp, git, github and markdown at the same time. As soon as I figure out how to rebase I'll attempt to make the history lineair.
What I modified in the benchmark is to use the same buffer multiple times in such a way that the total time is constant regardless the buffer size. This way we should get the same accuracy for small and large buffers. It take a bit of time on edison and probably a bit more on raspberry.
The correct way would be of course to select a constant time and do as many buffer conversions as possible within that time. But I think for now this is good enough.
I measured on Edison today that the cache has no effect (or doesn't exist), so no peak for small buffers. And thread creation costs 9us for one and 12us for 2 threads. So in my application (2k buffer at 2Mb/sec = 0.2MB/sec with SSE3 codec running at 150MB/sec) not using openmp for encoding/decoding will probably be optimal. Irony.
Actually base64 should be just fine I think, in ascii control characters are put from 0x00 to 0x37 and 0xFF is also reserved. I think I saw in your code 0xFF is used for error signalling so that would be just fine for me. And all I actually need is 0x00 (the serial port should be able to generate that in case of parity errors), 0xFF for preambles and STX (0x02) and ETX (0x03) to enclose the base64 text.

from base64.

htot commented on July 3, 2024

With respect to rebasing and rewriting history: please tell me how to do that and I'll get it done. I'm a little afraid to mess things up.

Otherwise, just go ahead and merge as you see fit. Thanks for the mention in the Readme!

from base64.

BurningEnlightenment commented on July 3, 2024

Take a look at the Pro Git Book, especially at the sections about Rebasing and Rewriting History.

Cheers

from base64.

htot commented on July 3, 2024

So, how would I continue then? Create a new branch starting from my initial branch point (beaf4e3) and then reorder and squash commits into 1) add openmp stuff + 2) benchmarks?

from base64.

htot commented on July 3, 2024

Alright, I think I sorted things out locally. Now how to fix on github? Delete the openmp branch and push up again? Or create a new branch?

from base64.

BurningEnlightenment commented on July 3, 2024

You can overwrite a remote branch's content doing a force push, i.e. git push -f. You shouldn't delete a branch with an open pull request, because the pull request will irreversebly be closed and you'd have to open a new one.

EDIT: whoops, just noticed that this is just an issue and not a pull request 😧, in this case it doesn't really matter whether you force push, recreate or branch again.

from base64.

htot commented on July 3, 2024

I forced pushed. Hope that didn't screw up anything.

from base64.

aklomp commented on July 3, 2024

Hey, I just merged your work into my openmp branch. I'll probably merge that into master tomorrow after I sleep on it.

I added two things: a merge commit and a new commit on top. The merge commit basically changes nothing except some spacing/style/syntax issues. The new commit shuffles things a little bit so that the plain functions are back in lib/lib.c, and lib/lib_plain.c is no longer needed. I think that makes the code cleaner, because lib/lib_openmp.c was basically copying the bodies of the plain codecs for the case where the buffer size was below the OMP threshold.

from base64.

htot commented on July 3, 2024

That's great news. Thanks for this great project and for letting me contribute something back!

from base64.

aklomp commented on July 3, 2024

Merged into master branch, thanks a lot for a great contribution!

from base64.

aklomp commented on July 3, 2024

Actually, I think I have one last question about the OpenMP code, in regards to this line. base64_stream_decode returns -1 if the requested codec is not implemented, 0 if decoding failed because of invalid input, and 1 if everything is OK. What happens if we spawn four OpenMP threads, and one thread encounters a decode error? All of the return values are added, so the accumulated result will be 3, which is truncated to 1, which means "success". Shouldn't a value for result of anything other than num_threads indicate an error?

from base64.

htot commented on July 3, 2024

You're right, I didn't realize there is no test case for this.
We could replace '+=' by '&=' (assuming -1 can not occur) and initialize 'result' to 1?

from base64.

aklomp commented on July 3, 2024

Well, if there are no errors, the sum must be equal to num_threads. The -1 case will come out as -num_threads since all threads encounter the same missing codec. Any result other than those two indicates that one or more threads hit an error.

from base64.

aklomp commented on July 3, 2024

Pushed fix to master.

from base64.

Using openmp to speed up further about base64 HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent