Comments (10)
Just for clarification regarding the title change: I focused mostly on the AVX512 implementation, but most of the suggestions also apply to the AVX2 code as well.
from md5-simd.
Changed the instructions to the integer variant. See #22
Benchmarks shows no impact, but it appears memory limited anyway.
all bitwise logic should be handled by VPTERNLOG [...]
I don't understand what you are saying. The VXORPD
should without any problem be handled while the memory operations are handled. Feel free to submit a PR.
avoid using gathers
Feel free to send a PR, but for this please add benchmarks. I don't want additional complexity without a clear upside.
consider interleaving two instruction stream
Saturating 16 lanes is pretty much impossible in everything but benchmarks, so adding 16 more will likely be pretty useless.
When 16 lanes are saturated it is memory bound, at least in my tests, so adding more lanes will just add complexity and maybe even slow down.
consider using EVEX embedded broadcast [...]
A PR is welcome if it has a measurable impact.
We don't use this package for anything significant at MinIO since we don't have many scenarios where we can fill enough lanes for this package to make sense. So this is not something we will invest huge amounts of time to.
from md5-simd.
Changed the instructions to the integer variant. See #22
Thanks for testing it out, but you're still using floating point operations. PS = packed single float, PD = packed double float. VPXORD
would be the correct instruction.
Benchmarks shows no impact
On Skylake, I wouldn't actually expect any, as I believe there's few bypass delays. Older Intel CPUs have it, as does AMD Zen, so, at least at the moment, it's more relevant to the AVX2 implementation than AVX512 (but may change on future uArchs, so it's always best to use the correct instruction).
I don't understand what you are saying. The VXORPD should without any problem be handled while the memory operations are handled.
For example, in the first round, you do:
VMOVAPD d, tmp
[...]
VXORPD c, tmp, tmp
[...]
VPTERNLOGD $0x6C, b, d, tmp
Instead, it should be:
VMOVDQA32 d, tmp
VPTERNLOGD $0xD8, b, c, tmp
In other words, all bitwise logic other than VPTERNLOGD
is unnecessary.
When 16 lanes are saturated it is memory bound, at least in my tests
If you're never going to get more than 16 lanes, then supporting more is moot, but are you sure it's memory bound? Your benchmark result shows a maximum speed of 17252.88 MB/s across 4 cores (~4.2GB/s per core) - that doesn't sound memory bound to me (unless the hardware has really sucky memory). If you want to do a quick test to see what's possible, you can grab ISA-L and run their MD5 benchmark.
So this is not something we will invest huge amounts of time to.
That's entirely fine :) To be quite honest, I'm not using your package myself, so have no personal interest in it.
So not actually asking you to do anything - I just thought I'd drop suggestions in case you happened to be interested in them.
from md5-simd.
@animetosho thanks for your feedback, I tried the suggestion to avoid the XOR
but found no difference in performance, as per:
benchmark old MB/s new MB/s speedup
BenchmarkAvx512/32KB-8 2299.46 2322.53 1.01x
BenchmarkAvx512/64KB-8 2766.58 2759.20 1.00x
BenchmarkAvx512/128KB-8 2853.97 2878.71 1.01x
BenchmarkAvx512/256KB-8 3171.99 3146.22 0.99x
BenchmarkAvx512/512KB-8 3391.35 3379.37 1.00x
BenchmarkAvx512/1MB-8 3397.27 3434.94 1.01x
BenchmarkAvx512/2MB-8 3438.92 3450.22 1.00x
BenchmarkAvx512/4MB-8 3474.41 3468.73 1.00x
BenchmarkAvx512/8MB-8 3495.29 3496.26 1.00x
BenchmarkAvx512Parallel/32KB-8 12645.19 12725.28 1.01x
BenchmarkAvx512Parallel/64KB-8 14413.94 14484.95 1.00x
BenchmarkAvx512Parallel/128KB-8 14964.45 15030.62 1.00x
BenchmarkAvx512Parallel/256KB-8 15570.16 15539.08 1.00x
BenchmarkAvx512Parallel/512KB-8 16067.74 16088.41 1.00x
BenchmarkAvx512Parallel/1MB-8 16151.83 16287.65 1.01x
BenchmarkAvx512Parallel/2MB-8 16258.97 16247.06 1.00x
BenchmarkAvx512Parallel/4MB-8 16906.32 16827.70 1.00x
BenchmarkAvx512Parallel/8MB-8 16720.65 16351.47 0.98x
(this is on a Xeon Platinum 8275CL CPU @ 3.00GHz in AWS)
from md5-simd.
Are you able to show your code?
from md5-simd.
This is the diff:
frank:~/go/src/github.com/minio/md5-simd (master) $ git diff
diff --git a/block16_amd64.s b/block16_amd64.s
index be0a43a..cf37c3c 100644
--- a/block16_amd64.s
+++ b/block16_amd64.s
@@ -11,10 +11,10 @@
VPGATHERDD index*4(base)(ptrs*1), ktmp, mem
#define ROUND1(a, b, c, d, index, const, shift) \
- VPXORQ c, tmp, tmp \
+ VMOVDQA32 d, tmp \
VPADDD 64*const(consts), a, a \
VPADDD mem, a, a \
- VPTERNLOGD $0x6C, b, d, tmp \
+ VPTERNLOGD $0xD8, b, c, tmp \
prep(index) \
VPADDD tmp, a, a \
VPROLD $shift, a, a \
@@ -22,10 +22,10 @@
VPADDD b, a, a
#define ROUND1noload(a, b, c, d, const, shift) \
- VPXORQ c, tmp, tmp \
+ VMOVDQA32 d, tmp \
VPADDD 64*const(consts), a, a \
VPADDD mem, a, a \
- VPTERNLOGD $0x6C, b, d, tmp \
+ VPTERNLOGD $0xD8, b, c, tmp \
VPADDD tmp, a, a \
VPROLD $shift, a, a \
VMOVAPD c, tmp \
from md5-simd.
Ah, you should apply it to every round, not just round 1. Also, instances of VMOVAPD
should be removed as they're now redundant.
Because you're only processing one vector at a time, and not trying to maximise ILP, it's possible that the CPU is just executing the extra instructions with spare capacity. Though I would've thought there'd at least be some difference with your change, if you're saturating both threads in a core...
Nonetheless, 2 instructions is better than 3, even if it doesn't show up on a benchmark.
from md5-simd.
Was able to remove the VANDNPD
from ROUND2 (ROUND3 and 4 have no logic instructions) but essentially got the same benchmark numbers as above (at least on this server config).
Perhaps Cascade Lake might be different, will see if I can do a quick check on that.
from md5-simd.
ROUND3 and 4 have no logic instructions
Round 3 looks fine (it's actually possible to eliminate the move, if you're interested).
Round 4 does have one which can be merged into VPTERNLOG
.
Though, as you've found, I doubt this change, by itself, will give an improvement in benchmarks.
Perhaps Cascade Lake might be different
It won't be.
from md5-simd.
Cool, will take a look at it.
from md5-simd.
Related Issues (9)
- Data race
- license in block-generic.go HOT 5
- Compile error "shift count type int, must be unsigned integer" on Go 1.12 HOT 1
- cpuid.CPU.AVX512F undefined (type cpuid.CPUInfo has no field or method AVX512F) HOT 1
- md5Server does not properly clean up its clients HOT 1
- avx512: use VPTERNLOGQ for ternary operations HOT 1
- Quickly slows down HOT 10
- As more hashers are created, the speed slows down
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from md5-simd.