Whilst doing research for my <a href="https://github.com/animetosho/md5-optimisation#m

Changed the instructions to the integer variant. See <a class="issue-link js-issue-lin

Changed the instructions to the integer variant. See <a class="issue-link

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This is the diff: <div class="snippet-clipboard-content notranslate position-relat

AVX512 Performance suggestions about md5-simd HOT 10 CLOSED

minio commented on May 20, 2024 1

AVX512 Performance suggestions

from md5-simd.

Comments (10)

animetosho commented on May 20, 2024

Just for clarification regarding the title change: I focused mostly on the AVX512 implementation, but most of the suggestions also apply to the AVX2 code as well.

from md5-simd.

klauspost commented on May 20, 2024

Changed the instructions to the integer variant. See #22

Benchmarks shows no impact, but it appears memory limited anyway.

all bitwise logic should be handled by VPTERNLOG [...]

I don't understand what you are saying. The VXORPD should without any problem be handled while the memory operations are handled. Feel free to submit a PR.

avoid using gathers

Feel free to send a PR, but for this please add benchmarks. I don't want additional complexity without a clear upside.

consider interleaving two instruction stream

Saturating 16 lanes is pretty much impossible in everything but benchmarks, so adding 16 more will likely be pretty useless.

When 16 lanes are saturated it is memory bound, at least in my tests, so adding more lanes will just add complexity and maybe even slow down.

consider using EVEX embedded broadcast [...]

A PR is welcome if it has a measurable impact.

We don't use this package for anything significant at MinIO since we don't have many scenarios where we can fill enough lanes for this package to make sense. So this is not something we will invest huge amounts of time to.

from md5-simd.

animetosho commented on May 20, 2024

Changed the instructions to the integer variant. See #22

Thanks for testing it out, but you're still using floating point operations. PS = packed single float, PD = packed double float. VPXORD would be the correct instruction.

Benchmarks shows no impact

On Skylake, I wouldn't actually expect any, as I believe there's few bypass delays. Older Intel CPUs have it, as does AMD Zen, so, at least at the moment, it's more relevant to the AVX2 implementation than AVX512 (but may change on future uArchs, so it's always best to use the correct instruction).

I don't understand what you are saying. The VXORPD should without any problem be handled while the memory operations are handled.

For example, in the first round, you do:

VMOVAPD d, tmp
[...]
VXORPD  c, tmp, tmp
[...]
VPTERNLOGD $0x6C, b, d, tmp

Instead, it should be:

VMOVDQA32 d, tmp
VPTERNLOGD $0xD8, b, c, tmp

In other words, all bitwise logic other than VPTERNLOGD is unnecessary.

When 16 lanes are saturated it is memory bound, at least in my tests

If you're never going to get more than 16 lanes, then supporting more is moot, but are you sure it's memory bound? Your benchmark result shows a maximum speed of 17252.88 MB/s across 4 cores (~4.2GB/s per core) - that doesn't sound memory bound to me (unless the hardware has really sucky memory). If you want to do a quick test to see what's possible, you can grab ISA-L and run their MD5 benchmark.

So this is not something we will invest huge amounts of time to.

That's entirely fine :) To be quite honest, I'm not using your package myself, so have no personal interest in it.
So not actually asking you to do anything - I just thought I'd drop suggestions in case you happened to be interested in them.

from md5-simd.

fwessels commented on May 20, 2024

@animetosho thanks for your feedback, I tried the suggestion to avoid the XOR but found no difference in performance, as per:

benchmark                           old MB/s     new MB/s     speedup
BenchmarkAvx512/32KB-8              2299.46      2322.53      1.01x
BenchmarkAvx512/64KB-8              2766.58      2759.20      1.00x
BenchmarkAvx512/128KB-8             2853.97      2878.71      1.01x
BenchmarkAvx512/256KB-8             3171.99      3146.22      0.99x
BenchmarkAvx512/512KB-8             3391.35      3379.37      1.00x
BenchmarkAvx512/1MB-8               3397.27      3434.94      1.01x
BenchmarkAvx512/2MB-8               3438.92      3450.22      1.00x
BenchmarkAvx512/4MB-8               3474.41      3468.73      1.00x
BenchmarkAvx512/8MB-8               3495.29      3496.26      1.00x
BenchmarkAvx512Parallel/32KB-8      12645.19     12725.28     1.01x
BenchmarkAvx512Parallel/64KB-8      14413.94     14484.95     1.00x
BenchmarkAvx512Parallel/128KB-8     14964.45     15030.62     1.00x
BenchmarkAvx512Parallel/256KB-8     15570.16     15539.08     1.00x
BenchmarkAvx512Parallel/512KB-8     16067.74     16088.41     1.00x
BenchmarkAvx512Parallel/1MB-8       16151.83     16287.65     1.01x
BenchmarkAvx512Parallel/2MB-8       16258.97     16247.06     1.00x
BenchmarkAvx512Parallel/4MB-8       16906.32     16827.70     1.00x
BenchmarkAvx512Parallel/8MB-8       16720.65     16351.47     0.98x

(this is on a Xeon Platinum 8275CL CPU @ 3.00GHz in AWS)

from md5-simd.

animetosho commented on May 20, 2024

Are you able to show your code?

from md5-simd.

fwessels commented on May 20, 2024

This is the diff:

frank:~/go/src/github.com/minio/md5-simd (master) $ git diff
diff --git a/block16_amd64.s b/block16_amd64.s
index be0a43a..cf37c3c 100644
--- a/block16_amd64.s
+++ b/block16_amd64.s
@@ -11,10 +11,10 @@
        VPGATHERDD index*4(base)(ptrs*1), ktmp, mem
 
 #define ROUND1(a, b, c, d, index, const, shift) \
-       VPXORQ     c, tmp, tmp            \
+       VMOVDQA32  d, tmp                 \
        VPADDD     64*const(consts), a, a \
        VPADDD     mem, a, a              \
-       VPTERNLOGD $0x6C, b, d, tmp       \
+       VPTERNLOGD $0xD8, b, c, tmp       \
        prep(index)                       \
        VPADDD     tmp, a, a              \
        VPROLD     $shift, a, a           \
@@ -22,10 +22,10 @@
        VPADDD     b, a, a
 
 #define ROUND1noload(a, b, c, d, const, shift) \
-       VPXORQ     c, tmp, tmp            \
+       VMOVDQA32  d, tmp                 \
        VPADDD     64*const(consts), a, a \
        VPADDD     mem, a, a              \
-       VPTERNLOGD $0x6C, b, d, tmp       \
+       VPTERNLOGD $0xD8, b, c, tmp       \
        VPADDD     tmp, a, a              \
        VPROLD     $shift, a, a           \
        VMOVAPD    c, tmp                 \

from md5-simd.

animetosho commented on May 20, 2024

Ah, you should apply it to every round, not just round 1. Also, instances of VMOVAPD should be removed as they're now redundant.

Because you're only processing one vector at a time, and not trying to maximise ILP, it's possible that the CPU is just executing the extra instructions with spare capacity. Though I would've thought there'd at least be some difference with your change, if you're saturating both threads in a core...
Nonetheless, 2 instructions is better than 3, even if it doesn't show up on a benchmark.

from md5-simd.

fwessels commented on May 20, 2024

Was able to remove the VANDNPD from ROUND2 (ROUND3 and 4 have no logic instructions) but essentially got the same benchmark numbers as above (at least on this server config).
Perhaps Cascade Lake might be different, will see if I can do a quick check on that.

from md5-simd.

animetosho commented on May 20, 2024

ROUND3 and 4 have no logic instructions

Round 3 looks fine (it's actually possible to eliminate the move, if you're interested).
Round 4 does have one which can be merged into VPTERNLOG.

Though, as you've found, I doubt this change, by itself, will give an improvement in benchmarks.

Perhaps Cascade Lake might be different

It won't be.

from md5-simd.

fwessels commented on May 20, 2024

Cool, will take a look at it.

from md5-simd.

AVX512 Performance suggestions about md5-simd HOT 10 CLOSED

Comments (10)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent