Giter Club home page Giter Club logo

Comments (10)

animetosho avatar animetosho commented on May 20, 2024

Just for clarification regarding the title change: I focused mostly on the AVX512 implementation, but most of the suggestions also apply to the AVX2 code as well.

from md5-simd.

klauspost avatar klauspost commented on May 20, 2024

Changed the instructions to the integer variant. See #22

Benchmarks shows no impact, but it appears memory limited anyway.

all bitwise logic should be handled by VPTERNLOG [...]

I don't understand what you are saying. The VXORPD should without any problem be handled while the memory operations are handled. Feel free to submit a PR.

avoid using gathers

Feel free to send a PR, but for this please add benchmarks. I don't want additional complexity without a clear upside.

consider interleaving two instruction stream

Saturating 16 lanes is pretty much impossible in everything but benchmarks, so adding 16 more will likely be pretty useless.

When 16 lanes are saturated it is memory bound, at least in my tests, so adding more lanes will just add complexity and maybe even slow down.

consider using EVEX embedded broadcast [...]

A PR is welcome if it has a measurable impact.

We don't use this package for anything significant at MinIO since we don't have many scenarios where we can fill enough lanes for this package to make sense. So this is not something we will invest huge amounts of time to.

from md5-simd.

animetosho avatar animetosho commented on May 20, 2024

Changed the instructions to the integer variant. See #22

Thanks for testing it out, but you're still using floating point operations. PS = packed single float, PD = packed double float. VPXORD would be the correct instruction.

Benchmarks shows no impact

On Skylake, I wouldn't actually expect any, as I believe there's few bypass delays. Older Intel CPUs have it, as does AMD Zen, so, at least at the moment, it's more relevant to the AVX2 implementation than AVX512 (but may change on future uArchs, so it's always best to use the correct instruction).

I don't understand what you are saying. The VXORPD should without any problem be handled while the memory operations are handled.

For example, in the first round, you do:

VMOVAPD d, tmp
[...]
VXORPD  c, tmp, tmp
[...]
VPTERNLOGD $0x6C, b, d, tmp

Instead, it should be:

VMOVDQA32 d, tmp
VPTERNLOGD $0xD8, b, c, tmp

In other words, all bitwise logic other than VPTERNLOGD is unnecessary.

When 16 lanes are saturated it is memory bound, at least in my tests

If you're never going to get more than 16 lanes, then supporting more is moot, but are you sure it's memory bound? Your benchmark result shows a maximum speed of 17252.88 MB/s across 4 cores (~4.2GB/s per core) - that doesn't sound memory bound to me (unless the hardware has really sucky memory). If you want to do a quick test to see what's possible, you can grab ISA-L and run their MD5 benchmark.

So this is not something we will invest huge amounts of time to.

That's entirely fine :) To be quite honest, I'm not using your package myself, so have no personal interest in it.
So not actually asking you to do anything - I just thought I'd drop suggestions in case you happened to be interested in them.

from md5-simd.

fwessels avatar fwessels commented on May 20, 2024

@animetosho thanks for your feedback, I tried the suggestion to avoid the XOR but found no difference in performance, as per:

benchmark                           old MB/s     new MB/s     speedup
BenchmarkAvx512/32KB-8              2299.46      2322.53      1.01x
BenchmarkAvx512/64KB-8              2766.58      2759.20      1.00x
BenchmarkAvx512/128KB-8             2853.97      2878.71      1.01x
BenchmarkAvx512/256KB-8             3171.99      3146.22      0.99x
BenchmarkAvx512/512KB-8             3391.35      3379.37      1.00x
BenchmarkAvx512/1MB-8               3397.27      3434.94      1.01x
BenchmarkAvx512/2MB-8               3438.92      3450.22      1.00x
BenchmarkAvx512/4MB-8               3474.41      3468.73      1.00x
BenchmarkAvx512/8MB-8               3495.29      3496.26      1.00x
BenchmarkAvx512Parallel/32KB-8      12645.19     12725.28     1.01x
BenchmarkAvx512Parallel/64KB-8      14413.94     14484.95     1.00x
BenchmarkAvx512Parallel/128KB-8     14964.45     15030.62     1.00x
BenchmarkAvx512Parallel/256KB-8     15570.16     15539.08     1.00x
BenchmarkAvx512Parallel/512KB-8     16067.74     16088.41     1.00x
BenchmarkAvx512Parallel/1MB-8       16151.83     16287.65     1.01x
BenchmarkAvx512Parallel/2MB-8       16258.97     16247.06     1.00x
BenchmarkAvx512Parallel/4MB-8       16906.32     16827.70     1.00x
BenchmarkAvx512Parallel/8MB-8       16720.65     16351.47     0.98x

(this is on a Xeon Platinum 8275CL CPU @ 3.00GHz in AWS)

from md5-simd.

animetosho avatar animetosho commented on May 20, 2024

Are you able to show your code?

from md5-simd.

fwessels avatar fwessels commented on May 20, 2024

This is the diff:

frank:~/go/src/github.com/minio/md5-simd (master) $ git diff
diff --git a/block16_amd64.s b/block16_amd64.s
index be0a43a..cf37c3c 100644
--- a/block16_amd64.s
+++ b/block16_amd64.s
@@ -11,10 +11,10 @@
        VPGATHERDD index*4(base)(ptrs*1), ktmp, mem
 
 #define ROUND1(a, b, c, d, index, const, shift) \
-       VPXORQ     c, tmp, tmp            \
+       VMOVDQA32  d, tmp                 \
        VPADDD     64*const(consts), a, a \
        VPADDD     mem, a, a              \
-       VPTERNLOGD $0x6C, b, d, tmp       \
+       VPTERNLOGD $0xD8, b, c, tmp       \
        prep(index)                       \
        VPADDD     tmp, a, a              \
        VPROLD     $shift, a, a           \
@@ -22,10 +22,10 @@
        VPADDD     b, a, a
 
 #define ROUND1noload(a, b, c, d, const, shift) \
-       VPXORQ     c, tmp, tmp            \
+       VMOVDQA32  d, tmp                 \
        VPADDD     64*const(consts), a, a \
        VPADDD     mem, a, a              \
-       VPTERNLOGD $0x6C, b, d, tmp       \
+       VPTERNLOGD $0xD8, b, c, tmp       \
        VPADDD     tmp, a, a              \
        VPROLD     $shift, a, a           \
        VMOVAPD    c, tmp                 \

from md5-simd.

animetosho avatar animetosho commented on May 20, 2024

Ah, you should apply it to every round, not just round 1. Also, instances of VMOVAPD should be removed as they're now redundant.

Because you're only processing one vector at a time, and not trying to maximise ILP, it's possible that the CPU is just executing the extra instructions with spare capacity. Though I would've thought there'd at least be some difference with your change, if you're saturating both threads in a core...
Nonetheless, 2 instructions is better than 3, even if it doesn't show up on a benchmark.

from md5-simd.

fwessels avatar fwessels commented on May 20, 2024

Was able to remove the VANDNPD from ROUND2 (ROUND3 and 4 have no logic instructions) but essentially got the same benchmark numbers as above (at least on this server config).
Perhaps Cascade Lake might be different, will see if I can do a quick check on that.

from md5-simd.

animetosho avatar animetosho commented on May 20, 2024

ROUND3 and 4 have no logic instructions

Round 3 looks fine (it's actually possible to eliminate the move, if you're interested).
Round 4 does have one which can be merged into VPTERNLOG.

Though, as you've found, I doubt this change, by itself, will give an improvement in benchmarks.

Perhaps Cascade Lake might be different

It won't be.

from md5-simd.

fwessels avatar fwessels commented on May 20, 2024

Cool, will take a look at it.

from md5-simd.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.