It seems unacceptable that a primitive that can be accelerated (leading to ~1.5x speed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement ASM for field ops/Arithmetic optimisations about algebra HOT 10 OPEN

arkworks-rs commented on September 27, 2024

Implement ASM for field ops/Arithmetic optimisations

from algebra.

Comments (10)

jon-chuang commented on September 27, 2024 1

As some updates, I managed to integrate Fp256 assembly leading to about 25% speedup in Fp256 mults. But it is proving harder to write generic assembly code than expected, and the inline assembly module doesn't really play nice with catching errors and doing what it's supposed to.

I am going to try to simplify my code by writing smaller chunks of inline assembly that are easier to verify correct in a piecemeal fashion and also hopefully still allows the rust compiler to reason about memory fetches better than I can. Hopefully, I will be able to tweak both the rust and assembly sides of the code with careful benchmarking.

Also, I should use the following tool for benchmarking clock cycles: https://git.zx2c4.com/kbench9000/about/. But it may only work with C? In any case, it seems better practice to benchmark clock cycles than time because there really is a lot of clock speed volatility across time, which is undesirable. I will look into what Rust tooling exists.

from algebra.

Pratyush commented on September 27, 2024

If you'd like to implement this, please go ahead, but I'd prefer to merge this behind a feature if we're going to be using unsafe code.

from algebra.

jon-chuang commented on September 27, 2024

Sure, I can abide. By the way, there seems to be a discrepancy between the full pairing timings for BLS12-381 (~2ms) and BLS12-377 (~2.5ms). I have replicated it 3 times. This does not quite stand to reason, as the underlying fields are the same size. Do you know of possible reasons?

from algebra.

Pratyush commented on September 27, 2024

The curves are of different types (M-type vs D-type), and so have different final exponentiation algorithms.

from algebra.

Pratyush commented on September 27, 2024

btw, you don't necessarily need to drop down to assembly to utilize SIMD instructions; these are available as intrinsics here, for example here: https://doc.rust-lang.org/1.29.1/core/arch/x86_64/index.html

from algebra.

jon-chuang commented on September 27, 2024

I see. I will try to study this. I noticed the miller loop for 377 is slower too. Is there a reason why Zexe uses the 377 rather than 381 curve? Is this to have a power of 2 subgroup in both p and r? Perhaps I should read the Zexe paper more carefully.

I will test out using the SIMD instructions provided first. From what I gather, it's still unsafe rust.

from algebra.

jon-chuang commented on September 27, 2024

I've looked into things - SIMD isn't efficient for small numbers unless one uses an unsaturated representation that does not work for general moduli. Nonetheless, one could possibly try to rewrite point addition formulas to take advantage of SIMD, rearranging ops to avoid data dependencies that can cause clock speeds to throttle, by parallelising at the data level, rather than the representation (add 4 field elements at the same time etc), but this is a lot of manual work. There is some work on 32-bit SIMD which seems performant that I will try to review more carefully. In particular, I will try to take a look if strategies used to target NEON can be used for WASM SIMD.

The Rust std_arch intrinsics do not support Intel's ADX multiprecision add and carry (adc) instructions for now, so we will have to resort to writing assembly. Since it is difficult to maintain this assembly for all the given field sizes, I will try to do code gen with macros and focus on schoolbook Montgomery with goff's optimisation. I will try to get similar or better performance to goff.

Karatsuba might be reserved for curves over larger base field sizes like the SW ans BW curves and will be prioritised for later.

from algebra.

Pratyush commented on September 27, 2024

Thanks, this sounds like good plan!

from algebra.

Pratyush commented on September 27, 2024

Re: benchmarking in terms of cycles, one way could be to look into how perf.rust-lang.org counts number of instructions instead of wall-clock time.

from algebra.

Pratyush commented on September 27, 2024

@jon-chuang what's the status of this? Should this be closed?

from algebra.

Implement ASM for field ops/Arithmetic optimisations about algebra HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent