Giter Club home page Giter Club logo

Comments (10)

jon-chuang avatar jon-chuang commented on September 27, 2024 1

As some updates, I managed to integrate Fp256 assembly leading to about 25% speedup in Fp256 mults. But it is proving harder to write generic assembly code than expected, and the inline assembly module doesn't really play nice with catching errors and doing what it's supposed to.

I am going to try to simplify my code by writing smaller chunks of inline assembly that are easier to verify correct in a piecemeal fashion and also hopefully still allows the rust compiler to reason about memory fetches better than I can. Hopefully, I will be able to tweak both the rust and assembly sides of the code with careful benchmarking.

Also, I should use the following tool for benchmarking clock cycles: https://git.zx2c4.com/kbench9000/about/. But it may only work with C? In any case, it seems better practice to benchmark clock cycles than time because there really is a lot of clock speed volatility across time, which is undesirable. I will look into what Rust tooling exists.

from algebra.

Pratyush avatar Pratyush commented on September 27, 2024

If you'd like to implement this, please go ahead, but I'd prefer to merge this behind a feature if we're going to be using unsafe code.

from algebra.

jon-chuang avatar jon-chuang commented on September 27, 2024

Sure, I can abide. By the way, there seems to be a discrepancy between the full pairing timings for BLS12-381 (~2ms) and BLS12-377 (~2.5ms). I have replicated it 3 times. This does not quite stand to reason, as the underlying fields are the same size. Do you know of possible reasons?

from algebra.

Pratyush avatar Pratyush commented on September 27, 2024

The curves are of different types (M-type vs D-type), and so have different final exponentiation algorithms.

from algebra.

Pratyush avatar Pratyush commented on September 27, 2024

btw, you don't necessarily need to drop down to assembly to utilize SIMD instructions; these are available as intrinsics here, for example here: https://doc.rust-lang.org/1.29.1/core/arch/x86_64/index.html

from algebra.

jon-chuang avatar jon-chuang commented on September 27, 2024

I see. I will try to study this. I noticed the miller loop for 377 is slower too. Is there a reason why Zexe uses the 377 rather than 381 curve? Is this to have a power of 2 subgroup in both p and r? Perhaps I should read the Zexe paper more carefully.

I will test out using the SIMD instructions provided first. From what I gather, it's still unsafe rust.

from algebra.

jon-chuang avatar jon-chuang commented on September 27, 2024

I've looked into things - SIMD isn't efficient for small numbers unless one uses an unsaturated representation that does not work for general moduli. Nonetheless, one could possibly try to rewrite point addition formulas to take advantage of SIMD, rearranging ops to avoid data dependencies that can cause clock speeds to throttle, by parallelising at the data level, rather than the representation (add 4 field elements at the same time etc), but this is a lot of manual work. There is some work on 32-bit SIMD which seems performant that I will try to review more carefully. In particular, I will try to take a look if strategies used to target NEON can be used for WASM SIMD.

The Rust std_arch intrinsics do not support Intel's ADX multiprecision add and carry (adc) instructions for now, so we will have to resort to writing assembly. Since it is difficult to maintain this assembly for all the given field sizes, I will try to do code gen with macros and focus on schoolbook Montgomery with goff's optimisation. I will try to get similar or better performance to goff.

Karatsuba might be reserved for curves over larger base field sizes like the SW ans BW curves and will be prioritised for later.

from algebra.

Pratyush avatar Pratyush commented on September 27, 2024

Thanks, this sounds like good plan!

from algebra.

Pratyush avatar Pratyush commented on September 27, 2024

Re: benchmarking in terms of cycles, one way could be to look into how perf.rust-lang.org counts number of instructions instead of wall-clock time.

from algebra.

Pratyush avatar Pratyush commented on September 27, 2024

@jon-chuang what's the status of this? Should this be closed?

from algebra.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.