Giter Club home page Giter Club logo

Comments (13)

burdges avatar burdges commented on July 20, 2024

If I understand, the benchmarks suffer if you pass these structures by value, but maybe one can tweak that with optimizations modes, etc. In particular, passing by value permits the recipient to mutate the value, so you stack space optimizations might inhibit doing a pass by reference behind the scenes. I'd also think making the operators #[inline(always)] wrappers around call by reference functions should fix this, but maybe not.

from curve25519-dalek.

hdevalence avatar hdevalence commented on July 20, 2024

When we implemented the arithmetic ops for T and not for &T, we had a lot of nested redundant copies: first, copying the LHS and RHS points into the function (each point is 160 bytes), plus copying the LHS and RHS of every field element operation inside (each field element is 40 bytes). It did make a difference in cutting out needless copies. However it's obviously worse ergonomically, as I'm sure you've noticed.

I didn't think that it would be possible to work around this, but it seems like @burdges' suggestion might work, at least on that toy example.

So, to answer your original question, @UnlawfulMonad, if there were implementations of the arithmetic ops that didn't require borrows but were just as fast, we would be happy to take that PR.

from curve25519-dalek.

isislovecruft avatar isislovecruft commented on July 20, 2024

Hi @UnlawfulMonad!

As @hdevalence said, we'd be stoked to take a patch for this, provided it doesn't slow anything down by introducing needless copies everywhere. If you want to implement this, my suggestion for the steps would be:

  1. Implement the operators for non-borrow types, using @burdges suggestion to use #[inline(always)]. (Bonus points for adding #[inline(always)] to all the operator implementations!)
  2. Find anywhere in our code (or in ed25519-dalek or dalek-rangeproofs that is doing what I call an "Eye of Sauron" (i.e. &(&(&(&( )))) like &(&(&s1 * &P1) + &(s2 * &P2)) + &(s3 * &P3) or something similarly evil looking), change it to your non-borrow syntax, and compare—with various rustc optimisation levels—the assembly generated from the original code to the assembly for the new syntax to ensure there's no extra copies. https://rust.godbolt.org is really handy for this. (If you're not comfortable with assembly, feel free to paste links to the the two versions of the code in godbolt, and we'll review it.)
  3. (optional) If all goes well, feel free to kill as many Eye of Saurons as you find. :)

@UnlawfulMonad Would you like me to assign you to this issue?

from curve25519-dalek.

UnlawfulMonad avatar UnlawfulMonad commented on July 20, 2024

Hi @isislovecruft and @hdevalence!

Thanks for the suggestions. I'll get on this next week and update this issue with my progress. I assumed that inlining would already be occurring if the call by value variant were simply wrapping the call by reference versions.

Sidenote: Eye of Sauron is a great name for those things. :)

from curve25519-dalek.

isislovecruft avatar isislovecruft commented on July 20, 2024

@UnlawfulMonad Awesome, thanks! It appears like if we add #[inline(always)] also to the borrow versions, then the by-value versions inline directly to the borrow version code (i.e. so that rustc is basically adding the Eye of Sauron for you).

(Also I used to work at the FreeGeek in Portland! It is an excellent project!)

from curve25519-dalek.

UnlawfulMonad avatar UnlawfulMonad commented on July 20, 2024

@isislovecruft Is code bloat not a concern at that point? #[inline(always)] forces rustc/LLVM to inline. Another option would be to leave the Eyes of Sauron in the library code but gives users the option to use either impl (overgeneralization warning) since in practice the overhead in user code should be negligible. I'm going to test simply adding #[inline] attributes and compare the overhead vs #[inline(always)]ing all the operators.

Sorry for the lack of progress. It's been a hectic couple of weeks.

(And that's awesome! Agreed, it's an awesome project!)

from curve25519-dalek.

UnlawfulMonad avatar UnlawfulMonad commented on July 20, 2024

Alright. Here are my results:

All readings with rustc version 1.19.0-nightly (148e91714 2017-06-08).

Baseline from commit 97d6974:

test curve::bench::add_extended_and_affine_niels_output_completed     ... bench:         212 ns/iter (+/- 87)
test curve::bench::add_extended_and_affine_niels_output_extended      ... bench:         480 ns/iter (+/- 136)
test curve::bench::add_extended_and_projective_niels_output_completed ... bench:         275 ns/iter (+/- 6)
test curve::bench::add_extended_and_projective_niels_output_extended  ... bench:         525 ns/iter (+/- 23)
test curve::bench::basepoint_mult                                     ... bench:      35,181 ns/iter (+/- 12,430)
test curve::bench::bench_select_precomputed_point                     ... bench:       1,032 ns/iter (+/- 50)
test curve::bench::edwards_compress                                   ... bench:      12,023 ns/iter (+/- 1,532)
test curve::bench::edwards_decompress                                 ... bench:      12,837 ns/iter (+/- 2,573)
test curve::bench::extended_double_output_extended                    ... bench:         486 ns/iter (+/- 100)
test curve::bench::mult_by_cofactor                                   ... bench:       1,316 ns/iter (+/- 251)
test curve::bench::projective_double_output_completed                 ... bench:         226 ns/iter (+/- 118)
test curve::bench::scalar_mult                                        ... bench:     152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint        ... bench:     159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult                      ... bench:     399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv                                 ... bench:      12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a                               ... bench:          67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq                                  ... bench:          48 ns/iter (+/- 1)
test scalar::bench::invert                                            ... bench:      67,871 ns/iter (+/- 803)
test scalar::bench::scalar_multiply_add                               ... bench:         270 ns/iter (+/- 68)
test scalar::bench::scalar_random                                     ... bench:       1,115 ns/iter (+/- 101)
test scalar::bench::scalar_unpacked_multiply_add                      ... bench:         199 ns/iter (+/- 12)

With #[inline] on all operators (commit 6cdae94 on the callbyvalue branch of my fork):

test curve::bench::add_extended_and_affine_niels_output_completed              ... bench:         220 ns/iter (+/- 9)
test curve::bench::add_extended_and_affine_niels_output_extended               ... bench:         493 ns/iter (+/- 28)
test curve::bench::add_extended_and_projective_niels_output_completed          ... bench:         287 ns/iter (+/- 23)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench:         299 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_extended           ... bench:         558 ns/iter (+/- 72)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value  ... bench:         593 ns/iter (+/- 33)
test curve::bench::basepoint_mult                                              ... bench:      36,890 ns/iter (+/- 6,329)
test curve::bench::bench_select_precomputed_point                              ... bench:         891 ns/iter (+/- 60)
test curve::bench::edwards_compress                                            ... bench:      12,197 ns/iter (+/- 2,828)
test curve::bench::edwards_decompress                                          ... bench:      12,710 ns/iter (+/- 1,029)
test curve::bench::extended_double_output_extended                             ... bench:         493 ns/iter (+/- 14)
test curve::bench::mult_by_cofactor                                            ... bench:       1,318 ns/iter (+/- 96)
test curve::bench::projective_double_output_completed                          ... bench:         220 ns/iter (+/- 10)
test curve::bench::scalar_mult                                                 ... bench:     156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value                                        ... bench:     163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv                                          ... bench:      11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a                                        ... bench:          65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 3)
test scalar::bench::invert                                                     ... bench:      65,925 ns/iter (+/- 3,157)
test scalar::bench::scalar_multiply_add                                        ... bench:         257 ns/iter (+/- 12)
test scalar::bench::scalar_random                                              ... bench:       1,135 ns/iter (+/- 61)
test scalar::bench::scalar_unpacked_multiply_add                               ... bench:         195 ns/iter (+/- 23)

With #[inline(always)] on all operators (commit 156c92d on the call by value branch of my fork):

test curve::bench::add_extended_and_affine_niels_output_completed              ... bench:         208 ns/iter (+/- 6)
test curve::bench::add_extended_and_affine_niels_output_extended               ... bench:         488 ns/iter (+/- 19)
test curve::bench::add_extended_and_projective_niels_output_completed          ... bench:         278 ns/iter (+/- 9)
test curve::bench::add_extended_and_projective_niels_output_completed_by_value ... bench:         295 ns/iter (+/- 11)
test curve::bench::add_extended_and_projective_niels_output_extended           ... bench:         550 ns/iter (+/- 20)
test curve::bench::add_extended_and_projective_niels_output_extended_by_value  ... bench:         565 ns/iter (+/- 16)
test curve::bench::basepoint_mult                                              ... bench:      36,467 ns/iter (+/- 1,810)
test curve::bench::bench_select_precomputed_point                              ... bench:         882 ns/iter (+/- 36)
test curve::bench::edwards_compress                                            ... bench:      12,003 ns/iter (+/- 1,698)
test curve::bench::edwards_decompress                                          ... bench:      12,630 ns/iter (+/- 1,812)
test curve::bench::extended_double_output_extended                             ... bench:         496 ns/iter (+/- 32)
test curve::bench::mult_by_cofactor                                            ... bench:       1,331 ns/iter (+/- 32)
test curve::bench::projective_double_output_completed                          ... bench:         217 ns/iter (+/- 8)
test curve::bench::scalar_mult                                                 ... bench:     155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value                                        ... bench:     156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv                                          ... bench:      11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a                                        ... bench:          66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 4)
test scalar::bench::invert                                                     ... bench:      72,665 ns/iter (+/- 1,737)
test scalar::bench::scalar_multiply_add                                        ... bench:         267 ns/iter (+/- 53)
test scalar::bench::scalar_random                                              ... bench:       1,130 ns/iter (+/- 195)
test scalar::bench::scalar_unpacked_multiply_add                               ... bench:         216 ns/iter (+/- 40)

Isolating a couple of the tests we get:

Baseline:
test curve::bench::scalar_mult                                        ... bench:     152,515 ns/iter (+/- 2,688)
test curve::bench::vartime::bench_double_scalar_mult_basepoint        ... bench:     159,619 ns/iter (+/- 2,291)
test curve::bench::vartime::ten_fold_scalar_mult                      ... bench:     399,000 ns/iter (+/- 7,540)
test field::bench::fieldelement_a_inv                                 ... bench:      12,200 ns/iter (+/- 1,889)
test field::bench::fieldelement_a_mul_a                               ... bench:          67 ns/iter (+/- 1)
test field::bench::fieldelement_a_sq                                  ... bench:          48 ns/iter (+/- 1)

#[inline]
test curve::bench::scalar_mult                                                 ... bench:     156,671 ns/iter (+/- 10,735)
test curve::bench::scalar_mult_by_value                                        ... bench:     163,941 ns/iter (+/- 13,653)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     160,486 ns/iter (+/- 10,722)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     402,349 ns/iter (+/- 16,962)
test field::bench::fieldelement_a_inv                                          ... bench:      11,891 ns/iter (+/- 2,288)
test field::bench::fieldelement_a_mul_a                                        ... bench:          65 ns/iter (+/- 2)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 6)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 3)

#[inline(always)]
test curve::bench::scalar_mult                                                 ... bench:     155,447 ns/iter (+/- 5,160)
test curve::bench::scalar_mult_by_value                                        ... bench:     156,946 ns/iter (+/- 6,148)
test curve::bench::vartime::bench_double_scalar_mult_basepoint                 ... bench:     157,789 ns/iter (+/- 9,171)
test curve::bench::vartime::ten_fold_scalar_mult                               ... bench:     405,442 ns/iter (+/- 19,138)
test field::bench::fieldelement_a_inv                                          ... bench:      11,806 ns/iter (+/- 1,706)
test field::bench::fieldelement_a_mul_a                                        ... bench:          66 ns/iter (+/- 3)
test field::bench::fieldelement_a_mul_a_by_value                               ... bench:          67 ns/iter (+/- 3)
test field::bench::fieldelement_a_sq                                           ... bench:          47 ns/iter (+/- 4)

Note: the by_value variants of tests modify the test to use the call by value variants. They don't change anything else. I didn't include any by_value variants to any tests that broke just by removing ampersands.

My assembly-fu isn't at the point where I can tell exactly what changed between runs but a cursory look suggests that it's very similar if not identical. Are these acceptable results?

from curve25519-dalek.

hdevalence avatar hdevalence commented on July 20, 2024

I started looking at this using cargo benchcmp on a machine with Turbo Boost disabled, so that the timings are actually consistent. I'm not entirely clear on what's going on, but hopefully I can get a clearer picture on the weekend or next week (I just wanted to note that I didn't forget about it).

from curve25519-dalek.

hdevalence avatar hdevalence commented on July 20, 2024

I compared the existing implementation with two variants of this trick: first, just having #[inline(always)] on the move/copy operators without #[inline] on all the other methods; second, having #[inline(always)] on the move/copy operators with the additional #[inline]s.

Roughly speaking, the results I got were a slight across-the-board decline in performance for the first option, which ends up inserting a bunch of extra copies. For the second option, there's a slight increase in performance for some functions, and some microbenchmarks, but there's a decrease in performance for the inversion used in decompression. (I believe this is because it's inlined all ~260 field operations).

What I suspect is happening here is that the extra inline annotations are interacting with the cost model in unexpected and not always positive ways. So while, on a single example, using #[inline(always)] on the move/copy operator works out, when it's done across the board there are unexpected interactions. It almost works -- but I think a cleaner solution would be to make arithmetic operators in Rust work the way the . operator does.

However, for point operations, which are what people who use -dalek want to work with, the cost of a copy is tiny compared to the cost of the operation. For those operations, it doesn't matter whether someone uses the move operator or the borrow operator. So I think that it makes sense to keep the operator-generating macros for those, at least until we can fix Rust.

from curve25519-dalek.

UnlawfulMonad avatar UnlawfulMonad commented on July 20, 2024

Hmm. I'm interested in the idea of arithmetic operators behaving like operator .. If there hasn't already been discussion on this topic can I steal the idea and see if I can propose this as a (Rust) RFC?

from curve25519-dalek.

burdges avatar burdges commented on July 20, 2024

There is more serious discussion on that topic now in rust-lang/rfcs#2147

from curve25519-dalek.

hdevalence avatar hdevalence commented on July 20, 2024

Just to circle back on this, the refactoring in #86 more cleanly separates the internal API from the external API. I think the fix for this issue is to just implement operators for both T and &T for all public types T, using something like @UnlawfulMonad's macro idea.

from curve25519-dalek.

hdevalence avatar hdevalence commented on July 20, 2024

Closing this issue since it's fixed in #101

from curve25519-dalek.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.