arkworks-rs / algebra Goto Github PK

View Code? Open in Web Editor NEW

558.0 22.0 213.0 3.44 MB

Libraries for finite field, elliptic curve, and polynomial arithmetic

Home Page: https://arkworks.rs

License: Apache License 2.0

Shell 0.10% Rust 94.37% Python 0.22% Sage 4.55% Makefile 0.03% Jupyter Notebook 0.73%

rust cryptography finite-fields elliptic-curves pairings polynomials

algebra's People

Contributors

Stargazers

Watchers

Forkers

valardragon yelhousni ryanleh swasilyev celo-org tsunrise jon-chuang adria0 isgasho rozbb zhtluo w3f manta-z kobigurk drewstone sri-csl paulwang001 heliaxdev ashutoshvarma mmaker huyuncong euevew kevaundray huitseeker alex-ozdemir daira paulcadman hashcloak alxiong markulf mimoo zhenfeizhang spector-in-london simonmasson gavinzheng mmagician penumbra-zone davidnevadoc cperezz chorusone tomaszwaszczyk tankofzion alexander-zw joebebel weikengchen colepbryan findoranetwork jamesthesnake lightprotocol maxblinder-godindustry sunhuachuang cryptopunkchina kelvinkirima014 mengxianglinx ashishsth7586 global-localhost fpgaq nikkolasg electron-labs gnosed maxgillett pierwill ethan-000 mfaulk hashcashier nhlanhlahasane horacepan doytsujin hanabi1224 luneth90 ninevone yuwen01 anthonymadia laplacekorea y-pakorn erwanor andrewmilson carol-x harshadptl confusesun z-k-m-l 0kml rickwebiii gmh5225 ayushm2003 sladuca domsteil ra1issa piotr-roslaniec omahs protoben lakshya-sky tarinn tjade273 achimcc kevinssgh dannywillems rbkhmrcr rubdos aphoh

algebra's Issues

Add methods to ingest strings

We should make it easier to input parameters to curves etc., rather than relying on decomposing numbers into hex values etc. It's meh for hex values, disastrous for decimals, and overall mistake prone.

Math Documentation

It might be helpful to add some Readmes in the algebra and in the future, other directories, that lists out basic math of the various curves, including security, special characteristics like twists, GLV, also list some relevant references etc. Since we seem to be adding more curves. Sort of like a pokedex. My inspiration is OpenAI's SpinningUp, which had a nice documentation that explained some of the differences in RL algorithms and threw in some useful formulas.

I am happy to do this as it will help motivate my study of elliptic curves.

GLV Design

This design is primarily based on the original GLV paper and Aztec's approach of doing pre-division by n and a second round of approximate division by right shift. We do an analysis to show that the error bound in terms of the short lattice basis lengths does not increase significantly.

Given a modulus n and a cheap endomorphism ϕ with associated scalar λ such that ϕ(P) = λP for all P in the curve, the objective is to compute two vectors in Z X Z, v1 = [a1, b1], v2 = [a2, b2] such that ||vi|| < sqrt(n) and ai - λbi = n*si = 0 mod n. By the construction from the extended euclidean algorithm found in the paper, we see that wlog a1|b2| + a2|b1| = n. We also know v1 linearly independent to v2.

Then, given any scalar k \in Z/nZ, one has that (k, 0) = β1 v1 + β2 v2, where β1, β2 in Q>=0 and by solving the linear equations, we find that

β2 = k/(a1*|b1/b2|*(-1)^{!(sign(b1) == sign(b2))}
β1 = -b1/b2*β2.

If b1 has different signs to b2, β1 = |b2|*k/n, β2 = |b1|*k/n, since then k = (a1|b2| + a2|b1|)k/n and b1|b2| + b2|b1|= 0.

Else,

As a matter of fact, we can assume that at most one of b1, b2 are negative.

Choose R > r. Then, one computes g1 = round(R*|b2|/n), g2 = round(R*|b1|/n). Let kb + 1/2 = qn + t, 0 <= t < n. Note that if R*b + (n-1)/2 = g’n + t’, then g’ = R*b/n +1/2 - t’/n +ε. Then, if we take c = g’k/R = k/R(R*b/n +1/2 - t’/n +ε) = k*b/n +k/2R - t’k/nR + ε’. If we round this, we get a distance from k*b/2 by at most ||t/n - 1/2 + k/R(1/2 - t’/n) + ε’||. Notice that this is at most 1.

Let c1 = round(g1*k/R), c2 = round(g2*k/R). So our total error for u = (k, 0) - (c1*v1 + c2*v2) has ||u|| <= ||v1|| + ||v2|| <= 2*max(||v1||, ||v2||) < 2*sqrt(n) by the triangle equality and f(u) = k where f((a, b)) = a + λb. So we set k2 = -(c1b1 + c2b2) and k1 = k - λ*k2 mod n.

Notes:

There is some confusion involved in thinking about lattices both as a subset of ZZ X ZZ and also ZZ/nZZ X ZZ/nZZ

Marlin IVC for MNT-753 cycle requires larger FFT domain

Marlin IVC for a predicate of 10^6 max degree would demand ~3000k degrees for MNT6-753.

However, MNT6-753 currently has a domain of 2^15 * 5^2, which is insufficient to support MNT6-753.
Changing 5^2 to 71^1 is still insufficient.

Thus, this calls for the support of truly multi-radix FFT, which supports more than two primes.
If we use a three-radix FFT of (2^15)*(5^2)*7, Marlin IVC would more likely work.

It is now left as a future todo.

cc'ed @npwardberkeley

Feature gate constant time impls

constant time is not a priority for zexe, so it could be nice to have constant time ops that are feature gated for when they are needed

Pippenger for G1 scalar_mul (single-threaded), add_mixed optimisation

In order to match gnark timings for G1, we should bring mul_assign timings down to 220us from 270us. add_mixed should be brought down from 750us to 705us.

gnark uses the following algorithm credited to Jonathan Bootle: https://jbootle.github.io/Misc/pippenger.pdf.

GPUScalarMulParameters

Actually, I think the derive approach might not be good, especially because of the discrepancy in the way that G1Affine etc. are instantiated

So I'm leaning towards a plain macro approach.

Field should return size

Field should return extension degree + characteristic (already impl) so that one can easily determine its order.

Why are we maintaining our own version of unroll?

Yes, the crate dev is not responsive, but are there really necessary changes?

Remove ProjectiveCurve: Hash maybe?

It appears ProjectiveCurve: Hash which sounds slow. I'd expect only AffineCurve: Hash so one must explicitly batch normalize anytime one wants hashes of points.

There is some complexity here because one should minimize pairings in a pairing equation by collecting points in buckets and running roughly Gaussian elimination. It's doable optimally for BLS signatures provided you give yourself some pre-hashed fixed size message type like a [u8; 32], so you can collect those points before even creating any projective points with hash-to-curve.

In general, you need some excess redundency in some batch_normalization calls if you start with only a bunch of projective points on both sides, but any given proof system could've its batch verifier written to do this optimally I think, but you need to look at the specific equation. In the end, all these worries would land under some TIPP/MIPP prover I think.

Document that most constants in `FpParameters` and elsewhere should be in Montgomery form.

Point compression for elliptic curves of j-invariant 0

Hi all,

My name is Dimitri Koshelev. I am a researcher from Moscow and Paris. My field of science is pairing-based cryptography.

Earlier I posted on the site ethresear.ch my new point compression method for elliptic curves of j-invariant 0. May be you are interested.

Also, I know how generalize my compression method to simultaneously compress three elliptic $\mathbb{F}{!q}$-points (resp. one \mathbb{F}{!q^3}$-point) such that at the decompression stage one requires to accomplish only 1 exponentiation in the field $\mathbb{F}_{!q}$ instead of 3.

In your opinion is it a usefull result or not?

Sincerely yours, Dimitri.

Optimize MNT4/6 pairing (2-NAF Miller loop)

Currently, MNT4/6 pairing is implemented in projective coordinate with a binary Miller loop.
Note that the Hamming weight of ATE_LOOP_COUNT for both MNT298 and MNT753 is lower when decomposed in 2-NAF. I suggest to implement 2-NAF Miller loop for MNT4 and MNT6.

Move ff-fft into algeba-core

It is algebra after all, and will reduce the clutter.

Optimized Pippenger

I assume that Zexe variable base multiexp is inherited from bellman. In section "New advances in multi-scalar-multiplication over elliptic curves" of this document by Aztec team it is shown that there is a 2x improvement.

Lazy Reductions?

I was wondering if lazy reductions are possible to speed up the EC operations.

This is as far as I can tell relevant for affine, Jacobian and projective formulas, where intermediate terms are composed of summands that are multiples of two or more field elements.

I'm pretty sure pairing formulas can benefit as well.

In Montgomery reduction mod N, we only require aR.bR to be in the range [0, RN-1], where we require R > N, rather than in the range [0, (R-1)^2].

We can probably use some appropriate compile time checks to make sure that no errors are introduced when utilising lazy reductions in writing formulas.

Support more efficient square roots in extension fields

Implement all algorithms defined in Figure 1 of https://eprint.iacr.org/2012/685.pdf

Add GLV-based scalar multiplication for appropriate curves.

We can obtain upto 30% speed ups for scalar multiplication by implementing a GLV-type endomorphism. See Section 3.5 (Pg 123) of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.394.3037&rep=rep1&type=pdf

Improve Miller loop performance for SW6

Currently the Miller loop for SW6 uses affine coordinates, which makes the loop much slower than necessary. Moving to projective coordinates should speed up the loop significantly.

One can use the MNT6 miller loop code as a reference point to get this working. Indeed, a prior version of the SW6 ML code was basically a port of the MNT6 code, but it had incorrect constants, and so the pairing result was incorrect. This (incorrect) version was ~4x faster than the current (correct) version.

Preprocess the FFT twiddle factors

You can preprocess the twiddle factors on the domain for FFTs. This halves the number of multiplications required for a FFT. You just have to align the twiddle factors with how they get used.

Here is the relevant code in libiop that does these:
Creating the cache: https://github.com/scipr-lab/libiop/blob/master/libiop/algebra/field_subset/subgroup.tcc#L118

Using the cache: https://github.com/scipr-lab/libiop/blob/master/libiop/algebra/fft.tcc#L309

Implement From on Field trait

Currently the From traits aren't implemented on Field, only PrimeField. This is due to them not being implemented on Prime Field Extensions.

I think it makes sense to have a From on a field extension Fp[x] that just sets the 'constant' term of the field element, which is consistent with what the base field does.

WebGPU/Emu MSM/FFT

I hope to either use webgpu or emu to parallelise MSM/FFTs. I am not sure what the current implementation of MSM is, but we need to load-balance the different buckets. My target is a 2-10x speedup on a powerful GPU like GTX1060-6GB.

An idea I have which may or may not make sense is to perform the FFT in parallel on the CPU first, then when the butterflies are sufficiently small, move them into small kernels on the GPU. (can't remember which way around it is). Due to lack of data dependencies, we should be able to get very high streaming throughput.

Other problems to think about is how to package the library so that GPU can be compiled for if desired/detected. Shouldn't be too hard utilising some WebGPU/#[cfg] functionality.

Remove `ToBytes` and `FromBytes`

Add methods on Field and Curve traits if necessary.

Results from benchmarking pairing libraries

Hi all, I would like to share with you guys some benchmarks I ran on zexe, gnark/goff and zkcrypto's BL12-381. I believe this is a good place to start to identify what the key avenues for optimisation are.

From these results, it seems that major areas of optimisation are in

Fq operations, including: add, sub, mul, inverse, negate, to and from montgomery form. Analogous results hold for field extensions, but we should check for improvement here after optimising the base field. From pure optimisations in Fq ops, one can aspect an overall improvement in other parts of the library by approximately 120%.
G1 ops: this can be partly explained by the Fq performance. However, scalar mul seems significantly slower, should check what differences exist between gnark and zexe.

More benchmarking should be done once Fq has been sufficiently optimised.

Yet to benchmark:
zk-swap-libff/libff
barretenberg

Results:
BLS12-381
Fq
goff

Zexe

G1
gnark

Zexe

zkcrypto

Pairing
Zexe

zkcrypto

gnark

Better error handling for field deserialization

I believe the deserialization logic here should be improved: https://github.com/scipr-lab/zexe/blob/e1db7c5fede8ba5a0f5b3958692de44da65c1cb6/algebra-core/src/fields/macros.rs#L175

When Zexe tries to read a byte stream and it gets a value that isn’t in the field it returns 0. This is both confusing for users to check for errors and prevents from getting a valid 0 value.

A current approach to mitigate it is https://github.com/scipr-lab/zexe/pull/181/files#diff-0d3539aa74111fa4e3ea4d5b08b7278cR289, but should be improved.

Possible API for multiplying many polynomials

Currently the polynomial library does not support an API to multiply more than two polynomials together efficiently.

For n polynomials the API would ideally:

Add up their degrees
create a new domain with the degrees
FFT the polynomials with the new domain
multiply all of the polynomials in their eval form
IFFT using the new domain

Add per-crate READMEs

ark-ff
ark-ec
ark-poly
ark-serialize

Generalize multivariate poly `rand` function

`Debug` impl of `Fp` outputs elements in Montgomery form.

This can trip up users. Should we change this?

w-NAF should have variable window size

currently it is fixed

Understand trade-offs of multi-radix domains and FFTs

The current integration of mixed-radix FFTs is kind of unsatisfactory. There are two issues:

The current impl only allows selecting one additional basis for the radix, which might be insufficient for applications (see #731)
Even after fixing the above, there's still the issue that we only select the larger radix after exhausting all powers in radix-2. As suggested in https://github.com/scipr-lab/zexe/issues/156#issuecomment-716432571, this can result in poor "resizing behaviour", which is demonstrated via the following example:

Required size: 10;
GeneralEvaluationDomain: domain_size = 2^4 = 16
DynamicEvaluationDomain: domain_size = 2 * 5 = 10

To resolve this, we might want to implement a "dynamic" domain that automatically chooses a subgroup whose size matches the required input size as closely as possible. This is different from the behaviour of something like GeneralEvaluationDomain, which tries to fit things into radix-2 if possible, and only goes for larger domain sizes if radix-2 is insufficient. This would also unify the existing Radix2Domain, MixedRadixDomain, and GeneralEvaluationDomain impls.

cc @weikengchen @paberr

Make FFTs nlog(d)

FFTs in the library are all nlog(n). However they could be nlog(d), which is an efficiency improvement when doing polynomial multiplication.

This can be taken advantage of in two possible ways:

Split up the FFT into n/d independent FFTs operating over the same polynomial, but evaluating on distinct cosets. Both of these sub-FFTs are dlog(d). At the end concatenate these two vectors.
Skip the first few rounds of butterflying / arithmetic in the current FFT algorithm, and just copy the polynomial coefficients into the correct positions. This is implemented in libiop.

This probably doesn't matter for the current pairing based SNARKs as they are dominated by MSM time.

Change all scalar mul to accept slices instead of bigint

By utilising bigint instead of field, we are already losing some invariant on the size of the arg.

Further, it's clear this has little bearing on code correctness. Hence, I suggest that either scalar_mul be parameterizable by a bigint param, or we simply accept slices of &[u64] to allow for a more flexible interface.

Hashing Structures helpers

Depends on arkworks-rs/snark#104

It'd be nice to have a helper method as an extension trait over CanonicalSerialize which calls a cryptographically secure hash function (maybe make it take a Digest?) on the serialized input, and returns the fixed size hash of the provided data.

It'd be even better if it would also be available on structs composing such elements.

Example:

use blake2::Blake2b;
use generic_array::GenericArray;
use zexe::algebra::*;
use digest::Digest;

trait Hasher: CanonicalSerialize {
   fn hash<H: Digest>(&self) -> GenericArray<u8, H::OutputSize> {
       let buffer = self.serialize();
       let mut hasher = H::new();
       hasher.input(buffer);
       hasher.result()
   }
}

struct X<E: PairingEngine> {
    a: E::G1Affine,
    b: E::G2Affine,
    c: Vec<E::G1Affine>,
}

x = X::<E> {
  a: E::G1Affine::zero(),
  b: E::G2Affine::zero(),
  c: vec![],
}
let hash = x.hash::<Blake2b>();

This would serialize all elements (compressed or uncompressed? i don't think it matters) to a buffer, and then take the hash of the buffer.

It becomes a bit trickier I believe if the struct ends up having non-CanonicalSerialize elements.

e.g.

struct X<E: PairingEngine> {
    a: E::G1Affine,
    b: E::G2Affine,
    c: Vec<E::G1Affine>,
    transcript_hash: [u8; 64],
}

Add method to prime field to get the ith root of unity

Currently prime_field.root_of_unity() returns a group element of order 2^{F::PARAMS::TWO_ADICITY}.

It would be convenient to have a method on PrimeField to get the root of unity for a given power of 2. This operation comes up fairly often, including the evaluation domain API of zexe.

It could also be extended in the future to get non-power of two roots.

Implement ASM for field ops/Arithmetic optimisations

It seems unacceptable that a primitive that can be accelerated (leading to ~1.5x speedup) is not.

Hence I suggest to look into how one should add assembly for field ops to make used of vectorised instructions etc. I also suggest to look into other modes of multiplication besides textbook such as karatsuba multiplication. This may not lead to huge speedups on x86_64, but might on other platforms. Further one should look into optimisations for faster extension field arithmetic.

Finally, I propose to look into how much of this code can be generically created for fields of arbitrary size using either macros or function calls to wrapped assembly, with target-dependent compilation.

The current pairing timings are about 2-2.5ms for the BLS12 curves on a 3+GHz machine. I believe this can be improved closer to the theoretical limit of roughly 1.1-1.3ms. As a target, I would set 1.5ms.

Change evaluation domain API for cosets

Could the struct for an evaluation domain contain a Field element for the affine shift, which then enables it for use as any coset. This not existing required me to re-implement a subset of this logic. (The current API only supports cosets with shift g, for a subset of methods)

My suggested API:

Have two constructors, one for the subgroup case, one for the coset case.
Remove cosetFFT, instead have the FFTs switch "if affine_shift == Field::one()"
Add coset support in the lagrange interpolations. This is simple to do (see libiop: https://github.com/alexchmit/libiop-dev/blob/master/libiop/algebra/lagrange.tcc#L177. I also suggest switching to its method of doing arithmetic as it avoids doing one more pass over the array, with the same number of arithmetic ops)
Update the vanishing polynomial, this just requires setting the constant term to affine_offset^|H|

Twisted Edwards curves - investigate utility of faster doubling algorithm

We currently use Extended Twisted Edwards Coordinates for computation on Edwards curves. In that paper, they provide a unified addition formula, and a tailored doubling formula that is faster. Currently the doubling implementation uses the unified addition formula (in a way that isn't constant time).

If speed on these curves matter (e.g. for signatures), its probably worth experimenting with the tailored doubling formula to see if the computation time is reduced in comprehensive benchmarks. It is not immediate that this will be the case due to instruction caching effects.

Further, it probably is the case that using this custom doubling formula for scalar multiplication (as suggested in section 4.3 and 4.4 of the paper) will yield higher speeds, as there we have much less to worry re things being out of cache.

Explore integration of `new_asm!` API

rust-lang/rust#69171

Implement BLS24-319 Curve

Why? 319-bit prime modulus reduces G1 cost by approx 1.44x vis a vis BLS12-381/377. Constructing curves from Cocks-Pinch method (only known method to generate pairing-friendly curves of prescribed order) doubles the size of the base field, so we want to it to be small to begin with.

Since we assume the leaves of an aggregating proof system have lower proving power, we believe this is an adequate strategy for 1 layer or even 2 layer proof composition.

Future work
Future work includes finding suitable Cocks-Pinch Curves for proof composition with BLS24-319. Hopefully, we can find good curves with log_2 p <= 639 and for 2nd layer, log_2 p <= 1279 with highly efficient pairings. Note that the size of the pairing as an arithmetic circuit is about 2x that of BLS12, so the outer most curve has to be at least 2x more efficient (in terms of machine ops/gas costs etc.).

The BW6-761 curve had a pairing time of about 10ms (albeit for a relatively unoptimised implementation), compared with ~2.5 ms for a curve over a prime with half the number of bits, which is a 4x difference. That being said, a slightly over par 1.5x to 2x improvement (on the basis of a smaller base field) seems within reach.

Nonetheless, as one approaches larger field sizes, the costs saved from using a smaller base field may not make up for larger circuits if optimised algorithms like Karatsuba multiplication is used. So one would be trading proof capacity for G1 e.g. proving performance.

BLS48-286
Further, the BLS48-286 curve should also be considered. According to my estimates, they should be 1.5-2x as slow for verification compared to BLS12, but proving could be much faster (9 words on a 32-bit machine). The real benefits can be seen in the outer layer curve, which could have modulus < 572 bits.

Note that this has a much large arithmetic circuit compared with BLS12 - approximately 2.5-5x. So to reap the benefits, the cost of verifying the outer curve has to be < 2.5-5x that of the BW6-761 curve. Based on the discrepancy of base field, a speedup of 1.5-2x is more likely.

Overall, due to the possibly high complexity of the pairing for BLS48, BLS24 seems like a more promising route for now.

Things to investigate

Exponentiation formulas and explicit number of field ops required for BLS48 pairing

Clarify guarantees in traits around point validity

The elliptic curve point API's (AffineCurve, ProjectiveCurve) currently claim that the interface should be providing quite strong guarantees, all points should be on the curve, and moreover in the prime order subgroup.

However this guarantee isn't achieved at the moment.

I've only looked at short weierstrass jacobian curves so far (which encompasses the BLS12 curves), and they don't guarantee that points are on the curve, or that they are in the prime order subgroup for AffineCurve.

The new() method for AffineGroup (which implements AffineCurve) does not ensure that points are on the curve.

Also the sw curve's Affine Curve implementation of from_random_bytes() uses get_point_from_x(), which as stated in its comments provides no guarantees around being in the prime order subgroup.

I believe that the trait comments should drop the description that they are in the prime order subgroup, and we should have new traits that enforce that this is the case. (Or alternatively, make the two above methods enforce the claimed properties)

Refactor prime field macros in `ark-ff`

Currently we have a slew of different macros; we should systematize and unify these where it makes sense.

Expand all macros

Since debugging with macros can be hard, and Zexe uses significant amount of macros, we should have a script that generates an entire new directory e.g. zexe_expand, so that all of the macros have been expanded. We can use cargo-expand to achieve this, and expand each file and write to a file. Then, this library can be compiled and run, and will be easier to debug.

We can have the expand script be located in the crate root. So just run . expand_all.sh.

Fix type aliasing.

Very unreadable code results from having two traits implementing the same type with same type name. One should prefix with the trait name. E.g. AffineScalarField, AffineBaseField

Make polynomial evaluation only take constant memory

Currently polynomial evaluation requires linear memory: https://github.com/scipr-lab/zexe/blob/6bfe574f7adea14b97ff554bbb594988635b1908/ff-fft/src/polynomial/dense.rs#L97

Instead it should only require a constant amount of memory using horner's method. An example of this is here: https://github.com/scipr-lab/libiop/blob/master/libiop/algebra/polynomials/polynomial.tcc#L99

Latest version of `unroll` dependency does not compile

https://gitlab.com/elrnv/unroll/-/issues/4

Temporary fix in arkworks-rs/snark#219

Security estimates of pairing-friendly curves

Due to improvements on special towered number field sieves, a recent paper from A. Guillevic show that many pairing-friendly curves are below their targeted level of security. For instance, the paper states that the base field size of BLS12 curves need to be increased up to 446 bit (instead of 377 as for BLS12-377 or (z-Cash's) BLS12-381) to serve 128 Bit security.
Furthermore, by personal communication with the author of the above paper, I have also received similar estimates on MNT curves which I uploaded to our repository, see HorizenOfficial/ginger-lib#47. The results are as for the BLS12 curves:

Coda's MNT4-753 security is about 112 bit, and
in order to serve 128 Bit security the base field size needs to be increased up to 1024 bit.

Guillevic's note also includes a link to a parameter file of new MNT4/6 cycles with base field size 996 bits and beyond) which serve quite acceptable domain sizes for mixed radix FFT.
As we from ginger do not aim at implementing Guillevic's cycle (already Coda's is very slow in performance) I can be of help providing parameters (such as cofactors and non-residues for extension fields) if this is of interest for Zexe.

Extend GLV to BLS/BN/MNT curves

GLV patents end 25 sept. 2 weeks and 2 days. We should collate resources/scripts for finding lambda and omega for all the SW curves Zexe uses, so that the current GLV script can produce the precomputes necessary for GLV scalar mul, which is not constant time either, but combined with wNAF produces a speedup of over 2x for BW6 projective. Batch affine alone only adds about 1.3-1.4x on top of this.

GLV precomputation script should also be generic and work for all curves based on features. We can use macros to simplify.

Optimize MNT4/6 pairing (2-NAF Miller loop)

Currently, MNT4/6 pairing is implemented in projective coordinate with a binary Miller loop.
Note that the Hamming weight of ATE_LOOP_COUNT for both MNT298 and MNT753 is lower when decomposed in 2-NAF. I suggest to implement 2-NAF Miller loop for MNT4 and MNT6.

Signed Biginteger type

It should be fairly straightforward (with some care to details) to use what we have in the big integer library to define signed bigintegers.

The basic design would be to have the top level limb be a i64 and the rest of the limbs be u64s.