Giter Club home page Giter Club logo

mratsim / constantine Goto Github PK

View Code? Open in Web Editor NEW
259.0 17.0 34.0 18.1 MB

Constantine: modular, high-performance, zero-dependency cryptography stack for proof systems and blockchain protocols.

License: Other

Nim 81.55% Sage 0.18% Python 1.92% C 3.74% C++ 0.15% Shell 0.04% Rust 11.78% Go 0.66%
elliptic-curves elliptic-curve-cryptography elliptic-curve-arithmetic finite-fields constant-time side-channels cryptography bls-signature barreto-naehrig bigint bignum galois-field digital-signature public-key-cryptography pairing pairing-cryptography bls12-381 bls hash-to-curve zkp

constantine's Introduction

Constantine

License: Apache License: MIT Stability: experimental
Github Actions CI

Constantine: High performance cryptography for proof systems and blockchain protocols

“A cryptographic system should be secure even if everything about the system, except the key, is public knowledge.”
— Auguste Kerckhoffs

This library provides constant-time implementation of cryptographic primitives with a particular focus on cryptography used in blockchains and zero-knowledge proof systems.

The library aims to be a fast, compact and hardened library for elliptic curve cryptography needs, in particular for blockchain protocols and zero-knowledge proofs system.

The library focuses on following properties:

  • constant-time (not leaking secret data via side-channels)
  • performance
  • generated code size, datatype size and stack usage

in this order.

Public API: Curves & Protocols

Protocols are a set of routines, designed for specific goals or a combination thereof:

  • confidentiality: only the intended receiver of a message can read it
  • authentication: the other party in the communication is the expected part
  • integrity: the received message has not been tampered with
  • non-repudiation: the sender of a message cannot repudiated it

Legend

  • ✅: Full support
  • 🏗️: Partial support:
    • in C, some APIs not provided.
    • in Rust, only low-level constantine-sys API available but no high-level wrapper.
  • 🙈: Missing support

Protocols

Constantine supports the following protocols in its public API.

Nim C Rust Go
Ethereum BLS signatures 🏗️ 🏗️ 🙈
Ethereum KZG commitments for EIP-4844
Ethereum Virtual Machine Precompiles (ECADD, ECMUL, ECPAIRING, MODEXP) 🙈 🙈 🙈
Zk Accel layer for Halo2 proof system (experimental) not applicable not applicable not applicable

Elliptic Curves

Constantine supports the following curves in its public API.

Nim C Rust Go
BN254-Snarks 🙈
BLS12-381 🙈
Pasta curves (Pallas & Vesta) 🙈

For all elliptic curves, the following arithmetic is supported

  • field arithmetic
    • on Fr (i.e. modulo the 255-bit curve order)
    • on Fp (i.e. modulo the 381-bit prime modulus)
  • elliptic curve arithmetic:
    • on elliptic curve over Fp (EC G1) with affine, jacobian and homogenous projective coordinates
    • on elliptic curve over Fp2 (EC G2) with affine, jacobian and homogenous projective coordinates
    • including scalar multiplication, multi-scalar-multiplication (MSM) and parallel MSM

All operations are constant-time unless explicitly mentioned vartime.

For pairing-friendly curves Fp2 arithmetic is also exposed.
🏗️ Pairings and multi-pairings are implemented but not exposed yet.

General cryptography

Constantine supports the following hash functions and CSPRNGs in its public API.

Nim C Rust Go
SHA256 🏗️
Cryptographically-secure RNG from Operating System (sysrand)

Threadpool

Constantine also exposes a high-performance threadpool for Nim that inherits performance and API from:

  • Task parallelism API RFC: nim-lang/RFCs#347

    • Weave data parallelism API:
      • spawn and sync
      • parallelFor and syncScope
        • parallelFor supports arbitrarily complex reduction. Constantine uses it extensively for parallel elliptic curve sum reductions.
      • isSpawned and isReady
  • CPU Topology - Query the number of threads available at the OS/VM-level to run computations:

    • ctt_cpu_get_num_threads_os in C
    • getNumThreadsOS in Nim
    • constantine_core::hardware::get_num_threads_os in Rust
  • https://github.com/mratsim/weave

  • https://github.com/status-im/nim-taskpools

The threadpool supports nested parallelism to exploit high core counts and does not suffer from OpenMP limitations of nested parallel loops. For batching KZG verification, Constantine issues 3 multi-scalar multiplication in parallel, each using at 3 nested parallel loops.

See the following documents on the threadpool performance details, design and research:

Installation

❗ Constantine can be compiled by Nim v1.6.x or v2.0.2 but not Nim v2.0.0 due to a compile-time integer regression

From Rust

  1. Install clang compiler, for example:

    • Debian/Ubuntu sudo apt update && sudo apt install build-essential clang
    • Archlinux pacman -S base-devel clang
    📝 We require Clang as it's significantly more performant than GCC for cryptographic code, especially for ARM where Constantine has no assembly optimizations. And Rust, like Clang both rely on LLVM.
    This can be changed to any C compiler by deleting this line.
  2. Install nim, it is available in most distros package manager for Linux and Homebrew for MacOS Windows binaries are on the official website: https://nim-lang.org/install_unix.html

    • Debian/Ubuntu sudo apt install nim
    • Archlinux pacman -S nim
  3. Test both:

    • the experimental ZK Accel API (ZAL) for Halo2-KZG
    • Ethereum EIP4844 KZG polynomial commitments
    git clone https://github.com/mratsim/constantine
    cd constantine
    cargo test
    cargo bench
    
  4. Add Constantine as a dependency in Cargo.toml

    • for Halo2-KZG Zk Accel Layer
      [dependencies]
      constantine-halo2-zal = { git = 'https://github.com/mratsim/constantine' }
    • for Ethereum EIP-4844 KZG polynomial commitments
      [dependencies]
      constantine-ethereum-kzg = { git = 'https://github.com/mratsim/constantine' }

Optionally, cross-language LTO between Nim and Rust can be used, see https://doc.rust-lang.org/rustc/linker-plugin-lto.html:

Add a .cargo/config.toml to your project with the following:

# .cargo/config.toml

[build]
rustflags="-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld"

and modify Constantine's build.rs to pass CTT_LTO=1

    Command::new("nimble")
        .env("CC", "clang")
        .env("CTT_LTO", "1") // <--
        .arg("make_lib_rust")
        .current_dir(root_dir)
        .stdout(Stdio::inherit())
        .stderr(Stdio::inherit())
        .status()
        .expect("failed to execute process");

From Go

  1. Install any C compiler, clang is recommended, for example:

    • Debian/Ubuntu sudo apt update && sudo apt install build-essential clang
    • Archlinux pacman -S base-devel clang
  2. Install nim, it is available in most distros package manager for Linux and Homebrew for MacOS Windows binaries are on the official website: https://nim-lang.org/install_unix.html

    • Debian/Ubuntu sudo apt install nim
    • Archlinux pacman -S nim
  3. Compile Constantine as a static (and shared) library in ./include

    cd constantine
    CC=clang nimble make_lib
    
  4. Test the go API.

    cd constantine-go
    go test -modfile=../go_test.mod
    
    📝 Constantine uses a separate modfile for tests.
    It has no dependencies (key to avoid supply chain attacks) except for testing.

From C

  1. Install a C compiler, clang is recommended, for example:

    • Debian/Ubuntu sudo apt update && sudo apt install build-essential clang
    • Archlinux pacman -S base-devel clang
  2. Install nim, it is available in most distros package manager for Linux and Homebrew for MacOS Windows binaries are on the official website: https://nim-lang.org/install_unix.html

    • Debian/Ubuntu sudo apt install nim
    • Archlinux pacman -S nim
  3. Compile the dynamic and static library.

    • Recommended:
      CC=clang nimble make_lib
    • or CTT_ASM=0 nimble make_lib
      to compile without assembly (otherwise it autodetects support)
    • or with default compiler
      nimble make_lib
  4. Ensure the libraries work

    • nimble test_lib
  5. Libraries location

  6. Read the examples in examples-c:

From Nim

You can install the developement version of the library through nimble with the following command

nimble install https://github.com/mratsim/constantine@#master

Dependencies & Requirements

For speed it is recommended to use Clang (see Compiler-caveats). In particular GCC generates inefficient add-with-carry code.

Constantine requires at least:

  • GCC 7
    Previous versions generated incorrect add-with-carry code.
  • Clang 14
    On x86-64, inline assembly is used to workaround compilers having issues optimizing large integer arithmetic, and also ensure constant-time code.
    Constantine uses the intel assembly syntax to address issues with the default AT&T syntax and constants propagated in Clang.
    Clang 14 added support for -masm=intel.

    On MacOS, Apple Clang does not support Intel assembly syntax, use Homebrew Clang instead or compile without assembly.
    Note that Apple is discontinuing Intel CPU throughough their product line so this will impact only older model and Mac Pro

On Windows, Constantine is tested with MinGW. The Microsoft Visual C++ Compiler is not configured.

Constantine has no C, Nim, Rust, Go dependencies, besides compilers, even on Nim standard library except:

  • for testing and benchmarking
    • the tested language json and yaml parsers for test vectors
    • the tested language standard library for tests, timing and message formatting.
    • GMP for testing against GMP
  • for Nvidia GPU backend:
    • the LLVM runtime ("dev" version with headers is not needed)
    • the CUDA runtime ("dev" version with headers is not needed)
  • at compile-time
    • we need the std/macros library to generate Nim code.

Performance

This section got way too long and has its own file.
See ./README-PERFORMANCE.md

Assembly & Hardware acceleration

  • Assembly is used on x86 and x86-64, unless CTT_ASM=0 is passed.
  • Assembly is planned for ARM.
  • GPU acceleration is planned.

Assembly solves both:

Security

Hardening an implementation against all existing and upcoming attack vectors is an extremely complex task. The library is provided as is, without any guarantees at least until:

  • it gets audited
  • formal proofs of correctness are produced
  • formal verification of constant-time implementation is possible

Defense against common attack vectors are provided on a best effort basis. Do note that Constantine has no external package dependencies hence it is not vulnerable to supply chain attacks (unless they affect a compiler or the OS).

Attackers may go to great lengths to retrieve secret data including:

  • Timing the time taken to multiply on an elliptic curve
  • Analysing the power usage of embedded devices
  • Detecting cache misses when using lookup tables
  • Memory attacks like page-faults, allocators, memory retention attacks

This is would be incomplete without mentioning that the hardware, OS and compiler actively hinder you by:

  • Hardware: sometimes not implementing multiplication in constant-time.
  • OS: not providing a way to prevent memory paging to disk, core dumps, a debugger attaching to your process or a context switch (coroutines) leaking register data.
  • Compiler: optimizing away your carefully crafted branchless code and leaking server secrets or optimizing away your secure erasure routine which is deemed "useless" because at the end of the function the data is not used anymore.

A growing number of attack vectors is being collected for your viewing pleasure at https://github.com/mratsim/constantine/wiki/Constant-time-arithmetics

Disclaimer

Constantine's authors do their utmost to implement a secure cryptographic library in particular against remote attack vectors like timing attacks.

Please note that Constantine is provided as-is without guarantees. Use at your own risks.

Thorough evaluation of your threat model, the security of any cryptographic library you are considering, and the secrets you put in jeopardy is strongly advised before putting data at risk. The author would like to remind users that the best code can only mitigate but not protect against human failures which are the weakest link and largest backdoors to secrets exploited today.

Security disclosure

You can privately report a security vulnerability through the Security tab.

Security > Report a vulnerability

Why Nim

The Nim language offers the following benefits for cryptography:

  • Compilation to machine code via C or C++ or alternatively compilation to Javascript. Easy FFI to those languages.
    • Obscure embedded devices with proprietary C compilers can be targeted.
    • WASM can be targeted.
  • Performance reachable in C is reachable in Nim, easily.
  • Rich type system: generics, dependent types, mutability-tracking and side-effect analysis, borrow-checking, compiler enforced distinct types (Miles != Meters, SecretBool != bool and SecretWord != uint64).
  • Compile-time evaluation, including parsing hex string, converting them to BigInt or Finite Field elements and doing bigint operations.
  • Assembly support either inline or a simple {.compile: "myasm.S".} away
  • No GC if no GC-ed types are used (automatic memory management is set at the type level and optimized for latency/soft-realtime by default and can be totally deactivated).
  • Procedural macros working directly on AST to
    • create generic curve configuration,
    • derive constants
    • write a size-independent inline assembly code generator
  • Upcoming proof system for formal verification via Z3 (DrNim, Correct-by-Construction RFC)

License

Licensed and distributed under either of

or

at your option. This file may not be copied, modified, or distributed except according to those terms.

This library has no external dependencies. In particular GMP is used only for testing and differential fuzzing and is not linked in the library.

constantine's People

Contributors

advaita-saha avatar agnxsh avatar arnetheduck avatar benbierens avatar cskiraly avatar lynxcs avatar markspanbroek avatar mratsim avatar petarvujovic98 avatar swader avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

constantine's Issues

Randomized test failure - ARM - G1 Infinity

image

seed 1592597496, 32-bit mode, note the error happens in the BN254 code path, which is run after the BLS codepath

run_EC_mul_sanity_tests(
ec = ECP_SWei_Proj[Fp[BLS12_381]],
ItersMul = ItersMul,
moduleName = "test_ec_weierstrass_projective_g1_mul_sanity_" & $BLS12_381
)
test "EC mul [Order]P == Inf":
var rng: RngState
let seed = uint32(getTime().toUnix() and (1'i64 shl 32 - 1)) # unixTime mod 2^32
rng.seed(seed)
echo "test_ec_weierstrass_projective_g1_mul_sanity_extra_curve_order_mul_sanity xoshiro512** seed: ", seed
proc test(EC: typedesc, bits: static int, randZ: static bool) =
for _ in 0 ..< ItersMul:
when randZ:
let a = rng.random_unsafe_with_randZ(EC)
else:
let a = rng.random_unsafe(EC)
let exponent = EC.F.C.getCurveOrder()
var exponentCanonical{.noInit.}: array[(bits+7) div 8, byte]
exponentCanonical.exportRawUint(exponent, bigEndian)
var
impl = a
reference = a
scratchSpace{.noInit.}: array[1 shl 4, EC]
impl.scalarMulGeneric(exponentCanonical, scratchSpace)
reference.unsafe_ECmul_double_add(exponentCanonical)
check:
bool(impl.isInf())
bool(reference.isInf())
test(ECP_SWei_Proj[Fp[BN254_Snarks]], bits = BN254_Snarks.getCurveOrderBitwidth(), randZ = false)
test(ECP_SWei_Proj[Fp[BN254_Snarks]], bits = BN254_Snarks.getCurveOrderBitwidth(), randZ = true)
# TODO: BLS12 is using a subgroup of order "r" such as r*h = CurveOrder
# with h the curve cofactor
# instead of the full group
# test(Fp[BLS12_381], bits = BLS12_381.getCurveOrderBitwidth(), randZ = false)
# test(Fp[BLS12_381], bits = BLS12_381.getCurveOrderBitwidth(), randZ = true)

Randomized test failure negate in sqrt Mersenne 127

32-bit, commit 89c78ef, seed 1592668178, https://github.com/mratsim/constantine/runs/790997366#step:13:856

image

proc randomSqrtCheck_p3mod4(C: static Curve) =
template testImpl(a: untyped): untyped {.dirty.} =
var na{.noInit.}: Fp[C]
na.neg(a)
var a2 = a
var na2 = na
a2.square()
na2.square()
check:
bool a2 == na2
bool a2.isSquare()
var r, s = a2
r.sqrt()
let ok = s.sqrt_if_square()
check:
bool ok
bool(r == s)
bool(r == a or r == na)

Implement lazy carries and reductions

Context

Currently after each addition or substraction steps there is a reduction done if the result is over the field modulus.

Due to constant-time constraints, there is no shortcut if it is unnecessary, the memory accesses are always done.

Instead, at the cost of a couple bits, we could use lazy carries/reductions.

Instead of using 31 bits over 32-bit words or 63-bit over 64-bit word, we use less bits in each words. For example assuming we use 26 bits and 52 bits respectively from 32 or 64-bit words, we only need to reduce every 6 or 12 additions respectively.
This is desirable in particular for addition chains

Choosing a default: Maximizing the usage of the word overhead

256-bit curves are quite common:

  • secp256k1 for blockchain ECDSA / transaction signing
  • P256/secp256r1 or Curve25519 for Diffie-Hellman

Or closely 254-bit Barreto-Naehrig for zkSNARKS.

Representing a 256 or 254 bits curve in the most compact way on 64-bit arch requires 4 words. Assuming

  • we want at most a 1 word overhead per 256-bit
  • we want to maximize the laziness of reduction

52-bit logical words and 12 lazy bits or 51-bit logical words and 13 lazy bits would be the best logical word bitsizes. They both can represent 2^255 integers in 5 words (but a radix 2^52 representation can also represent the numbers 2^255 and 2^256 in 5 words)

Side-note on SIMD

This may also enable opportunities for SIMD vectorization using either integer or floating-point math.

  • 2^24+1 is the first integer that cannot be represented in float32 though leaving 8-bit on the table might arguably be too much.
  • 2^53+1 is the first integer that cannot be represented in float64
  • The new AVX512_IFMA instructions (Integer Fused-Multiply-Add) supports multiply-add of 52-bit integers: VPMADD52LUQ and VPMADD52HUQ

Using floating point for pairings is covered in this paper:

Side-note on "final substraction"-less Montgomery Multiplication/Exponentiation

With well-chosen word-size that allow redundant representations we can avoid final substraction in Montgomery Multiplication and Exponentiation:

Implementation strategy

The implementatons steps would be:

  1. Change the WordBitSize at
    type Word* = Ct[uint32]
    ## Logical BigInt word
    ## A logical BigInt word is of size physical MachineWord-1
    type DoubleWord* = Ct[uint64]
    type BaseType* = uint32
    ## Physical BigInt for conversion in "normal integers"
    const
    WordPhysBitSize* = sizeof(Word) * 8
    WordBitSize* = WordPhysBitSize - 1

    This can be made a {.intdefine.} for compile-time configuration
  2. Ensure the encoding/decoding routine in io_bigints.nim properly deal with more than 1 unused bit
  3. Field elements should now have a countdown that tracks how many potential carries are left given the free bits. If it reaches 0, they should call a normalization/reduction proc.
  4. To be researched: add and sub may have to return a carry Word instead of a CtBool as the carry is not 0 or 1 anymore (but we never add the carry, it is used as an input for a optional reduction so might be unneeded)

References

Implement an ASM code generator / compiler

Unfortunately for both performance and security reasons, it is important for generic cryptographic libraries to implement a code generator.

The most wildly used code generators are:

Instead of complicating the build system, we can directly implement the code-generator using Nim metaprogramming features.

This ensures that unused assembly is not compiled in.
This significantly simplifies the build system.

A proof of concept code generator for multiprecision addition is available here

macro addCarryGen_u64(a, b: untyped, bits: static int): untyped =
var asmStmt = (block:
" movq %[b], %[tmp]\n" &
" addq %[tmp], %[a]\n"
)
let maxByteOffset = bits div 8
const wsize = sizeof(uint64)
when defined(gcc):
for byteOffset in countup(wsize, maxByteOffset-1, wsize):
asmStmt.add (block:
"\n" &
# movq 8+%[b], %[tmp]
" movq " & $byteOffset & "+%[b], %[tmp]\n" &
# adcq %[tmp], 8+%[a]
" adcq %[tmp], " & $byteOffset & "+%[a]\n"
)
elif defined(clang):
# https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
for byteOffset in countup(wsize, maxByteOffset-1, wsize):
asmStmt.add (block:
"\n" &
# movq 8+%[b], %[tmp]
" movq " & $byteOffset & "%[b], %[tmp]\n" &
# adcq %[tmp], 8+%[a]
" adcq %[tmp], " & $byteOffset & "%[a]\n"
)
let tmp = ident("tmp")
asmStmt.add (block:
": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" &
": [b] \"m\"(`" & $b & "->limbs[0]`)\n" &
": \"cc\""
)
result = newStmtList()
result.add quote do:
var `tmp`{.noinit.}: uint64
result.add nnkAsmStmt.newTree(
newEmptyNode(),
newLit asmStmt
)

Security

General purposes compiler can and do rewrite code as long as any observable effect is maintained. Unfortunately timing is not considered an observable effect and as general purpose compiler gets smarter and branch prediction on processor gets also smarter, compilers recognize and rewrite increasingly more initial branchless code to code with branches, potentially exposing secret data.

A typical example is conditional mov which is required to be constant-time any time secrets are involved (https://tools.ietf.org/html/draft-irtf-cfrg-hash-to-curve-08#section-4)
The paper What you get is what you C: Controlling side effects in mainstream C compilers (https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf) exposes how compiler "improvements" are detrimental to cryptography
image

Another example is secure erasing secret data, which is often elided as an optimization.

Those are not theoretical exploits as explained in the When constant-time doesn't save you article (https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/) which explains an attack against Curve25519 which was designed to be easily implemented in a constant-time manner.
This attacks is due to an "optimization" in MSVC compiler

every code compiled in 32-bit with MSVC on 64-bit architectures will call llmul every time a 64-bit multiplication is executed.

Verification of Assembly

The assembly code generated needs special tooling for formal verification that is different from the C code in #6.
Recently Microsoft Research introduced Vale:

  • Vale: Verifying High-Performance Cryptographic Assembly Code
    Barry Bond and Chris Hawblitzel, Microsoft Research; Manos Kapritsos, University of Michigan; K. Rustan M. Leino and Jacob R. Lorch, Microsoft Research; Bryan Parno, Carnegie Mellon University; Ashay Rane, The University of Texas at Austin;Srinath Setty, Microsoft Research; Laure Thompson, Cornell University
    https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-bond.pdf
    https://github.com/project-everest/vale
    Vale can be used to verify assembly crypto code against the architecture and also detect timing attacks.

Performance

Beyond security, compilers do not expose several primitives that are necessary for necessary for multiprecision arithmetic.

Add with carry, sub with borrow

The most egregious example is add with carry which led to the GMP team to implement everything in Assembly even though this is a most basic need and almost all processor have an ADC instruction, some like the 6502 from 30 years ago only have ADC and no ADD.
See:

image

Some specific platforms might expose add with carry, for example x86 but even then the code generation might be extremely poor: https://gcc.godbolt.org/z/2h768y

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

(Reported fixed but it is not? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67317)

And no way to use ADC for ARM architectures with GCC.
Clang does offer __builtin_addcll which might work now or not as fixing the add with carry for x86 took years. Alternatively Clang does offer new arbitrary width integer since a month ago, called ExtInt http://blog.llvm.org/2020/04/the-new-clang-extint-feature-provides.html it is unknown however if code is guaranted to be constant-time.

See also: https://stackoverflow.com/questions/29029572/multi-word-addition-using-the-carry-flag/29212615

Conditional move

Unlike add with carry which can be expressed but may lead to inefficient code, conditional move basically require assembly (see the security section) as there is no builtin at all.
A constant-time conditional move based on xor-masking would require 4-5 instructions instead of just 1.

MULX

On x86-64, the multiplication instruction is bottlenecked by the fact that the result is always in RAX and RDX registers which means that multiplications cannot be interleaved and exploit instruction level parallelism.

ADCX/ADOX

On x86-64, the add with carry instruction is bottlenecked by the fact that there is a single carry flag which means that additions with carry cannot be bottlenecked and exploit instruction level parallelism.

Furthermore, compilers like GCC and Clang were not designed to track carry chains (GCC) or can only track a single chain (Clang).

ADCX and ADOX enable having carry chains used respectively the Carry flag and the Overflow flag for carry chain.

https://gcc.gnu.org/legacy-ml/gcc-help/2017-08/msg00100.html

The compiler is not able to distingusih between OF and CF chains,
since both are represented as a different mode of a single flags
register. This is the limitation of the compiler.

https://bugs.llvm.org/show_bug.cgi?id=34249#c6

Resolving - the crash was fixed by rL310784 and we avoid ADOX/ADCX code generation now.

In MCL and Goff, combining MULX/ADCX and ADOX improves speed by about 20% on field multiplication.

See intel papers:

Non-Adjacent-Form-accelerated Scalar Multiplication

Currently scalar multiplication is using the raw canonical bigEndian representation of the scalar (a.k.a. octet string).
Depending on 0 or 1 bit, after the mandatory doubling, we either do nothing or add the original point.
On average, for scalars that are non-differentiable from a random oracle (which is the case for all private keys) we do an addition half the time.

Each integer has a unique Non-Adjacent-Form, that uses a ternary vase {-1 0 1} https://en.wikipedia.org/wiki/Non-adjacent_form with the interesting property that non-zero-values cannot be adjacent.

This form can be used to double and add-or-substract with an upper-bound of 1/3 addition/substraction.

The most interesting part for a constant-time implementation is that it allows using bigger window sizes for the same storage requirement as the binary representation since we have redundancy:

  • "-1 0 1 0" and "0 0 0 0" and "1 0 -1 0" ... can lookup in the same table slot

Current speed with window sizes effect:
image

Faster conversion to Montgomery domain

Context

The current implementation requires several passes over the data (muxing) and costly divisions
instead we can recognize that conversion to the Montgomery Residue representation is multiplying by R a montgomery factor chosen to be convenient i.e. 2^63
The fast montgomery multiplication does in natural term

MM(a, b) -> a b R^-1 (mod p)

because in natural representation

MM'(a', b'): aR * bR -> aR bR R^-1 (mod p) -> abR (mod p)

which is the Montgomery representation of the product

Thus Montgomery conversion is equivalent to
MM(a, R²)
and R² (mod p) can be precomputed at compile-time which allows us to avoid division

Current implementation

montgomeryResidue

func montgomeryResidue*(a: BigIntViewMut, N: BigIntViewConst) =
## Transform a bigint ``a`` from it's natural representation (mod N)
## to a the Montgomery n-residue representation
## i.e. Does "a * (2^LimbSize)^W (mod N), where W is the number
## of words needed to represent n in base 2^LimbSize
##
## `a`: The source BigInt in the natural representation. `a` in [0, N) range
## `N`: The field modulus. N must be odd.
##
## Important: `a` is overwritten
# Reference: https://eprint.iacr.org/2017/1057.pdf
checkValidModulus(N)
checkOddModulus(N)
checkMatchingBitlengths(a, N)
let nLen = N.numLimbs()
for i in countdown(nLen, 1):
a.shlAddMod(Zero, N)

shlAddMod

func shlAddMod(a: BigIntViewMut, c: Word, M: BigIntViewConst) =
## Fused modular left-shift + add
## Shift input `a` by a word and add `c` modulo `M`
##
## With a word W = 2^WordBitSize and a modulus M
## Does a <- a * W + c (mod M)
##
## The modulus `M` MUST announced most-significant bit must be set.
checkValidModulus(M)
let aLen = a.numLimbs()
let mBits = bitSizeof(M)
if mBits <= WordBitSize:
# If M fits in a single limb
var q: Word
# (hi, lo) = a * 2^63 + c
let hi = a[0] shr 1 # 64 - 63 = 1
let lo = (a[0] shl WordBitSize) or c # Assumes most-significant bit in c is not set
unsafeDiv2n1n(q, a[0], hi, lo, M[0]) # (hi, lo) mod M
return
else:
## Multiple limbs
let hi = a[^1] # Save the high word to detect carries
let R = mBits and WordBitSize # R = mBits mod 64
var a0, a1, m0: Word
if R == 0: # If the number of mBits is a multiple of 64
a0 = a[^1] #
moveMem(a[1].addr, a[0].addr, (aLen-1) * Word.sizeof) # we can just shift words
a[0] = c # and replace the first one by c
a1 = a[^1]
m0 = M[^1]
else: # Else: need to deal with partial word shifts at the edge.
a0 = ((a[^1] shl (WordBitSize-R)) or (a[^2] shr R)) and MaxWord
moveMem(a[1].addr, a[0].addr, (aLen-1) * Word.sizeof)
a[0] = c
a1 = ((a[^1] shl (WordBitSize-R)) or (a[^2] shr R)) and MaxWord
m0 = ((M[^1] shl (WordBitSize-R)) or (M[^2] shr R)) and MaxWord
# m0 has its high bit set. (a0, a1)/p0 fits in a limb.
# Get a quotient q, at most we will be 2 iterations off
# from the true quotient
let
a_hi = a0 shr 1 # 64 - 63 = 1
a_lo = (a0 shl WordBitSize) or a1
var q, r: Word
unsafeDiv2n1n(q, r, a_hi, a_lo, m0) # Estimate quotient
q = mux( # If n_hi == divisor
a0 == m0, MaxWord, # Quotient == MaxWord (0b0111...1111)
mux(
q.isZero, Zero, # elif q == 0, true quotient = 0
q - One # else instead of being of by 0, 1 or 2
) # we returning q-1 to be off by -1, 0 or 1
)
# Now substract a*2^63 - q*p
var carry = Zero
var over_p = CtTrue # Track if quotient greater than the modulus
for i in 0 ..< M.numLimbs():
var qp_lo: Word
block: # q*p
# q * p + carry (doubleword) carry from previous limb
let qp = unsafeExtPrecMul(q, M[i]) + carry.DoubleWord
carry = Word(qp shr WordBitSize) # New carry: high digit besides LSB
qp_lo = qp.Word and MaxWord # Normalize to u63
block: # a*2^63 - q*p
a[i] -= qp_lo
carry += Word(a[i].isMsbSet) # Adjust if borrow
a[i] = a[i] and MaxWord # Normalize to u63
over_p = mux(
a[i] == M[i], over_p,
a[i] > M[i]
)
# Fix quotient, the true quotient is either q-1, q or q+1
#
# if carry < q or carry == q and over_p we must do "a -= p"
# if carry > hi (negative result) we must do "a += p"
let neg = carry > hi
let tooBig = not neg and (over_p or (carry < hi))
discard a.add(M, ctl = neg)
discard a.sub(M, ctl = tooBig)
return

Montgomery Multiplication

func montyMul*(
r: BigIntViewMut, a, b: distinct BigIntViewAny,
M: BigIntViewConst, montyMagic: Word) =
## Compute r <- a*b (mod M) in the Montgomery domain
## `montyMagic` = -1/M (mod Word). Our words are 2^31 or 2^63
##
## This resets r to zero before processing. Use {.noInit.}
## to avoid duplicating with Nim zero-init policy
# i.e. c'R <- a'R b'R * R^-1 (mod M) in the natural domain
# as in the Montgomery domain all numbers are scaled by R
checkValidModulus(M)
checkOddModulus(M)
checkMatchingBitlengths(r, M)
checkMatchingBitlengths(a, M)
checkMatchingBitlengths(b, M)
let nLen = M.numLimbs()
setZero(r)
var r_hi = Zero # represents the high word that is used in intermediate computation before reduction mod M
for i in 0 ..< nLen:
let zi = (r[0] + wordMul(a[i], b[0])).wordMul(montyMagic)
var carry = Zero
for j in 0 ..< nLen:
let z = DoubleWord(r[j]) + unsafeExtPrecMul(a[i], b[j]) +
unsafeExtPrecMul(zi, M[j]) + DoubleWord(carry)
carry = Word(z shr WordBitSize)
if j != 0:
r[j-1] = Word(z) and MaxWord
r_hi += carry
r[^1] = r_hi and MaxWord
r_hi = r_hi shr WordBitSize
# If the extra word is not zero or if r-M does not borrow (i.e. r > M)
# Then substract M
discard r.sub(M, r_hi.isNonZero() or not r.sub(M, CtFalse))

Implementation strategy

Computing R² (mod p) at compile-time

Current situation

We cannot use shlAddMod as is at compile-time at the moment for the following reasons:

  1. Inputs are ptr object and pointers don't work at compile-time
  2. It uses moveMem and moveMem doesn't work at compile-time
  3. It uses unsafeDiv2n1n and unsafeDiv2n1n is assembly
  • Reason 2 is easy to work around
  • Reason 3 is already a temporary workaround, hardware division is usually not constant-time so
    a constant-time division algorithm is nedded. As it would work with integers-only it would naturally work at compile-time as well
  • Reason 1 would require implementing shlAddMod on BigInt instead of the type-erased BigIntView
    a switch with when nimvm to fallback on the type-erased version at runtime is possible.

Analysis

Given all these hurdles, it probably makes sense to have a specialize compute R^2 mod p procedure

Paper:
Bos and Montgomery, Montgomery Arithmetic from a Software Perspective
https://eprint.iacr.org/2017/1057.pdf

image

Note on conversion from Montgomery domain to the natural domain

Similarly conversion from the Montgomery Domain to the natural domain redc could be implemented by Montgomery Multiplication by 1, the current redc is already a specialized version of that.

Subgroup check

When verifying signatures we need to discard invalid signatures. In particular, for curves with a cofactor != 1 we need to prevent "subgroup attacks" for which a signature is forged to be on the curve but not in the desired subgroup (see cofactor #46).

Preventing subgroups attacks requires a subgroup check which is a costly scalar multiplication when implemented naively.

This paper from Zcash details a faster alternative:

Pseudo code for BLS12-381: pairingwg/bls_standard#21

Sean Bowe shows how to check subgroup membership more quickly than exponentiation by the group order.

This post quickly summarizes the results as pseudocode.

TODO: should subgroup testing be a required part of deserialization? Sean points out (in personal communication) that G2 subgroup checks are necessary, because the pairing operation is undefined otherwise. So probably it makes sense to just require subgroup checks for both G1 and G2.

For G1, define the endomorphism sigma as

sigma(x, y)

Input: a point P = (x, y)
Output: a point Q = (x', y')

Constants:
1. beta = 0x1a0111ea397fe699ec02408663d4de85aa0d857d89759ad4897d29650fb85f9b409427eb4f49fffd8bfd00000000aaac

Steps:
1. x' = x * beta
2. y' = y
2. return (x', y')

Then, to check subgroup membership, test the following:

subgroup_test_g1(x, y)

Input: a point P
Output: True if P is in the order-q subgroup of E1, else False

Constants:
- c1 = (z^2 - 1) / 3 (in the integers), where z = -0xd201000000010000
- point_at_infinity_E1 is the point at infinity on E1

Steps:
1. sP = sigma(P)
2. Q = 2 * sP
3. sP = sigma(sP)
4. Q = Q - P
5. Q = Q - sP
6. Q = c1 * Q
7. Q = Q - sP
8. return Q == point_at_infinity_E1

For G2, let psi(P) be the "untwist-Frobenius-twist" endomorphism given by Galbraith and Scott in Section 5 of GS08. Then to test subgroup membership, check the following:

subgroup_test_g2(P)

Input: a point P
Output: True if P is in the order-q subgroup of E2, else False

Constants:
- z = -0xd201000000010000
- point_at_infinity_E2 is the point at infinity on E2

Steps:
1. pP = psi(P)
2. pP = psi(pP)
3. Q = P - pP
4. pP = psi(pP)
5. pP = z * pP
6. Q = Q + pP
7. return Q == point_at_infinity_E2

Code size woes

The towered extensions lead to (lots of) duplicated code due to a codegen bug in Nim upstream nim-lang/Nim#13982.

This is quite bad for resource-restricted devices.
image

Finite field computation for moduli of special form

The library currently implements generic routine for odd field moduli.

This is motivated by the initial focus on pairing-friendly curves like BN (Barreto-Naerig) and BLS (Barreto-Lynn-Scott) as they are the main curves used in blockchain and for zero-knowledge proofs.
The pairing-curve modulus is not of special form (there are different tradeoffs to choose a pairing friendly modulus: the curve order must also be prime and all parameters must have low hamming weight.)

Routines for special field modulus form:

  • Mersenne Prime (2^k - 1 for example NIST prime P521 = 2^521 - 1),
  • Generalized Mersenne Prime (NIST Prime P256: 2^256 - 2^224 + 2^192 + 2^96 - 1)
  • Pseudo-Mersenne Prime (2^m - k for example Curve25519: 2^255 - 19 or secp256k1)
  • Golden Primes (φ^2 - φ - 1 with φ = 2^k for example Ed448-Goldilocks: 2^448 - 2^224 - 1)
    exist and can be implemented with compile-time specialization.

In particular, the field modulus for secp256k1 used in Bitcoin and Ethereum 1 is
2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 or 2^256 - 0x1000003D1 which is a pseudo-Mersenne prime.

Internal API: in-place vs result

Currently Constantine is using the following internal API for add, sub, mul:

func `+=`*(a: var Fp, b: Fp)
func sum*(r: var Fp, a, b: Fp)
func `-=`*(a: var Fp, b: Fp)
func diff*(r: var Fp, a, b: Fp)
func prod*(r: var Fp, a, b: Fp)

However, should we instead use the following?

func `+=`*(a: var Fp, b: Fp)
func `+`*(a, b: Fp): Fp
func `-=`*(a: var Fp, b: Fp)
func `-`*(a, b: Fp): Fp
func `*`*(r: var Fp, a, b: Fp)

Tradeoffs

Ergonomics

While those APIs are internal, they are used in some quite long formulas and it's easier to implement them without mistakes if we closely follow the mathematical notation, for example for Algorithm 7 below
image

Performance

There are a couple potential performance issue.
Note that Fp is often Fp12 in pairing cases, which on the BLS12-381 curve (6 64-bit limbs) is of size 12 * 6 * 8 bytes = 576 bytes and so a huge parameter to pass by value semantics.

Parameter-passing of arrays

To be investigated, if we have a static array as result values, are they passed by stack? Or are they preallocated in the caller context and passed by EAX/RAX register.

Register vs memory

Key operations like add-with-carry behave differently if the destination is a memory location or a register. The latency can be up to 6 cycles instead of 1 cycle, see Agner Fog tables: https://www.agner.org/optimize/instruction_tables.pdf

Nehalem (2008)
image
Haswell (2013)
image
Skylake (2015)
image
Skylake-X (2015 - latest for consumer because Intel cannot produce new architectures)
image

Left highlighted column is latency: how many cycles do you have to wait if you depend on this result for further operation (like chained in-place additions or depending on a carry).

Right highlighted column is reciprocal throughput, how many cycles do you need to issue independent instructions. A value of 0.25 means 4 independent additions per cycle while a value of 2 means 1 every 2 cycles.

The last case ADC SBB m,r/i which is add-with-carry / sub-with-borrow: memory <- register, must absolutely be avoided due to a low throughput (once every 2 cycles) and high latency (5 cycles to wait) while the other order register <- memory has thoughput and latency of 1 cycle.

(Note: we can avoid the carry case all together with lazy carry but even though throughput is 4 per cycles instead of 1 per cycle, overall it is very likely to be slower due to more memory used (and multiplication costs growing in n² with n the number of limbs) and costly reduction operations in the general case see #15 (comment))

If the API is using var parameter that means we need a temporary variable in register anyway to store the result and then copy it to the destination.

In the second case, the compiler knows that the result is already a temporary variable on the stack.

The cost is potentially more stack usage due to those temporaries as the compiler does not directly construct the result in the destination buffer (i.e. we don't have copy-elision)

{.noInit.}

Nim zero-initializes every variable before use. The C compiler can usually detect and discard redundant initializations, for example the following:

let a = 5

gets compiled into

NI a;
a = (NI)0;
a = (NI)5;

While this avoids a complete bug class, this when working with value object of size over 100 bytes (or even 576 bytes for FP12 elements in pairing) we have the following problems:

  • Does the compiler recognize and elide the zero-initialization?
  • If not, can we elide it manually?

For manual elision this can be done with {.noInit.}:
With in-place sum

var r{.noInit.}: FP12[BLS12_381]
r.sum(a, b)

However with a result value

proc `+`(a, b: Fp12) Fp12 {.noInit.} =
  discard "implementation"

let r = a + b

The {.noInit.} applies to the function implicit result value, but does the final assignation to r, zero-initialize r under-the-hood and so do we need

proc `+`(a, b: Fp12) Fp12 {.noInit.} =
  discard "implementation"

let r {.noInit.} = a + b

Introduce paranoid vs fast_muldiv mode

As sometimes we might want speed and sometimes we might want full-privacy and it depends on the user application, it might be worthwhile to have a paranoid vs fast_div mode.

fast_div would assume that the platform division is constant-time while paranoid would reimplement them, for example using binary shift.

Same issue for multiplication on 32-bit plaforms: https://bearssl.org/ctmul.html

Showstopper upstream bug - static object

There are 2 kinds of implementation possible

  1. Like in any language, the prime number is a global stored in the binary. Main trade-offs are:
  • We don't use the typesystem at all to check that A + B are compatible (because using same modulo)
  • {.noSideEffect.} tracking will be cumbersome.
  • No constant folding
  • We need a way to deal with "different modulus" error. Can we use exceptions given that we want to run even on embedded, using result + error code or just error code makes library usage clunky as well.
  1. Embed the prime number in the type system.

Upstream: nim-lang/Nim#11142

Fused modular square root on 32-bit - wrong "isSquare" result

image

test "EC add is associative":
proc test(F: typedesc, randZ: static bool) =
for _ in 0 ..< Iters:
when randZ:
let a = rng.random_unsafe_with_randZ(ECP_SWei_Proj[F])
let b = rng.random_unsafe_with_randZ(ECP_SWei_Proj[F])
let c = rng.random_unsafe_with_randZ(ECP_SWei_Proj[F])
else:
let a = rng.random_unsafe(ECP_SWei_Proj[F])
let b = rng.random_unsafe(ECP_SWei_Proj[F])
let c = rng.random_unsafe(ECP_SWei_Proj[F])
var tmp1{.noInit.}, tmp2{.noInit.}: ECP_SWei_Proj[F]
# r0 = (a + b) + c
tmp1.sum(a, b)
tmp2.sum(tmp1, c)
let r0 = tmp2
# r1 = a + (b + c)
tmp1.sum(b, c)
tmp2.sum(a, tmp1)
let r1 = tmp2
# r2 = (a + c) + b
tmp1.sum(a, c)
tmp2.sum(tmp1, b)
let r2 = tmp2
# r3 = a + (c + b)
tmp1.sum(c, b)
tmp2.sum(a, tmp1)
let r3 = tmp2
# r4 = (c + a) + b
tmp1.sum(c, a)
tmp2.sum(tmp1, b)
let r4 = tmp2
# ...
check:
bool(r0 == r1)
bool(r0 == r2)
bool(r0 == r3)
bool(r0 == r4)
test(Fp[BN254_Snarks], randZ = false)
test(Fp[BN254_Snarks], randZ = true)
test(Fp[BLS12_381], randZ = false)
test(Fp[BLS12_381], randZ = true)

https://travis-ci.com/github/mratsim/constantine/jobs/345566834#L1059

seed: 1591548864

Cleanup Readme

  • Curve configuration link is wrong
  • License should explain that the library has no dependencies and GNU library like GMP are only used for testing

Implement fast inversion for public data

When verifying using public keys we don't need to be constant-time, meaning we have a much wider variety of tooling that we can use.
This is especially important in scenarios where we need to defend against denial-of-service attacks where an attacker can craft seemingly valid data except for the cryptographic signature.

In particular inversion is the one of the top 2 slowest base operation in constant-time and it can be significantly accelerated in variable-time.
Benchmarking with GMP variable vs constant time implementations, it can be made about 30x faster.
This might enable using affine coordinates for GT and maybe G2 or G1 for verification, affine coordinates trades inversions for multiplication and several additions.

Window method for GLV acceleration

#44 introduced a basic endomorphism accelerated scalar multiplication.

This can be further accelerated by introducing windows to save the following amount of operations:

  • On G1, with a 2 dimensional decomposition, the lookup table is small (2 curve points), we can use a window of 2 or 3 (especially with affine coordinates) with the following estimated speedups:
    • GLV scalarmul on 254-bit scalar --> 127 doubling + 127 additions (from table lookup)
    • With window of size 2 --> 127 doublings + 64 additions (-25% operations)
    • With window of size 3 --> 127 doublings + 43 additions (-33% operations)

On G2 scalar multiplier for pairing curves can already be decomposed ways by combining 2 endomorphisms acceleration (GLV + GLS methods) and adding window methods on top will blow up the stack for few savings

The paper has a in-depth explanation of the window method applied to the custom representation.

  • Efficient and Secure Algorithms for GLV-Based Scalar
    Multiplication and their Implementation on GLV-GLS
    Curves (Extended Version)
    Armando Faz-Hernández, Patrick Longa, Ana H. Sánchez, 2013
    https://eprint.iacr.org/2013/158.pdf

Additionally:

Also Snowshoe (https://github.com/catid/snowshoe) has such an implementation and sems to be the only project with such an implementation in the wild: https://github.com/catid/snowshoe/blob/8ba3f575/src/ecmul.inc#L134-L160

multi-scalar multiplication / multi-exponentiations (a.k.a. Pippenger algorithm)

Glossary:

  • We talk about scalar multiplication for additive groups G1 (over Fp) and G2 (over Fp2 thanks to a sextic twist)
  • We talk about exponentiation for multiplicative group GT (defined over Fp12 for common pairing curve like BN254_Snarks and BLS12_381)

For Zk-SNARKS, we need to compute a linear combination of scalar multiplication/exponentiation:

Q <- [k1] P1 + [k2] P2 + ... + [kn] Pn

As a generalization to the Strauss-Shamir trick for [a]P + [b]Q we can save a significant amount of iterations.

image
image
image

Research

Implementations

Side-note

For batched signature verification (see https://ethresear.ch/t/fast-verification-of-multiple-bls-signatures/5407) we may use this as well. To be studied compared to the PAIR_BLS381_another in Milagro to accumulate line functions for multi-pairing and incur the cost of final exponentiation only once as well.

Conjugate addition (Computing both P+Q and P-Q)

From Patrick Longa's PhD Thesis Appendix:
image


Note:
Jacobian coordinates do not have a complete exception-free addition formula, they require special casing adding infinity, the same number or its opposite. So they are not suitable for constant-time operations unless we introduce masked copies.
Twisted Edwards curve have a complete addition law.

Implement constant-time division

Currently div2n1n (i.e. div 128 -> 64 and div 64 -> 32)
are implemented using assembly instructions.

Division in hardware is often implemented via microcode with timing that often depends on the operand values.

We need a guaranteed constant-time version of dividing 2 words by a word.

Implement 128-bit constant-time extended precision multiplication 64x64 -> 128

GCC/Clang offers the _int128 primitives
MSVC offer umul128 intrinsics: https://github.com/MicrosoftDocs/cpp-docs/blob/master/docs/intrinsics/umul128.md

So we can implement DoubleWord multiplication on 64-bit hardware.
This would allow using 64-bit limbs everywhere.

Alternative a custom uint128 type is probably enough as besides being the result of umul128 it only needs to support addition and shift right:

block: # q*p
# q * p + carry (doubleword) carry from previous limb
let qp = unsafeExtPrecMul(q, M[i]) + carry.DoubleWord
carry = Word(qp shr WordBitSize) # New carry: high digit besides LSB
qp_lo = qp.Word and MaxWord # Normalize to u63

var carry = DoubleWord(0)
for j in 0 ..< nLen:
let z = DoubleWord(a[i]) + unsafeExtPrecMul(z0, N[i]) + carry
carry = z shr WordBitSize

let z = DoubleWord(r[j]) + unsafeExtPrecMul(a[i], b[j]) +
unsafeExtPrecMul(zi, M[j]) + DoubleWord(carry)
carry = Word(z shr WordBitSize)
if j != 0:
r[j-1] = Word(z) and MaxWord

Simultaneous inversion

When we need to invert multiple field elements we can significantly reduce the cost by replacing all inversions but 1 by 3 multiplications each (i.e. we have 3(n-1) Mul + 1 Inv instead of n Inv).

Use case:

  • Converting multiple points to affine in lookup tables to enable mixed addition formulas for scalar multiplication

Constant-time inversions are about 300x slower than multiplications
image

Several techniques exist:

From

  • Efficient Simultaneous Inversion in Parallel and Application to Point Multiplication in ECC
    Pradeep Kumar Mishra, 2005
  • Fast Algorithms for Arithmetic on Elliptic Curves over Prime Fields
    Nicholas T. Sullivan, 2007
    https://eprint.iacr.org/2008/060.pdf

Montgomery simultenous inversion

image
image

Parallelizable alternatives

Parallel alternative will be useful for batch verification where no secrets are involved or for zero-knowledge proofs / snarks

Endomorphism-accelerated scalar multiplication

PR #28 introduced an efficient constant-time scalar multiplication scheme, it is over 10% faster than naive double-and-add (with scratchspace of 16 curve point) even though it is constant-time
image

The BN and BLS curve family among many others) allows using GLV (Gallant-Lambert-Vanstone) and GLS (Galbraith-
Lin-Scott) decompositions to speed up scalar multiplication even more on G1 (via GLV with 2 dimensional decomposition) and G2 (via combined GLV+GLS).

This works by decomposing an elliptic curve point on a Lattice-base and doing multi-scalar multiplication.

The main cost of scalar multiplication comes from iterating on each bit of the scalar:

  • for BN254_Snarks 254 iterations
  • for BLS12_381, 255 iterations (as we work on a curve of with an order of bitwidth 255)
    with each bit requiring a doubling and an addition (in a constant-time implementation)

A 2-dimensional GLV can half the number of doublings by deconposing the scalar and smaller exponents.

References

  • Joppe W. Bos, Craig Costello, and Michael Naehrig.
    Scalar Multiplication and Exponentiation in Pairing Groups
    Chapter 6 of Guide to Pairing-Based Cryptography (El Mrabet, Joye et al)

  • Software implementation of an Attribute-BasedEncryption scheme
    Eric Zavattoni, Luis J. Dominguez Perez, Shigeo Mitsunari, AnaH. Sanchez-Ramirez, Tadanori Teruya, andFrancisco Rodriguez-Henriquez, 2015
    https://eprint.iacr.org/2014/401.pdf

  • Joppe W. Bos, Craig Costello, and Michael Naehrig. Exponentiating in pairing groups.
    In T. Lange, K. Lauter, and P. Lisonek, editors, Selected Areas in Cryptography
    – SAC 2013, volume 8282 of Lecture Notes in Computer Science, pp. 438–455.
    Springer, Heidelberg, 2014.
    https://eprint.iacr.org/2013/458

Fused modular square root on 32-bit - wrong "isSquare" result

Reproduced locally in 32-bit mode (might not be related to 32-bit as the RNG will initialize differently)

test_ec_weierstrass_projective_g1 xoshiro512** seed: 1591646170

[Suite] Elliptic curve in Short Weierstrass form y² = x³ + a x + b with projective coordinates (X, Y, Z): Y²Z = X³ + aXZ² + bZ³ i.e. X = xZ, Y = yZ
  [OK] The infinity point is the neutral element w.r.t. to EC addition
  [OK] Adding opposites gives an infinity point
  [OK] EC add is commutative
  [OK] EC add is associative
    /home/beta/Programming/Nim/constantine/tests/test_ec_weierstrass_projective_g1.nim(170, 19): Check failed: bool(r0 == r1)
  [FAILED] EC double and EC add are consistent
  [OK] EC mul [0]P == Inf
  [OK] EC mul [Order]P == Inf
  [OK] EC mul [1]P == P
  [OK] EC mul [2]P == P.double()
  [OK] EC mul is distributive over EC add
  [OK] EC mul constant-time is equivalent to a simple double-and-add algorithm
Error: execution of an external program failed: '/home/beta/Programming/Nim/constantine/build/test_ec_weierstrass_projective_g1 '

test "EC double and EC add are consistent":
proc test(F: typedesc, randZ: static bool) =
for _ in 0 ..< Iters:
when randZ:
let a = rng.random_unsafe_with_randZ(ECP_SWei_Proj[F])
else:
let a = rng.random_unsafe(ECP_SWei_Proj[F])
var r0{.noInit.}, r1{.noInit.}: ECP_SWei_Proj[F]
r0.double(a)
r1.sum(a, a)
check: bool(r0 == r1)
test(Fp[BN254_Snarks], randZ = false)
test(Fp[BN254_Snarks], randZ = true)
test(Fp[BLS12_381], randZ = false)
test(Fp[BLS12_381], randZ = true)

Test generators should generate to a serialized format (JSON, ...)

Currently the test vector generators at:

must be run manually and then be copy pasted in a test file:

which is very tedious and error prone already for Fp2.

And pairings occur in Fp12, and there are many curves with "interesting" specificities like BLS12-377 which doesn't use a complex extension for Fp2.

That said, testing 10 points on a curve gives high confidence that the elliptic curve implemented is correct as if 2 polynomials of degree 3 (the case of elliptic curve) have 3 points being the same, they are the same polynomials (not sure how that changes when the polynomials is in both x and y).

Workaround for compile-time modulus and negInvModWord

Due to upstream bug nim-lang/Nim#9679 and another hard-to reproduce one for static Word, the use of compile-time property of the modulus requires lots of workarounds.

Bug 2 gives the following errors:

========================================================================================
Running tests/test_io_fields
========================================================================================
/home/beta/Programming/Nim/constantine/tests/test_io_fields.nim(19, 9) template/generic instantiation of `suite` from here
/home/beta/Programming/Nim/constantine/tests/test_io_fields.nim(20, 10) template/generic instantiation of `test` from here
/home/beta/Programming/Nim/constantine/tests/test_io_fields.nim(26, 10) template/generic instantiation of `fromUint` from here
/home/beta/Programming/Nim/constantine/constantine/io/io_fields.nim(29, 6) template/generic instantiation of `fromBig` from here
/home/beta/Programming/Nim/constantine/constantine/arithmetic/finite_fields.nim(64, 11) Error: type mismatch: got <BigInt[61], BigInt[61], BigInt[61], BigInt[61], static[Word](2305843009213693953)>
but expected one of: 
func montyResidue(mres: var BigInt; a, N, r2modN: BigInt; negInvModWord: static Word)
  first type mismatch at position: 5
  required type for negInvModWord: static[Word]
  but expression '2305843009213693953' is of type: static[Word](2305843009213693953)

Trying to reduce it to a minimal example makes it disappear ...

import macros

type Word = distinct uint64
const WordBitSize = 63

func wordsRequired*(bits: int): int {.compileTime.} =
  (bits + WordBitSize - 1) div WordBitSize

type
  BigInt*[bits: static int] = object
    bitLength: uint32
    limbs: array[bits.wordsRequired, Word]

  Curve = enum
    BN254
    BLS12_381

const CurveBitSize = [
  BN254: 254,
  BLS12_381: 381
]

template matchingBigInt(C: static Curve): untyped =
  # Need template indirection to avoid a sigmatch bug
  BigInt[CurveBitSize[C]]

type
  FiniteField*[C: static enum] = object
    big: matchingBigInt(C)

const BN254_Modulus = BigInt[254]()
const BN254_NegInvModWord = Word(2305843009213693953)

{.experimental: "dynamicBindSym".}

macro getModulus(C: static Curve): untyped =
  result = bindSym($C & "_Modulus")

macro getNegInvModWord(C: static Curve): untyped =
  result = bindSym($C & "_NegInvModWord")

func foo(mres: var BigInt, a, N: BigInt, negInvModWord: static Word) =
  discard

func bar(r: var FiniteField, a: FiniteField) =
  foo(r.big, a.big, a.C.getModulus(), a.C.getNegInvModWord())

var r: FiniteField[BN254]
let a = FiniteField[BN254]()

bar(r, a)

Quadratic and Cubic non-residues for tower of extension fields

To construct a tower of extension fields we need to find an irreducible polynomial, i.e.
For quadratic extension fields:

  • x² ≢1 (mod p), i.e. x is not a square (mod p)

For cubic extension fields:

  • x³ ≢1 (mod p), i.e. x is not a cube (mod p)

Unfortunately while for 𝔽p2 most (all?) pairing friendly primes admit the imaginary number 𝑖 as a quadratic non-residue (they are chosen so that the prime p ≡ 3 (mod 4)),
the irreducible polynomial for further extensions is likely dependent on the prime chosen.

For example the IETF draft on pairing curves suggests in section 4.3 the following towers:

  • BN462

    • F_p^2 = F_p[u] / (u^2 + 1)
    • F_p^6 = F_p^2[v] / (v^3 - u - 2)
    • F_p^12 = F_p^6[w] / (w^2 - v).
  • BLS12-381

    • F_p^2 = F_p[u] / (u^2 + 1)
    • F_p^6 = F_p^2[v] / (v^3 - u - 1)
    • F_p^12 = F_p^6[w] / (w^2 - v).

Given the wide range of curves we need to:

  1. Automatically check at least or find non-residues
  2. Associate the multiplication by non-residue to the curve so that fields like 𝔽p6 can be generically implemented.

Checking or finding non-residues

This was started in Sage at https://github.com/mratsim/constantine/blob/c40bc1977d00dbd87f328a3e410d63b9d85ae458/sage/non_residues.sage

Unfortunately Sage is completely outside my expertise, help wanted, I cannot seem to use the imaginary number 𝑖 to define an extension based on a complex polynomial.

See also

image

image

image
image

And further refinements

in particular for BN curves
image
image
(However the last result directly construct a sextic extension while it's preferable to only compose quadratic and cubic extension to benefits from simpler implementation (and also optimized like Chung-Hasan squarings)

Abstracting over the non-residues

This was somewhat started with this Xi abstract object:

Xi = object
## ξ (Xi) the cubic non-residue
func `*`(_: typedesc[Xi], a: Fp2): Fp2 {.inline.}=
## Multiply an element of 𝔽p2 by 𝔽p6 cubic non-residue 1 + 𝑖
## (c0 + c1 𝑖) (1 + 𝑖) => c0 + (c0 + c1)𝑖 + c1 𝑖²
## => c0 - c1 + (c0 + c1) 𝑖
result.c0.diff(a.c0, a.c1)
result.c1.sum(a.c0, a.c1)
template `*`(a: Fp2, _: typedesc[Xi]): Fp2 =
Xi * a
func `*=`(a: var Fp2, _: typedesc[Xi]) {.inline.}=
## Inplace multiply an element of 𝔽p2 by 𝔽p6 cubic non-residue 1 + 𝑖
let t = a.c0
a.c0 -= a.c1
a.c1 += t

Used like this:

v1.square(a.c2)
v1 *= Xi
v2.prod(a.c0, a.c1)
v1 -= v2
# C in v2
# C <- a1² - a0 a2
v2.square(a.c1)
v3.prod(a.c0, a.c2)
v2 -= v3
# F in v3
# F <- ξ a1 C + a0 A + ξ a2 B
r.c1.prod(v1, Xi * a.c2)
r.c2.prod(v2, Xi * a.c1)
v3.prod(r.c0, a.c0)

The way forward is probably to add a config_tower.nim file that would do something like this,
Using Xi as the generic name for non-residues

 func `*`(_: typedesc[Xi], a: Fp6[BN254]): Fp6[BN254] {.inline.}=
   discard
  
 template `*`(a: Fp2, _: typedesc[Xi]): Fp2 = 
   Xi * a 
  
 func `*=`(a: var Fp6[BN254], _: typedesc[Xi]) {.inline.}= 
   discard

 func `*`(_: typedesc[Xi], a: Fp12[BN254]): Fp12[BN254] {.inline.}=
   when a is CubicExtensionField:
     discard "Implementation for 𝔽p12 = 𝔽p4[w]
  else:
    discard "Implementation for 𝔽p12 = 𝔽p6[w]
  
 template `*`(a: Fp12, _: typedesc[Xi]): Fp2 = 
   Xi * a 
  
 func `*=`(a: var Fp12[BN254], _: typedesc[Xi]) {.inline.}= 
   when a is CubicExtensionField:
     discard "Implementation for 𝔽p12 = 𝔽p4[w]
  else:
    discard "Implementation for 𝔽p12 = 𝔽p6[w]

# --------------------------------------------------------------------

 func `*`(_: typedesc[Xi], a: Fp6[BLS12_381]): Fp6[BN256] {.inline.}=
   discard
  
 template `*`(a: Fp2, _: typedesc[Xi]): Fp2 = 
   Xi * a 
  
 func `*=`(a: var Fp6[BLS12_381], _: typedesc[Xi]) {.inline.}= 
   discard

 func `*`(_: typedesc[Xi], a: Fp12[BLS12_381]): Fp12[BN256] {.inline.}=
   when a is CubicExtensionField:
     discard "Implementation for 𝔽p12 = 𝔽p4[w]
  else:
    discard "Implementation for 𝔽p12 = 𝔽p6[w]
  
 template `*`(a: Fp12, _: typedesc[Xi]): Fp2 = 
   Xi * a 
  
 func `*=`(a: var Fp12[BLS12_381], _: typedesc[Xi]) {.inline.}= 
   when a is CubicExtensionField:
     discard "Implementation for 𝔽p12 = 𝔽p4[w]
  else:
    discard "Implementation for 𝔽p12 = 𝔽p6[w]

This would reuse the Cubic/Quadratic concepts defined in abelian_groups.nim

type
CubicExtAddGroup* = concept x
## Cubic extension fields - Abelian Additive Group concept
type BaseField = auto
x.c0 is BaseField
x.c1 is BaseField
x.c2 is BaseField

type
QuadExtAddGroup* = concept x
## Quadratic extension fields - Abelian Additive Group concept
not(x is CubicExtAddGroup)
type BaseField = auto
x.c0 is BaseField
x.c1 is BaseField

Compile-time precomputations: cleanup + generalize

Currently we have a precompute file which reimplements runtime algorithms to workaround either VM limitations (when nimvm is a bit restricted, it cannot used elif though that can be worked with a else: when foo:) or VM bugs (with static or distinct types or static distinct values) or recursive imports (montgomery conversion)

We need to keep only the precomputations there:

  • Move the primitives fallback to the primitives folder:
    const
    HalfWidth = WordBitWidth shr 1
    HalfBase = (BaseType(1) shl HalfWidth)
    HalfMask = HalfBase - 1
    func hi(n: BaseType): BaseType =
    result = n shr HalfWidth
    func lo(n: BaseType): BaseType =
    result = n and HalfMask
    func split(n: BaseType): tuple[hi, lo: BaseType] =
    result.hi = n.hi
    result.lo = n.lo
    func merge(hi, lo: BaseType): BaseType =
    (hi shl HalfWidth) or lo
    func addC(cOut, sum: var BaseType, a, b, cIn: BaseType) =
    # Add with carry, fallback for the Compile-Time VM
    # (CarryOut, Sum) <- a + b + CarryIn
    let (aHi, aLo) = split(a)
    let (bHi, bLo) = split(b)
    let tLo = aLo + bLo + cIn
    let (cLo, rLo) = split(tLo)
    let tHi = aHi + bHi + cLo
    let (cHi, rHi) = split(tHi)
    cOut = cHi
    sum = merge(rHi, rLo)
    func subB(bOut, diff: var BaseType, a, b, bIn: BaseType) =
    # Substract with borrow, fallback for the Compile-Time VM
    # (BorrowOut, Sum) <- a - b - BorrowIn
    let (aHi, aLo) = split(a)
    let (bHi, bLo) = split(b)
    let tLo = HalfBase + aLo - bLo - bIn
    let (noBorrowLo, rLo) = split(tLo)
    let tHi = HalfBase + aHi - bHi - BaseType(noBorrowLo == 0)
    let (noBorrowHi, rHi) = split(tHi)
    bOut = BaseType(noBorrowHi == 0)
    diff = merge(rHi, rLo)
    func add(a: var BigInt, w: BaseType): bool =
    ## Limbs addition, add a number that fits in a word
    ## Returns the carry
    var carry, sum: BaseType
    addC(carry, sum, BaseType(a.limbs[0]), w, carry)
    a.limbs[0] = SecretWord(sum)
    for i in 1 ..< a.limbs.len:
    let ai = BaseType(a.limbs[i])
    addC(carry, sum, ai, 0, carry)
    a.limbs[i] = SecretWord(sum)
    result = bool(carry)
    func dbl(a: var BigInt): bool =
    ## In-place multiprecision double
    ## a -> 2a
    var carry, sum: BaseType
    for i in 0 ..< a.limbs.len:
    let ai = BaseType(a.limbs[i])
    addC(carry, sum, ai, ai, carry)
    a.limbs[i] = SecretWord(sum)
    result = bool(carry)
    func sub(a: var BigInt, w: BaseType): bool =
    ## Limbs substraction, sub a number that fits in a word
    ## Returns the carry
    var borrow, diff: BaseType
    subB(borrow, diff, BaseType(a.limbs[0]), w, borrow)
    a.limbs[0] = SecretWord(diff)
    for i in 1 ..< a.limbs.len:
    let ai = BaseType(a.limbs[i])
    subB(borrow, diff, ai, 0, borrow)
    a.limbs[i] = SecretWord(diff)
    result = bool(borrow)
    func mul(hi, lo: var BaseType, u, v: BaseType) =
    ## Extended precision multiplication
    ## (hi, lo) <- u * v
    var x0, x1, x2, x3: BaseType
    let
    (uh, ul) = u.split()
    (vh, vl) = v.split()
    x0 = ul * vl
    x1 = ul * vh
    x2 = uh * vl
    x3 = uh * vh
    x1 += hi(x0) # This can't carry
    x1 += x2 # but this can
    if x1 < x2: # if carry, add it to x3
    x3 += HalfBase
    hi = x3 + hi(x1)
    lo = merge(x1, lo(x0))
    func muladd1(hi, lo: var BaseType, a, b, c: BaseType) {.inline.} =
    ## Extended precision multiplication + addition
    ## (hi, lo) <- a*b + c
    ##
    ## Note: 0xFFFFFFFF_FFFFFFFF² -> (hi: 0xFFFFFFFFFFFFFFFE, lo: 0x0000000000000001)
    ## so adding any c cannot overflow
    var carry: BaseType
    mul(hi, lo, a, b)
    addC(carry, lo, lo, c, 0)
    addC(carry, hi, hi, 0, carry)
    func muladd2(hi, lo: var BaseType, a, b, c1, c2: BaseType) {.inline.}=
    ## Extended precision multiplication + addition + addition
    ## (hi, lo) <- a*b + c1 + c2
    ##
    ## Note: 0xFFFFFFFF_FFFFFFFF² -> (hi: 0xFFFFFFFFFFFFFFFE, lo: 0x0000000000000001)
    ## so adding 0xFFFFFFFFFFFFFFFF leads to (hi: 0xFFFFFFFFFFFFFFFF, lo: 0x0000000000000000)
    ## and we have enough space to add again 0xFFFFFFFFFFFFFFFF without overflowing
    var carry1, carry2: BaseType
    mul(hi, lo, a, b)
    # Carry chain 1
    addC(carry1, lo, lo, c1, 0)
    addC(carry1, hi, hi, 0, carry1)
    # Carry chain 2
    addC(carry2, lo, lo, c2, 0)
    addC(carry2, hi, hi, 0, carry2)
  • Delete redundant implementations, may be blocked by distinct type bugs to be raised upstream
    func cadd(a: var BigInt, b: BigInt, ctl: bool): bool =
    ## In-place optional addition
    ##
    ## It is NOT constant-time and is intended
    ## only for compile-time precomputation
    ## of non-secret data.
    var carry, sum: BaseType
    for i in 0 ..< a.limbs.len:
    let ai = BaseType(a.limbs[i])
    let bi = BaseType(b.limbs[i])
    addC(carry, sum, ai, bi, carry)
    if ctl:
    a.limbs[i] = SecretWord(sum)
    result = bool(carry)
    func csub(a: var BigInt, b: BigInt, ctl: bool): bool =
    ## In-place optional substraction
    ##
    ## It is NOT constant-time and is intended
    ## only for compile-time precomputation
    ## of non-secret data.
    var borrow, diff: BaseType
    for i in 0 ..< a.limbs.len:
    let ai = BaseType(a.limbs[i])
    let bi = BaseType(b.limbs[i])
    subB(borrow, diff, ai, bi, borrow)
    if ctl:
    a.limbs[i] = SecretWord(diff)
    result = bool(borrow)
    func doubleMod(a: var BigInt, M: BigInt) =
    ## In-place modular double
    ## a -> 2a (mod M)
    ##
    ## It is NOT constant-time and is intended
    ## only for compile-time precomputation
    ## of non-secret data.
    var ctl = dbl(a)
    ctl = ctl or not a.csub(M, false)
    discard csub(a, M, ctl)
    func `<`(a, b: BigInt): bool =
    var diff, borrow: BaseType
    for i in 0 ..< a.limbs.len:
    subB(borrow, diff, BaseType(a.limbs[i]), BaseType(b.limbs[i]), borrow)
    result = bool borrow
    func shiftRight*(a: var BigInt, k: int) =
    ## Shift right by k.
    ##
    ## k MUST be less than the base word size (2^32 or 2^64)
    for i in 0 ..< a.limbs.len-1:
    a.limbs[i] = (a.limbs[i] shr k) or (a.limbs[i+1] shl (WordBitWidth - k))
    a.limbs[a.limbs.len-1] = a.limbs[a.limbs.len-1] shr k

In a second part it would be interesting to be able to have the full bigints and at least up to Fp2 operations available at compile-time:

  • Modular inverse to precompute division by a sextic non-residue
  • Modular exponentiation to precompute the cubic root of unity
  • BigInt multiplication to precompute p^m with p the field modulus and m the extension degree for square roots
  • Big int addition, substraction, multiplication for Lattice decomposition/endomorphism acceleration #48

[Alt Impl] LLVM builtin arbitrary fixed precision int

http://llvm.org/doxygen/APInt_8h_source.html

//===----------------------------------------------------------------------===//
 //                              APInt Class
 //===----------------------------------------------------------------------===//
 
 /// Class for arbitrary precision integers.
 ///
 /// APInt is a functional replacement for common case unsigned integer type like
 /// "unsigned", "unsigned long" or "uint64_t", but also allows non-byte-width
 /// integer sizes and large integer value types such as 3-bits, 15-bits, or more
 /// than 64-bits of precision. APInt provides a variety of arithmetic operators
 /// and methods to manipulate integer values of any bit-width. It supports both
 /// the typical integer arithmetic and comparison operations as well as bitwise
 /// manipulation.
 ///
 /// The class has several invariants worth noting:
 ///   * All bit, byte, and word positions are zero-based.
 ///   * Once the bit width is set, it doesn't change except by the Truncate,
 ///     SignExtend, or ZeroExtend operations.
 ///   * All binary operators must be on APInt instances of the same bit width.
 ///     Attempting to use these operators on instances with different bit
 ///     widths will yield an assertion.
 ///   * The value is stored canonically as an unsigned value. For operations
 ///     where it makes a difference, there are both signed and unsigned variants
 ///     of the operation. For example, sdiv and udiv. However, because the bit
 ///     widths must be the same, operations such as Mul and Add produce the same
 ///     results regardless of whether the values are interpreted as signed or
 ///     not.
 ///   * In general, the class tries to follow the style of computation that LLVM
 ///     uses in its IR. This simplifies its use for LLVM.

And https://llvm.org/docs/LangRef.html#integer-type

Composites Double-Add 2P+Q, tripling, quadrupling, quintupling, octupling

Computing 2P+Q may be useful for alternative scalar multiplications based on double-and-always-add.

For example Joye's ladder:

image
image

Computation cost according to Joye, 2007
image

Papers with algorithm:

Fixed-base scalar mul via LSB set encoding

The GLV-SAC paper introduces an alternative scalar multiplication for a fixed base.
For example signing is always done using the generator point as a fixed base

  • Efficient and Secure Algorithms for GLV-Based Scalar
    Multiplication and their Implementation on GLV-GLS
    Curves (Extended Version)
    Armando Faz-Hernández, Patrick Longa, Ana H. Sánchez, 2013
    https://eprint.iacr.org/2013/158.pdf

This representation seems to be twice faster than simple 4-way endormorphism decomposition (without additional windowing optimization).
image

Note: while it uses 8 times more memory, that memory is allocated directly in the binary, not RAM.

Constant-time verification

Taken from the wiki page: https://github.com/mratsim/constantine/wiki/Constant-time-arithmetics

Randomized testing failure: EC Add G2

For BLS12-381, seed 1592673796, commit 89c78ef like #62 (but different seed for the life of me I can't find the original CI trigger)

image

Repro with -d:Constantine32

import
  # Standard library
  std/[unittest, times],
  # Internals
  ../constantine/config/[common, curves],
  ../constantine/[arithmetic, towers],
  ../constantine/io/[io_bigints, io_fields, io_towers],
  ../constantine/elliptic/[ec_weierstrass_affine, ec_weierstrass_projective]

proc trySetFromCoordsXandZ_debug*[F](P: var ECP_SWei_Proj[F], x, z: F): SecretBool =
  ## Try to create a point the elliptic curve
  ## Y²Z = X³ + aXZ² + bZ³ (projective coordinates)
  ## y² = x³ + a x + b     (affine coordinate)
  ## return true and update `P` if `x` leads to a valid point
  ## return false otherwise, in that case `P` is undefined.
  ##
  ## Note: Dedicated robust procedures for hashing-to-curve
  ##       will be provided, this is intended for testing purposes.
  P.y.curve_eq_rhs(x)

  echo "P.y: ", P.y.toHex()
  echo "P.y.isSquare: ", bool P.y.isSquare
  result = sqrt_if_square(P.y)
  echo "P.y.wasSquare: ", bool result

  P.x.prod(x, z)
  P.y *= z
  P.z = z

var a, b, c: ECP_SWei_Proj[Fp2[BLS12_381]]

var ax, az, bx, bz, cx, cz: Fp2[BLS12_381]
ax.fromHex(
  c0 = "0x13d97382a3e097623d191172ec2972f3a4b436e24ae18f8394c9103a37c43b2747d5f7c597eff7bda406000000017ffd",
  c1 = "0x11eca90d537eabf01ead08dce5d4f63822941ce7255cc7bfc62483dceb5d148f23f7bfcaeb7f5ffccd767ff5ffffdffe"
)

az.fromHex(
  c0 = "0x15f65ec3fa7ce4935c071a97a256ec6d77ce385370513744df48944613b748b2a8e3bfdb035bfb7a7608ffc00002ff7c",
  c1 = "0x15f646c3fa80e4835bd70a57a196ac6d57ce1653705247455f48983753c758bae9f3800ba3ebeff024c8cbd78002fdfc"
)

bx.fromHex(
  c0 = "0x146e5ab3ea40d392d3868086a256ec2d524ce85345c237434ec0904f52d753b1ebf4000bc40c00026607fc000002fffc",
  c1 = "0x15f65ebfb267a4935007168f6256ec6d75c11633705252c55f489857437e08a2ebf3b7a7c40c000275e7fff9f0025ffa"
)

bz.fromHex(
  c0 = "0x0da4dec3fa76cb905c071a13a1d2c39906ce502d70085744df48985140be37fa6bd1ffdac407fff27608dfffde60fedc",
  c1 = "0x0df55883b636e29344071a7aa255dc6d25a258126bbe0a455b48985753c4377aeaf3a3f6c40c00027307ffb7ffbdefdc"
)

cx.fromHex(
  c0 = "0x11fcc7014aee3c2f1ead04bd25d8996fd29a1d71002e97bdca6d881d13ad1d937ff6ee83c8025feed202fffffbdcfffe",
  c1 = "0x09ee82982d80b1c7bf3e69b228ee461c30bce73d574478841da0bd7941294503292b7809222bfe7d4606f976400244d2"
)

cz.fromHex(
  c0 = "0x09ee82982d80b1c7bf3e69b228ee461c30bce73d574478841da0bd7941294503292b7809222bfe7d4606f976400244d2",
  c1 = "0x15f35eab6e70e2922b85d257a256ec6d43794851f05257452de3965753474ca66bf3f923c10bfe022d07d7f60000fffb"
)


doAssert bool a.trySetFromCoordsXandZ_debug(ax, az)
doAssert bool b.trySetFromCoordsXandZ_debug(bx, bz)
doAssert bool c.trySetFromCoordsXandZ_debug(cx, cz)

echo "a.x: ", a.x.toHex()
echo "a.y: ", a.y.toHex()
echo "a.z: ", a.z.toHex()
echo ""
echo "b.x: ", b.x.toHex()
echo "b.y: ", b.y.toHex()
echo "b.z: ", b.z.toHex()
echo ""
echo "c.x: ", c.x.toHex()
echo "c.y: ", c.y.toHex()
echo "c.z: ", c.z.toHex()

var tmp1{.noInit.}, tmp2{.noInit.}: ECP_SWei_Proj[Fp2[BLS12_381]]

# r0 = (a + b) + c
tmp1.sum(a, b)
tmp2.sum(tmp1, c)
let r0 = tmp2

# r1 = a + (b + c)
tmp1.sum(b, c)
tmp2.sum(a, tmp1)
let r1 = tmp2

# r2 = (a + c) + b
tmp1.sum(a, c)
tmp2.sum(tmp1, b)
let r2 = tmp2

# r3 = a + (c + b)
tmp1.sum(c, b)
tmp2.sum(a, tmp1)
let r3 = tmp2

# r4 = (c + a) + b
tmp1.sum(c, a)
tmp2.sum(tmp1, b)
let r4 = tmp2

# ...

doAssert bool(r0 == r1)
doAssert bool(r0 == r2)
doAssert bool(r0 == r3)
doAssert bool(r0 == r4)
P.y: Fp2(c0: 0x120dbe5ca6ec0769dda90bac03a42c27ffe46cfccaf61a66791de2e650d3f43f22d562853371e9e868dadba553e3aed8, c1: 0x0b377fc92bdd2d97ca38fb9f428fe092007c6c886672d48c8a833e15ab2b9ea6a8511db3d2be566585ea62005a40fe31)
P.y.isSquare: true
P.y.wasSquare: true
P.y: Fp2(c0: 0x05af52f65461f4255c7f977f3c46ae9aed4a4118d00db01ef65bce692bf4c77d5d359769af9259d9e1b0bb7c60fd174d, c1: 0x0b3ba68ed30db92744416d5e871d530ae918145a21fe73ce0bb9076ba784fe3e5f94369f91fb6fa961e4d37cf60905b4)
P.y.isSquare: true
P.y.wasSquare: true
P.y: Fp2(c0: 0x006b7a8f365b13ee0976b8af0af88f94afb80cd7b3e25e830d11da706a39dee9bcd8be64673b01c57df3655831c61cec, c1: 0x0b08dc0a88e9e38503004149dd4f2d823466b6934b2fef39ca2a7812935e74d28dd1824e2e4059d2e24ce5c9c76ab3bb)
P.y.isSquare: true
P.y.wasSquare: true
a.x: Fp2(c0: 0x1172a5b2a2adf5595988e1494780cf339b76b659bc79369a4198c3cb4198c322552d25452f163acab533ad886313d521, c1: 0x1339ccf93e7918bfaf29ec009b160580beb282f298b73c1975c8a502cd8fd50c562af936c97d288c873c93898a8b0381)
a.y: Fp2(c0: 0x0ca09b061afb3125a8767adcface64cd726f0badb6b0b7828ea6d4f68dc82e08d130df3721a721c3f0ad578a4366f551, c1: 0x0ab6755f2b1b31f52119c9112911eef23ed6c20229d19a9661c2d43ae1f31f228ac1140dc3b2e9c78b6d568f256fe910)
a.z: Fp2(c0: 0x15f65ec3fa7ce4935c071a97a256ec6d77ce385370513744df48944613b748b2a8e3bfdb035bfb7a7608ffc00002ff7c, c1: 0x15f646c3fa80e4835bd70a57a196ac6d57ce1653705247455f48983753c758bae9f3800ba3ebeff024c8cbd78002fdfc)

b.x: Fp2(c0: 0x05dd75183a66e6c8c72e5591ad2a26a994cf9d5f21a26bcd070530d41dfd3d5555ad8e691c5870e6caa53d819d347ac5, c1: 0x063d2e47437f947b1ae0c515eaace25a754a442a605cade51ae6d5bc85d15906100aa3e0d06389ec39d83db4242653e7)
b.y: Fp2(c0: 0x02bc5b99bd736e349eec338c34321b52568393918ccba4e24efd7477992756d88066a4e29f694e1fe3c62d265c278428, c1: 0x09cf0ea6f6f731f05169ea11e822b2b853b8b8c327fb11d457da3e7df33ef41f1dba958f928659778038fcb40b28b685)
b.z: Fp2(c0: 0x0da4dec3fa76cb905c071a13a1d2c39906ce502d70085744df48985140be37fa6bd1ffdac407fff27608dfffde60fedc, c1: 0x0df55883b636e29344071a7aa255dc6d25a258126bbe0a455b48985753c4377aeaf3a3f6c40c00027307ffb7ffbdefdc)

c.x: Fp2(c0: 0x01d96899d32dcd452de0c645a777add2d486731fa010931a8027fad5da6ea9f7bb1319b6fde0630806a3815a25f5af13, c1: 0x0d8366a9d06950de8407bcdca0c267fa2d699095de988603c1e79e374d80e77fb62176b3701ea61c8947d210a8a135d9)
c.y: Fp2(c0: 0x1238d8e83002c281b4468abf13da0d7186ae814e9d9af2c91642e6481a94c9400543bc8d56480bc7b74e6b3b080ba764, c1: 0x06cbef37390be486d23651b4e5e4eb07ce5f09c822aa8b1d93fec62eb0156ffae21d51870b763194c30b39c1d2079342)
c.z: Fp2(c0: 0x09ee82982d80b1c7bf3e69b228ee461c30bce73d574478841da0bd7941294503292b7809222bfe7d4606f976400244d2, c1: 0x15f35eab6e70e2922b85d257a256ec6d43794851f05257452de3965753474ca66bf3f923c10bfe022d07d7f60000fffb)
..../Programming/Nim/constantine/build/debug_g2.nim(109) debug_g2
..../.choosenim/toolchains/nim-#devel/lib/system/assertions.nim(29) failedAssertImpl
..../.choosenim/toolchains/nim-#devel/lib/system/assertions.nim(22) raiseAssert
..../.choosenim/toolchains/nim-#devel/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: ..../Programming/Nim/constantine/build/debug_g2.nim(109, 10) `bool(r0 == r1)`  [AssertionDefect]

Lattice decomposition cleanup

Endomorphism acceleration #44 requires decomposing a scalar.

This is done using lattice decomposition using Babai's rounding techniques.

  • Integrate in the property-based tests (requires clear cofactor #46)
  • The lattices should be moved in a per-curve-family configuration file:
    • const Lattice_BN254_Snarks_G1: array[2, array[2, tuple[b: BigInt[127], isNeg: bool]]] = [
      # Curve of order 254 -> mini scalars of size 127
      # u = 0x44E992B44A6909F1
      [(BigInt[127].fromHex"0x89d3256894d213e3", false), # 2u + 1
      (BigInt[127].fromHex"0x6f4d8248eeb859fd0be4e1541221250b", false)], # 6u² + 4u + 1
      [(BigInt[127].fromHex"0x6f4d8248eeb859fc8211bbeb7d4f1128", false), # 6u² + 2u
      (BigInt[127].fromHex"0x89d3256894d213e3", true)] # -2u - 1
      ]
      const Babai_BN254_Snarks_G1 = [
      # Vector for Babai rounding
      BigInt[127].fromHex"0x89d3256894d213e3", # 2u + 1
      BigInt[127].fromHex"0x6f4d8248eeb859fd0be4e1541221250b" # 6u² + 4u + 1
      ]
    • const Lattice_BLS12_381_G1: array[2, array[2, tuple[b: BigInt[128], isNeg: bool]]] = [
      # Curve of order 254 -> mini scalars of size 127
      # u = 0x44E992B44A6909F1
      [(BigInt[128].fromHex"0xac45a4010001a40200000000ffffffff", false), # u² - 1
      (BigInt[128].fromHex"0x1", true)], # -1
      [(BigInt[128].fromHex"0x1", false), # 1
      (BigInt[128].fromHex"0xac45a4010001a4020000000100000000", false)] #
      ]
      const Babai_BLS12_381_G1 = [
      # Vector for Babai rounding
      BigInt[128].fromHex"0xac45a4010001a4020000000100000000",
      BigInt[128].fromHex"0x1"
      ]
  • Instead of a separate "true"/"false" field, the BIgInt can use an extra bit as a sign flag (still using 2-complement), this is detailed in the paper
    image
    image
  • The encoding can use 1-bit per bit instead of 2 for {-1, 0, 1} as the column sign information is already encoded in the very first row, this will also speed up recoding
    image
    image
  • The lattice can be stored in tuples instead of arrays so that the BigInt have the minimal size possible to reduce the cost of BigInt multiplication
  • BLS has multiplication factors of 1 and/or 2 which can be optimized
  • The lattice and Babai roundings coefficient can be precomputed from the curve parameter instead of being hardcoded
  • The sage script should be fixed:
    def scalarMulGLV(scalar, P0):
    m = 2
    L = ((int(r).bit_length() + m-1) // m) + 1 # l = ⌈log2 r/m⌉ + 1
    print('L: ' + str(L))
    print('scalar: ' + Integer(scalar).hex())
    k0, k1 = getGLV2_decomp(scalar)
    print('k0: ' + k0.hex())
    print('k1: ' + k1.hex())
    P1 = (lambda1_r % r) * P0
    (Px, Py, Pz) = P0
    P1_endo = G1([Px*phi1 % p, Py, Pz])
    assert P1 == P1_endo
    expected = scalar * P0
    decomp = k0*P0 + k1*P1
    assert expected == decomp
    print('------ recode scalar -----------')
    even = k0 & 1 == 1
    if even:
    k0 -= 1
    b = recodeScalars([k0, k1])
    print('b0: ' + str(list(reversed(b[0]))))
    print('b1: ' + str(list(reversed(b[1]))))
    print('------------ lut ---------------')
    lut = buildLut(P0, P1)
    print('------------ mul ---------------')
    print('b0 L-1: ' + str(b[0][L-1]))
    Q = b[0][L-1] * lut[b[1][L-1] & 1]
    for i in range(L-2, -1, -1):
    Q *= 2
    Q += b[0][i] * lut[b[1][i] & 1]
    if even:
    Q += P0
    print('final Q: ' + pointToString(Q))
    print('expected: ' + pointToString(expected))
    assert Q == expected # TODO debug
  • We should not used signed integer as they introduce overflow checks which might leak secret bits
  • Some buffers don't need to be zero-initialized
  • Use window method to further accelerate 2-dimensional decompositions (#45)
  • Used mixed coordinates in the multiplication loop via an intermediate simultaneous inversion (#75)

Formal verification

Given that Constantine aims to be used for elliptic curve cryptographic, it is required to be proved bug-free.

Traditional model checker like TLA+ or Spin are more suited to formally distributed consensus protocols or concurrent data structures.

However the Galois companies offer SAW, a formal verifier that supports C and is used to AES, SHA and ECDSA formal verification: https://saw.galois.com/, it is based on Z3 https://github.com/GaloisInc/saw-script

Reactivate fast squaring algorithms

Commit 2971965
deactivates the fast path for generic squaring as there is an off-by-one on 32-bit with inputs from the test suite (from #61 #62):

suite "Modular squaring - bugs highlighted by property-based testing":
test "a² == (-a)² on for Fp[2^127 - 1] - #61":
var a{.noInit.}: Fp[Mersenne127]
a.fromHex"0x75bfffefbfffffff7fd9dfd800000000"
var na{.noInit.}: Fp[Mersenne127]
na.neg(a)
a.square()
na.square()
check:
bool(a == na)
var a2{.noInit.}, na2{.noInit.}: Fp[Mersenne127]
a2.fromHex"0x75bfffefbfffffff7fd9dfd800000000"
na2.neg(a2)
a2 *= a2
na2 *= na2
check:
bool(a2 == na2)
bool(a2 == a)
bool(a2 == na)
test "a² == (-a)² on for Fp[2^127 - 1] - #62":
var a{.noInit.}: Fp[Mersenne127]
a.fromHex"0x7ff7ffffffffffff1dfb7fafc0000000"
var na{.noInit.}: Fp[Mersenne127]
na.neg(a)
a.square()
na.square()
check:
bool(a == na)
var a2{.noInit.}, na2{.noInit.}: Fp[Mersenne127]
a2.fromHex"0x7ff7ffffffffffff1dfb7fafc0000000"
na2.neg(a2)
a2 *= a2
na2 *= na2
check:
bool(a2 == na2)
bool(a2 == a)
bool(a2 == na)

To be reactivated:

func montySquare_CIOS(r: var Limbs, a, M: Limbs, m0ninv: BaseType) =
## Montgomery Multiplication using Coarse Grained Operand Scanning (CIOS)
##
## Architectural Support for Long Integer Modulo Arithmetic on Risc-Based Smart Cards
## Johann Großschädl, 2003
## https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=95950BAC26A728114431C0C7B425E022?doi=10.1.1.115.3276&rep=rep1&type=pdf
##
## Analyzing and Comparing Montgomery Multiplication Algorithms
## Koc, Acar, Kaliski, 1996
## https://www.semanticscholar.org/paper/Analyzing-and-comparing-Montgomery-multiplication-Ko%C3%A7-Acar/5e3941ff482ec3ee41dc53c3298f0be085c69483
# TODO: Deactivated
# Off-by one on 32-bit for Fp[2^127 - 1] with inputs
# - -0x75bfffefbfffffff7fd9dfd800000000
# - -0x7ff7ffffffffffff1dfb7fafc0000000
# Squaring the number and its opposite
# should give the same result, but those are off-by-one
# We want all the computation to be kept in registers
# hence we use a temporary `t`, hoping that the compiler does it.
var z: typeof(r) # zero-init
const L = z.len
# Extra words to handle up to 2 carries t[N] and t[N+1]
var zLp1: SecretWord
var zL: SecretWord
staticFor i, 0, L:
# Squaring
var t: Carry
var u, v: SecretWord
# (u, v) <- a[i] * a[i] + z[i]
muladd1(u, v, a[i], a[i], z[i])
z[i] = v
staticFor j, i+1, L:
# (t, u, v) <- 2*a[j]*a[i] + z[j] + (t, u)
# 2*a[j]*a[i] can spill 1-bit on a 3rd word
mulDoubleAdd2(t, u, v, a[j], a[i], z[j], t, u)
z[j] = v
block:
# (u, v) <- zs + (t, u)
# zL <- v
# zL+1 <- u
var C: Carry
addC(C, v, zL, u, Carry(0))
addC(C, u, zLp1, SecretWord(t), C)
zL = v
zLp1 = u
# Reduction
# m <- (z[0] * m0ninv) mod 2^w
# (u, v) <- m * M[0] + z[0]
let m = z[0] * SecretWord(m0ninv)
muladd1(u, v, m, M[0], z[0])
staticFor j, 1, L:
# (u, v) <- m*M[j] + z[j] + u
# z[j-1] <- v
muladd2(u, v, m, M[j], z[j], u)
z[j-1] = v
block:
# (u, v) <- zL + u
# z[L-1] <- v
# z[L] <- zL+1 + u
var C: Carry
addC(C, v, zL, u, Carry(0))
z[L-1] = v
addC(C, zL, zLp1, Zero, C)
discard z.csub(M, zL.isNonZero() or not(z < M)) # TODO: (z >= M) is unnecessary for prime in the form (2^64)^w - 1
r = z

Trigger edge-cases with randomized testing

Constantine is currently using randomized testing vs a reference implementation (GMP) and property-based testing which uncovered edge cases (tagged heisenbugs):

We need to trigger those edge cases more reliably.
5 years ago, on the release of libsecp256k1, the bitcoin developers described a technique they use to test the library that uncovered a carry bug in OpenSSL CVE-2014-3570. The same kind of carry bug was deemed only discoverable by auditing in TweetNaCL: https://gist.github.com/CodesInChaos/8374632 as it happens with probability 2^-60

The way it works is having the RNG produce long strings of 1 or 0 to trigger carries which when using 2^64 words would happen only with 2^-64 * 2^-64 probability (?).

The incorrectly squared numbers would be expected to be found randomly with probability around one in 2^128, and so when one of the reference implementations of ed25519 had a very similar mistake some described it as "a bug that can only be found by auditing, not by randomized tests". But when we found this weren't auditing OpenSSL (the issue was burred deep in optimized code). Our tests used specially formed random numbers that were intended to explore a class of rare corner cases, a technique I'd previously used in the development of the Opus audio codec. Since our 'random' testing in libsecp256k1 was good enough to find one-in-a-{number too big to name} chance bugs which could "only be found by auditing" I'm a least a little bit proud of the work we've been doing there. (Obviously, we also use many other approaches than random testing on our own code.).

Part of our testing involves an automatic test harness that verifies agreement of our code with other implementations with random test data, including OpenSSL; though to reduce the chance of an uncontrolled disclosure we backed out some of the low level testing after discovering this bug.

This randomized testing revealed the flaw in OpenSSL. I suppose it's also a nice example of statistical reasoning (p=2^-128 that a uniform input triggers the misbehaviour) doesn't itself express risk, since while we've put enormous amounts of CPU cycles into tests we've certainly not done 2^128 work. In this case the reason our testing revealed the issue was because we used non-uniform numbers specifically constructed with low transition probability to better trigger improbable branches like carry bugs (https://github.com/bitcoin/secp256k1/blob/master/src/testrand_impl.h#L45). I used the same technique in the development of the Opus audio codec to good effect.

(Whitebox fuzzing tools like Klee or AFL, or branch coverage analysis, while good tools, seem to not be as effective for errors where the code completely omits a condition rather than having a wrong condition which was not exercised by tests.)

Looking into Fiat-Crypto paper (formally verified crypto primitives) a significant of bugs in cryptographic libraries are related to carries: http://adam.chlipala.net/papers/FiatCryptoSP19/FiatCryptoSP19.pdf

image
image

scalar mul: consider using incomplete addition/doubling

In scalar multiplication we use the complete formula from

to handle the infinity point, or adding P or its opposite to itself in a constant-time fashion.

However infinity can be checked and conditionally copied at the end instead of paying that cost each doubling/addition
and my intuition is that adding P/-P to itself is not possible in a scalar multiplication context (proof?).

From the paper, the complete formula carry a 40% overhead hence we might be able to significantly increase signing speed if those assumptions hold.

Fast clear cofactor

Unless the cofactor is 1 (for BN254 curves) we are unfortunately working on a subgroup of an elliptic curve.

This means that when generating a random point for testing we may be generating a point out of our subgroup of interest.

In particular for scalar multiplication accelerated by endomorphism, the point MUST be on the subgroup or the result is incorrect.

A simple way to generate a point in the proper subgroup is to scalar multiply a random point by the cofactor of the curve:

  • Q: can we use GLV multiplication for that?

More efficient ways exist and are detailed in the IETF hash-to-curve draft https://tools.ietf.org/html/draft-irtf-cfrg-hash-to-curve-08#section-7 and (Wahby, Boneh, 2019, Fast and simple constant-time hashingto the BLS12-381 elliptic curve, https://eprint.iacr.org/2019/403.pdf). For BLS G1 in particular we can simply multiply by 1-u with u the BLS parameter.

Note for compatibility, when a fast cofactor clearing method exist, it is usually incompatible with the "normal" scalar multiplication by the actual cofactor. As clear cofactor is only used in 2 cases, random testing and hash-to-curve, we should implement the hash-to-curve version.

For G2 see https://github.com/status-im/nim-blscurve/blob/1a18d0db/blscurve/hash_to_curve.nim#L454-L512

Sage implementation: https://github.com/cfrg/draft-irtf-cfrg-hash-to-curve/blob/ead9c911/poc/clear_h_bls12381g2.sage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.