Giter Club home page Giter Club logo

ashvardanian / simsimd Goto Github PK

View Code? Open in Web Editor NEW
716.0 14.0 33.0 707 KB

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, and C, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐

Home Page: https://ashvardanian.com/posts/simsimd-faster-scipy/

License: Apache License 2.0

CMake 0.95% C++ 4.38% C 67.52% Python 8.56% Jupyter Notebook 1.01% Go 1.38% JavaScript 2.56% TypeScript 2.95% Rust 10.69%
arm-neon arm-sve assembly avx2 distance-calculation distance-measures metrics neon simd simd-instructions

simsimd's Introduction

SimSIMD 📏

SimSIMD banner

Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geo-Spatial Analysis, and Information Retrieval. These algorithms generally have linear complexity in time, constant complexity in space, and are data-parallel. In other words, it is easily parallelizable and vectorizable and often available in packages like BLAS and LAPACK, as well as higher-level numpy and scipy Python libraries. Ironically, even with decades of evolution in compilers and numerical computing, most libraries can be 3-200x slower than hardware potential even on the most popular hardware, like 64-bit x86 and Arm CPUs. SimSIMD attempts to fill that gap. 1️⃣ SimSIMD functions are practically as fast as memcpy. 2️⃣ SimSIMD compiles to more platforms than NumPy (105 vs 35) and has more backends than most BLAS implementations.

Features

SimSIMD provides over 100 SIMD-optimized kernels for various distance and similarity measures, accelerating search in USearch and several DBMS products. Implemented distance functions include:

  • Euclidean (L2) and Cosine (Angular) spatial distances for Vector Search.
  • Dot-Products for real & complex vectors for DSP & Quantum computing.
  • Hamming (~ Manhattan) and Jaccard (~ Tanimoto) bit-level distances.
  • Kullback-Leibler and Jensen–Shannon divergences for probability distributions.
  • Haversine and Vincenty's formulae for Geospatial Analysis.
  • For Levenshtein, Needleman–Wunsch and other text metrics, check StringZilla.

Moreover, SimSIMD...

  • handles f64, f32, and f16 real & complex vectors.
  • handles i8 integral and b8 binary vectors.
  • is a zero-dependency header-only C 99 library.
  • has bindings for Python, Rust and JavaScript.
  • has Arm backends for NEON and Scalable Vector Extensions (SVE).
  • has x86 backends for Haswell, Skylake, Ice Lake, and Sapphire Rapids.

Due to the high-level of fragmentation of SIMD support in different x86 CPUs, SimSIMD uses the names of select Intel CPU generations for its backends. They, however, also work on AMD CPUs. Inel Haswell is compatible with AMD Zen 1/2/3, while AMD Genoa Zen 4 covers AVX-512 instructions added to Intel Skylake and Ice Lake. You can learn more about the technical implementation details in the following blogposts:

Benchmarks

Against NumPy and SciPy

Given 1000 embeddings from OpenAI Ada API with 1536 dimensions, running on the Apple M2 Pro Arm CPU with NEON support, here's how SimSIMD performs against conventional methods:

Kind f32 improvement f16 improvement i8 improvement Conventional method SimSIMD
Inner Product 2 x 9 x 18 x numpy.inner inner
Cosine Distance 32 x 79 x 133 x scipy.spatial.distance.cosine cosine
Euclidean Distance ² 5 x 26 x 17 x scipy.spatial.distance.sqeuclidean sqeuclidean
Jensen-Shannon Divergence 31 x 53 x scipy.spatial.distance.jensenshannon jensenshannon

Against GCC Auto-Vectorization

On the Intel Sapphire Rapids platform, SimSIMD was benchmarked against auto-vectorized code using GCC 12. GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which have been part of the C language since 2011.

Kind GCC 12 f32 GCC 12 f16 SimSIMD f16 f16 improvement
Inner Product 3,810 K/s 192 K/s 5,990 K/s 31 x
Cosine Distance 3,280 K/s 336 K/s 6,880 K/s 20 x
Euclidean Distance ² 4,620 K/s 147 K/s 5,320 K/s 36 x
Jensen-Shannon Divergence 1,180 K/s 18 K/s 2,140 K/s 118 x

Broader Benchmarking Results:

Using SimSIMD in Python

The package is intended to replace the usage of numpy.inner, numpy.dot, and scipy.spatial.distance. Aside from drastic performance improvements, SimSIMD significantly improves accuracy in mixed precision setups. NumPy and SciPy, processing i8 or f16 vectors, will use the same types for accumulators, while SimSIMD can combine i8 enumeration, i16 multiplication, and i32 accumulation to avoid overflows entirely. The same applies to processing f16 values with f32 precision.

Installation

Use the following snippet to install SimSIMD and list available hardware acceleration options available on your machine:

pip install simsimd
python -c "import simsimd; print(simsimd.get_capabilities())"

One-to-One Distance

import simsimd
import numpy as np

vec1 = np.random.randn(1536).astype(np.float32)
vec2 = np.random.randn(1536).astype(np.float32)
dist = simsimd.cosine(vec1, vec2)

Supported functions include cosine, inner, sqeuclidean, hamming, and jaccard. Dot products are supported for both real and complex numbers:

vec1 = np.random.randn(768).astype(np.float64) + 1j * np.random.randn(768).astype(np.float64)
vec2 = np.random.randn(768).astype(np.float64) + 1j * np.random.randn(768).astype(np.float64)

dist = simsimd.dot(vec1.astype(np.complex128), vec2.astype(np.complex128))
dist = simsimd.dot(vec1.astype(np.complex64), vec2.astype(np.complex64))
dist = simsimd.vdot(vec1.astype(np.complex64), vec2.astype(np.complex64)) # conjugate, same as `np.vdot`

Unlike SciPy, SimSIMD allows explicitly stating the precision of the input vectors, which is especially useful for mixed-precision setups.

dist = simsimd.cosine(vec1, vec2, "i8")
dist = simsimd.cosine(vec1, vec2, "f16")
dist = simsimd.cosine(vec1, vec2, "f32")
dist = simsimd.cosine(vec1, vec2, "f64")

It also allows using SimSIMD for half-precision complex numbers, which NumPy does not support. For that, view data as continuous even-length np.float16 vectors and override type-resolution with complex32 string.

vec1 = np.random.randn(1536).astype(np.float16)
vec2 = np.random.randn(1536).astype(np.float16)
simd.dot(vec1, vec2, "complex32")
simd.vdot(vec1, vec2, "complex32")

One-to-Many Distances

Every distance function can be used not only for one-to-one but also one-to-many and many-to-many distance calculations. For one-to-many:

vec1 = np.random.randn(1536).astype(np.float32) # rank 1 tensor
batch1 = np.random.randn(1, 1536).astype(np.float32) # rank 2 tensor
batch2 = np.random.randn(100, 1536).astype(np.float32)

dist_rank1 = simsimd.cosine(vec1, batch2)
dist_rank2 = simsimd.cosine(batch1, batch2)

Many-to-Many Distances

All distance functions in SimSIMD can be used to compute many-to-many distances. For two batches of 100 vectors to compute 100 distances, one would call it like this:

batch1 = np.random.randn(100, 1536).astype(np.float32)
batch2 = np.random.randn(100, 1536).astype(np.float32)
dist = simsimd.cosine(batch1, batch2)

Input matrices must have identical shapes. This functionality isn't natively present in NumPy or SciPy, and generally requires creating intermediate arrays, which is inefficient and memory-consuming.

Many-to-Many All-Pairs Distances

One can use SimSIMD to compute distances between all possible pairs of rows across two matrices (akin to scipy.spatial.distance.cdist). The resulting object will have a type DistancesTensor, zero-copy compatible with NumPy and other libraries. For two arrays of 10 and 1,000 entries, the resulting tensor will have 10,000 cells:

import numpy as np
from simsimd import cdist, DistancesTensor

matrix1 = np.random.randn(1000, 1536).astype(np.float32)
matrix2 = np.random.randn(10, 1536).astype(np.float32)
distances: DistancesTensor = simsimd.cdist(matrix1, matrix2, metric="cosine") # zero-copy
distances_array: np.ndarray = np.array(distances, copy=True) # now managed by NumPy

Multithreading

By default, computations use a single CPU core. To optimize and utilize all CPU cores on Linux systems, add the threads=0 argument. Alternatively, specify a custom number of threads:

distances = simsimd.cdist(matrix1, matrix2, metric="cosine", threads=0)

Using Python API with USearch

Want to use it in Python with USearch? You can wrap the raw C function pointers SimSIMD backends into a CompiledMetric and pass it to USearch, similar to how it handles Numba's JIT-compiled code.

from usearch.index import Index, CompiledMetric, MetricKind, MetricSignature
from simsimd import pointer_to_sqeuclidean, pointer_to_cosine, pointer_to_inner

metric = CompiledMetric(
    pointer=pointer_to_cosine("f16"),
    kind=MetricKind.Cos,
    signature=MetricSignature.ArrayArraySize,
)

index = Index(256, metric=metric)

Using SimSIMD in Rust

To install, add the following to your Cargo.toml:

[dependencies]
simsimd = "..."

Before using the SimSIMD library, ensure you have imported the necessary traits and types into your Rust source file. The library provides several traits for different distance/similarity kinds - SpatialSimilarity, BinarySimilarity, and ProbabilitySimilarity.

Spatial Similarity: Cosine and Euclidean Distances

use simsimd::SpatialSimilarity;

fn main() {
    let vector_a: Vec<f32> = vec![1.0, 2.0, 3.0];
    let vector_b: Vec<f32> = vec![4.0, 5.0, 6.0];

    // Compute the cosine similarity between vector_a and vector_b
    let cosine_similarity = f32::cosine(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Cosine Similarity: {}", cosine_similarity);

    // Compute the squared Euclidean distance between vector_a and vector_b
    let sq_euclidean_distance = f32::sqeuclidean(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Squared Euclidean Distance: {}", sq_euclidean_distance);
}

Spatial similarity functions are available for f64, f32, f16, and i8 types.

Dot-Products: Inner and Complex Inner Products

use simsimd::SpatialSimilarity;
use simsimd::ComplexProducts;

fn main() {
    let vector_a: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
    let vector_b: Vec<f32> = vec![5.0, 6.0, 7.0, 8.0];

    // Compute the inner product between vector_a and vector_b
    let inner_product = SpatialSimilarity::dot(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Inner Product: {}", inner_product);

    // Compute the complex inner product between complex_vector_a and complex_vector_b
    let complex_inner_product = ComplexProducts::dot(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    let complex_conjugate_inner_product = ComplexProducts::vdot(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Complex Inner Product: {:?}", complex_inner_product); // -18, 69
    println!("Complex C. Inner Product: {:?}", complex_conjugate_inner_product); // 70, -8
}

Complex inner products are available for f64, f32, and f16 types.

Probability Distributions: Jensen-Shannon and Kullback-Leibler Divergences

use simsimd::SpatialSimilarity;

fn main() {
    let vector_a: Vec<f32> = vec![1.0, 2.0, 3.0];
    let vector_b: Vec<f32> = vec![4.0, 5.0, 6.0];

    let cosine_similarity = f32::jensenshannon(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Cosine Similarity: {}", cosine_similarity);

    let sq_euclidean_distance = f32::kullbackleibler(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Squared Euclidean Distance: {}", sq_euclidean_distance);
}

Probability similarity functions are available for f64, f32, and f16 types.

Binary Similarity: Hamming and Jaccard Distances

Similar to spatial distances, one can compute bit-level distance functions between slices of unsigned integers:

use simsimd::BinarySimilarity;

fn main() {
    let vector_a = &[0b11110000, 0b00001111, 0b10101010];
    let vector_b = &[0b11110000, 0b00001111, 0b01010101];

    // Compute the Hamming distance between vector_a and vector_b
    let hamming_distance = u8::hamming(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Hamming Distance: {}", hamming_distance);

    // Compute the Jaccard distance between vector_a and vector_b
    let jaccard_distance = u8::jaccard(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Jaccard Distance: {}", jaccard_distance);
}

Binary similarity functions are available only for u8 types.

Half-Precision Floating-Point Numbers

Rust has no native support for half-precision floating-point numbers, but SimSIMD provides a f16 type. It has no functionality - it is a transparent wrapper around u16 and can be used with half or any other half-precision library.

use simsimd::SpatialSimilarity;
use simsimd::f16 as SimF16;
use half::f16 as HalfF16;

fn main() {
    let vector_a: Vec<HalfF16> = ...
    let vector_b: Vec<HalfF16> = ...

    let buffer_a: &[SimF16] = unsafe { std::slice::from_raw_parts(a_half.as_ptr() as *const SimF16, a_half.len()) };
    let buffer_b: &[SimF16] = unsafe { std::slice::from_raw_parts(b_half.as_ptr() as *const SimF16, b_half.len()) };

    // Compute the cosine similarity between vector_a and vector_b
    let cosine_similarity = SimF16::cosine(&vector_a, &vector_b)
        .expect("Vectors must be of the same length");

    println!("Cosine Similarity: {}", cosine_similarity);
}

Dynamic Dispatch

SimSIMD provides a dynamic dispatch mechanism to select the most advanced micro-kernel for the current CPU. You can query supported backends and use the SimSIMD::capabilities function to select the best one.

println!("uses neon: {}", capabilties::uses_neon());
println!("uses sve: {}", capabilties::uses_sve());
println!("uses haswell: {}", capabilties::uses_haswell());
println!("uses skylake: {}", capabilties::uses_skylake());
println!("uses ice: {}", capabilties::uses_ice());
println!("uses sapphire: {}", capabilties::uses_sapphire());

Using SimSIMD in JavaScript

To install, choose one of the following options depending on your environment:

  • npm install --save simsimd
  • yarn add simsimd
  • pnpm add simsimd
  • bun install simsimd

The package is distributed with prebuilt binaries, but if your platform is not supported, you can build the package from the source via npm run build. This will automatically happen unless you install the package with the --ignore-scripts flag or use Bun. After you install it, you will be able to call the SimSIMD functions on various TypedArray variants:

const { sqeuclidean, cosine, inner, hamming, jaccard } = require('simsimd');

const vectorA = new Float32Array([1.0, 2.0, 3.0]);
const vectorB = new Float32Array([4.0, 5.0, 6.0]);

const distance = sqeuclidean(vectorA, vectorB);
console.log('Squared Euclidean Distance:', distance);

Other numeric types and precision levels are supported as well. For double-precsion floating-point numbers, use Float64Array:

const vectorA = new Float64Array([1.0, 2.0, 3.0]);
const vectorB = new Float64Array([4.0, 5.0, 6.0]);
const distance = cosine(vectorA, vectorB);

When doing machine learning and vector search with high-dimensional vectors you may want to quantize them to 8-bit integers. You may want to project values from the $[-1, 1]$ range to the $[-100, 100]$ range and then cast them to Uint8Array:

const quantizedVectorA = new Uint8Array(vectorA.map(v => (v * 100)));
const quantizedVectorB = new Uint8Array(vectorB.map(v => (v * 100)));
const distance = cosine(quantizedVectorA, quantizedVectorB);

A more extreme quantization case would be to use binary vectors. You can map all positive values to 1 and all negative values and zero to 0, packing eight values into a single byte. After that, Hamming and Jaccard distances can be computed.

const { toBinary, hamming } = require('simsimd');

const binaryVectorA = toBinary(vectorA);
const binaryVectorB = toBinary(vectorB);
const distance = hamming(binaryVectorA, binaryVectorB);

Using SimSIMD in C

For integration within a CMake-based project, add the following segment to your CMakeLists.txt:

FetchContent_Declare(
    simsimd
    GIT_REPOSITORY https://github.com/ashvardanian/simsimd.git
    GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(simsimd)

After that, you can use the SimSIMD library in your C code in several ways. Simplest of all, you can include the headers, and the compiler will automatically select the most recent CPU extensions that SimSIMD will use.

#include <simsimd/simsimd.h>

int main() {
    simsimd_f32_t vector_a[1536];
    simsimd_f32_t vector_b[1536];
    simsimd_metric_punned_t distance_function = simsimd_metric_punned(
        simsimd_metric_cos_k, // Metric kind, like the angular cosine distance
        simsimd_datatype_f32_k, // Data type, like: f16, f32, f64, i8, b8, and complex variants
        simsimd_cap_any_k); // Which CPU capabilities are we allowed to use
    simsimd_distance_t distance;
    distance_function(vector_a, vector_b, 1536, &distance);
    return 0;
}

Dynamic Dispatch

To avoid hard-coding the backend, you can rely on c/lib.c to prepackage all possible backends in one binary, and select the most recent CPU features at runtime. That feature of the C library is called dynamic dispatch and is extensively used in the Python, JavaScript, and Rust bindings. To test which CPU features are available on the machine at runtime, use the following APIs:

int uses_neon = simsimd_uses_neon();
int uses_sve = simsimd_uses_sve();
int uses_haswell = simsimd_uses_haswell();
int uses_skylake = simsimd_uses_skylake();
int uses_ice = simsimd_uses_ice();
int uses_sapphire = simsimd_uses_sapphire();

simsimd_capability_t capabilities = simsimd_capabilities();

To differentiate between runtime and compile-time dispatch, define the following macro:

#define SIMSIMD_DYNAMIC_DISPATCH 1 // or 0

Spatial Distances: Cosine and Euclidean Distances

#include <simsimd/simsimd.h>

int main() {
    simsimd_f64_t f64s[1536];
    simsimd_f32_t f32s[1536];
    simsimd_f16_t f16s[1536];
    simsimd_i8_t i8[1536];
    simsimd_distance_t distance;

    // Cosine distance between two vectors
    simsimd_cos_i8(i8s, i8s, 1536, &distance);
    simsimd_cos_f16(f16s, f16s, 1536, &distance);
    simsimd_cos_f32(f32s, f32s, 1536, &distance);
    simsimd_cos_f64(f64s, f64s, 1536, &distance);
    
    // Euclidean distance between two vectors
    simsimd_l2sq_i8(i8s, i8s, 1536, &distance);
    simsimd_l2sq_f16(f16s, f16s, 1536, &distance);
    simsimd_l2sq_f32(f32s, f32s, 1536, &distance);
    simsimd_l2sq_f64(f64s, f64s, 1536, &distance);

    return 0;
}

Dot-Products: Inner and Complex Inner Products

#include <simsimd/simsimd.h>

int main() {
    simsimd_f64_t f64s[1536];
    simsimd_f32_t f32s[1536];
    simsimd_f16_t f16s[1536];
    simsimd_distance_t distance;

    // Inner product between two vectors
    simsimd_dot_f16(f16s, f16s, 1536, &distance);
    simsimd_dot_f32(f32s, f32s, 1536, &distance);
    simsimd_dot_f64(f64s, f64s, 1536, &distance);

    // Complex inner product between two vectors
    simsimd_dot_f16c(f16s, f16s, 1536, &distance);
    simsimd_dot_f32c(f32s, f32s, 1536, &distance);
    simsimd_dot_f64c(f64s, f64s, 1536, &distance);

    // Complex conjugate inner product between two vectors
    simsimd_vdot_f16c(f16s, f16s, 1536, &distance);
    simsimd_vdot_f32c(f32s, f32s, 1536, &distance);
    simsimd_vdot_f64c(f64s, f64s, 1536, &distance);

    return 0;
}

Binary Distances: Hamming and Jaccard Distances

#include <simsimd/simsimd.h>

int main() {
    simsimd_b8_t b8s[1536 / 8]; // 8 bits per word
    simsimd_distance_t distance;

    // Hamming distance between two vectors
    simsimd_hamming_b8(b8s, b8s, 1536 / 8, &distance);

    // Jaccard distance between two vectors
    simsimd_jaccard_b8(b8s, b8s, 1536 / 8, &distance);

    return 0;
}

Probability Distributions: Jensen-Shannon and Kullback-Leibler Divergences

#include <simsimd/simsimd.h>

int main() {
    simsimd_f64_t f64s[1536];
    simsimd_f32_t f32s[1536];
    simsimd_f16_t f16s[1536];
    simsimd_distance_t distance;

    // Jensen-Shannon divergence between two vectors
    simsimd_js_f16(f16s, f16s, 1536, &distance);
    simsimd_js_f32(f32s, f32s, 1536, &distance);
    simsimd_js_f64(f64s, f64s, 1536, &distance);

    // Kullback-Leibler divergence between two vectors
    simsimd_kl_f16(f16s, f16s, 1536, &distance);
    simsimd_kl_f32(f32s, f32s, 1536, &distance);
    simsimd_kl_f64(f64s, f64s, 1536, &distance);

    return 0;
}

Half-Precision Floating-Point Numbers

If you aim to utilize the _Float16 functionality with SimSIMD, ensure your development environment is compatible with C 11. For other SimSIMD functionalities, C 99 compatibility will suffice. To explicitly disable half-precision support, define the following macro before imports:

#define SIMSIMD_NATIVE_F16 0 // or 1
#include <simsimd/simsimd.h>

Target Specific Backends

SimSIMD exposes all kernels for all backends, and you can select the most advanced one for the current CPU without relying on built-in dispatch mechanisms. All of the function names follow the same pattern: simsimd_{function}_{type}_{backend}.

  • The backend can be serial, haswell, skylake, ice, sapphire, neon, or sve.
  • The type can be f64, f32, f16, f64c, f32c, f16c, i8, or b8.
  • The function can be dot, vdot, cos, l2sq, hamming, jaccard, kl, or js.

To avoid hard-coding the backend, you can use the simsimd_metric_punned_t to pun the function pointer and the simsimd_capabilities function to get the available backends at runtime.

simsimd_dot_f64_sve
simsimd_cos_f64_sve
simsimd_l2sq_f64_sve
simsimd_dot_f64_skylake
simsimd_cos_f64_skylake
simsimd_l2sq_f64_skylake
simsimd_dot_f64_serial
simsimd_cos_f64_serial
simsimd_l2sq_f64_serial
simsimd_js_f64_serial
simsimd_kl_f64_serial
simsimd_dot_f32_sve
simsimd_cos_f32_sve
simsimd_l2sq_f32_sve
simsimd_dot_f32_neon
simsimd_cos_f32_neon
simsimd_l2sq_f32_neon
simsimd_js_f32_neon
simsimd_kl_f32_neon
simsimd_dot_f32_skylake
simsimd_cos_f32_skylake
simsimd_l2sq_f32_skylake
simsimd_js_f32_skylake
simsimd_kl_f32_skylake
simsimd_dot_f32_serial
simsimd_cos_f32_serial
simsimd_l2sq_f32_serial
simsimd_js_f32_serial
simsimd_kl_f32_serial
simsimd_dot_f16_sve
simsimd_cos_f16_sve
simsimd_l2sq_f16_sve
simsimd_dot_f16_neon
simsimd_cos_f16_neon
simsimd_l2sq_f16_neon
simsimd_js_f16_neon
simsimd_kl_f16_neon
simsimd_dot_f16_sapphire
simsimd_cos_f16_sapphire
simsimd_l2sq_f16_sapphire
simsimd_js_f16_sapphire
simsimd_kl_f16_sapphire
simsimd_dot_f16_haswell
simsimd_cos_f16_haswell
simsimd_l2sq_f16_haswell
simsimd_js_f16_haswell
simsimd_kl_f16_haswell
simsimd_dot_f16_serial
simsimd_cos_f16_serial
simsimd_l2sq_f16_serial
simsimd_js_f16_serial
simsimd_kl_f16_serial
simsimd_cos_i8_neon
simsimd_cos_i8_neon
simsimd_l2sq_i8_neon
simsimd_cos_i8_ice
simsimd_cos_i8_ice
simsimd_l2sq_i8_ice
simsimd_cos_i8_haswell
simsimd_cos_i8_haswell
simsimd_l2sq_i8_haswell
simsimd_cos_i8_serial
simsimd_cos_i8_serial
simsimd_l2sq_i8_serial
simsimd_hamming_b8_sve
simsimd_jaccard_b8_sve
simsimd_hamming_b8_neon
simsimd_jaccard_b8_neon
simsimd_hamming_b8_ice
simsimd_jaccard_b8_ice
simsimd_hamming_b8_haswell
simsimd_jaccard_b8_haswell
simsimd_hamming_b8_serial
simsimd_jaccard_b8_serial
simsimd_dot_f32c_sve
simsimd_vdot_f32c_sve
simsimd_dot_f32c_neon
simsimd_vdot_f32c_neon
simsimd_dot_f32c_haswell
simsimd_vdot_f32c_haswell
simsimd_dot_f32c_skylake
simsimd_vdot_f32c_skylake
simsimd_dot_f32c_serial
simsimd_vdot_f32c_serial
simsimd_dot_f64c_sve
simsimd_vdot_f64c_sve
simsimd_dot_f64c_skylake
simsimd_vdot_f64c_skylake
simsimd_dot_f64c_serial
simsimd_vdot_f64c_serial
simsimd_dot_f16c_sve
simsimd_vdot_f16c_sve
simsimd_dot_f16c_neon
simsimd_vdot_f16c_neon
simsimd_dot_f16c_haswell
simsimd_vdot_f16c_haswell
simsimd_dot_f16c_sapphire
simsimd_vdot_f16c_sapphire
simsimd_dot_f16c_serial
simsimd_vdot_f16c_serial

simsimd's People

Contributors

alexanderchang1 avatar arman-ghazaryan avatar ashvardanian avatar chillfish8 avatar corani avatar dependabot[bot] avatar ecksters avatar gurgenyegoryan avatar kou avatar marthadev avatar nairihar avatar ngalstyan4 avatar pplanel avatar semantic-release-bot avatar smthngslv avatar sroussey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simsimd's Issues

simsimd not working on AWS Lambda public image

Hi,

first of all thank to everybody involved in developing and maintaining this package.
I am not really expert on the package internals, also because on my Windows 10 local laptop always worked llike a charm.

Now I am moving my script as a Lambda function, using a custom container based on AWS public image public.ecr.aws/lambda/python:3.11 as a starting image.

Here I am not able to execute any function provided from the library since I am alwayas receiving "module \'SimSIMD\' has no attribute \'cdist\'" error message.

From github I see that the library supports x86 AVX2, and if I execute the command cat /proc/cpuinfo into the container image, I see in its output that flags field does enlist also avx2, so I thought the library would have worked.

Can you help me understand what am I missing?
Thanks again

Please see the full reproducible output below:

(venv) C:\Users\jl1andricca\Documents\Python Scripts\aws\generative-ai>docker run -it -d --name example public.ecr.aws/lambda/python:3.11 bash
b899f02bfb3b29bb6c479d91bb1fbba575f5f9aaaa4e7c325935d1dd27c53aca

(venv) C:\Users\jl1andricca\Documents\Python Scripts\aws\generative-ai>docker exec -it example /bin/bash
bash-4.2# pip install numpy simsimd
Collecting numpy
  Obtaining dependency information for numpy from https://files.pythonhosted.org/packages/5a/62/007b63f916aca1d27f5fede933fda3315d931ff9b2c28b9c2cf388cd8edb/numpy-1.26.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.m
etadata
  Downloading numpy-1.26.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.2/61.2 kB 890.1 kB/s eta 0:00:00
Collecting simsimd
  Obtaining dependency information for simsimd from https://files.pythonhosted.org/packages/9c/5e/dff35e629b764cda0a5cf488a9348df187473eb896f2d9e58c19de913805/simsimd-1.4.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manyli
nux_2_28_x86_64.whl.metadata
  Downloading simsimd-1.4.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading numpy-1.26.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 15.6 MB/s eta 0:00:00
Downloading simsimd-1.4.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (33 kB)
Installing collected packages: simsimd, numpy
Successfully installed numpy-1.26.3 simsimd-1.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
bash-4.2# python
Python 3.11.6 (main, Dec  4 2023, 13:34:04) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import simsimd
>>> matrix1 = np.random.randn(1000, 1536).astype(np.float32)
>>> matrix2 = np.random.randn(10, 1536).astype(np.float32)
>>> distances = simsimd.cdist(matrix1, matrix2, metric="cosine")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'SimSIMD' has no attribute 'cdist'
>>> quit()
bash-4.2# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping        : 12
microcode       : 0xffffffff
cpu MHz         : 2112.005
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
 xsaves flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds
bogomips        : 4224.01
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:


Maybe a faster dot product (I'm not sure)

Looking at AVX512 dot product I tried to avoid the if inside the loop to make a faster code. Here is a (not tested) idea of the proposed code:

simsimd_avx512_f32_ip(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n)
{
    __m512 ab_vec = _mm512_setzero();
    __m512 a_vec, b_vec;

    int n_tail = n & 15; // Equivalent of n%16 but faster
    n -= n_tail;

    while(n) // faster loop without the "being on the tail condition"
    {
        a_vec = _mm512_loadu_ps(a);
        b_vec = _mm512_loadu_ps(b);
        ab_vec = _mm512_fmadd_ps(a_vec, b_vec, ab_vec);

        a += 16, b += 16, n -= 16;
    }
    if(n_tail)
    {
        __mmask16 mask = _bzhi_u32(0xFFFFFFFF, n_tail);
        a_vec = _mm512_maskz_loadu_ps(mask, a);
        b_vec = _mm512_maskz_loadu_ps(mask, b);
        ab_vec = _mm512_fmadd_ps(a_vec, b_vec, ab_vec);
    }

    return _mm512_reduce_add_ps(ab_vec);
}

Benchmark torch vs. NumPy

I tried to simplify the benchmarks a bit to understand a bit better what was going on, as I found weird a 10x speed-up of NumPy vs. PyTorch (we know that there is a bit of extra overhead in PyTorch, but 10x is way too much). I wrote the following script that delegates the benchmarking to the PyTorch benchmark suite.

import torch
from torch.utils.benchmark import Compare, Timer

import numpy as np
import simsimd as simd

# Set to ignore all floating-point errors
np.seterr(all="ignore")

def get_timer(fn, x, y):
    t = str(x.dtype)
    if t.startswith("torch."):
        t = t[len("torch."):]

    timer = Timer(
        "fn(x, y)",
        globals=locals(),
        label="dot",
        description=f"{fn.__module__}",
        sub_label=f"{tuple(x.shape)}, {t}",
        num_threads=torch.get_num_threads()
    )
    return timer.blocked_autorange(min_run_time=1)


count = 1000
ndim = 1536

generators = {
    np.float64: lambda: np.random.randn(count, ndim).astype(np.float64),
    np.float32: lambda: np.random.randn(count, ndim).astype(np.float32),
    np.float16: lambda: np.random.randn(count, ndim).astype(np.float16),
    np.int8: lambda: np.random.randint(-100, high=100, size=(count, ndim), dtype=np.int8),
}

print()
print("## Between 2 Vectors, Batch Size: 1")
print()

# Benchmark functions
funcs = [
    (
        [np.dot, simd.dot, torch.dot],
        [np.float64, np.float32, np.float16, np.int8],
        [np.array, np.array, torch.tensor],
    ),
]


def get_params():
    for fns, dtypes, tensor_types in funcs:
        for dtype in dtypes:
            for fn, tensor_type in zip(fns, tensor_types):
                A = generators[dtype]()[0]
                B = generators[dtype]()[0]
                yield fn, tensor_type(A), tensor_type(B)

compare = Compare([get_timer(*params) for params in get_params()])
compare.trim_significant_figures()
compare.print()

In my machine, an AMD 3970X, I get

## Between 2 Vectors, Batch Size: 1                                                                                                                                                                                  
                                                                                                                                                                                                                     
[----------------------- dot ------------------------]                                                                                                                                                               
                        |  numpy  |  SimSIMD  |  torch                                                                                                                                                               
32 threads: ------------------------------------------                                                                                                                                                               
      (1536,), float64  |   1000  |     800   |   2000                                                                                                                                                               
      (1536,), float32  |    900  |     400   |   2000                                                                                                                                                               
      (1536,), float16  |  11000  |    7400   |  21000                                                                                                                                                               
      (1536,), int8     |   2000  |     600   |   2000                                                                                                                                                               
                                                                                                                                                                                                                     
Times are in nanoseconds (ns).  

It's rather surprising also how slow is float16 in the three backends, but that's a story for another day.

Question on Cosine Similarity Result

First of all, thank you for creating and maintaining this project. It helped a lot for my SIMD implementation on distance functions for vectors.

I encounter some oddity when it comes to using f32::cosine in Rust. When comparing it to the manual cosine similarity calculation, it produces different result.

use simsimd::SpatialSimilarity;

fn main() {
    let a = vec![1.0, 3.0, 5.0];
    let b = vec![2.0, 4.0, 6.0];

    let dot = f32::dot(&a, &b).unwrap() as f32;
    let ma = a.iter().map(|x| x.powi(2)).sum::<f32>().sqrt();
    let mb = b.iter().map(|x| x.powi(2)).sum::<f32>().sqrt();
    let cosine = dot / (ma * mb);

    assert_eq!(cosine, f32::cosine(&a, &b).unwrap() as f32);
}

I'm just curious, is there something that I miss from the implementation?

Note: The f32::dot and f32::sqeuclidean do produced the correct result compared to manual calculation.

simsimd does not work with bun

What version of Bun is running?

❯ bun -v
1.0.21

What platform is your computer?

❯ uname -mprs
Darwin 23.2.0 arm64 arm

What steps can reproduce the bug?

install bun

bun i simsimd

test.ts:

import { inner, cosine } from "simsimd";
console.log({ inner, cosine });

bun run test.ts

What is the expected behavior?

In this simple example, show that the functions exist

What do you see instead?

❯ bun run test.ts
83 |         throw e
84 |       }
85 |     }
86 |   }
87 | 
88 |   err = new Error('Could not locate the bindings file. Tried:\n'
             ^
error: Could not locate the bindings file. Tried:
 → /Users/steve/Code/elmers/node_modules/simsimd/build/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/build/Debug/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/build/Release/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/out/Debug/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/Debug/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/out/Release/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/Release/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/build/default/simsimd.node
 → /Users/steve/Code/elmers/node_modules/simsimd/compiled/20.8.0/darwin/arm64/simsimd.node
      at bindings (/Users/steve/Code/elmers/node_modules/bindings/bindings.js:88:9)
      at /Users/steve/Code/elmers/node_modules/simsimd/javascript/simsimd.js:1:7

Will not work in a Vercel instance

Both for Node 18 and Node 20:

[14:23:34.076] .../[email protected]/node_modules/usearch install: ../simsimd/include/simsimd/types.h:27: warning: "SIMSIMD_TARGET_ARM_NEON" redefined
[14:23:34.076] .../[email protected]/node_modules/usearch install:    27 | #define SIMSIMD_TARGET_ARM_NEON 0
[14:23:34.076] .../[email protected]/node_modules/usearch install:       | 
[14:23:34.077] .../[email protected]/node_modules/usearch install: <command-line>: note: this is the location of the previous definition
[14:23:34.077] .../[email protected]/node_modules/usearch install: In file included from ../simsimd/include/simsimd/binary.h:23,
[14:23:34.077] .../[email protected]/node_modules/usearch install:                  from ../simsimd/include/simsimd/simsimd.h:17,
[14:23:34.077] .../[email protected]/node_modules/usearch install:                  from ../include/usearch/index_plugins.hpp:68,
[14:23:34.077] .../[email protected]/node_modules/usearch install:                  from ../include/usearch/index_dense.hpp:10,
[14:23:34.078] .../[email protected]/node_modules/usearch install:                  from ../javascript/lib.cpp:18:
[14:23:34.078] .../[email protected]/node_modules/usearch install: ../simsimd/include/simsimd/types.h:36: warning: "SIMSIMD_TARGET_ARM_SVE" redefined
[14:23:34.078] .../[email protected]/node_modules/usearch install:    36 | #define SIMSIMD_TARGET_ARM_SVE 0
[14:23:34.078] .../[email protected]/node_modules/usearch install:       | 
[14:23:34.078] .../[email protected]/node_modules/usearch install: <command-line>: note: this is the location of the previous definition
[14:23:35.137] .../[email protected]/node_modules/usearch install: In file included from ../simsimd/include/simsimd/simsimd.h:18,
[14:23:35.138] .../[email protected]/node_modules/usearch install:                  from ../include/usearch/index_plugins.hpp:68,
[14:23:35.138] .../[email protected]/node_modules/usearch install:                  from ../include/usearch/index_dense.hpp:10,
[14:23:35.138] .../[email protected]/node_modules/usearch install:                  from ../javascript/lib.cpp:18:
[14:23:35.138] .../[email protected]/node_modules/usearch install: ../simsimd/include/simsimd/probability.h:457:8: error: '__m512h' does not name a type; did you mean '__m512i'?
[14:23:35.138] .../[email protected]/node_modules/usearch install:   457 | inline __m512h
[14:23:35.138] .../[email protected]/node_modules/usearch install:       |        ^~~~~~~
[14:23:35.138] .../[email protected]/node_modules/usearch install:       |        __m512i
[14:23:35.138] .../[email protected]/node_modules/usearch install: ../simsimd/include/simsimd/probability.h:477:1: error: attribute 'avx512fp16' argument 'target' is unknown
[14:23:35.139] .../[email protected]/node_modules/usearch install:   477 | simsimd_avx512_f16_kl(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n) {
[14:23:35.139] .../[email protected]/node_modules/usearch install:       | ^~~~~~~~~~~~~~~~~~~~~

Vercel error log:

500: INTERNAL_SERVER_ERROR
Code: FUNCTION_INVOCATION_FAILED
ID: cle1:cle1::sw2js-1705001182897-78fa89a33f7c
19:26:23 [ERROR] Error: No native build was found for platform=linux arch=x64 runtime=node abi=115 uv=1 libc=glibc node=20.10.0
    loaded from: /var/task/node_modules/.pnpm/[email protected]/node_modules/simsimd

    at load.resolve.load.path (/var/task/node_modules/.pnpm/[email protected]/node_modules/node-gyp-build/node-gyp-build.js:60:9)
    at load (/var/task/node_modules/.pnpm/[email protected]/node_modules/node-gyp-build/node-gyp-build.js:22:30)
    at Object.<anonymous> (/var/task/node_modules/.pnpm/[email protected]/node_modules/simsimd/javascript/simsimd.js:3:18)
    at Module._compile (node:internal/modules/cjs/loader:1376:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1435:10)
    at Module.load (node:internal/modules/cjs/loader:1207:32)
    at Module._load (node:internal/modules/cjs/loader:1023:12)
    at r.<computed>.e._load (/var/task/___vc/__launcher/bridge-server-72TT5FOD.js:1:1574)
    at cjsLoader (node:internal/modules/esm/translators:345:17)
    at ModuleWrap.<anonymous> (node:internal/modules/esm/translators:294:7)

simsimd_avx512_i8_cos under vnni

Thanks for your great work!

When i reading codes in simsimd_avx512_i8_cos, https://github.com/ashvardanian/SimSIMD/blob/main/include/simsimd/spatial.h#L1120
I am a little confusing about that both input a and b is signed, but _mm512_dpbusd_epi32 is product of ZeroExtend16 and SignedExtend16 , I think there maybe some problem when vector a contain negative number like -1.

I think may it need to add more codes like this.
https://github.com/my-vegetable-has-exploded/dot-bench/blob/main/src/lib.rs#L54-L57

Disabling usage of AVX-512 instructions

First of all, thank you for this amazing project!

However, in my laptop equiped with and i7 10750h, the AVX512 capability is activated despite the processor not accepting the instruction. Is there any way to disable AVX-512 instructions in python (like a method "set capability").

Thank you for the time and effort.

Cannot import in go project

After imported "github.com/ashvardanian/simsimd" in my golang project, i try to call simsimd.CosineI8, but it doesn't recognize.

i have used go get github.com/ashvardanian/simsimd@latest to import this package.

and here is my simple test:

package main

import "github.com/ashvardanian/simsimd"

func main() {
	a := []int8{1, 2, 3}
	b := []int8{4, 5, 6}
	result := simsimd.CosineI8(a, b)
	println(result)
}

I'm wondering if there's something wrong with my approach, how do I call this?

Enhanced Load Masking for Prefixes and Suffixes

SimSIMD predominantly relies on unaligned loads for its operations. In instances where AVX-512 is utilized, masked loads are employed to bypass sequential operations on tail elements. However, a more expedient, albeit advanced, scheme can be explored. Under the assumption that any byte within a 64-byte cache line partaking in a vector implies the entire cache line is accessible, we can shift towards exclusively using aligned loads. This approach entails fetching the complete cache line with each load, inevitably conducting some superfluous operations but decidedly evading unaligned loads.

To circumvent potential complications with memory sanitizers, it's advised to incorporate the following attributes: __attribute__((no_sanitize_address)) and __attribute__((no_sanitize_thread)).

AttributeError: module 'SimSIMD' has no attribute 'cosine'

When I run the provided example I get the following error,AttributeError: module 'SimSIMD' has no attribute 'cosine',

import simsimd
import numpy as np

vec1 = np.random.randn(1536).astype(np.float32)
vec2 = np.random.randn(1536).astype(np.float32)
dist = simsimd.cosine(vec1, vec2)

I installed simsimd using pip, but I can only install 1.4.0. When I specified to install version 3.9.0, the following error occurred:ERROR: Could not find a version that satisfies the requirement simsimd==3.9.0 (from versions: 1.1.2, 1.2.0, 1.3.0, 1.4.0)
ERROR: No matching distribution found for simsimd==3.9.0

Add covariance estimators

Covariance isn't a distance function, as it can be negative. It however, is often used for similarity search over time-series and should be implemented in SimSIMD.

I want to add the simsimd to xmake's package management tool and have encountered some error.

xmake-repo simsimd pr
There are some error with the simsimd package during the testing and compilation phase.
I don't understand SIMD-related knowledge very well.
How should I resolve these errors?

iphoneos | android v8a r22:

simsimd/spatial.h:350:16: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        ab_vec = vdotq_s32(ab_vec, a_vec, b_vec);
               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/.xmake/packages/s/simsimd/v3.9.0/181e36b5210a41de9a550584e58f8699/include/simsimd/spatial.h:351:16: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        a2_vec = vdotq_s32(a2_vec, a_vec, a_vec);
               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/.xmake/packages/s/simsimd/v3.9.0/181e36b5210a41de9a550584e58f8699/include/simsimd/spatial.h:352:16: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
        b2_vec = vdotq_s32(b2_vec, b_vec, b_vec);
               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ubuntu:

simsimd/types.h:122:9: error: ‘_Float16’ is not supported on this target
  122 | typedef _Float16 simsimd_f16_t;
      |         ^~~~~~~~

Golang: '_Float16' is not supported on this target

After fixing the a few issues (#70) I'm getting the following error:

$ go test
# github.com/ashvardanian/simsimd/golang
In file included from ./../include/simsimd/binary.h:23,
                 from ./../include/simsimd/simsimd.h:21,
                 from ./simsimd.go:6:
./../include/simsimd/types.h:119:9: error: '_Float16' is not supported on this target
  119 | typedef _Float16 simsimd_f16_t;
      |         ^~~~~~~~
FAIL    github.com/ashvardanian/simsimd/golang [build failed]

Can't install on institutional linux cluster

Hi,

pip install simsimd can only get up to version 1.4.0 due to wheel/manylinux incompatibility with my OS, do you have any advice on how I could install later versions? 'sim'simd' is a dependence in another package I'm using.

NAME="Red Hat Enterprise Linux Server" VERSION="7.9 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.9" PRETTY_NAME="Red Hat Enterprise Linux" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.9"

Latest version not on pypi

Hey @ashvardanian,
I've just found this cool library, started experimenting with it (awesome speedups!) and noticed the latest version on pypi is v3.5.5, so just letting you know :)
Thanks for your work!

Inconsistency with scipy

Results are sometimes different than scipy.spatial.distance.
For example: (I checked using float64 and float32)

import numpy as np
import simsimd
from scipy.spatial import distance

a1 = np.array([0.10, 0.62])
a2 = np.array([0.16, 0.69])

print(simsimd.cosine(a1, a2))
print(distance.cosine(a1, a2))

print(simsimd.sqeuclidean(a1, a2))
print(distance.sqeuclidean(a1, a2))

Output:

0.0023124534636735916
0.0023073006911024097

0.008500000461935997
0.008499999999999994

Sparse Distances

All existing metrics imply dense vector representations. Dealing with very high-dimensional vectors, sparse representations may provide huge space-efficiency gains.

The only operation that needs to be implemented for Jaccard, Hamming, Inner Product, L2, and Cosine is a float-weighted vectorized set-intersection. We may expect the following kinds of vectors:

  • u16 - high priority
  • u32 - high priority
  • u16f16 - medium priority
  • u32f16 - medium priority
  • u32f32 - low priority?

The last may not be practically useful. AVX-512 backend (Intel Ice Lake and newer and AMD Genoa) and SVE (AWS Graviton, Nvidia Grace, Microsoft Cobalt) will see the biggest gains. Together with a serial backend, multiplied by 4-5 input types, and 5 distance functions, this may result in over 100 new kernels.

Any thoughts and recommendations? Someone else looking for this functionality?

How to work with OutputDistances object?

In simsimd 3.7.7, this code:

import simsimd
import numpy as np

X = np.array([[1, 2, 3]], dtype=np.float32)
Y = np.array([[4, 5, 6]], dtype=np.float32)

simsimd.cosine(X, Y)

returns

array([0.02940369], dtype=float32)

but in 3.9.0, it now returns:

<simsimd.OutputDistances at 0xffff6257ec70>

how do I get the result like in 3.7.7?

[Rust Bindings] Poor performance VS ndarray (BLAS) and optimized iteration impls

Recently we've been implementing some spacial distance functions and benchmarking them against some existing libraries, when testing with high dimensional data (1024 dims) we observe simsimd taking on average 619ns per vector, compared to ndarray (when backed by openblas) taking 43ns or an optimized bit of pure Rust taking 234ns and 95ns with ffast-math like intrinsics disabled/enabled respectively.

These benchmarks are taken with Criterion doing 1,000 vector ops per iteration in order to account for any clock accuracy issues due to the low ns time.

dot ndarray 1024 auto   time:   [43.270 µs 43.285 µs 43.302 µs]
Found 17 outliers among 500 measurements (3.40%)
  5 (1.00%) high mild
  12 (2.40%) high severe

Benchmarking dot simsimd 1024 auto: Warming up for 3.0000 s
Warning: Unable to complete 500 samples in 60.0s. You may wish to increase target time to 77.7s, enable flat sampling, or reduce sample count to 310.
dot simsimd 1024 auto   time:   [618.85 µs 619.93 µs 621.15 µs]
Found 43 outliers among 500 measurements (8.60%)
  7 (1.40%) low mild
  17 (3.40%) high mild
  19 (3.80%) high severe

dot fallback 1024 nofma time:   [232.92 µs 234.19 µs 235.76 µs]
Found 16 outliers among 500 measurements (3.20%)
  11 (2.20%) high mild
  5 (1.00%) high severe

dot fallback 1024 fma   time:   [95.456 µs 95.586 µs 95.729 µs]
Found 19 outliers among 500 measurements (3.80%)
  17 (3.40%) high mild
  2 (0.40%) high severe

Notes

  • CPU: AMD Ryzen 9 5900X 12-Core Processor, 3701 Mhz, 12 Core(s), 24 Logical Processor(s)
  • Benchmarked with Criterion 0.5.1, Openblas 0.3.25
  • Compiled with RUSTFLAGS="-C target-feature=+avx2,+fma"
    • Results can also be replicated via RUSTFLAGS="-C target-cpu=native"
  • We only measure ndarray for dot product as there are no blas specific ops for Euclidean or Cosine distance, but a similar performance difference can be observed between the pure rust and simsimd versions for those additional distance measures.

Loose benchmark structure (within Criterion)

There is a bit too much code to paste the exact benchmarks, but each step is the following:

fn bench_me(a: &[f32], b: &[f32]) {
   for _ in 0..1_000 {
       black_box(implementation_dot(black_box(a), black_box(b)));
   }
}

Pure Rust impl

Below is a fallback impl I've made, for simplicity I've removed the generic which was used to replace regular math operations with their ffast-math equivalents when running the dot fallback 1024 fma benchmark, however, the asm for dot fallback 1024 nofma are identical.

Notes

  • We only target vectors that can fit into a multiple of 8 so we don't have an additional loop to do the remainder if DIMS were to not be a multiple of 8, that being said, even with that final loop, the difference is minimal.
unsafe fn fallback_dot_product_demo<const DIMS: usize>(
    a: &[f32],
    b: &[f32],
) -> f32 {
    debug_assert_eq!(
        b.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        a.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        DIMS % 8,
        0,
        "DIMS must be able to fit entirely into chunks of 8 lanes."
    );

    let mut i = 0;

    // We do this manual unrolling to allow the compiler to vectorize
    // the loop and avoid some branching even if we're not doing it explicitly.
    // This made a significant difference in benchmarking ~4x
    let mut acc1 = 0.0;
    let mut acc2 = 0.0;
    let mut acc3 = 0.0;
    let mut acc4 = 0.0;
    let mut acc5 = 0.0;
    let mut acc6 = 0.0;
    let mut acc7 = 0.0;
    let mut acc8 = 0.0;

    while i < a.len() {
        let a1 = *a.get_unchecked(i);
        let a2 = *a.get_unchecked(i + 1);
        let a3 = *a.get_unchecked(i + 2);
        let a4 = *a.get_unchecked(i + 3);
        let a5 = *a.get_unchecked(i + 4);
        let a6 = *a.get_unchecked(i + 5);
        let a7 = *a.get_unchecked(i + 6);
        let a8 = *a.get_unchecked(i + 7);

        let b1 = *b.get_unchecked(i);
        let b2 = *b.get_unchecked(i + 1);
        let b3 = *b.get_unchecked(i + 2);
        let b4 = *b.get_unchecked(i + 3);
        let b5 = *b.get_unchecked(i + 4);
        let b6 = *b.get_unchecked(i + 5);
        let b7 = *b.get_unchecked(i + 6);
        let b8 = *b.get_unchecked(i + 7);

        acc1 = acc1 + (a1 * b1);
        acc2 = acc2 + (a2 * b2);
        acc3 = acc3 + (a3 * b3);
        acc4 = acc4 + (a4 * b4);
        acc5 = acc5 + (a5 * b5);
        acc6 = acc6 + (a6 * b6);
        acc7 = acc7 + (a7 * b7);
        acc8 = acc8 + (a8 * b8);

        i += 8;
    }

    acc1 = acc1 + acc2;
    acc3 = acc3 + acc4;
    acc5 = acc5 + acc6;
    acc7 = acc7 + acc8;
    
    acc1 = acc1 + acc3;
    acc5 = acc5 + acc7;

    acc1 + acc5
}

Build CPython wrapper to connect with USearch

USearch already uses SimSIMD for hardware-accelerated distance functions. That, however, exceptionally complicates backward-compatible compilation and testing. Instead, for Python users, we can wrap SimSIMD functions into CPython capsules to pass into USearch from the usearch.index.Index.

Need clarity

In response to your blog, which AVX have you used to achieve that 118ns for floats?
I ran the same experiment and obtained 7.30 ns for Time and 350 ns for CPU time in the Google benchmark report for avx2_f32_cos_1536d.
Could you please clarify which value I should consider, and whether you used time or CPU time for 118 ns?
@ashvardanian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.