Giter Club home page Giter Club logo

hnswlib-wasm's Introduction

hnswlib-wasm

This is a WebAssembly (Wasm) version of the hnswlib index library written in C++. This Wasm port was created by @ShravanSunder using the emcc Wasm compiler, see the repository here.

Inspired by the library hnswlib-node. @yoshoku has provided some wonderful documentation for hnswlib-node. Thanks, @yoshoku!

Note: This library is still in its early days! It is being built for a use case that requires running hnswlib in the browser.

hnswlib-wasm provides wasm bindings for Hnswlib that implements approximate nearest-neighbor search based on hierarchical navigable small world graphs. It works in browsers and is compiled with Emscripten.

Installation

$ yarn add hnswlib-wasm

See the npm package here.

Documentation

  • hnswlib-node API Documentation by @yoshoku for hnswlib-node provides an accurate description of the API. The TypeScript definitions of the API have been modified to work with Wasm. The documentation will be updated as the library progresses, and more examples will be added.
  • The major differences are in loading and saving the index. It supports indexedDB (in browser) and uses FS from Emscripten to save and load the index via the virtual file system and IDBFS.
  • See the changelog from hnswlib-node for more details by @yoshoku changelog.

Usage

First, create a runtime instance of the library:

import { loadHnswlib } from 'hnswlib-wasm';

const lib = await loadHnswlib();

Here is a full example of loading a index if it exists or creating a new index if it doesn't exist:

  const filename = 'ghost.dat';
  const { loadHnswlib } = await import('hnswlib-wasm');
  this.hnswlib = await loadHnswlib();
  this.hnswlib.EmscriptenFileSystemManager.setDebugLogs(true);
  this.vectorHnswIndex = new this.hnswlib.HierarchicalNSW('cosine', 1536);
  await syncFileSystem('read');

  const exists = this.hnswlib.EmscriptenFileSystemManager.checkFileExists(filename);
  if (!exists) {
    this.vectorHnswIndex.initIndex(100000, 48, 128, 100);
    this.vectorHnswIndex.setEfSearch(32);
    this.vectorHnswIndex.writeIndex('ghost.dat');
  } else {
    this.vectorHnswIndex.readIndex(filename, 100000, true);
    this.vectorHnswIndex.setEfSearch(32);
  }

You can create the index and use it like so.

// Here you're creating a new index with the L2 distance metric and 1000 as the max number of elements
const hnswIndex = lib.HierarchicalNSW('l2', 100);

// Initialize the index with the dimensions (1536), m, efConstruction. See the section below on parameters for more details. These cannot be changed after the index is created.
index.initIndex(1536, 36, 16, 200);

// Set efSearch parameters. This can be changed after the index is created.
index.setEfSearch(efSearch);

// Now you can add items to the index, labels are returned as an array for the vectors.  It will reuse deleted labels if possible based on the second parameter.
const labels = index.addItems(vectors, true);

// Now you can search the index
const result1 = index.searchKnn(vectors[10], 10, undefined);

// You can also search the index with a label filter
const labelFilter = (label: number) => {
  return label >= 10 && label < 20;
}
const result2 = index.searchKnn(testVectorData.vectors[10], 10, labelFilter);

More usage examples to be added.

For now, see the HierarchicalNSW.test.ts file in the tests folder and refer to the hnswlib-node API Documentation.

Extended IndexedDB (IDBFS) Support

The hnswlib-wasm library provides extended support for IndexedDB (IDBFS) to store and manage the search index in the browser. This allows you to save and load the search index easily in a web environment and you don't have to move data from the emcc file system to javascript memory. It uses the FS from Emscripten to save and load the index via the virtual file system. The virtual file system is synchronized with IDBFS.

Saving and Loading the Search Index with IDBFS

To save the search index, use the writeIndex method:

await index.writeIndex('savedIndex');

To load a previously saved search index, use the readIndex method:

await index.readIndex('savedIndex', false);

Synchronizing the Emscripten File System with IDBFS

The syncFs method is used to synchronize the Emscripten file system with the persistent storage IDBFS. You can use this method to save or read data from the file system's persistent source.

await lib.EmscriptenFileSystemManager.syncFS(true, emscripten::val::undefined()); // Read data from the persistent source
await lib.EmscriptenFileSystemManager.syncFS(false, emscripten::val::undefined()); // Save data to the persistent source

HNSW Algorithm Parameters for hnswlib-wasm

This section will provide an overview of the HNSW algorithm parameters and their impact on performance when using the hnswlib-wasm library. HNSW (Hierarchical Navigable Small World) is a graph-based index structure for efficient similarity search in high-dimensional spaces.

Image from pinecone.io

It has several parameters that can be tuned to control the trade-off between search quality and index size or construction time. Here are some of the key parameters.

Search Parameters: efSearch

efSearch is the size of the dynamic list for the nearest neighbors used during the search. Higher efSearch values lead to more accurate but slower searches. efSearch cannot be set lower than the number of queried nearest neighbors k and can be any value between k and the size of the dataset.

Construction Parameters: M

M is the number of bi-directional links created for every new element during index construction. A reasonable range for M is 2-100. Higher M values work better on datasets with high intrinsic dimensionality and/or high recall, while lower M values work better for datasets with low intrinsic dimensionality and/or low recall. The parameter also determines the algorithms memory consumption, which is roughly M * 8-10 bytes per stored element.

Construction Parameters: efConstruction

efConstruction controls the index construction time and accuracy. Bigger efConstruction values lead to longer construction times but better index quality. At some point, increasing efConstruction does not improve the quality of the index. To check if the selected efConstruction value is appropriate, measure recall for M nearest neighbor search when efSearch = efConstruction. If the recall is lower than 0.9, there is room for improvement.

Parameter Selection for hnswlib-wasm

When using hnswlib-wasm, it is essential to choose appropriate values for M, efSearch, and efConstruction based on your datasets size and dimensionality. Since hnswlib-wasm is running in the browser, you should consider the available memory and performance limitations. Here are some recommendations:

M:

Choose a value in the range of 12-48, as it works well for most use cases. You may need to experiment to find the optimal value for your specific dataset.

efSearch:

Start with a value close to M and adjust it based on your desired trade-off between search speed and accuracy. Lower values will be faster but less accurate, while higher values will be more accurate but slower.

efConstruction:

Set this value considering the expected query volume. If you anticipate low query volume, you can set a higher value for efConstruction to improve recall with minimal impact on search time, especially when using lower M values.

Remember that higher M values will increase the memory usage of the index, so you should balance performance and memory constraints when choosing your parameters for hnswlib-wasm.

Resources

Learn hnsw by pinecone

Vector indexes by pinecone

Images from pinecone.io

Other Notes

License

hnswlib-wasm is available as open source under the terms of the Apache-2.0 License.

Contributing

To build

yarn install
make rebuild
yarn build

To test

yarn test

Contact @ShravanSunder first!

hnswlib-wasm's People

Contributors

dependabot[bot] avatar shravansunder avatar yoshoku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hnswlib-wasm's Issues

Cannot build from source

git clone [email protected]:ShravanSunder/hnswlib-wasm.git
cd hnswlib-wasm
yarn install
make rebuild
yarn build

results in these errors:

mkdir -p lib
em++ -O3 -fwasm-exceptions -s ALLOW_MEMORY_GROWTH=1 -s ALLOW_TABLE_GROWTH=1 -s WASM=1 -s MODULARIZE=1 -s EXPORT_NAME='hnswlib' -s ASSERTIONS=1 -s DEMANGLE_SUPPORT=1 -s SINGLE_FILE --bind -s ENVIRONMENT=web -gsource-map -lidbfs.js -I././src/hnswlib  ././src/wrapper.cpp -o ./lib/hnswlib.mjs 
In file included from ././src/wrapper.cpp:6:
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:440:13: error: too many arguments to function call, expected single argument 'vec', have 2 arguments
  438 |         return internal::BindingType<ReturnType>::toWireType(
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  439 |             fn(internal::BindingType<Args>::fromWireType(args)...),
  440 |             ReturnPolicy{}
      |             ^~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:585:63: note: in instantiation of member function 'emscripten::internal::Invoker<emscripten::internal::rvp::default_tag, std::vector<float>, const std::vector<float> &>::invoke' requested here
  585 |     auto invoke = Invoker<ReturnPolicy, ReturnType, Args...>::invoke;
      |                                                               ^
././src/wrapper.cpp:1036:5: note: in instantiation of function template specialization 'emscripten::function<std::vector<float>, const std::vector<float> &>' requested here
 1036 |     function("normalizePoint", &normalizePointsPure);
      |     ^
././src/wrapper.cpp:180:23: note: 'toWireType' declared here
  180 |       static WireType toWireType(const std::vector<T, Allocator>& vec) {
      |                       ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
././src/wrapper.cpp:181:16: error: no matching function for call to 'toWireType'
  181 |         return ValBinding::toWireType(val::array(vec));
      |                ^~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:438:51: note: in instantiation of member function 'emscripten::internal::BindingType<std::vector<float>>::toWireType' requested here
  438 |         return internal::BindingType<ReturnType>::toWireType(
      |                                                   ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:585:63: note: in instantiation of member function 'emscripten::internal::Invoker<emscripten::internal::rvp::default_tag, std::vector<float>, const std::vector<float> &>::invoke' requested here
  585 |     auto invoke = Invoker<ReturnPolicy, ReturnType, Args...>::invoke;
      |                                                               ^
././src/wrapper.cpp:1036:5: note: in instantiation of function template specialization 'emscripten::function<std::vector<float>, const std::vector<float> &>' requested here
 1036 |     function("normalizePoint", &normalizePointsPure);
      |     ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/val.h:809:19: note: candidate function not viable: requires 2 arguments, but 1 was provided
  809 |   static WireType toWireType(val&& v, rvp::default_tag) {
      |                   ^          ~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/val.h:815:19: note: candidate function not viable: requires 2 arguments, but 1 was provided
  815 |   static WireType toWireType(const val& v, rvp::default_tag) {
      |                   ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from ././src/wrapper.cpp:6:
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:667:13: error: too many arguments to function call, expected single argument 'vec', have 2 arguments
  662 |         return internal::BindingType<ReturnType>::toWireType(
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  663 |             (internal::BindingType<ThisType>::fromWireType(wireThis)->*method)(
  664 |                 internal::BindingType<Args>::fromWireType(args)...
  665 |             )
  666 |             ,
  667 |             ReturnPolicy{}
      |             ^~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:1452:111: note: in instantiation of member function 'emscripten::internal::MethodInvoker<emscripten::internal::rvp::default_tag, std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(const std::vector<std::vector<float>> &, bool), std::vector<unsigned int>, emscripten::HierarchicalNSW *, const std::vector<std::vector<float>> &, bool>::invoke' requested here
 1452 |         auto invoke = MethodInvoker<ReturnPolicy, decltype(memberFunction), ReturnType, ClassType*, Args...>::invoke;
      |                                                                                                               ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:1740:27: note: in instantiation of function template specialization 'emscripten::internal::RegisterClassMethod<std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(const std::vector<std::vector<float>> &, bool)>::invoke<emscripten::HierarchicalNSW>' requested here
 1740 |         invoker::template invoke<ClassType, Policies...>(methodName, callable);
      |                           ^
././src/wrapper.cpp:1076:8: note: in instantiation of function template specialization 'emscripten::class_<emscripten::HierarchicalNSW>::function<emscripten::internal::DeduceArgumentsTag, std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(const std::vector<std::vector<float>> &, bool)>' requested here
 1076 |       .function("addItems", &HierarchicalNSW::addItems)
      |        ^
././src/wrapper.cpp:180:23: note: 'toWireType' declared here
  180 |       static WireType toWireType(const std::vector<T, Allocator>& vec) {
      |                       ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
././src/wrapper.cpp:181:16: error: no matching function for call to 'toWireType'
  181 |         return ValBinding::toWireType(val::array(vec));
      |                ^~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:662:51: note: in instantiation of member function 'emscripten::internal::BindingType<std::vector<unsigned int>>::toWireType' requested here
  662 |         return internal::BindingType<ReturnType>::toWireType(
      |                                                   ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:1452:111: note: in instantiation of member function 'emscripten::internal::MethodInvoker<emscripten::internal::rvp::default_tag, std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(const std::vector<std::vector<float>> &, bool), std::vector<unsigned int>, emscripten::HierarchicalNSW *, const std::vector<std::vector<float>> &, bool>::invoke' requested here
 1452 |         auto invoke = MethodInvoker<ReturnPolicy, decltype(memberFunction), ReturnType, ClassType*, Args...>::invoke;
      |                                                                                                               ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:1740:27: note: in instantiation of function template specialization 'emscripten::internal::RegisterClassMethod<std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(const std::vector<std::vector<float>> &, bool)>::invoke<emscripten::HierarchicalNSW>' requested here
 1740 |         invoker::template invoke<ClassType, Policies...>(methodName, callable);
      |                           ^
././src/wrapper.cpp:1076:8: note: in instantiation of function template specialization 'emscripten::class_<emscripten::HierarchicalNSW>::function<emscripten::internal::DeduceArgumentsTag, std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(const std::vector<std::vector<float>> &, bool)>' requested here
 1076 |       .function("addItems", &HierarchicalNSW::addItems)
      |        ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/val.h:809:19: note: candidate function not viable: requires 2 arguments, but 1 was provided
  809 |   static WireType toWireType(val&& v, rvp::default_tag) {
      |                   ^          ~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/val.h:815:19: note: candidate function not viable: requires 2 arguments, but 1 was provided
  815 |   static WireType toWireType(const val& v, rvp::default_tag) {
      |                   ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from ././src/wrapper.cpp:6:
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:667:13: error: too many arguments to function call, expected single argument 'vec', have 2 arguments
  662 |         return internal::BindingType<ReturnType>::toWireType(
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  663 |             (internal::BindingType<ThisType>::fromWireType(wireThis)->*method)(
  664 |                 internal::BindingType<Args>::fromWireType(args)...
  665 |             )
  666 |             ,
  667 |             ReturnPolicy{}
      |             ^~~~~~~~~~~~~~
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:1452:111: note: in instantiation of member function 'emscripten::internal::MethodInvoker<emscripten::internal::rvp::default_tag, std::vector<unsigned int> (emscripten::HierarchicalNSW::*)(), std::vector<unsigned int>, emscripten::HierarchicalNSW *>::invoke' requested here
 1452 |         auto invoke = MethodInvoker<ReturnPolicy, decltype(memberFunction), ReturnType, ClassType*, Args...>::invoke;
      |                                                                                                               ^
/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot/include/emscripten/bind.h:1740:27: note: in instantiation of function template specialization 'emscripten::internal::RegisterClassMethod<std::vector<unsigned int> (emscripten::HierarchicalNSW::*)()>::invoke<emscripten::HierarchicalNSW>' requested here
 1740 |         invoker::template invoke<ClassType, Policies...>(methodName, callable);
      |                           ^
././src/wrapper.cpp:1077:8: note: in instantiation of function template specialization 'emscripten::class_<emscripten::HierarchicalNSW>::function<emscripten::internal::DeduceArgumentsTag, std::vector<unsigned int> (emscripten::HierarchicalNSW::*)()>' requested here
 1077 |       .function("getUsedLabels", &HierarchicalNSW::getUsedLabels)
      |        ^
././src/wrapper.cpp:180:23: note: 'toWireType' declared here
  180 |       static WireType toWireType(const std::vector<T, Allocator>& vec) {
      |                       ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5 errors generated.
em++: error: '/opt/homebrew/Cellar/emscripten/3.1.61/libexec/llvm/bin/clang++ -target wasm32-unknown-emscripten -mllvm -combiner-global-alias-analysis=false -mllvm -wasm-enable-sjlj -mllvm -disable-lsr --sysroot=/opt/homebrew/Cellar/emscripten/3.1.61/libexec/cache/sysroot -DEMSCRIPTEN -Xclang -iwithsysroot/include/fakesdl -Xclang -iwithsysroot/include/compat -O3 -fwasm-exceptions -g -I././src/hnswlib ././src/wrapper.cpp -c -o /var/folders/2r/wm98mb7x1ms1bng9qjxt71q40000gn/T/emscripten_temp_56liivkk/wrapper_0.o' failed (returned 1)
make: *** [lib/hnswlib] Error 1

import() is disallowed on ServiceWorkerGlobalScope by the HTML specification.

Hello,

I'm trying to use your package to run hnswlib in a Chrome extension's background script. When I call loadHnswlib I get the error:

import() is disallowed on ServiceWorkerGlobalScope by the HTML specification.

Note that I don't get this error when importing transformers.js and it's successfully working in the background script, so it's not generally failing to do imports.

Any idea why your module might be different? I'm stumped why one module would import and work but not another, given this error message.

loading multiple indexes

Hi, thanks for compiling hnsw to webassembly! I am using this library for several of my open source packages.

I have found the following issue: When I try to load multiple indexes in parallel and read the saved index from indexeddb or diskemscripten FS using readIndex, Webassembly errors get thrown around. I don't get a specific error message. Just "webassembly.exception"

like this:

image

The error message doesn't have any real information attached to it.

--> my solution right now is to add a pause of 500ms before loading another index. And that seems to mitigate the issue for now.

here is the code if you're interested:

https://github.com/Xyntopia/taskyon/blob/5bf6076ad4b3d456f05185bc5fbeac728c740a99/src/modules/vectorSearch.ts#L49

This works. But I want to throw that in here. Maybe there is an easy solution to this. My workaround is quiet dirty.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.