Giter Club home page Giter Club logo

oasysdb's People

Contributors

dteare avatar edwinkys avatar noneback avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

oasysdb's Issues

FEAT: add collection api to retrieve record with filtered metadata

Purpose & use case

We should add an API for the Collection called maybe filter that takes in the condition and return a hash map of VectorID to Record. This filter API is useful for users who need to retrieve some of records which data match the provided condition.

FEAT: allow database reset

Use case

Currently, there is no way to reset all of the value stored in the database. This feature is a good to have feature for users to explore inserting values to OasysDB. This feature will allow a quick reset via API without replacing Docker container or removing the data storage manually.

Proposed solution

Users should be able to call an endpoint: DELETE /values and DELETE /graphs. Each endpoint will be responsible to reset the data stored in each database. I think this is better than having one endpoint to reset both database as this approach allows a more fine-grained control over the database.

An alternative solution that I thought about is to have one utility endpoint like /reset to remove all of the database storage from disk. One of the issue with this approach is that it's less RESTy which what we are trying to fully implement with OasysDB.

BUG: periodically saving collection bloats db storage size

Description

This bug happens when saving a collection in a periodic manner like the example below.

# Create the DB and collection...
for _ in range(300):
    records = Record.many_random(dimension=512, len=100)
    collection.insert_many(records)
    db.save_collection("storage", collection)

This caused unflushed dirty IO buffers to bloats the database folder to an amount unimaginable ๐Ÿคฏ
From 70MB database size to 10GB.

Read the discussion #77 for more context about what is happening.

FEAT: api to get all collection records as map

Description

This new API should allow users to access all of the records inside of the collection to perform operation on the record beyond the capability of the database such as filtering or iterating.

This issue is correlated with PR #40.

FEAT: python integration as optional feature

Use case

I'm trying to integrate OasysDB as part of a rust staticlib, which gets linked to a C++ based project. I'm not using Python integration at all, and yet all of pyo3 dependencies are compiled and the final binary needs to be linked with python library.

Proposed solution

Please make pyo3 an optional feature.

FEAT: index params to add dimension configuration

Use Case

In OasysDB, currently, there is no dimension checks when inserting new vectors.
This could lead to unchecked mistake by users.

Proposed Solution

In the IndexParams trait, we should add a method:

fn dimension(&self) -> usize;

This method will return integer of the configured dimension of the index. When building and inserting vectors in each index implementation, we need to add a check to ensure the vector dimension is valid.

CHORE: refactor database returns as result

Description

This chore purpose is to refactor the database (src/db/database.rs) public methods to include a more robust error handling functionality. This includes changing the return of the functions to add Result.

Example

// Current return type on create_collection method.
Collection<D, N, M>

// Expected return type.
Result<Collection<D, N, M>, Box<dyn Error>>

Why is this needed?

This chore will help users use OasysDB as a crate as it will help with error handling and debugging. This also creates a base of a more robust error messages for OasysDB in general along with issue #24 .

DOCS: create documentation website

Use case

OasysDB only has Rust generated documentation that is hosted in docs.rs/oasysdb. Not everyone who uses OasysDB uses Rust and thus would be unfamiliar with the documentation.

With dedicated documentation website for OasysDB, users should be able to understand and use OasysDB better.

CHORE: improve collection error handling and returns

Description

This chore purpose is to improve error handling of the collection public functionality to include error as value and use Result as returns. The scope of this chore is limited to the src/func/collection.rs file to avoid huge PR.

The result of this chore should look like this code below:

// Old returns
Vec<SearchResult<D>>

// New returns
Result<Vec<SearchResult<D>>, Box<dyn Error>>

Why is this needed?

This chore/improvement will help greatly when users use OasysDB as vector database. This chore helps make error handling with OasysDB much more intuitive.

FEAT: add optimized cosine distance function for normalized vectors

Use case

When calculating Cosine similarity on normalized vectors there is no need to calculate the magnitudes of each vector as they are already of unit length.

OpenAI embeddings are normalized, and presumably many others are as well. From the OpenAI embeddings FAQs:

OpenAI embeddings are normalized to length 1, which means that:

  • Cosine similarity can be computed slightly faster using just a dot product

The Cosine distance calculation from distance.rs#L46 could benefit from this optimization and avoid the last 3 calculations:

    fn cosine(a: &Vector, b: &Vector) -> f32 {
        let dot = Self::dot(a, b);
        let ma = a.0.iter().map(|x| x.powi(2)).sum::<f32>().sqrt();
        let mb = b.0.iter().map(|y| y.powi(2)).sum::<f32>().sqrt();
        dot / (ma * mb)
    }

Proposed solution

I suggest the Distance enum be expanded to include a CosineOptimizedUnitLength variant. Doing so would fit in nicely with the current config:

    let mut config = Config::default();

    // Using optimized calculation as our embeddings are normalized
    config.distance = Distance::CosineOptimizedUnitLength;

The Distance::calculate function could then match on this and call an optimized method:

fn cosine_normalized(a: &Vector, b: &Vector) -> f32 {
    Self::dot(a, b)
}

Additional Context

Not applicable / did so already inline where appropriate.

FEAT: collection insert methods return vector id

Use case

When inserting a vector record to a collection, users need a way to get the newly inserted record vector ID. When users need to update or delete the vector records, this returned vector ID can be used as the input parameters to the methods.

Proposed solution

Modify the return type and value of the Collection.insert method.

// Before
-> Result<(), Error>

// After
-> Result<VectorID, Error>

FEAT: add another distance metrics

Use case

OasysDB uses Euclidean distance formula to measure the distance between vectors by default. Some use cases of the users might require them to use other distance metrics like Dot Product or Cosine Similariy.

Proposed solution

I think the best solution would be to allow users to configure the distance formula choice in the Collection configuration level. Something like:

pub struct Config {
    ...
    distance: String
}

Then after that, in the Vector implementation:

pub fn distance(&self, other: &Self, formula: &str) -> f32 {...}

CHORE: measure recall rate

Description

This feature should create a script that allows us to measure the recall rate of OasysDB. This is very useful to make sure that when developing new features and refactoring, we don't decrease the recall performance of the database as this is a very important metric.

FEAT: add hnsw index

Use Case

This feature should add a new index implementation for HNSW indexing algorithm.
This allows users to configure the index to use this algorithm.

Proposed Solution

  1. Under the indices module, add idx_hnsw.rs file.
  2. In this file, implement the traits required for index implementations.

FEAT: add qol methods for vector id

Use case & proposed solution

This is a request to add several utility methods to help users use the VectorID struct better. These are some ideas for improvement:

  • Type conversion from VectorID to usize and u32.
  • Utility methods to_usize and to_u32.

FEAT: add boolean type to metadata

Use case

Allows users to store boolean as metadata or most commonly object containing boolean values like:

{
  "text": "...",
  "is_private": true,
}

Proposed solution

Just add boolean to metadata enum and adds type conversion to/from JSON, PyO3, and Rust native types.

DISCUSS: called `Option::unwrap()` on a `None` value on insert record

Short description

Hi, I'm using Oasysdb 0.5.1

I got a panic error when I called insert method of collection:

thread 'dialog-flow-system' panicked at D:\programData\rust\cargo\registry\src\index.crates.io-6f17d22bba15001f\oasysdb-0.5.1\src\func\utils.rs:413:14:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library\std\src\panicking.rs:645
   1: core::panicking::panic_fmt
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library\core\src\panicking.rs:72
   2: core::panicking::panic
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library\core\src\panicking.rs:145
   3: core::option::unwrap_failed
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library\core\src\option.rs:1985
   4: enum2$<core::option::Option<usize> >::unwrap
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6\library\core\src\option.rs:933
   5: oasysdb::func::utils::IndexConstruction::insert
             at D:\programData\rust\cargo\registry\src\index.crates.io-6f17d22bba15001f\oasysdb-0.5.1\src\func\utils.rs:410
   6: oasysdb::func::collection::Collection::insert_to_layers
             at D:\programData\rust\cargo\registry\src\index.crates.io-6f17d22bba15001f\oasysdb-0.5.1\src\func\collection.rs:748
   7: oasysdb::func::collection::Collection::insert
             at D:\programData\rust\cargo\registry\src\index.crates.io-6f17d22bba15001f\oasysdb-0.5.1\src\func\collection.rs:240

Steps to reproduce

I created a simple repository and tried reproducing this panic, but I can't get the error again
repo: https://github.com/dialogflowchatbot/oasysdb

After I deleted the data directory (The one maybe cause the panic) , the error was gone.

Do you have any thoughts about the reason cause this panic?

FEAT: database persistence to disk

Use case

Currently, the database data, key-value pairs and index, is stored only in memory. This means that every time the database is restarted, we need to re-add the key-value and rebuild the index.

This feature should allow the database data to be persisted to disk so that when there is crash or shutdown, the data and index of previous session can still be used.

Proposed solution

The database should store a snapshot of the key-value pairs periodically. Every time, the database is started, the database will lookup the most recent snapshot and prepopulate the key-value store and rebuild the index based on the existing key-value pairs.

Additional Context

FEAT: embedding model connectors

Purpose

This feature will add a utility connector that streamline insert, update, and search operations of OasysDB allowing users to directly perform these operations without converting the content to vector embeddings beforehand utilizing the EmbeddingModel trait.

Example:

// Before
...
let vector: Vector = generate_vector(content);
let record = Record::new(&vector, &data.into);
collection.insert(&record);

// After
collection.insert_content(content, &data.into);

Usage

  • Users should be able to configure a collection with the embedding model and it should be optional.
  • When configured this should enable the functionality of insert_content, search_content, etc.
  • In Rust, this functionalities should be gated via a feature called gen.
  • In Python, this functionalities will be accessible but only usable when the embedding model is configured.
pub fn insert_content(content: &str, data: &Metadata) ...
// All vector methods with suffix _content or for batch operations _contents.
// The output of the method should be the same with their default couterparts.

Scope

The scope of this feature doesn't include accomodating third-party embedding model. We will do it in the future though depending on the demand.

This change should be backward compatible if possible.

FEAT: add display implementation to error struct

Use case & proposed solution

A lot of applications created in Rust uses justerror or anyhow crates to handle errors. Currently, it's not intuitive to use OasysDB built-in Errorstruct with other commonly used error handling crates. We should add Display implementation to Error so that it's easier to wrap the built-in error with these crates.

FEAT: add batch insert collection method

Use case

Users need a way to insert a lot of vector records at once outside of the Collection.build method.

Technically, users can use Collection.insert method combined with a loop statement. But, this is highly inefficient because the private collection method, insert to graph layers, is designed to insert one vector record at a time.

Proposed solution

We might need a new private method dedicated to inserts multiple vector records to the index layers at once. We don't want to use for loop combined with insert_to_layers method as it is highly unoptimized for batch inserts.

FEAT: create python binding

Use case

Most of the use cases of vector databases seems to be implemented in Python as it has a more mature AI ecosystem than Rust. For example, Chroma allows their users to use its vector database simply by installing its Python libraries.

Some LLM frameworks like LangChain is also widely used when implementing LLM-related functionality which typically includes a form of vector stores. By having a Python binding to OasysDB, we open the door for this integration.

Proposed solution

The users should be able to install OasysDB using pip install oasysdb command. Also, all of the functionality required to run the vector database should be accessible using Python. For this, I found a library that can help with creating Rust to Python binding: https://github.com/PyO3/pyo3.

BUG: database open failed on windows when temporary directory is in different drive

Short Description

Ran Database::open failed

Steps to Reproduce

Version of Oasysdb: 0.7.1

use std::path::Path;
let p = Path::new(".").join("data").join("intentev"); // relative path to D:\work\dialogflow-backend\data
Database::open(p, Some("sqlite://./data/intentev/e.dat"))?;

open function throws

code: FileError, message: "The system can't move files to a different disk drive.  (os error 17)"

I suspected that this error was thrown by utils/file.rs at this line: fs::rename(&tmp_file, &path)?;

I printed temporary directory on Windows was: C:\Users\dialogf\AppData\Local\Temp\
and Datebase path I specified was: D:\work\dialogflow-backend\data\intentev\

Expected Behavior

Database open successfully.

Additional Context

Add any other context about the problem here like screenshots or logs.

BUG: unresolved import `std::hash::DefaultHasher` after update to 0.5.0

Short description

Updated from oasysdb 0.4.4 to 0.5.0. Building my program now gives this error:

--> ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/oasysdb-0.5.0/src/db/mod.rs:9:17
  |
9 | use std::hash::{DefaultHasher, Hash, Hasher};
  |                 ^^^^^^^^^^^^^ no `DefaultHasher` in `hash`
  |
  = help: consider importing this struct instead:
          std::collections::hash_map::DefaultHasher

For more information about this error, try `rustc --explain E0432`.
error: could not compile `oasysdb` (lib) due to previous error

Compiling on a M1 MacBookPro.

BUG: collection search result incorrect

Short description

I wanted to do sentence similarity

Steps to reproduce

git clone https://github.com/dialogflowchatbot/oasysdb-test
cd oasysdb-test
git lfs install
git -C resources clone https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
cargo r

I'm using rust-bert to generate sentence embedding, so the modal files need to be downloaded first.

Expected Behavior

The 1st searching, won't match anything

Severity

How severe is the bug?
Normal

Are you using OasysDB in production?
Not yet, but planing.

Additional Context

Add any other context about the problem here. For example, screenshots, screen recordings, or logs.

FEAT: make relevancy score pre-search filtering

Purpose

Currently, the relevancy score is applied after the search result has been retrieved. This could cause some other nodes that are more relevant to be left out due to the nature of the index.

We want to apply the relevancy score during the creation of the candidate pool.

FEAT: add ivf index algorithm

Use Case

This feature should add a flat IVF index implementation that excludes IVFPQ. IVF can offer a way higher recall rate than IVFPQ due to the loss that happens during the encoding and decoding process of PQ vector.

Proposed Solution

  1. Add a new file under the indices module called: idx_ivf.rs.
  2. Implements vector index and other traits for the index.

FEAT: improve filtering

Purpose

Adding filtering based on metadata when performing a search. This gives a better search result due to the nature of predeterministic of the filter that ANN doesn't provide.

Solution

This implementation must be pre-filtering. This means the filtering happens not after the result of ANN but during the creation of nearest neighbors candidate pool.

FEAT: metadata interoperability with json value

Use case

When users are dealing with complex data structure such as that of serde_json::Value, it requires a really complex workaround to make it work well with Metadata even though they share the same model.

OasysDB's Metadata and serde_json::Value should be able to convert into one another by using the From and Into implementations.

BUG: python library not loaded, no LC_RPATH's found

Short description

The bug happens when running cargo test in the terminal. Reason: no LC_RPATH's found.

Steps to reproduce

The bug happens after following this steps below from nothing:
(Running on VS Code, MacOS 14.2 M2 Chips, Python Version installed 3.11.8)

  1. Installing rust following the steps listed in this website: https://www.rust-lang.org/tools/install
  2. Extension Rust Analyzer is installed inside the VSCode.
  3. The Oasysdb main branch repo is forked and clone into local machine.
  4. Change directory to the cloned folder of "oasysdb"
  5. Then run cargo test from the terminal.

Expected Behavior

Expected: The compilation works successfully and giving no errors
Error encountered: Library not loaded: @rpath/Python3.framework/Versions/3.9/Python3, error: test failed, to rerun pass --lib.

Severity

This might blocks the new users or contributors to use or contribute.

Additional Context

Screenshot of the full error is provided here:
OasysDB Error

FEAT: support for wasm32 target (error due to simsimd crate)

Use case

Need target support for wasm32-unknown-unknown for developing on ICP ecosystem. Getting this error currently when building, something to do with the simsimd crate -

Caused by: Failed while trying to install all canisters.
Caused by: Failed to install wasm module to canister 'elna_db_backend'.
Caused by: Failed during wasm installation call
Caused by: The replica returned a rejection error: reject code CanisterError, reject message Error from Canister bkyz2-fmaaa-aaaaa-qaaaq-cai: Canister's Wasm module is not valid: Wasm module has an invalid import section. Module imports function 'simsimd_cos_f32' from 'env' that is not exported by the runtime..
This is likely an error with the compiler/CDK toolchain being used to build the canister. Please report the error to IC devs on the forum: https://forum.dfinity.org and include which language/CDK was used to create the canister., error code None

Here's the relevant error from above - Module imports function 'simsimd_cos_f32' from 'env' that is not exported by the runtime..

Please support wasm32-unknown-unknown as a target.

FEAT: error interoperability with anyhow

Use case

When users are using anyhow crate to handle error on their Rust project, there is no straight-forward solution to cast OasysDB native error type to anyhow error beside using .map_err() method. Although it works, when handling multiple errors from OasysDB, users have to write more boilerplate codes to handle errors.

FIX: atomic collection write to disk

Problem

Currently, when performing Database::save_collection method, OasysDB serialize the collection data into binary and store it directly into a file. When dealing with a large file that requires multiple second to write, the risk of a file being corrupted if the write process is interrupted is high.

Solution

We want to make save_collection operation atomic by writing the serialized collection into a temporary file before replacing the actual collection file. This way, if both write and replace operation fails, the actual collection file won't be corrupted.

BUG: dynamic module does not define module export function on 0.4.1 python

Short description

After upgraded from 0.3.0 to 0.4.1, codes break.

ImportError
Traceback (most recent call last) Cell In[1], line 1
----> 1 from oasysdb.prelude import Vector, Config, Database, Collection, Record

File ~/.pyenv/versions/3.11.4/envs/env/lib/python3.11/site-packages/oasysdb/__init__.py:2
      1 # flake8: noqa F401
----> 2 from .oasysdb import *

ImportError: dynamic module does not define module export function (PyInit_oasysdb)

Steps to reproduce

Just import from oasysdb.preludes

Expected Behavior

Works like a charm

BTW, great work.

CHORE: add benchmark tests

Description

Add a script to benchmark the database performance, the result of the benchmark, and the documentation of the benchmarking process. The primary operation that we want to have the benchmark for is the graph querying operation.

Why is this needed?

This is good to help users compare OasysDB to other vector database and make decision. This also helps us improve OasysDB by comparing the performance benchmark to benchmark making sure we do something that improve the database.

FEAT: min distance config to include to search result

Use case & purpose

Currently, the Collection.search method will return the approximate nearest neighbors regardless of the distance. There is a possibility that the result is totally irrelevant to the query but is still included in the search result regardless. Refer to issue #46.

Expected result

This feature should allow users to configure the collection so that if the result candidate distance doesn't pass the configured threshold, the candidate won't be included in the search result.

This makes sure that only relevant result candidates is included in the search result.

Proposed solution

This value should be included in the Config struct or directly in Collection struct.

One of the potential roadblock is how to make this configuration more intuitive. Distance value will differs depending on the distance formula. We could use arbitrary float value but that might requires the user to readjust the value.

DISCUSS: certificate of https://www.oasysai.com/ was expired.

Hi,

Just visited https://www.oasysai.com/ and browser said the certificate was expired.

Websites prove their identity via certificates, which are valid for a set time period.
The certificate for www.oasysai.com expired on 4/23/2024.

I found certificate was from Let's Encrypt, maybe can use renew helper to obtain certificate automatically.

BTW, if here's not a proper place to report this, please delete this thread as you wish.

FEAT: add manhattan distance metric

Use Case

This feature should allow users to configure Manhattan distance metric for their index:
Manhattan Distance

Proposed Solution

  • Add an enum variant, Manhattan, to DistanceMetric enum.
  • Add function to calculate Manhattan distance.
  • If possible, add SIMD variant to calculate the distance.

FEAT: optimize collection read and write from/to disk

Purpose

Improve the performance when reading and writing collection from/to the disk.
The scope of this optimization is backward compatibility.

Ultimately, to improve the speed, we need to reduce the overall collection size by off loading the storage of the vectors and metadata to disk permanently.

Proposed Solution

Use bincode::serialize_into() and bincode::deserialize_from().

CHORE: measure memory usage

Description

We need a functionality to measure the memory usage of the collection and a documentation about the memory usage when storing and searching the collection. For example, how much memory is required to store and query SIFT datasets.

Why is this needed?

This functionality and documentation of memory usage measurement will help users evaluate using OasysDB and understand the requirements for their use case better.

Relevant resources:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.