Giter Club home page Giter Club logo

annlite's Introduction




AnnLite logo: A fast and efficient ann libray

A fast embedded library for approximate nearest neighbor search

GitHub PyPI Codecov branch

What is AnnLite?

AnnLite is a lightweight and embeddable library for fast and filterable approximate nearest neighbor search (ANNS). It allows to search for nearest neighbors in a dataset of millions of points with a Pythonic API.

Highlighted features:

  • 🐥 Easy-to-use: a simple API is designed to be used with Python. It is easy to use and intuitive to set up to production.

  • 🐎 Fast: the library uses a highly optimized approximate nearest neighbor search algorithm (HNSW) to search for nearest neighbors.

  • 🔎 Filterable: the library allows you to search for nearest neighbors within a subset of the dataset.

  • 🍱 Integration: Smooth integration with neural search ecosystem including Jina and DocArray, so that users can easily expose search API with gRPC and/or HTTP.

The library is easy to install and use. It is designed to be used with Python.

Installation

To use AnnLite, you need to first install it. The easiest way to install AnnLite is using pip:

pip install -U annlite

or install from source:

python setup.py install

Quick start

Before you start, you need to know some experience about DocArray. AnnLite is designed to be used with DocArray, so you need to know how to use DocArray first.

For example, you can create a DocArray with 1000 random vectors with 128 dimensions:

from docarray import DocumentArray
import numpy as np

docs = DocumentArray.empty(1000)
docs.embeddings = np.random.random([1000, 128]).astype(np.float32)

Index

Then you can create an AnnIndexer to index the created docs and search for nearest neighbors:

from annlite import AnnLite

ann = AnnLite(128, metric='cosine', data_path="/tmp/annlite_data")
ann.index(docs)

Note that this will create a directory /tmp/annlite_data to persist the documents indexed. If this directory already exists, the index will be loaded from the directory. And if you want to create a new index, you can delete the directory first.

Search

Then you can search for nearest neighbors for some query docs with ann.search():

query = DocumentArray.empty(5)
query.embeddings = np.random.random([5, 128]).astype(np.float32)

result = ann.search(query)

Then, you can inspect the retrieved docs for each query doc inside query matches:

for q in query:
    print(f'Query {q.id}')
    for k, m in enumerate(q.matches):
        print(f'{k}: {m.id} {m.scores["cosine"]}')
Query ddbae2073416527bad66ff186543eff8
0: 47dcf7f3fdbe3f0b8d73b87d2a1b266f {'value': 0.17575037}
1: 7f2cbb8a6c2a3ec7be024b750964f317 {'value': 0.17735684}
2: 2e7eed87f45a87d3c65c306256566abb {'value': 0.17917466}
Query dda90782f6514ebe4be4705054f74452
0: 6616eecba99bd10d9581d0d5092d59ce {'value': 0.14570713}
1: d4e3147fc430de1a57c9883615c252c6 {'value': 0.15338594}
2: 5c7b8b969d4381f405b8f07bc68f8148 {'value': 0.15743542}
...

Or shorten the loop as one-liner using the element & attribute selector:

print(query['@m', ('id', 'scores__cosine')])

Query

You can get specific document by its id:

doc = ann.get_doc_by_id('<doc_id>')

And you can also get the documents with limit and offset, which is useful for pagination:

docs = ann.get_docs(limit=10, offset=0)

Furthermore, you can also get the documents ordered by a specific column from the index:

docs = ann.get_docs(limit=10, offset=0, order_by='x', ascending=True)

Note: the order_by column must be one of the columns in the index.

Update

After you have indexed the docs, you can update the docs in the index by calling ann.update():

updated_docs = docs.sample(10)
updated_docs.embeddings = np.random.random([10, 128]).astype(np.float32)

ann.update(updated_docs)

Delete

And finally, you can delete the docs from the index by calling ann.delete():

to_delete = docs.sample(10)
ann.delete(to_delete)

Search with filters

To support search with filters, the annlite must be created with colums parameter, which is a series of fields you want to filter by. At the query time, the annlite will filter the dataset by providing conditions for certain fields.

import annlite

# the column schema: (name:str, dtype:type, create_index: bool)
ann = annlite.AnnLite(128, columns=[('price', float)], data_path="/tmp/annlite_data")

Then you can insert the docs, in which each doc has a field price with a float value contained in the tags:

import random

docs = DocumentArray.empty(1000)
docs = DocumentArray(
    [
        Document(id=f'{i}', tags={'price': random.random()})
        for i in range(1000)
    ]
)

docs.embeddings = np.random.random([1000, 128]).astype(np.float32)

ann.index(docs)

Then you can search for nearest neighbors with filtering conditions as:

query = DocumentArray.empty(5)
query.embeddings = np.random.random([5, 128]).astype(np.float32)

ann.search(query, filter={"price": {"$lte": 50}}, limit=10)
print(f'the result with filtering:')
for i, q in enumerate(query):
    print(f'query [{i}]:')
    for m in q.matches:
        print(f'\t{m.id} {m.scores["euclidean"].value} (price={m.tags["price"]})')

The conditions parameter is a dictionary of conditions. The key is the field name, and the value is a dictionary of conditions. The query language is the same as MongoDB Query Language. We currently support a subset of those selectors.

  • $eq - Equal to (number, string)
  • $ne - Not equal to (number, string)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - Included in an array
  • $nin - Not included in an array

The query will be performed on the field if the condition is satisfied. The following is an example of a query:

  1. A Nike shoes with white color

    {
      "brand": {"$eq": "Nike"},
      "category": {"$eq": "Shoes"},
      "color": {"$eq": "White"}
    }

    We also support boolean operators $or and $and:

    {
      "$and":
        {
          "brand": {"$eq": "Nike"},
          "category": {"$eq": "Shoes"},
          "color": {"$eq": "White"}
        }
    }
  2. A Nike shoes or price less than 100$:

    {
        "$or":
        {
        "brand": {"$eq": "Nike"},
        "price": {"$lte": 100}
        }
    }

Dump and Load

By default, the hnsw index is in memory. You can dump the index to data_path by calling .dump():

from annlite import AnnLite

ann = AnnLite(128, metric='cosine', data_path="/path/to/data_path")
ann.index(docs)
ann.dump()

And you can restore the hnsw index from data_path if it exists:

new_ann = AnnLite(128, metric='cosine', data_path="/path/to/data_path")

If you didn't dump the hnsw index, the index will be rebuilt from scratch. This will take a while.

Supported distance metrics

The annlite supports the following distance metrics:

Supported distances:

Distance parameter Equation
Euclidean euclidean d = sqrt(sum((Ai-Bi)^2))
Inner product inner_product d = 1.0 - sum(Ai*Bi)
Cosine similarity cosine d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi))

Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index, e.g., inner_product([1.0, 1.0], [1.0. 1.0]) < inner_product([1.0, 1.0], [2.0, 2.0])

HNSW algorithm parameters

The HNSW algorithm has several parameters that can be tuned to improve the search performance.

Search parameters

  • ef_search - The size of the dynamic list for the nearest neighbors during search (default: 50). The larger the value, the more accurate the search results, but the slower the search speed. The ef_search must be larger than limit parameter in search(..., limit).

  • limit - The maximum number of results to return (default: 10).

Construction parameters

  • max_connection - The number of bi-directional links created for every new element during construction (default: 16). Reasonable range is from 2 to 100. Higher values works better for dataset with higher dimensionality and/or high recall. This parameter also affects the memory consumption during construction, which is roughly max_connection * 8-10 bytes per stored element.

    As an example for n_dim=4 random vectors optimal max_connection for search is somewhere around 6, while for high dimensional datasets, higher max_connection are required (e.g. M=48-64) for optimal performance at high recall. The range max_connection=12-48 is ok for the most of the use cases. When max_connection is changed one has to update the other parameters. Nonetheless, ef_search and ef_construction parameters can be roughly estimated by assuming that max_connection * ef_{construction} is a constant.

  • ef_construction: The size of the dynamic list for the nearest neighbors during construction (default: 200). Higher values give better accuracy, but increase construction time and memory consumption. At some point, increasing ef_construction does not give any more accuracy. To set ef_construction to a reasonable value, one can measure the recall: if the recall is lower than 0.9, then increase ef_construction and re-run the search.

To set the parameters, you can define them when creating the annlite:

from annlite import AnnLite

ann = AnnLite(128, columns=[('price', float)], data_path="/tmp/annlite_data", ef_construction=200, max_connection=16)

Benchmark

One can run executor/benchmark.py to get a quick performance overview.

Stored data Indexing time Query size=1 Query size=8 Query size=64
10000 2.970 0.002 0.013 0.100
100000 76.474 0.011 0.078 0.649
500000 467.936 0.046 0.356 2.823
1000000 1025.506 0.091 0.695 5.778

Results with filtering can be generated from examples/benchmark_with_filtering.py. This script should produce a table similar to:

Stored data % same filter Indexing time Query size=1 Query size=8 Query size=64
10000 5 2.869 0.004 0.030 0.270
10000 15 2.869 0.004 0.035 0.294
10000 20 3.506 0.005 0.038 0.287
10000 30 3.506 0.005 0.044 0.356
10000 50 3.506 0.008 0.064 0.484
10000 80 2.869 0.013 0.098 0.910
100000 5 75.960 0.018 0.134 1.092
100000 15 75.960 0.026 0.211 1.736
100000 20 78.475 0.034 0.265 2.097
100000 30 78.475 0.044 0.357 2.887
100000 50 78.475 0.068 0.565 4.383
100000 80 75.960 0.111 0.878 6.815
500000 5 497.744 0.069 0.561 4.439
500000 15 497.744 0.134 1.064 8.469
500000 20 440.108 0.152 1.199 9.472
500000 30 440.108 0.212 1.650 13.267
500000 50 440.108 0.328 2.637 21.961
500000 80 497.744 0.580 4.602 36.986
1000000 5 1052.388 0.131 1.031 8.212
1000000 15 1052.388 0.263 2.191 16.643
1000000 20 980.598 0.351 2.659 21.193
1000000 30 980.598 0.461 3.713 29.794
1000000 50 980.598 0.732 5.975 47.356
1000000 80 1052.388 1.151 9.255 73.552

Note that:

  • query times presented are represented in seconds.
  • % same filter indicates the amount of data that verifies a filter in the database.
    • For example, if % same filter = 10 and Stored data = 1_000_000 then it means 100_000 example verify the filter.

Next steps

If you already have experience with Jina and DocArray, you can start using AnnLite right away.

Otherwise, you can check out this advanced tutorial to learn how to use AnnLite: here in practice.

🙋 FAQ

1. Why should I use AnnLite?

AnnLite is easy to use and intuitive to set up in production. It is also very fast and memory efficient, making it a great choice for approximate nearest neighbor search.

2. How do I use AnnLite with Jina?

We have implemented an executor for AnnLite that can be used with Jina.

from jina import Flow

with Flow().add(uses='jinahub://AnnLiteIndexer', uses_with={'n_dim': 128}) as f:
    f.post('/index', inputs=docs)
  1. Does AnnLite support search with filters?
Yes.

Documentation

You can find the documentation on Github and ReadTheDocs

🤝 Contribute and spread the word

We are also looking for contributors who want to help us improve: code, documentation, issues, feedback! Here is how you can get started:

  • Have a look through GitHub issues labeled "Good first issue".
  • Read our Contributor Covenant Code of Conduct
  • Open an issue or submit your pull request!

License

AnnLite is licensed under the Apache License 2.0.

annlite's People

Contributors

alaeddine-13 avatar bwanglzu avatar cristianmtr avatar davidbp avatar gusye1234 avatar hanxiao avatar hippopotamus0308 avatar jemmyshin avatar jina-bot avatar joanfm avatar numb3r3 avatar orangesodahub avatar ziniuyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

annlite's Issues

PQLite: less data than limit

After some iterations python examples/hnsw_benchmark.py included in the PR seems to fail can you reproduce the following?

Xtr: (124980, 128) vs Xte: (20, 128)
2021-11-23 11:42:03.020 | WARNING  | pqlite.index:train:131 - The pqlite has been trained or is not trainable. Please use ``force_retrain=True`` to retrain.
2021-11-23 11:42:46.358 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:43:30.497 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.95, 'recall': 0.95, 'train_time': 0.000240325927734375, 'index_time': 87.68162178993225, 'query_time': 0.1407299041748047, 'query_qps': 142.1162056300232, 'index_qps': 1425.384219049087, 'indexer_hyperparams': {'n_cells': 1, 'n_subvectors': 64}}
2021-11-23 11:43:30.908 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:43:31.179 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=8)
2021-11-23 11:43:33.298 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=8) with 20480 data...
2021-11-23 11:43:34.021 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:43:34.021 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/c905ae006031e55b1d8d51e87803d278
2021-11-23 11:44:19.429 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:23.197 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:24.466 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:28.833 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:30.179 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:33.951 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:35.024 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:38.036 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.99, 'recall': 0.99, 'train_time': 0.7391390800476074, 'index_time': 64.20390892028809, 'query_time': 0.2022690773010254, 'query_qps': 98.8781887319096, 'index_qps': 1946.6104494536003, 'indexer_hyperparams': {'n_cells': 8, 'n_subvectors': 64}}
2021-11-23 11:44:38.510 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:44:38.736 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:44:38.951 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:44:39.172 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:44:39.610 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:44:39.836 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:44:40.064 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:44:40.290 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:44:40.518 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=16)
2021-11-23 11:44:46.918 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=16) with 20480 data...
2021-11-23 11:44:47.653 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:44:47.653 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/75115be8393181300ec49112b88b2445
2021-11-23 11:45:30.760 | DEBUG    | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:45:46.906 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.9850000000000001, 'recall': 0.9850000000000001, 'train_time': 0.7362098693847656, 'index_time': 59.48536229133606, 'query_time': 0.28374195098876953, 'query_qps': 70.48658095958322, 'index_qps': 2101.021077889663, 'indexer_hyperparams': {'n_cells': 16, 'n_subvectors': 64}}
2021-11-23 11:45:47.490 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:45:47.970 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:45:48.488 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:45:48.952 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:45:49.400 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:45:49.650 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:45:50.114 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:45:50.552 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:45:50.987 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:45:51.448 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:45:51.888 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:45:52.329 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:45:52.773 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:45:53.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:45:53.662 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:45:54.112 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:45:54.555 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=32)
2021-11-23 11:46:10.758 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=32) with 20480 data...
2021-11-23 11:46:11.500 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:46:11.500 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/8f37c3b2ffd1c67e4c81e81f64db0eea
2021-11-23 11:47:01.267 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 1.0, 'recall': 1.0, 'train_time': 0.7422680854797363, 'index_time': 49.93944001197815, 'query_time': 0.30017614364624023, 'query_qps': 66.62754660333749, 'index_qps': 2502.6311862933007, 'indexer_hyperparams': {'n_cells': 32, 'n_subvectors': 64}}
2021-11-23 11:47:01.804 | INFO     | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:47:02.336 | INFO     | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:47:02.945 | INFO     | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:47:03.533 | INFO     | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:47:04.493 | INFO     | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:47:05.305 | INFO     | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:47:06.152 | INFO     | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:47:06.923 | INFO     | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:47:07.710 | INFO     | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:47:08.489 | INFO     | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:47:09.025 | INFO     | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:47:09.471 | INFO     | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:47:09.901 | INFO     | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:47:10.344 | INFO     | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:47:10.775 | INFO     | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:47:11.211 | INFO     | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:47:11.639 | INFO     | pqlite.index:clear:259 - Clear the index of cell-16
2021-11-23 11:47:12.076 | INFO     | pqlite.index:clear:259 - Clear the index of cell-17
2021-11-23 11:47:12.503 | INFO     | pqlite.index:clear:259 - Clear the index of cell-18
2021-11-23 11:47:12.932 | INFO     | pqlite.index:clear:259 - Clear the index of cell-19
2021-11-23 11:47:13.367 | INFO     | pqlite.index:clear:259 - Clear the index of cell-20
2021-11-23 11:47:13.815 | INFO     | pqlite.index:clear:259 - Clear the index of cell-21
2021-11-23 11:47:14.261 | INFO     | pqlite.index:clear:259 - Clear the index of cell-22
2021-11-23 11:47:14.700 | INFO     | pqlite.index:clear:259 - Clear the index of cell-23
2021-11-23 11:47:15.132 | INFO     | pqlite.index:clear:259 - Clear the index of cell-24
2021-11-23 11:47:15.587 | INFO     | pqlite.index:clear:259 - Clear the index of cell-25
2021-11-23 11:47:16.032 | INFO     | pqlite.index:clear:259 - Clear the index of cell-26
2021-11-23 11:47:16.472 | INFO     | pqlite.index:clear:259 - Clear the index of cell-27
2021-11-23 11:47:16.903 | INFO     | pqlite.index:clear:259 - Clear the index of cell-28
2021-11-23 11:47:17.350 | INFO     | pqlite.index:clear:259 - Clear the index of cell-29
2021-11-23 11:47:17.787 | INFO     | pqlite.index:clear:259 - Clear the index of cell-30
2021-11-23 11:47:18.227 | INFO     | pqlite.index:clear:259 - Clear the index of cell-31
2021-11-23 11:47:18.661 | INFO     | pqlite.index:__init__:86 - Initialize VQ codec (K=64)
2021-11-23 11:47:56.044 | INFO     | pqlite.index:train:137 - Start training VQ codec (K=64) with 20480 data...
2021-11-23 11:47:57.003 | INFO     | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:47:57.004 | INFO     | pqlite.index:dump_model:282 - Save the trained parameters to data/e01ce8063d859fe594084b33a10515e8
2021-11-23 11:48:55.512 | DEBUG    | pqlite.container:insert:180 - => 124980 new docs added
Traceback (most recent call last):
  File "examples/hnsw_benchmark.py", line 95, in <module>
    pq.search(docs, limit=top_k)
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/index.py", line 238, in search
    match_dists, match_docs = self.search_cells(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 144, in search_cells
    dists, doc_ids, cells = self.ivf_search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 107, in ivf_search
    _dists, _doc_idx = self.vec_index(cell_id).search(
  File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/core/index/hnsw/index.py", line 77, in search
    ids, dists = self._index.knn_query(query, k=limit)
RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small

Originally posted by @davidbp in #18 (comment)

[Bug] Executer from hub fails to start

I've been trying to use the example from Alex's multimodal search demo and also tested the example code for the PQLite extention on the jina hub page. Testing both examples I get the following errors with jina 2.6.0, 2.6.2 and the latest version.

python ./app.py -t index -n 10Fetching PQLiteIndexer from Jina Hub ...DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses (raised from /home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/flatbuffers/compat.py:19)
  image_encoder@263243[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Traceback (most recent call last):
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 444, in main
    run()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "./app.py", line 94, in <module>
    main()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "./app.py", line 88, in main
    index(csv_file=CSV_FILE, max_docs=num_docs)
  File "./app.py", line 43, in index
    with flow_index:
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1132, in __enter__
    return self.start()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1179, in start
    self.enter_context(v)
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 208, in __enter__
    return self.start()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 692, in start
    self.enter_context(self.replica_set)
  File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
    result = _cm_type.__enter__(cm)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 476, in __enter__
    self._peas.append(BasePea(_args).start())
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 135, in __init__
    self.runtime_cls = self._get_runtime_cls()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 427, in _get_runtime_cls
    update_runtime_cls(self.args)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/helper.py", line 106, in update_runtime_cls
    _args.uses = HubIO(_hub_args).pull()
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 672, in pull
    executor, from_cache = HubIO.fetch_meta(
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/helper.py", line 323, in wrapper
    result = func(*args, **kwargs)
  File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 588, in fetch_meta
    image_name=resp['image'],
KeyError: 'image'```

Workspace size double during first use (query time init)

Hello,

I have large dataset. My workspace is 27GB after indexing (I have sent. embeddings, token embeddings and metadata).
But after first inference init workspace grows to 53 GB which is totally strange.

I have
annlite==0.3.5
jina==3.6.9

Any clues why this is happening?

Thanks

Support PCA in ANNlite

In order to reduce memory usage, we need to implement PCA inside ANNlite. There will be two PRs for this function:

  1. implement PCA based on scikit-learn
  2. integrate PCA with ANNlite

Feat: Additional filters

I was just chatting with @davidbp about filtering in PQLite. For my fashion search example I'm looking at adding filters (similar to Amazon) to pre-filter results. This work is being done in a separate branch of my repo.

At present I'm able to easily search in ranges (e.g. price, year), or above a certain threshold (e.g. rating):

filter = {
    "$and": {
        "year": {"$gte": 2011, "$lte": 2014},
        "price": {"$gte": 100, "$lte": 200},
        "rating": {"$gte": 3},
    },
}

But what would be really useful is a convenient way to search for AND and XOR.

Current implementation

Previously I tried something (which actually works) like:

filter = {
    "$and": {
        "year": {"$lte": 2014, "$gte": 2011},
        "price": {"$gte": 0, "$lte": 200},
    },
    "$or": {
        "baseColour": {"$eq": "Black"},
        "$or": {
            "baseColour": {"$eq": "White"},
            "$or": {
                "baseColour": {"$eq": "Blue"}
            }
        }
    }
}

But this is:

  1. Inelegant
  2. A real pain to build programmatically (i.e. by taking into account checked boxes on my frontend)

Desired implementation

Some new operators: $one_of and $all_of

filter = {
    "$and": {
        "year": {"$gte": 2011, "$lte": 2014},
        "price": {"$gte": 100, "$lte": 200},
        "rating": {"$gte": 3},
        "baseColour": {"$one_of": ['White', 'Blue', 'Black']},
        "season": {"$all_of": ['Summer', 'Spring', 'Fall']},
    },
}

Other thoughts

In Commsor (our community analysis tool) we use a lot of filters are useful in real world:
image

So I'd also like to propose the following operators:

  • $contains
  • $notcontains (e.g. we often want to filter out universities since we focus on enterprises, so we would say company_name $notcontains "university")

Notes

  • Rating and price aren't in the original dataset (as used by @bwanglzu in his notebook). I generated them programmatically to give us a richer dataset to play with

PQLite: improve table query performance

The bottleneck of search is about table SQL query

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    71                                               @line_profile
    72                                               def ivf_search(
    73                                                   self,
    74                                                   x: np.ndarray,
    75                                                   cells: np.ndarray,
    76                                                   where_clause: str = '',
    77                                                   where_params: Tuple = (),
    78                                                   limit: int = 10,
    79                                               ):
    80        15         18.0      1.2      0.0          dists = []
    81
    82        15         11.0      0.7      0.0          doc_idx = []
    83        15          7.0      0.5      0.0          cell_ids = []
    84        15          6.0      0.4      0.0          count = 0
    85        30        141.0      4.7      0.0          for cell_id in cells:
    86        15         23.0      1.5      0.0              cell_table = self.cell_table(cell_id)
    87        15      54765.0   3651.0      4.5              cell_size = cell_table.count()
    88        15         20.0      1.3      0.0              if cell_size == 0:
    89                                                           continue
    90
    91        15          6.0      0.4      0.0              indices = None
    92        15         10.0      0.7      0.0              if where_clause or (cell_table.deleted_count() > 0):
    93        15         11.0      0.7      0.0                  indices = []
    94    500030     806655.0      1.6     66.3                  for doc in cell_table.query(
    95        15          9.0      0.6      0.0                      where_clause=where_clause, where_params=where_params
    96                                                           ):
    97    500000     274113.0      0.5     22.5                      indices.append(doc['_id'])
    98
    99        15         27.0      1.8      0.0                  if len(indices) == 0:
   100                                                               continue
   101
   102        15      13655.0    910.3      1.1                  indices = np.array(indices, dtype=np.int64)
   103
   104        30      63932.0   2131.1      5.3              _dists, _doc_idx = self.vec_index(cell_id).search(
   105        15         32.0      2.1      0.0                  x, limit=min(limit, cell_size), indices=indices
   106                                                       )
   107
   108        15         22.0      1.5      0.0              if count >= limit and _dists[0] > dists[-1][-1]:
   109                                                           continue
   110
   111        15         24.0      1.6      0.0              dists.append(_dists)
   112        15          9.0      0.6      0.0              doc_idx.append(_doc_idx)
   113        15         41.0      2.7      0.0              cell_ids.extend([cell_id] * len(_dists))
   114        15         13.0      0.9      0.0              count += len(_dists)
   115
   116        15        113.0      7.5      0.0          cell_ids = np.array(cell_ids, dtype=np.int64)
   117        15         13.0      0.9      0.0          if len(dists) != 0:
   118        15        459.0     30.6      0.0              dists = np.hstack(dists)
   119        15        125.0      8.3      0.0              doc_idx = np.hstack(doc_idx)
   120
   121        15        105.0      7.0      0.0              indices = dists.argsort(axis=0)[:limit]
   122        15         28.0      1.9      0.0              dists = dists[indices]
   123        15         14.0      0.9      0.0              cell_ids = cell_ids[indices]
   124        15          9.0      0.6      0.0              doc_idx = doc_idx[indices]
   125
   126        15          6.0      0.4      0.0          doc_ids = []
   127       165        163.0      1.0      0.0          for cell_id, offset in zip(cell_ids, doc_idx):
   128       150       1750.0     11.7      0.1              doc_id = self.cell_table(cell_id).get_docid_by_offset(offset)
   129       150         94.0      0.6      0.0              doc_ids.append(doc_id)
   130        15          8.0      0.5      0.0          return dists, doc_ids, cell_ids

PQLite: update README

The pqlite has broken changes, and hence the following two use cases are supported!

  • (Basic) For small-scale data (e.g., < 10M docs),

    1. directly use HNSW indexing without training (dtype=np.float32)
  • (Advanced) For large-scale data (e.g., > 10M docs): combine 1) Product Qunantization, 2) IVF, and 3) HNSW

    1. train the VQ to conduct IVF index
    2. train the PQ to compress embeddings
    3. build the IVF-HNSW indexing using pq codes (dtype=np.uint8)

Can't be installed in Mac M1 chip

I am trying to install annlite in my macbook with M1 chip using pip install annlite but I receive the following error:

clang: error: the clang compiler does not support '-march=native'
      error: command '/usr/bin/clang' failed with exit code 1

Is there any suggestion to fix it?

acquire train data from lmdb

Sometimes we need to train the PCA model when we already created an indexer. (for example, there is a memory issue after we have indexed thousands or even millions of data, and we need PCA to fix it.)

We need to fetch train data from lmdb, but this is tricky when we move to jcloud since we need to fetch data from the server instead of local machine.

One way to solve this is to add a new endpoint in client called /fetch:

data = client.post('/fetch', params={'batch_size': 1024})

for training we can use partial_train():

annlite.partial_train(data)

Filtering using $in keyword crashes the executors and the database

I have a simple flow that looks like this:

f = (
    Flow(port_expose=8082, protocol='http', monitoring=True, port_monitoring=9090)
    .add(name='encoder', uses='jinahub+docker://CLIPEncoder')
    .add(name='processor',
         uses='jinahub+docker://PQLiteIndexer/latest',
         uses_with={
            'dim': 512,
            'columns': columns
         },
    )
)

Indexing works fine and I can verify it using /status endpoint where it shows the number of indexed documents. When I hit the /search endpoint, I can search and retrieve results correctly.

I also verified that filtering works by testing it with $eq. However, when I test it with $in, things go south. Not only does it not return any results, but it also seems to crash my entire database where I can't make calls to endpoints like /statusand /search. Does anyone have any idea as to what is happening? Here is how I am structuring my filter query:

# this query searches the files with a tag 'owners' of type array which includes the given string
search_results = c.post(on="/search",
                 parameters={
                     "query": QUERIES[0],
                     "traversal_paths": '@r,c',
                     "limit": 3,
                     "filter":{"owners": {"$in": ["EGGWLJSUHT6GLWU2KIB0"]}}
                 })

support load and save operation in HNSW

We don't support loading and saving in HNSW now, every time we need to rebuild the whole graph from the lmdb which is very slow when the datasize is huge. The better way is to save the HNSW graph directly and load it when initializing the indexer.

We need two APIs inside HNSW:

hnsw_indexer.load_index() and hnsw_indexer.save_index()

PQLite: implement pq in hnsw via C++

Pros:

  • compress the embedding, save the memory consumption
  • speed up the distance calculating via ADC method

Cons:

  • resulted in degraded search quality

PQLite: benchmark with filtering

Performance benchmark experiment
Something like this:

  • QPS with filtering out 10% data
  • QPS with filtering out 30% data
  • QPS with filtering out 50% data
  • QPS with filtering out 80% data

annlite failed to build on M1 Mac

Python version: 3.9
MacOS version: 12.2.1
CMD used: pip install https://github.com/jina-ai/annlite/archive/refs/heads/main.zip (or pip install "docarray[full]")
Error log:

Building wheels for collected packages: annlite
Building wheel for annlite (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /opt/anaconda3/envs/jina/bin/python /opt/anaconda3/envs/jina/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmps65jnhoh
cwd: /private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-req-build-zbgplt5p
Complete output (48 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-11.1-arm64-cpython-39
creating build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/enums.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/profile.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/container.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/utils.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/helper.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/filter.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/math.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core
copying annlite/core/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/kv.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/table.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/vq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/pq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/pq_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/flat_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
running build_ext
creating var
creating var/folders
creating var/folders/8m
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include/python3.9 -c /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.cpp -o var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.o -std=c++14
building 'annlite.hnsw_bind' extension
creating build/temp.macosx-11.1-arm64-cpython-39
creating build/temp.macosx-11.1-arm64-cpython-39/bindings
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/pybind11/include -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/numpy/core/include -I./include/hnswlib -I/opt/anaconda3/envs/jina/include/python3.9 -c ./bindings/hnsw_bindings.cpp -o build/temp.macosx-11.1-arm64-cpython-39/./bindings/hnsw_bindings.o -O3 -march=native -stdlib=libc++ -mmacosx-version-min=10.7 -DVERSION_INFO="0.3.2" -std=c++14
clang: error: the clang compiler does not support '-march=native'
error: command '/usr/bin/clang' failed with exit code 1
+++
ERROR: Failed building wheel for annlite
Failed to build annlite
ERROR: Could not build wheels for annlite which use PEP 517 and cannot be installed directly

PQlite: restore index from local storage

rebuild the index (sqlite, and vector index) from the local lmdb data

  • refactor abstract class BaseIndex
  • refactor fit function to check whether the training valid
  • add stat and clear api
  • rebuild index from local disk (i.e., lmdb data)
    • restore trained model from disk
    • rebuild index from disk

support upload/download model to/from hubble

Since we move to jcloud deployment, it's necessary to support uploading/downloading PCA/PQ model to/from Hubble.

Thus, we need to implement these APIs:

self._projector_codec.upload(artifact='...')
self._projector_codec.download(artifact='...')

The artifact is determined by users and should be consistency throughout the whole pipeline. And also should be passed to jcloud.yaml.

PQLite: optimize product quantization index

Improve the following parts (probably cython):

  • lookup operation of asymetric distance computation
  • asymetric distance computation table
    and benchmark improvement.
  • [o] add tests to the functions

The main issue with current version is that unless lot of data is filtered the code ends up beeing slower than cdist.

Indexing of long text documents are tricky

Hello,

my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs) then when loading query flow it throws error.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.

10 means (5 root docs, and 5 chunks together - dummy data)

Can you please help me
Thanks

Deployment on Google Cloud Platform

Hi!

I am currently taking a look at jina-ai. The plan is to get a simple text-based document search going and so far I've managed to make a simple demo locally which uses the PQLiteIndexer (based on AnnLite).

flow = Flow(port=5050)
flow = (
    flow
        .add(uses=TfIdfEncoder, uses_with=dict(tfidf_fp=tfidf_fp))
        .add(uses='jinahub://PQLiteIndexer/latest', install_requirements=True, uses_with=dict(dim=dim))
)

The next step would be for me to see how I can deploy a prototype to Google Cloud Platform (GCP) and, if possible, use Cloud Run in order to keep costs at minimum.

However, since AnnLite requires access to a local file-system I am not sure if that's possible. I intended to use Cloud Storage but it seems AnnLite would not support this.

What options do I have here?

Add comments for py

Hi,
I'm reading the main body of annlite, and I found some core functions lack comments, which may cause some confusion(at least to me).
Maybe I can add some comments while I'm reading, and open a PR for that?

Clean codebase

From today's meeting we've commented

  • fix pip in README
  • fix table format (percentage, decimal)
  • use TYPE_CHECKING to protect unnecessary input
  • use from_bytes and to_bytes for reading/writing binary of Document
  • an example of using jina 3 and pqlite to achieve sharding in K8s.

Persist Documents in a Database

Currently, only the HNSWPostgresIndexer supports persistence of Documents. Could we add the PQLiteIndexer some database persistence, not necessarily Postgres, but any database provider?

Formatting headers files

Hi, do we consider using clang-format or something else to format all the cpp/h files?
If it's necessary, maybe I can open a pr for that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.