georgia-tech-db / pgvector-remote Goto Github PK

License: Other

Dockerfile 0.10% Makefile 0.51% C 82.42% Raku 0.02% Perl 12.33% Jupyter Notebook 0.85% HTML 0.75% C++ 3.02%

pgvector-remote's Introduction

pgvector-remote

pgvector-remote is a fork of pgvector which combines the simplicity of pgvector with the power of remote vector databases, by introducing a new remote vector index type. Currently, pgvector-remote only supports pinecone , but we plan to support other vendors in the future.

Short Version
Use Cases
Installation
Configuration
Index Creation
Performance Considerations
Docker
Credits

Benchmarks for a 10M filtered search workload. https://big-ann-benchmarks.com/neurips23.html#tracks. Results for pgvector are shown for a tuned hnsw index on a t2.2xlarge (32GB RAM). Results for pinecone are for a p2.x8 pod.

Short Version

CREATE TABLE products (name text, embedding vector(1536), price float);
CREATE INDEX my_remote_index ON products USING pinecone (embedding, price) with (host = 'my-pinecone-index.pinecone.io');
-- [insert, update, and delete billions of records in products]
SELECT * FROM products WHERE price < 40.0 ORDER BY embedding <-> '[...]' LIMIT 10; -- pinecone performs this query, including the price predicate

Use Cases

Benefits of using pgvector-remote for pinecone users

Vector databases like pinecone aren't docstores (and they shouldn't try to be). That means your document and its embedding live in separate databases. pgvector-remote lets you keep your metadata in postgres and your embeddings in pinecone, while hiding this complexity from the user by presenting a unified sql interface to creating, querying, and updating pinecone indexes.
Control your data. Using pgvector-remote means that all your vectors are in postgres. This makes it easy to test out a different index type (like hnsw) and drop pinecone in favor of a different vendor.

Benefits of using pinecone for pgvector users

Scalability: Pinecone is designed to scale to billions of vectors. pgvector does not easily accomodate such large datasets. Large vector indexes are incredibly highly memory intensive and therefore it makes sense to separate this from the main database. For example indexing 200M vectors of 1536 dimensions would require 1.2TB of memory.

Benefits of using pgvector-remote for users who already use pinecone and pgvector

Seamless integration: You don't need to write a line of pinecone application logic. Use a unified sql interface to leverage pinecone as if it were any other postgres index type.
Synchronization: pgvector-remote ensures that the data in pinecone and postgres are always in sync. For example, if your postgres transaction rolls back you don't need to worry about cleaning up the data in pinecone.

Why is this integration better than confluent's kafka-connect?

Liveness and correctness: pgvector-remote sends inserted vectors to pinecone in batches and locally scans unflushed records, guaranteeing that all data is always visible to index queries.
Query and integration logic: traditional ETL won't help you write queries like the one above. pgvector-remote translates select predicates to pinecone filters.

When should I just use pgvector?

Small datasets: If you have a small to medium dataset (10M vectors at 768 dimensions), you can use pgvector without a remote vector store. The local hnsw indexes will be sufficient.
Minimal metadata: You aren't performing metadata filtering. Currently, pgvector does not handle metadata filtering, meaning that queries like the one above can sometimes be inefficient and inaccurate.

Installation

Follow the installation instructions for pgvector You can specify which remote providers to install using MAKEFLAGS. For example,

make USE_PINECONE=1 USE_MILVUS=1
sudo make install

Pinecone Installation

Install libcurl headers. For example,

sudo apt-get install libcurl4-openssl-dev

Milvus Installation

build the milvus c++ sdk
git clone https://github.com/oscarlaird/milvus-sdk-cpp
sudo apt install libgrpc++-dev libgrpc-dev libprotobuf-dev
git submodule update --init
make && sudo make install

Configuration

Set the pinecone API key in the postgres configuration. For example,

ALTER DATABASE mydb SET pinecone.api_key = 'xxxxxxxx-xxxx-xxxx-xxxx–xxxxxxxxxxxx';

Index Creation

There are two ways to specify the pinecone index:

By providing the host of an existing pinecone index. For example,

CREATE INDEX my_remote_index ON products USING pinecone (embedding) with (host = 'example-23kshha.svc.us-east-1-aws.pinecone.io');

By specifying the spec of the pinecone index. For example,

CREATE INDEX my_remote_index ON products USING pinecone (embedding) with (spec = '"spec": {
        "serverless": {
            "region": "us-west-2",
            "cloud": "aws"
        }
    }');

All spec options can be found here

Performance Considerations

Place your pinecone index in the same region as your postgres instance to minimize latency.
Make use of connection pooling to run queries in postgres concurrently. For example, use asyncpg in python.
Records are sent to the remote index in batches. Therefore pgvector-remote performs a local scan of the unflushed records before every query. To disable this set pinecone.max_buffer_scan to 0. For example,

ALTER DATABASE mydb SET pinecone.max_buffer_scan = 0;

You can adjust the number of vectors sent in each request and the number of concurrent requests per batch using pinecone.vectors_per_request and pinecone.requests_per_batch respectively. For example,

ALTER DATABASE mydb SET pinecone.vectors_per_request = 100; --default
ALTER DATABASE mydb SET pinecone.requests_per_batch = 40; --default

You can control the number of results returned by pinecone using pinecone.top_k. Lowering this parameter can decrease latencies, but keep in mind that setting this too low could cause fewer results to be returned than expected.

Docker

An example docker image can be obtained with,

docker pull kslohith17/pgvector-remote:latest

This contains postgres along with pgvector-remote configured to run on it.

Credits

We give special thanks to these projects, which enabled us to develop our extension:

pgvector-remote's People

Contributors

Stargazers

Watchers

Forkers

chitti-ankith swetavooda jarulraj kslohith tanish0019

pgvector-remote's Issues

[BUG] Seg Fault when creating index with NULL vectors

I have a table with null-able vector column. When I tried to create an index, it failed. And on checking the logs, I found it's due to Seg Fault.

SQL Query for Creating Index

CREATE INDEX "my-remote-index"
ON my_table 
USING pinecone (embedding) 
WITH (host = 'xxx.svc.aped-4627-b74a.pinecone.io')

Postgres Log:

2024-04-21 19:27:50.412 IST [15464] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2024-04-21 19:27:50.420 IST [15464] DEBUG:  building index "al-remote-index" on table "application_log" serially
2024-04-21 19:27:50.421 IST [15464] DEBUG:  Initializing CURL handle
2024-04-21 19:27:50.428 IST [15394] DEBUG:  snapshot of 1+0 running transaction ids (lsn 5/A6EC45E8 oldest xid 4527446 latest complete 4527445 next xid 4527447)
2024-04-21 19:27:51.564 IST [15464] DEBUG:  Response (write_callback): {"namespaces":{"":{"vectorCount":33000}},"dimension":384,"indexFullness":0,"totalVectorCount":33000}
2024-04-21 19:27:51.565 IST [15464] DEBUG:  Host specified in reloptions, checking if it is empty. Got response: {
		"namespaces":	{
			"":	{
				"vectorCount":	33000
			}
		},
		"dimension":	384,
		"indexFullness":	0,
		"totalVectorCount":	33000
	}
2024-04-21 19:27:54.352 IST [15464] DEBUG:  Freeing request body
2024-04-21 19:27:54.352 IST [15464] DEBUG:  Response (write_callback): {"upsertedCount":100}
...
2024-04-21 19:29:05.245 IST [15464] DEBUG:  Freeing request body
2024-04-21 19:29:05.245 IST [15464] DEBUG:  Response (write_callback): {"upsertedCount":100}
2024-04-21 19:29:05.302 IST [83682] DEBUG:  reaping dead processes
2024-04-21 19:29:05.302 IST [83682] DEBUG:  server process (PID 15464) was terminated by signal 11: Segmentation fault: 11
2024-04-21 19:29:05.302 IST [83682] DETAIL:  Failed process was running: CREATE INDEX "my-remote-index"
	ON my_table 
	USING pinecone (embedding) 
	WITH (host = 'xxx.svc.aped-4627-b74a.pinecone.io');
2024-04-21 19:29:05.302 IST [83682] LOG:  server process (PID 15464) was terminated by signal 11: Segmentation fault: 11
2024-04-21 19:29:05.302 IST [83682] DETAIL:  Failed process was running: CREATE INDEX "my-remote-index"
	ON my_table 
	USING my_table (embedding) 
	WITH (host = 'xxx.svc.aped-4627-b74a.pinecone.io');
2024-04-21 19:29:05.302 IST [83682] LOG:  terminating any other active server processes
2024-04-21 19:29:05.302 IST [83682] DEBUG:  sending SIGQUIT to process 15397
2024-04-21 19:29:05.302 IST [83682] DEBUG:  sending SIGQUIT to process 15394
2024-04-21 19:29:05.303 IST [83682] DEBUG:  sending SIGQUIT to process 15393
2024-04-21 19:29:05.303 IST [83682] DEBUG:  sending SIGQUIT to process 15395
2024-04-21 19:29:05.303 IST [83682] DEBUG:  sending SIGQUIT to process 15396
2024-04-21 19:29:05.303 IST [83682] DEBUG:  reaping dead processes
2024-04-21 19:29:05.304 IST [83682] DEBUG:  forked new backend, pid=15601 socket=11
2024-04-21 19:29:05.304 IST [83682] DEBUG:  reaping dead processes
2024-04-21 19:29:05.306 IST [15601] FATAL:  the database system is in recovery mode
2024-04-21 19:29:05.306 IST [15601] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2024-04-21 19:29:05.306 IST [15601] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
2024-04-21 19:29:05.306 IST [15601] DEBUG:  proc_exit(1): 1 callbacks to make
2024-04-21 19:29:05.306 IST [15601] DEBUG:  exit(1)
2024-04-21 19:29:05.306 IST [15601] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-04-21 19:29:05.306 IST [15601] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-04-21 19:29:05.306 IST [15601] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-04-21 19:29:05.307 IST [83682] DEBUG:  reaping dead processes
2024-04-21 19:29:05.307 IST [83682] DEBUG:  server process (PID 15601) exited with exit code 1
2024-04-21 19:29:05.307 IST [83682] LOG:  all server processes terminated; reinitializing
2024-04-21 19:29:05.309 IST [83682] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2024-04-21 19:29:05.309 IST [83682] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2024-04-21 19:29:05.309 IST [83682] DEBUG:  cleaning up orphaned dynamic shared memory with ID 2288057058
2024-04-21 19:29:05.309 IST [83682] DEBUG:  cleaning up dynamic shared memory control segment with ID 762501520
2024-04-21 19:29:05.310 IST [83682] DEBUG:  invoking IpcMemoryCreate(size=149528576)
2024-04-21 19:29:05.312 IST [83682] DEBUG:  dynamic shared memory system will support 674 segments
2024-04-21 19:29:05.312 IST [83682] DEBUG:  created dynamic shared memory control segment 2607114590 (26976 bytes)
2024-04-21 19:29:05.313 IST [15602] LOG:  database system was interrupted; last known up at 2024-04-21 19:26:17 IST
2024-04-21 19:29:05.313 IST [15602] DEBUG:  removing all temporary WAL segments
2024-04-21 19:29:05.313 IST [15603] DEBUG:  checkpointer updated shared memory configuration values
2024-04-21 19:29:05.443 IST [15602] DEBUG:  checkpoint record is at 5/A6EB0E10
2024-04-21 19:29:05.443 IST [15602] DEBUG:  redo record is at 5/A6EB0E10; shutdown true
2024-04-21 19:29:05.443 IST [15602] DEBUG:  next transaction ID: 4527446; next OID: 589824
2024-04-21 19:29:05.443 IST [15602] DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
2024-04-21 19:29:05.443 IST [15602] DEBUG:  oldest unfrozen transaction ID: 722, in database 1
2024-04-21 19:29:05.443 IST [15602] DEBUG:  oldest MultiXactId: 1, in database 1
2024-04-21 19:29:05.443 IST [15602] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2024-04-21 19:29:05.443 IST [15602] LOG:  database system was not properly shut down; automatic recovery in progress
2024-04-21 19:29:05.443 IST [15602] DEBUG:  transaction ID wrap limit is 2147484369, limited by database with OID 1
2024-04-21 19:29:05.443 IST [15602] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2024-04-21 19:29:05.443 IST [15602] DEBUG:  starting up replication slots
2024-04-21 19:29:05.443 IST [15602] DEBUG:  xmin required by slots: data 0, catalog 0
2024-04-21 19:29:05.443 IST [15602] DEBUG:  starting up replication origin progress state
2024-04-21 19:29:05.443 IST [15602] DEBUG:  didn't need to unlink permanent stats file "pg_stat/pgstat.stat" - didn't exist
2024-04-21 19:29:05.443 IST [15602] DEBUG:  resetting unlogged relations: cleanup 1 init 0
2024-04-21 19:29:05.445 IST [15602] LOG:  redo starts at 5/A6EB0E88
2024-04-21 19:29:05.445 IST [83682] DEBUG:  postmaster received pmsignal signal
2024-04-21 19:29:05.445 IST [15602] LOG:  invalid record length at 5/A6EC4840: expected at least 24, got 0
2024-04-21 19:29:05.445 IST [15602] LOG:  redo done at 5/A6EC4808 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-04-21 19:29:05.445 IST [15602] DEBUG:  resetting unlogged relations: cleanup 0 init 1
2024-04-21 19:29:05.447 IST [15602] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2024-04-21 19:29:05.447 IST [15602] DEBUG:  MultiXact member stop limit is now 4294914944 based on MultiXact 1
2024-04-21 19:29:05.447 IST [15603] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-04-21 19:29:05.447 IST [15603] DEBUG:  performing replication slot checkpoint
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=1 file=base/395417/3455 time=0.058 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=2 file=base/395417/2674 time=0.066 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=3 file=base/395417/2608 time=0.079 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=4 file=base/395417/589824 time=0.063 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=5 file=base/395417/2678 time=0.044 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=6 file=base/395417/2658 time=0.049 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=7 file=base/395417/1249 time=0.048 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=8 file=base/395417/2662 time=0.075 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=9 file=base/395417/1259 time=0.045 ms
2024-04-21 19:29:05.449 IST [15603] DEBUG:  checkpoint sync: number=10 file=pg_multixact/offsets/0000 time=0.024 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=11 file=base/395417/2679 time=0.052 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=12 file=pg_xact/0004 time=0.025 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=13 file=base/395417/2608_fsm time=0.086 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=14 file=base/395417/2673 time=0.044 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=15 file=base/395417/2610 time=0.048 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=16 file=base/395417/2663 time=0.051 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  checkpoint sync: number=17 file=base/395417/2659 time=0.049 ms
2024-04-21 19:29:05.450 IST [15603] DEBUG:  attempting to remove WAL segments older than log file 0000000000000005000000A5
2024-04-21 19:29:05.450 IST [15603] LOG:  checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.002 s, total=0.003 s; sync files=17, longest=0.001 s, average=0.001 s; distance=78 kB, estimate=78 kB; lsn=5/A6EC4840, redo lsn=5/A6EC4840
2024-04-21 19:29:05.450 IST [15602] DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to make
2024-04-21 19:29:05.450 IST [15602] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
2024-04-21 19:29:05.450 IST [15602] DEBUG:  proc_exit(0): 1 callbacks to make
2024-04-21 19:29:05.450 IST [15602] DEBUG:  exit(0)

I was able to fix the issue by adding a WHERE Clause

CREATE INDEX "my-remote-index"
ON my_table 
USING pinecone (embedding) 
WITH (host = 'xxx.svc.aped-4627-b74a.pinecone.io')
WHERE embedding IS NOT NULL;

Please let me know if you need any more info.

Add testing for pinecone index type

Configuration and Index parameter tests

pinecone.api_key is empty
spec is empty or unprovided; spec is not valid json
pinecone index is empty because all the tuples are in the buffer
buffersize parameter actually controls the buffersize (you can check that the buffersize guc is working because by increasing the buffersize we ensure all vectors are in the buffer and thus we'll get an empty pinecone index notice.)

Other tests

pinecone.top_k guc really controls number of results from pinecone (you can check by making sure all vectors are in pinecone not buffer and then seeing how this param affects the number of results returned.
error raised when trying to build or insert a vector with all zero values (the error is only raised when the buffer is flushed so you should set buffersize to 1 for the insertion part of this test).
building from a base table works: the tuples are actually sent to pinecone.
CORRECTNESS TESTS. Put a few tuples in pinecone and in the buffer. Verify that the query correctly interleaves the top hits from the buffer and the remote index. Do this for each of the three metrics and verify that the order really is correct. Choose values such that the results would be wrong if using a different metric (e.g. using euclidean instead of cosine when we are supposed to use cosine would give the wrong order).
METRIC TESTS. For each metric build an index with that metric type: Use ANALYZE to verify that the index is actually used for queries using this metric (you might need to SET enable_seqscan = 'off'). Use ANALYZE to verify that the index is not used for queries using a different metric. (The operators are <-> <=> <#> for euclid, cosine, inner).
select pinecone_indexes();
select pinecone_delete_unused_indexes();

Wish list (not implemented)

poll pinecone to wait for index to finish building
hangs if spec is empty
it is possible to connect to an already existing index by directly specifying a host instead of a spec.

Q: Why is this a fork, rather than an extension that depends on pgvector?

PGVector is a good extension that Just Works within PostgreSQL, with all batteries included for a PostgreSQL cluster to operate. To me, this means pgvector supports full snapshot consistency, streaming replication, point-in-time recovery, basebackups, etc.

This "pgvector-remote" fork, however, is now fundamentally incompatible with pgvector: while it installs mostly the same features as pgvector (plus some more); it critically installs the same access methods that pgvector installs, using the same name. As access methods can't be renamed and their names are supposed to be unique and database-global namespaced (so not schema-namespaced like table names), this results in conflicts when you want to use both extensions; which is reasonable, considering pgvector's progress.

So, what's the rationale for choosing to Fork, rather than Extend?

Few suggestions to improve CX

Hey Guys,
Thanks for creating this wonderful extension. I'm trying to integrate this extension to one of our existing databases to be able to perform similarity search and clustering. While I was setting up this extension in my test database, I faced few issues and thought of sharing them and a few suggestions to improve the Dev Experience.

API Key

I'm not entirely sure why (maybe because I'm using macOS) but ALTER DATABASE and ALTER SYSTEM didn't set the Pinecone API Key.

ALTER DATABASE succeeded but didn't actually set the value so running CREATE INDEX failed.
ALTER SYSTEM failed with unrecognized configuration parameter "pinecone.api_key" error
I was finally able to get it working using SET pinecone.api_key To 'xxxx' and verifying with SHOW pinecone.api_key

It would be great if all these options were available in README and also how to verify that api_key was set.

Handle `https://` in host for CREATE INDEX

I initially ran the following command to create an index, it failed with error curl_easy_perform(): Timeout was reached. The error was not intuitive. Removing https:// worked, having an appropriate error or automatically handing this would be great.

CREATE INDEX my_remote_index 
ON products 
USING pinecone (embedding) 
WITH (host = 'https://xxx.svc.aped-4627-b74a.pinecone.io');

Option to migrate from `pgvector` to `pgvector-remote`

Having this option would be great in the long run. I already have a database that uses pgvector. I tried to run ALTER EXTENSION UPDATE to use pgvector-remote in the existing database, postgres threw and exception

extension "vector" has no update path from version "0.6.2" to version "remote0.1.0"

It would be great allowing existing users for pgvector to seamlessly transition to pgvector-remote without having to create a new database and spend hours to migrate from older to newer database.

nit: Installation Instructions

This is a nit pick but the installation instructions were confusing at first glance. I wasn't sure if I had to also running the commands for Milvus or not. I elected to skip them and was able to use pinecone without any issues. Having simple steps to follow would make it easier for dev trying to test. And it would also be great to have instructions for Mac.