Giter Club home page Giter Club logo

pgvector's Issues

How to install pgvector with an existing postgres docker image?

Sorry if this is trivial but I trying to get pgvector to work with my existing docker postgres container, and looking for some pointers.

I currently have a setup using docker-compose, which is running a postgres image and another service.

$docker ps

CONTAINER ID   IMAGE                           COMMAND                   CREATED        STATUS        PORTS                                       NAMES
e1efada5776d   postgres                        "docker-entrypoint.sā€¦"    3 years ago    Up 2 months   5432/tcp                                    hasura_postgres_1
...

I've pulled the pg-vector image.

$docker image ls

REPOSITORY              TAG              IMAGE ID       CREATED        SIZE
ankane/pgvector         latest           1865afc96f66   3 weeks ago    381MB

How do I add the extension into my existing postgres container? The documentation mentions: "This adds pgvector to the Postgres image" but it is not clear what to do next.

Running SQL CREATE EXTENSION vector; inside postgresql will currently yield "/usr/share/postgresql/12/extension/vector.control": No such file or directory.null which seems expected as the extension files aren't there.

My docker-compose.yaml looks like this:

version: '3.6'
services:
  postgres:
    image: postgres
    restart: always
    volumes:
    - db_data:/var/lib/postgresql/data
  graphql_service:
    image: graphql_service
    depends_on:
    - "postgres"
    restart: always
volumes:
  db_data:

Issue when using L2 distance

I have a issue when using the 'L2 distance' operator. No matter what I'm comparing it to, it's not respecting the limit operator and it always returns just 1 (and always the same) exact record.

Here is an example query:

SELECT * FROM public.connected c ORDER BY factors <-> '[0.05613021180033684,-0.046761274337768555,-0.04585694149136543]' LIMIT 5;

I have no problems with cosine (<=>) and inner product (<#>), they work as expected.

Can't install using make, sqlvector 0.2.7 not available

#After cloning into extensions folder

C:\Program Files\PostgreSQL\Extensions>cd pgvector

C:\Program Files\PostgreSQL\Extensions\pgvector>make
process_begin: CreateProcess(NULL, uname -s, ...) failed.
Makefile:15: pipe: No error
process_begin: CreateProcess(NULL, pg_config --pgxs, ...) failed.
Makefile:40: pipe: No error
cp sql/vector.sql sql/vector--0.2.7.sql
process_begin: CreateProcess(NULL, cp sql/vector.sql sql/vector--0.2.7.sql, ...) failed.
make (e=2): The system cannot find the file specified.
make: *** [Makefile:35: sql/vector--0.2.7.sql] Error 2

The sql file even had all versions up to 0.2.7

Installing pgvector on a cloud pgsql instancw

The instructions mention:

git clone --branch v0.1.2 https://github.com/ankane/pgvector.git
cd pgvector
make
make install # may need sudo

How do I do it for a pg instance on Google cloud? Is it possible?

Using this extension on Windows OS

Hi, I am using this extension on my Windows Based project, PGSQL is v14, I tried to compile this extension both with MSVC 2019 and GCC minggw64.
the dll compiled with GCC could not work, PGSQL will crash when you try to touch the vector interface.
the dll compiled with MSVC 2019 can work without creating index, but when you start to create an index on vector, the PGSQL service process will crash immediately.
Is there anyone have experiences with Windows OS to use this extension?

Ideas

Plan

  • Use pairing heap for index scan for performance - stages branch
  • Use mini-batch k-means for index creation for reduced memory - minibatch branch
  • Add support for product quantization (in-progress)

Ideas

  • Use tuplesort_set_bound for performance - bound branch (not needed w/ pairing heap)
  • Add functions to view lists and/or pages like pageinspect (require superuser)

On-hold

  • Add support for parallel index scans (planner gets cost estimate but doesn't use) - parallel-index-scan branch
  • Change return type of distance functions from float8 to float4 for performance (maybe, needs benchmarking)

Wrong row estimate for "order by limit"

Hi, i'm facing some performance issues on a 10 million table. Note that I don't use pgvector indexes, instead I do my own clustering and store cluster_id with the row.

So the query looks something like this:

select person_id from detections 
where cluster_id in (select cluster_id from clusters order by embedding <-> $1 limit 10) 
order by embedding <-> $1 
limit 10

After some investigation I noticed that Postgres wrongly estimates number of rows returning after sort:

image

As you can see, it does not seem to understand that heapsort will be executed and estimates 4000 rows instead of 10.
What I can do about it?

postgres version: 14
pgvector version: 0.2.7

Multi-column index?

The GiST index doesn't seem to support multicolumn right now. How hard is it to add this support so that we can search within a subset of the table?

Index build error: `invalid memory alloc request size` >= 1GB

I encountered a new problem when working with our very large dataset: we hit a palloc size request limitation.
palloc are limited to 1GB - 1, see https://github.com/postgres/postgres/blob/REL_14_5/src/backend/utils/mmgr/mcxt.c#L1077.
This limit is exceeded when allocating the samples array during index build.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);

Insert 1M rows into the table

SET maintenance_work_mem='16GB';
CREATE INDEX ON embed USING ivfflat (vec vector_cosine_ops) WITH (lists = 13909);
ERROR:  invalid memory alloc request size 1073774812

1073774812 = VECTOR_ARRAY_SIZE(50*13909, 384)

Versions:

posgresql 14.5
pgvector v0.3.1 (2d8b7e5)

Output returns 1 result although setting LIMIT 5 after creating index

Hi, I am using pgvector to store 128-dimensional embedding vectors. I am using <-> operation to calculate L2 distance between a given 128-dimensional embedding vector and more than my 100 128-dimensional embedding vectors I already stored in postgresql. Before creating index with CREATE INDEX ON <my_table_name> USING ivfflat (embedding vector_l2_ops);. I query with SELECT * FROM <my_table_name> ORDER BY embedding <-> '<given_vector>' LIMIT 5;, it returns 5 results. However, after creating index, run the same query command, it just returns 1 result, sometimes is 2 but there are not 5 results returned. I am very confused and have been doing research about this problem but it is hopeless. Would anybody help me? Thank you in advance.

Cast to double precision / numeric

Currently, casting to real[] is useful as it allows this to interoperate existing clients however, it looses precision.

A function that casts to double prevision[] or possibly numeric[], either would be fine really. Would be extremely useful.

Vector operations

Hi, is it possible to do type casting between arrays and vectors? I would like to do some operations inside pg (averaging vectors,...), but i could not find anything to do so.

Server crash when inserting with ivfflat index with high number of clusters

Hello, thank you for this amazing extension!

However we have encountered somes server crashes when inserting rows into a table with an index which has a lists parameter greater than around 6500.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);
CREATE INDEX ON embed (vec vector_cosine_ops) WITH (lists = 10000);

Then to trigger the crash, insert some rows (as low as 1k did it for us almost everytime).

Server logs

TRAP: FailedAssertion("((PageHeader) (page))->pd_special >= SizeOfPageHeaderData", File: "/usr/include/postgresql/server/storage/bufpage.h", Line: 317, PID: 54277)
postgres: root root [local] COPY(ExceptionalCondition+0xab)[0x5644680a2c2d]
/usr/lib/postgresql/vector.so(+0x54b1)[0x7f6309e364b1]
/usr/lib/postgresql/vector.so(+0x585c)[0x7f6309e3685c]
/usr/lib/postgresql/vector.so(ivfflatinsert+0xd1)[0x7f6309e36abe]
postgres: root root [local] COPY(index_insert+0x9b)[0x564467bf9a70]
postgres: root root [local] COPY(ExecInsertIndexTuples+0x1dd)[0x564467da2c76]
postgres: root root [local] COPY(+0x24e970)[0x564467d0b970]
postgres: root root [local] COPY(+0x24ebec)[0x564467d0bbec]
postgres: root root [local] COPY(CopyFrom+0xa64)[0x564467d0c889]
postgres: root root [local] COPY(DoCopy+0x41f)[0x564467d0afa9]
postgres: root root [local] COPY(standard_ProcessUtility+0x4a0)[0x564467f6c0b3]
postgres: root root [local] COPY(ProcessUtility+0xdb)[0x564467f6c7bc]
postgres: root root [local] COPY(+0x4acaf6)[0x564467f69af6]
postgres: root root [local] COPY(+0x4acdd6)[0x564467f69dd6]
postgres: root root [local] COPY(PortalRun+0x1c9)[0x564467f6a1a3]
postgres: root root [local] COPY(+0x4a9084)[0x564467f66084]
postgres: root root [local] COPY(PostgresMain+0x83f)[0x564467f682a4]
postgres: root root [local] COPY(+0x408e46)[0x564467ec5e46]
postgres: root root [local] COPY(+0x40b318)[0x564467ec8318]
postgres: root root [local] COPY(+0x40b565)[0x564467ec8565]
postgres: root root [local] COPY(PostmasterMain+0x1183)[0x564467ec9b9e]
postgres: root root [local] COPY(main+0x214)[0x564467e0aa52]
/usr/lib/libc.so.6(+0x23290)[0x7f63154e7290]
/usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7f63154e734a]
postgres: root root [local] COPY(_start+0x25)[0x564467b7d045]
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  server process (PID 54277) was terminated by signal 6: Aborted
2022-10-26 08:30:59.409 UTC [53734] DETAIL:  Failed process was running: COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)
2022-10-26 08:30:59.409 UTC [53734] LOG:  server process (PID 54277) was terminated by signal 6: Aborted
2022-10-26 08:30:59.409 UTC [53734] DETAIL:  Failed process was running: COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)
2022-10-26 08:30:59.409 UTC [53734] LOG:  terminating any other active server processes
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53741
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53765
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53737
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53736
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53738
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53739
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53740
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/global.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/db_16385.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  removing temporary stats file "pg_stat_tmp/db_16385.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/db_13780.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  removing temporary stats file "pg_stat_tmp/db_13780.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/db_0.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  removing temporary stats file "pg_stat_tmp/db_0.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 08:30:59.414 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  server process (PID 53765) exited with exit code 2
2022-10-26 08:30:59.417 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 10000);
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] LOG:  all server processes terminated; reinitializing

GDB stack trace

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f631554c64c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f631554c64c in ?? () from /usr/lib/libc.so.6
#1  0x00007f63154fc958 in raise () from /usr/lib/libc.so.6
#2  0x00007f63154e653d in abort () from /usr/lib/libc.so.6
#3  0x00005644680a2c4f in ExceptionalCondition (conditionName=conditionName@entry=0x7f6309e3b280 "((PageHeader) (page))->pd_special >= SizeOfPageHeaderData",
    errorType=errorType@entry=0x7f6309e3b0cb "FailedAssertion", fileName=fileName@entry=0x7f6309e3b218 "/usr/include/postgresql/server/storage/bufpage.h", lineNumber=lineNumber@entry=317) at assert.c:69
#4  0x00007f6309e364b1 in PageValidateSpecialPointer (page=page@entry=0x564469f59a78 "") at /usr/include/postgresql/server/storage/bufpage.h:317
#5  0x00007f6309e3685c in InsertTuple (rel=rel@entry=0x7f6309e40fc8, itup=itup@entry=0x564469e47ac0, heapRel=heapRel@entry=0x7f6309efb728, values=values@entry=0x7fff8198cfb0) at src/ivfinsert.c:90
#6  0x00007f6309e36abe in ivfflatinsert (index=0x7f6309e40fc8, values=0x7fff8198d0f0, isnull=<optimized out>, heap_tid=0x564469e48c38, heap=0x7f6309efb728, checkUnique=<optimized out>, indexUnchanged=false,
    indexInfo=0x564469c661e8) at src/ivfinsert.c:167
#7  0x0000564467bf9a70 in index_insert (indexRelation=indexRelation@entry=0x7f6309e40fc8, values=values@entry=0x7fff8198d0f0, isnull=isnull@entry=0x7fff8198d0d0, heap_t_ctid=heap_t_ctid@entry=0x564469e48c38,
    heapRelation=heapRelation@entry=0x7f6309efb728, checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, indexUnchanged=false, indexInfo=0x564469c661e8) at indexam.c:193
#8  0x0000564467da2c76 in ExecInsertIndexTuples (resultRelInfo=resultRelInfo@entry=0x564469c65ed8, slot=0x564469e48c08, estate=estate@entry=0x564469c675a0, update=update@entry=false,
    noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0, arbiterIndexes=0x0) at execIndexing.c:411
#9  0x0000564467d0b970 in CopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x7fff8198d370, buffer=0x564469c57910) at copyfrom.c:344
#10 0x0000564467d0bbec in CopyMultiInsertInfoFlush (miinfo=miinfo@entry=0x7fff8198d370, curr_rri=curr_rri@entry=0x564469c65ed8) at copyfrom.c:426
#11 0x0000564467d0c889 in CopyFrom (cstate=cstate@entry=0x564469c65ca0) at copyfrom.c:1067
#12 0x0000564467d0afa9 in DoCopy (pstate=pstate@entry=0x564469c844c0, stmt=stmt@entry=0x564469b706a0, stmt_location=0, stmt_len=0, processed=processed@entry=0x7fff8198d490) at copy.c:299
#13 0x0000564467f6c0b3 in standard_ProcessUtility (pstmt=0x564469b71298, queryString=0x564469b6fa10 "COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)", readOnlyTree=<optimized out>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x564469b71388, qc=0x7fff8198d780) at utility.c:739
#14 0x0000564467f6c7bc in ProcessUtility (pstmt=pstmt@entry=0x564469b71298, queryString=<optimized out>, readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
    queryEnv=<optimized out>, dest=0x564469b71388, qc=0x7fff8198d780) at utility.c:527
#15 0x0000564467f69af6 in PortalRunUtility (portal=portal@entry=0x564469bd92f0, pstmt=pstmt@entry=0x564469b71298, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
    dest=dest@entry=0x564469b71388, qc=qc@entry=0x7fff8198d780) at pquery.c:1155
#16 0x0000564467f69dd6 in PortalRunMulti (portal=portal@entry=0x564469bd92f0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x564469b71388,
    altdest=altdest@entry=0x564469b71388, qc=qc@entry=0x7fff8198d780) at pquery.c:1312
#17 0x0000564467f6a1a3 in PortalRun (portal=portal@entry=0x564469bd92f0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564469b71388,
    altdest=altdest@entry=0x564469b71388, qc=0x7fff8198d780) at pquery.c:788
#18 0x0000564467f66084 in exec_simple_query (query_string=query_string@entry=0x564469b6fa10 "COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)") at postgres.c:1213
#19 0x0000564467f682a4 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff8198d980, dbname=<optimized out>, username=<optimized out>) at postgres.c:4496
#20 0x0000564467ec5e46 in BackendRun (port=port@entry=0x564469b97410) at postmaster.c:4530
#21 0x0000564467ec8318 in BackendStartup (port=port@entry=0x564469b97410) at postmaster.c:4252
#22 0x0000564467ec8565 in ServerLoop () at postmaster.c:1745
#23 0x0000564467ec9b9e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x564469b691b0) at postmaster.c:1417
#24 0x0000564467e0aa52 in main (argc=5, argv=0x564469b691b0) at main.c:209

Versions:

posgresql 14.5
pgvector v0.3.0 (379a760)

Index usage in queries

Hello. I created table

create table garment_vector(
garment_id int primary key,
vecdata vector(192)
);

CREATE INDEX l2_vector_index ON garment_vector
USING ivfflat (vecdata vector_l2_ops) WITH (lists = 1000);

It has 350000 rows

In this query postgres uses ivfflat index
explain analyze
select *
from garment_vector v1
order by v1.vecdata <-> '[-0.047943115234375,0.02069091796875,-0.04364013671875, .... ,0.04705810546875]' limit 10

Limit (cost=142.46..146.59 rows=10 width=788) (actual time=1.206..1.231 rows=10 loops=1)
-> Index Scan using l2_vector_index on garment_vector v1 (cost=142.46..142075.46 rows=344080 width=788) (actual time=1.205..1.229 rows=10 loops=1)
Order By: (vecdata <-> '[-0.047943115234375,0.02069091796875,...0.04705810546875]'::vector)
Planning Time: 0.439 ms
Execution Time: 1.272 ms

The same with
explain analyze
select *
from garment_vector v1
order by v1.vecdata <-> (select v2.vecdata from garment_vector v2 where v2.garment_id=1) limit 10

Limit (cost=153.53..157.66 rows=10 width=788) (actual time=1.053..1.076 rows=10 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using garment_vector_pkey on garment_vector v2 (cost=0.42..8.44 rows=1 width=776) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: (garment_id = 1)
-> Index Scan using l2_vector_index on garment_vector v1 (cost=145.09..144812.97 rows=350710 width=788) (actual time=1.052..1.073 rows=10 loops=1)
Order By: (vecdata <-> $0)
Planning Time: 0.702 ms
Execution Time: 1.110 ms

But in this query, postgres uses sequential scan
explain analyze
select *
from garment_vector v1,
garment_vector v2
where v2.garment_id=1
order by v1.vecdata <-> v2.vecdata limit 10

Limit (cost=51011.72..51011.75 rows=10 width=1568) (actual time=357.449..357.453 rows=10 loops=1)
-> Sort (cost=51011.72..51896.52 rows=353920 width=1568) (actual time=357.448..357.450 rows=10 loops=1)
Sort Key: ((v1.vecdata <-> v2.vecdata))
Sort Method: top-N heapsort Memory: 59kB
-> Nested Loop (cost=0.42..43363.64 rows=353920 width=1568) (actual time=0.031..256.816 rows=353920 loops=1)
-> Index Scan using garment_vector_pkey on garment_vector v2 (cost=0.42..8.44 rows=1 width=780) (actual time=0.009..0.011 rows=1 loops=1)
Index Cond: (garment_id = 1)
-> Seq Scan on garment_vector v1 (cost=0.00..38931.20 rows=353920 width=780) (actual time=0.020..116.430 rows=353920 loops=1)
Planning Time: 0.768 ms
Execution Time: 357.487 ms

How to use index inqueries with join?
I'm using PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit in Docker container compiled from latest version (0.2) of pgvector

pgvector vs FAISS

update:
Upgrading to v0.1.1 and building with PG_CFLAGS=-ffast-math make reduced the query time to 2.2s! Big speed jump, but 1.7x slower than the FAISS / Python service.


I imported 792010 rows of 512d image vectors (~5GB) (aka not random) and ran a tests[0] to find the 4 closests vectors to an exact vector in the dataset.

Searching with:

  • 1.279357709s - FAISS python web service (using json and IndexFlatL2) (with 791963 vectors [2]).
  • 11.381s - Searching (l2_distance) with pgvector extension (with 792010 rows) .

Hardware:

MacBook Pro (15-inch, 2018)
2.6 GHz 6-Core Intel Core i7
16 GB 2400 MHz DDR4

Importing took 11.381 seconds with the COPY cmd from a csv file with each row being the vector.

Any ideas why pgvector would be so much slower? The testing ENVs between the tools was significantly different, to the FAISS's dis-advantage, but FAISS was still much quicker.

[1] Not a "scientific" test. I had other programs running on the machine when running this test. Mileage may vary.
[2] The slight difference is the fais's vector import filters duplicate vectors.

Error during indexing large data

I'm experiencing an unknown error during creating indices for large table. I tried to create index for smaller chunk and it works well, but for larger it throws an error. Could you please suggest where to dig to solve that issue?

osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 10;
CREATE INDEX
osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 1000;
CREATE INDEX
osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 10000000;
CREATE INDEX
osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 1000000000;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

What information should I provide to debug the problem?

Create index failed (terminated by signal 9: Killed)

Hi, this extension is exactly what I need for our nearest neighbor searches.
But it always fails when I try to create the index.

To reproduce:

  • insert 5848000 float vectors with length 768
  • create index with: CREATE INDEX ON <schema.table_name> USING ivfflat (<vector_column_name> vector_cosine_ops)
    exact query: CREATE INDEX ON nlp.vector_test USING ivfflat (embedding vector_cosine_ops);
  • Get exception: [08006] An I/O error occurred while sending to the backend.

Logs:

LOG:  server process (PID 97127) was terminated by signal 9: Killed
DETAIL:  Failed process was running: CREATE INDEX ON nlp.vector_test USING ivfflat (embedding vector_cosine_ops)
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2021-05-24 11:24:42 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 5DD/9D2C8818
FATAL:  the database system is in recovery mode
redo done at 5DD/BE66DE20
database system is ready to accept connections

Details:
It seems that the index creation is eating all server memory (64GB). Then the system steps in (Ubuntu 20.04 LTS) and kills the process. I also tried to reduce the number of inverted lists to 10 but it doesn't help.

Do i need some partitioning or is this a bug ?

Vector values are changed after inserting them to the database!

I inserted this in the database:

INSERT INTO items VALUES
('[0.459, 0.276, 1, -12.803, 1, 0.0804, 0.889, 0.009, 0.217, 0.354, 89.477, 56253, 3]');

After fetching the same vector, the result is as follows:
[0.458999991416931,0.275999993085861,1,-12.8030004501343,1,0.0803999975323677,0.888999998569489,0.00899999961256981,0.216999992728233,0.354000002145767,89.4769973754883,56253,3]

Are they stored as double precision or floats?

Can I order by dot product? And how large vector columns can I create?

I need to create vector columns with thousands of values, what is the upper limit for the number of values a vector column can have?

And, can I do something similar to the following;

select * from xyz where dot_product(xyz.vectors, @some_argument) > @some_other_argument order by dot_product(xyz.vectors, @some_argument) desc

...?

Rebuild index

In the documentation you write that the index should be built after a bunch of vectors have been added.
I have an application where there are regularly new vectors being added, do I assume correctly that I need to rebuild the index regularly then?
What would be a good frequency?

(And as a side question: This library is perfoming exact search, correct?)

Increase VECTOR_MAX_DIM

Hello, Great project! Just what I was looking for a vector database in Rust.

I was working on an OpenAI embedding project to use pgvector with async-openai. However I immediately ran into VECTOR_MAX_DIM limit of 1024 on first insert. The output of latest Embedding model on OpenAI is of dimension 1536.

Having VECTOR_MAX_DIM doubled to 2048 will solve this particular use case, so wondering if increasing is someting that would align with the goals of the pgvector project?

Thank you

Q: How to estimate max dimensions supported?

Hey, thanks for authoring this project; really fantastic work.

What is the maximum possible vector size with the pg blocksize set at 32k?

A 1024 dimensions vector would take 4 * 1024 + 8 = 4104 bytes. This is approximately half of the default 8k blocksize.

I'm unfamiliar with pg internals, but I'm assuming this relationship (max vector memory = 1/2 blocksize) has significance.

Assuming so, would I be correct in estimating ~4000 dimensions (~16k bytes) as an upper limit with the blocksize set at 32k?

The real reason I'm asking this is because I'd like to store and use the largest embeddings provided by OpenAI's davinci model. That's around 12K dimensions. Not possible without dimensionality reduction?

Again, thanks for the amazing work!

30x size used by index vs tables

pg_relation_filepath |    kb    |                    relname
----------------------+----------+------------------------------------------------
 base/138600/8163593  | 24799472 | qa_embedding_idx
 base/138600/8163639  | 22196984 | terms_embedding_idx
 base/138600/8163587  |   854448 | qa
 base/138600/8163633  |   628264 | terms
 base/138600/8163580  |   450504 | qacluster_embedding_idx
 base/138600/8163614  |   207072 | termsearch
 base/138600/8163619  |   168464 | termsearch_pkey
 base/138600/8163592  |    68000 | qa_pkey
 base/138600/8163638  |    60864 | terms_pkey
 base/138600/8163594  |    25440 | qa_setid_idx
 base/138600/8163581  |    20008 | qasets
 base/138600/8163574  |     6352 | qacluster
 base/138600/8163586  |     4560 | qasets_pkey
 base/138600/8163579  |     1232 | qacluster_pkey

My indices are HUGE.
I use 1000 lists, and my vectors are fairly large, 1500 dimensions.

Am I doing something wrong?

Server closed the connection unexpectedly after set maintenance_work_mem

Hi, my table contains 13 847 564 rows, I store 13 847 564 128-dimensional embedding vectors and create the index with:

CREATE INDEX ON wiki_tb USING ivfflat (embedd vector_ip_ops) WITH (lists = 14885);

However, I get the error that tell me : memory required is 43495 MB, maintenance_work_mem is 26700 MB. Then, I set the maintenance_work_mem follows the requirement with:

SET maintenance_work_mem TO '43495 MB';
CREATE INDEX ON wiki_tb USING ivfflat (embedd vector_ip_ops) WITH (lists = 14885);

Set maintenance_work_mem works but CREATE INDEX... command throws the following error:

Error in query (7): server closed the connection unexpectedly
This probably means the server terminated abnormally 
before or while processing the request

I have no idea to solve this problem, could you please help me this? I would appreciate it. Thank you in advanced.

Aggregate functions for vectors

I'm trying to generate means of vectors, but I'm getting the error message ERROR: function avg(vector) does not exist. Is this something that's easy to implement in pgvector, or should I rely on code outside the database for aggregation?

Server crash when building an ivfflat index with a high number of clusters

Hello again, this issue is somewhat related to #41 but happens with different condtions and at a different code location.

This time the crash happens when creating an index on a table with a large number of rows and with the lists param also greater than around 6500 clusters.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);

Insert 1M rows into the table

SET maintenance_work_mem='16GB';
CREATE INDEX ON embed USING ivfflat (vec vector_cosine_ops) WITH (lists = 8000);
Server logs

2022-10-26 09:17:07.110 UTC [54774] STATEMENT:  create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:07.110 UTC [54774] DEBUG:  building index "embed_vec_idx" on table "embed" serially
2022-10-26 09:17:19.927 UTC [54584] DEBUG:  snapshot of 1+0 running transaction ids (lsn 0/6A280188 oldest xid 773 latest complete 772 next xid 774)
2022-10-26 09:17:47.742 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  server process (PID 54774) was terminated by signal 11: Segmentation fault
2022-10-26 09:17:47.743 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:47.743 UTC [53734] LOG:  server process (PID 54774) was terminated by signal 11: Segmentation fault
2022-10-26 09:17:47.743 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:47.743 UTC [53734] LOG:  terminating any other active server processes
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54588
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54584
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54583
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54585
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54586
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54587
2022-10-26 09:17:47.743 UTC [54587] DEBUG:  writing stats file "pg_stat/global.stat"
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  forked new backend, pid=54840 socket=9
2022-10-26 09:17:47.744 UTC [54840] LOG:  connection received: host=[local]
2022-10-26 09:17:47.744 UTC [54840] FATAL:  the database system is in recovery mode
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  proc_exit(1): 1 callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  exit(1)
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  server process (PID 54840) exited with exit code 1
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:48.242 UTC [54587] DEBUG:  writing stats file "pg_stat/db_16385.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  removing temporary stats file "pg_stat_tmp/db_16385.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  writing stats file "pg_stat/db_0.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  removing temporary stats file "pg_stat_tmp/db_0.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 09:17:48.245 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:48.245 UTC [53734] LOG:  all server processes terminated; reinitializing

GDB stack trace

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6309e36df9 in InitCenters (index=index@entry=0x7f6309ea5d38, samples=samples@entry=0x7f62dde6b050, centers=centers@entry=0x564469e56d20, lowerBound=lowerBound@entry=0x7f5fe2f62050)
    at src/ivfkmeans.c:63
63                              lowerBound[j * numCenters + i] = distance;
(gdb) bt
#0  0x00007f6309e36df9 in InitCenters (index=index@entry=0x7f6309ea5d38, samples=samples@entry=0x7f62dde6b050, centers=centers@entry=0x564469e56d20, lowerBound=lowerBound@entry=0x7f5fe2f62050)
    at src/ivfkmeans.c:63
#1  0x00007f6309e37164 in ElkanKmeans (index=0x7f6309ea5d38, samples=0x7f62dde6b050, centers=0x564469e56d20) at src/ivfkmeans.c:254
#2  0x00007f6309e37ada in IvfflatKmeans (index=0x7f6309ea5d38, samples=<optimized out>, centers=0x564469e56d20) at src/ivfkmeans.c:513
#3  0x00007f6309e354c0 in ComputeCenters (buildstate=buildstate@entry=0x7fff8198cd30) at src/ivfbuild.c:401
#4  0x00007f6309e35f4c in BuildIndex (heap=<optimized out>, index=0x7f6309ea5d38, indexInfo=<optimized out>, buildstate=buildstate@entry=0x7fff8198cd30, forkNum=forkNum@entry=MAIN_FORKNUM) at src/ivfbuild.c:580
#5  0x00007f6309e35fc4 in ivfflatbuild (heap=<optimized out>, index=<optimized out>, indexInfo=<optimized out>) at src/ivfbuild.c:599
#6  0x0000564467c8022c in index_build (heapRelation=heapRelation@entry=0x7f6309ea5a30, indexRelation=indexRelation@entry=0x7f6309ea5d38, indexInfo=indexInfo@entry=0x564469d00f38,
    isreindex=isreindex@entry=false, parallel=parallel@entry=true) at index.c:3012
#7  0x0000564467c81e06 in index_create (heapRelation=heapRelation@entry=0x7f6309ea5a30, indexRelationName=indexRelationName@entry=0x564469de0920 "embed_vec_idx", indexRelationId=40964, indexRelationId@entry=0,
    parentIndexRelid=parentIndexRelid@entry=0, parentConstraintId=parentConstraintId@entry=0, relFileNode=0, indexInfo=0x564469d00f38, indexColNames=0x564469de08a8, accessMethodObjectId=16435, tableSpaceId=0,
    collationObjectId=0x564469dd3360, classObjectId=0x564469dd3380, coloptions=0x564469dd33a0, reloptions=94851833923912, flags=0, constr_flags=0, allow_system_table_mods=false, is_internal=false,
    constraintId=0x7fff8198d0f4) at index.c:1232
#8  0x0000564467d39c55 in DefineIndex (relationId=relationId@entry=16462, stmt=stmt@entry=0x564469b70780, indexRelationId=indexRelationId@entry=0, parentIndexId=parentIndexId@entry=0,
    parentConstraintId=parentConstraintId@entry=0, is_alter_table=is_alter_table@entry=false, check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) at indexcmds.c:1164
#9  0x0000564467f6d24b in ProcessUtilitySlow (pstate=pstate@entry=0x564469d00e20, pstmt=pstmt@entry=0x564469b71820,
    queryString=queryString@entry=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0,
    queryEnv=queryEnv@entry=0x0, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:1534
#10 0x0000564467f6c6d2 in standard_ProcessUtility (pstmt=0x564469b71820, queryString=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);",
    readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:1066
#11 0x0000564467f6c7bc in ProcessUtility (pstmt=pstmt@entry=0x564469b71820, queryString=<optimized out>, readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
    queryEnv=<optimized out>, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:527
#12 0x0000564467f69af6 in PortalRunUtility (portal=portal@entry=0x564469bb35d0, pstmt=pstmt@entry=0x564469b71820, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
    dest=dest@entry=0x564469c87e28, qc=qc@entry=0x7fff8198d780) at pquery.c:1155
#13 0x0000564467f69dd6 in PortalRunMulti (portal=portal@entry=0x564469bb35d0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x564469c87e28,
    altdest=altdest@entry=0x564469c87e28, qc=qc@entry=0x7fff8198d780) at pquery.c:1312
#14 0x0000564467f6a1a3 in PortalRun (portal=portal@entry=0x564469bb35d0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564469c87e28,
    altdest=altdest@entry=0x564469c87e28, qc=0x7fff8198d780) at pquery.c:788
#15 0x0000564467f66084 in exec_simple_query (query_string=query_string@entry=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);") at postgres.c:1213
#16 0x0000564467f682a4 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff8198d980, dbname=<optimized out>, username=<optimized out>) at postgres.c:4496
#17 0x0000564467ec5e46 in BackendRun (port=port@entry=0x564469b98a70) at postmaster.c:4530
#18 0x0000564467ec8318 in BackendStartup (port=port@entry=0x564469b98a70) at postmaster.c:4252
#19 0x0000564467ec8565 in ServerLoop () at postmaster.c:1745
#20 0x0000564467ec9b9e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x564469b691b0) at postmaster.c:1417
#21 0x0000564467e0aa52 in main (argc=5, argv=0x564469b691b0) at main.c:209

Versions:

posgresql 14.5
pgvector v0.3.0 (379a760)

pgvector-php

Do you plan to build a pgvector libraray for PHP?

Issues running `make` on `m1`

Running make:

bram@Brams-MacBook-Air pgvector % make                        
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument  -mmacosx-version-min=10.12 -arch arm64 -arch x86_64 -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math  -I. -I./ -I/Applications/Postgres.app/Contents/Versions/14/include/postgresql/server -I/Applications/Postgres.app/Contents/Versions/14/include/postgresql/internal -I/Applications/Postgres.app/Contents/Versions/14/share/icu -I/Applications/Postgres.app/Contents/Versions/14/include/libxml2 -I/Applications/Postgres.app/Contents/Versions/14/include  -I/Applications/Postgres.app/Contents/Versions/14/include  -c -o src/ivfbuild.o src/ivfbuild.c
clang: error: the clang compiler does not support '-march=native'
make: *** [src/ivfbuild.o] Error 1

Running make -mcpu=apple-m1

bram@Brams-MacBook-Air pgvector % make install -mcpu=apple-m1
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- c
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- u
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- =
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- a
Usage: make [options] [target] ...
...
This program built for i386-apple-darwin11.3.0
...

pgvector on conda-forge

Hello, I've just created some staged recipes for pgvector and pgvector-python on conda forge's staged-recipes and gotten them building: conda-forge/staged-recipes#21432. I'm happy to maintain the feedstocks, but wanted to reach out and ask if there was anyone else that I should add to them.

array equals comparison stopped working

I am having some problems after installing the extension in Postgres 14.

First, array equal comparisons seem to have stopped working. Message:

HINT: Could not choose a best candidate operator. You might need to add explicit type casts.

This is after adding the casts as well. The relevant part of the query is:

and CAST(word_ids as int[]) = CAST([myvalues] as int[])

I'm not sure if this is related, but the bigint comparison also seems to have stopped working. When I do a very simple query like this:

select * from word w where w.signature = 3866747771;

I get ERROR: unrecognized strategy number: 2

If I provide a shorter number, it appears to work fine.

I apologize I don't have a full test case worked up, but I wanted to see if this rings any bells.

16bit float support

I'm using this library on a rather large set of vectors (>10^9), and so I'm considering forking the code so that the internal storage format is 16bit floats rather than 32bit floats. I don't really care about any potential speedup from using half precision floats as I imagine even under ideal hardware this will be negligible, I'm really just interested in space savings.

Ideally, I'd figure out a way to do some macro magic in order to have all of your existing index code work on both a 16bit and 32bit type (and possibly a 64bit type while I'm at it), but before I get started on this, I'm wondering if you've thought about this at all? And if you'd be open to a pull request that does this even if it makes the existing code quite a bit more complicated?

bug: free page can not use again

ivfflatbulkdelete() {
....
/* Set to first free page */
if (!BlockNumberIsValid(insertPage))
insertPage = searchPage;
...
}

Here only set the first free page, other deleted pages can not be used anymore, this caused index grow larger and larger.

any solution? thanks.

bug? - unexpected data beyond EOF in block

While using pgvector on a table with frequent updates / inserts on Postgres 14 on macOS on Intel, I've been encountering this error frequently on UPDATES:

psycopg2.errors.InternalError_: unexpected data beyond EOF in block 1638807 of relation base/24349058/41425278
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

Looking through the PostgreSQL mailing list about this error, most posts pertain to linux kernels from the ~2010s, and don't seem applicable.

I've run VACUUM FULL on the table a few times, as well as completely dumping the table using pg_dump, deleting the table and recreating. The table is ~340 GiB and there is also a 13 GiB IVFFlat index referencing one of the vector(768) columns.

Wondering if there might be a bug in how large vectors are stored.

My table, notably, contains columns of types:

  • vector(768)
  • vector(768)[]
  • character varying[]
  • character varying

And each row is easily around 2 or 3 MiB.

Clarification about accuracy of partially indexed dataset

Hi, thanks for this wonderful extension, great work!

So, as far as I understand, users are supposed to create index on vector column after data has been inserted. I do have the need to regularly insert, update & delete vectors in that column. I set up periodic job that on some timer does REINDEX CONCURRENTLY to keep search queries fast.

The question is, I might run into scenario when, firstly, 100% percent of the data has been indexed, then 50% of that data is removed and then 50% of new data is added BEFORE reindex happens. It would mean 50% indexed data and 50% non-indexed data. How does the algorithm works in this situation?

Segmentation fault when count table

first, nice work

but i found a error when i count table

this is my sql scripts

create or replace function gen_float4_arr(int) returns float4[] as $$    
  select array_agg((random()*100)::float4) from generate_series(1,$1);    
$$ language sql strict;

CREATE TABLE test2 (id int, feature vector(256));

INSERT INTO test2 SELECT id, gen_float4_arr(256) FROM generate_series(1, 50000) id;  

CREATE index index_test2_f ON test2 USING ivfflat (feature);

select count(1) from test2;

I think the reason is in function ivfflatgettuple, object scan dont have member orderByData (https://github.com/ankane/pgvector/blob/fb015adc25a9dad540caf8332dadcfdb1043f1d6/src/ivfscan.c#L255)

that was log

2021-06-09 04:37:56.745 EDT [13064] LOG:  server process (PID 13103) was terminated by signal 11: Segmentation fault
2021-06-09 04:37:56.745 EDT [13064] DETAIL:  Failed process was running: select count(1) from test2
2021-06-09 04:37:56.745 EDT [13064] LOG:  terminating any other active server processes
2021-06-09 04:37:56.746 EDT [13070] WARNING:  terminating connection because of crash of another server process

How to set threshold for search

Thanks for your awesome project. I want to query vector using L2 distance, which is larger than given threshold. How can I do that ?

Ideas

Please create a new issue to discuss any ideas or share your own.

HNSW

  • Add support for inline filtering
    • Scan only - hnsw-filtering branch
    • HQANN - hqann branch (WIP)
  • Use scan->kill_prior_tuple - kill-prior-tuple branch (WIP)
  • Add support for index-only scans - hnsw-index-only-v2 branch
  • Add support for INCLUDE clause with CREATE INDEX (not more performant since still accesses heap)
  • Improve numIndexTuples in cost estimation

IVFFlat

  • Add support for product quantization
  • Add support for another coarse quantizer
  • Add support for parallel k-means
  • Update parallel workers calculation - ivfflat-parallel-workers branch
  • Increase default value of ivfflat.probes
  • Add support for setting default probes on index level
  • Scan additional lists if number of entries is low after scanning probes
  • Use pairing heap for index scan for performance - stages and pairingheap branches
  • Use tuplesort_set_bound for performance - bound branch (not needed w/ pairing heap)
  • Add support for parallel index scans (planner gets cost estimate but doesn't use) - parallel-index-scan3 branch
  • Use Lloyd's or MO-Elkan's k-means when not enough memory for Elkan's
  • Use FB-Elkan's k-means for faster index creation
  • Use mini-batch k-means for index creation for reduced memory - minibatch branch
  • Add support for index-only scans - ivfflat-index-only branch
  • Add support for INCLUDE clause with CREATE INDEX
  • Prevent compression for index tuples - ivfflat-storage branch

Indexes

  • Include vectors with zero norm in indexes for cosine distance - index-zero-norm branch
  • Update cost estimation to not use index if LIMIT + OFFSET > expected tuples - index-limit/index-limit2 branch
  • Update cost estimation to not use index if large % of rows will be filtered by WHERE condition - hnsw-filtering-cost branch

Types

  • Add support for int8 vectors - intvec branch

Functions

  • Add subscript function - subscript branch (requires Postgres 14+)
  • Add angular_distance function - angular_distance branch - #234
  • Add random_vector function - random_vector branch
  • Add functions to view lists and/or pages like pageinspect (require superuser)
  • Add function to estimate index balance (require superuser?) - #105

Installation

  • Create installer for Windows - #108

Building on Windows

Continuation of #37

@cdtaichen Re #37 (comment), for GCC, I think the commands will probably be similar to Linux, which is:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfbuild.o src/ivfbuild.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfflat.o src/ivfflat.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfinsert.o src/ivfinsert.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfkmeans.o src/ivfkmeans.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfscan.o src/ivfscan.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfutils.o src/ivfutils.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfvacuum.o src/ivfvacuum.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/vector.o src/vector.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -shared -o vector.so src/ivfbuild.o src/ivfflat.o src/ivfinsert.o src/ivfkmeans.o src/ivfscan.o src/ivfutils.o src/ivfvacuum.o src/vector.o -L/usr/lib/x86_64-linux-gnu  -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -L/usr/lib/llvm-10/lib  -Wl,--as-needed 

Also, outstanding questions from #37:

  1. What does the Postgres server log say about the crashes?
  2. Does the windows branch work for MSVC?

Documentation improvement ideas

My friend and I were just chatting about adding vector search to pg the other day!

Could you add in the docs (or answer here?) these questions:

  1. What algorithms are used to find the closest vectors? I see you mention FAIS, but its unclear exactly what was used.
  2. Could you provide documentation on how this scales? Would this support 1B vectors?
  3. Does this support partitioned indexes?

Q: max dimensions / page size

Hi,
if PG page size 8192 (8KB), and
"Each vector takes 4 * dimensions + 8 bytes of storage"
why the limit is 1024? it looks like it can be bigger (considering other columns in the same table)

can I just increase it to 1080 in src/vector.h?
Without having PG compiled with 16KB page?

10x!

Q: Index Creation

I have a question about index creation.

What happens if you don't specify the index operation type

So instead of doing it like this:

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops);
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops);
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops);

I just do:

CREATE INDEX ON items USING ivfflat (embedding); (it does allow me to do this)

Am I getting something if I do it like this?

What will be the approach if I like to use all 3 operations in my app (l2, ip, cosine), should I create 3 indexes?

Postgres 12 issues with installation and indexes

I had issues with folders when installing the vector extension on postgres 12. As follows:

/usr/bin/clang-10 -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -O2  -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o src/vector.bc src/vector.c
root@cfsdb-en1-postgres-1:~/pgvector# make install
/bin/mkdir -p '/usr/lib/postgresql/14/lib'
/bin/mkdir -p '/usr/share/postgresql/14/extension'
/bin/mkdir -p '/usr/share/postgresql/14/extension'

However, after solving the issues:
image
I am trying to execute the code for cosine similarity (the same happens for all the similarity measures):

CREATE TABLE items (embedding vector(3));
INSERT INTO items VALUES ('[1,2,3]'), ('[4,5,6]');
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops);
SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 1;

It returns zero rows. I can see that the index is present in the pg_index table. However if I delete the index it returns as it should:

 embedding 
-----------
 [4,5,6]

Dockerize pgvector

Hi @ankane,

First of all, thanks for open-sourcing this project.

I have been testing a few use cases and the results are very promising!

I wanted to share the docker setup I'm using in case other devs want to run pgvector without having to compile or deal with dependencies. The resulting Docker image supports all the features as in https://hub.docker.com/_/postgres, as it extends directly from it.

Here is the PR #4

I'm curious:

  • have you taken perf benchmarks?
  • any important limitations (scalability, sharding, vector size)?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.