lancedb / lance Goto Github PK

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Home Page: https://lancedb.github.io/lance/

License: Apache License 2.0

Python 12.96% Dockerfile 0.01% Rust 74.70% CMake 0.05% Makefile 0.03% C 0.13% C++ 0.03% Shell 0.05% Jupyter Notebook 11.27% Java 0.77%

machine-learning computer-vision data-format deep-learning python apache-arrow duckdb mlops data-analysis data-analytics

lance's Introduction

Developer-friendly, database for multimodal AI

LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.

The key features of LanceDB include:

Production-scale vector search with no servers to manage.
Store, query and filter vectors, metadata and multi-modal data (text, images, videos, point clouds, and more).
Support for vector similarity search, full-text search and SQL.
Native Python and Javascript/Typescript support.
Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
GPU support in building vector index(*).
Ecosystem integrations with LangChain 🦜️🔗, LlamaIndex 🦙, Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

LanceDB's core is written in Rust 🦀 and is built using Lance, an open-source columnar format designed for performant ML workloads.

Quick Start

Javascript

npm install vectordb

const lancedb = require('vectordb');
const db = await lancedb.connect('data/sample-lancedb');

const table = await db.createTable({
  name: 'vectors',
  data:  [
    { id: 1, vector: [0.1, 0.2], item: "foo", price: 10 },
    { id: 2, vector: [1.1, 1.2], item: "bar", price: 50 }
  ]
})

const query = table.search([0.1, 0.3]).limit(2);
const results = await query.execute();

// You can also search for rows by specific criteria without involving a vector search.
const rowsByCriteria = await table.search(undefined).where("price >= 10").execute();

Python

pip install lancedb

import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
                         data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                               {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
result = table.search([100, 100]).limit(2).to_pandas()

Blogs, Tutorials & Videos

lance's People

Contributors

Stargazers

Watchers

Forkers

da-tubi renkai jaichopra alitrack some-forks sshyran tshauck rohankumardubey geauxeric vishalsingh17 shalevy1 radao datapythonista asadullahfarooqi amdahl-rs luisoala oyelowo datalearns bishwajitdey cemoody thuangiovn spread0x sabyh11 shepardo yah01 jacobjohansen isgasho xunfeng1980 ananis25 dacort yutiansut yssource sksundaram-learning markhng525 mapbased samantatarun cyberflamego markussagen xudong963 dnsco yyang48 dibndu viirya wangfenjin valeman haoxins gsajko mrubash1 hzhang86 jrabary satyam0327 liweijie ardsn chebbychefneq limhaneul12 mause julianmatos97 wjones127 daheige autra-weiliu thanhpham1987 lance0805 mehrbod2002 greydoubt cxz alejandrosuarez parrik mattalp jrcribb richardsonjf duke24k www2341 qqq-tech photoqiu positioner turgut090 trueutkarsh rockpunk gschleusner1972 dune-z luomor-ai molenick abeusher mwatts techieteee garyli1019 genostack iangregson adbmal nialldelahunty rok mihazagar tadeja michaelmior ashiskumarnaik ayushexel westonpace yohanvalencia neeleshbisht99 jaber-saed

lance's Issues

Add Readme and setup.py for Python package

The readme should include:

How to prepare pyarrow
How to build pylance.
How to test
Code style and etc.

Support and enforce primary key

Problem Statement

We assume that each lance dataset has a primary key, to guarantee a later fast point query.
Currently, the WriteTable API takes a mandate primary key parameter, however, we do not enforce the uniqueness and sorting by the primary key at the moment.

Desired Behavior

WriteTable should enforce:

A primary key is provided
The column of primary key exists
The uniqueness of the primary key in the column.
Optionally, sort the dataset by the primary key?
Open question: in the first version, do we want to limit the data types supported as a primary key?

Add support for Apache Arrow's time types

Problem Statement

To support time column and later time series data:

https://arrow.apache.org/docs/python/api/datatypes.html

parallel reads seems to be turned off in lance?

I ran the label distribution query against coco for parquet and lance. For parquet, the total CPU time was greater than the Wall time, but for lance that was not the case. I vaguely remember discussing this before, not sure if it's already a known issue.

Support offset and limit pushdown

Problem Statement

A major use case for lance format is containing the ML dataset that is easy to inspect. It should answer the query like SELECT * FROM dataset WHERE ... LIMIT 20 fast. Combining with the fact that lance is designed for embedded assets, such as images, it would be desired to reduce the data scanned.

Desired Behavior

Support offset and limit pushdown, to only scan large columns if necessary.

Protobuf build error on MacOS

Problem Statement

make failed on macOS , after protobuf being upgraded by homebrew

[ 82%] Built target io
[ 83%] Linking CXX shared library liblance.dylib
Undefined symbols for architecture arm64:
  "google::protobuf::internal::InternalMetadata::~InternalMetadata()", referenced from:
      google::protobuf::MessageLite::~MessageLite() in format.pb.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [liblance.dylib] Error 1
make[1]: *** [CMakeFiles/lance.dir/all] Error 2
make: *** [all] Error 2

OS: macOS 12.4
Protobuf: 21.2 (Homebrew)
Arrow: 8.0.0_4 (Homebrew)

Pyarrow Dataset Scanner has no public cython definiation

Problem Statement

#61 attempts to create a new arrow Scanner object to support nested column parsing, as well as LIMIT / OFFSET pushdown. However pyarrow._dataset.Scanner does not have a public definition in _dataset.pyd. It makes it difficult to wrap the CScanner object from C++ into pyarrow.dataset.Scanner python object.

For example, we can not call https://github.com/apache/arrow/blob/02c8598d264c839a5b5cf3109bfd406f3b8a6ba5/python/pyarrow/_dataset.pyx#L2361-L2362

Desired Behavior

Short term, we'd like to be able to create something that Duckdb can consume correctly from python.
Long term, we need either be able to create Scanner by our own, or fix arrow / duckdb to support nested column parsing.

Improve build and release process for the python package

Problem Statement

Currently, we build python package using a manually compiled pyarrow. The reasons were that:

We are using the official C++ packages via apt-get to install libarrow-*.dev. However it is not compatible to libarrow.so installed from pip install pyarrow.
Pypi pyarrow was built via manylinux2014(?), but we are building the Lance using Ubuntu 22.04 with C++20.

Desired behavior

Figure out the correct practice to build and release python package.

Apache Arrow does not support FieldRef to list of structs

Problem Statement

Apache Arrow does not support field reference to a list<struct>

import duckdb

ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])

Error:

Traceback (most recent call last):
  File "/Users/lei/work/lance/./query.py", line 6, in <module>
    ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])
  File "pyarrow/_dataset.pyx", line 271, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2328, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2174, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(annotations.label) in id: int64
width: int64
height: int64
file_name: string
image: struct<data: binary>
annotations: list<item: struct<area: double, box: struct<xmax: double, xmin: double, ymax: double, ymin: double>, label: string, label_id: int64, segmentation: struct<height: int64, polygon: list<item: list<item: double>>, rle: list<item: int64>, type: int64, width: int64>, supercategory: string>>
__index_level_0__: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Expected Behavior

Using annotations.label should returns values with type list<struct<label: str>> , a subset view of the original annotations list<struct>

Establish performance baseline

Problem Statement

We should maintain a set of benchmarks to maintain the performance baseline, observe regressions, and drive performance optimization.

Desired behavior

We have a set of representative benchmarks that we can run between releases to monitor performance improvement/degredation.

Improve VarBinaryDecode::Take performance by caching the offset buffer

Problem Statement

https://github.com/eto-ai/lance/blob/b90c2fc0a86bee8c2487a49d04c70b900602aa42/cpp/src/lance/encodings/binary.h#L148-L155 All GetScalar for each index, which involves 2 reads, one for the offset, and one for the blob. We can cache the offset in one read and amortize the I/O by half.

add lib to bump lance version

Release intel and apple silicon wheels from Github Action

Problem Statement

Let's figure out how to build x86_64 and arm64 targets from Github Action mac runners. So that we can do release directly from GH actions.

Install / Build arrow / protobuf for both x86_64 and arm64
Build lance cpp SDK for both x86_64 and arm64.
Build python wheel via cibuildwheel for both x86_64 and arm64

Support simple predicates push down

Problem Statement

Assuming a dataset with schema primary_key: string, label: dict[string], image: Image, if we run the query:

SELECT image WHERE label = "cat" Limit 20

We should expect that Lance does not actually scan all image data. Instead, it should use the filter (label = 'cat') to selectively read the image column.

Desired Behavior

Use the predicate push down to reduce the scan traffic over a large column (ie.., Tensor / Image).

Null supports in all encodings

Problem Statement

Apache Arrow supports Array with nulls. We should expect to store them efficiently on disk as well.

PlainEncoding
VarBinaryEncoding

Dictionary encoding uses plain encoding for its index, so Null support can be done via Plain encoding.

Enable Prefetching when Limit / Offset clause is included.

Problem Statement

For correctness and simplicity, we disabled prefetch and parallel scan when the LIMIT / Offset clause is provided (#45), for easier counting the number of rows read. We expect that to serve a web app, usually we only apply LIMIT 20-100 for pagination purpose. However, there are more opportunities to improve the I/O process and improve latency.

For example, we can still allow sequential prefetch threads happened in the background without sacrificing the correctness. Furthermore, we can use hedged read to improve parallelism as well, it needs fine-tuned schedule tho.

Use multithreading to read each field in parallel

Problem Statement

FileReader now reads each column of the dataset sequentially. Each field can be read in parallel to increase the throughput from S3.

Provide WriteTable API in Python.

Problem

Currently, the python package can read an lance file. We need to add write support in Python as well.

The WriteTable functionality has already been implemented in C++ side.

Desired behavior

Implement pylance.write_table(...) API to wrap C++ WriteTable API, takes in an Pyarrow Table, and write to disk.

pyx file is not include sdist

_lib.cpp is in the tarball but not _lib.pyx

Cannot write datetime column

In [7]: import pandas as pd

In [8]: import pyarrow as pa

In [9]: import lance

In [10]: df = pd.DataFrame({'date': pd.date_range('2022-01-01', periods=100, freq='D')})

In [11]: df.dtypes
Out[11]: 
date    datetime64[ns]
dtype: object

In [12]: table = pa.Table.from_pandas(df)

In [13]: table.schema
Out[13]: 
date: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 377

In [14]: lance.write_table(table, '/tmp/test.lance')
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 lance.write_table(table, '/tmp/test.lance')

File ~/code/eto/lance/python/lance/__init__.py:63, in write_table(table, destination)
     53 def write_table(table: pa.Table, destination: Union[str, Path]):
     54     """Write an Arrow Table into the destination.
     55 
     56     Parameters
   (...)
     61         The destination to write dataset to.
     62     """
---> 63     WriteTable(table, destination)

File ~/code/eto/lance/python/lance/_lib.pyx:117, in lance.lib.WriteTable()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: PlainEncoder:: does not support data type timestamp[ns]
> /home/chang/code/eto/lance/python/scripts/pyarrow/error.pxi(100)pyarrow.lib.check_status()

ipdb>

add project description to pypi

basically just do something like this:

from setuptools import setup

# read the contents of your README file
from pathlib import Path
this_directory = Path(__file__).parent
long_description = (this_directory / "README.md").read_text()

setup(
    name='an_example_package',
    # other arguments omitted
    long_description=long_description,
    long_description_content_type='text/markdown'
)

Setup Python CI in Github Action

Problem

Setup Python Github Action to test python codebase.

Clean up the terminology of Chunk / Group / Array / Column / Page in the format

Problem Statement

In POC, we used chunk/page/group, array/column, field/column, offset/position/location interchangeability. It caused confusion in the codebase. Let's clean the terminology before the first release.

Verify initial installation scenarios

Support python 3.8-3.10
Support mac x86, mac arm64, manylinux
pip install pylance into a clean venv (not conda) on a system that haven’t installed libarrow* separately
Verify that we can import lance and pyarrow
Verify that we can call lance.dataset and lance.scan

Put this in a GH action so it can be re-run before every release and also at least once a day on a schedule

Compute min/max values for columns to support file-level and chunk-level pruning.

Problem Statement

Using min/max values to skip / pruning chunks is a common practice in columar storages. Let's add it as well to support coarse grained filtering.

Open questions:

Should we use different bit width to support different data types (i.e., 8 bits for int8/uint8, 64 bits for int64/uint64/double, or just one implementation for all.
How to support string values efficiently. Do we only want to support min/max for dictionary string values or string values in general?

Desired behavior

Compute min/max values and store them in a fashion that does not hurt either full scan or point queries. Ideally, such indices are only loaded when necessary, while also taking advantage of optimal I/O size on S3/GCS, and vectorizations.

Examples of lance files for verifying different implementations

Since we may implement lance parser/generators in different languages (Rust/Scala/etc), a collection of lance files would be useful to test the correctness and integrality of different implements.

s3://eto-public/datasets/oxford_pet/pet.lance has 1GiB, too big for this reason.

Benchmark Pet / COCO raw dataset performance

Problem Statement

We could keep a benchmark suite to test analytics and point query performance over ML datasets (i.e.., COCO and Oxford Pet). Using this as baseline, to test against Lance performance.

Conceptually, the benchmarks should answer queries like the following from raw JSON / XML metadata files:

# Label distribution
SELECT label, count(1) FROM dataset GROUP BY 1

# Point query
SELECT image, label, id FROM dataset WHERE label in ("dog", "cat") LIMIT 20 OFFSET 50;

Desired Behavior

Implement a benchmark that we test baseline of raw ML dataset performance and answer the desired queries.

scanner api should take string filter expressions

automatically convert str to pyarrow._compute.Expression (pyarrow should have APIs to do this already)

Also, columns is Optional[str] but actually requires Iterable[str] currently. It should be able to handle None, str, list-like of str

Support dictionary encoding

Problem

Dictionary encoding can help save storage spaces as well as accelerate filtering/scans for the columns with small cardinality. In Arrow there is a Dictionary type, and Pandas has a Categorical type. Parquet has directory (plain and RLE) encoding as well.

Desired behavior.

Lance supports dictionary encoding. P0: plain dictionary, P1: RLE dictionary.
Design the physical layout.
Understand the implementation of Directionary Array from Arrow, whether the dictionary is at the array level or the table level?

The desired results: we can write a pyarrow.Dictionary array via Python API.

Lance filtering not working

Use case: build a scanner that filters on annotations.name using lance.scanner
We cannot do that directly because #60
So as a workaround, I'm filtering for the image_ids using unnest in then setting up a scanner with the filtered ids.

However, the filter does not work on Lance but does for parquet.

Repo:

Works in parquet:

import pyarrow as pa
from pyarrow.dataset import dataset

ids = [391895, 522418, 184613, 318219, 554625, 574769, 60623, 309022, 5802, 222564]
ds = dataset('s3://eto-public/datasets/coco/coco_links.parquet')
tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

Fails with Lance:

import lance
ids = [391895, 522418, 184613, 318219, 554625, 574769, 60623, 309022, 5802, 222564]
ds = lance.dataset('s3://eto-public/datasets/coco/coco_links.lance')
tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

With error message:

---------------------------------------------------------------------------
ArrowIndexError                           Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/_dataset.pyx:331, in pyarrow._dataset.Dataset.to_table()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/_dataset.pyx:2577, in pyarrow._dataset.Scanner.to_table()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:127, in pyarrow.lib.check_status()

ArrowIndexError: Index 9 out of bounds

Duckdb issues

Keep track of a list of duckdb issues to support sophon's workload. We can later decide whether we want to report these issues/feature requests to upstream.

Duckdb does not pushdown LIMIT / OFFSET via Apache Arrow. Apache Arrow does not support LIMIT / OFFSET in Scanner in general.
- We might need to tune he batch size to simulate the "limit" request.
Does not support projection over list<struct>

>>> duckdb.query("SELECT annotations.label FROM coco")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Binder Error: Cannot extract field 'label' from expression "annotations" because it is not a struct

Setup C++ CI on macOS

Fix CMake Warnings for not finding c-ares

Problem Statement

We have the following warnings on Ubuntu during cmake configuration.

CMake Warning at /usr/lib/x86_64-linux-gnu/cmake/arrow/Findc-aresAlt.cmake:25 (find_package):
  By not providing "Findc-ares.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "c-ares", but
  CMake did not find one.

  Could not find a package configuration file provided by "c-ares" with any
  of the following names:

    c-aresConfig.cmake
    c-ares-config.cmake

  Add the installation prefix of "c-ares" to CMAKE_PREFIX_PATH or set
  "c-ares_DIR" to a directory containing one of the above files.  If "c-ares"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/ArrowConfig.cmake:97 (find_dependency)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:223 (find_package)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:344 (arrow_find_package_cmake_package_configuration)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:389 (arrow_find_package)
  CMakeLists.txt:61 (find_package)

Desired Behavior

No such error messages during cmake.

segfault querying coco data

In [1]: import lance

In [2]: import duckdb

In [3]: uri_lance = "s3://eto-public/datasets/coco/coco.lance"

In [4]: dataset_lance = lance.dataset(uri_lance)

In [5]: duckdb.query("SELECT id FROM dataset_lance")
Out[5]: 
---------------------
--- Relation Tree ---
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- id (BIGINT)

---------------------
-- Result Preview  --
---------------------
id	
BIGINT	
[ Rows: 10]
31625	
257943	
153631	
461884	
376449	
534373	
486864	
94795	
193474	
482362	



In [6]: [1]    996655 segmentation fault (core dumped)  ipython

Improve column scan performance

Problem Statement

On commit fbc3645, running on EC2 (m6i.8xlarge)

import lance
import time
import pyarrow

ds = lance.dataset("s3://.../pet.lance")
parq_ds = pyarrow.dataset.dataset("s3://.../pet.parquet")

start = time.time()
parq_ds.scanner(columns=["label"]).to_table()
print(f"Parquet: Loading label column: {time.time() - start}")

start = time.time()
parq_ds.scanner(columns=["width"]).to_table()
print(f"Parquet: Loading width column: {time.time() - start}")

start = time.time()
lance.scanner(ds, columns=["label"]).to_table()
print(f"Lance: Loading label column: {time.time() - start}")

start = time.time()
lance.scanner(ds, columns=["width"]).to_table()
print(f"Lance: Loading width column: {time.time() - start}")

Parquet: Loading label column: 0.0885462760925293
Parquet: Loading width column: 0.05171823501586914
Lance: Loading label column: 0.3289780616760254
Lance: Loading width column: 0.262282133102417

Using parquet-utils to inspect parquet scheme

############ Column(width) ############
name: width
path: width
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 31%)

############ Column(label) ############
name: label
path: label
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 6%)

Desired Behavior

Given that the storage space of these two columns is similar between lance/parquet formats, there should be some low-hanging fruit to improve lance scan performance.

Large binary array write fails

see #91 for test cases to repro

Reading COCO segmentation failed due to nulls

Problem Statement

Reading COCO dataset with segmentation, failed with error message:

row_id=12472 Mismatching child array lengths

Addining more debug info into FileReader::GetListArray() and FileReader::GetStructArray(), I got the following messages:

Failed to build struct array: field=segmentation children=["height", "polygon", "rle", "type", "width"] batch=17: Mismatching child array lengths
Arrays: [[
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  ...
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478
], <Invalid array: Length spanned by list offsets (23) larger than values array (length 22)>, [
  [],
...

be more graceful handling unknow data types

Currently if you try to write an Arrow extension dtype using lance.write_table it segfaults

./cpp/src/arrow/result.cc:28: ValueOrDie called on an error: NotImplemented: FromLogicalType: logical_type "extension<my_package.uuid<UuidType>>" is not supported yet
/lib/x86_64-linux-gnu/libarrow.so.800(+0x156f1df)[0x7f433acac1df]
/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow4util8ArrowLogD1Ev+0xe1)[0x7f433a5a40f1]
/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x12a)[0x7f433a694b8a]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZNO5arrow6ResultISt10shared_ptrINS_8DataTypeEEE10ValueOrDieEv+0x4b)[0x7f431290cf05]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZNK5lance6format5Field4typeEv+0x369)[0x7f43129056a5]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance2io10FileWriter10WriteArrayERKSt10shared_ptrINS_6format5FieldEERKS2_IN5arrow5ArrayEE+0x63)[0x7f4312950bab]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance2io10FileWriter5WriteERKSt10shared_ptrIN5arrow11RecordBatchEE+0x137)[0x7f43129509c5]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance5arrow10WriteTableERKN5arrow5TableESt10shared_ptrINS1_2io12OutputStreamEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8optionalINS0_16FileWriteOptionsEE+0x2d8)[0x7f43128b842f]

projection design

when we project annotations.label for example, the output schema is still a list of struct, so we're still faced as query syntax issues as before.

Would it make sense for the output schema to become just a list of string instead?

Expected Behavior

For a schema of list of struct, annotations.label returns list<string> instead of list<struct<label: string>>

Improve select by indices performance on list array

Problem Statement

The current implementation is FileReader read the full column between indices[0], indices[-1] in memory first, and apply arrow::compute::Function("take") to filter out the data.

Desired Behavior

We can push the selection further down to reduce the amount of data read from S3 / disk

Change Var-length Binary Encoding to store the offset first then store the values.

Problem Statement

Currently, we are storing the values before offsets:

| value1 | value2 | value3 | value4 | offset0 | offset1 | offset2 | offste3 | offset4 |

This makes decoding a batch of string values requires 2 separate reads, one for the offset array. Moving offset to the front of the values can potentially allow speculating read to read both offset and values in one I/O.

We can also evaluate other alternatives to reduce the number of I/Os to scan a Var-length binary encoded page.

Related to #67

Expose filter APIs via Python

Problem Statement

Lance format itself does support nested columns projection pushdown, filter/predicates pushdown as well as LIMIT / OFFSET pushdown (#45). However, the Apache Arrow ScanOptions and Duckdb Pyarrow integration do not fully support these optimizations yet (#46).

We could expose these filters via our Python and C++ API first to make them usable first.

Desired Behavior

Enrich Python / C++ API support for nested column projection pushdown, filter / predicates pushdown as well as limit/offset clause pushdown.

Proposed python API change:

# lance/__init__.py
def dataset(
   uri: str,
   columns: Optional[list[str]] = None,
   filters: Optional[pyarrow.Expression] = None,
   limit: Optional[int] = None,
   offset: int = 0
) -> pyarrow.dataset.Dataset:
    ...

Support Image, 2D bounding box and segmentation as Arrow extension types

Problem Statement

To support rich semantic types, we can utilize Arrow's extension types. It will help us to reach to the feature parity to Rikai storage format.

Questions to answers:

Do we want to only implement the data type in Lance?
How should we consolidate with Rikai's type system?
How to make the type system extensible?
Or should we keep the extension types out of the lance data format. Lance just accepts extension types?

Desired Behavior

Use Apache Arrow's Executor to manage thread pool

Problem

We implemented a naive thread pool in Scanner to perform batch prefetch. Apache Arrow offers a thread executor for other data formats, we should probably just re-use their thread pool executor, making Lance play nicer with the ecosystem.

https://arrow.apache.org/docs/cpp/threading.html
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread_pool.h

Desired Behavior

Use Apache Arrow to manage threads in Lance Reader / Scanner.

Update README

Update README to prepare for public consumption.

Project goals and non-goals
Instructions for building the project
Simple code snippet / examples.
Link to docs / spec / references / talks / presentations

Prepare Python Release

Release Checklist

Python CI #4
Manylinux 2014 on Linux x86 (python 3.8 - 3.10)
Intel / Apple Silicon release on Mac (target latest MacOS and default python3 version, which is 3.10)
NO conda
Verify that we can pip install into clean env with no libarrow* pre-installed
Verify we can import pyarrow and lance
Verify we can create a pyarrow dataset with lance.dataset

Reduce cython compile warnings

Problem Statement

When compile cython via python setup.py build, there are warnings like:

lance/_lib.cpp:4810:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_tensor’ defined but not used [-Wunused-variable]
 4810 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_tensor)(std::shared_ptr< arrow::Tensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4809:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix’ defined but not used [-Wunused-variable]
 4809 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix)(std::shared_ptr< arrow::SparseCSRMatrix>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4808:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor’ defined but not used [-Wunused-variable]
 4808 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor)(std::shared_ptr< arrow::SparseCSFTensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4807:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix’ defined but not used [-Wunused-variable]
 4807 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix)(std::shared_ptr< arrow::SparseCSCMatrix>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4806:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor’ defined but not used [-Wunused-variable]
 4806 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor)(std::shared_ptr< arrow::SparseCOOTensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4805:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_chunked_array’ defined but not used [-Wunused-variable]
 4805 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_chunked_array)(std::shared_ptr< arrow::ChunkedArray>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4804:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_array’ defined but not used [-Wunused-variable]
 4804 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_array)(std::shared_ptr< arrow::Array>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4803:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_scalar’ defined but not used [-Wunused-variable]
 4803 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_scalar)(std::shared_ptr< arrow::Scalar>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4802:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_schema’ defined but not used [-Wunused-variable]
 4802 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_schema)(std::shared_ptr< arrow::Schema>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4801:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_field’ defined but not used [-Wunused-variable]
 4801 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_field)(std::shared_ptr< arrow::Field>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4800:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type’ defined but not used [-Wunused-variable]
 4800 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type)(std::shared_ptr< arrow::DataType>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4799:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer’ defined but not used [-Wunused-variable]
 4799 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer)(std::shared_ptr< arrow::ResizableBuffer>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4798:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_buffer’ defined but not used [-Wunused-variable]
 4798 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_buffer)(std::shared_ptr< arrow::Buffer>  const &); /*proto*/

It is probably due to we did cimport * which imports more unused structures.

from pyarrow.includes.common cimport *

Desired Behavior

We should expect no such unused-variable warnings during python build.

Scanner reads all batches in memory during ScannerBuilder.Finish()

Problem

Lance C++ Scanner reads all batches during the ScannerBuilder.Finish() call.

Desired behavior

ScannerBuilder.Finish() should only initialize Scanner. And the first read should not happen before the first time Scanner.Next() is called.

Segment fault using lance.Scanner(offset=N)

Problem Statement

The following code causes python crashes with the following code, on commit 3264363

start = time.time()
lance.scanner(ds, columns=["image", "label"], limit=20).to_table()
print(f"Executing: Lance {time.time() - start}")

Fixed size binary encoding

Problem Statement

Fixed size binary encoding is very useful when the users know that each blob has the same size, i.e., storing prepared Tensors. It allows fast point query without lookups on the offsets.

Additionally, since the data is aligned on the disk/cloud storage, it allows the "dataset loader" to read a batch of tensors (either in BCHW or BCWH form) without further transposing them in memory again.

lancedb / lance Goto Github PK

lance's Introduction

Quick Start

Blogs, Tutorials & Videos

lance's People

Contributors

Stargazers

Watchers

Forkers

lance's Issues

Problem Statement

Desired Behavior

Problem Statement

Problem Statement

Desired Behavior

Problem Statement

Problem Statement

Desired Behavior

Problem Statement

Desired behavior

Problem Statement

Expected Behavior

Problem Statement

Desired behavior

Problem Statement

Problem Statement

Problem Statement

Desired Behavior

Problem Statement

Problem Statement

Problem Statement

Problem

Desired behavior

Problem

Problem Statement

Problem Statement

Desired behavior

Problem Statement

Desired Behavior

Problem

Desired behavior.

Problem Statement

Desired Behavior

Problem Statement

Desired Behavior

Problem Statement

Expected Behavior

Problem Statement

Desired Behavior

Problem Statement

Problem Statement

Desired Behavior

Problem Statement

Desired Behavior

Problem

Desired Behavior

Release Checklist

Problem Statement

Desired Behavior

Problem

Desired behavior

Problem Statement

Problem Statement

Recommend Projects

Recommend Topics

Recommend Org