Giter Club home page Giter Club logo

lancedb / lance Goto Github PK

View Code? Open in Web Editor NEW
3.3K 38.0 169.0 15.4 MB

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Home Page: https://lancedb.github.io/lance/

License: Apache License 2.0

Python 12.96% Dockerfile 0.01% Rust 74.70% CMake 0.05% Makefile 0.03% C 0.13% C++ 0.03% Shell 0.05% Jupyter Notebook 11.27% Java 0.77%
machine-learning computer-vision data-format deep-learning python apache-arrow duckdb mlops data-analysis data-analytics

lance's Introduction

LanceDB Logo

Developer-friendly, database for multimodal AI

LanceDB lancdb Blog Discord Twitter

LanceDB Multimodal Search


LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.

The key features of LanceDB include:

  • Production-scale vector search with no servers to manage.

  • Store, query and filter vectors, metadata and multi-modal data (text, images, videos, point clouds, and more).

  • Support for vector similarity search, full-text search and SQL.

  • Native Python and Javascript/Typescript support.

  • Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.

  • GPU support in building vector index(*).

  • Ecosystem integrations with LangChain 🦜️🔗, LlamaIndex 🦙, Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

LanceDB's core is written in Rust 🦀 and is built using Lance, an open-source columnar format designed for performant ML workloads.

Quick Start

Javascript

npm install vectordb
const lancedb = require('vectordb');
const db = await lancedb.connect('data/sample-lancedb');

const table = await db.createTable({
  name: 'vectors',
  data:  [
    { id: 1, vector: [0.1, 0.2], item: "foo", price: 10 },
    { id: 2, vector: [1.1, 1.2], item: "bar", price: 50 }
  ]
})

const query = table.search([0.1, 0.3]).limit(2);
const results = await query.execute();

// You can also search for rows by specific criteria without involving a vector search.
const rowsByCriteria = await table.search(undefined).where("price >= 10").execute();

Python

pip install lancedb
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
                         data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                               {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
result = table.search([100, 100]).limit(2).to_pandas()

Blogs, Tutorials & Videos

lance's People

Contributors

albertlockett avatar ananis25 avatar bubblecal avatar changhiskhan avatar chebbychefneq avatar da-tubi avatar dnsco avatar eddyxu avatar frankliee avatar gsilvestrin avatar haoxins avatar hzhang86 avatar jaichopra avatar jerryyifei avatar lillianye avatar liweijie avatar luqqiu avatar renkai avatar rok avatar tanaymeh avatar tevinwang avatar trueutkarsh avatar universalmind303 avatar unkn-wn avatar wangfenjin avatar weijun-h avatar westonpace avatar wjones127 avatar yah01 avatar yliang412 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lance's Issues

Support and enforce primary key

Problem Statement

We assume that each lance dataset has a primary key, to guarantee a later fast point query.
Currently, the WriteTable API takes a mandate primary key parameter, however, we do not enforce the uniqueness and sorting by the primary key at the moment.

Desired Behavior

WriteTable should enforce:

  • A primary key is provided
  • The column of primary key exists
  • The uniqueness of the primary key in the column.
  • Optionally, sort the dataset by the primary key?
  • Open question: in the first version, do we want to limit the data types supported as a primary key?

parallel reads seems to be turned off in lance?

I ran the label distribution query against coco for parquet and lance. For parquet, the total CPU time was greater than the Wall time, but for lance that was not the case. I vaguely remember discussing this before, not sure if it's already a known issue.

Support offset and limit pushdown

Problem Statement

A major use case for lance format is containing the ML dataset that is easy to inspect. It should answer the query like SELECT * FROM dataset WHERE ... LIMIT 20 fast. Combining with the fact that lance is designed for embedded assets, such as images, it would be desired to reduce the data scanned.

Desired Behavior

Support offset and limit pushdown, to only scan large columns if necessary.

Protobuf build error on MacOS

Problem Statement

make failed on macOS , after protobuf being upgraded by homebrew

[ 82%] Built target io
[ 83%] Linking CXX shared library liblance.dylib
Undefined symbols for architecture arm64:
  "google::protobuf::internal::InternalMetadata::~InternalMetadata()", referenced from:
      google::protobuf::MessageLite::~MessageLite() in format.pb.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [liblance.dylib] Error 1
make[1]: *** [CMakeFiles/lance.dir/all] Error 2
make: *** [all] Error 2

OS: macOS 12.4
Protobuf: 21.2 (Homebrew)
Arrow: 8.0.0_4 (Homebrew)

Pyarrow Dataset Scanner has no public cython definiation

Problem Statement

#61 attempts to create a new arrow Scanner object to support nested column parsing, as well as LIMIT / OFFSET pushdown. However pyarrow._dataset.Scanner does not have a public definition in _dataset.pyd. It makes it difficult to wrap the CScanner object from C++ into pyarrow.dataset.Scanner python object.

For example, we can not call https://github.com/apache/arrow/blob/02c8598d264c839a5b5cf3109bfd406f3b8a6ba5/python/pyarrow/_dataset.pyx#L2361-L2362

Desired Behavior

Short term, we'd like to be able to create something that Duckdb can consume correctly from python.
Long term, we need either be able to create Scanner by our own, or fix arrow / duckdb to support nested column parsing.

Improve build and release process for the python package

Problem Statement

Currently, we build python package using a manually compiled pyarrow. The reasons were that:

  • We are using the official C++ packages via apt-get to install libarrow-*.dev. However it is not compatible to libarrow.so installed from pip install pyarrow.
  • Pypi pyarrow was built via manylinux2014(?), but we are building the Lance using Ubuntu 22.04 with C++20.

Desired behavior

Figure out the correct practice to build and release python package.

Apache Arrow does not support FieldRef to list of structs

Problem Statement

Apache Arrow does not support field reference to a list<struct>

import duckdb

ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])

Error:

Traceback (most recent call last):
  File "/Users/lei/work/lance/./query.py", line 6, in <module>
    ds = lance.dataset("./coco.lance").scanner(columns=["id", "annotations.label"])
  File "pyarrow/_dataset.pyx", line 271, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2328, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2174, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(annotations.label) in id: int64
width: int64
height: int64
file_name: string
image: struct<data: binary>
annotations: list<item: struct<area: double, box: struct<xmax: double, xmin: double, ymax: double, ymin: double>, label: string, label_id: int64, segmentation: struct<height: int64, polygon: list<item: list<item: double>>, rle: list<item: int64>, type: int64, width: int64>, supercategory: string>>
__index_level_0__: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Expected Behavior

Using annotations.label should returns values with type list<struct<label: str>> , a subset view of the original annotations list<struct>

Establish performance baseline

Problem Statement

We should maintain a set of benchmarks to maintain the performance baseline, observe regressions, and drive performance optimization.

Desired behavior

We have a set of representative benchmarks that we can run between releases to monitor performance improvement/degredation.

Release intel and apple silicon wheels from Github Action

Problem Statement

Let's figure out how to build x86_64 and arm64 targets from Github Action mac runners. So that we can do release directly from GH actions.

  • Install / Build arrow / protobuf for both x86_64 and arm64
  • Build lance cpp SDK for both x86_64 and arm64.
  • Build python wheel via cibuildwheel for both x86_64 and arm64

Support simple predicates push down

Problem Statement

Assuming a dataset with schema primary_key: string, label: dict[string], image: Image, if we run the query:

SELECT image WHERE label = "cat" Limit 20

We should expect that Lance does not actually scan all image data. Instead, it should use the filter (label = 'cat') to selectively read the image column.

Desired Behavior

Use the predicate push down to reduce the scan traffic over a large column (ie.., Tensor / Image).

Null supports in all encodings

Problem Statement

Apache Arrow supports Array with nulls. We should expect to store them efficiently on disk as well.

  • PlainEncoding
  • VarBinaryEncoding

Dictionary encoding uses plain encoding for its index, so Null support can be done via Plain encoding.

Enable Prefetching when Limit / Offset clause is included.

Problem Statement

For correctness and simplicity, we disabled prefetch and parallel scan when the LIMIT / Offset clause is provided (#45), for easier counting the number of rows read. We expect that to serve a web app, usually we only apply LIMIT 20-100 for pagination purpose. However, there are more opportunities to improve the I/O process and improve latency.

For example, we can still allow sequential prefetch threads happened in the background without sacrificing the correctness. Furthermore, we can use hedged read to improve parallelism as well, it needs fine-tuned schedule tho.

Provide WriteTable API in Python.

Problem

Currently, the python package can read an lance file. We need to add write support in Python as well.

The WriteTable functionality has already been implemented in C++ side.

Desired behavior

Implement pylance.write_table(...) API to wrap C++ WriteTable API, takes in an Pyarrow Table, and write to disk.

Cannot write datetime column

In [7]: import pandas as pd

In [8]: import pyarrow as pa

In [9]: import lance

In [10]: df = pd.DataFrame({'date': pd.date_range('2022-01-01', periods=100, freq='D')})

In [11]: df.dtypes
Out[11]: 
date    datetime64[ns]
dtype: object

In [12]: table = pa.Table.from_pandas(df)

In [13]: table.schema
Out[13]: 
date: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 377

In [14]: lance.write_table(table, '/tmp/test.lance')
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 lance.write_table(table, '/tmp/test.lance')

File ~/code/eto/lance/python/lance/__init__.py:63, in write_table(table, destination)
     53 def write_table(table: pa.Table, destination: Union[str, Path]):
     54     """Write an Arrow Table into the destination.
     55 
     56     Parameters
   (...)
     61         The destination to write dataset to.
     62     """
---> 63     WriteTable(table, destination)

File ~/code/eto/lance/python/lance/_lib.pyx:117, in lance.lib.WriteTable()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: PlainEncoder:: does not support data type timestamp[ns]
> /home/chang/code/eto/lance/python/scripts/pyarrow/error.pxi(100)pyarrow.lib.check_status()

ipdb> 

add project description to pypi

basically just do something like this:

from setuptools import setup

# read the contents of your README file
from pathlib import Path
this_directory = Path(__file__).parent
long_description = (this_directory / "README.md").read_text()

setup(
    name='an_example_package',
    # other arguments omitted
    long_description=long_description,
    long_description_content_type='text/markdown'
)

Verify initial installation scenarios

  • Support python 3.8-3.10
  • Support mac x86, mac arm64, manylinux
  • pip install pylance into a clean venv (not conda) on a system that haven’t installed libarrow* separately
  • Verify that we can import lance and pyarrow
  • Verify that we can call lance.dataset and lance.scan

Put this in a GH action so it can be re-run before every release and also at least once a day on a schedule

Compute min/max values for columns to support file-level and chunk-level pruning.

Problem Statement

Using min/max values to skip / pruning chunks is a common practice in columar storages. Let's add it as well to support coarse grained filtering.

Open questions:

  • Should we use different bit width to support different data types (i.e., 8 bits for int8/uint8, 64 bits for int64/uint64/double, or just one implementation for all.
  • How to support string values efficiently. Do we only want to support min/max for dictionary string values or string values in general?

Desired behavior

Compute min/max values and store them in a fashion that does not hurt either full scan or point queries. Ideally, such indices are only loaded when necessary, while also taking advantage of optimal I/O size on S3/GCS, and vectorizations.

Examples of lance files for verifying different implementations

Since we may implement lance parser/generators in different languages (Rust/Scala/etc), a collection of lance files would be useful to test the correctness and integrality of different implements.

s3://eto-public/datasets/oxford_pet/pet.lance has 1GiB, too big for this reason.

Benchmark Pet / COCO raw dataset performance

Problem Statement

We could keep a benchmark suite to test analytics and point query performance over ML datasets (i.e.., COCO and Oxford Pet). Using this as baseline, to test against Lance performance.

Conceptually, the benchmarks should answer queries like the following from raw JSON / XML metadata files:

# Label distribution
SELECT label, count(1) FROM dataset GROUP BY 1

# Point query
SELECT image, label, id FROM dataset WHERE label in ("dog", "cat") LIMIT 20 OFFSET 50;

Desired Behavior

Implement a benchmark that we test baseline of raw ML dataset performance and answer the desired queries.

scanner api should take string filter expressions

automatically convert str to pyarrow._compute.Expression (pyarrow should have APIs to do this already)

Also, columns is Optional[str] but actually requires Iterable[str] currently. It should be able to handle None, str, list-like of str

Support dictionary encoding

Problem

Dictionary encoding can help save storage spaces as well as accelerate filtering/scans for the columns with small cardinality. In Arrow there is a Dictionary type, and Pandas has a Categorical type. Parquet has directory (plain and RLE) encoding as well.

Desired behavior.

  • Lance supports dictionary encoding. P0: plain dictionary, P1: RLE dictionary.
  • Design the physical layout.
  • Understand the implementation of Directionary Array from Arrow, whether the dictionary is at the array level or the table level?

The desired results: we can write a pyarrow.Dictionary array via Python API.

Lance filtering not working

Use case: build a scanner that filters on annotations.name using lance.scanner
We cannot do that directly because #60
So as a workaround, I'm filtering for the image_ids using unnest in then setting up a scanner with the filtered ids.

However, the filter does not work on Lance but does for parquet.

Repo:

Works in parquet:

import pyarrow as pa
from pyarrow.dataset import dataset

ids = [391895, 522418, 184613, 318219, 554625, 574769, 60623, 309022, 5802, 222564]
ds = dataset('s3://eto-public/datasets/coco/coco_links.parquet')
tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

Fails with Lance:

import lance
ids = [391895, 522418, 184613, 318219, 554625, 574769, 60623, 309022, 5802, 222564]
ds = lance.dataset('s3://eto-public/datasets/coco/coco_links.lance')
tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

With error message:

---------------------------------------------------------------------------
ArrowIndexError                           Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 tbl = ds.to_table(filter=pc.field('image_id').isin(ids))

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/_dataset.pyx:331, in pyarrow._dataset.Dataset.to_table()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/_dataset.pyx:2577, in pyarrow._dataset.Scanner.to_table()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/code/eto/lance/python/thirdparty/arrow/python/pyarrow/error.pxi:127, in pyarrow.lib.check_status()

ArrowIndexError: Index 9 out of bounds

Duckdb issues

Keep track of a list of duckdb issues to support sophon's workload. We can later decide whether we want to report these issues/feature requests to upstream.

  • Duckdb does not pushdown LIMIT / OFFSET via Apache Arrow. Apache Arrow does not support LIMIT / OFFSET in Scanner in general.
    • We might need to tune he batch size to simulate the "limit" request.
  • Does not support projection over list<struct>
>>> duckdb.query("SELECT annotations.label FROM coco")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Binder Error: Cannot extract field 'label' from expression "annotations" because it is not a struct

Fix CMake Warnings for not finding c-ares

Problem Statement

We have the following warnings on Ubuntu during cmake configuration.

CMake Warning at /usr/lib/x86_64-linux-gnu/cmake/arrow/Findc-aresAlt.cmake:25 (find_package):
  By not providing "Findc-ares.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "c-ares", but
  CMake did not find one.

  Could not find a package configuration file provided by "c-ares" with any
  of the following names:

    c-aresConfig.cmake
    c-ares-config.cmake

  Add the installation prefix of "c-ares" to CMAKE_PREFIX_PATH or set
  "c-ares_DIR" to a directory containing one of the above files.  If "c-ares"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/ArrowConfig.cmake:97 (find_dependency)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:223 (find_package)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:344 (arrow_find_package_cmake_package_configuration)
  /usr/lib/x86_64-linux-gnu/cmake/arrow/FindArrow.cmake:389 (arrow_find_package)
  CMakeLists.txt:61 (find_package)

Desired Behavior

No such error messages during cmake.

segfault querying coco data

In [1]: import lance

In [2]: import duckdb

In [3]: uri_lance = "s3://eto-public/datasets/coco/coco.lance"

In [4]: dataset_lance = lance.dataset(uri_lance)

In [5]: duckdb.query("SELECT id FROM dataset_lance")
Out[5]: 
---------------------
--- Relation Tree ---
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- id (BIGINT)

---------------------
-- Result Preview  --
---------------------
id	
BIGINT	
[ Rows: 10]
31625	
257943	
153631	
461884	
376449	
534373	
486864	
94795	
193474	
482362	



In [6]: [1]    996655 segmentation fault (core dumped)  ipython

Improve column scan performance

Problem Statement

On commit fbc3645, running on EC2 (m6i.8xlarge)

import lance
import time
import pyarrow

ds = lance.dataset("s3://.../pet.lance")
parq_ds = pyarrow.dataset.dataset("s3://.../pet.parquet")

start = time.time()
parq_ds.scanner(columns=["label"]).to_table()
print(f"Parquet: Loading label column: {time.time() - start}")

start = time.time()
parq_ds.scanner(columns=["width"]).to_table()
print(f"Parquet: Loading width column: {time.time() - start}")

start = time.time()
lance.scanner(ds, columns=["label"]).to_table()
print(f"Lance: Loading label column: {time.time() - start}")

start = time.time()
lance.scanner(ds, columns=["width"]).to_table()
print(f"Lance: Loading width column: {time.time() - start}")
Parquet: Loading label column: 0.0885462760925293
Parquet: Loading width column: 0.05171823501586914
Lance: Loading label column: 0.3289780616760254
Lance: Loading width column: 0.262282133102417

Using parquet-utils to inspect parquet scheme

############ Column(width) ############
name: width
path: width
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 31%)

############ Column(label) ############
name: label
path: label
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 6%)

Desired Behavior

Given that the storage space of these two columns is similar between lance/parquet formats, there should be some low-hanging fruit to improve lance scan performance.

Reading COCO segmentation failed due to nulls

Problem Statement

Reading COCO dataset with segmentation, failed with error message:

row_id=12472 Mismatching child array lengths

Addining more debug info into FileReader::GetListArray() and FileReader::GetStructArray(), I got the following messages:

Failed to build struct array: field=segmentation children=["height", "polygon", "rle", "type", "width"] batch=17: Mismatching child array lengths
Arrays: [[
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  ...
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478,
  478
], <Invalid array: Length spanned by list offsets (23) larger than values array (length 22)>, [
  [],
...

be more graceful handling unknow data types

Currently if you try to write an Arrow extension dtype using lance.write_table it segfaults

./cpp/src/arrow/result.cc:28: ValueOrDie called on an error: NotImplemented: FromLogicalType: logical_type "extension<my_package.uuid<UuidType>>" is not supported yet
/lib/x86_64-linux-gnu/libarrow.so.800(+0x156f1df)[0x7f433acac1df]
/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow4util8ArrowLogD1Ev+0xe1)[0x7f433a5a40f1]
/lib/x86_64-linux-gnu/libarrow.so.800(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x12a)[0x7f433a694b8a]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZNO5arrow6ResultISt10shared_ptrINS_8DataTypeEEE10ValueOrDieEv+0x4b)[0x7f431290cf05]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZNK5lance6format5Field4typeEv+0x369)[0x7f43129056a5]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance2io10FileWriter10WriteArrayERKSt10shared_ptrINS_6format5FieldEERKS2_IN5arrow5ArrayEE+0x63)[0x7f4312950bab]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance2io10FileWriter5WriteERKSt10shared_ptrIN5arrow11RecordBatchEE+0x137)[0x7f43129509c5]
/home/chang/code/eto/lance/cpp/build/liblance.so(_ZN5lance5arrow10WriteTableERKN5arrow5TableESt10shared_ptrINS1_2io12OutputStreamEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8optionalINS0_16FileWriteOptionsEE+0x2d8)[0x7f43128b842f]


projection design

when we project annotations.label for example, the output schema is still a list of struct, so we're still faced as query syntax issues as before.

Would it make sense for the output schema to become just a list of string instead?

Expected Behavior

For a schema of list of struct, annotations.label returns list<string> instead of list<struct<label: string>>

Improve select by indices performance on list array

Problem Statement

The current implementation is FileReader read the full column between indices[0], indices[-1] in memory first, and apply arrow::compute::Function("take") to filter out the data.

Desired Behavior

We can push the selection further down to reduce the amount of data read from S3 / disk

Change Var-length Binary Encoding to store the offset first then store the values.

Problem Statement

Currently, we are storing the values before offsets:

| value1 | value2 | value3 | value4 | offset0 | offset1 | offset2 | offste3 | offset4 |

This makes decoding a batch of string values requires 2 separate reads, one for the offset array. Moving offset to the front of the values can potentially allow speculating read to read both offset and values in one I/O.

We can also evaluate other alternatives to reduce the number of I/Os to scan a Var-length binary encoded page.

Related to #67

Expose filter APIs via Python

Problem Statement

Lance format itself does support nested columns projection pushdown, filter/predicates pushdown as well as LIMIT / OFFSET pushdown (#45). However, the Apache Arrow ScanOptions and Duckdb Pyarrow integration do not fully support these optimizations yet (#46).

We could expose these filters via our Python and C++ API first to make them usable first.

Desired Behavior

Enrich Python / C++ API support for nested column projection pushdown, filter / predicates pushdown as well as limit/offset clause pushdown.

Proposed python API change:

# lance/__init__.py
def dataset(
   uri: str,
   columns: Optional[list[str]] = None,
   filters: Optional[pyarrow.Expression] = None,
   limit: Optional[int] = None,
   offset: int = 0
) -> pyarrow.dataset.Dataset:
    ...

Support Image, 2D bounding box and segmentation as Arrow extension types

Problem Statement

To support rich semantic types, we can utilize Arrow's extension types. It will help us to reach to the feature parity to Rikai storage format.

Questions to answers:

  • Do we want to only implement the data type in Lance?
  • How should we consolidate with Rikai's type system?
  • How to make the type system extensible?
  • Or should we keep the extension types out of the lance data format. Lance just accepts extension types?

Desired Behavior

  • Support Image type
  • Bounding Box
  • Segmentation
  • Label?
  • These objects can be SerDe transparently without manually providing schema.

Use Apache Arrow's Executor to manage thread pool

Problem

We implemented a naive thread pool in Scanner to perform batch prefetch. Apache Arrow offers a thread executor for other data formats, we should probably just re-use their thread pool executor, making Lance play nicer with the ecosystem.

https://arrow.apache.org/docs/cpp/threading.html
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread_pool.h

Desired Behavior

Use Apache Arrow to manage threads in Lance Reader / Scanner.

Update README

Update README to prepare for public consumption.

  • Project goals and non-goals
  • Instructions for building the project
  • Simple code snippet / examples.
  • Link to docs / spec / references / talks / presentations

Prepare Python Release

Release Checklist

  • Python CI #4
  • Manylinux 2014 on Linux x86 (python 3.8 - 3.10)
  • Intel / Apple Silicon release on Mac (target latest MacOS and default python3 version, which is 3.10)
  • NO conda
  • Verify that we can pip install into clean env with no libarrow* pre-installed
  • Verify we can import pyarrow and lance
  • Verify we can create a pyarrow dataset with lance.dataset

Reduce cython compile warnings

Problem Statement

When compile cython via python setup.py build, there are warnings like:

lance/_lib.cpp:4810:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_tensor’ defined but not used [-Wunused-variable]
 4810 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_tensor)(std::shared_ptr< arrow::Tensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4809:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix’ defined but not used [-Wunused-variable]
 4809 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csr_matrix)(std::shared_ptr< arrow::SparseCSRMatrix>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4808:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor’ defined but not used [-Wunused-variable]
 4808 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csf_tensor)(std::shared_ptr< arrow::SparseCSFTensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4807:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix’ defined but not used [-Wunused-variable]
 4807 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_csc_matrix)(std::shared_ptr< arrow::SparseCSCMatrix>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4806:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor’ defined but not used [-Wunused-variable]
 4806 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_sparse_coo_tensor)(std::shared_ptr< arrow::SparseCOOTensor>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4805:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_chunked_array’ defined but not used [-Wunused-variable]
 4805 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_chunked_array)(std::shared_ptr< arrow::ChunkedArray>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4804:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_array’ defined but not used [-Wunused-variable]
 4804 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_array)(std::shared_ptr< arrow::Array>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4803:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_scalar’ defined but not used [-Wunused-variable]
 4803 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_scalar)(std::shared_ptr< arrow::Scalar>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4802:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_schema’ defined but not used [-Wunused-variable]
 4802 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_schema)(std::shared_ptr< arrow::Schema>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4801:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_field’ defined but not used [-Wunused-variable]
 4801 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_field)(std::shared_ptr< arrow::Field>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4800:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type’ defined but not used [-Wunused-variable]
 4800 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type)(std::shared_ptr< arrow::DataType>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4799:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer’ defined but not used [-Wunused-variable]
 4799 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_resizable_buffer)(std::shared_ptr< arrow::ResizableBuffer>  const &); /*proto*/
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lance/_lib.cpp:4798:20: warning: ‘__pyx_f_7pyarrow_3lib_pyarrow_wrap_buffer’ defined but not used [-Wunused-variable]
 4798 | static PyObject *(*__pyx_f_7pyarrow_3lib_pyarrow_wrap_buffer)(std::shared_ptr< arrow::Buffer>  const &); /*proto*/

It is probably due to we did cimport * which imports more unused structures.

from pyarrow.includes.common cimport *

Desired Behavior

We should expect no such unused-variable warnings during python build.

Segment fault using lance.Scanner(offset=N)

Problem Statement

The following code causes python crashes with the following code, on commit 3264363

start = time.time()
lance.scanner(ds, columns=["image", "label"], limit=20).to_table()
print(f"Executing: Lance {time.time() - start}")

Fixed size binary encoding

Problem Statement

Fixed size binary encoding is very useful when the users know that each blob has the same size, i.e., storing prepared Tensors. It allows fast point query without lookups on the offsets.

Additionally, since the data is aligned on the disk/cloud storage, it allows the "dataset loader" to read a batch of tensors (either in BCHW or BCWH form) without further transposing them in memory again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.