Giter Club home page Giter Club logo

lance's Introduction

Lance: A Columnar Data Format for Computer Vision

CI Docs

PyPi Python versions

Lance is a cloud-native columnar data format designed for managing large-scale computer vision datasets in production environments. Lance delivers blazing fast performance for image and video data use cases from analytics to point queries to training scans.

Lance core is written in C++ and comes with python bindings to start. With first class Apache Arrow integration, Lance is queryable by tools like DuckDB out of the box and can be converted from parquet with a single line of code.

What problems does Lance solve?

Today, the data tooling stack for computer vision is insufficient to serve the needs of the ML engineering community.

Working with vision data for ML is different from working with tabular data:

  • Training, analytics, and labeling uses different tools requiring different formats
  • Data annotations are almost always deeply nested
  • Images / videos are large blobs that are difficult to query by existing engines

This results in some major pain-points:

  • Too much time spent on low level data munging
  • Multiple copies creates data quality issues, even for well-known datasets
  • Reproducibility and data versioning is extremely difficult to achieve

Lance to the rescue

To solve these pain-points, we are building Lance, an open-source columnar data format optimized for computer vision with the following goals:

  • Blazing fast performance for analytical scans and random access to individual records (for visualization and annotation)
  • Rich ML data types and integrations to eliminate manual data conversions
  • Support for vector and search indices, versioning, and schema evolution

Quick Start

We've provided Linux and MacOS wheels for Lance in PyPI. You can install Lance python bindings via:

pip install pylance

Thanks for its Apache Arrow-first APIs, lance can be used as a native Arrow extension. For example, it enables users to directly use DuckDB to analyze lance dataset via DuckDB's Arrow integration.

# pip install pylance duckdb
import lance
import duckdb

# Understand Label distribution of Oxford Pet Dataset
ds = lance.dataset("s3://eto-public/datasets/oxford_pet/pet.lance")
duckdb.query('select label, count(1) from ds group by label').to_arrow_table()

What makes Lance different

Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.

Encodings: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

Versioning / updates (ROADMAP): a Manifest can be used to record snapshots. Updates are supported via write-ahead logs.

Secondary Indices (ROADMAP):

  • Vector index for similarity search over embedding space
  • Inverted index for fuzzy search over many label / annotation fields

Benchmarks

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Why are you building yet another data format?!

Machine Learning development cycle involves the steps:

graph LR
    A[Collection] --> B[Exploration];
    B --> C[Analytics];
    C --> D[Feature Engineer];
    D --> E[Training];
    E --> F[Evaluation];
    F --> C;
    E --> G[Deployment];
    G --> H[Monitoring];
    H --> A;
Loading

People use different data representations to varying stages for the performance or limited by the tooling available. The academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While the industry uses data lake (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouse (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or Tfrecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice among ML practices.

While each of the existing data formats excel at its original designed workload, we need a new data format to tailored for multistage ML development cycle to reduce the fraction in tools and data silos.

A comparison of different data formats in each stage of ML development cycle.

Lance Parquet & ORC JSON & XML Tfrecord Database Warehouse
Analytics Fast Fast Slow Slow Decent Fast
Feature Engineering Fast Fast Decent Slow Decent Good
Training Fast Decent Slow Fast N/A N/A
Exploration Fast Slow Fast Slow Fast Decent
Infra Support Rich Rich Decent Limited Rich Rich

Presentations and Talks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.