Giter Club home page Giter Club logo

tiledb's Introduction

TileDB logo

Full CI Azure Pipelines Anaconda download count badge

The Universal Storage Engine

TileDB is a powerful engine for storing and accessing dense and sparse multi-dimensional arrays, which can help you model any complex data efficiently. It is an embeddable C++ library that works on Linux, macOS, and Windows. It is open-sourced under the permissive MIT License, developed and maintained by TileDB, Inc. To distinguish this project from other TileDB offerings, we often refer to it as TileDB Embedded.

TileDB includes the following features:

  • Support for both dense and sparse arrays
  • Support for dataframes and key-value stores (via sparse arrays)
  • Cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage)
  • Chunked (tiled) arrays
  • Multiple compression, encryption and checksum filters
  • Fully multi-threaded implementation
  • Parallel IO
  • Data versioning (rapid updates, time traveling)
  • Array metadata
  • Array groups
  • Numerous APIs on top of the C++ library
  • Numerous integrations (Spark, Dask, MariaDB, GDAL, etc.)

You can use TileDB to store data in a variety of applications, such as Genomics, Geospatial, Finance and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array, which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite data science tool.

Quickstart

You can install the TileDB C++ library as follows:

# Conda (macOS, Linux, Windows):
$ conda install -c conda-forge tiledb

(see links below for Python, R, and other API installation instructions)

Alternatively, you can use the Docker image we provide:

$ docker pull tiledb/tiledb
$ docker run -it tiledb/tiledb

We include several examples. You can start with the following:

Documentation

You can find the detailed TileDB documentation at https://docs.tiledb.com.

Building from source

Please see building from source in the documentation.

Format Specification

The TileDB data format is open-source and can be found here.

Application-specific Packages

APIs

The TileDB team maintains a variety of APIs built on top of the C++ library:

Integrations

TileDB is also integrated with several popular databases and data science tools:

Get involved

TileDB Embedded is an open-source project and welcomes all forms of contributions. Contributors to the project should read over the contribution docs for more information.

We'd love to hear from you. Drop us a line at [email protected], visit our forum or contact form, or follow us on Twitter to stay informed of updates and news.

tiledb's People

Contributors

abigalekim avatar bdeng-xt avatar bekadavis9 avatar cngzhnp avatar davisp avatar dhoke4tdb avatar dudoslav avatar eddelbuettel avatar eric-hughes-tiledb avatar ihnorton avatar jakebolewski avatar jeffhammond avatar joe-maley avatar johnkerl avatar joshblum avatar jp-dark avatar kdatta avatar kgururaj avatar kiterluc avatar lums658 avatar nguyenv avatar npapa avatar ravigaddipati avatar robertbindar avatar shaunrd0 avatar shelnutt2 avatar stavrospapadopoulos avatar tdenniston avatar teo-tsirpanis avatar ypatia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tiledb's Issues

Add a cache manager for tiles

So that tiles that are decompressed (and in the future decrypted) can be kept in main memory for subsequent accesses.

Add encryption

There should be the option of encrypting all TileDB data at-rest.

Fix the C API for setting the array schema

Set array (and metadata) schema members via functions. Also create a tiledb_array_schema_check function that checks if the array schema is correct. Get rid of array_schema_c.h and metadata_schema_c.h.

Metadata Creation

When I try to create metadata, TileDB is giving me the following error (when in debug mode): [TileDB::ArraySchema] Error: Cannot set domain; Lower domain bound larger than its corresponding upper.
It's not actually returning an error code, it's just printing the error. And it seems to be working correctly otherwise.

Since the metadata schema doesn't have a domain that is set (by the user at least), is this something that's crossing over from the data schema?

Test workspace not cleaned up.

currently the tests create a local workspace which is not cleaned up upon failure. This triggers an assertion and causes subsequent tests to fail if the workspace is not removed manually.

`tiledb_move` does not work as expected

Given the following workspace setup:

workspace/test1
workspace/test2/test3

you cannot move from test2/test3 -> test1/test3 as it fails with the error
[TileDB::StorageManager] Error: Move failed; Invalid source directory

Remove master catalog

The storage manager should not maintain master_catalog metadata for the workspaces.

Ditch the OpenSSL dependency

In the storage manager, the only functionality OpenSSL provides is the MD5 hash function. It would be better to include a standalone version of the md5 hash implementation to simplify the build process.

Hash function used in Metadata

The current metadata system uses MD5 as it's hash function. As a cryptographic hash function, it has the nice property that collisions are exceedingly rare. This is important as currently there is no key collision handling.

However, there are much faster non-cryptographic functions than MD5 like Murmur hash (and many others). Maybe there is a performance win for meta-data operations to switch to a faster hash function. The metadata schema could also tag the hash function used to derive the keys so that different hash functions can be substituted based on workload.

Handle endianness for portability

I don't know what the performance hit is for converting data/metadata to a non-native byteorder but I think we should use little-endian as a standard for all internal data.

Create a BasicArray class

This is the array with the simplest possible schema. It constitutes the base case for the recursive array architecture we are building.

tiledb_constants.h header should go away

The #define's related to the C-api should be moved to tiledb.h. Constants related to the c++ storage manager should be moved there.

In general we need a better separation between the C and C++ layers. The c++ storage manager should be self contained and not be entangled with the C api layer that uses it.

Bug when hitting Fragment Error

Upon exercising the error path in TileDB::Fragment, subsequent tests trigger an assertion.

Error:

======================================================================
ERROR: test_workspace_list (test_libtiledb.LibTileDBTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jacobbolewski/anaconda3/envs/tiledb/TileDB-Py/tiledb/tests/test_libtiledb.py", line 191, in test_workspace_list
    libtiledb.workspace_create(ctx, wrk2)
  File "tiledb/libtiledb.pyx", line 468, in tiledb.libtiledb.workspace_create (tiledb/libtiledb.cpp:7020)
    raise TileDBError()
tiledb.libtiledb.TileDBError: [TileDB::Fragment] Error: Cannot rename fragment directory; Directory not empty

----------------------------------------------------------------------

Assertion after error:

(tiledb) tiledb/TileDB-Py [jcb/cytiledb●] » python -c "import tiledb; tiledb.test()"
test_delete (test_libtiledb.LibTileDBTest) ... Assertion failed: (starts_with(stripped_fragment_name, "__")), function sort_fragment_names, file /Users/jacobbolewski/TileDB/TileDB-dev/core/src/storage_manager/storage_manager.cc, line 2394.
[1]    4100 abort      python -c "import tiledb; tiledb.test()"

(tiledb) tiledb/TileDB-Py [jcb/cytiledb●] » python -c "import tiledb; tiledb.test()"
test_delete (test_libtiledb.LibTileDBTest) ... Assertion failed: (starts_with(stripped_fragment_name, "__")), function sort_fragment_names, file /Users/jacobbolewski/TileDB/TileDB-dev/core/src/storage_manager/storage_manager.cc, line 2394.
[1]    4114 abort      python -c "import tiledb; tiledb.test()"

Clearing the ~/.tiledb directory causes the error to go away, but some bit of state is not being cleaned up upon hitting the original error path.

Refactor asynchronous I/O

Currently, each array spawns and manages an AIO thread. We must have a single (or one per disk) thread at the storage manager for handling all AIO.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.