Giter Club home page Giter Club logo

datamon's Introduction

CircleCI GitHub release license GoDoc

Datamon

Datamon is a data science tool sponsored by OneConcern that helps manage data at scale.

Primer

Goals

The primary goal of datamon is to manage versioned data at rest, providing CLI tools for creation, access and tracking in an environment where data repositories and their lifecycles are linked.

Datamon links the various sources of data, how they are processed and tracks the output/new data that is generated from the existing data.

More on design and architecture.

Features

  • Manage data sets as versioned repositories stored on a cloud storage backend
  • Manage metadata for these data sets (versions, labels, file sets...)
  • Multi-tenancy using contexts
  • Lineage tracking backed by cloud authentication
  • Store data sets as fixed size deduplicated blobs, using blake hashing
  • Versions ("bundles") may be uploaded then downloaded on local storage
  • Versions may be accessed directly on a mounted file system (fuse)
  • CLI management tool
  • Metrics collection

Added value

  • Leverages low-cost frozen storage (e.g. S3, GCS)
  • Optimized billed operations for storage: no fancy billable backend store options are used (like concurrency control, etc)
  • Optimized for speed: parallel I/Os together with deduplication vastly outperform usual tools like gsutil
  • A well-defined and tested immutable metadata model ensures that no data is ever lost or unrecoverable. Datamon is an effective substitute to many bespoke gsutil scripting utilities.
  • Versioning & tagging occur on whole data sets and not individual files. This makes it easy to restore consistent inputs to some reproducible computation
  • Less storage bucket administration: datamon uses only a few buckets, defined according to IAM policies (i.e. a datamon context)
  • Repositories make up a convenient abstration for datasets, and share the same underlying cloud storage bucket configuration (abstracted as a "context")

Extra tools

  • Scripted interface to use as a sidecar container (e.g. for ARGO workflows)

Experimental

  • Mutable fuse mount, to commit versioned data sets directly from a mounted file system

Coming soon...

  • Diamond workflow: several collaborating nodes produce a versioned dataset in parallel
  • Python bindings
  • Write Ahead / Read Ahead logs

Environment

Although flexible in its concepts and architecture, the current version of datamon is primarily developed and tested against the Google Cloud environment. Note that AWS S3 storage buckets are supported (see datamover tool).

Storage backends

Datamon supports the following cloud storage backends:

  • Google Cloud Storage
  • AWS S3
  • Repo: analogous to a git repo. A repo in datamon is a dataset that has a unified lifecycle.
  • Bundle: a bundle is a point in time read-only view of a rep:branch and is composed of individual files. Analogous to a commit in git.
  • Label: a name given to a bundle, analogous to tags in git. Examples: Latest, production.
  • Context: a context provides a way to define multiple instances of datamon.
  • Write Ahead Log: a WAL tracks data updates and their ordering.
  • Read Log: logs all read operations, with their originator.
  • Authentication: datamon keeps track of who contributed what, when and in which order (WAL) and who accessed what (Read Log).

Installation

Please follow the installation instructions.

Migrating from v1 to v2

v2 comes with breaking changes. The migration process replaces older repos by new ones.

See the migration guide.

CLI guide

Datamon comes as a CLI tool: see usage.

Use cases

Feature requests and bugs

Please file GitHub issues for feature requests or bug reports.

Contributing

Please read our contributing guidelines

License

Datamon is developed by OneConcern Inc. under the MIT license.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.