Giter Club home page Giter Club logo

etl-1's Introduction

Ruby ETL Gem

This ETL (Extract Transform Load) Ruby gem provides a back-end system for moving data between data sources in a reliable and scalable manner. It includes features for extracting from several different data sources, transforming the data, and loading to different databases with a user-specified load strategy.

The ETL system is designed from the ground up to be highly available and scalable through usage of queued ETL jobs, distributed workers, and optimized load paths.

Key Use Cases

  • Loading data in batches into a data warehouse (e.g. Redshift) for downstream analytic workloads

  • Creating aggregated/roll-up tables from raw data

Because the ETL system is configured through Ruby and JSON, the target users are back-end developers who are looking to integrate a lightweight solution into their system.

Key Features

Currently implemented features:

  • Connects to various data sources including relational, non-relational, and flat file

  • Flexible load strategies including insert, partition, and upsert

  • Job configuration and specification through Ruby for flexibility

  • Jobs are parameterized by JSON payloads that can be sent as queue messages for distributed processing

  • Automated management of common warehouse columns such as load date

  • Worker process that reads jobs from queue and runs them

Future features:

  • Parameterizable Ruby process for scheduling jobs

  • Posts metrics on job scheduling and execution performance

  • Additional data sources/destinations

  • Job scheduling and dependency representation

  • Support for streaming data sources

  • Code hooks for validation and auditing of data loads

Data Sources and Destinations

Currently the following sources are supported for both data input and output.

  • CSV

  • MySQL

  • PostgreSQL

  • InfluxDB

The following are on the short list for future support:

  • Redshift

  • CloudWatch

Status

Although there are many working parts of this system and it is being used in production at my current company, the interface is still evolving and not all features have been fully implemented. This should be considered pre-Alpha software. Please contact me if you’re interested in contributing or have questions.

To be completed before I’d consider this “alpha”:

  • Finish scheduling and metrics features listed above

  • Provide documentation and examples of the API

  • Publish to RubyGems

To run the existing tests

Run ‘docker build . -t outreach/etl-test` Run `docker run -it outreach/etl -v /MY_TEST_CONFIG_DIR:/etl_config bash` Run `service postgresql start` Run `bundle exec rspec` RUN `export ETL_CONFIG_DIR=/etl_config`

Copyright © 2015-2016 Charles Smith

This is an Open Source project licensed under the terms of the MIT license as described in the LICENSE file in the root directory of this repository.

etl-1's People

Contributors

chrisprobinson avatar pinchaque avatar emya avatar roysc avatar jmbutter avatar hellogator avatar onetwopunch avatar samueldaniel avatar stefanwork avatar frightenedmonkey avatar sam-daniel-outreach avatar

Stargazers

Daniel Morris avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.