Giter Club home page Giter Club logo

mongo-to-s3's Introduction

mongo-to-s3

Exports whitelisted mongo fields to an s3 bucket.

Usage:

  -config string
        String corresponding to an env var config
  -collection string
        The mongo collection you wish to pull from (required)
  -database string
        Database url if using existing instance (required)
  -bucket string
        s3 bucket to upload to

Behavior

mongo-to-s3 does a few things:

  1. connects to the provided mongo database
  2. determines the correct "data date" by rounding down to the nearest hour
  3. parses the provided config file
  4. for each table in the config file
  • pulls the whitelisted fields from mongo
  • flattens objects into dot-separated fields
  • streams to gzipped, timestamped JSON files on s3
  1. prints the payload to be used in a s3-to-redshift job to process this data
  • the job is kickstarted automatically by a workflow

Right now, mongo-to-s3 will attempt export all fields/tables in the X_config.yml whitelist which it's called with.

Updating config files

Configs are env vars in YAML and follow this format:

tablename-whateveryouwant:
  dest: <redshift_table_name>
  source: <mongo_table_name>
  columns:
    -
      dest: _data_timestamp
      type: timestamp
      sortord: 1
    -
      dest: <column_name_in_redshift>
      source: <column_name_in_mongo>
      type: text
      primarykey: true
      notnull: true
      distkey:  true
  meta:
    datadatecolumn: _data_timestamp
    schema: <redshift_schema_name>

Inrternal note: configs are located in ark-config

There are a few tricky things, including some items that are changing in the near future.

Tricky things:

  1. The datadatecolumn is to help keep track of the date of the data going into the data warehouse, and to prevent us from overwriting new data with old. Therefore, we want to set it to approximately when the data was created.

Currently, we do this via a special column that we specify in the meta section. Whatever column you specify here will be overwritten with the date the mongo-to-s3 worker is run, rounded down to the nearest hour. Note that we don't require a source here as we populate it in mongo-to-s3.

  1. We currently don't support more than one sortkey, so the only valid value for sortord is 1

  2. You also have to set notnull for primarykey columns, even though that is implied.

  3. Accepted column types are:

  • boolean
  • float
  • int
  • bigint
  • timestamp
  • text (256 characters)
  • longtext (65535 characters)

It should be easy to add more, however.

  1. You may want to think about issues if some data arrives sooner than other data to the data warehouse. For instance, suppose item A is only "active" if an item B exists in the database and points to A. If you've synched over A significantly before B, it may appear that A is 'inactive' until B is synced over. In reality, A has always been 'active'.

  2. While you pass collections to run on as parameters to mongo-to-s3, the eventual s3-to-redshft job will post with the destination table names as parameters.

mongo-to-s3's People

Contributors

bstein-clever avatar jansenclever avatar a-le-jan-dro avatar peternga avatar rgarcia avatar vynmeister avatar tnsardesai avatar xavi- avatar kvigen avatar taylor-sutton avatar natebrennand avatar afumagalli avatar nathanleiby avatar sayan- avatar meliaj avatar cozmo avatar bgveenstra avatar johnhuangclever avatar ghirsch1 avatar peggyl avatar wardmike avatar evaninja avatar drhurd avatar alsmola avatar aastein avatar lubalexuan avatar renatoprime avatar ulziibay avatar mezaphorical avatar prime-time avatar

Stargazers

 avatar Mallory @ SMO avatar Matthew G. Monteleone avatar  avatar  avatar Alisha Sojar avatar Sam Fishman avatar Colin Schimmelfing avatar  avatar

Watchers

Ben Adida avatar Tim Wee avatar Nikhil Pandit avatar  avatar  avatar mohit avatar  avatar James Cloos avatar Tyler B. avatar Kevin Shen avatar Jacob Simon avatar  avatar  avatar John Stoecker avatar  avatar Ric Parks avatar  avatar  avatar Derek Douville avatar Antonio Dangond avatar Martina Costagliola avatar  avatar Richard Wen avatar Asanka Dharma avatar  avatar Sam Yi avatar Emma Li avatar  avatar  avatar  avatar Kofi Ohene-Adu avatar Samuel Shen avatar Justin Ingemar Kwik avatar  avatar Vijay Bharadwaj avatar Fred Sun avatar Shannen Lam avatar Nisha McNealis avatar  avatar Nikhil Bhatia avatar  avatar

Forkers

rgtalentedge

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.