Giter Club home page Giter Club logo

tube's Introduction

Gen3 Tube ETL - a process from PostgreSQL to ElasticSearch

Build Status Coverage Status

Purpose

Providing a quick response for every data query is challenging, since we need to balance the drawback of the data storage space and the performance of the query. Given a database schema represented in the figure, querying all the data from all the data tables requires multiple joins to connect the data to each other.

SQL databases provide an optimal and standard way to store data. For that optimization in processing or retrieving data, we pay the price in saving space. Given a database with a schema such as the one in the figure above, in order to gather all information related to Subject in descendant tables, we need perform sixteen joins (one per link). With big data, it is an expensive task.

NoSQL and document databases offer a way to circumvent the cost by duplicating data or materializing necessary values for the frequent requests. Normally, when data are received by the system, they are stored in the "source of truth" database and streamed to the secondary document database via an Extract-Transform-Load (ETL) process.

The Gen3 Tube ETL is designed to translate data from a graph data model, stored in a PostgreSQL database, to indexed documents in ElasticSearch (ES), which supports efficient ways to query data from the front-end. The purpose of the Gen3 Tube ETL is to create indexed documents to reduce the response time of requests to query data. It is configured through an etlMapping.yaml configuration file, which describes which tables and fields to ETL to ElasticSearch.

Key documentation

Gen3 graph data flow

tube's People

Contributors

abgeorge7 avatar albertsnows avatar atharvar28 avatar avantol13 avatar cmlsn avatar dependabot-preview[bot] avatar dependabot[bot] avatar frickjack avatar giangbui avatar haraprasadj avatar jawadqur avatar m0nhawk avatar mfshao avatar michaellukowski avatar mikeabreu avatar mpingram avatar mysterious-progression avatar nss10 avatar paulineribeyre avatar philloooo avatar plooploops avatar thanh-nguyen-dang avatar themarcelor avatar vpsx avatar williamhaley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tube's Issues

PXD-1809 ⁃ allow props to be object

allow the etl to rename a field:

_props:
  - name: prop
  - name: property_name_in_es
    source(optional, default to name): property_name_in_psqlgraph

allow the etl to rename a value:

_props:
  - name: property_name_in_es
    source: property_name_in_psqlgraph
    value_mapping:
      "Positive": Yes
      "Negative": No

PXD-1807 ⁃ create file index mapping and etl

we need the file index for the 'download manifest' button,since it's only needed for that button, we dont need to aggregate any fields except for the 'case_id', the traversal path will be different per file type, so the mapping will need to write one block per node, we will need to adjust the mappings and settings to allow:

  • pointing to different indices per top level object in mapping. (doc type is deprecated so we shouldn't use it

PXD-1843 ⁃ handle tube bug fixes

we will keep running ETL for a bunch of commons in this sprint.

Created this ticket to handle any upcoming blocking bugs and adding regression tests.

PXD-1842 ⁃ support creating index with version number

we use alias in elasticsearch, how it works is you can let tube create index_0, then create alias "index" that point to "index_0", next time tube create a new "index_1", and we move the alias pointer to the new index. This way there is no downtime during etl. The arranger just use the alias endpoint.

The tube should support letting the manifest only specify the alias, and detect the previous latest version in current elasticsearch and build new index with suffix 1 greater than previous latest version, then roll the alias to point to the new index

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.