uc-cdis / tube Goto Github PK

View Code? Open in Web Editor NEW

7.0 18.0 8.0 24.55 MB

ETL

License: Apache License 2.0

Python 97.61% Dockerfile 2.00% Shell 0.39%

gen3

tube's Introduction

Gen3 Tube ETL - a process from PostgreSQL to ElasticSearch

Purpose

Providing a quick response for every data query is challenging, since we need to balance the drawback of the data storage space and the performance of the query. Given a database schema represented in the figure, querying all the data from all the data tables requires multiple joins to connect the data to each other.

SQL databases provide an optimal and standard way to store data. For that optimization in processing or retrieving data, we pay the price in saving space. Given a database with a schema such as the one in the figure above, in order to gather all information related to Subject in descendant tables, we need perform sixteen joins (one per link). With big data, it is an expensive task.

NoSQL and document databases offer a way to circumvent the cost by duplicating data or materializing necessary values for the frequent requests. Normally, when data are received by the system, they are stored in the "source of truth" database and streamed to the secondary document database via an Extract-Transform-Load (ETL) process.

The Gen3 Tube ETL is designed to translate data from a graph data model, stored in a PostgreSQL database, to indexed documents in ElasticSearch (ES), which supports efficient ways to query data from the front-end. The purpose of the Gen3 Tube ETL is to create indexed documents to reduce the response time of requests to query data. It is configured through an etlMapping.yaml configuration file, which describes which tables and fields to ETL to ElasticSearch.

Key documentation

ETL overview: more information about general ETL processes
How to configure the Gen3 Tube ETL
Configuration examples
Local development installation guide
How to run unit tests locally
Configuring SSL

Gen3 graph data flow

tube's People

Contributors

Stargazers

Watchers

Forkers

applesline ohsu-comp-bio chicagopcdc niehs andrebriggs plooploops webclinic017 andrzejgrzelak

tube's Issues

PXD-1549 ⁃ Implement PFB to elasticsearch ETL

proof of concept to see how if rich data(aka metadata) is presented in protobuf, what it looks like to ETL to elasticsearch via spark.

How to flat the child with relationship one-to-many with parent in etlMapping?

For example: The subject node has multiple family history node. How to flat family history node in etlMapping as a new index in ETL? @thanh-nguyen-dang

PXD-1809 ⁃ allow props to be object

allow the etl to rename a field:

_props:
  - name: prop
  - name: property_name_in_es
    source(optional, default to name): property_name_in_psqlgraph

allow the etl to rename a value:

_props:
  - name: property_name_in_es
    source: property_name_in_psqlgraph
    value_mapping:
      "Positive": Yes
      "Negative": No

PXD-2226 ⁃ gather requirements from GDC for elastic search ETL

PXD-1821 ⁃ allow flatten props to do path traversal too

  _flatten_props:
    - path: edge1.edge2
      _props:
        - prop

if the child > 1 during mapreduce , just get the first 1

PXD-1984 ⁃ Travis to wait for Quay finish building the image before running the test

PXD-1723 ⁃ Research about Serialization format - PFB

PXD-1807 ⁃ create file index mapping and etl

we need the file index for the 'download manifest' button,since it's only needed for that button, we dont need to aggregate any fields except for the 'case_id', the traversal path will be different per file type, so the mapping will need to write one block per node, we will need to adjust the mappings and settings to allow:

pointing to different indices per top level object in mapping. (doc type is deprecated so we shouldn't use it

PXD-1843 ⁃ handle tube bug fixes

we will keep running ETL for a bunch of commons in this sprint.

Created this ticket to handle any upcoming blocking bugs and adding regression tests.

PXD-1842 ⁃ support creating index with version number

we use alias in elasticsearch, how it works is you can let tube create index_0, then create alias "index" that point to "index_0", next time tube create a new "index_1", and we move the alias pointer to the new index. This way there is no downtime during etl. The arranger just use the alias endpoint.

The tube should support letting the manifest only specify the alias, and detect the previous latest version in current elasticsearch and build new index with suffix 1 greater than previous latest version, then roll the alias to point to the new index