Giter Club home page Giter Club logo

data-extractor's Introduction

data-extractor

Archive notice

This repository is archive now, SDP ingestion has moved to Azure data factory https://github.com/hmcts/mi-data-ingestion

Simple data extractor/shuffler

Runs a sql query against a database and saves the results in an azure storage container.

It can export the results as one of:

  • csv
  • json
  • jsonlines

It obtains its configuration from the following environment variables:

  • ETL_DB_URL: jdbc connection url (e.g. "jdbc:postgresql://localhost:5432")
  • ETL_DB_USER_FILE: file containing the db username relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-user)
  • ETL_DB_PASSWORD_FILE: file containing the db password relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-pwd)
  • ETL_MSI_CLIENT_ID: pod identity client id to get credentials from keyvault and write access to blob storage.
  • ETL_CONNECTIONSTRING: This is an alternative authentication system in case manage identity is not setup in the cluster.
  • ETL_ACCOUNT: Azure storage account where output should be saved (e.g. "devstoreaccount1")
  • EXTRACTION_CASETYPES: Configuration list to extract each case type. This list can contain the following values:
    • CONTAINER: Azure storage container where output should be saved (e.g. "testcontainer")
    • TYPE: output file type. One of: jsonlines, csv, json (default "jsonlines")
    • PREFIX: prefix for the output file (e.g. "test01").
    • CASETYPE: CCD case type to extract

The 2 values: ETL_DB_USER_FILE and ETL_DB_PASSWORD_FILE are useful if the username and password are retrieved from Azure keyvault and exposed as flexvolumes. The same username and password can alternatively be passed as environment variables (ETL_DB_USER and ETL_DB_PASSWORD).

The output file obtained contains all the records generated by the query. File naming follows this convention: <prefix>-<datetime>.<type>

Helm chart

The easiest way to run a job is by using the included helm chart which is based on chart-job. This can be done running the following command: helm install hmcts/data-extractor-job --name data-extractor-job-001 --namespace mi -f job-values.yaml --wait where job-values.yaml is:

job:
  image: hmcts.azurecr.io/hmcts/data-extractor-job:prod-f888e665
  aadIdentityName: mi
  keyVaults:
    "data-extractor":
      resourceGroup: data-extractor 
      secrets:
        - ccdro-user
        - ccdro-password
        - appinsights-instrumentationkey
  labels:
    app.kubernetes.io/instance : data-extractor-job-001
    app.kubernetes.io/name: data-extractor-job
  environment:
    ETL_DB_URL: jdbc:postgresql://ccd-data-store-api-postgres-db-aat.postgres.database.azure.com:5432/ccd_data_store
    ETL_DB_USER_FILE: data-extractor/ccdro-user
    ETL_DB_PASSWORD_FILE: data-extractor/ccdro-password
    ETL_SQL: >
      SELECT id, created_date, event_id, summary, description, user_id, case_data_id,
      case_type_id, case_type_version, state_id, user_first_name, user_last_name,
      event_name, state_name, security_classification
      FROM case_event
      WHERE created_date >= (current_date-1 + time '00:00')
      AND created_date < (current_date + time '00:00')
      ORDER BY created_date ASC;
    ETL_ACCOUNT: midatastg
    ETL_MSI_CLIENT_ID: 1461ff03-675c-423c-95a4-fb50d31254ff
    EXTRACTION_CASETYPES_0__CONTAINER: "test-container"
    EXTRACTION_CASETYPES_0__CASETYPE: "Caveat"
    EXTRACTION_CASETYPES_0__TYPE: "jsonlines"
    EXTRACTION_CASETYPES_0__PREFIX: "CCD-TEST"
global:
  job:
    kind: Job
  subscriptionId: "1c4f0704-a29e-403d-b719-b90c34ef14c9"
  tenantId: "531ff96d-0ae9-462a-8d2d-bec7c0b42082"
  environment: aat

For an example of how to run this using flux please see: flux github repo

data-extractor's People

Contributors

adusumillipraveen avatar dependabot-preview[bot] avatar luigibk avatar qzhou-hmcts avatar tchow8 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.