Giter Club home page Giter Club logo

dwp.historic-data-loader's Introduction

The Historic Data Loader

Overview

The historic data loader (hereinafter termed 'HDL') is a tool for bulk loading UC exports.

The process uses HBases's bulk loading capabilities which utilise map-reduce to prepare HFiles directly, hbase is then directed to adopt these files. A useful post on this technique can be found here.

Each run of HDL should process files all destined for the same table, so many runs of HDL are needed to perform a full import of each table's historic data - one for each table in HBase. This is a requirement of the HBases incremental load feature which this application uses.

Local running

The application can only run on EMR. Until we have a licence for localstack pro (which provides EMR), testing will be performed by the end to end tests.

Running in the development environment.

Build the jar

gradle build

Transfer the jar to s3 (replace the words development-bucket with and actual writable bucket in 33.)

aws --profile dataworks-development s3 cp ./build/libs/historic-data-loader-1.0-SNAPSHOT-all.jar s3://development-bucket/

Transfer the script to run the jar to the same bucket (one time activity)

aws --profile dataworks-development s3 cp ./resources/scripts/run.sh s3://development-bucket/

Log onto to the hbase master

aws ssm ....

Fetch the jar and the script

aws --profile dataworks-development s3 cp s3://development-bucket/historic-data-loader-1.0-SNAPSHOT-all.jar .
aws --profile dataworks-development s3 cp s3://development-bucket/run.sh .

Create a table to load into (must match the table name specified in the script)

hbase shell
> create table 'agent_core:agentToDo', { NAME => 'cf', VERSIONS => 100 }

.. or truncate an existing table

> truncate 'agent_core:agentToDo'

.. or do nothing and use an existing table.

Kick off the job

./run.sh

The script is very rough and ready and only for kicking off dev runs, alter the exported environment variables therein to suit your needs.

Check the logs

First you need to get the application Id which looks something like this application_1601048545520_0023, look for the line like this on the console after the run:

20/10/01 08:45:43 INFO impl.YarnClientImpl: Submitted application application_1601048545520_0023

Then to see the logs:

yarn logs --applicationId <id-determined-above> 

AWS Deployment notes

Name Purpose Default value Notes
AWS_REGION Location of infrastructure eu-west-2 Default probably suitable for deployed instances
AWS_USE_LOCALSTACK Indicates whether the code is running in localstack environment (for integration tests) false Can use default for deployed instances
HBASE_TABLE The table to which the data should be written data Needs to target a single table for each run and so to some extent needs to be ably to be set dynamically by whatever initiates a load.
MAP_REDUCE_OUTPUT_DIRECTORY Where the Hfiles should be written to and from where HBase will pick them up when it is directed to adopt them /user/hadoop/bulk Will need to be set for uniquely each topic, and not be an existing directory in hdfs
S3_BUCKET The bucket containing the objects of previously streamed records corporatestorage Will need to be set explicitly for aws deployed instances.
S3_MAX_CONNECTIONS How many concurrent s3 connections to allow 1000 Default probably ok.
S3_PREFIX The path to the files to be restored data Should be one tables worth of data.
TOPIC_NAME The name of the topic whose files are being reloaded This is needed to filter the s3 object list down to 1 topics worth of files

dwp.historic-data-loader's People

Contributors

connoravo avatar danielchicot avatar dataworks-ci avatar dependabot[bot] avatar snyk-bot avatar steveburton4 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.