Giter Club home page Giter Club logo

data-challenge's Introduction

Data Engineer Challenge


Installation

  • docker
  • docker-compose
  • bash (optional)

Setup

  • build the project
    $ make start
  • initialize the database
    $ make init_db
  • add calendar years
    $ make calendar start=2018 end=2023

Ingestion script

  • customer.csv

    $ make customer FILE=/app/resources/customer.csv
  • transaction.csv

    $ make transaction FILE=/app/resources/transaction.csv
  • exchange_rate.csv

    $ make exchange_rate FILE=/app/resources/exchange_rate.csv


Information System Architecture

Design a Data Flow architecture

Data Flow architecture

Which technologies do you propose? What are the advantages and disadvantages of those?

  • API - to receive request from other platforms and enable a point of receiving the trigger. The API code has access to meta data tables, which can show the progress of the data journey - ingestion, validation, transformation, and extraction.
  • Meta data storage - to store the data of the request received to ingest/export data. It can also capture some key attributes from each request received to derive insights at later stages. It will enable logging of the data received and overrides (versioning) if requested.
  • SQS/RabbitMQ - Any queue mechanism, where API will register the request to be picked up based upon the availability of the processor.
  • Processor - Key engine which will listen to traffic being received in the SQS/RabbitMQ and will spawn a necessary process to execute it.
  • Python script
    • Loader - this will be request specific and will expand to incorporate more data source types like Salesforce, Kafka, S3, direct upload, other databases, etc.
    • Staging - this layer is responsible to hold the raw data in as is form.
    • Pipeline - basically set of queries - written to validate, clean, and transform data to ingest it into facts, dimensions and lookup tables.
  • Views - these are virtual tables which abstract the necessary joins, and aggregations for the access of data via tableau or other BI tools.

Data Modelling

Design a relational data model to represent the given datasets. It should be optimized for analytical purposes.

ER diagram file name: /sql/db_setup.sql

Explain the modeling approach taken. Is there any alternative to the model chosen (trade-offs, limitations, advantages)?

I took approach of snowflake schema - where I introduced 2 fact tables to store exchange rate and transaction data. Also, dimensions are introduced:

  • dim_country - to store distinct country records
  • dim_currency - to store distinct currency
  • dim_calendar - indexed table with date, month, and year columns
  • dim_customer - to store distinct customers and reference it in fact tables.

A lookup table (map_country_currency) is also introduced to store the dependency between country and currency using the outgoing transaction data (per customer).

  • Tradeoffs:
    • query complexity - since data is normalized, so would need some joins to fetch data
    • join overhead - due to which, proper indexing and query optimization is essential.
  • Limitations:
    • transformations required for ingestion of data.
    • validations/updates to be in place of dimension table - to be able to ingest the data.
  • Advantages:
    • reduced data redundancy - created dimension data for the redundant data.
    • improved data consistency - better control of updates on the dimension data.

Propose a primary key implementation for each table, elaborating on the solution adopted.

  • dim_country - country_id
  • dim_currency - currency_id
  • map_country_currency - map_country_currency_id
  • dim_calendar - date_id
  • dim_customer - customer_id
  • fact_transaction - transaction_id
  • fact_exchange_rate - exchange_rate_id

Python

Explain why your code is performant or how it can be improved.

I am using a staging table to ingest data as is, and then validating the data in the database. This allows me to scale the queries in the database, because the validation is done in the staging table, and then only the validated data is loaded into the production schema.
This helps in making the code performant, by allowing me to perform validation and transformation on the staging table (ELT approach).


SQL

  • Create a view which shows the total amount in EUR, number of transactions and number of unique users per month and country.
    view name: transaction_summary_view
    file name: /sql/db_setup.sql

  • Based on a date, show the evolution of the exchange rate for each currency (based on GBP) in percentage for the next 15 days
    view name: exchange_rate_view
    file name: /sql/db_setup.sql

data-challenge's People

Contributors

varunsinghal avatar

Watchers

 avatar

data-challenge's Issues

documentation

  • code setup guide
  • script execution steps
  • design data flow architecture
    • ELT approach
    • propose technologies
    • advantages
    • disadvantages
  • ER diagram
  • explain data model approach

SQL views

  • Create a view which shows the total amount in EUR, number of transactions and number of unique users per month and country.
  • Based on a date, show the evolution of the exchange rate for each currency (based on GBP) in percentage for the next 15 days

ingestion script

ingestion script to load

  • customer.csv
  • exchange_rate.csv
  • transaction.csv

features:

  • validation
  • clean-up (deduplication) of data
  • enable logging to store time taken

data modeling

Design a relational data model to represent the given datasets. It should be optimized for analytical purposes.

  • Consider best practices for historization.
  • Requirement: there is a dependancy between country of a customer and currency of an out-going transaction

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.