Giter Club home page Giter Club logo

data-warehouse's Introduction

Investify Analytics

Need

An investment advisor company, Investify, has decided to introduce more automation and monitoring to their data warehouse ETL pipelines. They also want a single source of truth store on the cloud for the publicly traded stock information so their analysts have a reliable source for their investment decisions.

  • They have two data sources, one that provides daily stock price information in "csv" format and another source to provide stock information in "json" format. Stock price information has "ticker", "open", "close", "volumne", "date", etc. in it for each publicly traded stock; stock information source has full name, exchange, sector, and industry for each stock ticker.

  • Investment advisers at Investify have basic to intermediate SQL skills and they need to be able to query the stock information as needed from a central data warehouse to server their client needs. Their queries will be to analyse volume, or price of stocks on different dimensions like exchange, industry, sector, date, and stock name. They expect the data to be clean, queries to be highly performant, and the solution to be scalable for multiple users, and data over time.

  • The new solution needs to create a star schema with 1 fact table and 5 dimension tables; a data warehouse needs to available for query needs. There is an expectation of over 1 million rows of historical information with a future expectaion of 100 million rows to be stored for analysis. There is also an expectation of over 100 analysts assessing this data warehouse as part of their daily jobs. This new solution should have the ability to be scheduled automatically at 7 am every morning and finish its ETL process in less than 5 minutes.

Solution

Stock historical source data resides in S3 and new source data is put into S3 on a daily basis. Our technology team decided to pick Amazon Redshift for the data warehouse due to the storage, size of the data and need to ETL large amounts of data quickly. Apache Airflow was selected to automate, scheduling, and monitoring of the ETL workflow.

  • Input Data Model There are two datasets available for our processing
    • Stock price information: S3 link is at s3://akcapstone/priceinfo Price information dataset is in "csv" format and a sample line from it looks like AHH,11.5,11.5799999237061,8.49315452575684,11.25,11.6800003051758,4633900,2013-05-08; these values correspond to ticker,open,close,adj_close,low,high,volume,date values of a single publicly traded stock.
    • Stock descriptive information: S3 link is at s3:/akcapstone/descinfo Stock descriptive information dataset is in "json" format and a sample record from it looks like {"ticker":"PIH","exchange":"NASDAQ","name":"1347 PROPERTY INSURANCE HOLDINGS, INC.","sector":"FINANCE","industry":"PROPERTY-CASUALTY INSURERS"}. The ticker in each record will have price information for multiple dates in the other file.
  • ETL Process Pipeline
  • Output Data Model Star Schema

Instructions to run code

  • Start airflow
  • Setup redshift in us-east-1 region as the S3 storage account is in that region;
  • Run the script provided in the create_tables.sql script in redshift database to create the two staging tables
  • Setup 2 connections in airflow (1 for S3 storage called aws_credentials, and another for redshift called redshift)
  • Run the dag from airflow to execute the ETL pipeline that results in creating the fact and dimension tables

ETL test performance

  • In the project, there is a zip file with the publicly traded stock data in it; this "csv" file has over 1.3 million rows of data. This project ETL pipeline was able to ETL this file completely from S3 to redshift in around 1 minute and 30 seconds.

data-warehouse's People

Contributors

akomandooru avatar

Stargazers

Nirmal avatar

Watchers

 avatar

Forkers

nagu4dwh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.