Giter Club home page Giter Club logo

udacity-data-engineering-capstone's Introduction

Project description

  • As more and more immigrants move to the US, people want quick and reliable ways to access certain information that can help inform their immigration, such as weather of the destination, demographics of destination. And for regulators to keep track of immigrants and their immigration meta data such as visa type, visa expire date, entry method to the US.

  • Using the available data sources listed above, we build a Data Lake available on S3 that can be used to query for weather and demographics of popular immigration destinations, which could be useful for both immigrants and regulators. Regulators can also access data about individual immigrants, date of arrival to the US, visa expiry dates and method of entries to improve decision making.

Data sources

  • I94 Immigration Data: This data comes from the US National Tourism and Trade Office Source. This data records immigration records partitioned by month of every year.
  • World temperature Data: This dataset comes from Kaggle Source. Includes temperature recordings of cities around the world for a period of time
  • US City Demographic Data: This dataset comes from OpenSoft Source. Includes population formation of US states, like race and gender.
  • Aiport Code table: Source. Includes a collection of airport codes and their respective cities, countries around the world.

Data Lake Star Schema designs

The decision to build a Data Lake on S3 is due to:

  • Ease of schema design, rely on schema-on-read
  • Flexibility in adding / removing additional data
  • Availability to a wide range of users with access to the S3 bucket

Table designs

  1. Normalized us city: built on city code data from raw airport and demographics data
  2. Normalized us airport: built on raw airport data, filtered for US airports, joined with city table to get city_id
    • Each airport can either be identified by icao_code or iata_code or both
  3. Normalized country: built on country codes from I94 immigration dictionary
  4. Normalized us state code: built on state codes from I94 immigration dictionary
  5. Normalized us weather: built on global weather data, filtered for US cities, joined with city table to get city_id
    • Filtered for weather data in the US at the latest date
    • The raw weather data only has temperatures on a fraction of all the cities in the US
    • Some cities in the data are duplicates, but that means they're on different states of the US. However the state is not available in the data, but instead we have the latitude - longitude coordinates. This issue is currently NOT addressed in this project, but in a production setting,we should join the latitude - longitude from weather dataset with this data, which includes city coordinates and their respective states
  6. Normalized us demographics: built on raw demographics data, not much transform is needed
  7. Denormalized airport weather: Joining weather data with airport location, to get the respective weather information for each US airports
  8. Normalized immigrant: Information about individual immigrants, like age, gender, occupation, visa type, built on I94 Immigration dataset
  9. Normalized immigration table: Information about immigration information, such as date of arrival, visa expiry date, airport, means of travel
  10. Denormalized immigration demographics: Joining immigration with demographics data, to get the population of places where immigrants go

AWS Infrastructure

  • The AWS infrastructure is set up according to this tutorial
  • Upload the CloudFormation script to create the resources, such as EC2 instance, RSD database for Airflow, security groups, S3 bucket
  • Then connect to the EC2 instance:
sudo su
cd ~/airflow
source ~/.bash_profile
bash start.sh
bash run_dags.sh

ETL

  • dag_cluster: start the EMR cluster, and wait for all data transformation is finished, then terminate the cluster

alt text

  • dag_normalize: wait for EMR cluster to be ready, then use Apache Livy REST API to create interactive Spark session on the cluster, submit a Spark script to read data from S3, do transformation and write the output to S3
    • This DAG handles normalized tables

alt text

  • dag_analytics: wait for EMR cluster to be ready, and that normalized tables are processed, then read normalized tables to create analytics tables, and write to S3
    • This DAG handles immigration data, which is partitioned for 12 months from jan-2016 to dec-2016
    • To re-run this DAG, change the DAG name, then delete the existing DAG from Airflow, and refresh the UI

alt text

Possible errors

  • Livy session NOT started, restart EMR, restart Airflow scheduler

Scenarios

  • Data increase by 100x. read > write. write > read

    • Redshift: Analytical database, optimized for aggregation, also good performance for read-heavy workloads
    • Cassandra: Is optimized for writes, can be used to write online transactions as it comes in, and later aggregated into analytics tables in Redshift
    • Increase EMR cluster size to handle bigger volume of data
  • Pipelines would be run on 7am daily. how to update dashboard? would it still work?

    • DAG retries, or send emails on failures
    • daily intervals with quality checks
    • if checks fail, then send emails to operators, freeze dashboard, look at DAG logs to figure out what went wrong
  • Make it available to 100+ people

    • Redshift with auto-scaling capabilities and good read performance
    • Cassandra with pre-defined indexes to optimize read queries

udacity-data-engineering-capstone's People

Stargazers

 avatar  avatar biZsmith avatar  avatar Chris Seal avatar Renee Liu avatar Jason Huang avatar Jacob Dodd avatar Venkatraghavan Ramesh avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.