Giter Club home page Giter Club logo

azure-databricks-etl-project's Introduction

azure-databricks-etl-project

ETL motor racing data

####The data is from the Ergast website.

The data is stored in the form of an API, downloadable CSVs, and nested or non-nested JSON files. Azure Databricks on top of Apache Spark, Azure Notebook, and Azure Data Lakes Storage are the main tools for this ETL Project.

In this project, I focused on extraction from the CSV AND JSON files for my ETL. This can be done on a free AZURE trial option from Microsoft.

Here is a quick diagram of the high-level plan.

etl_motor_racing_1

Quick Overview of my ETL Processes

Purple Blocks show columns were renamed and/or transformed Red Blocks show columns that were dropped Green Blocks show columns that were Added

etl_motor_racing_2

etl_motor_racing_3

Both horizontal and vertical scaling is very much possible but a larger budget would be necessary to truly take advantage of the full potential of Azure Databricks.

etl_motor_racing_4

Below are random snapshots the reproducable files are avalable DataBricks files are in the folder

Creating secure secret keys and connecting and create and mounting the raw empty folder

etl_adls_notebook_1

Uploading raw files to Data Lakes Storage raw folder

etl_adls_notebook_2

read the json file using the spark dataframe

etl_adls_notebook_3

Output to parquet file

etl_adls_notebook_4

azure-databricks-etl-project's People

Contributors

randyroac avatar ranroac avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.