Giter Club home page Giter Club logo

data-engineering-zoomcamp's Introduction

Data Engineering Zoomcamp

  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

Goal: Orchestrating a job to ingest web data to a Data Lake in its raw form.

  • Data Lake (GCS)
    • Basics, What is a Data Lake
    • ELT vs. ETL
    • Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)
  • Orchestration (Airflow)
    • Basics
      • What is an Orchestration Pipeline?
      • What is a DAG?

Goal: Structuring data into a Data Warehouse

  • Data warehouse (BigQuery)
    • What is a data warehouse solution
    • What is big query, why is it so fast, Cost of BQ,
    • Partitoning and clustering, Automatic re-clustering
    • Pointing to a location in google storage
    • Loading data to big query & PG (10 min) -- using Airflow operator?
    • BQ best practices
    • Misc: BQ Geo location, BQ ML
    • Alternatives (Snowflake/Redshift)

Goal: Transforming Data in DWH to Analytical Views

  • Basics
    • What is DBT?
    • ETL vs ELT
    • Data modeling
    • DBT fit of the tool in the tech stack
  • Usage (Combination of coding + theory)
    • Anatomy of a dbt model: written code vs compiled Sources
    • Materialisations: table, view, incremental, ephemeral
    • Seeds
    • Sources and ref
    • Jinja and Macros
    • Tests
    • Documentation
    • Packages
    • Deployment: local development vs production
    • DBT cloud: scheduler, sources and data catalog (Airflow)
  • Google data studio -> Dashboard
  • Extra knowledge:
    • DBT cli (local)

Goal:

  • Distributed processing (Spark)
    • What is Spark, spark cluster
    • Explaining potential of Spark
    • What is broadcast variables, partitioning, shuffle
    • Pre-joining data
    • use-case
    • What else is out there (Flink)
  • Extending Orchestration env (airflow)
    • Big query on airflow
    • Spark on airflow

Goal:

  • Basics
    • What is Kafka
    • Internals of Kafka, broker
    • Partitoning of Kafka topic
    • Replication of Kafka topic
  • Consumer-producer
  • Schemas (avro)
  • Streaming
    • Kafka streams
  • Kafka connect
  • Alternatives (PubSub/Pulsar)
  • Putting everything we learned to practice

Duration: 2-3 weeks

  • Upcoming buzzwords
    • Delta Lake/Lakehouse
    • Databricks
    • Apache iceberg
    • Apache hudi
    • Data mesh
    • KSQLDB
    • Streaming analytics
    • Mlops

data-engineering-zoomcamp's People

Contributors

hayriyigit avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.