Giter Club home page Giter Club logo

data_engineering_roadmap's Introduction

Data Engineering Roadmap

==========================

Welcome to our Data Engineering Roadmap!

This roadmap is designed to help you navigate the world of data engineering, from the fundamentals to advanced topics. Whether you're a beginner or an experienced professional, this roadmap will guide you through the key concepts, tools, and technologies you need to master.

Data Engineering Fundamentals


  • Introduction to Data Engineering
  • Data Engineering Lifecycle:
    • Data Collection
    • Data Ingestion
    • Data Storage and Management
    • Data Transformation
    • Data Serving
  • Overview of Data Pipelines (ETL/ELT)
  • Overview of Data Modeling
  • Overview of Cloud Data Engineering
  • Soft Skills for Data Engineers

Linux and Git


  • Basic Linux Commands:

    • Introduction to the Command Line
    • Creating and Navigating Directories
    • Listing Files in Directories
    • Creating and Viewing Files
    • Copying and Moving Files
    • Renaming Files
    • Absolute and Relative Paths
    • Viewing and Managing Processes
  • GitHub:

    • Creating a Repo
    • Cloning a Repo
    • Git Add
    • Git Commit
    • Git Push
    • Git Branch
    • Pull Request
    • Resolving Git Conflicts
    • Creating a Git README and Documenting Projects

SQL


  • Introduction to Databases and Data Warehousing
  • Downloading the Postgres Server Locally
  • Basic Queries
    • DDL - Data Definition Language
    • DML - Data Manipulation Language
    • DCL - Data Control Language
  • Joins
  • SQL Data Cleaning
  • Window Functions
  • Introduction to Advanced SQL (Subquery & CTE)
  • Creating Tables/Views (Working with Tables)
  • Stored Procedures
  • Entity-Relationship Diagrams (ERDs)

Python


  • Python Basics
    • Control Flow
    • Operators:
    • Arithmetic Operators
    • Assignment Operators
    • Comparison Operators
    • Logical Operators
    • Identity Operators
    • Membership Operators
    • Logical Statements
      • If and Else Statements
    • Loops:
      • For Loop
      • While Loop
    • Functions:
    • Normal Functions
    • Generic Functions
      • Non-Default Arguments
      • Default Arguments
    • *Args and **kwargs
    • Modules & Packages:
      • In-Built Modules
      • Custom Modules
      • Packages
    • Errors and Exceptions
  • Data Structures
  • File Handling
  • Data Manipulation with Pandas
  • Database Interaction
  • API and Web Scraping
  • ETL Process and Data Pipeline
  • Version Control for Projects
  • Introduction to OOP

Data Modeling


  • Fundamental Concepts
  • Basic Techniques for Dimension Tables
  • Basic Techniques for Fact Tables
  • Slowly Changing Dimensions

DBT


  • DBT Fundamentals
  • Understanding Jinja, Macros, and Testing in DBT
  • DBT Packages
  • Introduction to DBT Cloud

Docker


  • Overview of Docker and Internals of Docker
  • Dockerfile
  • Docker Images
  • Docker Containers
  • Understanding Docker Volumes
  • Docker Networking
  • Introduction to YAML
  • Docker Compose and Anchors in Docker Compose

CI/CD


  • GitHub Actions

Data Integration


  • Airbyte:
    • Airbyte Concepts
    • Source
    • Destination
    • Connection
    • Connector
    • Sync
    • Airbyte Architecture:
      • Architecture Overview
      • WebApp
      • API Server
      • Metadata Database
      • Temporal
      • Worker
    • Running Airbyte in Docker
    • Understanding Source Configuration
    • Understanding Destination Configuration
    • Configuring a Full Synchronization between Source and Destination:
      • S3 to Postgres Database
      • Postgres Database to Redshift
    • How Sync Works Under the Hood

Orchestration


  • Introduction to Apache Airflow
  • Airflow Concepts:
    • Workflow
    • DAG
    • Task
    • Operators
    • Dependencies
  • Installation and Setup:
    • Prerequisites
    • Installation
    • Configuration
  • Airflow Architecture:
    • Architecture Overview
      • WebServer
      • Metadata Database
      • Scheduler
      • Worker
      • How Airflow Works
  • Creating Your First DAG
  • Understanding DAG Configuration
  • Understanding Task Configuration
  • Understanding Airflow Variables
  • Advanced DAG Concepts
  • Monitoring and Debugging
  • Airflow Configuration and Best Practices
  • Projects

Cloud


  • Introduction to the Cloud
  • IAM
  • Data Lake
  • Python Libraries to Interact with the Cloud
  • Data Catalog
  • Relational Database Services (RDS)
  • Data Warehouse
  • ETL Services
  • Orchestration Services
  • Compute

Spark


  • Introduction to Spark
  • Installation
  • Spark SQL and DataFrame API
  • RDDs
  • Transformations and Actions
  • Spark Streaming
  • Structured Streaming
  • Tuning and Optimization

Terraform


  • What terraform is
  • How terraform works
  • Terraform state file
  • Remote state file
  • Basic provisioning
    • Resource referencing
    • Data source
    • Local usage
  • Modules
    • Module overview
    • Module structure
    • Create a simple resource module

Kafka


  • Apache Kafka Overview
  • Kafka Architecture
  • Kafka topic, partition and offset
  • Producer and Consumer
  • Consumer group and Rebalancing protocol
  • Kafka Connect

Optional


  • Introduction to Kubernetes

data_engineering_roadmap's People

Contributors

dataengineering-community avatar

Stargazers

Abdulbasit avatar Augustine Onyebuchi avatar Stephen Atehe avatar Mujeeb Olukokun avatar Gabriel Osasumwen Okundaye avatar Joseph Okwuchukwu avatar hendrixxD avatar Moncef Bettaieb avatar  avatar  avatar SAS_OPS avatar Abdullah avatar Tomisin Adeniyi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.