Giter Club home page Giter Club logo

building-etl-pipelines-with-python's Introduction

Building ETL Pipelines with Python

Create Production-Ready ETL pipelines with Python and open source Libraries. The book utilizes the Pipenv environment for dependency management and PyCharm as the recommended Integrated Development Environment (IDE).

Tables of Contents

  1. Installation
  2. Getting Started
  3. Chapter Descriptions
  4. Contributing
  5. License

Installation

To set up the development environment for the Python Coding Book, follow the instructions below:

  1. Install Python: Ensure that Python is installed on your system. You can download the latest version of Python from the official Python website.
  2. Install Pipenv: Pipenv is used for managing dependencies. Install Pipenv by running the following command:
$ pip install pipenv
  1. Fork the Repository: Fork then Clone this repository to your local machine using Git or by downloading the ZIP file from the repository's main page.
  2. Install Dependencies: Some code examples in this chapter may require additional Python packages or libraries. These dependencies are listed in the Pipfile available in this GitHub repository. To install the required packages using Pipenv, navigate to the project directory and run the following commands:
$ pip install pipenv
$ pipenv install --dev

This will create a virtual environment and install all the required packages specified in the Pipfile.

  1. Jupyter Notebooks: Install Jupyter Notebooks (https://jupyter.org/install) to open and interact with the code examples. Jupyter Notebooks provides an interactive and visual environment for running Python code. You can install it using the following command:
$ pip install notebook

To initiate and run a Jupyter Notebook instance, run the following command:

$ jupyter notebook
  1. Set Up PyCharm (optional): If you prefer to use PyCharm as your IDE, follow the PyCharm installation instructions on the official JetBrains website.

Getting Started

To start working with the Python Coding Book, follow the steps below:

  1. Activate the Pipenv shell: Navigate to the repository's root directory in PyCharm and run the following command to activate the Pipenv shell:
$ pipenv shell
  1. Start Coding: Follow along with this book's chapters and corresponding code examples in the repository. Each chapter is organized in its respective directory and contains code files, exercises, and supporting materials.

Chapter Descriptions

The Building ETL Pipelines with Python consists of the following chapters:

Index Description Code Files
Chapter 1 A Primer on Python and the Development Environment A brief overview of Python and setting up the development environment with an IDE and GIT.
Chapter 2 Understanding Data Pipelines and the ETL Process Overview of the ETL process, its significance, and the difference between ETL and ELT
Chapter 3 Design Principles for ETL Pipelines How to implement design patterns using open-source Python libraries for robust ETL pipelines.
Chapter 4 Sourcing Insightful Data and Data Extraction Strategies Strategies for obtaining high-quality data from various source systems.
Chapter 5 Data Cleansing and Transformation Data cleansing, handling missing data, and applying transformation techniques to achieve the desired data format.
Chapter 6 Loading Transformed Data Overview of best practices for data loading activities in ETL Pipelines and various data loading techniques for RDBMS and NoSQL databases.
Chapter 7 Tutorial: Building a Full ETL Pipeline in Raw Python Guides the creation of an end-to-end ETL pipeline using different tools and technologies, using PostGreSQL Database as an example.
Chapter 8 Powerful ETL-Specific Libraries and Tools in Python Creating ETL Pipelines using Python libraries: Bonobo, Odo, mETL, and Riko. Introduction to using big data tools: pETL, Luigi, and Apache Airflow.
Chapter 9 Primer on AWS Tools for ETL Process. Explains AWS tools for ETL pipelines, including strategies for tool selection, creating a development environment, deployment, testing, and automation.
Chapter 10 Tutorial: Creating Production-Grade ETL Pipelines in AWS. Guides the creation of ETL pipelines in AWS using step functions, Bonobo, EC2, and RDS.
Chapter 11 Building a Robust Deployment Pipeline in AWS. Demonstrates using CI/CD tools to create a more resilient ETL pipeline deployment environment using: AWS CodePipeline, CodeDeploy, CodeCommit, and GIT integration.
Chapter 12 Orchestration and Scaling ETL pipelines. Covers scaling strategies, creating robust orchestration, and hands-on exercises for scaling and orchestration in ETL pipelines.
Chapter 13 Testing ETL Pipelines. Examine the importance of ETL testing and strategies for catching bugs before production, including unit testing and external testing.
Chapter 14 Best practices for ETL Pipelines. Highlights industry best practices and common pitfalls to avoid when building ETL pipelines.
Chapter 15 Use Cases and Further Reading. Practical exercises, mini-project outlines, and further reading suggestions are included in this chapter. Includes a case study of creating a robust ETL pipeline for New York Yellow-taxis data and US construction market data in AWS.

Each chapter directory contains code examples, exercises, and any additional resources required for that specific chapter.

Contributing

We encourage our readers to fork and clone this repository to use in tandem with each chapter. If you find any issues or have suggestions for enhancements, please feel free to submit a pull request or open an issue in the repository.

License

Building ETL Pipelines with Python repository is released under Packt Publishing's MIT License.

Let's get started!

building-etl-pipelines-with-python's People

Contributors

emschoof avatar honestsoul avatar packt-itservice avatar davids-packt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.