Giter Club home page Giter Club logo

skswar / data_engineering_pipelines Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 9.5 MB

๐Ÿ—„๏ธ This repository is about how at No Financial Overhead, Python can be used to build ETL Data Pipelines an Automate them with Task Scheduler. ๐Ÿ‘ทโ€โ™‚๏ธ

License: GNU General Public License v3.0

Jupyter Notebook 49.42% Python 28.63% TSQL 21.95%
data-engineering data-transformation etl-pipeline open-source pyodbc python smtp-mail sql-server sqlalchemy task-scheduler

data_engineering_pipelines's Introduction

Intro Logo

Creative Commons License
This and all the following images are licensed by Sayan Swar under a Creative Commons Attribution 4.0 International License.

How to build Efficient Data Enginnering Pipelines at No Additional Software Overhead Cost?


Table of contents

Introduction

As the global business landscape is increasingly transitioning towards data-driven decision-making and artificial intelligence, the importance of gathering and structuring data in an organized manner is crucial for all organizations. With all the technologies available at our disposal building, data pipleines and data storage has become very easy and fast. But it is not always economically viable, especially for growing business and startups. Often, immediate investment in high-end software or cloud solutions might not be feasible. Or say, in some cases the workload might not warrant the need for large-scale software solutions and can be effectively handled with in house, open source tools.

In this project I illustrate the process of implementing automated data pipleines with open source tools. I aim to demonstrate how simple solutions and proactive initiatives can empower us to commence data gathering from day one and offer valuable insights for strategic decision-making.

Methodology

The basic idea of building these pipelines are pretty simple. Once we have a data storage area determined, we can take help of a scripting tool to perform all the data manipulation and ingestion tasks and then a task scheduler which automates the execution of this scripts.

Tools Used

  • Python as scripting language with some libraries (for e.g. Pyodbc/SqlAlchemy for communicating with Database, SMTP for emailing etc.)
  • SQL Server as Database
  • Windows Task Scheduler as Automater

Building the Pipelines

In this project I have tried to implement two different types of data pipelines:

  1. One in which data is loaded from one database to another daily and incrementally which is the project_demo_1.
  2. One in which data is loaded from a file and upserted into a database table which is the project_demo_2.

Project Demo 1

In this porject the requirement is to incrementally load a table from a transactional datasource to a data-mart which is used for all data sciene and BI applicational needs. As both the sources lies under a same server therefore a stored procedure was written to move the data from source to destination. A stage table was created where data is first loaded, manipulated and then pushed into the destination table. The python script is written to coonect to the database using pyodbc library and excute the stored procedure. In an event of sucess the script then sends out an email to the stakeholders with number of records processed and load time. If process is aborted for any reason the failure notifications will also be emailed to the stakeholders. The ready python script is finally converted into an excutable file use pyinstaller. This executable file is scheduled to run on a daily basis at a specified time. The flowchart below describes the general idea of overall process flow.

Link to Project 1 files:

Project Demo 2

In this project, the goal is to load data from files to a dstination database. For this purpose a python script is written to check if file is available. If file is available, then the script reads the flile, performs all the necessary data transformations. To load the data into database first the script truncates the stage table. Then it uses the SQLAlchemy library to load the data into stage table rather than using a cursor to_sql function enahnces the load performance. Finally the script calls a stored procedure which then upserts the data from stage to the destination table. After processing the data it archives the files to an archive folder. In event of a sucess the script then sends out an email to the stakeholders with number of records processed and load time. If process is aborted for any reason the failure notifications will also be emailed to the stakeholders. The ready python script is finally converted into an excutable file use pyinstaller. This executable file is scheduled to run on a daily basis at a specified time. The flowchart below describes the general idea of overall process flow.

Link to Project 2 files:

Note: How to use task scheduler can be found in the img folder with name task_schd_#

Results and Conclusion

The result is smooth, low to no cost data pipelines which keeps the data flowing into a datalake/datamart or any data storage area which now can be used for data science and BI tasks. But we need to make sure this pipelines are scheduled by their dependency order. Also maintenance of this pipelines will be necessary and proactive actions might be needed if volume of the data sudenly increases to more than expected.

data_engineering_pipelines's People

Contributors

skswar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.