The data_engineering_pipelines from skswar

This and all the following images are licensed by Sayan Swar under a Creative Commons Attribution 4.0 International License.

How to build Efficient Data Enginnering Pipelines at No Additional Software Overhead Cost?

Introduction
Methodology
- Tools Used
- Building the Pipelines
  - Project Demo 1
  - Project Demo 2
Results and Conclusion

Introduction

As the global business landscape is increasingly transitioning towards data-driven decision-making and artificial intelligence, the importance of gathering and structuring data in an organized manner is crucial for all organizations. With all the technologies available at our disposal building, data pipleines and data storage has become very easy and fast. But it is not always economically viable, especially for growing business and startups. Often, immediate investment in high-end software or cloud solutions might not be feasible. Or say, in some cases the workload might not warrant the need for large-scale software solutions and can be effectively handled with in house, open source tools.

In this project I illustrate the process of implementing automated data pipleines with open source tools. I aim to demonstrate how simple solutions and proactive initiatives can empower us to commence data gathering from day one and offer valuable insights for strategic decision-making.

Methodology

The basic idea of building these pipelines are pretty simple. Once we have a data storage area determined, we can take help of a scripting tool to perform all the data manipulation and ingestion tasks and then a task scheduler which automates the execution of this scripts.

Tools Used

Python as scripting language with some libraries (for e.g. Pyodbc/SqlAlchemy for communicating with Database, SMTP for emailing etc.)
SQL Server as Database
Windows Task Scheduler as Automater

Building the Pipelines

In this project I have tried to implement two different types of data pipelines:

One in which data is loaded from one database to another daily and incrementally which is the project_demo_1.
One in which data is loaded from a file and upserted into a database table which is the project_demo_2.

Project Demo 1

In this porject the requirement is to incrementally load a table from a transactional datasource to a data-mart which is used for all data sciene and BI applicational needs. As both the sources lies under a same server therefore a stored procedure was written to move the data from source to destination. A stage table was created where data is first loaded, manipulated and then pushed into the destination table. The python script is written to coonect to the database using pyodbc library and excute the stored procedure. In an event of sucess the script then sends out an email to the stakeholders with number of records processed and load time. If process is aborted for any reason the failure notifications will also be emailed to the stakeholders. The ready python script is finally converted into an excutable file use pyinstaller. This executable file is scheduled to run on a daily basis at a specified time. The flowchart below describes the general idea of overall process flow.

Link to Project 1 files:

Python File: load_main_table_daily_type1.ipynb
SQL SP: load_main_table_daily_type1.sql

Project Demo 2

In this project, the goal is to load data from files to a dstination database. For this purpose a python script is written to check if file is available. If file is available, then the script reads the flile, performs all the necessary data transformations. To load the data into database first the script truncates the stage table. Then it uses the SQLAlchemy library to load the data into stage table rather than using a cursor to_sql function enahnces the load performance. Finally the script calls a stored procedure which then upserts the data from stage to the destination table. After processing the data it archives the files to an archive folder. In event of a sucess the script then sends out an email to the stakeholders with number of records processed and load time. If process is aborted for any reason the failure notifications will also be emailed to the stakeholders. The ready python script is finally converted into an excutable file use pyinstaller. This executable file is scheduled to run on a daily basis at a specified time. The flowchart below describes the general idea of overall process flow.

Link to Project 2 files:

Python File: Load_Environmental_Data.ipynb
SQL SP: Load_Environmental_Data_SP.sql
Pyinstaller: Pyinstaller.ipynb

Note: How to use task scheduler can be found in the img folder with name task_schd_#

Results and Conclusion

The result is smooth, low to no cost data pipelines which keeps the data flowing into a datalake/datamart or any data storage area which now can be used for data science and BI tasks. But we need to make sure this pipelines are scheduled by their dependency order. Also maintenance of this pipelines will be necessary and proactive actions might be needed if volume of the data sudenly increases to more than expected.

skswar / data_engineering_pipelines Goto Github PK

data_engineering_pipelines's Introduction

How to build Efficient Data Enginnering Pipelines at No Additional Software Overhead Cost?

Table of contents

Introduction

Methodology

Tools Used

Building the Pipelines

Project Demo 1

Project Demo 2

Results and Conclusion

data_engineering_pipelines's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent