Giter Club home page Giter Club logo

sql-etl-analytics's Introduction

Hacktoberfest 2023 project: building ETL and RAG pipelines with open source

Team's project: Extract Transform Load (ETL) pipeline of Adidas sales and further product information, with an analytics component for sales trends and successful product identification, competitive research, and more

All team members have completed all steps in the set up document.

Ploombler project: Transforming a Company through Building a ETL for Data Analysis with Ploomber

Description

We've developed an ETL (Extract, Transform, Load) pipeline for data analysis automation, specific to Adidas sales and main competitor sales, based on the provided datasets on section Data sources. This we think, can help address several important business problems and drive informed decision-making, for Adidas and for anyone in hopes of better understanding one's business as the ideas discussed here are universal.

Appart from the contribution of our own insights displayed through our EDA (Exploratory Data Analysis); With the help of the good people from Ploomber, we've build an application (Dashboard) with JupySQL + Voila as framework, with the use of a pipeline to prepare the data (ETL) automatically by tasks (see .yaml), and in the process generate subsecuent products as reports or logs (metadata), and Dashboard updates. We perform dataset extraction from 3 different Kaggle sources, cleaning, organizing and saving to an in-memory database DuckDB once prepared for storage and subsequent data analysis. Application is Dockerized as well. We used MotherDuck for in-cloud data storage for our application.

We hope our project provides Adidas with a comprehensive understanding of its sales and customer data on a simple Dashboard over an easy to mantain, modify and improve data process. This knowledge can be used to make data-driven decisions, tailor marketing strategies, optimize product offerings, improve customer satisfaction, and align business strategies with customer needs and preferences.

Data sources

The following are the used data sources, all of public domain:

  1. adidas-sales-dataset by Heemali Chaudhari licensed under CC0 1.0.
    • Adidas sales dataset is a collection of data that includes information on the sales of Adidas products. This type of dataset may include details such as the number of units sold, the total sales revenue, the location of the sales, the type of product sold, and any other relevant information.
    • It contains 9652 rows and 14 columns in total. (698.66 kB)
  2. adidas-vs-nike by Kaushik Suresh licensed under CC0 1.0.
    • Contains product information about Nike and Adidas (Adidas is further divided into sub-brands), feature information including their ratings, discount, sales price, listed price, product description, and the number of reviews.
    • It contains 3268 rows and 10 columns in total. (1.21 MB)
  3. customer-shopping-trends-dataset by Sourav Banerjee licensed under CC0 1.0.
    • The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more.
    • It contains 3900 rows and 18 columns in total. (453.25 kB)

Specific provenance is listed for all datasets in the respective Kaggle websites.

Methods

Staging

  1. Extraction, Transformation and Loading
    • Cast read datasets to dataframe using Pandas according to file extention; .csv, .xlsx.
      • Align dataset for better tabular form.
      • Formatting; Date, etc.
      • Drop unuseful data for process speed; ID, Date retrieval, etc. (This datasets do not have NaN, already seen from describe() Pandas method)
    • Save each dataset as table to local database DuckDB file logically.
      • Tables existence verification.

Back and forth between ETL and wrangle during EDA process.

  1. Wrangle. Heavy use of matplotlib for plotting, numpy and pandas for calculations and statistics
    • adidas-sales-dataset
      • Descriptive statistics
    • adidas-vs-nike
      • Descriptive statistics, i.e. Average Listing Price.
      • Discount gap.
      • Product offer amount by brand and by sub-brand.
      • Costumer satisfaction and popularity from rating and reviews.
      • Same brand best selling products.
    • customer-shopping-trends-dataset

Production

  1. Dockerize
  2. MotherDuck
  3. Voila
  4. Ploomber Cloud

User interface your project will have

User Interface of dashboard

We convert Jupyter Notebooks visualizations to interactive Dashboards with Python Voila, application is hosted on Ploomber Cloud. This way users don't need to install Python or any other dependencies to interact with the dashboard; all scripts run on the browser, users are also allowed to read and not edit the visualizations so as to mantain the integrity of the EDA. Easy to share HTML files.

We are proud to show the fininshed Dashboard for the you to see.

Team members

sql-etl-analytics's People

Contributors

fullmakeralchemist avatar delatorrena2016 avatar cchidalgog avatar lfunderburk avatar

Stargazers

 avatar

Watchers

 avatar  avatar

sql-etl-analytics's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.