Giter Club home page Giter Club logo

stedi's Introduction

Star Badge Open Source Love

STEDI - Data Lakehouse


Introduction

In this project, a data lake solution will be developed using AWS Glue, AWS S3, Python, and Spark for sensor data that trains machine learning algorithms.

AWS infrastructure will be used to create storage zones (landing, trusted and curated), data catalog, data transformations between zones and queries in semi-structured data.

Datasets
  • Customer Records - from fulfillment and the STEDI website
  • Step Trainer Records - data from the motion sensor
  • Accelerometer Records - from the mobile app
Required Steps
  • Data Acquisition: AWS S3 directories were created to simulate data coming from various sources. These directories served as landing zones for customer, step trainer, and accelerometer data.

  • Data Sanitization: AWS Glue Studio jobs were written to sanitize customer and accelerometer data from their respective landing zones. The sanitized data was then stored in a trusted zone.

  • Data Verification: AWS Athena was used to query and verify the data in the Glue customer_trusted table.

  • Data Curation: Additional Glue jobs were written to further sanitize the customer data and create a curated zone that only included customers who have accelerometer data and agreed to share their data for research.

  • Data Streaming: Glue Studio jobs were created to read the Step Trainer IoT data stream and populate a trusted zone Glue table.

  • Data Aggregation: Lastly, an aggregated table was created that matched Step Trainer readings and the associated accelerometer reading data for the same timestamp.

Implementation

Landing Zone

In the Landing Zone were stored the customer, accelerometer and step trainer raw data.

1. customer_landing.sql

2. accelerometer_landing.sql

Trusted Zone

In the Trusted Zone were stored the tables that contain the records from customers who agreed to share their data for research purposes.

1. customer_landing_to_trusted.py - script used to build the **customer_trusted table, which contains customer records from customers who agreed to share their data for research purposes.

2. accelerometer_landing_to_trusted.py - script used to build the accelerometer_trusted table, which contains accelerometer records from customers who agreed to share their data for research purposes.

The customer_trusted table was queried in Athena to show that it only contains customer records from people who agreed to share their data.

customer_trusted customer_trust_with_filter

Curated Zone

In the Curated Zone were stored the tables that contain the correct serial numbers.

Glue job scripts

customer_trusted_to_curated.py - script used to build the customer_curated table, which contains customers who have accelerometer data and have agreed to share their data for research.

step_trainer_trusted_to_curated.py: script used to build the machine_learning_curated table, which contains each of the step trainer readings, and the associated accelerometer reading data for the same timestamp, but only for customers who have agreed to share their data.

stedi's People

Contributors

ndleah avatar

Watchers

Kostas Georgiou avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.