Giter Club home page Giter Club logo

big-data-etl's Introduction

A SQL Database for Amazon Product Reviews
(Big Data ETL Demonstration)

Overview:

This project utilizes Amazon RDS, Google Colaboratory, and PySpark to illustrate the ETL process for very large data sets. For each jupyter notebook in this repository, a large file containing customer reviews for products sold on Amazon is retrieved from the Amazon Web Service and loaded into a PySpark dataframe. The dataframe is then broken into subsequent dataframes that will fit the tables defined in the schema file. The subsequent dataframes are cleaned and then loaded into an Amazon RDS database.

To examine and/or run the code files, their locations are indicated in the tree below.
(Note: Run the schema file in PostgreSQL before running the notebooks to set up the tables in your RDS database.)
(Note: You will need to supply your RDS endpoint, database name, and password in the notebookds bnefore running them.)
(Note: The Amazon files are very large, so the notebooks should be run in Google Colaboratory.)

Files/Folders:

  • "Code"
    • "schema.sql" (sets up the database tables, RUN THIS FIRST)
    • "amazon_reviews_1A.ipynb" (performs ETL on Amazon reviews for Health & Personal Care products)
    • "amazon_reviews_1B.ipynb" (performs ETL on Amazon reviews for Pet products)

(Please do not delete, move, rename, or alter!)

Sample Code:

PostgreSQL schema file to set up database tables:

CREATE TABLE review_id_table (
  review_id TEXT PRIMARY KEY NOT NULL,
  customer_id INTEGER,
  product_id TEXT,
  product_parent INTEGER,
  review_date DATE -- this should be in the formate yyyy-mm-dd
);

-- This table will contain only unique values
CREATE TABLE products (
  product_id TEXT PRIMARY KEY NOT NULL UNIQUE,
  product_title TEXT
);

-- Customer table for first data set
CREATE TABLE customers (
  customer_id INT PRIMARY KEY NOT NULL UNIQUE,
  customer_count INT
);

-- vine table
CREATE TABLE vine_table (
  review_id TEXT PRIMARY KEY,
  star_rating INTEGER,
  helpful_votes INTEGER,
  total_votes INTEGER,
  vine TEXT
);

big-data-etl's People

Contributors

rickmora98 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.