Giter Club home page Giter Club logo

emaynard10 / amazon_vine_analysis Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 30 KB

Pick one of 50 datasets and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, use PySpark to determine if there is any bias toward favorable reviews from Vine members.

Jupyter Notebook 100.00%
aws big-data etl googlecolaboratory pgadmin4 postgresql pyspark s3-buckets sql

amazon_vine_analysis's Introduction

Amazon_Vine_Analysis

The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. This project accesses approximately 50 datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products. The analysis picks one of these datasets and uses PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, using PySpark, determine if there is any bias toward favorable reviews from Vine members in your dataset.

Tools: PySpark, Google Colaboratory, Postgress/Pgadmin, AWS, Pandas

Overview of the analysis:

The purpose of this analysis is to determine if the there is bias in the paid reviews. Using PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, using PySpark, Pandas, or SQL to determine if there is any bias toward favorable reviews from Vine members in your dataset. This analysis uses PySpark to prefrom the analysis. The dataset the was selected for analysis was 'Pet products' at this link.

The four tables created with data loaded into pgadmin from the AWS RDS connection are shown below: The review_id_table:

Screen Shot 2022-08-01 at 1 10 36 PM

The customers_table:

Screen Shot 2022-08-01 at 1 10 19 PM

The products_table:

Screen Shot 2022-08-01 at 1 09 55 PM

The vine_table:

Screen Shot 2022-08-01 at 1 10 58 PM

Results:

The analysis begins with the creation of the Vine table with star ratings, helpful votes, total votes, whether or not the review is part of the Vine program, and if the purchase was verified. The first table is filtered to show only total votes over 20. This way there will be no division by zero errors later when percents are calculated. Then that table is filtered to show the helpful votes that are over 50%. Then there are two tables; one shows which reviews are in the Vine program and which are not. The vine table filtered to show 50% helpful votes is shown below:

Screen Shot 2022-08-01 at 2 57 19 PM

  • How many Vine reviews and non-Vine reviews were there? There were far more non-Vine reviews than Vine reviews, 37,840 unpaid, non-vine reviews and 170 paid, Vine reviews total.
total_review_unpaid = unpaid_df.count()
total_review_unpaid

total_review = paid_df.count()
total_review
  • How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars? Of the 170 Vine reviews 65 of them were five star reviews.
five_star_paid = paid_df.filter(paid_df.star_rating ==5).count()
five_star_paid

And there were 20,612 five star unpaid reviews that are not a part of the Vine program.

five_star_unpaid = unpaid_df.filter(unpaid_df.star_rating == 5).count()
five_star_unpaid 
  • What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars? The percent of Vine reviews that were five stars was 38.24% whereas the percent of unpaid five star reviews was 54.47%.
percent_fivestar_paid = (five_star_paid/total_review)*100
percent_fivestar_paid
 
percent_fivestar_unpaid = (five_star_unpaid/total_review_unpaid)*100
percent_fivestar_unpaid

Summary

There does not appear to be a bias toward the reviews in the Vine program, there are both more reviews that are unpaid and a higher percentage of positive 5 star reviews that are unpaid. If the paid reviews were mostly 5 star reviews, then there would be a bias. Another helpful analysis that could help with determining if there is bias toward Vine program reviews would be to filter the verified purchase column for purchases that were verified and see how the five star reviews compared with verified purchases in and out of the Vine porgram.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.