Giter Club home page Giter Club logo

datapipelinetl's Introduction

Data Pipeline ETL

Getting Started with Data Pipelines for ETL


Extracting Data
In this code-along, I'll focus on extracting data from flat-files. A flat file might be something like a .csv or a .json file. The two files that I'll be extracting data from are the apps_data.csv and the review_data.csv file. To do this, I'll used pandas.

After importing pandas, read the apps_data.csv DataFrame into memory. Print the head of the DataFrame.
Similar to before, read in the DataFrame stored in the review_data.csv file. Take a look at the first few rows of this DataFrame.
Print the column names, shape, and data types of the apps DataFrame.


Transforming Data
I am interested here in working with the apps and their corresponding reviews in the"FOOD_AND_DRINK" category. I'd like to do the following:

Define a function with name transform. This function will have five parameters; apps, review, category, min_rating, and min_reviews.
Drop duplicates from both DataFrames.
For each of the apps in the desired category, find the number of positive reviews, and filter the columns.
Join this back to the apps dataset, only keeping the following columns: App
Rating
Reviews
Installs
Sentiment_Polarity
Filter out all records that don't have at least the min_rating, and more than the min_reviews.
Order by the rating and number of installs, both in descending order.
Call the function for the "FOOD_AND_DRINK" category, with a minimum average rating of 4 stars, and at least 1000 reviews.


Loading Data
Next, I'd like to load the transformed dataset into a SQL database. i'll be using pandas along with sqlite to do just that!

After importing sqlite3, create a function with name load. The function will have four parameters; dataframe, database_name, table_name.
Connect to the database using the connect() function.
Write the DataFrame to the provided table name. Replace the table if it exists, and do not include the index.
Now, we'll validate that the data was loaded correctly. Use the read_sql() function to return the DataFrame that was just loaded.
Assert that the number of rows and columns match in the original and loaded DataFrame.
Return the DataFrame read from the sqlite database.
Call the function for the top_apps_data DataFrame, for the "market_research" database and the top_apps table.


Running the Pipeline
Now that functions have been defined and tested, I'll run this pipeline end-to-end!

For verbosity, import pandas and sqlite3.
Extract data from the apps_data.csv and review_data.csv functions.
Transform the data by passing in the following: category="FOOD_AND_DRINK"
min_rating=4.0
min_reviews=1000
Load the transformed DataFrame to the top_apps table in the market_research database.
Checking out the output!


datapipelinetl's People

Contributors

nalmiron23 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.