Giter Club home page Giter Club logo

crowdfunding_etl's Introduction

Crowdfunding_ETL

Challenge Synopsis:

For the ETL mini project, you will work with a partner to practice building an ETL pipeline using Python, Pandas, and either Python dictionary methods or regular expressions to extract and transform the data. After you transform the data, you'll create four CSV files and use the CSV file data to create an ERD and a table schema. Finally, you’ll upload the CSV file data into a Postgres database.

Since this is a one-week project, make sure that you have done at least half of your project before the third day of class to stay on track.

Although you and your partner will divide the work, it’s essential to collaborate and communicate while working on different parts of the project. Be sure to check in with your partner regularly and offer support.

Instructions:

The instructions for this mini project are divided into the following subsections:

  • Create the Category and Subcategory DataFrames
  • Create the Campaign DataFrame
  • Create the Contacts DataFrame
  • Create the Crowdfunding Database

Create the Category and Subcategory DataFrames

  • Extract and transform the crowdfunding.xlsx Excel data to create a category DataFrame that has the following columns:

  • A "category_id" column that has entries going sequentially from "cat1" to "catn", where n is the number of unique categories

  • A "category" column that contains only the category titles

  • The following image shows this category DataFrame:

  • Export the category DataFrame as category.csv and save it to your GitHub repository.

  • Extract and transform the crowdfunding.xlsx Excel data to create a subcategory DataFrame that has the following columns:

  • A "subcategory_id" column that has entries going sequentially from "subcat1" to "subcatn", where n is the number of unique subcategories

  • A "subcategory" column that contains only the subcategory titles

  • Export the subcategory DataFrame as subcategory.csv and save it to your GitHub repository.

Create the Campaign DataFrame

  • Extract and transform the crowdfunding.xlsx Excel data to create a campaign DataFrame has the following columns:

  • The "cf_id" column

  • The "contact_id" column

  • The "company_name" column

  • The "blurb" column, renamed to "description"

  • The "goal" column, converted to the float data type

  • The "pledged" column, converted to the float data type

  • The "outcome" column

  • The "backers_count" column

  • The "country" column

  • The "currency" column

  • The "launched_at" column, renamed to "launch_date" and with the UTC times converted to the datetime format

  • The "deadline" column, renamed to "end_date" and with the UTC times converted to the datetime format

  • The "category_id" column, with unique identification numbers matching those in the "category_id" column of the category DataFrame

  • The "subcategory_id" column, with the unique identification numbers matching those in the "subcategory_id" column of the subcategory DataFrame

  • Export the campaign DataFrame as campaign.csv and save it to your GitHub repository.

Create the Contacts DataFrame

  • Choose one of the following two options for extracting and transforming the data from the contacts.xlsx Excel data:

  • Option 1: Use Python dictionary methods.

  • Option 2: Use regular expressions.

  • If you chose Option 1, complete the following steps:

    • Import the contacts.xlsx file into a DataFrame.
    • Iterate through the DataFrame, converting each row to a dictionary.
    • Iterate through each dictionary, doing the following:
      • Extract the dictionary values from the keys by using a Python list comprehension.
      • Add the values for each row to a new list.
    • Create a new DataFrame that contains the extracted data.
    • Split each "name" column value into a first and last name, and place each in a new column.
    • Clean and export the DataFrame as contacts.csv and save it to your GitHub repository.
  • If you chose Option 2, complete the following steps:

    - Import the contacts.xlsx file into a DataFrame.

    - Extract the "contact_id", "name", and "email" columns by using regular expressions.

    - Create a new DataFrame with the extracted data.

    - Convert the "contact_id" column to the integer type.

    - Split each "name" column value into a first and a last name, and place each in a new column.

    - Clean and then export the DataFrame as contacts.csv and save it to your GitHub repository.

Create the Crowdfunding Database

  • Inspect the four CSV files, and then sketch an ERD of the tables by using QuickDBD

  • Use the information from the ERD to create a table schema for each CSV file.

    • Note: Remember to specify the data types, primary keys, foreign keys, and other constraints.
  • Save the database schema as a Postgres file named crowdfunding_db_schema.sql, and save it to your GitHub repository.

  • Create a new Postgres database, named crowdfunding_db.

  • Using the database schema, create the tables in the correct order to handle the foreign keys.

  • Verify the table creation by running a SELECT statement for each table.

  • Import each CSV file into its corresponding SQL table.

  • Verify that each table has the correct data by running a SELECT statement for each

Notes:

  • Headers were irritating in the CSV and we had to create extra work-arounds/cleaning.

  • Originally we thought we would try to automate the creation of the tables in our database for our graders. But then we found that this was not apart of the assignment. In order to do that, we would have had to path all the way to our C drive because the files for pgAdmin are all within the program. We ended up scrapping our original code for something more basic and simplistic.

  • Understanding the gargon is very important in databasing and ETL in general. Without it, all of the moving parts can be hard to conceptualize.

Citations:

Data for this dataset was generated by edX Boot Camps LLC, and is intended for educational purposes only.

Instructor: Othmane Benyoucef

Student Collaborations: Scott Arterbury Cory Chapman

crowdfunding_etl's People

Contributors

scottarterbury avatar thatcorygirl avatar

Watchers

 avatar

Forkers

thatcorygirl

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.