Giter Club home page Giter Club logo

dataengineer's Introduction

Data Engineering Project

What problem will we solve?

We will look at the most populous U.S. cities and identify which ones are the most expensive and the most affordable to live in. This will help us decide which city we'd like to move to next.

What datasets will we use?

We will scrape three datasets:

  1. Wikipedia List of United States cities by population data, which lists the most populous U.S. cities

  2. Zillow Home Value Index data, an estimate of median home values by city.

  3. Wikipedia Household income in the United States data, which lists 2017 median household income by state

How will we use these datasets to solve the problem?

We will append (2) median home value data and (3) household income data to the (1) list of most populous U.S. cities, and then calculate a "Cost Score", which shows for each city how many years of income is required to purchase a median value home. This tells us, relative to other cities, how costly it is to live in a particular city. We will then create a "Cost Rank" based on this score.

What steps will we take to do this?

We will:

I. Scrape the Data

II. Join the Data

III. Analyze the Data

For step by step code, refer to 'DataEngineering.ipynb'. The output of this notebook code is 'topCitiesJoined.csv', whcih lists the most populous U.S. cities as well as their 'Cost Score' and 'Cost Rank'. This CSV is ready to be uploaded to a BigQuery table.

NOTE: to run this notebook, you will need to have Anaconda Distribution with Python 3.7 installed.

dataengineer's People

Contributors

rayleegit avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.