Giter Club home page Giter Club logo

dsc-v2-mod1-final-project-online-ds-ft-051319's Introduction

Final Project Submission

Please fill out:

King County House Sales Analysis

Data Cleaning and Linear Regression

Table of Contents

  1. Introduction

Introduction

As I began this project, I divided it up into roughly 3 steps: data processing, exploratory data analysis, and linear regression modeling with analysis.

The processing stage consisted of loading my data and libraries, addressing any placeholders and null values, and formatting the data. I investigated

Once my data was processed, I began my exploratory data analysis. I engineered new features to better understand the trends in the data. I eliminated extraneous features and removed outliers. I also investigated three different features and their relationship to price: waterfront, location, and condition.

After exploring my data, I began my linear regression modeling. I first resolved any multicollinearity between variables. I scaled and normalized the variables that remained. Then I partitioned my dataset for testing and ran my regression model followed by validation testing. I iterated my model a few times and ended up with a model that I thought best predicted housing prices based on the chosen features.

Back to the top

Import libraries and load data

import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline  
import seaborn as sns
sns.set_style('whitegrid')

import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from statsmodels.formula.api import ols
import scipy.stats as stats

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score

import folium
from folium.plugins import MarkerCluster
from folium.plugins import FastMarkerCluster

import os

from math import sqrt

import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('kc_house_data.csv')

Back to the top

Gather information

df.shape
(21597, 21)
df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 2.159700e+04 21597.000000 19221.000000 21534.000000 21597.000000 21597.000000 21597.000000 21597.000000 17755.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000
mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 1.509941e+04 1.494096 0.007596 0.233863 3.409825 7.657915 1788.596842 1970.999676 83.636778 98077.951845 47.560093 -122.213982 1986.620318 12758.283512
std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 4.141264e+04 0.539683 0.086825 0.765686 0.650546 1.173200 827.759761 29.375234 399.946414 53.513072 0.138552 0.140724 685.230472 27274.441950
min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 370.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 1951.000000 0.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 1975.000000 0.000000 98065.000000 47.571800 -122.231000 1840.000000 7620.000000
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 1.068500e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
id               21597 non-null int64
date             21597 non-null object
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
lat              21597 non-null float64
long             21597 non-null float64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(8), int64(11), object(2)
memory usage: 3.5+ MB
df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')
df.hist(figsize=(16,12)) ;

png

df.head(20)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 3 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
1 6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
3 2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
4 1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503
5 7237550310 5/12/2014 1230000.0 4 4.50 5420 101930 1.0 0.0 0.0 3 11 3890 1530.0 2001 0.0 98053 47.6561 -122.005 4760 101930
6 1321400060 6/27/2014 257500.0 3 2.25 1715 6819 2.0 0.0 0.0 3 7 1715 ? 1995 0.0 98003 47.3097 -122.327 2238 6819
7 2008000270 1/15/2015 291850.0 3 1.50 1060 9711 1.0 0.0 NaN 3 7 1060 0.0 1963 0.0 98198 47.4095 -122.315 1650 9711
8 2414600126 4/15/2015 229500.0 3 1.00 1780 7470 1.0 0.0 0.0 3 7 1050 730.0 1960 0.0 98146 47.5123 -122.337 1780 8113
9 3793500160 3/12/2015 323000.0 3 2.50 1890 6560 2.0 0.0 0.0 3 7 1890 0.0 2003 0.0 98038 47.3684 -122.031 2390 7570
10 1736800520 4/3/2015 662500.0 3 2.50 3560 9796 1.0 NaN 0.0 3 8 1860 1700.0 1965 0.0 98007 47.6007 -122.145 2210 8925
11 9212900260 5/27/2014 468000.0 2 1.00 1160 6000 1.0 0.0 0.0 4 7 860 300.0 1942 0.0 98115 47.6900 -122.292 1330 6000
12 114101516 5/28/2014 310000.0 3 1.00 1430 19901 1.5 0.0 0.0 4 7 1430 0.0 1927 NaN 98028 47.7558 -122.229 1780 12697
13 6054650070 10/7/2014 400000.0 3 1.75 1370 9680 1.0 0.0 0.0 4 7 1370 0.0 1977 0.0 98074 47.6127 -122.045 1370 10208
14 1175000570 3/12/2015 530000.0 5 2.00 1810 4850 1.5 0.0 0.0 3 7 1810 0.0 1900 0.0 98107 47.6700 -122.394 1360 4850
15 9297300055 1/24/2015 650000.0 4 3.00 2950 5000 2.0 0.0 3.0 3 9 1980 970.0 1979 0.0 98126 47.5714 -122.375 2140 4000
16 1875500060 7/31/2014 395000.0 3 2.00 1890 14040 2.0 0.0 0.0 3 7 1890 0.0 1994 0.0 98019 47.7277 -121.962 1890 14018
17 6865200140 5/29/2014 485000.0 4 1.00 1600 4300 1.5 0.0 0.0 4 7 1600 0.0 1916 0.0 98103 47.6648 -122.343 1610 4300
18 16000397 12/5/2014 189000.0 2 1.00 1200 9850 1.0 0.0 0.0 4 7 1200 ? 1921 0.0 98002 47.3089 -122.210 1060 5095
19 7983200060 4/24/2015 230000.0 3 1.00 1250 9774 1.0 0.0 0.0 4 7 1250 0.0 1969 0.0 98003 47.3343 -122.306 1280 8850
df.isna().any()
id               False
date             False
price            False
bedrooms         False
bathrooms        False
sqft_living      False
sqft_lot         False
floors           False
waterfront        True
view              True
condition        False
grade            False
sqft_above       False
sqft_basement    False
yr_built         False
yr_renovated      True
zipcode          False
lat              False
long             False
sqft_living15    False
sqft_lot15       False
dtype: bool

Let's look at a preliminary regression model

y = df['price']
# Take out price, objects, and columns with null values
x = df.drop(['price', 'date', 'sqft_basement', 'waterfront', 'yr_renovated', 'view'], axis=1)
#X = sm.add_constant(x)
linreg = sm.OLS(y, x).fit()
linreg.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.661
Model: OLS Adj. R-squared: 0.661
Method: Least Squares F-statistic: 3003.
Date: Mon, 03 Jun 2019 Prob (F-statistic): 0.00
Time: 22:57:49 Log-Likelihood: -2.9572e+05
No. Observations: 21597 AIC: 5.915e+05
Df Residuals: 21582 BIC: 5.916e+05
Df Model: 14
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
id -9.659e-07 5.13e-07 -1.883 0.060 -1.97e-06 3.93e-08
bedrooms -4.683e+04 2003.587 -23.374 0.000 -5.08e+04 -4.29e+04
bathrooms 4.735e+04 3443.479 13.750 0.000 4.06e+04 5.41e+04
sqft_living 180.2720 4.582 39.347 0.000 171.292 189.252
sqft_lot 0.1450 0.051 2.845 0.004 0.045 0.245
floors 1.348e+04 3764.645 3.581 0.000 6101.521 2.09e+04
condition 2.53e+04 2439.664 10.372 0.000 2.05e+04 3.01e+04
grade 1.034e+05 2279.305 45.361 0.000 9.89e+04 1.08e+05
sqft_above 6.4953 4.514 1.439 0.150 -2.352 15.343
yr_built -3096.6095 67.880 -45.619 0.000 -3229.660 -2963.559
zipcode -518.5378 18.971 -27.333 0.000 -555.723 -481.353
lat 5.504e+05 1.13e+04 48.493 0.000 5.28e+05 5.73e+05
long -2.457e+05 1.39e+04 -17.664 0.000 -2.73e+05 -2.18e+05
sqft_living15 37.5591 3.570 10.520 0.000 30.561 44.557
sqft_lot15 -0.3379 0.078 -4.326 0.000 -0.491 -0.185
Omnibus: 19489.941 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1942760.660
Skew: 3.955 Prob(JB): 0.00
Kurtosis: 48.786 Cond. No. 5.20e+10


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.2e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

Observations:

  • R squared value .661
  • Skew 3.955
  • Kurtosis 48.786
  • Fairly low p values except for sqft_above and id
  • Need to clean up the data first

Back to the top

Replace placeholders

While examining the data, I noticed '?'s in the column sqft_basement. I will replace the '?'s with null values. I will replace the null values in the following step.

df.isna().sum()
id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64
df.sqft_basement.unique()
array(['0.0', '400.0', '910.0', '1530.0', '?', '730.0', '1700.0', '300.0',
       '970.0', '760.0', '720.0', '700.0', '820.0', '780.0', '790.0',
       '330.0', '1620.0', '360.0', '588.0', '1510.0', '410.0', '990.0',
       '600.0', '560.0', '550.0', '1000.0', '1600.0', '500.0', '1040.0',
       '880.0', '1010.0', '240.0', '265.0', '290.0', '800.0', '540.0',
       '710.0', '840.0', '380.0', '770.0', '480.0', '570.0', '1490.0',
       '620.0', '1250.0', '1270.0', '120.0', '650.0', '180.0', '1130.0',
       '450.0', '1640.0', '1460.0', '1020.0', '1030.0', '750.0', '640.0',
       '1070.0', '490.0', '1310.0', '630.0', '2000.0', '390.0', '430.0',
       '850.0', '210.0', '1430.0', '1950.0', '440.0', '220.0', '1160.0',
       '860.0', '580.0', '2060.0', '1820.0', '1180.0', '200.0', '1150.0',
       '1200.0', '680.0', '530.0', '1450.0', '1170.0', '1080.0', '960.0',
       '280.0', '870.0', '1100.0', '460.0', '1400.0', '660.0', '1220.0',
       '900.0', '420.0', '1580.0', '1380.0', '475.0', '690.0', '270.0',
       '350.0', '935.0', '1370.0', '980.0', '1470.0', '160.0', '950.0',
       '50.0', '740.0', '1780.0', '1900.0', '340.0', '470.0', '370.0',
       '140.0', '1760.0', '130.0', '520.0', '890.0', '1110.0', '150.0',
       '1720.0', '810.0', '190.0', '1290.0', '670.0', '1800.0', '1120.0',
       '1810.0', '60.0', '1050.0', '940.0', '310.0', '930.0', '1390.0',
       '610.0', '1830.0', '1300.0', '510.0', '1330.0', '1590.0', '920.0',
       '1320.0', '1420.0', '1240.0', '1960.0', '1560.0', '2020.0',
       '1190.0', '2110.0', '1280.0', '250.0', '2390.0', '1230.0', '170.0',
       '830.0', '1260.0', '1410.0', '1340.0', '590.0', '1500.0', '1140.0',
       '260.0', '100.0', '320.0', '1480.0', '1060.0', '1284.0', '1670.0',
       '1350.0', '2570.0', '1090.0', '110.0', '2500.0', '90.0', '1940.0',
       '1550.0', '2350.0', '2490.0', '1481.0', '1360.0', '1135.0',
       '1520.0', '1850.0', '1660.0', '2130.0', '2600.0', '1690.0',
       '243.0', '1210.0', '1024.0', '1798.0', '1610.0', '1440.0',
       '1570.0', '1650.0', '704.0', '1910.0', '1630.0', '2360.0',
       '1852.0', '2090.0', '2400.0', '1790.0', '2150.0', '230.0', '70.0',
       '1680.0', '2100.0', '3000.0', '1870.0', '1710.0', '2030.0',
       '875.0', '1540.0', '2850.0', '2170.0', '506.0', '906.0', '145.0',
       '2040.0', '784.0', '1750.0', '374.0', '518.0', '2720.0', '2730.0',
       '1840.0', '3480.0', '2160.0', '1920.0', '2330.0', '1860.0',
       '2050.0', '4820.0', '1913.0', '80.0', '2010.0', '3260.0', '2200.0',
       '415.0', '1730.0', '652.0', '2196.0', '1930.0', '515.0', '40.0',
       '2080.0', '2580.0', '1548.0', '1740.0', '235.0', '861.0', '1890.0',
       '2220.0', '792.0', '2070.0', '4130.0', '2250.0', '2240.0',
       '1990.0', '768.0', '2550.0', '435.0', '1008.0', '2300.0', '2610.0',
       '666.0', '3500.0', '172.0', '1816.0', '2190.0', '1245.0', '1525.0',
       '1880.0', '862.0', '946.0', '1281.0', '414.0', '2180.0', '276.0',
       '1248.0', '602.0', '516.0', '176.0', '225.0', '1275.0', '266.0',
       '283.0', '65.0', '2310.0', '10.0', '1770.0', '2120.0', '295.0',
       '207.0', '915.0', '556.0', '417.0', '143.0', '508.0', '2810.0',
       '20.0', '274.0', '248.0'], dtype=object)
# Replace placeholders with null values
df.sqft_basement.replace(to_replace='?', value=np.nan, inplace=True)
df.isna().sum()
id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement     454
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

Excellent. Now those 454 placeholders in sqft_basement have been converted to null values.

Back to the top

Resolve null values

The next step is to eliminate our null values. There are several ways to do this. Let's look through the columns with null values.

# Count of rows with null values
sum([True for idx,row in df.iterrows() if any(row.isnull())])
6168
# Percentage of rows with null values
6168/21597
0.28559522155854977

Waterfront

Let's examine waterfront more closely.

  • Waterfront is a binary variable containing 0s and 1s. I could convert this to a boolean or leave it as a float. Either way, I think it will function the same so I will leave it.
  • I could assume that a null value means no waterfront and replace all nulls with 0. The null values consist of 11% of my data so replacing them with 0s could also skew my data.
  • I could drop all of the rows with null values in waterfront but in doing so I could inadvertently drop important information.
  • The dominant value is 0 so replacing the nulls with 0 would not have a huge effect. Null in a boolean variable defaults to False anyway.
  • I could also split waterfront into two columns and look at just the null values.
df.shape
(21597, 21)
# Number of nulls
df.waterfront.isna().sum()
2376
# Percentage of null values in waterfront:
pnw = 2376/21597 * 100
pnw
11.00152798999861

I'm going to create two dataframes: one containing the rows without null values for waterfront and the other containing the rows with null values for waterfront. Then I can use them to look closer at the data.

# Make new dataframe with nulls for waterfront removed
df_w = df[(df.waterfront==df.waterfront)]
df_w.shape
(19221, 21)
# Check
df_w.waterfront.value_counts()
0.0    19075
1.0      146
Name: waterfront, dtype: int64
# Map all properties with nulls for waterfront removed
m_w = folium.Map([df_w['lat'].mean(), df_w['long'].mean()], zoom_start=11)
for index, row in df_w.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=3,
                        fill_color="#FF0000",
                        fill=True# divvy color
                       ).add_to(m_w)
m_w.save('map_w.html')
# Make new dataframe of just nulls from waterfront
df_nw = df[(df.waterfront != df.waterfront)]
df_nw.shape
(2376, 21)
# Map all properties will null values for waterfront
m_nw = folium.Map([df_nw['lat'].mean(), df_nw['long'].mean()], zoom_start=11)
for index, row in df_nw.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=3,
                        fill_color="#FF0000",
                        fill=True# divvy color
                       ).add_to(m_nw)
m_nw.save('map_nw.html')
# Make new dataframe of just waterfront properties
df_water = df[df['waterfront'] == 1]
df_water.shape
(146, 21)
# Map all the waterfront properties
m_water = folium.Map([df_water['lat'].mean(), df_water['long'].mean()], zoom_start=11)
for index, row in df_water.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=3,
                        fill_color="#FF0000",
                        fill=True# divvy color
                       ).add_to(m_water)
m_water.save('map_water.html')

Open maps in new windows. Crashes jupyter notebook.

Now I can see the properties with null values for waterfront on a map. Overall, they seem to be distributed similarly to our full dataset. The vast majority of properties are not waterfront, but there are still a handful that seem to be on the water.

Waterfront

From my analysis above, I am going to delete the rows containing null values for waterfront. I think this is an important feature, and I do not want to skew my data.

# Delete rows containing null values for waterfront
df = df.dropna(subset=['waterfront'])
# Check
df.waterfront.isna().sum()
0

View

Let's examine view more closely.

  • View refers to how many times a property has been viewed. It contains values 0, 1, 2, 3, and 4.
  • I could assume a null value in view means it has not been viewed.
  • I could drop all of the rows with null values in view.
  • There are only 63 rows with null values for view.
df.view.unique()
array([ 0., nan,  3.,  4.,  2.,  1.])
df.view.value_counts()
0.0    17312
2.0      836
3.0      435
1.0      291
4.0      290
Name: view, dtype: int64
df.view.isna().sum()
57

There are so few rows with null values in view. I'm just going to delete those rows.

# Delete rows containing null values for view
df = df.dropna(subset=['view'])
# Check
df.view.isna().sum()
0

Sqft_basement

Let's take a closer look at sqft_basement.

  • This column refers to the size of the basement.
  • I could drop these rows.
  • I could replace them with the mean basement size.
  • I could assume they mean no basement and replace them with 0s.
df.sqft_basement.isna().sum()
415
# Make new dataframe with nulls for sqft_basement removed
df_b = df[df['sqft_basement'] == df['sqft_basement']]
df_b.sqft_basement.isna().sum()
0
plt.scatter(df_b['sqft_basement'], df_b['price'], label='sqft_basement')
plt.title('sqft_basement vs price')
plt.legend()
plt.xlabel('sqft_basement')
plt.ylabel('price')
plt.show()

png

Sqft_basement does not seem to have much relationship to the price of a house. For that reason, I am going to replace the null values with 0. I am not worried about skewing my data because I probably won't use it anyway.

# Replace null values with 0
df.sqft_basement.replace(to_replace=np.nan, value=0.0, inplace=True)
# Check
df.sqft_basement.isna().sum()
0

Yr_renovated

Let's take a closer look at yr_renovated.

  • This is the year that the house was renovated.
  • I could drop the rows but then I would lose a lot of data.
  • I could assume null value means the house has not been renovated.
  • I don't think it would be appropriate to put another value in these cells.
  • I am going to replace null values with 0s because I don't think it will negatively impact my data too much.
df.yr_renovated.isna().sum()
3402
plt.scatter(df['yr_renovated'], df['price'], label='yr_renovated')
plt.title('yr_renovated vs price')
plt.legend()
plt.xlabel('yr_renovated')
plt.ylabel('price')
plt.show()

png

df.yr_renovated.value_counts()
0.0       15111
2014.0       64
2013.0       29
2005.0       27
2000.0       25
2007.0       24
2003.0       24
1990.0       22
2009.0       19
2006.0       18
2004.0       18
1989.0       17
2002.0       15
1998.0       15
2008.0       15
1984.0       14
2010.0       14
2001.0       14
1987.0       13
1999.0       13
1994.0       13
1991.0       13
2015.0       13
1997.0       12
1992.0       12
1983.0       11
1986.0       11
1985.0       11
1993.0       11
1988.0       10
          ...  
1975.0        5
1968.0        5
1969.0        4
1973.0        4
1960.0        3
1978.0        3
1958.0        3
1965.0        3
1972.0        3
1956.0        3
1981.0        3
1963.0        3
1945.0        2
1940.0        2
1957.0        2
1955.0        2
1946.0        1
1951.0        1
1944.0        1
1976.0        1
1934.0        1
1974.0        1
1962.0        1
1967.0        1
1948.0        1
1953.0        1
1950.0        1
1954.0        1
1959.0        1
1971.0        1
Name: yr_renovated, Length: 70, dtype: int64

Yr_renovated does not seem to have much relationship to the price of a house. For that reason, I am going to replace the null values with 0. I am not worried about skewing my data because I probably won't use it anyway.

# Replace null values in yr_renovated with 0
df.yr_renovated.replace(to_replace=np.nan, value=0.0, inplace=True)
#Check
df.isna().sum()
id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64
df.isna().any().any()
False

Back to the top

Format data types

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19164 entries, 1 to 21596
Data columns (total 21 columns):
id               19164 non-null int64
date             19164 non-null object
price            19164 non-null float64
bedrooms         19164 non-null int64
bathrooms        19164 non-null float64
sqft_living      19164 non-null int64
sqft_lot         19164 non-null int64
floors           19164 non-null float64
waterfront       19164 non-null float64
view             19164 non-null float64
condition        19164 non-null int64
grade            19164 non-null int64
sqft_above       19164 non-null int64
sqft_basement    19164 non-null object
yr_built         19164 non-null int64
yr_renovated     19164 non-null float64
zipcode          19164 non-null int64
lat              19164 non-null float64
long             19164 non-null float64
sqft_living15    19164 non-null int64
sqft_lot15       19164 non-null int64
dtypes: float64(8), int64(11), object(2)
memory usage: 3.2+ MB

The next step is to format the data types. All of the columns are either int or float except for date and sqft_basement. Date should be type datetime so we can use it for modeling later. Column sqft_basement should be int for consistency with the other sqft variables. Also column waterfront should be boolean.

pd.to_datetime(df['date'], format='%m/%d/%Y').head()
1   2014-12-09
2   2015-02-25
3   2014-12-09
4   2015-02-18
5   2014-05-12
Name: date, dtype: datetime64[ns]
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df['sqft_basement'] = df['sqft_basement'].astype('float64').astype('int64')
#df['waterfront'] = df['waterfront'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19164 entries, 1 to 21596
Data columns (total 21 columns):
id               19164 non-null int64
date             19164 non-null datetime64[ns]
price            19164 non-null float64
bedrooms         19164 non-null int64
bathrooms        19164 non-null float64
sqft_living      19164 non-null int64
sqft_lot         19164 non-null int64
floors           19164 non-null float64
waterfront       19164 non-null float64
view             19164 non-null float64
condition        19164 non-null int64
grade            19164 non-null int64
sqft_above       19164 non-null int64
sqft_basement    19164 non-null int64
yr_built         19164 non-null int64
yr_renovated     19164 non-null float64
zipcode          19164 non-null int64
lat              19164 non-null float64
long             19164 non-null float64
sqft_living15    19164 non-null int64
sqft_lot15       19164 non-null int64
dtypes: datetime64[ns](1), float64(8), int64(12)
memory usage: 3.2 MB

Back to the top

Exploratory Data Analysis

Preliminary EDA

Let's look through the variables in our dataset.

  • Continuous variables - 'price', 'sqft_above', 'sqft_basement', 'sqft_living', 'sqft_lot', 'sqft_living15', 'sqft_lot15', 'lat', 'long'
  • Discrete variables - 'bedrooms', 'bathrooms', 'floors', 'yr_built', 'yr_renovated', 'view'
  • Categorical variables - 'condition', 'grade', 'zipcode'
  • Boolean variables - 'waterfront'
  • Unimportant variables - 'id'

The continuous variables will need to be scaled and normalized. The discrete variables can all be treated as continuous variables. The categorical variables will need to be binned and one-hot encoded before modeling. The binary variables will be ignored for now. Finally, the unimportant variables will be removed.

abs(df.corr() > .7)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
id True False False False False False False False False False False False False False False False False False False False
price False True False False True False False False False False False False False False False False False False False False
bedrooms False False True False False False False False False False False False False False False False False False False False
bathrooms False False False True True False False False False False False False False False False False False False False False
sqft_living False True False True True False False False False False True True False False False False False False True False
sqft_lot False False False False False True False False False False False False False False False False False False False True
floors False False False False False False True False False False False False False False False False False False False False
waterfront False False False False False False False True False False False False False False False False False False False False
view False False False False False False False False True False False False False False False False False False False False
condition False False False False False False False False False True False False False False False False False False False False
grade False False False False True False False False False False True True False False False False False False True False
sqft_above False False False False True False False False False False True True False False False False False False True False
sqft_basement False False False False False False False False False False False False True False False False False False False False
yr_built False False False False False False False False False False False False False True False False False False False False
yr_renovated False False False False False False False False False False False False False False True False False False False False
zipcode False False False False False False False False False False False False False False False True False False False False
lat False False False False False False False False False False False False False False False False True False False False
long False False False False False False False False False False False False False False False False False True False False
sqft_living15 False False False False True False False False False False True True False False False False False False True False
sqft_lot15 False False False False False True False False False False False False False False False False False False False True

Back to the top

Feature engineering

Let's create some new columns to describe our data.

  • total_sqft_inside = sqft_above + sqft_basement
  • total_sqft = sqft_living + sqft_lot
  • last_updated = greater of (yr_built, yr_renovated)
  • year_sold = df['date'].dt.year
  • age_when_sold = yr_sold - yr_built
  • age_since_renovated = yr_sold - last_updated
# Create yr_sold column
df['yr_sold'] = df['date'].dt.year
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 yr_sold
1 6414100192 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 1951 1991.0 98125 47.7210 -122.319 1690 7639 2014
2 5631500400 2015-02-25 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 1933 0.0 98028 47.7379 -122.233 2720 8062 2015
3 2487200875 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 1965 0.0 98136 47.5208 -122.393 1360 5000 2014
4 1954400510 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 1987 0.0 98074 47.6168 -122.045 1800 7503 2015
5 7237550310 2014-05-12 1230000.0 4 4.50 5420 101930 1.0 0.0 0.0 3 11 3890 1530 2001 0.0 98053 47.6561 -122.005 4760 101930 2014
# Create age_when_sold column
df['age_when_sold'] = df['yr_sold'] - df['yr_built']
df.age_when_sold.unique()
array([ 63,  82,  49,  28,  13,  19,  55,  12,  72,  87,  37, 115,  36,
        20,  98,  93,  46,  67,  29,  73,  99, 105,  66,  10,  11,   9,
        85,  33,  84,  81, 110,  18,  30,   0,  92,  62,  59,  64,  34,
        88,   6,  24,  60,  35,  89,  41,  43,  50,  58,  74,  22,  51,
        53,   8,  26,  14,  75,  68,  48,  39, 104,  31, 114,   4,  69,
        90,  27, 101,  25,  54, 109,   7,  47,  52,  15,  38,  16,  61,
       113,  21,  94,  45, 100, 111,  96,  97,  86, 103,  78,  77, 107,
        40,  57, 102,  17,  71,   1,  23,  91,  42,  65,  95,  32,   3,
        56,  44, 112,   5,  70,   2,  80,  76,  83, 106, 108,  79,  -1])
# Create total_sqft_inside column
df['total_sqft_inside'] = df['sqft_above'] + df['sqft_basement']
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 yr_sold age_when_sold total_sqft_inside
1 6414100192 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 1951 1991.0 98125 47.7210 -122.319 1690 7639 2014 63 2570
2 5631500400 2015-02-25 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 1933 0.0 98028 47.7379 -122.233 2720 8062 2015 82 770
3 2487200875 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 1965 0.0 98136 47.5208 -122.393 1360 5000 2014 49 1960
4 1954400510 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 1987 0.0 98074 47.6168 -122.045 1800 7503 2015 28 1680
5 7237550310 2014-05-12 1230000.0 4 4.50 5420 101930 1.0 0.0 0.0 3 11 3890 1530 2001 0.0 98053 47.6561 -122.005 4760 101930 2014 13 5420
# Create total_sqft column
df['total_sqft'] = df['sqft_living'] + df['sqft_lot']
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 yr_sold age_when_sold total_sqft_inside total_sqft
1 6414100192 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 1951 1991.0 98125 47.7210 -122.319 1690 7639 2014 63 2570 9812
2 5631500400 2015-02-25 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 1933 0.0 98028 47.7379 -122.233 2720 8062 2015 82 770 10770
3 2487200875 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 1965 0.0 98136 47.5208 -122.393 1360 5000 2014 49 1960 6960
4 1954400510 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 1987 0.0 98074 47.6168 -122.045 1800 7503 2015 28 1680 9760
5 7237550310 2014-05-12 1230000.0 4 4.50 5420 101930 1.0 0.0 0.0 3 11 3890 1530 2001 0.0 98053 47.6561 -122.005 4760 101930 2014 13 5420 107350
# Create last_updated column
df['last_updated'] = np.nan
df.loc[df['yr_renovated'] > df['yr_built'], 'last_updated'] = df['yr_renovated']
df.loc[df['yr_renovated'] <= df['yr_built'], 'last_updated'] = df['yr_built']  
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 yr_sold age_when_sold total_sqft_inside total_sqft last_updated
1 6414100192 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 1951 1991.0 98125 47.7210 -122.319 1690 7639 2014 63 2570 9812 1991.0
2 5631500400 2015-02-25 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 1933 0.0 98028 47.7379 -122.233 2720 8062 2015 82 770 10770 1933.0
3 2487200875 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 1965 0.0 98136 47.5208 -122.393 1360 5000 2014 49 1960 6960 1965.0
4 1954400510 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 1987 0.0 98074 47.6168 -122.045 1800 7503 2015 28 1680 9760 1987.0
5 7237550310 2014-05-12 1230000.0 4 4.50 5420 101930 1.0 0.0 0.0 3 11 3890 1530 2001 0.0 98053 47.6561 -122.005 4760 101930 2014 13 5420 107350 2001.0
# Create age_since_updated column
df['age_since_updated'] = df['yr_sold'] - df['last_updated']
df['age_since_updated'].values[df['age_since_updated'].values <= 0] = 0.1
df['age_since_updated'].unique()
array([2.30e+01, 8.20e+01, 4.90e+01, 2.80e+01, 1.30e+01, 1.90e+01,
       5.50e+01, 1.20e+01, 7.20e+01, 8.70e+01, 3.70e+01, 1.15e+02,
       3.60e+01, 2.00e+01, 9.80e+01, 9.30e+01, 4.60e+01, 6.70e+01,
       2.90e+01, 7.30e+01, 9.90e+01, 1.05e+02, 6.60e+01, 1.00e+01,
       1.10e+01, 9.00e+00, 8.50e+01, 3.30e+01, 8.10e+01, 1.10e+02,
       1.80e+01, 3.00e+01, 1.00e-01, 9.20e+01, 6.20e+01, 5.90e+01,
       6.40e+01, 3.40e+01, 8.80e+01, 6.00e+00, 2.40e+01, 6.00e+01,
       3.50e+01, 8.90e+01, 4.10e+01, 4.30e+01, 5.00e+01, 5.80e+01,
       7.40e+01, 2.20e+01, 5.10e+01, 5.30e+01, 8.00e+00, 2.60e+01,
       1.40e+01, 7.50e+01, 4.80e+01, 3.90e+01, 1.04e+02, 3.10e+01,
       4.00e+00, 1.14e+02, 6.90e+01, 9.00e+01, 2.70e+01, 2.50e+01,
       5.40e+01, 1.09e+02, 7.00e+00, 6.80e+01, 4.70e+01, 1.00e+00,
       5.20e+01, 1.50e+01, 3.80e+01, 1.60e+01, 6.10e+01, 1.13e+02,
       2.10e+01, 9.40e+01, 4.50e+01, 1.00e+02, 9.60e+01, 9.70e+01,
       8.60e+01, 1.03e+02, 7.80e+01, 7.70e+01, 1.07e+02, 8.40e+01,
       4.00e+01, 5.70e+01, 1.02e+02, 1.70e+01, 7.10e+01, 6.30e+01,
       9.10e+01, 4.20e+01, 6.50e+01, 3.00e+00, 9.50e+01, 3.20e+01,
       5.60e+01, 4.40e+01, 1.12e+02, 5.00e+00, 7.00e+01, 2.00e+00,
       8.00e+01, 7.60e+01, 1.01e+02, 8.30e+01, 1.06e+02, 1.08e+02,
       1.11e+02, 7.90e+01])
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 yr_sold age_when_sold total_sqft_inside total_sqft last_updated age_since_updated
1 6414100192 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 1951 1991.0 98125 47.7210 -122.319 1690 7639 2014 63 2570 9812 1991.0 23.0
2 5631500400 2015-02-25 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 1933 0.0 98028 47.7379 -122.233 2720 8062 2015 82 770 10770 1933.0 82.0
3 2487200875 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 1965 0.0 98136 47.5208 -122.393 1360 5000 2014 49 1960 6960 1965.0 49.0
4 1954400510 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 1987 0.0 98074 47.6168 -122.045 1800 7503 2015 28 1680 9760 1987.0 28.0
5 7237550310 2014-05-12 1230000.0 4 4.50 5420 101930 1.0 0.0 0.0 3 11 3890 1530 2001 0.0 98053 47.6561 -122.005 4760 101930 2014 13 5420 107350 2001.0 13.0
df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'yr_sold',
       'age_when_sold', 'total_sqft_inside', 'total_sqft', 'last_updated',
       'age_since_updated'],
      dtype='object')

Back to the top

Eliminate extraneous features

Let's eliminate some of the columns that we definitely won't use in the regression model.

# Check linearity assumption with scatter plots
data = ['bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'yr_sold',
       'age_when_sold', 'total_sqft_inside', 'total_sqft', 'last_updated',
       'age_since_updated']

for column in data:
    plt.scatter(df[column], df.price, label=column)
    plt.legend()
    plt.show()

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

From these scatter plots, I can determine which columns have linear relationships with price and eliminate the others.

I will delete sqft_lot, floors, view, sqft_basement, yr_built, yr_renovated, zipcode, sqft_lot15, yr_sold, age_when_sold, total_sqft, and last_updated.

I will keep bedrooms, bathrooms, sqft_living, waterfront, condition, grade, sqft_above, lat, long, sqft_living15, total_sqft_inside, and age_since_updated.

df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'yr_sold',
       'age_when_sold', 'total_sqft_inside', 'total_sqft', 'last_updated',
       'age_since_updated'],
      dtype='object')
df2 = df.drop(['id', 'date', 'yr_built', 'yr_renovated', 'yr_sold',
               'last_updated'], axis=1)
df2.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'zipcode', 'lat', 'long', 'sqft_living15',
       'sqft_lot15', 'age_when_sold', 'total_sqft_inside', 'total_sqft',
       'age_since_updated'],
      dtype='object')
# This takes a really long time and sometimes crashes
pd.plotting.scatter_matrix(df2,figsize  = [9, 9], diagonal='hist', alpha=0.6);

png

df2.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated
count 1.916400e+04 19164.000000 19164.000000 19164.000000 1.916400e+04 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 1.916400e+04 19164.000000
mean 5.414490e+05 3.374452 2.117029 2082.038301 1.506174e+04 1.495173 0.007566 0.231058 3.409987 7.660718 1791.425433 284.380662 98077.726466 47.560018 -122.214011 1987.221874 12797.653882 43.282091 2075.806095 1.714378e+04 41.363677
std 3.709009e+05 0.928676 0.769241 921.918226 4.077215e+04 0.540308 0.086657 0.763368 0.650456 1.175601 831.776073 439.443059 53.448716 0.138661 0.140728 684.787028 27552.634083 29.389697 921.149889 4.094228e+04 28.972579
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 370.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000 -1.000000 370.000000 1.452000e+03 0.100000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1200.000000 0.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000 18.000000 1420.000000 7.032000e+03 16.000000
50% 4.500000e+05 3.000000 2.250000 1920.000000 7.620000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 0.000000 98065.000000 47.572000 -122.231000 1840.000000 7620.000000 39.000000 1910.000000 9.578000e+03 37.000000
75% 6.439625e+05 4.000000 2.500000 2550.000000 1.072000e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2217.750000 540.000000 98117.000000 47.678100 -122.125000 2360.000000 10093.250000 63.000000 2550.000000 1.301000e+04 61.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 4820.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 115.000000 13540.000000 1.652659e+06 115.000000

I still need to remove outliers. That will come soon.

# Let's look at a heatmap of the covariance
corr = df2.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr, annot=True, fmt='.1g', cmap=sns.color_palette('coolwarm'))
plt.show()

png

Columns with a high correlation to price include sqft_living, grade, and total_sqft_inside. Columns sqft_living, total_sqft_inside and grade are also correlated to each other. I will handle multicollinearity soon.

df2.hist(figsize=(16,12)) ;

png

I will need to transform my categorical values.

target = df2.price
for column in df2.columns:
    plt.scatter(df2[column], target, label=column)
    plt.legend()
    plt.show()

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

Back to the top

Remove outliers

Several of my columns have outliers. I want to write a function that will easily remove the outliers from each column. I'm going to use the interquartile range of the column to detect lower and upper bounds, then remove any points that fall outside those bounds. I will remove all points more than 2 standard deviations from the mean.

def remove_outlier(df, col):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound  = q1 - (2 * iqr)
    upper_bound = q3 + (2 * iqr)
    out_df = df.loc[(df[col] > lower_bound) & (df[col] < upper_bound)]
    return out_df
df2.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'zipcode', 'lat', 'long', 'sqft_living15',
       'sqft_lot15', 'age_when_sold', 'total_sqft_inside', 'total_sqft',
       'age_since_updated'],
      dtype='object')
df2.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated
count 1.916400e+04 19164.000000 19164.000000 19164.000000 1.916400e+04 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 19164.000000 1.916400e+04 19164.000000
mean 5.414490e+05 3.374452 2.117029 2082.038301 1.506174e+04 1.495173 0.007566 0.231058 3.409987 7.660718 1791.425433 284.380662 98077.726466 47.560018 -122.214011 1987.221874 12797.653882 43.282091 2075.806095 1.714378e+04 41.363677
std 3.709009e+05 0.928676 0.769241 921.918226 4.077215e+04 0.540308 0.086657 0.763368 0.650456 1.175601 831.776073 439.443059 53.448716 0.138661 0.140728 684.787028 27552.634083 29.389697 921.149889 4.094228e+04 28.972579
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 370.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000 -1.000000 370.000000 1.452000e+03 0.100000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1200.000000 0.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000 18.000000 1420.000000 7.032000e+03 16.000000
50% 4.500000e+05 3.000000 2.250000 1920.000000 7.620000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 0.000000 98065.000000 47.572000 -122.231000 1840.000000 7620.000000 39.000000 1910.000000 9.578000e+03 37.000000
75% 6.439625e+05 4.000000 2.500000 2550.000000 1.072000e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2217.750000 540.000000 98117.000000 47.678100 -122.125000 2360.000000 10093.250000 63.000000 2550.000000 1.301000e+04 61.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 4820.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 115.000000 13540.000000 1.652659e+06 115.000000

bedrooms clearly has some outliers. Let's test the function on that.

sns.boxplot(x=df2['bedrooms']).set_title('Box Plot of Bedrooms') ;

png

df2 = remove_outlier(df2, 'bedrooms')
sns.boxplot(x=df2['bedrooms']).set_title('Box Plot of Bedrooms') ;

png

df2 = remove_outlier(df2, 'sqft_lot')
sns.boxplot(x=df2['sqft_lot']).set_title('Box Plot of Lot Size') ;

png

df2 = remove_outlier(df2, 'sqft_lot15')
sns.boxplot(x=df2['sqft_lot15']).set_title('Box Plot of Neighboring Lot Sizes') ;

png

sns.boxplot(x=df2['long']).set_title('Box Plot of Longitude') ;

png

sns.boxplot(x=df2['lat']).set_title('Box Plot of Latitude') ;

png

df2 = remove_outlier(df2, 'long')
df2 = remove_outlier(df2, 'lat')
sns.boxplot(x=df2['long']).set_title('Box Plot of Longitude') ;

png

df2.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated
count 1.638200e+04 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000 16382.000000
mean 5.155945e+05 3.329508 2.075235 1985.477170 7355.434562 1.494201 0.003968 0.198327 3.412282 7.581553 1705.778476 274.095349 98080.828043 47.562858 -122.230745 1915.266756 7153.541936 43.713832 1979.873825 9340.911732 41.852771
std 3.160177e+05 0.801885 0.727018 809.415974 3677.205602 0.547090 0.062867 0.697289 0.650867 1.076539 735.350401 414.214514 54.087096 0.136627 0.125561 630.291515 3161.678554 30.056369 809.792088 3973.101692 29.662303
min 7.800000e+04 2.000000 0.500000 520.000000 520.000000 1.000000 0.000000 0.000000 1.000000 4.000000 480.000000 0.000000 98001.000000 47.177500 -122.503000 620.000000 651.000000 -1.000000 520.000000 1452.000000 0.100000
25% 3.150000e+05 3.000000 1.500000 1400.000000 4871.000000 1.000000 0.000000 0.000000 3.000000 7.000000 1170.000000 0.000000 98033.000000 47.481500 -122.335000 1460.000000 5000.000000 16.000000 1390.000000 6678.000000 14.000000
50% 4.400000e+05 3.000000 2.250000 1840.000000 7200.000000 1.500000 0.000000 0.000000 3.000000 7.000000 1510.000000 0.000000 98074.000000 47.574800 -122.264000 1780.000000 7210.000000 41.000000 1840.000000 9030.000000 38.000000
75% 6.200000e+05 4.000000 2.500000 2420.000000 9329.750000 2.000000 0.000000 0.000000 4.000000 8.000000 2090.000000 530.000000 98118.000000 47.679200 -122.153000 2250.000000 8993.500000 64.000000 2420.000000 11363.000000 62.000000
max 4.670000e+06 5.000000 6.750000 9640.000000 21883.000000 3.500000 1.000000 4.000000 5.000000 13.000000 6110.000000 4820.000000 98199.000000 47.777600 -121.784000 5790.000000 17707.000000 115.000000 9640.000000 25818.000000 115.000000

That looks much better!

Back to the top

Transform categorical variables

While there are no obvious categorical variables (such as those stored as strings), there are ordinal variables. These ordinal variables include condition and grade. I will tranform these columns with one-hot encoding.

df2.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'zipcode', 'lat', 'long', 'sqft_living15',
       'sqft_lot15', 'age_when_sold', 'total_sqft_inside', 'total_sqft',
       'age_since_updated'],
      dtype='object')
df2.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 98125 47.7210 -122.319 1690 7639 63 2570 9812 23.0
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 98028 47.7379 -122.233 2720 8062 82 770 10770 82.0
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 98136 47.5208 -122.393 1360 5000 49 1960 6960 49.0
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 98074 47.6168 -122.045 1800 7503 28 1680 9760 28.0
6 257500.0 3 2.25 1715 6819 2.0 0.0 0.0 3 7 1715 0 98003 47.3097 -122.327 2238 6819 19 1715 8534 19.0

Create bins

df_bins = df2

Grade

Instead of trying to transform grade and getting a bunch of extra columns, I'm just going to drop grade.

# df2 = df2.drop(['grade'], axis=1)
df_bins.grade.value_counts().plot(kind='bar').set_title('Grade Value Counts') ;

png

# First, create bins for based on the values observed
# 8 values will result in 7 bins
bins = [0, 5, 6, 7, 8, 9, 10, 13]
# use pd.cut
bins_grade = pd.cut(df_bins['grade'], bins)
# Using pd.cut returns unordered categories. 
# Transform this to ordered categories.
bins_grade = bins_grade.cat.as_unordered()
bins_grade.head()
1    (6, 7]
2    (5, 6]
3    (6, 7]
4    (7, 8]
6    (6, 7]
Name: grade, dtype: category
Categories (7, interval[int64]): [(0, 5], (5, 6], (6, 7], (7, 8], (8, 9], (9, 10], (10, 13]]
# Inspect the result
bins_grade.value_counts().plot(kind='bar').set_title('Grade Value Counts') ;

png

Condition

df_bins.condition.value_counts().plot(kind='bar').set_title('Condition Value Counts') ;

png

# First, create bins for based on the values observed
# 5 values will result in 4 bins
bins = [0, 2, 3, 5]
# use pd.cut
bins_condition = pd.cut(df_bins['condition'], bins)
bins_condition.head()
1    (2, 3]
2    (2, 3]
3    (3, 5]
4    (2, 3]
6    (2, 3]
Name: condition, dtype: category
Categories (3, interval[int64]): [(0, 2] < (2, 3] < (3, 5]]
# Using pd.cut returns unordered categories. 
# Transform this to ordered categories.
bins_condition = bins_condition.cat.as_unordered()
bins_condition.head()
1    (2, 3]
2    (2, 3]
3    (3, 5]
4    (2, 3]
6    (2, 3]
Name: condition, dtype: category
Categories (3, interval[int64]): [(0, 2], (2, 3], (3, 5]]
# Inspect the result
bins_condition.value_counts().plot(kind='bar').set_title('Condition Value Counts') ;

png

Set up dummy variables

# Set up dummy variable
grade_dummies = pd.get_dummies(bins_grade, prefix="grade",
                              drop_first=True)
condition_dummies = pd.get_dummies(bins_condition, 
                                   prefix="condition", 
                                   drop_first=True)
condition_dummies.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
condition_(2, 3] condition_(3, 5]
1 1 0
2 1 0
3 0 1
4 1 0
6 1 0
grade_dummies.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
grade_(5, 6] grade_(6, 7] grade_(7, 8] grade_(8, 9] grade_(9, 10] grade_(10, 13]
1 0 1 0 0 0 0
2 1 0 0 0 0 0
3 0 1 0 0 0 0
4 0 0 1 0 0 0
6 0 1 0 0 0 0
# Drop the original column and put in our dummy column
df_bins = df_bins.drop(['condition', 'grade'], axis=1)
df_bins = pd.concat([df_bins, condition_dummies, grade_dummies], axis=1)
df_bins.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated condition_(2, 3] condition_(3, 5] grade_(5, 6] grade_(6, 7] grade_(7, 8] grade_(8, 9] grade_(9, 10] grade_(10, 13]
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 2170 400 98125 47.7210 -122.319 1690 7639 63 2570 9812 23.0 1 0 0 1 0 0 0 0
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 770 0 98028 47.7379 -122.233 2720 8062 82 770 10770 82.0 1 0 1 0 0 0 0 0
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 1050 910 98136 47.5208 -122.393 1360 5000 49 1960 6960 49.0 0 1 0 1 0 0 0 0
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 1680 0 98074 47.6168 -122.045 1800 7503 28 1680 9760 28.0 1 0 0 0 1 0 0 0
6 257500.0 3 2.25 1715 6819 2.0 0.0 0.0 1715 0 98003 47.3097 -122.327 2238 6819 19 1715 8534 19.0 1 0 0 1 0 0 0 0
df2.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400 98125 47.7210 -122.319 1690 7639 63 2570 9812 23.0
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0 98028 47.7379 -122.233 2720 8062 82 770 10770 82.0
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910 98136 47.5208 -122.393 1360 5000 49 1960 6960 49.0
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0 98074 47.6168 -122.045 1800 7503 28 1680 9760 28.0
6 257500.0 3 2.25 1715 6819 2.0 0.0 0.0 3 7 1715 0 98003 47.3097 -122.327 2238 6819 19 1715 8534 19.0

All categorical variables have been transformed!

Back to the top

Question 1

What effect does waterfront have on price?

Waterfront properties account for a small portion of the house sales in King County. Nevertheless, it seems to have a sizable impact on price. When I created the map of house prices, I noticed that waterfront properties are consistently in the upper quartile of prices. While it might not be statistically significant, I'm interested enough to explore this question more deeply.

Reminder: The difference between df and df2 are features removed and outliers.

plt.scatter(df2['grade'], df2['price'], c=df2['waterfront'], 
            cmap='coolwarm')
plt.title('Grade vs price colored by waterfront')
plt.xlabel('grade')
plt.ylabel('price')
plt.show()

png

plt.scatter(df2['view'], df2['price'], c=df2['waterfront'], 
            cmap='coolwarm')
plt.title('Number of views vs price colored by waterfront', alpha=0.1)
plt.xlabel('number of views')
plt.ylabel('price')
plt.show()

png

df['view'].value_counts()
0.0    17312
2.0      836
3.0      435
1.0      291
4.0      290
Name: view, dtype: int64
# Violin plot of waterfront vs price
sns.violinplot(x=df2['waterfront'], y=df2['price'], 
               palette="Blues").set_title('Distribution of price by waterfront') ;

png

#Visualizing the data with outliers
sns.pairplot(df, 
             x_vars=["bedrooms", "total_sqft_inside", "total_sqft"], 
             y_vars=["price"],
             hue = "waterfront") ;

png

I can't figure out how to get titles on these pair plots

#Visualizing the data without outliers
sns.pairplot(df2, 
             x_vars=["bedrooms", "total_sqft_inside", "total_sqft"], 
             y_vars=["price"],
             hue = "waterfront") ;

png

Question 1 Conclusions

Clearly demand for waterfront properties is very high. The vast majority of properties were viewed 0 times; however, most waterfront properties were viewed 4 times. This leads me to believe that there are a lot of buyers in the market looking for waterfront properties. High demand drives up price. The violin plot shows that waterfront properties tend to have a wider distribution of prices and are generally more expensive.

Waterfront properties are some of the most expensive in relation to their size. In the pairplot visualization above, the yellow dots represent waterfront properties. We can see that waterfront properties generally sell for a higher price compared to other houses of similar size. Waterfront properties are generally not the largest of the houses in King County, but they are usually on the upper end of the price scale for their size. Waterfront properties also tend to have smaller lot sizes than other houses in the same price range.

Back to the top

Question 2

What effect does location have on price?

I'm going to create a map of King County with all house sales marked and color coded based on price. I want to see trends in the locations of the most expensive houses. I also want to make a map of the density of house sales to see which areas are the most populated.

m = folium.Map([df2['lat'].mean(), df2['long'].mean()], 
               zoom_start=11)
for index, row in df.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=3,
                        fill_color="#000000",
                        fill=True# divvy color
                       ).add_to(m)
m.save('map.html')
df2['price'].describe()
count    1.638200e+04
mean     5.155945e+05
std      3.160177e+05
min      7.800000e+04
25%      3.150000e+05
50%      4.400000e+05
75%      6.200000e+05
max      4.670000e+06
Name: price, dtype: float64
Q1_priced_houses = df2.loc[df2['price'] < 315000]
Q2_priced_houses = df2.loc[(df2['price'] >= 315000) & (df2['price'] < 440000)]
Q3_priced_houses = df2.loc[(df2['price'] >= 440000) & (df2['price'] < 620000)]
Q4_priced_houses = df2.loc[df2['price'] >= 620000]
min_lat=min(df2.lat)
max_lat=max(df2.lat)
min_lon=min(df2.long)
max_lon=max(df2.long)
print((min_lat, max_lat), (min_lon, max_lon))
(47.1775, 47.7776) (-122.50299999999999, -121.78399999999999)
m2 = folium.Map([47.5112, -122.257], zoom_start=10.5, 
                min_lat=min_lat, max_lat=max_lat, 
                min_lon=min_lon, max_lon=max_lon)
# Creating color coded dots for different quarters of house prices
for index, row in Q1_priced_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#ffff00",
                        popup=f"Price: {row['price']} Sqft: {row['sqft_living']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m2)
for index, row in Q2_priced_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#00ff00",
                        popup=f"Price: {row['price']} Sqft: {row['sqft_living']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m2)
for index, row in Q3_priced_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#05c1ff",
                        popup=f"Price: {row['price']} Sqft: {row['sqft_living']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m2)    
for index, row in Q4_priced_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#0000ff",
                        popup=f"Price: {row['price']} Sqft: {row['sqft_living']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m2)  
m2.save('color_map.html')
# This takes a long time and sometimes crashes or shows a blank box
#m2
# 2D density plot
sns.set_style("white")
sns.kdeplot(df.long, df.lat)
 
# Some features are characteristic of 2D: 
# color palette and whether or not color the lowest range
sns.kdeplot(df.long, df.lat, cmap="Blues", shade=True, shade_lowest=True)
plt.title('Density Map of Housing Sales')
plt.show()

png

# Build cluster map to show density of house sales
mc = folium.Map([47.5112, -122.257], zoom_start=10.5, 
                min_lat=min_lat, max_lat=max_lat, 
                min_lon=min_lon, max_lon=max_lon)
locations = list(zip(df2.lat, df2.long))

#popup_content = []
#for house in df.itertuples():
#    p = "Price: {} ".format(df.price)
#    s = "Sqft: {}".format(df.sqft_living)
#    b = "Bedrooms: {}".format(df.bedrooms)
#    content = p + s + b
#    popup_content.append(content)
#popups = [folium.Popup(content) for content in popup_content]

icons = [folium.Icon(prefix="fa") for _ in range(len(locations))]

cluster = MarkerCluster(locations=locations, icons=icons, 
                        name='Cluster Map of Housing Sales')
mc.add_child(cluster)
mc.save('color_map_clusters.html')
# This takes a long time and sometimes crashes or shows a blank box
#mc
df.zipcode.unique().shape
(70,)
plt.scatter(df2['zipcode'], df2['price'], c=df2['sqft_living'], 
            cmap='viridis')
plt.title('Zipcode vs price colored by sqft_living')
plt.xlabel('zipcode')
plt.ylabel('price')
plt.show()

png

Question 2 Conclusions

Open links in folder to color_map.html and color_map_clusters.html for relevant maps.

Location clearly has a big impact on price. This relationship is most evident from color_map.html where the most expensive properties can be seen clustered together. The scatterplot of zipcode vs price colored by sqft living also shows that some zipcodes contain a larger portion of expensive houses than others.

I could not get my zipcode map to work. For future work, I want to create the zipcode map with the prices color coded. I would like to create centroids in each zipcode using either the average lat and average long or the PO Box map on the KC gov website. That way I could maybe use zipcode in my regression model. I think location has a huge impact on price. I am unhappy with how lat and long are currently represented in my model because it's doesn't account for regionality properly. Another strategy could be to split my lat and long data into the 4 quadrants of the map. That would not be as good as having all the zipcodes though.

Back to the top

Question 3

What effect does size have on price?

I relied on feature engineering to help me answer this question. The original dataset included the following columns related to house size: 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'sqft_living15', and 'sqft_lot15'

After examining these columns, I decided to engineer the following new columns:

  • 'total_sqft_inside' = 'sqft_above' + 'sqft_basement'
  • 'total_sqft' = 'sqft_living' + 'sqft_lot'

Let's examine the relationship between these features and price.

df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'yr_sold',
       'age_when_sold', 'total_sqft_inside', 'total_sqft', 'last_updated',
       'age_since_updated'],
      dtype='object')
results = [['ind_var', 'r_squared', 'intercept', 'slope', 'p-value', 'normality(JB)']]

for idx, val in enumerate(['total_sqft_inside', 'total_sqft']):
    print("King County Housing DataSet - Regression Analysis  and Diagnostics for formula: price~" + val)
    print ("-------------------------------------------------------------------------------------")

    f = 'price~' + val
    
    model = smf.ols(formula=f, data=df2).fit()
    
    X_new = pd.DataFrame({val: [df2[val].min(), df2[val].max()]});
    preds = model.predict(X_new)
    df2.plot(kind='scatter', x=val, y='price');
    plt.plot(X_new, preds, c='red', linewidth=2);
    plt.show()
    fig = plt.figure(figsize=(15,8))
    fig = sm.graphics.plot_regress_exog(model, val, fig=fig)
    fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)
    plt.show()
    
    results.append([val, model.rsquared, model.params[0], model.params[1], model.pvalues[1], sms.jarque_bera(model.resid)[0]])
#    input("Press Enter to continue...")
King County Housing DataSet - Regression Analysis  and Diagnostics for formula: price~total_sqft_inside
-------------------------------------------------------------------------------------

png

png

png

King County Housing DataSet - Regression Analysis  and Diagnostics for formula: price~total_sqft
-------------------------------------------------------------------------------------

png

png

png

Last step, let's build a map of property locations colored by total sqft inside.

df2['total_sqft_inside'].describe()
count    16382.000000
mean      1979.873825
std        809.792088
min        520.000000
25%       1390.000000
50%       1840.000000
75%       2420.000000
max       9640.000000
Name: total_sqft_inside, dtype: float64
Q4_size_houses.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement zipcode lat long sqft_living15 sqft_lot15 age_when_sold total_sqft_inside total_sqft age_since_updated
count 1.0 1.0 1.00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000 1.00 1.0 1.0 1.0 1.0 1.0 1.0
mean 4670000.0 5.0 6.75 9640.0 13068.0 1.0 1.0 4.0 3.0 12.0 4820.0 4820.0 98040.0 47.557 -122.21 3270.0 10454.0 31.0 9640.0 22708.0 5.0
std NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 4670000.0 5.0 6.75 9640.0 13068.0 1.0 1.0 4.0 3.0 12.0 4820.0 4820.0 98040.0 47.557 -122.21 3270.0 10454.0 31.0 9640.0 22708.0 5.0
25% 4670000.0 5.0 6.75 9640.0 13068.0 1.0 1.0 4.0 3.0 12.0 4820.0 4820.0 98040.0 47.557 -122.21 3270.0 10454.0 31.0 9640.0 22708.0 5.0
50% 4670000.0 5.0 6.75 9640.0 13068.0 1.0 1.0 4.0 3.0 12.0 4820.0 4820.0 98040.0 47.557 -122.21 3270.0 10454.0 31.0 9640.0 22708.0 5.0
75% 4670000.0 5.0 6.75 9640.0 13068.0 1.0 1.0 4.0 3.0 12.0 4820.0 4820.0 98040.0 47.557 -122.21 3270.0 10454.0 31.0 9640.0 22708.0 5.0
max 4670000.0 5.0 6.75 9640.0 13068.0 1.0 1.0 4.0 3.0 12.0 4820.0 4820.0 98040.0 47.557 -122.21 3270.0 10454.0 31.0 9640.0 22708.0 5.0
Q1_size_houses = df2.loc[df2['total_sqft_inside'] < 1390]
Q2_size_houses = df2.loc[(df2['total_sqft_inside'] >= 1390) & (df2['total_sqft_inside'] < 1840)]
Q3_size_houses = df2.loc[(df2['total_sqft_inside'] >= 1840) & (df2['total_sqft_inside'] < 2420)]
Q4_size_houses = df2.loc[df2['total_sqft_inside'] >= 9640]
m_size = folium.Map([47.5112, -122.257], zoom_start=10.5, 
                min_lat=min_lat, max_lat=max_lat, 
                min_lon=min_lon, max_lon=max_lon)
# Creating color coded dots for different quarters of house sizes
for index, row in Q1_size_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#ffff00",
                        popup=f"Price: {row['price']} Sqft: {row['total_sqft_inside']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m_size)
for index, row in Q2_size_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#00ff00",
                        popup=f"Price: {row['price']} Sqft: {row['total_sqft_inside']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m_size)
for index, row in Q3_size_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#05c1ff",
                        popup=f"Price: {row['price']} Sqft: {row['total_sqft_inside']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m_size)    
for index, row in Q4_size_houses.iterrows():
    folium.CircleMarker([row['lat'], row['long']],
                        radius=1,
                        color="#0000ff",
                        popup=f"Price: {row['price']} Sqft: {row['total_sqft_inside']} Bedrooms: {row['bedrooms']}",
                        fill=True# divvy color
                       ).add_to(m_size)  
m_size.save('size_color_map.html')

Question 3 Conclusions

After running basic linear regressions for both 'total_sqft_inside' and 'total_sqft', I've determined that 'total_sqft_inside' is a much better predictor of house price. Total sqft inside is the sum of sqft above and sqft basement. Generally, larger houses sell for higher prices.

Back to the top

Resolve multicollinearity

df2.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'zipcode', 'lat', 'long', 'sqft_living15',
       'sqft_lot15', 'age_when_sold', 'total_sqft_inside', 'total_sqft',
       'age_since_updated'],
      dtype='object')
# Let's look at a heatmap of variables that are left
corr = df2.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr, annot=True, fmt='.1g', cmap=sns.color_palette('Blues'))
plt.show()

png

df3 = df2.drop(['sqft_living', 'sqft_lot', 'sqft_living15', 'sqft_lot15', 
                'sqft_above', 'sqft_basement', 'bathrooms', 'age_since_updated',
                'bedrooms', 'floors', 'grade'], axis=1)
corr_sub = df3.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr_sub, annot=True, fmt='.1g', cmap=sns.color_palette('Blues'))
plt.show()

png

Back to the top

Scale and normalize variables

Next I need to scale and normalize my continuous variables. Continuous variables include

df3.columns
Index(['price', 'waterfront', 'view', 'condition', 'zipcode', 'lat', 'long',
       'age_when_sold', 'total_sqft_inside', 'total_sqft'],
      dtype='object')
# First, I check out what the distribution looks like before transformation. 
df3[['price', 'waterfront', 'lat', 'long', 'total_sqft_inside', 'total_sqft']].hist(figsize = [12,12]) ;

png

# Transform price, total_sqft_inside and total_sqft using log transformation
# Creating new dataframe for log transformations and plotting histograms
df_log = pd.DataFrame([])
#Going to wait on price for analysis purposes
#df_log['log_price'] = np.log(df3['price'])
df_log['log_total_sqft_inside'] = np.log(df3['total_sqft_inside'])
#df_log['log_total_sqft'] = np.log(df3['total_sqft'])
#df_log['log_age_since_updated'] = np.log(df3['age_since_updated'])

df_log.hist(figsize=[12,12]) ;

png

What to do with lat and long?

  • Not log transformation (negative values)
  • Let's try standardizing them
scaled = pd.DataFrame([])
lat = df3['lat']
long = df3['long']
scaled["lat"] = (lat)/(np.linalg.norm(lat))
scaled["long"] = (long)/(np.linalg.norm(long))
scaled.hist(figsize  = [8, 8]);

png

df3.columns
Index(['price', 'waterfront', 'view', 'condition', 'zipcode', 'lat', 'long',
       'age_when_sold', 'total_sqft_inside', 'total_sqft'],
      dtype='object')
df3['lat_scaled'] = scaled["lat"]
df3['long_scaled'] = scaled["long"]
df3['log_total_sqft_inside'] = df_log['log_total_sqft_inside']
#df3['log_total_sqft'] = df_log['log_total_sqft']
df3.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price waterfront view condition zipcode lat long age_when_sold total_sqft_inside total_sqft lat_scaled long_scaled log_total_sqft_inside
1 538000.0 0.0 0.0 3 98125 47.7210 -122.319 63 2570 9812 0.007839 -0.007819 7.851661
2 180000.0 0.0 0.0 3 98028 47.7379 -122.233 82 770 10770 0.007842 -0.007813 6.646391
3 604000.0 0.0 0.0 5 98136 47.5208 -122.393 49 1960 6960 0.007806 -0.007823 7.580700
4 510000.0 0.0 0.0 3 98074 47.6168 -122.045 28 1680 9760 0.007822 -0.007801 7.426549
6 257500.0 0.0 0.0 3 98003 47.3097 -122.327 19 1715 8534 0.007771 -0.007819 7.447168
df_final = pd.DataFrame([])
df_final = df3.drop(['lat', 'long', 'total_sqft_inside', 'view', 
                     'condition', 'zipcode'], axis=1)
df_final.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price waterfront age_when_sold total_sqft lat_scaled long_scaled log_total_sqft_inside
1 538000.0 0.0 63 9812 0.007839 -0.007819 7.851661
2 180000.0 0.0 82 10770 0.007842 -0.007813 6.646391
3 604000.0 0.0 49 6960 0.007806 -0.007823 7.580700
4 510000.0 0.0 28 9760 0.007822 -0.007801 7.426549
6 257500.0 0.0 19 8534 0.007771 -0.007819 7.447168
df_final.columns
Index(['price', 'waterfront', 'age_when_sold', 'total_sqft', 'lat_scaled',
       'long_scaled', 'log_total_sqft_inside'],
      dtype='object')

After transformation, all four of these continuous variables appear much better suited to a normal distribution than before. Keep in mind that this transformation will change the interpretation of the coefficients during regression.

Back to the top

Partition the dataset

# Preparing the data
X = df_final.drop(['price'], axis=1)
#X = sm.add_constant(x)
y = df_final[['price']]

# Splitting the dataset into train and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.3, 
                                                    random_state = 10)

Back to the top

Create a linear regression model

linreg = sm.OLS(y_train, X_train).fit()
linreg.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.862
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 1.193e+04
Date: Mon, 03 Jun 2019 Prob (F-statistic): 0.00
Time: 23:13:41 Log-Likelihood: -1.5761e+05
No. Observations: 11467 AIC: 3.152e+05
Df Residuals: 11461 BIC: 3.153e+05
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
waterfront 8.786e+05 3.21e+04 27.352 0.000 8.16e+05 9.42e+05
age_when_sold 1947.5520 75.687 25.732 0.000 1799.192 2095.912
total_sqft -3.1982 0.592 -5.400 0.000 -4.359 -2.037
lat_scaled 3.72e+09 9.47e+07 39.283 0.000 3.53e+09 3.91e+09
long_scaled 4.147e+09 9.4e+07 44.140 0.000 3.96e+09 4.33e+09
log_total_sqft_inside 5.051e+05 6370.280 79.298 0.000 4.93e+05 5.18e+05
Omnibus: 7566.791 Durbin-Watson: 1.988
Prob(Omnibus): 0.000 Jarque-Bera (JB): 177895.954
Skew: 2.814 Prob(JB): 0.00
Kurtosis: 21.457 Cond. No. 6.43e+08


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.43e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
target = df_final['price']
df_final2 = df_final.drop(['price'], axis=1)
df_final2 = df_final2.drop(['total_sqft'], axis=1)
df_final2.columns
Index(['waterfront', 'age_when_sold', 'lat_scaled', 'long_scaled',
       'log_total_sqft_inside'],
      dtype='object')
# Preparing the data
X = df_final2
#X = sm.add_constant(x)
y = target

# Splitting the dataset into train and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.3, 
                                                    random_state = 10)
linreg = sm.OLS(y_train, X_train).fit()
linreg.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.862
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 1.428e+04
Date: Mon, 03 Jun 2019 Prob (F-statistic): 0.00
Time: 23:16:03 Log-Likelihood: -1.5763e+05
No. Observations: 11467 AIC: 3.153e+05
Df Residuals: 11462 BIC: 3.153e+05
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
waterfront 8.687e+05 3.21e+04 27.055 0.000 8.06e+05 9.32e+05
age_when_sold 1889.6556 75.016 25.190 0.000 1742.611 2036.700
lat_scaled 3.764e+09 9.45e+07 39.846 0.000 3.58e+09 3.95e+09
long_scaled 4.18e+09 9.39e+07 44.524 0.000 4e+09 4.36e+09
log_total_sqft_inside 4.897e+05 5696.509 85.961 0.000 4.79e+05 5.01e+05
Omnibus: 7445.178 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 168849.004
Skew: 2.761 Prob(JB): 0.00
Kurtosis: 20.970 Cond. No. 3.38e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.38e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
corr_sub = df_final2.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr_sub, annot=True, fmt='.1g', cmap=sns.color_palette('Blues'))
plt.show()

png

for i in range(5): 
    print(round(linreg.params[i]))
868700.0
1890.0
3763939677.0
4179706673.0
489675.0
df_final2.columns
Index(['waterfront', 'age_when_sold', 'lat_scaled', 'long_scaled',
       'log_total_sqft_inside'],
      dtype='object')

Multiple Linear Regression Equation

$$price = 868700 * waterfront + 1890 * age when sold + 3763939677 * lat scaled + 4179706673 * long scaled + 489675 * log total sqft inside$$

Back to the top

Validate model

print(len(X_train), len(X_test), len(y_train), len(y_test))
11467 4915 11467 4915
linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)
train_residuals = y_hat_train - y_train
test_residuals = y_hat_test - y_test
mse_train = np.sum((y_train-y_hat_train)**2)/len(y_train)
mse_test =np.sum((y_test-y_hat_test)**2)/len(y_test)
print('Train Mean Squarred Error:', mse_train)
print('Test Mean Squarred Error:', mse_test)
Train Mean Squarred Error: 48355343324.23764
Test Mean Squarred Error: 45620797653.64758

Great! There doesn't seem to be a big difference between the train and test MSE.

Next let's look at the RMSE.

mse_train = np.sum((y_train-y_hat_train)**2)/len(y_train)
mse_test =np.sum((y_test-y_hat_test)**2)/len(y_test)
print('Train Root Mean Squarred Error:', sqrt(mse_train))
print('Test Root Mean Squarred Error:', sqrt(mse_test))
Train Root Mean Squarred Error: 219898.48413356024
Test Root Mean Squarred Error: 213590.25645765677

RMSE is bigger than I would like but we'll just go with it for now.

K-Fold Cross Validation

cv_5_results = np.mean(cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error"))
cv_10_results = np.mean(cross_val_score(linreg, X, y, cv=10, scoring="neg_mean_squared_error"))
cv_20_results = np.mean(cross_val_score(linreg, X, y, cv=20, scoring="neg_mean_squared_error"))
cv_5_results
-48119782941.5707
cv_10_results
-48068898699.20207
cv_20_results
-47792083424.44407

All values of the cross validation tests are close to my training MSE. Success!

Back to the top

Analysis

Below is my final multiple linear regression equation:

$$price = 868700 * waterfront + 1890 * age when sold + 3763939677 * lat scaled + 4179706673 * long scaled + 489675 * log total sqft inside$$

The main features of my model are house age, location, and size. For age, I engineered a feature from the year built and the date sold. For size, I summed the total sqft above with the total sqft basement and ran a log transformation. Location was trickier. I included waterfront even though it was a small percentage of my data because I thought it had a big enough impact. I used normalized latitude and longitude in my model to account for location. I tried to group by zipcode but was unsuccessful.

My R squared value is 0.862 meaning my model explains 86.2% of the variability of the data around its mean. The RMSE of my model is 213590 which is the measure of how close the observed data points are to the model's predicted values. Thus my model can predict the price of a house withing $213000. There is definitely room for improvement in the fit of my model.

Back to the top

Conclusion

I feel like I gained a large amount of new knowledge through this project. I was able to apply everything I've learned about processing data and building multiple regression models. I also gained a lot of outside knowledge about creating maps and graphs in python. Overall, I'm very happy with how my project turned out. There are still many areas for improvement and more work to be done.

Back to the top

Future work

Here are some ideas for future work on this project:ย 

  • Location: Group properties by zipcode. Use centroids in zipcodes to get distance from centroid. Find a way to use in regression modeling.
  • Figure out how to make a regression model using grade.
  • Use feature selection methods
  • Create an input function where user can put in different parameter values and it would calculate the price of the house
  • Check QQ Plots for linear regression assumptions
  • Build residuals histogram with bell curve
  • Graph actual and predicted house prices
  • Make model iteration more clear (numbered models)

Back to the top

The code below is for a zipcode map which I could not get to work

I could not get my zipcode map to work. For future work, I want to create the zipcode map with the prices color coded. I would like to create centroids in each zipcode using either the average lat and average long or the PO Box map on the KC gov website. That way I could maybe use zipcode in my regression model. I think location has a huge impact on price. I am unhappy with how lat and long are currently represented in my model because it's doesn't account for regionality properly. Another strategy could be to split my lat and long data into the 4 quadrants of the map. That would not be as good as having all the zipcodes though.

#https://opendata.arcgis.com/datasets/e6c555c6ae7542b2bdec92485892b6e6_113.geojson
#df_map = df2
#df_map.head()
# Set zipcode type to string (folium)
#df_map['zipcode'] = df_map['zipcode'].astype('str')
# get the mean value across all data points
#zipcode_data = df_map.groupby('zipcode').aggregate(np.mean)
#zipcode_data.reset_index(inplace = True)
#zipcode_data = df2.groupby('zipcode').aggregate(np.mean)
#zipcode_data.reset_index(inplace=True)

#df2['count'] = 1
#count_houses_zipcode = df2.groupby('zipcode').sum()
#count_houses_zipcode.reset_index(inplace=True)
#count_houses_zipcode = count_houses_zipcode[['zipcode','count']]
#df2.drop(['count'], axis = 1, inplace = True)

#zipcode_data = pd.merge(zipcode_data, count_houses_zipcode, how='left', on=['zipcode'])
#zipcode_data.head(2)
# count number of houses grouped by zipcode
#df_map['count'] = 1
#temp = df_map.groupby('zipcode').sum()
#temp.reset_index(inplace = True)
#temp = temp[['zipcode', 'count']]
#zipcode_data = pd.merge(zipcode_data, temp, on='zipcode')
# drop count from org dataset
#df_map.drop(['count'], axis = 1, inplace = True)
#def show_zipcode_map(col):
#    geo_path = 'zipcode/zipcode_king_county.geojson'
#    zipcode = folium.Map(location=[df2['lat'].mean(), df2['long'].mean()], zoom_start=9)
#    zipcode.geo_json(geo_path = geo_path,
#                     data=zipcode_data, 
#                     columns = ['zipcode', col], key_on = 'feature.properties.ZCTA5CE10',
#                 fill_color='OrRd', fill_opacity=0.9,line_opacity=0.2)
#    zipcode.save(col + '.html')
#    return zipcode
#show_zipcode_map(['price'])
# zip codes with po box as centroid

# https://opendata.arcgis.com/datasets/02a7fd7667214af88303f0c1cf580099_2013.geojson
#df.plot(kind="scatter", x=df['long'], y=df['lat'], alpha=0.2)
# Maybe try making bins for zipcodes based on the average house prices.

#zipcode_data = df2.groupby('zipcode').aggregate(np.mean)
#zipcode_data.reset_index(inplace=True)

#df2['count'] = 1
#count_houses_zipcode = df2.groupby('zipcode').sum()
#count_houses_zipcode.reset_index(inplace=True)
#count_houses_zipcode = count_houses_zipcode[['zipcode','count']]
#df2.drop(['count'], axis = 1, inplace = True)

#zipcode_data = pd.merge(zipcode_data, count_houses_zipcode, 
#                        how='left', on=['zipcode'])
#zipcode_data.head()

Back to the top

Goals for what my analysis could someday be:

https://www.kaggle.com/harlfoxem/house-price-prediction-part-1

dsc-v2-mod1-final-project-online-ds-ft-051319's People

Contributors

peterbell avatar slmcamis avatar mike-kane avatar loredirick avatar sanpietro avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.