In this lab, you'll practice fitting a multiple linear regression model on the Boston Housing dataset!
You will be able to:
- Determine if it is necessary to perform normalization/standardization for a specific model or set of data
- Use standardization/normalization on features of a dataset
- Identify if it is necessary to perform log transformations on a set of features
- Perform log transformations on different features of a dataset
- Use statsmodels to fit a multiple linear regression model
- Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters
We pre-processed the Boston Housing data again. This time, however, we did things slightly different:
- We dropped
'ZN'
and'NOX'
completely - We categorized
'RAD'
in 3 bins and'TAX'
in 4 bins - We transformed
'RAD'
and'TAX'
to dummy variables and dropped the first variable to eliminate multicollinearity - We used min-max-scaling on
'B'
,'CRIM'
, and'DIS'
(and log transformed all of them first, except'B'
) - We used standardization on
'AGE'
,'INDUS'
,'LSTAT'
, and'PTRATIO'
(and log transformed all of them first, except for'AGE'
)
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_features = boston_features.drop(['NOX', 'ZN'],axis=1)
# First, create bins for based on the values observed. 3 values will result in 2 bins
bins = [0, 6, 24]
bins_rad = pd.cut(boston_features['RAD'], bins)
bins_rad = bins_rad.cat.as_unordered()
# First, create bins for based on the values observed. 4 values will result in 3 bins
bins = [0, 270, 360, 712]
bins_tax = pd.cut(boston_features['TAX'], bins)
bins_tax = bins_tax.cat.as_unordered()
tax_dummy = pd.get_dummies(bins_tax, prefix='TAX', drop_first=True)
rad_dummy = pd.get_dummies(bins_rad, prefix='RAD', drop_first=True)
boston_features = boston_features.drop(['RAD','TAX'], axis=1)
boston_features = pd.concat([boston_features, rad_dummy, tax_dummy], axis=1)
age = boston_features['AGE']
b = boston_features['B']
logcrim = np.log(boston_features['CRIM'])
logdis = np.log(boston_features['DIS'])
logindus = np.log(boston_features['INDUS'])
loglstat = np.log(boston_features['LSTAT'])
logptratio = np.log(boston_features['PTRATIO'])
# Min-Max scaling
boston_features['B'] = (b-min(b))/(max(b)-min(b))
boston_features['CRIM'] = (logcrim-min(logcrim))/(max(logcrim)-min(logcrim))
boston_features['DIS'] = (logdis-min(logdis))/(max(logdis)-min(logdis))
# Standardization
boston_features['AGE'] = (age-np.mean(age))/np.sqrt(np.var(age))
boston_features['INDUS'] = (logindus-np.mean(logindus))/np.sqrt(np.var(logindus))
boston_features['LSTAT'] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))
boston_features['PTRATIO'] = (logptratio-np.mean(logptratio))/(np.sqrt(np.var(logptratio)))
boston_features.head()
# Your code here
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
- CRIM: per capita crime rate by town
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of African American individuals by town
- LSTAT: % lower status of the population
Make sure to transform your variables as needed!
- CRIM: 0.15
- INDUS: 6.07
- CHAS: 1
- RM: 6.1
- AGE: 33.2
- DIS: 7.6
- PTRATIO: 17
- B: 383
- LSTAT: 10.87
- RAD: 8
- TAX: 284
Congratulations! You pre-processed the Boston Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Boston Housing data using statsmodels and scikit-learn!