In this lab, you'll explore the Boston Housing dataset for categorical variables, and you'll transform your data so you'll be able to use categorical data as predictors!
You will be able to:
- Determine whether variables are categorical or continuous
- Use bins to turn continuous data into categorical data
- Use one hot encoding to create dummy variables
- Describe why dummy variables are necessary
Let's start by importing the Boston Housing dataset. This dataset is available in Scikit-Learn, and can be imported by running the cell below:
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
If you'll inspect boston
now, you'll see that this basically returns a dictionary. Let's have a look at what exactly is stored in the dictionary by looking at the dictionary keys:
# Print boston
# Look at the keys
Let's create a Pandas DataFrame with the data (which are the features, not including the target) and the feature names as column names.
boston_features = None
Now look at the first five rows of boston_features
:
# Inspect the first few rows
For your reference, we copied the attribute information below. Additional information can be found here: http://scikit-learn.org/stable/datasets/index.html#boston-dataset
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of African American individuals by town
- LSTAT: % lower status of the population
Recall that the values corresponding to the data key are the features. The target is not included. For this dataset, the target is the median value of owner-occupied homes in $1000s and the values can be accessed using the target key. Using the target key, convert the target to a separate DataFrame and set 'MEDV'
as the column name.
boston_target = None
# Inspect the first few rows
The target is described as:
- MEDV: Median value of owner-occupied homes in $1000s
Next, let's merge the target and the predictors in one DataFrame boston_df
:
boston_df = None
boston_df.head()
Let's inspect these 13 features using .describe()
and .info()
# Use .describe()
# Use .info()
Now, take a look at the scatter plots for each predictor with the target on the y-axis.
import matplotlib.pyplot as plt
%matplotlib inline
# Create scatter plots
If you created your scatterplots correctly, you'll notice that except for CHAS
(the Charles River Dummy variable), there is clearly no categorical data. You will have seen though that RAD
and TAX
have more of a vertical-looking structure like the one seen in the lesson, and that there is less of a "cloud"-looking structure compared to most other variables. It is difficult to justify a linear pattern between predictor and target here. In this situation, it might make sense to restructure data into bins so that they're treated as categorical variables. We'll start by showing how this can be done for RAD
and then it's your turn to do this for TAX
.
Look at the structure of RAD
to decide how to create your bins.
# Your code here
# First, create bins based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]
# Use pd.cut()
bins_rad = None
# Using pd.cut() returns unordered categories. Transform this to ordered categories
bins_rad = None
bins_rad.head()
# Inspect the result
bins_rad.value_counts().plot(kind='bar')
# Replace the existing 'RAD' column
boston_df['RAD'] = bins_rad
Split the TAX
column up in 5 categories. You can chose the bins as desired but make sure they're pretty well-balanced.
# Repeat everything for "TAX"
# Perform label encoding and replace in boston_df
# Inspect first few columns
Create dummy variables, and make sure their column names contain 'TAX'
and 'RAD'
remembering to drop the first. Add the new dummy variables to boston_df
and remove the old 'TAX'
and 'RAD'
columns.
# Create dummpy variables for TAX and RAD columns
Note how you end up with 19 columns now!
In this lab, you practiced your knowledge of categorical variables on the Boston Housing dataset! Specifically, you practiced distinguishing continuous and categorical data. You also used binning to transform continuous into categorical data. You then created dummy variables using one hot encoding.