Dealing with Categorical Variables - Lab

Introduction

In this lab, you'll explore the Boston Housing dataset for categorical variables, and you'll transform your data so you'll be able to use categorical data as predictors!

Objectives

You will be able to:

Determine whether variables are categorical or continuous
Use bins to turn continuous data into categorical data
Use one hot encoding to create dummy variables
Describe why dummy variables are necessary

Importing the Boston Housing dataset

Let's start by importing the Boston Housing dataset. This dataset is available in Scikit-Learn, and can be imported by running the cell below:

import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()

If you'll inspect boston now, you'll see that this basically returns a dictionary. Let's have a look at what exactly is stored in the dictionary by looking at the dictionary keys:

# Print boston

# Look at the keys

Let's create a Pandas DataFrame with the data (which are the features, not including the target) and the feature names as column names.

boston_features = None

Now look at the first five rows of boston_features:

# Inspect the first few rows

For your reference, we copied the attribute information below. Additional information can be found here: http://scikit-learn.org/stable/datasets/index.html#boston-dataset

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk - 0.63)^2 where Bk is the proportion of African American individuals by town
LSTAT: % lower status of the population

Recall that the values corresponding to the data key are the features. The target is not included. For this dataset, the target is the median value of owner-occupied homes in $1000s and the values can be accessed using the target key. Using the target key, convert the target to a separate DataFrame and set 'MEDV' as the column name.

boston_target = None

# Inspect the first few rows

The target is described as:

MEDV: Median value of owner-occupied homes in $1000s

Next, let's merge the target and the predictors in one DataFrame boston_df:

boston_df = None
boston_df.head()

Let's inspect these 13 features using .describe() and .info()

# Use .describe()

# Use .info()

Now, take a look at the scatter plots for each predictor with the target on the y-axis.

import matplotlib.pyplot as plt
%matplotlib inline

# Create scatter plots

To categorical: binning

If you created your scatterplots correctly, you'll notice that except for CHAS (the Charles River Dummy variable), there is clearly no categorical data. You will have seen though that RAD and TAX have more of a vertical-looking structure like the one seen in the lesson, and that there is less of a "cloud"-looking structure compared to most other variables. It is difficult to justify a linear pattern between predictor and target here. In this situation, it might make sense to restructure data into bins so that they're treated as categorical variables. We'll start by showing how this can be done for RAD and then it's your turn to do this for TAX.

RAD

Look at the structure of RAD to decide how to create your bins.

# Your code here

# First, create bins based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]

# Use pd.cut()
bins_rad = None

# Using pd.cut() returns unordered categories. Transform this to ordered categories 
bins_rad = None
bins_rad.head()

# Inspect the result
bins_rad.value_counts().plot(kind='bar')

# Replace the existing 'RAD' column
boston_df['RAD'] = bins_rad

TAX

Split the TAX column up in 5 categories. You can chose the bins as desired but make sure they're pretty well-balanced.

# Repeat everything for "TAX"

Perform label encoding

# Perform label encoding and replace in boston_df

# Inspect first few columns

Create dummy variables

Create dummy variables, and make sure their column names contain 'TAX' and 'RAD' remembering to drop the first. Add the new dummy variables to boston_df and remove the old 'TAX' and 'RAD' columns.

# Create dummpy variables for TAX and RAD columns

Note how you end up with 19 columns now!

Summary

In this lab, you practiced your knowledge of categorical variables on the Boston Housing dataset! Specifically, you practiced distinguishing continuous and categorical data. You also used binning to transform continuous into categorical data. You then created dummy variables using one hot encoding.

khuloodnasher / dsc-dealing-with-categorical-variables-lab-onl01-dtsc-ft-030220 Goto Github PK

dsc-dealing-with-categorical-variables-lab-onl01-dtsc-ft-030220's Introduction

Dealing with Categorical Variables - Lab

Introduction

Objectives

Importing the Boston Housing dataset

To categorical: binning

RAD

TAX

Perform label encoding

Create dummy variables

Summary

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent