Giter Club home page Giter Club logo

dsc-dealing-with-categorical-variables-lab-onl01-dtsc-ft-030220's Introduction

Dealing with Categorical Variables - Lab

Introduction

In this lab, you'll explore the Boston Housing dataset for categorical variables, and you'll transform your data so you'll be able to use categorical data as predictors!

Objectives

You will be able to:

  • Determine whether variables are categorical or continuous
  • Use bins to turn continuous data into categorical data
  • Use one hot encoding to create dummy variables
  • Describe why dummy variables are necessary

Importing the Boston Housing dataset

Let's start by importing the Boston Housing dataset. This dataset is available in Scikit-Learn, and can be imported by running the cell below:

import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()

If you'll inspect boston now, you'll see that this basically returns a dictionary. Let's have a look at what exactly is stored in the dictionary by looking at the dictionary keys:

# Print boston
# Look at the keys

Let's create a Pandas DataFrame with the data (which are the features, not including the target) and the feature names as column names.

boston_features = None

Now look at the first five rows of boston_features:

# Inspect the first few rows

For your reference, we copied the attribute information below. Additional information can be found here: http://scikit-learn.org/stable/datasets/index.html#boston-dataset

  • CRIM: per capita crime rate by town
  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: nitric oxides concentration (parts per 10 million)
  • RM: average number of rooms per dwelling
  • AGE: proportion of owner-occupied units built prior to 1940
  • DIS: weighted distances to five Boston employment centres
  • RAD: index of accessibility to radial highways
  • TAX: full-value property-tax rate per $10,000
  • PTRATIO: pupil-teacher ratio by town
  • B: 1000(Bk - 0.63)^2 where Bk is the proportion of African American individuals by town
  • LSTAT: % lower status of the population

Recall that the values corresponding to the data key are the features. The target is not included. For this dataset, the target is the median value of owner-occupied homes in $1000s and the values can be accessed using the target key. Using the target key, convert the target to a separate DataFrame and set 'MEDV' as the column name.

boston_target = None

# Inspect the first few rows

The target is described as:

  • MEDV: Median value of owner-occupied homes in $1000s

Next, let's merge the target and the predictors in one DataFrame boston_df:

boston_df = None
boston_df.head()

Let's inspect these 13 features using .describe() and .info()

# Use .describe()
# Use .info()

Now, take a look at the scatter plots for each predictor with the target on the y-axis.

import matplotlib.pyplot as plt
%matplotlib inline

# Create scatter plots

To categorical: binning

If you created your scatterplots correctly, you'll notice that except for CHAS (the Charles River Dummy variable), there is clearly no categorical data. You will have seen though that RAD and TAX have more of a vertical-looking structure like the one seen in the lesson, and that there is less of a "cloud"-looking structure compared to most other variables. It is difficult to justify a linear pattern between predictor and target here. In this situation, it might make sense to restructure data into bins so that they're treated as categorical variables. We'll start by showing how this can be done for RAD and then it's your turn to do this for TAX.

RAD

Look at the structure of RAD to decide how to create your bins.

# Your code here
# First, create bins based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]

# Use pd.cut()
bins_rad = None
# Using pd.cut() returns unordered categories. Transform this to ordered categories 
bins_rad = None
bins_rad.head()
# Inspect the result
bins_rad.value_counts().plot(kind='bar')
# Replace the existing 'RAD' column
boston_df['RAD'] = bins_rad

TAX

Split the TAX column up in 5 categories. You can chose the bins as desired but make sure they're pretty well-balanced.

# Repeat everything for "TAX"

Perform label encoding

# Perform label encoding and replace in boston_df
# Inspect first few columns

Create dummy variables

Create dummy variables, and make sure their column names contain 'TAX' and 'RAD' remembering to drop the first. Add the new dummy variables to boston_df and remove the old 'TAX' and 'RAD' columns.

# Create dummpy variables for TAX and RAD columns

Note how you end up with 19 columns now!

Summary

In this lab, you practiced your knowledge of categorical variables on the Boston Housing dataset! Specifically, you practiced distinguishing continuous and categorical data. You also used binning to transform continuous into categorical data. You then created dummy variables using one hot encoding.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.