Exploring Your Data - Lab

Introduction

In this lab, you'll perform an EDA task, using your skills with statistics and data visualizations. You'll continue using the Lego dataset that you've acquired and cleaned in the previous labs.

Objectives

You will be able to:

Examine the descriptive statistics of our data set
Create visualizations to better understand the distributions of variables in a dataset

Data Exploration

At this point, you've already done a modest amount of EDA between investigating the initial dataset to further exploring individual features while cleaning things up in preparation for modeling. During this process, you've become more familiar with the particular idiosyncrasies of the dataset. This gives you an opportunity to uncover difficulties and potential pitfalls in working with the dataset as well as potential avenues for feature engineering that could improve the predictive performance of your model down the line. Remember that this is also not a linear process; after building an initial model, you might go back and continue to mine the dataset for potential inroads to create additional features and improve the model's performance if the initial results did not satisfy your needs and expectations. Here, you'll continue this process, investigating the distributions of some of the various features and their relationship to the target variable: list_price.

In the cells below:

Import pandas and set the standard alias.
Import numpy and set the standard alias.
Import matplotlib.pyplot and set the standard alias.
Import seaborn and set the alias sns (this is the standard alias for seaborn).
Use the ipython magic command to set all matplotlib visualizations to display inline in the notebook.
Load the dataset stored in the 'Lego_data_cleaned.csv' file into a DataFrame, df.
Inspect the head of the DataFrame to ensure everything loaded correctly.

import warnings
warnings.filterwarnings('ignore')

# Import libraries and load Lego_data_merged.csv

Describe the dataset using 5-point statistics.

# Your code here

Use pandas to plot histograms for all the numeric variables in the dataset.

# Your code here

Note how skewed most of these distributions are. While linear regression does not assume that each of the individual predictors are normally distributed, it does assume a linear relationship between the predictors and the target variable (list_price in this case). To further investigate if this assumption holds true, you can plot some single variable regression plots of each feature against the target variable using seaborn.

Check for Linearity

Recall that one assumption in linear regression is that the target variable is linearly related to the input features. As shown in the previous lesson, you can use the sns.jointplot() function to investigate whether this relation holds true for the various predictors on hand.

# Your code here

# Your code here

# Your code here

# Your code here

# Your code here

Checking for Multicollinearity

It's also important to make note of whether your predictive features will result in multicollinearity in the resulting model. While definitive checks for multicollinearity require analyzing the resulting model, predictors with overly high pairwise-correlation (r > .65) are almost certain to produce multicollinearity in a model. With that, take a minute to generate the pairwise (pearson) correlation coefficients of your predictive features and visualize these coefficients as a heatmap.

# Your code here

# Your code here

Further Resources

Have a look at following resources on how to deal with complex datasets that don't meet our initial expectations:

Summary

In this lesson you performed some initial EDA using descriptive statistics and data visualizations to check for regression assumptions. In the upcoming lessons, you'll continue to carry out a standard Data Science process and begin to fit and refine an initial model.

chibui191 / dsc-exploring-your-data-lab Goto Github PK

dsc-exploring-your-data-lab's Introduction

Exploring Your Data - Lab

Introduction

Objectives

Data Exploration

Check for Linearity

Checking for Multicollinearity

Further Resources

Summary

dsc-exploring-your-data-lab's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent