In this lab, you'll perform an EDA task, using your skills with statistics and data visualizations. You'll continue using the Lego dataset that you've acquired and cleaned in the previous labs.
You will be able to:
- Examine the descriptive statistics of our data set
- Create visualizations to better understand the distributions of variables in a dataset
At this point, you've already done a modest amount of EDA between investigating the initial dataset to further exploring individual features while cleaning things up in preparation for modeling. During this process, you've become more familiar with the particular idiosyncrasies of the dataset. This gives you an opportunity to uncover difficulties and potential pitfalls in working with the dataset as well as potential avenues for feature engineering that could improve the predictive performance of your model down the line. Remember that this is also not a linear process; after building an initial model, you might go back and continue to mine the dataset for potential inroads to create additional features and improve the model's performance if the initial results did not satisfy your needs and expectations. Here, you'll continue this process, investigating the distributions of some of the various features and their relationship to the target variable: list_price
.
In the cells below:
- Import
pandas
and set the standard alias. - Import
numpy
and set the standard alias. - Import
matplotlib.pyplot
and set the standard alias. - Import
seaborn
and set the aliassns
(this is the standard alias for seaborn). - Use the ipython magic command to set all matplotlib visualizations to display inline in the notebook.
- Load the dataset stored in the
'Lego_data_cleaned.csv'
file into a DataFrame,df
. - Inspect the head of the DataFrame to ensure everything loaded correctly.
import warnings
warnings.filterwarnings('ignore')
# Import libraries and load Lego_data_merged.csv
- Describe the dataset using 5-point statistics.
# Your code here
- Use pandas to plot histograms for all the numeric variables in the dataset.
# Your code here
Note how skewed most of these distributions are. While linear regression does not assume that each of the individual predictors are normally distributed, it does assume a linear relationship between the predictors and the target variable (list_price
in this case). To further investigate if this assumption holds true, you can plot some single variable regression plots of each feature against the target variable using seaborn
.
Recall that one assumption in linear regression is that the target variable is linearly related to the input features. As shown in the previous lesson, you can use the sns.jointplot()
function to investigate whether this relation holds true for the various predictors on hand.
# Your code here
# Your code here
# Your code here
# Your code here
# Your code here
It's also important to make note of whether your predictive features will result in multicollinearity in the resulting model. While definitive checks for multicollinearity require analyzing the resulting model, predictors with overly high pairwise-correlation (r > .65) are almost certain to produce multicollinearity in a model. With that, take a minute to generate the pairwise (pearson) correlation coefficients of your predictive features and visualize these coefficients as a heatmap.
# Your code here
# Your code here
Have a look at following resources on how to deal with complex datasets that don't meet our initial expectations:
In this lesson you performed some initial EDA using descriptive statistics and data visualizations to check for regression assumptions. In the upcoming lessons, you'll continue to carry out a standard Data Science process and begin to fit and refine an initial model.