Giter Club home page Giter Club logo

dsc-3-31-10-regression-cart-trees-lab-online-ds-sp-000's Introduction

Regression with CART Trees - Lab

Introduction

In this lab, we'll make use of what we learned in the previous lesson to build a model for the "Petrol Consumption Dataset" from Kaggle. This model will be used to predict gasoline consumption for a bunch of examples, based on drivers' features.

Objectives

You will be able to:

  • Conduct a regression experiment using CART trees
  • Evaluate the model fit and study the impact of hyper parameters on the final tree
  • Understand training, prediction, evaluation and visualizations required to run regression experiments using trees

Import necessary libraries

# Import libraries 
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

Read the dataset petrol_consumption.csv and view its head and dimensions

# Read the dataset and view head and dimensions

# Code here
(48, 5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption
0 9.0 3571 1976 0.525 541
1 9.0 4092 1250 0.572 524
2 9.0 3865 1586 0.580 561
3 7.5 4870 2351 0.529 414
4 8.0 4399 431 0.544 410

Check the basic statistics for the dataset and inspect the target variable Petrol_Consumption

# Describe the dataset

# Code here
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Petrol_tax Average_income Paved_Highways Population_Driver_licence(%) Petrol_Consumption
count 48.000000 48.000000 48.000000 48.000000 48.000000
mean 7.668333 4241.833333 5565.416667 0.570333 576.770833
std 0.950770 573.623768 3491.507166 0.055470 111.885816
min 5.000000 3063.000000 431.000000 0.451000 344.000000
25% 7.000000 3739.000000 3110.250000 0.529750 509.500000
50% 7.500000 4298.000000 4735.500000 0.564500 568.500000
75% 8.125000 4578.750000 7156.000000 0.595250 632.750000
max 10.000000 5342.000000 17782.000000 0.724000 968.000000

Create features, labels and train/test datasets with a 80/20 split

As with the classification task, we will divide our data into attributes/features and labels and consequently into training and test sets.

# Create datasets for training and test


# Code here

Create an instance of CART regressor and fit the data to the model

As mentioned earlier, for a regression task we'll use a different sklearn class than we did for the classification task. The class we'll be using here is the DecisionTreeRegressor class, as opposed to the DecisionTreeClassifier from before.

# Train a regression tree model with training data 


# Code here
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Using test set, make predictions and calculate the MAE, MSE and RMSE

Just as with Decision Trees for classification, there are several commonly used metrics for evaluating the performance of our model. The most common metrics are:

  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)

If these look familiar, its likely because you have already seen them before--they are common evaluation metrics for any sort of regression model, and as we can see, Regressions performed with Decision Tree models are no exception!

Since these are common evaluation metrics, sklearn has functions for each of them that we can use to make our job easier. You'll find these functions inside the metrics module. In the cell below, calculate each of the three evaluation metrics listed above!

# Predict and evaluate the predictions


# Code here
Mean Absolute Error: 55.6
Mean Squared Error: 6286.2
Root Mean Squared Error: 79.28555984540942

Level Up - Optional

  • In order to understand and interpret a tree structure, we need some domain knowledge in which the data was generated. That can help us inspect each leaf and investigate/prune the tree based on qualitative analysis.

  • Look at the hyper parameters used in the regression tree, check their values ranges in official doc and try running some optimization by growing a number of trees in a loop.

  • Use a dataset that you are familiar with and run tree regression to see if you can interpret the results.

  • Check for outliers, try normalization and see the impact on the output

Summary

In this lesson, we developed a tree regressor architecture to train the regressor and predict values for unseen data. We saw that with a vanilla approach, the results were not so great, and this requires further pre-tuning of the model (what we described as hyper parameter optimization OR pruning in the case of trees.

dsc-3-31-10-regression-cart-trees-lab-online-ds-sp-000's People

Contributors

loredirick avatar shakeelraja avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.