Giter Club home page Giter Club logo

airline_delays's Introduction

Predicting Airline Delays

Introduction

For this analysis, I will consider the 2016 airline delay data only, and my goal will be to predict delays using this one year of data.

I have several hypotheses which I will evaluate throughout this presentation.

Goal: Given all of the airline data from 2016, can we predict delays? I tried this using two methods:

  1. Regression on the departure delay.
  2. Classification above a 20 minute threshold (defined as a delay by most major airlines).

File Structure

analysis/: The notebooks used for all of the data processing and analysis.

doc/: The research paper and TeX file for the research paper.

fig/: Figures generated used in the research paper.

json/: JSON files used for processing some of the weather data.

data/: All data sets used. Note this was not included on Github due to size constraints, but is available upon request.

src/: Source files used for scraping weather data.

Research Results

A research paper written on the results is contained here:

Predicting Airline Delays: A Comparison of Models and Features

Analysis

The analysis is done in four iPython notebooks:

This notebook contains the processing of all of the weather data from 2016.

This notebook includes analysis of the features involved including exploratory analysis. It also includes merging of the data sets.

This notebook includes analysis of the time series, and creation of the Poisson variables used.

The final notebook includes the comparison of the models with and without the new features added below.

I used the following data in my analysis.

Data sets:

Caveats of above data: Plane data is from 2009, but will have missing planes from 2010-2016.

Main Results:

Performance:

We compare performance of multiple models with and without the additional data used above. More precisely, we define:

Original: Only the Airline On-Time Performance Data.

Enhanced: All of the data included in the above links, in addition to a customized Poisson variable. More precisely, I define two time dependent variables with Poisson priors given by: alt text

where the P represents a Poisson distribution with time dependent mean depending on the delay of the previous hour or day respectively.

The hour variable (called hourly_poisson below) is seen to be the top variable for the Gradient Boosted Trees and second top variable for the Random Forest classifier.

Below we see a comparison of three different classification models for the original and enriched variable set. The best performance obtained was by the Random Forest Classifier with the enriched data set, which achieved an ROC of 0.76.

alt text

Variable Importances:

The variables are explained further in the notebook (Processing Weather Data), but for the reader's convenience here are some descriptions.

Weather: PRCP, SNOW and WT03 are weather variables representing precipitation, snow and wind.

Plane Model: model and issue_date correspond to the make of the plane.

Poisson Regression Time Series: poisson_hourly and poisson_daily represent the variables I discussed briefly above.

Gradient Boosted Trees:

alt text

Random Forest:

alt text

airline_delays's People

Contributors

doriang102 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.