For this analysis, I will consider the 2016 airline delay data only, and my goal will be to predict delays using this one year of data.
I have several hypotheses which I will evaluate throughout this presentation.
Goal: Given all of the airline data from 2016, can we predict delays? I tried this using two methods:
- Regression on the departure delay.
- Classification above a 20 minute threshold (defined as a delay by most major airlines).
analysis/
: The notebooks used for all of the data processing and analysis.
doc/
: The research paper and TeX file for the research paper.
fig/
: Figures generated used in the research paper.
json/
: JSON files used for processing some of the weather data.
data/
: All data sets used. Note this was not included on Github due to size constraints, but is available upon request.
src/
: Source files used for scraping weather data.
A research paper written on the results is contained here:
Predicting Airline Delays: A Comparison of Models and Features
The analysis is done in four iPython notebooks:
This notebook contains the processing of all of the weather data from 2016.
This notebook includes analysis of the features involved including exploratory analysis. It also includes merging of the data sets.
This notebook includes analysis of the time series, and creation of the Poisson variables used.
The final notebook includes the comparison of the models with and without the new features added below.
I used the following data in my analysis.
-
Airline On-Time Performance Data. RITA/BTS. Bureau of Transportation Statistics.. 2016.
-
Local Climatological Data. National Centers for Environmental Information.. 2016.
-
Flight Standards Service โ Civil Aviation Registry. Federal Aviation Administration.. 2009.
-
[Passenger Boardings at Commercial Service Airports. Federal Aviation Administration.] (https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy14-commercial-service-enplanements.pdf). 2014.
Caveats of above data: Plane data is from 2009, but will have missing planes from 2010-2016.
We compare performance of multiple models with and without the additional data used above. More precisely, we define:
Original: Only the Airline On-Time Performance Data.
Enhanced: All of the data included in the above links, in addition to a customized Poisson variable. More precisely, I define two time dependent variables with Poisson priors given by:
where the P represents a Poisson distribution with time dependent mean depending on the delay of the previous hour or day respectively.
The hour variable (called hourly_poisson below) is seen to be the top variable for the Gradient Boosted Trees and second top variable for the Random Forest classifier.
Below we see a comparison of three different classification models for the original and enriched variable set. The best performance obtained was by the Random Forest Classifier with the enriched data set, which achieved an ROC of 0.76.
The variables are explained further in the notebook (Processing Weather Data), but for the reader's convenience here are some descriptions.
Weather: PRCP
, SNOW
and WT03
are weather variables representing precipitation, snow and wind.
Plane Model: model
and issue_date
correspond to the make of the plane.
Poisson Regression Time Series: poisson_hourly
and poisson_daily
represent the variables I discussed briefly above.
Gradient Boosted Trees:
Random Forest: