Giter Club home page Giter Club logo

nfl_elo_predictions's Introduction

Table of Contents

Introduction

This project is to analyze and predict the NFL games based on ELO Ratings. Does it favor Home or Visiting teams? Does team rating has impact or quarterback rating? You can read more about ELO ratings here

Data

Source

The dataset analyzed in this project is from Project FiveThirtyEight. Data contains 33 features starting from 1920 until 2022. There are some features like importance that was introduced very late in 2021. It contains ELO ratings for Teams and Quarterbacks and has ratings for both pre-game and post-game.

Data Analysis

There 33 features in dataset that are split between both the teams, Home and Visiting team. There are around 17K rows and below are the columns in dataset,

['date', 'season', 'neutral', 'playoff', 'team1', 'team2', 'elo1_pre',
       'elo2_pre', 'elo_prob1', 'elo_prob2', 'elo1_post', 'elo2_post',
       'qbelo1_pre', 'qbelo2_pre', 'qb1', 'qb2', 'qb1_value_pre',
       'qb2_value_pre', 'qb1_adj', 'qb2_adj', 'qbelo_prob1', 'qbelo_prob2',
       'qb1_game_value', 'qb2_game_value', 'qb1_value_post', 'qb2_value_post',
       'qbelo1_post', 'qbelo2_post', 'score1', 'score2', 'quality',
       'importance', 'total_rating']

We have different ELO ratings by team and quarter-backs like pre-game, probability and post-game ratings. We have dates when games were played and the season. We have ignored the dates and post ratings. And since the importance was not until 2021 that is also removed which removes total_rating too as it is derived from quality and importance.

Here is the data distribution of subset of features along with derived features (Feature Engineerin):

Features_distribution

Preprocessing

As mentioned above after removing certain features, and focusing on all pre-game and prob ratings, we got around 15K rows. There are no null's in this data which is good. The main data target is not part of the dataset, but considering the game prediction we have derived the Winner by comparing scores and assigning a binary score 0 for Home Team win and 1 for Visitors.

Other feature engineering includes ELO difference between teams and quarter backs.

Data analysis shows how the Home team has adavantage and also shows how the ELO ratings favor Home team, it is not very obvious but there is a slight +ve impact to Home teams.

The ELO ratings and how the winners are distributed:
ELO Rating vs Winner

The ELO ratings differences and how the winners are distributed:
ELO Difference vs Winner

ELO Probability vs Scores and how winners are distributed:
ELO Prob vs Scores

Quarterback ELO ratings difference and how winners are distributed:
Quarterback ELO Difference vs Winner

Heatmap with all main features:
Heatmap

Modeling

Approach

Taking all the main features (mainly ELO ratings) we have processed the data to evaluate different Classification models. Splitting the data with features and target and then training and testing data. Have scaled the training and test data. Below are the accuracy scores for all classification models trained on,

Accuracy Scores of all Models

Model Selection

After reviewing all the classification models, Logistic Regression, AdaBooster and SVC seems to be closer. Performed more analysis using GridSearch with different hyperparameters for these 3 models. Logistic Regression and SVC came closer or similar to each other. Considering the cost of processing we went with Logistic Regression.

Training and Evaluation

Models were trained with different set of features, hyperparameters and training sets. We changed training sets with different sizes and random states. Checking the accuracy for each set, getting coefficients and determining what features impacts and evaluating the scores as we process.

Best scores and hyper parameters from GridSearch:

Logistic Regression:

Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Score: 0.6559197355113419
Test Score: 0.6508407517309595

AdaBoosterClassifier:

Best Parameters: {'learning_rate': 0.1, 'n_estimators': 200}
Best Score: 0.6574033876985577
Test Score: 0.648203099241675

SVC:

Best Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
Best Score: 0.6546003178623334
Test Score: 0.6571051763930102

Metrics

We evaluated multiple metrics available from sklearn to determine the best model. Predictability Proba, Coefficients, Cross-validations scores, Balanced Accuracy Scores, Classification Reports and Accuracy scores. Here is one of the confusion matrix plot for Logistic Regression.

conf_matrix

Here is the classification reports:

Logisitic Regression Classification Report
              precision    recall  f1-score   support

           0       0.60      0.50      0.55      1274
           1       0.68      0.76      0.72      1759

    accuracy                           0.65      3033
   macro avg       0.64      0.63      0.63      3033
weighted avg       0.65      0.65      0.64      3033

Cross-validation scores:

Cross-validation scores: [0.67194197 0.66094987 0.6378628  0.6444591  0.65435356]
Mean accuracy: 0.6539134602921078
Standard deviation of accuracy: 0.01201449638315885

Conclusion

Based on the analysis done we have some what good results from Logistic Regression and need to explore more on optimizing it by adding other features.
Also, work on other data elements that impact NFL games and incorporate that to out model to optimize. Use this model to different other games, use the dataset from other form of games and make it more generic to predict any ELO Rated games.

nfl_elo_predictions's People

Contributors

itsakcode avatar acraigsen avatar lhmelton avatar northcoastbuzz avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.