Giter Club home page Giter Club logo

nflscrapr-data's Introduction

nflscrapR-data repository

This repository contains both data accessed from NFL.com using nflscrapR along with all of the statistics generated by the nflscrapR expected points and win probability models (source code available here).

The data folders are organized in the following manner (will be updating):

  • play_by_play_data - all play-by-play data accessed with nflscrapR, with three folders for pre-, post-, and regular season games.
  • games_data - all game data accessed with the nflscrapR::scrape_game_ids function containing info such as the home and away team, score, game's URL, with three folders for pre-, post-, and regular season games.
  • legacy_data - all data accessed and generated with previous version of nflscrapR

Additionally, code examples are located in the R folder.

During the 2018 NFL season, this repository will be updated at least once a week on either Tuesday or Wednesday to account for each week of games played.

nflscrapr-data's People

Contributors

dutta avatar ryurko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nflscrapr-data's Issues

Unreasonably large playtimediff in 2 games

GameID: 2011120406, 2014100510

In 2011120406, the data might not be sorted properly, several plays with >200 playtimediff and even >600.
In 2014100510, there are a couple plays with playtimediff>=900.

Error in EPA and WPA on a muffed punt

GameID: 2017100808. 2nd quarter 12:36 left
Seattle punts, LA muffs punt, Seattle takes over. Pastebin of the datapoints: https://pastebin.com/ADrw1B6c

Seattle's expected points before is -1.558698 (makes sense, they're about to punt)
Expected points after is 4.058728 (makes sense, 1st and 10 from opponent 30)

So EPA should be 5.617426 (the difference between before and after) but instead it's listed as -2.50002953082569.

home_wp_post is also different than home_wp_pre on the next line (I think it's home_wp_post that's wrong)

Yards Gained

Thanks a lot for the data!
I am working a lot with it and have to parse the yards gained out of the String descriptions if I want to know the yards gained on a run play. For passes you can of course use AirYards+YAC but a column like netyrds but per play and not for the whole drive would be awesome to have.
Thx again, all the best
Vincent

"Set-of-Downs" ID or "Series" ID

Would be great if a "Set-of-Downs" or "Series" ID could be included to identify when a new set of downs begins. Maybe a sequential number similar to "Drive".

Incorrect Pick-six data

See e.g. gsis_id '2013092903', drive 22. The sixth play results in a pick-six of Houston by Seattle. This is recorded in the points fields as follows:

total_home_score | total_away_score | posteam_score | defteam_score | score_differential | posteam_score_post | defteam_score_post | score_differential_post
20 | 19 | 20 | 13 | 7 | 20 | 19 | 1

The PAT is not captured in this data. In fact, for all future drives in this game, SEA's score is off by a point.

player IDs

I think that you should include a player_id field in the csv files. The name fields, such as "rusher", identify a player only by the first initial and last name and are not unique.

Incorrect negative air yards on numerous plays

game_id: 2019100611
play_id: 4135

Shows -15 air yards and 68 yards after catch. In reality the play was 21 air yards and 32 yards after catch.

game_id: 2019120101
play_id: 3896

Shows -13 air yards and 28 yards after catch. In reality the play was 14 air yards and 1 yards after catch.

Just a Question

I've been playing with some of the NFL data myself and I was wondering how you put together the rosters?

Missing Points (PosTeamScore,DefTeamScore)

Two things:

  1. Points seem to be missing from the PosTeamScore and/or DefTeamScore fields when points are scored as time expires. See 2017 Eagles/Giants for example (2017092409). Final scores are 24-24 despite Elliott's game-winning FG.
  2. Can you add HomeTeamScore and AwayTeamScore fields?
    --

Touchdown plays without WPA

In the PBP data across all seasons there are over 400 good touchdown plays for which the WPA is <NA>. I have tried to find some common factor between them but there doesn't appear to be one. The home/away_wp_post are, logically, also <NA>. There are a further 500 + plays with <NA> WPA which are not TDs.

However, since TD plays are more likely to add significant WP I consider those to be the biggest concern, as it could conceivably skew any analysis.

Missing extra points

GameID: 2017102908

Earl Thomas scores a TD on an INT return and Seattle's following extra point is not included in the play by play data. The score is then wrong for the remainder of the game. This may affect the WP calculations too.

The same thing happened in SEA-IND when Seattle had a pick-6 in that game, so it might be related to defensive scores (GameID 2017100113).

Play-by-Play: Yards.Gained truncates 100+ yard plays down to 10 yards

This problem shows up in kickoff playtypes where the return is more than 99 yards. The easiest way to find examples is to open up the Play-by-Play data and filter PlayType for "Kickoff" and sp for "1". Numerous examples of 100+ yard kickoff return touchdowns will show up for any year (2013 and 2017 play-by-play data is confirmed to have this problem).

Issue: Play by Play Data for GameID 2011120406 has many repeated plays

This 2011 game between the Saint and Lions have repeated play-by-play entries showing up at different points in the game. For example:

Drive: 6, qtr: 2, down: 2, time: 11:41, desc: (11:41) M.Stafford sacked at DET 13 for -7 yards (S.Shanle).
next play desc: (11:06) (Shotgun) M.Stafford pass short left to K.Smith to DET 31 for 18 yards (J.Casillas).

then it repeats:
Drive: 11, qtr: 2, down: 2, time 11:39, desc: (11:39) M.Stafford sacked at DET 13 for -7 yards (S.Shanle).
next play desc: (11:07) (Shotgun) M.Stafford pass short left to K.Smith to DET 31 for 18 yards (J.Casillas).

This seems to go on for a few drives, where plays from the 2nd quarter are showing up in the 3rd quarter.

Mistyped Blocked Punts

There are a number of instances where a punt was blocked but it is labeled as "clean". Game ID 2013100700, Play ID 204 is an example. Punt was blocked, then a pass attempted. This is labeled "Clean" instead of "Blocked". Additionally, a number of Fake Punts are labeled as "Clean". I suggest a third category be made here called "Fake".

Subscript Out of Bounds

When executing season_play_by_play, getting a subscript out of bounds error. See attachment.

unnamed

Data

What website(s) do you get the data from?

Game player stats - proposed changes

I think the game player stats files would be more useful if you had a field for season and week. This would allow you to do lead/lag and other window functions.

Also, there are some NaN, Inf, and '-' values in certain fields. Do you think it would be better to change those to empty strings? In the attached file, I used the foreign data wrapper in postgresql to join the csv file with my existing tables to add in the season and week and I had to remove all of those values for the process to work.

game_passing_df.csv.tar.gz

Thanks!
Eric

Some plays are classified as pass attempts and sacks.

Hey! First off thanks for providing all this data. I did find an issue when looking over the pbp data for 2017. It looks like some pass plays were labeled as pass attempts and sacks. Most of them tend to be plays that were challenges. Not sure if it's an issue with the NFL API or with how you are parsing the data, but I thought you should be aware of the problem. Below is a full list of the plays from the 2017 season, including data frame index (they are 0-based), game id and play description.

index: 800
GameID: 2017091005
play: (1:18) (No Huddle, Shotgun) T.Savage sacked at HOU 38 for -9 yards (M.Jackson). FUMBLES (M.Jackson), RECOVERED by JAX-T.Smith at HOU 46. T.Smith for 46 yards, TOUCHDOWN. The Replay Official reviewed the fumble ruling, and the play was REVERSED. (No Huddle, Shotgun) T.Savage pass incomplete short left to D.Hopkins (M.Jackson).

index: 5154
GameID: 2017091713
play: (14:17) (Shotgun) A.Rodgers sacked at GB 21 for -9 yards (V.Beasley). FUMBLES (V.Beasley) [V.Beasley], RECOVERED by ATL-D.Trufant at GB 15. D.Trufant for 15 yards, TOUCHDOWN. The Replay Official reviewed the backward pass ruling, and the play was Upheld. The ruling on the field stands.

index: 9269
GameID: 2017100106
play: (3:08) (Shotgun) M.Stafford pass incomplete short middle to Z.Zenner. Minnesota challenged the incomplete pass ruling, and the play was REVERSED. (Shotgun) M.Stafford sacked at DET 26 for -9 yards. FUMBLES, recovered by DET-Z.Zenner at DET 11. Z.Zenner to DET 11 for no gain (E.Kendricks).

index: 9997
GameID: 2017100104
play: (11:12) (Shotgun) D.Prescott sacked at DAL 18 for -4 yards (M.Brockers). FUMBLES (M.Brockers) [M.Brockers], RECOVERED by LA-M.Barron at DAL 29. M.Barron to DAL 29 for no gain (C.Beasley). The Replay Official reviewed the fumble ruling, and the play was REVERSED. (Shotgun) D.Prescott pass short middle intended for C.Beasley INTERCEPTED by M.Barron [M.Brockers] at DAL 33. M.Barron to DAL 29 for 4 yards (C.Beasley).

index: 10820
GameID: 2017100113
play: (3:38) (Shotgun) R.Wilson pass incomplete short middle to J.McKissic [N.Hairston]. Indianapolis challenged the incomplete pass ruling, and the play was REVERSED. (Shotgun) R.Wilson sacked in End Zone for -6 yards, SAFETY (N.Hairston).

index: 10830
GameID: 2017100113
play: (:09) R.Wilson pass short left to L.Willson to IND 31 for 19 yards (V.Davis; M.Farley) [J.Sheard]. Indianapolis challenged the pass completion ruling, and the play was REVERSED. R.Wilson sacked at SEA 47 for -3 yards (J.Sheard).

index: 11305
GameID: 2017100500
play: (2:39) (Shotgun) J.Winston FUMBLES (Aborted) at NE 45, and recovers at NE 48. J.Winston pass incomplete short left to A.Humphries. New England challenged the runner was not down by contact ruling, and the play was REVERSED. (Shotgun) J.Winston FUMBLES (Aborted) at NE 45, and recovers at NE 48. J.Winston sacked at NE 47 for -10 yards (D.Wise).

index: 14995
GameID: 2017101506
play: (1:52) B.Hoyer sacked at SF 6 for -7 yards (M.Foster). FUMBLES (M.Foster) [M.Foster], RECOVERED by WAS-D.Swearinger at SF 8. D.Swearinger for 8 yards, TOUCHDOWN. The Replay Official reviewed the fumble ruling, and the play was REVERSED. B.Hoyer pass incomplete short right [M.Foster].

index: 17219
GameID: 2017102204
play: (14:27) (No Huddle, Shotgun) J.Brissett pass incomplete short left to M.Mack [Y.Ngakoue]. Jacksonville challenged the incomplete pass ruling, and the play was REVERSED. (No Huddle, Shotgun) J.Brissett sacked at IND 9 for -14 yards (M.Jackson). FUMBLES (M.Jackson) [Y.Ngakoue], recovered by IND-A.Castonzo at IND 12. A.Castonzo to IND 12 for no gain (T.Smith). Credit a sack for minus-11 yards

index: 20434
GameID: 2017102908
play: (3:15) (Shotgun) R.Wilson pass incomplete short right to L.Willson (J.Clowney) [D.Reader]. Seattle challenged the incomplete pass ruling, and the play was REVERSED. (Shotgun) R.Wilson sacked at HOU 41 for -10 yards (J.Clowney). FUMBLES (J.Clowney) [D.Reader], recovered by SEA-L.Willson at HOU 20. L.Willson to HOU 20 for no gain (A.Hal).

index: 22630
GameID: 2017110507
play: (12:26) C.Beathard sacked at ARI 43 for -7 yards (K.Martin). FUMBLES (K.Martin), recovered by SF-K.Juszczyk at ARI 49. K.Juszczyk to ARI 49 for no gain (K.Martin). San Francisco challenged the fumble ruling, and the play was REVERSED. C.Beathard pass incomplete short right to M.Goodwin (K.Martin).

index: 24154
GameID: 2017111204
play: (3:44) B.Bortles pass short left to L.Fournette to JAX 43 for 20 yards (T.Boston). Los Angeles Chargers challenged the runner was not down by contact ruling, and the play was REVERSED. B.Bortles sacked at JAX 19 for -4 yards (K.Emanuel).

index: 35452
GameID: 2017121001
play: (1:41) (No Huddle, Shotgun) C.Keenum pass incomplete short left to J.McKinnon [J.Peppers]. The Replay Official reviewed the incomplete pass ruling, and the play was REVERSED. (No Huddle, Shotgun) C.Keenum sacked at MIN 37 for -6 yards (J.Peppers).

index: 37358
GameID: 2017121703
play: (13:21) (Shotgun) T.Yates sacked at HOU 12 for -6 yards (M.Jackson). FUMBLES (M.Jackson), RECOVERED by JAX-M.Jack at HOU 19. M.Jack for 19 yards, TOUCHDOWN. The Replay Official reviewed the fumble ruling, and the play was REVERSED. (Shotgun) T.Yates pass incomplete short middle to S.Anderson.

index: 41615
GameID: 2017122409
play: (7:54) B.Bortles sacked at 50 for -8 yards (D.Buckner). FUMBLES (D.Buckner) [D.Buckner], RECOVERED by SF-R.Foster at SF 33. The Replay Official reviewed the fumble ruling, and the play was REVERSED. B.Bortles pass incomplete short middle to K.Cole [D.Buckner].

index: 44082
GameID: 2017123112
play: (1:18) (Shotgun) P.Lynch sacked at DEN 30 for -7 yards (T.Kpassagnon).

index: 44100
GameID: 2017123112
play: (8:35) (No Huddle, Shotgun) P.Lynch sacked at KC 21 for -10 yards (T.Kpassagnon).

EPA Inconsistency with Missed FG

I'm looking at Legacy data and contrasting two missed 62-yard FG's (no return): Play_id 4138 from 2012 and Pay_id 5178 from 2016. For some reason the 2012 play was scored as positive EPA, but the 2016 play was not, despite nearly identical circumstances.

R Errors

Hi,

When running the season stats script in R, I'm getting the following errors. Not sure if these are causing issues with the data. Also running very slow.

Note that I'm running this while the Thursday night game is being played. Not sure if that's part of it.

Thanks guys,
-Mike

TKIntRJMP.R version 5.05
Warning: Loading required package: nnet
Warning: Loading required package: magrittr
Warning: Loading required package: XML
Warning: Loading required package: RCurl
Warning: Loading required package: bitops
Warning: Warning messages:
Warning: 1: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 2: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 3: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 4: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 5: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 6: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 7: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 8: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Warning: 9: In .parse_hms(..., order = "MS", quiet = quiet) :
Warning: Some strings failed to parse, or all strings are NAs
Scriptable[]

EPA

not sure if I understand that correctly, but there are many instances where 2nd&short has higher EP than following 1st&10 -> that results in a negative EPA for successful 2nd down plays.
example KC vs NE game ID 2017090700
drive no 4
chiefs play at Q1 8:48 2nd & 1 from own 19, EP 0.80 -> 4 yard gain, new 1st down, results in:
1st & 10 from own 23, EP 0.74 -> so a positive play has negative EP

there are many instances of that.
could you check that?

Question about updating data

Is the play by play data going to be continuously updated throughout this season? If not, are you open to support of others updating the play by play?

Playoff/SB games?

Does the NFL have the same structure for Playoff and/or Super Bowl games? Would be really cool to analyze how/if teams change their tendencies in the playoffs.

This data is awesome, btw.

away_wp incorrect

play_id = 4134
game_id = 2018090900
The away_wp is 0 for most of this game but then jumps to almost 1 for this one play.

Weird.

Additional Roster Information

Hi Ron,

I would like to suggest the addition of some roster information to your roster_data. I came to this need when I noticed that players have a profile id and an esb id in addition to the gsis id. The esb id is needed to build the url for headshots.

I have developed a solution which I show below. Since scraping the esb ids on my laptop took 0.7s on average for each profile this is no solution which should be run every time the ebs id is needed. That's why I ask you to add it to your csv data.

library(tidyverse)
library(jsonlite)
library(rvest)

roster_ron <-
  read_csv(
    "https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/roster_data/regular_season/reg_roster_2019.csv"
  )

player_info_json <-
  bind_rows(lapply(
    fromJSON(
      "https://raw.githubusercontent.com/derek-adair/nflgame/master/nflgame/players.json"
    ),
    as.data.frame
  ))

roster_ron_new <-
  roster_ron %>%
  inner_join(
    player_info_json %>%
      select(birthdate, college, gsis_id, profile_id, profile_url),
    by = "gsis_id"
  )

roster_ron_new$esb_id <-
  sapply(roster_ron_new$profile_url,
         function(url) {
           url %>%
             as.character() %>%
             read_html() %>%
             html_nodes(xpath = '//meta[@id="playerId"]') %>%
             html_attr('content')
         })

roster_ron_new$headshot_url <-
  glue::glue(
    "http://static.nfl.com/static/content/public/static/img/fantasy/transparent/200x200/{roster_ron_new$esb_id}.png"
  )

Another possibility would be to add esb_id and headshot_url to all the data in player_info_json (that would probably take more than an hour and a half) and save it as a new csv.

GameID 2010101004 Incorrect Score

In PlayID 2140 of GameID 2010101004 (STL vs DET), the posteam_score_post column incorrectly increases from 17 to 23 on a TD that was reversed by replay. As a result, this drive (#11) has two different scores on it (the correct TD happens in PlayID in 2339), and the final score is wrong (49-6, should be 44-6).

Issue: Play-by-Play EPA data for Kick-Off Fumble Returns for TD is wrong

EPA is defined as the expected points added for the Possession Team (PT). On Kickoffs, the team receiving the kick is the PT.

However, on Kickoffs, if the KR team fumbles the ball and the Defensive Team (Kickoff Team) returns it for a TD, the EPA is positive. It should be negative since the Defensive Team scored.

There are 3 examples from 2013 play by play data. Below are details of one of these examples:
GameID: 2013101303 TimeSecs: 1362 EPA: 6.54564648081907 Desc: G.Zuerlein kicks 70 yards from STL 35 to HOU -5. K.Martin to HOU 10 for 15 yards (R.McLeod). FUMBLES (R.McLeod), RECOVERED by STL-D.Bates at HOU 11. D.Bates for 11 yards, TOUCHDOWN.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.