jldbc / pybaseball Goto Github PK
View Code? Open in Web Editor NEWPull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
License: MIT License
Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
License: MIT License
When I execute pybaseball.download_lahman()
, following is happened.
This problem was already fixed by this commit. But I guess this fix maybe not on released module to PyPI
.
(I checked 1.0.5
and 1.0.7
, but both set baseballdatabank-2017.1.zip
to lahman.url
).
Could you check how released module is right now, and fix this problem?
It seems like it's giving me an error on the requests dependency, but I have requests 2.18.4 and have tried updating to 2.19.1
Here's the full error message:
Collecting pybaseball
Using cached https://files.pythonhosted.org/packages/73/ed/032d64eddfbc0acad1cc509e5376ae63161f5ba2e079039ef04794fb51b7/pybaseball-1.0.7.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/s6/5hjjzxbs17g8t1ch00w6whvm0000gn/T/pip-install-pvC6TX/pybaseball/setup.py", line 90, in
'requests>=2.18.1'],
File "/Users/irarickman/anaconda2/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/Users/irarickman/anaconda2/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 225, in init
_Distribution.init(self,attrs)
File "/Users/irarickman/anaconda2/lib/python2.7/distutils/dist.py", line 287, in init
self.finalize_options()
File "/Users/irarickman/anaconda2/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 257, in finalize_options
ep.require(installer=self.fetch_build_egg)
File "/Users/irarickman/anaconda2/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2029, in require
working_set.resolve(self.dist.requires(self.extras),env,installer))
File "/Users/irarickman/anaconda2/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 592, in resolve
raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (certifi 2018.01.18 (/Users/irarickman/anaconda2/lib/python2.7/site-packages), Requirement.parse('certifi==2016.9.26'))
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/s6/5hjjzxbs17g8t1ch00w6whvm0000gn/T/pip-install-pvC6TX/pybaseball/
Hey @jldbc thanks for creating this module! I'm currently using the statcast_single_game module for a project, and was wondering if you had a list of game_ids for previous seasons, or if you could point me in the right direction I would really appreciate it!
Hi guys,
I'm on GitHub Pro, I'm happy to work on this, I have a fork already -- what would you fine folks say to moving to there and working over there?
This library relies on a set of data sources that are all online, which can be problematic for a few reasons:
and/or
So after discussing with @schorrm in #85 we'd like to add a caching layer that would allow for data to be stored locally for repeat calls. I have some ideas that I will list in further comments, but other suggestions are welcome as well.
Hi @jldbc - thanks for putting this project together. I am using the statcast
function to retrieve pitching data, and am proposing a simple change.
Adding drop=True
in the reset_index()
call on line 170 of statcast.py
prevents an unnecessary column named "index". Happy to add change in a future PR.
Thanks again!
When generating a statcast query, there are now a number of pitches which are returned with 'pitch_type'
set to what seems to be a date-string.
The easiest way to look at this is to use something like: data['pitch_type'].unique()
After a quick look, the vast majority of these pitches also return with blank fields in the following:
However this doesn't cover all of the cases, the remainder seem normal except for a blank 'pitch_name'
field - though there are plenty of pitches with blank pitch types that also return a blank 'pitch_name'
field but are otherwise normal.
This is an issue on the statcast side: I've replicated the behaviour using a simple 2-day query from baseballsavant. With that in mind, this isn't really an issue with pybaseball as I don't think a cut & dry fix exists on the pybaseball end, but it's more something for people to be aware of.
For what it's worth, in case others want to remove these entries like I did, I used something like this:
mask = np.in1d(data['pitch_type'].astype(str).str[0], '1')
data = data[~mask]
Which covers you for these entries occurring across multiple years.
Thanks again for the hard work James.
Cheers,
Rens
When the number of day in date range is multiple of 6 plus 1 (i.e. 7, 13, 19...) the last day cannot be import successfully
FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
final_data = pd.concat(dataframe_list, axis=0)
It would be nice to have a link or at least the hash value where I can find the videoclip for a specific pitch. Right now I have to redo the search on the site to obtain it.
Why don't you create a dataset from this data?
Instead of run every time a request to a website hosted in internet..
Make sense this idea.....
Getting the following error but it doesn't reproduce reliably which is odd.
Error message is "pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 13331."
I received this message when running statcast('2008-03-20','2011-11-11')
, and it appears to have happened in the sub-query from 2010-07-02 to 2010-07-07.
Re-running the same full query statcast('2008-03-20','2011-11-11')
doesn't reproduce the issue and neither does statcast('2010-07-02','2010-07-07')
.
The error doesn't seem to impact many of the smaller queries, but probably needs a fix since it becomes increasingly likely to break a query as the date range gets larger and the function depends on a larger number of requests.
I have a handful of raggedy modules I use for fantasy purposes:
I also use the CBS / Yahoo APIs, which are deprecated, but can still shine through some insights.
Anyone see any value integrating?
The scraper broke a couple days ago. I fixed it and submitted a PR (#60).
I've been using this library for a little bit (and just submitted my first PR!) and have really been enjoying it.
I have a few suggestions that I'd like to make. I'm more than willing to do a lot of the work on these suggestions, but wanted to make sure I would be on the same page with other devs before I just start throwing code over the wall.
Add some unit/integration tests. This library would be ripe for adding some unit and integration tests. Both to help make sure code changes are good, but also as a way to determine if an external data source format has changed.
Refactor/reuse code. It seems that a lot of the code that is used for the same site (e.g., FanGraphs, BRef) is very repetitive. I think it would be great to distill some of this down to some shared code to help minimize issues when changes are made. Would do this after the above suggestion to ensure that the library is 100% backwards compatible.
Adding a caching layer. Sometimes when making calls to FanGraphs, the site will start rejecting your requests for exceeding their rate limit. Locally I wrote a caching wrapper that will save my DataFrames to CSV and return those if available. Integrating this into the library would allow others to take advantage.
Add an extra field to the FanGraphs results that returns a column that is joinable to the other data. E.g., team data or player data. Right now the best we'd be able to get is the FanGraphs integer id. We may have to set up our own internal mapping for teams. There are some other sources we could tap into, such as one of these for players:
http://crunchtimebaseball.com/baseball_map.html
https://www.smartfantasybaseball.com/tools/
Let me know what you folks think. Not trying to rock the boat, just trying to take this thing to the next level.
Hey, when I do pitching_stats
I get tons of data. The only stat I can see that is missing is Quality Starts. Do you know how I can get that stat? I see it on the baseball reference page for pitchers, but am not sure how to get it via the tool.
Thank you for your time.
Apologies for another issue. I'm getting this following error when using playerid_lookup(last, first)
function fairly consistently:
sys:1: DtypeWarning: Columns (8,9,10) have mixed types. Specify dtype option on import or set low_memory=False.
This issue seems to occur when reading the people.csv
file into a dataframe. The dtypes are inconsistent for some column(s) within that file and I'm not sure where.
There are two fixes and I wanted to see which was preferred. The first option is to specify the dtypes for each column in the file so that there is no guessing involved and/or reading the entire file into memory before allocating and assigning a dataframe. The second option is to specify the low_memory argument to pandas.read_csv()
to be False (default is True). I'm finding conflicted statements about the argument, some saying it is deprecated and others that say it isn't. If it is deprecated, it will just suppress the error; however, if it isn't deprecated, then it will use more memory in order to read all tuples of a column into memory to validate a type.
Is the larger memory footprint at runtime satisfactory or should the dtypes be specified for all the columns in the Chadwick Register?
When using statcast_pitcher, there is a column "spin_rate_deprecated". There is also "spin_dir", "break_angle_deprecated", and "break_length_deprecated", all of which appear to be "NaN" for 2019 stats at the very least. Is there any way to access spin rate/spin direction/etc. for 2019 since these columns are now deprecated?
Is it possible to filter by a minimum number of pitches/events? For example I want to get every fastball for a certain time period thrown by pitchers who threw at least "x" amount of fastballs in that time period.
Hi there,
I first want to say thank you for your awesome work in putting together pybaseball. I was playing around with the package yesterday and ran the fivethirtyeight New Science of Hitting example with no issues. Today I was running some statcast and statcast_batter queries in much the same way as in the example, however I run into errors every time.
I now receive the following error when attempting to print the shape of the initial dataset in the New Science of hitting example: ValueError: could not convert string to float: 'Sinker'.
When running statcast_batter with an end-date as in the example I receive an Error: Query Timeout message. However when statcast_batter is run without an end-date, it returns without an error, however the player_id is ignored in the returned data set.
I have attached an image to better illustrate what I'm experiencing (run here with David Ortiz's player_id as in the statcast_batter example).
I have run this using both Python 2.7.10 and Python 3.6.1, with the same behaviour for both.
EDIT: I have tried manually scraping the data using the technique set out by @alanrkessler and baseball savant is returning a 'Error: Query Timeout. Please try to limit your query to less data.' as the singular field of the csv when you attempt to scrape the data.
Getting "Error: Query Timeout. Please try to limit your query to less data" on some of the statcast() queries, but having trouble reproducing it reliably.
This is probably a bit of an arduous task but I think a quite valuable one. Fangraphs has a collection of the most valuable and in-depth stats and having this sort of granularity would be invaluable. Would even help out with some of the annoying stuff if some of the other contributors are on board.
the "pitching_stats" command returns a key error on "K-BB%"
Are there any plans to add WAR or any stats from the Player Value tables on Baseball Reference to pitching_stats_bref(season)
?
I was looking to find the largest difference between bWAR and fWAR for pitchers, but I am unable to without a WAR column in the dataframe that returns from pitching_stats_bref(season)
. Were there issues in obtaining that data or just never implemented?
Hi James. First, thank you! I'm playing with your package to learn more python and matplotlib.
However, I'm not getting current season information with standings, schedule_and_record and batting_stats_range.
Everything works fine with previous seasons though.
Thanks!
Really like this library, but one thing I don't get is why the Lahman DB needs to be re-downloaded every time you try and use a function interfacing with the Lahman DB.
I'm proposing something like this:
def get_lahman_zip():
if os.path.exists(base_string):
z = None
else:
s = requests.get(url,stream=True)
z = zipfile.ZipFile(BytesIO(s.content))
return z
And then all Lahman interfacing functions can be edited like so:
def parks():
z = get_lahman_zip()
f = os.path.join(base_string, "Parks.csv")
data = pd.read_csv(f if z is None else z.open(f), header=0, sep=',', quotechar="'")
return data
This way you only have to call download_lahman
once and every subsequent time you call parks()
it will just use the downloaded DB.
This probably isn't the most elegant way to do it, but I think something like this would be a good idea.
Happy to discuss, do the changes myself and file a pull request!
Currently all numeric columns in the Statcast data are coerced to a float data type. This happens in the postprocessing
function in statcast.py
.
numeric_cols = ['release_speed','release_pos_x','release_pos_z','batter','pitcher','zone','hit_location','balls',
'strikes','game_year','pfx_x','pfx_z','plate_x','plate_z','on_3b','on_2b','on_1b','outs_when_up','inning',
'hc_x','hc_y','fielder_2','vx0','vy0','vz0','ax','ay','az','sz_top','sz_bot',
'hit_distance_sc','launch_speed','launch_angle','effective_speed','release_spin_rate','release_extension',
'game_pk','pitcher.1','fielder_2.1','fielder_3','fielder_4','fielder_5',
'fielder_6','fielder_7','fielder_8','fielder_9','release_pos_y',
'estimated_ba_using_speedangle','estimated_woba_using_speedangle','woba_value','woba_denom','babip_value',
'iso_value','launch_speed_angle','at_bat_number','pitch_number','home_score','away_score','bat_score',
'fld_score','post_away_score','post_home_score','post_bat_score','post_fld_score']
data[numeric_cols] = data[numeric_cols].astype(float)
Many of those numeric columns always contain integer values (balls, strikes, outs_when_up, etc.). These columns should be coerced to an int data type.
Perhaps more importantly, many of the other columns are ID values (batter, pitcher, game_pk, etc.). It would make more sense to coerce these columns to a string data type (or int would also be better than float).
I would recommend the following changes in the postprocessing
function:
Receiving the ERROR below CODE below when trying to get batting stats based on date range. Can anyone provide any help with this?
CODE
from pybaseball import batting_stats_range
from pybaseball import pitching_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
data.head()
IndexError Traceback (most recent call last)
in
----> 1 data = batting_stats_range('2017-05-01', '2017-05-08')
2 data.head()
~/opt/anaconda3/lib/python3.7/site-packages/pybaseball/league_batting_stats.py in batting_stats_range(start_dt, end_dt)
79 # retrieve html from baseball reference
80 soup = get_soup(start_dt, end_dt)
---> 81 table = get_table(soup)
82 table = table.dropna(how='all') # drop if all columns are NA
83 # scraped data is initially in string format.
~/opt/anaconda3/lib/python3.7/site-packages/pybaseball/league_batting_stats.py in get_table(soup)
49
50 def get_table(soup):
---> 51 table = soup.find_all('table')[0]
52 data = []
53 headings = [th.get_text() for th in table.find("tr").find_all("th")][1:]
IndexError: list index out of range
Baseball Savant changed some column names, which is currently breaking the statcast
function. Bill Petti gives some details here.
I'll look into the mapping later and try to get a quick deploy out.
https://baseballsavant.mlb.com/csv-docs
Some items don't match like hit distance is hit_distance_sc which is just hit_distance in the docs.
Launch angle is launch_speed_angle, I don't see the point of this, and if there is one, a full documentation needs to be made.
Also, quick help, can someone show me where I can find if the strike was called or swung on.
`from pybaseball import team_pitching
import pandas as pd
pd.set_option('display.max_columns', None)
data = team_pitching_bref('NYY', 2019)
print(data)`
NameError: name 'team_pitching_bref' is not defined
So in a recent PR I tried to bring in some formatting changes (not necessarily on purpose, mostly because it was my first PR, and I always have some sort of auto pep 8 formatter on, ๐).
This led to quite a few unrelated code changes and some understandable concern on @schorrm 's part (expecially for some of the choices that were made by the formatter.
However, I think (and I believe @schorrm agrees to some extent) that adding some code style standards could be fruitful, and if we can all coalesce around a shared tool and config to keep it painless, the better! The goal of the style guide would to make the code more readable and internally consistent.
So I'd like to use this issue to discuss what some participants like in a style guide, don't like in a style guide, or are apathetic to.
I'll begin with a few of mine.
# Technically legal
cols = [col.replace('*', '').replace('#', '') for col in cols]
# More readable in my opinion
cols = [
col.replace('*', '').replace('#', '') for col in cols
]
# For extra long lines I'd even break it this way as well
cols = [
col.replace('*', '').replace('#', '').extraLongFunctionGoesHereToTakeUpRoom()
for col in cols
]
my_string = "Pretend this string goes on for something like 120 characters... " +
"The rest of the string goes here."
ata = fangraphs.get_fangraphs_tabular_data_from_url(
_FG_TEAM_PITCHING_URL.format(
start_season=start_season,
end_season=end_season,
league=league,
ind=ind,
)
)
def team_pitching(start_season: int = None):
for season in range(start_season, start_season + 1):
pass
def team_pitching(
start_season: int,
end_season: int = None,
league: str = 'all',
ind: int = 1,
):
I also really prefer when the code gets a pylint score of 10.0, but there are some linting failures I don't flip out about (like docstrings on modules).
When possible, I prefer type hinting in function params and returns so MyPy can help catch misuse before runtime.
I would like to eliminate all print statements if possible. Print statements are an uncontrolled side effect for anyone using the library downstream. Instead we should use the logging library and give the user some control over where the logs go:
https://docs.python.org/3/howto/logging.html#logging-basic-tutorial
I'm thinking this package's next addition should be the Lahman database from http://seanlahman.com/. I'm going to add some starter code for this and use this for tracking any issues that arise.
i saw the post of the previous season's game id's csv file.
i was wondering if there was a function to scrape today's game pks? sort of hard to use the statcast function without knowing the game pks.
Tossing this out as a question before I attempt to implement it...
Are one of the numerous columns returned with betting_stats()
the platoon split? Thanks!
Seamheads has the best Negro League data. By far.
I'm running batting_stats_range and pitching_stats_range once a day for the last 3 months with range parameters of the day I'm currently running.
Been getting data for both every day until July 27th, 2020, after that date I've been getting that error every day. Nothing changed from my side so I assume it's an issue with pybaseball API or one of its providers.
Is anyone else experiencing the same issue?
I've noticed this a few times throughout this repo but is anyone else having this issue?
from pybaseball import team_pitching
team_pitching(1999)
AttributeError: 'NoneType' object has no attribute 'find_all'
Issue is coming from the get_table(soup, ind) function
I've been developing some projects, and one of the main pet peeves (let me be clear, it isn't major) is not having a good description of the tables/of the columns that they have.
I wonder if other people would appreciate having descriptions of the tables centralized in the documentation of this repo?
Basically more documentation for function outputs.
What do you guys think?
It's been a really long time since any pull requests have gotten dealt with. Would it be possible to get some more maintainers here? I'm extremely thankful for what you've done, and of course, I understand you have a real job and stuff, but I don't. I'm a 4th year CompSci student with some time to kill and I'd love to help out as a maintainer, and I'm sure there are other contributors here who would also be happy to help maintain this package.
Thank you
I'm adding @schorrm as a collaborator for both this repo and its associated PyPI project. Moshe has been active in using and improving the package both here and in his fork. Having a more active maintainer and keeping most users under a single PyPI installation will be good for the quality and stability of the project. I'm excited to have him on board!
Continuing a discussion that was started in issue #20, I think one of the ways we can solve the issues users are having when they try to run statcast_pitcher
and statcast_batter
queries over periods longer than about two months would be to break those queries up.
My general approach to this would be to first determine if a query is longer than some arbitrary maximum (probably around 60 days), then use some of the features in Python's datetime package to iterate over the user specified period in chunks calling the necessary function each time. This will result in a list of DataFrames returned by each function call and which can be bound together.
A secondary goal would be to run the queries in parallel, but I think that can wait until after this initial work is done.
I'm happy to work on adding this feature and would love to hear any feedback/ideas from others.
The current playerid_lookup() function finds player ids, taking name as input. Something doing the opposite would be useful, taking an id from a statcast query as input and returning the player's name. The option to do this in bulk would be even better.
Did the statcast csv change its date format? Someone tagged me in this on Twitter https://twitter.com/ckurcon/status/1301913190465507328
I get: ValueError: time data "2020-09-03T00:00:00.000Z" doesn't match format specified
from pybaseball import team_batting
team_batting(2016)
My error returns this:
team_batting(2016)
Traceback (most recent call last):
File "", line 1, in
team_batting(2016)
File "/opt/anaconda3/lib/python3.7/site-packages/pybaseball/team_batting.py", line 76, in team_batting
table = get_table(soup, ind)
File "/opt/anaconda3/lib/python3.7/site-packages/pybaseball/team_batting.py", line 26, in get_table
rows = table_body.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'
quick question - is there an easy way to distinguish every at-bat (like an at bat id) when using the statcast batter package?
It may be a difficult feature to add, but I would like to see the addition of player position appearances added. I realize this data is recorded in the Lahman Database, but that does not include the current season appearances. This information could be valuable in determining differences between positions, position scarcity, and other important areas of analysis.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.