double-dose-larry / sportsref Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1.83 MB

A pleasant Python interface to sports-reference.com websites

License: MIT License

Python 9.66% Jupyter Notebook 90.34%

baseball-reference football-reference basketball-reference

sportsref's Introduction

sportsref

easily pull stats from sports-reference web sites

sportsref is designed to be used in an interactive python environment, such as IPython or JupyterNotebook

The api tries to mirror the web experience:

each subject area (i.e. player, season, league) is represented by an class.
each class has methods representing the pages available.
- for example Ozzie Albies player page has the following menu of pages
- if the menu is a dropdown, the method takes an additional parameter or two
the methods return a Page object which know about all the tables on that page
use the Page.get_df("table_name") to get a pandas.DataFrame of the table you want.

The examples.ipynb JupyterNotebook has a few examples demonstrating a workflow.

Install

clone the repo then

pip install .

sportsref's People

Contributors

Stargazers

Watchers

sportsref's Issues

add a delay to calls touching bref

to conform with Terms & Conditions of the website.

Specifically:

Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials.

I suppose waiting at least a half-second is good enough. The numberize_df stuff usually takes more than that already. just want to make sure.

add play index matchups vs. pitchers for players

add vs_pitcher splits to the Player object

sample url below, only requires a key

https://www.baseball-reference.com/play-index/batter_vs_pitcher.cgi?batter=choiji01

add coaching staff to team_season

its on the team_season page in div_coaches

add batting line ups to team_seasons

https://www.baseball-reference.com/teams/NYM/2019-batting-orders.shtml

move all of the url building logic into the convert_url funcion

There's no reason for all these url strings to be built all over the place.

The logic is quite simple:
in goes:

web page url
name of the div that contains the table you want
possibly an dictionary that will be translated to a url query
out comes:
a fully quoted embed url.

add leauge wide stats

maybe have a league object.
league can be MLB, or AL or NL

Then we can have league_season object that holds things like

Standings
For batting/pitching/fielding:
- standard tables
- value tables
- there's a whole bunch of other stuff in here
misc
- attendance

that's it for now, much more in there

Proper Tests

write proper tests, maybe with pytest.

get rid of the jupyter notebooks

usage guide

write docs, or update README to show folks how to use this.

Maybe supply a couple of jupyter notebooks. I like those.

generalize the parsing and url construction to work with all sports reference websites

Basic logic is this:

First we start with the sport module, for example:

from py_sportsref.football import Player

I guess that means we'll have to rename the library to more general name like py_sportsref

then we'll build up the parts of the url in a dictionary:

my_dict = {
    'css': 1,
    'site': 'pfr',
    'url' : '/players/F/FarvBr00.htm',
    'div': 'div_passing'
}

we still care about the web url because we'll need to parse the valid divs on it. we can construct it from known base locations, we can use the urllib.parse library to work with these and html.parser to quickly stream through the html and pick out the ids of divs with a 'stats_table' class

we pass this url to a enumerate_table_divs() function. that returns all the table_stat div ids. these ids will be the valid table types

Cloudfront CSV

Didn't see a way of contacting you so Id thought I do it through here, I found your code why google searching some of the domains I found, and found your util.py
Was particularly looking at
csv_url = f"https://{cdn}.cloudfront.net/short/inc/{player_or_team}s_search_list.csv"

What do you mean by {player_or_team} would I replace this with its unique id?
Ive found some links such as
https://d6rt22vwfyr3i.cloudfront.net/short/inc/players_search_list.csv
and
https://d6rt22vwfyr3i.cloudfront.net/short/inc/clubs_search_list.csv

But if I wanted info on a player doing
https://d6rt22vwfyr3i.cloudfront.net/short/inc/19538871_search_list.csv
Does not load

I was wondering what the significance of the line meant thanks

add play index matchups vs batters for players

https://www.baseball-reference.com/play-index/batter_vs_pitcher.cgi?pitcher=degroja01

maybe combine with vs_pitchers into one function.

something like

Player("Emilio Pagan").vs("b")
Player("Tommy Pham").vs("p")

could possibly use the Player().pit_or_bat_default to infer the default, so the code could just be

Player("Austin Meadows").vs() # returns vs_pitchers dataframe