Giter Club home page Giter Club logo

boxball's People

Contributors

brayanmnz avatar dependabot[bot] avatar droher avatar zou000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

boxball's Issues

Automatically trigger build when new data is published

There are a few manual tasks I need to do in order to update Boxball when Retrosheet publishes new data. While they don't take very long, they're still a pain and ended up resulting in months of extra delay getting the latest data out. As an alternative:

  • Set up a listener for changes in the latest commit hashes of baseballdatabank and retrosheet repos (and/or new releases)
  • Automatically create a new branch/release draft, update the .env file with the new information, put up a PR, build images, and upload data files to OneDrive
  • When PR is merged, push the images with their appropriate release labels

premission error in build script

Describe the bug
Can't get the db to build.

To Reproduce
docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" doublewick/boxball:postgres-cstore-fdw-latest

Expected behavior
I want the built database to be inside the container, but when I take out the -v flag on the docker run command I'm hit with the directory not empty error. I expect the postgres db to exists in /var/lib/postgresql/data inside the container.

Error Message

initdb: error: directory "/var/lib/postgresql/data" exists but is not empty
If you want to create a new database system, either remove or empty
the directory "/var/lib/postgresql/data" or run initdb
with an argument other than "/var/lib/postgresql/data".

Desktop (please complete the following information):
docker desktop on window 10

Billy Herman 1934 allstar_full_pkey unique violation in 2020 postgres and mysql images, value in park.csv exceeds varcar(45) in 2020 postgres_cstore image

Tested a few of the new images and three of the containers of were exiting shortly after launching. Removed the image and cleared the cache with each of these three and retried but got same error codes. Outputted their logs, see below, and saw these issues. I only managed to try postgres, postres_cstore, mysql, and clickhouse. Clickhouse worked with no issues. Docker version 19.03.8, build afacb8b running on CentOS Linux release 7.8.2003 (Core). These could probably be separated into two different issues but I'm grouping them together because they're both a result of the 2020 updates.

Postgres Log
2020-05-02 05:15:21.721 UTC [42] ERROR: duplicate key value violates unique constraint "allstar_full_pkey" 2020-05-02 05:15:21.721 UTC [42] DETAIL: Key (player_id, year_id, game_num)=(hermabi01, 1934, 0) already exists. 2020-05-02 05:15:21.721 UTC [42] CONTEXT: COPY allstar_full, line 69 2020-05-02 05:15:21.721 UTC [42] STATEMENT: COPY baseballdatabank.allstar_full(player_id, year_id, game_num, game_id, team_id, lg_id, gp, starting_pos) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/allstar_full.csv.zst' WITH (FORMAT CSV, FORCE_NULL(player_id, year_id, game_num, game_id, team_id, lg_id, gp, starting_pos)); psql:/docker-entrypoint-initdb.d/postgres.sql:535: ERROR: duplicate key value violates unique constraint "allstar_full_pkey" DETAIL: Key (player_id, year_id, game_num)=(hermabi01, 1934, 0) already exists. CONTEXT: COPY allstar_full, line 69

MySQL Log
/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/mysql.sql ERROR 1062 (23000) at line 536: Duplicate entry 'hermabi01-1934-0' for key 'PRIMARY'

Note: An issue exists on baseballdatabank for this Billy Herman record. He played twice in the game. 🤷‍♂️
https://github.com/chadwickbureau/baseballdatabank/issues/104

Postgres_cstore Log
/data/baseballdatabank/parks.csv.zst: 11602 bytes 2020-05-02 05:23:59.954 UTC [40] ERROR: value too long for type character varying(45) 2020-05-02 05:23:59.954 UTC [40] STATEMENT: COPY baseballdatabank_parks(park_id, park_name, park_alias, city, state, country) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/parks.csv.zst' WITH (FORMAT CSV, FORCE_NULL(park_id, park_name, park_alias, city, state, country)); psql:/docker-entrypoint-initdb.d/postgres_cstore_fdw.sql:527: ERROR: value too long for type character varying(45)

Inconsistencies in Parquet download files

Describe the bug
I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.

  1. First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect game.parquet, gamelog.parquet, or schedule.parquet to have an indicator column for this, but I do not see one).

  2. Second appears to be more of a bug. Contrary to no. 1 above, schedule.parquet only seems to include regular season games. Great, we can simply filter game.parquet and other files by whether or not the game exists in schedule.parquet, right? Nope, schedule.parquet seems to list games that never actually occurred. As an example: schedule.parquet includes a game MIL @ CHN 2022-04-08, not part of a double header. However, games.parquet (as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' in schedule.parquet for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.

To Reproduce
Steps to reproduce the behavior:

  1. Compare schedule.parquet and game.parquet as described above.

Expected behavior
Games will be consistent across files, and a column easily delineates whether games are regular season or not.

I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!

Update data to the Summer 2023 Release

Hi,

Tom Thress recently announced the availability of the first semi-annual Summer 2023 Release.

https://www.retrosheet.org/june2023release.html

I was wondering if boxball can be updated to reflect the changes. Mainly, I'm interested in the new optional field, gametype, that's now included. It's been a real sore point for me to figure out based on game dates and what not.

A new info record type was introduced in 2023 describing the type of game.

info,gametype,regular

The possible types are:
regular
exhibition
preseason
allstar
playoff
worldseries
lcs
divisionseries
wildcard
championship

Thank you so much for putting this project together.

Update with 2021 Season data

I believe baseball-databank has came out with data files that now include 2021 season data.

Is it possible to include those?

SQLite image is not working

Hi,
I found that sqlite image is not working when I follow the usage in README which is
docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-0.0.2

The container will exit immediately and the error message shows:

$ docker logs sqlite 
zstd: can't stat /db/boxball.db.zst : No such file or directory -- ignored

I try not to mount the data folder and there is another problem. The container won't print any message and becomes to exited after a while. The weird thing is that the container will still be displayed if you use docker ps to show status.

$ docker ps
CONTAINER ID        IMAGE                             COMMAND                  CREATED              STATUS              PORTS                               NAMES
8618c27cc60c        doublewick/boxball:sqlite-0.0.2   "/bin/sh -c 'zstd --…"   About a minute ago   Up About a minute   0.0.0.0:8080->8080/tcp              sqlite

But it's actually exited if you use docker inspect sqlite to display the status:

$ docker inspect sqlite
[
    {
        "Id": "8618c27cc60c2d227977a29ba7942b5d63e22daa227a376b079858523a9345ea",
        "Created": "2019-06-14T05:41:52.886829683Z",
        "Path": "/bin/sh",
        "Args": [
            "-c",
            "zstd --rm -d /db/boxball.db.zst && sqlite_web -H 0.0.0.0 -x /db/boxball.db"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 73,
            "Error": "",
            "StartedAt": "2019-06-14T05:41:53.700159123Z",
            "FinishedAt": "2019-06-14T05:42:23.108049305Z"
        },

I have no idea why this is happening and I'll guess it's because the data is too large.

Missing Retrosheet player data

Describe the bug
When analyzing RetroSheet data, there is no way to join to player information.

Additional context
Retrosheet offers a biofile.csv data source that does not come in from the upstream Chadwick repository. This data would be super helpful to have as a table within the retrosheet schema.

Reference: https://www.retrosheet.org/biofile.htm

MySQL Container fails while loading data

Describe the bug
Error when loading the MySQL tag

To Reproduce
Ran suggested MySQL start point
docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest

Specs
Ubuntu through WSL2 on Windows 10,

Error Message
2022-11-23 17:18:07 ERROR 1262 (01000) at line 2063: Row 112998 was truncated; it contained more data than there were input columns

Looks to be when loading the Retrosheet Event csv file

Additional context
Full container logs: https://pastebin.com/bD53UHXE

retrosheet_daily table missing game.source

cwdaily outputs daily lines for each player, which include the source for the game information.
For games with multiple sources, there will be multiple daily entries for a given (player_id, game_id) tuple, and right now there's no column that can be used to disambiguate.

E.g. select * from retrosheet_daily where game_dt = '1943-06-19' and player_id = 'mackr101';

yields 5 rows for 2 games (two halves of a double header), which each game having a box score & deduced game, according to https://raw.githubusercontent.com/chadwickbureau/retrosplits/master/daybyday/playing-1943.csv. I'm not sure why mack in particular has 2 deduced game entries for CHA194306191, but that's probably an issue in chadwick

Update Parquet Files

Hey...

This may be a weird thing to ask here, but can you update the parquet files with updated retrosheet data? They've been updated with 2019 events.

If you can, thanks.

Need help? Want new features/endpoints?

I'm looking to help make this data as accessible as possible for experienced researchers and novice data analysts alike -- please let me know if anything would help! This includes documentation, additional features, new data sources, or new load endpoints (including the ones listed in the other features so I can prioritize).

retrosheet_event table empty for postgres-cstore-fdw

Executing SELECT COUNT(*) FROM retrosheet_event; returns 0. I've tried the latest image and 2020.2.

To Reproduce
Steps to reproduce the behavior:

  1. docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-latest
  2. psql -U postgres -p 5433
  3. SELECT COUNT(*) FROM retrosheet_event
  4. Output should be zero.

Expected behavior
Is the event data still being included in the images?

Desktop (please complete the following information):

  • OS: Windows 10 w/ WSL 2-based docker
  • Version: doublewick/boxball:postgres-cstore-fdw-latest

Create markdown documentation for database schemas

Is your feature request related to a problem? Please describe.
Currently, the database schemas are only documented in code.

Describe the solution you'd like
Create a markdown page with a more human readable representation of the database structure

Describe alternatives you've considered
There are other more human readable formats that could be used, but Github markdown seems to be a natural fit.

Additional context
I have a script that generates an (admittedly basic) markdown page for the schemas in my fork's transform/src/schemas directory.

Separate data files from image

The data files are currently baked into the images for each database, which is not ideal because they're a single huge layer and Docker can't share it between images even though the data is the same. Instead, the data should be left out of each image, and downloaded when the container starts. The download should first go to some general shareable volume between images, so that it gets cached and only has to be downloaded once across multiple images.

As a prerequisite i'll need to add the data upload to OneDrive/wherever as part of the build instead of something I do manually.

2020 Season Data

I saw that 2020 went up on Retrosheet.

Is it possible to update the docker images with that data included?

Row-based Postgres not populating

Describe the bug
Waited well over 90 minutes for the population of the row-based postgres docker image, and it seems like the data isn't populating in the postgres database. The data seems to exist in the boxball directory, but it's not showing when I try to access it in the database.

I installed the sqlite image to see if it was all the images, but it worked as expected, so it might just be that particular image that has a problem

To Reproduce
Steps to reproduce the behavior:

  1. Follow instructions to start up row-based postgres docker image
  2. Leave up for 6 hours, and the data is not populated

Expected behavior
Data should populate in around 90 minutes

Screenshots
image

image

Desktop:

  • OS: Fedora 39 (Workstation) x86_64
  • Kernel: 6.5.9-300.fc39.x86_64
  • Docker: 24.0.7

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.