Giter Club home page Giter Club logo

baseball's Introduction

Baseball Workbench

Baseball Workbench is an upcoming Web-based tool for Sabermetrics research.

The software for Baseball Workbench will be open-source, and will integrate with freely available data sources.

The hosted tool will have associated infrastructure costs, but we may look to offset these with sponsors, ads, or fees at some point.

Release Date

Update: I haven't released a hosted Alpha version of Baseball Workbench but I'll keep posting updates here! Feel free to follow along. Alpha access may be private to help control costs - if so, invite information will be posted to this README.

Get development updates

If you're a GitHub user, you can "Watch" this project to get updates.

You can also check out the content on the GitHub Wiki.

Contributing

There are many ways you can contribute to Baseball Workbench!

Check out the Issues tab here on GitHub for work that is planned.

If you are a programmer (or want to try your hand at it!), check out the various code throughout the project and contribute enhancements:

  • There is Python code in shared/, api/api.py, and worker/service.py
  • There is R code in worker/service.py (this will move into a separate file soon!)
  • Groovy is used in the worker/extract/download.groovy script (but may be changed to Python in the future for consistency)

If you know or want to learn AWS concepts, you can check out the infra folder, which uses AWS CloudFormation to define the hosted version of the app.

If you are interested or experienced with baseball datasets such as the Lahman Database and Retrosheet Game Logs, you can check out the data extraction script in worker/extract and the datasource metadata files in shared/btr3baseball/datasource. Both of these are fairly easy to follow, and we will need a lot of work in these areas to integrate with interesting datasets.

Finally, once the hosted app is live for Alpha, you can contribute by logging Issues in the Issues tab for bugs you find and features you would like to see.

Contact

If you have any questions about the Baseball Workbench, you can contact Bryan at [email protected].

baseball's People

Contributors

bryantrobbins avatar diffley avatar gallagherrchris avatar jacobsheppard avatar leemarc00 avatar mchao47 avatar smxjrz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

baseball's Issues

UI: Equation Editor

To support the definition of new statistics from existing columns in a dataset, Baseball Workbench should have an Equation Editor.

  • Users should be able to easily add references to columns from their datasource.
  • Users should be able to easily add references to custom columns previously defined.
  • Attempts to add references to non-existent columns should result in a client-side error.
  • Users should be able to use simple mathematical operators: Add, Subtract, Multiply, Divide
  • Attempts to use unsupported operators (or include any extraneous characters) should result in a client-side error.

Add Consul to BuildHost

The EC2 Container Service from AWS does not handle service discovery. This forces you to explicitly map host ports (e.g., so that dependent containers can communicate over known ports).

I would like to add a Consul server to the BuildHost, and to use Consul to look up container IP and Port information in real-time. This will avoid having to hard-code the EC2 instance host ports when defining ECS-hosted components.

An example and further explanation are given here:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

API: GetDatasets

Write a Lambda function which serves back a JSON block of dataset metadata.

API: GetJobInfo

Write a Lambda function which returns a job's info by doing a lookup in a DynamoDB table

Push UI container to ECR repo

Prove that this can be done from the existing Packer setup, via a post-processor that looks something like this (from Packer Docker documentation):

"post-processors": [
[
{
"type": "docker-tag",
"repository": "12345.dkr.ecr.us-east-1.amazonaws.com/packer",
"tag": "0.7"
},
{
"type": "docker-push",
"login": true,
"login_email": "none",
"login_username": "AWS",
"login_password": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"login_server": "https://12345.dkr.ecr.us-east-1.amazonaws.com/"
}
]
]

To support this, need ability to provide ECR repo location and credentials via environment variables, such as:

https://www.packer.io/docs/templates/user-variables.html

Backend proof of concept

Write a backend which:

  • Grabs a pending job's ID from a queue
  • Retrieves the job configuration from a database
  • Validates the job configuration
  • Executes the job in R
  • Pushes output files to S3
  • Updates the job status in a database

UI: Allow configuration of Column Define transformation

Parameters:

  • Column (name of column)
  • Expression (must conform to grammar)

In general, the expression should be either a string constant OR a mathematical expression which allows:

  • Plus, Minus, Multiply, Divide
  • Exponents
  • Parentheses
  • Column references of the form $('COL')

Create Dataset Metadata API

The system will know about a fixed number of datasets. We need an API for dataset metadata, driven by config files and serving back row and column info to the Metadata Viewer frontend.

Metadata viewing is completely separate from the Job Submission API that will be necessary for manipulating data. It could even be deployed as its own microservice.

Add encrypted hieradata to local puppet runs

Because the "standard-aws" repo is separate, it would be nice to be able to supply project-specific details to that standard setup during launch, so that the configured server(s) can have project-specific details.

Infra: Add database

Trying to decide between Postgres, MongoDB, and Riak. Would like to pick one and add it to the existing infrastructure, for now.

My current favorite is Riak, which is similar to AWS DynamoDB (based off of the Dynamo paper).

Requirements:

  • Support for multiple nodes
  • Ability to run nodes in Docker containers, across different Docker hosts
  • No data loss on single node failure
  • Ability to backup and restore from tarball
  • Low-latency querying interface

Deploy Metadata API container

Should have an ECR repo created in base stack.
Should have a metadata-image job, which calls Packer (just as ui-image does) to create an image.
Should have a Consul Template configured on the BuildServer, defining backends as an Nginx "upstream".
Should have service definition added to the DEV template, taking arguments for container count and version as the UI container configurations do now.

API: SubmitJob

Write a Lambda function which writes a job's configuration info to a DynamoDB table, and places a notification of that job on a queue for processing.

UI: Viewing Dataset Metadata

There are a number of publicly available datasets with Baseball statistics.

As a key feature of its UI, Baseball Workbench should have a "Metadata Viewer" layout. The UI can retrieve dataset metadata from an API call.

Metadata can be assumed to include:

  • Dataset Name: A unique identifier for this dataset, such as "Lahman.Hitters"
  • Dataset Description: A text description of the dataset
  • Row Description: A text description of what each row in the dataset represents (all rows will represent the same thing, so only one description is required)
  • Column Metadata: A unique Name, text description, and data type (String, Count, Ratio) for each column in the dataset

Worker: Job Error Propagation

The current worker implementation does not update the DB entry in the case of errors (and stops processing all messages when the first error is encountered).

Things should not be this way. We should generically try/except so that job status is always updated and job processing continues to the next job as appropriate.

Propagation should also include an error message.

UI: Add Export components

The user should be able to select from a pre-defined set of export types, and provide any necessary options for each type.

The only available type for the MVP should be:

  • Ordered Table: requires choice of 1 to 5 columns to be in the table, and requires choice of order by column and order direction (ascending or descending)

UI: Default Use Case Flow

The Baseball Workbench UI should allow users to describe, execute, and export statistical analysis. To support this goal, there should be a flow of the following activities:

  • Select Initial Dataset, from a list of available Public datasets. For example: Lahman.Hitters
  • Define one or more new columns, in terms of columns from the Initial Dataset and basic arithmetic (add, subtract, multiply, divide). For example: RC = (H + BB) * TB / (AB + BB)
  • Define one or more row filters, in terms of column names, values, and comparators, to be applied to the updated data set (Initial + New Columns) prior to export. For example: Year > 1955
  • Define Exported Artifact, from a list of available export types and their options. _For example: Histogram of RC _
  • Click "Generate"
  • Receive temporary link to exported files.

The available datasets are:

  • Individual tables from the Lahman database (e.g., Hitters, Pitchers, Teams, etc.)
  • Retrosheet Gamelogs database (Regular Season, Postseason, or All-Star)

The supported Export Types are:

  • Table ordered by X (ASC or DSC)
  • Histogram of X
  • Scatter Plot of X vs. Y

API: Add validation to SubmitJob API

Before successfully writing to DynamoDB and placing a message on the queue, the SubmitJob API call should validate the parameters of the requested job.

Here is a sample JSON configuration object for a job:

{
  "dataset": "Lahman_Batting",
  "transformations": [
    {
      "type": "columnSelect",
      "columns": [
        "HR",
        "lgID"
      ]
    },
    {
      "type": "rowSelect",
      "column": "yearID",
      "operator": ">=",
      "criteria": "2000"
    },
    {
      "type": "columnDefine",
      "column": "custom",
      "expression": "2*(HR)"
    },
    {
      "type": "rowSum",
      "columns": [
        "playerID",
        "yearID",
        "lgID"
      ]
    }
  ],
  "output": {
    "type": "leaderboard",
    "column": "HR",
    "direction": "desc"
  }
}

Below is a list of required validations.

Dataset:

  • Dataset ID should be from set of allowed set of datasets (currently just "Lahman_Batting")

Output:

  • Output parameter "type" should be from allowed set of output types (currently just "leaderboard")
  • Output parameter "column" should be the name of a single column from the set of selected and/or defined columns as of the end of all transformations
  • Output parameter direction must be one of "desc" or "asc"

ColumnSelect and RowSum Transformation:

  • Entries in the "columns" list should be the name of an existing column, with respect to any previously executed transformations.
  • After the ColumnSelect transformation, all columns not present in the "columns" list are lost.
  • After the RowSum transformation, all string-valued columns not present in the "columns" list are lost.

RowSelect Transformation:

  • "column" should be the name of an existing column, with respect to any previously executed transformations.
  • "operator" should be one of <, >, <=, >=, =, or !=.
  • "criteria" should be either a number or string, and not an expression.
  • The type of the criteria (number or string) should match the type of the corresponding column chosen.

ColumnDefine Transformation:

  • "column" should be a unique name for the new column being defined, and should not conflict with the name of any existing column, with respect to any previously executed transformations
  • "expression" should be a valid mathematical expression using only scalar values (strings or numbers) or the names of existing columns, with respect to any previously executed transformations.
  • "expression" may use the following numerical operators: +, -, *, /, ^
  • After the ColumndDefine transformation, a new column with the given name is added.

Automated Build Server Configuration

Acceptance criteria:

  • Single command creates all AWS resources
  • r10k runs to retrieve required puppet modules
  • Local puppet apply runs to configure Jenkins server
  • Single t2.nano server with Jenkins instlaled
  • Jenkins security enabled, with "admin" and "bryan" accounts created
  • Docker and Packer installed, to support image builds
  • Jenkins plugins installed to support Git checkout
  • Jenkins seed job configured to checkout from specified github repo and runs file with given name

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.