bryantrobbins / baseball Goto Github PK

View Code? Open in Web Editor NEW

22.0 21.0 9.0 2.85 MB

An upcoming web-based tool for sabermetrics.

License: Apache License 2.0

Shell 7.71% R 0.20% Groovy 4.37% HTML 15.15% JavaScript 38.48% CSS 0.53% Python 31.48% Vue 2.09%

baseball's Introduction

Baseball Workbench

Baseball Workbench is an upcoming Web-based tool for Sabermetrics research.

The software for Baseball Workbench will be open-source, and will integrate with freely available data sources.

The hosted tool will have associated infrastructure costs, but we may look to offset these with sponsors, ads, or fees at some point.

Release Date

Update: I haven't released a hosted Alpha version of Baseball Workbench but I'll keep posting updates here! Feel free to follow along. Alpha access may be private to help control costs - if so, invite information will be posted to this README.

Get development updates

If you're a GitHub user, you can "Watch" this project to get updates.

You can also check out the content on the GitHub Wiki.

Contributing

There are many ways you can contribute to Baseball Workbench!

Check out the Issues tab here on GitHub for work that is planned.

If you are a programmer (or want to try your hand at it!), check out the various code throughout the project and contribute enhancements:

There is Python code in shared/, api/api.py, and worker/service.py
There is R code in worker/service.py (this will move into a separate file soon!)
Groovy is used in the worker/extract/download.groovy script (but may be changed to Python in the future for consistency)

If you know or want to learn AWS concepts, you can check out the infra folder, which uses AWS CloudFormation to define the hosted version of the app.

If you are interested or experienced with baseball datasets such as the Lahman Database and Retrosheet Game Logs, you can check out the data extraction script in worker/extract and the datasource metadata files in shared/btr3baseball/datasource. Both of these are fairly easy to follow, and we will need a lot of work in these areas to integrate with interesting datasets.

Finally, once the hosted app is live for Alpha, you can contribute by logging Issues in the Issues tab for bugs you find and features you would like to see.

Contact

If you have any questions about the Baseball Workbench, you can contact Bryan at [email protected].

baseball's People

Contributors

Stargazers

Watchers

Forkers

jacobsheppard leemarc00 gallagherrchris smxjrz mchao47 dsasser73 diffley hernanerodriguez zhangfuyuan69

baseball's Issues

UI: Equation Editor

To support the definition of new statistics from existing columns in a dataset, Baseball Workbench should have an Equation Editor.

Users should be able to easily add references to columns from their datasource.
Users should be able to easily add references to custom columns previously defined.
Attempts to add references to non-existent columns should result in a client-side error.
Users should be able to use simple mathematical operators: Add, Subtract, Multiply, Divide
Attempts to use unsupported operators (or include any extraneous characters) should result in a client-side error.

Add Nginx Reverse Proxy w/Consul Template to BuildHost

Basically, we need this:

https://medium.com/@ladislavGazo/easy-routing-and-service-discovery-with-docker-consul-and-nginx-acfd48e1a291#.17sfplbji

The reverse proxy picks up backend services as they register with Consul.

Add Consul to BuildHost

The EC2 Container Service from AWS does not handle service discovery. This forces you to explicitly map host ports (e.g., so that dependent containers can communicate over known ports).

I would like to add a Consul server to the BuildHost, and to use Consul to look up container IP and Port information in real-time. This will avoid having to hard-code the EC2 instance host ports when defining ECS-hosted components.

An example and further explanation are given here:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

Worker: Add validation before calling R commands

Use validator from shared Python package

UI: Update export to poll API for job info

Add Registrator and Consul Agent to ECS hosts

This setup will allow Docker containers to register with Consul as they come up:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

API: GetDatasets

Write a Lambda function which serves back a JSON block of dataset metadata.

API: Add stand-alone Validate Config API call

Allows UI to validate configuration before attempting to submit job

(Uses same validation code from shared Python package)

API: GetJobInfo

Write a Lambda function which returns a job's info by doing a lookup in a DynamoDB table

UI: Display output SVG from protected S3 bucket

Deploy UI container to ECS cluster

API: Add getJobInfo method for getting job details

Should work against the data persisted by submitJob.

Worker: Look into options for hosting worker

Looks like it might be a great fit for running the worker jobs in Docker containers:
http://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html

Push UI container to ECR repo

Prove that this can be done from the existing Packer setup, via a post-processor that looks something like this (from Packer Docker documentation):

"post-processors": [
[
{
"type": "docker-tag",
"repository": "12345.dkr.ecr.us-east-1.amazonaws.com/packer",
"tag": "0.7"
},
{
"type": "docker-push",
"login": true,
"login_email": "none",
"login_username": "AWS",
"login_password": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"login_server": "https://12345.dkr.ecr.us-east-1.amazonaws.com/"
}
]
]

To support this, need ability to provide ECR repo location and credentials via environment variables, such as:

https://www.packer.io/docs/templates/user-variables.html

Backend proof of concept

Write a backend which:

Grabs a pending job's ID from a queue
Retrieves the job configuration from a database
Validates the job configuration
Executes the job in R
Pushes output files to S3
Updates the job status in a database

Backend deployment

Deploy backend into a homogeneous group of Docker containers

UI: Allow configuration of Column Define transformation

Parameters:

Column (name of column)
Expression (must conform to grammar)

In general, the expression should be either a string constant OR a mathematical expression which allows:

Plus, Minus, Multiply, Divide
Exponents
Parentheses
Column references of the form $('COL')

Create Dataset Metadata API

The system will know about a fixed number of datasets. We need an API for dataset metadata, driven by config files and serving back row and column info to the Metadata Viewer frontend.

Metadata viewing is completely separate from the Job Submission API that will be necessary for manipulating data. It could even be deployed as its own microservice.

Add encrypted hieradata to local puppet runs

Because the "standard-aws" repo is separate, it would be nice to be able to supply project-specific details to that standard setup during launch, so that the configured server(s) can have project-specific details.

API: Submit Job should enforce usage limits through API key

Possible Limits for Alpha:

Maximum number of jobs per hour/day/week
Maximum number of columns, rows in leaderboard output
Maximum number of custom column definitions per job

Worker: Add image build to Jenkins DSL

UI: Allow configuration of Row Sum transformation

Parameters:

Columns (columns to group by)

Note this sums over all Numeric columns and drops all String columns not ID'ed as grouping columns.

UI: Allow job submission, including API key

Submission includes:

Selection of data source
All defined transformations, in order
Output configuration
API key

Infra: Add database

Trying to decide between Postgres, MongoDB, and Riak. Would like to pick one and add it to the existing infrastructure, for now.

My current favorite is Riak, which is similar to AWS DynamoDB (based off of the Dynamo paper).

Requirements:

Support for multiple nodes
Ability to run nodes in Docker containers, across different Docker hosts
No data loss on single node failure
Ability to backup and restore from tarball
Low-latency querying interface

Move common infrastructure code to standard-aws repo

Deploy Metadata API container

Should have an ECR repo created in base stack.
Should have a metadata-image job, which calls Packer (just as ui-image does) to create an image.
Should have a Consul Template configured on the BuildServer, defining backends as an Nginx "upstream".
Should have service definition added to the DEV template, taking arguments for container count and version as the UI container configurations do now.

API: SubmitJob

Write a Lambda function which writes a job's configuration info to a DynamoDB table, and places a notification of that job on a queue for processing.

Backend autoscaling

Scale backend capacity according to number of pending jobs in the queue

UI: Allow configuration of Leaderboard output type

Available parameters:

Column
Direction
Size

API: Update submitJob to persist job info in DB

Add ECS and ECR components to CloudFormation

Need at least an ECS cluster and ECR repository in the Standard template (https://github.com/bryantrobbins/standard-aws).

Create a new CF template specific to baseball which defines ECS Tasks for each Docker-hosted component (UI, API, and Worker containers).

API: Propagate errors from Python through API Gateway

Infra: Add Hubot to BuildHost

UI: Viewing Dataset Metadata

There are a number of publicly available datasets with Baseball statistics.

As a key feature of its UI, Baseball Workbench should have a "Metadata Viewer" layout. The UI can retrieve dataset metadata from an API call.

Metadata can be assumed to include:

Dataset Name: A unique identifier for this dataset, such as "Lahman.Hitters"
Dataset Description: A text description of the dataset
Row Description: A text description of what each row in the dataset represents (all rows will represent the same thing, so only one description is required)
Column Metadata: A unique Name, text description, and data type (String, Count, Ratio) for each column in the dataset

Worker: Job Error Propagation

The current worker implementation does not update the DB entry in the case of errors (and stops processing all messages when the first error is encountered).

Things should not be this way. We should generically try/except so that job status is always updated and job processing continues to the next job as appropriate.

Propagation should also include an error message.

UI: Allow configuration of Row Select transformation

Parameters:

Column
Operator (lt, gt, le, ge, eq, ne)
Criteria (String or Number)

API: Trigger worker task via submitJob call

UI: Allow configuration of Column Select transformation

Parameters:

Column list

Infra: Add DNS to cloudformation

UI: Add Export components

The user should be able to select from a pre-defined set of export types, and provide any necessary options for each type.

The only available type for the MVP should be:

Ordered Table: requires choice of 1 to 5 columns to be in the table, and requires choice of order by column and order direction (ascending or descending)

UI: Default Use Case Flow

The Baseball Workbench UI should allow users to describe, execute, and export statistical analysis. To support this goal, there should be a flow of the following activities:

Select Initial Dataset, from a list of available Public datasets. For example: Lahman.Hitters
Define one or more new columns, in terms of columns from the Initial Dataset and basic arithmetic (add, subtract, multiply, divide). For example: RC = (H + BB) * TB / (AB + BB)
Define one or more row filters, in terms of column names, values, and comparators, to be applied to the updated data set (Initial + New Columns) prior to export. For example: Year > 1955
Define Exported Artifact, from a list of available export types and their options. _For example: Histogram of RC _
Click "Generate"
Receive temporary link to exported files.

The available datasets are:

Individual tables from the Lahman database (e.g., Hitters, Pitchers, Teams, etc.)
Retrosheet Gamelogs database (Regular Season, Postseason, or All-Star)

The supported Export Types are:

Table ordered by X (ASC or DSC)
Histogram of X
Scatter Plot of X vs. Y

API: Add validation to SubmitJob API

Before successfully writing to DynamoDB and placing a message on the queue, the SubmitJob API call should validate the parameters of the requested job.

Here is a sample JSON configuration object for a job:

{
  "dataset": "Lahman_Batting",
  "transformations": [
    {
      "type": "columnSelect",
      "columns": [
        "HR",
        "lgID"
      ]
    },
    {
      "type": "rowSelect",
      "column": "yearID",
      "operator": ">=",
      "criteria": "2000"
    },
    {
      "type": "columnDefine",
      "column": "custom",
      "expression": "2*(HR)"
    },
    {
      "type": "rowSum",
      "columns": [
        "playerID",
        "yearID",
        "lgID"
      ]
    }
  ],
  "output": {
    "type": "leaderboard",
    "column": "HR",
    "direction": "desc"
  }
}

Below is a list of required validations.

Dataset:

Dataset ID should be from set of allowed set of datasets (currently just "Lahman_Batting")

Output:

Output parameter "type" should be from allowed set of output types (currently just "leaderboard")
Output parameter "column" should be the name of a single column from the set of selected and/or defined columns as of the end of all transformations
Output parameter direction must be one of "desc" or "asc"

ColumnSelect and RowSum Transformation:

Entries in the "columns" list should be the name of an existing column, with respect to any previously executed transformations.
After the ColumnSelect transformation, all columns not present in the "columns" list are lost.
After the RowSum transformation, all string-valued columns not present in the "columns" list are lost.

RowSelect Transformation:

"column" should be the name of an existing column, with respect to any previously executed transformations.
"operator" should be one of <, >, <=, >=, =, or !=.
"criteria" should be either a number or string, and not an expression.
The type of the criteria (number or string) should match the type of the corresponding column chosen.

ColumnDefine Transformation:

"column" should be a unique name for the new column being defined, and should not conflict with the name of any existing column, with respect to any previously executed transformations
"expression" should be a valid mathematical expression using only scalar values (strings or numbers) or the names of existing columns, with respect to any previously executed transformations.
"expression" may use the following numerical operators: +, -, *, /, ^
After the ColumndDefine transformation, a new column with the given name is added.

Worker: Incorporate Hackathon contributions to R scripts

Data export script (add to data export process)
R package installations (add to image build process)
Generic "worker" script (add to container)

Automated Build Server Configuration

Acceptance criteria:

Single command creates all AWS resources
r10k runs to retrieve required puppet modules
Local puppet apply runs to configure Jenkins server
Single t2.nano server with Jenkins instlaled
Jenkins security enabled, with "admin" and "bryan" accounts created
Docker and Packer installed, to support image builds
Jenkins plugins installed to support Git checkout
Jenkins seed job configured to checkout from specified github repo and runs file with given name

bryantrobbins / baseball Goto Github PK

baseball's Introduction

Baseball Workbench

Release Date

Get development updates

Contributing

Contact

baseball's People

Contributors

Stargazers

Watchers

Forkers

baseball's Issues

Recommend Projects

Recommend Topics

Recommend Org