Light

caseyjohnsonwv / totd-ai Goto Github PK

Building a data lake and AI pipeline to predict the next Track Of The Day in Trackmania 2020

Python 97.41% Dockerfile 0.65% Shell 1.94%

totd-ai's Introduction

TOTD-AI

Overengineering an entire data lake environment to support an AI application that predicts future Track Of The Day selections in Trackmania 2020, then making those predictions accessible via API routes that integrate with web browsers, Discord bots, and more.

Quickstart

Will require a beefy computer to run locally, 10GB+ of RAM.

Run scripts/tmdb-volume.sh --create
Run docker-compose up --build
Navigate to the Airflow UI
Manually trigger the master-ingestion-dag

Airflow:

UI: localhost:8080
Username: airflow
Password: airflow

Adminer for Postgres:

UI: localhost:8081
System: PostgreSQL
Server: trackmania-postgres
Username: airflow
Password: airflow
Database: trackmania

totd-ai's People

Contributors

Watchers

totd-ai's Issues

Ingest Player Club Data

Club tags are available through TMIO leaderboards. We should include this data anywhere we're including a username / user_id.

Ingest Previous Days' Leaderboards

When scraping TMX and TMIO, we are currently grabbing data exclusively relating to the current TOTD. It would be nice to grab information about the previous day's TOTD (to refresh our leaderboards and world records).

This should only require alterations to the collect_tmio_enhancements DAG. It is currently hardcoded to pull the top 10 leaderboard entries for the current TOTD. I would like to:

Make the lookup configurable with a top N entries
Use the top 2 map_uid values (rather than top 1)

To accomplish both:

Modify the get_totd_map_uid task to use LIMIT {f-string'ed variable} rather than LIMIT 1. Make the variable default to 2 so the DAG doesn't error out if it isn't set. This will also enable (manual) full refreshes.
Rewrite scrape_leaderboards_today to be iterative over multiple days, still pushing to one XCom object
Refactor push_leaderboards_to_postgres to UNNEST() the JSON it pulls from XCom

Consumption Layer QA

Haven't been able to test consumption layer objects because TMX is down tonight. Need to do this and merge resource-names back into main.

This raises an interesting conversation about fault-tolerance. I've opted for the TMX and OpenPlanet APIs to avoid dealing with Nadeo's API. Not having data 100% of the time is the price you pay for a much faster development cycle and a more stable API. We rely on both TMX and TMIO, and our AI model will require both to accurately make predictions. We don't want to re-train our model with bad or partial data.

In other words, if TMX or OpenPlanet is down, we don't want to do anything, so fault tolerance is not a problem on the data ingestion front.

Create DAGs for joining TMX and TMIO data from Conform and promoting to Consume

We're going to want a variety of tables in the Consume layer. This is what I've come up with off the top of my head:

TOTD
track name, exchange_id, authors, totd date, upload date, days from upload to totd, medal times
TOTD_Tags
exchange_id, tag name (repeated for multiple tags on one map)
TOTD_WorldRecords
exchange_id, wr_time, wr_player, date driven
TOTD_Authors
exchange_id, author user_id, all tags on all of author's maps

Project Roadmap

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.