Giter Club home page Giter Club logo

totd-ai's Introduction

TOTD-AI

Overengineering an entire data lake environment to support an AI application that predicts future Track Of The Day selections in Trackmania 2020, then making those predictions accessible via API routes that integrate with web browsers, Discord bots, and more.

Quickstart

Will require a beefy computer to run locally, 10GB+ of RAM.

  1. Run scripts/tmdb-volume.sh --create
  2. Run docker-compose up --build
  3. Navigate to the Airflow UI
  4. Manually trigger the master-ingestion-dag

Airflow:

UI: localhost:8080
Username: airflow
Password: airflow

Adminer for Postgres:

UI: localhost:8081
System: PostgreSQL
Server: trackmania-postgres
Username: airflow
Password: airflow
Database: trackmania

totd-ai's People

Contributors

caseyjohnsonwv avatar

Watchers

 avatar  avatar

totd-ai's Issues

Ingest Player Club Data

Club tags are available through TMIO leaderboards. We should include this data anywhere we're including a username / user_id.

Ingest Previous Days' Leaderboards

When scraping TMX and TMIO, we are currently grabbing data exclusively relating to the current TOTD. It would be nice to grab information about the previous day's TOTD (to refresh our leaderboards and world records).

This should only require alterations to the collect_tmio_enhancements DAG. It is currently hardcoded to pull the top 10 leaderboard entries for the current TOTD. I would like to:

  • Make the lookup configurable with a top N entries
  • Use the top 2 map_uid values (rather than top 1)

To accomplish both:

  • Modify the get_totd_map_uid task to use LIMIT {f-string'ed variable} rather than LIMIT 1. Make the variable default to 2 so the DAG doesn't error out if it isn't set. This will also enable (manual) full refreshes.
  • Rewrite scrape_leaderboards_today to be iterative over multiple days, still pushing to one XCom object
  • Refactor push_leaderboards_to_postgres to UNNEST() the JSON it pulls from XCom

Consumption Layer QA

Haven't been able to test consumption layer objects because TMX is down tonight. Need to do this and merge resource-names back into main.

This raises an interesting conversation about fault-tolerance. I've opted for the TMX and OpenPlanet APIs to avoid dealing with Nadeo's API. Not having data 100% of the time is the price you pay for a much faster development cycle and a more stable API. We rely on both TMX and TMIO, and our AI model will require both to accurately make predictions. We don't want to re-train our model with bad or partial data.

In other words, if TMX or OpenPlanet is down, we don't want to do anything, so fault tolerance is not a problem on the data ingestion front.

Create DAGs for joining TMX and TMIO data from Conform and promoting to Consume

We're going to want a variety of tables in the Consume layer. This is what I've come up with off the top of my head:

  • TOTD
    track name, exchange_id, authors, totd date, upload date, days from upload to totd, medal times
  • TOTD_Tags
    exchange_id, tag name (repeated for multiple tags on one map)
  • TOTD_WorldRecords
    exchange_id, wr_time, wr_player, date driven
  • TOTD_Authors
    exchange_id, author user_id, all tags on all of author's maps

Project Roadmap

  • Create DAG for ingesting TMIO data into Collection layer
  • Create DAG for cleaning TMIO data and promoting to Conform
  • Create DAG for ingesting TMX data into Collection layer
  • Create DAG for cleaning TMX data and promoting to Conform
  • Create master orchestration DAG to trigger all aforementioned DAGs sequentially
  • Pay down technical debt and use volumes to persist data properly
  • Perform one-time load of TMX tags data into Conform layer XREF table
  • Create DAGs for loading additional TMX data such as multiple authors, leaderboards, etc
  • Create DAGs for loading TMIO leaderboards data to supplement TMX data
  • #2
  • Revisit technical debt for persistent volumes; create volumes independently from docker-compose stack
  • Perform historical load of past TMIO and TMX data
  • Train AI models using views on data from Consume, select one, and bake it into the Airflow environment
  • Create DAG for running AI model with data from Consume and publishing predictions to Application layer
  • Create DAG for demoting old predictions from the Application layer down to an archive in Consume
  • Build FastAPI route /predictions/{date} to expose predictions from the Application layer
  • Build FastAPI route /totd/{date} to expose actual / historical data from the Consume layer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.