Light

mitdbg / brad Goto Github PK

A virtualization layer for cloud data infrastructures.

Home Page: https://dsg.csail.mit.edu/projects/brad

License: GNU Affero General Public License v3.0

CMake 0.11% C++ 4.89% Shell 3.52% Python 88.81% Dockerfile 0.05% Rust 1.42% HTML 0.02% CSS 0.23% JavaScript 0.93% Roff 0.02%

brad's Introduction

BRAD

A virtualization layer for cloud data infrastructures.

License

AGPL 3.0. See LICENSE.

brad's People

Contributors

Stargazers

Watchers

brad's Issues

System planning (provisioning, table placement) pathfinding and implementation

Make the system planning (provisioning, table placement) ideas more concrete
- How to generalize the overall planning
- What signals do we need
- What initial experiments to run
Make a plan for the initial implementation

Record and process a query trace

Right now the daemon process receives queries as they are submitted to BRAD. We will need to record these queries for the offline planning pass.

Transactional logging should be done with sampling. But it should be fine to log every analytical query.

We should remove the codepath where we send queries to the daemon - this is a scaling bottleneck. Instead the frontend server should just log to its own files.

It would also be useful to write code to load the log format into a Workload class (depends on #97).

Consider replacing sqlglot

We currently use sqlglot for SQL parsing and manipulation. It's convenient and has a user-friendly API. The problem is that it is written in Python and could be a bottleneck on the critical path. It's worth investigating:

Whether sqlglot ends up being a bottleneck
If there are alternatives written in native code (with Python bindings) that support the features we need (parsing, AST manipulation, conversion back to SQL)
Whether sqlglot supports enough of the Postgres SQL dialect that we care about

Integrate the IMDB workload (dataset, analytical queries, transactions)

See #25.

Add ability to manually request a cross-engine data sync

Rename `DBType` to `Engine`

So that we keep a consistent terminology.

Handle Redshift multi-file unload (and later import) gracefully

Part of the data sync engine.

Stub out the front end service

Call into one of the three systems

Investigate suitability of TPC-DS for evaluation

Decouple the front end processes and have them communicate using RPCs

See the comments in #75 for context.

Implement the blueprint planner abstraction and integrate it

Launch the planner and schedule it to run in the background
Accept new blueprints and "deploy" them
- For now, the deploy action will be a no-op (we will fill in the details later)

Investigate options for handling multiple concurrent connections

For convenience, we currently use a ThreadPoolExecutor to handle client requests. But because of the Python GIL, we will not be able to process multiple requests in parallel (even if the underlying server has multiple cores). We will likely need to revisit this design decision depending on the experiments we plan to run.

Some options

A multiprocess architecture (e.g., use ProcessPoolExecutor) - we cannot easily share state among requests in this way
Switch to a non-Python front end (e.g., C++) and embed a Python interpreter to run Python code
Move computationally heavy code into C++ and release the GIL

Create Python project and initialize development environment

Two logical services:
- The actual front end which handles queries from the user
- Background daemon that runs policy engine, orchestration, forecasting
Reasonable defaults for the development environment

Explore integrating a variant of the HATtrick workload

See #25. We will want to change the schema slightly to capture data transformations across tables.

Set up the schema and make sure we can load data effectively across the BRAD engines
Transfer the workload executor over to BRAD

Add testing infrastructure

We'll use pytest to run tests. We should also hook it into the CI.

Implement system blueprint abstraction

We already have a data blueprint. Probably the implementation that makes the most sense is to modify the DataBlueprint into a Blueprint. It makes sense to have access to the overall blueprint where we are currently using DataBlueprints.

Add bulk load utility to support experiment workloads

We need to extend the brad admin tooling with a bulk load utility. We need this tool to be able to load datasets into the underlying engines before running any experiments.

Explore and implement AWS provisioning abstractions

Right now we are manually setting up our AWS deployment (via the web console). BRAD will need to automatically make provisioning changes. We need to set up abstractions and utilities within BRAD to initiate these provisioning changes.

Explore AWS SDK integration (possibly boto3)
Decorating the DB engines with the appropriate permissions for S3 export/import
How to represent the provisioning details for system planning

Make the BRAD daemon the "main" system component

We should restructure the BRAD architecture such that the daemon is the main component. It should be responsible for launching frontend servers and shutting them down. Right now it is the other way around because the system organically evolved with the frontend server being implemented first.

Why do this? Because the daemon forms BRAD's control plane. It handles system planning and mesh restructuring. The frontend server is just meant to handle user requests, and later we may have multiple of them.

Support multiple BRAD front end servers

See #17 for context. BRAD will have a scalability ceiling since it processes requests in a single-threaded event loop, but this is unfortunately a limitation of the Python runtime. If we need to scale beyond our current limits, we should run multiple instances of our front end (currently BradServer). This will require some light refactoring, but should be possible.

Consider replacing the CLI implementation with Python's cmd module

We currently manually implement the REPL interface (iohtap cli). But we should leverage Python's standard library implementation instead (cmd). We should then automatically get goodies like command history.

Fix transaction handling

We use pyodbc, which runs all statements in a transaction. This ends up giving end-users a weird experience when using iohtap cli because they need to issue a COMMIT to commit writes (even if they never issued a BEGIN).

We can fix this by turning on autocommit when establishing connections to the DBMSes. But we need to double check that we were not relying on this functionality elsewhere inside the server.

Integrate the provisioning implementation with the blueprint deployment code path

Depends on #63.

Reorganize config files inside BRAD

See config/config_sample.yml for an example configuration file. As we add more configuration values, the file will get messier.

Implement programmatic access to IOHTAP

Currently, the only way to submit queries to IOHTAP is via the CLI. We need programmatic access for performance benchmarking purposes.

Remove `TableName` and replace it with a bare `str`

Representing table names with a custom type (for type checking) is more trouble than it is worth.

Maintain separate metrics for Aurora writer instances

Beyond maintaining metrics for RDS as a whole, maintain a separate set for Aurora writer instances.

Implement a first pass workload generator and runner

No need to go overboard in generality. But we need to have a way to run workloads against BRAD and the underlying engines individually.

Issue queries via ODBC and the BRAD RPC interface
Scale a workload up in size and number of clients
- Workloads can be hardcoded - no need for excessive generality here
Measure query latency (p50, p99)
Measure workload throughput

Run `EXPLAIN` on a query in the trace and extract the number of rows scanned

Needed for scoring.

Realize table movements when the blueprint changes

Use the existing physical operators in the data_sync module if possible.

Investigate supporting ODBC

Integrate workload forecasts into the system planner

Integrate the forecasted levels of system metrics into the overall system planning process

Extend the planner `Workload` class to include query traces as needed

Needed for filtering and scoring

Run IOHTAP -> BRAD renaming codemod

The sooner we do this the better.

Report dollar cost from Cloudwatch

Extend the data sync command to schedule and run transformations

Implement first neighborhood based blueprint planning algorithm

Investigate and implement end-to-end workloads

We need a "data mesh" workload (analytical queries and transactions). We should implement 2 - 3 workloads to showcase different parts of the system.

Possible workloads to adapt

IMDB dataset and queries (@wuziniu and Dean are exploring this option) (#36)
- Need to expand this workload to include some plausible transactions
Adapt the HATtrick benchmark from Wisconsin (#37)
- May need to alter the schemas to make the workload more transactionally friendly (it's based on SSB) and to showcase ETLs
- May need to alter the analytical queries to make them more interesting
TPC-DS
- Already includes a data maintenance segment
- Already includes analytical queries (should see if they expose interesting routing decisions)
- Need to add plausible transactions

Gracefully terminate connected clients on server shutdown

They are somewhat "left hanging" if the server shuts down first.

Generate concrete experiment workload scenarios for testing

One fixed baseline scenario (used to test the initial scale down)
One scale up scenario where we increase the number of clients hitting the system (used to test re-planning for scale up)

Refactor request handling implementation into independent executors

See the comments in #16.

Metrics collection - Explore forecasting metrics and integrate with AWS services

We currently believe that we will use metrics signals to know when to trigger re-provisioning. We also plan to use system metrics to detect poor provisioning decisions. We should see if these metrics are easily queryable from AWS, and explore how to best process them.

Add utility to extract table physical size

Needed to estimate transition costs. Also useful for forecasting changes in the dataset size.

Investigate options for connection pooling

We will eventually need connection pooling. We should look into options for any pooling solutions we can re-use, or write our own if needed.

Merge `DBType` and `Location`

We have them separate because data on S3 (for Athena) can also be read by Redshift (though we don't leverage this capability now). To simplify the code, we should just unify these two enums - having them separate adds unneeded code complexity at this stage in the project.

Implement first pass lightweight online query routing

We will likely start with decision trees. This should be something simple and fast to start.

Add more complex forecasters: moving average, linear

Beyond just forecasting that future values of the cloudwatch metrics will match the most recent value, we should explore using a moving average of their recent values, or fitting a linear model to predict the next value.

Later on, we can also add more complicated forecasting techniques that take seasonality into account, like Prophet by Meta.

Allow the bulk load to restart from where it left off

Treat tables with data in it as "loaded". Only apply the bulk load to tables that are empty. This is useful because sometimes the bulk load gets "stuck" in the middle of processing Redshift data (I suspect some kind of synchronization bug in aioodbc).

Investigate better options for client/server interaction

We currently open a socket and implement a simple request-response protocol. This works for now, but we should avoid complicating the protocol if we need to add in more features. Instead, we should implement something more robust.

Server/daemon interaction: Consider implementing RPCs (e.g., with gRPC)
Client/server interaction: Ideally via the PostgreSQL wire protocol
- gRPC can also be a stopgap method

Add a script to set up one's development environment and provision resources on AWS

Install non-Python dependencies
Install iohtap in development mode
Provision a starting set of resources on AWS (Aurora, Redshift, S3 bucket, set up an IAM user for Athena)
Generate a config.yml for development

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.