A virtualization layer for cloud data infrastructures.
Copyright 2024 Massachusetts Institute of Technology
AGPL 3.0. See LICENSE
.
A virtualization layer for cloud data infrastructures.
Home Page: https://dsg.csail.mit.edu/projects/brad
License: GNU Affero General Public License v3.0
A virtualization layer for cloud data infrastructures.
Copyright 2024 Massachusetts Institute of Technology
AGPL 3.0. See LICENSE
.
Right now the daemon process receives queries as they are submitted to BRAD. We will need to record these queries for the offline planning pass.
Transactional logging should be done with sampling. But it should be fine to log every analytical query.
We should remove the codepath where we send queries to the daemon - this is a scaling bottleneck. Instead the frontend server should just log to its own files.
It would also be useful to write code to load the log format into a Workload
class (depends on #97).
We currently use sqlglot
for SQL parsing and manipulation. It's convenient and has a user-friendly API. The problem is that it is written in Python and could be a bottleneck on the critical path. It's worth investigating:
sqlglot
ends up being a bottlenecksqlglot
supports enough of the Postgres SQL dialect that we care aboutSee #25.
So that we keep a consistent terminology.
Part of the data sync engine.
Call into one of the three systems
See the comments in #75 for context.
For convenience, we currently use a ThreadPoolExecutor
to handle client requests. But because of the Python GIL, we will not be able to process multiple requests in parallel (even if the underlying server has multiple cores). We will likely need to revisit this design decision depending on the experiments we plan to run.
Some options
ProcessPoolExecutor
) - we cannot easily share state among requests in this waySee #25. We will want to change the schema slightly to capture data transformations across tables.
We'll use pytest
to run tests. We should also hook it into the CI.
We already have a data blueprint. Probably the implementation that makes the most sense is to modify the DataBlueprint
into a Blueprint
. It makes sense to have access to the overall blueprint where we are currently using DataBlueprint
s.
We need to extend the brad admin
tooling with a bulk load utility. We need this tool to be able to load datasets into the underlying engines before running any experiments.
Right now we are manually setting up our AWS deployment (via the web console). BRAD will need to automatically make provisioning changes. We need to set up abstractions and utilities within BRAD to initiate these provisioning changes.
boto3
)We should restructure the BRAD architecture such that the daemon is the main component. It should be responsible for launching frontend servers and shutting them down. Right now it is the other way around because the system organically evolved with the frontend server being implemented first.
Why do this? Because the daemon forms BRAD's control plane. It handles system planning and mesh restructuring. The frontend server is just meant to handle user requests, and later we may have multiple of them.
See #17 for context. BRAD will have a scalability ceiling since it processes requests in a single-threaded event loop, but this is unfortunately a limitation of the Python runtime. If we need to scale beyond our current limits, we should run multiple instances of our front end (currently BradServer
). This will require some light refactoring, but should be possible.
We currently manually implement the REPL interface (iohtap cli
). But we should leverage Python's standard library implementation instead (cmd
). We should then automatically get goodies like command history.
We use pyodbc
, which runs all statements in a transaction. This ends up giving end-users a weird experience when using iohtap cli
because they need to issue a COMMIT
to commit writes (even if they never issued a BEGIN
).
We can fix this by turning on autocommit
when establishing connections to the DBMSes. But we need to double check that we were not relying on this functionality elsewhere inside the server.
Depends on #63.
See config/config_sample.yml
for an example configuration file. As we add more configuration values, the file will get messier.
Currently, the only way to submit queries to IOHTAP is via the CLI. We need programmatic access for performance benchmarking purposes.
Representing table names with a custom type (for type checking) is more trouble than it is worth.
Beyond maintaining metrics for RDS as a whole, maintain a separate set for Aurora writer instances.
No need to go overboard in generality. But we need to have a way to run workloads against BRAD and the underlying engines individually.
Needed for scoring.
Use the existing physical operators in the data_sync
module if possible.
Integrate the forecasted levels of system metrics into the overall system planning process
Needed for filtering and scoring
The sooner we do this the better.
We need a "data mesh" workload (analytical queries and transactions). We should implement 2 - 3 workloads to showcase different parts of the system.
Possible workloads to adapt
They are somewhat "left hanging" if the server shuts down first.
See the comments in #16.
We currently believe that we will use metrics signals to know when to trigger re-provisioning. We also plan to use system metrics to detect poor provisioning decisions. We should see if these metrics are easily queryable from AWS, and explore how to best process them.
Needed to estimate transition costs. Also useful for forecasting changes in the dataset size.
We will eventually need connection pooling. We should look into options for any pooling solutions we can re-use, or write our own if needed.
We have them separate because data on S3 (for Athena) can also be read by Redshift (though we don't leverage this capability now). To simplify the code, we should just unify these two enums - having them separate adds unneeded code complexity at this stage in the project.
We will likely start with decision trees. This should be something simple and fast to start.
Beyond just forecasting that future values of the cloudwatch metrics will match the most recent value, we should explore using a moving average of their recent values, or fitting a linear model to predict the next value.
Later on, we can also add more complicated forecasting techniques that take seasonality into account, like Prophet by Meta.
Treat tables with data in it as "loaded". Only apply the bulk load to tables that are empty. This is useful because sometimes the bulk load gets "stuck" in the middle of processing Redshift data (I suspect some kind of synchronization bug in aioodbc
).
We currently open a socket and implement a simple request-response protocol. This works for now, but we should avoid complicating the protocol if we need to add in more features. Instead, we should implement something more robust.
iohtap
in development modeconfig.yml
for developmentA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.