Giter Club home page Giter Club logo

brad's Introduction

BRAD

A virtualization layer for cloud data infrastructures.

License

Copyright 2024 Massachusetts Institute of Technology

AGPL 3.0. See LICENSE.

brad's People

Contributors

geoffxy avatar wuziniu avatar mmarkakis avatar ferdiko avatar amlatyrngom avatar sopzha avatar xingalbert avatar

Stargazers

 avatar  avatar

Watchers

Lucian avatar Tim Kraska avatar Sam Madden avatar  avatar Raul avatar Zeyuan Shang avatar Tianyu Li avatar  avatar

brad's Issues

Record and process a query trace

Right now the daemon process receives queries as they are submitted to BRAD. We will need to record these queries for the offline planning pass.

Transactional logging should be done with sampling. But it should be fine to log every analytical query.

We should remove the codepath where we send queries to the daemon - this is a scaling bottleneck. Instead the frontend server should just log to its own files.

It would also be useful to write code to load the log format into a Workload class (depends on #97).

Consider replacing sqlglot

We currently use sqlglot for SQL parsing and manipulation. It's convenient and has a user-friendly API. The problem is that it is written in Python and could be a bottleneck on the critical path. It's worth investigating:

  • Whether sqlglot ends up being a bottleneck
  • If there are alternatives written in native code (with Python bindings) that support the features we need (parsing, AST manipulation, conversion back to SQL)
  • Whether sqlglot supports enough of the Postgres SQL dialect that we care about

Investigate options for handling multiple concurrent connections

For convenience, we currently use a ThreadPoolExecutor to handle client requests. But because of the Python GIL, we will not be able to process multiple requests in parallel (even if the underlying server has multiple cores). We will likely need to revisit this design decision depending on the experiments we plan to run.

Some options

  • A multiprocess architecture (e.g., use ProcessPoolExecutor) - we cannot easily share state among requests in this way
  • Switch to a non-Python front end (e.g., C++) and embed a Python interpreter to run Python code
  • Move computationally heavy code into C++ and release the GIL

Implement system blueprint abstraction

We already have a data blueprint. Probably the implementation that makes the most sense is to modify the DataBlueprint into a Blueprint. It makes sense to have access to the overall blueprint where we are currently using DataBlueprints.

Explore and implement AWS provisioning abstractions

Right now we are manually setting up our AWS deployment (via the web console). BRAD will need to automatically make provisioning changes. We need to set up abstractions and utilities within BRAD to initiate these provisioning changes.

  • Explore AWS SDK integration (possibly boto3)
  • Decorating the DB engines with the appropriate permissions for S3 export/import
  • How to represent the provisioning details for system planning

Make the BRAD daemon the "main" system component

We should restructure the BRAD architecture such that the daemon is the main component. It should be responsible for launching frontend servers and shutting them down. Right now it is the other way around because the system organically evolved with the frontend server being implemented first.

Why do this? Because the daemon forms BRAD's control plane. It handles system planning and mesh restructuring. The frontend server is just meant to handle user requests, and later we may have multiple of them.

Support multiple BRAD front end servers

See #17 for context. BRAD will have a scalability ceiling since it processes requests in a single-threaded event loop, but this is unfortunately a limitation of the Python runtime. If we need to scale beyond our current limits, we should run multiple instances of our front end (currently BradServer). This will require some light refactoring, but should be possible.

Fix transaction handling

We use pyodbc, which runs all statements in a transaction. This ends up giving end-users a weird experience when using iohtap cli because they need to issue a COMMIT to commit writes (even if they never issued a BEGIN).

We can fix this by turning on autocommit when establishing connections to the DBMSes. But we need to double check that we were not relying on this functionality elsewhere inside the server.

Implement a first pass workload generator and runner

No need to go overboard in generality. But we need to have a way to run workloads against BRAD and the underlying engines individually.

  • Issue queries via ODBC and the BRAD RPC interface
  • Scale a workload up in size and number of clients
    • Workloads can be hardcoded - no need for excessive generality here
  • Measure query latency (p50, p99)
  • Measure workload throughput

Extend the data sync command to schedule and run transformations

  • Extend the schema description configuration file to include data dependencies
  • Implement a data placement plan abstraction (serializable)
  • Add transformation specifications to the schema description config file
  • Ensure routing is location-aware (not all tables are going to be available on all engines)
  • Update Aurora table extraction to work with the transforms we need to specify
  • Implement data sync plan logical abstractions
  • Implement the physical data sync operators
  • Implement the data sync planner
  • Implement a function that lowers the logical plan into a physical plan
  • Extend the data sync executor to run the data sync plan (execution operators, etc.)
  • Handle skipped extractions (b/c there were no changes made to the table)
  • Verify (or handle) cases of delete-insert-delete

Investigate and implement end-to-end workloads

We need a "data mesh" workload (analytical queries and transactions). We should implement 2 - 3 workloads to showcase different parts of the system.

Possible workloads to adapt

  • IMDB dataset and queries (@wuziniu and Dean are exploring this option) (#36)
    • Need to expand this workload to include some plausible transactions
  • Adapt the HATtrick benchmark from Wisconsin (#37)
    • May need to alter the schemas to make the workload more transactionally friendly (it's based on SSB) and to showcase ETLs
    • May need to alter the analytical queries to make them more interesting
  • TPC-DS
    • Already includes a data maintenance segment
    • Already includes analytical queries (should see if they expose interesting routing decisions)
    • Need to add plausible transactions

Merge `DBType` and `Location`

We have them separate because data on S3 (for Athena) can also be read by Redshift (though we don't leverage this capability now). To simplify the code, we should just unify these two enums - having them separate adds unneeded code complexity at this stage in the project.

Add more complex forecasters: moving average, linear

Beyond just forecasting that future values of the cloudwatch metrics will match the most recent value, we should explore using a moving average of their recent values, or fitting a linear model to predict the next value.

Later on, we can also add more complicated forecasting techniques that take seasonality into account, like Prophet by Meta.

Allow the bulk load to restart from where it left off

Treat tables with data in it as "loaded". Only apply the bulk load to tables that are empty. This is useful because sometimes the bulk load gets "stuck" in the middle of processing Redshift data (I suspect some kind of synchronization bug in aioodbc).

Investigate better options for client/server interaction

We currently open a socket and implement a simple request-response protocol. This works for now, but we should avoid complicating the protocol if we need to add in more features. Instead, we should implement something more robust.

  • Server/daemon interaction: Consider implementing RPCs (e.g., with gRPC)
  • Client/server interaction: Ideally via the PostgreSQL wire protocol
    • gRPC can also be a stopgap method

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.