Giter Club home page Giter Club logo

ballista's Introduction

Ballista: Distributed Compute Platform

License Crates.io Gitter Chat Discord chat

Overview

Ballista is a distributed compute platform primarily implemented in Rust, using Apache Arrow as the memory model. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Ballista can be deployed in Kubernetes, or as a standalone cluster using etcd for discovery.

Architecture Diagram

The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.

Ballista Architecture Diagram

How does this compare to Apache Spark?

Ballista differs from Apache Spark in many ways.

  • Due to the use of Rust, memory usage can be up to 100x lower than Apache Spark which means that more processing can fit on a single node, reducing the overhead of distributed compute.
  • Also due to the use of Rust, there are no "cold start" overheads. The first run of a query can be up to 10x faster than Apache Spark.
  • The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
  • Ballista is columnar rather than row-based, meaning that it can take advantage of vectorized processing both on the CPU (using SIMD) and on the GPU. GPU support isn't available yet but will be available in a future release.

Example Rust Client

#[tokio::main]
async fn main() -> Result<()> {

    let nyc_taxi_path = "/mnt/nyctaxi/parquet/year=2019";
    let executor_host = "localhost";
    let executor_port = 50051;

    let ctx = Context::remote(executor_host, executor_port, HashMap::new());

    let results = ctx
        .read_parquet(nyc_taxi_path, None)?
        .aggregate(vec![col("passenger_count")], vec![max(col("fare_amount"))])?
        .collect()
        .await?;

    // print the results
    pretty::print_batches(&results)?;

    Ok(())
}

Status

Distributed execution using async Rust has now been proven and we are working towards a 0.3.0 release in August 2020 that will support the following capabilities.

Operators:

  • Projection
  • Selection
  • Hash Aggregate
  • Limit

Expressions:

  • Basic aggregate expressions (MIN, MAX, SUM, COUNT, AVG)
  • Boolean expressions (AND, OR, NOT)
  • Comparison expressions (==, !=, <=, <, >, >=)
  • Basic math expressions (+, -, *, /, %)
  • Rust user-defined functions (UDFs)
  • Java user-defined functions (UDFs)

File Formats:

  • CSV
  • Parquet

Roadmap

After the 0.3.0 release we will start working on more complex operators, particularly joins, using the TPCH benchmarks to drive requirements. The full roadmap is available here.

More Examples

The following examples should help illustrate the current capabilities of Ballista

Documentation

The user guide is hosted at https://ballistacompute.org, along with the blog where news and release notes are posted.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

ballista's People

Contributors

andygrove avatar blad avatar houqp avatar jorgecarleitao avatar kensuenobu avatar kyprifog avatar max-sixty avatar rrichardson avatar sd2k avatar stspyder avatar zznq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.