DataFusion is a distributed data processing platform implemented in Rust. It is very much inspired by Apache Spark and has a similar programming style through the use of DataFrames and SQL.
DataFusion can also be used as a crate dependency in your project if you want the ability to perform SQL queries and DataFrame style data manipulation in-process.
The project home page is now at https://datafusion.rs
There are two working examples:
Both of these examples run a trivial query against a trivial CSV file using a single thread.
I've started defining milestones and issues in github issues, but here's a high level summary of the plan with some rough guesses of timescale.
For the POC, I want to be able to run a single worker process (preferably dockerized) and be able to send it a query (via JSON) and have it execute that query. This will be sufficient to run some representative (but trivial) workloads to compare with Apache Spark.
The workloads will read and write CSV files from HDFS.
MVP should be fully deployable, have a good UX, have good documentation etc. It could still be lacking major features though such as JOIN, GROUP BY, user-defined functions etc.
The 1.0 release should be able to support real-world workloads with performance, scalability, and reliability that generally exceed those of Apache Spark.
Contributers are welcome! Please see CONTRIBUTING.md for details.