Giter Club home page Giter Club logo

caerus-dike's Introduction

Caerus-Dike

This is a POC (Proof of Concept) of Near Data Processing (NDP), on Spark using pushdowns. This includes a demonstration of pushdown on Spark of filter, project and aggregate, for both HDFS, and S3.

NDP is a technique where processing of data is pushed closer to the source of the data in an attempt to leverage locality and limit the a) data transfer and b) processing cycles needed across the set of data.

Our approach will consider use cases on Spark, and specifically focus on the use cases where Spark is processing data using SQL (Structured Query Language). This is typical of data stored in a tabular format, and now very popular in Big Data applications and analytics. Today, these Spark cases use SQL queries, which operate on tables (see background for more detail). These cases require reading very large tables from storage in order to perform relational queries on the data.

Our approach will consist of pushing down portions of the SQL query to the storage itself so that the storage can perform the operations of filter, project, and aggregate and thereby limit the data transfer of these tables back to the compute node. This limiting of data transfer also has a secondary benefit of limiting the amount of data needing to be processed on the compute node after the fetch of data is complete.

This repo has everything needed to demonstrate this capability on a single machine using dockers. Everything from the build to the servers, to running the test are all utilizing dockers.

For more information, see the Design Specification

Branches

The design spec above talks about v1 and v2 versions of our code.
The v1 version can be found in the v1 branch.
The v2 version is currently in the master branch.

Components

Spark - The version of spark we build in this repo is a patch which contains the support for Aggregate pushdown in Spark. apache/spark#29695

pushdown-datasource - This is a new Spark V2 datasource, which has support for pushdown of filter, project and aggregate. This data source can operate against either HDFS or S3.

dikeCS - A new S3 server, built on the POCO C++ Libraries, but supporting S3-select API, which supports SQL Query pushdown, of filter, project and aggregate. This server utilizes SQLite for the query engine.

dikeHDFS - This contains our NDP server for HDFS.
This server utilizes a proxy in front of HDFS. This also contains use of SQLite as the query engine.

TPC-H - This is a fork of https://github.com/ssavvides/tpch-spark,
which is: TPC-H queries implemented in Spark using the DataFrames API. We modified this: * to support our data source for S3, HDFS * to support a variety of data formats (csv, tbl) * to support a variety of testing modes such as selecting the pushdown used (with or without pushdown, project, filter, aggregate), selecting the number of partitions, or spark workers, etc.

Setup

git git clone https://github.com/open-infrastructure-labs/caerus-dike.git
cd caerus-dike
git submodule init
git submodule update --recursive --progress
# Alternatively you can update specific submodules only
# git submodule update dikeCS pushdown-datasource

docker network create dike-net

Build

./build.sh

Clean out artifacts

In case you want to delete all the artifacts.

./clean.sh

Demo test of NDP

First, bring up all the server code (hdfs, s3, spark)

./start.sh

Next, run the demo script, and follow the prompts.

./demo.sh

Finally, bring down the servers.

./stop.sh

caerus-dike's People

Contributors

peterpuhov avatar peterpuhov-github avatar peterpuhov-linaro avatar rf972 avatar rfoley-fw avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.