Giter Club home page Giter Club logo

gazelle_plugin's Introduction

* Gazelle support has officially ended as of February 2023. Please see below information for the end of life announcement.

It's all started from the spark summit session Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators. On 4/25/2019, we created Gazelle project to explore the new opportunity to reach higher performance in Spark with vectorized execution engine. We're proud of the work has been done in Gazlle not only to reach better performance beyond Vanilla Spark, but also to unleash the power of hardware capability and bring it into another level. During the time frame to push Gazelle go to the market, we have heard many voices from the customer side to refactor Gazelle source code, leverage Gazelle's JNI as a unified API, as well as to add some existing and mature SQL engine or library such as ClickHouse or Vcelox as the backend support. In 2023, we decide that no longer to support Gazelle project and move to the next stage to extend the experience for Spark with vectorized execution engine support. We encourage the existing Gazelle users or developers move the focus to our 2nd generation native SQL engine - Gluten, which can provide more possibility with multiple native SQL backend integration as well as more companies work together to build a new ecosystem for Spark vectorized execution engine. Thank you for join with Gazelle's journey and we look forward that you can continue the journey in Gluten with better experience as well.

* LEGAL NOTICE: Your use of this software and any required dependent software (the "Software Package") is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the "TPP.txt" or other similarly-named text file included with the Software Package for additional details.
* Optimized Analytics Package for Spark* Platform is under Apache 2.0 (https://www.apache.org/licenses/LICENSE-2.0).

Gazelle Plugin

A Native Engine for Spark SQL with vectorized SIMD optimizations. Please refer to user guide for details on how to enable Gazelle.

Online Documentation

You can find the all the Gazelle Plugin documents on the project web page.

Introduction

Overview

Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. Apache Arrow provided CPU-cache friendly columnar in-memory layout, its SIMD-optimized kernels and LLVM-based SQL engine Gandiva are also very efficient.

Gazelle Plugin reimplements Spark SQL execution layer with SIMD-friendly columnar data processing based on Apache Arrow, and leverages Arrow's CPU-cache friendly columnar in-memory layout, SIMD-optimized kernels and LLVM-based expression engine to bring better performance to Spark SQL.

Performance data

For advanced performance testing, below charts show the results by using two benchmarks with Gazelle v1.1: 1. Decision Support Benchmark1 and 2. Decision Support Benchmark2. The testing environment for Decision Support Benchmark1 is using 1 master + 3 workers and Intel(r) Xeon(r) Gold 6252 CPU|384GB memory|NVMe SSD x3 per single node with 1.5TB dataset and parquet format.

  • Decision Support Benchmark1 is a query set modified from TPC-H benchmark. We change Decimal to Double since Decimal hasn't been supported in OAP v1.0-Gazelle Plugin. Overall, the result shows a 1.49X performance speed up from OAP v1.0-Gazelle Plugin comparing to Vanilla SPARK 3.0.0. We also put the detail result by queries, most of queries in Decision Support Benchmark1 can take the advantages from Gazelle Plugin. The performance boost ratio may depend on the individual query.

Performance

Performance

The testing environment for Decision Support Benchmark2 is using 1 master + 3 workers and Intel(r) Xeon(r) Platinum 8360Y CPU|1440GB memory|NVMe SSD x4 per single node with 3TB dataset and parquet format.

  • Decision Support Benchmark2 is a query set modified from TPC-DS benchmark. We change Decimal to Doubel since Decimal hasn't been supported in OAP v1.0-Gazelle Plugin. We pick up 10 queries which can be fully supported in OAP v1.0-Gazelle Plugin and the result shows a 1.26X performance speed up comparing to Vanilla SPARK 3.0.0.

Performance

Please notes the performance data is not an official from TPC-H and TPC-DS. The actual performance result may vary by individual workloads. Please try your workloads with Gazelle Plugin first and check the DAG or log file to see if all the operators can be supported in OAP-Gazelle Plugin. Please check the detailed page on performance tuning for TPC-H and TPC-DS workloads.

Coding Style

Contact

[email protected] [email protected]

gazelle_plugin's People

Contributors

adrian-wang avatar carsonwang avatar chenghao-intel avatar eugene-mark avatar felixybw avatar gfl94 avatar haojinintel avatar happycherry avatar ivoson avatar jackylee-ch avatar jerrychenhf avatar jikunshang avatar jkself avatar lee-lei avatar lidinghao avatar luciferyang avatar marin-ma avatar moonlit-sailor avatar philo-he avatar rui-mo avatar songzhan01 avatar tigersong avatar weiting-chen avatar xuechendi avatar xwu99 avatar yao531441 avatar yma11 avatar zhixingheyi-tian avatar zhouyuan avatar zhztheplayer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gazelle_plugin's Issues

columnar whole stage codegen failed due to empty results

We should guard the empty results code path

java.io.IOException: nativeEvaluate: evaluate failed with error msg Invalid: RecordBatch must be non-empty.
	at com.intel.oap.vectorized.ExpressionEvaluatorJniWrapper.nativeEvaluate(Native Method)
	at com.intel.oap.vectorized.ExpressionEvaluator.evaluate(ExpressionEvaluator.java:137)
	at com.intel.oap.vectorized.ExpressionEvaluator.evaluate(ExpressionEvaluator.java:107)
	at com.intel.oap.execution.ColumnarShuffledHashJoinExec.$anonfun$doExecuteColumnar$1(ColumnarShuffledHashJoinExec.scala:229)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)

Decimal fallback

We tested current fallback logic on the generated decimal dataset, and found it is not sufficient enough. Therefore, we need to add more fallback logic for decimal type to make TPC-DS and TPC-H queries runnable.

  • SHJ
  • Union
  • Sort
  • update the ArrowWritableColumnarVector in arrow-datasource also.
  • fix the error caused by row-based exchange and columnar-BHJ.
  • SMJ
  • TPC-H

TPC-DS Q95 failed due in columnar wscg

The issue seems due to multiple SMJ in on columnar wscg

Stack: [0x00007f2ead550000,0x00007f2ead651000],  sp=0x00007f2ead64c9e0,  free space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [spark-columnar-plugin-codegen-ad4b5a2b868bedea.so+0x9aff]  TypedWholeStageCodeGenImpl::WholeStageCodeGenResultIterator::SetDependencies(std::vector<std::shared_ptr<ResultIteratorBase>, std::allocator<std::shared_ptr<ResultIteratorBase> > > const&)+0x16bf

columnar BHJ failed on new memory pool

java.lang.NullPointerException
        at org.apache.arrow.vector.ipc.message.MessageSerializer.serializeMetadata(MessageSerializer.java:179)
        at org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:165)
        at org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:150)
        at com.intel.oap.vectorized.ExpressionEvaluator.getSchemaBytesBuf(ExpressionEvaluator.java:242)
        at com.intel.oap.vectorized.ExpressionEvaluator.build(ExpressionEvaluator.java:105)
        at org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:118)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:182)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

CI: Add some TPC-H execution rounds for SMJ

Currently TPC-H CI is running based on BHJ (RAM size of GitHub Action VMs are quite limited so BHJ is automatically used). We should add some execution rounds for SMJ too.

Merge Arrow data source

as two repo shares some piece of codes, it's much easier to merge these two repo together

improve columnar SMJ

  • using non-codegen operator for columnar SMJ in some cases
  • using data ref in comparing tuple on multiple key based cases

columnar sort codegen fallback to executor side

driver failed to do sort codegen as there's some issue on sortorder. Quick fix as below can bring it back

diff --git a/core/src/main/scala/com/intel/oap/execution/ColumnarSortExec.scala b/core/src/main/scala/com/intel/oap/execution/ColumnarSortExec.scala
index 5302bf3b..9d7b16c5 100644
--- a/core/src/main/scala/com/intel/oap/execution/ColumnarSortExec.scala
+++ b/core/src/main/scala/com/intel/oap/execution/ColumnarSortExec.scala
@@ -135,10 +135,7 @@ case class ColumnarSortExec(

   /***********************************************************/
   def getCodeGenSignature =
-    if (!sortOrder
-          .filter(
-            expr => bindReference(expr.child, child.output, true).isInstanceOf[BoundReference])
-          .isEmpty) {
+    if (true) {
       ColumnarSorter.prebuild(
         sortOrder,
         child.output,

Data size exceeds max length that array can contain

TPC-DS q24a failed by the following exception:
Error: Error running query: java.io.IOException: nativeEvaluate: evaluate failed with error msg Capacity error: array cannot contain more than 2147483646 bytes, have 2147483672 (state=,code=0)

Add cmake parameters to mvn install

Need to add more cmake parameters to support mvn install with more detail configuration.
For example, cmake can set up BUILD_PROTOBUF or BUILD_ARROW from source, but mvn can only run a default setting with compile.sh

ColumnarWSCG further optimization

We noticed that in Q72 native WSCG is still slower than vanilla spark, some ideas below may improve performance.

  • postpone GetValue in WSCG when value is required.
  • BHJ build : we may decide hashmap initial size according to build table size
  • BHJ Get: optimize hashmap probe and add a fast cache
  • SMJ optimization: TBD

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.