spark-custom-datasource-example's Introduction

Custom Spark Datasource

This repository contains a sample Spark application that implements the Datasource API. For simplicity's sake, the implementation works with text files that have three columns separated by "$", which include information about name, surname, salary.

Schema Identification

Create a class called DefaultSource that extends RelationProvider and SchemaRelationProvider traits.

The RelationProvider trait is implemented by objects that produce relations for a specific kind of data source. Users may omit the fully qualified class name of a given data source. In this case, Spark SQL will append the class name DefaultSource to the path, allowing for less verbose invocation. For example, org.apache.spark.sql.json would resolve to the data source org.apache.spark.sql.json.DefaultSource.

Create a class that extends BaseRelation.

BaseRelation represents a collection of tuples with a known schema. Simply speaking, it is used to infer/define schemas.

Reading Data

Implement the TableScan trait in the custom relation class. This method should return all rows from the custom data source as an RDD of Rows.

Writing Data

To support write calls, DefaultSource has to implement one additional trait called CreatableRelationProvider.

Column Pruning

To implement the column pruning, the custom relation class has to implement the PrunedScan. It can help to optimize the column access.

Filter Pushdown

To optimize filtering, the custom relation class can extend the PrunedFilterScan trait.

Recommend Projects

aokolnychyi / spark-custom-datasource-example Goto Github PK

spark-custom-datasource-example's Introduction

Custom Spark Datasource

Schema Identification

Reading Data

Writing Data

Column Pruning

Filter Pushdown

spark-custom-datasource-example's People

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent