This repository contains a sample Spark application that implements the Datasource API. For simplicity's sake, the implementation works with text files that have three columns separated by "$", which include information about name, surname, salary.
- Create a class called
DefaultSource
that extendsRelationProvider
andSchemaRelationProvider
traits.
The RelationProvider
trait is implemented by objects that produce relations for a specific kind of
data source. Users may omit the fully qualified class name of a given data source. In this case,
Spark SQL will append the class name DefaultSource
to the path,
allowing for less verbose invocation. For example, org.apache.spark.sql.json
would resolve to
the data source org.apache.spark.sql.json.DefaultSource
.
- Create a class that extends
BaseRelation
.
BaseRelation
represents a collection of tuples with a known schema.
Simply speaking, it is used to infer/define schemas.
- Implement the
TableScan
trait in the custom relation class. This method should return all rows from the custom data source as anRDD
ofRows
.
- To support write calls,
DefaultSource
has to implement one additional trait calledCreatableRelationProvider
.
- To implement the column pruning, the custom relation class has to implement the
PrunedScan
. It can help to optimize the column access.
- To optimize filtering, the custom relation class can extend the
PrunedFilterScan
trait.