Giter Club home page Giter Club logo

connectors's Introduction

Delta Lake Logo Connectors

CircleCI

We are building connectors to bring Delta Lake to popular big-data engines outside Apache Spark (e.g., Apache Hive, Presto).

Introduction

This is the repository for Delta Lake Connectors. It includes a library for querying Delta Lake metadata and connectors to popular big-data engines (e.g., Apache Hive, Presto). Please refer to the main Delta Lake repository if you want to learn more about the Delta Lake project.

Building

The project is compiled using SBT. It has the following subprojects.

Delta uber jar

This project generates a single uber jar containing Delta Lake and all it transitive dependencies (except Hadoop and its dependencies).

  • Most of the dependencies are shaded to avoid version conflicts. See the file build.sbt for details on what are not shaded.
  • Hadoop and its dependencies is not included in the jar because they are expected to be present in the deployment environment.
  • To generate the uber jar, run build/sbt core/compile
  • To test the uber jar, run build/sbt coreTest/test

Hive connector

This project contains all the codes needed to make Hive read Delta Lake tables. The connector has two JARs delta-core-shaded-assembly_<scala_version>-0.1.0.jar and hive-delta_<scala_version>-0.1.0.jar. You can use either Scala 2.11 or 2.12. The released JARs are available in the releases page. Please download the JARs for the corresponding Scala version you would like to use.

You can also use the following instructions to build them.

Build JARs

Please skip this section if you have downloaded the connector JARs.

  • To compile the project, run build/sbt hive/compile
  • To test the project, run build/sbt hive/test
  • To generate the connector jar run build/sbt hive/package

The above commands will generate two JARs in the following paths.

core/target/scala-2.12/delta-core-shaded-assembly_2.12-0.1.0.jar
hive/target/scala-2.12/hive-delta_2.12-0.1.0.jar

These two JARs include the Hive connector and all its dependencies. They need to be put in Hive’s classpath.

Note: if you would like to build jars using Scala 2.11, you can run the SBT command build/sbt "++ 2.11.12 hive/package" and the generated JARS will be in the following paths.

core/target/scala-2.11/delta-core-shaded-assembly_2.11-0.1.0.jar
hive/target/scala-2.12/hive-delta_2.11-0.1.0.jar

Setting up Hive

This sections describes how to set up Hive to load the Delta Hive connector.

Before starting your Hive CLI or running your Hive script, add the following special Hive config to the hive-site.xml file (Its location is /etc/hive/conf/hive-site.xml in a EMR cluster).

<property>
  <name>hive.input.format</name>
  <value>io.delta.hive.HiveInputFormat</value>
</property>
<property>
  <name>hive.tez.input.format</name>
  <value>io.delta.hive.HiveInputFormat</value>
</property>

Alternatively, you can also run the following SQL commands in Hive CLI before reading Delta tables to set io.delta.hive.HiveInputFormat:

SET hive.input.format=io.delta.hive.HiveInputFormat;
SET hive.tez.input.format=io.delta.hive.HiveInputFormat;

The second step is to upload the above two JARs to the machine that runs Hive. Finally, add the paths of the JARs toHive’s environment variable, HIVE_AUX_JARS_PATH. You can find this environment variable in the hive-env.sh file, whose location is /etc/hive/conf/hive-env.sh on an EMR cluster. This setting will tell Hive where to find the connector JARs.

Create a Hive table

After finishing setup, you should be able to create a Delta table in Hive.

Right now the connector supports only EXTERNAL Hive tables. The Delta table must be created using Spark before an external Hive table can reference it.

Here is an example of a CREATE TABLE command that defines an external Hive table pointing to a Delta table on s3://foo-bucket/bar-dir.

CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'

io.delta.hive.DeltaStorageHandler is the class that implements Hive data source APIs. It will know how to load a Delta table and extract its metadata. The table schema in the CREATE TABLE statement must be consistent with the underlying Delta metadata. Otherwise, the connector will throw an error to tell you about the inconsistency.

Frequently asked questions (FAQ)

Supported Hive versions

Hive 2.x.

Can I use this connector in Apache Spark or Presto?

No. The connector must be used with Apache Hive. It doesn't work in other systems, such as Apache Spark or Presto.

If I create a table using the connector in Hive, can I query it in Apache Spark or Presto?

No. The table created by this connector in Hive cannot be read in any other systems right now. We recommend to create different tables in different systems but point to the same path. Although you need to use different table names to query the same Delta table, the underlying data will be shared by all of systems.

Can I write to a Delta table using this connector?

No. The connector doesn't support writing to a Delta table.

Do I need to specify the partition columns when creating a Delta table?

No. The partition columns are read from the underlying Delta metadata. The connector will know the partition columns and use this information to do the partition pruning automatically.

Why do I need to specify the table schema? Shouldn’t it exist in the underlying Delta table metadata?

Unfortunately, the table schema is a core concept of Hive and Hive needs it before calling the connector.

What if I change the underlying Delta table schema in Spark after creating the Hive table?

If the schema in the underlying Delta metadata is not consistent with the schema specified by CREATE TABLE statement, the connector will report an error when loading the table and ask you to fix the schema. You must drop the table and recreate it using the new schema. Hive 3.x exposes a new API to allow a data source to hook ALTER TABLE. You will be able to use ALTER TABLE to update a table schema when the connector supports Hive 3.x.

Hive has three execution engines, MapReduce, Tez and Spark. Which one does this connector support?

The connector supports MapReduce and Tez. It doesn't support Spark execution engine in Hive.

Reporting issues

We use GitHub Issues to track community reported issues. You can also contact the community for getting answers.

Contributing

We welcome contributions to Delta Lake Connectors repository. We use GitHub Pull Requests for accepting changes.

Community

There are two mediums of communication within the Delta Lake community.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.