Presto-Kudu

The Presto Kudu connector allows querying, inserting and deleting data in Apache Kudu

Integration into PrestoDB distribution

Starting with Presto 0.209 the presto-kudu connector is integrated into the Presto distribution. Syntax for creating tables has changed, but the functionality is the same. Please see Presto Documentation / Kudu Connector for more details.

Migration

If you want to migrate from Presto <= 0.208 with this Kudu connector to a newer Presto version with the integrated Kudu connector, perform following steps:

Stop Presto 0.208
Delete the table $schemas from Kudu using the Java or Python Kudu client. This table will be recreated automatically.
If you want to continue using the schema emulation used in this old connector, set following property in the kudu.properties:
```
kudu.schema-emulation.prefix=
```
Start Presto >= 0.209

Compatibility

Version	Compatibility	Details
Apache Kudu 1.8.0	yes	tests ok
Apache Kudu 1.7.0/1.7.1	yes	by full API- and ABI-compatibility of Kudu Java Client 1.8.0
Apache Kudu 1.6.0	yes	by full API- and ABI-compatibility of Kudu Java Client 1.8.0
Apache Kudu 1.5.0	yes	by full API- and ABI-compatibility of Kudu Java Client 1.8.0
Apache Kudu 1.4.0	yes	by full API- and ABI-compatibility of Kudu Java Client 1.8.0

Presto 0.208	yes	tests ok

Support for older Presto versions see release history

Installation

Please follow the below steps to query Apache Kudu in Presto.

Deploying Kudu server

Follow installation guide at Apache Kudu.

If you want to deploy Kudu 1.8.0 on RHE 7 or CentOS 7, you may also be interessed in my binary build project kudu-rpm.

Deploying Presto server

Install Presto according to the documentation: https://prestosql.io/docs/current/installation/deployment.html

Download Presto-Kudu connector

Download current release

Configuring Apache Kudu connector

Go to the directory $PRESTO_HOME$/plugin
Extract the content of presto-kudu-XXX.zip to this folder
Rename the extracted folder presto-kudu-XXX to kudu

Create a file name kudu.properties in $PRESTO_HOME/etc/catalog/:

connector.name=kudu

## List of Kudu master addresses, at least one is needed (comma separated)
## Supported formats: example.com, example.com:7051, 192.0.2.1, 192.0.2.1:7051,
##                    [2001:db8::1], [2001:db8::1]:7051, 2001:db8::1
kudu.client.master-addresses=localhost

## Optional restriction of tablets for specific tenant.
## If a tenant is set, only Kudu tablets starting with `<tenant>.` will
## be visible in Presto
#kudu.session.tenant=mytenant

#######################
### Advanced Kudu Java client configuration
#######################

## Default timeout used for administrative operations (e.g. createTable, deleteTable, etc.)
#kudu.client.defaultAdminOperationTimeout = 30s

## Default timeout used for user operations
#kudu.client.defaultOperationTimeout = 30s

## Default timeout to use when waiting on data from a socket
#kudu.client.defaultSocketReadTimeout = 10s

## Disable Kudu client's collection of statistics.
#kudu.client.disableStatistics = false

Query kudu in CLI of presto

Download presto cli client following: https://prestosql.io/docs/current/installation/cli.html

Start CLI:

./presto-cli --server localhost:8086 --catalog kudu --schema default

Replace the hostname, port and schema name with your own.

Querying Data

A Kudu table named mytable is available in Presto as table kudu.default.mytable. A Kudu table containing a dot is considered as a schema/table combination, e.g. dev.mytable is mapped to the Presto table `kudu.dev.mytable. Only Kudu table names in lower case are currently supported.

Now you can use any Kudu table, if it is lower case and contains no dots.
Alternatively you can create a users table with

CREATE TABLE users (
  user_id int,
  first_name varchar,
  last_name varchar
) WITH (
 column_design = '{"user_id": {"key": true}}',
 partition_design = '{"hash":[{"columns":["user_id"], "buckets": 2}]}',
 num_replicas = 1
);

On creating a Kudu table you must/can specify addition information about the primary key, encoding, and compression of columns and hash or range partitioning, and the number of replicas. Details see below in section "Create Kudu Table".

The table can be described using

DESCRIBE kudu.default.users;

You should get something like

   Column   |  Type   |                               Extra                               | Comment 
------------+---------+-------------------------------------------------------------------+---------
 user_id    | integer | key, encoding=AUTO_ENCODING, compression=DEFAULT_COMPRESSION      |         
 first_name | varchar | nullable, encoding=AUTO_ENCODING, compression=DEFAULT_COMPRESSION |         
 last_name  | varchar | nullable, encoding=AUTO_ENCODING, compression=DEFAULT_COMPRESSION |         
(3 rows)

Insert some data with

INSERT INTO users VALUES (1, 'Donald', 'Duck'), (2, 'Mickey', 'Mouse');

Select the inserted data

SELECT * FROM users;

Data Type Mapping

The data types of Presto and Kudu are mapped as far as possible:

Presto Data Type	Kudu Data Type	Comment
`BOOLEAN`	`BOOL`
`TINYINT`	`INT8`
`SMALLINT`	`INT16`
`INTEGER`	`INT32`
`BIGINT`	`INT64`
`REAL`	`FLOAT`
`DOUBLE`	`DOUBLE`
`VARCHAR`	`STRING`	see note 1
`VARBINARY`	`BINARY`	see note 1
`TIMESTAMP`	`UNIXTIME_MICROS`	µs resolution in Kudu column is reduced to ms resolution
`DECIMAL`	`DECIMAL`	only supported for Kudu server >= 1.7.0
`CHAR`	-	not supported, see note 2
`DATE`	-	not supported, see note 2
`TIME`	-	not supported
`JSON`	-	not supported
`TIME WITH TIMEZONE`	-	not supported
`TIMESTAMP WITH TIMEZONE`	-	not supported
`INTERVAL YEAR TO MONTH`	-	not supported
`INTERVAL DAY TO SECOND`	-	not supported
`ARRAY`	-	not supported
`MAP`	-	not supported
`IPADDRESS`	-	not supported

Note 1

On performing CREATE TABLE ... AS ... from a Presto table to Kudu, the optional maximum length is lost

Note 2

On performing CREATE TABLE ... AS ... from a Presto table to Kudu, a DATE or CHAR column is converted to STRING

Supported Presto SQL statements

Presto SQL statement	Supported	Comment
`SELECT`	[x]
`INSERT INTO ... VALUES`	[x]	behaves like `upsert`
`INSERT INTO ... SELECT ...`	[x]	behaves like `upsert`
`DELETE`	[x]
`CREATE SCHEMA`	[x]
`DROP SCHEMA`	[x]
`CREATE TABLE`	[x]
`CREATE TABLE ... AS`	[x]
`DROP TABLE`	[x]
`ALTER TABLE ... RENAME TO ...`	[x]
`ALTER TABLE ... RENAME COLUMN ...`	[x]	if not part of primary key
`ALTER TABLE ... ADD COLUMN ...`	[x]
`ALTER TABLE ... DROP COLUMN ...`	[x]	if not part of primary key
`SHOW SCHEMAS`	[x]
`SHOW TABLES`	[x]
`SHOW CREATE TABLE`	[x]
`SHOW COLUMNS FROM`	[x]
`DESCRIBE`	[x]	same as `SHOW COLUMNS FROM`
`CALL kudu.system.add_range_partition`	[x]	add range partition to an existing table
`CALL kudu.system.drop_range_partition`	[x]	drop an existing range partition from a table

Currently not supported are SHOW PARTITIONS FROM ..., ALTER SCHEMA ... RENAME

Create Kudu Table with `CREATE TABLE`

On creating a Kudu Table you need to provide following table properties:

column_design
partition_design (optional)
num_replicas (optional, defaults to 3)

Example:

CREATE TABLE users (
  user_id int,
  first_name varchar,
  last_name varchar
) WITH (
 column_design = '{"user_id": {"key": true}}',
 partition_design = '{"hash":[{"columns":["user_id"], "buckets": 2}]}',
 num_replicas = 1
);

Table property `column_design`

With the column design table property you define the columns for the primary key. Additionally you can overwrite the encoding and compression of every single column.

The value of this property must be a string of a valid JSON object. The keys are the columns and the values is a JSON object with the columns properties to set, i.e.

'{"<column name>": {"<column property name>": <value>, ...}, ...}'`

Column property name	Value	Comment
`key`	`true` or `false`	if column belongs to primary key, default: `false`
`nullable`	`true` or `false`	if column is nullable, default: `true` for non-key columns, key columns must not be nullable
`encoding`	"string value"	See Apache Kudu documentation: Column encoding
`compression`	"string value"	See Apache Kudu documentation: Column compression

Example:

'{"column1": {"key": true, "encoding": "dictionary", "compression": "LZ4"}, "column2": {...}}'

Table property `partition_design`

With the partition design table property you define the partition layout. In Apache Kudu you can define multiple hash partitions and at most one range partition. Details see Apache Kudu documentation: Partitioning

The value of this property must be a string of a valid JSON object. The keys are either hash or range or both, i.e.

'{"hash": [{...},...], "range": {...}}'`

Hash partitioning

You can provide multiple hash partition groups in Apache Kudu. Each group consists of a list of column names and the number of buckets.

Example:

'{"hash": [{"columns": ["region", "name"], "buckets": 5}]}'

This defines a hash partition with the columns "region" and "name", distributed over 5 buckets. All partition columns must be part of the primary key.

Range partitioning

You can provide at most one range partition in Apache Kudu. It consists of a list of columns. The ranges themselves are given either in the table property range_partitions. Alternatively, the procedures kudu.system.add_range_partition and kudu.system.drop_range_partition can be used to manage range partitions for existing tables. For both ways see below for more details.

Example:

'{"range": {"columns": ["event_time"]}}'

Defines range partitioning on the column "event".

To add concrete range partitions use either the table property range_partitions or call the procedure .

Table property `range_partitions`

With the range_partitions table property you specify the concrete range partitions to be created. The range partition definition itself must be given in the table property partition_design separately.

Example:

CREATE TABLE events (
  serialno varchar,
  event_time timestamp,
  message varchar
) WITH (
 column_design = '{"serialno": {"key": true}, "event_time": {"key": true}}',
 partition_design = '{"hash":[{"columns":["serialno"], "buckets": 4}],
                      "range": {"columns":["event_time"]}}',
 range_partitions = '[{"lower": null, "upper": "2017-01-01T00:00:00"},
                      {"lower": "2017-01-01T00:00:00", "upper": "2017-07-01T00:00:00"},
                      {"lower": "2017-07-01T00:00:00", "upper": "2018-01-01T00:00:00"}]',
 num_replicas = 1
);

This creates a table with a hash partition on column serialno with 4 buckets and range partitioning on column event_time. Additionally three range partitions are created:

for all event_times before the year 2017 (lower bound = null means it is unbound)
for the first half of the year 2017
for the second half the year 2017 This means any try to add rows with event_time of year 2018 or greater will fail, as no partition is defined.

Managing range partitions

For existing tables, there are procedures to add and drop a range partition.

adding a range partition

CALL kudu.system.add_range_partition(<schema>, <table>, <range_partition_as_json_string>),

dropping a range partition

CALL kudu.system.drop_range_partition(<schema>, <table>, <range_partition_as_json_string>)

<schema>: schema of the table
<table>: table names

<range_partition_as_json_string>: lower and upper bound of the range partition as json string in the form '{"lower": <value>, "upper": <value>}', or if the range partition has multiple columns: '{"lower": [<value_col1>,...], "upper": [<value_col1>,...]}'. The concrete literal for lower and upper bound values are depending on the column types.

Examples:

Presto Data Type	JSON string example
BIGINT	'{"lower": 0, "upper": 1000000}'
SMALLINT	'{"lower": 10, "upper": null}'
VARCHAR	'{"lower": "A", "upper": "M"}'
TIMESTAMP	'{"lower": "2018-02-01T00:00:00.000", "upper": "2018-02-01T12:00:00.000"}'
BOOLEAN	'{"lower": false, "upper": true}'
VARBINARY	values encoded as base64 strings

To specified an unbounded bound, use the value null.

Example:

CALL kudu.system.add_range_partition('myschema', 'events', '{"lower": "2018-01-01", "upper": "2018-06-01"}')

This would add a range partition for a table events in the schema myschema with the lower bound 2018-01-01 (more exactly 2018-01-01T00:00:00.000) and the upper bound 2018-07-01.

Use the sql statement SHOW CREATE TABLE to request the existing range partitions (they are shown in the table property range_partitions).

Known limitations

Only lower case table and column names in Kudu are supported
As schemas are not directly supported by Kudu, a special table named $schemas is created in Kudu when using this connector
Using a secured Kudu cluster has not been tested.

Build

The Presto-Kudu connector is a standard Maven project. Under Linux, simply run the following command from the project root directory:

mvn -DskipTests clean package

Building the package under Windows is currently not supported, as the maven plugin maven-presto-plugin has an open issue with Windows.

To run the build with tests, it is assumed that Kudu master server (and at least one Kudu tablet server) runs on localhost. If you have Docker installed on your machine, you can use following steps:

docker run --rm -d --name apache-kudu --net=host usuresearch/kudu-docker-slim:release-v1.8.0-1
mvn clean package
docker stop apache-kudu

790113107 / presto-kudu Goto Github PK

presto-kudu's Introduction

Presto-Kudu

Integration into PrestoDB distribution

Migration

Compatibility

Installation

Deploying Kudu server

Deploying Presto server

Download Presto-Kudu connector

Configuring Apache Kudu connector

Query kudu in CLI of presto

Querying Data

Data Type Mapping

Note 1

Note 2

Supported Presto SQL statements

Create Kudu Table with CREATE TABLE

Table property column_design

Table property partition_design

Hash partitioning

Range partitioning

Table property range_partitions

Managing range partitions

Known limitations

Build

presto-kudu's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Create Kudu Table with `CREATE TABLE`

Table property `column_design`

Table property `partition_design`

Table property `range_partitions`