Hello Guys, Thank you so much for this amazing project ! Really a good job :)

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Parquet and Hive integration - Example job configuration about streamx HOT 2 CLOSED

qubole commented on August 10, 2024

Parquet and Hive integration - Example job configuration

from streamx.

Comments (2)

PraveenSeluka commented on August 10, 2024 1

Hi @jocelyndrean

Very happy to know that StreamX helps. Here is a sample configuration

{"name":"s3connect",
 "config":
{
"connector.class":"com.qubole.streamx.s3.S3SinkConnector",
"tasks.max":"1",
"flush.size":"3000",
"s3.url":"s3://streamx/demo/",
"hadoop.conf.dir":"/usr/lib/hadoop2/etc/hadoop/",
  "topics":"clickstream",
"rotate.interval.ms":"60000",
"partitioner.class":"io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"partition.duration.ms":"3600000",
"path.format":"'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/",
"locale":"en",
  "timezone":"GMT",
  "hive.metastore.uris":"thrift://localhost:10000",
  "hive.integration":"true",
  "schema.compatibility":"BACKWARD"
}}

In general, users like to add hourly partitions. The above config will create a new partition every 3600000 ms (1 hour). It will have "s3://streamx/demo/topics/clickstream/year=2016/month=09/day=21/hour=22/" as directory.

Regarding hive integration, it packages the required hive dependencies. You need a hive metastore server running somewhere and point streamx to that. Every 1 hour, it will do add partition call to update the hive table.

In connect-distributed.properties,

You need to use AvroConverter and SchemaRegistry to store Avro schemas. (Avro message in Kafka, and Avro output to s3)


key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

Let us know if you need any more information.

Thanks
Praveen

from streamx.

jocelyndrean commented on August 10, 2024

Hello Praveen,

Thank you so much for your answer. I tested with this configuration and this is working perfectly :)

Amazing job on this project guys ! Thanks !

Jocelyn

from streamx.

Recommend Projects

Parquet and Hive integration - Example job configuration about streamx HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent