medical-dataset's Introduction

medical-dataset

It runs a workflow for hive queries.

Prerequisite: Create data base and tables for medical dataset before running workflow.

Update following configuration before you run workflow for scheduling mode.

Configure NameNode URL and Job Track in coordinator.properties
Configure data base and table names in coordinator.properties
Update oozie command (oozie job --oozie http://manager-0:11000/oozie -config coordinator.properties -run) in setup.sh
Run setup.sh for schedule mode workflow execution

Update following configuration before you run workflow for one time.

Configure NameNode URL and Job Track in job.properties
Configure data base and table names in job.properties
Update oozie command (oozie job --oozie http://manager-0:11000/oozie -config job.properties -run) in setup.sh
Run setup.sh for workflow execution

spark pre-requisites

Convert the plain dataset into parquet file using snappy compression for better results (default in spark2.3). See spark_jobs directory for more information

By running this oozie job, will met into an issue of jackson dependancy with oozie and spark. https://community.hortonworks.com/content/supportkb/186305/error-comfasterxmljacksondatabindjsonmappingexcept.html

To avoid that, please follow the solution provided in the above post.

Recommend Projects