It runs a workflow for hive queries.
Prerequisite: Create data base and tables for medical dataset before running workflow.
Update following configuration before you run workflow for scheduling mode.
- Configure NameNode URL and Job Track in coordinator.properties
- Configure data base and table names in coordinator.properties
- Update oozie command (oozie job --oozie http://manager-0:11000/oozie -config coordinator.properties -run) in setup.sh
- Run setup.sh for schedule mode workflow execution
Update following configuration before you run workflow for one time.
- Configure NameNode URL and Job Track in job.properties
- Configure data base and table names in job.properties
- Update oozie command (oozie job --oozie http://manager-0:11000/oozie -config job.properties -run) in setup.sh
- Run setup.sh for workflow execution
Convert the plain dataset into parquet file using snappy compression for better results (default in spark2.3). See spark_jobs directory for more information
By running this oozie job, will met into an issue of jackson dependancy with oozie and spark. https://community.hortonworks.com/content/supportkb/186305/error-comfasterxmljacksondatabindjsonmappingexcept.html
To avoid that, please follow the solution provided in the above post.