docker_spark_history_ui
A dockerised version of the spark history server which enables us to access metrics in the spark ui from a log generated by AWS glue
Background
AWS recently announced that it's possible to monitor and troubleshoot Glue ETL jobs in the Spark UI.
Upon first glance, the docs seems to suggest it's as simple as including:
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://s3-event-log-path'
in the job config options.
On more careful reading of the docs, it becomes clear the Spark UI is not provided automatically as part of the AWS Glue GUI. Instead, AWS provide a CloudFormation template that allow you to run the spark history server. See here for a description. We probably do not want to use this as it's just another piece of complexity in the platform that we'd need to look after.
Essentially all AWS Glue does is output logs in the format requirred by the Spark History Server. This means an alternative option is to run a local version of the Spark History Service, and import the logs generated by Glue into this locally-running server.
This repo provides a dockerised version of this server to make it as easy as possible to get up and running.
Instructions
1. Set your glue job going
Glue job
In your glue job, you need to enable the following options (this code uses etl manager):
job = GlueJob('my_dir/', bucket=bucket, job_role=my_role,
job_arguments={"'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://my-bucket/path-where-i-want-logs-to-go' })
2. Clone this repo
git clone [email protected]:moj-analytical-services/docker_spark_history_ui.git
cd docker_spark_history_ui
3. Build dockerfile from this repo
docker build -t sparkhistoryserver .
4. Copy the events from the job to a local events folder
mkdir events
aws s3 sync s3://my-bucket/path-where-i-want-logs-to-go events
5. Run the spark history server and navigate to the web frontend
docker run -v ${PWD}/events:/tmp/spark-events -p 18080:18080 sparkhistoryserver
and go to http://127.0.0.1:18080 in your web browser
Notes on how I made this work
Note that in the dockerfile, I set the SPARK_NO_DAEMONIZE environment variable, see here. Otherwise the dockerfile exists soon after it starts