docker_spark_history_ui

A dockerised version of the spark history server which enables us to access metrics in the spark ui from a log generated by AWS glue

Background

AWS recently announced that it's possible to monitor and troubleshoot Glue ETL jobs in the Spark UI.

Upon first glance, the docs seems to suggest it's as simple as including:

'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://s3-event-log-path'

in the job config options.

On more careful reading of the docs, it becomes clear the Spark UI is not provided automatically as part of the AWS Glue GUI. Instead, AWS provide a CloudFormation template that allow you to run the spark history server. See here for a description. We probably do not want to use this as it's just another piece of complexity in the platform that we'd need to look after.

Essentially all AWS Glue does is output logs in the format requirred by the Spark History Server. This means an alternative option is to run a local version of the Spark History Service, and import the logs generated by Glue into this locally-running server.

This repo provides a dockerised version of this server to make it as easy as possible to get up and running.

Instructions

1. Set your glue job going

Glue job

In your glue job, you need to enable the following options (this code uses etl manager):

job = GlueJob('my_dir/', bucket=bucket, job_role=my_role,
              job_arguments={"'--enable-spark-ui': 'true',
                             '--spark-event-logs-path': 's3://my-bucket/path-where-i-want-logs-to-go' })

2. Clone this repo

git clone [email protected]:moj-analytical-services/docker_spark_history_ui.git
cd docker_spark_history_ui

3. Build dockerfile from this repo

docker build -t sparkhistoryserver .

4. Copy the events from the job to a local events folder

mkdir events
aws s3 sync s3://my-bucket/path-where-i-want-logs-to-go events

5. Run the spark history server and navigate to the web frontend

docker run -v ${PWD}/events:/tmp/spark-events -p 18080:18080 sparkhistoryserver

and go to http://127.0.0.1:18080 in your web browser

Notes on how I made this work

Note that in the dockerfile, I set the SPARK_NO_DAEMONIZE environment variable, see here. Otherwise the dockerfile exists soon after it starts

uk-gov-mirror / moj-analytical-services.docker_spark_history_ui Goto Github PK

moj-analytical-services.docker_spark_history_ui's Introduction

docker_spark_history_ui

Background

Instructions

1. Set your glue job going

Glue job

2. Clone this repo

3. Build dockerfile from this repo

4. Copy the events from the job to a local events folder

5. Run the spark history server and navigate to the web frontend

Notes on how I made this work

moj-analytical-services.docker_spark_history_ui's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent