ukaea / piezo Goto Github PK

View Code? Open in Web Editor NEW

1.0 7.0 0.0 13.54 MB

Dockerfile 0.37% Shell 0.41% Python 91.48% Ruby 0.16% RobotFramework 7.59%

piezo's Introduction

Piezo

A tool to run Spark jobs on Kubernetes. Please refer to the wiki for full documentation.

piezo's People

Contributors

Stargazers

Watchers

piezo's Issues

Formalise web API

Adapt the prototype web API to a more formal structure (Handlers, Services, etc).

Use an adapter for the kubectl proxy to at this stage.

Consolidate spark image for both prometheus and python support

The standard Spark image ("gcr.io/spark-operator/spark:v2.4.0") supports Prometheus but not Python. The CERN Spark image ("gitlab-registry.cern.ch/db/spark-service/docker-registry/spark:v2.4.0-hadoop3-0.7") is in the opposite situation. We currently have to switch between the two in the "image" field of the validation_rules.json file when deploying the web app.

We need both at the same time.

Web app rejects poorly-formatted job names

Pod names must be composed of lower case characters, digits, . and -. . and - must be surrounded by characters and/or digits.

The Kubernetes API will reject poorly-formatted names, but the error message it returns is not very clear. Instead, we should warn the user of this at the same time as all other validation warnings.

In addition see failed test cases from #51

Acceptance criteria

Poorly-formatted job names are rejected by the web app, with a 400 response and helpful error message. Well-formatted job names are accepted as before.

The documentation in the wiki user guide matches the accepted formatting.

Test Scenarios

Test that a job name follows the naming convention rules (no space, no special characters, no uppercase letter).
Test that the first character in the job name is an alphabet and not a number.
Test that a 400 status code and a user-friendly error message is displayed.

Extend output of job status

As well as returning the status of the job also include:

Its name
When created
When submitted
When terminated
Any error message

Acceptance criteria

Calling the job status handler for a specified job returns the following information:
- When created
- Submission attempts
- When submitted
- When terminated
- Any error message
If information is not available for any category then UNKNOWN is returned
Trying to get the status of a non existent job returns a 404 not found error

Test Scenarios

Test that GET job status return all of the above in response.
Test that a 404 and a user-friendly error message is displayed when a job is not found.

Update Wiki with user guide for Piezo

Deploy Prometheus for kubernetes metrics

Install the prometheus operator onto the kubernetes cluster on openstack to scrape kubernetes cluster metrics.

Acceptance criteria

Running kubectl get pods --all-namespaces shows equivilent prometheus pods:
- alertmanager-prometheus-operator-alertmanager-0
- prometheus-operator-grafana-8559d7df44-v2jf7
- prometheus-operator-kube-state-metrics-6b6d6b8bbd-h65gd
- prometheus-operator-operator-7d5577d9b5-qsh8m
- prometheus-operator-prometheus-node-exporter-85blf
- prometheus-operator-prometheus-node-exporter-khq77
- prometheus-operator-prometheus-node-exporter-kmdlq
- prometheus-prometheus-operator-prometheus-0
All prometheus pods are ready and have status ready
The prometheus dashboard can be accesed in a browser by port forwarding prometheus-prometheus-operator-prometheus-0 on port 9090

User can request job logs be written to S3

Sub-task to #16

User can call a writelogs handler with body {"job name": "example-job"}. The web app will get the logs for that job and write them to the kubernetes bucket in S3 at outputs/example-job/log.txt

Subsequent tickets will:

generalise this process to all jobs that have completed
improve the configuration
call the method on a cron job

Acceptance criteria

A POST request to the new writelogs handler with body {"job_name": "{name of job}"}where the job exists produces a file in thekubernetesbucket in S3 at outputs/example-job/log.txt`
The contents of the produced log file matches those obtained by running kubectl logs {spark job name}-driver from a command window (with kubectl access to the kubernetes cluster)
If the job does not exist a 404 is returned
If "job_name" argument is missing from the request a 400 is returned with a message stating it is a required property - Fail. Error message include unrelevant details "data": "'job_name' is a required property\n\nFailed validating 'required' in schema:\n {'properties': {'job_name': {'type': 'string'}},\n 'required': ['job_name'],\n 'type': 'object'}\n\nOn instance:\n {}"
The job is not deleted from the cluster after calling writelogs on it
Calling the handler a second time for the same job overwrites the log file previously produced

Fix bug when deleting a job

Requesting that a job be deleted caused a 500 Internal Server Error (regardless of whether the job existed or not). This was caused by a poorly-formatted body provided by the python kubernetes library.

The route is deletejob, e.g. URL http://host-172-16-113-146.nubes.stfc.ac.uk:31924/deletejob

Acceptance criteria

After submitting a job, requesting deletion results in a 200 Ok response. Subsequent attempts to delete or to get the status of the job result in a 404 Not Found response.

Test Scenarios

Test that a delete job returns a 200 status.
Test that a delete job on job that doesn't exit gives a 404 status.
Check the delete requests response time for a small job and a large job (if applies)
Check that responses content header meet OWASPs ASVS checklist.

Investigate API logging when in a pod

Configuration and retrieval of web app logging may need to be different when the app is deployed in a pod, rather than hosted on a server. Investigate how that should be managed and write up findings in the wiki.

Stress-test system

Objective 5.4 involves identifying how to tune and optimise the operation of Spark Clusters. Design and implement performance tests that submit large numbers of diverse tests to Piezo in a short time frame. Investigate the effect of changing configuration settings on the system performance, and summarise the findings in a report.

Validation rules config file

Due to uncertainty around passing config files to pods, a temporary approach to specifying validation rules as python code was adopted. With this now resolved (see #49) validation rules should be passed in as formatted text files.

Acceptance Criteria

App can be deployed following the updated wiki guide. Behaviour observed by the user should be unchanged.

Test Scenarios

Review Wiki
A sanity testing on the key endpoints

Package API into a pod

We currently know how to do this, but an automated deployment script and good documentation is needed.

Submit jobs against ECHO

Set application configuration to be able to access ECHO. This application will be able to read data and code directly from ECHO and should be able to return the results.

Acceptance criteria

Running an application that requires a data source that is only located on ECHO completes successfully
Logs/results show that the datasource was successfully integrated into the job
A new file/files has been written to ECHO that contains the results

Remove "namespace" from handler inputs

Acceptance criteria

The input "default" is not accepted by any of the handlers. All spark jobs are submitted to the default namespace given by the validation rules specification.

Ingress for Prometheus

Add ingress rules to be able to access the prometheus and Grafana dashboards from outside of the kubernetes cluster.

Acceptance Criteria

Prometheus and Grafana dashboards should be accessible on any machine on the STFC vpn
Metrics should be accessible for the kubernetes cluster running on openstack

Set up system test infrastructure

Remove server from content header

The content header of the Piezo web app handlers returns the server type and version by default. This is a security risk and so needs to be disabled.

Acceptance criteria

None of the Piezo web app handlers returns the server in its response header.

Test Scenario:

Test a few endpoints header content to ensure content header does not display server details. FAIL bad requests still show header but gone from good requests

Make Spark job names unique

Submitting a job request to the web app more than once with the same name input causes all but the first request to be rejected (409 error: conflict). This happens even if the first job has finished, and can only be resolved by deleting the first job.

A solution is to append a unique identifier tag to each spark job before passing it onto the Spark Operator.

Notes

The user must be made aware of the new job name in the submit job response (as well as the driver name)
Care must be taken that the name is not too long
- driver/executor information is added to the pod names by the Spark Operator
- validation on the length of the job name may be needed

Acceptance criteria

When submitting a job, the response notifies the user of the new job name: this will be the submitted job name, plus a unique identifier tag.

Subsequent requests to get the logs of, get the status of, or to delete this job must use the unique job name (using the original job name should result in a 404 Not Found response).

Multiple submissions of the same job name should result in a 200 Ok response, and a different unique job name each time.

Submitting a job name that is very long (>200 characters) should result in a 400 Bad Request response, with a message explaining why.

Test Scenarios

Test that the user is informed with the new job name with a unique identifier.
Test that a job name with 200+ characters return 400 status and a user-friendly error message
Test that a job can be submitted with the same name multiple times NA
Test that job name with different character casing is treated as the same name NA
Test that 404 and a user-friendly error message is returned while using an invalid job name. FAIL Whitespace and special characters return 422 not 400 SHOULD BE FIXED WHEN #70 IS DONE - fix verified.
Test numerate name FAIL e.g 0123 fails to run spark submit and job remains in failed state SHOULD BE FIXED WHEN #70 IS DONE - fix verified.

Configure prometheus to scrape metrics from spark operator

Deploy a kubernetes service and corresponding kubernetes service monitor to scrape kubernetes cluster metrics from the spark-operator pod.

Acceptance criteria

From the prometheus dashboard should be able to access spark job specific metrics. e.g spark_app_executor_running_count
When running a spark job on the openstack cluster the metrics should update

Integration tests for get logs handler

There are currently no integration tests for the getlogs handler. No bugs are reported, but they still need to be written.

Acceptance criteria

Tests show happy case for getting logs
Tests show when job doesn't exist
Test show when results of providing a bad input
Tests all run

Correct input of driver_core_limit

The input argument for driver_core_limit is different to driver_core: although it can be a multiple of ten, it must be wrapped in quotation marks

Set up/connect to Harbor Docker Registry

Simplest: Docker images hosted on a public project, and so can be pulled without the need to login.
Next step: Create a Piezo user on Harbor, who can log in to pull images from a private project.
End goal: Use time-limited access tokens supplied by the user.

Deploy API pod to the cluster

Attach an ambassador sidecar to the API pod

Ensure logging implemented throughout Web API

Investigate/document job logging configuration

Job log files may roll over when a set size or time is reached. This is not yet an issue, since we are running small example jobs, but could become one when the system is put into production. The thresholds for time/size may be configurable - if this is the case, then documenting what the defaults are and how they could be configured is the first step.

Catch key errors on getting job status

Bug fix to return "UNKNOWN" rather than 500 error when job status is not available. Solve by using catch key error on jobstatus as in getjobs. Solves issue when request is called too soon after submitting job.

Acceptance criteria

jobstatus returns 200 response with message "UNKNOWN" when called very soon after the job is submitted. This bug occurs during system tests and when the kubernetes cluster is handling a large number of job submission at once.

Test Scenario

Test that 200 status with 'Unknown' message is returned when a request is sent very soon after the job is submitted.
Test that 200 status with a valid message is returned when a request is sent for a completed job.

Add optional label input

Labels may be added to k8s pods; these can be specified in the manifest sent to the Spark Operator.

One benefit of this is that all pods with a certain label can be deleted. This would be useful when cleaning up after system tests.

Acceptance criteria

When submitting a job, the user has the option of including an input named "label" that has a string value.
Running the command kubectl get pods -l userLabel=my_label on a machine with kubectl configured should return only those pods for jobs with the label my_label.

Test Scenarios

Test that labels can be defined for every pod available (all resource objects).
Test that label names cannot be duplicated NA
Test that label names are not case sensitive NA
Test that label names are not restricted by max character limit (check the max boundry 256 char? and min boundry 1 char? limit)
Test if special characters are accepted in label names, specially :, /, white space

Add ingress for web API

Add ingress rules to be able to send requests to the web api via your local machine

Note to test will first require web app to be packaged and deployed in a pod on the openstack kubernetes cluster

Acceptance Criteria

Given correct permission should be able to send requests from any machine on the STFC vpn to the web api
Web API should have routes to run jobs, retrieve logs and clear up afterwards

Expose Spark UI to user

Spark generates a UI for monitoring a single job. This is currently unavailable to the user due to ingress rules (the URL is generated dynamically). It may be feasible to expose the UI to the user via port-forwarding or dynamic ingress rule updating. If not, update the decision record explaining why.

Update

Investigation proved that it is possible to expose a spark ui for each job. This will be implemented via a proxy pod with dynamic ingress rules.

Acceptance criteria

When submitting a job a user is returned a url that for the spark ui for their job
If the spark ui is not set up properly then the url returned to the user is Unavailable
The spark ui proxy, service and ingress rule are deleted when a job is deleted
It is possible to access the spark ui for multiple jobs simultaneously assuming they are both running

Investigate configuration when running in a pod

How do we pass in configuration (e.g. log file location) when running the web app in a pod?

Current approach is to configure as a python file, which is a shortcut and not appropriate for the final deliverable.

Validate arguments passed for a Spark job

Checks the arguments supplied as part of the POST request to the submit-job handler.

Acceptance Criteria

If the arguments are all valid, they are then passed to the template builder
If there are problems with the arguments, then a 400 error is returned together with a useful explanation of all the problems
Problems include:
- request for excessive resources (e.g. too many executor nodes)
- unrecognised options (e.g. unsupported programming language)

Notes

The web app user guide describes the input arguments for the submit-job handler.

Some arguments may have default values, so these can be missing from the arguments
Some arguments will be language-specific

Test Scenarios

Send a valid payload with mandatory only attributes to see 200 status and expected response body
Send a valid payload with mandatory & all optional attributes to see 200 status and expected response body
Send a payload with one or more mandatory attributes missing to see 400 status code and a user-friendly error message.
Fail too many details returned when removing required input, e.g. language Fix verified
Send a payload with invalid input to the attributes to see 400 status code and a user-friendly error message.
Check the response time for request is acceptable
Check the content headers meet the OWASP ASVS

User can ask for status of job

Status could be Submitted, Running, Failed, Completed

The route is jobstatus, e.g. URL http://host-172-16-113-146.nubes.stfc.ac.uk:31924/jobstatus

Acceptance criteria

After submitting a job, requesting its status returns a meaningful status (typically Submitted or Running immediately after submission, Completed or Failed long afterwards)
Requesting the status of a job that does not exist returns a 404 not found response

Test Scenarios

Test all status feasible with a valid job. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase to see relevant status codes.
Test that a 404 status and a user-friendly message is returned for a job that does not exist.
Check the response time is acceptable when a bigger job is submitted (if anything like that is possible)
Check that the response content header meets OWASP ASVS checklist.
Test that a job that times out gives fails gracefully with a valid status code and response message.

Alternative to users supplying S3 keys

Options:

upload file directly
point to file location on Piezo S3 bucket
supply access token to file on restricted S3 bucket

Automate job tidying process

Tasks #88 and #96 write job logs to S3 and tidy up. This is exposed as route handlers, meaning that it can be triggered manually or via an automated cURL script.

For this task, write a script that calls the route handler developed in #96 and is triggered on a regular basis (e.g. every 6 hours). This could run as a separate pod on the Kubernetes cluster.

Acceptance criteria

Tidy jobs handler is called as a background process at regular intervals
The length of the interval is settable in the configuration file
Killing the web app also kills this background process

Heartbeat handler

Simple handler that reports that the Web App is up and running.

Set up API integration test framework

Dummy backend to k8s

Correct validation of driver/executor cores

Driver cores must be (the string representation of) a float (multiples of 0.1)
Executor cores must be (the string representation of) an integer

We currently use the same underlying method for both, which is not appropriate.

Acceptance criteria

The inputs passed to the submit-jobs handler can include "driver_cores" and "executor_cores" (both are optional inputs)

"driver_cores" can be a multiple of 0.1 (e.g. "0.5" should be accepted) or an integer (e.g. "1" should be accepted) but nothing else (e.g. "0.53" should be rejected)
"executor_cores" can be an integer (e.g. "3" should be accepted) but nothing else (e.g. "0.5" should be rejected)

Whenever an input is rejected, the message returned to the user should be clear and unambiguous.

Test Scenarios

Test that driver_cores with multiple of 0.1 in the payload returns success status.
Test that driver_cores with 2 decimals in the payload returns 400 status and a user-friendly error message.
Test that driver_cores with non-numeric value in the payload returns 400 status and a user-friendly error message.
Test that driver_cores with no input in the payload returns 400 status and a user-friendly error message.
Test that executor_cores with integer as input in the payload returns success status.
Test that executor_cores with decimal input in the payload returns 400 status and a user-friendly error message.
Test that executor_cores with non-numeric value in the payload returns 400 status and a user-friendly error message.
Test that executor_cores with no input in the payload returns 400 status and a user-friendly error message.
Test for min and max boundary limits on driver_cores
Test for min and max boundary limits on executor_cores
Check the response time of the request is acceptable.
Check that response content header meet OWASP ASVS checklist

User can get names of all Spark jobs

A get request to getjobs handler will return a list of all Spark jobs present on the k8s cluster. For each Spark job, the information returned is the job name and the job status (the job status should match that given by the jobstatus handler #46).
The request currently takes no body and performs no filtering on which applications it returns.

Test Scenarios

Test /getjobs list all jobs present on the cluster with their status

Investigate secret volumes for supplying keys to pod

Have now got the logic to do this. Just need to produce documentation in the wiki explaining how to do do it

Allow/handle input of script input arguments

User job scripts may take input arguments to run, which need to be specified in the body to the piezo web app and passed onto the spark operator

Acceptance criteria

Submit job handler accepts the argument: arguments
arguments is optional and if not provided then is not included in the manifest
arguments is formatted as an array
Each value in arguments is supplied in the spark job manifest in the same order as provided in arguments

Instructions for using Prometheus and Grafana

Add basic instructions for navigating around and using Prometheus and Grafana. Should also include an outline of what information can be gained from them

Make get logs return 404 and fail when driver doesn't exist

When a driver pod doesn't exist get logs currently returns
{ "status": "success", "data": "Kubernetes error when trying to get logs for driver \"driver-name\" in namespace \"default\": Not Found" }
Error message is fine but the status code should be 404 (not 200) and status should be fail

Consider providing output files / log files as Swift temporary URLs

I remembered I was meant to provide some information/suggestions about Swift temporary URLs. A benefit of this is that users can directly download (or upload) files from ECHO (1) without having the credentials for accessing ECHO, and (2) without the data having to flow through another box.

Swift temporary URLs are described here:

http://docs.ceph.com/docs/jewel/radosgw/swift/tempurl/

An example of what they look like for ECHO (note that I've modified it to be invalid however):

https://s3.echo.stfc.ac.uk/swift/v1/prominence-jobs/alahiff/jintrac-20190130.tar?temp_url_sig=27d3ba7c37877b64d050ac70a1096c42181012a9&temp_url_expires=1553272013

You first of all need to create a key which can then be used for creating the temporary URLs. Once you have created this key you don’t need the original credentials anymore. The temporary URLs can be generated by a few lines of Python (an example of which is in the link above).

For downloading output files, your REST API could generate a temporary URL and redirect the user to this.

To create the key, a simplest way is to install the Swift client somewhere (pip install python-swiftclient) and run:

swift -A https://s3.echo.stfc.ac.uk/auth/1.0 -U <username> -K <password> post -m "Temp-URL-Key:<key>"

Here you need to specify your Swift credentials and the key you want to use. I used uuidgen to generate the key.

User can request all finished jobs be tidied up

An extension of #88

All jobs with status Failed/Completed should have their logs written to S3 and then be deleted. Jobs with status Pending/Running/Unknown/CrashLoopBackOff are unaffected.

The return will give a list of jobs that have been tidied with their statuses and will return any jobs failed to be tidied with a reason.

Completing this task makes #16 ready

Acceptance criteria

Calling tidyjobs handler writes the logs of all COMPLETED and FAILED spark applications to S3 and then deletes the job
All jobs that are not in COMPLETED or FAILED state are skipped
Any job that fails to write logs to S3 is not deleted
The return groups jobs by if they were tidied, skipped or if an error occured when tidying
If an error occurs when tidying then the reason and status of the error is returned

Test Scenarios

Test that tidyjobs clears all jobs that are in Completed/Failed status.
Test that logs are generated for the jobs in S3 before deleting them.
Check the log files content
Check that User guide is updated with the tidy

Update Wiki to include missing development setup information

Include link to install kubectl
Include details to get kubeconfig file from VMs to be controlled outside of the cluster.
Include link to set up secret keys

Spark job template builder

Given a complete set of appropriate arguments, this service produces the script body that is submitted to the k8s spark-operator.

Acceptance criteria

a POST request can be submitted to the submit-job handler with the following arguments: [job_name, language (python, scala etc), path_to_main_application_file, executors, driver_cores, driver_memory, executor_cores, executor_memory, arguments]
- depending on the language, extra arguments may be required (e.g. mainClass for scala)
Returns a dictionary object with the arguments with the sparkapplication definition

Notes

No validation is performed at this stage:

If required arguments are given then an exception should be thrown
Resources for drivers and executors are unlimited at this stage

Configure prometheus to scrape spark specific metrics from driver and executor pods

Deploy a kubernetes service and service monitor to monitor the spark driver and executor pods of spark applications as they are dynamically spun up and torn down. Metrics should be accessible from the Prometheus dashboard.

~~### Note currently on hold as there is a bug in the spark operator code for scraping these metrics which will be fixed.~~
kubeflow/spark-operator#381

Acceptance criteria

Spark driver and executor metrics are accessible from the prometheus dashboard
Spark pods are dynamically discovered and scraped for metrics as they are created

Deploy services/service monitors

Deploy a service to expose the metrics from a spark driver and executors. The service should expose port 8090 for any pod with the label piezo-spark-job.

Deploy a second service to expose the metrics for spark applications from the spark operator. The service should expose port expose port 10254 for any pod with the label piezo-spark-operator.

Deploy a service monitor to monitor the above services and ensure that this service monitor is included in the services to be monitored by prometheus by ensuring the label matches that specified in the prometheus crd def

Acceptance criteria

Running kubectl get svc --all-namespaces shows the two services created
Running kubectl get servicemonitors --all-namespaces shows the service monitor created
When a spark application is run additional spark specific metrics are available via the prometheus dashboard

Test Scenarios

Test that running kubectl get svc --all-namespaces show the two service (driver and executor/label: piezo-spark-job, piezo-spark-operator) created.
Test that running kubectl get servicemonitors --all-namespaces show the service monitor created
Check that the Prometheus dashboard is available when spark application is running.
Check the status of Prometheus dashboard when there is no spark application running.

User can get names of all jobs with a given label

Acceptance criteria

Once a number of jobs are submitted with different labels attached (see Wiki page), the user can call the getjobs handler with a label argument in the body. The response should contain the name and status of all jobs that share this label.

If no label is given when calling getjobs, then all jobs are returned.

If no jobs match the label given to getjobs then an empty array is returned.

ukaea / piezo Goto Github PK

piezo's Introduction

Piezo

piezo's People

Contributors

Stargazers

Watchers

piezo's Issues

Acceptance criteria

Test Scenarios

Acceptance criteria

Test Scenarios

Acceptance criteria

Acceptance criteria

Acceptance criteria

Test Scenarios

Acceptance Criteria

Test Scenarios

Acceptance criteria

Acceptance criteria

Acceptance Criteria

Acceptance criteria

Test Scenario:

Acceptance criteria

Test Scenarios

Acceptance criteria

Acceptance criteria

Acceptance criteria

Test Scenario

Acceptance criteria

Test Scenarios

Note to test will first require web app to be packaged and deployed in a pod on the openstack kubernetes cluster

Acceptance Criteria

Update

Acceptance criteria

Acceptance Criteria

Notes

Test Scenarios

Acceptance criteria

Test Scenarios

Acceptance criteria

Acceptance criteria

Test Scenarios

Test Scenarios

Acceptance criteria

Acceptance criteria

Test Scenarios

Acceptance criteria

Notes

Acceptance criteria

Acceptance criteria

Test Scenarios

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org