A tool to run Spark jobs on Kubernetes. Please refer to the wiki for full documentation.
piezo's Introduction
piezo's People
piezo's Issues
Formalise web API
Adapt the prototype web API to a more formal structure (Handlers, Services, etc).
Use an adapter for the kubectl proxy to at this stage.
Consolidate spark image for both prometheus and python support
The standard Spark image ("gcr.io/spark-operator/spark:v2.4.0"
) supports Prometheus but not Python. The CERN Spark image ("gitlab-registry.cern.ch/db/spark-service/docker-registry/spark:v2.4.0-hadoop3-0.7"
) is in the opposite situation. We currently have to switch between the two in the "image"
field of the validation_rules.json
file when deploying the web app.
We need both at the same time.
Web app rejects poorly-formatted job names
Pod names must be composed of lower case characters, digits, .
and -
. .
and -
must be surrounded by characters and/or digits.
The Kubernetes API will reject poorly-formatted names, but the error message it returns is not very clear. Instead, we should warn the user of this at the same time as all other validation warnings.
In addition see failed test cases from #51
Acceptance criteria
Poorly-formatted job names are rejected by the web app, with a 400 response and helpful error message. Well-formatted job names are accepted as before.
The documentation in the wiki user guide matches the accepted formatting.
Test Scenarios
- Test that a job name follows the naming convention rules (no space, no special characters, no uppercase letter).
- Test that the first character in the job name is an alphabet and not a number.
- Test that a 400 status code and a user-friendly error message is displayed.
Extend output of job status
As well as returning the status of the job also include:
- Its name
- When created
- When submitted
- When terminated
- Any error message
Acceptance criteria
- Calling the job status handler for a specified job returns the following information:
- When created
- Submission attempts
- When submitted
- When terminated
- Any error message - If information is not available for any category then UNKNOWN is returned
- Trying to get the status of a non existent job returns a 404 not found error
Test Scenarios
- Test that GET job status return all of the above in response.
- Test that a 404 and a user-friendly error message is displayed when a job is not found.
Update Wiki with user guide for Piezo
Deploy Prometheus for kubernetes metrics
Install the prometheus operator onto the kubernetes cluster on openstack to scrape kubernetes cluster metrics.
Acceptance criteria
- Running
kubectl get pods --all-namespaces
shows equivilent prometheus pods:alertmanager-prometheus-operator-alertmanager-0
prometheus-operator-grafana-8559d7df44-v2jf7
prometheus-operator-kube-state-metrics-6b6d6b8bbd-h65gd
prometheus-operator-operator-7d5577d9b5-qsh8m
prometheus-operator-prometheus-node-exporter-85blf
prometheus-operator-prometheus-node-exporter-khq77
prometheus-operator-prometheus-node-exporter-kmdlq
prometheus-prometheus-operator-prometheus-0
- All prometheus pods are ready and have status ready
- The prometheus dashboard can be accesed in a browser by port forwarding
prometheus-prometheus-operator-prometheus-0
on port 9090
User can request job logs be written to S3
Sub-task to #16
User can call a writelogs
handler with body {"job name": "example-job"}
. The web app will get the logs for that job and write them to the kubernetes
bucket in S3 at outputs/example-job/log.txt
Subsequent tickets will:
- generalise this process to all jobs that have completed
- improve the configuration
- call the method on a
cron
job
Acceptance criteria
- A
POST
request to the newwritelogs
handler with body {"job_name": "{name of job}"}where the job exists produces a file in the
kubernetesbucket in S3 at
outputs/example-job/log.txt` - The contents of the produced log file matches those obtained by running
kubectl logs {spark job name}-driver
from a command window (with kubectl access to the kubernetes cluster) - If the job does not exist a
404
is returned - If "job_name" argument is missing from the request a
400
is returned with a message stating it is a required property - Fail. Error message include unrelevant details"data": "'job_name' is a required property\n\nFailed validating 'required' in schema:\n {'properties': {'job_name': {'type': 'string'}},\n 'required': ['job_name'],\n 'type': 'object'}\n\nOn instance:\n {}"
- The job is not deleted from the cluster after calling
writelogs
on it - Calling the handler a second time for the same job overwrites the log file previously produced
Fix bug when deleting a job
Requesting that a job be deleted caused a 500 Internal Server Error (regardless of whether the job existed or not). This was caused by a poorly-formatted body provided by the python kubernetes library.
The route is deletejob
, e.g. URL http://host-172-16-113-146.nubes.stfc.ac.uk:31924/deletejob
Acceptance criteria
- After submitting a job, requesting deletion results in a 200 Ok response. Subsequent attempts to delete or to get the status of the job result in a 404 Not Found response.
Test Scenarios
- Test that a delete job returns a 200 status.
- Test that a delete job on job that doesn't exit gives a 404 status.
- Check the delete requests response time for a small job and a large job (if applies)
- Check that responses content header meet OWASPs ASVS checklist.
Investigate API logging when in a pod
Configuration and retrieval of web app logging may need to be different when the app is deployed in a pod, rather than hosted on a server. Investigate how that should be managed and write up findings in the wiki.
Stress-test system
Objective 5.4 involves identifying how to tune and optimise the operation of Spark Clusters. Design and implement performance tests that submit large numbers of diverse tests to Piezo in a short time frame. Investigate the effect of changing configuration settings on the system performance, and summarise the findings in a report.
Validation rules config file
Due to uncertainty around passing config files to pods, a temporary approach to specifying validation rules as python code was adopted. With this now resolved (see #49) validation rules should be passed in as formatted text files.
Acceptance Criteria
App can be deployed following the updated wiki guide. Behaviour observed by the user should be unchanged.
Test Scenarios
- Review Wiki
- A sanity testing on the key endpoints
Package API into a pod
We currently know how to do this, but an automated deployment script and good documentation is needed.
Submit jobs against ECHO
Set application configuration to be able to access ECHO. This application will be able to read data and code directly from ECHO and should be able to return the results.
Acceptance criteria
- Running an application that requires a data source that is only located on ECHO completes successfully
- Logs/results show that the datasource was successfully integrated into the job
- A new file/files has been written to ECHO that contains the results
Remove "namespace" from handler inputs
Acceptance criteria
The input "default" is not accepted by any of the handlers. All spark jobs are submitted to the default namespace given by the validation rules specification.
Ingress for Prometheus
Add ingress rules to be able to access the prometheus and Grafana dashboards from outside of the kubernetes cluster.
Acceptance Criteria
- Prometheus and Grafana dashboards should be accessible on any machine on the STFC vpn
- Metrics should be accessible for the kubernetes cluster running on openstack
Set up system test infrastructure
Remove server from content header
The content header of the Piezo web app handlers returns the server type and version by default. This is a security risk and so needs to be disabled.
Acceptance criteria
None of the Piezo web app handlers returns the server in its response header.
Test Scenario:
- Test a few endpoints header content to ensure content header does not display server details. FAIL bad requests still show header but gone from good requests
Make Spark job names unique
Submitting a job request to the web app more than once with the same name
input causes all but the first request to be rejected (409 error: conflict). This happens even if the first job has finished, and can only be resolved by deleting the first job.
A solution is to append a unique identifier tag to each spark job before passing it onto the Spark Operator.
Notes
- The user must be made aware of the new job name in the submit job response (as well as the driver name)
- Care must be taken that the name is not too long
- driver/executor information is added to the pod names by the Spark Operator
- validation on the length of the job name may be needed
Acceptance criteria
When submitting a job, the response notifies the user of the new job name: this will be the submitted job name, plus a unique identifier tag.
Subsequent requests to get the logs of, get the status of, or to delete this job must use the unique job name (using the original job name should result in a 404 Not Found response).
Multiple submissions of the same job name should result in a 200 Ok response, and a different unique job name each time.
Submitting a job name that is very long (>200 characters) should result in a 400 Bad Request response, with a message explaining why.
Test Scenarios
- Test that the user is informed with the new job name with a unique identifier.
- Test that a job name with 200+ characters return 400 status and a user-friendly error message
- Test that a job can be submitted with the same name multiple times NA
- Test that job name with different character casing is treated as the same name NA
- Test that 404 and a user-friendly error message is returned while using an invalid job name. FAIL Whitespace and special characters return 422 not 400 SHOULD BE FIXED WHEN #70 IS DONE - fix verified.
- Test numerate name FAIL e.g 0123 fails to run spark submit and job remains in failed state SHOULD BE FIXED WHEN #70 IS DONE - fix verified.
Configure prometheus to scrape metrics from spark operator
Deploy a kubernetes service and corresponding kubernetes service monitor to scrape kubernetes cluster metrics from the spark-operator pod.
Acceptance criteria
- From the prometheus dashboard should be able to access spark job specific metrics. e.g spark_app_executor_running_count
- When running a spark job on the openstack cluster the metrics should update
Integration tests for get logs handler
There are currently no integration tests for the getlogs
handler. No bugs are reported, but they still need to be written.
Acceptance criteria
- Tests show happy case for getting logs
- Tests show when job doesn't exist
- Test show when results of providing a bad input
- Tests all run
Correct input of driver_core_limit
The input argument for driver_core_limit
is different to driver_core
: although it can be a multiple of ten, it must be wrapped in quotation marks
Set up/connect to Harbor Docker Registry
- Simplest: Docker images hosted on a public project, and so can be pulled without the need to login.
- Next step: Create a Piezo user on Harbor, who can log in to pull images from a private project.
- End goal: Use time-limited access tokens supplied by the user.
Deploy API pod to the cluster
Attach an ambassador sidecar to the API pod
Ensure logging implemented throughout Web API
Investigate/document job logging configuration
Job log files may roll over when a set size or time is reached. This is not yet an issue, since we are running small example jobs, but could become one when the system is put into production. The thresholds for time/size may be configurable - if this is the case, then documenting what the defaults are and how they could be configured is the first step.
Catch key errors on getting job status
Bug fix to return "UNKNOWN" rather than 500 error when job status is not available. Solve by using catch key error on jobstatus as in getjobs. Solves issue when request is called too soon after submitting job.
Acceptance criteria
jobstatus
returns 200 response with message "UNKNOWN" when called very soon after the job is submitted. This bug occurs during system tests and when the kubernetes cluster is handling a large number of job submission at once.
Test Scenario
- Test that 200 status with 'Unknown' message is returned when a request is sent very soon after the job is submitted.
- Test that 200 status with a valid message is returned when a request is sent for a completed job.
Add optional label input
Labels may be added to k8s pods; these can be specified in the manifest sent to the Spark Operator.
One benefit of this is that all pods with a certain label can be deleted. This would be useful when cleaning up after system tests.
Acceptance criteria
When submitting a job, the user has the option of including an input named "label" that has a string value.
Running the command kubectl get pods -l userLabel=my_label
on a machine with kubectl
configured should return only those pods for jobs with the label my_label
.
Test Scenarios
- Test that labels can be defined for every pod available (all resource objects).
- Test that label names cannot be duplicated NA
- Test that label names are not case sensitive NA
- Test that label names are not restricted by max character limit (check the max boundry 256 char? and min boundry 1 char? limit)
- Test if special characters are accepted in label names, specially :, /, white space
Add ingress for web API
Add ingress rules to be able to send requests to the web api via your local machine
Note to test will first require web app to be packaged and deployed in a pod on the openstack kubernetes cluster
Acceptance Criteria
- Given correct permission should be able to send requests from any machine on the STFC vpn to the web api
- Web API should have routes to run jobs, retrieve logs and clear up afterwards
Expose Spark UI to user
Spark generates a UI for monitoring a single job. This is currently unavailable to the user due to ingress rules (the URL is generated dynamically). It may be feasible to expose the UI to the user via port-forwarding or dynamic ingress rule updating. If not, update the decision record explaining why.
Update
Investigation proved that it is possible to expose a spark ui for each job. This will be implemented via a proxy pod with dynamic ingress rules.
Acceptance criteria
- When submitting a job a user is returned a url that for the spark ui for their job
- If the spark ui is not set up properly then the url returned to the user is
Unavailable
- The spark ui proxy, service and ingress rule are deleted when a job is deleted
- It is possible to access the spark ui for multiple jobs simultaneously assuming they are both running
Investigate configuration when running in a pod
How do we pass in configuration (e.g. log file location) when running the web app in a pod?
Current approach is to configure as a python file, which is a shortcut and not appropriate for the final deliverable.
Validate arguments passed for a Spark job
Checks the arguments supplied as part of the POST request to the submit-job handler.
Acceptance Criteria
- If the arguments are all valid, they are then passed to the template builder
- If there are problems with the arguments, then a 400 error is returned together with a useful explanation of all the problems
- Problems include:
- request for excessive resources (e.g. too many executor nodes)
- unrecognised options (e.g. unsupported programming language)
Notes
The web app user guide describes the input arguments for the submit-job handler.
- Some arguments may have default values, so these can be missing from the arguments
- Some arguments will be language-specific
Test Scenarios
- Send a valid payload with mandatory only attributes to see 200 status and expected response body
- Send a valid payload with mandatory & all optional attributes to see 200 status and expected response body
- Send a payload with one or more mandatory attributes missing to see 400 status code and a user-friendly error message.
Fail too many details returned when removing required input, e.g. language Fix verified - Send a payload with invalid input to the attributes to see 400 status code and a user-friendly error message.
- Check the response time for request is acceptable
- Check the content headers meet the OWASP ASVS
User can ask for status of job
Status could be Submitted
, Running
, Failed
, Completed
The route is jobstatus
, e.g. URL http://host-172-16-113-146.nubes.stfc.ac.uk:31924/jobstatus
Acceptance criteria
- After submitting a job, requesting its status returns a meaningful status (typically
Submitted
orRunning
immediately after submission,Completed
orFailed
long afterwards) - Requesting the status of a job that does not exist returns a 404 not found response
Test Scenarios
- Test all status feasible with a valid job. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase to see relevant status codes.
- Test that a 404 status and a user-friendly message is returned for a job that does not exist.
- Check the response time is acceptable when a bigger job is submitted (if anything like that is possible)
- Check that the response content header meets OWASP ASVS checklist.
- Test that a job that times out gives fails gracefully with a valid status code and response message.
Alternative to users supplying S3 keys
Options:
- upload file directly
- point to file location on Piezo S3 bucket
- supply access token to file on restricted S3 bucket
Automate job tidying process
Tasks #88 and #96 write job logs to S3 and tidy up. This is exposed as route handlers, meaning that it can be triggered manually or via an automated cURL script.
For this task, write a script that calls the route handler developed in #96 and is triggered on a regular basis (e.g. every 6 hours). This could run as a separate pod on the Kubernetes cluster.
Acceptance criteria
- Tidy jobs handler is called as a background process at regular intervals
- The length of the interval is settable in the configuration file
- Killing the web app also kills this background process
Heartbeat handler
Simple handler that reports that the Web App is up and running.
Set up API integration test framework
Dummy backend to k8s
Correct validation of driver/executor cores
- Driver cores must be (the string representation of) a float (multiples of 0.1)
- Executor cores must be (the string representation of) an integer
We currently use the same underlying method for both, which is not appropriate.
Acceptance criteria
The inputs passed to the submit-jobs handler can include "driver_cores" and "executor_cores" (both are optional inputs)
- "driver_cores" can be a multiple of 0.1 (e.g. "0.5" should be accepted) or an integer (e.g. "1" should be accepted) but nothing else (e.g. "0.53" should be rejected)
- "executor_cores" can be an integer (e.g. "3" should be accepted) but nothing else (e.g. "0.5" should be rejected)
Whenever an input is rejected, the message returned to the user should be clear and unambiguous.
Test Scenarios
- Test that driver_cores with multiple of 0.1 in the payload returns success status.
- Test that driver_cores with 2 decimals in the payload returns 400 status and a user-friendly error message.
- Test that driver_cores with non-numeric value in the payload returns 400 status and a user-friendly error message.
- Test that driver_cores with no input in the payload returns 400 status and a user-friendly error message.
- Test that executor_cores with integer as input in the payload returns success status.
- Test that executor_cores with decimal input in the payload returns 400 status and a user-friendly error message.
- Test that executor_cores with non-numeric value in the payload returns 400 status and a user-friendly error message.
- Test that executor_cores with no input in the payload returns 400 status and a user-friendly error message.
- Test for min and max boundary limits on driver_cores
- Test for min and max boundary limits on executor_cores
- Check the response time of the request is acceptable.
- Check that response content header meet OWASP ASVS checklist
User can get names of all Spark jobs
A get request to getjobs
handler will return a list of all Spark jobs present on the k8s cluster. For each Spark job, the information returned is the job name and the job status (the job status should match that given by the jobstatus
handler #46).
The request currently takes no body and performs no filtering on which applications it returns.
Test Scenarios
- Test /getjobs list all jobs present on the cluster with their status
Investigate secret volumes for supplying keys to pod
Have now got the logic to do this. Just need to produce documentation in the wiki explaining how to do do it
Allow/handle input of script input arguments
User job scripts may take input arguments to run, which need to be specified in the body to the piezo web app and passed onto the spark operator
Acceptance criteria
- Submit job handler accepts the argument:
arguments
arguments
is optional and if not provided then is not included in the manifestarguments
is formatted as an array- Each value in
arguments
is supplied in the spark job manifest in the same order as provided inarguments
Instructions for using Prometheus and Grafana
Add basic instructions for navigating around and using Prometheus and Grafana. Should also include an outline of what information can be gained from them
Make get logs return 404 and fail when driver doesn't exist
When a driver pod doesn't exist get logs currently returns
{ "status": "success", "data": "Kubernetes error when trying to get logs for driver \"driver-name\" in namespace \"default\": Not Found" }
Error message is fine but the status code should be 404 (not 200) and status should be fail
Consider providing output files / log files as Swift temporary URLs
I remembered I was meant to provide some information/suggestions about Swift temporary URLs. A benefit of this is that users can directly download (or upload) files from ECHO (1) without having the credentials for accessing ECHO, and (2) without the data having to flow through another box.
Swift temporary URLs are described here:
http://docs.ceph.com/docs/jewel/radosgw/swift/tempurl/
An example of what they look like for ECHO (note that I've modified it to be invalid however):
You first of all need to create a key which can then be used for creating the temporary URLs. Once you have created this key you don’t need the original credentials anymore. The temporary URLs can be generated by a few lines of Python (an example of which is in the link above).
For downloading output files, your REST API could generate a temporary URL and redirect the user to this.
To create the key, a simplest way is to install the Swift client somewhere (pip install python-swiftclient
) and run:
swift -A https://s3.echo.stfc.ac.uk/auth/1.0 -U <username> -K <password> post -m "Temp-URL-Key:<key>"
Here you need to specify your Swift credentials and the key you want to use. I used uuidgen to generate the key.
User can request all finished jobs be tidied up
An extension of #88
All jobs with status Failed
/Completed
should have their logs written to S3 and then be deleted. Jobs with status Pending
/Running
/Unknown
/CrashLoopBackOff
are unaffected.
The return will give a list of jobs that have been tidied with their statuses and will return any jobs failed to be tidied with a reason.
Completing this task makes #16 ready
Acceptance criteria
- Calling tidyjobs handler writes the logs of all COMPLETED and FAILED spark applications to S3 and then deletes the job
- All jobs that are not in COMPLETED or FAILED state are skipped
- Any job that fails to write logs to S3 is not deleted
- The return groups jobs by if they were tidied, skipped or if an error occured when tidying
- If an error occurs when tidying then the reason and status of the error is returned
Test Scenarios
- Test that tidyjobs clears all jobs that are in Completed/Failed status.
- Test that logs are generated for the jobs in S3 before deleting them.
- Check the log files content
- Check that User guide is updated with the tidy
Update Wiki to include missing development setup information
- Include link to install kubectl
- Include details to get kubeconfig file from VMs to be controlled outside of the cluster.
- Include link to set up secret keys
Spark job template builder
Given a complete set of appropriate arguments, this service produces the script body that is submitted to the k8s spark-operator.
Acceptance criteria
- a POST request can be submitted to the submit-job handler with the following arguments: [job_name, language (python, scala etc), path_to_main_application_file, executors, driver_cores, driver_memory, executor_cores, executor_memory, arguments]
- depending on the language, extra arguments may be required (e.g. mainClass for scala)
- Returns a dictionary object with the arguments with the sparkapplication definition
Notes
No validation is performed at this stage:
- If required arguments are given then an exception should be thrown
- Resources for drivers and executors are unlimited at this stage
Configure prometheus to scrape spark specific metrics from driver and executor pods
Deploy a kubernetes service and service monitor to monitor the spark driver and executor pods of spark applications as they are dynamically spun up and torn down. Metrics should be accessible from the Prometheus dashboard.
### Note currently on hold as there is a bug in the spark operator code for scraping these metrics which will be fixed.
kubeflow/spark-operator#381
Acceptance criteria
- Spark driver and executor metrics are accessible from the prometheus dashboard
- Spark pods are dynamically discovered and scraped for metrics as they are created
Deploy services/service monitors
Deploy a service to expose the metrics from a spark driver and executors. The service should expose port 8090 for any pod with the label piezo-spark-job
.
Deploy a second service to expose the metrics for spark applications from the spark operator. The service should expose port expose port 10254 for any pod with the label piezo-spark-operator
.
Deploy a service monitor to monitor the above services and ensure that this service monitor is included in the services to be monitored by prometheus by ensuring the label matches that specified in the prometheus crd def
Acceptance criteria
- Running
kubectl get svc --all-namespaces
shows the two services created - Running
kubectl get servicemonitors --all-namespaces
shows the service monitor created - When a spark application is run additional spark specific metrics are available via the prometheus dashboard
Test Scenarios
- Test that running
kubectl get svc --all-namespaces
show the two service (driver and executor/label: piezo-spark-job, piezo-spark-operator) created. - Test that running
kubectl get servicemonitors --all-namespaces
show the service monitor created - Check that the Prometheus dashboard is available when spark application is running.
- Check the status of Prometheus dashboard when there is no spark application running.
User can get names of all jobs with a given label
Acceptance criteria
Once a number of jobs are submitted with different labels attached (see Wiki page), the user can call the getjobs
handler with a label
argument in the body. The response should contain the name and status of all jobs that share this label.
If no label is given when calling getjobs
, then all jobs are returned.
If no jobs match the label given to getjobs
then an empty array is returned.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.