eidf-epcc-cluster

Additional documentation from the team managing the EIDF Cluster is available at this link. If you don't have access to that repository, you can get in touch with the EIDF service desk (see this link) and request access. Likewise, if you need to flag problems with the cluster, please get in touch with the helpdesk at this link.

About running ML/NLP experiments on a Kubernetes cluster -- we prepared an introductory guide available here -- if you find that anything is missing from that guide, please feel free to add to it (you all have write access) or, if you are unable of doing so, please open an issue.

WARNING the cluster does not have any quota or permission management at the moment, so please behave and don't hoard resources or make life harder for other users, or we will have to restrict your access.

EIDF EPCC Cluster -- Guide for new users

A guide to onboarding new users. Be aware that this is a developing service.

New User Sign Up to EIDF

Full Documentation on signing up at EIDF Documentation.

Open Browser and goto the EIDF Portal
Click Login -> This will redirect to SAFE
if you have a SAFE account -> use that account, if you do not have a SAFE account, register for a new SAFE account
Return to the EIDF portal after your SAFE account has logged in and activated.

New User Request Access to Informatics Project

Click on the dropdown menu "Projects"
Click "Request Access"
Choose from the list "eidf029 - Informatics K8s Support"
Click "Apply'
An approver will add you to the project and create an account.

Approver Steps to Create EIDF Accounts for New User

Login into the EIDF Portal
Click on the dropdown menu "Projects"
Choose the applicant to the project from the "Project Management Requests"
Choose the approve option in the page and submit.
Click on the dropdown menu "Projects"
Click "Your Projects"
Choose from the list "eidf029 - Informatics K8s Support"
Find and click the button for "Create Account"
Create a username for the account in the form -infk8s e.g dmckay-infk8s
In the account owner drop down box, choose the applicant you are creating the account for.
Click "Submit"
In the "Project Accounts", click "Manage" next to the account you have just created.
Give "Access", not "Sudo" to the following entries: eidf-gateway, eidf029-host1

New User First Login

Login into the EIDF Portal
Click on the dropdown menu "Projects"
Click "Your Projects"
Choose from the list "eidf029 - Informatics K8s Support"
Click your account user name from your Account
Click to view your initial password and copy/note it
Click VDI Login
From the project list of VMs, choose eidf029-host1_ssh
Enter your username
Enter the initial password
At the change prompt follow the instructions.

New User Login

Use the VDI as on initial login, save for changing password
Optional use the EIDF Gateway

ssh -J [email protected] [email protected]

Alternatively, you can edit the .ssh/config file (useful for VSCode)

Host eidf
    User USERNAME
    IdentityFile PATH-TO-KEY
    HostName 10.24.5.121
    ProxyJump [email protected]

and access the cluster by ssh eidf.

Kubectl get nodes

( NB: This currently fails, but is not needed)

Run kubectl get nodes
Output should look like:

NAME       STATUS   ROLES                      AGE   VERSION
gpu-vm00   Ready    controlplane,etcd,worker   21d   v1.24.4
gpu-vm01   Ready    controlplane,etcd,worker   21d   v1.24.4
gpu-vm02   Ready    controlplane,etcd,worker   21d   v1.24.4
gpu-vm03   Ready    worker                     21d   v1.24.4
gpu-vm04   Ready    worker                     21d   v1.24.4
gpu-vm05   Ready    worker                     21d   v1.24.4
gpu-vm06   Ready    worker                     21d   v1.24.4
gpu-vm07   Ready    worker                     21d   v1.24.4

Run a pod

open editor of your choice to create the file test_NBody.yml
put the following into to the file:

 apiVersion: v1
 kind: Pod
 metadata:
   generateName: sample-
 spec:
   restartPolicy: OnFailure
   containers:
   - name: cudasample
     image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1
     args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"]
     resources:
       limits:
          nvidia.com/gpu: 1

Save the file and exit the editor
Run `kubectl create -f test_NBody.yml'
This will output something like:

pod/sample-7gdtb created

Run kubectl get pods
This will output something like:

pi-tt9kq                                                          0/1     Completed   0              24h
sample-24n7n                                                      0/1     Completed   0              24h
sample-2j5tc                                                      0/1     Completed   0              24h
sample-2kjbx                                                      0/1     Completed   0              24h
sample-2mnvg                                                      0/1     Completed   0              24h
sample-4sng2                                                      0/1     Completed   0              24h
sample-5h6sr                                                      0/1     Completed   0              24h
sample-6bqql                                                      0/1     Completed   0              24h
sample-7gdtb                                                      0/1     Completed   0              39s
sample-8dnht                                                      0/1     Completed   0              24h
sample-8pxz4                                                      0/1     Completed   0              24h
sample-bphjx                                                      0/1     Completed   0              24h
sample-cp97f                                                      0/1     Completed   0              24h
sample-gcbbb                                                      0/1     Completed   0              24h
sample-hdlrr                                                      0/1     Completed   0              24h
sample-hkwk2                                                      0/1     Completed   0              24h
sample-j66ck                                                      0/1     Completed   0              24h
sample-jxhtk                                                      0/1     Completed   0              24h
sample-lzmg8                                                      0/1     Completed   0              24h
sample-nhrtk                                                      0/1     Completed   0              24h
sample-rh9v7                                                      0/1     Completed   0              24h
sample-v48jd                                                      0/1     Completed   0              24h

View the logs of the pod you ran kubectl logs sample-7gdtb
This will output something like:

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Fullscreen mode
> Simulation data stored in video memory
> Double precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.0

> Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]
number of bodies = 512000
512000 bodies, total time for 10 iterations: 10570.778 ms
= 247.989 billion interactions per second
= 7439.679 double-precision GFLOP/s at 30 flops per interaction

delete your pod with kubectl delete pod sample-7gdtb

Running your own experiments

Follow this guide to get started, and check the following tools from the amazing @AntreasAntoniou:

https://github.com/BayesWatch/kubeproject for general kubectl stuff and understanding what’s going on.
https://github.com/AntreasAntoniou/kubejobs for python-based kubernetes job launching that covers a lot of options for the yaml — but in python class format.
https://github.com/AntreasAntoniou/minimal-ml-template/tree/main/kubernetes for a minimal ml projects that can run on a kubernetes cluster

Acknowledging EIDF in your work

More details here

peterdavidfagan / eidf-epcc-cluster Goto Github PK