Giter Club home page Giter Club logo

sx-aurora-slurm-plugin's Introduction

SX-Aurora-Slurm-Plugin

This gres plugin for Slurm allows for scheduling whole aurora cards on nodes using gres in Slurm but not to share one aurora card between jobs.

Check https://github.com/SX-Aurora/SX-Aurora-Slurm-Plugin/releases for the latest release. This has been tested on 20.11

Getting started

Compiling

You need to compile custom code for Slurm:

  1. Clone this repo to src/plugins/gres/ve in your local copy of Slurm or unpack the release tarball and copy the files to that folder.
  2. Add ''src/plugins/gres/ve/Makefile'' to the configure.ac file in the slurm root directory.
  3. Remove the existing configure file.
  4. Add ''ve'' to the SUBDIRS variable in src/plugins/gres/Makefile.am
  5. Run autoreconf.
  6. make && make install if this is a new Slurm-source-tree or patch the slurm.spec file.

It is recommendable to build a separate slurm cluster on the aurora-nodes for testing before moving the setup to production. In such a case slurmctld and slurmdbd are needed because of the gres-usage. Remember to change the ports in the slurm.conf if you have other slurmctlds running otherwise the reconfiguration may crash the other slurmctld.

Slurm-Configuration

  1. You should have the nodes configured in your slurm.conf or an approriate include file. The node definition should look like:
GresTypes=ve
SelectType=select/cons_tres
Nodename=<nodename> Gres=ve:<count>

With multiple VEs per VH you have to create a shared parition like below. It is also recommended to define CPUs as well as memory of the nodes as shared resources like in the example below with two A100:

GresTypes=ve
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
NoneName=vh[100-101] CPUs=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=192078 Gres=ve:8
PartitionName=aurora Shared=Yes Nodes=vh[100-101]
  1. gres.conf should contain at least:
Nodename=<nodename> File=/dev/veslot[<ve slot numbers as csv>]

So for the example with the two A300-8 nodes it would be

NodeName=vh0t[100-101] Name=ve File=/dev/veslot[0,1,2,3,4,5,6,7]
  1. cgroups.conf needs to have ContrainDevices=yes.

  2. cgroup_allowed_devices_file.conf must contain at least

/dev/cpu/*/*

Once the configuration files are changed restart slurmctld and the slurmds. Run a little test like:

srun -n1 --gres=ve:1 -p<yourpartition> env|sort 

and check for SLURM-variables being set, i.e. VE_NODE_NUMBER should also be there.

This GRES module only supports single node aurora jobs! The environment variables inside a job are set to support NEC-MPI in distributed mode, which means that inside your job script you should be able to simply do

mpirun <executable>

sx-aurora-slurm-plugin's People

Contributors

efocht avatar henkela avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.