Giter Club home page Giter Club logo

noded's Introduction

Noded - A Slurm Node Monitoring Daemon

Build Status

What is Noded?

Noded (pronounced node-d) is a python daemon that runs on compute nodes in a Slurm HPC cluster. Its purpose is to collect various system information and job-related details for all Slurm jobs running on the compute node.

Existing monitoring daemons can monitor operating system metrics, protocol metrics, and specific application metrics. However, Slurm jobs in a cluster typically make use of a variety of single and multi-threaded applications that run on a single node or parallelized across multiple nodes. Most monitoring daemons must be preconfigured with the process names to monitor ahead of time. Noded will automatically discover all jobs running on the node.

What do you need?

supervisord

Slurm jobs finish very rapidly and if there are any uncaught exceptions, noded dies. With the supervisord watchdog, noded will respawn automatically, within the thresholds configured in the supervisord configuration, for example no more than 3 times in 60 seconds.

Depends on Slurm configured with cgroups for CPU and Memory:

  • CR_CPU_Memory
  • CR_Core_Memory

How does it work?

Noded periodically ships all job information on a Slurm node to a Redis database in JSON format.

Noded looks for uids and jobids from the cgroup hierarchy, and then identifies all children PIDs of a given job as well as the CPUs those processes are running on.

Job data can be queried directly from the Redis database without having to query the Slurm controller, adding load to slurmctld.

Redis was chosen because it is in-memory, very fast, and allows for a stateless design. On a cluster with 2000 nodes, 50,000 CPUs, and 5,000 concurrent jobs, Redis consumes a mere 45 MB of RAM. If Redis is down, all values are lost since Redis is configured not to persist data. This strategy works because as Redis becomes available again, all nodes will repopulate the database within the default 30 seconds.

Noded is single threaded.

Noded will only report jobs in the RUNNING state. If a job is PENDING, it has not be allocated resources on a node and Noded will not know about it.

Installation

  • Requires python-redis and hi-redis for performance
  • Requires supervisord
  • Requires nose for testing
  • Requires a Redis server (>=2.8)

Works on Linux only (depends on /proc). Works with Python 2.6, 2.7 and 3.3+.

Configuration

  • Redis credentials
  • sleep timer
  • redis key expiration
  • overloaded noded ring buffer length

Starting/Stopping Noded

As root, use supervisorctl to start and stop noded, as well as to check the status:

# supervisorctl status
noded                            RUNNING   pid 44311, uptime 7 days, 3:24:43
# supervisorctl stop noded
noded: stopped
# supervisorctl start noded
noded: started

Resources

  • Schema Reference
  • Examples
  • Tuning Redis and OS

Todo

  • Capture more exceptions
  • log exceptions with intuitive messages
  • use exceptions as test cases
  • provide help
  • need to check if root is necessary, if not, provide instructions to create noded:noded
  • provide ansible playbook for installation

License

This project is in the worldwide public domain. See LICENSE and CONTRIBUTING for more information.

noded's People

Contributors

giovtorres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.