Giter Club home page Giter Club logo

slurmmon's Introduction

Slurmmon is a system for gaining insight into Slurm and the jobs it runs. It's meant for cluster administrators looking to measure the effects of configuration changes and raise cluster utilization. Features include:

  • trending all the scheduler performance diagnostics (the numbers from sdiag)
  • measuring job turnaround time of probe jobs, as a bellwether of scheduling issues
  • creating daily whitespace reports -- identifying specific users and jobs with low utilization of their allocations (the jobs that lead to the dreaded whitespace gap in plots of total resources vs. used resources)

Slurmmon is meant to run on a RHEL/CentOS/SL 6 based system and currently uses Ganglia for data collection and Apache/mod_python for reporting. The components are:

  • slurmmon-daemon -- the daemons that query Slurm and send data to Ganglia
  • slurmmon-ganglia -- the Ganglia custom reports that use php to stack raw rrd data
  • slurmmon-web -- a set of web pages that organize all the reports and relevant plots
  • slurmmon-python -- a general python interface to Slurm, using dict-based io pipelines and lazy evaluation (but being replaced by dio and slyme)

See the doc directory for more information, specifically:

  • INSTALL for initial installation and setup
  • FAQ for answers to common questions and other details

Here is a screenshot of the basic diagnostic report from the production cluster at @fasrc:

slurmmon screenshot

It shows how something interesting happened on the 31st -- there was a spike in job turnaround and slurmctld agent queue size.


Here is an example daily whitespace (CPU waste) report:

slurmmon whitespace report screenshot

Of the jobs that completed in that day, the top CPU-waster was sophia's, and it was a case of mismatched Slurm -n (128) and mpirun -np (16) (the latter is unnecessary -- user education opportunity). Lots of other jobs show the issue of asking for many CPU cores but using only one. The job IDs are links to full details.


Here is a stack of plots from our Slurm upgrade from 2.6.9 to 14.03.4 around 10:00 a.m.:

slurm upgrade

It shows the much faster backfill scheduler runs (top plot), deeper backfill scheduler runs (middle plot), and higher job throughput (slope of completed jobs in bottom plot).

slurmmon's People

Contributors

cinek810 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

slurmmon's Issues

ensure all html content that's user-provided is escaped

e.g. JobScript, JobName, etc.

Most (all?) of it currently is, but I think that's by virtue of the syntax highlighting code, and might get lost if that's configured off.

Pull request #2 has some nice PIL code for converting to an image instead, but the copy/paste and grep-ability of text is really helpful, and it would be nice to keep package dependencies to a minimum.

slurmmon web page

Yet another thing...
I like your plugin that's why I really want to use it... but some things remain not clear to me.
This is about the web page. I installed to the default path /var/www/html/slurmmon. The php's for the graphs are located at /var/www/html/ganglia/graph.d.
When I look on my ganglia web page I see the slurmmond_* reports. But when I open the slurmmon page I only see text and broken images although the graph.php calls look the same as on my ganglia page.
The whitespace report works at least for the first 3 columns. The rest is not shown (like CPU efficiency and so on). Unfortunately, it is not clear to me how to set up the whole thing that I could look at the graphs in ganglia and use the slurmmon web page. Or is it a ganglia OR slurmmon?
I'm thankful for any advices.

Definition Cores_Wasted

Hi again,
I'm looking through the code because I was wondering about the numbers on cores wasted in the whitespace report. In Node.py (l.89 I think) you define Cores_Wasted as CPUAlloc - CPULoad. As far as I could see CPULoad is the 1min load average from /proc/loadavg and CPUAlloc is the number of cores.
The naming "Cores wasted" sounds like wasted, unused, non-utilized cores to me, like, for example, somebody calling for a full node with N cores and just using one of the N cores and therefore wasting N-1 cores.
The more I think about it, I can see why you chose the quantity of CPUAlloc - CPULoad since the difference will approach zero when the cpus are "loaded" enough. Then still the name in the report is misleading somehow. I will add the definition of it to the whitespace report table for now. But I would be interested in your thoughts on this.
By the way, are GPUs supported already? I think I didn't see anything in the code about it.

problems slurmmon_whitespace_report

Running /usr/sbin/slurmmon_whitespace_report shows:
Traceback (most recent call last):
File "/usr/sbin/slurmmon_whitespace_report", line 230, in
live_nodes_cpu=True,
File "/usr/sbin/slurmmon_whitespace_report", line 117, in write_report
for i, x in enumerate((j['User'], j['JobID'], int(round(j['CPU_Wasted']/(60_60_24))), '%d%%' % int(round(j['CPU_Efficiency']*100)), j['NCPUS'], config.syntax_highlight(j['JobScriptPreview']))):
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Setting completed_jobs_cpu=False, I obtain:
*** ERROR *** unable to parse squeue job text ['']: need more than 0 values to unpack

Any advice? My current guess would be that you're using a different resource management in slurm. Maybe you could provide your settings in slurm.conf or add the configuration prerequisites for using slurmmon?

slurm commands not found

I'm seeing another issue, which may be just my own fault. When I install the rpms and start slurmmond afterwards, the daemon doesn't find the slurm commands that are located in /usr/local/bin. Also the PATH environment includes the /usr/local/bin. Here's a trace of the messages that I just took:

slurmmond[5341]: starting
slurmmond[5341]: started sdiag metrics process, pid [5342]
slurmmond[5341]: started jobcount metrics process, pid [5343]
slurmmond[5341]: started reserved cores metrics process, pid [5345]
slurmmond[5341]: started [probejob-compute,IB] metrics process, pid [5346]
slurmmond(sdiag)[5342]: metrics for [slurmmond(sdiag)] failed with message [[Errno 2] No such file or directory]
slurmmond(jobcount)[5343]: metrics for [slurmmond(jobcount)] failed with message [shell code ["squeue -h -o '%u' -t PD | wc -l"] failed with exit status [0], stderr is ['/bin/sh: squeue: command not found\n']]
slurmmond(probejob-compute,IB)[5346]: metrics for [slurmmond(probejob-compute,IB)] failed with message [job submission ["sbatch '-p' 'compute,IB' '-J' 'probejob' '-n' '1' '-t' '2' '--mem' '10' '-o' '/dev/null' '-e' '/dev/null' --wrap 'true'"] failed with non-zero returncode [127] and/or non-empty stderr ['/bin/sh: sbatch: command not found']]
slurmmond(reservations)[5345]: metrics for [slurmmond(reservations)] failed with message [[Errno 2] No such file or directory]

If I add /usr/local/bin/ to all of the command calls slurmmond works.
I also tried to add a sys.path.append but this didn't work. Am I doing somehting wrong?

rhel7/centos7

any plans to migrate to Rhel7/centos7?
I got the rpm's rebuilt for RHEL7, rpm's installed and services running but ran into problems with the index.psp file in slurmmon-web. Seems mod_python is does not exist anymore and has moved to mod_wsgi. I'm not a python programmer for the web so I'm stuck here.

Error start slurmmon service

Hello,
I'm trying install slurmmon on Linux Mint 18.3
I sucssfully installed the RPMs but I'm getting an error when I try to start slurmmon service:

Job for slurmmond.service failed because the control process exited with error code. See "systemctl status slurmmond.service" and "journalctl -xe" for details.

And the "journalctl -xe output is:

-- Unit slurmmond.service has begun starting up.
Nov 29 09:42:18 master.lbn.com sudo[1745]:     root : TTY=unknown ; PWD=/ ; USER=slurmmon ; COMMAND=/usr/bin/python /
Nov 29 09:42:18 master.lbn.com sudo[1745]: pam_unix(sudo:session): session opened for user slurmmon by (uid=0)
Nov 29 09:42:18 master.lbn.com slurmmond[1732]: Starting slurmmond... Traceback (most recent call last):
Nov 29 09:42:18 master.lbn.com slurmmond[1732]:   File "/usr/sbin/slurmmond", line 142, in <module>
Nov 29 09:42:18 master.lbn.com slurmmond[1732]:     import slurmmon
Nov 29 09:42:18 master.lbn.com slurmmond[1732]: ImportError: No module named slurmmon
Nov 29 09:42:18 master.lbn.com sudo[1745]: pam_unix(sudo:session): session closed for user slurmmon
Nov 29 09:42:18 master.lbn.com slurmmond[1732]: *** FAILED ***
Nov 29 09:42:18 master.lbn.com systemd[1]: slurmmond.service: Control process exited, code=exited status=1
Nov 29 09:42:18 master.lbn.com systemd[1]: Failed to start SYSV: slurmmond - gather data about SLURM behavior.
-- Subject: Unit slurmmond.service has failed

Hoe to fix this ?
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.