amq92 / simple_slurm Goto Github PK

View Code? Open in Web Editor NEW

117.0 117.0 16.0 81 KB

A simple Python wrapper for Slurm with flexibility in mind.

License: GNU Affero General Public License v3.0

Python 100.00%

simple_slurm's People

Contributors

Stargazers

Watchers

Forkers

jacobdanovitch jacobog02 mamachra mkgessen cnut1648 thisiscam eric-vader hunoutl milmillin iraikov xiaoqiwang19 db0 cecilpert meliao tzachar

simple_slurm's Issues

Please Release 0.1.7

Thanks for your work here! I'd like to be able to use the verbose option with simple-slurm and it looks like it's not out in the current release on pypi. Could you please bump and release?

typo in README.md

Hello,

In the introduction section of the readme file. The import statement should be:

from simple_slurm import Slurm instead of from slurm import Slurm

Optional output of sbatch script to file

Hello,

Thank you for creating this very useful package!

I found it convenient to add an option to the sbatch procedure to output to file the script that is submitted. This helps particularly with troubleshooting to verify that the correct arguments are passed to the sbatch command. Are you interested in adding such functionality, and would you like me to submit a pull request? Thank you!

squeue functionality

It would be nice to add some squeue functionality, where I can access my jobs as dictionaries. The below is functioning code mostly generated by chatgpt4 and checks the SQUEUE_FORMAT env var. One could argue that a malformed SQUEUE_FORMAT should not lead to an error but to a fallback to default_format

import os
import subprocess
import csv
from io import StringIO

class SlurmSqueueWrapper:

    def __init__(self):
        self.command = "squeue"
        self.default_format = '"%i","%j","%t","%M","%L","%D","%C","%m","%b","%R"'
        self.output_format = os.getenv("SQUEUE_FORMAT", self.default_format)

        if not self._is_valid_csv_format(self.output_format):
            raise ValueError("Invalid CSV format in SQUEUE_FORMAT environment variable")

        self.jobs = []

    def run_squeue(self):
        result = subprocess.run([self.command, "--me", "-o", self.output_format],
                                stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

        if result.returncode != 0:
            raise RuntimeError(f"Error running squeue: {result.stderr.strip()}")

        self.jobs = self._parse_output(result.stdout.strip())

    def _is_valid_csv_format(self, format_str):
        try:
            sniffer = csv.Sniffer()
            dialect = sniffer.sniff(format_str, delimiters=',')
            dialect.strict = True
            csv.reader(StringIO(format_str), dialect=dialect)
            return True
        except csv.Error:
            return False

    def _parse_output(self, output):
        csv_file = StringIO(output)
        reader = csv.DictReader(csv_file, delimiter=',', quotechar='"', skipinitialspace=True)
        jobs = [row for row in reader] 
        return jobs

    def display_jobs(self):
        for job in self.jobs:
            print(job)

if __name__ == "__main__":
    squeue = SlurmSqueueWrapper()
    squeue.run_squeue()
    squeue.display_jobs()

Output below ( note that that queue --me option is not available in older Slurm versions and should be replaced with -u $(whoami)

[user123@login1 ~]$ squeue --me
             JOBID PART         NAME     USER ST       TIME  TIME_LEFT NOD CPU TRES_PER_ MIN_ NODELIST(REASON)
          22112971 prt1         wrap user123  R       0:02      59:58   1   1       N/A   4G node-2-44
          22112970 prt1         wrap user123  R       0:04      59:56   1   1       N/A   4G node-2-44
          22112966 prt1         wrap user123  R       1:29      58:31   1   1       N/A   4G node-2-44

[user123@login1 ~]$ python3 ./squeue.py
{'JOBID': '22112971', 'NAME': 'wrap', 'ST': 'R', 'TIME': '0:07', 'TIME_LEFT': '59:53', 'NODES': '1', 'CPUS': '1', 'MIN_MEMORY': '4G', 'TRES_PER_NODE': 'N/A', 'NODELIST(REASON)': 'node-2-44'}
{'JOBID': '22112970', 'NAME': 'wrap', 'ST': 'R', 'TIME': '0:09', 'TIME_LEFT': '59:51', 'NODES': '1', 'CPUS': '1', 'MIN_MEMORY': '4G', 'TRES_PER_NODE': 'N/A', 'NODELIST(REASON)': 'node-2-44'}
{'JOBID': '22112966', 'NAME': 'wrap', 'ST': 'R', 'TIME': '1:34', 'TIME_LEFT': '58:26', 'NODES': '1', 'CPUS': '1', 'MIN_MEMORY': '4G', 'TRES_PER_NODE': 'N/A', 'NODELIST(REASON)': 'node-2-44'}

srun option in sbatch ?

In some contexts, you have to call srun in a sbatch script (multi-node execution).
The workaround I use makes add 'srun' inside run_cmd.
slurm.sbatch('srun python train.py')

Is a srun option in sbatch function could be usefull ?
possible contribution

CLI for simple slurm

Hi,

Thank you very much for this package. It is super useful.

Would it be possible to have a simple command line interface to use simple_slurm? It would be great if I can submit simple commands from the shell directly.
Something like
simple_slurm --command "echo running;hostname" --cpus_per_task 1

Thanks!

Query: slurm.add_cmd() vs slurm.srun()

I'm seeing some confusing behavior with add_cmd() vs srun(). In particular, what I really want to do is add many spack commands to a batch script constructed by simple_slurm via add_cmd(), and then submit it.

The observed behavior is that I can put slurm.srun("spack find") and it will place the output of that command in the file I specify by output= in the initialization of Slurm as expected, but if I do something like slurm.add_cmd("spack find") and then slurm.srun("echo done!"), I'll only see the output for the echo.

What's going on here? Do I have an option other than chaining all the spack commands I want to use into the srun string? It seems that I may need to do that if I want to see the output at least. The print of the Slurm object looks just fine, but the stdout/stderr interactions of spack and simple_slurm may be tricky? Are you doing anything different with respect to stdout/stderr for add_cmd() vs srun()?

spack is also a Python package with likely complex subprocess handling/behavior, and I did confirm that if I copy/paste the output of "printing" the Slurm object to a batch script that it does correctly report stdout/stderr.

Add pre/post processing inside sbatch

Is adding a pre/post processing command possible?
For now I bypass the problem with this but it is not very clean :

slurm.sbatch(f'module purge \n module load pytorch-gpu \n srun python train.py')

The basic use that I have for the moment concerns the management of virtual environment
but I imagine that there must be other cases of use.

possible contribution, how much do you want to keep the package simple ?

Bad error handling when failure

>>> slurm.sbatch('echo demo.py ' + Slurm.SLURM_ARRAY_TASK_ID)

File ".../site-packages/simple_slurm/core.py", line 131, in sbatch
    assert success_msg in stdout, result.stderr
AssertionError: None

It does not show what's exactly wrong and what happened from the sbatch command's output.

E.g.,

sbatch: error: Invalid generic resource (gres) specification

Dynamically added methods and attributes are not seen by linters

MWE :

from simple_slurm import Slurm
slurm = Slurm()
slurm.add_arguments(gres='gpu')  # <-- OK
slurm.add_gres('gpu')  # <-- NOK
cmd = 'echo ' + Slurm.SLURM_ARRAY_TASK_ID  # <-- NOK

pylint

main.py:5:0: E1101: Instance of 'Slurm' has no 'add_gres' member (no-member)
main.py:6:16: E1101: Class 'Slurm' has no 'SLURM_ARRAY_TASK_ID' member (no-member)

pylance / pyright

Cannot access member "add_gres" for type "Slurm"
Member "add_gres" is unknownPylancereportGeneralTypeIssues

Cannot access member "SLURM_ARRAY_TASK_ID" for type "Type[Slurm]"
Member "SLURM_ARRAY_TASK_ID" is unknownPylancereportGeneralTypeIssues

flake8 shows nothing (this is my default linter)

All linters were tested with their default configuration.
This is not deal-breaker, but rather a nice-to-have feature.
A naive solution would be to hard-code the methods and attributes into the code (manually) instead of dynamically adding them.
However, I assume that a more elegant solution must exist ...

job_id fails to return when using clusters

Hi there,

It's possible to set a 'cluster' option in Slurm:

#SBATCH --clusters            {my_cluster_name}

From Slurm docs:

-M, --clusters=
Clusters to issue commands to. Multiple cluster names may be comma separated. The job will be submitted to the one cluster providing the earliest expected job initiation time. The default value is the current cluster. A value of 'all' will query to run on all clusters. Note the --export option to control environment variables exported between clusters. Note that the SlurmDBD must be up for this option to work properly.

When doing this, the return message will be:
Submitted batch job {job_ID} on cluster {my_cluster_name}

This causes a ValueError in simple-slurm at the moment:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-d58e7c3efb9b> in <module>
----> 1 slurm.sbatch(cmd)

simple_slurm/core.py in sbatch(self, run_cmd, convert, sbatch_cmd, shell)
    145         stdout = result.stdout.decode('utf-8')
    146         assert success_msg in stdout, result.stderr
--> 147         job_id = int(stdout.replace(success_msg, ''))
    148         print(success_msg, job_id)
    149         return job_id

ValueError: invalid literal for int() with base 10: ' {job_ID} on cluster {my_cluster_name}\n'

I think this would be fairly easy to catch with some additional string parsing

Shell hashbang option

It would be great if one could submit a batch job with a hashbang other than .../sh, e.g. .../bash. Perhaps this could added as a run config? Something like slurm.sbatch(run commands, shell="bash").

How to get node TMPDIR?

I have trouble (1) retrieving the path of the node's temporary directory onto which files ought to be moved (usually available via $TMPDIR in the batch script) and (2) actually moving the the necessary files to the node before starting the actual computation on the node. How do you go about this using the wrapper?
Thank you for your advice.

RuntimeError - Unable to find installation candidates for simple-slurm (0.1.7)

Our pipeline broke in production due to missing version (0.1.7) of simple-slurm

  RuntimeError
  Unable to find installation candidates for simple-slurm (0.1.7)
  at /usr/local/lib/python3.8/site-packages/poetry/installation/chooser.py:72 in choose_for
       68│ 
       69│             links.append(link)
       70│ 
       71│         if not links:
    →  72│             raise RuntimeError(
       73│                 "Unable to find installation candidates for {}".format(package)
       74│             )
       75│ 
       76│         # Get the best link

How to pass in options like --requeue?

Submitting Python functions?

Hi Arturo,

I was wondering if it is possible to submit python functions to slurm with simple_slurm. It would be a super nice feature to have.
Do you know if anyone already managed to do that?

Many thanks,
Federico

Monitoring/waiting for job completion

Does this package support anything like this?

slurm.sbatch('python demo.py ' + Slurm.SLURM_ARRAY_JOB_ID)
slurm.wait_for_completion()

If not, I might be interested in contributing a bit.

Version 0.2.1 got uploaded to pypi without *.txt files included.

Hi @amq92 👋 Nice to meet you.
Thank you for maintaining/creating simple-slurm, I've been thoroughly enjoying using it.

I noticed that when installing simple-slurm==0.2.1 that the *.txt files are not included in the package, and thus simple-slurm doesn't work with the latest version. I wasn't able to spot quickly what is wrong with the setup.py.

In case you were already aware of this, my apologies 🙇 .

Once again, thank you for your work.

Alexander