amq92 / simple_slurm Goto Github PK
View Code? Open in Web Editor NEWA simple Python wrapper for Slurm with flexibility in mind.
License: GNU Affero General Public License v3.0
A simple Python wrapper for Slurm with flexibility in mind.
License: GNU Affero General Public License v3.0
Thanks for your work here! I'd like to be able to use the verbose
option with simple-slurm
and it looks like it's not out in the current release on pypi. Could you please bump and release?
Hello,
In the introduction section of the readme file. The import statement should be:
from simple_slurm import Slurm
instead of from slurm import Slurm
Hello,
Thank you for creating this very useful package!
I found it convenient to add an option to the sbatch procedure to output to file the script that is submitted. This helps particularly with troubleshooting to verify that the correct arguments are passed to the sbatch command. Are you interested in adding such functionality, and would you like me to submit a pull request? Thank you!
It would be nice to add some squeue functionality, where I can access my jobs as dictionaries. The below is functioning code mostly generated by chatgpt4 and checks the SQUEUE_FORMAT env var. One could argue that a malformed SQUEUE_FORMAT should not lead to an error but to a fallback to default_format
import os
import subprocess
import csv
from io import StringIO
class SlurmSqueueWrapper:
def __init__(self):
self.command = "squeue"
self.default_format = '"%i","%j","%t","%M","%L","%D","%C","%m","%b","%R"'
self.output_format = os.getenv("SQUEUE_FORMAT", self.default_format)
if not self._is_valid_csv_format(self.output_format):
raise ValueError("Invalid CSV format in SQUEUE_FORMAT environment variable")
self.jobs = []
def run_squeue(self):
result = subprocess.run([self.command, "--me", "-o", self.output_format],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
raise RuntimeError(f"Error running squeue: {result.stderr.strip()}")
self.jobs = self._parse_output(result.stdout.strip())
def _is_valid_csv_format(self, format_str):
try:
sniffer = csv.Sniffer()
dialect = sniffer.sniff(format_str, delimiters=',')
dialect.strict = True
csv.reader(StringIO(format_str), dialect=dialect)
return True
except csv.Error:
return False
def _parse_output(self, output):
csv_file = StringIO(output)
reader = csv.DictReader(csv_file, delimiter=',', quotechar='"', skipinitialspace=True)
jobs = [row for row in reader]
return jobs
def display_jobs(self):
for job in self.jobs:
print(job)
if __name__ == "__main__":
squeue = SlurmSqueueWrapper()
squeue.run_squeue()
squeue.display_jobs()
Output below ( note that that queue --me option is not available in older Slurm versions and should be replaced with -u $(whoami)
[user123@login1 ~]$ squeue --me
JOBID PART NAME USER ST TIME TIME_LEFT NOD CPU TRES_PER_ MIN_ NODELIST(REASON)
22112971 prt1 wrap user123 R 0:02 59:58 1 1 N/A 4G node-2-44
22112970 prt1 wrap user123 R 0:04 59:56 1 1 N/A 4G node-2-44
22112966 prt1 wrap user123 R 1:29 58:31 1 1 N/A 4G node-2-44
[user123@login1 ~]$ python3 ./squeue.py
{'JOBID': '22112971', 'NAME': 'wrap', 'ST': 'R', 'TIME': '0:07', 'TIME_LEFT': '59:53', 'NODES': '1', 'CPUS': '1', 'MIN_MEMORY': '4G', 'TRES_PER_NODE': 'N/A', 'NODELIST(REASON)': 'node-2-44'}
{'JOBID': '22112970', 'NAME': 'wrap', 'ST': 'R', 'TIME': '0:09', 'TIME_LEFT': '59:51', 'NODES': '1', 'CPUS': '1', 'MIN_MEMORY': '4G', 'TRES_PER_NODE': 'N/A', 'NODELIST(REASON)': 'node-2-44'}
{'JOBID': '22112966', 'NAME': 'wrap', 'ST': 'R', 'TIME': '1:34', 'TIME_LEFT': '58:26', 'NODES': '1', 'CPUS': '1', 'MIN_MEMORY': '4G', 'TRES_PER_NODE': 'N/A', 'NODELIST(REASON)': 'node-2-44'}
In some contexts, you have to call srun in a sbatch script (multi-node execution).
The workaround I use makes add 'srun' inside run_cmd.
slurm.sbatch('srun python train.py')
Is a srun option in sbatch function could be usefull ?
possible contribution
Hi,
Thank you very much for this package. It is super useful.
Would it be possible to have a simple command line interface to use simple_slurm? It would be great if I can submit simple commands from the shell directly.
Something like
simple_slurm --command "echo running;hostname" --cpus_per_task 1
Thanks!
I'm seeing some confusing behavior with add_cmd()
vs srun()
. In particular, what I really want to do is add many spack
commands to a batch script constructed by simple_slurm
via add_cmd()
, and then submit it.
The observed behavior is that I can put slurm.srun("spack find")
and it will place the output of that command in the file I specify by output=
in the initialization of Slurm
as expected, but if I do something like slurm.add_cmd("spack find")
and then slurm.srun("echo done!")
, I'll only see the output for the echo
.
What's going on here? Do I have an option other than chaining all the spack
commands I want to use into the srun
string? It seems that I may need to do that if I want to see the output at least. The print
of the Slurm
object looks just fine, but the stdout
/stderr
interactions of spack
and simple_slurm
may be tricky? Are you doing anything different with respect to stdout
/stderr
for add_cmd()
vs srun()
?
spack
is also a Python package with likely complex subprocess
handling/behavior, and I did confirm that if I copy/paste the output of "printing" the Slurm
object to a batch script that it does correctly report stdout
/stderr
.
Is adding a pre/post processing command possible?
For now I bypass the problem with this but it is not very clean :
slurm.sbatch(f'module purge \n module load pytorch-gpu \n srun python train.py')
The basic use that I have for the moment concerns the management of virtual environment
but I imagine that there must be other cases of use.
possible contribution, how much do you want to keep the package simple ?
>>> slurm.sbatch('echo demo.py ' + Slurm.SLURM_ARRAY_TASK_ID)
File ".../site-packages/simple_slurm/core.py", line 131, in sbatch
assert success_msg in stdout, result.stderr
AssertionError: None
It does not show what's exactly wrong and what happened from the sbatch command's output.
E.g.,
sbatch: error: Invalid generic resource (gres) specification
MWE :
from simple_slurm import Slurm
slurm = Slurm()
slurm.add_arguments(gres='gpu') # <-- OK
slurm.add_gres('gpu') # <-- NOK
cmd = 'echo ' + Slurm.SLURM_ARRAY_TASK_ID # <-- NOK
pylint
main.py:5:0: E1101: Instance of 'Slurm' has no 'add_gres' member (no-member)
main.py:6:16: E1101: Class 'Slurm' has no 'SLURM_ARRAY_TASK_ID' member (no-member)
pylance
/ pyright
Cannot access member "add_gres" for type "Slurm"
Member "add_gres" is unknownPylancereportGeneralTypeIssues
Cannot access member "SLURM_ARRAY_TASK_ID" for type "Type[Slurm]"
Member "SLURM_ARRAY_TASK_ID" is unknownPylancereportGeneralTypeIssues
flake8
shows nothing (this is my default linter)
All linters were tested with their default configuration.
This is not deal-breaker, but rather a nice-to-have feature.
A naive solution would be to hard-code the methods and attributes into the code (manually) instead of dynamically adding them.
However, I assume that a more elegant solution must exist ...
Hi there,
It's possible to set a 'cluster' option in Slurm:
#SBATCH --clusters {my_cluster_name}
From Slurm docs:
-M, --clusters=
Clusters to issue commands to. Multiple cluster names may be comma separated. The job will be submitted to the one cluster providing the earliest expected job initiation time. The default value is the current cluster. A value of 'all' will query to run on all clusters. Note the --export option to control environment variables exported between clusters. Note that the SlurmDBD must be up for this option to work properly.
When doing this, the return message will be:
Submitted batch job {job_ID} on cluster {my_cluster_name}
This causes a ValueError in simple-slurm at the moment:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-d58e7c3efb9b> in <module>
----> 1 slurm.sbatch(cmd)
simple_slurm/core.py in sbatch(self, run_cmd, convert, sbatch_cmd, shell)
145 stdout = result.stdout.decode('utf-8')
146 assert success_msg in stdout, result.stderr
--> 147 job_id = int(stdout.replace(success_msg, ''))
148 print(success_msg, job_id)
149 return job_id
ValueError: invalid literal for int() with base 10: ' {job_ID} on cluster {my_cluster_name}\n'
I think this would be fairly easy to catch with some additional string parsing
It would be great if one could submit a batch job with a hashbang other than .../sh
, e.g. .../bash
. Perhaps this could added as a run config? Something like slurm.sbatch(run commands, shell="bash")
.
I have trouble (1) retrieving the path of the node's temporary directory onto which files ought to be moved (usually available via $TMPDIR in the batch script) and (2) actually moving the the necessary files to the node before starting the actual computation on the node. How do you go about this using the wrapper?
Thank you for your advice.
Our pipeline broke in production due to missing version (0.1.7) of simple-slurm
RuntimeError
Unable to find installation candidates for simple-slurm (0.1.7)
at /usr/local/lib/python3.8/site-packages/poetry/installation/chooser.py:72 in choose_for
68│
69│ links.append(link)
70│
71│ if not links:
→ 72│ raise RuntimeError(
73│ "Unable to find installation candidates for {}".format(package)
74│ )
75│
76│ # Get the best link
Hi Arturo,
I was wondering if it is possible to submit python functions to slurm with simple_slurm
. It would be a super nice feature to have.
Do you know if anyone already managed to do that?
Many thanks,
Federico
Does this package support anything like this?
slurm.sbatch('python demo.py ' + Slurm.SLURM_ARRAY_JOB_ID)
slurm.wait_for_completion()
If not, I might be interested in contributing a bit.
Hi @amq92 👋 Nice to meet you.
Thank you for maintaining/creating simple-slurm, I've been thoroughly enjoying using it.
I noticed that when installing simple-slurm==0.2.1 that the *.txt files are not included in the package, and thus simple-slurm doesn't work with the latest version. I wasn't able to spot quickly what is wrong with the setup.py.
In case you were already aware of this, my apologies 🙇 .
Once again, thank you for your work.
Alexander
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.