aiidateam / aiida-hyperqueue Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 6.0 103 KB

AiiDA plugin for the HyperQueue metascheduler.

Home Page: http://aiida-hyperqueue.readthedocs.io/

License: MIT License

Python 100.00%

aiida metascheduler workflows

aiida-hyperqueue's Introduction

AiiDA HyperQueue plugin

AiiDA plugin for the HyperQueue metascheduler.

❗️ This package is still in the early stages of development and we will most likely break the API regularly in new 0.X versions. Be sure to pin the version when installing this package in scripts.

Features

Allows task farming on Slurm machines through the submission of AiiDA calculations to the HyperQueue metascheduler. See the Documentation for more information on how to install and use the plugin.

aiida-hyperqueue's People

Contributors

Stargazers

Watchers

Forkers

giovannipizzi ramirezfranciscof tsthakur qiaojunfeng mfkiwl unkcpz

aiida-hyperqueue's Issues

Use already defined job resource classes?

We should limit the number of different JobResource subclasses used by different scheduler plugins, I think, because these make different schedulers behave differently and so it's harder for the user to know which resources we pass.

For this scheduler, we clearly need to specify the total number of cores.

Memory can probably be removed as discussed in #7

Do we need a different class, and in particular both num_mpiprocs and num_cores?
Or can we just reuse e.g. this below (ParEnvJobResource), simply specifying the tot_num_mpiprocs? (and a parallel_env, which is a string - I imagine this would be matched in the future to the name of the alloc on which you want to run - e.g. GPU vs CPU etc.).

https://github.com/aiidateam/aiida-core/blob/ff1318b485a8b803e115b78946cc4593fc661153/aiida/schedulers/datastructures.py#L177

Drop <2 requirement, move to <3?

At the moment, it uninstalls my development version of AiiDA 2.x-alpha; in any case this will be compatible with 2.x right?

Should we use in the CLI the same defaults of `hq`?

There are a few points where we change the defaults: number of backlog, and use of hyper threading.
I suggest we stick to the same defaults of HQ.

HT used by default; have a --no-ht flag instead
don't set the backlog by default, and specify it only if the user passes an option

(and any other change we are actually doing to the defaults).
Otherwise a user will get confused when using directly HQ vs using it via AiiDA

Timeout to start server

Got following error when start the hq server in EIGER. The server is started successfully although. Trace back aiida-core and open issue aiidateam/aiida-core#6377

Traceback (most recent call last):
  File "/home/jyu/.aiida_venvs/sssp-project/bin/verdi", line 8, in <module>
    sys.exit(verdi())
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  [Previous line repeated 1 more time]
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/jyu/project/sssp-project/aiida-core/src/aiida/cmdline/utils/decorators.py", line 102, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/jyu/project/sssp-project/aiida-hyperqueue/aiida_hyperqueue/cli.py", line 34, in start_cmd
    retval, _, stderr = transport.exec_command_wait(
  File "/home/jyu/project/sssp-project/aiida-core/src/aiida/transports/transport.py", line 413, in exec_command_wait
    retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
  File "/home/jyu/project/sssp-project/aiida-core/src/aiida/transports/plugins/ssh.py", line 1413, in exec_command_wait_bytes
    stdout_bytes.append(stdout.read())
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/paramiko/file.py", line 200, in read
    new_data = self._read(self._DEFAULT_BUFSIZE)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/paramiko/channel.py", line 1361, in _read
    return self.channel.recv(size)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/paramiko/channel.py", line 701, in recv
    raise socket.timeout()
TimeoutError

Use both time-limit and time-request

This comment justifies why only time-request was used.

aiida-hyperqueue/aiida_hyperqueue/scheduler.py

Lines 125 to 130 in e33376c

 # `--time-request` will only let the HQ job start on the worker in case there is still enough time available 

 # `--time-limit` means the HQ job will be killed after this time. 

 # It's better to use `--time-request`, since it will guarantee that the time is still available, but won't 

 # kill job the job in case more time is needed and is available. 

 hq_options.append( 

 f'--time-request={job_tmpl.max_wallclock_seconds}s')

However, I think both should be used, and set to the same value. It's expected that schedulers kill the job if this takes too long.
Actually, this is even more important when sharing a node: I just had a case in which, for some reasons, all jobs in a node remained stuck and stopped producing output, even if they were still using 100% of the CPU. They blocked the worker until the end of its wall time. This means that if e.g. the worker has 24h of wall time, 24h are wasted even if the job should finish within 10 minutes. It's better to kill it and let other jobs go in.

👌 IMPROVE: Use HQ directives for submit command

The current submit command uses quite a bit of hacky logic to convert the # HQ line into arguments for the hq submit command:

aiida-hyperqueue/aiida_hyperqueue/scheduler.py

Lines 150 to 153 in 3f38f17

 submit_command = ( 

 f"chmod 774 {submit_script}; options=$(grep '#HQ' {submit_script});" 

 f"sed -i s/\\'srun\\'/srun\ --cpu-bind=map_cpu:\$HQ_CPUS/ {submit_script};" 

 f'hq submit ${{options:3}} ./{submit_script}')

Once HQ directives are implemented, as discussed in It4innovations/hyperqueue#6, we can simply add these to separate lines in the jobscript header instead.

Use existing properties to set the max memory, don't add them to the resources

Instead of defining a new memory_Mb in the resources (very specific to this scheduler), we should reuse concepts already into AiiDA and independent of the scheduler.

E.g. we have metadata.options.max_memory_kb in CalcJobs:

https://github.com/aiidateam/aiida-core/blob/ff1318b485a8b803e115b78946cc4593fc661153/aiida/engine/processes/calcjobs/calcjob.py#L249

and this is passed to the scheduler in the JobTemplate:

https://github.com/aiidateam/aiida-core/blob/ff1318b485a8b803e115b78946cc4593fc661153/aiida/schedulers/datastructures.py#L284

See e.g. how it's used in SLURM:

https://github.com/aiidateam/aiida-core/blob/develop/aiida/schedulers/plugins/slurm.py#L383-L396

✨ NEW: Add support for PBS

Currently the scheduler has only been developed/tested with Slurm in mind. However, since HyperQueue also supports PBS, we should redesign some parts to be more general so both job managers can be supported.

👌 IMPROVE: Avoid using `sed` to add `$HQ_CPUS`

To avoid the current bash escaping that adds single quotes around every command and argument, I added a bit of hacky sed logic:

aiida-hyperqueue/aiida_hyperqueue/scheduler.py

Lines 150 to 153 in 3f38f17

 submit_command = ( 

 f"chmod 774 {submit_script}; options=$(grep '#HQ' {submit_script});" 

 f"sed -i s/\\'srun\\'/srun\ --cpu-bind=map_cpu:\$HQ_CPUS/ {submit_script};" 

 f'hq submit ${{options:3}} ./{submit_script}')

If/when it's possible to avoid the single quotes around all items in the srun line, we can e.g. simply add this to the srun command of the computer setup.

However, we should probably not add this option at all when running on multiple nodes. I haven't tested this so far.

I think it should be used at least for hq jobs and hq submit (and possibly hq delete if it is also supported there?)

	# `--time-request` will only let the HQ job start on the worker in case there is still enough time available
	# `--time-limit` means the HQ job will be killed after this time.
	# It's better to use `--time-request`, since it will guarantee that the time is still available, but won't
	# kill job the job in case more time is needed and is available.
	hq_options.append(
	f'--time-request={job_tmpl.max_wallclock_seconds}s')

	submit_command = (
	f"chmod 774 {submit_script}; options=$(grep '#HQ' {submit_script});"
	f"sed -i s/\\'srun\\'/srun\ --cpu-bind=map_cpu:\$HQ_CPUS/ {submit_script};"
	f'hq submit ${{options:3}} ./{submit_script}')

aiidateam / aiida-hyperqueue Goto Github PK

aiida-hyperqueue's Introduction

AiiDA HyperQueue plugin

Features

aiida-hyperqueue's People

Contributors

Stargazers

Watchers

Forkers

aiida-hyperqueue's Issues

Recommend Projects

Recommend Topics

Recommend Org