Giter Club home page Giter Club logo

aiida-hyperqueue's Introduction

Build Status Docs status PyPI version

AiiDA HyperQueue plugin

AiiDA plugin for the HyperQueue metascheduler.

โ—๏ธ This package is still in the early stages of development and we will most likely break the API regularly in new 0.X versions. Be sure to pin the version when installing this package in scripts.

Features

Allows task farming on Slurm machines through the submission of AiiDA calculations to the HyperQueue metascheduler. See the Documentation for more information on how to install and use the plugin.

aiida-hyperqueue's People

Contributors

giovannipizzi avatar mbercx avatar tsthakur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

aiida-hyperqueue's Issues

Use already defined job resource classes?

We should limit the number of different JobResource subclasses used by different scheduler plugins, I think, because these make different schedulers behave differently and so it's harder for the user to know which resources we pass.

For this scheduler, we clearly need to specify the total number of cores.

Memory can probably be removed as discussed in #7

Do we need a different class, and in particular both num_mpiprocs and num_cores?
Or can we just reuse e.g. this below (ParEnvJobResource), simply specifying the tot_num_mpiprocs? (and a parallel_env, which is a string - I imagine this would be matched in the future to the name of the alloc on which you want to run - e.g. GPU vs CPU etc.).

https://github.com/aiidateam/aiida-core/blob/ff1318b485a8b803e115b78946cc4593fc661153/aiida/schedulers/datastructures.py#L177

Should we use in the CLI the same defaults of `hq`?

There are a few points where we change the defaults: number of backlog, and use of hyper threading.
I suggest we stick to the same defaults of HQ.

  • HT used by default; have a --no-ht flag instead
  • don't set the backlog by default, and specify it only if the user passes an option

(and any other change we are actually doing to the defaults).
Otherwise a user will get confused when using directly HQ vs using it via AiiDA

Timeout to start server

Got following error when start the hq server in EIGER. The server is started successfully although. Trace back aiida-core and open issue aiidateam/aiida-core#6377

Traceback (most recent call last):
  File "/home/jyu/.aiida_venvs/sssp-project/bin/verdi", line 8, in <module>
    sys.exit(verdi())
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  [Previous line repeated 1 more time]
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/jyu/project/sssp-project/aiida-core/src/aiida/cmdline/utils/decorators.py", line 102, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/jyu/project/sssp-project/aiida-hyperqueue/aiida_hyperqueue/cli.py", line 34, in start_cmd
    retval, _, stderr = transport.exec_command_wait(
  File "/home/jyu/project/sssp-project/aiida-core/src/aiida/transports/transport.py", line 413, in exec_command_wait
    retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
  File "/home/jyu/project/sssp-project/aiida-core/src/aiida/transports/plugins/ssh.py", line 1413, in exec_command_wait_bytes
    stdout_bytes.append(stdout.read())
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/paramiko/file.py", line 200, in read
    new_data = self._read(self._DEFAULT_BUFSIZE)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/paramiko/channel.py", line 1361, in _read
    return self.channel.recv(size)
  File "/home/jyu/.aiida_venvs/sssp-project/lib/python3.10/site-packages/paramiko/channel.py", line 701, in recv
    raise socket.timeout()
TimeoutError

Use both time-limit and time-request

This comment justifies why only time-request was used.

# `--time-request` will only let the HQ job start on the worker in case there is still enough time available
# `--time-limit` means the HQ job will be killed after this time.
# It's better to use `--time-request`, since it will guarantee that the time is still available, but won't
# kill job the job in case more time is needed and is available.
hq_options.append(
f'--time-request={job_tmpl.max_wallclock_seconds}s')

However, I think both should be used, and set to the same value. It's expected that schedulers kill the job if this takes too long.
Actually, this is even more important when sharing a node: I just had a case in which, for some reasons, all jobs in a node remained stuck and stopped producing output, even if they were still using 100% of the CPU. They blocked the worker until the end of its wall time. This means that if e.g. the worker has 24h of wall time, 24h are wasted even if the job should finish within 10 minutes. It's better to kill it and let other jobs go in.

๐Ÿ‘Œ IMPROVE: Use HQ directives for submit command

The current submit command uses quite a bit of hacky logic to convert the # HQ line into arguments for the hq submit command:

submit_command = (
f"chmod 774 {submit_script}; options=$(grep '#HQ' {submit_script});"
f"sed -i s/\\'srun\\'/srun\ --cpu-bind=map_cpu:\$HQ_CPUS/ {submit_script};"
f'hq submit ${{options:3}} ./{submit_script}')

Once HQ directives are implemented, as discussed in It4innovations/hyperqueue#6, we can simply add these to separate lines in the jobscript header instead.

Use existing properties to set the max memory, don't add them to the resources

Instead of defining a new memory_Mb in the resources (very specific to this scheduler), we should reuse concepts already into AiiDA and independent of the scheduler.

E.g. we have metadata.options.max_memory_kb in CalcJobs:

https://github.com/aiidateam/aiida-core/blob/ff1318b485a8b803e115b78946cc4593fc661153/aiida/engine/processes/calcjobs/calcjob.py#L249

and this is passed to the scheduler in the JobTemplate:

https://github.com/aiidateam/aiida-core/blob/ff1318b485a8b803e115b78946cc4593fc661153/aiida/schedulers/datastructures.py#L284

See e.g. how it's used in SLURM:

https://github.com/aiidateam/aiida-core/blob/develop/aiida/schedulers/plugins/slurm.py#L383-L396

โœจ NEW: Add support for PBS

Currently the scheduler has only been developed/tested with Slurm in mind. However, since HyperQueue also supports PBS, we should redesign some parts to be more general so both job managers can be supported.

๐Ÿ‘Œ IMPROVE: Avoid using `sed` to add `$HQ_CPUS`

To avoid the current bash escaping that adds single quotes around every command and argument, I added a bit of hacky sed logic:

submit_command = (
f"chmod 774 {submit_script}; options=$(grep '#HQ' {submit_script});"
f"sed -i s/\\'srun\\'/srun\ --cpu-bind=map_cpu:\$HQ_CPUS/ {submit_script};"
f'hq submit ${{options:3}} ./{submit_script}')

If/when it's possible to avoid the single quotes around all items in the srun line, we can e.g. simply add this to the srun command of the computer setup.

However, we should probably not add this option at all when running on multiple nodes. I haven't tested this so far.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.