Comments (2)
@wjcunningham7 I think the issue we were discussing yesterday is either related to or is this one.
The plan is:
(1) two new constructor inputs for keepalive_interval
and reconnect_retries
(different from retries
which are triggered when a task fails).
(2) Add reconnect attempts in the polling loop.
Looking at lines 511-535 in ~/covalent_slurm_plugin/slurm.py
we have
async def _poll_slurm(self, job_id: int, conn: asyncssh.SSHClientConnection) -> None:
"""Poll a Slurm job until completion.
Args:
job_id: Slurm job ID.
conn: SSH connection object.
Returns:
None
"""
# Poll status every `poll_freq` seconds
status = await self.get_status({"job_id": str(job_id)}, conn)
while (
"PENDING" in status
or "RUNNING" in status
or "COMPLETING" in status
or "CONFIGURING" in status
):
await asyncio.sleep(self.poll_freq)
status = await self.get_status({"job_id": str(job_id)}, conn)
if "COMPLETED" not in status:
raise RuntimeError("Job failed with status:\n", status)
I assume there is something we can take from conn
to check if it has gone stale in some way. This should be checked every keepalive_interval
for a maximum of reconnect_retries
times. If it has gone stale, we just re-run lines 201-333: the method:
async def _client_connect(self) -> asyncssh.SSHClientConnection:`
from covalent-slurm-plugin.
Further to the above, the thing to track is the output of conn.is_closing()
from covalent-slurm-plugin.
Related Issues (20)
- Update PR template
- Allow for SLURM submission locally HOT 2
- Allow for the creation of unique subfolders in the current working directory to avoid file overwriting
- Support for login without SSH key HOT 1
- SLURM job crashes if Conda is not installed HOT 1
- Add an option, `use_srun: bool`, that can run the Python function without `srun`
- Update to sshproxy instructions in README.md
- Allow certain exit codes at user's discretion HOT 6
- Slurm electrons fails when called within a Dask sublattice which itself is called in a Dask lattice. HOT 2
- Slurm sublattice fails with "username is a required parameter in the Slurm plugin." HOT 6
- Setting the executor in a `@ct.lattice` decorator does not use the right configuration parameters HOT 1
- Make it possible for users to pass optional kwargs to `asyncssh.connect()`
- Can not acquire file lock on Slurm cluster HOT 2
- Adding docker based functional tests to the pipeline
- Support more robust path handling with `remote_workdir`
- v0.18.0 appears to be broken: no `sbatch` of jobs HOT 1
- `prerun_commands` don't show up in the Slurm jobscript file HOT 1
- Unclear error reported in the UI when the results pkl is not found on the Covalent side
- Pickle file paths are not handled appropriately when a `chdir` call is made
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from covalent-slurm-plugin.