Is there an option (seems not...) to have restart jobs when running ASCOT5 using MPI=1

Restart option for very time consuming jobs about ascot5 HOT 3 OPEN

rui-coelho commented on July 30, 2024

Restart option for very time consuming jobs

from ascot5.

Comments (3)

miekkasarki commented on July 30, 2024

I support this feature. We already have a "worker" thread that monitors the simulation and whose only job right now is to print that progress file. In principle, we could signal that thread that "please stop my simulation ASAP" and that worker thread then sets MAX_CPU_TIME end condition for all markers that are currently being simulated or whose simulation has not yet started. Then the whole particle queue would be flushed within a minute or so, and you would get your intermediate results stored in the HDF5 file.

As for the signal, what you suggest would probably work in your case where the progress meter can be trusted. However, I would prefer that the job would be terminated gracefully in two cases:

SLURM signals that the job is approaching its time limit where it would be forcefully terminated
User wants to terminate the run earlier and sends the signal him/herself

These would work for you, right? The signal could be something as simple as creating a file called "stop" in the same folder where the job was launched. However, there are two open questions:

How can we make SLURM to generate such a file or could we pass the time limit from SLURM to ascot somehow at the beginning of the simulation?
What if the user is running multiple simulations in same folder? In this case the file should be something like "stop_" and again we would have to communicate JOBID from SLURM to ascot somehow.

from ascot5.

rui-coelho commented on July 30, 2024

I was thinking of something really much more basic. Imagine we have an initial value code to simulate the time evolution of an instability. We know we are going overboard in terms of maximum runtime and if we are on MARCONI this mans typically 24h. I first need to have a rough estimate of how much time steps this translates to and then i can set the number of time steps accordingly. I then set the first run to do 1M time steps (1-1,000,000), the second run will do from 1,000,000 to 2,000,000 and so on and so forth.
Now, if ASCOT runs the markers "sequentially" i.e. dispatching let's say 1000 markers until the end condition is met, then the next 1000 and so on......one could "trivially" instruct the code to only dispatch the "first" 1,000,000 markers and then store the result in the HDF5 file. The next call to ASCOT, however, would have to know which set of 1,000,000 markers was dealt with and then dispatch the next set of 1,000,000 markers. Very likely, for this to work, one should have an extra OPTIONS key to specify what "sequence number" of the multi-stage run we are running so that the stupid code could know which set of 1,000,000 markers to launch in the run.....and of course the number of markers to "push" on each "sequence" should also be an OPTION key in the dictionary......

from ascot5.

rui-coelho commented on July 30, 2024

....since ASCOT does not do beam-beam reaction it should be doable to implement since in reality once a given marker meets it's end (poor guy....) it R.I.P right ?

So....we could potentially break up a run that has 10M markers in 10 runs of 1M each or 20 runs of 500k each (the number of markers i have been using more frequently...) in sequence and update the hdf5 file as the sequences evolved....and since the markers are all "tagged" with metadata we could even check which ones have met their fate and which ones are waiting to go to the slaughter.....(too much jambon hanging and eating while at Salamanca....apologies for the analogies...)

from ascot5.

Restart option for very time consuming jobs about ascot5 HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent