Notification of Pending Change CHG0033049 - October 3, 2017 at 10:00AM
Problem:
Multicore jobs are currently dispatching at a poor rate. The poor dispatch rate is caused by the scheduler not having enough information to determine when a host will have the requested number of slots available and is therefore not reserving slots on the optimal host.
Solution:
The short and long queues will be consolidated into the “broad” queue.
A default maximum run time (h_rt) of 2 hours (02:00:00) will be added to every job. This is only a default. Any amount of time can be requested.
Backfilling will be enabled.
Impact:
By adding an appropriate h_rt to every job, the scheduler will be able to identify the order and time cores will free up on each execution host. Knowing the order will allow the scheduler to reserve the appropriate slots allowing multicore jobs to dispatch more quickly.
Job backfill will be enabled. By adding h_rt to every job, the scheduler will know when enough slots will free up on an execution host to allow the multicore job with reserved slots to run. With this knowledge, the scheduler will be able to backfill short jobs into the reserved slots without interfering with the multicore job’s dispatch time.
The new configuration will allow the scheduler to more effectively and efficiently schedule short jobs as well as multicore jobs.
Required actions:
All jobs will need to be submitted to a single queue named “broad”. The “broad” queue will be the default queue, meaning any job submitted after the cutover will no longer need to specify a queue. Any job dispatched to an execution host and running will continue to run to completion. Any job pending at the time of the cutover will need to be moved to the new broad queue. This is accomplished with “qalter -q broad -u ”. Any job not moved over will pend indefinitely. During the cutover, BITS will move any pending job to the new queue, but any jobs submitted after the cutover will need to be altered by the job owner.
All jobs will have an h_rt of 2 hours set by default. You will need to request an h_rt that is appropriate for your job. There is no limit to this length but the longer the h_rt the less likely the job is to benefit from backfilling. The more accurate every h_rt is, the quicker and more efficient the scheduler will dispatch jobs. The format for h_rt is HH:MM:SS.
Example:
qsub -l h_rt=120:15:30 my_script.sh
#This job will be killed after 120 hours (5 days), 15 minutes, and 30 seconds of runtime.
If you have any questions or concerns we highly encourage you to send an email to [email protected].
Thank you,
BITS Operations