Comments (25)
It can still be overwritten
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1925
And default values are
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1153
"SoftTimeout": {"default": 129600, "type": int, "validate": lambda x: x > 0},
"GracePeriod": {"default": 300, "type": int, "validate": lambda x: x > 0},
from t0.
Question, this is just insurance, right ? To make sure jobs actually exit where they are running and aren't just declared bad eventually by the agent due to taking too much time ?
from t0.
Question though, is that 36h actually used by most jobs ? I've seen NERSC jobs on KNL that get evicted when they reach close to 48h wall time, certainly no runtime soft kill there.
from t0.
Sorry Dirk, I was wrong. It seems that default value is overwritten by this call.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L230
So it is set o 47h. Sorry for the confusion. We need to clean up the code. I still think you can update those values as in RunConfigAPI.py
Question, this is just insurance, right ? To make sure jobs actually exit where they are running and aren't just declared bad eventually by the agent due to taking too much time ?
You could say that.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L228
from t0.
Hm, if there are separate soft and hard timeouts, what is the grace period for ?
from t0.
hard timeout = soft time out + grace period.
from t0.
That code you pointed me at hard codes the grace period at 5min though, so why is there still another parameter at all then if it's ignored ?
from t0.
It is not ignored unlike maxRSS, maxVSize. If you updateArguments after spec is created. It will still update the values.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1925
I think what we set default value below will be overwritten to hard coded one. (We probably clean up/remove or so it won't confuse people.) These values are still need be be configurable, right?
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1153
from t0.
You are right, it's actually constructed here
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1831
and
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L1225
Still don't really know what the difference is between addRuntimeMonitors and setPerformanceMonitor. They seem to do the same thing in different ways.
from t0.
Yes we need to clean up those code so it will be more maintainable. I created the issue.
dmwm/WMCore#8735
from t0.
The pilot lifetime can be changed. Tier0 pilots can have longer lifetime if we find it necessary, i.e. jobs are not processable in less than 47h. So let's make sure we don't introduce limitations on WMCore level to prevent us from having "7-day" long jobs if necessary.
from t0.
Let me try to clarify a few points that were misunderstood.
-
HardTimeout is meant to be used and updated by ReqMgr2 itself. So T0/CompOps is supposed to update only SoftTimeout and GracePeriod (where HardTimeout is taken from, as SEangchan explained above). GracePeriod is not used anywhere but to calculate the HardTimeout. Maybe we could get rid of that and keep only SoftTimeout and HardTimeout?
-
When we create a workflow, yes, there are hard-coded settings for those Soft/Hard timeout parameters because they actually don't belong to the creation phase of a spec, but they are part of the whole workflow/workload construction and that's why we need to define something there. Code is:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L228 -
Having said that SoftTimeout and GracePeriod are assignment parameters, they do have default values in there (not hard-coded values). So one can pass to
updateArguments
whatever parameters he/she wants and that's how long a job can run, afterwards it gets a signal to be killed by the job watchdog.
from t0.
Just to complement, SoftTimeout setting is 36h:
In [34]: 129600 / 3600
Out[34]: 36
from t0.
Well, that doesn't match my observation. I see 47h (like Seangchan pointed at).
from t0.
Maybe those KNL jobs were resized?
from t0.
Yes, they were, resized to more cores. How does that matter ?
from t0.
Actually resizable jobs don't resize the timeout parameters, so scratch that.
Bottom line is, default softtimeout settings are set to 36h, unless someone provides a diff value during assignment.
from t0.
Hi Alan, Actually that default value seems to be overwritten.
By this call. (merging, logcollecting as well)
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L347
and it sets 47h
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L230
I was confused initially as well.
Seangchan
from t0.
I insist, it's the 47h hard-coded value that gets overwritten, not the other way around :)
There is no moment of the assignment that we call the method you pointed out setupProcessingTask
.
from t0.
But Tier0 doesn't use assignment, does it?
from t0.
Yes, it does. That RunConfigAPI calls the workload method updateArguments
, so we can say T0 also "assigns" workflows.
BTW, I just tested what we're discussing with the DMWM template MonteCarlo_eff.json. Just remove the SoftTimeout key/value and give it a try. Workflow injected only:
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_MonteCarlo_eff_HG1808f_Validation_180810_210522_2455
and workflow injected and assigned (without providing SoftTimeout, so default it goes):
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_MonteCarlo_eff_HG1808f_Validation_180810_210557_6989
now there is one thing that I agree to be confusing. If you look at the json of the injected-only request, it has the default value, even though the workload spec has 47h (and we know what matters is the workload spec)
from t0.
I still don't quite believe we kill on 36h, no matter what the request says. But I'll need to debug further and compare actual jobs running and when they are killed and the request in reqmgr2 before I can follow up on this...
So @amaltaro , to summarize, according to you what we set in
https://github.com/dmwm/T0/blob/master/src/python/T0/RunConfig/RunConfigAPI.py#L1096
actually matters ?
from t0.
Yes, it does. That RunConfigAPI calls the workload method updateArguments, so we can say T0 also "assigns" workflows.
Alan, you're right. I didn't see it is actually calling below line.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1900
Also, I got confuse Dirk is actually asking jobs in NERSC not Tier0 jobs. Sorry for the confusion.
Yes it seems default value is 36 hours.
https://github.com/dmwm/T0/blob/master/src/python/T0/RunConfig/RunConfigAPI.py#L1096
actually matters ?
Yes that should actually matters.
from t0.
No problem.
Dirk, Seangchan is right, it does matter.
from t0.
We have moved to a 46h Soft timeout and a 1h grace period for our jobs, as can be seen here:
T0/src/python/T0/RunConfig/RunConfigAPI.py
Line 702 in 7b4f5b1
from t0.
Related Issues (20)
- Cleanup script needs to check if run is active or not HOT 11
- change the wmbs_location_senames table HOT 2
- Report Repacked Files For Unclosed Runs HOT 1
- fix Express AlcaSkim task to use single core HOT 1
- Run tests in CI
- Where is the real Tier0 config? HOT 6
- Create the change for Tier0 on PNN, PSN mapping. HOT 2
- Consider throwing in case of duplicate data set definitions in Tier0 config
- Fileset closing is not waiting for the data in the run_stream_fileset to be complete HOT 6
- Versioning T0 replay runs when the same run is used more than once HOT 9
- Condition_t is not working properly HOT 1
- Tier0 and Rucio Subscriptions HOT 3
- SQL queries return unicode strings while CMSSW only accepts byte strings HOT 7
- Open files/urls with context manager HOT 3
- Update the T0 deployment script with the Rucio project metadata value HOT 1
- Replay test with 12_1_0_pre3 HOT 10
- Post description of Replay PR to HN
- Clean up Heavy Ion Config files HOT 2
- version #100 is not clear HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from t0.