Giter Club home page Giter Club logo

Comments (25)

ticoann avatar ticoann commented on July 29, 2024

It can still be overwritten
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1925

And default values are
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1153
"SoftTimeout": {"default": 129600, "type": int, "validate": lambda x: x > 0},
"GracePeriod": {"default": 300, "type": int, "validate": lambda x: x > 0},

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

Question, this is just insurance, right ? To make sure jobs actually exit where they are running and aren't just declared bad eventually by the agent due to taking too much time ?

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

Question though, is that 36h actually used by most jobs ? I've seen NERSC jobs on KNL that get evicted when they reach close to 48h wall time, certainly no runtime soft kill there.

from t0.

ticoann avatar ticoann commented on July 29, 2024

Sorry Dirk, I was wrong. It seems that default value is overwritten by this call.

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L230

So it is set o 47h. Sorry for the confusion. We need to clean up the code. I still think you can update those values as in RunConfigAPI.py

Question, this is just insurance, right ? To make sure jobs actually exit where they are running and aren't just declared bad eventually by the agent due to taking too much time ?

You could say that.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L228

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

Hm, if there are separate soft and hard timeouts, what is the grace period for ?

from t0.

ticoann avatar ticoann commented on July 29, 2024

hard timeout = soft time out + grace period.

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

That code you pointed me at hard codes the grace period at 5min though, so why is there still another parameter at all then if it's ignored ?

from t0.

ticoann avatar ticoann commented on July 29, 2024

It is not ignored unlike maxRSS, maxVSize. If you updateArguments after spec is created. It will still update the values.

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1925

I think what we set default value below will be overwritten to hard coded one. (We probably clean up/remove or so it won't confuse people.) These values are still need be be configurable, right?
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1153

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

You are right, it's actually constructed here

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1831

and

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L1225

Still don't really know what the difference is between addRuntimeMonitors and setPerformanceMonitor. They seem to do the same thing in different ways.

from t0.

ticoann avatar ticoann commented on July 29, 2024

Yes we need to clean up those code so it will be more maintainable. I created the issue.
dmwm/WMCore#8735

from t0.

drkovalskyi avatar drkovalskyi commented on July 29, 2024

The pilot lifetime can be changed. Tier0 pilots can have longer lifetime if we find it necessary, i.e. jobs are not processable in less than 47h. So let's make sure we don't introduce limitations on WMCore level to prevent us from having "7-day" long jobs if necessary.

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

Let me try to clarify a few points that were misunderstood.

  1. HardTimeout is meant to be used and updated by ReqMgr2 itself. So T0/CompOps is supposed to update only SoftTimeout and GracePeriod (where HardTimeout is taken from, as SEangchan explained above). GracePeriod is not used anywhere but to calculate the HardTimeout. Maybe we could get rid of that and keep only SoftTimeout and HardTimeout?

  2. When we create a workflow, yes, there are hard-coded settings for those Soft/Hard timeout parameters because they actually don't belong to the creation phase of a spec, but they are part of the whole workflow/workload construction and that's why we need to define something there. Code is:
    https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L228

  3. Having said that SoftTimeout and GracePeriod are assignment parameters, they do have default values in there (not hard-coded values). So one can pass to updateArguments whatever parameters he/she wants and that's how long a job can run, afterwards it gets a signal to be killed by the job watchdog.

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

Just to complement, SoftTimeout setting is 36h:
In [34]: 129600 / 3600
Out[34]: 36

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

Well, that doesn't match my observation. I see 47h (like Seangchan pointed at).

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

Maybe those KNL jobs were resized?

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

Yes, they were, resized to more cores. How does that matter ?

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

Actually resizable jobs don't resize the timeout parameters, so scratch that.
Bottom line is, default softtimeout settings are set to 36h, unless someone provides a diff value during assignment.

from t0.

ticoann avatar ticoann commented on July 29, 2024

Hi Alan, Actually that default value seems to be overwritten.

By this call. (merging, logcollecting as well)
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L347

and it sets 47h
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L230

I was confused initially as well.

Seangchan

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

I insist, it's the 47h hard-coded value that gets overwritten, not the other way around :)
There is no moment of the assignment that we call the method you pointed out setupProcessingTask.

from t0.

ticoann avatar ticoann commented on July 29, 2024

But Tier0 doesn't use assignment, does it?

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

Yes, it does. That RunConfigAPI calls the workload method updateArguments, so we can say T0 also "assigns" workflows.

BTW, I just tested what we're discussing with the DMWM template MonteCarlo_eff.json. Just remove the SoftTimeout key/value and give it a try. Workflow injected only:
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_MonteCarlo_eff_HG1808f_Validation_180810_210522_2455

and workflow injected and assigned (without providing SoftTimeout, so default it goes):
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_MonteCarlo_eff_HG1808f_Validation_180810_210557_6989

now there is one thing that I agree to be confusing. If you look at the json of the injected-only request, it has the default value, even though the workload spec has 47h (and we know what matters is the workload spec)

from t0.

hufnagel avatar hufnagel commented on July 29, 2024

I still don't quite believe we kill on 36h, no matter what the request says. But I'll need to debug further and compare actual jobs running and when they are killed and the request in reqmgr2 before I can follow up on this...

So @amaltaro , to summarize, according to you what we set in

https://github.com/dmwm/T0/blob/master/src/python/T0/RunConfig/RunConfigAPI.py#L1096

actually matters ?

from t0.

ticoann avatar ticoann commented on July 29, 2024

Yes, it does. That RunConfigAPI calls the workload method updateArguments, so we can say T0 also "assigns" workflows.

Alan, you're right. I didn't see it is actually calling below line.
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMWorkload.py#L1900
Also, I got confuse Dirk is actually asking jobs in NERSC not Tier0 jobs. Sorry for the confusion.

Yes it seems default value is 36 hours.

@hufnagel,

https://github.com/dmwm/T0/blob/master/src/python/T0/RunConfig/RunConfigAPI.py#L1096

actually matters ?

Yes that should actually matters.

from t0.

amaltaro avatar amaltaro commented on July 29, 2024

No problem.
Dirk, Seangchan is right, it does matter.

from t0.

germanfgv avatar germanfgv commented on July 29, 2024

We have moved to a 46h Soft timeout and a 1h grace period for our jobs, as can be seen here:

'SoftTimeout': 165600, #46 hours

from t0.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.