Giter Club home page Giter Club logo

Comments (12)

belforte avatar belforte commented on September 2, 2024

PreJob code seems OK

def alter_submit(self, crab_retry):

and not affected by recent changes

A problem is that prejob log is full of jobs because logging initialization is not working. But in there I find both these
CRAB_RequestedMemory = 2000;
RequestMemory = 4500;

Could it be a hint ? Or simply CRAB_RequestedMemory is the initial value ?

hmm...
at same time as prejob_logs/prejob.6.7.txt and prejob_logs/prejob.6.7.txt were created, the input for PreJob was written in resubmit_info.job.6.txt

 ls -l resubmit_info/job.6.txt 
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 resubmit_info/job.6.txt

that file is read in

def get_resubmit_info(self):
"""
Need a doc string here.
"""
file_name = "resubmit_info/job.%s.txt" % (self.job_id)
if os.path.exists(file_name):
with open(file_name, 'r', encoding='utf-8') as fd:
self.resubmit_info = literal_eval(fd.read())

and content later used in
maxmemory = self.resubmit_info[inkey].get('maxmemory')

(but I do not know what that inkey outkey stuff is)

but in the resubmit_info file:

belforte@vocms0194/cluster108538413.proc0.subproc0> cat resubmit_info/job.6.txt |tr , '\n'|grep maxmem
 'maxmemory': 2000
 'maxmemory': 2000
 'maxmemory': 2000
 'maxmemory': 2000
 'maxmemory': 2000
 'maxmemory': 2000
 'maxmemory': 2000
 'maxmemory': 2000
belforte@vocms0194/cluster108538413.proc0.subproc0> 

actually:

belforte@vocms0194/cluster108538413.proc0.subproc0> ls -l resubmit_info/
total 40
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.10.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.1.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 20:24 job.2.txt
-rw-r--r--. 1 crabtw zh 1324 Aug 23 20:24 job.3.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.4.txt
-rw-r--r--. 1 crabtw zh  757 Aug 22 21:32 job.5.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.6.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.7.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.8.txt
-rw-r--r--. 1 crabtw zh 1513 Aug 23 22:01 job.9.txt
belforte@vocms0194/cluster108538413.proc0.subproc0> cat resubmit_info/job.*.txt |tr , '\n'|grep maxmem|uniq
 'maxmemory': 2000
belforte@vocms0194/cluster108538413.proc0.subproc0> 

that value never changed, and IIUC it is what drives PreJob and Job.x.submit.

from crabserver.

belforte avatar belforte commented on September 2, 2024

the file in resubmi_info is created (and later read) by PreJob itself

try:
self.get_resubmit_info()
self.alter_submit(crab_retry)
self.save_resubmit_info()
except Exception:

I do not find that directory referenced in any other place in the source base.
I am confused

from crabserver.

belforte avatar belforte commented on September 2, 2024

DagmanResubmitter has some helpful documentation

schedd.edit(rootConst, "HoldKillSig", 'SIGKILL')
# Overwrite parameters in the os.environ[_CONDOR_JOB_AD] file. This will affect
# all the jobs, not only the ones we want to resubmit. That's why the pre-job
# is saving the values of the parameters for each job retry in text files (the
# files are in the directory resubmit_info in the schedd).
for adparam, taskparam in params.items():
if taskparam in ad:
# repr() in the line below is a workaround for V2 bindings bug
# https://github.com/dmwm/CRABServer/issues/8604#issuecomment-2284346056
schedd.edit(rootConst, adparam, repr(ad.lookup(taskparam)))
elif task['resubmit_' + taskparam] is not None:
schedd.edit(rootConst, adparam, str(task['resubmit_' + taskparam]))
schedd.act(htcondor.JobAction.Hold, rootConst)
schedd.edit(rootConst, "HoldKillSig", 'SIGUSR1')
schedd.act(htcondor.JobAction.Release, rootConst)

from crabserver.

belforte avatar belforte commented on September 2, 2024

hmmm... this should be the intial setting

if 'CRAB_RequestedMemory' in self.task_ad:
maxmemory = int(str(self.task_ad.lookup('CRAB_RequestedMemory')))

then DagmanResubmitter communicates via the simple RequestMemory
params = {'CRAB_ResubmitList' : 'jobids',
'CRAB_SiteBlacklist' : 'site_blacklist',
'CRAB_SiteWhitelist' : 'site_whitelist',
'MaxWallTimeMinsRun' : 'maxjobruntime',
'RequestMemory' : 'maxmemory',
'RequestCpus' : 'numcores',
'JobPrio' : 'priority'
}

and edits those in the target dagman job.

schedd.edit(rootConst, adparam, repr(ad.lookup(taskparam)))

The Dagman job ads are then save to _CONDOR_JOB_AD in the SPOOL_DIR where indeed

belforte@vocms0194/cluster108538413.proc0.subproc0> grep RequestMemory _CONDOR_JOB_AD 
RequestMemory = 4500
RequestMemory_RAW = 4500

and that file is read by PreJob

self.logger.info("Loading classads from: %s", os.environ['_CONDOR_JOB_AD'])
self.task_ad = classad.parseOne(open(os.environ['_CONDOR_JOB_AD'], 'r', encoding='utf-8'))
self.logger.info(str(self.task_ad))

from crabserver.

belforte avatar belforte commented on September 2, 2024

sounds like the only way is

  1. fix logging in PreJob --> done in #8668
  2. add zilions of logging
  3. try to resubmi in a test task

Maybe I can go through PreJob interactively, but I am not sure it will really reproduce the situation for resubmission https://twiki.cern.ch/twiki/bin/view/CMSPublic/Crab3OperatorDebugging#To_run_a_pre_job_on_the_schedd

from crabserver.

belforte avatar belforte commented on September 2, 2024

I should also test with an older tag. To sort of find when it broke. It will be a long work anyhow

from crabserver.

belforte avatar belforte commented on September 2, 2024

I finally managed to run PreJOb with pdb and found that as feared I introduced this bug in f700045#diff-4523635b2647add8f64521c2cb1208629e53538fde1905e60c8a57b285ebee30
image

I have changed the ads which are used to communicate from TW to Dagman via prepending CRAB_ "for clarity", but somehow the change was not done consistently

from crabserver.

belforte avatar belforte commented on September 2, 2024

I guess the problem is here

params = {'CRAB_ResubmitList' : 'jobids',
'CRAB_SiteBlacklist' : 'site_blacklist',
'CRAB_SiteWhitelist' : 'site_whitelist',
'MaxWallTimeMinsRun' : 'maxjobruntime',
'RequestMemory' : 'maxmemory',
'RequestCpus' : 'numcores',
'JobPrio' : 'priority'
}

The root case was the confusion between ads meant for the JDL of the intial dagman-startup script, and ads used to communicate to various scripts how to submit jobs to grid.
I tried to clean that up using CRAB_... for the latter, but did not catch the resubmission use case.

from crabserver.

belforte avatar belforte commented on September 2, 2024

I had only taken care of setting CRAB_RequestedMemory in DagmanSubmitter

('+CRAB_RequestedMemory', 'tm_maxmemory'),

from crabserver.

belforte avatar belforte commented on September 2, 2024

so..
Dagman(Re)Submitter should set values for CRAB_xxxs classAds, which are then used in PreJob to create the JDL for vanilla jobs submission

from crabserver.

belforte avatar belforte commented on September 2, 2024

I tested on one job in crab-dev-tw01 and it works.

I will add some documentation comments

from crabserver.

belforte avatar belforte commented on September 2, 2024

after review, it is fine as it is

from crabserver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.