dmwm / cms-htcondor-es Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 19.0 6.27 MB

ElasticSearch integration for CMS's HTCondor pool

Python 88.09% Shell 10.16% Dockerfile 1.75%

cms-htcondor-es's People

Contributors

Stargazers

Watchers

Forkers

belforte sciaba stiegerb tiradani amaltaro todor-ivanov mapsacosta cronosnull khurtado leggerf vkuznet lecriste chtc stlammel mrceyhun nikodemas

cms-htcondor-es's Issues

Update JobMonitoring schema with unknown keys

(Starting to hand over, it would be good if someone else would start doing these things.)

Here's a short guideline:

Check es-cms-logmon for a list of recent messages about validation failures because of unknown keys (message_type:validator_unknown_key).
Edit the JobMonitoring.json file in /home/cmsjobmon/cms-htcondor-es/ (in a safe place) and add key/example value pairs.
Create a PR on dmwm/CMSMonitoring to update the schemas/JobMonitoring.json and jsonschemas/JobMonitoring.json files. Create the latter using genson schemas/JobMonitoring.json > jsonschemas/JobMonitoring.schema.

Jobs missing in MONIT

Seems we sare still far from having full accounting of jobs in MONIT.

While investigating a problem with glideinWms
https://ggus.eu/index.php?mode=ticket_info&ticket_id=138535
I tried to use ES+Kibana etc. but found the very puzzling situation that for this CRAB Task [1]
I can not find in ES the jobs which completed at T2_TR_METU.
An example of such job (which completed successfully at that site a few hours ago) is e.g. [3]
So if the job ran, I am sure HTCondor knows about it, why it is not shown ?
Is me not looking in the right way, or something is wrong in the reporting ?

By the way way those jobs are in ES when running (although with incomplete
information, like missing CRAB_Id i.e. their ID inside the crab task), but somehow not as Completed [4]

[1]
Here are pointers in new and old dashboard.

https://monit-grafana.cern.ch/d/NvcuKTSiz/condor-jobs-test?orgId=6&var-task=181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1&var-user=areinsvo
http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=areinsvo&refresh=0&table=Mains&p=1&records=25&activemenu=2&pattern=181203_151212%3Aareinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1&task=&from=&till=&timerange=lastWeek

[3]
https://cmsweb.cern.ch/scheddmon/0122/cms1354/181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1/job_out.30.0.txt

[4]
https://monit-kibana.cern.ch/kibana/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(data.RequestCpus,data.Site,data.MachineAttrCpus0,data.Status,data.CRAB_Id),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'monit_prod_condor_raw_metric_v002-*',key:data.Type,negate:!f,type:phrase,value:analysis),query:(match:(data.Type:(query:analysis,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'monit_prod_condor_raw_metric_v002-*',key:data.RequestCpus,negate:!f,type:phrase,value:'4'),query:(match:(data.RequestCpus:(query:4,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'monit_prod_condor_raw_metric_v002-*',key:data.Status,negate:!f,type:phrase,value:Completed),query:(match:(data.Status:(query:Completed,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'monit_prod_condor_raw_metric_v002-*',key:data.JobStatus,negate:!f,type:phrase,value:'4'),query:(match:(data.JobStatus:(query:4,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'monit_prod_condor_raw_metric_v002-*',key:data.Site,negate:!f,type:phrase,value:T2_TR_METU),query:(match:(data.Site:(query:T2_TR_METU,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'monit_prod_condor_raw_metric_v002-*',key:data.CRAB_Workflow,negate:!f,type:phrase,value:'181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1'),query:(match:(data.CRAB_Workflow:(query:'181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1',type:phrase))))),index:'monit_prod_condor_raw_metric_v002-*',interval:auto,query:(match_all:()),sort:!(data.MachineAttrCpus0,desc))

CRAB_Id can be a string now

Easiest solution is to move it from int_vals to string_vals:
https://github.com/bbockelm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L111

We'd then need to re-index the older ES indices to follow the change.

Process only necessary fields for running/pending jobs

Currently all fields are converted and processed and then all but the ones in running_fields are dropped. We could optimize by only processing those in the first place.

Where are the shell scripts?

I found that repository does not contain shell scripts used in cronjobs on vocms0240. Why these scripts are not present in this repository? Where they are located?

I suggest to update repository with all shell script we need to use on production node.

Generate log tarball URL in ElasticSearch ads

For production jobs, we currently store the tarballs in CMS EOS (and they are readable by CMS members).

The URL can be constructed with the following recipe:

"https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/" + \
   <task name> + "/" + \
   <step name> + "/" + \
   <schedd name> + <WMAgent job ID> + "-" + <retry #> + "-log.tar.gz"

For WMAgent jobs, can we add a new attribute (LogURL?) containing a URL constructed as above?

Create a way to identify the last update for a job.

For the tasks' index, we are only taking the last status for each job. However, the order in which the messages are consumed is not enough to ensure that the last seen message is the actual last update. We starting using the RecordTime, and it worked when the status change, but for a completed job status updates always have the same RecordTime making that updates over completed jobs didn't get overwritten (as they have the same version).

Currently, DataCollection and RecordTime use the CompletionDate for Completed jobs. Could we use the _launch_time for DataCollection (or a new field) and use it as a version hint?.

Add CRAB_Retry and QDate for Idle and running tasks

In the new ES schema, the retry number is part of the ID of the document, it is needed to avoid duplicates.

The QDate can be useful for the user in the tasks list for the running and pending tasks to order (and, when grafana enables us to do so, filter) the jobs.

Elasticsearch 6 migration

Discussed with CERN/ES about this. They will start setting up a new endpoint "es-cms6" and mirror new data there. We can then observe possible issues and start fixing them.

The main thing is the removal of mapping types, but we only use a single type ("job"), so that should not be an issue for us.

TaskType can be non-ROOT also for DAGMAN's

with ref. to
https://github.com/bbockelm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L532
there are now more DAGs around in for CRAB due to automatic splitting which have task name other than ROOT yet do are not grid jobs.
We need a more solid way to tell things, and to communicate to developers what they can (not) do.

Consolidate the different spiders (volunteer, cms, ogsi)

Using parameters to define the collectors or schedd factories (see #134) unify the different spiders.

analysis jobs marked as production

I've notice there are some analysis jobs that are pushed as Type:production and trace this issue down to this part of the code:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L574

There are several analysis jobs, mostly jobUniverse 7 and 12, that do not have CRAB_Id neither define AccountingGroup and hence, they get marked as 'production', look at the condor query done on a crab schedd[1]

Wouldn't be better to use the CMSGWMS_Type attribute in the Schedd ClassAd to get the type of schedd (crabschedd, prodschedd, tier0schedd) and set the job Type respectively ?

[1]
[root@vocms0198 ~]# condor_q -const 'CRAB_Id=?=UNDEFINED' -af:h JobUniverse AccountingGroup | sort | uniq -c
240 12 undefined
10 5 production.cmsdataops
2 5 undefined
2 7 "highprio.spiga"
5242 7 undefined

Re-evaluate the need for constraining data for running/pending jobs

We're currently maintaining a limited list of fields to be retained when processing the condor queues:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L423

And we're dropping any field not in that list when processing the queues:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L780-L781

(The behavior can be switched off by using the --keep_full_queue_data option for spider_cms.py)

I had introduced this as a means of speeding up the uploading of documents to MONIT (it cuts about a factor of 5 in volume, iirc) as we were having trouble staying within the 12 min cycle. However, we were still uploading from a VM hosted at UNL at that time, and the faster upload time from the CERN hosted VM to MONIT might have made this unnecessary.

To check, we could run a few cycles without culling the fields and see how much the performance suffers.

Another aspect to keep in mind, however, is that this would significantly increase the data volume needed on the MONIT side, since it affects a large majority of the documents we send.

Setting new filter for all condor job monitoring queries: JobUniverse==5

As suggested by @belforte in MM(Distrubuted Analysis channel), we are planning to put JobUniverse==5 constraint to our query scripts that query Condor Schedds, both queue and history.

As CMS Monitoring&Analytics team, we are just the operator of this Condor Job Monitoring. That's why, we would like to get confirmation from all stakeholders.

You can see Completed jobs with JobUniverse is different than 5 in [1] and see other jobs with JobUniverse is different than 5 in [2]

[1] https://es-cms.cern.ch/kibana/goto/9916fa1438a0d470cf620150e2ec8fb7
[2] https://monit-opensearch.cern.ch/dashboards/goto/9f7d3eaf122501164b28e0cb84b616b5?security_tenant=global

fyi @leggerf @brij01

Ceyhun

Add joblog and postJobLog urls

In order to avoid using javascript links to create the URLs, we'll send the URLs to ES.
Also, in order to use the value directly, we'll add the committed wall clock in seconds (to use the Grafana's duration formats).

RecordTime for Held jobs

We currently set the RecordTime as the CompletionDate from the ad, and then overwrite it with the current time in case the CompletionDate is not set:

https://github.com/bbockelm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L536-L540

result['RecordTime'] = ad.get("CompletionDate", 0)

...

if not result['RecordTime']:
    result['RecordTime'] = _launch_time

The problem is when jobs have a CompletionDate, but have not actually terminated, e.g. because they are "Held". (I'm not sure actually there are any other examples.) So those docs will be submitted with the same RecordTime over and over again, and pile up in the dashboard in one time bin.

Instead we could set RecordTime to be the launch time for all jobs with statuses other than "Completed".

CRAB ID is not being reported for IDLE and Running Jobs

CRAB ID is being omitted for running and idle tasks and is needed for the task monitoring dashboard.

Repeated records when a jobs is removed.

It seems that when a job gets removed, the monitoring system keeps adding duplicate records of it every 12 minutes. In this example:
https://monit-kibana.cern.ch/kibana/app/kibana::/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2019-07-29T22:00:00.000Z',mode:absolute,to:'2019-08-07T21:59:59.999Z'))&_a=(columns:!(_source),index:'monit_prod_condor_raw_metric_v002-*',interval:auto,query:(query_string:(query:'data.CRAB_Retry:3+AND+data.CRAB_Id:11+AND+data.CRAB_Workflow:190702_095101*acarvalh_crab_2016v3_2019Jul02_ZZTo2L2Nu_13TeV_powheg_pythia8__RunIISummer16MiniAODv3-PUMoriond17_94X_mcotic_v3-v2')),sort:!(metadata.timestamp,desc))

you can see that all of the records point to the exact same job, same LastStartTime, same globalJobId, etc. One thing to notice is that the CpuEff does change. This could be the reason why the monitoring system thinks it is a different record. I think this is related to this bug in condor:https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7083

Add MachineAttrCMSSubSiteName0 to the running fields

The MachineAttrCMSSubSiteName0 was added at the condor level to identify opportunistic nodes on sites. It started appearing on CompletedJobs, now we need to add them to the Running Jobs list.

CMSPrimaryProcessedDataset and CMSPrimaryPrimaryDataset not present in some ES indexes

Hello,

While looking into the following issue:
dmwm/WMCore#11613

It was found out that some indexes like the one used for monit_prod_condor seem to be missing some entries from this block:

https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L751-L757

Basically, if we have something like (as shown in the ES link above):

data.DESIRED_CMSDataset : /EGamma/Run2022C-v1/RAW

Then, we should see:
data.CMSPrimaryPrimaryDataset, data.CMSPrimaryDataTier, data.CMSPrimaryProcessedDataset
but only data.CMSPrimaryDataTier is properly shown, with the other 2 set as Unknown. I haven't see this in other indexes like cms-20*, is this a bug?

Also, we had a few questions regarding the names:

CMSPrimaryPrimaryDataset: why Primary twice in the name?
CMSPrimaryDataTier, would it be better to rename it to DESIRED_CMSDatatier? In such case, we would like to do some renaming to make things clearer, like:
CMSPrimaryPrimaryDataset -> DESIRED_CMSPrimaryDataset
CMSPrimaryProcessedDataset -> DESIRED_CMSProcessedDataset
CMSPrimaryDataTier -> DESIRED_CMSDatatier

what do you think?
pinging @amaltaro

Properly sort jobs in Grafana dashboard for task monitoring

Given the discussion in dmwm/CRABServer#5860
we agree with @cronosnull to add a numerical field corresponding to the CRAB_Id string and sort jobs accordingly.

At the moment, the possible values of CRAB_Id are of two types:

"1", "2", "3", ..., up to ~"10000"
"n-1", "n-2", "n-3", ..., where n can be 0, 1, 2 or 3

and the overall order should be 0-1, 0-2, 0-3, 0-..., 1, 2, 3, ..., 1-1, 1-2, 1-3, 1-..., 2-1, 2-2, 2-3, 2-..., 3-1, 3-2, 3-3, 3-...

Replace the collector cmssrv221 with cmssrv623

This change has been already pushed to production using a temporary python2 branch (keeping the two collectors), but once the Python 3 version is ready we need to apply the change.

We use this strategy because the changes can be slightly different and because we want to keep the python3 branch-which is currently on review- able to automatically merge with the master)

Exit code is being reported wrong.

As @belforte noticed from the task monitoring dashboard, the exit code is being sent as the mod 256 of the actual code (maybe it was cast as byte in some part of the process).

commonExitCode defaults to 0 (meaning "sucess") when no classad is available to provide the exitCode of the job

When looking into the following grafana dashboard:
https://monit-grafana.cern.ch/d/ifXAfjLVk/production-jobs-exit-code-monitoring?orgId=11

we realized we had some jobs that were "Removed" but the exit code was successful. When investigating this, we noticed the following line:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L1157

basically means when a job is removed and hence neither CRAB, WMCore or HTCondor gets to propagate a classad to report a exit code, it defaults to "0".

This shouldn't be the case, but we are wondering what the value should be instead. After discussing his, we came up with some options and we wanted to hear feedback from the monitoring team;

Options are:

"Unknown": While this is possible, we don't know how LUCENE will treat it if we try conditionals like "ExitCode > 0"
-1 : Only issue with this is, exit codes are usually positive
A positive integer number > 99999 ? E.g.: 100001
Because it's past the range used for CMS exit codes and it means it wasn't produced by WMException, although it's not easy for someone to say so without looking at the documentation/WMException comments. https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes
On the other hand, it does fit the "ExitCode>0 == failed job" concept, as opposed to the "-1".

Do you have any comments on what the best exit code to default should be (one that does have side effects with monitoring)

Related issues:
dmwm/WMCore#11614

Python2 to Python3 migration

Run py2to3 and see what happens.

Add CRAB_JobCount to Running fields

In order to use the CRAB jobcount to infer the number of unsubmitted jobs we need this field to be present in the index for all the jobs.

Create python version of affiliation script

Christian,
after review of cronAffiliation.sh script in context of k8s migration I need two things:

ability to produce affiliation file elsewhere (e.g. outside of home area)
ability to read affiliation file in spider_cms.py

This implies that you need to change AffiliationManager class to accept input file which either will be created by this class or it will read from it.

I also suggest to get rid of cronAffiliation.sh script and replace it with pure python version with proper arguments, similar as spider_cms.py

All the environment should be set externally such that scripts can be run from the shell.

This will allow to easily port the functionality of spider/affiliation python scripts to k8s.

At the end what we need is to call two files:

# to produce affiliation file
python affiliation.py --output=affiliation.txt
# to read affiliation
python affiliation.py --input=affiliation.txt
# to read affiliation in spider
python spider_cms.py --affiliation=affiliation.txt ...

Could you please adjust the code and make necessary changes.

Consolidate RemoteHost and LastRemoteHost

As pointed out by @belforte , it would be good to have a single field to use for both Running/Idle and Completed jobs.

@bbockelm I'm assuming the logic is that RemoteHost is the current host for a running job, and LastRemoteHost is simply the last host that the job ran on?

If so, could we simply set RemoteHost to LastRemoteHost for completed jobs?

Use CMS_CampaignName from WM when available

At present, a job campaign is defined in the following line:

https://github.com/dmwm/cms-htcondor-es/blob/vm-legacy/src/htcondor_es/convert_to_json.py#L699

We have recently implemented a mechanism to report the campaign name in the WM system
dmwm/WMCore#10914

The class ad name is: CMS_CampaignName

When this classad is defined, this information should be used and when not, the current logic in monit can be used instead.
Central services will be updated for next week and this feature will be present.

Use CMS_TaskType to define TaskType

It seems that, for historical reasons, we are using CMS_JobType instead of CMS_TaskType to set TaskType for analysis jobs.

Recovery of history data after downtimes

I don't fully trust how the code recovers history data after not finishing for a few hours (e.g. when the VM is down). For example, the VM feeding es-cms was down for a few hours after a reboot on Sunday afternoon (October 7th), and the script was restarted only on Monday afternoon. It recovers some, but not all of the data:
https://es-cms.cern.ch/kibana/goto/14b8189cfdd5119db8dc25405fa4a9f7

Looking at the code, I suspect this:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/history.py#L54

where we specify a limit of 10'000 jobs per query (per schedd). Depending on which 10'000 jobs this retrieves, the last_completion time will be set such that older jobs are never recovered.

@bbockelm can you clarify which jobs are returned when a limit is passed to schedd.history? Should we increase that number?

Use CMS_extendedTaskType to support physics job task types

At present, physics task types are supported via:

https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L682-L684

WMCore has recently added support for this. The classad is called: CMS_extendedJobType and it shows the physics types when this is a production/processing job, or just the WMCOre job type (LogCollect, Merge, etc) when it is not.

We would need to replace the current CMS_TaskType in the code lines above with this new classad CMS_extendedJobType and still fallback to the guessTaskType function until this feature is fully propagated.
Do I understand correctly that CMS_TaskType was a placeholder for a future WMCore native implementation of this by the way? I don't see this classad in the jobs in condor.

Documentation can be found below:
https://github.com/dmwm/WMCore/wiki/Job-task-type-characterization-based-on-cmsDriver-command-line-arguments

But it basically support the following physics types (and a combination of them)

Physics types:

GEN
SIM
DIGI_nopileup
DIGI_premix
DIGI_classicalmix
RECO
MINIAOD
NANOAOD
UNKNOWN

Improper shutdown on timeout

There are still cases where spider_cms.py runs out of time and fails to shut itself down properly.

Probably in the upload loop:

https://github.com/bbockelm/cms-htcondor-es/blob/master/spider_cms.py#L779-L810

CRAB_TaskCreationDate should be UTC timestamp

Changes in e75c68b regarding the timezone (get_creation_time_from_taskname) did not make any difference. We can remove the pytz dependency, and use:

import calendar
#...
calendar.timegm(_naive_date.timetuple())

Elasticsearch 5 Migration

I'll collect here some issues that appear.

Need to change the type of (some or all?) string fields from text to keyword in the mappings for them to be used in aggregations. See here

Check if the pool is assigned correctly

See https://cern.service-now.com/service-portal/view-request.do?n=RQF1370606

Add Job priority ( data.JobPrio ) to the running fields.

Per SI (Antonio) request, we want to have the job priority also for the not-completed jobs.

Move chirped CMSSW IOSite fields into a single field

Rather than the existing format, that creates at least N(sitename) field names:

{
   "ChirpCMSSW(.*?)IOSite_Site1_ReadBytes" : 123456,
   "ChirpCMSSW(.*?)IOSite_Site1_ReadTimeMS" : 123456,
   "ChirpCMSSW(.*?)IOSite_Site2_ReadBytes" : 234567,
   "ChirpCMSSW(.*?)IOSite_Site2_ReadTimeMS" : 234567
}

we should store these in a single nested field:

{"ChirpCMSSW_IOSite" : [
   {
      "SiteName" : "Site1",
      "ReadBytes" : 123456,
      "ReadTimeMS" : 123456
   },
   {
      "SiteName" : "Site2",
      "ReadBytes" : 234567,
      "ReadTimeMS" : 234567
   }
]}

Aggregation by site name can still work when using a kibana plugin: https://ppadovani.github.io/knql_plugin/overview/

I will add this field, leaving the old fields in place for now. Then we can set up the plugin, test it, and decide how to proceed. @bbockelm what do you think? Do you know who's currently using this information?

CMSSW version not reported

It looks like in ES analysis jobs have their CMSSW version in the CRAB_JobSW attribute, but production jobs don't have anything like that. Would it be possible to add an attribute called JobSW for all jobs to
contain the CMSSW version used? The field should be defined for any job running CMSSW.

Move vm-legacy to master

Because of the decision to not continue K8s migration development which is in master branch, virtual machine based code which is in vm-legacy will be forced pushed to master branch. The fact that it is not done via Pull Request(master<-vm-legacy) is to not mix up K8s development code to production code(vm-legacy).

It was not efficient to use master branch for K8s development but it was necessary for the quick development in this silent repo at that time.

Fyi @leggerf @brij01

Add HasBeenRouted_Overflow ad to data collection

Brian, is there any reason not to add HasBeenRouted_Overflow to the data uploaded to es-cms instance? That would be quite helpful to investigate issues with "non-standard" jobs.

I was going to create a patch, but I saw this constraint
https://github.com/bbockelm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L475

so I better ask before coding anything.

Plain bug (using DB12 benchmark instead of HS06)

https://github.com/bbockelm/cms-htcondor-es/blob/4020f000604f60ba4a8f8d886cfbaefe3c938515/src/htcondor_es/convert_to_json.py#L760

Add task creation date to all tasks

In order to filter the task in a more natural way for the user, we need the task creation date (it need to be stored as a timestamp, in order to be used in grafana as time field).
It can be generated from the CRAB_Workflow name, but maybe there is a more maintainable way to do this.

Add GLIDECLIENT_Name to the running fields

By Antonio's request, we need to have the GLIDECLIENT_Name value not only for the completed jobs.

use RemoveReason classAd to set JobStatus=Cancelled

In order to flag jobs intentionally killed by users, rather than by e.g. HTCondor PeriodicRemove because they exceeeded resource bounds. Since both result otherwise in jobs in status Removed with HTC history sayng: job removed by user.

see:
https://its.cern.ch/jira/browse/CMSMONIT-58
and
dmwm/CRABServer#5807