Giter Club home page Giter Club logo

cms-htcondor-es's People

Contributors

bbockelm avatar belforte avatar cronosnull avatar khurtado avatar lecriste avatar leggerf avatar mrceyhun avatar nikodemas avatar stiegerb avatar todor-ivanov avatar vkuznet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cms-htcondor-es's Issues

Update JobMonitoring schema with unknown keys

(Starting to hand over, it would be good if someone else would start doing these things.)

Here's a short guideline:

  • Check es-cms-logmon for a list of recent messages about validation failures because of unknown keys (message_type:validator_unknown_key).
  • Edit the JobMonitoring.json file in /home/cmsjobmon/cms-htcondor-es/ (in a safe place) and add key/example value pairs.
  • Create a PR on dmwm/CMSMonitoring to update the schemas/JobMonitoring.json and jsonschemas/JobMonitoring.json files. Create the latter using genson schemas/JobMonitoring.json > jsonschemas/JobMonitoring.schema.

Jobs missing in MONIT

Seems we sare still far from having full accounting of jobs in MONIT.

While investigating a problem with glideinWms
https://ggus.eu/index.php?mode=ticket_info&ticket_id=138535
I tried to use ES+Kibana etc. but found the very puzzling situation that for this CRAB Task [1]
I can not find in ES the jobs which completed at T2_TR_METU.
An example of such job (which completed successfully at that site a few hours ago) is e.g. [3]
So if the job ran, I am sure HTCondor knows about it, why it is not shown ?
Is me not looking in the right way, or something is wrong in the reporting ?

By the way way those jobs are in ES when running (although with incomplete
information, like missing CRAB_Id i.e. their ID inside the crab task), but somehow not as Completed [4]

[1]
Here are pointers in new and old dashboard.

https://monit-grafana.cern.ch/d/NvcuKTSiz/condor-jobs-test?orgId=6&var-task=181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1&var-user=areinsvo
http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=areinsvo&refresh=0&table=Mains&p=1&records=25&activemenu=2&pattern=181203_151212%3Aareinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1&task=&from=&till=&timerange=lastWeek

[3]
https://cmsweb.cern.ch/scheddmon/0122/cms1354/181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1/job_out.30.0.txt

[4]
https://monit-kibana.cern.ch/kibana/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(data.RequestCpus,data.Site,data.MachineAttrCpus0,data.Status,data.CRAB_Id),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'monit_prod_condor_raw_metric_v002-*',key:data.Type,negate:!f,type:phrase,value:analysis),query:(match:(data.Type:(query:analysis,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'monit_prod_condor_raw_metric_v002-*',key:data.RequestCpus,negate:!f,type:phrase,value:'4'),query:(match:(data.RequestCpus:(query:4,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'monit_prod_condor_raw_metric_v002-*',key:data.Status,negate:!f,type:phrase,value:Completed),query:(match:(data.Status:(query:Completed,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'monit_prod_condor_raw_metric_v002-*',key:data.JobStatus,negate:!f,type:phrase,value:'4'),query:(match:(data.JobStatus:(query:4,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'monit_prod_condor_raw_metric_v002-*',key:data.Site,negate:!f,type:phrase,value:T2_TR_METU),query:(match:(data.Site:(query:T2_TR_METU,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'monit_prod_condor_raw_metric_v002-*',key:data.CRAB_Workflow,negate:!f,type:phrase,value:'181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1'),query:(match:(data.CRAB_Workflow:(query:'181203_151212:areinsvo_crab_SingleMuon_Run2017F-31Mar2018-v1',type:phrase))))),index:'monit_prod_condor_raw_metric_v002-*',interval:auto,query:(match_all:()),sort:!(data.MachineAttrCpus0,desc))

Where are the shell scripts?

I found that repository does not contain shell scripts used in cronjobs on vocms0240. Why these scripts are not present in this repository? Where they are located?

I suggest to update repository with all shell script we need to use on production node.

Generate log tarball URL in ElasticSearch ads

For production jobs, we currently store the tarballs in CMS EOS (and they are readable by CMS members).

The URL can be constructed with the following recipe:

"https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/" + \
   <task name> + "/" + \
   <step name> + "/" + \
   <schedd name> + <WMAgent job ID> + "-" + <retry #> + "-log.tar.gz"

For WMAgent jobs, can we add a new attribute (LogURL?) containing a URL constructed as above?

Create a way to identify the last update for a job.

For the tasks' index, we are only taking the last status for each job. However, the order in which the messages are consumed is not enough to ensure that the last seen message is the actual last update. We starting using the RecordTime, and it worked when the status change, but for a completed job status updates always have the same RecordTime making that updates over completed jobs didn't get overwritten (as they have the same version).

Currently, DataCollection and RecordTime use the CompletionDate for Completed jobs. Could we use the _launch_time for DataCollection (or a new field) and use it as a version hint?.

Add CRAB_Retry and QDate for Idle and running tasks

In the new ES schema, the retry number is part of the ID of the document, it is needed to avoid duplicates.

The QDate can be useful for the user in the tasks list for the running and pending tasks to order (and, when grafana enables us to do so, filter) the jobs.

Elasticsearch 6 migration

Discussed with CERN/ES about this. They will start setting up a new endpoint "es-cms6" and mirror new data there. We can then observe possible issues and start fixing them.

The main thing is the removal of mapping types, but we only use a single type ("job"), so that should not be an issue for us.

analysis jobs marked as production

I've notice there are some analysis jobs that are pushed as Type:production and trace this issue down to this part of the code:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L574

There are several analysis jobs, mostly jobUniverse 7 and 12, that do not have CRAB_Id neither define AccountingGroup and hence, they get marked as 'production', look at the condor query done on a crab schedd[1]

Wouldn't be better to use the CMSGWMS_Type attribute in the Schedd ClassAd to get the type of schedd (crabschedd, prodschedd, tier0schedd) and set the job Type respectively ?

[1]
[root@vocms0198 ~]# condor_q -const 'CRAB_Id=?=UNDEFINED' -af:h JobUniverse AccountingGroup | sort | uniq -c
240 12 undefined
10 5 production.cmsdataops
2 5 undefined
2 7 "highprio.spiga"
5242 7 undefined

Re-evaluate the need for constraining data for running/pending jobs

We're currently maintaining a limited list of fields to be retained when processing the condor queues:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L423

And we're dropping any field not in that list when processing the queues:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L780-L781

(The behavior can be switched off by using the --keep_full_queue_data option for spider_cms.py)

I had introduced this as a means of speeding up the uploading of documents to MONIT (it cuts about a factor of 5 in volume, iirc) as we were having trouble staying within the 12 min cycle. However, we were still uploading from a VM hosted at UNL at that time, and the faster upload time from the CERN hosted VM to MONIT might have made this unnecessary.

To check, we could run a few cycles without culling the fields and see how much the performance suffers.

Another aspect to keep in mind, however, is that this would significantly increase the data volume needed on the MONIT side, since it affects a large majority of the documents we send.

Setting new filter for all condor job monitoring queries: JobUniverse==5

As suggested by @belforte in MM(Distrubuted Analysis channel), we are planning to put JobUniverse==5 constraint to our query scripts that query Condor Schedds, both queue and history.

As CMS Monitoring&Analytics team, we are just the operator of this Condor Job Monitoring. That's why, we would like to get confirmation from all stakeholders.

You can see Completed jobs with JobUniverse is different than 5 in [1] and see other jobs with JobUniverse is different than 5 in [2]

[1] https://es-cms.cern.ch/kibana/goto/9916fa1438a0d470cf620150e2ec8fb7
[2] https://monit-opensearch.cern.ch/dashboards/goto/9f7d3eaf122501164b28e0cb84b616b5?security_tenant=global

fyi @leggerf @brij01

  • Ceyhun

Add joblog and postJobLog urls

In order to avoid using javascript links to create the URLs, we'll send the URLs to ES.
Also, in order to use the value directly, we'll add the committed wall clock in seconds (to use the Grafana's duration formats).

RecordTime for Held jobs

We currently set the RecordTime as the CompletionDate from the ad, and then overwrite it with the current time in case the CompletionDate is not set:

https://github.com/bbockelm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L536-L540

result['RecordTime'] = ad.get("CompletionDate", 0)

...

if not result['RecordTime']:
    result['RecordTime'] = _launch_time

The problem is when jobs have a CompletionDate, but have not actually terminated, e.g. because they are "Held". (I'm not sure actually there are any other examples.) So those docs will be submitted with the same RecordTime over and over again, and pile up in the dashboard in one time bin.

Instead we could set RecordTime to be the launch time for all jobs with statuses other than "Completed".

Repeated records when a jobs is removed.

It seems that when a job gets removed, the monitoring system keeps adding duplicate records of it every 12 minutes. In this example:
https://monit-kibana.cern.ch/kibana/app/kibana::/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2019-07-29T22:00:00.000Z',mode:absolute,to:'2019-08-07T21:59:59.999Z'))&_a=(columns:!(_source),index:'monit_prod_condor_raw_metric_v002-*',interval:auto,query:(query_string:(query:'data.CRAB_Retry:3+AND+data.CRAB_Id:11+AND+data.CRAB_Workflow:190702_095101*acarvalh_crab_2016v3_2019Jul02_ZZTo2L2Nu_13TeV_powheg_pythia8__RunIISummer16MiniAODv3-PUMoriond17_94X_mcotic_v3-v2')),sort:!(metadata.timestamp,desc))

you can see that all of the records point to the exact same job, same LastStartTime, same globalJobId, etc. One thing to notice is that the CpuEff does change. This could be the reason why the monitoring system thinks it is a different record. I think this is related to this bug in condor:https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7083

CMSPrimaryProcessedDataset and CMSPrimaryPrimaryDataset not present in some ES indexes

Hello,

While looking into the following issue:
dmwm/WMCore#11613

It was found out that some indexes like the one used for monit_prod_condor seem to be missing some entries from this block:

https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L751-L757

Basically, if we have something like (as shown in the ES link above):

data.DESIRED_CMSDataset : /EGamma/Run2022C-v1/RAW

Then, we should see:
data.CMSPrimaryPrimaryDataset, data.CMSPrimaryDataTier, data.CMSPrimaryProcessedDataset
but only data.CMSPrimaryDataTier is properly shown, with the other 2 set as Unknown. I haven't see this in other indexes like cms-20*, is this a bug?

Also, we had a few questions regarding the names:

  • CMSPrimaryPrimaryDataset: why Primary twice in the name?

  • CMSPrimaryDataTier, would it be better to rename it to DESIRED_CMSDatatier? In such case, we would like to do some renaming to make things clearer, like:

  • CMSPrimaryPrimaryDataset -> DESIRED_CMSPrimaryDataset

  • CMSPrimaryProcessedDataset -> DESIRED_CMSProcessedDataset

  • CMSPrimaryDataTier -> DESIRED_CMSDatatier

what do you think?
pinging @amaltaro

Properly sort jobs in Grafana dashboard for task monitoring

Given the discussion in dmwm/CRABServer#5860
we agree with @cronosnull to add a numerical field corresponding to the CRAB_Id string and sort jobs accordingly.

At the moment, the possible values of CRAB_Id are of two types:

  • "1", "2", "3", ..., up to ~"10000"
  • "n-1", "n-2", "n-3", ..., where n can be 0, 1, 2 or 3

and the overall order should be 0-1, 0-2, 0-3, 0-..., 1, 2, 3, ..., 1-1, 1-2, 1-3, 1-..., 2-1, 2-2, 2-3, 2-..., 3-1, 3-2, 3-3, 3-...

Replace the collector cmssrv221 with cmssrv623

This change has been already pushed to production using a temporary python2 branch (keeping the two collectors), but once the Python 3 version is ready we need to apply the change.

We use this strategy because the changes can be slightly different and because we want to keep the python3 branch-which is currently on review- able to automatically merge with the master)

commonExitCode defaults to 0 (meaning "sucess") when no classad is available to provide the exitCode of the job

When looking into the following grafana dashboard:
https://monit-grafana.cern.ch/d/ifXAfjLVk/production-jobs-exit-code-monitoring?orgId=11

we realized we had some jobs that were "Removed" but the exit code was successful. When investigating this, we noticed the following line:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L1157

basically means when a job is removed and hence neither CRAB, WMCore or HTCondor gets to propagate a classad to report a exit code, it defaults to "0".

This shouldn't be the case, but we are wondering what the value should be instead. After discussing his, we came up with some options and we wanted to hear feedback from the monitoring team;

Options are:

  • "Unknown": While this is possible, we don't know how LUCENE will treat it if we try conditionals like "ExitCode > 0"
  • -1 : Only issue with this is, exit codes are usually positive
  • A positive integer number > 99999 ? E.g.: 100001
    Because it's past the range used for CMS exit codes and it means it wasn't produced by WMException, although it's not easy for someone to say so without looking at the documentation/WMException comments. https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes
    On the other hand, it does fit the "ExitCode>0 == failed job" concept, as opposed to the "-1".

Do you have any comments on what the best exit code to default should be (one that does have side effects with monitoring)

Related issues:
dmwm/WMCore#11614

Add CRAB_JobCount to Running fields

In order to use the CRAB jobcount to infer the number of unsubmitted jobs we need this field to be present in the index for all the jobs.

Create python version of affiliation script

Christian,
after review of cronAffiliation.sh script in context of k8s migration I need two things:

  • ability to produce affiliation file elsewhere (e.g. outside of home area)
  • ability to read affiliation file in spider_cms.py

This implies that you need to change AffiliationManager class to accept input file which either will be created by this class or it will read from it.

I also suggest to get rid of cronAffiliation.sh script and replace it with pure python version with proper arguments, similar as spider_cms.py

All the environment should be set externally such that scripts can be run from the shell.

This will allow to easily port the functionality of spider/affiliation python scripts to k8s.

At the end what we need is to call two files:

# to produce affiliation file
python affiliation.py --output=affiliation.txt
# to read affiliation
python affiliation.py --input=affiliation.txt
# to read affiliation in spider
python spider_cms.py --affiliation=affiliation.txt ...

Could you please adjust the code and make necessary changes.

Consolidate RemoteHost and LastRemoteHost

As pointed out by @belforte , it would be good to have a single field to use for both Running/Idle and Completed jobs.

@bbockelm I'm assuming the logic is that RemoteHost is the current host for a running job, and LastRemoteHost is simply the last host that the job ran on?

If so, could we simply set RemoteHost to LastRemoteHost for completed jobs?

Use CMS_CampaignName from WM when available

At present, a job campaign is defined in the following line:

https://github.com/dmwm/cms-htcondor-es/blob/vm-legacy/src/htcondor_es/convert_to_json.py#L699

We have recently implemented a mechanism to report the campaign name in the WM system
dmwm/WMCore#10914

The class ad name is: CMS_CampaignName

When this classad is defined, this information should be used and when not, the current logic in monit can be used instead.
Central services will be updated for next week and this feature will be present.

Recovery of history data after downtimes

I don't fully trust how the code recovers history data after not finishing for a few hours (e.g. when the VM is down). For example, the VM feeding es-cms was down for a few hours after a reboot on Sunday afternoon (October 7th), and the script was restarted only on Monday afternoon. It recovers some, but not all of the data:
https://es-cms.cern.ch/kibana/goto/14b8189cfdd5119db8dc25405fa4a9f7

Looking at the code, I suspect this:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/history.py#L54

where we specify a limit of 10'000 jobs per query (per schedd). Depending on which 10'000 jobs this retrieves, the last_completion time will be set such that older jobs are never recovered.

@bbockelm can you clarify which jobs are returned when a limit is passed to schedd.history? Should we increase that number?

Use CMS_extendedTaskType to support physics job task types

At present, physics task types are supported via:

https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L682-L684

WMCore has recently added support for this. The classad is called: CMS_extendedJobType and it shows the physics types when this is a production/processing job, or just the WMCOre job type (LogCollect, Merge, etc) when it is not.

We would need to replace the current CMS_TaskType in the code lines above with this new classad CMS_extendedJobType and still fallback to the guessTaskType function until this feature is fully propagated.
Do I understand correctly that CMS_TaskType was a placeholder for a future WMCore native implementation of this by the way? I don't see this classad in the jobs in condor.

Documentation can be found below:
https://github.com/dmwm/WMCore/wiki/Job-task-type-characterization-based-on-cmsDriver-command-line-arguments

But it basically support the following physics types (and a combination of them)

Physics types:

  • GEN
  • SIM
  • DIGI_nopileup
  • DIGI_premix
  • DIGI_classicalmix
  • RECO
  • MINIAOD
  • NANOAOD
  • UNKNOWN

Elasticsearch 5 Migration

I'll collect here some issues that appear.

  • Need to change the type of (some or all?) string fields from text to keyword in the mappings for them to be used in aggregations. See here

Move chirped CMSSW IOSite fields into a single field

Rather than the existing format, that creates at least N(sitename) field names:

{
   "ChirpCMSSW(.*?)IOSite_Site1_ReadBytes" : 123456,
   "ChirpCMSSW(.*?)IOSite_Site1_ReadTimeMS" : 123456,
   "ChirpCMSSW(.*?)IOSite_Site2_ReadBytes" : 234567,
   "ChirpCMSSW(.*?)IOSite_Site2_ReadTimeMS" : 234567
}

we should store these in a single nested field:

{"ChirpCMSSW_IOSite" : [
   {
      "SiteName" : "Site1",
      "ReadBytes" : 123456,
      "ReadTimeMS" : 123456
   },
   {
      "SiteName" : "Site2",
      "ReadBytes" : 234567,
      "ReadTimeMS" : 234567
   }
]}

Aggregation by site name can still work when using a kibana plugin: https://ppadovani.github.io/knql_plugin/overview/

I will add this field, leaving the old fields in place for now. Then we can set up the plugin, test it, and decide how to proceed. @bbockelm what do you think? Do you know who's currently using this information?

CMSSW version not reported

It looks like in ES analysis jobs have their CMSSW version in the CRAB_JobSW attribute, but production jobs don't have anything like that. Would it be possible to add an attribute called JobSW for all jobs to
contain the CMSSW version used? The field should be defined for any job running CMSSW.

Move vm-legacy to master

Because of the decision to not continue K8s migration development which is in master branch, virtual machine based code which is in vm-legacy will be forced pushed to master branch. The fact that it is not done via Pull Request(master<-vm-legacy) is to not mix up K8s development code to production code(vm-legacy).

It was not efficient to use master branch for K8s development but it was necessary for the quick development in this silent repo at that time.

Fyi @leggerf @brij01

Add task creation date to all tasks

In order to filter the task in a more natural way for the user, we need the task creation date (it need to be stored as a timestamp, in order to be used in grafana as time field).
It can be generated from the CRAB_Workflow name, but maybe there is a more maintainable way to do this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.