Giter Club home page Giter Club logo

gwmsmon's Introduction

This is a simple monitoring page for CMS production

This contains two parts:

  • A cron job which periodically records current and historical workflow information from HTCondor.
  • A web application that allows a user to explore recorded data.

gwmsmon's People

Contributors

amaltaro avatar bbockelm avatar drkovalskyi avatar dtnrm avatar h4d4 avatar justinasr avatar juztas avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

amaltaro juztas h4d4

gwmsmon's Issues

CMSGWMS_Type will distinguish tiero from production schedds

Currently CMSGWMS_Type represents 3 categories:

  1. prodschedd
  2. crabschedd
  3. others
    3.1 mit schedd
    3.2 cmsconnect schedd

We will further divide prodschedd into prodschedd and tierschedd, so now all tier schedds will have CMSGWMS_Type=tier0schedd instead of prodschedd, nothing else will change apart from that.

Minimize queries to the schedds

totalview runtime is becoming long especially when schedds are overloaded, e.g. vocms0255/256... Get CPU Count part is running the most. In any case, prodview/analysisview/institutional view is already querying all schedds and can produce a file which could be re-used in totalview and speed it up

Scan for factory entries in downtime

Here: https://github.com/dmwm/gwmsmon/blob/master/src/totalview-update#L837

Would be nice to scan also for downtimes inside the factory xmls, like: http://vocms0207.cern.ch/factory/monitor/schedd_status.xml

<entry name="CMSHTPC_T2_US_Caltech_cit">
<downtime status="True"/>
<total>
<ClientMonitor CoresIdle="2" CoresRunning="1039" CoresTotal="1056" GlideIdle="2" GlideRunning="282" GlideTotal="310" InfoAge="183" JobsIdle="267424" JobsRunHere="282" JobsRunning="49740"/>
<Requested Idle="0" IdleCores="0" MaxCores="0" MaxGlideins="0"/>
<Status Held="0" Idle="0" IdleOther="0" Pending="0" Running="22" RunningCores="704" StageIn="0" StageOut="0" Wait="0"/>
</total>
<frontends>
<frontend name="CMSG-v1_0_cmspilot">
<ClientMonitor CoresIdle="2" CoresRunning="1039" CoresTotal="1056" GlideIdle="2" GlideRunning="282" GlideTotal="310" InfoAge="733" JobsIdle="267424" JobsRunHere="282" JobsRunning="49740"/>
<Downtime status="True"/>
<Requested Idle="0" IdleCores="0" MaxCores="0" MaxGlideins="0">
<Parameters></Parameters>
</Requested>
<Status Held="0" Idle="0" IdleOther="0" Pending="0" Running="22" RunningCores="704" StageIn="0" StageOut="0" Wait="0"/>
<StatusEntries/>
</frontend>
</frontends>
</entry>

- means factory will not send new pilots.

JSON read issues from gwmsmon_site_summary

As Unified is using the gwmsmon provided jsons, so can you please look at this issue we are facing from time to time while reading the json -
https://cms-gwmsmon.cern.ch/totalview//json/site_summary

As the command we use is :
https://github.com/CMSCompOps/WmAgentScripts/blob/master/utils.py#L1914
it doesn't gives much details in error message except - No JSON object could be decoded.

I tried to use the same command on terminal and after a few retries got error message around 11am today morning, which tells that the certain column numbers in the JSON are not readable.
Error Message:
ValueError: Unterminated string starting at: line 1 column 434737 (char 434736)

The last time I see it from the unified logs is at Tue Feb 4 02:47:50 2020.
@z4027163 FYI

factory upgrade

https://ggus.eu/index.php?mode=ticket_info&ticket_id=126013
As far as I remember, only pilots from [FRONTENDS] can run CMS jobs from Global pool. Please let me know if there is anything wrong and other frontends should also be monitored.
Also just to add what improvements I can add to cms-gwmsmon and created github issues:
a) count number of CPUs for the pilots which running;
b) compare with pilots which are registered to the pool and provide a graph. Constant value should provide a diff of number unregistered to the pool;
c) There is already an API [5] (no GUI for each pilot stats, but can be added) for site admins, from which he get all pilots which are registered to the pool and also information about pilot, like name, what it is running, how cpu, mem is used. Addition to this would be also group by factory and entry name, so it could be debug deeper.
d) GUI for each pilot and mapping to task/user/specific job;

keep cached values from SSB

From @amaltaro here: #8
If we want to get back to the pledges metric, then we need to keep a cache of the pledges for periods of SSB instabilities (I'm short on time to follow that up).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.