edf-hpc / hpcstats Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 3.98 MB

HPC clusters data guzzler for usage statistics and metrics

Home Page: http://edf-hpc.github.io/hpcstats/

License: GNU General Public License v2.0

Mako 0.43% Python 99.42% Shell 0.15%

hpcstats's People

Contributors

Stargazers

Watchers

Forkers

phantez jbaptl nilce05

hpcstats's Issues

Retry several times on failed MySQL requests

HPCStats should try mysql requests several times, eventually restarting the connection, to avoid failing this way during long runs:

Traceback (most recent call last):
  File "/usr/bin/hpcstats", line 39, in <module>
    launcher.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsLauncher.py", line 154, in run
    self.app.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 93, in run
    self.import_cluster_data(db, cluster_name)
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 142, in import_cluster_data
    self.jobs.load_update_window()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 166, in load_update_window
    batch_id = self.load_window(batch_id) + 1
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 181, in load_window
    return self.get_jobs_after_batchid(batch_id, self.window_size)
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 222, in get_jobs_after_batchid
    old_schema = self._is_old_schema()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 198, in _is_old_schema
    self.cur.execute(req)
  File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

or :

Traceback (most recent call last):
  File "/usr/bin/hpcstats", line 39, in <module>
    launcher.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsLauncher.py", line 152, in run
    self.app.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 91, in run
    self.import_cluster_data(db, cluster_name)
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 140, in import_cluster_data
    self.jobs.load_update_window()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 166, in load_update_window
    batch_id = self.load_window(batch_id) + 1
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 181, in load_window
    return self.get_jobs_after_batchid(batch_id, self.window_size)
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 222, in get_jobs_after_batchid
    old_schema = self._is_old_schema()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 198, in _is_old_schema
    self.cur.execute(req)
  File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction')

Handle missing private key in launcher

Example of backtrace:

Traceback (most recent call last):
  File "/usr/share/hpcstats/bin/launch-jobstats", line 92, in <module>
    main()
  File "/usr/share/hpcstats/bin/launch-jobstats", line 89, in main
    ssh_launch(user, frontend, privkey, script)
  File "/usr/share/hpcstats/bin/launch-jobstats", line 70, in ssh_launch
    ssh.connect(frontend, username=user, key_filename=privkey)
  File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 332, in connect
    self._auth(username, password, pkey, key_filenames, allow_agent, look_for_keys)
  File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 436, in _auth
    key = pkey_class.from_private_key_file(key_filename, password)
  File "/usr/lib/pymodules/python2.6/paramiko/pkey.py", line 198, in from_private_key_file
    key = cls(filename=filename, password=password)
  File "/usr/lib/pymodules/python2.6/paramiko/rsakey.py", line 51, in __init__
    self._from_private_key_file(filename, password)
  File "/usr/lib/pymodules/python2.6/paramiko/rsakey.py", line 163, in _from_private_key_file
    data = self._read_private_key_file('RSA', filename, password)
  File "/usr/lib/pymodules/python2.6/paramiko/pkey.py", line 279, in _read_private_key_file
    f = open(filename, 'r')
IOError: [Errno 2] No such file or directory: '/path/to/file'

SlurmDBD >= 15.08 schema changes

New SlurmDBD schema use multivalues TRES fields that replace several other fields, such as cpu_count in event tables. Many SQL requests accross JobImporterSlurm and EventImporterSlurm must be adapted to use these new fields when needed.

Increase event_type field to 200 chars

Currently, event_type field in DB is 50 chars long. It can be too short for some event type. It should be increased to 200 chars to avoid issues when importing events.

Fix unit tests following devs for slurmdbd 15.08 support

Some code has been developed to support SlurmDBD 15.08. This new code make 9 unit tests fail, and the new code is not covered by the tests. This must be fixed.

Avoid checking runs on new jobs

The Slurm job importer checks if runs exist even for new jobs, but this is useless since the DB model prevents runs from existing for new jobs. There might be a great optimization here by avoiding the check for new jobs.

Handle failed hostname resolution in the launcher

Example of backtrace:

File "/usr/share/hpcstats/bin/launch-jobstats", line 92, in <module>
    main()
  File "/usr/share/hpcstats/bin/launch-jobstats", line 89, in main
    ssh_launch(user, frontend, privkey, script)
  File "/usr/share/hpcstats/bin/launch-jobstats", line 70, in ssh_launch
    ssh.connect(frontend, username=user, key_filename=privkey)
  File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 282, in connect
    for (family, socktype, proto, canonname, sockaddr) in socket.getaddrinfo(hostname, port, socket.AF_UNSPEC, socket.SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known

ldap parameter TLS_CACERT ignored

HPCStats returns an error if "TLS_REQCERT ALLOW" is not set /etc/ldap/ldap.conf:

ERROR: HPCStatsLauncher: Source Error: unable to connect to LDAP server: {'info': '(unknown error code)', 'desc': "Can't contact LDAP server"}

Add mechanism to avoid simultaneous and concurrent import of same cluster

Currently, there is nothing to prevent multiple concurrent import runs on the same cluster. This systematically result on failed transaction. A mechanism might be added in HPCStats too avoid such concurrent runs.

Avoid full stop of importer app on first cluster fail

Currently, the importer app stops as soon as a source error is encountered. If a source error happens on a cluster, on the data from the other cluster (even if they work perfectly) are not imported. This behaviour should be avoided. The data must be imported for all working clusters and cluster encountering source error must simply be ignored.

edf-hpc / hpcstats Goto Github PK

hpcstats's People

Contributors

Stargazers

Watchers

Forkers

hpcstats's Issues

Retry several times on failed MySQL requests

Handle missing private key in launcher

SlurmDBD >= 15.08 schema changes

Increase event_type field to 200 chars

Fix unit tests following devs for slurmdbd 15.08 support

Avoid checking runs on new jobs

Handle failed hostname resolution in the launcher

ldap parameter TLS_CACERT ignored

Add mechanism to avoid simultaneous and concurrent import of same cluster

Avoid full stop of importer app on first cluster fail

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent