Giter Club home page Giter Club logo

hpcstats's People

Contributors

hmlth avatar mehdid avatar oldmanscave avatar phantez avatar rezib avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpcstats's Issues

Retry several times on failed MySQL requests

HPCStats should try mysql requests several times, eventually restarting the connection, to avoid failing this way during long runs:

Traceback (most recent call last):
  File "/usr/bin/hpcstats", line 39, in <module>
    launcher.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsLauncher.py", line 154, in run
    self.app.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 93, in run
    self.import_cluster_data(db, cluster_name)
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 142, in import_cluster_data
    self.jobs.load_update_window()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 166, in load_update_window
    batch_id = self.load_window(batch_id) + 1
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 181, in load_window
    return self.get_jobs_after_batchid(batch_id, self.window_size)
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 222, in get_jobs_after_batchid
    old_schema = self._is_old_schema()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 198, in _is_old_schema
    self.cur.execute(req)
  File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

or :

Traceback (most recent call last):
  File "/usr/bin/hpcstats", line 39, in <module>
    launcher.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsLauncher.py", line 152, in run
    self.app.run()
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 91, in run
    self.import_cluster_data(db, cluster_name)
  File "/usr/lib/python2.6/dist-packages/HPCStats/CLI/HPCStatsImporter.py", line 140, in import_cluster_data
    self.jobs.load_update_window()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 166, in load_update_window
    batch_id = self.load_window(batch_id) + 1
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 181, in load_window
    return self.get_jobs_after_batchid(batch_id, self.window_size)
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 222, in get_jobs_after_batchid
    old_schema = self._is_old_schema()
  File "/usr/lib/python2.6/dist-packages/HPCStats/Importer/Jobs/JobImporterSlurm.py", line 198, in _is_old_schema
    self.cur.execute(req)
  File "/usr/lib/pymodules/python2.6/MySQLdb/cursors.py", line 166, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.6/MySQLdb/connections.py", line 35, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction')

Handle missing private key in launcher

Example of backtrace:

Traceback (most recent call last):
  File "/usr/share/hpcstats/bin/launch-jobstats", line 92, in <module>
    main()
  File "/usr/share/hpcstats/bin/launch-jobstats", line 89, in main
    ssh_launch(user, frontend, privkey, script)
  File "/usr/share/hpcstats/bin/launch-jobstats", line 70, in ssh_launch
    ssh.connect(frontend, username=user, key_filename=privkey)
  File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 332, in connect
    self._auth(username, password, pkey, key_filenames, allow_agent, look_for_keys)
  File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 436, in _auth
    key = pkey_class.from_private_key_file(key_filename, password)
  File "/usr/lib/pymodules/python2.6/paramiko/pkey.py", line 198, in from_private_key_file
    key = cls(filename=filename, password=password)
  File "/usr/lib/pymodules/python2.6/paramiko/rsakey.py", line 51, in __init__
    self._from_private_key_file(filename, password)
  File "/usr/lib/pymodules/python2.6/paramiko/rsakey.py", line 163, in _from_private_key_file
    data = self._read_private_key_file('RSA', filename, password)
  File "/usr/lib/pymodules/python2.6/paramiko/pkey.py", line 279, in _read_private_key_file
    f = open(filename, 'r')
IOError: [Errno 2] No such file or directory: '/path/to/file'

SlurmDBD >= 15.08 schema changes

New SlurmDBD schema use multivalues TRES fields that replace several other fields, such as cpu_count in event tables. Many SQL requests accross JobImporterSlurm and EventImporterSlurm must be adapted to use these new fields when needed.

Increase event_type field to 200 chars

Currently, event_type field in DB is 50 chars long. It can be too short for some event type. It should be increased to 200 chars to avoid issues when importing events.

Avoid checking runs on new jobs

The Slurm job importer checks if runs exist even for new jobs, but this is useless since the DB model prevents runs from existing for new jobs. There might be a great optimization here by avoiding the check for new jobs.

Handle failed hostname resolution in the launcher

Example of backtrace:

File "/usr/share/hpcstats/bin/launch-jobstats", line 92, in <module>
    main()
  File "/usr/share/hpcstats/bin/launch-jobstats", line 89, in main
    ssh_launch(user, frontend, privkey, script)
  File "/usr/share/hpcstats/bin/launch-jobstats", line 70, in ssh_launch
    ssh.connect(frontend, username=user, key_filename=privkey)
  File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 282, in connect
    for (family, socktype, proto, canonname, sockaddr) in socket.getaddrinfo(hostname, port, socket.AF_UNSPEC, socket.SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known

ldap parameter TLS_CACERT ignored

HPCStats returns an error if "TLS_REQCERT ALLOW" is not set /etc/ldap/ldap.conf:

ERROR: HPCStatsLauncher: Source Error: unable to connect to LDAP server: {'info': '(unknown error code)', 'desc': "Can't contact LDAP server"}

Avoid full stop of importer app on first cluster fail

Currently, the importer app stops as soon as a source error is encountered. If a source error happens on a cluster, on the data from the other cluster (even if they work perfectly) are not imported. This behaviour should be avoided. The data must be imported for all working clusters and cluster encountering source error must simply be ignored.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.