Giter Club home page Giter Club logo

nimrodg's Introduction

Nimrod/G

Usage

The CLI is huge, consider use the -h flag.

usage: nimrod [-h] [-c CONFIG] [-d] command ...

Invoke Nimrod/G CLI commands

optional arguments:
  -h, --help             show this help message and exit
  -c CONFIG, --config CONFIG
                         Path to configuration file. (default: /home/user/.config/nimrod/nimrod.ini)
  -d, --debug            Enable debug output. (default: false)

valid commands:
  command
    property             Property Operations.
    experiment           Experiment Operations.
    master               Start the experiment master.
    resource             Resource operations.
    resource-type        Resource type operations.
    job                  Job operations.
    setup                Nimrod/G setup functionality.
    compile              Compile a planfile.
    genconfig            Generate a default configuration file.
    agent                Agent Operations.
    staging              Execute staging commands.

Build Instructions

Use the nimw.sh wrapper script to invoke the CLI via Gradle.

To generate a tarball, use gradle nimrod:assembleDist.

Requirements

  • Java 11+
  • Gradle 5.3.1+

Installation

  • Create a nimrod.ini configuration file in ~/.config/nimrod
    • A sample is provided in nimrodg-cli/src/main/resources
  • Create a setup configuration file. This can be placed anywhere.
    • A sample is provided in nimrodg-cli/src/main/resources
  • Run nimrod setup init /path/to/setup-config.ini
  • You're ready to go.

License

This project is licensed under the Apache License, Version 2.0:

Copyright © 2019 The University of Queensland

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

3rd-party Licenses

Project License License URL
Antlr4 The BSD License http://www.antlr.org/license.html
icu4j Unicode/ICU License http://source.icu-project.org/repos/icu/trunk/icu4j/main/shared/licenses/LICENSE
PgJDBC BSD-2-Clause License https://jdbc.postgresql.org/about/license.html
[ini4j] Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Bouncy Castle Crypto APIs Bouncy Castle License https://www.bouncycastle.org/license.html
Jersey CDDL 1.1 https://jersey.github.io/license.html
sqlite-jdbc Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
RabbitMQ Java Client Library Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Apache log4j2 Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Commons CSV Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Commons IO Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Commons Collections Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Tomcat Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Mina SSHD Apache 2.0 http://www.apache.org/licenses/LICENSE-2.0.txt
java_text_tables MIT License https://raw.githubusercontent.com/iNamik/java_text_tables/master/LICENSE.txt

nimrodg's People

Contributors

hoangnguyen177 avatar vs49688 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

nimrodg's Issues

Job count querying

The Nimrod Portal requires completed, failed, running, pending, and total counts of each job. This information isn't available unless the status of each job (and thus each attempt) is queried and counted, which is slow.

A new method JobCounts getJobCounts(Experiment exp); should be added to NimrodAPI, which provides this information in an efficient way.

debug2: Control master terminated unexpectedly

In some situations Travis builds fail with this error. I have no idea what causes it, but it seems extremely inconsistent. This has only ever been seen on Travis builds, never on a local build.

https://travis-ci.org/github/UQ-RCC/nimrodg/jobs/712805294

    [Test worker] TRACE au.edu.uq.rcc.nimrodg.shell.OpenSSHClient - Executing command: /usr/bin/ssh -l user -i /tmp/junit9299936063122352927/openssh-1882757806-key -p 2292 -oPasswordAuthentication=no -oKbdInteractiveAuthentication=no -oChallengeResponseAuthentication=no -oBatchMode=yes -oControlMaster=auto -oControlPersist=yes -oControlPath=/tmp/junit9299936063122352927/openssh-1882757806-control -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oLogLevel=DEBUG3 127.0.0.1 -E /tmp/junit9299936063122352927/openssh-1882757806-log01.txt -- scp -q -p -t /asdf
    [Test worker] TRACE au.edu.uq.rcc.nimrodg.shell.OpenSSHClient - Attempting to dump OpenSSH log file at /tmp/junit9299936063122352927/openssh-1882757806-log01.txt
    [Test worker] TRACE au.edu.uq.rcc.nimrodg.shell.OpenSSHClient - debug1: Reading configuration data /home/travis/.ssh/config
    debug1: /home/travis/.ssh/config line 1: Applying options for *
    debug1: /home/travis/.ssh/config line 2: Deprecated option "useroaming"
    debug1: Reading configuration data /etc/ssh/ssh_config
    debug1: /etc/ssh/ssh_config line 19: Applying options for *
    debug1: auto-mux: Trying existing master
    debug2: fd 4 setting O_NONBLOCK
    debug2: mux_client_hello_exchange: master version 4
    debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
    debug3: mux_client_request_session: entering
    debug3: mux_client_request_alive: entering
    debug3: mux_client_request_alive: done pid = 7067
    debug3: mux_client_request_session: session request sent
    debug1: mux_client_request_session: master session id: 2
    debug3: mux_client_read_packet: read header failed: Broken pipe
    debug2: Control master terminated unexpectedly

[Nimrod/K] A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)

When running in Nimrod/K with

parameters = {"x", "y"}
jobs = {"x"=0, "y"=0}

throws the exception:

ptolemy.kernel.util.IllegalActionException: [SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
  with tag colour {NOCOLOUR}
  in .Unnamed1.Nimrod/G Actor
Because:
[SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
	at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:575)
Caused by: au.edu.uq.rcc.nimrodg.impl.base.db.NimrodSQLException: [SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:540)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:67)
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:65)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:124)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:58)
	at au.edu.uq.rcc.nimrod.NimrodGActor.fire(NimrodGActor.java:340)
	at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:464)
Caused by: au.edu.uq.rcc.nimrodg.impl.base.db.NimrodSQLException: [SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:540)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:67)
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:65)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:124)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:58)
	at au.edu.uq.rcc.nimrod.NimrodGActor.fire(NimrodGActor.java:340)
	at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:464)

Actuator proxying

The SSH actuators should have this. Perhaps modify the config to have the following:

{
  "tunnels": [
    {
      "type": "reverse",
      "srcport": 5671,
      "dsthost": "203.101.225.94",
      "dstport": 5671
    }
  ]
}

MINA SSHD would create those tunnels upon connection. The user can specify custom AMQP/Transfer URIs in the resource configuration as normal, except this time they'd reference the head node.

Add support for querying/filtering job attempts based on an experiment.

This will be useful for the master when doing state recovery.

Collection<? extends Job> NimrodAPI#filterJobAttempts(Experiment exp, EnumSet<JobAttempt.Status> status, long start, int limit);

Collection<? extends Job> Job#filterAttempt(EnumSet<JobAttempt.Status> status, long start, int limit);

External Execution Actuator

Rather simple concept, invoke a program that will start agents.

The program shall have two commands:

Usage:
  ./extlaunch.py launch
  ./extlaunch.py kill <uuid>
  • launch takes a JSON dump of the following format from stdin:
{
	"resource_path": "fl012",
	"amqp_uri": "amqp://user:pass@asdfasdfasd",
	"no_verify_peer": true,
	"no_verify_host": true,
	"cert_data": "asdfasd==",
	"amqp_routing_key": "iamthemaster",
	"uuids": [
		"ef449411-1407-4702-ad12-06a639c065fb",
		"a72188ba-24a7-4dd8-95c0-dea896a70141",
		"b725ad16-db6c-4744-995e-b254e4fcfd08"
	],
	"config": {
		"limit": 10,
		"program": "/home/user/Desktop/nimrod-embedded/extlaunch.py"
	}
}

How it invokes the agents, Nimrod doesn't care.
For the case of Embedded Nimrod, it'll generate a script and invoke ssh to another node, bypassing the requirement of needing access to the host's keys.

  • kill <uuid> -- Attempt to kill a single agent.

Any nonzero return value is considered a failure.

Agent Protocol Updates

Everything that should be in the next agent protocol update:

  • Protocol versioning
    • A singe number field would suffice. Agents are shipped with Nimrod, so this isn't a problem.
  • SSH key path field
    • So SSH transfer targets can be used securely
    • Or embed the key directly
  • state info in agent.pong
    • for accounting and debugging purposes
  • a new "resync" message
    • If agents get out-of-sync for some reason, the master can force-reset them.
  • a new CommandResult.Status enum value of Failed
    • Used for explicitly stating a process has returned nonzero. Blocks #16
  • add a timestamp field to each message
    • To AMQPBasicProperties (POSIX timestamp)
    • timestamp field (ISO8601)
    • X-NimrodG-Sent-At header (ISO8601)
      • time the message was actually sent
  • message signing
    • NIM1-HMAC-SHA224
    • NIM1-HMAC-SHA256
    • NIM1-HMAC-SHA384
    • NIM1-HMAC-SHA512
    • Change datestamp to timestamp
    • Use nonce
    • Add a nonce field. For future use.
  • agent.submit token field

Unique Directory for each Actuator

Am currently using the below, which isn't enough.

String.format("act-%s-%d", this.getClass().getSimpleName(), (long)uri.hashCode() & 0xFFFFFFFFL)

Should add a function in ActuatorUtils to do this properly.

Statelessness

This is purely on the master.

All state is in the database except:

  • Launch failure counts

There's several ways of doing this:

  • The master and schedulers manually recover their state via the DB.
    • Easiest, as they can access their own internals.
    • Issues arise in the agent scheduler when rebuilding job<->agent mappings, as it has no access to that data.
  • The master gives the schedulers a NULL backend and "replays" the state.

"no schema has been selected to create in" when using currentSchema in JDBC URL

$ nimrod setup init
au.edu.uq.rcc.nimrodg.setup.NimrodSetupAPI$SetupException: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
	at au.edu.uq.rcc.nimrodg.impl.postgres.SetupAPIImpl.reset(SetupAPIImpl.java:144)
	at au.edu.uq.rcc.nimrodg.cli.commands.Setup.execute(Setup.java:110)
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:36)
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:125)
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:145)
Caused by: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
	at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
	at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
	at au.edu.uq.rcc.nimrodg.impl.postgres.SetupAPIImpl.reset(SetupAPIImpl.java:142)

Remove NimrodServeAPI

Is old and unused.

  • NimrodServeAPI interface
  • nimrod_experiments::file_token Postgres
  • nimrod_experiments::file_token Sqlite3
  • agent.submit token field
    • Is being done as part of #17

The master thinks agents already exist on a resource and doesn't spawn any.

This happens after a hard crash.

zane=> TABLE nimrod_resource_agents;
 id |  state   |             queue              |              agent_uuid              | shutdown_signal | shutdown_reason |            created            | expiry_time | expired | location | location_full 
----+----------+--------------------------------+--------------------------------------+-----------------+-----------------+-------------------------------+-------------+---------+----------+---------------
  1 | READY    | amq.gen-thWSdFUXYM9_EgVti14nUA | ee7d5374-c222-4dc0-8e3a-4819c3550c1f |              -1 | HostSignal      | 2018-08-06 15:57:25.873506+10 |             | f       |        1 | local
  2 | READY    | amq.gen-j5xTdZeGjlQlR41iecTZvg | ff5d67b3-84ed-436a-917e-746a5503074e |              -1 | HostSignal      | 2018-08-06 15:57:50.770861+10 |             | f       |        1 | local
  3 | READY    | amq.gen-8FS_l8w1y_ZNHuHhXrJgDw | 48e076ab-f8c1-4f64-a190-6299e0e59119 |              -1 | HostSignal      | 2018-08-06 15:58:21.721877+10 |             | f       |        1 | local
  4 | SHUTDOWN | amq.gen-wDSdoFnccCrF_Uwu-mU_VA | 7ac88025-4245-4af2-af9f-f6059ffe3944 |               9 | HostSignal      | 2018-08-06 15:59:15.35432+10  |             | t       |        1 | local
(4 rows)

Ideally, the master would "rescan" agents at startup to see if they're still alive.

Attempts aren't failed if the final command fails

If onerror == fail and the last command in a job fails, the job scheduler doesn't count it as a failure.

This is a logic error in DefaultJobScheduler.java:

if(au.getAction() == AgentUpdate.Action.Stop) {
	if(cr.index < maxIdx || cr.status != CommandResult.CommandResultStatus.SUCCESS) {
		/* A command has failed and caused the job to stop. */
		ops.updateJobFinished(att, true);
	} else {
		/* We've finished successfully. */
		ops.updateJobFinished(att, false);
	}
}

Portal api feature requests

  1. Experiment
  • list job-related stats with jobs belong to an experiment and their status
  • list number of agents belong to an experiment
  • list cpu hours based on resource type
  • mean, median jobs
  1. Resource
  • show the number of CPU hours being used so far for each resource
  • show the total number of successful jobs, failed jobs
  • mean, median jobs
  • add resource with portalapi similar to nimrod api
  1. plan file
  • compile returns 0 even when compile command fails

And more

Mina doesn't fully resolve the host key

The ecdsa-sha2-nistp521 key is missing the trailing aA== causing a key verification error.

Command:

./nimw.sh resource add flashlite pbspro -- --uri=ssh://[email protected] --key /path/to/key --limit=10 --max-batch-size=2 --add-batch-res-scale=ncpus:1 --add-batch-res-static=walltime:01:00:00 --add-batch-res-scale=mem:1GiB -- -A UQ-RCC
{
  "agent_platform": "x86_64-pc-linux-musl",
  "transport": {
    "name": "sshd",
    "uri": "ssh://[email protected]",
    "keyfile": "/path/to/key",
    "hostkeys": [
      "ssh-dss AAAAB3NzaC1kc3MAAACBAJ5dwWbFpwVHS1XfxfNuEFG+gwt770d/eC1sKDkLkAmilGko2AB+DS5QrEkWUOKuhn0dsvuvi9g14iSz+439fqn0tHF0LPzp7KqGZmloGkjSOVjqy4JkAk+xthZrt671j0KUuq3DxIbmibcHRyuQQDCxjxZJnyz2RkSiP06N19V1AAAAFQDpoXU890ULUUVDnMlaHmYODe2nzwAAAIBna51ORkuWCviOBuHADVEiuC27ithK1YHzQW84eAXqUKiUXZWbEV7ByBSGRrzEc7WZU1e2dMAb7uACQburkQy3OIf2iJc6zZzVrYLLSZJmdtX/94A3CYWN/j2AGeZR+zNmX3DX8tl+Q0i2Amg22ewl24TSvy0q+fU/RFE9tO5NmgAAAIBDXXcLual8D2GiW5zeRKmp/EyfQjAzRJtj0v8lTntddXzE8ciYcWYARCrTYBjALvbAoPQqvUbvicABgaoGHXkmtoe7g4H0NpgcsXSZnEhqKsXUgaAZraO0r2qRBBeFtE3AMKmg1VPzSvgfdJMmezL0leJUhTmYQb/aEtADhLg3Ow==",
      "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAplbm/kI54sQLlIdGMH5tgf18Z+d6X3Ik3/y1T4l5ddDN6nPvXkVL4WsKJD2boIWo6L7kBiuhz5KY7AtQrLF+NNDoltP/x2j4jdxGnXTUakt59ARrPCcNPAhINOZMNOHqos2B1T0Ca5ZpYeZDu7yJ25Q1J6OpIayxanPot9MvchXTzJ5/dVvVF092ECuGXA9KfclzV0Al486hcWnEENm7KGxfCYY+46hGGpOCBcc+aHtL5mgNj39tRp7d4tK3cNT39SbAvfmd/V5DnTD8ODaPGS3rISYSWuGw/xQq/vpfGDRGtD4/TmKW1I0O+kn95B56HuZ4jiRQSZli5T6WcMdoWw==",
      "ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAD7bmdQjLozXCuciMh4rJ2TkAjznmqmKUdxTkgJDGeAXa2RtgkLSkYIV2SbzSVHgnPJMCQiAgvuCOuLArQS5OpvOwAsmwBoeRamhazYXuQGBwlycpBWJM8lZ7nh9vZAD2skn9MGJdlYaL4WsmQId6Bf3PnS78dQsVFQw3mGMkK5NHlz"
    ]
  },
  "tmpvar": "TMPDIR",
  "pbsargs": ["-A", "UQ-RCC"],
  "limit": 10,
  "max_batch_size": 2,
  "batch_config": [
    { "name": "walltime", "value": 86400, "scale": false },
    { "name": "ncpus", "value": 1, "scale": true },
    { "name": "mem", "value": 1073741824, "scale": true }
  ]
}
$ ssh-keyscan flashlite.rcc.uq.edu.au
# flashlite.rcc.uq.edu.au:22 SSH-2.0-OpenSSH_7.4
flashlite.rcc.uq.edu.au ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAplbm/kI54sQLlIdGMH5tgf18Z+d6X3Ik3/y1T4l5ddDN6nPvXkVL4WsKJD2boIWo6L7kBiuhz5KY7AtQrLF+NNDoltP/x2j4jdxGnXTUakt59ARrPCcNPAhINOZMNOHqos2B1T0Ca5ZpYeZDu7yJ25Q1J6OpIayxanPot9MvchXTzJ5/dVvVF092ECuGXA9KfclzV0Al486hcWnEENm7KGxfCYY+46hGGpOCBcc+aHtL5mgNj39tRp7d4tK3cNT39SbAvfmd/V5DnTD8ODaPGS3rISYSWuGw/xQq/vpfGDRGtD4/TmKW1I0O+kn95B56HuZ4jiRQSZli5T6WcMdoWw==
# flashlite.rcc.uq.edu.au:22 SSH-2.0-OpenSSH_7.4
flashlite.rcc.uq.edu.au ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAD7bmdQjLozXCuciMh4rJ2TkAjznmqmKUdxTkgJDGeAXa2RtgkLSkYIV2SbzSVHgnPJMCQiAgvuCOuLArQS5OpvOwAsmwBoeRamhazYXuQGBwlycpBWJM8lZ7nh9vZAD2skn9MGJdlYaL4WsmQId6Bf3PnS78dQsVFQw3mGMkK5NHlzaA==
# flashlite.rcc.uq.edu.au:22 SSH-2.0-OpenSSH_7.4
flashlite.rcc.uq.edu.au ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOeGyKYiqW7FEXnjGWOwShGWEhu124BLktd/q8CsiJwt

Move HPC definitions into the database

As it stands, API behaviour is dependent on hpc.json which may be different on each machine. This should be stored in the database instead.

Will be configured at setup time along with everything else.

Update shexec syntax

Update the shexec syntax to support different shells:

shexec[:<plat>[:<shell>]]
shexec:win32:cmd
shexec:win32:powershell

"java.sql.SQLException: No such command" when using SQLite backend.

Sometimes Nimrod will exit with java.sql.SQLException: No such command. Only seen with sqlite.

Caused by DBExperimentHelpers#getCommandIdForResult() being called with cmdIndex == -1. Traced to
JobScheduler#recordCommandResult() entering NimrodMasterAPI with an invalid argument.

Doesn't happen Postgres because this in _exp_t_command_result_add():

/* If NULL or negative command index, assume the next one. */
IF NEW.command_index IS NULL OR NEW.command_index < 0 THEN
    SELECT COALESCE(MAX(command_index) + 1, 0) INTO NEW.command_index FROM nimrod_command_results WHERE attempt_id = NEW.attempt_id;
END IF;

I'm not sure whether or not this behaviour is correct. Further investigation is required.

Stack Trace:

au.edu.uq.rcc.nimrodg.api.NimrodException$DbError: java.sql.SQLException: No such command
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:574) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:71) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:65) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addCommandResult(TempNimrodAPIImpl.java:372) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_JobOperations.recordCommandResult(Master.java:699) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.onJobFailure(DefaultJobScheduler.java:178) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_AgentOperations.lambda$reportJobFailure$10(Master.java:908) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.lambda$processQueue$18(Master.java:568) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
	at au.edu.uq.rcc.nimrodg.master.Master.processQueue(Master.java:568) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.startProc(Master.java:456) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.tick(Master.java:315) [nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.commands.MasterCmd.execute(MasterCmd.java:157) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLICommand.execute(NimrodCLICommand.java:43) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:43) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:125) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:145) [main/:?]
Caused by: java.sql.SQLException: No such command
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.DBExperimentHelpers.getCommandIdForResult(DBExperimentHelpers.java:857) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.DBExperimentHelpers.addCommandResult(DBExperimentHelpers.java:864) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.addCommandResult(SQLite3DB.java:429) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.lambda$addCommandResult$38(TempNimrodAPIImpl.java:372) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:50) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	... 14 more

Play nice with Jigsaw

The nimrod-* projects should be changed so they play nice with JPMS.

  • Add module-info.java to each project with the appropriate exports.
  • Make each project only "manage" one package. This is mostly done except for nimrodg-internal-api.

None of this should be done until Gradle properly supports it anyway.

Handle long-spawning agents.

Sometimes heartbeating will mark an agent for expiry that's still in WAITING_FOR_HELLO. This can happen in the case where an agent can be stuck in a PBS/SLURM queue.

Options:

  1. Make AgentScheduler#onAgentExpiry() accept launching agents.

    • It is up to the scheduler to handle this.
    • The agent may launch and connect later.
  2. In Master#doExpire() call AgentScheduler#onAgentLaunchFailure().

    • The actuator may not know about the expiry, which causes issues.
  3. Ask the actuator:

Something like this in Actuator:

/** Agent status from an actuator's POV. */
enum AgentStatus {
	/** Agent is still launching. May be stuck in a queue. */
	Launching,
	/** Agent has launched, but not connected yet. */
	Launched,
	/** Agent has connected. */
	Connected,
	/** Agent has disconnected. */
	Disconnected,
	/** Unknown. The agent may not be ours, or we have stopped tracking it. */
	Unknown
}

default AgentStatus queryStatus(UUID uuid) {
	return AgentStatus.Unknown;
}

If the state is Launching, then either do nothing or extend the waiting time. Otherwise continue as normal.

Strange scheduling behaviour for failed jobs.

Job picalc/5244 fails 3 times and is then scheduled (and succeeds) multiple times. Only seen with sqlite3.

[22/05/2019 21:19:57:455 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:27:371 |      a.e.u.r.n.r.s.OpenSSHClient | TRACE] Executing command: ssh -q -o PasswordAuthentication=no -o StrictHostKeyChecking=no np-compute-20 -- mkdir -p /mnt/nimrod/agent-33e070cc-5831-43e9-877b-320afc0bc549
[22/05/2019 21:20:28:293 |      a.e.u.r.n.r.s.OpenSSHClient | TRACE] Executing command: ssh -q -o PasswordAuthentication=no -o StrictHostKeyChecking=no np-compute-20 -- /home/nimrod/.nimrod/d33f7f73-ca03-4010-9f32-15fc48555621/agent-x86_64-pc-linux-musl --uuid 33e070cc-5831-43e9-877b-320afc0bc549 --amqp-uri amqp://nimrod:[email protected]/nimrod --amqp-routing-key nimrod --work-root /mnt/nimrod/agent-33e070cc-5831-43e9-877b-320afc0bc549 --batch --output workroot
[22/05/2019 21:20:30:027 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:30:040 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:30:784 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:30:784 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:31:443 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:31:443 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:31:855 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:31:855 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:32:600 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:32:600 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:33:310 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:33:310 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:34:090 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:34:090 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:35:041 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:35:041 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:35:672 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:35:672 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:36:142 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:36:142 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:37:113 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:37:113 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:37:952 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:37:952 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:39:268 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:39:269 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:40:542 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent 'c3ac4cd2-727c-421b-9f1d-00e32348f8c1'
[22/05/2019 21:20:40:961 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:40:961 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:40:962 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from WAITING_FOR_HELLO -> READY
[22/05/2019 21:20:42:047 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, WAITING_FOR_HELLO, READY)
[22/05/2019 21:20:42:164 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' failed on attempt 1, rescheduling...
[22/05/2019 21:20:42:164 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' failed on attempt 2, rescheduling...
[22/05/2019 21:20:42:164 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' failed on attempt 3, rescheduling...
[22/05/2019 21:20:42:224 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:42:224 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:42:225 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:43:592 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5594' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 21:20:43:592 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 21:20:45:916 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:20:45:918 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:20:45:918 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:20:45:959 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 21:37:29:922 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 4!
[22/05/2019 21:48:42:627 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:48:42:627 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from BUSY -> READY
[22/05/2019 21:48:42:628 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, BUSY, READY)
[22/05/2019 21:48:42:631 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/7101' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 21:48:42:632 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 21:48:42:735 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:48:42:736 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 21:48:42:835 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:48:42:936 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:480 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:480 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from BUSY -> READY
[22/05/2019 22:17:24:480 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, BUSY, READY)
[22/05/2019 22:17:24:484 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/8622' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 22:17:24:485 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 22:17:24:605 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:605 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 22:17:24:711 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:826 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:52:051 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent '92c20286-4e0e-4e53-85df-caaedb3a4f5a'
[22/05/2019 22:46:52:157 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 5!
[22/05/2019 22:46:52:168 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 6!
[22/05/2019 22:46:52:168 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 7!
[22/05/2019 22:46:52:820 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent 'a716c476-543e-4078-b9e6-cb22c840095d'
[22/05/2019 22:46:52:938 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 8!
[22/05/2019 22:46:53:047 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 9!
[22/05/2019 22:46:53:056 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 10!
[22/05/2019 22:46:55:440 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:55:440 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from BUSY -> READY
[22/05/2019 22:46:55:441 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, BUSY, READY)
[22/05/2019 22:46:55:447 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 22:46:55:448 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 22:46:55:560 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:55:560 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 22:46:55:561 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 11!
[22/05/2019 22:46:55:562 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:55:564 |               a.e.u.r.n.m.Master | ERROR] Caught exception during RUN:
[22/05/2019 22:46:55:564 |               a.e.u.r.n.m.Master | ERROR] Catching
java.lang.NullPointerException: null
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.tickJobAttempt(DefaultJobScheduler.java:172) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.onJobUpdate(DefaultJobScheduler.java:143) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_AgentListener.lambda$onJobUpdate$3(Master.java:912) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.lambda$processQueue$14(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
	at au.edu.uq.rcc.nimrodg.master.Master.processQueue(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.startProc(Master.java:436) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.tick(Master.java:304) [nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.commands.MasterCmd.execute(MasterCmd.java:137) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLICommand.execute(NimrodCLICommand.java:42) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:36) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:124) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:137) [nimrodg-cli-1.1.1.jar:?]
[22/05/2019 22:46:56:078 |                a.e.u.r.n.m.AAAAA | TRACE] Cancelling pending launches...
[22/05/2019 22:46:56:290 |               a.e.u.r.n.m.Master | TRACE] Terminating agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 22:46:56:415 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:56:419 |               a.e.u.r.n.m.Master | ERROR] Caught exception during RUN:
[22/05/2019 22:46:56:419 |               a.e.u.r.n.m.Master | ERROR] Catching
java.lang.NullPointerException: null
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.tickJobAttempt(DefaultJobScheduler.java:172) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.onJobUpdate(DefaultJobScheduler.java:143) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_AgentListener.lambda$onJobUpdate$3(Master.java:912) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.lambda$processQueue$14(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
	at au.edu.uq.rcc.nimrodg.master.Master.processQueue(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.stoppingProc(Master.java:639) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.tick(Master.java:304) [nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.commands.MasterCmd.execute(MasterCmd.java:137) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLICommand.execute(NimrodCLICommand.java:42) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:36) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:124) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:137) [nimrodg-cli-1.1.1.jar:?]
[22/05/2019 22:46:58:255 |                a.e.u.r.n.m.AAAAA |  INFO] Waiting on 40 actuator(s)...

Add the ability to instrument jobs

It would be useful to be able to instrument job/agent performance along the lines of iostat:

#root@gpfs1 10:25:50 /local/home/user> iostat
Linux 3.10.0-514.el7.x86_64 (host) 	10/08/18 	_x86_64_	(40 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.59    0.00    0.87    0.01    0.00   98.53

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               4.31         2.57        38.68   32836930  493431848
sdb               0.39         0.26        21.33    3267452  272117192

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.