uq-rcc / nimrodg Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 2.61 MB

Nimrod/G

Home Page: https://rcc.uq.edu.au/nimrod

License: Apache License 2.0

Java 91.91% PLpgSQL 6.59% ANTLR 0.85% Shell 0.62% Jinja 0.02%

nimrod nimrodg uq hpc

nimrodg's Introduction

Nimrod/G

Usage

The CLI is huge, consider use the -h flag.

usage: nimrod [-h] [-c CONFIG] [-d] command ...

Invoke Nimrod/G CLI commands

optional arguments:
  -h, --help             show this help message and exit
  -c CONFIG, --config CONFIG
                         Path to configuration file. (default: /home/user/.config/nimrod/nimrod.ini)
  -d, --debug            Enable debug output. (default: false)

valid commands:
  command
    property             Property Operations.
    experiment           Experiment Operations.
    master               Start the experiment master.
    resource             Resource operations.
    resource-type        Resource type operations.
    job                  Job operations.
    setup                Nimrod/G setup functionality.
    compile              Compile a planfile.
    genconfig            Generate a default configuration file.
    agent                Agent Operations.
    staging              Execute staging commands.

Build Instructions

Use the nimw.sh wrapper script to invoke the CLI via Gradle.

To generate a tarball, use gradle nimrod:assembleDist.

Requirements

Java 11+
Gradle 5.3.1+

Installation

Create a nimrod.ini configuration file in ~/.config/nimrod
- A sample is provided in nimrodg-cli/src/main/resources
Create a setup configuration file. This can be placed anywhere.
- A sample is provided in nimrodg-cli/src/main/resources
Run nimrod setup init /path/to/setup-config.ini
You're ready to go.

License

This project is licensed under the Apache License, Version 2.0:

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

3rd-party Licenses

Project	License	License URL
Antlr4	The BSD License	http://www.antlr.org/license.html
icu4j	Unicode/ICU License	http://source.icu-project.org/repos/icu/trunk/icu4j/main/shared/licenses/LICENSE
PgJDBC	BSD-2-Clause License	https://jdbc.postgresql.org/about/license.html
[ini4j]	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Bouncy Castle Crypto APIs	Bouncy Castle License	https://www.bouncycastle.org/license.html
Jersey	CDDL 1.1	https://jersey.github.io/license.html
sqlite-jdbc	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
RabbitMQ Java Client Library	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Apache log4j2	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Commons CSV	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Commons IO	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Commons Collections	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Tomcat	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
Apache Mina SSHD	Apache 2.0	http://www.apache.org/licenses/LICENSE-2.0.txt
java_text_tables	MIT License	https://raw.githubusercontent.com/iNamik/java_text_tables/master/LICENSE.txt

nimrodg's People

Contributors

Stargazers

Watchers

nimrodg's Issues

Portal api get resources output change request

Currently, nimrod portalapi getresources returns

name,path,config,tx_uri,tx_cert,tx_no_verify_peer,tx_no_verify_host

Please add type into the output. Or replace path with type.

TLSv1.2 is hardcoded in the Master

In MasterCmd.java

try(AMQProcessor amqp = new AMQProcessor(
					amqpUri.uri,
					certs,
					"TLSv1.2",
					cfg.getAmqpRoutingKey(),

Job count querying

The Nimrod Portal requires completed, failed, running, pending, and total counts of each job. This information isn't available unless the status of each job (and thus each attempt) is queried and counted, which is slow.

A new method JobCounts getJobCounts(Experiment exp); should be added to NimrodAPI, which provides this information in an efficient way.

Move Experiment#filterJobs() into NimrodAPI

As in title. Looking up another entity should only be done via NimrodAPI.

Make "step" optional in the range domain

Default to 1

debug2: Control master terminated unexpectedly

In some situations Travis builds fail with this error. I have no idea what causes it, but it seems extremely inconsistent. This has only ever been seen on Travis builds, never on a local build.

https://travis-ci.org/github/UQ-RCC/nimrodg/jobs/712805294

    [Test worker] TRACE au.edu.uq.rcc.nimrodg.shell.OpenSSHClient - Executing command: /usr/bin/ssh -l user -i /tmp/junit9299936063122352927/openssh-1882757806-key -p 2292 -oPasswordAuthentication=no -oKbdInteractiveAuthentication=no -oChallengeResponseAuthentication=no -oBatchMode=yes -oControlMaster=auto -oControlPersist=yes -oControlPath=/tmp/junit9299936063122352927/openssh-1882757806-control -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oLogLevel=DEBUG3 127.0.0.1 -E /tmp/junit9299936063122352927/openssh-1882757806-log01.txt -- scp -q -p -t /asdf
    [Test worker] TRACE au.edu.uq.rcc.nimrodg.shell.OpenSSHClient - Attempting to dump OpenSSH log file at /tmp/junit9299936063122352927/openssh-1882757806-log01.txt
    [Test worker] TRACE au.edu.uq.rcc.nimrodg.shell.OpenSSHClient - debug1: Reading configuration data /home/travis/.ssh/config
    debug1: /home/travis/.ssh/config line 1: Applying options for *
    debug1: /home/travis/.ssh/config line 2: Deprecated option "useroaming"
    debug1: Reading configuration data /etc/ssh/ssh_config
    debug1: /etc/ssh/ssh_config line 19: Applying options for *
    debug1: auto-mux: Trying existing master
    debug2: fd 4 setting O_NONBLOCK
    debug2: mux_client_hello_exchange: master version 4
    debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
    debug3: mux_client_request_session: entering
    debug3: mux_client_request_alive: entering
    debug3: mux_client_request_alive: done pid = 7067
    debug3: mux_client_request_session: session request sent
    debug1: mux_client_request_session: master session id: 2
    debug3: mux_client_read_packet: read header failed: Broken pipe
    debug2: Control master terminated unexpectedly

Simplify usages of map streams

Change instances of map.entrySet().stream().forEach(e -> {}); to map.forEach((k, v) -> {});

[Nimrod/K] A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)

When running in Nimrod/K with

parameters = {"x", "y"}
jobs = {"x"=0, "y"=0}

throws the exception:

ptolemy.kernel.util.IllegalActionException: [SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
  with tag colour {NOCOLOUR}
  in .Unnamed1.Nimrod/G Actor
Because:
[SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
	at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:575)
Caused by: au.edu.uq.rcc.nimrodg.impl.base.db.NimrodSQLException: [SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:540)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:67)
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:65)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:124)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:58)
	at au.edu.uq.rcc.nimrod.NimrodGActor.fire(NimrodGActor.java:340)
	at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:464)
Caused by: au.edu.uq.rcc.nimrodg.impl.base.db.NimrodSQLException: [SQLITE_CONSTRAINT_CHECK]  A CHECK constraint failed (CHECK constraint failed: nimrod_jobs)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:540)
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:67)
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:65)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:124)
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addJobs(TempNimrodAPIImpl.java:58)
	at au.edu.uq.rcc.nimrod.NimrodGActor.fire(NimrodGActor.java:340)
	at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:464)

Planfile ingestion fails with 250000 jobs when using Postgres

parameter seed integer random from 1 to 1000000 points 250000

task main
    onerror fail
    redirect stdout off
    redirect stderr off
    exec /home/nimrod/genpi.py $seed 1000000000
endtask

Actuator proxying

The SSH actuators should have this. Perhaps modify the config to have the following:

{
  "tunnels": [
    {
      "type": "reverse",
      "srcport": 5671,
      "dsthost": "203.101.225.94",
      "dstport": 5671
    }
  ]
}

MINA SSHD would create those tunnels upon connection. The user can specify custom AMQP/Transfer URIs in the resource configuration as normal, except this time they'd reference the head node.

Allow the OpenSSH transport to accept KeyPair instances.

Currently only supports a path. If given a KeyPair instance, should write it to the secure folder (with 0600 perms) and use that instead.

Allow updating agent configuration after initial commit

NimrodMasterAPI#updateAgent() ignores the data field.

Resources can be deleted when assigned.

This was intended behaviour if assigned experiments were stopped, but not when they're running.

Add support for querying/filtering job attempts based on an experiment.

This will be useful for the master when doing state recovery.

Collection<? extends Job> NimrodAPI#filterJobAttempts(Experiment exp, EnumSet<JobAttempt.Status> status, long start, int limit);

Collection<? extends Job> Job#filterAttempt(EnumSet<JobAttempt.Status> status, long start, int limit);

External Execution Actuator

Rather simple concept, invoke a program that will start agents.

The program shall have two commands:

Usage:
  ./extlaunch.py launch
  ./extlaunch.py kill <uuid>

launch takes a JSON dump of the following format from stdin:

{
	"resource_path": "fl012",
	"amqp_uri": "amqp://user:pass@asdfasdfasd",
	"no_verify_peer": true,
	"no_verify_host": true,
	"cert_data": "asdfasd==",
	"amqp_routing_key": "iamthemaster",
	"uuids": [
		"ef449411-1407-4702-ad12-06a639c065fb",
		"a72188ba-24a7-4dd8-95c0-dea896a70141",
		"b725ad16-db6c-4744-995e-b254e4fcfd08"
	],
	"config": {
		"limit": 10,
		"program": "/home/user/Desktop/nimrod-embedded/extlaunch.py"
	}
}

How it invokes the agents, Nimrod doesn't care.
For the case of Embedded Nimrod, it'll generate a script and invoke ssh to another node, bypassing the requirement of needing access to the host's keys.

kill <uuid> -- Attempt to kill a single agent.

Any nonzero return value is considered a failure.

Use control sockets for the OpenSSH transport

It'd a massive QOL improvement for "certain" HPCs that are slow to login.

Agent Debugger Configuration

So I don't have to keep c/p'ing my tests

OpenSSH transport backend always uses StrictHostKeyChecking=no

As in title. This is really to force-disable interactivity

Agent Protocol Updates

Everything that should be in the next agent protocol update:

Unique Directory for each Actuator

Am currently using the below, which isn't enough.

String.format("act-%s-%d", this.getClass().getSimpleName(), (long)uri.hashCode() & 0xFFFFFFFFL)

Should add a function in ActuatorUtils to do this properly.

Statelessness

This is purely on the master.

All state is in the database except:

Launch failure counts

There's several ways of doing this:

The master and schedulers manually recover their state via the DB.
- Easiest, as they can access their own internals.
- Issues arise in the agent scheduler when rebuilding job<->agent mappings, as it has no access to that data.
The master gives the schedulers a NULL backend and "replays" the state.

"no schema has been selected to create in" when using currentSchema in JDBC URL

$ nimrod setup init
au.edu.uq.rcc.nimrodg.setup.NimrodSetupAPI$SetupException: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
	at au.edu.uq.rcc.nimrodg.impl.postgres.SetupAPIImpl.reset(SetupAPIImpl.java:144)
	at au.edu.uq.rcc.nimrodg.cli.commands.Setup.execute(Setup.java:110)
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:36)
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:125)
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:145)
Caused by: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
	at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
	at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
	at au.edu.uq.rcc.nimrodg.impl.postgres.SetupAPIImpl.reset(SetupAPIImpl.java:142)

Remove NimrodServeAPI

Is old and unused.

NimrodServeAPI interface
nimrod_experiments::file_token Postgres
nimrod_experiments::file_token Sqlite3
~~agent.submit token field~~
- Is being done as part of #17

Environment variable support in Planfile

New setenv command:
- setenv VAR_NAME <string_literal>
${env:VAR_NAME} substitution

The master thinks agents already exist on a resource and doesn't spawn any.

This happens after a hard crash.

zane=> TABLE nimrod_resource_agents;
 id |  state   |             queue              |              agent_uuid              | shutdown_signal | shutdown_reason |            created            | expiry_time | expired | location | location_full 
----+----------+--------------------------------+--------------------------------------+-----------------+-----------------+-------------------------------+-------------+---------+----------+---------------
  1 | READY    | amq.gen-thWSdFUXYM9_EgVti14nUA | ee7d5374-c222-4dc0-8e3a-4819c3550c1f |              -1 | HostSignal      | 2018-08-06 15:57:25.873506+10 |             | f       |        1 | local
  2 | READY    | amq.gen-j5xTdZeGjlQlR41iecTZvg | ff5d67b3-84ed-436a-917e-746a5503074e |              -1 | HostSignal      | 2018-08-06 15:57:50.770861+10 |             | f       |        1 | local
  3 | READY    | amq.gen-8FS_l8w1y_ZNHuHhXrJgDw | 48e076ab-f8c1-4f64-a190-6299e0e59119 |              -1 | HostSignal      | 2018-08-06 15:58:21.721877+10 |             | f       |        1 | local
  4 | SHUTDOWN | amq.gen-wDSdoFnccCrF_Uwu-mU_VA | 7ac88025-4245-4af2-af9f-f6059ffe3944 |               9 | HostSignal      | 2018-08-06 15:59:15.35432+10  |             | t       |        1 | local
(4 rows)

Ideally, the master would "rescan" agents at startup to see if they're still alive.

Attempts aren't failed if the final command fails

If onerror == fail and the last command in a job fails, the job scheduler doesn't count it as a failure.

This is a logic error in DefaultJobScheduler.java:

if(au.getAction() == AgentUpdate.Action.Stop) {
	if(cr.index < maxIdx || cr.status != CommandResult.CommandResultStatus.SUCCESS) {
		/* A command has failed and caused the job to stop. */
		ops.updateJobFinished(att, true);
	} else {
		/* We've finished successfully. */
		ops.updateJobFinished(att, false);
	}
}

Portal api feature requests

Experiment

list job-related stats with jobs belong to an experiment and their status
list number of agents belong to an experiment
list cpu hours based on resource type
mean, median jobs

Resource

show the number of CPU hours being used so far for each resource
show the total number of successful jobs, failed jobs
mean, median jobs
add resource with portalapi similar to nimrod api

plan file

compile returns 0 even when compile command fails

And more

Expose SetupAPI#setup() via the command line

Configuration should be able to be changed without nuking everything.

Handle NO_ROUTE responses from AMQP broker

Mina doesn't fully resolve the host key

The ecdsa-sha2-nistp521 key is missing the trailing aA== causing a key verification error.

Command:

./nimw.sh resource add flashlite pbspro -- --uri=ssh://[email protected] --key /path/to/key --limit=10 --max-batch-size=2 --add-batch-res-scale=ncpus:1 --add-batch-res-static=walltime:01:00:00 --add-batch-res-scale=mem:1GiB -- -A UQ-RCC

{
  "agent_platform": "x86_64-pc-linux-musl",
  "transport": {
    "name": "sshd",
    "uri": "ssh://[email protected]",
    "keyfile": "/path/to/key",
    "hostkeys": [
      "ssh-dss AAAAB3NzaC1kc3MAAACBAJ5dwWbFpwVHS1XfxfNuEFG+gwt770d/eC1sKDkLkAmilGko2AB+DS5QrEkWUOKuhn0dsvuvi9g14iSz+439fqn0tHF0LPzp7KqGZmloGkjSOVjqy4JkAk+xthZrt671j0KUuq3DxIbmibcHRyuQQDCxjxZJnyz2RkSiP06N19V1AAAAFQDpoXU890ULUUVDnMlaHmYODe2nzwAAAIBna51ORkuWCviOBuHADVEiuC27ithK1YHzQW84eAXqUKiUXZWbEV7ByBSGRrzEc7WZU1e2dMAb7uACQburkQy3OIf2iJc6zZzVrYLLSZJmdtX/94A3CYWN/j2AGeZR+zNmX3DX8tl+Q0i2Amg22ewl24TSvy0q+fU/RFE9tO5NmgAAAIBDXXcLual8D2GiW5zeRKmp/EyfQjAzRJtj0v8lTntddXzE8ciYcWYARCrTYBjALvbAoPQqvUbvicABgaoGHXkmtoe7g4H0NpgcsXSZnEhqKsXUgaAZraO0r2qRBBeFtE3AMKmg1VPzSvgfdJMmezL0leJUhTmYQb/aEtADhLg3Ow==",
      "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAplbm/kI54sQLlIdGMH5tgf18Z+d6X3Ik3/y1T4l5ddDN6nPvXkVL4WsKJD2boIWo6L7kBiuhz5KY7AtQrLF+NNDoltP/x2j4jdxGnXTUakt59ARrPCcNPAhINOZMNOHqos2B1T0Ca5ZpYeZDu7yJ25Q1J6OpIayxanPot9MvchXTzJ5/dVvVF092ECuGXA9KfclzV0Al486hcWnEENm7KGxfCYY+46hGGpOCBcc+aHtL5mgNj39tRp7d4tK3cNT39SbAvfmd/V5DnTD8ODaPGS3rISYSWuGw/xQq/vpfGDRGtD4/TmKW1I0O+kn95B56HuZ4jiRQSZli5T6WcMdoWw==",
      "ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAD7bmdQjLozXCuciMh4rJ2TkAjznmqmKUdxTkgJDGeAXa2RtgkLSkYIV2SbzSVHgnPJMCQiAgvuCOuLArQS5OpvOwAsmwBoeRamhazYXuQGBwlycpBWJM8lZ7nh9vZAD2skn9MGJdlYaL4WsmQId6Bf3PnS78dQsVFQw3mGMkK5NHlz"
    ]
  },
  "tmpvar": "TMPDIR",
  "pbsargs": ["-A", "UQ-RCC"],
  "limit": 10,
  "max_batch_size": 2,
  "batch_config": [
    { "name": "walltime", "value": 86400, "scale": false },
    { "name": "ncpus", "value": 1, "scale": true },
    { "name": "mem", "value": 1073741824, "scale": true }
  ]
}

$ ssh-keyscan flashlite.rcc.uq.edu.au
# flashlite.rcc.uq.edu.au:22 SSH-2.0-OpenSSH_7.4
flashlite.rcc.uq.edu.au ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAplbm/kI54sQLlIdGMH5tgf18Z+d6X3Ik3/y1T4l5ddDN6nPvXkVL4WsKJD2boIWo6L7kBiuhz5KY7AtQrLF+NNDoltP/x2j4jdxGnXTUakt59ARrPCcNPAhINOZMNOHqos2B1T0Ca5ZpYeZDu7yJ25Q1J6OpIayxanPot9MvchXTzJ5/dVvVF092ECuGXA9KfclzV0Al486hcWnEENm7KGxfCYY+46hGGpOCBcc+aHtL5mgNj39tRp7d4tK3cNT39SbAvfmd/V5DnTD8ODaPGS3rISYSWuGw/xQq/vpfGDRGtD4/TmKW1I0O+kn95B56HuZ4jiRQSZli5T6WcMdoWw==
# flashlite.rcc.uq.edu.au:22 SSH-2.0-OpenSSH_7.4
flashlite.rcc.uq.edu.au ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAD7bmdQjLozXCuciMh4rJ2TkAjznmqmKUdxTkgJDGeAXa2RtgkLSkYIV2SbzSVHgnPJMCQiAgvuCOuLArQS5OpvOwAsmwBoeRamhazYXuQGBwlycpBWJM8lZ7nh9vZAD2skn9MGJdlYaL4WsmQId6Bf3PnS78dQsVFQw3mGMkK5NHlzaA==
# flashlite.rcc.uq.edu.au:22 SSH-2.0-OpenSSH_7.4
flashlite.rcc.uq.edu.au ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOeGyKYiqW7FEXnjGWOwShGWEhu124BLktd/q8CsiJwt

Replace `org.everit.json.schema` with something that uses `javax.json`.

It's bad to have multiple JSON libraries.

Move HPC definitions into the database

As it stands, API behaviour is dependent on hpc.json which may be different on each machine. This should be stored in the database instead.

Will be configured at setup time along with everything else.

Update shexec syntax

Update the shexec syntax to support different shells:

shexec[:<plat>[:<shell>]]
shexec:win32:cmd
shexec:win32:powershell

"java.sql.SQLException: No such command" when using SQLite backend.

Sometimes Nimrod will exit with java.sql.SQLException: No such command. Only seen with sqlite.

Caused by DBExperimentHelpers#getCommandIdForResult() being called with cmdIndex == -1. Traced to
JobScheduler#recordCommandResult() entering NimrodMasterAPI with an invalid argument.

Doesn't happen Postgres because this in _exp_t_command_result_add():

/* If NULL or negative command index, assume the next one. */
IF NEW.command_index IS NULL OR NEW.command_index < 0 THEN
    SELECT COALESCE(MAX(command_index) + 1, 0) INTO NEW.command_index FROM nimrod_command_results WHERE attempt_id = NEW.attempt_id;
END IF;

I'm not sure whether or not this behaviour is correct. Further investigation is required.

Stack Trace:

au.edu.uq.rcc.nimrodg.api.NimrodException$DbError: java.sql.SQLException: No such command
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:574) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.makeException(SQLite3DB.java:71) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:65) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.addCommandResult(TempNimrodAPIImpl.java:372) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_JobOperations.recordCommandResult(Master.java:699) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.onJobFailure(DefaultJobScheduler.java:178) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_AgentOperations.lambda$reportJobFailure$10(Master.java:908) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.lambda$processQueue$18(Master.java:568) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
	at au.edu.uq.rcc.nimrodg.master.Master.processQueue(Master.java:568) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.startProc(Master.java:456) ~[nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.tick(Master.java:315) [nimrodg-master-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.commands.MasterCmd.execute(MasterCmd.java:157) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLICommand.execute(NimrodCLICommand.java:43) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:43) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:125) [main/:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:145) [main/:?]
Caused by: java.sql.SQLException: No such command
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.DBExperimentHelpers.getCommandIdForResult(DBExperimentHelpers.java:857) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.DBExperimentHelpers.addCommandResult(DBExperimentHelpers.java:864) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.sqlite3.SQLite3DB.addCommandResult(SQLite3DB.java:429) ~[nimrodg-impl-sqlite3-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.TempNimrodAPIImpl.lambda$addCommandResult$38(TempNimrodAPIImpl.java:372) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	at au.edu.uq.rcc.nimrodg.impl.base.db.SQLUUUUU.runSQLTransaction(SQLUUUUU.java:50) ~[nimrodg-impl-base-db-1.9.0-100-0c4fbaff-longspawn-dirty.jar:?]
	... 14 more

Follow the Unix philosophy

Don't print anything to the CLI except errors/warnings.

Play nice with Jigsaw

The nimrod-* projects should be changed so they play nice with JPMS.

Add module-info.java to each project with the appropriate exports.
Make each project only "manage" one package. This is mostly done except for nimrodg-internal-api.

None of this should be done until Gradle properly supports it anyway.

Handle long-spawning agents.

Sometimes heartbeating will mark an agent for expiry that's still in WAITING_FOR_HELLO. This can happen in the case where an agent can be stuck in a PBS/SLURM queue.

Options:

Make AgentScheduler#onAgentExpiry() accept launching agents.
- It is up to the scheduler to handle this.
- The agent may launch and connect later.
In Master#doExpire() call AgentScheduler#onAgentLaunchFailure().
- The actuator may not know about the expiry, which causes issues.
Ask the actuator:

Something like this in Actuator:

/** Agent status from an actuator's POV. */
enum AgentStatus {
	/** Agent is still launching. May be stuck in a queue. */
	Launching,
	/** Agent has launched, but not connected yet. */
	Launched,
	/** Agent has connected. */
	Connected,
	/** Agent has disconnected. */
	Disconnected,
	/** Unknown. The agent may not be ours, or we have stopped tracking it. */
	Unknown
}

default AgentStatus queryStatus(UUID uuid) {
	return AgentStatus.Unknown;
}

If the state is Launching, then either do nothing or extend the waiting time. Otherwise continue as normal.

Random parameter domain allows duplicates

This was an oversight.

Strange scheduling behaviour for failed jobs.

Job picalc/5244 fails 3 times and is then scheduled (and succeeds) multiple times. Only seen with sqlite3.

[22/05/2019 21:19:57:455 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:27:371 |      a.e.u.r.n.r.s.OpenSSHClient | TRACE] Executing command: ssh -q -o PasswordAuthentication=no -o StrictHostKeyChecking=no np-compute-20 -- mkdir -p /mnt/nimrod/agent-33e070cc-5831-43e9-877b-320afc0bc549
[22/05/2019 21:20:28:293 |      a.e.u.r.n.r.s.OpenSSHClient | TRACE] Executing command: ssh -q -o PasswordAuthentication=no -o StrictHostKeyChecking=no np-compute-20 -- /home/nimrod/.nimrod/d33f7f73-ca03-4010-9f32-15fc48555621/agent-x86_64-pc-linux-musl --uuid 33e070cc-5831-43e9-877b-320afc0bc549 --amqp-uri amqp://nimrod:[email protected]/nimrod --amqp-routing-key nimrod --work-root /mnt/nimrod/agent-33e070cc-5831-43e9-877b-320afc0bc549 --batch --output workroot
[22/05/2019 21:20:30:027 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:30:040 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:30:784 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:30:784 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:31:443 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:31:443 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:31:855 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:31:855 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:32:600 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:32:600 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:33:310 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:33:310 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:34:090 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:34:090 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:35:041 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:35:041 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:35:672 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:35:672 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:36:142 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:36:142 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:37:113 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:37:113 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:37:952 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:37:952 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:39:268 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:39:269 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:40:542 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent 'c3ac4cd2-727c-421b-9f1d-00e32348f8c1'
[22/05/2019 21:20:40:961 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.hello)
[22/05/2019 21:20:40:961 |               a.e.u.r.n.m.Master | TRACE] Received agent.hello with (uuid, queue) = (33e070cc-5831-43e9-877b-320afc0bc549, amq.gen-iCyuqTcCwtudGsCAvxjD-g)
[22/05/2019 21:20:40:962 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from WAITING_FOR_HELLO -> READY
[22/05/2019 21:20:42:047 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, WAITING_FOR_HELLO, READY)
[22/05/2019 21:20:42:164 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' failed on attempt 1, rescheduling...
[22/05/2019 21:20:42:164 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' failed on attempt 2, rescheduling...
[22/05/2019 21:20:42:164 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' failed on attempt 3, rescheduling...
[22/05/2019 21:20:42:224 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:42:224 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:42:225 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Scheduling job 'picalc/5244'
[22/05/2019 21:20:43:592 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5594' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 21:20:43:592 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 21:20:45:916 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:20:45:918 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:20:45:918 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:20:45:959 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 21:37:29:922 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 4!
[22/05/2019 21:48:42:627 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:48:42:627 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from BUSY -> READY
[22/05/2019 21:48:42:628 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, BUSY, READY)
[22/05/2019 21:48:42:631 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/7101' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 21:48:42:632 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 21:48:42:735 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:48:42:736 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 21:48:42:835 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 21:48:42:936 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:480 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:480 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from BUSY -> READY
[22/05/2019 22:17:24:480 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, BUSY, READY)
[22/05/2019 22:17:24:484 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/8622' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 22:17:24:485 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 22:17:24:605 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:605 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 22:17:24:711 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:17:24:826 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:52:051 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent '92c20286-4e0e-4e53-85df-caaedb3a4f5a'
[22/05/2019 22:46:52:157 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 5!
[22/05/2019 22:46:52:168 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 6!
[22/05/2019 22:46:52:168 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 7!
[22/05/2019 22:46:52:820 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent 'a716c476-543e-4078-b9e6-cb22c840095d'
[22/05/2019 22:46:52:938 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 8!
[22/05/2019 22:46:53:047 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 9!
[22/05/2019 22:46:53:056 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 10!
[22/05/2019 22:46:55:440 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:55:440 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from BUSY -> READY
[22/05/2019 22:46:55:441 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, BUSY, READY)
[22/05/2019 22:46:55:447 |               a.e.u.r.n.m.Master |  INFO] Run job 'picalc/5244' on agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 22:46:55:448 |               a.e.u.r.n.m.Master | DEBUG] Agent 33e070cc-5831-43e9-877b-320afc0bc549: State change from READY -> BUSY
[22/05/2019 22:46:55:560 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:55:560 | .u.r.n.m.s.DefaultAgentScheduler | TRACE] onAgentStateUpdate(33e070cc-5831-43e9-877b-320afc0bc549, np_compute_20, READY, BUSY)
[22/05/2019 22:46:55:561 | .e.u.r.n.m.s.DefaultJobScheduler |  INFO] Job 'picalc/5244' succeeded on attempt 11!
[22/05/2019 22:46:55:562 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:55:564 |               a.e.u.r.n.m.Master | ERROR] Caught exception during RUN:
[22/05/2019 22:46:55:564 |               a.e.u.r.n.m.Master | ERROR] Catching
java.lang.NullPointerException: null
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.tickJobAttempt(DefaultJobScheduler.java:172) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.onJobUpdate(DefaultJobScheduler.java:143) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_AgentListener.lambda$onJobUpdate$3(Master.java:912) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.lambda$processQueue$14(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
	at au.edu.uq.rcc.nimrodg.master.Master.processQueue(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.startProc(Master.java:436) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.tick(Master.java:304) [nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.commands.MasterCmd.execute(MasterCmd.java:137) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLICommand.execute(NimrodCLICommand.java:42) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:36) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:124) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:137) [nimrodg-cli-1.1.1.jar:?]
[22/05/2019 22:46:56:078 |                a.e.u.r.n.m.AAAAA | TRACE] Cancelling pending launches...
[22/05/2019 22:46:56:290 |               a.e.u.r.n.m.Master | TRACE] Terminating agent '33e070cc-5831-43e9-877b-320afc0bc549'
[22/05/2019 22:46:56:415 |               a.e.u.r.n.m.Master | DEBUG] doProcessAgentMessage(33e070cc-5831-43e9-877b-320afc0bc549, agent.update)
[22/05/2019 22:46:56:419 |               a.e.u.r.n.m.Master | ERROR] Caught exception during RUN:
[22/05/2019 22:46:56:419 |               a.e.u.r.n.m.Master | ERROR] Catching
java.lang.NullPointerException: null
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.tickJobAttempt(DefaultJobScheduler.java:172) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.sched.DefaultJobScheduler.onJobUpdate(DefaultJobScheduler.java:143) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master$_AgentListener.lambda$onJobUpdate$3(Master.java:912) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.lambda$processQueue$14(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
	at au.edu.uq.rcc.nimrodg.master.Master.processQueue(Master.java:502) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.stoppingProc(Master.java:639) ~[nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.master.Master.tick(Master.java:304) [nimrodg-master-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.commands.MasterCmd.execute(MasterCmd.java:137) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLICommand.execute(NimrodCLICommand.java:42) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.DefaultCLICommand.execute(DefaultCLICommand.java:36) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.cliMain(NimrodCLI.java:124) [nimrodg-cli-1.1.1.jar:?]
	at au.edu.uq.rcc.nimrodg.cli.NimrodCLI.main(NimrodCLI.java:137) [nimrodg-cli-1.1.1.jar:?]
[22/05/2019 22:46:58:255 |                a.e.u.r.n.m.AAAAA |  INFO] Waiting on 40 actuator(s)...

Add the ability to instrument jobs

It would be useful to be able to instrument job/agent performance along the lines of iostat:

#root@gpfs1 10:25:50 /local/home/user> iostat
Linux 3.10.0-514.el7.x86_64 (host) 	10/08/18 	_x86_64_	(40 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.59    0.00    0.87    0.01    0.00   98.53

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               4.31         2.57        38.68   32836930  493431848
sdb               0.39         0.26        21.33    3267452  272117192

uq-rcc / nimrodg Goto Github PK

nimrodg's Introduction

Nimrod/G

Usage

Build Instructions

Requirements

Installation

License

3rd-party Licenses

nimrodg's People

Contributors

Stargazers

Watchers

nimrodg's Issues

Recommend Projects

Recommend Topics

Recommend Org