Self-contained fat-jar micro-service to listen for events from Prison systems (NOMIS) to update offender sentence information in Probation systems (Delius)
Docker
Even when running the tests docker is used by the integration test to load localstack
(for AWS services). The build will automatically download and run localstack
on your behalf.
./gradlew build
localstack
is used to emulate the AWS SQS service. When running the integration test this will be started automatically. If you want the tests to use an already running version of locastack
run the tests with the environment AWS_PROVIDER=localstack
. This has the benefit of running the test quicker without the overhead of starting the localstack
container.
Any commands in localstack/setup-sns.sh
will be run when localstack
starts, so this should contain commands to create the appropriate queues.
Running all services locally:
TMPDIR=/private$TMPDIR docker-compose up
Queues and topics will automatically be created when the localstack
container starts.
Running all services except this application (hence allowing you to run this in the IDE)
TMPDIR=/private$TMPDIR docker-compose up --scale prison-to-probation-update=0
Check the docker-compose file for sample environment variables to run the application locally.
Or to just run localstack
which is useful when running against an a non-local test system
TMPDIR=/private$TMPDIR docker-compose up localstack
In all of the above the application should use the host network to communicate with localstack
since AWS Client will try to read messages from localhost rather than the localstack
network.
There are two handy scripts to add messages to the queue with data that matches either the T3 test environment or data in the test Docker version of the apps
T3 test data:
./create-prison-movements-messages-t3.bash
local test data:
./create-prison-movements-messages-local.bash
There are 4 source sets to handle the unit, integration, e2e and smoke tests. Note that check
only runs the unit tests
and you have to run the integration and e2e tests manually.
The source set test
contains the unit tests. These are run with the command ./gradlew test
and are also included in the check when running command ./gradlew check
.
The source set testIntegration
contains the integration tests. These are run with the
command ./gradlew testIntegration
but are NOT included in the check when running command ./gradlew check
and so must
be run manually.
Note that the integration tests currently use TestContainers to start localstack and so you do not need to start localstack manually.
If you DO wish to run localstack manually (as is done in the Circle build) then you must:
- start localstack with command
TMPDIR=/private$TMPDIR docker-compose up localstack
- run the tests with command
AWS_PROVIDER=localstack ./gradlew testIntegration
TL;DR - Integration tests do not start the JMS listener and do not start the scheduled DynamoDB message processor. If you need these then write an E2E test.
There is an issue with Spring such that when a test requires a new context (e.g. when configuration properties change),
the old (cached) context continues to a) read messages using the JMS listener and b) run scheduled jobs that read from
the DynamoDB database. The proper "Spring way" to handle this issue is to use @DirtiesContext
to stop the old context.
Unfortunately a separate issue exists where using @DirtiesContext
causes the Netty executor pool to be closed and the
new context is unable to make any outbound requests.
The upshot of this unfortunate combination of issues is that tests that require the JMS listener / DynamoDB scheduler to NOT run are corrupted by the JMS listener / DynamoDB scheduler running in the cached context.
Therefore you mustn't add any tests to the testIntegration
test module that use the real JMS listener or DynamoDB
scheduler - these should be added to the testE2e
test module.
The source set testE2e
contains end to end tests. These are defined as tests that require a real JMS listener and real
DynamoDB scheduled tasks to run.
These tests are run with command ./gradlew testE2e
but are NOT included in the check when running
command ./gradlew check
and so must be run manually.
Note that the end to end tests currently use TestContainers to start localstack and so you do not need to start localstack manually.
If you DO wish to run localstack manually (as is done in the Circle build) then you must:
- start localstack with command
TMPDIR=/private$TMPDIR docker-compose up localstack
- run the tests with command
AWS_PROVIDER=localstack ./gradlew testE2e
The source set testSmoke
contains the smoke tests.
These tests are not intended to be run locally, but instead are run against a deployed application (as happens in the Circle build).
For more information on the smoke tests see the project dps-smoketest
.
Understanding the architecture makes live support easier:
The service subscribes to a number of prison offender events
- IMPRISONMENT_STATUS-CHANGED
- EXTERNAL_MOVEMENT_RECORD-INSERTED
- BOOKING_NUMBER-CHANGED
- SENTENCE_DATES-CHANGED
- CONFIRMED_RELEASE_DATE-CHANGED
Each message is processed in up to 3 phases:
- Validation - where the associated booking is inspected to see if the message should be processed
- Initial processing - where an attempt to is made to synchronise data with probation
- Retry processing - where further attempts are made if there was a failure during initial processing
A Topic subscription is made to the offender_events.topic
which is a topic for prison offender events. The subscriptions are defined in the Cloud Platform terraform.
The listener for the queue is defined in PrisonerChangesListenerPusher.
Once a message has been validated it will be added to the data store ready to be picked up for processing. This is a DynamoDB database that is again defined in Cloud Platform terraform. This, essentially, is a queue of messages ready to be processed.
The scheduling of the message is processed by MessageRetryService.
The main processing loop is invoked by SerialiseBookingScheduler which is a Spring Scheduler that itself uses DynamoDB as a persistent store for locking - this ensures only a single pod is processing messages at a time.
Events are held back for a period of time (one hour in production); this is so multiple similar messages are processed for a record all in one go, there is no drip feeding of small changes, e.g. each of the key sentence dates are all changed at once.
Events are sorted and grouped so for each run of the processing events for a particular booking and are processed together rather than being processed by different pods simultaneously. This is done by the MessageAggregator.
Successful processing of an event leads to the DynamoDB record being marked as processed by setting the processedDate. Unsuccessful processing can lead to different paths depending on the event being processed:
- IMPRISONMENT_STATUS-CHANGED - will be retried for a period of time. Given this indicates a significant change to the prisoner's status, e.g. conviction, it may take some time before the associated Delius record has been created or is in the correct state. It will retry for around a month.
- EXTERNAL_MOVEMENT_RECORD-INSERTED - only tried once and then discarded
- BOOKING_NUMBER-CHANGED - only tried once and then discarded
- SENTENCE_DATES-CHANGED - only tried once and then discarded
- CONFIRMED_RELEASE_DATE-CHANGED - only tried once and then discarded
If there is a system exception while processing any of these messages they will be retried. However business "exceptions", for instance if the prisoner does not exist in Delius, are not retried since they are not expected to succeed because of the passing of time.
The IMPRISONMENT_STATUS-CHANGED event is used as a trigger to indicate there has possibly been a new conviction that therefore means the NOMIS and Delius record should be synchronised. Once it has been established the prisoner has a current sentence the initial matching with the Delius record begins. This is a multi-stage process:
- Search for the person in probation-offender-search
- if successful check they have a matching sentence (EVENT) in Delius via community-api
- if successful attempt to set the NOMS number in Delius via community-api
- if successful (or already set) attempt to set the book number in Delius via community-api
- if successful (or already set) attempt to set the prison location in Delius via community-api
- if successful (or already set) attempt to set the key dates in Delius via community-api
- if successful mark record as processed or retry again later
Retrying events marked as requiring a retry is orchestrated by the RetryScheduler. This uses 3 simple CRON expressions to simulate back-off retry processing.
- RETRY_SCHEDULES_SHORT_CRON: "0 0 * * * *"
- RETRY_SCHEDULES_MEDIUM_CRON: "0 15 */4 * * *"
- RETRY_SCHEDULES_LONG_CRON: "0 30 23 * * *"
Which retry schedule a failed record goes into is determined by the Message retryCount. So it will retry once an hour (for 4 hours), then every 4 hours (for 6 hours) and finally once a day until the record is auto-deleted.
The Message deleteBy attribute is set at initial creation and can be extended for a couple of reasons:
- A record should be retried until the sentence ends since Delius has no SLA for data is not in the correct state (e.g. multiple custodial events)
- A record has been processed, but we wish to retain for a period of time for reporting purposes.
Once the date specified by the deleteBy value is reached DynamoDB will automatically delete the record.
If an alert is raised in the dps_alert channel indicating the Dead Letter Queue (DLQ) is filling with messages it means something is going wrong when validating a message.
Given validation is relatively simple it is likely to be related to the prison-api call to get the booking failing.
Check Application Insights to show any errors, while the traces will indicate what messages have been received
exceptions
| where cloud_RoleName == "prison-to-probation-update"
| order by timestamp desc
traces
| where cloud_RoleName == "prison-to-probation-update"
| order by timestamp desc
It will likely be a transient issue with either retrieving an HMPPS auth token or prison-api being down. Once the transient issue has been resolved, the messages from the DLQ should be moved back to the main queue. Currently, there is no endpoint to do this so messages will have to be received and acknowledged manually and sent to the main queue.
Secrets for the AWS credentials are stored in namespace prison-to-probation-update-prod
under ptpu-sqs-dl-instance-output
for the DLQ and ptpu-sqs-instance-output
for the main queue.
A message could also appear on the DLQ if we were unable to add the message to the DynamoDB table.
The health check captures the health of DynamoDB. It also displays the table name that is dynamically created, and the number of rows currently on the table. Further details about the table state can be seen by running the AWS CLI commands, again secrets for accessing can be retrieved from the namespace as message-dynamodb-output which also includes the table ARN.
aws dynamodb describe-table --table-name <table name>
This will show the dynamic provisioned throughput e.g.
"ProvisionedThroughput": {
"LastIncreaseDateTime": "2021-01-28T10:16:00.613000+00:00",
"LastDecreaseDateTime": "2021-01-28T00:04:04.338000+00:00",
"NumberOfDecreasesToday": 1,
"ReadCapacityUnits": 665,
"WriteCapacityUnits": 1
}
All telemetry events are prefixed by P2P
so this query will show all significant events:
customEvents
| where cloud_RoleName in ("prison-to-probation-update", "community-api")
| where name startswith "P2P" or name startswith "KeyDate"
| summarize count() by name
Description for each event is as follows:
Prison transfers
- prison-to-probation-update service
P2PExternalMovement
- Movement message is about to be processedP2PTransferProbationUpdated
- Location has been updated or was already correct so required no change.P2PTransferProbationRecordNotFound
- Location could not be updated since prisoner or event has not been foundP2PTransferIgnored
- Location movement is no longer valid
- community-api service
P2PTransferPrisonUpdated
- Location has been updatedP2PTransferPrisonUpdateIgnored
- Location was already correctP2PTransferOffenderNotFound
- Location could not be updated since prisoner was not foundP2PTransferBookingNumberNotFound
- Location could not be updated since event not foundP2PTransferBookingNumberHasDuplicates
- Location could not be updated since mutiple events found with book number
Sentence dates changes
- prison-to-probation-update service
P2PSentenceDatesRecordNotFound
- Dates could not be updated since prisoner or event has not been foundP2PSentenceDatesChanged
- Dates have been updated or were correct alreadyP2PSentenceDatesChangeIgnored
- Dates changes are ignored since the dates or booking is no longer valid
- community-api service
keyDatesBulkUnchanged
- No dates required updatingkeyDatesBulkSummary
- Dates were successfully updatedkeyDatesBulkAllRemoved
- All dates were updated, this may indicate the sentence has changed to a Life sentence, they have been deported or new dates are about to be addedKeyDateDeleted
- One of the key the dates was deleted (also fired POM handover dates)KeyDateUpdated
- Existing date was updated (also fired POM handover dates)KeyDateAdded
- New date was added (also fired POM handover dates)
Imprisonment status change and offender matching
-
prison-to-probation-update service
P2POffenderNoMatch
- Prisoner not found in DeliusP2POffenderTooManyMatches
- Prisoner found in Delius but multiple hits foundP2POffenderMatch
- Details of the prisoner matching - a successful match would be represented by the number of times we see this event minus the P2POffenderNoMatch and P2POffenderTooManyMatches countP2POffenderNumberSet
- NOMS number set in Delius (new offender has been matched)P2PImprisonmentStatusUpdated
- Successfully completed matching and setting the dates and locationP2PImprisonmentStatusNoSentenceStartDate
- Change is being ignored since the prisoner has no sentence start dateP2PImprisonmentStatusIgnored
- Change is being ignored since booking is no longer relevant, e.g. the booking is no longer active and they have been releasedP2PBookingNumberNotAssigned
- Book number can be set in Delius, typically because there are multiple active eventsP2PLocationNotUpdated
- Location can not be set for the matched offender, typically where there are multiple custodial eventsP2PKeyDatesNotUpdated
- Dates can not be set for the matched offender, typically where there are multiple custodial events
-
community-api service
P2PImprisonmentStatusCustodyEventsHasDuplicates
- book number can not be set due multiple matching eventsP2PImprisonmentStatusCustodyEventNotFound
- book number can not be set due no matching eventsP2PImprisonmentStatusBookingNumberUpdated
- book number updatedP2PImprisonmentStatusBookingNumberInserted
- book number insertedP2PImprisonmentStatusBookingNumberAlreadySet
- book number already correct
Book number changed
- prison-to-probation-update service
P2PBookingNumberChanged
- NOMS number updatedP2PBookingNumberChangedOffenderNotFound
- Offender not found so can't update NOMS number
For the above telemetry events that are emitted from the community-api there would be an associated request - use the queries below to show these requests:
- Set NOMS number
requests
| where cloud_RoleName in ("community-api")
| where name == "PUT CustodyResource/updateOffenderNomsNumber"
| summarize count() by resultCode
- Set Book number
requests
| where cloud_RoleName in ("community-api")
| where name == "PUT CustodyResource/updateCustodyBookingNumber"
| summarize count() by resultCode
- Set prison location
requests
| where cloud_RoleName in ("community-api")
| where name == "PUT CustodyResource/updateCustody"
| summarize count() by resultCode
- Set key dates
requests
| where cloud_RoleName in ("community-api")
| where name == "POST CustodyKeyDatesController/replaceAllCustodyKeyDateByNomsNumberAndBookingNumber"
| summarize count() by resultCode
- Replace NOMS number
requests
| where cloud_RoleName in ("community-api")
| where name == "PUT CustodyResource/replaceOffenderNomsNumber"
| summarize count() by resultCode
There is a Grafana Dashboard that shows the totals for major processing counts.
The data for the dashboard is provided by micrometer. This requires the micrometer dependencies in
the build.gradle.kts
file which expose the /prometheus
endpoint. The metrics made available in this endpoint are
defined in the MetricsService
class.
The data for the dashboard is gathered by a Prometheus Service Monitor which uses a combination of namespace selector and service label selectors to find pods to query. You can see the Service Monitor polling the endpoints and also run a Prometheus query to check data is coming through .
There is a rule in the ingress to prevent access to the /prometheus
endpoint externally which is defined in the main
helm values.yaml
files (section generic-service.ingress.nginx.ingress.kubernetes.io/configuration-snippet
). The
Prometheus Service Monitor does not use the ingress - it polls the pods directly - and so it is not affected by this
rule. The rule exists to prevent metrics being exposed to the internet.
There are a number of CSV reports that are documented here .
To run the reports you require the ROLE_PTPU_REPORT
role. We would recommend asking the HMPPS Auth administrators (DPS
Tech team) to create personal production client credentials and just add that single role. Use Postman or cUrl to
request the report.