hyperledger / indy-node-monitor Goto Github PK

License: Apache License 2.0

Dockerfile 3.03% Python 66.73% Shell 30.24%

verifiable-organizations-network hyperledger-indy hyperledger-aries von hyperledger verifiable-credentials trust-over-ip indy vdr indy-vdr

indy-node-monitor's Issues

Develop Indy Node Monitor into a fully containerized monitoring stack

Wrap a full monitoring stack around Indy Node Monitor so data can be continuously collected and visualized.

Work has started on a fully containerized monitoring stack to include:

Indy Node Monitor
Telegraf
Prometheus and/or InfluxDB
Alert Manager
Grafana

Work streams can be found here:

The main working branch; https://github.com/WadeBarnes/indy-node-monitor/tree/monitoring-stack
Dashboard development; https://github.com/ConnorBarnes88/indy-node-monitor/tree/monitoring-stack
~~The prerequisite REST API support for indy-node-monitor; #35. The working branch is using the initial PoC REST API implementation, which will be replaced by the work in this PR.~~
GeoLocation services, yet to be integrated into the stack to map nodes; https://github.com/WadeBarnes/geolocation-api/tree/get-geo-location

The stack can be easily spun up using the ./manage script supplied in the monitoring-stack branch. Dashboards are auto-provisioned and Gafana is launched in a browser automatically.

Work on this monitoring stack was put on hold for a while due to other priorities, but is now resuming.

This issue is to document the locations of the work done so far. The work is currently lacking detailed documentation on the architecture and development processes. Detailed user documentation is also lacking. Current usage documentation can be found in the manage script; ./manage -h.

Add network diagnostics

Add MTR (My Traceroute) type diagnostic capabilities into the indy-node-monitor CLI and API. For use when troubleshooting network connectivity to a network or individual nodes.

Features should also include the ability to determine and list the current node configuration for a network based on the current pool transaction information. This would list all of the nodes that are contained in the pool transactions file along with their current status (i.e. IP addresses, and validator status (active/inactive), etc.).

Update Readme to indicate what analysis is performed by the scripts and how it's reported

Update Readme to indicate what analysis is performed by the scripts and how it's reported.

Include examples
Refer to #24 (comment) for additional context.

Update Fetch Validator Status Documentation - Clarify the reference to the Trustee seed

Update the documentation to clarify the use of the Trustee seed when using von-network in the initial demo. The use of the Trustee seed in this case is simply for convenience. It is a well known seed used by von-network, and it has the privileges to fetch the extended validator info data from the nodes. In real life you would use a NETWORK_MONITOR seed/DID for this purpose.

https://github.com/hyperledger/indy-node-monitor/tree/main/fetch-validator-status#run-the-validator-info-script

Add system clock skew detection and alerting

It is important for the nodes to have there system clocks synchronized. If they get too far out of sync it affects their ability to form consensus.

The timestamp reported by a node during an authenticated validator-info call indicates the node's system time. A comparison of the time differences between all of the nodes should be performed. When a clock skew of more than a few seconds (actual time TBD) a warning report (in the warnings list) should be generated.

This is to address NTP time sync discussions here; #24 (comment)

The implementation should build on the analysis plugin once merged; #26

Additional Requirements:

The "allowed" clock skew should be configurable, and have a reasonable default.

Add detection and reporting of network/node response time

Record and report network and node response times.

At the network level this would be the time it takes to establish the initial connection with the pool.
At the node level this would be the time it takes for individual nodes to respond to queries.

Fetch validator status script error

Older version of python and missing dependency causes error on run.sh

Traceback (most recent call last):
  File "main.py", line 29, in <module>
    monitor_plugins = PluginCollection('plugins')
  File "/home/indy/plugin_collection.py", line 69, in __init__
    self.reload_plugins()
  File "/home/indy/plugin_collection.py", line 78, in reload_plugins
    self.walk_package(self.plugin_package)
  File "/home/indy/plugin_collection.py", line 126, in walk_package
    self.walk_package(package + '.' + child_pkg)
  File "/home/indy/plugin_collection.py", line 100, in walk_package
    plugin_module = __import__(pluginname, fromlist=['blah'])
  File "/home/indy/plugins/metrics/google_sheets.py", line 3, in <module>
    import gspread
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/__init__.py", line 7, in <module>
    from .auth import (
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/auth.py", line 19, in <module>
    from .client import Client
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/client.py", line 17, in <module>
    from .spreadsheet import Spreadsheet
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/spreadsheet.py", line 15, in <module>
    from .worksheet import Worksheet
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/worksheet.py", line 11, in <module>
    from typing import (
ImportError: cannot import name 'Literal' from 'typing' (/usr/lib/python3.7/typing.py)

Incompatility with latest fastapi/starlette packages versions

As of probably last week (2023-03 11 to 18), building indy-node-monitor from scratch with non-cached build artefacts pulls fastapi==0.95.0, with requires starlette==0.26.1.
This causes the indy-node-monitor process to crash on startup, with a bunch of stack trace containing the following kind of error:

  File "/home/indy/rest_api.py", line 80, in <module>
    async def network(network: Network = Path(example_network_enum, example=example_network_name, description="The network code."),
  File "/home/indy/.local/lib/python3.7/site-packages/fastapi/param_functions.py", line 42, in Path
    **extra,
  File "/home/indy/.local/lib/python3.7/site-packages/fastapi/params.py", line 83, in __init__
    assert default is ..., "Path parameters cannot have a default value"
AssertionError: Path parameters cannot have a default value
[2023-03-24 20:06:52 +0000] [11] [INFO] Worker exiting (pid: 11)

This is due to the fact that relying on creating fastapi Path object with code like:

Path(Network.sbn, example="sbn", description="The network code.")

was probably relying on a bug in the Path api, which was corrected since.
The correct syntax is:

Path(path=Network.sbn, example="sbn", description="The network code.")

A PR to correct this problem is on its way.

Create a MAINTAINERS.md file

Example: https://github.com/hyperledger-labs/solang/blob/main/MAINTAINERS.md Example of a pointer: https://github.com/hyperledger-labs/homebrew-solang/blob/main/MAINTAINERS.md

Determine what information is available from validator-info to determine node compliance

The objective is this exercise is to determine what information provided by the validator-info call can be used to measure compliance with the Sovrin Technical Policies

An example might be to determine if nodes have two Public IPs for node and client traffic respectively as per the technical policy.

Further analysis is required to determine what is already available vs what is missing. i.e. a Gap analysis, the outcome might be requested changes to the output provided by the validator-info call from indy-vdr

Given the capabilities of InfluxDb 2.7+, investigate wether Prometheus is still required

This is a follow-up issue to 61.

61 was raised, because using a metric as a tag in prometheus (or even influxdb) is a anti-pattern.

However, it questions the pertinence of using prometheus, because prometheus is a numeric time series data store, which cannot leverage the textual data provided by indy-node-monitor.
I also appeared that influxdb 2.7+ (instead of 1.8) seemed to fill the alerting void left if prometheus was removed from the solution

So we are investigating fully or partly replacing elements of the current (as of 2023-04-11) stack by the latest OSS version of influxdb (currently 2.7.0)

For memory, here is the current architecture:

flowchart TD
    Grafana[/Grafana\n* data display\n* dashboards\] -- queries --> Prometheus
    Grafana -- queries --> InfluxDB
    Prometheus{{Prometheus& \nAlertManager\n* numeric time series\n* alerting}} -- queries --> Telegraf
    Telegraf[[Telegraf\n\n* periodic data gathering\n* data dispatch]] -- stores --> InfluxDB{{InfluxDB\n* numeric time series\n* text-based time series}}
    Telegraf -- queries --> 
    Indy-Node-Monitor[Indy-Node-Monitor\n* status/metrics\ngathering interface]  --> IndyClusters[\Indy Clusters/]
    Indy-Node-Monitor --> IndyClusters
    Indy-Node-Monitor --> IndyClusters

Add IP address configuration analysis

In order to ensure established best practices we want to ensure a node has been configured to use separate NICs for node and client communications, and they have been configured with different IP addresses on different subnets.

There can be two levels to this:

The public level which is reported in the node summary of each node (client-address and node-address fields). This data is based on the registration information for the node on the ledger.
The private level which is reported by a node during an authenticated validator-info call (Node_ip and Client_ip fields). This data is based on the configuration of the node itself.

This is to address Network interfaces (IPs and Ports) discussions here; #24 (comment)

The implementation should build on the analysis plugin once merged; #26

Requirements:

Ensure the public IPs for a given node are different, and are on different subnets. Generate descriptive warning report(s) (in the warnings list) for any detected issues.
Ensure the private IPs for a given node are different, and are on different subnets. Generate descriptive warning report(s) (in the warnings list) for any detected issues.
Ensure the private IPs are bound to separate NICs. Requires; hyperledger/indy-node#1669

Not sure why `response_result_data_Hardware_HDD_used_by_node` is used as a label in prometheus metrics

It artificially creates different times series, is there a rationale for this ?

Define initial metrics to export to Prometheus as MVP

Prometheus MVP Metrics

The Sovrin Network name being monitored
Should be able to get this from the pool being connected to
Node alias name
Detect when a node is inaccessible and produce standard output for that situation.

Should generate a timeout when trying to pull validator_info from inaccessible nodes.

Detect any nodes that are accessible but that are "unreachable" to some or all of the other Indy nodes.
- That indicates that the internal port to the node is not accessible, even though the public port is accessible.

"Reachable_nodes": [
            [
              "Node1",
              0
            ],
            [
              "Node3",
              null
            ],
            [
              "Node4",
              null
            ]
          ],
          "Unreachable_nodes": [
            [
              "Node2",
              null
            ]
          ],
          "Reachable_nodes_count": 3,
          "Unreachable_nodes_count": 1,

The number of transaction per Indy ledger, especially the domain ledger.

"transaction-count": {
              "ledger": 21,
              "pool": 4,
              "config": 0,
              "audit": 1042
            },

The average read and write times for the node.

"throughput": {
              "0": 0.0017547843
            },
            "master throughput": 0.0017547843,
            "total requests": 16,
            "avg backup throughput": null,
            "master throughput ratio": null,
            "average-per-second": {
              "read-transactions": 0.0338584473,
              "write-transactions": 0.0001539895
            },

The average throughput time for the node.

"throughput": {
              "0": 0.0017547843
            },
            "master throughput": 0.0017547843,
            "total requests": 16,
            "avg backup throughput": null,
            "master throughput ratio": null,
            "average-per-second": {
              "read-transactions": 0.0338584473,
              "write-transactions": 0.0001539895
            },

The uptime of the node (time is last restart).

    "transaction-count": {
              "ledger": 21,
              "pool": 4,
              "config": 0,
              "audit": 1042
            },
            "uptime": 103903
          },

The time since last freshness check (should be less than 5 minutes).

          "Freshness_status": {
            "1": {
              "Last_updated_time": "2020-07-06 23:55:07+00:00",
              "Has_write_consensus": true
            },
            "0": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            },
            "2": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            }
          }

Node IP address information

"Node_info": {
          "Name": "Node4",
          "Mode": "participating",
          "Client_port": 9708,
          "Client_ip": "0.0.0.0",
          "Client_protocol": "tcp",
          "Node_port": 9707,
          "Node_ip": "0.0.0.0",

Total nodes in pool information

"Pool_info": {
          "Read_only": false,
          "Total_nodes_count": 4,

Using indy-node-monitor with cron

To use indy-node-monitor as a cron job the -ti parameters should be removed from the run.sh cmd.

Might be appropriate to expect indy-node-monitor to be used extensively with cron jobs in future.

A quick fix might be to indicate via a cmd line argument if it will be a cron job that will remove the -ti parameters from the docker cmd.

Implement a health, diversity, and compliance analysis plug-in

As a companion to hyperledger/indy-node#1670, integrate the analysis portion(s) of the Technical Verification Script into a new plugin to facilitate continuous monitoring.

The implementation should build on the analysis plugin once merged; #26

This is to address the discussions regarding incorporating the collection of the additional data analyzed by the Technical Verification Script, in order to facilitate continuous compliance monitoring; #24 (comment)

Cut new release(s)

Cut new release(s) to denote the significant changes that have occurred over the last while.

Add node counts to status when using network monitor DID

When not in anonymous mode, suggest that the --status result also return the following fields:

From Pool_info:

Unreachable_nodes_count
Reachable_nodes_count
Total_nodes_count
Possibly Suspicious_nodes if it is a count -- not if it is an array.

From Hardware:

HDD_used_by_node

Nice to have from the 'Extractions' node -- trickier because the values are in text:

node-control status: Memory: and Tasks:
- as node-control-memory and node-control-tasks
indy-node_status: Memory: and Tasks:
- as indy-node-memory and indy-node-tasks

hyperledger / indy-node-monitor Goto Github PK

indy-node-monitor's Issues

Recommend Projects

Recommend Topics

Recommend Org