Giter Club home page Giter Club logo

indy-node-monitor's Introduction

Indy Node Monitor

License Lifecycle:Maturing

Indy Node Monitor is a set of tools for monitoring the status of an Indy Ledger by querying the validator information of the nodes on the ledger. These tools are integrated into a full monitoring stack including the following components:

  • Indy Node Monitor (Fetch Validator Status)
  • Telegraf
  • Prometheus and/or InfluxDB
  • Alert Manager
  • Grafana

This allows you to:

  • Visualize the node and network data on dashboards.
  • Track trends about the status of nodes and the overall ledger.
  • Track read and write uptimes.
  • Tracking ledger usage such as number of transactions on the ledger.
  • Drive notifications of node outages.

The stack is easily managed and spun up in docker using the included ./manage script. Starting the stack is as easy as:

./manage build
./manage start

A browser window to the Grafana instance should launch automatically. The login username is admin and the password foobar. Once logged in you are able to browse the various (auto-provisioned) dashboards which will begin to populate after a few minutes. Please note:- In order to collect detailed data from the network to populate many of the graphs you will require a privileged DID (NETWORK_MONITOR at minimum) on the ledger and you will need to configure the monitoring stack with the seed(s).

For more information about the commands available in the ./manage script refer to the command list

For more infomration on how to setup and use the monitoring stack refer to this readme

Fetch Validator Status

This simple tool is the heart of the Indy Node Monitor. It is used to retrieve "validator-info"—detailed status data about an Indy node (aka "validator")—from all the nodes in a network. The results are returned as a JSON array with a record per validator, and the data can be manipulated and formatted through the use of plugins. Fetch validator status can be used as a stand-alone command line tool or through the Indy Node Monitor REST API which is used by the monitoring stack. The Indy Node Monitor REST API can be spun up easily by running ./manage start indy-node-monitor, and a browser window will automatically launch to display the API documents and interface.

For more details see the Fetch Validator Status readme

How to contribute to the monitoring stack

For more details on how to contribute follow the link to this readme

Contributions

Pull requests are welcome! Please read our contributions guide and submit your PRs. We enforce developer certificate of origin (DCO) commit signing. See guidance here.

We also welcome issues submitted about problems you encounter in using the tools within this repo.

Code of Conduct

All contributors are required to adhere to our Code of Conduct guidelines.

License

Apache License Version 2.0

indy-node-monitor's People

Contributors

andrewwhitehead avatar connorbarnes88 avatar crajapakshe avatar dependabot[bot] avatar donato-ods-devops avatar gavinbendixsen avatar jleach avatar kolebarnes avatar mrkaurelius avatar patstlouis avatar rajpalc7 avatar repo-mountie[bot] avatar swcurran avatar wadebarnes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indy-node-monitor's Issues

Implement a health, diversity, and compliance analysis plug-in

As a companion to hyperledger/indy-node#1670, integrate the analysis portion(s) of the Technical Verification Script into a new plugin to facilitate continuous monitoring.

The implementation should build on the analysis plugin once merged; #26

This is to address the discussions regarding incorporating the collection of the additional data analyzed by the Technical Verification Script, in order to facilitate continuous compliance monitoring; #24 (comment)

Update Fetch Validator Status Documentation - Clarify the reference to the Trustee seed

Update the documentation to clarify the use of the Trustee seed when using von-network in the initial demo. The use of the Trustee seed in this case is simply for convenience. It is a well known seed used by von-network, and it has the privileges to fetch the extended validator info data from the nodes. In real life you would use a NETWORK_MONITOR seed/DID for this purpose.

https://github.com/hyperledger/indy-node-monitor/tree/main/fetch-validator-status#run-the-validator-info-script

Cut new release(s)

Cut new release(s) to denote the significant changes that have occurred over the last while.

Add system clock skew detection and alerting

It is important for the nodes to have there system clocks synchronized. If they get too far out of sync it affects their ability to form consensus.

The timestamp reported by a node during an authenticated validator-info call indicates the node's system time. A comparison of the time differences between all of the nodes should be performed. When a clock skew of more than a few seconds (actual time TBD) a warning report (in the warnings list) should be generated.

This is to address NTP time sync discussions here; #24 (comment)

The implementation should build on the analysis plugin once merged; #26

Additional Requirements:

  • The "allowed" clock skew should be configurable, and have a reasonable default.

Incompatility with latest fastapi/starlette packages versions

As of probably last week (2023-03 11 to 18), building indy-node-monitor from scratch with non-cached build artefacts pulls fastapi==0.95.0, with requires starlette==0.26.1.
This causes the indy-node-monitor process to crash on startup, with a bunch of stack trace containing the following kind of error:

  File "/home/indy/rest_api.py", line 80, in <module>
    async def network(network: Network = Path(example_network_enum, example=example_network_name, description="The network code."),
  File "/home/indy/.local/lib/python3.7/site-packages/fastapi/param_functions.py", line 42, in Path
    **extra,
  File "/home/indy/.local/lib/python3.7/site-packages/fastapi/params.py", line 83, in __init__
    assert default is ..., "Path parameters cannot have a default value"
AssertionError: Path parameters cannot have a default value
[2023-03-24 20:06:52 +0000] [11] [INFO] Worker exiting (pid: 11)

This is due to the fact that relying on creating fastapi Path object with code like:

Path(Network.sbn, example="sbn", description="The network code.")

was probably relying on a bug in the Path api, which was corrected since.
The correct syntax is:

Path(path=Network.sbn, example="sbn", description="The network code.")

A PR to correct this problem is on its way.

Add node counts to status when using network monitor DID

When not in anonymous mode, suggest that the --status result also return the following fields:

From Pool_info:

  • Unreachable_nodes_count
  • Reachable_nodes_count
  • Total_nodes_count
  • Possibly Suspicious_nodes if it is a count -- not if it is an array.

From Hardware:

  • HDD_used_by_node

Nice to have from the 'Extractions' node -- trickier because the values are in text:

  • node-control status: Memory: and Tasks:
    • as node-control-memory and node-control-tasks
  • indy-node_status: Memory: and Tasks:
    • as indy-node-memory and indy-node-tasks

Develop Indy Node Monitor into a fully containerized monitoring stack

Wrap a full monitoring stack around Indy Node Monitor so data can be continuously collected and visualized.

Work has started on a fully containerized monitoring stack to include:

  • Indy Node Monitor
  • Telegraf
  • Prometheus and/or InfluxDB
  • Alert Manager
  • Grafana

Work streams can be found here:

The stack can be easily spun up using the ./manage script supplied in the monitoring-stack branch. Dashboards are auto-provisioned and Gafana is launched in a browser automatically.

Work on this monitoring stack was put on hold for a while due to other priorities, but is now resuming.

This issue is to document the locations of the work done so far. The work is currently lacking detailed documentation on the architecture and development processes. Detailed user documentation is also lacking. Current usage documentation can be found in the manage script; ./manage -h.

Fetch validator status script error

Older version of python and missing dependency causes error on run.sh

Traceback (most recent call last):
  File "main.py", line 29, in <module>
    monitor_plugins = PluginCollection('plugins')
  File "/home/indy/plugin_collection.py", line 69, in __init__
    self.reload_plugins()
  File "/home/indy/plugin_collection.py", line 78, in reload_plugins
    self.walk_package(self.plugin_package)
  File "/home/indy/plugin_collection.py", line 126, in walk_package
    self.walk_package(package + '.' + child_pkg)
  File "/home/indy/plugin_collection.py", line 100, in walk_package
    plugin_module = __import__(pluginname, fromlist=['blah'])
  File "/home/indy/plugins/metrics/google_sheets.py", line 3, in <module>
    import gspread
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/__init__.py", line 7, in <module>
    from .auth import (
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/auth.py", line 19, in <module>
    from .client import Client
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/client.py", line 17, in <module>
    from .spreadsheet import Spreadsheet
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/spreadsheet.py", line 15, in <module>
    from .worksheet import Worksheet
  File "/home/indy/.local/lib/python3.7/site-packages/gspread/worksheet.py", line 11, in <module>
    from typing import (
ImportError: cannot import name 'Literal' from 'typing' (/usr/lib/python3.7/typing.py)

Given the capabilities of InfluxDb 2.7+, investigate wether Prometheus is still required

This is a follow-up issue to 61.

61 was raised, because using a metric as a tag in prometheus (or even influxdb) is a anti-pattern.

However, it questions the pertinence of using prometheus, because prometheus is a numeric time series data store, which cannot leverage the textual data provided by indy-node-monitor.
I also appeared that influxdb 2.7+ (instead of 1.8) seemed to fill the alerting void left if prometheus was removed from the solution

So we are investigating fully or partly replacing elements of the current (as of 2023-04-11) stack by the latest OSS version of influxdb (currently 2.7.0)

For memory, here is the current architecture:

flowchart TD
    Grafana[/Grafana\n* data display\n* dashboards\] -- queries --> Prometheus
    Grafana -- queries --> InfluxDB
    Prometheus{{Prometheus& \nAlertManager\n* numeric time series\n* alerting}} -- queries --> Telegraf
    Telegraf[[Telegraf\n\n* periodic data gathering\n* data dispatch]] -- stores --> InfluxDB{{InfluxDB\n* numeric time series\n* text-based time series}}
    Telegraf -- queries --> 
    Indy-Node-Monitor[Indy-Node-Monitor\n* status/metrics\ngathering interface]  --> IndyClusters[\Indy Clusters/]
    Indy-Node-Monitor --> IndyClusters
    Indy-Node-Monitor --> IndyClusters

Add detection and reporting of network/node response time

Record and report network and node response times.

  • At the network level this would be the time it takes to establish the initial connection with the pool.
  • At the node level this would be the time it takes for individual nodes to respond to queries.

Add IP address configuration analysis

In order to ensure established best practices we want to ensure a node has been configured to use separate NICs for node and client communications, and they have been configured with different IP addresses on different subnets.

There can be two levels to this:

  • The public level which is reported in the node summary of each node (client-address and node-address fields). This data is based on the registration information for the node on the ledger.
  • The private level which is reported by a node during an authenticated validator-info call (Node_ip and Client_ip fields). This data is based on the configuration of the node itself.

This is to address Network interfaces (IPs and Ports) discussions here; #24 (comment)

The implementation should build on the analysis plugin once merged; #26

Requirements:

  • Ensure the public IPs for a given node are different, and are on different subnets. Generate descriptive warning report(s) (in the warnings list) for any detected issues.
  • Ensure the private IPs for a given node are different, and are on different subnets. Generate descriptive warning report(s) (in the warnings list) for any detected issues.
  • Ensure the private IPs are bound to separate NICs. Requires; hyperledger/indy-node#1669

Using indy-node-monitor with cron

To use indy-node-monitor as a cron job the -ti parameters should be removed from the run.sh cmd.

Might be appropriate to expect indy-node-monitor to be used extensively with cron jobs in future.

A quick fix might be to indicate via a cmd line argument if it will be a cron job that will remove the -ti parameters from the docker cmd.

Determine what information is available from validator-info to determine node compliance

The objective is this exercise is to determine what information provided by the validator-info call can be used to measure compliance with the Sovrin Technical Policies

An example might be to determine if nodes have two Public IPs for node and client traffic respectively as per the technical policy.

Further analysis is required to determine what is already available vs what is missing. i.e. a Gap analysis, the outcome might be requested changes to the output provided by the validator-info call from indy-vdr

Add network diagnostics

Add MTR (My Traceroute) type diagnostic capabilities into the indy-node-monitor CLI and API. For use when troubleshooting network connectivity to a network or individual nodes.

Features should also include the ability to determine and list the current node configuration for a network based on the current pool transaction information. This would list all of the nodes that are contained in the pool transactions file along with their current status (i.e. IP addresses, and validator status (active/inactive), etc.).

Define initial metrics to export to Prometheus as MVP

Prometheus MVP Metrics

  • The Sovrin Network name being monitored
    Should be able to get this from the pool being connected to

  • Node alias name

  • Detect when a node is inaccessible and produce standard output for that situation.

Should generate a timeout when trying to pull validator_info from inaccessible nodes.

  • Detect any nodes that are accessible but that are "unreachable" to some or all of the other Indy nodes.

    • That indicates that the internal port to the node is not accessible, even though the public port is accessible.
"Reachable_nodes": [
            [
              "Node1",
              0
            ],
            [
              "Node3",
              null
            ],
            [
              "Node4",
              null
            ]
          ],
          "Unreachable_nodes": [
            [
              "Node2",
              null
            ]
          ],
          "Reachable_nodes_count": 3,
          "Unreachable_nodes_count": 1,
  • The number of transaction per Indy ledger, especially the domain ledger.
"transaction-count": {
              "ledger": 21,
              "pool": 4,
              "config": 0,
              "audit": 1042
            },
  • The average read and write times for the node.
"throughput": {
              "0": 0.0017547843
            },
            "master throughput": 0.0017547843,
            "total requests": 16,
            "avg backup throughput": null,
            "master throughput ratio": null,
            "average-per-second": {
              "read-transactions": 0.0338584473,
              "write-transactions": 0.0001539895
            },
  • The average throughput time for the node.
"throughput": {
              "0": 0.0017547843
            },
            "master throughput": 0.0017547843,
            "total requests": 16,
            "avg backup throughput": null,
            "master throughput ratio": null,
            "average-per-second": {
              "read-transactions": 0.0338584473,
              "write-transactions": 0.0001539895
            },
  • The uptime of the node (time is last restart).
    "transaction-count": {
              "ledger": 21,
              "pool": 4,
              "config": 0,
              "audit": 1042
            },
            "uptime": 103903
          },
  • The time since last freshness check (should be less than 5 minutes).
          "Freshness_status": {
            "1": {
              "Last_updated_time": "2020-07-06 23:55:07+00:00",
              "Has_write_consensus": true
            },
            "0": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            },
            "2": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            }
          }
  • Node IP address information
"Node_info": {
          "Name": "Node4",
          "Mode": "participating",
          "Client_port": 9708,
          "Client_ip": "0.0.0.0",
          "Client_protocol": "tcp",
          "Node_port": 9707,
          "Node_ip": "0.0.0.0",
  • Total nodes in pool information
"Pool_info": {
          "Read_only": false,
          "Total_nodes_count": 4,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.