hyperledger / indy-node-monitor Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Wrap a full monitoring stack around Indy Node Monitor so data can be continuously collected and visualized.
Work has started on a fully containerized monitoring stack to include:
Work streams can be found here:
The stack can be easily spun up using the ./manage
script supplied in the monitoring-stack
branch. Dashboards are auto-provisioned and Gafana is launched in a browser automatically.
Work on this monitoring stack was put on hold for a while due to other priorities, but is now resuming.
This issue is to document the locations of the work done so far. The work is currently lacking detailed documentation on the architecture and development processes. Detailed user documentation is also lacking. Current usage documentation can be found in the manage script; ./manage -h
.
Add MTR (My Traceroute) type diagnostic capabilities into the indy-node-monitor CLI and API. For use when troubleshooting network connectivity to a network or individual nodes.
Features should also include the ability to determine and list the current node configuration for a network based on the current pool transaction information. This would list all of the nodes that are contained in the pool transactions file along with their current status (i.e. IP addresses, and validator status (active/inactive), etc.).
Update Readme to indicate what analysis is performed by the scripts and how it's reported.
Update the documentation to clarify the use of the Trustee seed when using von-network
in the initial demo. The use of the Trustee seed in this case is simply for convenience. It is a well known seed used by von-network
, and it has the privileges to fetch the extended validator info data from the nodes. In real life you would use a NETWORK_MONITOR seed/DID for this purpose.
It is important for the nodes to have there system clocks synchronized. If they get too far out of sync it affects their ability to form consensus.
The timestamp reported by a node during an authenticated validator-info
call indicates the node's system time. A comparison of the time differences between all of the nodes should be performed. When a clock skew of more than a few seconds (actual time TBD) a warning report (in the warnings
list) should be generated.
This is to address NTP time sync discussions here; #24 (comment)
The implementation should build on the analysis plugin once merged; #26
Additional Requirements:
Record and report network and node response times.
Older version of python and missing dependency causes error on run.sh
Traceback (most recent call last):
File "main.py", line 29, in <module>
monitor_plugins = PluginCollection('plugins')
File "/home/indy/plugin_collection.py", line 69, in __init__
self.reload_plugins()
File "/home/indy/plugin_collection.py", line 78, in reload_plugins
self.walk_package(self.plugin_package)
File "/home/indy/plugin_collection.py", line 126, in walk_package
self.walk_package(package + '.' + child_pkg)
File "/home/indy/plugin_collection.py", line 100, in walk_package
plugin_module = __import__(pluginname, fromlist=['blah'])
File "/home/indy/plugins/metrics/google_sheets.py", line 3, in <module>
import gspread
File "/home/indy/.local/lib/python3.7/site-packages/gspread/__init__.py", line 7, in <module>
from .auth import (
File "/home/indy/.local/lib/python3.7/site-packages/gspread/auth.py", line 19, in <module>
from .client import Client
File "/home/indy/.local/lib/python3.7/site-packages/gspread/client.py", line 17, in <module>
from .spreadsheet import Spreadsheet
File "/home/indy/.local/lib/python3.7/site-packages/gspread/spreadsheet.py", line 15, in <module>
from .worksheet import Worksheet
File "/home/indy/.local/lib/python3.7/site-packages/gspread/worksheet.py", line 11, in <module>
from typing import (
ImportError: cannot import name 'Literal' from 'typing' (/usr/lib/python3.7/typing.py)
As of probably last week (2023-03 11 to 18), building indy-node-monitor
from scratch with non-cached build artefacts pulls fastapi==0.95.0
, with requires starlette==0.26.1
.
This causes the indy-node-monitor
process to crash on startup, with a bunch of stack trace containing the following kind of error:
File "/home/indy/rest_api.py", line 80, in <module>
async def network(network: Network = Path(example_network_enum, example=example_network_name, description="The network code."),
File "/home/indy/.local/lib/python3.7/site-packages/fastapi/param_functions.py", line 42, in Path
**extra,
File "/home/indy/.local/lib/python3.7/site-packages/fastapi/params.py", line 83, in __init__
assert default is ..., "Path parameters cannot have a default value"
AssertionError: Path parameters cannot have a default value
[2023-03-24 20:06:52 +0000] [11] [INFO] Worker exiting (pid: 11)
This is due to the fact that relying on creating fastapi Path object with code like:
Path(Network.sbn, example="sbn", description="The network code.")
was probably relying on a bug in the Path api, which was corrected since.
The correct syntax is:
Path(path=Network.sbn, example="sbn", description="The network code.")
A PR to correct this problem is on its way.
The objective is this exercise is to determine what information provided by the validator-info
call can be used to measure compliance with the Sovrin Technical Policies
An example might be to determine if nodes have two Public IPs for node and client traffic respectively as per the technical policy.
Further analysis is required to determine what is already available vs what is missing. i.e. a Gap analysis, the outcome might be requested changes to the output provided by the validator-info
call from indy-vdr
This is a follow-up issue to 61.
61 was raised, because using a metric as a tag in prometheus (or even influxdb) is a anti-pattern.
However, it questions the pertinence of using prometheus, because prometheus is a numeric time series data store, which cannot leverage the textual data provided by indy-node-monitor.
I also appeared that influxdb 2.7+ (instead of 1.8) seemed to fill the alerting void left if prometheus was removed from the solution
So we are investigating fully or partly replacing elements of the current (as of 2023-04-11) stack by the latest OSS version of influxdb (currently 2.7.0)
For memory, here is the current architecture:
flowchart TD
Grafana[/Grafana\n* data display\n* dashboards\] -- queries --> Prometheus
Grafana -- queries --> InfluxDB
Prometheus{{Prometheus& \nAlertManager\n* numeric time series\n* alerting}} -- queries --> Telegraf
Telegraf[[Telegraf\n\n* periodic data gathering\n* data dispatch]] -- stores --> InfluxDB{{InfluxDB\n* numeric time series\n* text-based time series}}
Telegraf -- queries -->
Indy-Node-Monitor[Indy-Node-Monitor\n* status/metrics\ngathering interface] --> IndyClusters[\Indy Clusters/]
Indy-Node-Monitor --> IndyClusters
Indy-Node-Monitor --> IndyClusters
In order to ensure established best practices we want to ensure a node has been configured to use separate NICs for node and client communications, and they have been configured with different IP addresses on different subnets.
There can be two levels to this:
client-address
and node-address
fields). This data is based on the registration information for the node on the ledger.validator-info
call (Node_ip
and Client_ip
fields). This data is based on the configuration of the node itself.This is to address Network interfaces (IPs and Ports)
discussions here; #24 (comment)
The implementation should build on the analysis plugin once merged; #26
Requirements:
warnings
list) for any detected issues.warnings
list) for any detected issues.It artificially creates different times series, is there a rationale for this ?
Prometheus MVP Metrics
The Sovrin Network name being monitored
Should be able to get this from the pool being connected to
Node alias name
Detect when a node is inaccessible and produce standard output for that situation.
Should generate a timeout when trying to pull validator_info from inaccessible nodes.
Detect any nodes that are accessible but that are "unreachable" to some or all of the other Indy nodes.
"Reachable_nodes": [
[
"Node1",
0
],
[
"Node3",
null
],
[
"Node4",
null
]
],
"Unreachable_nodes": [
[
"Node2",
null
]
],
"Reachable_nodes_count": 3,
"Unreachable_nodes_count": 1,
"transaction-count": {
"ledger": 21,
"pool": 4,
"config": 0,
"audit": 1042
},
"throughput": {
"0": 0.0017547843
},
"master throughput": 0.0017547843,
"total requests": 16,
"avg backup throughput": null,
"master throughput ratio": null,
"average-per-second": {
"read-transactions": 0.0338584473,
"write-transactions": 0.0001539895
},
"throughput": {
"0": 0.0017547843
},
"master throughput": 0.0017547843,
"total requests": 16,
"avg backup throughput": null,
"master throughput ratio": null,
"average-per-second": {
"read-transactions": 0.0338584473,
"write-transactions": 0.0001539895
},
"transaction-count": {
"ledger": 21,
"pool": 4,
"config": 0,
"audit": 1042
},
"uptime": 103903
},
"Freshness_status": {
"1": {
"Last_updated_time": "2020-07-06 23:55:07+00:00",
"Has_write_consensus": true
},
"0": {
"Last_updated_time": "2020-07-06 23:57:33+00:00",
"Has_write_consensus": true
},
"2": {
"Last_updated_time": "2020-07-06 23:57:33+00:00",
"Has_write_consensus": true
}
}
"Node_info": {
"Name": "Node4",
"Mode": "participating",
"Client_port": 9708,
"Client_ip": "0.0.0.0",
"Client_protocol": "tcp",
"Node_port": 9707,
"Node_ip": "0.0.0.0",
"Pool_info": {
"Read_only": false,
"Total_nodes_count": 4,
To use indy-node-monitor as a cron job the -ti parameters should be removed from the run.sh cmd.
Might be appropriate to expect indy-node-monitor to be used extensively with cron jobs in future.
A quick fix might be to indicate via a cmd line argument if it will be a cron job that will remove the -ti parameters from the docker cmd.
As a companion to hyperledger/indy-node#1670, integrate the analysis portion(s) of the Technical Verification Script into a new plugin to facilitate continuous monitoring.
The implementation should build on the analysis plugin once merged; #26
This is to address the discussions regarding incorporating the collection of the additional data analyzed by the Technical Verification Script, in order to facilitate continuous compliance monitoring; #24 (comment)
Cut new release(s) to denote the significant changes that have occurred over the last while.
When not in anonymous mode, suggest that the --status
result also return the following fields:
From Pool_info
:
Unreachable_nodes_count
Reachable_nodes_count
Total_nodes_count
Suspicious_nodes
if it is a count -- not if it is an array.From Hardware
:
HDD_used_by_node
Nice to have from the 'Extractions' node -- trickier because the values are in text:
node-control-memory
and node-control-tasks
indy-node-memory
and indy-node-tasks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.