Giter Club home page Giter Club logo

tenderduty's Introduction

TenderDuty v2

Go Reference Gosec CodeQL

Tenderduty is a comprehensive monitoring tool for Tendermint chains. Its primary function is to alert a validator if they are missing blocks, and has many other features.

v2 is complete rewrite of the original tenderduty graciously sponsored by the Osmosis Grants Program. This new version adds a web dashboard, prometheus exporter, telegram and discord notifications, multi-chain support, more granular alerting, and more types of alerts.

dashboard screenshot

Documentation

The documentation is a work-in-progress.

Runtime options:

$ tenderduty -h
Usage of tenderduty:
  -example-config
    	print the an example config.yml and exit
  -f string
    	configuration file to use (default "config.yml")
  -state string
    	file for storing state between restarts (default ".tenderduty-state.json")
  -cc string
    	directory containing additional chain specific configurations (default "chains.d")

Installing

Detailed installation info is in the installation doc.

30 second quickstart if you already have Docker installed:

mkdir tenderduty && cd tenderduty
docker run --rm ghcr.io/blockpane/tenderduty:latest -example-config >config.yml
# edit config.yml and add chains, notification methods etc.
docker run -d --name tenderduty -p "8888:8888" -p "28686:28686" --restart unless-stopped -v $(pwd)/config.yml:/var/lib/tenderduty/config.yml ghcr.io/blockpane/tenderduty:latest
docker logs -f --tail 20 tenderduty

Split Configuration

For validators with many chains, chain specific configuration may be split into additional files and placed into the directory "chains.d".

This directory can be changed with the -cc option

The user friendly chain label will be taken from the name of the file.

For example:

chains.d/Juno.yml -> Juno
chains.d/Lum Network.yml -> Lum Network

Configuration inside chains.d/Network.yml will be the YAML contents without the chain label.

For example start directly with:

chain_id: demo-1
    valoper_address: demovaloper...

Contributions

Contributions are welcome, please open pull requests against the 'develop' branch, not main.

tenderduty's People

Contributors

alchemydc avatar blockpane avatar chillyvee avatar czarcas7ic avatar deepsourcebot avatar dylanschultzie avatar elsehow avatar ericjohncarlson avatar jeongseup avatar leonoorscryptoman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tenderduty's Issues

Bug: Restarting tenderduty while alert is active causes that alert to struck forever and never produce any new alert (Critical)

Problem

Restarting tenderduty while alert is active causes that alert to struck forever and never produce any new alert

image

image

Above picture clearly state that xxx has missed 10 blocks on mocha is already resolved. In contrast, tenderduty still showing that alert is not resolved and neither pagerduty nor telegram receive any resolve event.

Moreover, it never produce missed 10 blocks alert anymore in the future.

Impact

Alert struck forever and never trigger any new alert. Causes our validator node to be at risk of being jailed and slashed. (Critical)

Step to reproduce

  1. Run both tenderduty and validator normally
  2. Stop validator and wait for it to trigger alert "ALERT: - xxx has missed xx blocks on xxx"
  3. Stop tenderduty
  4. Restart tenderduty
  5. Restart validator
  6. Notice that alert doesn't get resolved
  7. Repeat these steps and notice that new alert never get sent

possible http client resource leak

Reported error is: http: Accept error: accept tcp [::]:8888: accept4: too many open files; retrying in 1s

This is likely caused by an http get response body not being closed. No more context given.

Websocket Bad Handshake Error

OS: Ubuntu 20.04
Go Version: go1.18.1 linux/amd64

Issue:
Tenderduty cannot connect to remote RPC server and fails with a "websocket: bad handshake" error.

Command:
./tenderduty -c darcvalcons12345... -p <pagerduty_api_key> -u "https://node1.konstellation.tech:26657"

Expected output:
Tenderduty connects to the RPC server and starts monitoring the address

Actual Outcome:
Tenderduty fails to run with the following error:

tenderduty   | 2022/04/26 15:53:27 main.go:88: checking that at least 1 of 1 endpoints are available
tenderduty   | 2022/04/26 15:53:27 main.go:183: Warning: could not connect to https://node1.konstellation.tech:26657 websocket: bad handshake
tenderduty   | 2022/04/26 15:53:27 main.go:91: FATAL:no RPC endpoints are available

Workarounds:
None found so far

Notes:
This is also occurring when run in a Docker container.
The RPC server appears to be functioning normally outside of Tendermint.

feature: Disk space / Resources monitoring with node export prometheus

I also use PANIC and they have something that is really cool and I think Panic should have it too as too often people run out of space.

image

In the past I would do the alerts on grafana but later I changed it to use just Panic since it was really useful and would love to see the same in tenderduty -- that would allow me to fully switch to tenderduty.

Setting up https nodes.url could not dial ws client resulting in bad handshake

022/08/15 14:20:29 tenderduty |  โš™๏ธ auravaloper1h2ney6qy ... is using consensus key: auravalcons1dnjfhr2vj74fd9vxray3wtvhgjc454sz7dlktw
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://aura-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  euphoria-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  โš™๏ธ found torivaloper1ykeu0kv4zuz9v74l34c9ca7k2mdpmh5yycr3eh (kjnodes) in validator set
2022/08/15 14:20:29 tenderduty |  โš™๏ธ torivaloper1ykeu0kv4 ... is using consensus key: torivalcons1cqry0nd6ayly2adm05p6r3w9a3ktmd5rghqrz8
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://teritori-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  teritori-testnet-v2 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  โš™๏ธ found rebusvaloper19y5d84d033f5vcnvcyxzw4zc9chlatkkcnesg9 (kjnodes) in validator set
2022/08/15 14:20:29 tenderduty |  โš™๏ธ rebusvaloper19y5d84d ... is using consensus key: rebusvalcons1pckg8f85m86y4j4902n2ucwpdpeu37xlasmdgw
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://rebus-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  reb_3333-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  โš™๏ธ found palomavaloper13uslh0y22ffnndyr3x30wqd8a6peqh25m8p743 (kjnodes) in validator set
2022/08/15 14:20:29 tenderduty |  โš™๏ธ palomavaloper13uslh0 ... is using consensus key: palomavalcons1j5v34hyr9qekezc4cpq0zvmftg568lcymedtw2
2022/08/15 14:20:29 tenderduty |  โš™๏ธ found celestiavaloper13y5d6t28kumgmsr6jhmtj0s8p08yqvel70gyyp (kjnodes) in validator set
2022/08/15 14:20:29 tenderduty |  โš™๏ธ celestiavaloper13y5d ... is using consensus key: celestiavalcons14ewk8w79karwka070j8nua27efdgr6l72p6ked
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://paloma-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  paloma-testnet-6 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://celestia-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  mamaki ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  โš™๏ธ found dewebvaloper190y0pl94cy3xec3dp7d6h6yy5uagm4t5qthddp (kjnodes) in validator set
2022/08/15 14:20:29 tenderduty |  โš™๏ธ dewebvaloper190y0pl9 ... is using consensus key: dewebvalcons1fu45f0wgr5ksrapwqkrgh02y8fpc0mmcwc04pu
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://dws-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  deweb-testnet-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  โš™๏ธ found ccvaloper1ehad6hausea0qc8t3amxlpqs5rx2c05z4mhnvm (kjnodes) in validator set
2022/08/15 14:20:29 tenderduty |  โš™๏ธ ccvaloper1ehad6hause ... is using consensus key: ccvalcons1ejdqpps56hg87fl8584vrm397ss05q9qexh3a2
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://cardchain-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  Cardchain ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  โŒ seivaloper1828et4w3u52na4h8m3mp8r22vjylxeaa5d36ca (kjnodes) is INACTIVE
2022/08/15 14:20:29 tenderduty |  โš™๏ธ seivaloper1828et4w3u ... is using consensus key: seivalcons1hssfezdle4ajj5e5g0z7g7qejftp24gzla0fe5
2022/08/15 14:20:29 tenderduty |  could not dial ws client to wss://sei-testnet.rpc.kjnodes.com:443/websocket: websocket: bad handshake
2022/08/15 14:20:29 tenderduty |  atlantic-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Stride       node https://stride-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Aura         node https://aura-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Teritori     node https://teritori-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Deweb        node https://dws-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Celestia     node https://celestia-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Rebus        node https://rebus-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Sei          node https://sei-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Paloma       node https://paloma-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Uptick       node https://uptick-testnet.rpc.kjnodes.com:443 is healthy
2022/08/15 14:20:29 tenderduty |  ๐ŸŸข Cardchain    node https://cardchain-testnet.rpc.kjnodes.com:443 is healthy

Configuration doesn't work as expected.

I setup global discord webhook. then I see the webhook for chain said 'uses default if blank', but it in fact doesn't work. If I don't setup chain option, it will report no webhook found.

Here are a design issue, that the 'enabled' and 'webhook' looks coupled, but one is a supper option (that override the chain option), and an other is a default option (that override by chain option). This doesn't make sense, and need to be re-organized.

Upgrade (v13) Time Reached, No Alarm

Running the latest version

2022/12/08 05:44:23 tenderduty |  ๐ŸงŠ osmosis-1    block 7241420
2022/12/08 05:46:21 tenderduty |  ๐ŸงŠ osmosis-1    block 7241440
2022/12/08 05:48:16 tenderduty |  ๐ŸงŠ osmosis-1    block 7241460
2022/12/08 05:50:12 tenderduty |  ๐ŸงŠ osmosis-1    block 7241480 _(v13 Upgrade Height)_
2022/12/08 05:53:46 tenderduty |  ๐Ÿ›‘ osmosis-1 websocket idle for 1 minute, exiting
2022/12/08 05:53:46 tenderduty |  osmosis-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/12/08 05:53:51 tenderduty |  โš™๏ธ osmosis-1    watching for NewBlock and Vote events via http://192.168.0.56:31657
2022/12/08 05:54:51 tenderduty |  ๐Ÿ›‘ osmosis-1 websocket idle for 1 minute, exiting
2022/12/08 05:54:51 tenderduty |  osmosis-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/12/08 05:54:56 tenderduty |  โš™๏ธ osmosis-1    watching for NewBlock and Vote events via http://192.168.0.56:31657
2022/12/08 05:55:56 tenderduty |  ๐Ÿ›‘ osmosis-1 websocket idle for 1 minute, exiting
2022/12/08 05:55:56 tenderduty |  osmosis-1 ๐ŸŒ€ websocket exited! Restarting monitoring

then it just looped like that until (2.5 hrs later), I got another alert that the node was down and restarted it with the upgraded version (we don't use Cosmovisor, and the upgrade height came earlier than expected)

2022/12/08 08:39:41 tenderduty |  osmosis-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/12/08 08:39:46 tenderduty |  โš™๏ธ osmosis-1    watching for NewBlock and Vote events via http://192.168.0.56:31657
2022/12/08 08:40:46 tenderduty |  ๐Ÿ›‘ osmosis-1 websocket idle for 1 minute, exiting
2022/12/08 08:40:46 tenderduty |  osmosis-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/12/08 08:40:51 tenderduty |  โš™๏ธ osmosis-1    watching for NewBlock and Vote events via http://192.168.0.56:31657
2022/12/08 08:41:25 tenderduty |  osmosis-1 ๐ŸŒ€ websocket exited! Restarting monitoring
2022/12/08 08:41:30 tenderduty |  osmosis-1 โ›‘ attemtping to use public fallback node https://rpc-osmosis.ecostake.com:443
2022/12/08 08:41:31 tenderduty |  osmosis-1 โ›‘ connected to public endpoint https://rpc-osmosis.ecostake.com:443
2022/12/08 08:41:34 tenderduty |  โš™๏ธ osmosis-1    watching for NewBlock and Vote events via https://rpc-osmosis.ecostake.com:443
2022/12/08 08:41:35 tenderduty |  ๐Ÿšจ ALERT        new alarm on Osmosis (Validator has missed > 2% of the slashing window's blocks on osmosis-1) - notifying PagerDuty
2022/12/08 08:41:44 tenderduty |  ๐ŸงŠ osmosis-1    block 7243020
2022/12/08 08:41:44 tenderduty |  โŒ warning      Validator missed block 7243020 on osmosis-1
2022/12/08 08:41:44 tenderduty |  โŒ warning      Validator  missed block 7243021 on osmosis-1
2022/12/08 08:41:50 tenderduty |  โŒ warning      Validator  missed block 7243022 on osmosis-1

Settings:
image

Endpoint metrics are not published when all rpc are not available.

Metrics
tenderduty_total_monitored_endpoints
tenderduty_total_unhealthy_endpoints
have no datapoints when rpc endpoints are not available.

Could be easily reproduced by adding extra symbol to endpoint within chain. After that those metrics filtered by chain label would be empty.

PagerDuty alert is triggered and immediately resolved for jailed node

We're currently jailed on Terra2's pisco-1 testnet because I was too slow with a binary upgrade.

Starting the new v2 tenderduty targeting our node produced the following log output:

tenderduty-v2-1  | 2022/07/21 08:04:29 tenderduty |  โŒ terravaloper12y453v20w2dlv2y0vlepf88ng0r6sfsz6k7qm4 (Sapient Nodes) is INACTIVE
tenderduty-v2-1  | 2022/07/21 08:04:29 tenderduty |  โš™๏ธ terravaloper12y453v2 ... is using consensus key: terravalcons15mh8ypm30as0ludpw0hfuc08r6fyejl8983stz
tenderduty-v2-1  | 2022/07/21 08:04:29 tenderduty |  โš™๏ธ pisco-1      watching for NewBlock and Vote events via tcp://10.0.0.4:26657
tenderduty-v2-1  | 2022/07/21 08:04:45 tenderduty |  ๐ŸงŠ pisco-1      block 891920
tenderduty-v2-1  | 2022/07/21 08:05:03 tenderduty |  ๐Ÿšจ ALERT        new alarm on Terra2Testnet (RPC node tcp://10.0.0.4:26657 has been down for > 3 minutes on pisco-1) - notifying PagerDuty
tenderduty-v2-1  | 2022/07/21 08:05:03 tenderduty |  ๐ŸŸข Terra2Testnet node tcp://10.0.0.4:26657 is healthy
tenderduty-v2-1  | 2022/07/21 08:05:05 tenderduty |  ๐Ÿ’œ Resolved     alarm on Terra2Testnet (RPC node tcp://10.0.0.4:26657 has been down for > 3 minutes on pisco-1) - notifying PagerDuty
tenderduty-v2-1  | 2022/07/21 08:06:03 tenderduty |  ๐ŸŸข Terra2Testnet node tcp://10.0.0.4:26657 is healthy

So I got an alert, which I think is correct, but then it immediately resolved the problem?
Given that I still haven't unjailed our validator I would assume that the alert should persist.

UPX not found

 => ERROR [builder 6/6] RUN upx tenderduty && upx -t tenderduty                                                                                                                                                                                                           0.4s 
------                                                                                                                                                                                                                                                                         
 > [builder 6/6] RUN upx tenderduty && upx -t tenderduty:                                                                                                                                                                                                                      
#0 0.343 /bin/sh: 1: upx: not found                                                                                                                                                                                                                                            
------                                                                                                                                                                                                                                                                         
Dockerfile:8                                                                                                                                                                                                                                                                   
--------------------
   6 |     
   7 |     RUN go get ./... && go build -ldflags "-s -w" -trimpath -o tenderduty main.go
   8 | >>> RUN upx tenderduty && upx -t tenderduty
   9 |     
  10 |     # 2nd stage, create a user to copy, and install libraries needed if connecting to upstream TLS server
--------------------
ERROR: failed to solve: process "/bin/sh -c upx tenderduty && upx -t tenderduty" did not complete successfully: exit code: 127
ERROR: Service 'tenderduty' failed to build : Build failed

'routing_key' cannot contain non-alphanumeric characters

$ td -p blahkeyblah -test -u http://192.168.0.102:26657 -c blahblahvalconsxxxxxxxxxxxxxxxxxxxx
tenderduty   | 2022/02/16 15:52:52 main.go:58: Sending trigger event
2022/02/16 15:52:53 HTTP Status Code: 400, Message: {"status":"invalid event","message":"Event object is invalid","errors":["Length of 'routing_key' is incorrect (should be 32 characters)","'routing_key' cannot contain non-alphanumeric characters."]}

The key used is v2. Examples (already removed)
e+23rschqsESqk9Q6_-g
e+kd5BWM-NCknhB3XnYg

I get that this message is generated by a PagerDuty lib, but is it possible this lib is not updated to accept v2 keys?

Validator status and blockheight prometheus metrics

Hello. I'm creating grafana dashboard from tenderduty exported metrics and noticed that not all of them are there - validator status (bonded / jailed etc ) and current block height. Would be great to have them. Thanks !

Issue: not getting status data on Sei testnet network (Atlantic 2)

Trying to use it with Sei Atlantic-2 testnet network and fetching status seems like failing because of json marshaling on some date, and thus not being able to get/display block data.

Tenderduty version v2.2.0 compiled from source.
Go version go1.19 linux/amd64

โŒ could not get status for z_Sei Test: (tcp://5.9.78.252:26657) error unmarshalling result: JSON time must be UTC and end with 'Z', but got "\"2023-04-05T11:14:22.842212738+02:00\""

image
image

panic: pubkey is incorrect size

receiving error for Bandprotocol setup.

Error stack

panic: pubkey is incorrect size
github.com/cosmos/cosmos-sdk/crypto/keys/ed25519.(*PubKey).Address(0xc001145bd0?)
/var/lib/tenderduty/go/pkg/mod/github.com/cosmos/[email protected]/crypto/keys/ed25519/ed25519.go:161 +0x9a
github.com/blockpane/tenderduty/v2/td2.getVal({0x14052e0, 0xc00104e0c0}, 0xc0004f85a0, {0xc000f84940, 0x32})
/var/lib/tenderduty/tenderduty/td2/validator.go:159 +0x19d
github.com/blockpane/tenderduty/v2/td2.(*ChainConfig).GetValInfo(0xc000f8e900, 0x1)
/var/lib/tenderduty/tenderduty/td2/validator.go:43 +0x105
github.com/blockpane/tenderduty/v2/td2.Run.func4(0xc000f8e900, {0xc00027eff4, 0x4})
/var/lib/tenderduty/tenderduty/td2/run.go:100 +0x1dc
created by github.com/blockpane/tenderduty/v2/td2.Run
/var/lib/tenderduty/tenderduty/td2/run.go:81 +0x46e

I recreated with multiple validator address from Band , here is one for example. - bandvaloper1wwl9d8gvpktdjhdpzxe9yupunpkthzayysf2hn

https://www.mintscan.io/band/validators/bandvaloper1wwl9d8gvpktdjhdpzxe9yupunpkthzayysf2hn

Pass Source Of Alert

If I have tenderduty running on multiple servers (for redundancy), it would be useful to be able to pass something similar to payload.source in the API, so that'd I know if the service is really down or the RPC relied up by the server is question is N/A.

As I see it, I'd run two rpcs from each monitoring server, as in
Monitoring Server A: uses its LAN rpc servers
Monitoring Server B: uses its LAN rpc servers

I'd like to know that the validator is indeed down, not just the local RPC server. This is only possible when I get an alert from both Server A and Server B.

Update Recommended Installation Procedure?

"it's not possible to use 'go install' remotely because tendermint uses replace directives in the go.mod file. It's necessary to clone the repo and build manually." but then later it is recommended (go install here)

Which is the proper install procedure? I ask because both go get ./... && go build -ldflags '-s -w' -trimpath -o ~/go/bin/tenderduty main.go and go install worked, but each method yields a binary with a different hash.

Issue: Segmentation fault on M1 Mac when running docker image

Env

  • MacBook Pro (14-inch, 2021)
  • MacOS 12.1
  • Docker Desktop 4.10.0 (82025)
`docker info`
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.8.2)
  compose: Docker Compose (Docker Inc., v2.6.1)
  extension: Manages Docker extensions (Docker Inc., v0.2.7)
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 11
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.104-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 3.841GiB
 Name: docker-desktop
 ID: IBXS:EE4X:D3YC:TTQR:H6VP:IOUQ:7KPN:F3YU:Z7X5:KNCI:FE7V:G7GA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false

Steps

docker run --rm ghcr.io/blockpane/tenderduty:latest -example-config >config.yml
# or just bash
docker run --rm -it  --platform linux/amd64 ghcr.io/blockpane/tenderduty:latest bash

# Edit 1: The bash command was incorrect before, but after command was fixed it still got the same issue

Outputs

The warning is normal and can be ignored

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

Feature: Support minimum-signed threshold

Purpose

The fundamental of tenderduty is to check the validator's uptime, but there is an inconvenience in intuitively understanding the risk level. For example, we run a lot of validators and use a different minSignedPerWindow for each protocol. It's very annoying to manually query the nodes, so we add a feature to visually see the threshold.

threshold

Impact

Intuitively understand the risk level of your validators.

Recovery from node-down monitor not detected

Not sure, if this is intended. The node-down notification works only the first time when the limit of 3 minutes has expired.

There is no recovery alert. And there is no consecutive down alert if I take the node down again after 3 minutes.

11:47:51 AM - โš ๏ธ Terra2.0-testnet node tcp://10.30.1.3:26657 is down
11:46:51 AM - โš ๏ธ Terra2.0-testnet node tcp://10.30.1.3:26657 is down
11:45:51 AM - โš ๏ธ Terra2.0-testnet node tcp://10.30.1.3:26657 is down
11:44:51 AM - โš ๏ธ Terra2.0-testnet node tcp://10.30.1.3:26657 is down
11:43:51 AM - ๐ŸŸข Terra2.0-testnet node tcp://10.30.1.3:26657 is healthy
11:42:51 AM - ๐ŸŸข Terra2.0-testnet node tcp://10.30.1.3:26657 is healthy
11:41:51 AM - ๐ŸŸข Terra2.0-testnet node tcp://10.30.1.3:26657 is healthy
11:40:51 AM - ๐ŸŸข Terra2.0-testnet node tcp://10.30.1.3:26657 is healthy
11:39:51 AM - ๐ŸŸข Terra2.0-testnet node tcp://10.30.1.3:26657 is healthy
11:38:52 AM - ๐Ÿšจ ALERT        new alarm on Terra2.0-testnet (RPC node tcp://10.30.1.3:26657 has been down for > 3 minutes on pisco-1) - notifying Discord
11:38:51 AM - โš ๏ธ Terra2.0-testnet node tcp://10.30.1.3:26657 is down

v2 not working with Nomic

Seems that v2 (i'm using lastest v2.0.2) doesn't work with Nomic chain anymore

refreshing signing info for nomicvalcons1y83npk4q35f6xjav0l9qjgm6xxjmkmddwg5eh8 could not find validator nomicvalcons1y83npk4q35f6xjav0l9qjgm6xxjmkmddwg5eh8

address is retrieved using the HEX value of address on a RPC node

image

and https://slowli.github.io/bech32-buffer/

image

this was a working configuration (reported by other users, not tried by me)

Information disclosure vulnerability at /debug/vars

The /debug/vars endpoint exposed by the net/http/pprof package in Go provides a JSON output of the values of all expvar variables. These variables can include a variety of information about the internal state of your application, such as:

  • Command-line arguments with which the program was started
  • The number of garbage collections
  • The number of goroutines
  • Memory statistics
  • Information about the last garbage collection

While this information is not typically sensitive in the sense of revealing secrets or private data, it can provide a lot of information about the internal workings of your application. This could potentially be used by an attacker to understand more about your system and find ways to exploit it.

Therefore, it's generally a good idea to not expose this endpoint in a production environment, or at least restrict access to it.

Slack Alerts Bug

Hey,

Started using Tenderduty to push alerts to Slack however, alerts seem to bug out on Slack only as we have alerts going to Discord also from the same instance and they're fine.

Issues:
Regardless of the value passed into node_down_alert_minutes: in the config slack will alert with any node downtime even if it is only 30 seconds Discord does not.

Alerts are repetitive and do not clear where Discord will alert once as it should slack send 30+ messages for the same alert and then again 30+ when the alert is resolved.

Cleanup alarm state - stalled chain

After the chain has been stalled (not producing blocks for more than 10 minutes), and the chain has been restored, the notification is still marked active.

Observed with Terra2 testnet upgrade today. Chain was waiting for 2/3 quorum for a couple of hours, alarm was raised as expected after 10 min. But does not invalidate the state, even after restart of Tenderduty.

10:50:29 PM - ๐Ÿ“‚ restored %s alarm state - dashboard stalled: have not seen a new block on pisco-1 in 10 minutes

Expected: Automatic alarm state cleanup, IF the chain is producing again AND the validator participates in voting/proposals as expected.

Running tenderduty behind a reverse proxy (Nginx)

I'm having some trouble getting tenderduty to act as it should when viewing the web interface through a nginx reverse proxy.

The interface is not updating "live" and after a short while the log field shows "Socket is closed, retrying /ws ..."

Looks like a websocket issue, but since i'm no nginx wizard, maybe i'm missing something in my config.

Current nginx reverse proxy config is:

location /tenderduty/ {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_set_header X-NginX-Proxy true;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_pass http://127.0.0.1:8888/;
}

Feature: display validator voting power

Voting power (in number of tokens) is very interesting data for validator operators. Though a change in voting power is not a matter of validator performance but a very interesting event for validator profitability and sustainability, and would help to only check Tenderduty and avoid having to have lots of explorers open just to check that data.

Other tools such as panic, send alerts on stake changes in validator, and are quite useful to know the status of your validator in that regard.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.