Giter Club home page Giter Club logo

Comments (7)

MSalopek avatar MSalopek commented on July 16, 2024

Thank you for opening the issue.

I don't think this is the case for v17 only but for all versions using cosmos-sdk v47.

Does the node respond when you query /status or /block?

from gaia.

shupcode avatar shupcode commented on July 16, 2024

https://forum.cosmos.network/t/cosmos-hub-v17-1-chain-halt-post-mortem/13899

During the chain halt the node was still running and peering and responding to /status and /block

Maybe this issue should be raised in the cosmos-sdk repository.

from gaia.

Reecepbcups avatar Reecepbcups commented on July 16, 2024

I am pretty sure this is a feature of the upstream SDK yea. Peer Exchange continues despite a non 0 exit code (baseapp recovers from it to keep the instances alive) since it could recover in some exit 1 cases (not this one, but others).

from gaia.

shupcode avatar shupcode commented on July 16, 2024

I am pretty sure this is a feature of the upstream SDK yea. Peer Exchange continues despite a non 0 exit code (baseapp recovers from it to keep the instances alive) since it could recover in some exit 1 cases (not this one, but others).

From a devops standpoint, when running nodes, I believe nodes that halt and do not progress from irrecoverable errors should exit with 1 code. For example, should one run a container on either Docker or Kubernetes, when the main app exits from an error there could be automatic attempts to restart and also trigger error logs to Docker/Kubernetes monitoring systems.

When you keep a node alive from errors that cannot be recovered there are a ton of funny edge cases you need to think about.

A node that is in error and connected to peers and responds to API calls, I have seen catching_up = false on status calls. Even if we monitor latest_block_time how do we differentiate between a legitimate chain halt from an upgrade where latest_block_time is stale, or a fatal error that crashes a node and continues to give a stale latest_block_time?

Furthermore, is there really a point in staying connected to peers during a fatal error that crashes a node?

Surely there needs to be more work on identifying errors that should immediately terminate the node and cause a cascade in warnings and error in monitoring systems and errors that are recoverable.

from gaia.

MSalopek avatar MSalopek commented on July 16, 2024

I agree with the points above.

Can you open this issue on cosmos-sdk? This is out of scope for a chain repository.

Please tag me on the new issue if you need further assistance from us.

from gaia.

tac0turtle avatar tac0turtle commented on July 16, 2024

How was the fatal halt conveyed here? apphash, lastresulthash mismatches? it seems the framework acted as intended.

from gaia.

MSalopek avatar MSalopek commented on July 16, 2024

Discussion continues on cosmos-sdk

Closing.

from gaia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.