Giter Club home page Giter Club logo

Comments (4)

nagisa avatar nagisa commented on July 1, 2024 2

Crash early does work well in e.g. web applications where an esoteric request will at worst bring down a single server and something more reliable, like a load balancer, will reroute the rest of the traffic to the other instances.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

I do like the mechanism myself quite a bit too, but I fear that with how neard and the near protocol are structured, we cannot really take advantage of it in many places of our code base.

No piece of code should rely on graceful shutdown in order to function correctly.

While reliance on crash early is a good motivation for the code to not rely on graceful termination, external factors like SIGKILL or just somebody pulling the power plug are already there. So the code that's not meant to corrupt state should not assume graceful termination regardless.


In Rust world specifically explicit errors always end up being a part of the type state, and thus at least somewhat documented. Panics, asserts and such on the other hand are meant to be ignored. In fact, you could not do anything but ignore them all the way till the 1.9.0 release!

This usually means that for most Rust projects any unwinds being introduced into the code should be very well considered. The community has the guideline somewhere along the lines of

Panics should only be used to inform about programming mistakes.

So it is perhaps somewhat unreasonable to expect our developers to swim against the flow here, when the rest of the ecosystem subscribes to a different ideology altogether.

I personally found it more straightforward on myself to just always use Errors everywhere, even for said programming/invariant mistakes, by default. At my previous workplace we crafted a green-field project where to the best of my knowledge we did not introduce unwinds (especially not the implicit ones) of our own and religiously followed these guidelines. Not only was the resulting software rock stable and it's code easy to modify, but whenever there was an error condition, reasons leading to it naturally ended up being really easy to understand to the operators of the software as well!

Unfortunately applying the same strategy or guidelines for neard is a little too late (it is not a green field project!) but I don't think there is any reason to not follow the similar style for the new code that we write. At worst the error can be explicitly propagated until it is no longer possible to do so. This propagation is still explicit and whoever ends up dealing with anything in between will enjoy most of the benefits of the scheme.

from nearcore.

wacban avatar wacban commented on July 1, 2024 1

here is my take, in order of priority

  • Any assumptions or function contracts should be coded directly into the function signature. Make it impossible for the function to be used incorrectly. Rust is fairly good at it. Ideally all assumptions should be encoded that way and there shouldn't be any need for asserts.
  • For non-critical issues it's best to use debug_asserts and log the error. For this to work we need a proper setup with canaries (including validators) running in debug mode and monitoring for errors. I believe we have the former but I don't know about the latter.
  • I dislike pure asserts and even worse implicit asserts (e.g. array access with unchecked index) because those create a potential attack vector. Asserts basically move the reponsibility to the caller of the asserting function, implicitly, and it is not covered by the compiler.

from nearcore.

akhi3030 avatar akhi3030 commented on July 1, 2024

I am a big fan of the crash early concept. Some thoughts:

  • If you have a piece of code that is not properly tested, then you cannot be sure that that piece of code correctly works. Then you should not have it in the code base. Code that handles very esoteric errors are often not properly tested. They are add a lot of complexity.
  • As the link above suggests, when you have some sort of corruption and you do not think you can safely continue execution, the best thing to do is to end execution.
  • No piece of code should rely on graceful shutdown in order to function correctly. If your database code relies that the files will always be properly closed with all data properly written, at some point that assumption will fail. Similarly, if your networking code relies that you will always manage to cleanly close the connection with all data written and read by the client, that will eventually fail. These pieces of code should be able to start with potentially invalid states.

from nearcore.

akhi3030 avatar akhi3030 commented on July 1, 2024

I dislike pure asserts

I think this is a bit more nuanced. There are some cases where asserts (or their cousins unwraps, etc.) can be a useful tool. There are always some types of corruptions that the application is not designed to cope with. When such corruptions are detected, there is no good reason to pass them around, instead just asserting can be fine. E.g. when you want to lock a mutex. Should you unwrap or not? If you do not want to bother handling the edge case of poisoned locks, then unwrapping is fine. Or asserting that a certain invariant hold which due to legacy code or just sheer amount of complexity, we decided not to properly capture in the types we have defined.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

Yes, this is absolutely true. If we have a systematic corruption that impacts all or large number nodes, then the entire network will enter an infinite crash loop. But what is the alternative here? It is not so difficult to throw an error to your caller if you experience an esoteric error. But if the caller is not designed to handle the error, then you are just increasing code complexity for little gain. So my suggestion would be that if we do this, we need to handle this in a case-by-case basis and we have to make sure that the error gets propagated properly all the way up to the appropriate level which is able to properly handle and recover from it. If we not building in proper recovery logic, then there is little point in propagating the error.

from nearcore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.