There are some errors which should never happen, unless there is a big bug in the code

here is my take, in order of priority Any assumptions or funct

I am a big fan of the <a href="https://softwareengineering.stackexchange.com/questions

Guidelines for handling unexpected errors about nearcore HOT 4 OPEN

jancionear commented on July 1, 2024 2

Guidelines for handling unexpected errors

from nearcore.

Comments (4)

nagisa commented on July 1, 2024 2

Crash early does work well in e.g. web applications where an esoteric request will at worst bring down a single server and something more reliable, like a load balancer, will reroute the rest of the traffic to the other instances.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

I do like the mechanism myself quite a bit too, but I fear that with how neard and the near protocol are structured, we cannot really take advantage of it in many places of our code base.

No piece of code should rely on graceful shutdown in order to function correctly.

While reliance on crash early is a good motivation for the code to not rely on graceful termination, external factors like SIGKILL or just somebody pulling the power plug are already there. So the code that's not meant to corrupt state should not assume graceful termination regardless.

In Rust world specifically explicit errors always end up being a part of the type state, and thus at least somewhat documented. Panics, asserts and such on the other hand are meant to be ignored. In fact, you could not do anything but ignore them all the way till the 1.9.0 release!

This usually means that for most Rust projects any unwinds being introduced into the code should be very well considered. The community has the guideline somewhere along the lines of

Panics should only be used to inform about programming mistakes.

So it is perhaps somewhat unreasonable to expect our developers to swim against the flow here, when the rest of the ecosystem subscribes to a different ideology altogether.

I personally found it more straightforward on myself to just always use Errors everywhere, even for said programming/invariant mistakes, by default. At my previous workplace we crafted a green-field project where to the best of my knowledge we did not introduce unwinds (especially not the implicit ones) of our own and religiously followed these guidelines. Not only was the resulting software rock stable and it's code easy to modify, but whenever there was an error condition, reasons leading to it naturally ended up being really easy to understand to the operators of the software as well!

Unfortunately applying the same strategy or guidelines for neard is a little too late (it is not a green field project!) but I don't think there is any reason to not follow the similar style for the new code that we write. At worst the error can be explicitly propagated until it is no longer possible to do so. This propagation is still explicit and whoever ends up dealing with anything in between will enjoy most of the benefits of the scheme.

from nearcore.

wacban commented on July 1, 2024 1

here is my take, in order of priority

Any assumptions or function contracts should be coded directly into the function signature. Make it impossible for the function to be used incorrectly. Rust is fairly good at it. Ideally all assumptions should be encoded that way and there shouldn't be any need for asserts.
For non-critical issues it's best to use debug_asserts and log the error. For this to work we need a proper setup with canaries (including validators) running in debug mode and monitoring for errors. I believe we have the former but I don't know about the latter.
I dislike pure asserts and even worse implicit asserts (e.g. array access with unchecked index) because those create a potential attack vector. Asserts basically move the reponsibility to the caller of the asserting function, implicitly, and it is not covered by the compiler.

from nearcore.

akhi3030 commented on July 1, 2024

I am a big fan of the crash early concept. Some thoughts:

If you have a piece of code that is not properly tested, then you cannot be sure that that piece of code correctly works. Then you should not have it in the code base. Code that handles very esoteric errors are often not properly tested. They are add a lot of complexity.
As the link above suggests, when you have some sort of corruption and you do not think you can safely continue execution, the best thing to do is to end execution.
No piece of code should rely on graceful shutdown in order to function correctly. If your database code relies that the files will always be properly closed with all data properly written, at some point that assumption will fail. Similarly, if your networking code relies that you will always manage to cleanly close the connection with all data written and read by the client, that will eventually fail. These pieces of code should be able to start with potentially invalid states.

from nearcore.

akhi3030 commented on July 1, 2024

I dislike pure asserts

I think this is a bit more nuanced. There are some cases where asserts (or their cousins unwraps, etc.) can be a useful tool. There are always some types of corruptions that the application is not designed to cope with. When such corruptions are detected, there is no good reason to pass them around, instead just asserting can be fine. E.g. when you want to lock a mutex. Should you unwrap or not? If you do not want to bother handling the edge case of poisoned locks, then unwrapping is fine. Or asserting that a certain invariant hold which due to legacy code or just sheer amount of complexity, we decided not to properly capture in the types we have defined.

For neard (in absence of ZK/partial shard tracking/etc.) most of the validators end up seeing all of the traffic from the network, have the same view of the database state, etc. so an unfortunate esoteric event is much more likely to bring down a large number – if not all – of the instances that have seen it. Furthermore, when they come back up, they are likely going to see that same construct again…

Yes, this is absolutely true. If we have a systematic corruption that impacts all or large number nodes, then the entire network will enter an infinite crash loop. But what is the alternative here? It is not so difficult to throw an error to your caller if you experience an esoteric error. But if the caller is not designed to handle the error, then you are just increasing code complexity for little gain. So my suggestion would be that if we do this, we need to handle this in a case-by-case basis and we have to make sure that the error gets propagated properly all the way up to the appropriate level which is able to properly handle and recover from it. If we not building in proper recovery logic, then there is little point in propagating the error.

from nearcore.

Guidelines for handling unexpected errors about nearcore HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent