Giter Club home page Giter Club logo

Comments (11)

dennisklein avatar dennisklein commented on August 16, 2024

@ktf Does this happen after sending SIGINT to the device process?

edit: nvm, Thread_24274069 can only exist, when we received a SIGINT signal and my initial thought was, that it screws with the startup of the device. But after some digging I conclude, that it is not the reason, why your process hangs. We need to make sure, that an exception cannot stop the state machine progress. Alexey and I are on holiday until Monday, but I might find some time to squeeze in a fix for this. Thanks for the bug report!

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

@rbx I believe, in order to fix this, we need to intercept any exception, set the state machine into proper error state (which will notify the hanging control plugin threads), and then re-throw probably. Also, we need to enhance the control plugin to handle the error state properly. This could very well be a regression from the times before we stripped out the state machine thread from the device.

from fairmq.

rbx avatar rbx commented on August 16, 2024

What is the actual symptom/issue here? Does the device not shut down?
Is the exception being thrown from the same thread as the catcher?
Does it even come to ConditionalRun - it looks like Control is in the RunStartupSequence?
So far I don't see how anything would prevent an exception from being caught. Unless the actual issue is not shutting down.

Intercepting any exceptions from plugin seems like an overkill to me and may be really tricky since those may be running in different threads. What about notyfing the plugins out of the catch (which can be either one that Giulio already has, or an extra one inside our DeviceRunner/RunStateMachine.

A note on error state - it is built in a way that will keep the device hanging (for attaching debugger) and will not allow any subsequent state changes.

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

What is the actual symptom/issue here? Does the device not shut down?

Yes, I assume the stack trace is collected in the hanging state. Only then it makes any sense to me. You can see, that the program waits in the destructor of the control plugin on one of its threads (The exception has led to the situation, that the device runner went out of scope and began destruction). All control plugin threads are also waiting. At that point no thread will progress the device state machine any more.

Intercepting any exceptions from plugin seems like an overkill to me

Not from plugin, but from the device. Basically wrap the states with a try..catch so no exception can escape unnoticed.

A note on error state - it is built in a way that will keep the device hanging (for attaching debugger) and will not allow any subsequent state changes.

Well, at least the way it is hanging now, is not this intended state of hanging.

from fairmq.

ktf avatar ktf commented on August 16, 2024

@dennisklein: Yes, I think there was also a SIGINT involved. Also I noticed I am throwing in Init, not in ConditionalRun.

@rbx: yes, it's hanging there in the destructor of the ~Control.

from fairmq.

rbx avatar rbx commented on August 16, 2024

We need to enhance the control plugin to handle the error state properly.

Agreed, it should react properly if anything else (device, user code, ..?) sets error state. What I am unsure about, is if an exception should in all cases lead to the error state. We could let it propagate until the DeviceRunner.Run()/RunWithExceptionHandlers() and if one does arrive, it will instruct all plugins to stop. Both methods will catch exceptions, but only Run() will propagate them further.

What do you think?

from fairmq.

rbx avatar rbx commented on August 16, 2024

Going through the error state will make it unrecoverable (error state does not allow any further state transitions).
This is probably fine in most cases and for what Giulio is doing.
It will not introduce any additional plugin interfaces and will contribute towards proper error state handling in the plugin.
Let's do it in this way now. If such condition ever needs to be recoverable, it can be a different issue.

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

If any exception escapes any user code we call, I would say this is always a proper unrecoverable error state. I am open to having a separate error state for this, if the existing one is meant for something else (e.g. set by user code only), but I would prefer if we signal plugins through the state machine change signals.

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

@ktf We believe we have fixed the problem in dev. Will tag it once #98 is resolved tomorrow or Friday.

It should not require any changes on your side. The intended behaviour is, that any uncaught exception may escape the DeviceRunner::Run() call. We took care, that our builtin controller thread is not hanging any more in this case.

Other plugins, that spawn long running threads, are now notified through the existing StateChange subscription plugin API and may react accordingly. We transition the device state machine to the error state now when we see an uncaught exception.

from fairmq.

ktf avatar ktf commented on August 16, 2024

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

@ktf Thank you for helping to improve FairMQ! We believe your issue has been resolved in release v1.3.4. Reopen or create a new issue, if we missed anything.

from fairmq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.