Comments (3)
Yes, currently Swarm doesn't have an API for saying "hey, we want to shutdown, so hand off your state to the node which will become responsible for owning you". It is something I'd like to add, but will require some thought as to the best way to do so.
On the other hand, if your node shuts down, and the process is restarted elsewhere in the cluster, as long as it can "rebuild" it's state from some authoritative source (say mnesia, Redis, or some other datastore), then you can just let Swarm handle redistributing those processes when the node shuts down the way it does already. The only key there is that those processes must be started with register_name/4
, and handle the various Swarm system messages. If these processes have state which can't be reconstructed on init though, then it's probably not the right way to go for you at this point in time.
Also, Swarm guarantees that if a network partition occurs, a copy of those registered processes will be running in every partition - so if you need a guarantee that a given process will only ever be present once, across all partitions (in a netsplit scenario), Swarm wouldn't be the right solution.
from swarm.
I would say, at least initially, I'm not too concerned what the behavior is in the case of a network partition. I find something like that unlikely in the context of kubernetes. I think I might be a bit confused as to the purpose of swarm's handoff capabilities. What is the purpose of being able to handoff process state if not that the current node is going down? Is it about balancing work amongst a cluster?
from swarm.
I would never underestimate the ability of the network to mess up your day ;), even within k8s (maybe even especially in k8s, due to the software defined networking layer).
Swarm's current handoff implementation handles the case where a cluster of nodes suddenly loses one of the nodes - all of the processes (those registered via register_name/4
anyway) on that downed node need to be restarted on the remaining nodes, based on the distribution on the internal hash ring - this means that some percentage of all processes will need to be moved to new parent nodes - so for those processes running on nodes which are still up, Swarm takes the time to coordinate a graceful handoff. This is fairly easy to accomplish automatically, because upon nodedown, the hash ring on each node is updated automatically and deterministically, so we know that all nodes have the same hash ring, and will agree upon where a given process should live.
It gets tricky when you want to simulate a node down event, because you need to do the handoff before you go down. Due to the way the internals are written (they rely on the hash ring for a priori knowledge of where to register processes), we need to remove the node from everyone's hash ring, but still allow handoff events from that node. That can be risky if the broadcast which tells the rest of the cluster that a node should be "soft-killed" fails to reach all parties - the hash rings become out of sync and chaos ensues. Only performing handoffs when a node actually goes down is safer because the message is generated within each node when they lose communication with a connected node - we don't risk losing the event and getting out of sync. This is definitely not an impossible problem to solve, but it is tricky, hence why it wasn't done at the same time as the current handoff behavior implementation.
from swarm.
Related Issues (20)
- Calling swarm functions on blacklisted nodes has unexpected results HOT 5
- target module not available on remote node HOT 9
- Compile Warning with version 3.3.1 on Elixir 1.8.0-otp-21.2.2 HOT 7
- Netsplit conflict behavior HOT 1
- Function clause error in `Swarm.IntervalTreeClock.fill` on topology change HOT 4
- Unrecognized cast: {:untrack, pid} HOT 7
- Swarm.Tracker stuck in `syncing` state when lots of nodes starting/crashing HOT 1
- whereis_name/1 invocation takes more than 60_000 ms
- Swarm process exit callback HOT 1
- Incorrect statement about syn in README HOT 5
- Swarm is unresponsive after startup in 24% of cases HOT 3
- Too less example or tutorial HOT 1
- Violation of the Swarm.Distribution.Strategy definition is actually accepted
- No function clause matching Swarm.IntervalTreeClock.sum/2 HOT 1
- Debug log inspection performance penalty
- Swarm stops to respond when try to find name
- Question about eventual consistency in the readme
- Clock conflicts and other errors when clustering
- Provide Ability for running multiple swarm instances.
- Swarm project status?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from swarm.