Comments (3)
I also have this problem, with 3.4.0 strategy: ring, libcluster 3.2.0, strategy: kubernetes.dns
The following is the state of Swarm.Tracker of 4 replicas
It looks like that
10.1.0.171 is waiting for 10.1.0.173,
10.1.0.173 is waiting for 10.1.0.172,
10.1.0.172 is waiting for 10.1.0.171
May be recursive deadlock?
{:syncing,
%Swarm.Tracker.TrackerState{
clock: {1, 0},
nodes: [:"[email protected]", :"[email protected]", :"[email protected]"],
pending_sync_reqs: [#PID<11439.1343.0>, #PID<11437.1339.0>],
self: :"[email protected]",
strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]>,
sync_node: :"[email protected]",
sync_ref: #Reference<11224.2913252267.1040711683.223001>
}}
{:syncing,
%Swarm.Tracker.TrackerState{
clock: {1, 0},
nodes: [:"[email protected]", :"[email protected]", :"[email protected]"],
pending_sync_reqs: [#PID<11438.1339.0>],
self: :"[email protected]",
strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]>,
sync_node: :"[email protected]",
sync_ref: #Reference<11224.2390910966.1309409281.72932>
}}
{:syncing,
%Swarm.Tracker.TrackerState{
clock: {1, 0},
nodes: [:"[email protected]", :"[email protected]"],
pending_sync_reqs: [],
self: :"[email protected]",
strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]"]>,
sync_node: :"[email protected]",
sync_ref: #Reference<11224.3895545559.1309147137.190750>
}}
{:syncing,
%Swarm.Tracker.TrackerState{
clock: {1, 0},
nodes: [:"[email protected]", :"[email protected]", :"[email protected]"],
pending_sync_reqs: [#PID<11438.1341.0>],
self: :"[email protected]",
strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]>,
sync_node: :"[email protected]",
sync_ref: #Reference<11224.763090284.1309409282.67102>
}}
Any feedback would be appreciated!
Thank you
from swarm.
From a quick look at the code it seems that there's no deadlock resolution when 3 or more nodes start syncing concurrently and in addition you happen to get unlucky with random pick of the syncing node.
There's some logic that handles case of 2 nodes syncing each other:
swarm/lib/swarm/tracker/tracker.ex
Line 378 in 4aee63d
Basically this works okay:
n1 picks n2
n2 picks n1
n3 picks n1 or n2
But this doesn't:
n1 picks n2
n2 picks n3
n3 picks n1
Not an expert on this topic, so forgive me if I'm completely off the mark - but here's just my quick idea to fix it:
After receiving :sync message respond with new {:waiting_for, node} message (only if actually waiting for some other node).
Waiting node receives that message, verifies if deadlock took place (by checking its own pending_sync_reqs list) and if so cancels the sync.
After cancelling the sync we can try again with different node.
Since only one of them needs to cancel so we can just pick the first one (based on node order).
n1 picks n2
n2 picks n3
n3 picks n1
n3 send {:waiting_for, n1} to n2
n2 receive {:waiting_for, n1} | n1 in pending_sync_reqs but self() > n1 so message ignored
n1 send {:waiting_for, n2} to n3
n3 receive {:waiting_for, n2} | n2 in pending_sync_reqs but self() > n2 so message ignored
n2 send {:waiting_for, n3} to n1
n1 receive {:waiting_for, n3} | n3 in pending_sync_reqs and self() < n3
n1:
1. cancels sync to n2 (probably needs to notify n2 so yet another message)
2. pick different node than n2 and try syncing again
Instead of retrying with different node I suppose one could also pick the "best" node of the three (based on clock and then node ordering) and let it just resolve the pending requests.
I'll attempt to do PR for that when I have some more spare time.
Quick workaround for Kubernetes based clusters would be StatefulSet (with proper readiness probe) since that would guarantee that node start one by one.
EDIT: Now that I think of it, this would only work with simple 3-node sync cycle. It's going to be a little more complicated if it were to handle bigger cycles.
from swarm.
hi, we just hit this too. due to go live this weekend, eek!
is there any quick fix for this, a change of strategy or is there some other config we can tweak to make it less likely?
from swarm.
Related Issues (20)
- Calling swarm functions on blacklisted nodes has unexpected results HOT 5
- target module not available on remote node HOT 9
- Compile Warning with version 3.3.1 on Elixir 1.8.0-otp-21.2.2 HOT 7
- Netsplit conflict behavior HOT 1
- Function clause error in `Swarm.IntervalTreeClock.fill` on topology change HOT 4
- Unrecognized cast: {:untrack, pid} HOT 7
- Swarm.Tracker stuck in `syncing` state when lots of nodes starting/crashing HOT 1
- whereis_name/1 invocation takes more than 60_000 ms
- Swarm process exit callback HOT 1
- Incorrect statement about syn in README HOT 5
- Too less example or tutorial HOT 1
- Violation of the Swarm.Distribution.Strategy definition is actually accepted
- No function clause matching Swarm.IntervalTreeClock.sum/2 HOT 1
- Debug log inspection performance penalty
- Swarm stops to respond when try to find name
- Question about eventual consistency in the readme
- Clock conflicts and other errors when clustering
- Provide Ability for running multiple swarm instances.
- Swarm project status?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from swarm.