caio / foca Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 6.0 200 KB

mirror of https://caio.co/de/foca/

Home Page: https://caio.co/de/foca/

License: Mozilla Public License 2.0

Rust 100.00%

foca's People

Contributors

Stargazers

Watchers

Forkers

tediou5 photonquantum roberto-xy jeromegn limit-lab tushar00jain

foca's Issues

Manually leave the cluster and rejoin?

I need to declare node offline and do something, after that, rejoin the cluster.

But I found the leave_cluster(mut self, mut runtime: impl Runtime<T>), which receives the self not &mut self, so I can't rejoin the cluster via announce after leave cluster.

Is there any way to realize this scenario?

Ever growing members

I don't have much logs since they'd be too chatty, but using the latest (0.15.0) I noticed members grow uncontrollably over time:

We have about ~400 nodes in that cluster, not the 6K+ some nodes have recorded.

I decided to start from scratch on boot instead of reusing state from the database.

Looks like most of the nodes having a very high number of other nodes are the ones which are further away from the other, geographically and therefore have higher latency and packet loss.

I don't think we had this issue with an older version of foca.

Anything I can try to troubleshoot this?

Getting a lot of "Broadcasts disabled" error logs when not using broadcasts

I've turned off broadcasts through foca entirely a while ago and do my own thing now.

I'm not using Foca::broadcast or Foca::add_broadcast anywhere in my code, yet I keep seeing "Broadcasts disabled" error logs.

Can it happen any other way? Perhaps this is due to my corrupted messages being interpreted as custom broadcasts by foca?

Document nice-to-have things when using foca for serious business

There are some things foca doesn't do which would be nice to have them explicit.

Foca doesn't:

Try to recover from zero-membership: if it reaches zero members implementors must decide how to handle it (announce to well known addresses, restart the application, w/e) (update: v0.17.0 can recover from many zero membership scenarios)
Version its protocol (yet, at least), so running distinct versions at once may lead to problems (inability to decode certain messages, for example)
Add any encryption to its payload, so it's trivial for anyone in the network to snoop in the topology and even join the cluster
...More?

(Ref #15)

How to use BroadcastHandler in a user-friendly way

I've stumbled upon this crate and it looks like it's of very high quality and has all the features I need for a little project.

Everything has been pretty clear in the docs and examples, except I'm not sure how to correctly use BroadcastHandler.

I'm assuming I should be able to use it to send / receive an enum that may represent multiple message types. However, BroadcastHandler::Broadcast requires AsRef<[u8]>. It seems like the only thing I can then really broadcast are serialized bytes.

Now, that might be OK, but it means if my messages have any kind of complexity to them, I have to constantly deserialize them inside in my Invalidates implementation to determine if they should be invalidated.

I get that this was built to run in embedded environment without too much resources. The serialization I'm doing is probably far heavier than intended for the use cases here. I could see how it would be entirely possible to read a u64 from the start of the message to know to invalidate it or not, but in my implementation there will be many different message variants and invalidating them might even require a database round-trip (TBD).

Any insight on how I could use the BroadcastHandler to disseminate more complex types of messages?

Should I worry about "Member not found" warning log?

We're using an fat identity with a bump.

pub struct Actor {
    id: ActorId,
    name: ActorName,
    addr: SocketAddr,
    group: String,

    // An extra field to allow fast rejoin
    bump: u16,
}

and our Identity implementation:

impl Identity for Actor {
    fn has_same_prefix(&self, other: &Self) -> bool {
        // sometimes the ID can be nil, when we connect to a member,
        // we don't know its ID, we only know its address
        // the ID should be updated later on, at least I hope
        if other.id.is_nil() || self.id.is_nil() {
            self.addr.eq(&other.addr)
        } else {
            self.id.eq(&other.id)
        }
    }

    fn renew(&self) -> Option<Self> {
        Some(Self {
            id: self.id,
            name: self.name.clone(),
            addr: self.addr,
            group: self.group.clone(),
            bump: self.bump.wrapping_add(1),
        })
    }
}

We're handling members like:

match notification {
        Notification::MemberUp(actor) => {
            let added = { members.write().add_member(&actor) };
            info!("Member Up {actor:?} (added: {added})");
            if added {
                // actually added a member
                // notify of new cluster size
                if let Ok(size) = (members.read().0.len() as u32).try_into() {
                    foca_tx.send(FocaInput::ClusterSize(size)).ok();
                }
            }
            tokio::spawn(write_sql(write_sql_sender, move |pool| async move {
                if let Err(e) = upsert_actor_name(&pool, actor.id(), actor.name().as_str()).await {
                    warn!("could not upsert actor name: {e}");
                }
            }));
        }
        Notification::MemberDown(actor) => {
            let removed = { members.write().remove_member(&actor) };
            info!("Member Down {actor:?} (removed: {removed})");
            if removed {
                // actually removed a member
                // notify of new cluster size
                if let Ok(size) = (members.read().0.len() as u32).try_into() {
                    foca_tx.send(FocaInput::ClusterSize(size)).ok();
                }
            }
        }
// ...
}

I started listing foca members via iter_members() to try and debug this.

Some logs I see when restarting:

2022-07-22T16:14:38.007685Z [INFO] Current Actor ID: 385107b4-3423-4b2e-b4e8-637cfac5779e
2022-07-22T16:14:41.036757Z [INFO] Member Up Id(ActorName("ab03"), ActorId(9c84ee68-d7db-427a-9a08-b8c7d501e856), <ip>:7878, bump: 17951) (added: true)
2022-07-22T16:14:41.036810Z [INFO] Current node is considered ACTIVE
2022-07-22T16:14:41.036832Z [INFO] foca knows about: Id(ActorName("ab03"), ActorId(9c84ee68-d7db-427a-9a08-b8c7d501e856), <ip>:7878, bump: 17951)
2022-07-22T16:14:42.103568Z [INFO] Member Up Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 46452) (added: true)
2022-07-22T16:14:42.103698Z [INFO] foca knows about: Id(ActorName("ab03"), ActorId(9c84ee68-d7db-427a-9a08-b8c7d501e856), <ip>:7878, bump: 17951)
2022-07-22T16:14:42.103736Z [INFO] foca knows about: Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 46452)
2022-07-22T16:14:43.011694Z [INFO] Member Up Id(ActorName("09cd"), ActorId(b129db5e-650e-40c3-9046-852ac72ed1ef), <ip>:7878, bump: 57964) (added: true)
2022-07-22T16:14:43.011765Z [INFO] foca knows about: Id(ActorName("ab03"), ActorId(9c84ee68-d7db-427a-9a08-b8c7d501e856), <ip>:7878, bump: 17951)
2022-07-22T16:14:43.011779Z [INFO] foca knows about: Id(ActorName("09cd"), ActorId(b129db5e-650e-40c3-9046-852ac72ed1ef), <ip>:7878, bump: 57964)
2022-07-22T16:14:43.011787Z [INFO] foca knows about: Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 46452)
2022-07-22T16:14:47.527714Z [INFO] Member Up Id(ActorName("1337"), ActorId(69fddfaf-b618-452e-829b-897e54c91889), <ip>:7878, bump: 39791) (added: true)
2022-07-22T16:14:47.527816Z [INFO] foca knows about: Id(ActorName("ab03"), ActorId(9c84ee68-d7db-427a-9a08-b8c7d501e856), <ip>:7878, bump: 17951)
2022-07-22T16:14:47.527869Z [INFO] foca knows about: Id(ActorName("09cd"), ActorId(b129db5e-650e-40c3-9046-852ac72ed1ef), <ip>:7878, bump: 57964)
2022-07-22T16:14:47.527879Z [INFO] foca knows about: Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 46452)
2022-07-22T16:14:47.527887Z [INFO] foca knows about: Id(ActorName("1337"), ActorId(69fddfaf-b618-452e-829b-897e54c91889), <ip>:7878, bump: 39791)
2022-07-22T16:14:50.424720Z [INFO] Member Up Id(ActorName("957f"), ActorId(9a925261-675a-4fa7-8478-ccf378b01d90), <ip>:7878, bump: 6158) (added: true)
2022-07-22T16:14:50.424831Z [INFO] foca knows about: Id(ActorName("ab03"), ActorId(9c84ee68-d7db-427a-9a08-b8c7d501e856), <ip>:7878, bump: 17951)
2022-07-22T16:14:50.424902Z [INFO] foca knows about: Id(ActorName("957f"), ActorId(9a925261-675a-4fa7-8478-ccf378b01d90), <ip>:7878, bump: 6158)
2022-07-22T16:14:50.424913Z [INFO] foca knows about: Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 46452)
2022-07-22T16:14:50.424922Z [INFO] foca knows about: Id(ActorName("1337"), ActorId(69fddfaf-b618-452e-829b-897e54c91889), <ip>:7878, bump: 39791)
2022-07-22T16:14:50.424930Z [INFO] foca knows about: Id(ActorName("09cd"), ActorId(b129db5e-650e-40c3-9046-852ac72ed1ef), <ip>:7878, bump: 57964)
2022-07-22T16:14:53.231006Z [INFO] Member Up Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 57499) (added: false)
2022-07-22T16:15:11.043867Z [INFO] Member Up Id(ActorName("1337"), ActorId(69fddfaf-b618-452e-829b-897e54c91889), <ip>:7878, bump: 36078) (added: false)
2022-07-22T16:15:31.041785Z [INFO] Member Down Id(ActorName("2f55"), ActorId(67c1bd17-a13b-4c72-8e81-63c6a5b82977), <ip>:7878, bump: 46452) (removed: false)
2022-07-22T16:15:41.110738Z [INFO] Member Down Id(ActorName("1337"), ActorId(69fddfaf-b618-452e-829b-897e54c91889), <ip>:7878, bump: 36078) (removed: false)
2022-07-22T16:16:06.048567Z [WARN] foca: Member not found

Should we be worried about the Member not found log message or are these normal? They don't seem to appear repeatedly. This could be a race between 2 notifications?

Announce replies should contain random active members

At present, the members is the reply are in random order but the ordering doesn't change every time foca sends a Feed message (it changes following the probe cycle and a round robin through all active members so: larger cluster, less randomness :sadcat:)

Ref: #19 (comment)

Get traces / errors in a better shape

Traces are in a weird shape; when not running on debug the warns it emits don't contain any useful info.

Issue opened to handle this with more care. Errors might include more data to make them more actionable (not sure this will be the case), but warn/error traces should definitely include useful metadata instead of requiring a rerun with the very noisy debug level.

REF #2 (comment)

Adaptive max_transmissions

Determining the value for max_transmissions should be dynamic based on the number of members. Probably based on the memberlist implementation!

Or it should be possible to update this value at runtime as more members are added to a cluster.

Currently I'm passing a NODE_COUNT env var (for now I know this before deploying anything) and using the formula from memberlist.

FR: Responding to Broadcasts

I really like Foca's broadcasting feature for disseminating information. However, as far as I can tell, it is missing the ability to respond to broadcasts.

As I understand it, acknowledgements are a core part of the SWIM protocol, and can be used to send information back to the requesting node. It would be nice if users could piggy-back on this to send their own information back in response to broadcasts.

The main use case I am thinking of is Anti-Entropy. For example, node A might broadcast that it has version x of the metadata. Node B knows that version x is out-of-date (from vector clocks or some other way), so it responds back to A with the up-to-date metadata.

Member can lose all memberships and never recover

If there's a network event causing a node to lose all connectivity, all members will appear down from its point of view.

It doesn't appear to recover from this condition when the network goes back up.

Is this expected? If so, what would be a good way to remediate the situation? I assume detecting a large amount of down nodes -> re-announcing to the original bootstrap list would work?

Some inconsistent behavior in the new version v0.17.0, compared to v0.16.0

The cluster has two nodes, called node1 and node2.
when restarting node2 very quickly:

node1 has only receives Notification::Rename. However, there should be Notification::Rename and Notification::Up, as described in comments.
node2 sometimes got DataFromOurselves error, which is probabilistic. Then I looks into the Input::Data that causes this error, and decode it into payload::Header, and the approximate structure is Header { src: node2, dest: node2, message: IndirentPing{ origin: node1}}

Looping on "Sender is considered down, sending notification message"

I'm not sure how this situation happen, but with just 2 nodes, it appears that at some point after a few restarts something gets loopy and both nodes keep writing messages like:

INFO handle_data{len=82 header=Header { src: Actor { id: ActorId(d402fd8c-68fe-4ef8-9a72-5929e346c4d4), addr: [fdaa:0:3b99:a7b:139:f465:d5c8:2]:8787, bump: 25953 }, src_incarnation: 0, dst: Actor { id: ActorId(ee386b44-3d9d-4dec-af91-596a5d7b6323), addr: [fdaa:0:3b99:a7b:f8:dc04:c68a:2]:8787, bump: 60930 }, message: TurnUndead }}: foca: Sender is considered down, sending notification message

This is a new project using foca, and I've re-used much of the code I had for my first project. The main difference is I'm now leaving the cluster when exiting.

I'm on v0.11.

Any ideas?

Lots of CPU time spent sorting broadcast bytes

I'm actually not sure where the time is being spent, but most of the CPU time in my project is currently being used sorting something related to broadcast messages.

Flamegraph: corro.svg.zip

Looks like it goes like this:

Foca::broadcast
Foca::send_message
Broadcasts::fill
core::slice::sort::recurse
core::slice::sort::partial_insertion_sort
PartialOrd::lt

I have a loop that calls Foca::broadcast every 200ms to ensure maximum gossip propagation of broadcast messages.

Suspect members from saved state won't change state

As an optimization, on startup we're applying the last known state of the cluster so it's a lot faster to know a whole cluster when there are hundreds of nodes.

I noticed that if the last saved state was Suspect, and it was applied on start, those members would never go back to a non-suspect state.

As an experiment I filtered out all Suspect members when using apply_many on startup and it appeared to fix it. The members were discovered as Alive again.

Is this the way to prevent this behaviour or is it a bug?

WARN: "update about identity with same prefix as ours, declaring it down"

In our setup, I'm cleanly leaving the cluster by using leave_cluster and waiting 2 seconds w/ the hope that the update propagates to as many nodes as possible.

Since leave_cluster moves the Foca instance, we can't call handle_data and such on it anymore. We've lost control of it. Does foca still handle dispatching the leave message thoroughly?

We're often restarting the cluster w/ a concurrency of 6 (or more) nodes at a time. I figure it's possible for nodes to not receive the leave / down message and consider this node as up. So when they start again, they might apply_many an up state for the node and it might be outdated.

For example, there's no way to store the current state of the cluster past the leave_cluster call, therefore as other nodes are also leaving at the same time, we'll have stored the wrong identity for them.

When there's a deploy (and therefore a restart), we keep getting these log lines:

WARN: update about identity with same prefix as ours, declaring it down ...

I know these are mostly harmless, but I wonder if there's a way to either avoid them or to reduce their log level.

Encountered: "BUG! Probe cycle finished without running its full course"

I figured I might bring this up since it was labeled "BUG".

I'm not sure what can cause this, I've been restarting a few instances and encountered it. It seems to have recovered from it.

From my logs, I see this sequence of things happened:

Notification::Rejoin
Notification::Active

and then:

ERROR foca: error handling timer: BUG! Probe cycle finished without running its full course

Decode error, then cause to Member::Rename?

I run a cluster with 10 nodes, and I found the error Decode(Found an Option discriminant that wasn't 0 or 1) occurs every 10-20minutes in several nodes, then other nodes received the Member::Rename(i am not sure if that errors causes it), but the Member::Rename is more frequent even if there no error.

Here is my foca config:

probe-period: 2s
probe-rtt: 1s
num_indirect_probes: 3
remove_down_after: 24h
max_packet_size: 1400
max-transmissions: 5
notify_down_members: true
suspect-to-down-after: 6s
periodic-announce:
    frequency: 30s
    num-members: 1
periodic-announce-to-down-members:
    frequency: 65s
    num-members: 2

Can't construct `PeriodicParams`

Just wanted to make a note that the PeriodicParams struct is part of Config but not actually re-exported in the crate root so it can't actually be constructed by another crate when using foca as an API.

Tweaking the `Config` for fast broadcast

Hey, it's me again!

I've been reading the docs on the Config and trying to figure out how to tweak it. I'm just using the Config::simple() for now, it seemed sensible.

I noticed it took a full ~4-5 minutes for all broadcasts to fully propagate on a cluster of 6 nodes geographically far apart. I'm sending foca messages over UDP.

How would you tweak the config to make these broadcasts propagate faster? Perhaps increasing max_transmissions, as the documentation suggests?

I can't help but feel like 4-5 minutes is a very long time even with the default max_transmissions setting. If I send a single update it takes about 1 second to reach everywhere. If I send ~20+ broadcasts from nodes randomly as a test, it takes 4-5 minutes to fully propagate everything once I stop my test.

I'm testing this by updating values in a KV store and comparing the final state of each node. I'm diffing every state with every other state to make sure they're exactly the same. After the 4-5 minutes delay, I saw logs stopped printing my log line stating the node received a gossip broadcast item to process. This coincided with the state being the same everywhere.

Bump postcard dependency to 1.0.0

Postcard 1.0.0 fixed an issue which caused chrono types and #[serde_as(as = "DisplayFromStr")] failed to be serialized. See jamesmunns/postcard#32.

In my use case, I put http::Uri type into my identity type, which doesn't implement serde traits. So I used DisplayFromStr to work it around. However, postcard codec failed to serialize it due to the issue stated above.

Membership states are kept even when the identity has been renewed

(At the risk of creating an issue where it's all the same thing I've been misunderstanding...)

We've noticed that restarting a node or a small subset of nodes will create an odd situation where they'll receive far less payloads.

I think this might be due to the fact that foca does not throw away old identities when they're renewed. We have a way to dump the result of iter_membership_states and I noticed this:

$ corrosion cluster membership-states | grep -A4 -B5 3813b34
}
{
  "id": {
    "addr": "[fc01:a7b:152::]:8787",
    "cluster_id": 1,
    "id": "3813b347-28e3-47cd-b759-77030e0965b1",
    "ts": 7337274699724960448
  },
  "incarnation": 0,
  "state": "Alive"
--
}
{
  "id": {
    "addr": "[fc01:a7b:152::]:8787",
    "cluster_id": 1,
    "id": "3813b347-28e3-47cd-b759-77030e0965b1",
    "ts": 7335860957095395824
  },
  "incarnation": 0,
  "state": "Down"

Our identities include a timestamp (ts) which we use internally to instead of a bump field (as previously discussed) to make sure we only keep the latest identity in Corrosion.

Our has_same_prefix implementation should be preventing duplicates:

impl Identity for Actor {
    fn has_same_prefix(&self, other: &Self) -> bool {
        // this happens if we're announcing ourselves to another node
        // we don't yet have any info about them, except their gossip addr
        if other.id.is_nil() || self.id.is_nil() {
            self.addr.eq(&other.addr)
        } else {
            self.id.eq(&other.id)
        }
    }

    fn renew(&self) -> Option<Self> {
        Some(Self {
            id: self.id,
            addr: self.addr,
            ts: NTP64::from(duration_since_epoch()).into(),
            cluster_id: self.cluster_id,
        })
    }
}

I suspect the behavior in foca is intended, keeping downed members for however long the remove_down_after is set to? Should it keep down members even when they're another live one that has the same prefix?

I'm not sure yet the effect it's having in our project. We're checking the timestamp on Up and Down and making sure we only remove a member if it's the current one we know about and we only replace/add a member if the new timestamp is higher than the previous. I'll have to start dumping the members too to figure that one out.

BUG! Probe cycle finished without running its full course

I noticed hits happening a lot in my new project using foca:

2023-06-26T17:33:00Z app[5918571e9b3e83] waw [info]2023-06-26T17:33:00.162462Z ERROR corro_agent::broadcast: foca: error handling timer: BUG! Probe cycle finished without running its full course
2023-06-26T17:33:00Z app[5918571e9b3e83] waw [info]2023-06-26T17:33:00.162586Z  WARN handle_timer{event=SendIndirectProbe { probed_id: Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }, token: 0 }}: foca: SendIndirectProbe: Member not being probed probed_id=Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }

2023-06-26T17:34:31Z app[178199dc492228] cdg [info]2023-06-26T17:34:31.239548Z ERROR corro_agent::broadcast: foca: error handling timer: BUG! Probe cycle finished without running its full course
2023-06-26T17:34:31Z app[178199dc492228] cdg [info]2023-06-26T17:34:31.239671Z  WARN handle_timer{event=SendIndirectProbe { probed_id: Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }, token: 0 }}: foca: SendIndirectProbe: Member not being probed probed_id=Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }

I noticed on that cdg instance, this happened a few seconds before:

2023-06-26T17:34:10Z app[178199dc492228] cdg [info]2023-06-26T17:34:10.570726Z  INFO handle_timer{event=ProbeRandomMember(0)}: foca: Member failed probe, will declare it down if it doesn't react member_id=Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 } timeout=30s
2023-06-26T17:34:13Z app[178199dc492228] cdg [info]2023-06-26T17:34:13.369396Z  WARN handle_data{len=167 header=Header { src: Actor { id: ActorId(c5450e0c-52ac-4c33-9e71-fbc64a1c1f77), addr: [fdaa:2:4742:a7b:18e:91b5:604a:2]:8787, bump: 17046 }, src_incarnation: 0, dst: Actor { id: ActorId(341fe755-52b3-4808-888a-0cb45c1ff189), addr: [fdaa:2:4742:a7b:aebf:9821:21ff:2]:8787, bump: 43456 }, message: ForwardedAck { origin: Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 }, probe_number: 51 } } num_updates=1}: foca: Unexpected ForwardedAck sender
2023-06-26T17:34:13Z app[178199dc492228] cdg [info]2023-06-26T17:34:13.425022Z  WARN handle_data{len=167 header=Header { src: Actor { id: ActorId(339568fc-e676-48eb-9387-c67e381515f6), addr: [fdaa:2:4742:a7b:144:4161:8932:2]:8787, bump: 54737 }, src_incarnation: 0, dst: Actor { id: ActorId(341fe755-52b3-4808-888a-0cb45c1ff189), addr: [fdaa:2:4742:a7b:aebf:9821:21ff:2]:8787, bump: 43456 }, message: ForwardedAck { origin: Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 }, probe_number: 51 } } num_updates=1}: foca: Unexpected ForwardedAck sender
2023-06-26T17:34:13Z app[178199dc492228] cdg [info]2023-06-26T17:34:13.470475Z  WARN handle_data{len=167 header=Header { src: Actor { id: ActorId(90501bcc-b720-4327-961f-c2fb7f74b01c), addr: [fdaa:2:4742:a7b:104:f3da:d4ad:2]:8787, bump: 9228 }, src_incarnation: 0, dst: Actor { id: ActorId(341fe755-52b3-4808-888a-0cb45c1ff189), addr: [fdaa:2:4742:a7b:aebf:9821:21ff:2]:8787, bump: 43456 }, message: ForwardedAck { origin: Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 }, probe_number: 51 } } num_updates=1}: foca: Unexpected ForwardedAck sender

(the cdg node's actor_id is 341fe755-52b3-4808-888a-0cb45c1ff189)

I'm not sure what is happening exactly, but I assume a bug in foca :) I'm using it in a fairly straightforward way.

`updates_backlog` grows to a certain number and stays there

I haven't been looking at it, but foca.updates_backlog() appears to be reporting 10832. Restarting my program just makes it grow back up to almost the same number of backlog items.

Is this because we're trying to retransmit too much?

(Thats a single node, it restarted at the time of that dip)

All nodes: