We recently added <a href="https://github.com/0xProject/0x-mesh/pull/422" data-hoverca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

BandwidthCounter is erroneously reporting huge spikes about go-libp2p-core HOT 23 CLOSED

libp2p commented on July 21, 2024

BandwidthCounter is erroneously reporting huge spikes

from go-libp2p-core.

Comments (23)

Stebalien commented on July 21, 2024

Hm. So, this could be due to how we record bandwidth internally. We:

Write.
Record that we sent the bytes.
Then, in a different thread, we keep an EWMA by averaging bandwidth second by second. We use a ticker (not a timer) for this.

Unfortunately, this means:

A large write can all be treated as an instant.
If the ticker falls behind for some reason (heavily loaded machine), we'll have a very short tick (two back to back).

Given a 1ns "tick" and a 10MiB message, we'll get a 9 exabyte spike.

~~Alternatively, this could be a clock adjustment.~~ go prevents this

from go-libp2p-core.

Stebalien commented on July 21, 2024

If possible (if this is easy to reproduce), could you try building with: libp2p/go-flow-metrics#8? We could also have an overflow/underflow bug somewhere but I'm not seeing it.

from go-libp2p-core.

albrow commented on July 21, 2024

@Stebalien thanks for your response.

I'm skeptical that this is purely due to a short sample time. Let's suppose that we recorded bandwidth over a period of 1 nanosecond (which is the shortest possible due to the precision of time.Time). In order for us to get a reading of 9 exabytes/second we would need to observe 9 gigabytes of bandwidth in a single nanosecond. In other words 9 exabytes/second == 9 gigabytes/nanosecond. That doesn't really seem possible. Is there maybe something else going on here?

from go-libp2p-core.

albrow commented on July 21, 2024

In terms of reproducibility, since we don't know exactly what is causing the issue it can take some time to reproduce. The spikes are consistent but not very frequent. The graphs I shared are spikes occurring across the entire network (about 3-5 spikes per day on average). But if you just look at a single node, we're seeing one spike every 2-5 days.

Once you have a new version of libp2p/go-libp2p-core/metrics or libp2p/go-flow-metrics for us to test, it could take up to a few days for us to confirm whether the issue is still occurring.

from go-libp2p-core.

Stebalien commented on July 21, 2024

You're right. I was calculating in tebibytes, not exbibytes.

How are you extracting this information from libp2p? I know we had an issue with spikes but it was in our logging framework (or grafana? Can't remember).
We could have a bad reader/writer that's returning -1 (along with an error).

from go-libp2p-core.

albrow commented on July 21, 2024

How are you extracting this information from libp2p?

You can see our code that logs bandwidth usage every 5 minutes, and the separate code that bans peers which have high bandwidth usage. In both cases the code is really straightforward and we aren't doing any additional math or processing on top of what we get from BandwidthCounter.

These logs get converted to JSON and sent to Elasticsearch via Fluent Bit. The screenshots I shared are from Kibana, which we use to visualize the data in Elasticsearch.

One thing you just made me realize is that JSON cannot represent extraordinarily large numbers accurately. While that is true, I don't think that completely explains the behavior we're seeing. Number.MAX_SAFE_INTEGER on 64 bit systems is 2^53-1 and a large majority of the spikes we're seeing are below that number. Only the largest spike that we saw (9 exabytes/sec) exceeds Number.MAX_SAFE_INTEGER. We likely lost some accuracy here, but that figure should still roughly be correct in terms of magnitude.

Moreover, we can see that some peers are actually getting banned as a result of these spikes. That indicates that the numbers inside of Go are extraordinarily high independent of the limits of the JSON format or hypothetical issues with our ELK stack.

from go-libp2p-core.

albrow commented on July 21, 2024

@Stebalien do you still want me to try out libp2p/go-flow-metrics#8? I'm not sure if we got consensus on that fix but it might still be worth a try.

from go-libp2p-core.

Stebalien commented on July 21, 2024

Actually... are you doing this from web assembly? If the clock isn't monotonic, we could end up:

Skipping a bunch of updates because the last update time is in the future.
Applying all the updates all at once once we reach the last update time.

However, we'd still need to push 9 GiB of data for that to happen. On the other hand, that could happen ever few days. However, this is still really fishy (unless webassembly is giving us a fake time?).

I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.

from go-libp2p-core.

albrow commented on July 21, 2024

@Stebalien Good question. None of the results I've shared so far are from WebAssembly. (We actually don't have a way to collect logs from browser nodes at the moment).

I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.

Got it. We might try that PR if there is nothing else for us to do. In the meantime, we probably will need to implement a workaround on our end to account for these spikes. It's hard to do because most workarounds we can think of also open doors to potential attackers/spammers.

from go-libp2p-core.

Stebalien commented on July 21, 2024

None of the results I've shared so far are from WebAssembly.

Damn. Ok. So, go is usually pretty good about monotonic clocks. I only worried about web assembly as it's new and relatively unsupported.

from go-libp2p-core.

Stebalien commented on July 21, 2024

I really can't find the bug. I've combed over github.com/libp2p/go-flow-metrics and can't find anything that could cause this.

If you're willing to annotate that code with some debug logging, that would help us figure out what's going on here. Specifically, we'd want to:

Detect if Rate is very large here: https://github.com/libp2p/go-flow-metrics/blob/655b4706c9ab7e08ce0c43775ccb8f9c4fdd4f81/sweeper.go#L111
If so, log an error with everything (total, rate, instant, previous total, previous rate, new time, old time).

from go-libp2p-core.

albrow commented on July 21, 2024

@Stebalien Good idea. Yes we should be able to add logging for debugging purposes. But keep in mind the feedback loop here will be fairly slow.

from go-libp2p-core.

Stebalien commented on July 21, 2024

Understood. At this point, I really have no idea what's going on.

from go-libp2p-core.

albrow commented on July 21, 2024

@Stebalien we have a PR here: 0xProject/0x-mesh#513 which will log everything you asked for (and a bit more) whenever the reported rate is extraordinarily high. As I mentioned earlier, it might take a few days to get results. I'll report anything we find on this thread.

from go-libp2p-core.

albrow commented on July 21, 2024

@Stebalien Actually it looks like I already have some useful logs. If I'm interpreting this correctly, the bug is occurring a lot more frequently than we thought. (Perhaps it only seemed rare because, relatively speaking, we don't check bandwidth inside of Mesh very frequently). I added the following logging code to go-flow-metrics:

oldRate := m.snapshot.Rate
if m.snapshot.Rate == 0 {
	m.snapshot.Rate = instant
} else {
	m.snapshot.Rate += alpha * (instant - m.snapshot.Rate)
}
if m.snapshot.Rate >= 100000000 {
	logrus.WithFields(logrus.Fields{
		"oldRate":             oldRate,
		"rate":                m.snapshot.Rate,
		"oldTotal":            m.snapshot.Total,
		"newTotal":            total,
		"instant":             instant,
		"diff":                diff,
		"alpha":               alpha,
		"snapshot.LastUpdate": m.snapshot.LastUpdate,
		"sw.LastUpdate":       sw.lastUpdateTime,
		"tdiff":               tdiff,
		"timeMultiplier":      timeMultiplier,
	}).Debug("abnormal rate inside go-flow-metrics")
}

Here's a gist containing some recent results. Let me know if I can provide any additional information.

from go-libp2p-core.

Stebalien commented on July 21, 2024

Ok, so we're resetting the total to 0 for some reason.

from go-libp2p-core.

Stebalien commented on July 21, 2024

The bug happens here:

{
  "alpha_number": 0.6321205588285577,
  "diff_number": 18446744073709552000,
  "instant_number": 18467629448269930000,
  "level": "debug",
  "msg": "abnormal rate inside go-flow-metrics",
  "myPeerID": "16Uiu2HAm2qhk786g2KC5KvhHqtwot2hDbCLtwN82MJUrXDAAWzYU",
  "newTotal_number": 0,
  "oldRate_number": 0,
  "oldTotal_number": 818,
  "rate_number": 18467629448269930000,
  "snapshot.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
  "sw.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
  "tdiff_number": 998869082,
  "time": "2019-11-07T12:53:55-08:00",
  "timeMultiplier_number": 1.0011321984235768
}

We do reset the accumulator to 0 when we mark the meter as "idle" but then we go through a very careful dance to unregister old meters.

@vyzo, could you go over the code in sweeper.go in https://github.com/libp2p/go-flow-metrics and see if you can spot a bug in the logic? I can't find anything.

from go-libp2p-core.

vyzo commented on July 21, 2024

The logic is not so simple; and there is suspicious floating point arithmetic involved.

from go-libp2p-core.

vyzo commented on July 21, 2024

So possibly an issue is the subtraction of snapshot.Total from the accumulator when we are re-adding.
The comment says "Remove the snapshot total, it'll get added back on registration.", but I don't see that happening on register.

from go-libp2p-core.

vyzo commented on July 21, 2024

ah it happens at the bottom of the loop; nvm.

from go-libp2p-core.

vyzo commented on July 21, 2024

So the issue seems to be here: https://github.com/libp2p/go-flow-metrics/blob/master/sweeper.go#L151

We copy a meter from the back, that hasn't been visited yet and could potentially be an idle timer.

from go-libp2p-core.

vyzo commented on July 21, 2024

libp2p/go-flow-metrics#11 fixes the issue discovered, and hopefully squashes the bug.
@albrow can you test it to verify that it fixes the problem?

from go-libp2p-core.

vyzo commented on July 21, 2024

The bug might be worse than we thought. In my attempt to make a regression test in libp2p/go-flow-metrics#12 I didn't trigger the spike, but got mangled totals for the affected meters.

from go-libp2p-core.

BandwidthCounter is erroneously reporting huge spikes about go-libp2p-core HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent