Comments (23)
Hm. So, this could be due to how we record bandwidth internally. We:
- Write.
- Record that we sent the bytes.
- Then, in a different thread, we keep an EWMA by averaging bandwidth second by second. We use a ticker (not a timer) for this.
Unfortunately, this means:
- A large write can all be treated as an instant.
- If the ticker falls behind for some reason (heavily loaded machine), we'll have a very short tick (two back to back).
Given a 1ns "tick" and a 10MiB message, we'll get a 9 exabyte spike.
Alternatively, this could be a clock adjustment. go prevents this
from go-libp2p-core.
from go-libp2p-core.
@Stebalien thanks for your response.
I'm skeptical that this is purely due to a short sample time. Let's suppose that we recorded bandwidth over a period of 1 nanosecond (which is the shortest possible due to the precision of time.Time
). In order for us to get a reading of 9 exabytes/second we would need to observe 9 gigabytes of bandwidth in a single nanosecond. In other words 9 exabytes/second == 9 gigabytes/nanosecond
. That doesn't really seem possible. Is there maybe something else going on here?
from go-libp2p-core.
In terms of reproducibility, since we don't know exactly what is causing the issue it can take some time to reproduce. The spikes are consistent but not very frequent. The graphs I shared are spikes occurring across the entire network (about 3-5 spikes per day on average). But if you just look at a single node, we're seeing one spike every 2-5 days.
Once you have a new version of libp2p/go-libp2p-core/metrics
or libp2p/go-flow-metrics
for us to test, it could take up to a few days for us to confirm whether the issue is still occurring.
from go-libp2p-core.
You're right. I was calculating in tebibytes, not exbibytes.
- How are you extracting this information from libp2p? I know we had an issue with spikes but it was in our logging framework (or grafana? Can't remember).
- We could have a bad reader/writer that's returning -1 (along with an error).
from go-libp2p-core.
How are you extracting this information from libp2p?
You can see our code that logs bandwidth usage every 5 minutes, and the separate code that bans peers which have high bandwidth usage. In both cases the code is really straightforward and we aren't doing any additional math or processing on top of what we get from BandwidthCounter
.
These logs get converted to JSON and sent to Elasticsearch via Fluent Bit. The screenshots I shared are from Kibana, which we use to visualize the data in Elasticsearch.
One thing you just made me realize is that JSON cannot represent extraordinarily large numbers accurately. While that is true, I don't think that completely explains the behavior we're seeing. Number.MAX_SAFE_INTEGER
on 64 bit systems is 2^53-1 and a large majority of the spikes we're seeing are below that number. Only the largest spike that we saw (9 exabytes/sec) exceeds Number.MAX_SAFE_INTEGER
. We likely lost some accuracy here, but that figure should still roughly be correct in terms of magnitude.
Moreover, we can see that some peers are actually getting banned as a result of these spikes. That indicates that the numbers inside of Go are extraordinarily high independent of the limits of the JSON format or hypothetical issues with our ELK stack.
from go-libp2p-core.
@Stebalien do you still want me to try out libp2p/go-flow-metrics#8? I'm not sure if we got consensus on that fix but it might still be worth a try.
from go-libp2p-core.
Actually... are you doing this from web assembly? If the clock isn't monotonic, we could end up:
- Skipping a bunch of updates because the last update time is in the future.
- Applying all the updates all at once once we reach the last update time.
However, we'd still need to push 9 GiB of data for that to happen. On the other hand, that could happen ever few days. However, this is still really fishy (unless webassembly is giving us a fake time?).
I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.
from go-libp2p-core.
@Stebalien Good question. None of the results I've shared so far are from WebAssembly. (We actually don't have a way to collect logs from browser nodes at the moment).
I've pushed a fix for time travel to that PR in case you'd like to test it. But I'm far from confident that it's the issue.
Got it. We might try that PR if there is nothing else for us to do. In the meantime, we probably will need to implement a workaround on our end to account for these spikes. It's hard to do because most workarounds we can think of also open doors to potential attackers/spammers.
from go-libp2p-core.
None of the results I've shared so far are from WebAssembly.
Damn. Ok. So, go is usually pretty good about monotonic clocks. I only worried about web assembly as it's new and relatively unsupported.
from go-libp2p-core.
I really can't find the bug. I've combed over github.com/libp2p/go-flow-metrics and can't find anything that could cause this.
If you're willing to annotate that code with some debug logging, that would help us figure out what's going on here. Specifically, we'd want to:
- Detect if
Rate
is very large here: https://github.com/libp2p/go-flow-metrics/blob/655b4706c9ab7e08ce0c43775ccb8f9c4fdd4f81/sweeper.go#L111 - If so, log an error with everything (total, rate, instant, previous total, previous rate, new time, old time).
from go-libp2p-core.
@Stebalien Good idea. Yes we should be able to add logging for debugging purposes. But keep in mind the feedback loop here will be fairly slow.
from go-libp2p-core.
Understood. At this point, I really have no idea what's going on.
from go-libp2p-core.
@Stebalien we have a PR here: 0xProject/0x-mesh#513 which will log everything you asked for (and a bit more) whenever the reported rate is extraordinarily high. As I mentioned earlier, it might take a few days to get results. I'll report anything we find on this thread.
from go-libp2p-core.
@Stebalien Actually it looks like I already have some useful logs. If I'm interpreting this correctly, the bug is occurring a lot more frequently than we thought. (Perhaps it only seemed rare because, relatively speaking, we don't check bandwidth inside of Mesh very frequently). I added the following logging code to go-flow-metrics
:
oldRate := m.snapshot.Rate
if m.snapshot.Rate == 0 {
m.snapshot.Rate = instant
} else {
m.snapshot.Rate += alpha * (instant - m.snapshot.Rate)
}
if m.snapshot.Rate >= 100000000 {
logrus.WithFields(logrus.Fields{
"oldRate": oldRate,
"rate": m.snapshot.Rate,
"oldTotal": m.snapshot.Total,
"newTotal": total,
"instant": instant,
"diff": diff,
"alpha": alpha,
"snapshot.LastUpdate": m.snapshot.LastUpdate,
"sw.LastUpdate": sw.lastUpdateTime,
"tdiff": tdiff,
"timeMultiplier": timeMultiplier,
}).Debug("abnormal rate inside go-flow-metrics")
}
Here's a gist containing some recent results. Let me know if I can provide any additional information.
from go-libp2p-core.
Ok, so we're resetting the total to 0 for some reason.
from go-libp2p-core.
The bug happens here:
{
"alpha_number": 0.6321205588285577,
"diff_number": 18446744073709552000,
"instant_number": 18467629448269930000,
"level": "debug",
"msg": "abnormal rate inside go-flow-metrics",
"myPeerID": "16Uiu2HAm2qhk786g2KC5KvhHqtwot2hDbCLtwN82MJUrXDAAWzYU",
"newTotal_number": 0,
"oldRate_number": 0,
"oldTotal_number": 818,
"rate_number": 18467629448269930000,
"snapshot.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
"sw.LastUpdate_string": "2019-11-07T12:53:55.723148-08:00",
"tdiff_number": 998869082,
"time": "2019-11-07T12:53:55-08:00",
"timeMultiplier_number": 1.0011321984235768
}
We do reset the accumulator to 0 when we mark the meter as "idle" but then we go through a very careful dance to unregister old meters.
@vyzo, could you go over the code in sweeper.go
in https://github.com/libp2p/go-flow-metrics and see if you can spot a bug in the logic? I can't find anything.
from go-libp2p-core.
The logic is not so simple; and there is suspicious floating point arithmetic involved.
from go-libp2p-core.
So possibly an issue is the subtraction of snapshot.Total
from the accumulator when we are re-adding.
The comment says "Remove the snapshot total, it'll get added back on registration.", but I don't see that happening on register
.
from go-libp2p-core.
ah it happens at the bottom of the loop; nvm.
from go-libp2p-core.
So the issue seems to be here: https://github.com/libp2p/go-flow-metrics/blob/master/sweeper.go#L151
We copy a meter from the back, that hasn't been visited yet and could potentially be an idle timer.
from go-libp2p-core.
libp2p/go-flow-metrics#11 fixes the issue discovered, and hopefully squashes the bug.
@albrow can you test it to verify that it fixes the problem?
from go-libp2p-core.
The bug might be worse than we thought. In my attempt to make a regression test in libp2p/go-flow-metrics#12 I didn't trigger the spike, but got mangled totals for the affected meters.
from go-libp2p-core.
Related Issues (20)
- Host event bus does not emit EvtPeerConnectednessChanged HOT 5
- Error in using libp2pquic.NewTransport with libp2p.Transport HOT 2
- Record envelope protobuf does not match spec HOT 2
- Add a `ClearDeadline` API HOT 4
- v0.7.0 breaks backward compatibility for multiple packages HOT 3
- add WithStat option for host.NewStream HOT 3
- Closing streams does not transmit all data HOT 1
- Unable to marshal / unmarshal AddrInfo within my struct HOT 4
- flaky TestResetBandwidthCounter test HOT 1
- How to know the conn has been closed
- What happened to the method in **helps**? HOT 1
- Reliable Notifiee events HOT 1
- looks unsafe HOT 2
- routing.go HOT 1
- How to convert PrivKey to crypto/ecdsa.PrivateKey? HOT 1
- Requesting release 0.16 with "update btcec dependency" change HOT 4
- btcec update leads to `go mod tidy` failure HOT 9
- extend `peer.Set` with the Remove method HOT 4
- Unable to know data on the connected node HOT 1
- use a mock clock in bandwidth counter tests HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from go-libp2p-core.