I would hope that an f1-micro instance (0.2 cpu) could handle a single client, but run

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I think <a href="https://github.com/firezone/firezone/blob/main/rust/relay/src/server.

I think <a href="https://github.com/firezone/firezone/blob/main/rust/rela

Relay uses higher CPU than expected about firezone HOT 12 CLOSED

jamilbk commented on June 13, 2024

Relay uses higher CPU than expected

from firezone.

Comments (12)

jamilbk commented on June 13, 2024

@AndrewDryga discovered that a simple speedtest from one client floods the metrics even at log level info:

https://cloudlogging.app.goo.gl/bTavBctEz1iXb6VT8

from firezone.

jamilbk commented on June 13, 2024

I think we might be instrumenting each client message?

    /// Process the bytes received from a client.
    ///
    /// After calling this method, you should call [`Server::next_command`] until it returns `None`.
    #[tracing::instrument(skip_all, fields(transaction_id, %sender, allocation, channel, recipient, peer), level = "error")]
    pub fn handle_client_input(&mut self, bytes: &[u8], sender: ClientSocket, now: SystemTime) {
        if tracing::enabled!(target: "wire", tracing::Level::TRACE) {
            let hex_bytes = hex::encode(bytes);
            tracing::trace!(target: "wire", %hex_bytes, "receiving bytes");
        }

        match self.decoder.decode(bytes) {
            Ok(Ok(message)) => {
                if let Some(id) = message.transaction_id() {
                    Span::current().record("transaction_id", hex::encode(id.as_bytes()));
                }

                self.handle_client_message(message, sender, now);
            }
            // Could parse the bytes but message was semantically invalid (like missing attribute).
            Ok(Err(error_code)) => {
                self.queue_error_response(sender, error_code);
            }
            // Parsing the bytes failed.
            Err(client_message::Error::BadChannelData(ref error)) => {
                tracing::debug!(%error, "failed to decode channel data")
            }
            Err(client_message::Error::DecodeStun(ref error)) => {
                tracing::debug!(%error, "failed to decode stun packet")
            }
            Err(client_message::Error::UnknownMessageType(t)) => {
                tracing::debug!(r#type = %t, "unknown STUN message type")
            }
            Err(client_message::Error::Eof) => {
                tracing::debug!("unexpected EOF while parsing message")
            }
        };
    }

from firezone.

AndrewDryga commented on June 13, 2024

Disabled OLTP on staging. Now I can hit 160 Mbps (instead of 10) until the f1-micro CPU is throttled on the relay.

from firezone.

thomaseizinger commented on June 13, 2024

I think we might be instrumenting each client message?

The span should def be level "debug" I think, not error!

from firezone.

thomaseizinger commented on June 13, 2024

I'll spend some cycles on this next week :)

from firezone.

thomaseizinger commented on June 13, 2024

Some benchmarking revealed that we were indeed allocating a lot of unnecessary spans.

The next improvements I can see are allocations for the actual relayed data. Based on the current design, those are expected. There are a number of things we can change here:

If we want to stay in user-space, #4095 would be a first attempt, maybe paired with using mio directly (it is used by tokio under the hood) to dynamically listen on multiple ports.
Move away from user-space and implement the relaying as an eBPF program.
Build something using io-uring to avoid copying between user space and kernel space.

from firezone.

jamilbk commented on June 13, 2024

I would say (1) is probably the least risky and might yield learnings or results we can apply to clients and gateways as well which are limited to user-space.

Have we verified where the major bottlenecks are? Surely we can copy packets from user space to kernel space faster than 150 Mbps

from firezone.

thomaseizinger commented on June 13, 2024

I would say (1) is probably the least risky and might yield learnings or results we can apply to clients and gateways as well which are limited to user-space.

(1) are already the learnings from clients & gateways that I'd like to apply back to the relay 😃

Have we verified where the major bottlenecks are? Surely we can copy packets from user space to kernel space faster than 150 Mbps

As far as I can tell, it is allocations. Also, we currently have queues which isn't ideal I think.

I wasn't able to fully saturate my CPU yesterday and it topped out at about 1GBps. Not sure what the bottleneck is there?

from firezone.

jamilbk commented on June 13, 2024

Not sure what the bottleneck is there?

My first hunch would be RAM speed, but you're on a fast system, I would expect more than 1Gbps.

I think with the right profiling approach we can get much faster. @conectado has done some work in this area if you want to pick his brain when he's back from PTO.

topped out at about 1GBps

I'd be curious to know what the CPU wall time was on this vs waiting on IO. That should give us a rough indication of how much of the bottleneck is "our fault". We could also be copying between RAM and CPU multiple times.

So yeah, I guess the consensus is to start optimizing in user-space. We'll need a good grasp of the data patterns to make a good kernel-space implementation anyhow.

from firezone.

thomaseizinger commented on June 13, 2024

Did some more benchmarking and was able to reach ~ 7GBps locally:

target/release/firezone-relay  46.97s user 75.49s system 325% cpu 37.636 total

Memory usage was at a constant 70MB.

from firezone.

thomaseizinger commented on June 13, 2024

Did some more benchmarking and was able to reach ~ 7GBps locally:

I am not sure how much I can actually trust these numbers because the benchmarking client I am using (https://github.com/vi/turnhammer) is able to generate the packets but somehow fails to receive them.

from firezone.

thomaseizinger commented on June 13, 2024

However, I can still generate a flamegraph from this usage and ~50% of our time is spent allocating stuff / moving memory using _memmove_avx512_unaligned_erms.

from firezone.

Relay uses higher CPU than expected about firezone HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent