Chapter 1: You are benchmarking the client, not the server Let's l

Criticism of the benchmarkings about websocket-shootout HOT 6 OPEN

hashrocket commented on September 23, 2024 12

Criticism of the benchmarkings

from websocket-shootout.

Comments (6)

jackc commented on September 23, 2024 2

I've done further testing in response to your assertions.

First, I added a binary mode to the websocketpp, uwebsocket, and Go (golang.org/x/net/websocket) servers and updated the benchmark client to support it. Performance of all servers increased roughly proportionally to the message size reduction (the same message encoded in binary is about 75% the size of the message encoded in binary) but not more. As you mention, a single broadcast from the server will cause 1000's of JSON decodes in the client, but this test indicates that due to using multiple client machines in parallel that was not a limiting factor.

Next, I tested using C++ instead of Go for the benchmark client. In addition, the C++ benchmark does not use a a full websocket implementation; it works directly with libuv. I used your throughput benchmark as the starting point and updated it to run the same test as the Go tool's binary broadcast test. It was indeed able to get higher results than a single Go client, but running the Go tool on multiple machines in parallel produces higher results.

None of the above tests found the dramatic differences your assertions would lead me to expect. However, I think I found something that explains the substantially different results we observe. I noticed the benchmarks for uwebsockets are hard-coded to connect to 127.0.0.1. This could confound the results in two ways. First, the client and server are running on the same machine. So any resources taken by the benchmark client have a direct negative effect on the server. This explains getting a substantially different result from a very low overhead C++ client and a more heavyweight Go client. Second, by using a loopback interface instead of an actual network there is far less overhead. This allows seeing much higher numbers than is possible when actually on a real network.

The point of a good benchmark is to maximize the result difference between the test subjects.

I do not see the fact that most implementations are within 50% of each other as flaw, I see it as a valid data point that for this particular workload the choice of language and library should probably not be decided just based on throughput. For other workloads, results may be substantially different.

The raw results are here: https://github.com/hashrocket/websocket-shootout/blob/master/results/round-02-binary.md. The C++ benchmark is here: https://github.com/hashrocket/websocket-shootout/tree/master/cpp/bench.

from websocket-shootout.

commented on September 23, 2024 1

To validate Chapter 6 of my first post, and to really show you how flawed your "benchmark of websocket libraries" is, I made my own server with uWS and it performs multiple hundreds of percentages better than the one you wrote (using the very same uWS):

clients: 1000    95per-rtt: 7ms    min-rtt: 4ms    median-rtt: 7ms    max-rtt: 7ms
clients: 2000    95per-rtt: 15ms    min-rtt: 8ms    median-rtt: 11ms    max-rtt: 18ms
clients: 3000    95per-rtt: 19ms    min-rtt: 12ms    median-rtt: 14ms    max-rtt: 25ms
clients: 4000    95per-rtt: 22ms    min-rtt: 16ms    median-rtt: 19ms    max-rtt: 27ms
clients: 5000    95per-rtt: 31ms    min-rtt: 20ms    median-rtt: 23ms    max-rtt: 36ms
clients: 6000    95per-rtt: 37ms    min-rtt: 23ms    median-rtt: 27ms    max-rtt: 39ms
clients: 7000    95per-rtt: 36ms    min-rtt: 26ms    median-rtt: 29ms    max-rtt: 40ms
clients: 8000    95per-rtt: 41ms    min-rtt: 30ms    median-rtt: 33ms    max-rtt: 45ms
clients: 9000    95per-rtt: 44ms    min-rtt: 34ms    median-rtt: 37ms    max-rtt: 49ms
clients: 10000    95per-rtt: 50ms    min-rtt: 38ms    median-rtt: 42ms    max-rtt: 50ms
clients: 11000    95per-rtt: 54ms    min-rtt: 42ms    median-rtt: 45ms    max-rtt: 59ms
clients: 12000    95per-rtt: 59ms    min-rtt: 46ms    median-rtt: 49ms    max-rtt: 61ms
clients: 13000    95per-rtt: 63ms    min-rtt: 50ms    median-rtt: 53ms    max-rtt: 64ms
clients: 14000    95per-rtt: 65ms    min-rtt: 55ms    median-rtt: 57ms    max-rtt: 68ms
clients: 15000    95per-rtt: 73ms    min-rtt: 58ms    median-rtt: 61ms    max-rtt: 75ms
clients: 16000    95per-rtt: 78ms    min-rtt: 62ms    median-rtt: 65ms    max-rtt: 83ms
clients: 17000    95per-rtt: 89ms    min-rtt: 66ms    median-rtt: 69ms    max-rtt: 145ms
clients: 18000    95per-rtt: 91ms    min-rtt: 69ms    median-rtt: 73ms    max-rtt: 95ms
clients: 19000    95per-rtt: 90ms    min-rtt: 73ms    median-rtt: 77ms    max-rtt: 93ms
clients: 20000    95per-rtt: 94ms    min-rtt: 77ms    median-rtt: 80ms    max-rtt: 95ms
clients: 21000    95per-rtt: 98ms    min-rtt: 81ms    median-rtt: 86ms    max-rtt: 103ms
clients: 22000    95per-rtt: 101ms    min-rtt: 86ms    median-rtt: 89ms    max-rtt: 103ms
clients: 23000    95per-rtt: 105ms    min-rtt: 89ms    median-rtt: 93ms    max-rtt: 105ms
clients: 24000    95per-rtt: 105ms    min-rtt: 94ms    median-rtt: 97ms    max-rtt: 109ms
clients: 25000    95per-rtt: 130ms    min-rtt: 97ms    median-rtt: 103ms    max-rtt: 202ms
clients: 26000    95per-rtt: 115ms    min-rtt: 102ms    median-rtt: 106ms    max-rtt: 116ms
clients: 27000    95per-rtt: 123ms    min-rtt: 104ms    median-rtt: 112ms    max-rtt: 125ms
clients: 28000    95per-rtt: 131ms    min-rtt: 110ms    median-rtt: 115ms    max-rtt: 134ms

Just like Chapter 6 states, a broadcast is ultimately going to end up being a loop of syscalls (which is a constant workload for all servers). That's why it is important to know what you are doing when implementing things like pub/sub and similar things (like this very benchmark of yours). You cannot use your grandmother as a test subject when testing how fast a sports car is and then conclude, based on the fact that your grandmother didn't go any faster, that "all cars are the same speed". What you benchmark in that case is your grandmother, not the car.

By implementing a very simple server based on my own recommendations from this repo: https://github.com/alexhultman/High-performance-pub-sub I was able to give you results of your own benchmark, close to 5x different than those you came up with.

You need to stop tainting the bechmark with your own shortcomings. You cannot conclude that uWS is "about the same" as other low-perf implementations, when the issue is what you put ontop of the library. A server will not just magically be fast just because you swapped to uWS - it requires that you know how to use it and surrounding low-level matters.

Stick with the echo tests, they are standard in this industry: they benchmark receiving performance (parsing + memory management) as well as sending performance (framing and memory management). Everything else is up to the user, it's not part of the websocket library. Node.js, Apache, h2o, NGINX and all those HTTP server measure performance in requests per second aka echo, simply becuse that is the only way to show (without tainting the server with user code) the performance of the server and only the server.

For reference, this is the result I get with the server you wrote in uWS:

clients: 1000    95per-rtt: 25ms    min-rtt: 7ms    median-rtt: 15ms    max-rtt: 26ms
clients: 2000    95per-rtt: 41ms    min-rtt: 10ms    median-rtt: 32ms    max-rtt: 44ms
clients: 3000    95per-rtt: 56ms    min-rtt: 14ms    median-rtt: 47ms    max-rtt: 59ms
clients: 4000    95per-rtt: 72ms    min-rtt: 19ms    median-rtt: 62ms    max-rtt: 76ms
clients: 5000    95per-rtt: 87ms    min-rtt: 22ms    median-rtt: 80ms    max-rtt: 99ms
clients: 6000    95per-rtt: 106ms    min-rtt: 25ms    median-rtt: 96ms    max-rtt: 111ms
clients: 7000    95per-rtt: 125ms    min-rtt: 29ms    median-rtt: 113ms    max-rtt: 132ms
clients: 8000    95per-rtt: 139ms    min-rtt: 33ms    median-rtt: 129ms    max-rtt: 144ms
clients: 9000    95per-rtt: 158ms    min-rtt: 37ms    median-rtt: 145ms    max-rtt: 176ms
clients: 10000    95per-rtt: 182ms    min-rtt: 48ms    median-rtt: 164ms    max-rtt: 189ms
clients: 11000    95per-rtt: 203ms    min-rtt: 49ms    median-rtt: 185ms    max-rtt: 214ms
clients: 12000    95per-rtt: 217ms    min-rtt: 49ms    median-rtt: 200ms    max-rtt: 225ms
clients: 13000    95per-rtt: 240ms    min-rtt: 53ms    median-rtt: 217ms    max-rtt: 252ms
clients: 14000    95per-rtt: 257ms    min-rtt: 57ms    median-rtt: 234ms    max-rtt: 263ms
clients: 15000    95per-rtt: 266ms    min-rtt: 74ms    median-rtt: 253ms    max-rtt: 271ms
clients: 16000    95per-rtt: 282ms    min-rtt: 69ms    median-rtt: 269ms    max-rtt: 285ms
clients: 17000    95per-rtt: 300ms    min-rtt: 72ms    median-rtt: 288ms    max-rtt: 361ms
clients: 18000    95per-rtt: 316ms    min-rtt: 88ms    median-rtt: 306ms    max-rtt: 323ms
clients: 19000    95per-rtt: 331ms    min-rtt: 84ms    median-rtt: 323ms    max-rtt: 336ms
clients: 20000    95per-rtt: 349ms    min-rtt: 80ms    median-rtt: 341ms    max-rtt: 353ms
clients: 21000    95per-rtt: 366ms    min-rtt: 91ms    median-rtt: 357ms    max-rtt: 369ms
clients: 22000    95per-rtt: 386ms    min-rtt: 93ms    median-rtt: 375ms    max-rtt: 388ms
clients: 23000    95per-rtt: 396ms    min-rtt: 111ms    median-rtt: 391ms    max-rtt: 406ms
clients: 24000    95per-rtt: 416ms    min-rtt: 98ms    median-rtt: 408ms    max-rtt: 429ms
clients: 25000    95per-rtt: 436ms    min-rtt: 104ms    median-rtt: 428ms    max-rtt: 537ms
clients: 26000    95per-rtt: 453ms    min-rtt: 107ms    median-rtt: 446ms    max-rtt: 454ms
clients: 27000    95per-rtt: 473ms    min-rtt: 112ms    median-rtt: 465ms    max-rtt: 479ms
clients: 28000    95per-rtt: 487ms    min-rtt: 117ms    median-rtt: 480ms    max-rtt: 492ms

As you can see, the difference is major. Yet the very same websocket library has been utilized. I hope this will get you to realize how flawed this benchmark is.

This is yet again validating my very first post "Chapter 6".

from websocket-shootout.

commented on September 23, 2024 1

Yes I can post it, but it would be very unfair if you used it since the other servers would be using a different broadcasting algorithm.

This is what I have currently, it depends on a new function which is not fully decided on yet, but should land some time soon (I have discussed this function for a while with other people doing pub/sub):

#include <uWS/uWS.h>
#include <iostream>
#include <string>
using namespace std;

struct Sender {
    std::string data;
    uWS::WebSocket<uWS::SERVER> ws;
};

std::vector<Sender> senders;
uWS::Hub hub;
bool newThisIteration, inBatch;

int main(int argc, char *argv[]) {

    uv_timer_t timer;
    uv_timer_init(hub.getLoop(), &timer);

    uv_prepare_t prepare;
    prepare.data = &timer;
    uv_prepare_init(hub.getLoop(), &prepare);
    uv_prepare_start(&prepare, [](uv_prepare_t *prepare) {
        if (inBatch) {
            uv_timer_start((uv_timer_t *) prepare->data, [](uv_timer_t *t) {}, 1, 0);
            newThisIteration = false;
        }
    });

    uv_check_t checker;
    uv_check_init(hub.getLoop(), &checker);
    uv_check_start(&checker, [](uv_check_t *checker) {
        if (inBatch && !newThisIteration) {
            std::vector<std::string> messages;
            std::vector<int> excludes;
            for (Sender s : senders) {
                messages.push_back(s.data);
            }

            if (messages.size()) {
                uWS::WebSocket<uWS::SERVER>::PreparedMessage *prepared = uWS::WebSocket<uWS::SERVER>::prepareMessageBatch(messages, excludes, uWS::OpCode::BINARY, false, nullptr);
                hub.getDefaultGroup<uWS::SERVER>().forEach([&prepared](uWS::WebSocket<uWS::SERVER> ws) {
                    ws.sendPrepared(prepared, nullptr);
                });
                uWS::WebSocket<uWS::SERVER>::finalizeMessage(prepared);
            }

            for (Sender s : senders) {
                s.data[0] = 'r';
                s.ws.send(s.data.data(), s.data.length(), uWS::OpCode::BINARY);
            }

            senders.clear();
            inBatch = false;
        }
    });

    hub.onMessage([](uWS::WebSocket<uWS::SERVER> ws, char *message, size_t length, uWS::OpCode opCode) {
        switch (message[0]) {
        case 'b':
            senders.push_back({std::string(message, length), ws});
            newThisIteration = true;
            inBatch = true;
            break;
        case 'e':
            ws.send(message, length, opCode);
        }
    });

    hub.listen(3000);
    hub.run();
}

I landed the initial commit here: uNetworking/uWebSockets@e4b7584

from websocket-shootout.

jackc commented on September 23, 2024

Can you share the code for this?

from websocket-shootout.

DarkMarmot commented on September 23, 2024

I love the fact that you've put together a nice set of socket implementations in various languages (especially Elixir!).

I would very much like to see a more optimized version of the Node implementation, though. If it took advantage of inline caching and V8 CrankShaft's optimizer it could be doing dramatically better I think.

Most.js does an amazing job at that: https://github.com/cujojs/most/tree/master/test/perf

from websocket-shootout.

commented on September 23, 2024

Good write up. I also wonder why the ws websocket library was used instead of uWS. When uWS is far better in performance. That's not fair for nodejs or the author, and I think the blog chart should be re-updated with uWS numbers instead.

from websocket-shootout.

Criticism of the benchmarkings about websocket-shootout HOT 6 OPEN

Comments (6)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent