hashrocket / websocket-shootout Goto Github PK

A comparison of websocket servers in multiple languages and frameworks

License: MIT License

Go 18.53% Makefile 1.01% C++ 13.71% Ruby 12.61% JavaScript 20.58% CSS 0.41% HTML 3.16% Elixir 13.13% Clojure 1.29% Rust 5.04% Java 5.60% Haskell 1.57% Erlang 3.36%

websocket-shootout's People

Stargazers

Watchers

websocket-shootout's Issues

Logo for this project

Does websocket-shootout need a logo?
Does Indian Jones need his hat?
Does Thor need his hammer?
Does Batman need his crippling fear of bats?

I don't know the answer to any of these questions, but I sure think it would be great if this project had a fancy logo.

Document build dependencies

I'm trying to build the benchmark tool on Fedora 23, but I'm struggling to satisfy the dependencies; I'm using Go 1.5.4. Initially I got these errors:

$ make bin/websocket-bench                                                                                              
cd go/src/hashrocket/websocket-bench && go install
main.go:11:2: cannot find package "github.com/spf13/cobra" in any of:
    /usr/lib/golang/src/github.com/spf13/cobra (from $GOROOT)
    ($GOPATH not set)
action_cable_server_adapter.go:6:2: cannot find package "golang.org/x/net/websocket" in any of:
    /usr/lib/golang/src/golang.org/x/net/websocket (from $GOROOT)
    ($GOPATH not set)
make: *** [Makefile:16: /home/alewis/git/websocket-shootout/go/bin/websocket-bench] Error 1

I then installed the following packages and their dependencies from the Fedora repos:

golang-golangorg-net-devel-0-0.33.git4d38db7.fc23.noarch
golang-github-spf13-cobra-devel-0-0.16.git8e91712.fc23.noarch

After this, make threw exactly the same errors. I set my GOPATH to ./go and then got this error:

$ make bin/websocket-bench 
cd go/src/hashrocket/websocket-bench && go install
main.go:11:2: no buildable Go source files in /home/alewis/git/websocket-shootout/go/src/github.com/spf13/cobra
action_cable_server_adapter.go:6:2: cannot find package "golang.org/x/net/websocket" in any of:
    /usr/lib/golang/src/golang.org/x/net/websocket (from $GOROOT)
    /home/alewis/git/websocket-shootout/go/src/golang.org/x/net/websocket (from $GOPATH)
make: *** [Makefile:16: /home/alewis/git/websocket-shootout/go/bin/websocket-bench] Error 1

If it isn't obvious already, I have little experience with building Go applications: a few instructions in the Readme would be really helpful.

Criticism of the benchmarkings

Chapter 1: You are benchmarking the client, not the server

Let's look at the client you are using to "benchmark" these servers:

It's written in Go (facepalm level: 6/10)
It makes use of a full implementation of the WebSocket protocol (facepalm level: 10/10)
It has blocking sleeps inside of it (facepalm level: not sure?)

Golden rule of benchmarking: benchmark the server, NOT the client. You are benchmarking a high performance C++ server with a low performance golang client. Every JSON (this whole JSON-tainting story is a chapter of its own) receive server side will result in many receives client side. In fact, if you look at µWS as an example, the only thing happening user-side of the server is:

recv syscall
Parse WebSocket frame
Copy it into RapidJSON
Parse it (JSON parsing is about 10x slower than parsing a WebSocket frame, which in itself it what you are claiming to benchmark)
Format the WebSocket message (once!)
Loop over all sockets: call Unix send syscall for all sockets

So what are you benchmarking server side here? Well, you are benchmarking the receive of one WebSocket frame followed by one JSON parse and one WebSocket frame formatting -> the rest is 100% the operating system (aka, there is no theoretical way to make it any more efficient)

Now, lets look at the client side: since you are using a low performance golang client with a full WebSocket implementation, every broadcast will result in thousands of WebSocket frame parsings client side. Are you starting to get what I'm pointing at now? You are benchmarking 1 WebSocket frame parsing + 1 JSON parse server side followed by thousands of WebSocket frame parsings client side and you are parsing these in golang!.

I immediately saw a HUGE tainting factor client-side when I started benchmarking WebSocket servers. So what did I do about it? I wrote the client in low-level TCP in C++ and made sure the server was stressed 100% all the time. This dramatically increased the gap between the slow WebSocket servers and the fast ones (as you can see in my benchmark, WebSocket++ is many tens of x:es faster than ws).

If you are going to act like you are benchmarking a high performance server you better write a client that is capable of outperforming the server, otherwise you are not benchmarking anything other than the client. No matter how many client instances you have, it still makes a massive different between having many slow clients in a cluster, or one ultra fast. You are completely tainting any kind of result by having this client.

Chapter 2: the broadcasting benchmark in general

You told me that you was not able to see any difference when doing an echo test, so instead you made this broadcast test. That statement alone solidifies my criticism: your client is so slow that it doesn't make any difference if you have a fast server or a slow one, while in my tests I can see dramatic differences in server performance, even when only dealing with 1 single echo message! I can see a 6x difference in performance between ws and µWS with 1 single echo message, and up to 150x when doing thousands of echoes per TCP chunk. But my point is not the 150x, my point is that it is absolutely possible to showcase a massive difference in performance when doing simple echo benchmarks!. But like I said: it requires that your client is able to stress the server and that means you cannot possibly write it in golang with the standard golang bullshit WebSocket client implementation.

Chapter 3: the JSON tainting

Like you have already heard, the fact that you benchmark 1 WebSocket frame parsing together with 1 JSON parsing, where the JSON parsing is majorly dominant is simply unacceptable. And you pass this off as a WebSocket benchmark! Parsing JSON is extremely slow compared to parsing a WebSocket frame: every single byte of the JSON data will have to be scanned for matchin end-token (if you are inside of a string, it will have to check EVERY BYTE for the end token). Compare this to the WebSocket format where the length of the whole message is given in the header, which makes the parsing O(1) while the JSON parsing is AT LEAST O(n).

Chapter 4: the threading and other random variance

Some servers are threaded, some are not. Some servers are implemented with hash tables, some are not. Some servers have RapidJSON, some have other JSON implementations. You simply have WAY too many variables going all random to give any kind of stable result. Comparing a server utilizing 8 CPU cores with a server restricted to 1 is just mindblowingly invalid. It's not just a bunch of "threads" you can toss in and have it speed-up you also need to take into account the efficient and the inefficient ways of using threading. That varies with implementation.

Chapter 5: gold comments

Comments in #20 : about the JSON variance
Comments in #19 : C++ was run with no optimization and yet there was "no difference". If you are running a benchmark where you cannot see any difference between optimized and non-optimized builds then you are obviously not properly benchmarking the server! How hard is that to understand? I see humongous differences at different optimization levels!
Comments in #41 are initial criticisms and answers.

Chapter 6: low-level primitives vs. high level algorithms

A WebSocket library exposes some very fundamental and low-level functionalities that you as an app developer can use to construct more complex algorithms, like for instance, efficient broadcasting. What this benchmark is trying to simulate is very close to a pub/sub server: you get an event and you push this to all the connected sockets.

Now, as you might know, broadcasting can be implemented with a simple for-loop and a call to the WebSocket send function. This is what you are doing in this "benchmark". Problem with this is, that kind of algorithm for distributing 100 events to X connections is very far from something efficient and does not reflect the underlying low-level library as much as it reflects your own abstract interpretation of "pub/sub".

As an example, I work for a company where pub/sub is part of the problem to optimize. This pub/sub was implemented with a for-loop and calls to send for each socket. I changed this into a far more efficient algorithm that merges the broadcasts and prepares the WebSocket frames for these in an efficient way. This resulted in a 270x speed-up and far outperforms the most common pub/sub implementations out there. Had I used a slow server as the low-level implementation, then this speed-up would not be even remotely close to possible. Yet, it still required me to design the algorithm efficiently.

My point is, you cannot benchmark the low-level fundamentals of a library by benchmarking your own inefficient for-loop that pretty much just calls into the kernel and leaves no room for the user-space server to shine.

End notes

This benchmark is completely flawed and does not in any way show the real personalities of the underlying WebSocket servers. I know for a fact that WebSocket++ far outperforms most other servers and that needs to be properly displayed here. The point of a good benchmark is to maximize the result difference between the test subjects. You want to show difference in terms of X not in terms of minor percentages.

node uws version doesn't work

wss.clients.forEach(function each(client) {
^

TypeError: Cannot read property 'forEach' of undefined
at broadcast (/home/haofei/work/websocket-shootout/js/uws/index.js:12:14)
at Socket.incoming as internalOnMessage
at Server.nativeServer.onMessage (/home/haofei/work/websocket-shootout/js/uws/node_modules/uws/uws.js:395:20)

Rust benchmark results

Is it possible to run the rust code on the same machines to see what the compared results are?

JSON and the C++ benchmark

I don't think this was intentional, but it seems that one of the slowest C++ JSON parsers out there was used for the C++ benchmark. Perhaps a review of the following might help:

Also with regards to the following:
https://github.com/hashrocket/websocket-shootout/blob/master/cpp/src/server.cpp#L48

Perhaps having a per thread instance of the JSON reader and variables available (seeing as though io_service is being threaded), instead of constructing and destructing the instances on each call of on_message.

Types such as Json::Reader (r instance) may sometimes be relatively expensive to construct/destruct.

Error in actioncable benchmark

System: FreeBSD 11.0
Go: go version go1.7.4 freebsd/amd64
Error:

./bin/websocket-bench broadcast ws://10.10.0.20:8888/ -l 10.10.0.21 -c 4 -s 100 --step-size 100 --origin http://10.10.0.20/ --server-type actioncable
panic: interface conversion: interface is nil, not string

goroutine 61 [running]:
panic(0x6db480, 0xc4201055c0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
hashrocket/websocket-bench/benchmark.(*ActionCableServerAdapter).Receive(0xc42008a2f8, 0x0, 0x0, 0x0)
        /data/home/test/websocket-shootout/go/src/hashrocket/websocket-bench/benchmark/action_cable_server_adapter.go:79 +0x169
hashrocket/websocket-bench/benchmark.(*localClient).rx(0xc4202b62d0)
        /data/home/test/websocket-shootout/go/src/hashrocket/websocket-bench/benchmark/local_client.go:163 +0x49
created by hashrocket/websocket-bench/benchmark.newLocalClient
        /data/home/test/websocket-shootout/go/src/hashrocket/websocket-bench/benchmark/local_client.go:140 +0x5bd

I see such payload in Rails log:

I, [2017-01-21T12:52:16.513466 #1355]  INFO -- : BenchmarkChannel transmitting {"action"=>"broadcast", "payload"=>{"SendTime"=>"2017-01-21T12:52:17.131527137+01:00", "Padding"=>""}} (via streamed from all)
I, [2017-01-21T12:52:16.505506 #1355]  INFO -- : BenchmarkChannel transmitting {"action"=>"broadcast", "payload"=>{"SendTime"=>"2017-01-21T12:52:17.131527137+01:00", "Padding"=>""}} (via streamed from all)
I, [2017-01-21T12:52:16.513070 #1355]  INFO -- : BenchmarkChannel transmitting {"action"=>"broadcast", "payload"=>{"SendTime"=>"2017-01-21T12:52:17.131527137+01:00", "Padding"=>""}} (via streamed from all)

Questions about JavaScript in Round 2

I've got some questions related to JavaScript/uws performance in Round 2 benchmark.

At the single-thread uws benchmark, it is said that:

"Note - was only able to get near these results once."

And why is that? There were any errors in other runs? The results went too differently? If so, what was the average perf numbers?

At the clustered uws benchmark the results obtained were less than half of those in the non-clustered benchmark (single-thread). Since Node generally receives a huge pump on performance when running in cluster mode, do you guys have any guess as to why in this case the result was the opposite (clustered running half the speed of non-clustered)?

C++ not compiled with optimizations?

https://github.com/hashrocket/websocket-shootout/blob/master/Makefile#L19

Seems to be missing -O2 or -O3.

How does amount of messages handled per seconds change in comparison to number of clients?

I didn't understand from reading the article if the total number of messages / second grows linearly or exponentially depending on number of clients.

cpp uWebSockets impl is single threaded

The uWebSockets server implementation is single threaded. Please take a look at the uWebSockets multi-threaded example.

Use cluster for Nodejs to use multiple cores

It was mentioned in article that Nodejs is limited due to single threaded, however, nodejs does allow multithreading.

Need RAM usage too..

It would be nice if RAM usage also written on the result.

Other options for Go

Thanks for writing up the article. As a user of both Go and Elixir, it's in line with what I'd expect.

There are a few other options for Go though, which I wonder how they compare in terms of ease-of-use and performance.

https://github.com/jingweno/thunderbird
https://github.com/gorilla/websocket

Not sure if I'll have a chance to send a PR anytime soon, but if anyone else would like to, that'd be awesome.

JVM settings for Clojure

I'm not sure what version of the JVM or JVM settings are being used for Clojure, but they will have some impact on performance. I would highly recommend using the latest version of Java 8 and running with at least the following JVM options:

-server # probably will be auto-selected, but doesn't hurt to be specific
-XX:+AggressiveOpts # enables optimizations that will be default in the next major JDK version

Those are pretty much guaranteed to be not worse and likely better than not using them. Other settings that might help are those around heap size and GC but that's harder to analyze without knowledge of the runtime profile (and knowing whether its memory constrained).

Node cluster

Any reason why you don't run node in a cluster to take advantage of the multiple cpus?

Rust Suggestion

Hi!

Suggesting that in Rust you may wish to eliminate the atomic workers counter? It doesn't seem to be present in the other examples.

A SeqCst operation happening forces a full memory ordering and could have a large performance impact.