This project aims to quantify the overhead of creating threads in various WebAssembly environments.
- As a baseline, it spawns native OS threads (
native
) - Then, we spawn threads in Wasmtime compiled using the
wasm32-wasi-threads
target in wasi-sdk (wasi-threads
) - We also spawn threads in NodeJS web workers compiled using Emscripten
(
emscripten
); we assume that web worker performance in NodeJS is similar (identical?) to that of browser web workers - For some additional data points, we measure the raw cost to start up a web worker in a browser
(
manual
)--this does not involve compiling any C code
The bottom-line: web worker thread-spawning is 2-3 orders of magnitude slower than native or wasi-threads spawning. See the results for more details.
It was not always clear to me that this infrastructure recorded the correct measurements (see, e.g.,
emscripten#19788). The timing code in
benchmark-now.h
is not designed to be particularly precise, but it must have
microsecond precision. To check this, run:
$ make calibrate
This runs the calibrate
driver over the compiled
benchmark-calibration.c
code. One can then visually inspect that the
passed-in value (> run ... <n>
), the recorded value (<n>
printed to stdout), and the time
output all roughly line up. n
is microseconds. Do not worry too much about the n = 1
case (e.g.,
time
is measuring a bunch of startup overhead) — all that is important here is that the
printed n
is some low integer, not 0
or 1000
(as recorded originally with Emscripten).
To gather some measurements for threads spawned sequentially, run:
$ taskset --cpu-list 0-12 make sequential
This runs the sequential
driver over the compiled
benchmark-sequential.c
code. The idea here is to spawn n
threads one
after the other, measuring the time it takes from pthread_create
in the main thread to the first
instruction being executed in the child thread. This is done for different n
, attempting to
collect at least 100 samples in the low n
cases and doing some "check the last 10 samples" for
others.
To gather some measurements for threads spawned in parallel, run:
$ taskset --cpu-list 0-12 make parallel
This runs the parallel
driver over the compiled
benchmark-parallel.c
code. The idea here is to spawn n
threads all
around the same time, again measuring the time it takes from pthread_create
in the main thread to
the first instruction being executed in the child thread. As with sequential
, this is done for
different n
, attempting to collect at least 100 samples in the low n
cases. Expect the
Emscripten-Node version to fail at a certain point.
The manual
approach attempts to measure something slightly different than the
previous ones. The Emscripten-compiled versions make use of web workers under the hood, but there
are some confounding variables here (e.g., worker caching? postMessage
time?). To understand this
better, the manual
directory is simpler: it spawns a web worker using JS and uses
performance.now()
to measure the time for that worker to send a message back. (Though we would
like to, we cannot measure at the first worker instruction because each worker has its own 0-based
time origin — we have to accept a slight overhead with a postMessage
back to the main
thread).
This directory runs some experiments in a browser — see the browser console. It essentially
captures the sequential
and parallel
approaches but with slightly different terminology: it uses
the JS async
and await
concepts instead. No compilation is required; serve the directory and
open manual.html
in a browser:
$ $EMSDK_DIR/upstream/emscripten/emrun --no_browser manual
# open http://localhost:6931
The measurements will vary with system load and this infrastructure's precision is rather rough, but
one can get a general overview of the trends regardless. This table summarizes the mean measurements
(rounded to microseconds) for various n
measured on my system with no special care taken to
isolate for benchmarking:
1 | 10 | 100 | 1000 | 100 (last 10) | 1000 (last 10) | |
---|---|---|---|---|---|---|
native sequential |
110 | 56 | 16 | 10 | 7 | 7 |
native parallel |
91 | 24 | 6 | 5 | n/a | n/a |
wasi-threads sequential |
239 | 133 | 20 | 34 | 15 | 14 |
wasi-threads parallel |
245 | 67 | 12 | 10 | n/a | n/a |
emscripten sequential |
29817 | 3119 | 336 | 73 | 36 | 182 |
emscripten parallel |
29570 | 37775 | 409055 | OOM* | n/a | n/a |
manual sequential |
5238 | 2836 | 3188 | n/a | n/a | n/a |
manual parallel |
9233 | 18661 | 122577 | n/a | n/a | n/a |
Note that this data is very noisy and prone to variation; you may see different results on your system. But I notice several trends:
- First, the first few threads spawned take longer. I added the two-right hand columns to
eliminate the skewing from the slower early spawns — these columns cut off all earlier data
except the last 10 spawns from
n
threads spawned sequentially. In general with sequential spawning, the more threads we spawn, the less time they take (except foremscripten
withn = 1000
?) - Secondly, one might conclude that, eventually, wasi-threads are only about 2-3x slower to spawn than native ones.
- Glancing at
manual
andemscripten
, we can see differences of 2-3 orders of magnitude versusnative
andwasi-threads
. This trend is only buckedemscripten
sequential: I suspect that Emscripten here reuses the same web worker repeatedly to avoid the high cost of spawning a web worker thread. In scenarios where it cannot do this, it just has to be at least the base cost (seemanual
) plus some Emscripten overhead. - One might wonder: can we move the cost of spawning a web worker somewhere else? Emscripten has a
-sPTHREAD_POOL_SIZE
flag to spawn a pool of web workers so that laterpthread_create
calls are faster. But note that the high cost is payed somewhere (earlier in the application lifetime, sure) and that not all applications will fit neatly into this paradigm.
Thread spawning via web workers is much slower than native thread spawning, even with some WASI
overhead thrown in. At the lower limit subtracting Emscripten's overhead (i.e., manual
), there's
roughly a ~200x slowdown (at n = 100
, 3188 / 16). With Emscripten's thread-caching tricks (and
only spawning the "right" number of threads--don't look at parallel!), there's roughly a ~7x
slowdown (at n = 1000
, 73 / 10). In the worst case, spawning 100 threads in parallel, there's
roughly a ~68000x slowdown (at n = 100
, 409055 / 6). All of these differences, picked rather
arbitrarily, are meant to illustrate the point: actually spawning new web worker threads can be
very slow.