Comments (2)
This is an interesting observation that I haven't made before. I can reproduce this.
Let's first rule out the shell as a possible source for this, which you also did by using --shell=none
. Note that this will have an effect on which command we run, because echo
and sleep
are both shell builtins. So when running hyperfine "echo test"
you are benchmarking the shell builtin. And when running hyperfine -N "echo test"
, you are benchmarking /usr/bin/echo
, which is a proper executable. The former is going to be much faster, since we don't have to spawn a new process (note that hyperfine subtracts the time to spawn the shell). We could also do hyperfine "/usr/bin/echo test"
which should be roughly equivalent to hyperfine -N "echo test"
.
So we are going to compare launching /usr/bin/echo test
with and without an intermediate --prepare
command. We can clearly see a difference, even more pronounced as in your results:
▶ hyperfine -N --time-unit microsecond --warmup 1000 --runs 2000 "echo test"
Benchmark 1: echo test
Time (mean ± σ): 578.5 µs ± 49.1 µs [User: 461.3 µs, System: 66.9 µs]
Range (min … max): 491.5 µs … 953.0 µs 2000 runs
▶ hyperfine -N --time-unit microsecond --warmup 1000 --runs 2000 "echo test" --prepare "sleep 0.001"
Benchmark 1: echo test
Time (mean ± σ): 2949.0 µs ± 400.6 µs [User: 1097.9 µs, System: 1516.1 µs]
Range (min … max): 590.0 µs … 3670.7 µs 2000 runs
Interestingly, if I use echo foo
as the --prepare
command, I do not see this effect:
▶ hyperfine -N --time-unit microsecond --warmup 1000 --runs 2000 "echo test" --prepare "echo foo"
Benchmark 1: echo test
Time (mean ± σ): 585.6 µs ± 54.9 µs [User: 472.3 µs, System: 63.1 µs]
Range (min … max): 495.8 µs … 924.7 µs 2000 runs
This seems to be a real effect which is not related to hyperfine.
Another thing I did was to do the run with --prepare "sleep 0.001"
and record 10k runs, export the results as JSON, and plot a histogram:
We can see that the main peak is around 3 ms, but there is a clear second peak at the low (~500µs) runtime that we saw in the benchmark without --prepare
. So even when running the prepare command, we sometimes see the fast runtime.
I don't know the real reason for this, but here are two wild guesses:
- When using
sleep
as a--prepare
command, we essentially tell the OS that it can use the current CPU core for scheduling another process. On the other hand, when runningecho test
in fast succession, there might be some OS-internal mechanism that pins hyperfine/echo on a single core. - There might be some (OS internal) caching effects that make running the same process over and over again fast.
Some evidence for hypothesis 1 comes from the following experiment, where I pin hyperfine
on a certain core.. which speeds up the benchmarkee:
▶ taskset -c 3 hyperfine -N --time-unit microsecond --warmup 1000 --runs 2000 "echo test" --prepare "sleep 0.001"
Benchmark 1: echo test
Time (mean ± σ): 738.5 µs ± 332.4 µs [User: 555.6 µs, System: 85.6 µs]
Range (min … max): 510.2 µs … 3666.1 µs 2000 runs
from hyperfine.
I did not know that shells use builtin commands instead of the actual binary but when I think of if it is kinda obvious that they would do it for performance reasons. Is there a more convenient way to check if a built-in is used other than checking if a new process is created?
Hypothesis 1:
If that is the case a spinlock should not show the same behavior because it never signals the OS that it is idling.
To test this I created a simple loop with the asm macro to prevent compiler optimizations:
use std::arch::asm;
const COUNT: usize = 100_000_000;
fn main() {
unsafe {
asm! {
"mov rax, {0:r}",
"2:",
"dec rax",
"jnz 2b",
in(reg) COUNT
};
}
}
Benchmarking this with hyperfine reported around 23ms runtime on my system and htop shows 100% usage of the cpu core.
Using this spinlock the effect is still present in the echo benchmark even with pinning:
❯ taskset -c 3 hyperfine -N '/usr/bin/echo test' --warmup 100 --runs 1000
Benchmark 1: /usr/bin/echo test
Time (mean ± σ): 386.2 µs ± 57.4 µs [User: 297.7 µs, System: 27.6 µs]
Range (min … max): 352.7 µs … 931.4 µs 1000 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
❯ taskset -c 3 hyperfine -N '/usr/bin/echo test' --warmup 100 --runs 1000 --prepare target/release/spinlock
Benchmark 1: /usr/bin/echo test
Time (mean ± σ): 413.2 µs ± 32.8 µs [User: 318.0 µs, System: 33.8 µs]
Range (min … max): 358.7 µs … 672.3 µs 1000 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
However in the head random benchmark it is not present anymore:
❯ taskset -c 3 hyperfine -N '/usr/bin/head -c 3000000 /dev/urandom' --warmup 100 --runs 1000
Benchmark 1: /usr/bin/head -c 3000000 /dev/urandom
Time (mean ± σ): 6.4 ms ± 0.3 ms [User: 0.2 ms, System: 6.2 ms]
Range (min … max): 6.2 ms … 7.8 ms 1000 runs
❯ taskset -c 3 hyperfine -N '/usr/bin/head -c 3000000 /dev/urandom' --warmup 100 --runs 1000 --prepare target/release/spinlock
Benchmark 1: /usr/bin/head -c 3000000 /dev/urandom
Time (mean ± σ): 6.4 ms ± 0.2 ms [User: 0.2 ms, System: 6.1 ms]
Range (min … max): 6.2 ms … 7.8 ms 1000 runs
Afaik there should be no caching possible with /dev/urandom
as it should always produce new random ouput? Maybe it can cache opening the "file" or something. So that this effect is still present in the echo benchmark could be related to caching effects or your second hypothesis.
Therefore I think the first hypothesis can be correct.
Hypothesis 2:
We can use the same command for --prepare
and benchmarking. This way there will be always the same process executed and caching should not be affected:
❯ taskset -c 3 hyperfine -N 'echo test' --warmup 1000 --runs 10000
Benchmark 1: echo test
Time (mean ± σ): 589.2 µs ± 87.3 µs [User: 447.1 µs, System: 76.8 µs]
Range (min … max): 519.1 µs … 1510.2 µs 10000 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
❯ taskset -c 3 hyperfine -N 'echo test' --warmup 1000 --runs 10000 --prepare 'echo test'
Benchmark 1: echo test
Time (mean ± σ): 582.3 µs ± 87.6 µs [User: 444.7 µs, System: 73.2 µs]
Range (min … max): 517.9 µs … 1604.4 µs 10000 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
And now it actually looks to be equal. So maybe both of your hypotheses are correct.
Confirming that it is not tool specific:
It is probably a good idea to try the same benchmarking with a different tool and see if it behaves the same. There we would need to find a tool that offers these capabilities. I can't think of one off the top of my head.
Adding a warning / notice about this behavior
I think it would be a good idea to document this behavior somewhere to make people aware of it even tho it might not be specific to this tool but rather to the system architecture.
Possible solutions
Maybe it is possible to gain exclusive access to a single core? Not sure if linux supports this. This would prevent the scheduler from running other processes on the core while the benchmarked process is waiting for I/O or otherwise idling.
from hyperfine.
Related Issues (20)
- hyperfine 1.18.0 not published at crates.io HOT 1
- Usage of `hyperfine` to benchmark code HOT 1
- ETA not clearly visible on terminals with a block cursor HOT 1
- run commands only once per set of parameters they use
- Always in this "Initial time measurement" state HOT 6
- Include parameters in output HOT 2
- T
- `-N` makes me can't pass options for command which'll be benchmarked. HOT 1
- Output cpu, disc and memory information HOT 3
- hypperfine without --show-output hard to read on windows7 cmd console HOT 1
- Allow custom time measurements HOT 1
- Feature request: Remote commands HOT 1
- Provide Statically Compiled Binaries for (aarch64|arm64) Linux
- Incorrect result for fastfetch HOT 2
- RFE: advanced metrics beyond wall time (CPU time, instruction count, ...)
- Distribution via NPM HOT 3
- Hyperfine gets flagged as virus
- Apple Silicon Support
- hyperfine as library
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperfine.