Giter Club home page Giter Club logo

tasty-bench's Introduction

tasty-bench Hackage Stackage LTS Stackage Nightly

Featherlight benchmark framework (only one file!) for performance measurement with API mimicking criterion and gauge. A prominent feature is built-in comparison against previous runs and between benchmarks.

How lightweight is it?

There is only one source file Test.Tasty.Bench and no non-boot dependencies except tasty. So if you already depend on tasty for a test suite, there is nothing else to install.

Compare this to criterion (10+ modules, 50+ dependencies) and gauge (40+ modules, depends on basement and vector). A build on a clean machine is up to 16x faster than criterion and up to 4x faster than gauge. A build without dependencies is up to 6x faster than criterion and up to 8x faster than gauge.

tasty-bench is a native Haskell library and works everywhere, where GHC does, including WASM. We support a full range of architectures (i386, amd64, armhf, arm64, ppc64le, s390x) and operating systems (Linux, Windows, macOS, FreeBSD, OpenBSD, NetBSD), plus any GHC from 7.0 to 9.10.

How is it possible?

Our benchmarks are literally regular tasty tests, so we can leverage all existing machinery for command-line options, resource management, structuring, listing and filtering benchmarks, running and reporting results. It also means that tasty-bench can be used in conjunction with other tasty ingredients.

Unlike criterion and gauge we use a very simple statistical model described below. This is arguably a questionable choice, but it works pretty well in practice. A rare developer is sufficiently well-versed in probability theory to make sense and use of all numbers generated by criterion.

How to switch?

Cabal mixins allow to taste tasty-bench instead of criterion or gauge without changing a single line of code:

cabal-version: 2.0

benchmark foo
  ...
  build-depends:
    tasty-bench
  mixins:
    tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main, Test.Tasty.Bench as Gauge, Test.Tasty.Bench as Gauge.Main)

This works vice versa as well: if you use tasty-bench, but at some point need a more comprehensive statistical analysis, it is easy to switch temporarily back to criterion.

How to write a benchmark?

Benchmarks are declared in a separate section of cabal file:

cabal-version:   2.0
name:            bench-fibo
version:         0.0
build-type:      Simple
synopsis:        Example of a benchmark

benchmark bench-fibo
  main-is:       BenchFibo.hs
  type:          exitcode-stdio-1.0
  build-depends: base, tasty-bench
  ghc-options:   "-with-rtsopts=-A32m"
  if impl(ghc >= 8.6)
    ghc-options: -fproc-alignment=64

And here is BenchFibo.hs:

import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "Fibonacci numbers"
    [ bench "fifth"     $ nf fibo  5
    , bench "tenth"     $ nf fibo 10
    , bench "twentieth" $ nf fibo 20
    ]
  ]

Since tasty-bench provides an API compatible with criterion, one can refer to its documentation for more examples.

How to read results?

Running the example above (cabal bench or stack bench) results in the following output:

All
  Fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns
    tenth:     OK (1.71s)
      809 ns ±  73 ns
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs

All 3 tests passed (7.25s)

The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall-clock time), its predicted mean CPU time was 63 nanoseconds and means of individual samples do not often diverge from it further than ±3.4 nanoseconds (double standard deviation). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.

Wall-clock time vs. CPU time

What time are we talking about? Both criterion and gauge by default report wall-clock time, which is affected by any other application which runs concurrently. Ideally benchmarks are executed on a dedicated server without any other load, but — let's face the truth — most of developers run benchmarks on a laptop with a hundred other services and a window manager, and watch videos while waiting for benchmarks to finish. That's the cause of a notorious "variance introduced by outliers: 88% (severely inflated)" warning.

To alleviate this issue tasty-bench measures CPU time by getCPUTime instead of wall-clock time by default. It does not provide a perfect isolation from other processes (e. g., if CPU cache is spoiled by others, populating data back from RAM is your burden), but is a bit more stable.

Caveat: this means that for multithreaded algorithms tasty-bench reports total elapsed CPU time across all cores, while criterion and gauge print maximum of core's wall-clock time. It also means that by default tasty-bench does not measure time spent out of process, e. g., calls to other executables. To work around this limitation use --time-mode command-line option or set it locally via TimeMode option.

Statistical model

Here is a procedure used by tasty-bench to measure execution time:

  1. Set $n \leftarrow 1$.
  2. Measure execution time $t_n$ of $n$ iterations and execution time $t_{2n}$ of $2n$ iterations.
  3. Find $t$ which minimizes deviation of $(nt,2nt)$ from $(t_n,t_{2n})$, namely $t \leftarrow (t_n + 2t_{2n}) / 5n$.
  4. If deviation is small enough (see --stdev below) or time is running out soon (see --timeout below), return $t$ as a mean execution time.
  5. Otherwise set $n \leftarrow 2n$ and jump back to Step 2.

This is roughly similar to the linear regression approach which criterion takes, but we fit only two last points. This allows us to simplify away all heavy-weight statistical analysis. More importantly, earlier measurements, which are presumably shorter and noisier, do not affect overall result. This is in contrast to criterion, which fits all measurements and is biased to use more data points corresponding to shorter runs (it employs $n \leftarrow 1.05n$ progression).

Mean time and its deviation does not say much about the distribution of individual timings. E. g., imagine a computation which (according to a coarse system timer) takes either 0 ms or 1 ms with equal probability. While one would be able to establish that its mean time is 0.5 ms with a very small deviation, this does not imply that individual measurements are anywhere near 0.5 ms. Even assuming an infinite precision of a system timer, the distribution of individual times is not known to be normal.

Obligatory disclaimer: statistics is a tricky matter, there is no one-size-fits-all approach. In the absence of a good theory simplistic approaches are as (un)sound as obscure ones. Those who seek statistical soundness should rather collect raw data and process it themselves using a proper statistical toolbox. Data reported by tasty-bench is only of indicative and comparative significance.

Memory usage

Configuring RTS to collect GC statistics (e. g., via cabal bench --benchmark-options '+RTS -T' or stack bench --ba '+RTS -T') enables tasty-bench to estimate and report memory usage:

All
  Fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns, 223 B  allocated,   0 B  copied, 2.0 MB peak memory
    tenth:     OK (1.71s)
      809 ns ±  73 ns, 2.3 KB allocated,   0 B  copied, 4.0 MB peak memory
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs, 277 KB allocated,  59 B  copied, 5.0 MB peak memory

All 3 tests passed (7.25s)

This data is reported as per RTSStats fields: allocated_bytes, copied_bytes and max_mem_in_use_bytes.

Combining tests and benchmarks

When optimizing an existing function, it is important to check that its observable behavior remains unchanged. One can rebuild both tests and benchmarks after each change, but it would be more convenient to run sanity checks within benchmark itself. Since our benchmarks are compatible with tasty tests, we can easily do so.

Imagine you come up with a faster function myFibo to generate Fibonacci numbers:

import Test.Tasty.Bench
import Test.Tasty.QuickCheck -- from tasty-quickcheck package

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

myFibo :: Int -> Integer
myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2)

main :: IO ()
main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain
  [ bench "fibo   20" $ nf fibo   20
  , bench "myFibo 20" $ nf myFibo 20
  , testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n
  ]

This outputs:

All
  fibo   20:     OK (3.02s)
    104 μs ± 4.9 μs
  myFibo 20:     OK (1.99s)
     71 μs ± 5.3 μs
  myFibo = fibo: FAIL
    *** Failed! Falsified (after 5 tests and 1 shrink):
    2
    1 /= 2
    Use --quickcheck-replay=927711 to reproduce.

1 out of 3 tests failed (5.03s)

We see that myFibo is indeed significantly faster than fibo, but unfortunately does not do the same thing. One should probably look for another way to speed up generation of Fibonacci numbers.

Troubleshooting

  • If benchmarks take too long, set --timeout to limit execution time of individual benchmarks, and tasty-bench will do its best to fit into a given time frame. Without --timeout we rerun benchmarks until achieving a target precision set by --stdev, which in a noisy environment of a modern laptop with GUI may take a lot of time.

    While criterion runs each benchmark at least for 5 seconds, tasty-bench is happy to conclude earlier, if it does not compromise the quality of results. In our experiments tasty-bench suites tend to finish earlier, even if some individual benchmarks take longer than with criterion.

    A common source of noisiness is garbage collection. Setting a larger allocation area (nursery) is often a good idea, either via cabal bench --benchmark-options '+RTS -A32m' or stack bench --ba '+RTS -A32m'. Alternatively bake it into cabal file as ghc-options: "-with-rtsopts=-A32m".

  • Never compile benchmarks with -fstatic-argument-transformation, because it breaks a trick we use to force GHC into reevaluation of the same function application over and over again.

  • If benchmark results look malformed like below, make sure that you are invoking Test.Tasty.Bench.defaultMain and not Test.Tasty.defaultMain (the difference is consoleBenchReporter vs. consoleTestReporter):

    All
      fibo 20:       OK (1.46s)
        Response {respEstimate = Estimate {estMean = Measurement {measTime = 87496728, measAllocs = 0, measCopied = 0}, estStdev = 694487}, respIfSlower = FailIfSlower Infinity, respIfFaster = FailIfFaster Infinity}
    
  • If benchmarks fail with an error message

    Unhandled resource. Probably a bug in the runner you're using.
    

    or

    Unexpected state of the resource (NotCreated) in getResource. Report as a tasty bug.
    

    this is likely caused by env or envWithCleanup affecting benchmarks structure. You can use env to read test data from IO, but not to read benchmark names or affect their hierarchy in other way. This is a fundamental restriction of tasty to list and filter benchmarks without launching missiles.

    Strict pattern-matching on resource is also prohibited. For instance, if it is a tuple, the second argument of env should use a lazy pattern match \~(a, b) -> ...

  • If benchmarks fail with Test dependencies form a loop or Test dependencies have cycles, this is likely because of bcompare, which compares a benchmark with itself. Locating a benchmark in a global environment may be tricky, please refer to tasty documentation for details and consider using locateBenchmark.

  • When seeing

    This benchmark takes more than 100 seconds. Consider setting --timeout, if this is unexpected (or to silence this warning).
    

    do follow the advice: abort benchmarks and pass -t100 or similar. Unless you are benchmarking a very computationally expensive function, a single benchmark should stabilize after a couple of seconds. This warning is a sign that your environment is too noisy, in which case tasty-bench will continue trying with exponentially longer intervals, often unproductively.

  • The following error can be thrown when benchmarks are built with ghc-options: -threaded:

    Benchmarks must not be run concurrently. Please pass -j1 and/or avoid +RTS -N.
    

    The underlying cause is that tasty runs tests concurrently, which is harmful for reliable performance measurements. Make sure to use tasty-bench >= 0.3.4 and invoke Test.Tasty.Bench.defaultMain and not Test.Tasty.defaultMain. Note that localOption (NumThreads 1) quashes the warning, but does not eliminate the cause.

  • If benchmarks using GHC 9.4.4+ segfault on Windows, check that you are not using non-moving garbage collector --nonmoving-gc. This is likely caused by GHC issue. Previous releases of tasty-bench recommended enabling --nonmoving-gc to stabilise benchmarks, but it's discouraged now.

  • If you see

    <stdout>: commitBuffer: invalid argument (cannot encode character '\177')
    

    it means that your locale does not support UTF-8. tasty-bench makes an effort to force locale to UTF-8, but sometimes, when benchmarks are a part of a larger application, it's impossible to do so. In such case run locale -a to list available locales and set a UTF-8-capable one (e. g., export LANG=C.UTF-8) before starting benchmarks.

Isolating interfering benchmarks

One difficulty of benchmarking in Haskell is that it is hard to isolate benchmarks so that they do not interfere. Changing the order of benchmarks or skipping some of them has an effect on heap's layout and thus affects garbage collection. This issue is well attested in both criterion and gauge.

Usually (but not always) skipping some benchmarks speeds up remaining ones. That's because once a benchmark allocated heap which for some reason was not promptly released afterwards (e. g., it forced a top-level thunk in an underlying library), all further benchmarks are slowed down by garbage collector processing this additional amount of live data over and over again.

There are several mitigation strategies. First of all, giving garbage collector more breathing space by +RTS -A32m (or more) is often good enough.

Further, avoid using top-level bindings to store large test data. Once such thunks are forced, they remain allocated forever, which affects detrimentally subsequent unrelated benchmarks. Treat them as external data, supplied via env: instead of

largeData :: String
largeData = replicate 1000000 'a'

main :: IO ()
main = defaultMain
  [ bench "large" $ nf length largeData, ... ]

use

import Control.DeepSeq (force)
import Control.Exception (evaluate)

main :: IO ()
main = defaultMain
  [ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
    bench "large" $ nf length largeData, ... ]

Finally, as an ultimate measure to reduce interference between benchmarks, one can run each of them in a separate process. We do not quite recommend this approach, but if you are desperate, here is how:

cabal run -v0 all:benches -- -l | sed -e 's/[\"]/\\\\\\&/g' | while read -r name; do cabal run -v0 all:benches -- -p '$0 == "'"$name"'"'; done

This assumes that there is a single benchmark suite in the project and that benchmark names do not contain newlines.

Comparison against baseline

One can compare benchmark results against an earlier run in an automatic way.

When using this feature, it's especially important to compile benchmarks with ghc-options: -fproc-alignment=64, otherwise results could be skewed by intermittent changes in cache-line alignment.

Firstly, run tasty-bench with --csv FILE key to dump results to FILE in CSV format (it could be a good idea to set smaller --stdev, if possible):

Name,Mean (ps),2*Stdev (ps)
All.Fibonacci numbers.fifth,48453,4060
All.Fibonacci numbers.tenth,637152,46744
All.Fibonacci numbers.twentieth,81369531,3342646

Now modify implementation and rerun benchmarks with --baseline FILE key. This produces a report as follows:

All
  Fibonacci numbers
    fifth:     OK (0.44s)
       53 ns ± 2.7 ns,  8% more than baseline
    tenth:     OK (0.33s)
      641 ns ±  59 ns,       same as baseline
    twentieth: OK (0.36s)
       77 μs ± 6.4 μs,  5% less than baseline

All 3 tests passed (1.50s)

You can also fail benchmarks, which deviate too far from baseline, using --fail-if-slower and --fail-if-faster options. For example, setting both of them to 6 will fail the first benchmark above (because it is more than 6% slower), but the last one still succeeds (even while it is measurably faster than baseline, deviation is less than 6%). Consider also using --hide-successes to show only problematic benchmarks, or even tasty-rerun package to focus on rerunning failing items only.

If you wish to compare two CSV reports non-interactively, here is a handy awk incantation:

awk 'BEGIN{FS=",";OFS=",";print "Name,Old,New,Ratio"}FNR==1{trueNF=NF;next}NF<trueNF{print "Benchmark names should not contain newlines";exit 1}FNR==NR{oldTime=$(NF-trueNF+2);NF-=trueNF-1;a[$0]=oldTime;next}{newTime=$(NF-trueNF+2);NF-=trueNF-1;print $0,a[$0],newTime,newTime/a[$0];gs+=log(newTime/a[$0]);gc++}END{if(gc>0)print "Geometric mean,,",exp(gs/gc)}' old.csv new.csv

A larger shell snippet to compare two git commits can be found in compare_benches.sh.

Note that columns in CSV report are different from what criterion or gauge would produce. If names do not contain commas, missing columns can be faked this way:

awk 'BEGIN{FS=",";OFS=",";print "Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);stddev=$(NF-trueNF+3);NF-=trueNF-1;print $0,mean/1e12,mean/1e12,mean/1e12,stddev/2e12,stddev/2e12,stddev/2e12}'

To fake gauge in --csvraw mode use

awk 'BEGIN{FS=",";OFS=",";print "name,iters,time,cycles,cpuTime,utime,stime,maxrss,minflt,majflt,nvcsw,nivcsw,allocated,numGcs,bytesCopied,mutatorWallSeconds,mutatorCpuSeconds,gcWallSeconds,gcCpuSeconds"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);fourth=$(NF-trueNF+4);fifth=$(NF-trueNF+5);sixth=$(NF-trueNF+6);NF-=trueNF-1;print $0,1,mean/1e12,0,mean/1e12,mean/1e12,0,sixth+0,0,0,0,0,fourth+0,0,fifth+0,0,0,0,0}'

Comparison between benchmarks

You can also compare benchmarks to each other without any external tools, all in the comfort of your terminal.

import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "Fibonacci numbers"
    [ bcompare "tenth"  $ bench "fifth"     $ nf fibo  5
    ,                     bench "tenth"     $ nf fibo 10
    , bcompare "tenth"  $ bench "twentieth" $ nf fibo 20
    ]
  ]

This produces a report, comparing mean times of fifth and twentieth to tenth:

All
  Fibonacci numbers
    fifth:     OK (16.56s)
      121 ns ± 2.6 ns, 0.08x
    tenth:     OK (6.84s)
      1.6 μs ±  31 ns
    twentieth: OK (6.96s)
      203 μs ± 4.1 μs, 128.36x

To locate a baseline benchmark in a larger suite use locateBenchmark.

One can leverage comparisons between benchmarks to implement portable performance tests, expressing properties like "this algorithm must be at least twice faster than that one" or "this operation should not be more than thrice slower than that". This can be achieved with bcompareWithin, which takes an acceptable interval of performance as an argument.

Plotting results

Users can dump results into CSV with --csv FILE and plot them using gnuplot or other software. But for convenience there is also a built-in quick-and-dirty SVG plotting feature, which can be invoked by passing --svg FILE. Here is a sample of its output:

Plotting

Build flags

Build flags are a brittle subject and users do not normally need to touch them.

  • If you find yourself in an environment, where tasty is not available and you have access to boot packages only, you can still use tasty-bench! Just copy Test/Tasty/Bench.hs to your project (imagine it like a header-only C library). It will provide you with functions to build Benchmarkable and run them manually via measureCpuTime. This mode of operation can be also configured by disabling Cabal flag tasty.

Command-line options

Use --help to list all command-line options.

  • -p, --pattern

    This is a standard tasty option, which allows filtering benchmarks by a pattern or awk expression. Please refer to tasty documentation for details.

  • -t, --timeout

    This is a standard tasty option, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long: tasty-bench will make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations) will result in a benchmark failure. One can adjust it locally for a group of benchmarks, e. g., localOption (mkTimeout 100000000) for 100 seconds.

  • --stdev

    Target relative standard deviation of measurements in percents (5% by default). Large values correspond to fast and loose benchmarks, and small ones to long and precise. It can also be adjusted locally for a group of benchmarks, e. g., localOption (RelStDev 0.02). If benchmarking takes far too long, consider setting --timeout, which will interrupt benchmarks, potentially before reaching the target deviation.

  • --csv

    File to write results in CSV format.

  • --baseline

    File to read baseline results in CSV format (as produced by --csv).

  • --fail-if-slower, --fail-if-faster

    Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see --baseline), it will be reported as failed. Can be used in conjunction with a standard tasty option --hide-successes to show only problematic benchmarks. Both options can be adjusted locally for a group of benchmarks, e. g., localOption (FailIfSlower 0.10).

  • --svg

    File to plot results in SVG format.

  • --time-mode

    Whether to measure CPU time (cpu, default) or wall-clock time (wall).

  • +RTS -T

    Estimate and report memory usage.

Custom command-line options

As usual with tasty, it is easy to extend benchmarks with custom command-line options. Here is an example:

import Data.Proxy
import Test.Tasty.Bench
import Test.Tasty.Ingredients.Basic
import Test.Tasty.Options
import Test.Tasty.Runners

newtype RandomSeed = RandomSeed Int

instance IsOption RandomSeed where
  defaultValue = RandomSeed 42
  parseValue = fmap RandomSeed . safeRead
  optionName = pure "seed"
  optionHelp = pure "Random seed used in benchmarks"

main :: IO ()
main = do
  let customOpts  = [Option (Proxy :: Proxy RandomSeed)]
      ingredients = includingOptions customOpts : benchIngredients
  opts <- parseOptions ingredients benchmarks
  let RandomSeed seed = lookupOption opts
  defaultMainWithIngredients ingredients benchmarks

benchmarks :: Benchmark
benchmarks = bgroup "All" []

tasty-bench's People

Contributors

amesgen avatar andreasabel avatar bodigrim avatar maoe avatar phadej avatar tomjaguarpaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tasty-bench's Issues

Some benchmarks written for `criterion` just complete immediately with `tasty-bench`.

All.Comparison.1000.find.vector-hashtables,6580352,647420
All.Comparison.1000.find.vector-hashtables (frozen),1469,38

the same benchmarks ran using criterion

Comparison/1000/find/vector-hashtables,7.08775428580093e-6,6.878445436163731e-6,7.573894628958587e-6,1.0076997615131766e-6,5.435319381788577e-7,1.8001436015276278e-6
Comparison/1000/find/vector-hashtables (frozen),3.011253725428155e-6,2.997528898847814e-6,3.034834992200024e-6,5.824131186108332e-8,4.132973586165513e-8,8.391675029687467e-8

https://github.com/klapaucius/vector-hashtables/blob/2dc4c0dc2d9471bec8f1d1b8edd161fcc78250fc/bench/Main.hs#L52

 +RTS --info
 [("GHC RTS", "YES")
 ,("GHC version", "8.10.7")
 ,("RTS way", "rts_v")
 ,("Build platform", "x86_64-unknown-mingw32")
 ,("Build architecture", "x86_64")
 ,("Build OS", "mingw32")
 ,("Build vendor", "unknown")
 ,("Host platform", "x86_64-unknown-mingw32")
 ,("Host architecture", "x86_64")
 ,("Host OS", "mingw32")
 ,("Host vendor", "unknown")
 ,("Target platform", "x86_64-unknown-mingw32")
 ,("Target architecture", "x86_64")
 ,("Target OS", "mingw32")
 ,("Target vendor", "unknown")
 ,("Word size", "64")
 ,("Compiler unregisterised", "NO")
 ,("Tables next to code", "YES")
 ,("Flag -with-rtsopts", "")
 ]

Can we get a probabilistic check of asymptotic complexity?

For example, consider the following output:

    1:                            OK (0.18s)
       10 ns ± 920 ps
    2:                            OK (0.21s)
       12 ns ± 714 ps
    4:                            OK (0.17s)
       19 ns ± 1.8 ns
    8:                            OK (0.16s)
       37 ns ± 2.7 ns
    16:                           OK (0.32s)
       74 ns ± 2.6 ns
    32:                           OK (0.17s)
      157 ns ±  14 ns
    64:                           OK (0.17s)
      311 ns ±  29 ns
    128:                          OK (0.17s)
      622 ns ±  53 ns
    256:                          OK (0.17s)
      1.3 μs ±  86 ns
    512:                          OK (0.17s)
      2.5 μs ± 203 ns
    1024:                         OK (0.17s)
      4.9 μs ± 352 ns
    2048:                         OK (0.17s)
      9.9 μs ± 800 ns
    4096:                         OK (0.17s)
       20 μs ± 1.9 μs
    8192:                         OK (0.17s)
       40 μs ± 2.8 μs
    16384:                        OK (0.17s)
       80 μs ± 5.6 μs
    32768:                        OK (0.17s)
      158 μs ±  12 μs
    65536:                        OK (0.17s)
      318 μs ±  31 μs

What happens here is the same function is measured with exponentially increasing input. It looks from the numbers as though its time complexity is linear.

Can we make a magical automated check out of these numbers?

No indication of progress with --csv

I'm currently running some benchmarks with --stdev 1 --csv and I'm wondering whether they will ever finish. Since there's no terminal output, I can't tell whether they are hanging at a particular benchmark.

Could terminal output be provided even when the --csv option is enabled, just like criterion and gauge handle it?

Print more digits?

Hi, is it possible to print more digits in the output?

For example, I see a benchmark allocates approximately 1.9 G; is it possible to show e.g. 1.867 G instead?

Thanks!

Run an IO action in the bench without timing it

I want to benchmark a search algorithm implementation. As a result, I'd like to generate random needles for each iteration of the bench. However, I don't want to also measure the PRNG. Is there a way to do this currently?

When running benchmarks against a baseline, it is unclear whether the baseline was compared against

The above line shows that when the change in runtime against the baseline is essentially 0, then no indication is given. The first output is a typical slowdown:

      41.7 ms ± 3.7 ms, 157 MB allocated,  19 MB copied,  14 MB peak memory, 97203242% slower than baseline

but the thing is that when running against a baseline, I want to see in words "unchanged against baseline". Not seeing that text makes it confusing as to whether the baseline was even tested against. Here is an example:

      41.7 ms ± 3.7 ms, 157 MB allocated,  19 MB copied,  14 MB peak memory

This is confusing to at least myself and my coworker. What do you think?

By the way - thanks for the awesome library!

Misleadling verbalization of speed up / slow down

I was surprised that a

  1. "75% faster than baseline" (see https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8072#note_427599)
  2. was advertised as "4 times faster" (see haskell/core-libraries-committee#59 (comment))

75% faster means 1.75 times as fast, so, which one is true? Turns out "4 times faster" is (sort of) true, but the verbalization of the speed up is incorrect:

formatSlowDown :: Double -> String
formatSlowDown n = case m `compare` 0 of
LT -> printf ", %2i%% faster than baseline" (-m)
EQ -> ""
GT -> printf ", %2i%% slower than baseline" m
where
m :: Int64
m = truncate ((n - 1) * 100)

(In this code n is the fraction new duration divided by old duration.)

The verb "faster" is about comparison of velocities. So if run 5 km/h and you 10 km/h, you run
5 km/h faster than me. Relativising this to the baseline of my speed, you run 100% faster than me. Much easier to understand is the sentence you run 2 times as fast as me (where proper English would be "twice as fast").

This bug is another proof that people are just terrible when it comes to percentage calculations.

I suggest to avoid percentage calculation and also comparative adjectives (which require you to subtract the baseline). Rather, just print the factor. So, instead of

75% faster than baseline

print:

4.0 times as fast as the baseline

Elaborate the statistical model

Refer to this in the docs:

Here is a procedure used by tasty-bench to measure execution time:

    Set 𝑛←1

.
Measure execution time 𝑡𝑛
of 𝑛 iterations and execution time 𝑡2𝑛 of 2𝑛
iterations.
Find 𝑡
which minimizes deviation of (𝑛𝑡,2𝑛𝑡) from (𝑡𝑛,𝑡2𝑛)
.
If deviation is small enough (see --stdev below), return 𝑡
as a mean execution time.
Otherwise set 𝑛←2𝑛
and jump back to Step 2.

I stumbled on Find 𝑡 which minimizes deviation of (𝑛𝑡,2𝑛𝑡) from (𝑡𝑛,𝑡2𝑛)? Do you calculate the mean per iteration i.e. (tn+t2n)/3n, and then calculate the deviation of the two points from that mean? If that stddev is less than 5% then you stop otherwise continue?

What happens if the deviation never comes below the threshold? Is --timeout the only bailout option? What is the default behavior? Does it continue forever?

Format of the CSV file

Wanted to discuss a few minor details about the CSV file format.

  1. Would it be better to not have spaces in the names of the columns?
  2. Instead of "Mean", "cpuTime" may be a more informative choice.
  3. Also, why not keep the units of time as seconds instead of ps?
  4. Instead of Copied, gcBytesCopied would be more informative.
  5. Can we store the whole series of measurements in the CSV file, instead of just one data point? Like in gauge, we can have an iterations column reporting the number of iterations and rest of the columns report raw data corresponding to those many iterations as usual. This will allow other tools to do any statistical analysis over the whole series of measurements.

Allow running benchmarks a given number of times.

With criterion based benchmarks I often found the -n option helpful to run them a fixed number of times without doing any analysis or the like.

This can be useful to produce a profile of a given benchmark, get a clearer idea of the cost of the benchmark vs the overhead from analyzing the results or just for heating your room.

So it would be nice if tasty-bench would also support this feature.

Benchmarking a memoized function

Hello,

I noticed that the benchmarking results for functions including internal memoization are surprising (or maybe they shouldn't be surprising; see haskell/criterion#85). In particular, the first function call takes long, as expected, and all other calls use the memoized value (I didn't expect this). The final result only reflects the time it takes to retrieve the memoized value, once it has been calculated.

Is it possible to benchmark a memoized function? That is, let the memoization work "within a benchmarking function call", but not "across benchmarking function calls".

EDIT: Is this maybe also important for normal functions that use some type of data that has to be initialized (calculated) and can be reused?

Not enough significant figures are shown

It appears that tasty-bench doesn't take into account the requested standard deviation when displaying results. For instance, if I pass --stdev 1 then I would expect that three significant figures are shown. However, this is currently not the case:

$ cabal run bench -- --stdev 1
All
  CUChar: OK (10.28s)
    1.3 ms ± 23 μs,   0 B  allocated,   0 B  copied

Add debug mode

I am using tasty-bench to benchmark a piece of code that forks a process. Not sure how it interacts with the benchmarking process. But I get figures that are wildly off. See below, the figures reported by "tasty-bench", "time" and "+RTS -s":

$ time -v cabal run Benchmark.System.Process -- --stdev 1000000 +RTS -s
...
Generating files...
Running benchmarks...
All
  processChunks tr: OK (3.25s)
    312 μs ± 765 μs, 129 KB allocated, 3.7 KB copied

All 1 tests passed (3.25s)
  13,403,301,168 bytes allocated in the heap
      24,131,192 bytes copied during GC
         828,192 bytes maximum residency (48 sample(s))
         101,600 bytes maximum slop
               4 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     12772 colls,     0 par    0.071s   0.078s     0.0000s    0.0004s
  Gen  1        48 colls,     0 par    0.026s   0.026s     0.0005s    0.0008s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    3.170s  (  3.213s elapsed)
  GC      time    0.098s  (  0.104s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    3.268s  (  3.318s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,227,761,766 bytes per MUT second

  Productivity  97.0% of total user, 96.9% of total elapsed

        Command being timed: "cabal run Benchmark.System.Process -- --stdev 1000000 +RTS -s"
        User time (seconds): 7.98
        System time (seconds): 0.53
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.49
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 320096
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 13
        Minor (reclaiming a frame) page faults: 153321
        Voluntary context switches: 1694
        Involuntary context switches: 67
        Swaps: 0
        File system inputs: 24
        File system outputs: 27752
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

There is some more code outside the benchmarks which is contributing to the "+RTS -s" and "time" figures but even after accounting for that figures are wildly off.

The code to reproduce this is available here: https://github.com/composewell/streamly-process/tree/tasty-bench-issue .

Even gauge has the same issue. Maybe this is some fundamental issue with this type of code? Or am I doing something wrong in the measurements/code?

Think about benchmarking of linear and unlifted data

For benchmarking of linear functions there are two options. One approach is to make nf and friends multiplicity-polymorphic. This does not require any implementation changes, just tweaking of type signatures and CPP around it. Not convinced that it's worth it. The second options is more a workaround: if you have a linear function f, which you wish to benchmark, you can benchmark \x -> f x instead, which is no longer linear and thus does not need any special support. I'm currently leaning to say that the second option is good enough. Cf. haskell/criterion#263

More interesting is how to benchmark functions, whose inputs and outputs are not in Type. I can very well imagine someone wishing to use Int# as an input and an unlifted type as an output, so that no time is wasted for pointer chasing and tag checking. While this is doable for every specific RuntimeRep, I have not found an ergonomic way to be levity-polymorphic here. And type signature grows to be quite scary.

Report RTSStats.max_mem_in_use_bytes

It will be useful to know how much memory an application being benchmarked could be holding at peak. Ideally we want maxrss as seen by the OS. But I guess we can trust the GHC RTS stats, more importantly, its easy to implement.

Spurious allocations reported with --csv option

From composewell/streamly#1042.

It seems when I use --csv option it shows more allocations and bytesCopied:

$ cabal run bench:Data.Fold -- --stdev 1000000 +RTS -T -K36K -M36M -RTS -p /elimination.product/ --csv=fold.csv
Up to date
All
  Data.Fold/o-1-space
    elimination
      product: OK
         70 μs ± 112 μs,  13 KB allocated, 5.9 KB copied

All 1 tests passed (0.01s)
$ cabal run bench:Data.Fold -- --stdev 1000000 +RTS -T -K36K -M36M -RTS -p /elimination.product/
Up to date
All
  Data.Fold/o-1-space
    elimination
      product: OK
         58 μs ±  88 μs,   0 B  allocated,   0 B  copied

All 1 tests passed (0.01s)

A flag to run each benchmark exactly *n* times

This might be a bit of an odd one, so feel free to reject it. I'm doing some very fine performance tuning, and it would be helpful if I could run benchmarks so that each benchmark is ran a deterministic number of times. I'm actually less interested in timings here, but am very interested in total allocations, over the entire benchmark run.

Font color might be indistinguishable from terminal foreground color

I'll just paste two screenshots. My default terminal theme:

vs a dark one:

I don't think this is a problem with the color scheme, as I don't remember other cli/tui applications having this issue. On the bright side, this is a weirdest bug I've come across for a while (I was puzzled for about half an hour where did the measurements go!) which gave me a sensible chuckle after understanding the problem.

Duplicate CLI options

The -j / --num-threads and --csv options appear twice in the options list:

Available options:
  -h,--help                Show this help text
  -p,--pattern PATTERN     Select only tests which satisfy a pattern or awk
                           expression
  -t,--timeout DURATION    Timeout for individual tests (suffixes: ms,s,m,h;
                           default: s)
  -l,--list-tests          Do not run the tests; just print their names
  -j,--num-threads NUMBER  Number of threads to use for tests execution
                           (default: # of cores/capabilities)
  --csv ARG                File to write results in CSV format. If specified,
                           suppresses console output
  -j,--num-threads NUMBER  Number of threads to use for tests execution
                           (default: # of cores/capabilities)
  -q,--quiet               Do not produce any output; indicate success only by
                           the exit code
  --hide-successes         Do not print tests that passed successfully
  --color never|always|auto
                           When to use colored output (default: auto)
  --ansi-tricks ARG        Enable various ANSI terminal tricks. Can be set to
                           'true' or 'false'. (default: true)
  --stdev ARG              Target relative standard deviation of measurements in
                           percents (5 by default). Large values correspond to
                           fast and loose benchmarks, and small ones to long and
                           precise. If it takes far too long, consider setting
                           --timeout, which will interrupt benchmarks,
                           potentially before reaching the target deviation.
  --csv ARG                File to write results in CSV format. If specified,
                           suppresses console output

--pattern option seems to impact benchmark results?!

I'm seeing some very weird behaviour in pandoc's benchmarks at jgm/pandoc@a9ef6b4:

$ cabal bench -w ghc-8.10.4 --benchmark-options "" --constraint 'doclayout==0.3.0.1'
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - pandoc-2.12 (lib) (configuration changed)
 - pandoc-2.12 (bench:benchmark-pandoc) (configuration changed)
Configuring library for pandoc-2.12..
Preprocessing library for pandoc-2.12..
Building library for pandoc-2.12..
Configuring benchmark 'benchmark-pandoc' for pandoc-2.12..
Warning: The package has an extraneous version range for a dependency on an
internal library: pandoc >=0 && ==2.12, pandoc >=0 && ==2.12, pandoc >=0 &&
==2.12. This version range includes the current package but isn't needed as
the current package's library will always be used.
Preprocessing benchmark 'benchmark-pandoc' for pandoc-2.12..
Building benchmark 'benchmark-pandoc' for pandoc-2.12..
Running 1 benchmarks...
Benchmark benchmark-pandoc: RUNNING...
All
  writers
    asciidoc:              OK (0.50s)
       10 ms ± 381 μs
    asciidoctor:           OK (0.29s)
       10 ms ± 788 μs
    beamer:                OK (0.28s)
       12 ms ± 888 μs
    commonmark:            OK (0.32s)
       12 ms ± 828 μs
    commonmark_x:          OK (0.34s)
       13 ms ± 723 μs
    context:               OK (0.24s)
      7.0 ms ± 697 μs
    docbook:               ^C⏎                                                                                                                        
$ cabal bench -w ghc-8.10.4 --benchmark-options "-p writer" --constraint 'doclayout==0.3.0.1'
Build profile: -w ghc-8.10.4 -O1
In order, the following will be built (use -v for more details):
 - pandoc-2.12 (lib) (configuration changed)
 - pandoc-2.12 (bench:benchmark-pandoc) (configuration changed)
Configuring library for pandoc-2.12..
Preprocessing library for pandoc-2.12..
Building library for pandoc-2.12..
Configuring benchmark 'benchmark-pandoc' for pandoc-2.12..
Warning: The package has an extraneous version range for a dependency on an
internal library: pandoc >=0 && ==2.12, pandoc >=0 && ==2.12, pandoc >=0 &&
==2.12. This version range includes the current package but isn't needed as
the current package's library will always be used.
Preprocessing benchmark 'benchmark-pandoc' for pandoc-2.12..
Building benchmark 'benchmark-pandoc' for pandoc-2.12..
Running 1 benchmarks...
Benchmark benchmark-pandoc: RUNNING...
All
  writers
    asciidoc:              OK (0.32s)
       15 ms ± 1.0 ms
    asciidoctor:           OK (0.26s)
       15 ms ± 685 μs
    beamer:                OK (0.15s)
       18 ms ± 1.8 ms
    commonmark:            OK (0.59s)
       18 ms ± 666 μs
    commonmark_x:          OK (0.61s)
       19 ms ± 417 μs
    context:               ^C⏎                                                                                                              

The benchmarks appear to run ~50% slower when I add -p writer to the benchmark options?!

Excessive inlining may optimize away the function to benchmark

I realized that it is really easy to be fooled by excessive inlining.

The following file illustrates this using GHC 9.4.5:

  • If run with -O1, then every function is benched normally.
  • If run with -O2, then some functions are optimized away (see comments). This is visible in Core output.

The functions are trivial, but I got this issue with less trivial functions in unicode-data.

Bench.hs
#!/usr/bin/env cabal

-- Make this executable and run it with ./Bench.hs

{- cabal:
build-depends:
    base        >= 4.16  && < 4.19,
    deepseq     >= 1.4   && < 1.5,
    tasty-bench >= 0.3.4 && < 0.4,
ghc-options:
  -O2 -Wall -fdicts-strict -rtsopts -with-rtsopts=-A32m -fproc-alignment=64
  -ddump-simpl -ddump-stg-final -ddump-cmm -ddump-to-file
  -dsuppress-idinfo -dsuppress-coercions -dsuppress-type-applications
-}

module Main where

import Test.Tasty.Bench (defaultMain, bench, nf, bgroup)
import Control.DeepSeq (NFData (..), deepseq)
import Data.Ix (Ix(..))

newtype MyInt = MyInt Int

instance NFData MyInt where
  {-# NOINLINE rnf #-}
  rnf (MyInt a) = rnf a

{-# INLINE f #-}
f :: Int -> Int
f cp = if cp < 790 then negate cp else cp

main :: IO ()
main = defaultMain
  [ bgroup "single value"
    [ bench "negate Int"   (nf negate           (789 :: Int)) -- not ok: rnf & negate too small
    , bench "negate MyInt" (nf (MyInt . negate) (789 :: Int)) -- ok: rnf not inlined
    , bench "negate maybe" (nf (Just . negate)  (789 :: Int)) -- not ok: rnf & negate too small
    , bench "negate list"  (nf ((:[]) . negate) (789 :: Int)) -- ok: rnf big enough
    , bench "gcd"          (nf (gcd 123)        (789 :: Int)) -- ok: gcd big enough
    , bench "f"            (nf f                (789 :: Int)) -- ok: f big enough
    ]
  , bgroup "range (rnf on function)"
    [ bench "negate MyInt" (nf (foldr (\n -> deepseq (MyInt (negate n))) () . range) (789 :: Int, 799)) -- not ok
    , bench "gcd"          (nf (foldr (\n -> deepseq (gcd n 123)) () . range)        (789 :: Int, 799)) -- ok
    , bench "f"            (nf (foldr (\c -> deepseq (f c)) () . range)              (789 :: Int, 799)) -- not ok
    ]
  , bgroup "range (rnf on accumulator)"
    [ bench "negate MyInt" (nf (foldr (\n -> (`deepseq` (MyInt (negate n)))) (MyInt minBound) . range) (789 :: Int, 799)) -- ok
    , bench "gcd"          (nf (foldr (\n -> (`deepseq` (gcd n 123))) minBound . range)        (789 :: Int, 799)) -- ok
    , bench "f"            (nf (foldr (\c -> (`deepseq` (f c))) minBound . range)              (789 :: Int, 799)) -- ok
    ]
  ]

Conclusion:

  • This is not a bug per se, but I think it is worth to add a warning about excessive inlining in the documentation.
  • Is there a systematic way to avoid this excessive inlining? Or do we have to check every time the Core output?

@harendra-kumar

Combinators to assert performance metrics

Super cool library!

I could imagine there being combinators to assert performance metrics for individual test cases. Like "fail this test if it the runtime differs by x% off some baseline".

  • allow annotation of tests/benches with expected runtime + allowed deviation
  • maybe even save metrics to a file and have a command line flag to accept changes

Imagine CI checked performance regressions, very similar to what GHC does in it's CI infrastructure.

Unable to match pattern

I have a benchmark named All.Unicode.Stream/o-1-space.ungroup-group.US.unlines . S.splitOnSuffix ([Word8]) (1/10). I am able to substring match the benchmark as follows:

Unicode.Stream -p "/All.Unicode.Stream\/o-1-space.ungroup-group.US.unlines . S.splitOnSuffix ([Word8]) (1\/10)/"
All
  Unicode.Stream/o-1-space
    ungroup-group
      US.unlines . S.splitOnSuffix ([Word8]) (1/10): OK (0.55s)
        182 ms ±  17 ms

All 1 tests passed (0.55s)

But I need an exact match instead of substring match so I use an awk pattern like this:

Unicode.Stream -p '$0 == "All.Unicode.Stream\/o-1-space.ungroup-group.US.unlines . S.splitOnSuffix ([Word8]) (1\/10)"'
option -p: Could not parse pattern

Usage: Unicode.Stream [-p|--pattern PATTERN] [-t|--timeout DURATION]
                      [-l|--list-tests] [-j|--num-threads NUMBER] [-q|--quiet]
                      [--hide-successes] [--color never|always|auto]
                      [--ansi-tricks ARG] [--baseline ARG] [--csv ARG]
                      [--svg ARG] [--stdev ARG] [--fail-if-slower ARG]
                      [--fail-if-faster ARG]

It fails with option -p: Could not parse pattern. Can someone tell me what's wrong with this? Is there any way some debug information can be printed which can tell why it could not parse pattern?

Subtract benchmark baseline

I have a benchmark for mutable hashtables. Its my understanding that env re-uses whatever is created for every benchmark iteration, so benchmarks that change the table (insert/delete) cannot use env to setup a table.
So to mitigate this I setup a baseline benchmark which only does the setup + iterating the keys I insert/delete. The actual numbers I would be interested in are the following benchmark times minus this baseline benchmark.
It would be nice to have a bsubtract (or similar) which does just that.
Oh and to add: If the setup cost is much larger than the function itself (ie huge table vs 1k inserts) the scale with which the time is output makes subtracting the baseline manually useless as it looses precision.

Another thing I'd love in the same context is being able to divide the result by some number.
I am running all the methods over a few thousand keys to get stable results and while dividing by that number to get the actual runtime isn't difficult, it would be nice to have the framework provide a combinator for it.

Estimate standard deviation for memory statistics

(Disclaimer: I am not particularly well-versed in benchmarking or statistics, so there may be reasons this is a bad idea.)

Consider the output

All
fibonacci numbers
fifth: OK (2.13s)
63 ns ± 3.4 ns, 223 B allocated, 0 B copied, 2.0 MB peak memory

Notice that only timing has an estimated (double) standard deviation. Could we add exactly the same analysis to allocation and copying statistics? (I am not sure what variance information would be useful for "peak memory", as we report the maximum, rather than the mean.)

I guess in the majority of programs GHC is deterministic enough for these variations to be very small, so perhaps it is not worth doing?

Recommend mitigations for benchmark instability introduced by GHCs SpecConstr.

As part of the ghc-9.6 release process we the GHC team has been investigating various performance
changes in libraries and discovered that GHCs SpecConstr pass can get in the way of
benchmark stability.

The problem

The core of the issue is that by default SpecConstr specializes a function
to it's argument only at limited number of call sites. This is controlled by the -fspec-constr-count
flag and by default the number of specializations is three.

This can be problematic for benchmarks which often look like this:

        [ bench "FindIndices/inlined"     $ nf (S.findIndices    (== nl)) absurdlong
        , bench "FindIndices/non-inlined" $ nf (S.findIndices (nilEq nl)) absurdlong
        , bench "FindIndex/inlined"       $ nf (S.findIndex      (== nl)) absurdlong
        , bench "FindIndex/non-inlined"   $ nf (S.findIndex   (nilEq nl)) absurdlong
        ]

Why is this problematic? Before SpecConstr is run the above snipped will have been
transformed into something along the lines of:

bench1 = $wbenchLoop (S.findIndices    (== nl)) absurdlong
bench2 = $wbenchLoop (S.findIndices (nilEq nl)) absurdlong
bench3 = $wbenchLoop (S.findIndex      (== nl)) absurdlong
bench4 = $wbenchLoop (S.findIndex   (nilEq nl)) absurdlong

This is problematic as we now have four calls to $wbenchLoop but only three of
them will be specialized because of -fspec-constr-count being set to three by
default.

This results in:

  • One of the four benchmarks "mysteriously" being slower.
  • The performance of benchmarks "mysteriously" changing if they are reordered as
    that affects which functions get specialized.
  • Similarly the performance of a benchmarks might change if new benchmarks are
    added or other benchmarks removed.

Potential workarounds

Ensure that bench isn't inlined.

Since SpecConstr currently won't generate specializations of functions defined in
other modules. Therefore if bench is prevented from inlining this would prevent SpecConstr
from tiggering on it.

Alternatively bench could also be marked as OPAQUE for newer versions of GHC for similar effect.

This would add some (constant) overhead to benchmarks but would sidestep the SpecConstr issue
and should prevent users from running into this issue in most common scenarios.

Recommend users to disable SpecConstr

This will stabilize results as no occurrences of bench will get specialized.
However this results in slightly higher overhead for the benchmarking framework.

Recommend users to set -fspec-constr-count to a large number.

This can solve the issue by ensuring all occurrences will be specialized equally.

Custom, ad-hoc metrics?

What: I would like to use tasty-bench to not only measure performance metrics, but also custom-made metrics. My use case is that I am working on a hobby project where I work on an optimization problem. I would like to see the effect of different heuristics reflected in the benchmarks: "When using this heuristic, the solution produced is this or that good". It would be quite nice to be able to use the tasty-bench machinery (i.e. comparing between benchmark runs)!

Question: Is this something that makes sense for tasty-bench design-wise?

Here is a draft of how I image using that feature:

{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE ExistentialQuantification #-}

module Demo where

import Control.DeepSeq (NFData)
import GHC.Generics (Generic)
import Test.Tasty.Bench (Benchmark, bench, defaultMain, nf)

data Problem = Problem
data Solution = Solution deriving (Generic, NFData)
problem1 :: Problem
problem1 = undefined

solveProblemInstance :: Problem -> Solution
solveProblemInstance = undefined

steps :: Solution -> Int
steps = undefined

customBench :: String -> [AdditionalDataKV] -> Benchmark
customBench = undefined

-- Ord, because the values should be comparable between benchmarks.
data AdditionalDataKV = forall a. Ord a => AdditionalDataKV String a

main :: IO ()
main =
    Test.Tasty.Bench.defaultMain
        [ bench "solveProblemInstance problem1" $ nf solveProblemInstance problem1
        , customBench
            "steps required in solution of 'solveProblemInstance problem1'"
            [AdditionalDataKV "numberOfSteps" (steps $ solveProblemInstance problem1)]
        ]

CSV output appears to be incompatible with criterion-compare

criterion-compare is a nice tool for visualizing diffs between benchmark runs. AFAIK it can digest the CSV output from both criterion and gauge. It seems unable to handle the output from tasty-bench though. Given

Name,Mean (ps),2*Stdev (ps)
All.Data.ByteString.Builder.Encoding wrappers.primMapListFixed word8 (10000),27122086,1442142
All.Data.ByteString.Builder.Encoding wrappers.primUnfoldrFixed word8 (10000),7367723,667980
All.Data.ByteString.Builder.Encoding wrappers.primMapByteStringFixed word8 (10000),4706674,282574
All.Data.ByteString.Builder.Encoding wrappers.primMapLazyByteStringFixed word8 (10000),4794733,344816

it says:

$ criterion-compare bench.csv bench.csv 
criterion-compare: user error (parse error (Failed reading: conversion error: empty) at 
All.Data.ByteString.Builder.Encoding wrappers.primUnfoldrFixed word8 (10000),7367723,667980
All.Dat (truncated))

Or should I possibly use a different tool for viewing the diffs?

Output of the benchmark function is retained in memory

See my discourse thread for the full story.

The short version is that this issue is reproduced by this program:

import Test.Tasty.Bench
-- import Criterion.Main
import Control.DeepSeq

drop' :: Int -> [a] -> [a]
drop' n s =
      case n of
        0 ->
          case s of
            [] -> []
            x : xs -> x : drop' 0 xs
        _ ->
          case s of
            [] -> []
            x : xs -> drop' (n - 1) xs

main = do
  let input = replicate 1000000 'a'
  defaultMain
    [ bench "1" $ whnf (rnf . drop' 100000) input
    , bench "2" $ nf (drop' 100000) input
    ]

The result is:

All
  1: OK (2.67s)
    4.88 ms ± 250 μs,  41 MB allocated, 1.3 KB copied,  53 MB peak memory
  2: OK (1.72s)
    26.0 ms ± 2.6 ms,  41 MB allocated,  40 MB copied, 120 MB peak memory

Changing to criterion yields these wildly different results:

benchmarking 1
time                 4.943 ms   (4.867 ms .. 5.039 ms)
                     0.993 R²   (0.987 R² .. 0.998 R²)
mean                 5.130 ms   (5.039 ms .. 5.235 ms)
std dev              309.2 μs   (245.2 μs .. 386.1 μs)
variance introduced by outliers: 35% (moderately inflated)

benchmarking 2
time                 5.058 ms   (4.957 ms .. 5.163 ms)
                     0.996 R²   (0.991 R² .. 0.998 R²)
mean                 5.213 ms   (5.124 ms .. 5.349 ms)
std dev              337.9 μs   (248.5 μs .. 504.7 μs)
variance introduced by outliers: 39% (moderately inflated)

I've tracked it down in Core to this difference:

-- Expression that evaluates benchmark 1
seq# (case $wgo ($wdrop' 100000# x1) of { (# #) -> () }) eta2

-- Helper function for benchmark 2
eta1 :: [Char] -> [Char]
eta1 = \ (s :: [Char]) -> $wdrop' 100000# s

-- Expression that evaluates benchmark 2
seq#
  (let {
    x2 :: [Char]
    x2 = eta1 x1 } 
   in
    case $wgo x2 of { (# #) -> x2 })
  eta2

This shows benchmark 2 retains x2 in memory during the normalization ($wgo).

I think this can be solved by using rnf instead of force:

-nf = funcToBench force
+nf = funcToBench rnf

Benchmarks sometimes get stuck

I unfortunately don't have a reproducible example, as this doesn't happen deterministically, but sometimes the benchmarks seem to get stuck. Restarting the benchmarks usually fixes the problem.

Do not overwrite CSV file, instead append to it

tasty-bench overwrites the CSV file every time, whereas the behavior of gauge is to append to it. The latter is more convenient as we can keep appending measurements to the same CSV file and then the tools processing the CSV file can select or compare the measurements.

With the appending behavior as default it is easy to get the overwriting behavior by just removing the existing file before performing measurements. Alternatively, a CLI option can be provided to select the appending vs overwrite behavior. Though its always better to have fewer CLI options so I would prefer the appending behavior by default and no CLI option.

Duplicate benchmark names result in nonsensical comparisons

For example in bytestring:

$ cabal bench --benchmark-options '--csv baseline.csv -p traversals'
<snip>
    map (+1): OK (0.22s)
      217 μs ±  11 μs
    map (+1): OK (0.45s)
       52 ns ± 1.3 ns

All 2 tests passed (0.69s)
Benchmark bytestring-bench: FINISH
$ cabal bench --benchmark-options '--baseline baseline.csv -p traversals'
<snip>
    map (+1): OK (0.23s)
      206 μs ±  11 μs
    map (+1): OK (0.23s)
       53 ns ± 3.3 ns, 99% faster than baseline

All 2 tests passed (0.45s)
Benchmark bytestring-bench: FINISH

The 99% improvement is reported because this benchmark is compared to the result of the much slower other benchmark of the same name.

Unhandled resource exception on pure IO action

I have the following code:

myBenchEncode :: ( forall (st :: ps) (st' :: ps). NFData (Message ps st st')
                 , forall (st :: ps) pr. NFData (PeerHasAgency pr st)
                 )
              => String
              -> (NodeToNodeVersion -> Codec ps DeserialiseFailure IO ByteString)
              -> NodeToNodeVersion
              -> AnyMessageAndAgency ps
              -> Benchmark
myBenchEncode title getCodec ntnVersion (AnyMessageAndAgency a m) =
  let Codec { encode } = getCodec ntnVersion
   in env (evaluate (force (a, m))) $ \(agency, message) ->
        bench title $ nf (encode agency) message

ignoring the details specific to my code, this is a function that generates benchmarks. I want to force evaluate some of the inputs which are pure values, an as you can see, the name of the benchmarks is not dependent on the IO action. Also the benchmark hierarchy is not dependent on it as well. What's weirder is that if I only evaluate on element of the tuple (i.e. either a or m) the code works, but as soon as I tuple them it doesn't work giving bench: Unhandled resource. Probably a bug in the runner you're using. is this a bug?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.