vincenthz / hs-gauge Goto Github PK

View Code? Open in Web Editor NEW

90.0 7.0 9.0 5.47 MB

Lean Haskell Benchmarking

License: BSD 2-Clause "Simplified" License

Haskell 97.09% C 2.91%

haskell criterion gauge benchmarking statistics

hs-gauge's People

Contributors

Stargazers

Watchers

Forkers

harendra-kumar sjakobi psibi rubenpieters jwaldmann gtrunsec runeksvendsen utdemir lagunoff

hs-gauge's Issues

make a release on hackage

@vincenthz In https://github.com/composewell/streaming-benchmarks I am using the yet unreleased quick mode csv dump feature for quickly generating graphs that otherwise take a lot of time. It may be useful for other users of these benchmarks as well. We can make a patch version release to make it available.

Question: what is the expected behaviour when using the option `iter :: Maybe Int64`

Background:

in criterion I cannot control the max number of runs which can be problematic for long running processes leading to very long bench times
I see that in the gauge config you have more arguments than in criterion . Notably you have the iters :: Maybe Int64

However it seems this options used to exist (initially introduced as --no-measurments here) in criterion and is also available in the bench cli tool but in a way that, if selected, produces no stat.

Is that the case as well in gauge? [no measure produced]

If yes: could this be added in the documentation of that option
if no: is there a minimum number (2, 3 ?) of runs to perform so that the statistical measures perform do not crash

As a comparison hyperfine has a --max-runs and also a --runs but imposes a minimum of two runs to compute a stdev

Thanks in advance for your help on this

Add 'Semigroup' instance to 'Outliers'

Currently there's only instance Monoid Outliers. Without Semigroup instance it's impossible to build gauge using ghc-8.4.1:

hs-gauge/Gauge/Analysis.hs

Line 109 in 69d302b

instance Monoid Outliers where

data for some iterations goes missing in --csvraw output

For example, we do not see iteration 24 in the following output:

compose/all-out-filters/streamly,22,0.366724004,804929524,0.366298,0.0,0.0,3547136,0,0,0,111,2111865480,2046,50080,0.358753,0.358753,3.284e-3,4.393824e-3
compose/all-out-filters/streamly,23,0.371523978,815465102,0.371458,0.0,0.0,3547136,0,0,0,81,2207859448,2139,52280,0.364126,0.364126,3.261e-3,4.404201e-3
compose/all-out-filters/streamly,25,0.410539736,901101540,0.410004,0.0,0.0,3547136,0,0,0,172,2399847288,2325,56680,0.401229,0.401229,3.807e-3,5.039309e-3
compose/all-out-filters/streamly,26,0.421200579,924501238,0.420976,0.0,0.0,3547136,0,0,0,111,2495841144,2418,58880,0.41244,0.41244,3.652e-3,4.978051e-3

Stop reporting precision that doesn't exist in integral metrics

for average it make sense to have a better precision than the one that can be measured at the system level, but for some other (min, max) it's not useful to report anything more than nanoseconds.

Make calculation takes integral numbers

it would be more accurate to have calculation done on floating points at the 10 or 100 picoseconds level instead of relying on Double doing the right thing related to rounding. Some calculation might still require Double convertion, but some like (min, max... don't)

We should not allow statistical analysis when the number of samples is less than 3

Otherwise it leads to a crash:

analysing with 1000 resamples
bootstrapping with 2 samples
benchmarks: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (-9223372036854775808,1000)
CallStack (from HasCallStack):
  error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.0.1-JlawpRjIcMJIYPJVsWriIA:Data.Vector.Internal.Check
Benchmark benchmarks: ERROR

Add more counters from Linux HW PMC and perf_event infrastructure

We can add more counters to the microbenchmarking from the linux perf_event counters. The perf list command shows the counters, some of them are:

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]
  dummy                                              [Software event]

...

I have used many of these in the past to analyse and improve the performance of a C/Haskell program. Especially instructions, cache-missses and branch-misses. We can add some of the most useful ones to start with and then add more later. We are already using the perf_event counters for rdtsc on linux so it should be pretty easy to just add a few more counters.

Automatically find the gauge executable for --measure-with when possible

PR #3 introduced --measure-with option for isolated measurement. This option requires the user to specify the path of the gauge executable itself so that it can be invoked again for running benchmarks in isolation. This is inconvenient for the user, we can find this automatically in most cases if not all.

On Unices the exact path of the executable can be figured out as it was invoked and we can use that. On Windows that is not possible, only the name of the executable can be determined. However if the gauge executable is in PATH we can try to find an executable of that name in the PATH. If we still cannot find it then we can ask the user to use --measure-with to specify the path.

For automatic determination we can use --isolate command line switch and in the same vein the --measure-with option can be renamed to --isolate-with to keep the two options intuitively related.

crash in toOptional

I am seeing this quite often:

benchmarking constantSlowConsumer/asyncly ... Adaptive: Creating an optional valid value using the optional tag
CallStack (from HasCallStack):
  error, called at ./Gauge/Optional.hs:58:25 in gauge-0.2.1-HPl6LVZXpCZ5TsFkdrilsb:Gauge.Optional
  toOptional, called at ./Gauge/Measurement.hs:478:44 in gauge-0.2.1-HPl6LVZXpCZ5TsFkdrilsb:Gauge.Measurement

Here are the places in the stack trace:

471 applyRUStatistics end start m                                                                                                                                               
472     | RUsage.supported = m { measUtime   = Optional.toOptional $ diffTV RUsage.userCpuTime                                                                                  
473                            , measStime   = Optional.toOptional $ diffTV RUsage.systemCpuTime                                                                                
474                            , measMaxrss  = Optional.toOptional $ RUsage.maxResidentSetSize end                                                                              
475                            , measMinflt  = Optional.toOptional $ diff RUsage.minorFault                                                                                     
476                            , measMajflt  = Optional.toOptional $ diff RUsage.majorFault                                                                                     
477                            , measNvcsw   = Optional.toOptional $ diff RUsage.nVoluntaryContextSwitch                                                                        
478                            , measNivcsw  = Optional.toOptional $ diff RUsage.nInvoluntaryContextSwitch                                                                      
479                            }

 55 -- | Create an optional value from a-                                                                                                                                       
 56 toOptional :: (HasCallStack, OptionalTag a) => a -> Optional a                                                                                                              
 57 toOptional v                                                                                                                                                                
 58     | isOptionalTag v = error "Creating an optional valid value using the optional tag"                                                                                     
 59     | otherwise       = Optional v

Cycles issues

Just noticed the comment on #15. Cycles is broken for linux and osx, and presumably windows.

on linux the fddev is not properly initialized, leading to an unreported invalid handle, and then an unreported read error with a cycle value being the uninitialized variables.

on osx, missing <Rts.h> means the CPP for i386 and x86_64 returns False, and thus using the backup function

I haven't checked but same thing on windows (since the lack of include issue)

StatsTables not visible when using a custom config

For some odd reason, I cannot get benchmark results tables to show when using a modified Config.

When I run with defaultMain [...] it gives me the results table as it finishes each group.

However, when I customize the config with a fixed iteration count like this:

fixedIters :: Config
fixedIters = defaultConfig { iters = Just (6000 :: Int64) }

main = do
   ...
   defaultMainWith fixedIters [queryRenderGroup, dbManipulationsGroup]

I don't get any benchmark results at all shown in the terminal, it just tells me it has finished benchmarking:

Benchmark benchmarks: FINISH

I wonder if this is just a problem on my setup or if I'm doing something wrong here? The modified config also has displayMode = StatsTable. I'm running on WIN10.

Here's a repo to test it out with:
https://github.com/tuomohopia/squeal

Enter the squeal-postgresql folder and run stack bench.

Comparison chart

I'd like a simple terminal comparison output a la the graphical output of Criterion. How about if I submit a PR with something like the below if you pass --bars?

bench1    ██████░░░░░░   41.65 ns
bench2    ████████████   163.9 ns
bench3    ██░░░░░░░░░░   21.25 ns

Can't build guage on Apple Silicon

I built gauge on MacBook Air with Apple M1 chip and it raised this

hs-gauge/cbits/cycles.c

Line 55 in 496a8b9

#error Unsupported OS/architecture/compiler!

error.

When I ran

❯ cabal install --lib gauge

and got this output:

Output

Resolving dependencies...
Build profile: -w ghc-8.10.7 -O1
In order, the following will be built (use -v for more details):
 - gauge-0.2.5 (lib) (requires build)
Starting     gauge-0.2.5 (lib)
Building     gauge-0.2.5 (lib)

Failed to build gauge-0.2.5.
Build log ( /Users/***/.cabal/logs/ghc-8.10.7/gg-0.2.5-a59d4712.log ):
Configuring library for gauge-0.2.5..
Preprocessing library for gauge-0.2.5..
Building library for gauge-0.2.5..
[ 1 of 43] Compiling Gauge.CSV        ( Gauge/CSV.hs, dist/build/Gauge/CSV.o, dist/build/Gauge/CSV.dyn_o )
[ 2 of 43] Compiling Gauge.ListMap    ( Gauge/ListMap.hs, dist/build/Gauge/ListMap.o, dist/build/Gauge/ListMap.dyn_o )
[ 3 of 43] Compiling Gauge.Optional   ( Gauge/Optional.hs, dist/build/Gauge/Optional.o, dist/build/Gauge/Optional.dyn_o )
[ 4 of 43] Compiling Gauge.Source.Time ( dist/build/Gauge/Source/Time.hs, dist/build/Gauge/Source/Time.o, dist/build/Gauge/Source/Time.dyn_o )
[ 5 of 43] Compiling Gauge.Time       ( Gauge/Time.hs, dist/build/Gauge/Time.o, dist/build/Gauge/Time.dyn_o )
[ 6 of 43] Compiling Gauge.Source.RUsage ( dist/build/Gauge/Source/RUsage.hs, dist/build/Gauge/Source/RUsage.o, dist/build/Gauge/Source/RUsage.dyn_o )
[ 7 of 43] Compiling Gauge.Source.GC  ( Gauge/Source/GC.hs, dist/build/Gauge/Source/GC.o, dist/build/Gauge/Source/GC.dyn_o )
[ 8 of 43] Compiling Gauge.Measurement ( Gauge/Measurement.hs, dist/build/Gauge/Measurement.o, dist/build/Gauge/Measurement.dyn_o )
[ 9 of 43] Compiling Gauge.Format     ( Gauge/Format.hs, dist/build/Gauge/Format.o, dist/build/Gauge/Format.dyn_o )
[10 of 43] Compiling Numeric.MathFunctions.Comparison ( math-functions/Numeric/MathFunctions/Comparison.hs, dist/build/Numeric/MathFunctions/Comparison.o, dist/build/Numeric/MathFunctions/Comparison.dyn_o )
[11 of 43] Compiling Numeric.MathFunctions.Constants ( math-functions/Numeric/MathFunctions/Constants.hs, dist/build/Numeric/MathFunctions/Constants.o, dist/build/Numeric/MathFunctions/Constants.dyn_o )
[12 of 43] Compiling Numeric.SpecFunctions.Internal ( math-functions/Numeric/SpecFunctions/Internal.hs, dist/build/Numeric/SpecFunctions/Internal.o, dist/build/Numeric/SpecFunctions/Internal.dyn_o )
[13 of 43] Compiling Numeric.SpecFunctions ( math-functions/Numeric/SpecFunctions.hs, dist/build/Numeric/SpecFunctions.o, dist/build/Numeric/SpecFunctions.dyn_o )
[14 of 43] Compiling Numeric.Sum      ( math-functions/Numeric/Sum.hs, dist/build/Numeric/Sum.o, dist/build/Numeric/Sum.dyn_o )
[15 of 43] Compiling Paths_gauge      ( dist/build/autogen/Paths_gauge.hs, dist/build/Paths_gauge.o, dist/build/Paths_gauge.dyn_o )
[16 of 43] Compiling Gauge.Main.Options ( Gauge/Main/Options.hs, dist/build/Gauge/Main/Options.o, dist/build/Gauge/Main/Options.dyn_o )
[17 of 43] Compiling Statistics.Distribution ( statistics/Statistics/Distribution.hs, dist/build/Statistics/Distribution.o, dist/build/Statistics/Distribution.dyn_o )
[18 of 43] Compiling Statistics.Function ( statistics/Statistics/Function.hs, dist/build/Statistics/Function.o, dist/build/Statistics/Function.dyn_o )
[19 of 43] Compiling Statistics.Internal ( statistics/Statistics/Internal.hs, dist/build/Statistics/Internal.o, dist/build/Statistics/Internal.dyn_o )
[20 of 43] Compiling Statistics.Distribution.Normal ( statistics/Statistics/Distribution/Normal.hs, dist/build/Statistics/Distribution/Normal.o, dist/build/Statistics/Distribution/Normal.dyn_o )
[21 of 43] Compiling Statistics.Math.RootFinding ( statistics/Statistics/Math/RootFinding.hs, dist/build/Statistics/Math/RootFinding.o, dist/build/Statistics/Math/RootFinding.dyn_o )
[22 of 43] Compiling Statistics.Matrix.Types ( statistics/Statistics/Matrix/Types.hs, dist/build/Statistics/Matrix/Types.o, dist/build/Statistics/Matrix/Types.dyn_o )
[23 of 43] Compiling Statistics.Matrix.Mutable ( statistics/Statistics/Matrix/Mutable.hs, dist/build/Statistics/Matrix/Mutable.o, dist/build/Statistics/Matrix/Mutable.dyn_o )
[24 of 43] Compiling Statistics.Quantile ( statistics/Statistics/Quantile.hs, dist/build/Statistics/Quantile.o, dist/build/Statistics/Quantile.dyn_o )
[25 of 43] Compiling Statistics.Sample.Histogram ( statistics/Statistics/Sample/Histogram.hs, dist/build/Statistics/Sample/Histogram.o, dist/build/Statistics/Sample/Histogram.dyn_o )
[26 of 43] Compiling Statistics.Sample.Internal ( statistics/Statistics/Sample/Internal.hs, dist/build/Statistics/Sample/Internal.o, dist/build/Statistics/Sample/Internal.dyn_o )
[27 of 43] Compiling Statistics.Sample ( statistics/Statistics/Sample.hs, dist/build/Statistics/Sample.o, dist/build/Statistics/Sample.dyn_o )
[28 of 43] Compiling Statistics.Matrix ( statistics/Statistics/Matrix.hs, dist/build/Statistics/Matrix.o, dist/build/Statistics/Matrix.dyn_o )
[29 of 43] Compiling Statistics.Matrix.Algorithms ( statistics/Statistics/Matrix/Algorithms.hs, dist/build/Statistics/Matrix/Algorithms.o, dist/build/Statistics/Matrix/Algorithms.dyn_o )
[30 of 43] Compiling Statistics.Transform ( statistics/Statistics/Transform.hs, dist/build/Statistics/Transform.o, dist/build/Statistics/Transform.dyn_o )
[31 of 43] Compiling Statistics.Sample.KernelDensity ( statistics/Statistics/Sample/KernelDensity.hs, dist/build/Statistics/Sample/KernelDensity.o, dist/build/Statistics/Sample/KernelDensity.dyn_o )
[32 of 43] Compiling Statistics.Types.Internal ( statistics/Statistics/Types/Internal.hs, dist/build/Statistics/Types/Internal.o, dist/build/Statistics/Types/Internal.dyn_o )
[33 of 43] Compiling Statistics.Types ( statistics/Statistics/Types.hs, dist/build/Statistics/Types.o, dist/build/Statistics/Types.dyn_o )
[34 of 43] Compiling System.Random.MWC ( mwc-random/System/Random/MWC.hs, dist/build/System/Random/MWC.o, dist/build/System/Random/MWC.dyn_o )
[35 of 43] Compiling Statistics.Resampling ( statistics/Statistics/Resampling.hs, dist/build/Statistics/Resampling.o, dist/build/Statistics/Resampling.dyn_o )
[36 of 43] Compiling Statistics.Resampling.Bootstrap ( statistics/Statistics/Resampling/Bootstrap.hs, dist/build/Statistics/Resampling/Bootstrap.o, dist/build/Statistics/Resampling/Bootstrap.dyn_o )
[37 of 43] Compiling Statistics.Regression ( statistics/Statistics/Regression.hs, dist/build/Statistics/Regression.o, dist/build/Statistics/Regression.dyn_o )
[38 of 43] Compiling Gauge.Monad      ( Gauge/Monad.hs, dist/build/Gauge/Monad.o, dist/build/Gauge/Monad.dyn_o )
[39 of 43] Compiling Gauge.IO.Printf  ( Gauge/IO/Printf.hs, dist/build/Gauge/IO/Printf.o, dist/build/Gauge/IO/Printf.dyn_o )
[40 of 43] Compiling Gauge.Benchmark  ( Gauge/Benchmark.hs, dist/build/Gauge/Benchmark.o, dist/build/Gauge/Benchmark.dyn_o )
[41 of 43] Compiling Gauge.Analysis   ( Gauge/Analysis.hs, dist/build/Gauge/Analysis.o, dist/build/Gauge/Analysis.dyn_o )
[42 of 43] Compiling Gauge.Main       ( Gauge/Main.hs, dist/build/Gauge/Main.o, dist/build/Gauge/Main.dyn_o )
[43 of 43] Compiling Gauge            ( Gauge.hs, dist/build/Gauge.o, dist/build/Gauge.dyn_o )

cbits/cycles.c:55:2: error:
     error: Unsupported OS/architecture/compiler!
   |
55 | #error Unsupported OS/architecture/compiler!
   |  ^
#error Unsupported OS/architecture/compiler!
 ^
1 error generated.
`gcc' failed in phase `C Compiler'. (Exit code: 1)
cabal: Failed to build gauge-0.2.5. See the build log above for details.

Release gauge-0.2.4 on hackage

I specifically need the fix in #85, it is becoming very painful for me to always change to a local copy in stack.yaml before I perform benchmarking. There have been no breaking changes since the last release, though a few APIs (nfAppIO and friends) have been added. So a minor version bump may be enough.

vector Index out of bounds

The bootstrapBCA in the statistics package has a bug causing vector index out of bounds. See haskell/statistics#150 . Needs to fixed here

hs-gauge/statistics/Statistics/Resampling/Bootstrap.hs

Line 55 in ef4efa5

lo = max (cumn a1) 0

Copying the statistics package into gauge may not be a good idea from maintenance perspective.

documentation of command line options

It took me some time to find out that I can select benchmarks on the command line (my-benchmark-exe -m pattern foobar)

I suggest to show some typical ways of calling an executable that uses defaultMain, right at https://hackage.haskell.org/package/gauge-0.2.4/docs/Gauge-Main.html#v:defaultMain , and maybe even at the top page https://hackage.haskell.org/package/gauge-0.2.4 , and some text like

An executable that uses defaultMain
is called from the command line as executable <options> <arguments>
where options are described at https://hackage.haskell.org/package/gauge/docs/src/Gauge.Main.Options.html#opts , and each argument is a benchmark name (or pattern)

Current documentation/landing page adresses people that want to switch from criterion (and wastes a lot of space by listing libraries that are avoided) - but my use case is that I want to evaluate gauge (or recommend it to my students) without previous knowledge of other frameworks.

(NB: I find criterion's defaultMain equally underdocumented.)

Make a release to address bug #92

@vincenthz we have been hitting the issue described in #92. Could you please upload the new version on hackage.

Formatting numbers

Big numbers are difficult to read without separators. What is the best way to format them? I found https://hackage.haskell.org/package/format-numbers on hackage for this purpose, which is a tiny package. Should we put a dependency for this or just copy over the code, or are there better options in base or something we already depend on? Or write our own?

ridiculously low benchmarking results with nfIO

Refer to this line of code. Reproduced below for easy reference:

benchIO name f n = bench name $ nfIO $ f n >>= return

If I remove the bind in this operation and just keep f n, some of the streaming libraries (conduit for example) being benchmarked by streaming-benchmarks package show ridiculously low results (in nanoseconds for a million operations). To reproduce, just remove that bind in that line and run this benchmark with and without it:

./run.sh elimination/toList/conduit

Clearly there is something wrong with that. Perhaps this function just gets optimized out because the compiler thinks the value is not being used? Or something else?

I did not investigate it much, I am hoping that someone knows what's going on and can figure this out quickly.

Without knowing this the results will remain unpredictable and benchmarking is of little use unless it is predictable. If the bind is really required then we can perhaps do that inside a wrapper so that the results are always predictable for users.

Throw away & Retry wrapped metrics

instead of defaulting to max 0 .. for different values it would a good idea to just throw away this call and retry the whole gathering

failing test when verbose is selected (and not quick)

sanity: benchmarking fib/fib 10 ... FAIL
  Exception: Creating an optional valid value using the optional tag
  CallStack (from HasCallStack):
    error, called at ./Gauge/Optional.hs:55:25 in gauge-0.2.0-1DreTMaUhTv9BugFF6fY9:Gauge.Optional

1 out of 1 tests failed (5.11s)

This happened on a slow machine, 32-bit Intel, Archlinux32.

Confusing constraints regarding GHC 7.8 support

I'm confused why the current bound on base is >= 4.7 (indicating compatibility with GHC 7.8) when support for GHC 7.8 was dropped in 4adcd00.

Wouldn't it have been easier to just bump the base bounds?

Build failures with GHC-7.8

I've sadly lost my build logs but when I tried to build with lts-2.22 I ran into a bad import from math-functions and missing imports of pure.

Currently the cabal file claims that you're testing with GHC-7.8 so you should do probably that. :)

Nice package BTW. Looks very useful for benchmarking dependencies of criterion! :)

Cycles are no longer being reported on mac

After commit c52432c, cycles are not reported on mac os x. I see this in time-osx.c:

    tr->rdtsc = 0;

Older code used this for mac (cycles.c):

#if x86_64_HOST_ARCH || i386_HOST_ARCH

StgWord64 gauge_rdtsc(void)
{
  StgWord32 hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ((StgWord64) lo) | (((StgWord64) hi)<<32);
}

@vincenthz is this deliberate or pending?

Dump all raw measurements

Each rounds of measurement in a benchmarks could optionally be dumped in CSV format

name-bench,clock-time-diff,cpu-time-diff,cycles-diff,gc-diff...
...

It should make measurement/analysis debugging easier, plus it could be loaded or transformed without re-running the measurement.

Always output the same units

I haven't actually run Gauge yet, but I assume the output looks similar to Criterion. One thing that always trips me up is comparing timings with different units. How do 123 ns, 234 μs, and 345 ms compare? It would be nice if Gauge always output the same unit (probably ns).

Use h2cs and hsc for C bindings

to make sure we have safer bindings use h2cs for getrusage and all other foreign bindings/unmarsalling.

Print messages on stdout instead of stderr

It seems the benchmark results are being printed on stderr. Shouldn't they be printed on stdout instead?

Time accounting is incorrect on Mac and not accurate on Linux

Ideally utime + stime = cpuTime. On Linux this holds approximately (though I would like it to be more accurate there as well, I have seen significant aberration) but on Mac it doesn't hold at all. In fact on Mac I observed that stime is always 0. Sometimes even utime is zero even though cpuTime is a significant value, for example:

cpuTime              175.8 ms  
utime                0.0 s     
stime                0.0 s

@vincenthz I saw some commits from you fixing cycles stuff, while your L1 cache is hot, can you take a look what's going on here. It seems there is still something wrong with measuring time on Mac. I tested at commit 6164dc4 .

On Linux there is some irregularity, I know that we measure cpuTime and utime/stime using different methods and at different points but even then I do not expect an aberration of milliseconds. I am seeing an aberration of upto 2-3 ms which is quite significant:

cpuTime              182.0 ms  
utime                180.0 ms  
stime                4.000 ms  
...
cpuTime              166.6 ms  
utime                156.0 ms  
stime                8.000 ms  
...
cpuTime              52.83 ms  
utime                56.00 ms  
stime                0.0 s

I also observed that utime and stime never have a non-zero fractional part, maybe the aberration is due to this loss of precision and that may be the reason why it is always up to 2 ms.

Quick mode

In most cases I have not found the statistical regression very useful. The result from a single sample with sufficient number of iterations is almost always good enough, I am yet to see an example where this is really useful. This fancy stuff takes more CPU and generates its own load impacting the benchmarks and taking more time. And in fact it is also doing wrong computations which has gone unnoticed for a long time perhaps because of the complexity. We can use this mode when we really want that kind of rigor or have reason to believe that it will be useful.

For day to day or minute to minute runs we can use a fast mode which keeps it simple and stupid, would do a bit of warmup, measure one sample quickly and give the results. We can use a --quick option to enable this. This will save time and we can rely on the simplicity.

Add --compare flag

When generating results, it would be nice to provide a --save baseline flag that would produce e.g.

bench1    ██████░░░░░░   41.65 ns
bench2    ████████████   163.9 ns
bench3    ██░░░░░░░░░░   21.25 ns

Saved to: baseline

And write the benchmark results to a file in the current directory e.g. .gauge/saves/.

And then you could re-run (like when you run a test suite again with --seed) with

--compare baseline --save inline

and you would get:

bench1
  baseline:     ██████░░░░░░   41.65 ns
  inline:       ████░░░░░░░░   32.65 ns
bench2
  baseline:     ████████████   163.9 ns
  inline:       █████████░░░   113.9 ns
bench3
  baseline:     ██░░░░░░░░░░   21.25 ns
  inline:       ██░░░░░░░░░░   20.22 ns

Saved to: inline

This would be a great way to optimize things and experiment with ideas. Other tooling could also use it, e.g. Emacs could automate this. The --compare could be passed many times to compare many benchmark results at once.

--output silently does nothing for HTML

Overview

Edit: Changed issue title, somehow missed that HTML support isn't available.

So I was going to go through http://www.serpentine.com/criterion/tutorial.html to see if all the instructions there also work without modification for Gauge and discovered that the --output=file.html doesn't appear to work (or I'm holding it wrong). The benchmark completes successfully but doesn't write an output file.

Steps to Reproduce

Visit http://www.serpentine.com/criterion/tutorial.html
Create the file Fibber.hs with the content of the code block under the Getting Started section but replace import Criterion.Main with import Gauge
ghc -O --make Fibber
./Fibber --output=fibber.html

Checking the help options (./Fibber --help) it appears that --output is a valid option.

Other info

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.4.4

$ ./Fibber --help
Microbenchmark suite - built with gauge 0.2.4
<snip>
  -o FILE   --output=FILE             File to write report to
<snip>

Report estimated time rather than mean in short output?

Given the amount of effort the original library went to computing kernel density estimates et al wouldn't the time field be better to put into the --short version of the output rather than mean?

AfC

When using multiple benchmarks earlier ones affect the ones coming later

I reported this against criterion, reporting it here as well, I have tested it using gauge and the same problem persists as expected.

I have the following benchmarks in a group:

        bgroup "map"
          [ bench "machines" $ whnf drainM (M.mapping (+1))
          , bench "streaming" $ whnf drainS (S.map (+1))
          , bench "pipes" $ whnf drainP (P.map (+1))
          , bench "conduit" $ whnf drainC (C.map (+1))
          , bench "list-transformer" $ whnf drainL (lift . return . (+1))
          ]

The last two benchmarks take significantly more time when I run all these benchmarks in one go using stack bench --benchmark-arguments "-m glob ops/map/*".

$ stack bench --benchmark-arguments "-m glob ops/map/*"

benchmarking ops/map/machines
time                 30.23 ms   (29.22 ms .. 31.04 ms)

benchmarking ops/map/streaming
time                 17.91 ms   (17.48 ms .. 18.37 ms)

benchmarking ops/map/pipes
time                 29.30 ms   (28.12 ms .. 30.03 ms)

benchmarking ops/map/conduit
time                 36.69 ms   (35.73 ms .. 37.58 ms)

benchmarking ops/map/list-transformer
time                 84.06 ms   (75.02 ms .. 90.34 ms)

However when I run individual benchmarks the results are different:

$ stack bench --benchmark-arguments "-m glob ops/map/conduit"

benchmarking ops/map/conduit
time                 31.64 ms   (31.30 ms .. 31.86 ms)

$ stack bench --benchmark-arguments "-m glob ops/map/list-transformer"

benchmarking ops/map/list-transformer
time                 68.67 ms   (66.84 ms .. 70.96 ms)

To reproduce the issue just run those commands in this repo. The repo works with gauge.

I cannot figure what the problem is here. I tried using "env" to run the benchmarks and putting a "threadDelay" for a few seconds and a "performGC" in it but nothing helps.

I am now resorting to always running each benchmark individually in a separate process. Maybe we can have support for running each benchmark in a separate process in criterion itself to guarantee isolation of benchmarks, as I have seen this sort of problem too often. Now I am always skeptical of the results produced by criterion.

Add some ANSI colours to stdout

now that Basement.Terminal.ANSI has colors, probably a good idea to add some simple (& hopefully tasteful) coloring to the output

Default match of benchmark name should be Exact and not Prefix

I was expecting the default match to be exact and I got bitten by this. I had two benchmark names such that one was a prefix of the other and when running in isolated mode passing the benchmark name as argument to run that particular benchmark in isolation I always got the results for the longer name and therefore the results of the both the benchmarks were the same. Spent precious hours to debug what's going on.

Do not average the iters

In the verbose mode, the number of iterations is also printed as an average. This should not be an average, it should be an absolute number. It does not make sense to print the average of iterations.

benchmarked elimination/toNull/streamly
time                 14.56 ms   (13.28 ms .. 15.77 ms)
                     0.997 R²   (0.990 R² .. 1.000 R²)
mean                 14.66 ms   (14.24 ms .. 15.06 ms)
std dev              571.9 μs   (406.1 μs .. 712.1 μs)
variance introduced by outliers: 14% (moderately inflated)
iters                4          (1 .. 6)
time                 14.66 ms   (13.98 ms .. 15.34 ms)

Notice iters is printed as 4 (1..6) there are total 6 samples and we print 4 in the average field.

The relevant code to fix this is in analyseBenchmark :

              _ <- traverse
                    (\(k, (a, s, _)) -> reportStat Verbose a s k)
                    measureAccessors_

we should make an exception for the iter accessor and print it differently.

Ability to select perf counters to measure

It would be nice to have the ability to select specific counters a la perf stat -e on Linux. For two reasons, one if we support HW PMC measurements then the processor allows limited counters at a time, two measurement of a counter may impact the value of others though I am not sure how important this is.

semantics of whnf (documentation issue?)

Gauge.Benchmark.whnf doc says "Apply an argument to a function, and evaluate the result to weak head normal form (WHNF)." What it does not say is whether the time needed for evaluating the argument is included in the measurement, or not.

I guess it is not, and I take the following experiment as confirmation:

Prelude Gauge.Main> benchmark $ whnf id (sum . enumFromTo 0 $ 1000)
benchmarking function ... took 16.17 s, total 60947101 iterations
function                                 time                 23.96 ns  

Prelude Gauge.Main> benchmark $ whnf (sum . enumFromTo 0) 1000
benchmarking function ... took 6.913 s, total 20152 iterations
function                                 time                 47.58 μs

But what exactly is the semantics: the argument (value) simply is shared? So the cost of its computation is accounted for by the first function call? Or is the argument forced (to WHNF?) before that? (That would be better?)

Negative time value in analysis

It shouldn't really happend, but some people are seeing negative time value in reports:

e.g.

benchmarked xxx
time                 136.9 ms   (-340.9 ms .. 511.1 ms)
                     0.063 R²   (0.000 R² .. 0.999 R²)
mean                 1.010 s    (425.0 ms .. 3.335 s)
std dev              1.833 s    (12.24 ms .. 3.048 s)```

Use better way to represent and manipulate the counters

Each counter has the following properties:

A way to measure the counter
A way to measure the diff of counters to get the measurement interval values
A way to represent an absent counter
A way to accumulate counters from multiple iterations. Some might get added others might use max to record max values.
A measureAccessor (keys) to retrieve the value from the Measured record.
A show function for the counter

It may be a good idea to have a Counter typeclass to represent all these in a better way compared to the ad hoc way we use as of now. The MeasureDiff class is better but it is not sufficient as it represents only the diff operation.

In addition to operations listed above we can also have measurement source attached to each counter and we can batch multiple counters having the same source together to measure them in one go and then retrieve from the common source. This can help bunch counters dynamically if we allow the users to select counters at the command line.

Format of the benchmarking output on console

The Linux perf tool has a nice readable format for output and we can adopt the same, at least for the quick mode it directly applies:

interceptor:~$ perf stat ls

 Performance counter stats for 'ls':

          1.265128      task-clock (msec)         #    0.703 CPUs utilized          
                 4      context-switches          #    0.003 M/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                97      page-faults               #    0.077 M/sec                  
         2,006,759      cycles                    #    1.586 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
         1,332,533      instructions              #    0.66  insns per cycle        
           273,542      branches                  #  216.217 M/sec                  
     <not counted>      branch-misses            

       0.001799526 seconds time elapsed

Multiple iterations with --min-duration=0

I want to run my benchmarks as quickly as possible, just to check that they work.

So I'm using --time-limit=0 to get only a single sample and --min-duration=0 to get only a single iteration per sample.

Nonetheless gauge reports multiple iterations:

benchmarking Issue #108/Text ... took 341.6 ms, total 56 iterations
benchmarked Issue #108/Text
time                 6.434 ms   (5.934 ms .. 6.876 ms)
                     0.994 R²   (0.990 R² .. 1.000 R²)
mean                 6.017 ms   (5.917 ms .. 6.214 ms)
std dev              232.3 μs   (131.1 μs .. 337.9 μs)

The header of the csv file generated by --csvraw option is not correct

The header should print the name of the benchmark as well.

iters,time,cycles,cpuTime,utime,stime,maxrss,minflt,majflt,nvcsw,nivcsw,allocated,numGcs,bytesCopied,mutatorWallSeconds,mutatorCpuSeconds,gcWallSeconds,gcCpuSeconds

elimination/toNull/streamly,1,1.6136415e-2,35418132,1.6138e-2,0.0,0.0,3706880,0,0,0,3,95989920,93,3304,1.5653e-2,1.5653e-2,2.62e-4,3.24074e-4

Remove code-page package

Basement.Terminal have access to utf8 codepage for windows through the initialize function

Discuss: the use of overhead

I cannot get my head around this:

      stime     = measure (measTime . rescale) .
                  G.filter ((>= threshold) . measTime) . G.map fixTime .
                  G.tail $ meas
      fixTime m = m { measTime = measTime m - overhead / 2 }

Why are we fixing the measurement time using the overhead? The overhead is the amount of time measure itself takes. How is that relevant? measTime has no relation to that. Maybe I am missing something but it looks to me that overhead and its use here can just be removed.

utime and stime should not be printed when they are not available (on windows)

@vincenthz Commit 3f03bc9 had fixed this problem but this seems to have been nullified by the merge of new getrusage FFI changes.

Design of the command line interface

It may be a good idea to use a command based CLI interface using a command for logical sets of tasks. For example, I can imagine the following commands (with their own options not shown here):

$ gauge list     # List all available benchmarks
$ gauge evlist   # List all available event counters
$ gauge stats    # Run selected benchmarks and measure selected counter stats
$ gauge report   # Read the raw perf data file and generate a report (text, html, ...)
$ gauge diff     # Show a nice diff between two raw perf data files

Where possible we can also take cues from the Linux perf tool.

Hackage Release

Can this be released to hackage? I want to switch to this for benchmarking suites in most of my libraries (since I never need criterion's plots), but not having a hackage release is a showstopper.