smarr / rebench Goto Github PK

Execute and document benchmarks reproducibly.

License: MIT License

Shell 0.17% Python 99.83%

benchmarking performance-tracking continuous-testing continuous-benchmarking python research-tool science reproducibility

rebench's Issues

Add support to record multiple data points for same benchmark in one run

Think currently the support is not there or broken to get multiple results for the same benchmark from the same run.

Configurations with template parameters in extra params are not recognized to be identical

Configurations with template parameters in extra params are not recognized to be identical are not recognized with configurations that have filled in 'template parameters'/hardcoded values.

leads to unnecessarily duplicated runs

extra_args don't seem to support expansion of format vars

Problem is probably that we add the extra args by format var expansion:

ReBench/rebench/model/run_id.py

Line 196 in e5ebe70

cmdline += " %s" % self._bench_cfg.extra_args

Remove use of context.py

The context-oriented programming is only used sparingly.
While it is kind of nice, it's not really necessary and makes things more complex than necessary.

Randomize execution order of benchmarks

To avoid systematic bias that might be caused by operating system caches or hardware/memory properties, the execution order of benchmark runs should be randomized.

This has the additional benefit of producing data points for more benchmarks and it is possible to see trends of the results earlier for long running sets of benchmarks.

There needs to be an option to suppress the randomization, of course.
Also, we should have an estimate of what that means in terms of memory usage when many microbenchmarks are executed (need to keep the data in memory to calculated confidence interval, etc...)

Properly handle threading of the IRC support

we need to sync sending of reports to the thread
we need to properly end the thread when rebench is done
would also be good to not print the debug output for the IRC client on the normal debugging level

Avoid duplicated executing of identical run configurations

If we have a complex configuration, it can be that benchmarks with identical configurations are listed multiple times to produce separate data sets.

Try to avoid executing them more than necessary.

How to determine that runs are identical? Either based on the configuration, which is tricky, because we can use the string expansion ala %(cores)s. Or based on the resulting command line, which might also be tricky if there are subtile whitespace differences.

Support rerunning of selected experiments

Typical scenarios:

a benchmark changed
a VM changed

One might also want to rerun, or run specific experiments with specific parameters, but this is outside the scope of this feature request.

So, we want something like:

rebench test.conf TestExperiment vm:TestRunner1
rebench -r test.conf TestExperiment vm:TestRunner1 # -r for rerun (or perhaps -c, for clear)
rebench -r test.conf TestExperiment s:TestSuite1
rebench -r test.conf TestExperiment vm:TestRunner1 s:TestSuite1
rebench -r test.conf TestExperiment s:TestSuite1:Bench1

Extract a ReBench-specific TestCase to avoid code duplication

Log failed runs to file with all necessary info to reproduce, like cmdline etc.

Add support for JMH

Add a performance reader for JMH.

Example output:

# Run progress: 0.00% complete, ETA 01:00:00
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: benchmarks.DynamicProxy.directAdd
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Fork: 1 of 10
# Warmup Iteration   1: 847885.303 ops/ms
# Warmup Iteration   2: 869209.997 ops/ms
# Warmup Iteration   3: 787127.216 ops/ms
# Warmup Iteration   4: 849333.002 ops/ms
# Warmup Iteration   5: 862511.213 ops/ms
# Warmup Iteration   6: 786574.891 ops/ms
# Warmup Iteration   7: 867692.766 ops/ms
# Warmup Iteration   8: 791901.852 ops/ms
# Warmup Iteration   9: 868440.246 ops/ms
# Warmup Iteration  10: 873144.727 ops/ms
# Warmup Iteration  11: 858841.746 ops/ms
# Warmup Iteration  12: 864258.483 ops/ms
# Warmup Iteration  13: 867792.566 ops/ms
# Warmup Iteration  14: 873802.641 ops/ms
# Warmup Iteration  15: 789308.386 ops/ms
# Warmup Iteration  16: 872348.119 ops/ms
# Warmup Iteration  17: 876049.520 ops/ms
# Warmup Iteration  18: 855590.678 ops/ms
# Warmup Iteration  19: 790754.207 ops/ms
# Warmup Iteration  20: 844763.982 ops/ms
Iteration   1: 851585.492 ops/ms
Iteration   2: 855210.272 ops/ms
Iteration   3: 863139.120 ops/ms
Iteration   4: 854572.548 ops/ms
Iteration   5: 848365.018 ops/ms
Iteration   6: 868452.069 ops/ms
Iteration   7: 874102.630 ops/ms
Iteration   8: 871221.945 ops/ms
Iteration   9: 872087.960 ops/ms
Iteration  10: 871954.737 ops/ms
Iteration  11: 866641.653 ops/ms
Iteration  12: 871745.541 ops/ms
Iteration  13: 873303.464 ops/ms
Iteration  14: 871289.619 ops/ms
Iteration  15: 871734.355 ops/ms
Iteration  16: 879997.366 ops/ms
Iteration  17: 871969.580 ops/ms
Iteration  18: 804149.821 ops/ms
Iteration  19: 874024.426 ops/ms
Iteration  20: 875282.521 ops/ms

Result: 864541.507 ±(99.9%) 14445.542 ops/ms [Average]
  Statistics: (min, avg, max) = (804149.821, 864541.507, 879997.366), stdev = 16635.508
  Confidence interval (99.9%): [850095.965, 878987.049]


# Run progress: 1.11% complete, ETA 01:11:46
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: benchmarks.DynamicProxy.directAdd
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Fork: 2 of 10
# Warmup Iteration   1: 857739.490 ops/ms
# Warmup Iteration   2: 869167.053 ops/ms
# Warmup Iteration   3: 866989.763 ops/ms
# Warmup Iteration   4: 867616.777 ops/ms
# Warmup Iteration   5: 855941.061 ops/ms
# Warmup Iteration   6: 863636.436 ops/ms
# Warmup Iteration   7: 869266.711 ops/ms
# Warmup Iteration   8: 864455.908 ops/ms
# Warmup Iteration   9: 865891.557 ops/ms
# Warmup Iteration  10: 864545.288 ops/ms
# Warmup Iteration  11: 785449.100 ops/ms
# Warmup Iteration  12: 871062.463 ops/ms
# Warmup Iteration  13: 865995.950 ops/ms
# Warmup Iteration  14: 869501.998 ops/ms
# Warmup Iteration  15: 880105.688 ops/ms
# Warmup Iteration  16: 870951.292 ops/ms
# Warmup Iteration  17: 869497.593 ops/ms
# Warmup Iteration  18: 789584.957 ops/ms
# Warmup Iteration  19: 865307.329 ops/ms
# Warmup Iteration  20: 864320.819 ops/ms
Iteration   1: 846892.297 ops/ms
Iteration   2: 858812.483 ops/ms
Iteration   3: 779040.228 ops/ms
Iteration   4: 866954.433 ops/ms
Iteration   5: 874218.456 ops/ms
Iteration   6: 871035.856 ops/ms
Iteration   7: 878649.265 ops/ms
Iteration   8: 791281.176 ops/ms
Iteration   9: 863840.816 ops/ms
Iteration  10: 870654.903 ops/ms
Iteration  11: 858951.775 ops/ms
Iteration  12: 781786.693 ops/ms
Iteration  13: 857076.130 ops/ms
Iteration  14: 869513.038 ops/ms
Iteration  15: 872952.031 ops/ms
Iteration  16: 871831.447 ops/ms
Iteration  17: 787480.350 ops/ms
Iteration  18: 870333.741 ops/ms
Iteration  19: 878597.978 ops/ms
Iteration  20: 868287.689 ops/ms

Result: 850909.539 ±(99.9%) 30180.380 ops/ms [Average]
  Statistics: (min, avg, max) = (779040.228, 850909.539, 878649.265), stdev = 34755.771
  Confidence interval (99.9%): [820729.159, 881089.919]


# Run progress: 10.00% complete, ETA 01:05:16
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: benchmarks.DynamicProxy.directAdd
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Fork: 10 of 10
# Warmup Iteration   1: 850773.837 ops/ms
# Warmup Iteration   2: 859996.736 ops/ms
# Warmup Iteration   3: 842337.855 ops/ms
# Warmup Iteration   4: 834024.179 ops/ms
# Warmup Iteration   5: 848325.927 ops/ms
# Warmup Iteration   6: 851807.203 ops/ms
# Warmup Iteration   7: 865749.630 ops/ms
# Warmup Iteration   8: 841759.249 ops/ms
# Warmup Iteration   9: 843872.638 ops/ms
# Warmup Iteration  10: 852743.625 ops/ms
# Warmup Iteration  11: 870366.746 ops/ms
# Warmup Iteration  12: 860670.067 ops/ms
# Warmup Iteration  13: 855269.930 ops/ms
# Warmup Iteration  14: 860215.809 ops/ms
# Warmup Iteration  15: 862334.297 ops/ms
# Warmup Iteration  16: 861751.244 ops/ms
# Warmup Iteration  17: 855697.310 ops/ms
# Warmup Iteration  18: 773933.681 ops/ms
# Warmup Iteration  19: 855363.310 ops/ms
# Warmup Iteration  20: 860512.882 ops/ms
Iteration   1: 784181.953 ops/ms
Iteration   2: 861926.105 ops/ms
Iteration   3: 854354.042 ops/ms
Iteration   4: 863663.976 ops/ms
Iteration   5: 869966.533 ops/ms
Iteration   6: 821378.328 ops/ms
Iteration   7: 866908.171 ops/ms
Iteration   8: 787129.306 ops/ms
Iteration   9: 780665.906 ops/ms
Iteration  10: 783710.954 ops/ms
Iteration  11: 869766.607 ops/ms
Iteration  12: 874771.924 ops/ms
Iteration  13: 874007.935 ops/ms
Iteration  14: 786184.900 ops/ms
Iteration  15: 867349.710 ops/ms
Iteration  16: 818710.228 ops/ms
Iteration  17: 786727.556 ops/ms
Iteration  18: 853130.618 ops/ms
Iteration  19: 869214.341 ops/ms
Iteration  20: 867435.571 ops/ms

Result: 837059.233 ±(99.9%) 33100.880 ops/ms [Average]
  Statistics: (min, avg, max) = (780665.906, 837059.233, 874771.924), stdev = 38119.022
  Confidence interval (99.9%): [803958.353, 870160.113]


# Run progress: 11.11% complete, ETA 01:04:28
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: benchmarks.DynamicProxy.proxiedAdd
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Fork: 1 of 10
# Warmup Iteration   1: 99681.741 ops/ms
# Warmup Iteration   2: 111011.439 ops/ms
# Warmup Iteration   3: 83221.079 ops/ms
# Warmup Iteration   4: 86550.458 ops/ms
# Warmup Iteration   5: 87100.410 ops/ms
# Warmup Iteration   6: 86719.841 ops/ms
# Warmup Iteration   7: 87628.795 ops/ms
# Warmup Iteration   8: 86472.207 ops/ms
# Warmup Iteration   9: 86111.395 ops/ms
# Warmup Iteration  10: 86871.991 ops/ms
# Warmup Iteration  11: 87797.001 ops/ms
# Warmup Iteration  12: 86590.364 ops/ms
# Warmup Iteration  13: 87005.565 ops/ms
# Warmup Iteration  14: 88105.287 ops/ms
# Warmup Iteration  15: 88517.748 ops/ms
# Warmup Iteration  16: 86863.272 ops/ms
# Warmup Iteration  17: 87413.754 ops/ms
# Warmup Iteration  18: 85960.142 ops/ms
# Warmup Iteration  19: 87216.054 ops/ms
# Warmup Iteration  20: 86368.302 ops/ms
Iteration   1: 85897.591 ops/ms
Iteration   2: 85818.520 ops/ms
Iteration   3: 86150.077 ops/ms
Iteration   4: 86313.090 ops/ms
Iteration   5: 86278.108 ops/ms
Iteration   6: 86504.070 ops/ms
Iteration   7: 85584.778 ops/ms
Iteration   8: 86987.707 ops/ms
Iteration   9: 85158.246 ops/ms
Iteration  10: 87069.476 ops/ms
Iteration  11: 88860.713 ops/ms
Iteration  12: 87230.651 ops/ms
Iteration  13: 88672.239 ops/ms
Iteration  14: 87435.816 ops/ms
Iteration  15: 83644.226 ops/ms
Iteration  16: 86858.133 ops/ms
Iteration  17: 86321.756 ops/ms
Iteration  18: 87300.606 ops/ms
Iteration  19: 85362.787 ops/ms
Iteration  20: 86763.998 ops/ms

Result: 86510.629 ±(99.9%) 1020.717 ops/ms [Average]
  Statistics: (min, avg, max) = (83644.226, 86510.629, 88860.713), stdev = 1175.459
  Confidence interval (99.9%): [85489.913, 87531.346]

Restructure implementation to compile configuration up front into execution tasks

Currently, the implementation does not completely separate the interpretation of a config file and its execution. Some of the other issues (#4, #7, #12, #19) are better solved when the two are more cleanly separated. At the end, we need some kind of task list that is processed by an executor.

Add support for warmup

We need:

a configuration to define the number of warmup runs
a template parameter for the command
ignore measurements for those runs (let's ignore situations were the harness supports it for the moment)

Sort out runs that already have all date upfront.

When restarting ReBench based on an existing data file, it would be useful to first sort out all the runs that are already completed to have better estimate on remaining time.

Add option to get feedback early by collecting just one data point for each run.

With large benchmark suites it can take hours to go through all runs, and it would be nice to get early feedback and allow the results to get refined with more measurements later on.

Consider removing the option to run until certain confidence level is reached

I am not using it anyway, because it makes reasoning about comparability of ratios hard for me.
And then, there seems to be an argument that is against it from a theoretical perspective: http://blog.regehr.org/archives/1024

The time-left computation is wrong.

Think it relies on the wrong number of total runs.
We already filtered out all the ones that are done.

Show 'requested configuration is not available' warning only on a very high verbosity level

need to add an additional, higher verbosity level

generally, the warning is confusing, and the right way to fix it is to solve issue #45.

Indicate failed runs via the return code of ReBench

ReBench should set a return code for the process when it ends but wasn't able to collect all desired data.

Add support for VMs that just use a binary on the path

This is a to avoid the need for wrapper scripts.

Report results as soon as possible, not only after all data points are collected

This is especially useful for round-robin or random execution

Add distinction between benchmark name and parameter passed to execution

With this distinction, we can easily unify the benchmark names with post-processing.

We need to think about the codespeed reporting name as well, do we still need that?

There might be a regression for failing runs.

There was a problem on serentity with using the latest version, missing RunId module, I think.
The version that is currently deployed has however also issues with failing runs.
Some return value does not have enough components to be decomposed properly.

Add command-line option to ignore return code and record also faulty/failing runs

Sometimes, we got a really unstable VM, or a VM that doesn't do proper shutdown after running the benchmarks. To be able to obtain some of the results, it would be useful to have a command-line switch to override the error checking.

Necessary adaptations include:

--- a/rebench/executor.py
+++ b/rebench/executor.py
@@ -151,15 +154,15 @@ class Executor:
                                                           stderr=subprocess.STDOUT,
                                                           shell=True,
                                                           timeout=run_id.bench_cfg.suite.max_runtime)
-        if return_code != 0:
-            run_id.indicate_failed_execution()
-            run_id.report_run_failed(cmdline, return_code, output)
-            if return_code == 126:
-                logging.error(("Could not execute %s. A likely cause is that "
-                               "the file is not marked as executable.")
-                              % run_id.bench_cfg.vm.name)
-        else:
-            self._eval_output(output, run_id, gauge_adapter, cmdline)
+        #if return_code != 0:
+        #    run_id.indicate_failed_execution()
+        #    run_id.report_run_failed(cmdline, return_code, output)
+        #    if return_code == 126:
+        #        logging.error(("Could not execute %s. A likely cause is that "
+        #                       "the file is not marked as executable.")
+        #                      % run_id.bench_cfg.vm.name)
+        #else:
+        self._eval_output(output, run_id, gauge_adapter, cmdline)

         return self._check_termination_condition(run_id, termination_check)

and

--- a/rebench/interop/rebench_log_adapter.py
+++ b/rebench/interop/rebench_log_adapter.py
@@ -47,9 +47,9 @@ class RebenchLogAdapter(GaugeAdapter):
         current = DataPoint(run_id)

         for line in data.split("\n"):
-            if self.check_for_error(line):
-                raise ResultsIndicatedAsInvalid(
-                    "Output of bench program indicated error.")
+            #if self.check_for_error(line):
+            #    raise ResultsIndicatedAsInvalid(
+            #        "Output of bench program indicated error.")

             m = self.re_log_line.match(line)
             if m:

Output for failed runs is incomplete. Name of benchmark not given, neither command-line?

Make sure stdout and stderr are flushed

Is required for use in CI environments

Add some basic reporting to send notifications via IRC

Useful notifications would be for completion.

Integrate ReBench with R to produce graphs as part of the reporting

Remove the graph generation that's currently in ReBench and replace it with an R implementation.

See whether it's possible and efficient to have incremental graph generation.
See how we can express the graphs dependencies on benchmarks properly to ReBench?

Refactor performance reader implementations to be files in module

Avoid having to have them all in one file, also to make it easier to drop in new ones.

Make output more useful, less verbose, and independent of debug output.

Most output is currently only displayed with the -d switch.
Make output such as the stuff that enables debugging of failed runs conditional on failed runs.
Display general progress output without requiring -d switch.
And make final reporting output human readable. We got machine-readable files already.

Capture all output of the experiments as files

Needed for debugging and problem analysis.

Add support for parallel execution of benchmarks

Need to be able express in the config whether parallel execution is allowed.
Perhaps at least on the VM.

Current use case: Graal+Truffle are highly parallel, and use all cores, but RPython is strictly single-core.
Also, the interpreter versions probably do not interfere with each other.

Need to be able to configure some maximum degree of parallelism.
Am not entirely sure that parallel execution is going to be interference free, so, we definitely do not want to overload the machine.

Look into Scala Meter and see whether we can adopt the HTML reporting somehow

Think, it consumes JSON for the data files, so that part should be easy.

https://github.com/scalameter/scalameter

License seems to be a 3-clause BSD version, so having the code side by side in the repo should be fine, I guess...

Warn when unused options are encountered in configuration.

For example, warn when finding old ulimit entries, which are not supported anymore.

Remove support for profiling

The profiling support is probably broken, and hasn't been used in years.
Remove it to simplify the code.

Fix Caliper support

Currently, the support for Caliper's output is not yet adapted to the new ReBench implementation.

Added reporting feature to highlight benchmarks that don't fulfill certain quality properties.

Relevant properties could be minimum runtime, and error (stddev or conf interval size).

Reconsider handling of identical BenchmarkConfig objects

It is demonstrated by configurator_test.py test_number_of_experiments_testconf.

Incremental Reporting to Codespeed

Add incremental result reporting to codespeed.
Allows to see results earlier and when the benchmark run was aborted for some reason.

Replace ulimit usage by wall-clock timeout

ulimit does not work for tile-monitor or wrapper scripts that do not have CPU utilisation

Needs to be replaced by something that timeouts with respect to wall clock.

Show 'requested configuration is not available' warning only when config is really not in the file

Currently, the warning is also shown when another config is executed, but the 'unavailable' one is in the same config file.

Add feature to discover 'reasonably-steady-state'

For continuous performance tracking during development it is important to be able to automatically account for changes in the warmup time benchmarks take in order to keep track of the achievable peak performance. At the same time, it is still important to minimize the overall benchmark runtime to be able to experiment properly.

[Note: this is targeted towards micro- and macrobenchmarks with reasonably small runtimes to be practical.]

While Kalibera and Jones (2013, http://kar.kent.ac.uk/33611/) advocate for a convincing manual method to determine whether a real steady state is reached and the measurements from the same VM invocation reached an independent state, I need something more practical, something that is completely automatized, robust, and parameterizable.

I think, I am going to take a slightly parameterized version of Georges et al's method (2007, http://buytaert.net/files/oopsla07-georges.pdf):

give a parameter for the desired coefficient of variation CoV
a parameter for minimum number of iterations min_i
a parameter for the number of measurements k
a parameter for the maximum number of measurements max_i
a parameter for the maximum runtime max_runtime

CoV: standard deviation over all measurements (at least minimum number of iterations) divided by their mean (sd(m)/mean(m))

report whether 'reasonably-steady-state' was reached, if not, report whether timeout or max_i was reached
report number of necessary iterations i before reaching 'reasonably-steady-state'
report k measurements (after i iterations)

Failing benchmark recognized but terminates ReBench

Currently a failing benchmark (Richards) is recognized properly, but rebench fails to process the resulting exception and terminates.

Output:

Starting Richards benchmark ... 
Results are incorrect

Traceback (most recent call last):

  File "/usr/bin/rebench", line 9, in <module>
    load_entry_point('ReBench==0.2.2', 'console_scripts', 'rebench')()
  File "/home/smarr/Projects/ReBench/rebench/rebench.py", line 161, in main_func
    return ReBench().run()
  File "/home/smarr/Projects/ReBench/rebench/rebench.py", line 141, in run
    self.execute_experiment()
  File "/home/smarr/Projects/ReBench/rebench/rebench.py", line 156, in execute_experiment
    executor.execute()
  File "/home/smarr/Projects/ReBench/rebench/executor.py", line 177, in execute
    self._scheduler.execute()
  File "/home/smarr/Projects/ReBench/rebench/executor.py", line 70, in execute
    completed = self._executor.execute_run(run)
  File "/home/smarr/Projects/ReBench/rebench/executor.py", line 114, in execute_run
    termination_check)
  File "/home/smarr/Projects/ReBench/rebench/executor.py", line 148, in _generate_data_point
    self._eval_output(output, run_id, perf_reader, cmdline)
  File "/home/smarr/Projects/ReBench/rebench/executor.py", line 154, in _eval_output
    data_points = perf_reader.parse_data(output, run_id)
  File "/home/smarr/Projects/ReBench/rebench/performance.py", line 92, in parse_data
    raise RuntimeError("Output of bench program indicated error.")
RuntimeError: Output of bench program indicated error.

Support for passing on Environment variables

It would be nice to pass on environment variables to the binary/vm

on a per-binary basis
on a per-benchmark basis
on a per-suite basis
on a per-experiment basis (and so on)

smarr / rebench Goto Github PK

rebench's Issues

Recommend Projects

Recommend Topics

Recommend Org