Giter Club home page Giter Club logo

hardware-effects's Introduction

Hardware effects

This repository demonstrates various hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU and OS architecture. For each effect I try to create a proof of concept program that is as small as possible so that it can be understood easily.

You can check out my talk about hardware effects from the Meeting C++ 2019 conference.

Related repository with GPU hardware effects: https://github.com/kobzol/hardware-effects-gpu

Those effects obviously depend heavily on your CPU microarchitecture and model, so the demonstration programs may not showcase the slowdown on your CPU, but I try to make them as general as I can. That said, the examples are targeting x86-64 processors (Intel and AMD) and may not make sense on other CPU architectures. I focus on effects that should be observable on commodity (desktop/notebook) hardware, so I don't include things like NUMA effects here (although in a few years they might be common even in personal computers). The code is mainly tested on Linux.

Currently the following effects are demonstrated:

  • 4k aliasing
  • bandwidth saturation
  • branch misprediction
  • branch target misprediction
  • cache conflicts
  • cache/memory hierarchy bandwidth
  • data dependencies
  • denormal floating point numbers
  • DRAM refresh interval
  • false sharing
  • hardware prefetching
  • hardware store elimination
  • memory-bound program
  • misaligned accesses
  • non-temporal stores
  • software prefetching
  • store buffer capacity
  • write combining

Every example directory has a README that explains the individual effects.

Isolating those hardware effects can be very tricky, so it's possible that some of the examples are actually demonstrating something entirely else (or nothing at all :) ). If you have a better explanation of what is happening, please let me know in the issues. Ideally the code should be written in assembly, however that would lower its readability. I wrote it in C++ in a way that (hopefully) forces the compiler to emit the instructions that I want (even with -O3).

Benchmarking

For all benchmarks I recommend to turn off frequency scaling, hyper-threading, Turbo mode, address space randomization and other stuff that can increase noise. I'm using the following commands:

$ sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space"           # address randomization
$ sudo bash -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" # Turbo mode
$ sudo bash -c "echo 0 > /sys/devices/system/cpu/cpuX/online"           # hyper-threading
$ ...                                                                   # for all hyper-threading CPUs
$ sudo cpupower frequency-set --governor performance                    # frequency scaling

You can find more tips here.

Build

$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j

If you want to use the benchmark scripts (written in Python 3), you should also install the Python dependencies:

$ pip install -r requirements.txt

Docker

You can download a prebuilt image:

$ docker pull kobzol/hardware-effects

or build it yourself:

$ docker build -t hardware-effects .

Then run it:

# interactive run
$ docker run --rm -it hardware-effects

# directly launch a program
$ docker run hardware-effects build/branch-misprediction/branch-misprediction 1

License

MIT

Resources

hardware-effects's People

Contributors

awesomebytes avatar bejado avatar kobzol avatar nanxiao avatar paullee73 avatar travisdowns avatar vellvisher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hardware-effects's Issues

cache-memory-bound: varying results

Hi,

I compiled the cache-memory-bound example with the Intel C++ compiler. I wanted to check what VTune says about how memory bound the program is.

The issue is that (in the increment 1 case) I saw various numbers reaching from 0% memory bound to 100% memory bound.

So I increased the size of the array by a factor of 4 but without luck.

Additionally, multiple executions of the program lead to pretty different timings:

12:58:33 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
902
12:58:57 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
881
12:59:25 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
884
12:59:27 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
887
12:59:29 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
887
12:59:31 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
741
12:59:33 fnick@leo3:cache-memory-bound (master)$ cache-memory-bound 1
737

The processor is a Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (two of them) so I cannot rule out that the processor is changing it's core frequency at some point. But the different in the results looks somewhat suspicious, so I'd be interested in your opinion.

Intel Compiler problem

Hi,

I tried to compile everything with the Intel compilers and the first thing that happened was the same as in #3. This can be easily fixed by #if defined(__clang__) || defined(__INTEL_COMPILER) in software-prefetching.cpp.

Next, the following error comes up:

[ 96%] Building CXX object prefetching/CMakeFiles/software-prefetching.dir/software-prefetching.cpp.o
/home/fnick/hardware-effects/prefetching/software-prefetching.cpp(37): error: argument of type "Type *" is incompatible with parameter of type "const char *"
              if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                         ^
          detected during instantiation of "size_t={unsigned long} test_memory<Prefetch,Hint>(const std::vector<Type *, std::allocator<Type *>> &, int) [with Prefetch=true, Hint=(unsigned char)'\001']" at line 79

/home/fnick/hardware-effects/prefetching/software-prefetching.cpp(37): error: argument of type "Type *" is incompatible with parameter of type "const char *"
              if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                         ^
          detected during instantiation of "size_t={unsigned long} test_memory<Prefetch,Hint>(const std::vector<Type *, std::allocator<Type *>> &, int) [with Prefetch=true, Hint=(unsigned char)'\002']" at line 80

/home/fnick/hardware-effects/prefetching/software-prefetching.cpp(37): error: argument of type "Type *" is incompatible with parameter of type "const char *"
              if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                         ^
          detected during instantiation of "size_t={unsigned long} test_memory<Prefetch,Hint>(const std::vector<Type *, std::allocator<Type *>> &, int) [with Prefetch=true, Hint=(unsigned char)'\003']" at line 81

/home/fnick/hardware-effects/prefetching/software-prefetching.cpp(37): error: argument of type "Type *" is incompatible with parameter of type "const char *"
              if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                         ^
          detected during instantiation of "size_t={unsigned long} test_memory<Prefetch,Hint>(const std::vector<Type *, std::allocator<Type *>> &, int) [with Prefetch=true, Hint=(unsigned char)'\000']" at line 82

/home/fnick/hardware-effects/prefetching/software-prefetching.cpp(37): error: argument of type "Type *" is incompatible with parameter of type "const char *"
              if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                         ^
          detected during instantiation of "size_t={unsigned long} test_memory<Prefetch,Hint>(const std::vector<Type *, std::allocator<Type *>> &, int) [with Prefetch=false, Hint=(unsigned char)'\001']" at line 86

Might be my mistake though, any hints? If any testing is required with the Intel Compilers, I'm happy to help...

store-buffer-capacity result not correct and varied a lot

when run it on

perf stat -e resource_stalls.sb store-buffer-capacity 

the resource_stalls.sb result varied a lot(each time, got quite different result).

It seems can not use this way to judge store buffer capacity. Is this tool still valid for modern CPU processors?

Wrong bandwidth-saturation benchmark results due to compiler optimisations

Hey, I've run the bandwidth-saturation on ryzen 9 5950x, debian bullseye and got some concerning results:

./bandwidth-saturation 0 32 -> 224mS
./bandwidth-saturation 1 32 -> 1979mS

Those benchmarks as provided are doing ~85GB of writes to memory. If the first number were correct, I'd be getting 280GB/s of throughput to my RAM. Which is almost three times more than should be possible.

Disassembly of non-temporal thread-fn:

Upon disassembling we see that REPETITIONS loop got optimized away. This is in contrast to the temporal version that has this loop:


I can also artificially manipulate the benchmark results by recompiling with different REPETITIONS constant
REPETITIONS=20 (original value)

REPETITIONS=1


One way of fixing this (THAT I AM NOT SURE IS CORRECT) is to declare items as volatile Type. I'm too stupid to really argue about it's correctness, or the reason for the compiler eating away that loop only in non-temporal version, but:

REPETITIONS=20 with line 20 changed to void thread_fn(volatile Type* items, size_t size)

the asembly looks as follows (exactly as temporal, expect the mov instruction changed to movnti)


EDIT: removed wrong conclusion, since I've mistakenly assumed more is better in this benchmark. non-temporal are still superior

cache-aliasing example should be "cache-thrashing"

Hi Jakub,

Greeting from me!

I think cache-aliasing example actually does not describe what "cache-aliasing", but "cache-thrashing". According to Wikipedia:

Aliasing: Multiple virtual addresses can map to a single physical address. Most processors guarantee that all updates to that single physical address will happen in program order. To deliver on that guarantee, the processor must ensure that only one copy of a physical address resides in the cache at any given time.

The point of "cache-aliasing" is at the same time, there are more than one copies of physical address resides in the cache. What your program demonstrates is because the conflicts in cache set, the cache evicts a cache line frequently, and this is a typical "cache-thrashing".

Hope you check it, thanks!

Best Regards
Nan Xiao

4K aliasing correct perf counter

I am curious what is the 'correct' perf counter for 4K aliasing. You have mentioned ld_blocks.store_forward, but I was wondering about the other counter ld_blocks_partial.address_alias as well.

Here is the perf list description:

  ld_blocks.store_forward                           
       [loads blocked by overlapping with store buffer that cannot be forwarded]
  ld_blocks_partial.address_alias                   
       [False dependencies in MOB due to partial compare on address]

Here are the perf results on my machine:

$ perf stat -e ld_blocks_partial.address_alias,ld_blocks.store_forward ./a.out 4096
222

 Performance counter stats for './a.out 4096':

             6,852      ld_blocks_partial.address_alias:u                                   
                32      ld_blocks.store_forward:u                                   

       0.224647447 seconds time elapsed

$ perf stat -e ld_blocks_partial.address_alias,ld_blocks.store_forward ./a.out 4092
359

 Performance counter stats for './a.out 4092':

       132,139,399      ld_blocks_partial.address_alias:u                                   
         2,097,093      ld_blocks.store_forward:u                                   

       0.361229917 seconds time elapsed

As you can see, both of them are hugely different for 4092 and 4096.

It's not write combining

Despite what the mechanical sympathy article says, it's not write combining that produces the effects you see in the write-combining tests. Neither Intel nor AMD use WC buffers for normal writes, only for "WC protocol" writes - which basically mean writes to WC memory and writes to WB memory ("normal" memory) using NT stores.

You could probe the WC behavior using NT stores. It still leaves open the question of what produces the effect you see.

Determining cache line size

This isn't really an issue, but more of a request.

I think it would be really nice to have a program (much like cache-hierarchy-bandwidth) which determine the cache line size. It is surprisingly difficult to isolate cache line size effect (prefetching makes everything hard). I tried to follow the advice mentioned in the top answer here: https://stackoverflow.com/questions/12675092/how-to-find-the-size-of-the-l1-cache-line-size-with-io-timing-measurements but it doesn't work on my computer.

Additional hardware effects ideas

  • non-temporal stores
  • multiple threads saturating the memory bus
  • hardware prefetching with indexed accesses
  • floating point handling (denormals etc.)
  • 4k aliasing
  • store buffer capacity
  • instruction cache misses
  • TLB misses
  • more multithreading examples (lock contention etc.)
  • vector instructions
  • critical word load
  • CUDA examples

taskset?

I'm not sure if I'm calling stuff the right way. Is there an outer script that I need to call to get the setup done correctly?

[prefetching:118] python3 software-prefetching.py ../../build/prefetching/software-prefetching 
Traceback (most recent call last):
  File "software-prefetching.py", line 18, in <module>
    frame = benchmark(data, pin_to_cpu=True, repeat=2)
  File "/Users/eijkhout/Current/istc/hpc-book-private/hardware/hardware-effects/utils.py", line 53, in benchmark
    times = [float(res) for res in run_repeatable(executable, repeat, values, pin_to_cpu)]
  File "/Users/eijkhout/Current/istc/hpc-book-private/hardware/hardware-effects/utils.py", line 53, in <listcomp>
    times = [float(res) for res in run_repeatable(executable, repeat, values, pin_to_cpu)]
  File "/Users/eijkhout/Current/istc/hpc-book-private/hardware/hardware-effects/utils.py", line 36, in run_repeatable
    stderr=subprocess.PIPE)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'taskset': 'taskset'

Repetition count

I'm toying around with the cache-conflicts code, and I'm using a higher count than you did in your benchmark. However, the repetition count than makes the program run for absolutely forever. So maybe you could do:

#ifndef REPETITIONS
#define REPETITIONS 10000000
#endif

so that it becomes possible to override that count.

After that I can of course recompile by hand, but beats me if I can figure out how to use your build system and pass the -DREPETITIONS=1000 to the compiler.

Thoughts?

Unable to launch from prebuilt-image

Hi!
I just tried docker pull kobzol/hardware-effects which gives me

Status: Image is up to date for kobzol/hardware-effects:latest

then docker run --rm -it hardware-effects which gives me

Unable to find image 'hardware-effects:latest' locally

Am I missing a step or is there something that needs to be added to the README?

Compiler error, "error: ‘_mm_hint’ has not been declared"

Build seems to be failing on my system, also seems some on reddit have the same issue. I am running on Arch, GCC 8.2.1 as can be seen when running cmake. Let me know if you need any other information and I will be happy to provide it.

Attached is the build log:

⋊> /t/l/b cmake -DCMAKE_BUILD_TYPE=Release ../hardware-effects/                                       22:29:04
-- The C compiler identification is GNU 8.2.1
-- The CXX compiler identification is GNU 8.2.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/lol/b
⋊> /t/l/b make -j                                                                                     22:29:26
Scanning dependencies of target cache-memory-bound
Scanning dependencies of target data-dependency
Scanning dependencies of target branch-misprediction
Scanning dependencies of target branch-target-misprediction
Scanning dependencies of target cache-hierarchy-bandwidth
Scanning dependencies of target cache-aliasing
Scanning dependencies of target prefetching
[  5%] Building CXX object cache-memory-bound/CMakeFiles/cache-memory-bound.dir/cache-memory-bound.cpp.o
Scanning dependencies of target false-sharing
[ 11%] Building CXX object data-dependency/CMakeFiles/data-dependency.dir/data-dependency.cpp.o
[ 16%] Building CXX object branch-misprediction/CMakeFiles/branch-target-misprediction.dir/branch-target-misprediction.cpp.o
[ 22%] Building CXX object branch-misprediction/CMakeFiles/branch-misprediction.dir/branch-misprediction.cpp.o
[ 27%] Building CXX object cache-aliasing/CMakeFiles/cache-aliasing.dir/cache-aliasing.cpp.o
[ 33%] Building CXX object cache-hierarchy-bandwidth/CMakeFiles/cache-hierarchy-bandwidth.dir/cache-hierarchy-bandwidth.cpp.o
[ 38%] Building CXX object false-sharing/CMakeFiles/false-sharing.dir/false-sharing.cpp.o
[ 44%] Building CXX object prefetching/CMakeFiles/prefetching.dir/prefetching.cpp.o
Scanning dependencies of target write-combining
[ 50%] Building CXX object write-combining/CMakeFiles/write-combining.dir/write-combining.cpp.o
[ 55%] Linking CXX executable cache-hierarchy-bandwidth
[ 61%] Linking CXX executable cache-aliasing
[ 66%] Linking CXX executable cache-memory-bound
[ 72%] Linking CXX executable data-dependency
[ 72%] Built target cache-aliasing
[ 72%] Built target cache-hierarchy-bandwidth
/tmp/lol/hardware-effects/prefetching/prefetching.cpp:19:26: error: ‘_mm_hint’ has not been declared
 template <bool Prefetch, _mm_hint Hint>
                          ^~~~~~~~
/tmp/lol/hardware-effects/prefetching/prefetching.cpp: In function ‘size_t test_memory(const std::vector<Data*>&, int)’:
/tmp/lol/hardware-effects/prefetching/prefetching.cpp:32:62: error: ‘Hint’ was not declared in this scope
             if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                                              ^~~~
/tmp/lol/hardware-effects/prefetching/prefetching.cpp:32:62: note: suggested alternative: ‘uint’
             if (Prefetch) _mm_prefetch(memory[j + distance], Hint);
                                                              ^~~~
                                                              uint
/*** SNIP ****/
/tmp/lol/hardware-effects/prefetching/prefetching.cpp:20:8: note:   template argument deduction/substitution failed:
/tmp/lol/hardware-effects/prefetching/prefetching.cpp:75:65: error: template argument 2 is invalid
         sum = test_memory<false, _MM_HINT_T0>(pointers, distance);
                                                                 ^
[ 72%] Built target cache-memory-bound
[ 77%] Linking CXX executable branch-misprediction
[ 77%] Built target data-dependency
[ 83%] Linking CXX executable false-sharing
make[2]: *** [prefetching/CMakeFiles/prefetching.dir/build.make:63: prefetching/CMakeFiles/prefetching.dir/prefetching.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:460: prefetching/CMakeFiles/prefetching.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 88%] Linking CXX executable write-combining
[ 94%] Linking CXX executable branch-target-misprediction
[ 94%] Built target false-sharing
[ 94%] Built target branch-misprediction
[ 94%] Built target branch-target-misprediction
[ 94%] Built target write-combining
make: *** [Makefile:84: all] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.