Giter Club home page Giter Club logo

spring_2020_tutorial's Introduction

Hardware, Software, and Compilers! Oh My!

This repository contains all code/links to all examples presented in the Spring 2020 "Hardware, Software, and Compilers! Oh My!" tutorial. I've provided a breakdown of all of the topics below, along with supplemental links for those interested.

Cheers,

Nick

The Small Buffer Optimization

Dynamic allocations are expensive, and we'd like to avoid them (if possible). One way we can do this is the small buffer optimization. We can leave some extra space inside of a handle to a larger dynamic allocation (like a std::string) where we can store a small number of elements. By doing this, we can defer any dynamic allocation until we run out of space inside the handle. All major compilers implement a form of this optimization for std::string in C++ (but interestingly, they don't all do it the same way!). In this example, we will be looking at a few implementations of the small buffer optimization (known as the short/small string optimization), and benchmarking the cost of dynamic allocation.

Short string optimization source code

Relevant Links

The strange details of std::string at Facebook

Copy Elision

Copying a large object can be expensive. Compiler writers understand this, and have been implementing optimization passes for copy elision to remove unnecessary copies. A common form of this is the Return Value Optimization (RVO) and Named Return Value Optimizaiton (NRVO). These deal with eliding the copy of an object during a function return. However, you're compiler can't always elide the copy! In this example, we will look at some source and disassembly of a simple function to better understand RVO, and when/where it may not be applicable

Source and disassembly

Relevant Links

Copy elision in the C++ standard

Arthur O'Dwyer's 2018 CppCon Talk

Aliasing and Compiler Optimization

Compilers are smart, and can exploit opportunties that even the most veteran of programmers can't notice. However, compilers aren't omniciant, and this can lead them to being overly conservative. A great example of this is aliasing. If a compiler can not figure out if two references point to the same piece of memory, it may not be able to perform certain optimizations (e.g, vectorization). In this example, we will look at the source and disassembly of a simple function, and measure the performance impact of aliasing on a simple matrix multiplication CUDA kernel.

Source and disassembly

CUDA matrix multiplication source code

Relevant Links

Strict aliasing in C++

Aliasing in CUDA

Link Time Optimization (LTO)

Changing how we compile a program can change it's performance. One example of this is when we break our compilation into multiple translation units. Because the compiler doesn't have the full context of the program, it may omit some optimizations. However, we can get some of these optimizations back at link time using Link Time Optimization (LTO). In this example, we will look at how the compiler optimizes a matrix multiplication benchmark in a single translation unit, split across two translation units, and split across multiple translation units with Link Time Optimization enabled.

CPU matrix multiplication source code

Relevant Links

GCC's Link Time Optimization

Branch Prediction

Modern processors rely on branch predictors to keep the pipeline filled in the presence of branches. However, the programmer has a cruicial role in helping out the branch predictor. If we are able to write more predictable code, the branch predictor is able to speculate better. However, if our branches become somewhat random, we can suffer sever performance penalties. In this example we will take a look at the role of branch prediction with virtual functions/dynamic dispatch. We will benchmark different orderings of virtual function calls, and study the affects on the branch miss prediction rate.

Virtual function source code

Relevant Links

Agner Fog's Assembly Optimization Guide

Agner Fog's Optimizing C++

Code Scheduling

The dynamic order of instructions can have a significant impact on performance. One example of this is with branches. Preferring one side of a branch over another can lead to significant performance differences. In this example we will look at using a compiler intrinsic to give hints as to what the "hot-side" of a branch is, and measure the performance difference.

Link to modulo benchmark

Relevant Links

Anger Fog's Instruction Tables

Chandler Carruth's 2015 CppCon Talk

Cache Associativity

Caches are critical to providing performance in modern processors. However, seemingly innacuous access patterns can lead to unexpected drops in performance. One example of this is a power-of-two stride. Modern caches are set-associative (each cache line gets mapped to a set, but there are only N ways where it can be placed). When we do a power of two set, we stumble across the relatively simple mapping of cache lines to sets. As we increase the power of two, the number of unique sets we access decreases, until every cache line we access maps to the same one. In this example, we will show how we can predict such access patterns for an arbitrary processor by looking at the cache organization details.

Link to associativity benchmark source code

Relevant Links

What Every Programmer Should Know About Memory

Prefetching

Not all access patterns have the same performance. As we showed in the cache associativity example, power-of-two stride can lead to unfortunate performance consequences. Another part of the hardware we should think about is the prefetcher. If we have a constant stride access pattern, our hardware prefetcher can begin fetching cache lines before we need them. However, more random access patterns do not get this benefit. In this example, we will be looking at a number of different access patterns to explore the limits of hardware prefetching, and briefly discuss software prefetching intrinsics.

Link to prefetching benchmark

Relevant Links

Stony Brook University lecture on prefetching

What Every Programmer Should Know About Memory

False Sharing

2004 marked the beginning of the multi-core era, as Intel cancelled a 4GHz Pentium 4 processor in favor of dual core. Since that time, all major processor vendors have focused on multi-core performance over single-core. Optimizing multithreaded applications is difficult. Poorly implemented synchronization strategies can lead to execution times that look worse than if the program was single-threaded. However, some performance pitfalls can be more subtle. One of these is false sharing. In this example, we will show how false sharing can impact performance, and briefly discuss the role of coherence in performance.

Link to false sharing benchmark

Relevant Links

Intel blog on false sharing

SIMD Intrinsics - An Optimization Case Study

The majority of comput hardware on modern processors is in the vector units. While compiler-based vectorization can typically achieve ~80% maximum performance, there may be situations where we need to take matter into our own hands. However, knowing if you should be applying vectorization is another story. In this example, we will be examining the optimization of matrix-vector multiplication. We will explore whether we are compute or memory-bound, then manually apply some vectorization to boost performance.

Link to matrix-vector benchmark

Vectorized Dot Product Intrinsic

Relevant Links

Intel Intrinsics Guide

spring_2020_tutorial's People

Contributors

coffeebeforearch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

spring_2020_tutorial's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.