The astproject from thetheodor

SIMD

Single Instruction Multiple Data: a single instruction operates on multiple data elements.

Autovectorization

Compilers can automatically vectorize code using SIMD instructions, as shown in the previous examples.
Autovectorization is an active area of research and development in compilers with many new papers every year.
Compilers can inform the programmer about the vectorization process using flags like -fopt-info-vec in GCC, LLVM generates similar diagnostics.
Specialized languages and compilers such as ISPC and Halide are designed to make it easier to write vectorized code.

Example: Copying an array

cat copy.c
void copy(long *restrict a, long *restrict b, unsigned long n) {
    for (unsigned long i = 0ul; i < n; i++) {
        a[i] = b[i];
    }
}

Without SIMD:

gcc -O3 -fno-tree-loop-distribute-patterns -fno-tree-vectorize -S  copy.c  -o /dev/stdout
# -fno-tree-loop-distribute-patterns: prevents replacing the loop with a call to memcpy
# -fno-tree-vectorize: prevents vectorization

# Loop body:
 # Each iteration of the loop copies 8 bytes from b[i] to a[i]
 .L3:
	movq	(%rsi,%rax,8), %rcx      # load 8 bytes from b[i] into rcx
	movq	%rcx, (%rdi,%rax,8)      # store 8 bytes from rcx into a[i]
	addq	$1, %rax
	cmpq	%rax, %rdx
	jne	.L3

With SIMD:

gcc -O3 -fno-tree-loop-distribute-patterns  -S -o /dev/stdout copy.c


# Loop body:
 # Each iteration of the loop copies 16 bytes from b[i] to a[i]
 .L4:
 	movdqu	(%rsi,%rax), %xmm0     # load 16 bytes from b[i] into xmm0
 	movups	%xmm0, (%rdi,%rax)     # store 16 bytes from xmm0 into a[i]
 	addq	$16, %rax             
 	cmpq	%rcx, %rax
 	jne	.L4

Wider SIMD registers can be used to copy more data per iteration:

gcc -O3 -fno-tree-loop-distribute-patterns -mavx512f  -S -o /dev/stdout copy.c
# -mavx512f: enable AVX-512 instructions
# Loop body:
 # Each iteration of the loop copies 64 bytes from b[i] to a[i]
 .L4:
 	vmovdqu64	(%rsi,%rax), %zmm0      # load 64 bytes from b[i] into zmm0
 	vmovdqu64	%zmm0, (%rdi,%rax)      # store 64 bytes from zmm0 into a[i]
 	addq	$64, %rax
 	cmpq	%rax, %rcx
 	jne	.L4

We can also vectorize manually using intrinsics:

cat copy_intrinsics.c

#include <immintrin.h>

void copy(long *restrict a, long *restrict b, unsigned long n) {

    // Only works if n is a multiple of 8
    for (unsigned long i = 0ul; i < n; i+=8ul) {
        __m512i v = _mm512_loadu_si512(&b[i]);
        _mm512_storeu_si512(&a[i], v);
    }
}

gcc -O3 -fno-tree-loop-distribute-patterns -mavx512f  -S -o /dev/stdout copy_intrinsics.c
Loop body:
 # Each iteration of the loop copies 64 bytes from b[i] to a[i]
 .L3:
 	vmovdqu64	(%rsi,%rax,8), %zmm0    # load 64 bytes from b[i] into zmm0
 	vmovdqu64	%zmm0, (%rdi,%rax,8)    # store 64 bytes from zmm0 into a[i]
 	addq	$8, %rax
 	cmpq	%rdx, %rax
 	jb	.L3

Example: Vectorized control flow

cat control_flow.c

void copy(long *restrict a, long *restrict b, unsigned long n) {
    for (unsigned long i = 0ul; i < n; i++) {
        if (a[i] > 0)
            b[i] = a[i];
        else
            b[i] = a[i] * 2;
    }
}

gcc -O3 -fno-tree-loop-distribute-patterns -mavx512f  -S -o /dev/stdout control_flow.c

# Loop body:
.L4:
	vmovdqu64	(%rdi,%rax), %zmm3     # load 64 bytes from a[i] into zmm3
	vpsllq	$1, (%rdi,%rax), %zmm0     # load 64 bytes from a[i] and shift left by 1 (multiply by 2) into zmm0
	vpcmpq	$6, %zmm1, %zmm3, %k1      # compare all elementf of zmm3 with 0 (stored in zmm1) and set mask k1 ($6 is the comparison type)
	vmovdqa64	%zmm3, %zmm0{%k1}      # copy elements from zmm3 to zmm0 using mask k1
	vmovdqu64	%zmm0, (%rsi,%rax)     # store 64 bytes from zmm0 into b[i]
	addq	$64, %rax
	cmpq	%rax, %rcx
	jne	.L4

The goal of this project

Autovectorization is difficult. Vectorizing a piece of code may require significant changes to the code (e.g., All you need is superword-level parallelism: systematic control-flow vectorization with SLP. The main goal of this project is not to develop an autovectorizer, but to develop a method for identifying missed opportunities for vectorization in existing code. That is, given an existing binary, we want to identify loops (or even straighline) that could be vectorized but are not.

Approach

(This is a suggestion, feel free to deviate from it.)

The detection of missed opportunities for vectorization can be done in several ways. One approach is to use a dynamic analysis that traces an execution and identifies vectorizable code.

There are many dynamic analysis frameworks, for example:

A dataflow-based approach

Given a trace of execution (instructions and memory accesses), we can use a dataflow analysis to identify "parallel" computations. For example in the following code (from the first example that copies an array)

 .L3:
	movq	(%rsi,%rax,8), %rcx      # load 8 bytes from b[i] into rcx
	movq	%rcx, (%rdi,%rax,8)      # store 8 bytes from rcx into a[i]
	addq	$1, %rax
	cmpq	%rax, %rdx
	jne	.L3

We can identify that the load and store instructions are independent and can be executed in parallel (i.e., they can be vectorized) by looking at the dataflow:

Iteration 1:
load b+0  -> rcx -> store a+0

Iteration 2:
load b+8  -> rcx -> store a+8

Iteration 3:
load b+18 -> rcx -> store a+16
...

All the loads and stores are "parallel". Similarly, in a slightly more complex example:

 .L3:
	movq	(%rsi,%rax,8), %rcx      # load 8 bytes from array1[i] into rcx
	movq	(%rbx,%rax,8), %r10      # load 8 bytes from array2[i] into rdx
	addq    %r10, %rcx               # add the two values
	movq	%rcx, (%rdi,%rax,8)      # store 8 bytes from rdx into array3[i]
	addq	$1, %rax
	cmpq	%rax, %rdx
	jne	.L3

Dataflow:

Iteration 1:
load array1+0  -> rcx 
                     \    
                      -> add r10 and rcx -> store array3+0
                     / 
load array2+0  -> r10 

Iteration 2:
load array1+8  -> rcx 
                     \    
                      -> add r10 and rcx -> store array3+8
                     / 
load array2+8  -> r10 


Iteration 3:
load array1+16 -> rcx 
                     \    
                      -> add r10 and rcx -> store array3+16
                     / 
load array2+16 -> r10 
...

All the necessary data to build the dataflow graph can be obtained from the trace (generated by the dynamic analysis tool).

Basic idea

Identify parallel dataflow in the trace and match it to known vectorization instruction/patterns. In the above examples it is easy to see that the dataflow can be vectorized using SIMD instructions, all the loads and stores are independent and on concecutive addresses.

Challenges:

The dataflow graph can be very large. Hint: if this becomes a problem, you could first identify loops and only trace these.
Control flow: some loops iterations might behave differently due to divergent control flow (like in the control.flow.c example ). Your approach should be able to handle this.
The number of vector instructions and patterns is large. You might want to start with a small subset of instructions and patterns and then expand it.
How do you match the dataflow to a vectorization pattern? This is a key question and there are many possible approaches. You might want to start with a simple pattern matching approach and then expand it (e.g., using a solver) to identify more complex patterns.

Test inputs

One popular autovectorization benchmark is TSVC with both simple and complex loops that compilers (GCC and LLVM) fail to vectorize. This might be useful for testing your approach.YARPGen is another potential source of test programs.

thetheodor / astproject Goto Github PK

astproject's Introduction

SIMD

Autovectorization

Example: Copying an array

Example: Vectorized control flow

The goal of this project

Approach

A dataflow-based approach

Basic idea

Test inputs

astproject's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent