Giter Club home page Giter Club logo

vc4c's People

Contributors

doe300 avatar kaminarifox avatar long-long-float avatar nomaddo avatar pfoof avatar thijswithaar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vc4c's Issues

AArch64 requires -fPIC

I'm working to build this on a rpi3 with debian/Aarch64 and got a few hundred of these messages when compiling vc4c:

/usr/bin/ld: CMakeFiles/objvc4asm.dir/Parser.cpp.o: relocation R_AARCH64_ADR_PREL_PG_HI21 against symbol `_ZTVN6Parser7MessageE' which may bind externally can not be used when making a shared object; recompile with -fPIC

The fix looks simple enough?

Remove common expression

kernel void sum_f16 (global float a[], global float b[], global float c[]) {
  int id = get_global_id(0);
  float16 x = vload16(id, b);
  float16 y = vload16(id, c);
  vstore16 (x * y + x * y + x * y, id, a);
}

https://gist.github.com/nomaddo/9a7921afd53d08cb34d736e7432b8036

In this code, it seems that the same calculation is performed repeatedly (line 59 and 60).
I imaged this such typical optimization (removing subcommon expression) is done by clang.

Latest `VC4C` exit with throwing exception

If the following program is passed to latest VC4C, the exception Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111) is raised.
Is it probably a bug of the register allocation?

kernel void loop1 (global float a[], global float b[]) {
  for (int i = 0; i < 1000; i++) {
    a[i] = -i;
    b[i] = -i;
  }
}
./build/VC4C -o /tmp/hoge.o ~/idein/clABCMarks/loop1.cl 

nomaddo@nomaddo-AS:~/idein/VC4C$  ./build/VC4C -o /tmp/hoge.o ~/idein/clABCMarks/loop1.cl 
[D] Tue Feb 13 13:24:06 2018: Compiling '/home/nomaddo/idein/clABCMarks/loop1.cl' into '/tmp/hoge.o' with options '' ...
[D] Tue Feb 13 13:24:06 2018: Temporary file '/tmp/vc4c-GFIvlj' created
[D] Tue Feb 13 13:24:06 2018: Temporary file '/tmp/vc4c-gseRZG' created
[I] Tue Feb 13 13:24:06 2018: Compiling OpenCL to LLVM-IR with :/opt/SPIRV-LLVM/build/bin/clang -cc1 -triple spir-unknown-unknown  -I /home/nomaddo/idein/clABCMarks -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant  -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /home/nomaddo/idein/VC4C/../VC4CLStdLib/include/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-gseRZG /home/nomaddo/idein/clABCMarks/loop1.cl
[I] Tue Feb 13 13:24:07 2018: Converting LLVM-IR to SPIR-V with :/opt/SPIRV-LLVM/build/bin/llvm-spirv -o /tmp/vc4c-GFIvlj /tmp/vc4c-gseRZG
[D] Tue Feb 13 13:24:07 2018: Temporary file '/tmp/vc4c-gseRZG' deleted
[I] Tue Feb 13 13:24:07 2018: Compilation complete!
[I] Tue Feb 13 13:24:07 2018: Using SPIR-V frontend...
[D] Tue Feb 13 13:24:07 2018: Read SPIR-V binary with 219 words
[D] Tue Feb 13 13:24:07 2018: Starting parsing...
[D] Tue Feb 13 13:24:07 2018: SPIR-V header parsed: magic-number 0x7230203, version 0x10000, generator 6000e, max-ID 29
[D] Tue Feb 13 13:24:07 2018: Using supported capability: Addresses
[D] Tue Feb 13 13:24:07 2018: Using supported capability: Linkage
[D] Tue Feb 13 13:24:07 2018: Using supported capability: Kernel
[D] Tue Feb 13 13:24:07 2018: Importing extended instruction set: OpenCL.std
[D] Tue Feb 13 13:24:07 2018: Using a physical OpenCL memory model
[D] Tue Feb 13 13:24:07 2018: Kernel-method found: loop1
[D] Tue Feb 13 13:24:07 2018: Reading function: %4 (arm_get_core_id)
[D] Tue Feb 13 13:24:07 2018: Reading function: %11 (loop1)
[D] Tue Feb 13 13:24:07 2018: Reading parameter: float* %12
[D] Tue Feb 13 13:24:07 2018: Reading parameter: float* %13
[D] Tue Feb 13 13:24:07 2018: SPIR-V binary successfully parsed
[D] Tue Feb 13 13:24:07 2018: Mapping instructions to intermediate...
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %0
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %5
[D] Tue Feb 13 13:24:07 2018: Generating intermediate return of value: i32 0
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %0
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %14
[D] Tue Feb 13 13:24:07 2018: Generating intermediate branch to %15
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %15
[D] Tue Feb 13 13:24:07 2018: Generating Phi-Node with 2 options into i32 %18
[D] Tue Feb 13 13:24:07 2018: Generating intermediate binary operation 'sub' with i32 0 and i32 %18 into i32 %19
[D] Tue Feb 13 13:24:07 2018: Generating intermediate unary operation 'sitofp' with i32 %19 into float %20
[D] Tue Feb 13 13:24:07 2018: Generating calculating indices of float* %12 into float* %21
[D] Tue Feb 13 13:24:07 2018: Generating writing of float %20 into float* %21
[D] Tue Feb 13 13:24:07 2018: Generating calculating indices of float* %13 into float* %22
[D] Tue Feb 13 13:24:07 2018: Generating writing of float %20 into float* %22
[D] Tue Feb 13 13:24:07 2018: Generating intermediate binary operation 'add' with i32 %18 and i32 1 into i32 %17
[D] Tue Feb 13 13:24:07 2018: Generating intermediate comparison 'eq' of i32 %17 and i32 1000 into bool %27
[D] Tue Feb 13 13:24:07 2018: Generating intermediate conditional branch on %27 to either %16 or %15
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %16
[D] Tue Feb 13 13:24:07 2018: Generating intermediate return
[D] Tue Feb 13 13:24:07 2018: Eliminating phi-node by inserting moves: i32 %18 = phi %15 -> i32 %17, %14 -> i32 0
[D] Tue Feb 13 13:24:07 2018: Inserting 'move' into end of basic-block: %15
[D] Tue Feb 13 13:24:07 2018: Inserting 'move' into end of basic-block: %14
[I] Tue Feb 13 13:24:07 2018: -----
[I] Tue Feb 13 13:24:07 2018: Inlining functions for kernel: loop1
[I] Tue Feb 13 13:24:07 2018: -----
[D] Tue Feb 13 13:24:07 2018: -----
[I] Tue Feb 13 13:24:07 2018: Running optimization passes for: loop1
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: MapMemoryAccess
[D] Tue Feb 13 13:24:07 2018: Generating memory access which cannot be lowered into VPM: store float %20 into float* %22
[D] Tue Feb 13 13:24:07 2018: Increased the scratch size to 1 rows (64 bytes)
[D] Tue Feb 13 13:24:07 2018: 0 16 0 1
[D] Tue Feb 13 13:24:07 2018: Generating memory access which cannot be lowered into VPM: store float %20 into float* %21
[D] Tue Feb 13 13:24:07 2018: 0 16 0 1
[D] Tue Feb 13 13:24:07 2018: Found base address - with offset -1 for writing into memory
[D] Tue Feb 13 13:24:07 2018: Found base address - with offset -1 for writing into memory
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: ResolveStackAllocations
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: SingleSteps
[D] Tue Feb 13 13:24:07 2018: Running steps: EliminateReturns, CombineDuplicateBranches, EliminateUselessBranch, IntrinsifyBuiltin, HandleLiteralVector, MapGlobalDataToAddress, CalculateConstantValue, EliminateUselessInstruction, LoadImmediateValues, HandleUseWithImmediateValues, MoveRotationSourcesToAccs, CheckMethodCalls, CombineSelectionWithZero, CombineSettingSameFlags, 
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Tue Feb 13 13:24:07 2018: Removing branch to next instruction: br %15
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Tue Feb 13 13:24:07 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %19 = sub i32 0 (0), i32 %18
[D] Tue Feb 13 13:24:07 2018: Intrinsifying multiplication with left-shift
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Tue Feb 13 13:24:07 2018: Re-using temporary to split up use of long-living local with immediate value: i32 %index_offset.0 = shl i32 %18, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: Intrinsifying multiplication with left-shift
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Tue Feb 13 13:24:07 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %index_offset.1 = shl i32 %18, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %17 = add i32 %18, i32 1 (1)
[D] Tue Feb 13 13:24:07 2018: Intrinsifying comparison 'eq' to arithmetic operations
[D] Tue Feb 13 13:24:07 2018: Loading immediate value: 1000
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value true to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value true to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value true to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Replacing return in kernel-function with branch to end-label
[D] Tue Feb 13 13:24:07 2018: Removing branch to next instruction: br %end_of_function
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: CombineLiteralLoads
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: CombineRotations
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: EliminateDeadStores
[D] Tue Feb 13 13:24:07 2018: Cleaned 7 unused locals from method loop1
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: VectorizeLoops
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: SplitReadAfterWrites
[D] Tue Feb 13 13:24:07 2018: Inserting NOP to split up read-after-write before: i32 %use_with_literal.3 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Inserting NOP to split up read-after-write before: register - = bool %27 (setf )
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: ReorderInstructions
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpm, register mutex_acq, register vpw_busy, register vpw_wait): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpm, register mutex_acq, register vpw_busy, register vpw_wait): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: CombineALUIinstructions
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: UnrollWorkGroups
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: AddStartStopSegment
[D] Tue Feb 13 13:24:07 2018: 
[D] Tue Feb 13 13:24:07 2018: Running pass: ExtendBranches
[I] Tue Feb 13 13:24:07 2018: 
[I] Tue Feb 13 13:24:07 2018: Optimizations done, changed number of instructions from 21 to 95
[D] Tue Feb 13 13:24:07 2018: -----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %start_of_function
[D] Tue Feb 13 13:24:07 2018: i32 %work_dim = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %local_sizes = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %local_ids = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %num_groups_x = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %num_groups_y = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %num_groups_z = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %group_id_x = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %group_id_y = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %group_id_z = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_offset_x = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_offset_y = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_offset_z = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_data_address = register unif
[D] Tue Feb 13 13:24:07 2018: float* %12 = register unif
[D] Tue Feb 13 13:24:07 2018: float* %13 = register unif
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %0
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %14
[D] Tue Feb 13 13:24:07 2018: i32 %18 = i32 0 (0) (phi)
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %15
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.3 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %19 = sub i32 0 (0), i32 %use_with_literal.3
[D] Tue Feb 13 13:24:07 2018: float %20 = itof i32 %19
[D] Tue Feb 13 13:24:07 2018: i32 %index_offset.0 = shl i32 %use_with_literal.3, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: float* %21 = add float* %12, i32 %index_offset.0
[D] Tue Feb 13 13:24:07 2018: mutex_acq
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpm = float %20
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(rows: 1, elements: 1 words, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(stride: 0)
[D] Tue Feb 13 13:24:07 2018: register vpw_addr = float* %21
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: register - = register vpw_wait
[D] Tue Feb 13 13:24:07 2018: mutex_rel
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %index_offset.1 = shl i32 %use_with_literal.6, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: float* %22 = add float* %13, i32 %index_offset.1
[D] Tue Feb 13 13:24:07 2018: mutex_acq
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpm = float %20
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(rows: 1, elements: 1 words, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(stride: 0)
[D] Tue Feb 13 13:24:07 2018: register vpw_addr = float* %22
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: register - = register vpw_wait
[D] Tue Feb 13 13:24:07 2018: mutex_rel
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %17 = add i32 %use_with_literal.8, i32 1 (1)
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: register - = xor i32 %17, i32 %immediate.10 (setf )
[D] Tue Feb 13 13:24:07 2018: bool %27 = bool 1 (1) (ifz )
[D] Tue Feb 13 13:24:07 2018: bool %27 = xor bool 1 (1), bool 1 (1) (ifzc )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: register - = bool %27 (setf )
[D] Tue Feb 13 13:24:07 2018: i32 %18 = i32 %17 (ifz phi)
[D] Tue Feb 13 13:24:07 2018: register - = or register elem_num, bool %27 (setf )
[D] Tue Feb 13 13:24:07 2018: br.ifzc %16 (on bool %27) (ifzc )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: br.ifz %15 (on bool %27) (ifz )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %16
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %end_of_function
[D] Tue Feb 13 13:24:07 2018: i32 %group_loop_size = register unif
[D] Tue Feb 13 13:24:07 2018: register - = or register elem_num, i32 %group_loop_size (setf )
[D] Tue Feb 13 13:24:07 2018: br.ifzc %start_of_function (on i32 %group_loop_size) (ifzc )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: register irq = not register qpu_num
[D] Tue Feb 13 13:24:07 2018: nop (thrend )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Cleaned 0 unused locals from method loop1
[D] Tue Feb 13 13:24:07 2018: Created node: %group_loop_size init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %17 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %index_offset.1 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %20 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %19 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %18 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %27 init: acc,B, avail: acc,B (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %13 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %12 init: A,B, avail: A,B (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %global_data_address is never read!
[D] Tue Feb 13 13:24:07 2018: Local %global_offset_z is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %21 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %global_offset_y is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %22 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %global_offset_x is never read!
[D] Tue Feb 13 13:24:07 2018: Local %group_id_z is never read!
[D] Tue Feb 13 13:24:07 2018: Local %group_id_y is never read!
[D] Tue Feb 13 13:24:07 2018: Local %group_id_x is never read!
[D] Tue Feb 13 13:24:07 2018: Local %num_groups_z is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %immediate.10 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %num_groups_y is never read!
[D] Tue Feb 13 13:24:07 2018: Local %num_groups_x is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %index_offset.0 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %local_ids is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %use_with_literal.8 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %use_with_literal.6 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %use_with_literal.3 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %local_sizes is never read!
[D] Tue Feb 13 13:24:07 2018: Local %work_dim is never read!
[D] Tue Feb 13 13:24:07 2018: Colored graph with 31 nodes created!
[D] Tue Feb 13 13:24:07 2018: -----
[D] Tue Feb 13 13:24:07 2018: Mapping label '%start_of_function' to byte-position 0
[D] Tue Feb 13 13:24:07 2018: Mapping label '%0' to byte-position 120
[D] Tue Feb 13 13:24:07 2018: Mapping label '%14' to byte-position 120
[D] Tue Feb 13 13:24:07 2018: Mapping label '%15' to byte-position 136
[D] Tue Feb 13 13:24:07 2018: Mapping label '%16' to byte-position 632
[D] Tue Feb 13 13:24:07 2018: Mapping label '%end_of_function' to byte-position 632
[D] Tue Feb 13 13:24:07 2018: Mapped 6 labels to positions
[D] Tue Feb 13 13:24:07 2018: Assigned local %local_sizes to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %use_with_literal.6 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %use_with_literal.8 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %local_ids to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %num_groups_x to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %num_groups_y to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %num_groups_z to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %19 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %group_id_x to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %13 to register ra0
[D] Tue Feb 13 13:24:07 2018: Assigned local %18 to register r3
[D] Tue Feb 13 13:24:07 2018: Assigned local %use_with_literal.3 to register r1
[D] Tue Feb 13 13:24:07 2018: Assigned local %21 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %index_offset.1 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %immediate.10 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %22 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %17 to register r1
[D] Tue Feb 13 13:24:07 2018: Assigned local %work_dim to register -
[E] Tue Feb 13 13:24:07 2018:  (1) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x130 [0x7fa0eaa3dcda]
[E] Tue Feb 13 13:24:07 2018:  (2) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::qpu_asm::ColoredNode::getRegisterFixed() const+0x19d [0x7fa0eaad6819]
[E] Tue Feb 13 13:24:07 2018:  (3) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::qpu_asm::GraphColoring::toRegisterMap() const+0x1ec [0x7fa0eaadef2a]
[E] Tue Feb 13 13:24:07 2018:  (4) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::qpu_asm::CodeGenerator::generateInstructions(vc4c::Method&)+0x4e7 [0x7fa0eaacd4b9]
[E] Tue Feb 13 13:24:07 2018:  (5) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : +0x8553db [0x7fa0eaa3f3db]
[E] Tue Feb 13 13:24:07 2018:  (6) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : +0x8555f2 [0x7fa0eaa3f5f2]
[E] Tue Feb 13 13:24:07 2018:  (7) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : +0x8570b0 [0x7fa0eaa410b0]
[E] Tue Feb 13 13:24:07 2018:  (8) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x32 [0x7fa0eaa4323a]
[E] Tue Feb 13 13:24:07 2018:  (9) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const+0x60 [0x7fa0eaa428b0]
[E] Tue Feb 13 13:24:07 2018:  (10) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)+0x28 [0x7fa0eaa49562]
[E] Tue Feb 13 13:24:07 2018:  (11) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()()+0x2c [0x7fa0eaa494b8]
[E] Tue Feb 13 13:24:07 2018:  (12) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : std::thread::_Impl<std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()> >::_M_run()+0x1c [0x7fa0eaa493ee]
[E] Tue Feb 13 13:24:07 2018:  (13) /usr/lib/x86_64-linux-gnu/libstdc++.so.6 : +0xb8c80 [0x7fa0e9f20c80]
[E] Tue Feb 13 13:24:07 2018:  (14) /lib/x86_64-linux-gnu/libpthread.so.0 : +0x76ba [0x7fa0e91656ba]
[E] Tue Feb 13 13:24:07 2018:  (15) /lib/x86_64-linux-gnu/libc.so.6 : clone+0x6d [0x7fa0e968641d]
[E] Tue Feb 13 13:24:07 2018: Background worker threw error: Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Temporary file '/tmp/vc4c-GFIvlj' deleted
[E] Tue Feb 13 13:24:07 2018: Compiler threw exception: Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
terminate called after throwing an instance of 'vc4c::CompilationError'
  what():  Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
Aborted (core dumped)

Add loop vectorization

Add an optimization steps to vectorize loops. Status:

  • Find all loops in a kernel function (single and multiple blocks)
  • Determine loop-control data (iteration step variable/operation, initial value, upper bound, repetition branch instruction/condition, ...)
  • Determine optimum vectorization factor. NOTE: very basic implementation
  • Determine cost vs. benefits. NOTE: Benefits are currently calculated in total number of instructions saved, calculation of costs is missing
  • Vectorize by changing all required operations into vector-operations. NOTE: very basic implementation, needs to be improved
  • Adjust the initial value and the iteration step.

Notes:
Currently works only on basic for-range loops with literal initial value, a literal constant step and a literal constant upper bound (e.g. for(int i = 0; i < 1000; ++i).
Can't vectorize loops using one of those instructions: vector-rotations, memory barriers and semaphore-instructions.

The optimization is enabled via a new configuration-entry in config.h, which is disabled by default for now.

Needs extensive testing!

Suggestion: first release milestone

I am thinking how to develop VC4C in efficient way.
I really make it worth to adapt realistic application.

To do that, how about setting milestone for first release.
If you permit, I want to list current improvements what we should now.


In my opinion, we need evaluate output of VC4C and compare this with ideal output.

For example, current VC4C of master output is as follows:

__kernel void loop1 (__global float * a) {
  int id = get_local_id(0);
  float16 v = vload16 (id, a);
  vstore16(v * 2, id, a);
}

``asm
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 49 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or r0, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -45
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never


In my opinion, this should be like:

```asm
;; This buffer is for text that is not saved, and for Lisp evaluation.
;; To create a file, visit it with <open> and enter text in its buffer.

or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or r0, unif, unif
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed
or -, unif, unif  // should be removed 
or -, unif, unif  // should be removed

                  // label1: should be beginning of loop

or ra0, unif, unif
or r0, r0, r0     // just duplication, should be removed

ldi r1, 255       //
and r0, r0, r1    // these have no effects
and r0, r0, r1    //

shl r0, r0, 4 (4) //
shl r0, r0, 2 (2) // why shifted ?

add r0, ra0, r0   //

or r2, r0, r0     // r0 is reused by `v8add`. why need copy?

mul24 r1, 4 (4), elem_num // this should compute only once, should move to outside of loop
v8adds r0, 8 (8), 8 (8)   // this should compute only once, should move to outside of loop
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)  // r0 is re-assigned in next instruction, no affects
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never

or r0, 2.000000 (33), 2.000000 (33) // 2.0 includes smallImm, can be fused
fmul r0, r4, r0                     //

or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -45 // should be jump to label1
nop.never                // should be replaced to meaningful op
nop.never                // should be replaced to meaningful op
nop.never                // should be replaced to meaningful op
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never

So we need:

  1. Optimize layout of parameters
  2. Enhancement of basic optimization (remove redundant moves, peepohole)
  3. Dual issue (by Instruction Scheduler)
  4. Use both TMU0 and TMU1 (by Instruction Scheduler)
  5. Improve register allocation (but it might be very difficult)

How about make first release after implementation of such improvement?
@doe300 What do you think of that?

Implement register spilling

We need to implement register spilling to be able to support more complex kernels.

If the size of spilled registers (times 12 QPUs!) is small enough, we could store them in VPM and save from accessing memory. Access to the spilled registers would still need to be synchronized via the hardware-mutex.

The actual problem of this implementation is not the spilling/loading of locals, but in determining the minimum number of registers to spill.

(see doe300/VC4CL#24)

Is the output inefficient?

Recently, I read the output of VC4C. To me, it includes inefficient instructions.
I would like to know your design concept of the output.

kernel void add (global float a[], global float b[]) {
  int id = get_global_id(0);
  float8 v1 = vload8(id, a);
  float8 v2 = vload8(id, b);
  v1 += v2;
  vstore8 (v1, id, a);
}

https://gist.github.com/nomaddo/728ffc2fa605ab5b87f316a6280246be

My question is:

  • In line 3, 6, 7 or so on, this read from uniform and drop it. Why do it do?
    I want to know the layout of uniform which VC4CL runtime passes to the kernel.
  • It reads values from VPM, not TMU.
    In general, load from TMU is better than VPM , because we have two TMUs and these are free from mutex locks. Why do you choose the load way from VPM?

And some optimizations seem to lack:

  • Line 30 and line 31, can be fused.
or r0, r2, r2
shl r0, r0, 2

is equal to

shl r0, r2, r0
  • In line 22, or r2, r2, r2 has no meaning effects.
  • Have you implemented instruction scheduling? It affects a lot in simple architecture like vc4.

Manipulating local memories get weird results

I am trying to implement a dot product operation. However, it seems like working on local memories always get weird results. The code snippet looks like this:

__kernel void dot3(__global float* a_vec, __global float* b_vec, 
    __global float* output, __local float* l_vec) {
  size_t gid = get_global_id(0);
  size_t lid = get_local_id(0);

  l_vec[lid]  = a_vec[gid] * b_vec[gid];

  barrier(CLK_LOCAL_MEM_FENCE);
  if(lid==0) {
    l_vec[lid] = l_vec[lid] + l_vec[lid+1]; //sometimes, this line is ignored.
  }
  output[gid] = l_vec[lid];
}

The output returns different values each run. The system does not sum the first two items in each workgroup.

Thank you.

Manipulate instructions in `BasicBlock`

I am trying to implement instruction scheduler.
I am wondering how to treat instructions.

BasicBlock.instructions has type
RandomModificationList<std::unique_ptr<intermediate::IntermediateInstruction>> and InstructionWalker treat raw pointers of intermediate::intermediateinstruction.
I want to change it to shared_ptr, and change raw pointers to shared_ptr.

Do you have any reason to use unique_ptr and raw pointers?
@doe300

EDIT:
I am struggling to cache instruction: When I cache instructions (as raw pointers because only them are available out-side from the class) and remove instructions from BasicBlock, data is released. The problem is come from raw pointers.

Remove constant load in loops

ldi remain in for-loop even though it can be moved to outside of the loop.

kernel void loop1 (global float a[], global float b[]) {
  for (int i = 0; i < 1000; i++) {
    a[i] = -i;
    b[i] = -i;
  }
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 67 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), __global out float* b (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r2, unif, unif
or r1, 0 (0), 0 (0)
nop.never 
or r0, r1, r1
sub r0, 0 (0), r0
itof r3, r0; v8min r0, r1, r1
shl r0, r0, 2 (2)
add r0, ra0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or r0, r1, r1
or r1, r1, r1
shl r0, r0, 2 (2)
add r1, r1, 1 (1)
add r0, r2, r0
ldi ra1, 1000                                   //// load the constant  every time !!!
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
xor.setf -, r1, ra1
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)
or.setf -, r0, r0
or.ifz r1, r1, r1
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + 4
nop.never 
nop.never 
nop.never 
brr.ifanyz (pc+4) + -40
nop.never 
nop.never 
nop.never 
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -63
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

Automatically add artifacts to releases

The next step after #8 and fixing #9 would be to create a github-release nightly, were after every successful build on CircleCI (or at least once per day), the resulting artifacts (library, executable and debian-package) are released.

A more light-weight version of this could also be implemented by (simply?) linking to the latest versions of the artifacts in the CircleCI builds. In this case, the artifacts would need a stable URL across builds.

Issues with rebuilding and default clang

Sometimes, when VC4C is re-built (CMake already configured) using the default clang, the compilation fails due to CLANG_PATH not found.

My theory is that is has something to with CLANG_PATH being set as a macro in CMake, not a CMake variable and therefore might not be saved?!

A current solution is to clean and re-build cmake (see doe300/VC4CL#19 (comment)):

git clean -fdx && cmake . && make clean && make -j2

See also doe300/VC4CL#19 and #34.

CI build of x86_64 binary and test

Currently, CI only checks that the source code can be compiled to arm binaries.
In my motivation, it is convinient that we can do cross-compilation (output vc4 binaries from x86_64 host environments).

My plan is:

  • Enable compile x86_64 version of VC4C in circleci
  • Add tests (check, at least compilations can finish successfully)

Compilation of `testing/test_inner_loops.cl` failed

The latest VC4C fail to compile testing/test_inner_loops.cl as follows:

nomaddo@nomaddo-AS:~/VC4C_cross$ valgrind build/VC4C --asm -o /tmp/hoge2.s ~/VC4C_cross/testing/test_inner_loops.cl 
==15066== Memcheck, a memory error detector
==15066== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==15066== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==15066== Command: build/VC4C --asm -o /tmp/hoge2.s /home/nomaddo/VC4C_cross/testing/test_inner_loops.cl
==15066== 
[D] Fri Mar 23 19:03:51 2018: Compiling '/home/nomaddo/VC4C_cross/testing/test_inner_loops.cl' into '/tmp/hoge2.s' with options '' ...
[D] Fri Mar 23 19:03:51 2018: Temporary file '/tmp/vc4c-po6NJo' created
[I] Fri Mar 23 19:03:51 2018: Compiling OpenCL to LLVM-IR with :/usr/bin/clang-3.9 -cc1 -triple spir-unknown-unknown  -I /home/nomaddo/VC4C_cross/testing -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant  -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-po6NJo /home/nomaddo/VC4C_cross/testing/test_inner_loops.cl
[I] Fri Mar 23 19:03:54 2018: Compilation complete!
[D] Fri Mar 23 19:03:57 2018: Reading LLVM module from bit-code...
[D] Fri Mar 23 19:03:59 2018: Found SPIR kernel-function: test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:03:59 2018: Reading function void test_remove_constant_load_in_loops_opt(...)
[D] Fri Mar 23 19:03:59 2018: Reading parameter float* %a
[D] Fri Mar 23 19:03:59 2018: Reading parameter float* %b
[D] Fri Mar 23 19:04:00 2018: Mapping function 'test_remove_constant_load_in_loops_opt'...
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.0
[D] Fri Mar 23 19:04:00 2018: Generating unconditional branch to 0xccb9520
[D] Fri Mar 23 19:04:00 2018: Generating label label %.preheader22
[D] Fri Mar 23 19:04:00 2018: Generating Phi-Node with 2 options into i32 %j.025
[D] Fri Mar 23 19:04:00 2018: Generating binary operation mul with i32 %j.025 and i32 200 into i32 %mul2
[D] Fri Mar 23 19:04:00 2018: Generating unconditional branch to 0xccba520
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.1
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %j.025 and i32 1 into i32 %inc3
[D] Fri Mar 23 19:04:00 2018: Generating comparison slt with i32 %inc3 and i32 200 into %cmp
[D] Fri Mar 23 19:04:00 2018: Generating branch on condition bool %cmp to either label %.preheader22 or label %.preheader.preheader
[D] Fri Mar 23 19:04:00 2018: Generating label label %.preheader.preheader
[D] Fri Mar 23 19:04:00 2018: Generating unconditional branch to 0xccbb390
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.2
[D] Fri Mar 23 19:04:00 2018: Generating Phi-Node with 2 options into i32 %i.024
[D] Fri Mar 23 19:04:00 2018: Generating binary operation mul with i32 %i.024 and i32 %j.025 into i32 %mul
[D] Fri Mar 23 19:04:00 2018: Generating unary operation sitofp with i32 %mul into float %conv
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %i.024 and i32 %mul2 into i32 %add
[D] Fri Mar 23 19:04:00 2018: Generating calculating index i32 %add of float* %a into float* %arrayidx
[D] Fri Mar 23 19:04:00 2018: Generating writing of float %conv into float* %arrayidx
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %i.024 and i32 1 into i32 %inc
[D] Fri Mar 23 19:04:00 2018: Generating comparison slt with i32 %inc and i32 200 into %cmp1
[D] Fri Mar 23 19:04:00 2018: Generating branch on condition bool %cmp1 to either label %tmp.2 or label %tmp.1
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.3
[D] Fri Mar 23 19:04:00 2018: Generating return nothing
[D] Fri Mar 23 19:04:00 2018: Generating label label %.preheader
[D] Fri Mar 23 19:04:00 2018: Generating Phi-Node with 2 options into i32 %i4.023
[D] Fri Mar 23 19:04:00 2018: Generating unary operation sitofp with i32 %i4.023 into float %conv7
[D] Fri Mar 23 19:04:00 2018: Generating calculating index i32 %i4.023 of float* %b into float* %arrayidx8
[D] Fri Mar 23 19:04:00 2018: Generating writing of float %conv7 into float* %arrayidx8
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %i4.023 and i32 1 into i32 %inc9
[D] Fri Mar 23 19:04:00 2018: Generating comparison slt with i32 %inc9 and i32 40000 into %cmp5
[D] Fri Mar 23 19:04:00 2018: Generating branch on condition bool %cmp5 to either label %.preheader or label %tmp.3
[D] Fri Mar 23 19:04:00 2018: Eliminating phi-node by inserting moves: i32 %j.025 = phi %tmp.1 -> i32 %inc3, %tmp.0 -> i32 0
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %tmp.1
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %tmp.0
[D] Fri Mar 23 19:04:00 2018: Eliminating phi-node by inserting moves: i32 %i.024 = phi %tmp.2 -> i32 %inc, %.preheader22 -> i32 0
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %tmp.2
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %.preheader22
[D] Fri Mar 23 19:04:00 2018: Eliminating phi-node by inserting moves: i32 %i4.023 = phi %.preheader.preheader -> i32 0, %.preheader -> i32 %inc9
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %.preheader.preheader
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %.preheader
[I] Fri Mar 23 19:04:00 2018: -----
[I] Fri Mar 23 19:04:00 2018: Inlining functions for kernel: test_remove_constant_load_in_loops_opt
[I] Fri Mar 23 19:04:00 2018: -----
[D] Fri Mar 23 19:04:00 2018: -----
[I] Fri Mar 23 19:04:00 2018: Running optimization passes for: test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:04:00 2018: 
[D] Fri Mar 23 19:04:00 2018: Running pass: MapMemoryAccess
[D] Fri Mar 23 19:04:00 2018: Generating memory access which cannot be lowered into VPM: store float %conv7 into float* %arrayidx8
[D] Fri Mar 23 19:04:00 2018: 0 16 0 1
[D] Fri Mar 23 19:04:00 2018: Generating memory access which cannot be lowered into VPM: store float %conv into float* %arrayidx
[D] Fri Mar 23 19:04:00 2018: 0 16 0 1
[D] Fri Mar 23 19:04:00 2018: Found base address - with offset -1 for writing into memory
[D] Fri Mar 23 19:04:00 2018: Found base address - with offset -1 for writing into memory
[D] Fri Mar 23 19:04:00 2018: 
[D] Fri Mar 23 19:04:00 2018: Running pass: ResolveStackAllocations
[D] Fri Mar 23 19:04:00 2018: 
[D] Fri Mar 23 19:04:00 2018: Running pass: SingleSteps
[D] Fri Mar 23 19:04:00 2018: Running steps: EliminateReturns, CombineDuplicateBranches, EliminateUselessBranch, IntrinsifyBuiltin, HandleLiteralVector, MapGlobalDataToAddress, CalculateConstantValue, EliminateUselessInstruction, LoadImmediateValues, HandleUseWithImmediateValues, MoveRotationSourcesToAccs, CheckMethodCalls, CombineSelectionWithZero, CombineSettingSameFlags, 
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Removing branch to next instruction: br %.preheader22
[D] Fri Mar 23 19:04:00 2018: Intrinsifying unsigned multiplication of integers
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.21 = clz i32 1 (1)' with constant value: i32 31
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 65535
[D] Fri Mar 23 19:04:00 2018: Replacing 'i16 %immediate.23 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 200
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.24 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Replacing obsolete i32 %tmp.19 = xor i32 %sign.8, i8 0 with move
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %inc3 = add i32 %j.025, i32 1 (1)
[D] Fri Mar 23 19:04:00 2018: Intrinsifying comparison 'slt' to arithmetic operations
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 200
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Removing branch to next instruction: br.ifz %.preheader.preheader (on bool %cmp) (ifz )
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Intrinsifying unsigned multiplication of integers
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.49 = clz i32 1 (1)' with constant value: i32 31
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.50 = clz i32 1 (1)' with constant value: i32 31
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 65535
[D] Fri Mar 23 19:04:00 2018: Replacing 'i16 %immediate.52 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 65535
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.54 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.55 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.56 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Intrinsifying multiplication with left-shift
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %inc = add i32 %i.024, i32 1 (1)
[D] Fri Mar 23 19:04:01 2018: Intrinsifying comparison 'slt' to arithmetic operations
[D] Fri Mar 23 19:04:01 2018: Loading immediate value: 200
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Replacing return in kernel-function with branch to end-label
[D] Fri Mar 23 19:04:01 2018: Intrinsifying multiplication with left-shift
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Fri Mar 23 19:04:01 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %index_offset.5 = shl i32 %i4.023, i32 2 (2)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %inc9 = add i32 %i4.023, i32 1 (1)
[D] Fri Mar 23 19:04:01 2018: Intrinsifying comparison 'slt' to arithmetic operations
[D] Fri Mar 23 19:04:01 2018: Loading immediate value: 40000
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: 
[D] Fri Mar 23 19:04:01 2018: Running pass: CombineLiteralLoads
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.50 = clz i32 1 (1)
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.53 = loadi i32 65535
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.54 = v8adds i32 8 (8), i32 8 (8)
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.56 = v8adds i32 8 (8), i32 8 (8)
[D] Fri Mar 23 19:04:01 2018: 
[D] Fri Mar 23 19:04:01 2018: Running pass: CombineRotations
[D] Fri Mar 23 19:04:01 2018: 
[D] Fri Mar 23 19:04:01 2018: Running pass: RemoveRedundantMoves
[D] Fri Mar 23 19:04:01 2018: Removing obsolete move by replacing uses of the output with the input: i32 %mul.out1.16 = i32 %mul.out0.15
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpm = float %conv
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpw_addr = float* %arrayidx
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpm = float %conv7
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpw_addr = float* %arrayidx8
[D] Fri Mar 23 19:04:01 2018: 
[D] Fri Mar 23 19:04:01 2018: Running pass: EliminateDeadStores
[D] Fri Mar 23 19:04:01 2018: Removing instruction i32 %mul.b0.13 = i8 0 (0), since its output is never read
[D] Fri Mar 23 19:04:01 2018: Cleaned 22 unused locals from method test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:04:01 2018: 
[D] Fri Mar 23 19:04:01 2018: Running pass: RemoveConstantLoadInLoops
[D] Fri Mar 23 19:04:01 2018: CFG created/updated for function: test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:04:01 2018: Found a control-flow loop: label: %tmp.2 -> label: %tmp.1 -> label: %.preheader22 -> 
[D] Fri Mar 23 19:04:01 2018: Found a control-flow loop: label: %tmp.2 -> 
[D] Fri Mar 23 19:04:01 2018: Found a control-flow loop: label: %.preheader -> 
==15066== Thread 2 Optimizer:
==15066== Invalid read of size 8
==15066==    at 0x526D78A: vc4c::ControlFlowLoop::includes(vc4c::ControlFlowLoop const&) const (ControlFlowGraph.cpp:121)
==15066==    by 0x53F03C4: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:990)
==15066==    by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066==    by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066==    by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066==    by 0x542C365: runOptimizationPasses(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&, std::set<vc4c::optimizations::OptimizationPass, std::less<vc4c::optimizations::OptimizationPass>, std::allocator<vc4c::optimizations::OptimizationPass> > const&) (Optimizer.cpp:173)
==15066==    by 0x542C6BB: vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}::operator()() const (Optimizer.cpp:211)
==15066==    by 0x542D989: std::_Function_handler<void (), vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
==15066==    by 0x522B767: std::function<void ()>::operator()() const (functional:2267)
==15066==    by 0x522AD43: threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const (BackgroundWorker.h:62)
==15066==    by 0x52313B9: void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1531)
==15066==    by 0x523130F: std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()() (functional:1520)
==15066==  Address 0xcdb4488 is 0 bytes after a block of size 24 alloc'd
==15066==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15066==    by 0x527A997: __gnu_cxx::new_allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*>::allocate(unsigned long, void const*) (new_allocator.h:104)
==15066==    by 0x52797D4: std::allocator_traits<std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::allocate(std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*>&, unsigned long) (alloc_traits.h:491)
==15066==    by 0x52781F3: std::_Vector_base<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::_M_allocate(unsigned long) (stl_vector.h:170)
==15066==    by 0x5276ACF: vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const** std::vector<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::_M_allocate_and_copy<std::move_iterator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const**> >(unsigned long, std::move_iterator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const**>, std::move_iterator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const**>) (stl_vector.h:1224)
==15066==    by 0x5274484: std::vector<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::reserve(unsigned long) (vector.tcc:75)
==15066==    by 0x526EE83: vc4c::ControlFlowGraph::findLoopsHelper(vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, vc4c::PerformanceMap<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, int, (vc4c::OrderType)2, vc4c::hash<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >&, vc4c::PerformanceMap<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, int, (vc4c::OrderType)2, vc4c::hash<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >&, vc4c::PerformanceList<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, (vc4c::AccessType)2, (vc4c::InsertRemoveType)3>&, int&) (ControlFlowGraph.cpp:273)
==15066==    by 0x526DEAA: vc4c::ControlFlowGraph::findLoops() (ControlFlowGraph.cpp:192)
==15066==    by 0x53F02BA: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:982)
==15066==    by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066==    by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066==    by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066== 
==15066== Invalid read of size 8
==15066==    at 0x526D78D: vc4c::ControlFlowLoop::includes(vc4c::ControlFlowLoop const&) const (ControlFlowGraph.cpp:121)
==15066==    by 0x53F03C4: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:990)
==15066==    by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066==    by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066==    by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066==    by 0x542C365: runOptimizationPasses(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&, std::set<vc4c::optimizations::OptimizationPass, std::less<vc4c::optimizations::OptimizationPass>, std::allocator<vc4c::optimizations::OptimizationPass> > const&) (Optimizer.cpp:173)
==15066==    by 0x542C6BB: vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}::operator()() const (Optimizer.cpp:211)
==15066==    by 0x542D989: std::_Function_handler<void (), vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
==15066==    by 0x522B767: std::function<void ()>::operator()() const (functional:2267)
==15066==    by 0x522AD43: threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const (BackgroundWorker.h:62)
==15066==    by 0x52313B9: void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1531)
==15066==    by 0x523130F: std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()() (functional:1520)
==15066==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==15066== 
==15066== 
==15066== Process terminating with default action of signal 11 (SIGSEGV)
==15066==  Access not within mapped region at address 0x0
==15066==    at 0x526D78D: vc4c::ControlFlowLoop::includes(vc4c::ControlFlowLoop const&) const (ControlFlowGraph.cpp:121)
==15066==    by 0x53F03C4: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:990)
==15066==    by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066==    by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066==    by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066==    by 0x542C365: runOptimizationPasses(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&, std::set<vc4c::optimizations::OptimizationPass, std::less<vc4c::optimizations::OptimizationPass>, std::allocator<vc4c::optimizations::OptimizationPass> > const&) (Optimizer.cpp:173)
==15066==    by 0x542C6BB: vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}::operator()() const (Optimizer.cpp:211)
==15066==    by 0x542D989: std::_Function_handler<void (), vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
==15066==    by 0x522B767: std::function<void ()>::operator()() const (functional:2267)
==15066==    by 0x522AD43: threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const (BackgroundWorker.h:62)
==15066==    by 0x52313B9: void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1531)
==15066==    by 0x523130F: std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()() (functional:1520)
==15066==  If you believe this happened as a result of a stack
==15066==  overflow in your program's main thread (unlikely but
==15066==  possible), you can try to increase the size of the
==15066==  main thread stack using the --main-stacksize= flag.
==15066==  The main thread stack size used in this run was 8720384.
==15066== 
==15066== HEAP SUMMARY:
==15066==     in use at exit: 18,678,964 bytes in 192,720 blocks
==15066==   total heap usage: 229,603 allocs, 36,883 frees, 33,442,982 bytes allocated
==15066== 
==15066== LEAK SUMMARY:
==15066==    definitely lost: 0 bytes in 0 blocks
==15066==    indirectly lost: 0 bytes in 0 blocks
==15066==      possibly lost: 7,449,688 bytes in 53,136 blocks
==15066==    still reachable: 11,229,276 bytes in 139,584 blocks
==15066==         suppressed: 0 bytes in 0 blocks
==15066== Rerun with --leak-check=full to see details of leaked memory
==15066== 
==15066== For counts of detected and suppressed errors, rerun with: -v
==15066== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Killed

Remove load instruction of SmallImm

In this code, smallImm 2.0 is loaded my ldi.
Discussed in #45.

__kernel void loop1 (__global float * a) {
  int id = get_local_id(0);
  float16 v = vload16 (id, a);
  vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 37 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items) (lids)
or r0, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255    
and r0, r0, r1  
and r0, r0, r1  
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)  // can be fused
fmul r0, r4, r0                      //
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never

Better flattening of basic blocks

kernel void if1 (global float a[]) {
  int id = get_global_id(0);
  if ((id & 0x01) == 0)
    a[id] = 3.0;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'if1' with 46 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items) (lSize, lids, gidX, offX)
// label %start_of_function
or r1, unif, unif
or r2, unif, unif
or r0, unif, unif
or ra2, unif, unif
or ra1, unif, unif
// label %tmp.0
or ra0, r0, r0; v8min r0, r1, r1
ldi r1, 255
and r3, r0, r1
or r2, ra2, ra2; v8min r0, r2, r2
and r1, r0, r1
mul24 r0, ra0, r3
add r0, r2, r0
add r1, r0, r1
nop.never 
or r0, r1, r1
and.setf -, r0, 1 (1)
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + 4 // to %tmp.1      --- (1)
nop.never 
nop.never 
nop.never 
brr.ifanyz (pc+4) + 10 // to %tmp.2    --- (2)
nop.never 
nop.never 
nop.never 
// label %tmp.1
or r0, r1, r1
shl r0, r0, 2 (2)
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
itof vpm, 3 (3)
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add vpw_addr, ra1, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -42 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

In this code, the branch of (1) can be removed. We should,

  • check condition of (2) first
  • if the condition is false, fall through to the basic block (of writing to memory)

Anyway, we need better flattening of basic blocks.
Currently, we just use the iterator of traverseing all instructions.
We need more carefully flattening, which is keen to the shape of the graph.

Move CircleCI to LLVM-library front-end

Using the LLVM-library front-end for debian packages is currently the easiest and best way (for the user) to use the VC4C compiler:

  • the dependency on CLang/LLVM library installed can be expressed in the debian-package
  • the LLVM library front-end has the same test-sample acceptance rate as the SPIR-V front-end
  • the LLVM library front-end is the fastest front-end

Current problems:

  • CircleCI seems to provide llvm-config-3.9 (should be provided by llvm-3.9)
  • "Default" llvm-config seems not to find the LLVM library, headers, include flags, etc. (see CircleCI log section configure)
  • But for SPIRV-LLVM, llvm-config finds all required components

Do we have to install/provide llvm-dev cross-compiled for Raspberry Pi? Or do we have to re-build the docker image?

@nomaddo, any ideas on this?

precomplication failed

I am trying to run VC4C on fresh raspbian env, but somehow compile failed as follows:

pi@raspberrypi:~/VC4C$ ./build/VC4C -o /tmp/hoge ../VC4CL/mul2.cl 
[D] Tue Mar 13 22:47:53 2018: Compiling '../VC4CL/mul2.cl' into '/tmp/hoge' with options '' ...
[D] Tue Mar 13 22:47:53 2018: Temporary file '/tmp/vc4c-PXslFI' created
[I] Tue Mar 13 22:47:53 2018: Compiling OpenCL to LLVM-IR with :/usr/bin/clang-3.9 -cc1 -triple spir-unknown-unknown  -I ../VC4CL -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant  -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-PXslFI ../VC4CL/mul2.cl
[E] Tue Mar 13 22:47:59 2018: Errors in precompilation:
[E] Tue Mar 13 22:47:59 2018: ../VC4CL/mul2.cl:3:12: error: use of undeclared identifier 'i'
  a[i] = a[i] * 2;
           ^
../VC4CL/mul2.cl:3:5: error: use of undeclared identifier 'i'
  a[i] = a[i] * 2;
    ^
2 errors generated.

[E] Tue Mar 13 22:47:59 2018:  (1) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xf4 [0x76c21d78]
[E] Tue Mar 13 22:47:59 2018:  (2) /home/pi/VC4C/build/libVC4CC.so.1.2 : +0x349ba0 [0x76c55ba0]
[E] Tue Mar 13 22:47:59 2018:  (3) /home/pi/VC4C/build/libVC4CC.so.1.2 : +0x349edc [0x76c55edc]
[E] Tue Mar 13 22:47:59 2018:  (4) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::Precompiler::run(std::unique_ptr<std::istream, std::default_delete<std::istream> >&, vc4c::SourceType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)+0x36c [0x76c5674c]
[E] Tue Mar 13 22:47:59 2018:  (5) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::Precompiler::precompile(std::istream&, std::unique_ptr<std::istream, std::default_delete<std::istream> >&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)+0x154 [0x76c54d34]
[E] Tue Mar 13 22:47:59 2018:  (6) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0xd4 [0x76c23ce4]
[E] Tue Mar 13 22:47:59 2018:  (7) ./build/VC4C : main+0xc4c [0x232e4]
[E] Tue Mar 13 22:47:59 2018:  (8) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x73f3c678]
[D] Tue Mar 13 22:47:59 2018: Temporary file '/tmp/vc4c-PXslFI' deleted
[E] Tue Mar 13 22:47:59 2018: Compiler threw exception: Pre-compilation: Error in precompilation: ../VC4CL/mul2.cl:3:12: error: use of undeclared identifier 'i'
  a[i] = a[i] * 2;
           ^
../VC4CL/mul2.cl:3:5: error: use of undeclared identifier 'i'
  a[i] = a[i] * 2;
    ^
2 errors generated.

terminate called after throwing an instance of 'vc4c::CompilationError'
  what():  Pre-compilation: Error in precompilation: ../VC4CL/mul2.cl:3:12: error: use of undeclared identifier 'i'
  a[i] = a[i] * 2;
           ^
../VC4CL/mul2.cl:3:5: error: use of undeclared identifier 'i'
  a[i] = a[i] * 2;
    ^
2 errors generated.

Aborted

mul2.cl is as follows:

kernel void f (global float * a) {
  int id = get_global_id(0);
  a[i] = a[i] * 2;
}

This issue happened in x86 env. @doe300 any idea?

Remove move instruction in mutex

In the following code, fadd r1, r2, r0 can be fused to move instruction that is located in mutex block.

kernel void sum_f16 (global float a[], global float b[], global float c[]) {
  int id = get_global_id(0);
  float16 x = vload16(id, b);
  float16 y = vload16(id, c);
  vstore16 (x + y, id, a);
}
or -, unif, unif
or r2, unif, unif
or ra5, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or r0, unif, unif
or -, unif, unif
or -, unif, unif
or ra4, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra3, unif, unif
or ra2, unif, unif
or ra1, unif, unif
or ra0, r0, r0; v8min r1, 0 (0), 0 (0)
shr r2, r2, r1
ldi r0, 255
and r3, r2, r0
or r2, ra4, ra4
shr r1, ra5, r1
and r1, r1, r0
mul24 r0, ra0, r3
add r0, r2, r0
add r0, r0, r1
shl r3, r0, 4 (4)
mul24 r2, 4 (4), elem_num
or r0, r3, r3
shl r0, r0, 2 (2)
add r0, ra2, r0
or r1, r0, r0; v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r1, r2
or tmu0s, r0, r0
nop.load_tmu0.never 
or r2, r4, r4
or r0, r3, r3
shl r0, r0, 2 (2)
add r0, ra1, r0
or ra0, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, ra0, r1
or tmu0s, r0, r0
nop.load_tmu0.never 
or r0, r4, r4
fadd r1, r2, r0                             !!! can be fused
or r0, r3, r3
shl r0, r0, 2 (2)
add r0, ra3, r0
or r0, r0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r1, r1                              !!! can be fused
ldi vpw_setup, 2156937216
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0                    
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -69
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

Remove moves of `r4`

In the following kernel code, redundant move instructions from r4 is problem.

void kernel test(float global * a, const int n){
  for (int i = 0; i < 10; i++)
    a[i] += 1;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'test' with 68 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), int n (4 B, 1 items)
// label: %start_of_function
or ra0, unif, unif
or -, unif, unif
// label: %tmp.0
or tmu0s, ra0, ra0
nop.load_tmu0.never 
or r0, r4, r4
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fadd vpm, r0, 1.000000 (32)
ldi vpw_setup, vdw_setup(rows: 10, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or r0, ra0, ra0
add tmu0s, r0, 4 (4)
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
or r0, ra0, ra0
add tmu0s, r0, 8 (8)
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
or r0, ra0, ra0
add tmu0s, r0, 12 (12)
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 8 (8), 8 (8)
add tmu0s, ra0, r0
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 10 (10), 10 (10)
add tmu0s, ra0, r0
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 12 (12), 12 (12)
add tmu0s, ra0, r0
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 14 (14), 14 (14)
add tmu0s, ra0, r0
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
ftoi r0, 32.000000 (37)
add tmu0s, ra0, r0
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
mul24 r0, 6 (6), 6 (6)
add tmu0s, ra0, r0
nop.load_tmu0.never 
or r0, r4, r4                                  // shoudl be removed
fadd vpm, r0, 1.000000 (32)
or vpw_addr, ra0, ra0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -64 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

Cross-compiled build does not run on Rpi B

The cross-compiled build does not run on original Raspberry Pi B (not B+, not 2B) due to probably architecture mismatch.
The original error message when trying to execute VC4C is:

Illegal instruction

cat /proc/cpuinfo spills out:

ARMv6-compatible processor rev 7 (v6l)

readelf -A /usr/local/bin/VC4C gives

Tag_CPU_name: "7-A"
Tag_CPU_arch: v7

Remove redundant instructions

Discussed in #45.

__kernel void loop1 (__global float * a) {
  int id = get_local_id(0);
  float16 v = vload16 (id, a);
  vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 37 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items) (lids)
or r0, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255         // 
and r0, r0, r1      // Redundant 
and r0, r0, r1      // 
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never

Rewrite structure of methods

See also #50

  • Rewrite method to store basic blocks in graph according to their predecessors/successors
    • Rewrite direct accesses to basic-block data structure
    • Rewrite iterations over all instructions in method
    • Make sure, remaining iterations/accesses do not need a specific order of basic blocks
    • Rewrite CFG to directly store blocks, not just references, rewrite method to use CFG
    • Rewrite existing generations of CFG by reusing CFG from method
  • Rewrite basic block to store instructions in graph according to their data and order dependencies
    • TDB

PyOpenCL demo benchmark fails to compile - bit-cast vector-size error

I'm testing out some basic PyOpenCL examples (PyOpenCL detects the library and parameters correctly). The demo benchmark looked like a nice simple kernel to try:
https://raw.githubusercontent.com/inducer/pyopencl/master/examples/benchmark.py

This gives an error:
[E] Mon Feb 5 06:43:19 2018: Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!
terminate called after throwing an instance of 'vc4c::CompilationError'
what(): Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!

The exception comes from here:

throw CompilationError(CompilationStep::LLVM_2_IR, "Bit-casts across different vector-sizes are not yet supported!");

I'm happy to look at fixing it, but some pointers as to what needs to happen would be great.

Can not get files in example folder working

I am new to OpenCL. Apologies if the question is dumb.

My environment is Raspberry Pi 3, rasbian version 8(jessie), clang 3.9.0-4. VC4C, VC4CL and VC4CLStdLib have been built according to your instruction. However, when I compile the file /example/hello_world.cl using the following command
VC4C --hex -o hello_world.hex hello_world.cl, I get the following error

[E] Fri Dec 22 21:45:46 2017: Cannot map phi-node to label: %0
[E] Fri Dec 22 21:45:46 2017:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::string const&)+0xb8 [0x76c843b8]
[E] Fri Dec 22 21:45:46 2017:  (2) /usr/local/lib/libVC4CC.so.1.2 : +0x394b88 [0x76e0fb88]
[E] Fri Dec 22 21:45:46 2017:  (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::optimizations::eliminatePhiNodes(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)+0xec [0x76e10028]
[E] Fri Dec 22 21:45:46 2017:  (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const+0x188 [0x76e366a0]
[E] Fri Dec 22 21:45:46 2017:  (5) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0x27c [0x76c86458]
[E] Fri Dec 22 21:45:46 2017:  (6) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::string const&, vc4c::Optional<std::string> const&)+0x3cc [0x76c86b78]
[E] Fri Dec 22 21:45:46 2017:  (7) VC4C : main+0xb3c [0x1ea38]
[E] Fri Dec 22 21:45:46 2017:  (8) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x76793294]
[D] Fri Dec 22 21:45:50 2017: Temporary file '/tmp/vc4c-OPk2ns' deleted
[E] Fri Dec 22 21:45:50 2017: Compiler threw exception: Optimizer: Failed to map all phi-options to valid basic-blocks
terminate called after throwing an instance of 'vc4c::CompilationError'
  what():  Optimizer: Failed to map all phi-options to valid basic-blocks
Aborted

What did I do wrong? Could you give some detailed instruction of setting up a working environment?

Thanks.
BTW, great project!

Eliminate loading of constants

When loading constant global data, the LLVM-IR/SPIR-V output loads initialized constant global data from memory. Since we know the value and it cannot change, we can use the data as literal constants and skip the reading from memory.

  • Find accesses to initialized constant global data
  • Replace with usage of constant value on either:
    • Access is without index (e.g. global is scalar/vector)
    • Index accessed is constant, element and value of index can be determined at compile-time
    • Global data has zero-initializer

Add linker

Since SPIRV-Tools now implements a linker, we could use it, to provide a linker for the VC4C compiler.

Provide an interface (can be separate from compiler-interface) to link together two intermediate binaries (LLVM-IR or SPIR-V).

LLVM/CLang support linking via llvm-link, SPIR-V via tools/spirv-link in SPIRV-Tools.

Problem:
Both utils only support "full linking", where implementations for all functions need to be available, which is not the case since the vc4cl_ intrinsic functions are not implemented in any intermediate format.

Use clang-format

We should use clang-format to make sure, we have a unified formatting.

Any comments on this?
Does anyone have a good template for clang-format, we could use?

This would also apply to the VC4CL project.

Add instruction scheduler

Not sure how to implement this, but here a few notes on an instruction scheduler:

Goal

Reorder instructions within a basic block to utilize the delay introduced by certain operations by inserting meaningful instructions minimizing the number of cycles spent waiting (via nop or on periphery registers).

Target features

  • remove delays introduced by accessing periphery (waiting on SFU/TMU/VPM)
  • remove delay introduced by read-after-write of physical register
  • allow queueing up to 4 memory load-requests per TMU and balance load on both TMUs (see #15 )
  • re-order instructions to maximize the possible candidates for instruction-combinations without violating the above distances

Implementation

  • create graph of dependencies between instructions
  • dependencies have weight determining the minimum distance between the two instructions to not require delay-slots.

See also the Wikipedia.

Optimize uniform layout

Currently, unused mov instructions from uniform are generated at the beginning of kernel functions.

__kernel void loop1 (__global float * a) {
  int id = get_local_id(0);
  float16 v = vload16 (id, a);
  vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 49 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or r0, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never 
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -45
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

These loads from uniform can be removed, ideally. And should be removed, because these are repeatedly executed (looped by brr.ifallzc (pc+4) + -45).

Compilation Failure on Arch Linux

$ gcc --version
gcc (GCC) 7.2.1 20180116
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


$ clang --version
clang version 5.0.1 (tags/RELEASE_501/final)
Target: armv7l-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin

$ uname -a
Linux alarmpi 4.14.22-1-ARCH #1 SMP Tue Feb 27 06:45:49 UTC 2018 armv7l GNU/Linux

I am trying to compile VC4C, and I get the following compiler error:

/home/alarm/opencl/VC4C/src/llvm/BitcodeReader.cpp: In member function 'vc4c::Method& vc4c::llvm2qasm::BitcodeReader::parseFunction(vc4c::Module&, const llvm::Function&)':
/home/alarm/opencl/VC4C/src/llvm/BitcodeReader.cpp:466:34: error: 'const class llvm::Function' has no member named 'getArgumentList'; did you mean 'Arguments'?
  method->parameters.reserve(func.getArgumentList().size());
                                  ^~~~~~~~~~~~~~~
                                  Arguments
/home/alarm/opencl/VC4C/src/llvm/BitcodeReader.cpp:467:39: error: 'const class llvm::Function' has no member named 'getArgumentList'; did you mean 'Arguments'?
  for(const llvm::Argument& arg : func.getArgumentList())
                                       ^~~~~~~~~~~~~~~
                                       Arguments
make[2]: *** [build/CMakeFiles/VC4CC.dir/build.make:1047: build/CMakeFiles/VC4CC.dir/llvm/BitcodeReader.cpp.o] Error 1

I think this is caused by this change to LLVM:

https://reviews.llvm.org/D31052

But I'm not sure.

Rewrite building of standard-library PCH

Especially now, that cross-compilation and debian packaging is in the making (see #8), the way the VC4CLStdLib precompiled header is being built needs to be changed.

Currently, the PCH-file is being built on the development machine:

  • a PCH-file only works with the exact version of the CLang used to compile it. For cross-compilation or packaging, this won't match the version of the destination machine(s).
  • the PCH-file requires the original source headers to be available and their (relative?) path not to change, also not given for cross-compiled or packaged distribution.

The solution would be to build the PCH-file on the destination machine:

  • requires the VC4CLStdLib header-files to be available there (e.g. installed via extra vc4cl-stdlib package).
  • the PCH-file would need to be built on first compilation using VC4C.
  • this also fixes the issues arising, that the compilation options for the PCH-file don't match the options for the actual compilations (see #6 ).

=> On every build (or in a post-installation script), if the PCH-file doesn't exist, create it first (using the same options as the actual compilations).

  • how to get path to the VC4CLStdLib headers? Need to distinguish between in-source-builds (e.g. on development machines, the headers lie in a project somewhere) and packaged build (e.g. on destination machines, the headers are installed into e.g. /usr/include/vc4cl-stdlib via extra package)
  • where to store the PCH-file to and how to save the path to the file? Similar to first point, for development build, we can simply store it on ./include/ or ./build. For packaged builds, it should reside somewhere in /usr/share/vc4c or a similar path

Build failed?

I am not sure what happened, circleci build failed.
Do you have any idea?

CircleCI

Speed-up memory access

Memory access can be sped up in several ways:

  • utilize delays for DMA access for other calculations
  • using VPM as cache and access DMA for blocks of memory at a time
  • use TMU to read from memory!

See also #14 and this post.

Enhance `combineLoadingLiterals`

The output of deepCL/backproweights.cl can be improved by enhacement of reducing the number of ldi.
For example, loading constant 256 happens 69 times.
If we can combine them to one instruction, rapid speedup can be archived.
https://gist.github.com/nomaddo/220b867143eff68f2c0f83f9188ab382

More small example, the following code can be improved:

kernel void f(global float * a) {
  for (int i = 0; i < 4; i++){
    a[i] += 129;
  }
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'f' with 37 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items)
// label: %start_of_function
or ra0, unif, unif
// label: %tmp.0
nop.never 
or tmu0s, ra0, ra0
nop.load_tmu0.never 
ldi r0, 1124139008                       // should be fused
fadd r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 4, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add tmu0s, ra0, 4 (4)
nop.load_tmu0.never 
ldi r1, 1124139008                       // should be fused
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 8 (8)
nop.load_tmu0.never 
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 12 (12)
nop.load_tmu0.never 
ldi r0, 1124139008                       // should be fused
fadd r0, r4, r0
or vpm, r0, r0
or vpw_addr, ra0, ra0
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

PyOpenCL: values only computed correctly for first worker in the group

I'm testing out some basic PyOpenCL examples (PyOpenCL detects the library and parameters correctly). The demo benchmark looked like a nice simple kernel to try:
https://raw.githubusercontent.com/inducer/pyopencl/master/examples/benchmark.py

The value only gets reliably assigned for the first worker in the group. Writing a constant value of 1 to the output only results in 1's being written to the first worker in each group (the rest of the values are 0). This might be a writing issue, or it might be a problem with the call to get_global_id(0)

A secondary issue is that the openCL driver will only allow worker groups in powers of 2 - so I cannot assign 12 workers here, only 8:
Traceback_ (most recent call last):
File "opencl_test.py", line 91, in
exec_evt = prg.sum(queue, global_size, local_size, a_buf, b_buf, dest_buf)
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1766, in call
return self._enqueue(self, queue, global_size, local_size, *args, **kwargs)
File "", line 90, in enqueue_knl_sum
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1952, in enqueue_nd_range_kernel
global_work_size, local_work_size, c_wait_for, num_wait_for))
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 664, in _handle_error
raise e
pyopencl.cffi_cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE

I'm unsure if there's an incorrect assumption in the pyopencl code somewhere... but this reduces total throughput by a third. With 8 workers the result is:
gpu: 2.32745s
cpu: 0.469361066818s

Segmentation fault when running VC4C

Running VC4C on a Raspberry Pi 3 with a fresh Raspbian Stretch setup.
VC4C has been built with only the LLVM front-end.

I am trying to run the following command against fibonacci.cl from the examples folder:

VC4C --llvm --hex -o fibonacci.hex fibonacci.cl

I get the following message:

Compiling 'fibonacci.cl' into 'fibonacci.hex' with options '' ...
[E] Mon Nov 27 14:27:43 2017: Errors in precompilation:
[E] Mon Nov 27 14:27:43 2017: Stack dump:
0.	Program arguments: /usr/lib/llvm-3.9/bin/clang -cc1 -triple armv6-unknown-linux-gnueabihf -emit-llvm -disable-free -disable-llvm-verifier -discard-value-names -main-file-name fibonacci.cl -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu arm1136jf-s -target-feature +strict-align -target-abi aapcs-linux -mfloat-abi hard -dwarf-column-info -debugger-tuning=gdb -coverage-file /tmp/vc4c-sbCfYb -resource-dir /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1 -include-pch /home/pi/VC4C/include/VC4CLStdLib.h.pch -I . -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1/include -internal-externc-isystem /usr/include/arm-linux-gnueabihf -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -fdebug-compilation-dir /home/pi/VC4CL-test -ferror-limit 19 -fmessage-length 0 -fallow-half-arguments-and-returns -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o /tmp/vc4c-sbCfYb -x cl fibonacci.cl 
clang: error: unable to execute command: Segmentation fault
clang: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 3.9.1-9+rpi1 (tags/RELEASE_391/rc2)
Target: armv6-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin

I have tried to run the given /usr/lib/llvm-3.9/bin/clang command line and got a segmentation fault as well.
The command succeeds if I remove the argument: -include-pch /home/pi/VC4C/include/VC4CLStdLib.h.pch:

/usr/lib/llvm-3.9/bin/clang -cc1 -triple armv6-unknown-linux-gnueabihf -emit-llvm -disable-free -disable-llvm-verifier -discard-value-names -main-file-name fibonacci.cl -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu arm1136jf-s -target-feature +strict-align -target-abi aapcs-linux -mfloat-abi hard -dwarf-column-info -debugger-tuning=gdb -coverage-file /tmp/vc4c-kK1RbC -resource-dir /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1 -I . -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1/include -internal-externc-isystem /usr/include/arm-linux-gnueabihf -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -fdebug-compilation-dir /home/pi/VC4CL-test -ferror-limit 19 -fmessage-length 0 -fallow-half-arguments-and-returns -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o /tmp/vc4c-kK1RbC -x cl fibonacci.cl

Here is an extract of strace run with the faulty command:

stat64("/home/pi/VC4C/include", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/home/pi/VC4C/include/VC4CLStdLib.h.pch", O_RDONLY) = 5
readlink("/proc/self/fd/5", "/home/pi/VC4C/include/VC4CLStdLi"..., 4096) = 39
fstat64(5, {st_mode=S_IFREG|0644, st_size=12789540, ...}) = 0
mmap2(NULL, 12789540, PROT_READ, MAP_PRIVATE, 5, 0) = 0x737a9000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x157f000} ---

Fuse instructions

__kernel void loop1 (__global float * a) {
  int id = get_local_id(0);
  a[id] = a[id] * 2;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 30 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items) (lids)
// label: %start_of_function
or r0, unif, unif
or ra0, unif, unif
// label: %tmp.0
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 2 (2)
add r1, ra0, r0          /// (1)
nop.never 
or tmu0s, r1, r1          /// (2)
nop.load_tmu0.never 
or r0, r4, r4
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fmul vpm, r0, 2.000000 (33)
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r1, r1
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -26 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

(1) and (2) can be fused. To remove (1), we should

  • find move instruction (2)
  • get the definition of r1
  • if the number of references to r1 is only once, replace (2) to (1)

This is related to the plan of implementation of enhancement of general porpose optimizations #47.

Pre-compiler deletes /dev/stdout

Sometimes, when failing to compile an OpenCL C source-code, the pre-compiler (clang) deletes the symbolic link /dev/stdout. This results in any successive compilations with VC4C (and probably any other program using stdout) to fail.

To "fix" this, the /dev/stdout symbolic link needs to be re-created. This can be done by rebooting the Raspberry Pi.

Another "fix" is to configure VC4C with the CMake option PRECOMPILER_DROP_RIGHTS enabled, which will run the pre-compilation process with the user pi (instead of root, if run with sudo). This should prevent the process from deleting /dev/stdout.

Since I cannot reliably reproduce the issue, I don't know if the fix works. If anyone can create a stable reproduction, please notify me.

Remove dependency from `/opt/SPIRV-LLVM`

Currently, vc4c-0.4.deb depends of /opt/SPIRV-LLVM/build/bin/clang in preinst stage.
It is inconvenient for ordinary users because it requires compilation of SPIRV-LLVM by hand.

My suggestion is,

  • Use clang as default front-end
  • If some environment variable (VC4C_SPIRV_LLVM or so on) are set, use it as front-end

In my understand, VC4C can use official clang as front-end, but use of SPIRV-LLVM is better for better performance. Is it correct?

Describe Usage

Hi,

I want to try your works, but I cannot find the usage.
Can you add it?

I don't want to compile LLVM tool-chains in rasberrypi, so I want cross-compilation usages.

Use LLVM to read bitcode

Problem:

Currently, the LLVM IR front-end reads in the textual representation of the LLVM-IR, this has several drawbacks:

  • parsing textual input is error-prone, esp. with different syntax between LLVM-versions
  • parsing textual input is slow
  • in contrast to SPIRV-LLVM does the "standard" CLang not filter functions not used in the current kernels. This means, "standard" CLang writes all functions (including the standard-library) into the IR-file, resulting in about 7,5MB for the simple fibonacci.cl example and a very long loading time.

Solution:

Similar to the SPIR-V front-end, use a library to parse the LLVM IR (in binary format). The LLVM library could be used (component bitreader)

  • would depend on the llvm-dev package (which is available in Raspbian repo), but would be unaffected by changes in the IR (bitcode)
  • should be much faster than parsing the text
  • required components and flags are easily added to CMake via llvm-config (see here and here)

Steps to efficiently read binary IR:

  1. read binary module
  2. get OpenCL kernel meta-data (which functions are kernels)
  3. read kernel functions (and recursively all used functions/globals/types/...), so unread functions don't have to be converted

Installation of libcpplog

make install does not install libcpplog

I had to manually run sudo make install from build/build/cpplog.
I also observed that libcpplog is not installed in the CMAKE_INSTALL_PREFIX that I set when running cmake.

Compilation failed due to spirv-llvm?

Currently VC4C on x86_64 fail to compile as follows:
Debian packages are downloaded from https://circleci.com/gh/nomaddo/VC4C/137#artifacts/containers/0.

$ docker run -v /tmp:/tmp -v ${HOME}/clABCMarks:/marks --rm -it nomaddo/cross-rpi:0.1

$ dpkg-deb -i vc4cl-stdlib-0.4-Linux.deb
$ dpkg-deb -x  vc4c-0.4-Linux.deb /
$ /opt/SPIRV-LLVM/build/bin/clang -cc1 -triple spir-unknown-unknown -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-all -Wno-gcc-compat -x cl -emit-pch -o /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -I /usr/local/include/vc4cl-stdlib/ /usr/local/include/vc4cl-stdlib/opencl-c.h
$ VC4C --asm -o /tmp/hoge.s /marks/mul2.cl
VC4C --asm -o /tmp/hoge.s /marks/mul2.cl
[D] Fri Mar  2 13:26:31 2018: Compiling '/marks/mul2.cl' into '/tmp/hoge.s' with options '' ...
[D] Fri Mar  2 13:26:31 2018: Temporary file '/tmp/vc4c-PdE2jQ' created
[D] Fri Mar  2 13:26:31 2018: Temporary file '/tmp/vc4c-Ptoo1K' created
[I] Fri Mar  2 13:26:31 2018: Compiling OpenCL to LLVM-IR with :/opt/SPIRV-LLVM/build/bin/clang -cc1 -triple spir-unknown-unknown  -I /marks -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant  -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-Ptoo1K /marks/mul2.cl
[I] Fri Mar  2 13:26:31 2018: Converting LLVM-IR to SPIR-V with :/opt/SPIRV-LLVM/build/bin/llvm-spirv -o /tmp/vc4c-PdE2jQ /tmp/vc4c-Ptoo1K
[D] Fri Mar  2 13:26:31 2018: Temporary file '/tmp/vc4c-Ptoo1K' deleted
[I] Fri Mar  2 13:26:31 2018: Compilation complete!
[I] Fri Mar  2 13:26:31 2018: Using SPIR-V frontend...
[D] Fri Mar  2 13:26:31 2018: Read SPIR-V binary with 161 words
[D] Fri Mar  2 13:26:31 2018: Starting parsing...
[D] Fri Mar  2 13:26:31 2018: SPIR-V header parsed: magic-number 0x7230203, version 0x10000, generator 6000e, max-ID 21
[D] Fri Mar  2 13:26:31 2018: Using supported capability: Addresses
[D] Fri Mar  2 13:26:31 2018: Using supported capability: Linkage
[D] Fri Mar  2 13:26:31 2018: Using supported capability: Kernel
[D] Fri Mar  2 13:26:31 2018: Using supported capability: Vector16
[D] Fri Mar  2 13:26:31 2018: Importing extended instruction set: OpenCL.std
[D] Fri Mar  2 13:26:31 2018: Using a physical OpenCL memory model
[D] Fri Mar  2 13:26:31 2018: Kernel-method found: loop1
[E] Fri Mar  2 13:26:31 2018: Met unsupported instruction-decoration 11
[E] Fri Mar  2 13:26:31 2018: Unsupported operation:  at ?
[E] Fri Mar  2 13:26:31 2018:  (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x121 [0x7fead7f2fe13]
[E] Fri Mar  2 13:26:31 2018:  (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::spirv2qasm::SPIRVParser::parse(vc4c::Module&)+0x53b [0x7fead813ca6d]
[E] Fri Mar  2 13:26:31 2018:  (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0x11f [0x7fead7f3165d]
[E] Fri Mar  2 13:26:31 2018:  (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x250 [0x7fead7f31d44]
[E] Fri Mar  2 13:26:31 2018:  (5) VC4C : main+0xcf6 [0x5642ad15ddfc]
[E] Fri Mar  2 13:26:31 2018:  (6) /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main+0xf1 [0x7fead66fc2b1]
[E] Fri Mar  2 13:26:31 2018:  (7) VC4C : _start+0x2a [0x5642ad15c72a]
[D] Fri Mar  2 13:26:31 2018: Temporary file '/tmp/vc4c-PdE2jQ' deleted
[E] Fri Mar  2 13:26:31 2018: Compiler threw exception: Parser: Unsupported operation: 
terminate called after throwing an instance of 'vc4c::CompilationError'
  what():  Parser: Unsupported operation: 
Aborted (core dumped)

@doe300 Do you have any idea?

CLANG_PATH not found during build

A current git pull on a box with a previously working build has produced compile errors. CLANG_PATH appears to be undefined in the Precompiler source for the current commit. I have clang-3.9 installed in /usr/bin/.

Command used for building:
cd VC4C && git pull && git checkout . && cmake . && make clean && make

Output:
[ 37%] Building CXX object build/CMakeFiles/VC4CC.dir/Module.cpp.o
[ 37%] Building CXX object build/CMakeFiles/VC4CC.dir/Locals.cpp.o
[ 38%] Building CXX object build/CMakeFiles/VC4CC.dir/Precompiler.cpp.o
/home/pi/VC4C/src/Precompiler.cpp: In static member function ‘static vc4c::SourceType vc4c::Precompiler::getSourceType(std::istream&)’:
/home/pi/VC4C/src/Precompiler.cpp:125:38: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
else if(std::atol(buffer.data()) == QPUASM_MAGIC_NUMBER || std::atol(buffer.data()) == QPUASM_NUMBER_MAGIC)
~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
/home/pi/VC4C/src/Precompiler.cpp:125:89: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
else if(std::atol(buffer.data()) == QPUASM_MAGIC_NUMBER || std::atol(buffer.data()) == QPUASM_NUMBER_MAGIC)
~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
/home/pi/VC4C/src/Precompiler.cpp: In function ‘void compileOpenCLToLLVMIR(std::istream&, std::ostream&, const string&, bool, const vc4c::Optional<std::__cxx11::basic_string >&, const vc4c::Optional<std::__cxx11::basic_string >&)’:
/home/pi/VC4C/src/Precompiler.cpp:254:31: error: ‘CLANG_PATH’ was not declared in this scope
const std::string compiler = CLANG_PATH;

Remove redundant instructions for checking for-loop-exit-condition

The for-loop creates redundant instructions in the instructions of checking exit condition.

ex)

kernel void loop1 (global float a[], global float b[]) {
  for (int i = 0; i < 1000; i++) {
    a[i] = -i;
    b[i] = -i;
  }
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 67 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), __global out float* b (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r2, unif, unif
or r1, 0 (0), 0 (0)
nop.never 
or r0, r1, r1
sub r0, 0 (0), r0
itof r3, r0; v8min r0, r1, r1
shl r0, r0, 2 (2)
add r0, ra0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or r0, r1, r1
or r1, r1, r1
shl r0, r0, 2 (2)
add r1, r1, 1 (1)
add r0, r2, r0
ldi ra1, 1000
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
xor.setf -, r1, ra1                                             /// redundant ?
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)           /// 
or.setf -, r0, r0                                               ///
or.ifz r1, r1, r1                                               ///
or.setf -, elem_num, r0                                         ///
brr.ifallzc (pc+4) + 4
nop.never 
nop.never 
nop.never 
brr.ifanyz (pc+4) + -40
nop.never 
nop.never 
nop.never 
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -63
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

To me, above instructions (specifying ///) can be expressed as isub.setf -, r1, ra1.
Then, if the condition is negative, jump to the head of loop. Otherwise go through.

Convert multiple additions to one multiplication

In the following code, multiple additions can be combined to one multiplication instruction.

void kernel test(float global * a, const int n){
  float sum = 0.0;
  float c = a[0];
  for (int i = 0; i < 10; i++)
    sum += 10.0 + c;
  a[get_global_id(0)] = sum;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'test' with 47 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), int n (4 B, 1 items) (lSize, lids, gidX, offX)
// label: %start_of_function
or r3, unif, unif
or ra3, unif, unif
or r1, unif, unif
or r2, unif, unif
or ra1, unif, unif
or -, unif, unif
// label: %tmp.0
or tmu0s, ra1, ra1
nop.load_tmu0.never 
itof r0, 10 (10)
fadd rb0, r4, r0
or ra2, r1, r1
or r0, rb0, rb0
fadd r0, rb0, r0   // can be improved
fadd r0, rb0, r0   //
fadd r0, rb0, r0   //
fadd r0, rb0, r0   //
fadd r0, rb0, r0   //
fadd r0, rb0, r0   //
fadd r0, rb0, r0   //
fadd ra0, rb0, r0 //
or r0, r3, r3
ldi r1, 255
and r3, r0, r1
or r2, r2, r2; v8min r0, ra3, ra3
and r1, r0, r1
mul24 r0, ra2, r3
add r0, r2, r0
add r0, r0, r1
shl r0, r0, 2 (2)
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fadd vpm, rb0, ra0
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add vpw_addr, ra1, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -43 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.