doe300 / vc4c Goto Github PK
View Code? Open in Web Editor NEWCompiler for the VC4CL OpenCL implementation
License: MIT License
Compiler for the VC4CL OpenCL implementation
License: MIT License
[E] Fri Mar 16 18:59:04 2018: error: unable to read PCH file /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch: 'No such file or directory' fatal error: PCH file
I'm working to build this on a rpi3 with debian/Aarch64 and got a few hundred of these messages when compiling vc4c:
/usr/bin/ld: CMakeFiles/objvc4asm.dir/Parser.cpp.o: relocation R_AARCH64_ADR_PREL_PG_HI21 against symbol `_ZTVN6Parser7MessageE' which may bind externally can not be used when making a shared object; recompile with -fPIC
The fix looks simple enough?
kernel void sum_f16 (global float a[], global float b[], global float c[]) {
int id = get_global_id(0);
float16 x = vload16(id, b);
float16 y = vload16(id, c);
vstore16 (x * y + x * y + x * y, id, a);
}
https://gist.github.com/nomaddo/9a7921afd53d08cb34d736e7432b8036
In this code, it seems that the same calculation is performed repeatedly (line 59 and 60).
I imaged this such typical optimization (removing subcommon expression) is done by clang
.
If the following program is passed to latest VC4C
, the exception Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
is raised.
Is it probably a bug of the register allocation?
kernel void loop1 (global float a[], global float b[]) {
for (int i = 0; i < 1000; i++) {
a[i] = -i;
b[i] = -i;
}
}
./build/VC4C -o /tmp/hoge.o ~/idein/clABCMarks/loop1.cl
nomaddo@nomaddo-AS:~/idein/VC4C$ ./build/VC4C -o /tmp/hoge.o ~/idein/clABCMarks/loop1.cl
[D] Tue Feb 13 13:24:06 2018: Compiling '/home/nomaddo/idein/clABCMarks/loop1.cl' into '/tmp/hoge.o' with options '' ...
[D] Tue Feb 13 13:24:06 2018: Temporary file '/tmp/vc4c-GFIvlj' created
[D] Tue Feb 13 13:24:06 2018: Temporary file '/tmp/vc4c-gseRZG' created
[I] Tue Feb 13 13:24:06 2018: Compiling OpenCL to LLVM-IR with :/opt/SPIRV-LLVM/build/bin/clang -cc1 -triple spir-unknown-unknown -I /home/nomaddo/idein/clABCMarks -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /home/nomaddo/idein/VC4C/../VC4CLStdLib/include/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-gseRZG /home/nomaddo/idein/clABCMarks/loop1.cl
[I] Tue Feb 13 13:24:07 2018: Converting LLVM-IR to SPIR-V with :/opt/SPIRV-LLVM/build/bin/llvm-spirv -o /tmp/vc4c-GFIvlj /tmp/vc4c-gseRZG
[D] Tue Feb 13 13:24:07 2018: Temporary file '/tmp/vc4c-gseRZG' deleted
[I] Tue Feb 13 13:24:07 2018: Compilation complete!
[I] Tue Feb 13 13:24:07 2018: Using SPIR-V frontend...
[D] Tue Feb 13 13:24:07 2018: Read SPIR-V binary with 219 words
[D] Tue Feb 13 13:24:07 2018: Starting parsing...
[D] Tue Feb 13 13:24:07 2018: SPIR-V header parsed: magic-number 0x7230203, version 0x10000, generator 6000e, max-ID 29
[D] Tue Feb 13 13:24:07 2018: Using supported capability: Addresses
[D] Tue Feb 13 13:24:07 2018: Using supported capability: Linkage
[D] Tue Feb 13 13:24:07 2018: Using supported capability: Kernel
[D] Tue Feb 13 13:24:07 2018: Importing extended instruction set: OpenCL.std
[D] Tue Feb 13 13:24:07 2018: Using a physical OpenCL memory model
[D] Tue Feb 13 13:24:07 2018: Kernel-method found: loop1
[D] Tue Feb 13 13:24:07 2018: Reading function: %4 (arm_get_core_id)
[D] Tue Feb 13 13:24:07 2018: Reading function: %11 (loop1)
[D] Tue Feb 13 13:24:07 2018: Reading parameter: float* %12
[D] Tue Feb 13 13:24:07 2018: Reading parameter: float* %13
[D] Tue Feb 13 13:24:07 2018: SPIR-V binary successfully parsed
[D] Tue Feb 13 13:24:07 2018: Mapping instructions to intermediate...
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %0
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %5
[D] Tue Feb 13 13:24:07 2018: Generating intermediate return of value: i32 0
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %0
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %14
[D] Tue Feb 13 13:24:07 2018: Generating intermediate branch to %15
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %15
[D] Tue Feb 13 13:24:07 2018: Generating Phi-Node with 2 options into i32 %18
[D] Tue Feb 13 13:24:07 2018: Generating intermediate binary operation 'sub' with i32 0 and i32 %18 into i32 %19
[D] Tue Feb 13 13:24:07 2018: Generating intermediate unary operation 'sitofp' with i32 %19 into float %20
[D] Tue Feb 13 13:24:07 2018: Generating calculating indices of float* %12 into float* %21
[D] Tue Feb 13 13:24:07 2018: Generating writing of float %20 into float* %21
[D] Tue Feb 13 13:24:07 2018: Generating calculating indices of float* %13 into float* %22
[D] Tue Feb 13 13:24:07 2018: Generating writing of float %20 into float* %22
[D] Tue Feb 13 13:24:07 2018: Generating intermediate binary operation 'add' with i32 %18 and i32 1 into i32 %17
[D] Tue Feb 13 13:24:07 2018: Generating intermediate comparison 'eq' of i32 %17 and i32 1000 into bool %27
[D] Tue Feb 13 13:24:07 2018: Generating intermediate conditional branch on %27 to either %16 or %15
[D] Tue Feb 13 13:24:07 2018: Generating intermediate label %16
[D] Tue Feb 13 13:24:07 2018: Generating intermediate return
[D] Tue Feb 13 13:24:07 2018: Eliminating phi-node by inserting moves: i32 %18 = phi %15 -> i32 %17, %14 -> i32 0
[D] Tue Feb 13 13:24:07 2018: Inserting 'move' into end of basic-block: %15
[D] Tue Feb 13 13:24:07 2018: Inserting 'move' into end of basic-block: %14
[I] Tue Feb 13 13:24:07 2018: -----
[I] Tue Feb 13 13:24:07 2018: Inlining functions for kernel: loop1
[I] Tue Feb 13 13:24:07 2018: -----
[D] Tue Feb 13 13:24:07 2018: -----
[I] Tue Feb 13 13:24:07 2018: Running optimization passes for: loop1
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: MapMemoryAccess
[D] Tue Feb 13 13:24:07 2018: Generating memory access which cannot be lowered into VPM: store float %20 into float* %22
[D] Tue Feb 13 13:24:07 2018: Increased the scratch size to 1 rows (64 bytes)
[D] Tue Feb 13 13:24:07 2018: 0 16 0 1
[D] Tue Feb 13 13:24:07 2018: Generating memory access which cannot be lowered into VPM: store float %20 into float* %21
[D] Tue Feb 13 13:24:07 2018: 0 16 0 1
[D] Tue Feb 13 13:24:07 2018: Found base address - with offset -1 for writing into memory
[D] Tue Feb 13 13:24:07 2018: Found base address - with offset -1 for writing into memory
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: ResolveStackAllocations
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: SingleSteps
[D] Tue Feb 13 13:24:07 2018: Running steps: EliminateReturns, CombineDuplicateBranches, EliminateUselessBranch, IntrinsifyBuiltin, HandleLiteralVector, MapGlobalDataToAddress, CalculateConstantValue, EliminateUselessInstruction, LoadImmediateValues, HandleUseWithImmediateValues, MoveRotationSourcesToAccs, CheckMethodCalls, CombineSelectionWithZero, CombineSettingSameFlags,
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Tue Feb 13 13:24:07 2018: Removing branch to next instruction: br %15
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Tue Feb 13 13:24:07 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %19 = sub i32 0 (0), i32 %18
[D] Tue Feb 13 13:24:07 2018: Intrinsifying multiplication with left-shift
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Tue Feb 13 13:24:07 2018: Re-using temporary to split up use of long-living local with immediate value: i32 %index_offset.0 = shl i32 %18, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: Intrinsifying multiplication with left-shift
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Tue Feb 13 13:24:07 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %index_offset.1 = shl i32 %18, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %17 = add i32 %18, i32 1 (1)
[D] Tue Feb 13 13:24:07 2018: Intrinsifying comparison 'eq' to arithmetic operations
[D] Tue Feb 13 13:24:07 2018: Loading immediate value: 1000
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value true to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value true to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Mapping constant for immediate value true to: 1 (1)
[D] Tue Feb 13 13:24:07 2018: Replacing return in kernel-function with branch to end-label
[D] Tue Feb 13 13:24:07 2018: Removing branch to next instruction: br %end_of_function
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: CombineLiteralLoads
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: CombineRotations
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: EliminateDeadStores
[D] Tue Feb 13 13:24:07 2018: Cleaned 7 unused locals from method loop1
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: VectorizeLoops
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: SplitReadAfterWrites
[D] Tue Feb 13 13:24:07 2018: Inserting NOP to split up read-after-write before: i32 %use_with_literal.3 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Inserting NOP to split up read-after-write before: register - = bool %27 (setf )
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: ReorderInstructions
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpm, register mutex_acq, register vpw_busy, register vpw_wait): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %use_with_literal.8): i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpw_wait, register vpw_busy, register vpm, register mutex_acq, i32 %index_offset.1, i32 %use_with_literal.6, float* %22): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register vpm, register mutex_acq, register vpw_busy, register vpw_wait): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (register mutex_acq, i32 %immediate.10): i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: Found instruction not using any of the excluded values (i32 %17, register mutex_acq, i32 %use_with_literal.8): i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: Replacing NOP with: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: CombineALUIinstructions
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: UnrollWorkGroups
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: AddStartStopSegment
[D] Tue Feb 13 13:24:07 2018:
[D] Tue Feb 13 13:24:07 2018: Running pass: ExtendBranches
[I] Tue Feb 13 13:24:07 2018:
[I] Tue Feb 13 13:24:07 2018: Optimizations done, changed number of instructions from 21 to 95
[D] Tue Feb 13 13:24:07 2018: -----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %start_of_function
[D] Tue Feb 13 13:24:07 2018: i32 %work_dim = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %local_sizes = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %local_ids = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %num_groups_x = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %num_groups_y = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %num_groups_z = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %group_id_x = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %group_id_y = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %group_id_z = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_offset_x = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_offset_y = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_offset_z = register unif
[D] Tue Feb 13 13:24:07 2018: i32 %global_data_address = register unif
[D] Tue Feb 13 13:24:07 2018: float* %12 = register unif
[D] Tue Feb 13 13:24:07 2018: float* %13 = register unif
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %0
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %14
[D] Tue Feb 13 13:24:07 2018: i32 %18 = i32 0 (0) (phi)
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %15
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.3 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %19 = sub i32 0 (0), i32 %use_with_literal.3
[D] Tue Feb 13 13:24:07 2018: float %20 = itof i32 %19
[D] Tue Feb 13 13:24:07 2018: i32 %index_offset.0 = shl i32 %use_with_literal.3, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: float* %21 = add float* %12, i32 %index_offset.0
[D] Tue Feb 13 13:24:07 2018: mutex_acq
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpm = float %20
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(rows: 1, elements: 1 words, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(stride: 0)
[D] Tue Feb 13 13:24:07 2018: register vpw_addr = float* %21
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: register - = register vpw_wait
[D] Tue Feb 13 13:24:07 2018: mutex_rel
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.6 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %index_offset.1 = shl i32 %use_with_literal.6, i32 2 (2)
[D] Tue Feb 13 13:24:07 2018: float* %22 = add float* %13, i32 %index_offset.1
[D] Tue Feb 13 13:24:07 2018: mutex_acq
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpm = float %20
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(rows: 1, elements: 1 words, address: h32(0))
[D] Tue Feb 13 13:24:07 2018: register vpw_setup = loadi vdw_setup(stride: 0)
[D] Tue Feb 13 13:24:07 2018: register vpw_addr = float* %22
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: register - = register vpw_wait
[D] Tue Feb 13 13:24:07 2018: mutex_rel
[D] Tue Feb 13 13:24:07 2018: i32 %use_with_literal.8 = i32 %18
[D] Tue Feb 13 13:24:07 2018: i32 %17 = add i32 %use_with_literal.8, i32 1 (1)
[D] Tue Feb 13 13:24:07 2018: i32 %immediate.10 = loadi i32 1000
[D] Tue Feb 13 13:24:07 2018: register - = xor i32 %17, i32 %immediate.10 (setf )
[D] Tue Feb 13 13:24:07 2018: bool %27 = bool 1 (1) (ifz )
[D] Tue Feb 13 13:24:07 2018: bool %27 = xor bool 1 (1), bool 1 (1) (ifzc )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: register - = bool %27 (setf )
[D] Tue Feb 13 13:24:07 2018: i32 %18 = i32 %17 (ifz phi)
[D] Tue Feb 13 13:24:07 2018: register - = or register elem_num, bool %27 (setf )
[D] Tue Feb 13 13:24:07 2018: br.ifzc %16 (on bool %27) (ifzc )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: br.ifz %15 (on bool %27) (ifz )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %16
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Basic block ----
[D] Tue Feb 13 13:24:07 2018: label: %end_of_function
[D] Tue Feb 13 13:24:07 2018: i32 %group_loop_size = register unif
[D] Tue Feb 13 13:24:07 2018: register - = or register elem_num, i32 %group_loop_size (setf )
[D] Tue Feb 13 13:24:07 2018: br.ifzc %start_of_function (on i32 %group_loop_size) (ifzc )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: register irq = not register qpu_num
[D] Tue Feb 13 13:24:07 2018: nop (thrend )
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: nop
[D] Tue Feb 13 13:24:07 2018: Block end ----
[D] Tue Feb 13 13:24:07 2018: Cleaned 0 unused locals from method loop1
[D] Tue Feb 13 13:24:07 2018: Created node: %group_loop_size init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %17 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %index_offset.1 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %20 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %19 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %18 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %27 init: acc,B, avail: acc,B (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %13 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %12 init: A,B, avail: A,B (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %global_data_address is never read!
[D] Tue Feb 13 13:24:07 2018: Local %global_offset_z is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %21 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %global_offset_y is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %22 init: any, avail: any (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %global_offset_x is never read!
[D] Tue Feb 13 13:24:07 2018: Local %group_id_z is never read!
[D] Tue Feb 13 13:24:07 2018: Local %group_id_y is never read!
[D] Tue Feb 13 13:24:07 2018: Local %group_id_x is never read!
[D] Tue Feb 13 13:24:07 2018: Local %num_groups_z is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %immediate.10 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %num_groups_y is never read!
[D] Tue Feb 13 13:24:07 2018: Local %num_groups_x is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %index_offset.0 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %local_ids is never read!
[D] Tue Feb 13 13:24:07 2018: Created node: %use_with_literal.8 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %use_with_literal.6 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Created node: %use_with_literal.3 init: acc, avail: acc (1111, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Local %local_sizes is never read!
[D] Tue Feb 13 13:24:07 2018: Local %work_dim is never read!
[D] Tue Feb 13 13:24:07 2018: Colored graph with 31 nodes created!
[D] Tue Feb 13 13:24:07 2018: -----
[D] Tue Feb 13 13:24:07 2018: Mapping label '%start_of_function' to byte-position 0
[D] Tue Feb 13 13:24:07 2018: Mapping label '%0' to byte-position 120
[D] Tue Feb 13 13:24:07 2018: Mapping label '%14' to byte-position 120
[D] Tue Feb 13 13:24:07 2018: Mapping label '%15' to byte-position 136
[D] Tue Feb 13 13:24:07 2018: Mapping label '%16' to byte-position 632
[D] Tue Feb 13 13:24:07 2018: Mapping label '%end_of_function' to byte-position 632
[D] Tue Feb 13 13:24:07 2018: Mapped 6 labels to positions
[D] Tue Feb 13 13:24:07 2018: Assigned local %local_sizes to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %use_with_literal.6 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %use_with_literal.8 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %local_ids to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %num_groups_x to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %num_groups_y to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %num_groups_z to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %19 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %group_id_x to register -
[D] Tue Feb 13 13:24:07 2018: Assigned local %13 to register ra0
[D] Tue Feb 13 13:24:07 2018: Assigned local %18 to register r3
[D] Tue Feb 13 13:24:07 2018: Assigned local %use_with_literal.3 to register r1
[D] Tue Feb 13 13:24:07 2018: Assigned local %21 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %index_offset.1 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %immediate.10 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %22 to register r0
[D] Tue Feb 13 13:24:07 2018: Assigned local %17 to register r1
[D] Tue Feb 13 13:24:07 2018: Assigned local %work_dim to register -
[E] Tue Feb 13 13:24:07 2018: (1) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x130 [0x7fa0eaa3dcda]
[E] Tue Feb 13 13:24:07 2018: (2) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::qpu_asm::ColoredNode::getRegisterFixed() const+0x19d [0x7fa0eaad6819]
[E] Tue Feb 13 13:24:07 2018: (3) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::qpu_asm::GraphColoring::toRegisterMap() const+0x1ec [0x7fa0eaadef2a]
[E] Tue Feb 13 13:24:07 2018: (4) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : vc4c::qpu_asm::CodeGenerator::generateInstructions(vc4c::Method&)+0x4e7 [0x7fa0eaacd4b9]
[E] Tue Feb 13 13:24:07 2018: (5) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : +0x8553db [0x7fa0eaa3f3db]
[E] Tue Feb 13 13:24:07 2018: (6) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : +0x8555f2 [0x7fa0eaa3f5f2]
[E] Tue Feb 13 13:24:07 2018: (7) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : +0x8570b0 [0x7fa0eaa410b0]
[E] Tue Feb 13 13:24:07 2018: (8) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : std::function<void ()>::operator()() const+0x32 [0x7fa0eaa4323a]
[E] Tue Feb 13 13:24:07 2018: (9) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const+0x60 [0x7fa0eaa428b0]
[E] Tue Feb 13 13:24:07 2018: (10) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)+0x28 [0x7fa0eaa49562]
[E] Tue Feb 13 13:24:07 2018: (11) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()()+0x2c [0x7fa0eaa494b8]
[E] Tue Feb 13 13:24:07 2018: (12) /home/nomaddo/idein/VC4C/build/libVC4CC.so.1.2 : std::thread::_Impl<std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()> >::_M_run()+0x1c [0x7fa0eaa493ee]
[E] Tue Feb 13 13:24:07 2018: (13) /usr/lib/x86_64-linux-gnu/libstdc++.so.6 : +0xb8c80 [0x7fa0e9f20c80]
[E] Tue Feb 13 13:24:07 2018: (14) /lib/x86_64-linux-gnu/libpthread.so.0 : +0x76ba [0x7fa0e91656ba]
[E] Tue Feb 13 13:24:07 2018: (15) /lib/x86_64-linux-gnu/libc.so.6 : clone+0x6d [0x7fa0e968641d]
[E] Tue Feb 13 13:24:07 2018: Background worker threw error: Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
[D] Tue Feb 13 13:24:07 2018: Temporary file '/tmp/vc4c-GFIvlj' deleted
[E] Tue Feb 13 13:24:07 2018: Compiler threw exception: Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
terminate called after throwing an instance of 'vc4c::CompilationError'
what(): Label/Register Mapping: Cannot get register of non-fixed node: %14 init: any, avail: any (1101, 11111111111111111111111111111111, 11111111111111111111111111111111)
Aborted (core dumped)
Add an optimization steps to vectorize loops. Status:
Notes:
Currently works only on basic for-range loops with literal initial value, a literal constant step and a literal constant upper bound (e.g. for(int i = 0; i < 1000; ++i
).
Can't vectorize loops using one of those instructions: vector-rotations, memory barriers and semaphore-instructions.
The optimization is enabled via a new configuration-entry in config.h
, which is disabled by default for now.
Needs extensive testing!
I am thinking how to develop VC4C
in efficient way.
I really make it worth to adapt realistic application.
To do that, how about setting milestone for first release.
If you permit, I want to list current improvements what we should now.
In my opinion, we need evaluate output of VC4C
and compare this with ideal output.
For example, current VC4C
of master output is as follows:
__kernel void loop1 (__global float * a) {
int id = get_local_id(0);
float16 v = vload16 (id, a);
vstore16(v * 2, id, a);
}
``asm
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 49 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or r0, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -45
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
In my opinion, this should be like:
```asm
;; This buffer is for text that is not saved, and for Lisp evaluation.
;; To create a file, visit it with <open> and enter text in its buffer.
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or r0, unif, unif
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
or -, unif, unif // should be removed
// label1: should be beginning of loop
or ra0, unif, unif
or r0, r0, r0 // just duplication, should be removed
ldi r1, 255 //
and r0, r0, r1 // these have no effects
and r0, r0, r1 //
shl r0, r0, 4 (4) //
shl r0, r0, 2 (2) // why shifted ?
add r0, ra0, r0 //
or r2, r0, r0 // r0 is reused by `v8add`. why need copy?
mul24 r1, 4 (4), elem_num // this should compute only once, should move to outside of loop
v8adds r0, 8 (8), 8 (8) // this should compute only once, should move to outside of loop
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0) // r0 is re-assigned in next instruction, no affects
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33) // 2.0 includes smallImm, can be fused
fmul r0, r4, r0 //
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -45 // should be jump to label1
nop.never // should be replaced to meaningful op
nop.never // should be replaced to meaningful op
nop.never // should be replaced to meaningful op
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
So we need:
How about make first release after implementation of such improvement?
@doe300 What do you think of that?
We need to implement register spilling to be able to support more complex kernels.
If the size of spilled registers (times 12 QPUs!) is small enough, we could store them in VPM and save from accessing memory. Access to the spilled registers would still need to be synchronized via the hardware-mutex.
The actual problem of this implementation is not the spilling/loading of locals, but in determining the minimum number of registers to spill.
(see doe300/VC4CL#24)
Recently, I read the output of VC4C
. To me, it includes inefficient instructions.
I would like to know your design concept of the output.
kernel void add (global float a[], global float b[]) {
int id = get_global_id(0);
float8 v1 = vload8(id, a);
float8 v2 = vload8(id, b);
v1 += v2;
vstore8 (v1, id, a);
}
https://gist.github.com/nomaddo/728ffc2fa605ab5b87f316a6280246be
My question is:
VC4CL
runtime passes to the kernel.And some optimizations seem to lack:
or r0, r2, r2
shl r0, r0, 2
is equal to
shl r0, r2, r0
or r2, r2, r2
has no meaning effects.I am trying to implement a dot product operation. However, it seems like working on local memories always get weird results. The code snippet looks like this:
__kernel void dot3(__global float* a_vec, __global float* b_vec,
__global float* output, __local float* l_vec) {
size_t gid = get_global_id(0);
size_t lid = get_local_id(0);
l_vec[lid] = a_vec[gid] * b_vec[gid];
barrier(CLK_LOCAL_MEM_FENCE);
if(lid==0) {
l_vec[lid] = l_vec[lid] + l_vec[lid+1]; //sometimes, this line is ignored.
}
output[gid] = l_vec[lid];
}
The output returns different values each run. The system does not sum the first two items in each workgroup.
Thank you.
I am trying to implement instruction scheduler.
I am wondering how to treat instructions.
BasicBlock.instructions
has type
RandomModificationList<std::unique_ptr<intermediate::IntermediateInstruction>>
and InstructionWalker
treat raw pointers of intermediate::intermediateinstruction
.
I want to change it to shared_ptr
, and change raw pointers to shared_ptr
.
Do you have any reason to use unique_ptr
and raw pointers?
@doe300
EDIT:
I am struggling to cache instruction: When I cache instructions (as raw pointers because only them are available out-side from the class) and remove instructions from BasicBlock
, data is released. The problem is come from raw pointers.
ldi
remain in for-loop even though it can be moved to outside of the loop.
kernel void loop1 (global float a[], global float b[]) {
for (int i = 0; i < 1000; i++) {
a[i] = -i;
b[i] = -i;
}
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 67 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), __global out float* b (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r2, unif, unif
or r1, 0 (0), 0 (0)
nop.never
or r0, r1, r1
sub r0, 0 (0), r0
itof r3, r0; v8min r0, r1, r1
shl r0, r0, 2 (2)
add r0, ra0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or r0, r1, r1
or r1, r1, r1
shl r0, r0, 2 (2)
add r1, r1, 1 (1)
add r0, r2, r0
ldi ra1, 1000 //// load the constant every time !!!
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
xor.setf -, r1, ra1
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)
or.setf -, r0, r0
or.ifz r1, r1, r1
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + 4
nop.never
nop.never
nop.never
brr.ifanyz (pc+4) + -40
nop.never
nop.never
nop.never
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -63
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
The next step after #8 and fixing #9 would be to create a github-release nightly
, were after every successful build on CircleCI (or at least once per day), the resulting artifacts (library, executable and debian-package) are released.
A more light-weight version of this could also be implemented by (simply?) linking to the latest versions of the artifacts in the CircleCI builds. In this case, the artifacts would need a stable URL across builds.
Sometimes, when VC4C is re-built (CMake already configured) using the default clang, the compilation fails due to CLANG_PATH
not found.
My theory is that is has something to with CLANG_PATH
being set as a macro in CMake, not a CMake variable and therefore might not be saved?!
A current solution is to clean and re-build cmake (see doe300/VC4CL#19 (comment)):
git clean -fdx && cmake . && make clean && make -j2
See also doe300/VC4CL#19 and #34.
Currently, CI only checks that the source code can be compiled to arm binaries.
In my motivation, it is convinient that we can do cross-compilation (output vc4 binaries from x86_64 host environments).
My plan is:
VC4C
in circleci@doe300
Line 60 and 61, the same integer is assigned. Is it okey?
https://github.com/doe300/VC4C/blob/master/src/Values.h#L60
https://github.com/doe300/VC4C/blob/master/src/Values.h#L61
And, I want to replace such magic number into enum.......
The latest VC4C
fail to compile testing/test_inner_loops.cl
as follows:
nomaddo@nomaddo-AS:~/VC4C_cross$ valgrind build/VC4C --asm -o /tmp/hoge2.s ~/VC4C_cross/testing/test_inner_loops.cl
==15066== Memcheck, a memory error detector
==15066== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==15066== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==15066== Command: build/VC4C --asm -o /tmp/hoge2.s /home/nomaddo/VC4C_cross/testing/test_inner_loops.cl
==15066==
[D] Fri Mar 23 19:03:51 2018: Compiling '/home/nomaddo/VC4C_cross/testing/test_inner_loops.cl' into '/tmp/hoge2.s' with options '' ...
[D] Fri Mar 23 19:03:51 2018: Temporary file '/tmp/vc4c-po6NJo' created
[I] Fri Mar 23 19:03:51 2018: Compiling OpenCL to LLVM-IR with :/usr/bin/clang-3.9 -cc1 -triple spir-unknown-unknown -I /home/nomaddo/VC4C_cross/testing -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-po6NJo /home/nomaddo/VC4C_cross/testing/test_inner_loops.cl
[I] Fri Mar 23 19:03:54 2018: Compilation complete!
[D] Fri Mar 23 19:03:57 2018: Reading LLVM module from bit-code...
[D] Fri Mar 23 19:03:59 2018: Found SPIR kernel-function: test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:03:59 2018: Reading function void test_remove_constant_load_in_loops_opt(...)
[D] Fri Mar 23 19:03:59 2018: Reading parameter float* %a
[D] Fri Mar 23 19:03:59 2018: Reading parameter float* %b
[D] Fri Mar 23 19:04:00 2018: Mapping function 'test_remove_constant_load_in_loops_opt'...
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.0
[D] Fri Mar 23 19:04:00 2018: Generating unconditional branch to 0xccb9520
[D] Fri Mar 23 19:04:00 2018: Generating label label %.preheader22
[D] Fri Mar 23 19:04:00 2018: Generating Phi-Node with 2 options into i32 %j.025
[D] Fri Mar 23 19:04:00 2018: Generating binary operation mul with i32 %j.025 and i32 200 into i32 %mul2
[D] Fri Mar 23 19:04:00 2018: Generating unconditional branch to 0xccba520
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.1
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %j.025 and i32 1 into i32 %inc3
[D] Fri Mar 23 19:04:00 2018: Generating comparison slt with i32 %inc3 and i32 200 into %cmp
[D] Fri Mar 23 19:04:00 2018: Generating branch on condition bool %cmp to either label %.preheader22 or label %.preheader.preheader
[D] Fri Mar 23 19:04:00 2018: Generating label label %.preheader.preheader
[D] Fri Mar 23 19:04:00 2018: Generating unconditional branch to 0xccbb390
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.2
[D] Fri Mar 23 19:04:00 2018: Generating Phi-Node with 2 options into i32 %i.024
[D] Fri Mar 23 19:04:00 2018: Generating binary operation mul with i32 %i.024 and i32 %j.025 into i32 %mul
[D] Fri Mar 23 19:04:00 2018: Generating unary operation sitofp with i32 %mul into float %conv
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %i.024 and i32 %mul2 into i32 %add
[D] Fri Mar 23 19:04:00 2018: Generating calculating index i32 %add of float* %a into float* %arrayidx
[D] Fri Mar 23 19:04:00 2018: Generating writing of float %conv into float* %arrayidx
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %i.024 and i32 1 into i32 %inc
[D] Fri Mar 23 19:04:00 2018: Generating comparison slt with i32 %inc and i32 200 into %cmp1
[D] Fri Mar 23 19:04:00 2018: Generating branch on condition bool %cmp1 to either label %tmp.2 or label %tmp.1
[D] Fri Mar 23 19:04:00 2018: Generating label label %tmp.3
[D] Fri Mar 23 19:04:00 2018: Generating return nothing
[D] Fri Mar 23 19:04:00 2018: Generating label label %.preheader
[D] Fri Mar 23 19:04:00 2018: Generating Phi-Node with 2 options into i32 %i4.023
[D] Fri Mar 23 19:04:00 2018: Generating unary operation sitofp with i32 %i4.023 into float %conv7
[D] Fri Mar 23 19:04:00 2018: Generating calculating index i32 %i4.023 of float* %b into float* %arrayidx8
[D] Fri Mar 23 19:04:00 2018: Generating writing of float %conv7 into float* %arrayidx8
[D] Fri Mar 23 19:04:00 2018: Generating binary operation add with i32 %i4.023 and i32 1 into i32 %inc9
[D] Fri Mar 23 19:04:00 2018: Generating comparison slt with i32 %inc9 and i32 40000 into %cmp5
[D] Fri Mar 23 19:04:00 2018: Generating branch on condition bool %cmp5 to either label %.preheader or label %tmp.3
[D] Fri Mar 23 19:04:00 2018: Eliminating phi-node by inserting moves: i32 %j.025 = phi %tmp.1 -> i32 %inc3, %tmp.0 -> i32 0
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %tmp.1
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %tmp.0
[D] Fri Mar 23 19:04:00 2018: Eliminating phi-node by inserting moves: i32 %i.024 = phi %tmp.2 -> i32 %inc, %.preheader22 -> i32 0
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %tmp.2
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %.preheader22
[D] Fri Mar 23 19:04:00 2018: Eliminating phi-node by inserting moves: i32 %i4.023 = phi %.preheader.preheader -> i32 0, %.preheader -> i32 %inc9
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %.preheader.preheader
[D] Fri Mar 23 19:04:00 2018: Inserting 'move' into end of basic-block: %.preheader
[I] Fri Mar 23 19:04:00 2018: -----
[I] Fri Mar 23 19:04:00 2018: Inlining functions for kernel: test_remove_constant_load_in_loops_opt
[I] Fri Mar 23 19:04:00 2018: -----
[D] Fri Mar 23 19:04:00 2018: -----
[I] Fri Mar 23 19:04:00 2018: Running optimization passes for: test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:04:00 2018:
[D] Fri Mar 23 19:04:00 2018: Running pass: MapMemoryAccess
[D] Fri Mar 23 19:04:00 2018: Generating memory access which cannot be lowered into VPM: store float %conv7 into float* %arrayidx8
[D] Fri Mar 23 19:04:00 2018: 0 16 0 1
[D] Fri Mar 23 19:04:00 2018: Generating memory access which cannot be lowered into VPM: store float %conv into float* %arrayidx
[D] Fri Mar 23 19:04:00 2018: 0 16 0 1
[D] Fri Mar 23 19:04:00 2018: Found base address - with offset -1 for writing into memory
[D] Fri Mar 23 19:04:00 2018: Found base address - with offset -1 for writing into memory
[D] Fri Mar 23 19:04:00 2018:
[D] Fri Mar 23 19:04:00 2018: Running pass: ResolveStackAllocations
[D] Fri Mar 23 19:04:00 2018:
[D] Fri Mar 23 19:04:00 2018: Running pass: SingleSteps
[D] Fri Mar 23 19:04:00 2018: Running steps: EliminateReturns, CombineDuplicateBranches, EliminateUselessBranch, IntrinsifyBuiltin, HandleLiteralVector, MapGlobalDataToAddress, CalculateConstantValue, EliminateUselessInstruction, LoadImmediateValues, HandleUseWithImmediateValues, MoveRotationSourcesToAccs, CheckMethodCalls, CombineSelectionWithZero, CombineSettingSameFlags,
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Removing branch to next instruction: br %.preheader22
[D] Fri Mar 23 19:04:00 2018: Intrinsifying unsigned multiplication of integers
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.21 = clz i32 1 (1)' with constant value: i32 31
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 65535
[D] Fri Mar 23 19:04:00 2018: Replacing 'i16 %immediate.23 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 200
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.24 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Replacing obsolete i32 %tmp.19 = xor i32 %sign.8, i8 0 with move
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %inc3 = add i32 %j.025, i32 1 (1)
[D] Fri Mar 23 19:04:00 2018: Intrinsifying comparison 'slt' to arithmetic operations
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 200
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:00 2018: Removing branch to next instruction: br.ifz %.preheader.preheader (on bool %cmp) (ifz )
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 0 to: 0 (0)
[D] Fri Mar 23 19:04:00 2018: Intrinsifying unsigned multiplication of integers
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.49 = clz i32 1 (1)' with constant value: i32 31
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.50 = clz i32 1 (1)' with constant value: i32 31
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 65535
[D] Fri Mar 23 19:04:00 2018: Replacing 'i16 %immediate.52 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Loading immediate value: 65535
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.54 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.55 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Replacing 'i8 %immediate.56 = v8adds i32 8 (8), i32 8 (8)' with constant value: i32 16
[D] Fri Mar 23 19:04:00 2018: Intrinsifying multiplication with left-shift
[D] Fri Mar 23 19:04:00 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %inc = add i32 %i.024, i32 1 (1)
[D] Fri Mar 23 19:04:01 2018: Intrinsifying comparison 'slt' to arithmetic operations
[D] Fri Mar 23 19:04:01 2018: Loading immediate value: 200
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Replacing return in kernel-function with branch to end-label
[D] Fri Mar 23 19:04:01 2018: Intrinsifying multiplication with left-shift
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value 2 to: 2 (2)
[D] Fri Mar 23 19:04:01 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %index_offset.5 = shl i32 %i4.023, i32 2 (2)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value 1 to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Inserting temporary to split up use of long-living local with immediate value: i32 %inc9 = add i32 %i4.023, i32 1 (1)
[D] Fri Mar 23 19:04:01 2018: Intrinsifying comparison 'slt' to arithmetic operations
[D] Fri Mar 23 19:04:01 2018: Loading immediate value: 40000
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018: Mapping constant for immediate value true to: 1 (1)
[D] Fri Mar 23 19:04:01 2018:
[D] Fri Mar 23 19:04:01 2018: Running pass: CombineLiteralLoads
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.50 = clz i32 1 (1)
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.53 = loadi i32 65535
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.54 = v8adds i32 8 (8), i32 8 (8)
[D] Fri Mar 23 19:04:01 2018: Removing duplicate loading of local: i8 %immediate.56 = v8adds i32 8 (8), i32 8 (8)
[D] Fri Mar 23 19:04:01 2018:
[D] Fri Mar 23 19:04:01 2018: Running pass: CombineRotations
[D] Fri Mar 23 19:04:01 2018:
[D] Fri Mar 23 19:04:01 2018: Running pass: RemoveRedundantMoves
[D] Fri Mar 23 19:04:01 2018: Removing obsolete move by replacing uses of the output with the input: i32 %mul.out1.16 = i32 %mul.out0.15
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpm = float %conv
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpw_addr = float* %arrayidx
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpm = float %conv7
[D] Fri Mar 23 19:04:01 2018: Replacing obsolete move with instruction calculating its source: register vpw_addr = float* %arrayidx8
[D] Fri Mar 23 19:04:01 2018:
[D] Fri Mar 23 19:04:01 2018: Running pass: EliminateDeadStores
[D] Fri Mar 23 19:04:01 2018: Removing instruction i32 %mul.b0.13 = i8 0 (0), since its output is never read
[D] Fri Mar 23 19:04:01 2018: Cleaned 22 unused locals from method test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:04:01 2018:
[D] Fri Mar 23 19:04:01 2018: Running pass: RemoveConstantLoadInLoops
[D] Fri Mar 23 19:04:01 2018: CFG created/updated for function: test_remove_constant_load_in_loops_opt
[D] Fri Mar 23 19:04:01 2018: Found a control-flow loop: label: %tmp.2 -> label: %tmp.1 -> label: %.preheader22 ->
[D] Fri Mar 23 19:04:01 2018: Found a control-flow loop: label: %tmp.2 ->
[D] Fri Mar 23 19:04:01 2018: Found a control-flow loop: label: %.preheader ->
==15066== Thread 2 Optimizer:
==15066== Invalid read of size 8
==15066== at 0x526D78A: vc4c::ControlFlowLoop::includes(vc4c::ControlFlowLoop const&) const (ControlFlowGraph.cpp:121)
==15066== by 0x53F03C4: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:990)
==15066== by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066== by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066== by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066== by 0x542C365: runOptimizationPasses(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&, std::set<vc4c::optimizations::OptimizationPass, std::less<vc4c::optimizations::OptimizationPass>, std::allocator<vc4c::optimizations::OptimizationPass> > const&) (Optimizer.cpp:173)
==15066== by 0x542C6BB: vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}::operator()() const (Optimizer.cpp:211)
==15066== by 0x542D989: std::_Function_handler<void (), vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
==15066== by 0x522B767: std::function<void ()>::operator()() const (functional:2267)
==15066== by 0x522AD43: threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const (BackgroundWorker.h:62)
==15066== by 0x52313B9: void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1531)
==15066== by 0x523130F: std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()() (functional:1520)
==15066== Address 0xcdb4488 is 0 bytes after a block of size 24 alloc'd
==15066== at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15066== by 0x527A997: __gnu_cxx::new_allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*>::allocate(unsigned long, void const*) (new_allocator.h:104)
==15066== by 0x52797D4: std::allocator_traits<std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::allocate(std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*>&, unsigned long) (alloc_traits.h:491)
==15066== by 0x52781F3: std::_Vector_base<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::_M_allocate(unsigned long) (stl_vector.h:170)
==15066== by 0x5276ACF: vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const** std::vector<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::_M_allocate_and_copy<std::move_iterator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const**> >(unsigned long, std::move_iterator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const**>, std::move_iterator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const**>) (stl_vector.h:1224)
==15066== by 0x5274484: std::vector<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, std::allocator<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >::reserve(unsigned long) (vector.tcc:75)
==15066== by 0x526EE83: vc4c::ControlFlowGraph::findLoopsHelper(vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, vc4c::PerformanceMap<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, int, (vc4c::OrderType)2, vc4c::hash<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >&, vc4c::PerformanceMap<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, int, (vc4c::OrderType)2, vc4c::hash<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*> >&, vc4c::PerformanceList<vc4c::Node<vc4c::BasicBlock*, vc4c::CFGRelation> const*, (vc4c::AccessType)2, (vc4c::InsertRemoveType)3>&, int&) (ControlFlowGraph.cpp:273)
==15066== by 0x526DEAA: vc4c::ControlFlowGraph::findLoops() (ControlFlowGraph.cpp:192)
==15066== by 0x53F02BA: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:982)
==15066== by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066== by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066== by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066==
==15066== Invalid read of size 8
==15066== at 0x526D78D: vc4c::ControlFlowLoop::includes(vc4c::ControlFlowLoop const&) const (ControlFlowGraph.cpp:121)
==15066== by 0x53F03C4: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:990)
==15066== by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066== by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066== by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066== by 0x542C365: runOptimizationPasses(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&, std::set<vc4c::optimizations::OptimizationPass, std::less<vc4c::optimizations::OptimizationPass>, std::allocator<vc4c::optimizations::OptimizationPass> > const&) (Optimizer.cpp:173)
==15066== by 0x542C6BB: vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}::operator()() const (Optimizer.cpp:211)
==15066== by 0x542D989: std::_Function_handler<void (), vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
==15066== by 0x522B767: std::function<void ()>::operator()() const (functional:2267)
==15066== by 0x522AD43: threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const (BackgroundWorker.h:62)
==15066== by 0x52313B9: void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1531)
==15066== by 0x523130F: std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()() (functional:1520)
==15066== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==15066==
==15066==
==15066== Process terminating with default action of signal 11 (SIGSEGV)
==15066== Access not within mapped region at address 0x0
==15066== at 0x526D78D: vc4c::ControlFlowLoop::includes(vc4c::ControlFlowLoop const&) const (ControlFlowGraph.cpp:121)
==15066== by 0x53F03C4: vc4c::optimizations::removeConstantLoadInLoops(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (ControlFlow.cpp:990)
==15066== by 0x543261F: std::_Function_handler<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&), void (*)(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::_M_invoke(std::_Any_data const&, vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) (functional:1871)
==15066== by 0x54317FB: std::function<void (vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)>::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (functional:2267)
==15066== by 0x542B941: vc4c::optimizations::OptimizationPass::operator()(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&) const (Optimizer.cpp:35)
==15066== by 0x542C365: runOptimizationPasses(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&, std::set<vc4c::optimizations::OptimizationPass, std::less<vc4c::optimizations::OptimizationPass>, std::allocator<vc4c::optimizations::OptimizationPass> > const&) (Optimizer.cpp:173)
==15066== by 0x542C6BB: vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}::operator()() const (Optimizer.cpp:211)
==15066== by 0x542D989: std::_Function_handler<void (), vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1871)
==15066== by 0x522B767: std::function<void ()>::operator()() const (functional:2267)
==15066== by 0x522AD43: threading::BackgroundWorker::operator()()::{lambda()#1}::operator()() const (BackgroundWorker.h:62)
==15066== by 0x52313B9: void std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1531)
==15066== by 0x523130F: std::_Bind_simple<threading::BackgroundWorker::operator()()::{lambda()#1} ()>::operator()() (functional:1520)
==15066== If you believe this happened as a result of a stack
==15066== overflow in your program's main thread (unlikely but
==15066== possible), you can try to increase the size of the
==15066== main thread stack using the --main-stacksize= flag.
==15066== The main thread stack size used in this run was 8720384.
==15066==
==15066== HEAP SUMMARY:
==15066== in use at exit: 18,678,964 bytes in 192,720 blocks
==15066== total heap usage: 229,603 allocs, 36,883 frees, 33,442,982 bytes allocated
==15066==
==15066== LEAK SUMMARY:
==15066== definitely lost: 0 bytes in 0 blocks
==15066== indirectly lost: 0 bytes in 0 blocks
==15066== possibly lost: 7,449,688 bytes in 53,136 blocks
==15066== still reachable: 11,229,276 bytes in 139,584 blocks
==15066== suppressed: 0 bytes in 0 blocks
==15066== Rerun with --leak-check=full to see details of leaked memory
==15066==
==15066== For counts of detected and suppressed errors, rerun with: -v
==15066== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Killed
In this code, smallImm 2.0
is loaded my ldi
.
Discussed in #45.
__kernel void loop1 (__global float * a) {
int id = get_local_id(0);
float16 v = vload16 (id, a);
vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 37 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items) (lids)
or r0, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33) // can be fused
fmul r0, r4, r0 //
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
kernel void if1 (global float a[]) {
int id = get_global_id(0);
if ((id & 0x01) == 0)
a[id] = 3.0;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'if1' with 46 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items) (lSize, lids, gidX, offX)
// label %start_of_function
or r1, unif, unif
or r2, unif, unif
or r0, unif, unif
or ra2, unif, unif
or ra1, unif, unif
// label %tmp.0
or ra0, r0, r0; v8min r0, r1, r1
ldi r1, 255
and r3, r0, r1
or r2, ra2, ra2; v8min r0, r2, r2
and r1, r0, r1
mul24 r0, ra0, r3
add r0, r2, r0
add r1, r0, r1
nop.never
or r0, r1, r1
and.setf -, r0, 1 (1)
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + 4 // to %tmp.1 --- (1)
nop.never
nop.never
nop.never
brr.ifanyz (pc+4) + 10 // to %tmp.2 --- (2)
nop.never
nop.never
nop.never
// label %tmp.1
or r0, r1, r1
shl r0, r0, 2 (2)
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
itof vpm, 3 (3)
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add vpw_addr, ra1, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -42 // to %start_of_function
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
In this code, the branch of (1) can be removed. We should,
Anyway, we need better flattening of basic blocks.
Currently, we just use the iterator of traverseing all instructions.
We need more carefully flattening, which is keen to the shape of the graph.
Using the LLVM-library front-end for debian packages is currently the easiest and best way (for the user) to use the VC4C compiler:
Current problems:
llvm-config-3.9
(should be provided by llvm-3.9
)llvm-config
seems not to find the LLVM library, headers, include flags, etc. (see CircleCI log section configure
)llvm-config
finds all required componentsDo we have to install/provide llvm-dev
cross-compiled for Raspberry Pi? Or do we have to re-build the docker image?
@nomaddo, any ideas on this?
I am trying to run VC4C
on fresh raspbian env, but somehow compile failed as follows:
pi@raspberrypi:~/VC4C$ ./build/VC4C -o /tmp/hoge ../VC4CL/mul2.cl
[D] Tue Mar 13 22:47:53 2018: Compiling '../VC4CL/mul2.cl' into '/tmp/hoge' with options '' ...
[D] Tue Mar 13 22:47:53 2018: Temporary file '/tmp/vc4c-PXslFI' created
[I] Tue Mar 13 22:47:53 2018: Compiling OpenCL to LLVM-IR with :/usr/bin/clang-3.9 -cc1 -triple spir-unknown-unknown -I ../VC4CL -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-PXslFI ../VC4CL/mul2.cl
[E] Tue Mar 13 22:47:59 2018: Errors in precompilation:
[E] Tue Mar 13 22:47:59 2018: ../VC4CL/mul2.cl:3:12: error: use of undeclared identifier 'i'
a[i] = a[i] * 2;
^
../VC4CL/mul2.cl:3:5: error: use of undeclared identifier 'i'
a[i] = a[i] * 2;
^
2 errors generated.
[E] Tue Mar 13 22:47:59 2018: (1) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xf4 [0x76c21d78]
[E] Tue Mar 13 22:47:59 2018: (2) /home/pi/VC4C/build/libVC4CC.so.1.2 : +0x349ba0 [0x76c55ba0]
[E] Tue Mar 13 22:47:59 2018: (3) /home/pi/VC4C/build/libVC4CC.so.1.2 : +0x349edc [0x76c55edc]
[E] Tue Mar 13 22:47:59 2018: (4) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::Precompiler::run(std::unique_ptr<std::istream, std::default_delete<std::istream> >&, vc4c::SourceType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)+0x36c [0x76c5674c]
[E] Tue Mar 13 22:47:59 2018: (5) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::Precompiler::precompile(std::istream&, std::unique_ptr<std::istream, std::default_delete<std::istream> >&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)+0x154 [0x76c54d34]
[E] Tue Mar 13 22:47:59 2018: (6) /home/pi/VC4C/build/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0xd4 [0x76c23ce4]
[E] Tue Mar 13 22:47:59 2018: (7) ./build/VC4C : main+0xc4c [0x232e4]
[E] Tue Mar 13 22:47:59 2018: (8) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x73f3c678]
[D] Tue Mar 13 22:47:59 2018: Temporary file '/tmp/vc4c-PXslFI' deleted
[E] Tue Mar 13 22:47:59 2018: Compiler threw exception: Pre-compilation: Error in precompilation: ../VC4CL/mul2.cl:3:12: error: use of undeclared identifier 'i'
a[i] = a[i] * 2;
^
../VC4CL/mul2.cl:3:5: error: use of undeclared identifier 'i'
a[i] = a[i] * 2;
^
2 errors generated.
terminate called after throwing an instance of 'vc4c::CompilationError'
what(): Pre-compilation: Error in precompilation: ../VC4CL/mul2.cl:3:12: error: use of undeclared identifier 'i'
a[i] = a[i] * 2;
^
../VC4CL/mul2.cl:3:5: error: use of undeclared identifier 'i'
a[i] = a[i] * 2;
^
2 errors generated.
Aborted
mul2.cl is as follows:
kernel void f (global float * a) {
int id = get_global_id(0);
a[i] = a[i] * 2;
}
This issue happened in x86 env. @doe300 any idea?
In the following code, fadd r1, r2, r0
can be fused to move instruction that is located in mutex block.
kernel void sum_f16 (global float a[], global float b[], global float c[]) {
int id = get_global_id(0);
float16 x = vload16(id, b);
float16 y = vload16(id, c);
vstore16 (x + y, id, a);
}
or -, unif, unif
or r2, unif, unif
or ra5, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or r0, unif, unif
or -, unif, unif
or -, unif, unif
or ra4, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra3, unif, unif
or ra2, unif, unif
or ra1, unif, unif
or ra0, r0, r0; v8min r1, 0 (0), 0 (0)
shr r2, r2, r1
ldi r0, 255
and r3, r2, r0
or r2, ra4, ra4
shr r1, ra5, r1
and r1, r1, r0
mul24 r0, ra0, r3
add r0, r2, r0
add r0, r0, r1
shl r3, r0, 4 (4)
mul24 r2, 4 (4), elem_num
or r0, r3, r3
shl r0, r0, 2 (2)
add r0, ra2, r0
or r1, r0, r0; v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r1, r2
or tmu0s, r0, r0
nop.load_tmu0.never
or r2, r4, r4
or r0, r3, r3
shl r0, r0, 2 (2)
add r0, ra1, r0
or ra0, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, ra0, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, r4, r4
fadd r1, r2, r0 !!! can be fused
or r0, r3, r3
shl r0, r0, 2 (2)
add r0, ra3, r0
or r0, r0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r1, r1 !!! can be fused
ldi vpw_setup, 2156937216
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -69
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
In the following kernel code, redundant move instructions from r4
is problem.
void kernel test(float global * a, const int n){
for (int i = 0; i < 10; i++)
a[i] += 1;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'test' with 68 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), int n (4 B, 1 items)
// label: %start_of_function
or ra0, unif, unif
or -, unif, unif
// label: %tmp.0
or tmu0s, ra0, ra0
nop.load_tmu0.never
or r0, r4, r4
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fadd vpm, r0, 1.000000 (32)
ldi vpw_setup, vdw_setup(rows: 10, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or r0, ra0, ra0
add tmu0s, r0, 4 (4)
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
or r0, ra0, ra0
add tmu0s, r0, 8 (8)
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
or r0, ra0, ra0
add tmu0s, r0, 12 (12)
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 8 (8), 8 (8)
add tmu0s, ra0, r0
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 10 (10), 10 (10)
add tmu0s, ra0, r0
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 12 (12), 12 (12)
add tmu0s, ra0, r0
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
v8adds r0, 14 (14), 14 (14)
add tmu0s, ra0, r0
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
ftoi r0, 32.000000 (37)
add tmu0s, ra0, r0
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
mul24 r0, 6 (6), 6 (6)
add tmu0s, ra0, r0
nop.load_tmu0.never
or r0, r4, r4 // shoudl be removed
fadd vpm, r0, 1.000000 (32)
or vpw_addr, ra0, ra0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -64 // to %start_of_function
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
The cross-compiled build does not run on original Raspberry Pi B (not B+, not 2B) due to probably architecture mismatch.
The original error message when trying to execute VC4C
is:
Illegal instruction
cat /proc/cpuinfo
spills out:
ARMv6-compatible processor rev 7 (v6l)
readelf -A /usr/local/bin/VC4C
gives
Tag_CPU_name: "7-A"
Tag_CPU_arch: v7
Discussed in #45.
__kernel void loop1 (__global float * a) {
int id = get_local_id(0);
float16 v = vload16 (id, a);
vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 37 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items) (lids)
or r0, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255 //
and r0, r0, r1 // Redundant
and r0, r0, r1 //
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
See also #50
I'm testing out some basic PyOpenCL examples (PyOpenCL detects the library and parameters correctly). The demo benchmark looked like a nice simple kernel to try:
https://raw.githubusercontent.com/inducer/pyopencl/master/examples/benchmark.py
This gives an error:
[E] Mon Feb 5 06:43:19 2018: Compiler threw exception: Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!
terminate called after throwing an instance of 'vc4c::CompilationError'
what(): Instruction Mapping: Bit-casts across different vector-sizes are not yet supported!
The exception comes from here:
VC4C/src/intermediate/TypeConversions.cpp
Line 25 in 2a95e8a
I'm happy to look at fixing it, but some pointers as to what needs to happen would be great.
I am new to OpenCL. Apologies if the question is dumb.
My environment is Raspberry Pi 3, rasbian version 8(jessie), clang 3.9.0-4. VC4C, VC4CL and VC4CLStdLib have been built according to your instruction. However, when I compile the file /example/hello_world.cl using the following command
VC4C --hex -o hello_world.hex hello_world.cl, I get the following error
[E] Fri Dec 22 21:45:46 2017: Cannot map phi-node to label: %0
[E] Fri Dec 22 21:45:46 2017: (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::string const&)+0xb8 [0x76c843b8]
[E] Fri Dec 22 21:45:46 2017: (2) /usr/local/lib/libVC4CC.so.1.2 : +0x394b88 [0x76e0fb88]
[E] Fri Dec 22 21:45:46 2017: (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::optimizations::eliminatePhiNodes(vc4c::Module const&, vc4c::Method&, vc4c::Configuration const&)+0xec [0x76e10028]
[E] Fri Dec 22 21:45:46 2017: (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::optimizations::Optimizer::optimize(vc4c::Module&) const+0x188 [0x76e366a0]
[E] Fri Dec 22 21:45:46 2017: (5) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0x27c [0x76c86458]
[E] Fri Dec 22 21:45:46 2017: (6) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::string const&, vc4c::Optional<std::string> const&)+0x3cc [0x76c86b78]
[E] Fri Dec 22 21:45:46 2017: (7) VC4C : main+0xb3c [0x1ea38]
[E] Fri Dec 22 21:45:46 2017: (8) /lib/arm-linux-gnueabihf/libc.so.6 : __libc_start_main+0x114 [0x76793294]
[D] Fri Dec 22 21:45:50 2017: Temporary file '/tmp/vc4c-OPk2ns' deleted
[E] Fri Dec 22 21:45:50 2017: Compiler threw exception: Optimizer: Failed to map all phi-options to valid basic-blocks
terminate called after throwing an instance of 'vc4c::CompilationError'
what(): Optimizer: Failed to map all phi-options to valid basic-blocks
Aborted
What did I do wrong? Could you give some detailed instruction of setting up a working environment?
Thanks.
BTW, great project!
When loading constant global data, the LLVM-IR/SPIR-V output loads initialized constant global data from memory. Since we know the value and it cannot change, we can use the data as literal constants and skip the reading from memory.
Since SPIRV-Tools now implements a linker, we could use it, to provide a linker for the VC4C compiler.
Provide an interface (can be separate from compiler-interface) to link together two intermediate binaries (LLVM-IR or SPIR-V).
LLVM/CLang support linking via llvm-link
, SPIR-V via tools/spirv-link
in SPIRV-Tools.
Problem:
Both utils only support "full linking", where implementations for all functions need to be available, which is not the case since the vc4cl_
intrinsic functions are not implemented in any intermediate format.
We should use clang-format to make sure, we have a unified formatting.
Any comments on this?
Does anyone have a good template for clang-format, we could use?
This would also apply to the VC4CL project.
Not sure how to implement this, but here a few notes on an instruction scheduler:
Reorder instructions within a basic block to utilize the delay introduced by certain operations by inserting meaningful instructions minimizing the number of cycles spent waiting (via nop
or on periphery registers).
See also the Wikipedia.
Currently, unused mov instructions from uniform
are generated at the beginning of kernel functions.
__kernel void loop1 (__global float * a) {
int id = get_local_id(0);
float16 v = vload16 (id, a);
vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 49 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or r0, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -45
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
These loads from uniform
can be removed, ideally. And should be removed, because these are repeatedly executed (looped by brr.ifallzc (pc+4) + -45
).
$ gcc --version
gcc (GCC) 7.2.1 20180116
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ clang --version
clang version 5.0.1 (tags/RELEASE_501/final)
Target: armv7l-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
$ uname -a
Linux alarmpi 4.14.22-1-ARCH #1 SMP Tue Feb 27 06:45:49 UTC 2018 armv7l GNU/Linux
I am trying to compile VC4C, and I get the following compiler error:
/home/alarm/opencl/VC4C/src/llvm/BitcodeReader.cpp: In member function 'vc4c::Method& vc4c::llvm2qasm::BitcodeReader::parseFunction(vc4c::Module&, const llvm::Function&)':
/home/alarm/opencl/VC4C/src/llvm/BitcodeReader.cpp:466:34: error: 'const class llvm::Function' has no member named 'getArgumentList'; did you mean 'Arguments'?
method->parameters.reserve(func.getArgumentList().size());
^~~~~~~~~~~~~~~
Arguments
/home/alarm/opencl/VC4C/src/llvm/BitcodeReader.cpp:467:39: error: 'const class llvm::Function' has no member named 'getArgumentList'; did you mean 'Arguments'?
for(const llvm::Argument& arg : func.getArgumentList())
^~~~~~~~~~~~~~~
Arguments
make[2]: *** [build/CMakeFiles/VC4CC.dir/build.make:1047: build/CMakeFiles/VC4CC.dir/llvm/BitcodeReader.cpp.o] Error 1
I think this is caused by this change to LLVM:
https://reviews.llvm.org/D31052
But I'm not sure.
Especially now, that cross-compilation and debian packaging is in the making (see #8), the way the VC4CLStdLib precompiled header is being built needs to be changed.
Currently, the PCH-file is being built on the development machine:
The solution would be to build the PCH-file on the destination machine:
vc4cl-stdlib
package).=> On every build (or in a post-installation script), if the PCH-file doesn't exist, create it first (using the same options as the actual compilations).
/usr/include/vc4cl-stdlib
via extra package)./include/
or ./build
. For packaged builds, it should reside somewhere in /usr/share/vc4c
or a similar pathNow the dependencies of .deb
package is as follows.
https://github.com/doe300/VC4C/blob/master/CMakeLists.txt#L344
I am wondering it is right:
llvm-3.9-dev
is always necessary? If LLVMLIB_FRONTEND
is disabled, VC4C
might not depend on it?clang-3.9
is always necessary? if LLVMIR_FRONTEND
is disabled, VC4C
doesn't depend on it?@doe300 Any Idea?
The output of deepCL/backproweights.cl
can be improved by enhacement of reducing the number of ldi
.
For example, loading constant 256
happens 69 times.
If we can combine them to one instruction, rapid speedup can be archived.
https://gist.github.com/nomaddo/220b867143eff68f2c0f83f9188ab382
More small example, the following code can be improved:
kernel void f(global float * a) {
for (int i = 0; i < 4; i++){
a[i] += 129;
}
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'f' with 37 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items)
// label: %start_of_function
or ra0, unif, unif
// label: %tmp.0
nop.never
or tmu0s, ra0, ra0
nop.load_tmu0.never
ldi r0, 1124139008 // should be fused
fadd r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 4, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add tmu0s, ra0, 4 (4)
nop.load_tmu0.never
ldi r1, 1124139008 // should be fused
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 8 (8)
nop.load_tmu0.never
fadd r0, r4, r1
or vpm, r0, r0
add tmu0s, ra0, 12 (12)
nop.load_tmu0.never
ldi r0, 1124139008 // should be fused
fadd r0, r4, r0
or vpm, r0, r0
or vpw_addr, ra0, ra0
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33 // to %start_of_function
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
I'm testing out some basic PyOpenCL examples (PyOpenCL detects the library and parameters correctly). The demo benchmark looked like a nice simple kernel to try:
https://raw.githubusercontent.com/inducer/pyopencl/master/examples/benchmark.py
The value only gets reliably assigned for the first worker in the group. Writing a constant value of 1 to the output only results in 1's being written to the first worker in each group (the rest of the values are 0). This might be a writing issue, or it might be a problem with the call to get_global_id(0)
A secondary issue is that the openCL driver will only allow worker groups in powers of 2 - so I cannot assign 12 workers here, only 8:
Traceback_ (most recent call last):
File "opencl_test.py", line 91, in
exec_evt = prg.sum(queue, global_size, local_size, a_buf, b_buf, dest_buf)
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1766, in call
return self._enqueue(self, queue, global_size, local_size, *args, **kwargs)
File "", line 90, in enqueue_knl_sum
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 1952, in enqueue_nd_range_kernel
global_work_size, local_work_size, c_wait_for, num_wait_for))
File "/usr/local/lib/python2.7/dist-packages/pyopencl/cffi_cl.py", line 664, in _handle_error
raise e
pyopencl.cffi_cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE
I'm unsure if there's an incorrect assumption in the pyopencl code somewhere... but this reduces total throughput by a third. With 8 workers the result is:
gpu: 2.32745s
cpu: 0.469361066818s
Running VC4C on a Raspberry Pi 3 with a fresh Raspbian Stretch setup.
VC4C has been built with only the LLVM front-end.
I am trying to run the following command against fibonacci.cl from the examples
folder:
VC4C --llvm --hex -o fibonacci.hex fibonacci.cl
I get the following message:
Compiling 'fibonacci.cl' into 'fibonacci.hex' with options '' ...
[E] Mon Nov 27 14:27:43 2017: Errors in precompilation:
[E] Mon Nov 27 14:27:43 2017: Stack dump:
0. Program arguments: /usr/lib/llvm-3.9/bin/clang -cc1 -triple armv6-unknown-linux-gnueabihf -emit-llvm -disable-free -disable-llvm-verifier -discard-value-names -main-file-name fibonacci.cl -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu arm1136jf-s -target-feature +strict-align -target-abi aapcs-linux -mfloat-abi hard -dwarf-column-info -debugger-tuning=gdb -coverage-file /tmp/vc4c-sbCfYb -resource-dir /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1 -include-pch /home/pi/VC4C/include/VC4CLStdLib.h.pch -I . -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1/include -internal-externc-isystem /usr/include/arm-linux-gnueabihf -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -fdebug-compilation-dir /home/pi/VC4CL-test -ferror-limit 19 -fmessage-length 0 -fallow-half-arguments-and-returns -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o /tmp/vc4c-sbCfYb -x cl fibonacci.cl
clang: error: unable to execute command: Segmentation fault
clang: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 3.9.1-9+rpi1 (tags/RELEASE_391/rc2)
Target: armv6-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
I have tried to run the given /usr/lib/llvm-3.9/bin/clang
command line and got a segmentation fault as well.
The command succeeds if I remove the argument: -include-pch /home/pi/VC4C/include/VC4CLStdLib.h.pch
:
/usr/lib/llvm-3.9/bin/clang -cc1 -triple armv6-unknown-linux-gnueabihf -emit-llvm -disable-free -disable-llvm-verifier -discard-value-names -main-file-name fibonacci.cl -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -fuse-init-array -target-cpu arm1136jf-s -target-feature +strict-align -target-abi aapcs-linux -mfloat-abi hard -dwarf-column-info -debugger-tuning=gdb -coverage-file /tmp/vc4c-kK1RbC -resource-dir /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1 -I . -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-3.9/bin/../lib/clang/3.9.1/include -internal-externc-isystem /usr/include/arm-linux-gnueabihf -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -fdebug-compilation-dir /home/pi/VC4CL-test -ferror-limit 19 -fmessage-length 0 -fallow-half-arguments-and-returns -fno-signed-char -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -o /tmp/vc4c-kK1RbC -x cl fibonacci.cl
Here is an extract of strace
run with the faulty command:
stat64("/home/pi/VC4C/include", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/home/pi/VC4C/include/VC4CLStdLib.h.pch", O_RDONLY) = 5
readlink("/proc/self/fd/5", "/home/pi/VC4C/include/VC4CLStdLi"..., 4096) = 39
fstat64(5, {st_mode=S_IFREG|0644, st_size=12789540, ...}) = 0
mmap2(NULL, 12789540, PROT_READ, MAP_PRIVATE, 5, 0) = 0x737a9000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x157f000} ---
__kernel void loop1 (__global float * a) {
int id = get_local_id(0);
a[id] = a[id] * 2;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 30 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items) (lids)
// label: %start_of_function
or r0, unif, unif
or ra0, unif, unif
// label: %tmp.0
or r0, r0, r0
ldi r1, 255
and r0, r0, r1
and r0, r0, r1
shl r0, r0, 2 (2)
add r1, ra0, r0 /// (1)
nop.never
or tmu0s, r1, r1 /// (2)
nop.load_tmu0.never
or r0, r4, r4
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fmul vpm, r0, 2.000000 (33)
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r1, r1
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -26 // to %start_of_function
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
(1) and (2) can be fused. To remove (1), we should
r1
r1
is only once, replace (2) to (1)This is related to the plan of implementation of enhancement of general porpose optimizations #47.
Sometimes, when failing to compile an OpenCL C source-code, the pre-compiler (clang) deletes the symbolic link /dev/stdout
. This results in any successive compilations with VC4C (and probably any other program using stdout
) to fail.
To "fix" this, the /dev/stdout
symbolic link needs to be re-created. This can be done by rebooting the Raspberry Pi.
Another "fix" is to configure VC4C with the CMake option PRECOMPILER_DROP_RIGHTS
enabled, which will run the pre-compilation process with the user pi
(instead of root, if run with sudo). This should prevent the process from deleting /dev/stdout
.
Since I cannot reliably reproduce the issue, I don't know if the fix works. If anyone can create a stable reproduction, please notify me.
Currently, vc4c-0.4.deb
depends of /opt/SPIRV-LLVM/build/bin/clang
in preinst stage.
It is inconvenient for ordinary users because it requires compilation of SPIRV-LLVM
by hand.
My suggestion is,
clang
as default front-endVC4C_SPIRV_LLVM
or so on) are set, use it as front-endIn my understand, VC4C
can use official clang
as front-end, but use of SPIRV-LLVM
is better for better performance. Is it correct?
Hi,
I want to try your works, but I cannot find the usage.
Can you add it?
I don't want to compile LLVM tool-chains in rasberrypi, so I want cross-compilation usages.
Currently, the LLVM IR front-end reads in the textual representation of the LLVM-IR, this has several drawbacks:
fibonacci.cl
example and a very long loading time.Similar to the SPIR-V front-end, use a library to parse the LLVM IR (in binary format). The LLVM library could be used (component bitreader
)
llvm-dev
package (which is available in Raspbian repo), but would be unaffected by changes in the IR (bitcode)llvm-config
(see here and here)Steps to efficiently read binary IR:
make install
does not install libcpplog
I had to manually run sudo make install
from build/build/cpplog
.
I also observed that libcpplog
is not installed in the CMAKE_INSTALL_PREFIX
that I set when running cmake
.
Currently VC4C
on x86_64 fail to compile as follows:
Debian packages are downloaded from https://circleci.com/gh/nomaddo/VC4C/137#artifacts/containers/0.
$ docker run -v /tmp:/tmp -v ${HOME}/clABCMarks:/marks --rm -it nomaddo/cross-rpi:0.1
$ dpkg-deb -i vc4cl-stdlib-0.4-Linux.deb
$ dpkg-deb -x vc4c-0.4-Linux.deb /
$ /opt/SPIRV-LLVM/build/bin/clang -cc1 -triple spir-unknown-unknown -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-all -Wno-gcc-compat -x cl -emit-pch -o /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -I /usr/local/include/vc4cl-stdlib/ /usr/local/include/vc4cl-stdlib/opencl-c.h
$ VC4C --asm -o /tmp/hoge.s /marks/mul2.cl
VC4C --asm -o /tmp/hoge.s /marks/mul2.cl
[D] Fri Mar 2 13:26:31 2018: Compiling '/marks/mul2.cl' into '/tmp/hoge.s' with options '' ...
[D] Fri Mar 2 13:26:31 2018: Temporary file '/tmp/vc4c-PdE2jQ' created
[D] Fri Mar 2 13:26:31 2018: Temporary file '/tmp/vc4c-Ptoo1K' created
[I] Fri Mar 2 13:26:31 2018: Compiling OpenCL to LLVM-IR with :/opt/SPIRV-LLVM/build/bin/clang -cc1 -triple spir-unknown-unknown -I /marks -O3 -ffp-contract=off -cl-std=CL1.2 -cl-kernel-arg-info -cl-single-precision-constant -Wno-undefined-inline -Wno-unused-parameter -Wno-unused-local-typedef -Wno-gcc-compat -include-pch /usr/local/include/vc4cl-stdlib/VC4CLStdLib.h.pch -x cl -S -emit-llvm-bc -o /tmp/vc4c-Ptoo1K /marks/mul2.cl
[I] Fri Mar 2 13:26:31 2018: Converting LLVM-IR to SPIR-V with :/opt/SPIRV-LLVM/build/bin/llvm-spirv -o /tmp/vc4c-PdE2jQ /tmp/vc4c-Ptoo1K
[D] Fri Mar 2 13:26:31 2018: Temporary file '/tmp/vc4c-Ptoo1K' deleted
[I] Fri Mar 2 13:26:31 2018: Compilation complete!
[I] Fri Mar 2 13:26:31 2018: Using SPIR-V frontend...
[D] Fri Mar 2 13:26:31 2018: Read SPIR-V binary with 161 words
[D] Fri Mar 2 13:26:31 2018: Starting parsing...
[D] Fri Mar 2 13:26:31 2018: SPIR-V header parsed: magic-number 0x7230203, version 0x10000, generator 6000e, max-ID 21
[D] Fri Mar 2 13:26:31 2018: Using supported capability: Addresses
[D] Fri Mar 2 13:26:31 2018: Using supported capability: Linkage
[D] Fri Mar 2 13:26:31 2018: Using supported capability: Kernel
[D] Fri Mar 2 13:26:31 2018: Using supported capability: Vector16
[D] Fri Mar 2 13:26:31 2018: Importing extended instruction set: OpenCL.std
[D] Fri Mar 2 13:26:31 2018: Using a physical OpenCL memory model
[D] Fri Mar 2 13:26:31 2018: Kernel-method found: loop1
[E] Fri Mar 2 13:26:31 2018: Met unsupported instruction-decoration 11
[E] Fri Mar 2 13:26:31 2018: Unsupported operation: at ?
[E] Fri Mar 2 13:26:31 2018: (1) /usr/local/lib/libVC4CC.so.1.2 : vc4c::CompilationError::CompilationError(vc4c::CompilationStep, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x121 [0x7fead7f2fe13]
[E] Fri Mar 2 13:26:31 2018: (2) /usr/local/lib/libVC4CC.so.1.2 : vc4c::spirv2qasm::SPIRVParser::parse(vc4c::Module&)+0x53b [0x7fead813ca6d]
[E] Fri Mar 2 13:26:31 2018: (3) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::convert()+0x11f [0x7fead7f3165d]
[E] Fri Mar 2 13:26:31 2018: (4) /usr/local/lib/libVC4CC.so.1.2 : vc4c::Compiler::compile(std::istream&, std::ostream&, vc4c::Configuration, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, vc4c::Optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x250 [0x7fead7f31d44]
[E] Fri Mar 2 13:26:31 2018: (5) VC4C : main+0xcf6 [0x5642ad15ddfc]
[E] Fri Mar 2 13:26:31 2018: (6) /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main+0xf1 [0x7fead66fc2b1]
[E] Fri Mar 2 13:26:31 2018: (7) VC4C : _start+0x2a [0x5642ad15c72a]
[D] Fri Mar 2 13:26:31 2018: Temporary file '/tmp/vc4c-PdE2jQ' deleted
[E] Fri Mar 2 13:26:31 2018: Compiler threw exception: Parser: Unsupported operation:
terminate called after throwing an instance of 'vc4c::CompilationError'
what(): Parser: Unsupported operation:
Aborted (core dumped)
@doe300 Do you have any idea?
A current git pull on a box with a previously working build has produced compile errors. CLANG_PATH appears to be undefined in the Precompiler source for the current commit. I have clang-3.9 installed in /usr/bin/.
Command used for building:
cd VC4C && git pull && git checkout . && cmake . && make clean && make
Output:
[ 37%] Building CXX object build/CMakeFiles/VC4CC.dir/Module.cpp.o
[ 37%] Building CXX object build/CMakeFiles/VC4CC.dir/Locals.cpp.o
[ 38%] Building CXX object build/CMakeFiles/VC4CC.dir/Precompiler.cpp.o
/home/pi/VC4C/src/Precompiler.cpp: In static member function ‘static vc4c::SourceType vc4c::Precompiler::getSourceType(std::istream&)’:
/home/pi/VC4C/src/Precompiler.cpp:125:38: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
else if(std::atol(buffer.data()) == QPUASM_MAGIC_NUMBER || std::atol(buffer.data()) == QPUASM_NUMBER_MAGIC)
~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
/home/pi/VC4C/src/Precompiler.cpp:125:89: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
else if(std::atol(buffer.data()) == QPUASM_MAGIC_NUMBER || std::atol(buffer.data()) == QPUASM_NUMBER_MAGIC)
~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
/home/pi/VC4C/src/Precompiler.cpp: In function ‘void compileOpenCLToLLVMIR(std::istream&, std::ostream&, const string&, bool, const vc4c::Optional<std::__cxx11::basic_string >&, const vc4c::Optional<std::__cxx11::basic_string >&)’:
/home/pi/VC4C/src/Precompiler.cpp:254:31: error: ‘CLANG_PATH’ was not declared in this scope
const std::string compiler = CLANG_PATH;
The for-loop creates redundant instructions in the instructions of checking exit condition.
ex)
kernel void loop1 (global float a[], global float b[]) {
for (int i = 0; i < 1000; i++) {
a[i] = -i;
b[i] = -i;
}
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 67 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), __global out float* b (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r2, unif, unif
or r1, 0 (0), 0 (0)
nop.never
or r0, r1, r1
sub r0, 0 (0), r0
itof r3, r0; v8min r0, r1, r1
shl r0, r0, 2 (2)
add r0, ra0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or r0, r1, r1
or r1, r1, r1
shl r0, r0, 2 (2)
add r1, r1, 1 (1)
add r0, r2, r0
ldi ra1, 1000
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
xor.setf -, r1, ra1 /// redundant ?
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1) ///
or.setf -, r0, r0 ///
or.ifz r1, r1, r1 ///
or.setf -, elem_num, r0 ///
brr.ifallzc (pc+4) + 4
nop.never
nop.never
nop.never
brr.ifanyz (pc+4) + -40
nop.never
nop.never
nop.never
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -63
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
To me, above instructions (specifying ///
) can be expressed as isub.setf -, r1, ra1
.
Then, if the condition is negative, jump to the head of loop. Otherwise go through.
In the following code, multiple additions can be combined to one multiplication instruction.
void kernel test(float global * a, const int n){
float sum = 0.0;
float c = a[0];
for (int i = 0; i < 10; i++)
sum += 10.0 + c;
a[get_global_id(0)] = sum;
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'test' with 47 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), int n (4 B, 1 items) (lSize, lids, gidX, offX)
// label: %start_of_function
or r3, unif, unif
or ra3, unif, unif
or r1, unif, unif
or r2, unif, unif
or ra1, unif, unif
or -, unif, unif
// label: %tmp.0
or tmu0s, ra1, ra1
nop.load_tmu0.never
itof r0, 10 (10)
fadd rb0, r4, r0
or ra2, r1, r1
or r0, rb0, rb0
fadd r0, rb0, r0 // can be improved
fadd r0, rb0, r0 //
fadd r0, rb0, r0 //
fadd r0, rb0, r0 //
fadd r0, rb0, r0 //
fadd r0, rb0, r0 //
fadd r0, rb0, r0 //
fadd ra0, rb0, r0 //
or r0, r3, r3
ldi r1, 255
and r3, r0, r1
or r2, r2, r2; v8min r0, ra3, ra3
and r1, r0, r1
mul24 r0, ra2, r3
add r0, r2, r0
add r0, r0, r1
shl r0, r0, 2 (2)
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fadd vpm, rb0, ra0
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add vpw_addr, ra1, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -43 // to %start_of_function
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.