cucapra / diospyros Goto Github PK

Search-based compiler for high-performance DSP programming

License: MIT License

Racket 33.67% Makefile 1.61% Python 16.11% C 29.88% Rust 10.54% C++ 4.19% Shell 0.06% JavaScript 0.77% HTML 3.08% CSS 0.09%

diospyros's People

Contributors

Stargazers

Watchers

Forkers

sgpthomas akothen

diospyros's Issues

Adding XTENSA guards to generated intrinsic kernels

For generated intrinsic kernels, maybe we should add the #ifdef XTENSA #endif to ensure that the kernel is transparent to non-Tensilica compilation targets. If we want, there can also be a fall back to the specification code. If we need both specification and intrinsic code we may need to go with some sort of namespace solution. Perhaps we should discuss more.

The use case is regression testing and CI. Not all CI machines likely will have access to the Tensilica tool chain. But we can still validate any functionality we care about using a CPU compiler.

Runtime validation failing for various matrix multiply configurations

After some initial results after running coverage over different matrix multiply dimensions, the following configurations appear to be failing validation. I think this may again be related to how we store the results back to memory but not sure. The failing cases appears to be super small dimensionality corner cases. The remaining results are still being generated.

Failing configurations:
input_rows, input_cols, output_cols
1, 1, 2
1, 2, 3
1, 1, 5
1, 1, 7
1, 2, 1
1, 2, 2,
1, 2, 3
1, 3, 1
1, 3, 2
1, 3, 3
1, 4, 2
1, 6, 1

Repro from 1054f25:

eigen --kernel multiple --input_rows <input_rows> --input_cols <input_cols> --output_cols <output_cols>
Validation when running the test harness will throw a failure when validating the synthesized kernel against the specification

Shared benchmark Makefile

So that for each benchmark, we only need to change parameters and reference names

Implement dead code elimination/value numbering for shuffles

We should run DCE after register allocation and shuffle truncation. We generate quite a few unnecessary registers.

multiple functions in `cdios`

Support compiling multiple functions in one module in cdios. For simplicity, enforce the specification restrictions on the last function (which can call into other functions).

Partial PDX_SAV_MXF32_XP stores must be flushed to memory

The intrinsic that saves to memory works if filling all 16 bytes when called as such:
PDX_SAV_MXF32_XP(v_43, align_c_out, (xb_vecMxf32 *)c_out, 16);
However, when filling less than 16 bytes, it will cause the store to fail and not propagate. So for instance, invoking this instruction by itself will not actually propagate the store to the first word of the memory:
PDX_SAV_MXF32_XP(v_43, align_c_out, (xb_vecMxf32 *)c_out, 4);
The solution is to either add the following flush after the PDX_SAV_MXF32_XP instruction or in the case of writing one word, to just store to the pointer:
PDX_SAPOS_FP(align_c_out, (xb_vec4Mx8 *)c_out); // added after PDX_SAV_MXF32_XP
or
*c_out = ; // just directly save value

Currently the compiler generated kernel does not flush the data out which causes it to produce partially incorrect results because values are not propagated.

Kernels fails with contract violation (support scalar outputs)

Minimal repro:

void foo(float x_in, float y_in[3], float z_out) {
  float acc = 0.0f;
  for (int a = 0; a < 3; a++) {
    acc = acc + x_in * y_in[a];
  }
  z_out = acc;
}

Error message:

[vtlee@vtlee-fedora-IT2277812 ifelse0]$ diospyros.py --manifest spec/diospyros.json 
Standard C compilation successful
Writing intermediate files to: compile-out
in-range: contract violation
  expected: real?
  given: #f
Error: Compilation aborted. cdios return error code 1.

Pre-calculate reg-of table

quotient and bsdiv are expensive, avoid exposing them to the solving by pre-calculating the register-of function up to the maximum index size.

Profile to check that this is faster.

Implement predicated VMAC

The SDK instruction has a predicated vector mac instruction which might be easier to compile to than a predicated store instruction (which doesn't seem to exist).

Enable scalar inputs to specification

Need something like this supported:

void foo(float a_in, float b_in, float c_in, float *x0_out, float *x1_out) {
// function body
}

Can probably be implemented by a 1x1 pointer right now but would improve quality of life.

Cdios: update import for c-utils library

We should import as (require (prefix-in c: c)) so as to not overwrite the definition of struct

cdios: support for sqrt

Needed for vector norms and unit quaternion calculations.

Example specification:

  float acc = 0.0f;
  for (int i = 0; i < 4; i++) {
    acc += a_in[i] * a_in[i];
  }
  for (int i = 0; i < 4; i++) {
    b_out[i] = a_in[i] / std::sqrt(acc);
  }
}```

Timeout for translation validation in synth.rkt

Would make sense to throw in timing data per query here, too.

Emit PDX_SAPOS_FP for final stores

I think at some point we fixed this issue with the flush instruction but it appears to have resurfaced again on my end.

Repro from diospyros/utils:

eigen.py --kernel multiply
By default produces a 3x3x3x3 matrix multiply kernel
Running test should throw a failure since matrices do not match

Fix for kernel is to add this at the bottom of the generated kernel:
PDX_SAPOS_FP(align_c_out, (xb_vec4Mx8 *)c_out);

Diffing the kernel generated by master with the kernel that was generated after commit 9e0119b only shows changes to the DRAM section. The code body appears to be the same but both cases seem to be failing on my machine now without the flush instruction.

Cdios error messages for unhandled C constructs should be lifted to surface syntax

Conditionals in `cdios`

Compile basic data-independent conditionals (i.e., 2dconv) with cdios.

Control namespace for generated kernels

Generated kernels should probably be encapsulated in a namespace to avoid polluting the base namespace and so that we can precisely call different implementations with potentially the same function call for testing. Probably can get away with a default namespace like dios or diospyros for now but to facilitate automation we may want to allow it to eventually be user controlled since we'll be generating override kernels.

For instance, we may need to call:
// base definition
Eigen::foo()
// our optimized override
OurNamespace::Eigen::foo()

Need control over emitted kernel function name

Need control over the code generated kernel function name to facilitate integration. Currently the tool defaults to void kernel() in the generated file. For automation flows, this will eventually need to be supplied by the user so that it can be changed to something arbitrary to eliminate manually copying and adjusting the function call.

This probably will need us to plumb through arguments to the tool to propagate it through to the code generator.

Update mat-mul to use vec-write

Then remove continuous-aligned-vec

Demo requires pypy3 installation

Repro:

cdios demo/matrix-multiply.c
Standard C compilation successful
Writing intermediate files to: compile-out
cat compile-out/spec.rkt | pypy3 src/dios-egraphs/vec-dsl-conversion.py -w 4 -p > compile-out/spec-egg.rkt
/bin/sh: pypy3: command not found
make: *** [Makefile:52: compile-out/spec-egg.rkt] Error 127
cat: compile-out/kernel.c: No such file or directory

Fixed with just installing based on this: https://doc.pypy.org/en/latest/install.html
After installation:

pypy3 -m ensurepip
pypy3 -mpip install sexpdata

Aligned writes into output matrix

Instead of using vector-shuffle-set! to set the computation on the output array, create a sketch where the write is done at locations $i, $i+1, ... and so on.

This forces the sketch to never shuffle the output.

Investigate XCC's "super software pipelining"

Investigate whether we can improve the baseline's performance with "super software pipelining"

See section 4.5 Software Pipelining of Xtensa C and C++ Compiler user guide.

Compute-shuffle sketch that can use `n` previous shuffle vectors

Implement a sketch where the compute gets to select one of the last n shuffle vectors defined before it. Having a parameter n allows us to make trade-offs with the size of the formula and the flexibility of the sketch.

Better error message for failed synth with run-experiment

When a configuration passed to run-experiment fails to synthesize, the error message is p opaque:

application: not a procedure;
 expected a procedure that can be applied to arguments
  given: #<void>
  arguments...:
   #f

We should fail more gracefully

Replace the sketch's shuffle/select with a more general permutation

Shuffle/select should be added by later compiler passes, after register allocation.

Discrete Fourier transform example

Tensilica header imports missing from generated kernel.c

Looks like in the generated kernel.c the Tensilica header imports that support some of these typedefs and macros are missing. Maybe we can emit all the assume headers in the generated kernel.c file by default to make it self-contained?

By itself, the kernel.c when compiled generates a bunch of errors like these on our build system:

stderr: arvr/projects/surreal/flash/diospyros/src/MatMult3x3x3x3.cpp:36:3: error: unknown type name 'valign'
valign align_a_in;
^
arvr/projects/surreal/flash/diospyros/src/MatMult3x3x3x3.cpp:37:33: error: use of undeclared identifier 'xb_vecMxf32'
align_a_in = PDX_LA_MXF32_PP((xb_vecMxf32 *)a_in);
^
arvr/projects/surreal/flash/diospyros/src/MatMult3x3x3x3.cpp:37:46: error: expected expression
align_a_in = PDX_LA_MXF32_PP((xb_vecMxf32 *)a_in);

Incremental synthesis

The first query should not use the cost model at all. The subsequent queries can then use the resulting program as an upper bound on cost.

scalars.h emitted as relative path breaks self-contained build

For the generated kernel, the scalars.h import uses a relative path in the file. Recommend forcing the build to copy scalars.h into compile-out/include or something similar so that the compile output is self-contained.

#include <float.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <xtensa/sim.h>
#include <xtensa/tie/xt_pdxn.h>
#include <xtensa/tie/xt_timer.h>
#include <xtensa/xt_profiling.h>
#include "../../../../src/scalars.h" // scalars.h requires relative path -> would rather have #include <scalar.h> or <diosypros/scalar.h> or something similar.

Generated code comment fields to improve tracking

Could be useful to add additional comment information at the top of the generated code file. Right now it just includes a Git hash and status but at some point we may want to add some of these:

generated code timestamp
something along the lines of "This code was automatically generated by "
code path to the original specification file location

Emit include header file to allow #include of kernel into codebase

For generated kernels, accompanying include header file for import is needed to allow it to get compiled into larger code bases. Quick fix might be to just emit a kernel.h and preprocessor.h file for now with just the signature definitions which can be picked up by a build system.

C harness for codegen

cdios: support local static array initialization

In a point product kernel, we want to write:

float qvec[3] = {q_in[0], q_in[1], q_in[2]};

But currently need to write:

  float qvec[3];
  qvec[0] = q_in[0];
  qvec[1] = q_in[1];
  qvec[2] = q_in[2];

Tensilica codgen can re-order kernel input arguments

We use:

(define (all-args)
    (hash-keys map))

Which does not keep argument order in the kernel as desired: (order should be I, F, O)

void kernel(float * input_F, float * input_I, float * input_O)

Update runt tests

Ternary operator support

Enable support for ternary operators like the following to improve quality of life:

Repro specification:

void abs_test(float a_in[4], float b_out[4]) {
  for (int i = 0; i < 4; i++) {
    b_out[i] = (a_in[i] < 0) ? -a_in[i] : a_in[i];
  }
}

Compiler aborts with:

can't handle expr #s((expr:if expr 1) #s(src 95 3 15 129 3 49 #f) #s((expr:binop expr 1) #s(src 96 3 16 107 3 27 #f) #s((expr:array-ref expr 1) #s(src 96 3 16 103 3 23 #f) #s((expr:ref expr 1) #s(src 96 3 16 100 3 20 #f) #s((id:var id 1) #s(src 96 3 16 100 3 20 #f) a_in)) #s((expr:ref expr 1) #s(src 101 3 21 102 3 22 #f) #s((id:var id 1) #s(src 101 3 21 102 3 22 #f) i))) #s((id:op id 1) #s(src 104 3 24 105 3 25 #f) <) #s((expr:int expr 1) #s(src 106 3 26 107 3 27 #f) 0 ())) #s((expr:unop expr 1) #s(src 111 3 31 119 3 39 #f) #s((id:op id 1) #s(src 111 3 31 112 3 32 #f) -) #s((expr:array-ref expr 1) #s(src 112 3 32 119 3 39 #f) #s((expr:ref expr 1) #s(src 112 3 32 116 3 36 #f) #s((id:var id 1) #s(src 112 3 32 116 3 36 #f) a_in)) #s((expr:ref expr 1) #s(src 117 3 37 118 3 38 #f) #s((id:var id 1) #s(src 117 3 37 118 3 38 #f) i)))) #s((expr:array-ref expr 1) #s(src 122 3 42 129 3 49 #f) #s((expr:ref expr 1) #s(src 122 3 42 126 3 46 #f) #s((id:var id 1) #s(src 122 3 42 126 3 46 #f) a_in)) #s((expr:ref expr 1) #s(src 127 3 47 128 3 48 #f) #s((id:var id 1) #s(src 127 3 47 128 3 48 #f) i))))
  context...:
   /home/vtlee/.racket/7.3/pkgs/rosette/rosette/base/form/control.rkt:30:25
   /home/vtlee/diospyros/src/c-meta.rkt:135:12: for-loop
   /home/vtlee/.racket/7.3/pkgs/rosette/rosette/base/form/control.rkt:31:25
   /home/vtlee/.racket/7.3/pkgs/rosette/rosette/base/form/control.rkt:30:25
   /home/vtlee/diospyros/src/c-meta.rkt:135:12: for-loop
   /home/vtlee/.racket/7.3/pkgs/rosette/rosette/base/form/control.rkt:31:25
   /home/vtlee/diospyros/src/c-meta.rkt:205:0: translate-fn-decl
   "/home/vtlee/diospyros/src/c-meta.rkt": [running body]
   temp37_0
   for-loop
   run-module-instance!125
   perform-require!78
Error: Compilation aborted. cdios return error code 1.

Would enable macros and other compact code like:

#define abs(x) ((x < 0) ? -x : x)

Probably a good to have feature but not absolutely necessary. This one can probably be ignored if if-else support works.

Support for exponentiation in cdios

We can support squares written as n * n, should be fairly simply to extend to target Racket's expt.

 float powf(float x, float y);

2D matrices in `cdios`

Support multidimensional (but still statically defined) matrices in cdios. Plan is to still unroll to single-dimensional Racket.

2x1 x 1x4 matrix multiply causes ISS to abort

While running coverage over different matrix multiply dimensions, the following pathological case appears to generate a kernel that causes the instruction set simulator validation to abort.

Repro configurations:

Specification definition specification.c :

/*!

  Specification file of the target kernel to be consumed by the Diosypros tool

*/

#define A_ROWS 2
#define A_COLS 1
#define B_COLS 4

void MatMult2x1x1x4(
    float a_in[A_ROWS * A_COLS],
    float b_in[A_COLS * B_COLS],
    float c_out[A_ROWS * B_COLS]) {
  for (int i = 0; i < A_ROWS; i++) {
    for (int j = 0; j < B_COLS; j++) {
      c_out[j * A_ROWS + i] = 0;

      for (int k = 0; k < A_COLS; k++) {
        c_out[j * A_ROWS + i] += a_in[k * A_ROWS + i] * b_in[j * A_COLS + k];
      }
    }
  }
}

Manifest definition file diospyros.json:
{"inputs": {"a": "Eigen::Matrix<float, 2, 1>", "b": "Eigen::Matrix<float, 1, 4>"}, "outputs": {"c": "Eigen::Matrix<float, 2, 4>"}, "test": "c = a * b", "name": "MatMult2x1x1x4", "namespace": "Eigen", "specification": "specification.c", "specification_kernel": "MatMult2x1x1x4", "manifest_path": "build_2x1x1x4/spec", "build": "build_2x1x1x4", "src_dir": "build_2x1x1x4/src", "include_dir": "build_2x1x1x4/include", "bin": "build_2x1x1x4/bin", "test_dir": "build_2x1x1x4/test"}

Call diospyros --manifest diospyros.json and run test and/or benchmark code. Throws the following exception during ISS:
*WARNING* Unhandled user exception: LoadStoreAlignmentCause (0xbf26daea)

cdios: disallow modification of for loop variable in body

If the body of a Racket for loop modifies the index variable, it does not persist to the next iteration: the front end of cdios should error on this (or do the smart thing and translate to an equivalent recursive while). Currently FFT is rewritten manually to a while loop to work around this.

raco pkg install c possibly missing from setup

Repro:

Setup as according to README
cdios demo/matrix-multiply.c

Throws:
standard-module-name-resolver: collection not found
for module path: c
collection: "c"
in collection directories:
/home/vtlee/.racket/7.3/collects
/usr/local/share/racket/collects
... [171 additional linked and package directories]

Fixed it with:
raco pkg install c-utils

Support early returns/break/continue in cdios

Need continuations to translate the following patterns in C:

continue
break
(early) return

for (i =0; i < N; i++) {
    if (i > N/2) continue; ... 
}

for (i =0; i < N; i++) {
    if (i > N/2) break; ... 
}

if (foo < bar) return;

Racket's for construct does not support these out of the box, we'll want to use explicit continuations

Update (make-symbolic-bv-list ty size) to take an optional name

For readability and parsing specs, it would be nice for (make-symbolic-bv-list ty size) to take an optional name parameter instead of naming all symbolic values v.

Codegen needs to handle boundary conditions in load/store

Right now loads/stores are to multiples of the full register width, but need to handle inputs that are not aligned to that size.

QR decomposition in cdios

Debug c-based implementation of QR decomposition to reach parity with the Racket DSL.

Enable support for data-dependent if-else

Needed for catching corner cases regarding pathological values.

Repro specification:

  float acc = 0.0f;
  for (int i = 0; i < 4; i++) {
    acc += a_in[i] * a_in[i];
  }
  for (int i = 0; i < 4; i++) {
    if (acc == 0.0f) {
      b_out[i] = 0.0f;
    } else {
      b_out[i] = a_in[i] / acc;
    }
  }
}```

Compilation aborts with:

Writing intermediate files to: compile-out
==: this match expander must be used inside match
in: (== acc 0.0)
context...:
do-raise-syntax-error
apply-transformer-in-context
apply-transformer52
dispatch-transformer41
for-loop
[repeats 1 more time]
finish-bodys
for-loop
finish-bodys
for-loop
[repeats 1 more time]
finish-bodys
for-loop
[repeats 1 more time]
finish-bodys
for-loop
...
Error: Compilation aborted. cdios return error code 1.

cdios should build fresh racket executable if it doesn't exist

Flag to run translation validation from cdios

CDIOS: Compiling C->Racket failed

cdios cdios-tests/matrix-multiply.c
Standard C compilation successful
Writing intermediate files to: compile-out
CDIOS: Compiling C->Racket failed
src/utils.rkt:47:25: current-oracle: unbound identifier
in: current-oracle
context...:
/home/diospyros/Downloads/rosette/rosette/base/form/module.rkt:16:0

I just started to learn Racket programming. In the environment of Racket8.0, the above error occurred. I don't think it is a problem with the code itself. I just started to learn Racket programming. In the environment of Racket8.0, CDIOS has the above error. I don't think it is a problem with the code itself. I installed the necessary components according to the instructions, but some of them may be missed. Please Give me some suggestions, thank you！

Only emit `scalars.h` when necessary

Codegen currently emits #include "../../../../src/scalars.h", only do this when necessary (when the code uses scalar negation or sgn).