cucapra / dahlia Goto Github PK

Time-sensitive affine types for predictable hardware generation

Home Page: https://capra.cs.cornell.edu/dahlia

License: MIT License

Makefile 0.54% CSS 0.49% HTML 0.14% JavaScript 1.39% C++ 0.43% Scala 90.52% Racket 5.22% Emacs Lisp 0.92% Shell 0.14% Vim Script 0.21%

high-level-synthesis fpga-programming open-source-hardware dahlia

dahlia's People

Contributors

Stargazers

Watchers

Forkers

gussmith23 mewbak vmiheer mfkiwl kammoh andrewb1999 uwsampl mark1626

dahlia's Issues

Implement %{ }%

%{ stmts }% allows the program to recover the capabilities for arrays and imply that a sequence of statements are executed on a different clock cycle. Implementing this for now requires changing the capability checking and type checker.

This program is valid:

%{ a[i] := c[i] + 1 }%
%{ b[i] := 2 * a[i] }%

Fix matadd_1dfullnrl.sea

This case broke with re-implementation of index types. See comment in matadd_1dfullnrl.sea for details.

Loop carry dependencies from local variables

This program should be rejected since it is creating a loop carry dependency. The sequential and parallel semantics don't match up:

let c = 0;
for (...) unroll 3 {
  c := a[i] + c
}

However, this example is fine:

for (...) unroll 3 {
  let c = 0;
  c := a[i] + c;
}

since c cannot carry dependencies over.

Adrian says:

I think the right restriction is probably something like “variables declared outside a loop can be read or written in a loop body, but not both”

To figure out a solution, should these FAIL/PASS:
(1)

let c = 0;
for (...) unroll 3 {
  c := 1;
  a[i] := c;
}

(2)

let c = 0;
for (...) unroll 3 {
  a[i] := c
}

(3) Note the missing unroll

let c = 0;
for (...) {
  c := a[i] + c;
}

(4) add unroll to (3)

let c = 0
for (...) unroll 3 {
  c := a[i] + c;
}

Explicit banking inside unrolled loops incorrectly typechecked

This program compiles:

func madd(a: int[1024 bank(32)], b: int, c: int[1024 bank(32)]) {
  for (let i = 0..31) unroll 4 {
    c[i] := a{0}[i] + 1;
  }
}

It shouldn't.

Type checking rule for `---`

We have the following type checking rule for e1 --- e2.

G, e1 --> G', t1    G, e2 --> G'', t2  Gf = G' + G''
---------------------------------------------------------------    where G' + G'' = G''
                G, e1 --- e2 --> Gf, t2

The definition of + on G is incorrect for this case:

{
  let x = 1;
  ---
  let y = 2;
};
x + y // error with `x` is unbound.

Figure out the right + and implement it.

Buildbot tracker

TODO:

For benchmark harness (High Priority)

Buildbot should extract data using the data extraction scripts after the synthesis step.
Run task should upload required data files to Areesh.
First order support for F1 instances.

Logs should be kept in a proper text file, log.txt, instead of in the JSON job info dictionary. As a consequence, running commands could "stream" to the log, letting us see in-progress output before the command is done executing.
Job page should extract all log lines that have the word "warning". These usually seem to indicate when a particular pragma is being ignored.

DONE:

The job page should not have the full file listing and log. (These get way too big in normal runs.) Link to separate pages to view these.

Buildbot should detect if there is a Makefile and run make.

Finish fixing the Dockerfile to work with opam2 so it can actually run on Gorgonzola again. (@sampsyo is working on this.)
Merge the workproc branch described in https://github.com/cucapra/seashell-gorgonzola/issues/1 to fix reliability issues.
Make the buildbot actually run things on the Zynq board. (@sa2257 is working on this.)
Add a no-Seashell mode to just HLSify and execute plain C programs.
Add an option to just run the Vivado hardware resource estimation instead of doing the full synthesis flow.
Add the ability to change the status of a job. This is really useful for debugging, when you want to take an old job and "rewind" it to an earlier stage that failed before. (It's currently possible to do this by hand by editing the JSON job data file, but then you have to give the server a kick somehow to pick up the new status.)

Generate job name based on the cpp or fuse file in the provided archive.
Change the format for "started" to "%MM/%DD, %TIME". Also the time currently given is 3 hours late.
Add color labels for job states to quickly figure the status of recent jobs.

Create Fuse user manual

@sa2257 Can you complete the tutorial docs you were writing?

Functions implementation sketch

Proposal to bring back functions:

Create new CFuncDef and EApp nodes to represent function definitions and applications. Change Prog to be a list of declarations, followed by a list of function definitions, followed by top level commands.
- Alternatively, we can try removing decls altogether and require a main function to be defined.
For typechecking, extend the environment to map function identifiers to CFuncDef nodes. Checking EApps is non-trivial since memories used in an EApp expression need have their banks consumed correctly. Follow this algorithm:
- Conservatively assume that all banks in parameters are consumed.
- ~~Check of if the actual parameters satisfy constraints on formal parameters.~~
- ~~Typecheck the function body using formal parameters and "write back" the bank consumptions to actual paramters.~~
- ~~Optimization: Pre calculate these consumptions instead of re-running for each EApp.~~
Extend Emit to spit out function definitions.
Recursive functions are not allowed.
Use before define for functions is not allowed.

Array access using dynamic ints

Some of the examples in the Xilinx manual allow for this:

void inp(int a[10], int readline) {
  int c = a[readline/2];
}

It seems we can conservatively allow for these by making dynamic accesses consume all banks of the array.

Bitwise operations

Some of the MachSuite apps use need bitwise operations: bitwise ands, ors, nots, and shifting. Here are a couple of apps where this appears to be important:

fft/strided
sort/radix

Meaning of refresh banks in combine blocks

What does this piece of code mean?

decl a: bit<16>[10 bank 2];
for (let i = 0..10) unroll 2 {
  let x = a[i];
} combine {
  sum += x;
  ---
  a[i]
}

Note: --- is the new syntax for %{ }%

Add `struct` support

For some of the MachSuite apps, it would be helpful to have some kind of "and" type, like a struct, that could be declared in the Fuse body (by Fuse body I mean not declared with decl, which would get compiled to a function argument). Would be helpful in particular in:

bfs/queue: need a way to represent a node of a graph.
md/grid: in the C code, has vectors represented as structs, that are declared in the "Fuse body" I referred to above. I thought of just using Fuse arrays to represent the vectors, but they can only be declared with decl.

Buildbot: job stages can run in the wrong directory when processing multiple jobs concurrently

We're seeing an issue where job tasks (e.g., commands) run in the wrong directory—that is, the right directory for the wrong job. For example, here's the log from a run I manufactured where I configured the "HLS prefix" to just echo the directory name and then sleep to force concurrent job submissions to run at the same time in different stages:

$ sh -c 'pwd ; sleep 1' -- sds++ -sds-pf zed -sds-hw float float.cpp -sds-end -clkid 3 -poll-mode 1 -verbose -Wall -O3 -c -MMD -MP '-MF"vsadd.d"' float.cpp -o float.o
/Users/asampson/Documents/cu/research/seashell/buildbot/instance/jobs/26QGyg_2IGg/code
$ sh -c 'pwd ; sleep 1' -- sds++ -sds-pf zed -sds-hw float float.cpp -sds-end -clkid 3 -poll-mode 1 -verbose -Wall -O3 -c -MMD -MP '-MF"main.d"' main.cpp -o main.o
/Users/asampson/Documents/cu/research/seashell/buildbot/instance/jobs/wllOwUSEbpg/code
$ sh -c 'pwd ; sleep 1' -- sds++ -sds-pf zed -sds-hw float float.cpp -sds-end -clkid 3 -poll-mode 1 -verbose -Wall -O3 '' float.o main.o -o sdsoc
/Users/asampson/Documents/cu/research/seashell/buildbot/instance/jobs/wllOwUSEbpg/code

The correct job name is 26QGyg_2IGg. The first command runs in the correct job directory; the latter two get the directory for the previous job. No good!

The problem is that our work context manager chdirs to the right directory to process a given job, but the working directory is shared for an entire process. That means that different threads that are in different stages are clobbering one another's working directories—in other words, the logic about working directories that works sequentially for one thread doesn't work in the presence of multiple concurrent threads.

To fix this, I need to explicitly keep track of each thread's desired local directory and use that directly to construct each call that might touch the filesystem.

Sized ints overflow

For the current implementation of sized integers, huge numbers are stored in the index types, and with bit numbers as high as 32 or so there are overflow errors. It seems a bit silly to store such big numbers, so maybe we could change the TIndex representation for the implementation? Would need to think about how to differentiate static ints I think.

Index type addition does not work

Following program passes

func mmul(a: float[9 bank(3)], b: float[9 bank(3)], c: float[9 bank(3)]) {  
     for (let i = 0..3) unroll 1 {  
         for (let j = 0..3) unroll 3 {  
             c[i] := a[i*3+j] + b[i*3+j];  
         }  
     }  
}

But following fails with error [Type error] can't apply operator '+' to idx<0 .. 1, -4611686018427387904 .. 4611686018427387903> and idx<0 .. 3, 0 .. 1>

func mmul(a: float[9 bank(3)], b: float[9 bank(3)], c: float[9 bank(3)]) {  
    for (let i = 0..3) unroll 1 {  
        for (let j = 0..3) unroll 3 {  
            c[i*3+j] := a[i*3+j] + b[i*3+j];  
        }  
    }  
}

Shouldn't both give the same error?

MachSuite Tracker

This is to keep track of the status of MachSuites apps implementation. Subsequent posts will track each app.

Some translations:

C = "Something Compiles to Fuse"
T = "The Fuse program was tested with the harness"
R = "The Fuse program is a straightforward 'translation' from the original C"

App	C	R	Notes
gemm ncubed	✅	Yes	Simple kernel
stencil2d
gemm block
stencil3d
fft	✅	Yes
kmp	✅	Yes
radix sort	✅	Yes
merge sort
spmv	✅	Yes
bfs	✅	Yes
md	✅	Yes
backprop
nw	✅	Yes
aes
viterbi

Bank accessor must be static error for vsadd_nrl

https://github.com/cucapra/seashell/blob/master/examples/vsadd_nrl.sea fails due to "Bank accessor must be static error". (Simple examples for explicit access seem to work, array1d_physical.sea)

Benchmarking harness

@sa2257 is building a simple harness that will allow us to do the following:

Have a top-level benchmarks/ directory.
One benchmark per-directory benchmarks/foo
The harness will run a command to (probably make) build build each benchmark using the HLS flow.
Compile and run all benchmarks using a single command.

Improve error message for non-existent type

This program, which uses the int annotation (which is not valid, as there's no type definition for int):

func test_view (a : int[8 bank(2)]) {
	let v_a = view a [0:2:1];
	for (let j = 0..2) unroll 2 {
			v_a[j] := 1;
	}
}

Produces this:

seac: internal error, uncaught exception:
      Seashell__Context.NoBinding("int")

instead of some error indicating that it's not a valid type

Rewrite typechecker and code generator using ast_visitor

Typechecker
Code generator

Consolidate check_banked_aa and check_aa

It seems like these are doing effectively the same thing, one maybe a special case of the other: figure out which indices of an array access to consume given some input, and then update the context accordingly.

Don't emit uneccesary parens in binops

The pretty printer for binary ops current adds parens around every binary op expressions. The AST that represents this:

1 * 2 + 3 * 4

is printed as:

((1 * 2) + (3 * 4))

All the parens in the above example are unnecessary. Since our parser already implements C/C++ operator precedence, our printer should not emit these parens.

Use this file as a reference implementation to implement this correctly: https://github.com/plasma-umass/Tortoise/blob/master/src/main/scala/PrettyFS.scala

Pipelining

Without a specific proposal, this issue is to remind us that we will need some manner of pipelining annotation for loops. It's likely we can use a desugaring strategy similar to the new one for unrolling as described in #58, which takes a source loop like this:

for (...) unroll k {
  c1
  ---
  c2
}

To:

for (...) {
  c1(0); c1(1); ...
  ---
  c2(0); c2(2); ...
}

Where a pipelined version of the original loop would look something like this, where the "duplicated" commands get offset according to the initiation interval:

for (...) {
  c1(0)
  ---
  c2(0); c1(1)
  ---
  c2(1)
}

Allow specification of custom default bit lengths

@sampsyo said the following in #46:

It would be nice to make sure that the number fits into a 32-bit size. And even if not, it could be sensible to put the default integer type into a constant somewhere.

I realize this seems like a low-level detail, but this could actually be a deeper question: we might want to let people specify the default numeric type within a scope. Because numeric widths are so important for accelerator efficiency, people often want to decide the numeric type for an entire design or module at once. That should probably affect (most?) literals as well as the buffers they eventually flow into. Seems sort of tricky.

This can be implemented in two ways: (1) Parsing magic comments in fuse files that specify bit lengths, or (2) Making default bit length a CLI option.

Benchmarking and evaluation

Since the long term goal of seashell is to demonstrate its superiority over current HLS implementations, we will need to port and test existing HLS benchmarks (even make our own?).

@sampsyo has suggested looking at the following suites:

MachSuite https://breagen.github.io/MachSuite/
Rosetta https://github.com/cornell-zhang/rosetta
Spatial’s apps https://github.com/stanford-ppl/spatial-apps

We should start collecting the HLS kernels from these and put them in the folder. The type checker people will stare at them really hard and come up with language features and turn them into test cases.

Index must contain static information error for matadd and multaccess

This error comes up with https://github.com/cucapra/seashell/blob/master/examples/matadd.sea and https://github.com/cucapra/seashell/blob/master/examples/multaccess.sea.

While loop support

Can we support dumb while loops that don’t support any unrolling?

@sa2257 what does HLS do with while loops?

Reading list

Tracking various relevant papers and articles.

CircleCI not sensitive to seashell errors

CircleCI passes with programs that are not type checked. I recall they used to fail. Any clue why this is the case?

Remove TAlias and deltas from type system

Initialize loops using other iterators

This is a common pattern for offset-ted array accesses:

for (int i = 0; i < 8; i++) {
  for (int j = i; j < 8; j++) {
    a[i][j];
  }
}

The type checker will reject this program because the ranges must be static. However, we can rewrite the two loops to:

for (int i = 0; i < 8; i++) {
  for (int j = 0; j < 8; j++) {
    if (j >= i) { a[i][j]; }
  }
}

The benefit of the rewrite is that both the loops can be unrolled. Furthermore, it seems to me that the HLS compiler has to essentially discover this relationship b/w i and j to efficiently compile the hardware.

(1) Should we force programmers to write programs in the second form?

@sampsyo pointed out that the mux in front of the loop implies a logical time gap (@sampsyo please expand).

If we do choose (1), then this easily extends to while loops too (after #63):

  int f = 0;
  while(f > 10) {
    f++;
    for (int i = f; i < 10; i++) {
      a[i] *= f
    }
}

turns into

  int f = 0;
  while(f > 10) {
    f++;
    for (let i = 0..10) {
      if (i >= f) { a[i] *= f }
    }
 }

Concretely, this will allow us to rewrite the loops in fft (@tedbauer #60):

for(span=FFT_SIZE>>1; span; span>>=1, log++){
  for(odd=span; odd<FFT_SIZE; odd++) {
  ...
  }
}

into:

let span = FFT_SIZE>>1;
while(span == 0) {
  for(let odd = 0..FTT_SIZE) {
    if (odd >= span) {
    ...
    }  
  }
  span = span >> 1;
  log = log + 1;
}

Views w/ shrink and flex

The current formulation of views requires that the underlying array have the same number of banks as the width of the dimension of the view.

Examples:
(1) Should fail:

decl a: int[10 bank 10];
view v_a = a[off = 0, w = 5];

(2) Should pass

decl a: int[10 bank 10];
view v_a =  a{shrink 5}[off = 0, w = 5]

(3) Should pass

decl a: int[10 bank 10];
view v_a = a{flex 3}[off = 0, w= 3]

Q1. Should the programmer be allowed to freely mix flex and shrink in multidimensional cases? Which examples should pass:
(4)

decl a: int[10 bank 10][10 bank 10]
view v_a = a{shrink 5}{}[off = 0, w = 5][off = 0, w = 10]

(5)

decl a: int[10 bank 10][10 bank 10]
view v_a = a{shrink 5}{shrink 5}[off = 0, w = 5][off = 0, w = 5]

(6)

decl a: int[10 bank 10][10 bank 10]
view v_a = a{shrink 5}{flex 3}[off = 0, w = 5][off = 0, w = 3]

Q2. What the stepping semantics of shrink and view? Did we decide that the step of a shrink array is always equal to the length and that the steps for flex can be arbitrary?

Q3. Syntax suggestions for shrink, flex, and views?

Implement reductions

Reductions are implemented on top of loops. We add the optional seq block to every loop:

let sum = 0;
for (...) unroll k {
  v = x[i];
seq:
  sum += v;
}

There are two semantic effects of seq blocks.

Variables created inside the non-seq part become length k vectors inside the seq part. In the above example, the type of v in the non-seq part is int while in seq part, it is int[k].
Inside the seq part, the arithmetic operators are overloaded to be vector operations. In the above example, += is a vector add.

A future design proposal will add custom reduction operators.

Example: A is a 3X3 matrix and B is 3X1. The computation B[i] = A[i][1] + A[i][2] + A[i][3] is written as:

for (let i = 0..3) {
   for (let j = 0..3) unroll 3 {
      v = a[i][j];
   seq:
      b[i] += v;
   }
}

Missing: What happens when I unroll the i loop?
What is the correct restriction on += in non-seq world?
+= and variants are defined to be reducers. Reducers can only be used within seq. For example, neither of these programs are correct:

for (...) {
  a[i] += b[i]
}

for (...) {
  v += b[i]
}

The correct version of the first program is:

for (...) {
  v = b[i]
seq:
  a[i] += b[i]
}

What if I want two loops feeding into one seq block?
You can either use (1) Manual fusion of the loops operations, or (2) Create two loops in sequence that fill in the values.

Observation: Removing all unrolls and seqs results in a correct program. This feeds into our erasure story.

Hex literals

If it wouldn't be too much trouble to add, being able to write hex literals would be handy.

HLS partition issue

@sa2257 mentioned (#63 (comment)) that this example:

for (i = 0; i < N*N; i++) 
#pragma HLS unroll factor=N  
    Minim_Loop: while (A[i] != B[i]) {    
      if (A[i] > B[i])  
        A[i] -= B[i];  
      else 
        B[i] -= A[i];  
    }    
    C[i] = A[i]; 
}

fails with:

ERROR: [XFORM 203-103] Cannot partition array 'A' (/.../gcd.cpp:3): array access out of bound (/.../gcd.cpp:20:5).
ERROR: [HLS 200-70] Pre-synthesis failed.

where,
Line 3 is the function definition.
Line 20 is C[i] = A[i]

Sachille will update when he debugs it.

Capabilities in unrolled contexts

An array access needs to be able to generate sufficient resources to be consumed in an unrolled context:

decl a: bit<10>[3 bank 3][3 bank 3];
for (i ...) unroll 3 {
  for (j ...) unroll 3 {
     a[i][j] // correct
     ---
     a[0][j]
     ---
     a[j][i] //correct
  }
}

A non-unrolled context implicitly unrolls by 1 which means constants are allowed since they generate exactly one copy of the access:

for(i ...) {
  a[0] // correct
  for (j ...) unroll 3 {
    a[0][j] //correct I think?
    a[j][0] // correct I think?
    a[i][j] // correct I think
    a[i][0] // wrong
  }
}

Fix parser shift reduce conflict

Building outputs these warnings:

Warning: 3 states have shift/reduce conflicts.
Warning: 14 shift/reduce conflicts were arbitrarily resolved.

Type system tracker

Global tracker for the compiler and the type system.

Bounds checking

Done

More loops

Some of the MachSuite apps require loops that aren't supported in Fuse. AFAIK we only allow looping over ranges. Here are some loops that I think we can't write yet:

while(a_idx>0 || b_idx>0) { ... } (From nw/nw, a "dynamic programming algorithm for optimal sequence alignment.") (Do we have boolean operators like "and" and "or"?)
for(span=FFT_SIZE>>1; span; span>>=1, log++){ ... } (From fft/strided, a "recursive formulation of the Fast Fourier Transform.")

Incorrect typechecking

This test should fail typechecking:

func double(a: int[10], b: int[10]) {
  for (let i = 0..9) {
    b[i] := a[i] + a[i];
    b[i] := a[i] + a[i];
  }
}

It doesn't.

EDIT: Fixed in cbe1fcc

CSeq should take a list of commands

A common optimization for the seq representation is to have it take a list of commands instead of a tuple of two. This keeps the AST flat and reduces the chance of stack overflows.

Operations on index types not used for array access

Fuse disallows:

a[2 * i] // i is an iterator

Since * is not defined on index types. However, it will also reject this:

x[0] := 2*i; // i is an iterator

I think the latter should be allowed with the following semantics:
When an index type has an operation performed on it, it gets upcast to a sized int. This also has the nice property of making sure that the first program is still rejected (since access using dynamic ints is not allowed).

The result of the operation is also a dynamic int.

@sampsyo if this sounds good, please LGTM.

Comparison on memories

In #46, @sampsyo said:

[comparisons] Should probably only be supported for primitive/numeric types (not memories, for example).

Should be not allow stuff like a[0] == 0? Constant comparisons might be fine, but what would this mean and what would the hardware look like:

for (...) unroll 3 {
  if (a[i] < 10) { a[i] := 1 }
}

(Search might require comparisons on dynamic memory locations)

Source type checking rule for `---` does not match desugared semantics

In the formal semantics, we define unrolls to have the following desugaring:
(1)

for (...) unroll k {
  C1
  ---
  C2
}

into:

for (...) {
  C1(0); C1(1); ... C1(k - 1);
  ---
  C2(0); C2(1); ... C2(k - 1);
}

The desugar function only does this with top level ---. So, this:
(2)

for (...) unroll k {
  { C1 --- C2 }
}

turns into:

for (...) {
  { C1(0) --- C2(0) };
  { C1(1) --- C2(1) };
  ...
  { C1(k - 1) -- C2(k - 1) };
}

Concretely, this means (1) should be accepted and (2) should be rejected. In the current implementation, both pass. We need to come up with a type checking rule at the source level that works correctly.

Functions of some sort

Some of the MachSuite apps are lengthy, like backprop/backprop, and it would be nice to organize Fuse programs into logical chunks. Would it be possible to have some sort of construct for this, if not functions?

Polymorphism implementation sketch

With functions in place, we can start thinking about polymorphism.

Implementation sketch

Create AST where EVar can either be an Id or a Hole. A Hole is a type parameter.
- For example, def foo(a: int[M bank N]) parses to Func(foo, List(a -> TArray((EVar(M), EVar(N))))
- Use scala trait to create an AST parameterized on the type of Id.
Redefine the parser to allow for holes to exist in:
- Array types: Both dimension and banking factor
- Ranges: Both range end and unrolling factor.
Write a constrain generator that outputs SMT-lib
- Constraints are generated on EAA.
- Encode CSeq and CPar constraints in a preamble
- Note: It seems some type checking and capability checking might be useful at this point, but I don't want to create another type checker in here.
- TODO: probably missing some stuff
Write a deparser for SMT-lib that extracts the model and remaps it to symbols.
Walk through each function definition in order and resolve the bounds on each type parameter.
- Instead of concrete assignment, we want a bound. So instead of M = 24, we want M % 4 = 0
- Once a function is done, add the function def with bounds into a simple environment.
For function applications with abstract parameters, foo[M, N](a, b)
- add the bounds from the function definition into the constraint formula.
- When the solver returns a concrete model, add an instantiation of foo with these concrete values to a Set of concrete function instantiations.
Once all function definitions are checked, remove all abstract functions and add concrete instantiations of the function into the AST.
Replace abstract function calls with calls to the concrete functions.
Send to the type checker!

Even at a high level, this proposal implies 3-4 days of concerted hacking.

read a[1] as a1;
let x = a1;
a1 = 10;

This is because a1 is not given a linear type and is therefore not consumed by the read.

Else statement

I keep finding myself writing this pattern:

if (cond) {
...
} 

if (!cond) {
...
}

Doesn't seem critical but would make my soul feel better if there was an else construct.