cucapra / dahlia Goto Github PK
View Code? Open in Web Editor NEWTime-sensitive affine types for predictable hardware generation
Home Page: https://capra.cs.cornell.edu/dahlia
License: MIT License
Time-sensitive affine types for predictable hardware generation
Home Page: https://capra.cs.cornell.edu/dahlia
License: MIT License
%{ stmts }%
allows the program to recover the capabilities for arrays and imply that a sequence of statements are executed on a different clock cycle. Implementing this for now requires changing the capability checking and type checker.
This program is valid:
%{ a[i] := c[i] + 1 }%
%{ b[i] := 2 * a[i] }%
This case broke with re-implementation of index types. See comment in matadd_1dfullnrl.sea
for details.
This program should be rejected since it is creating a loop carry dependency. The sequential and parallel semantics don't match up:
let c = 0;
for (...) unroll 3 {
c := a[i] + c
}
However, this example is fine:
for (...) unroll 3 {
let c = 0;
c := a[i] + c;
}
since c
cannot carry dependencies over.
Adrian says:
I think the right restriction is probably something like “variables declared outside a loop can be read or written in a loop body, but not both”
To figure out a solution, should these FAIL/PASS:
(1)
let c = 0;
for (...) unroll 3 {
c := 1;
a[i] := c;
}
(2)
let c = 0;
for (...) unroll 3 {
a[i] := c
}
(3) Note the missing unroll
let c = 0;
for (...) {
c := a[i] + c;
}
(4) add unroll to (3)
let c = 0
for (...) unroll 3 {
c := a[i] + c;
}
This program compiles:
func madd(a: int[1024 bank(32)], b: int, c: int[1024 bank(32)]) {
for (let i = 0..31) unroll 4 {
c[i] := a{0}[i] + 1;
}
}
It shouldn't.
We have the following type checking rule for e1 --- e2
.
G, e1 --> G', t1 G, e2 --> G'', t2 Gf = G' + G''
--------------------------------------------------------------- where G' + G'' = G''
G, e1 --- e2 --> Gf, t2
The definition of +
on G
is incorrect for this case:
{
let x = 1;
---
let y = 2;
};
x + y // error with `x` is unbound.
Figure out the right +
and implement it.
For benchmark harness (High Priority)
log.txt
, instead of in the JSON job info dictionary. As a consequence, running commands could "stream" to the log, letting us see in-progress output before the command is done executing.make
.Finish fixing the Dockerfile to work with opam2 so it can actually run on Gorgonzola again. (@sampsyo is working on this.)
Merge the workproc branch described in https://github.com/cucapra/seashell-gorgonzola/issues/1 to fix reliability issues.
Make the buildbot actually run things on the Zynq board. (@sa2257 is working on this.)
Add a no-Seashell mode to just HLSify and execute plain C programs.
Add an option to just run the Vivado hardware resource estimation instead of doing the full synthesis flow.
Add the ability to change the status of a job. This is really useful for debugging, when you want to take an old job and "rewind" it to an earlier stage that failed before. (It's currently possible to do this by hand by editing the JSON job data file, but then you have to give the server a kick somehow to pick up the new status.)
cpp
or fuse
file in the provided archive.@sa2257 Can you complete the tutorial docs you were writing?
Proposal to bring back functions:
CFuncDef
and EApp
nodes to represent function definitions and applications. Change Prog
to be a list of declarations, followed by a list of function definitions, followed by top level commands.
decl
s altogether and require a main
function to be defined.CFuncDef
nodes. Checking EApp
s is non-trivial since memories used in an EApp
expression need have their banks consumed correctly. Follow this algorithm:
Emit
to spit out function definitions.Some of the examples in the Xilinx manual allow for this:
void inp(int a[10], int readline) {
int c = a[readline/2];
}
It seems we can conservatively allow for these by making dynamic accesses consume all banks of the array.
Some of the MachSuite apps use need bitwise operations: bitwise ands, ors, nots, and shifting. Here are a couple of apps where this appears to be important:
fft/strided
sort/radix
What does this piece of code mean?
decl a: bit<16>[10 bank 2];
for (let i = 0..10) unroll 2 {
let x = a[i];
} combine {
sum += x;
---
a[i]
}
Note: ---
is the new syntax for %{ }%
For some of the MachSuite apps, it would be helpful to have some kind of "and" type, like a struct
, that could be declared in the Fuse body (by Fuse body I mean not declared with decl
, which would get compiled to a function argument). Would be helpful in particular in:
bfs/queue
: need a way to represent a node of a graph.md/grid
: in the C code, has vectors represented as struct
s, that are declared in the "Fuse body" I referred to above. I thought of just using Fuse arrays to represent the vectors, but they can only be declared with decl
.We're seeing an issue where job tasks (e.g., commands) run in the wrong directory—that is, the right directory for the wrong job. For example, here's the log from a run I manufactured where I configured the "HLS prefix" to just echo the directory name and then sleep to force concurrent job submissions to run at the same time in different stages:
$ sh -c 'pwd ; sleep 1' -- sds++ -sds-pf zed -sds-hw float float.cpp -sds-end -clkid 3 -poll-mode 1 -verbose -Wall -O3 -c -MMD -MP '-MF"vsadd.d"' float.cpp -o float.o
/Users/asampson/Documents/cu/research/seashell/buildbot/instance/jobs/26QGyg_2IGg/code
$ sh -c 'pwd ; sleep 1' -- sds++ -sds-pf zed -sds-hw float float.cpp -sds-end -clkid 3 -poll-mode 1 -verbose -Wall -O3 -c -MMD -MP '-MF"main.d"' main.cpp -o main.o
/Users/asampson/Documents/cu/research/seashell/buildbot/instance/jobs/wllOwUSEbpg/code
$ sh -c 'pwd ; sleep 1' -- sds++ -sds-pf zed -sds-hw float float.cpp -sds-end -clkid 3 -poll-mode 1 -verbose -Wall -O3 '' float.o main.o -o sdsoc
/Users/asampson/Documents/cu/research/seashell/buildbot/instance/jobs/wllOwUSEbpg/code
The correct job name is 26QGyg_2IGg
. The first command runs in the correct job directory; the latter two get the directory for the previous job. No good!
The problem is that our work
context manager chdir
s to the right directory to process a given job, but the working directory is shared for an entire process. That means that different threads that are in different stages are clobbering one another's working directories—in other words, the logic about working directories that works sequentially for one thread doesn't work in the presence of multiple concurrent threads.
To fix this, I need to explicitly keep track of each thread's desired local directory and use that directly to construct each call that might touch the filesystem.
For the current implementation of sized integers, huge numbers are stored in the index types, and with bit numbers as high as 32 or so there are overflow errors. It seems a bit silly to store such big numbers, so maybe we could change the TIndex
representation for the implementation? Would need to think about how to differentiate static ints I think.
Following program passes
func mmul(a: float[9 bank(3)], b: float[9 bank(3)], c: float[9 bank(3)]) {
for (let i = 0..3) unroll 1 {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
}
But following fails with error [Type error] can't apply operator '+' to idx<0 .. 1, -4611686018427387904 .. 4611686018427387903> and idx<0 .. 3, 0 .. 1>
func mmul(a: float[9 bank(3)], b: float[9 bank(3)], c: float[9 bank(3)]) {
for (let i = 0..3) unroll 1 {
for (let j = 0..3) unroll 3 {
c[i*3+j] := a[i*3+j] + b[i*3+j];
}
}
}
Shouldn't both give the same error?
This is to keep track of the status of MachSuites apps implementation. Subsequent posts will track each app.
Some translations:
App | C | T | R | Notes |
---|---|---|---|---|
gemm ncubed | ✅ | Yes | Simple kernel | |
stencil2d | ||||
gemm block | ||||
stencil3d | ||||
fft | ✅ | Yes | ||
kmp | ✅ | Yes | ||
radix sort | ✅ | Yes | ||
merge sort | ||||
spmv | ✅ | Yes | ||
bfs | ✅ | Yes | ||
md | ✅ | Yes | ||
backprop | ||||
nw | ✅ | Yes | ||
aes | ||||
viterbi |
https://github.com/cucapra/seashell/blob/master/examples/vsadd_nrl.sea fails due to "Bank accessor must be static error". (Simple examples for explicit access seem to work, array1d_physical.sea)
@sa2257 is building a simple harness that will allow us to do the following:
benchmarks/
directory.benchmarks/foo
make
) build build each benchmark using the HLS flow.This program, which uses the int
annotation (which is not valid, as there's no type definition for int
):
func test_view (a : int[8 bank(2)]) {
let v_a = view a [0:2:1];
for (let j = 0..2) unroll 2 {
v_a[j] := 1;
}
}
Produces this:
seac: internal error, uncaught exception:
Seashell__Context.NoBinding("int")
instead of some error indicating that it's not a valid type
It seems like these are doing effectively the same thing, one maybe a special case of the other: figure out which indices of an array access to consume given some input, and then update the context accordingly.
The pretty printer for binary ops current adds parens around every binary op expressions. The AST that represents this:
1 * 2 + 3 * 4
is printed as:
((1 * 2) + (3 * 4))
All the parens in the above example are unnecessary. Since our parser already implements C/C++ operator precedence, our printer should not emit these parens.
Use this file as a reference implementation to implement this correctly: https://github.com/plasma-umass/Tortoise/blob/master/src/main/scala/PrettyFS.scala
Without a specific proposal, this issue is to remind us that we will need some manner of pipelining annotation for loops. It's likely we can use a desugaring strategy similar to the new one for unrolling as described in #58, which takes a source loop like this:
for (...) unroll k {
c1
---
c2
}
To:
for (...) {
c1(0); c1(1); ...
---
c2(0); c2(2); ...
}
Where a pipelined version of the original loop would look something like this, where the "duplicated" commands get offset according to the initiation interval:
for (...) {
c1(0)
---
c2(0); c1(1)
---
c2(1)
}
@sampsyo said the following in #46:
It would be nice to make sure that the number fits into a 32-bit size. And even if not, it could be sensible to put the default integer type into a constant somewhere.
I realize this seems like a low-level detail, but this could actually be a deeper question: we might want to let people specify the default numeric type within a scope. Because numeric widths are so important for accelerator efficiency, people often want to decide the numeric type for an entire design or module at once. That should probably affect (most?) literals as well as the buffers they eventually flow into. Seems sort of tricky.
This can be implemented in two ways: (1) Parsing magic comments in fuse files that specify bit lengths, or (2) Making default bit length a CLI option.
Since the long term goal of seashell is to demonstrate its superiority over current HLS implementations, we will need to port and test existing HLS benchmarks (even make our own?).
@sampsyo has suggested looking at the following suites:
We should start collecting the HLS kernels from these and put them in the folder. The type checker people will stare at them really hard and come up with language features and turn them into test cases.
Can we support dumb while loops that don’t support any unrolling?
@sa2257 what does HLS do with while loops?
Tracking various relevant papers and articles.
CircleCI passes with programs that are not type checked. I recall they used to fail. Any clue why this is the case?
This is a common pattern for offset-ted array accesses:
for (int i = 0; i < 8; i++) {
for (int j = i; j < 8; j++) {
a[i][j];
}
}
The type checker will reject this program because the ranges must be static. However, we can rewrite the two loops to:
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
if (j >= i) { a[i][j]; }
}
}
The benefit of the rewrite is that both the loops can be unrolled. Furthermore, it seems to me that the HLS compiler has to essentially discover this relationship b/w i
and j
to efficiently compile the hardware.
(1) Should we force programmers to write programs in the second form?
@sampsyo pointed out that the mux in front of the loop implies a logical time gap (@sampsyo please expand).
If we do choose (1), then this easily extends to while
loops too (after #63):
int f = 0;
while(f > 10) {
f++;
for (int i = f; i < 10; i++) {
a[i] *= f
}
}
turns into
int f = 0;
while(f > 10) {
f++;
for (let i = 0..10) {
if (i >= f) { a[i] *= f }
}
}
Concretely, this will allow us to rewrite the loops in fft
(@tedbauer #60):
for(span=FFT_SIZE>>1; span; span>>=1, log++){
for(odd=span; odd<FFT_SIZE; odd++) {
...
}
}
into:
let span = FFT_SIZE>>1;
while(span == 0) {
for(let odd = 0..FTT_SIZE) {
if (odd >= span) {
...
}
}
span = span >> 1;
log = log + 1;
}
The current formulation of views requires that the underlying array have the same number of banks as the width of the dimension of the view.
Examples:
(1) Should fail:
decl a: int[10 bank 10];
view v_a = a[off = 0, w = 5];
(2) Should pass
decl a: int[10 bank 10];
view v_a = a{shrink 5}[off = 0, w = 5]
(3) Should pass
decl a: int[10 bank 10];
view v_a = a{flex 3}[off = 0, w= 3]
Q1. Should the programmer be allowed to freely mix flex
and shrink
in multidimensional cases? Which examples should pass:
(4)
decl a: int[10 bank 10][10 bank 10]
view v_a = a{shrink 5}{}[off = 0, w = 5][off = 0, w = 10]
(5)
decl a: int[10 bank 10][10 bank 10]
view v_a = a{shrink 5}{shrink 5}[off = 0, w = 5][off = 0, w = 5]
(6)
decl a: int[10 bank 10][10 bank 10]
view v_a = a{shrink 5}{flex 3}[off = 0, w = 5][off = 0, w = 3]
Q2. What the stepping semantics of shrink
and view
? Did we decide that the step of a shrink
array is always equal to the length and that the steps for flex
can be arbitrary?
Q3. Syntax suggestions for shrink
, flex
, and views
?
Reductions are implemented on top of loops. We add the optional seq
block to every loop:
let sum = 0;
for (...) unroll k {
v = x[i];
seq:
sum += v;
}
There are two semantic effects of seq
blocks.
seq
part become length k
vectors inside the seq
part. In the above example, the type of v
in the non-seq
part is int
while in seq part, it is int[k]
.seq
part, the arithmetic operators are overloaded to be vector operations. In the above example, +=
is a vector add.A future design proposal will add custom reduction operators.
Example: A
is a 3X3 matrix and B is 3X1. The computation B[i] = A[i][1] + A[i][2] + A[i][3]
is written as:
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
v = a[i][j];
seq:
b[i] += v;
}
}
Missing: What happens when I unroll the i
loop?
What is the correct restriction on +=
in non-seq
world?
+=
and variants are defined to be reducers. Reducers can only be used within seq
. For example, neither of these programs are correct:
for (...) {
a[i] += b[i]
}
for (...) {
v += b[i]
}
The correct version of the first program is:
for (...) {
v = b[i]
seq:
a[i] += b[i]
}
What if I want two loops feeding into one seq
block?
You can either use (1) Manual fusion of the loops operations, or (2) Create two loops in sequence that fill in the values.
Observation: Removing all unroll
s and seq
s results in a correct program. This feeds into our erasure story.
If it wouldn't be too much trouble to add, being able to write hex literals would be handy.
@sa2257 mentioned (#63 (comment)) that this example:
for (i = 0; i < N*N; i++)
#pragma HLS unroll factor=N
Minim_Loop: while (A[i] != B[i]) {
if (A[i] > B[i])
A[i] -= B[i];
else
B[i] -= A[i];
}
C[i] = A[i];
}
fails with:
ERROR: [XFORM 203-103] Cannot partition array 'A' (/.../gcd.cpp:3): array access out of bound (/.../gcd.cpp:20:5).
ERROR: [HLS 200-70] Pre-synthesis failed.
where,
Line 3 is the function definition.
Line 20 is C[i] = A[i]
Sachille will update when he debugs it.
An array access needs to be able to generate sufficient resources to be consumed in an unrolled context:
decl a: bit<10>[3 bank 3][3 bank 3];
for (i ...) unroll 3 {
for (j ...) unroll 3 {
a[i][j] // correct
---
a[0][j]
---
a[j][i] //correct
}
}
A non-unrolled context implicitly unrolls by 1
which means constants are allowed since they generate exactly one copy of the access:
for(i ...) {
a[0] // correct
for (j ...) unroll 3 {
a[0][j] //correct I think?
a[j][0] // correct I think?
a[i][j] // correct I think
a[i][0] // wrong
}
}
Building outputs these warnings:
Warning: 3 states have shift/reduce conflicts.
Warning: 14 shift/reduce conflicts were arbitrarily resolved.
Global tracker for the compiler and the type system.
type.ml
ocaml-ppx
for the AST.seac
into a proper CLI with some help info.Some of the MachSuite apps require loops that aren't supported in Fuse. AFAIK we only allow looping over ranges. Here are some loops that I think we can't write yet:
while(a_idx>0 || b_idx>0) { ... }
(From nw/nw
, a "dynamic programming algorithm for optimal sequence alignment.") (Do we have boolean operators like "and" and "or"?)for(span=FFT_SIZE>>1; span; span>>=1, log++){ ... }
(From fft/strided
, a "recursive formulation of the Fast Fourier Transform.")This test should fail typechecking:
func double(a: int[10], b: int[10]) {
for (let i = 0..9) {
b[i] := a[i] + a[i];
b[i] := a[i] + a[i];
}
}
It doesn't.
EDIT: Fixed in cbe1fcc
A common optimization for the seq
representation is to have it take a list of commands instead of a tuple of two. This keeps the AST flat and reduces the chance of stack overflows.
Fuse disallows:
a[2 * i] // i is an iterator
Since *
is not defined on index types. However, it will also reject this:
x[0] := 2*i; // i is an iterator
I think the latter should be allowed with the following semantics:
When an index type has an operation performed on it, it gets upcast to a sized int. This also has the nice property of making sure that the first program is still rejected (since access using dynamic ints is not allowed).
The result of the operation is also a dynamic int.
@sampsyo if this sounds good, please LGTM.
[comparisons] Should probably only be supported for primitive/numeric types (not memories, for example).
Should be not allow stuff like a[0] == 0
? Constant comparisons might be fine, but what would this mean and what would the hardware look like:
for (...) unroll 3 {
if (a[i] < 10) { a[i] := 1 }
}
(Search might require comparisons on dynamic memory locations)
In the formal semantics, we define unroll
s to have the following desugaring:
(1)
for (...) unroll k {
C1
---
C2
}
into:
for (...) {
C1(0); C1(1); ... C1(k - 1);
---
C2(0); C2(1); ... C2(k - 1);
}
The desugar function only does this with top level ---
. So, this:
(2)
for (...) unroll k {
{ C1 --- C2 }
}
turns into:
for (...) {
{ C1(0) --- C2(0) };
{ C1(1) --- C2(1) };
...
{ C1(k - 1) -- C2(k - 1) };
}
Concretely, this means (1) should be accepted and (2) should be rejected. In the current implementation, both pass. We need to come up with a type checking rule at the source level that works correctly.
Some of the MachSuite apps are lengthy, like backprop/backprop
, and it would be nice to organize Fuse programs into logical chunks. Would it be possible to have some sort of construct for this, if not functions?
With functions in place, we can start thinking about polymorphism.
EVar
can either be an Id
or a Hole
. A Hole
is a type parameter.
def foo(a: int[M bank N])
parses to Func(foo, List(a -> TArray((EVar(M), EVar(N))))
trait
to create an AST parameterized on the type of Id
.holes
to exist in:
EAA
.CSeq
and CPar
constraints in a preambleM = 24
, we want M % 4 = 0
foo[M, N](a, b)
foo
with these concrete values to a Set of concrete function instantiations.Even at a high level, this proposal implies 3-4 days of concerted hacking.
Functions check_for
and check_for_impl
can be consolidated. The fact that they're different is an artifact of when int
s were not represented as special cases of the idx
type.
If the user writes an explicit program with inconsistent capabilities, the type checker will not catch the misuse. For example, the following program type checks:
read a[1] as a1;
let x = a1;
a1 = 10;
This is because a1
is not given a linear type and is therefore not consumed by the read.
I keep finding myself writing this pattern:
if (cond) {
...
}
if (!cond) {
...
}
Doesn't seem critical but would make my soul feel better if there was an else
construct.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.