gussmith23 / glenside Goto Github PK

A pure, low-level tensor program representation enabling tensor program optimization via program rewriting. See the web demo at https://gussmith23.github.io/glenside-web-demo/

Rust 81.60% Python 2.00% C 16.24% Dockerfile 0.16%

program-rewriting equality-saturation machine-learning

glenside's Introduction

Glenside

Check out the web demo!

Glenside is a pure, low-level tensor program representation which enables tensor program optimization via program rewriting, using rewriting frameworks such as the egg equality saturation library. If you are interested in transforming and optimizing tensor kernels (e.g. fusing kernels, exploring data layouts, or mapping to custom hardware), then Glenside is of interest to you! See the web demo for concrete examples. See our MAPS 2021 paper to understand why Glenside exists and how it can be used. Finally, see the docs for technical documentation.

Quickstart

Fastest way to verify that this code actually does something is to build the Docker image and run the tests:

git clone <this repo>
cd <this repo>
docker build --tag glenside .
docker run -it glenside cargo test

...and "soon" I will add interactive web demos and pretty visualizations!

Dependencies

Glenside optionally depends on TVM and CPLEX. To disable these optional dependencies, use the --no-default-features flag with cargo, e.g. cargo test --no-default-features.

CPLEX

Glenside uses the CPLEX ILP solver. It isn't actually used in the core of Glenside anymore, and needs to be removed or cordoned off, but for now, to get Glenside fully working, you need CPLEX. To set up CPLEX, follow these steps:

Get access to CPLEX. Students and academics can do so by making an account through their academic program. Download and install CPLEX on your machine.
Set environment variables. Set $CPLEX_LIB to the location of the newly-installed CPLEX library on your machine. For me on OSX, it resides in /Applications/CPLEX_Studio1210/cplex/lib/x86-64_osx/static_pic/.

Publications

Pure Tensor Program Rewriting via Access Patterns (Representation Pearl) (published at the MAPS symposium at PLDI 2021)
Enumerating Hardware-Software Splits with Program Rewriting, a two-page proposal of Glenside which was accepted into the Second Young Architects Workshop at ASPLOS 2020.

glenside's People

Contributors

Stargazers

Watchers

Forkers

logmoss vcanumalla leeohzzz ad1024 rishikumarray arthurking000 chrisliu bastacyclop thomasdh-c iq-scm

glenside's Issues

ILP Extraction

See #31 for more detail.

Remove old constructs

Before removing, take the documentation from these constructs and use them to improve the corresponding newer constructs.

BsgSystolicArray
MoveAxis
CartesianProduct
MapDotProduct
Slice
Concatenate
ElementwiseAdd

Remove `MyAnalysisData::Legacy`

It's basically only used for Usize now. Should be replaced with Usize.

Remove MyAnalysisDataLegacyData struct
change implementation of get_usize()
Remove implementation of get_shape()

Add network support timeline into Github issues

We have a timeline for which networks we want to support. Move that here.

Tensorize Mobilenet to automatically-blocking systolic arrays

This is a shorter path to victory, that I honestly think will be good enough for the paper perhaps by itself. I'm starting to get a little defensive about it, knowing that I want Glenside to work a certain way (i.e. wanting Glenside to do the blocking), but I can't get so broken up about it.

Relay to Glenside web demo

Using/instantiating accumulators via Glenside

When Glenside blocks up a computation, it sometimes needs to explicitly do accumulation (if you split along the dimension of the vectors being dot-prodded). This is what accumulators in Scott's systolic arrays are for, and there are commands to do accumulation in pieces, so I think this this is possible...however, Scott's current high-level C API doesn't have a command for this. The only way to take advantage of the accumulators is to let Scott's tools do the blocking.

Tasks

Confirm that this is currently not possible with Scott's API
See whether we could do lower-level codegen to activate the accumulators
Confirm that representing this kind of...stateful? computation is possible in Glenside
- I'm not sure stateful is the right word, but there is some notion that certain data needs to reside in the right place in memory at a specific time for this to work

Speed up Relay-to-Glenside tests

The Relay to Glenside tests do a number of random samples, comparing the output of the relay program and the output of the glenside program to see if they match given random inputs. We can just do one random sample if we feel like multiple aren't useful.

Speeding up the interpreter should also speed up these tests.

Simple, greedy, de-duplicated egg extraction

Switch to main TVM repo in dependencies

Currently pointing at mwillsey's fork, but now that his PR is merged, we can switch.
Should pick a specific revision from TVM, not just point at master!

Write macros to clean up interpreter tests

Glenside should find a blocking strategy for Mobilenet convolutions

This is mostly an issue of egg search performance and the fact that Glenside's blowing up the egraph with rewrites. See comment thread for debugging chain.

Add (failing) test for just one convolutional layer of Mobilenet
Debug why it's not tensorizing
Add needed rewrites --> Unsure on whether this is actually the issue

Clean up codegen tests using Rust macros

The codegen tests are super bloated, though their structure is really repetitive. There's probably a clever way to write them using macros. And then we might write more of them, because they wouldn't be so hard to write!

How does TVM compile conv2d for VTA?

We should be able to encapsulate this usecase.

Remove unnecessary language constructs

There are at least a few language constructs that do roughly the same things.

shape vs raw list: We can probably just remove shape.
shape-of probably isn't needed for now, as all of our workloads are statically sized

Generate rustdoc and upload as github.io page

See this.

I think it should be fairly easy to use Github Actions to run rustdoc and upload?

Glenside interpreter web demo

Are fine-grained memory hierarchies actually worth it?

We want to put a lot of effort into enabling the development of complex memory hierarchies...but are they worth it? If not, why not?

Memoize interpreter

Use a more sane interpretation strategy -- use memoization or something, rather than naive recursion.
I think (I don't actually remember) the interpreter is totally naive and doesn't memoize at all. Thus, when subexpressions appear multiple times, we completely recompute them each time.

Separate TVM stuff under feature flag (to fix CI)

Whatever stuff breaks in CI because TVM isn't present -- put it under a feature flag.

Interpreter benchmarks are measuring too much

The benchmarks in the interpreter code are measuring the setup code and the result-verifying code as well. It works for now, but in the future, we might want to resolve that. It looks like it could be annoying to resolve, however. The easiest solution might be to have the macro create both a #[test] and a #[bench] separately, where the test tests functionality, and the benchmark is just set up for timing.

Try setting up the full flow in CI/try running Scott's tools on RTML server

I have notes on how to do this. Give it a try.

Add new atom (with rewrites and codegen): systolic array with externally (non-Glenside) managed blocking

Scott's tools can block up computations just like Glenside can. He wants us to add a new atom that will let him do the blocking himself.

Tasks:

Add atom: systolic-array-automatic-blocking
- naming isn't great
Add rewrite: need to figure out exactly what can be blocked up by Scott's scripts
~~Interpret~~ we're not actually implementing the systolic arrays in the interpreter (they're not needed, yet...and at that point, the interpreter would become, like, an emulator)
Add codegen: call Scott's API
Add demo

Remove old interpreter.rs file; rename interpreter_new.rs -> interpreter.rs

Set up Dockerfile so we can test Glenside with all features in CI

How should separable convolutions get run on systolic arrays?

For what it's worth, this is a place where (hopefully) Glenside can tell us how to do it!

Rethink representation for binary computations

I'm not sure the representation for binary computations is as clean as it could be at the moment. For example, computing elementwise add between two tensors is done as (compute elementwise-add (access-pair (access (access-tensor a) 0) (access (access-tensor b) 0))); however, there's no reason they can't be accessed at some other (matching) dimension. The access-pair conveys that they should be "paired" elementwise. The compute then looks for the "tuple" or "pair" dimension and reduces that dimension via an add. It feels very ad-hoc.

Interestingly, as I'm writing this, I realize that elementwise-add and reduce-sum computations have very similar semantics; elementwise-add reduces along a single dimension in the item shape, though, while reduce-sum reduces along all dimensions in the item shape.

Glenside "named tensors" implementation (named dimensions, algebra of dimensions, rank polymorphism, layout agnostic tensors, and many other names/ideas)

Named tensors are a great idea for making deep learning programming easier and safer at compiletime and runtime. The basic concept is this: tensors are n-dimensional arrays, and so traditionally, their dimensions have been indexed using numbers. As deep learning has evolved, though, common tensor layouts have arisen, e.g. NCHW for activation tensors, where N is the batch dimension, C is the channels dimension, and H/W are height and width. Using named tensors for activation tensors makes sense, as it

Makes it clear what each dimension is for
Removes the notion of "layout" (in-memory ordering of dimensions) from the tensor

I would like to explore some related ideas in Glenside. Currently, Glenside represents tensors as "access patterns", which represents dimensions in an ordered list (just as a standard tensor representation would). However, access patterns add the concept of some dimensions at the end being "accessed" as the item dimensions, creating a tensor which iterates over tensors. For example, a tensor a of shape (1, 3, 32, 32) accessed at dimension 1 becomes an access pattern with shape (1) and "item shape" (3, 32, 32): conceptually, a tensor of shape (1) whose items are tensors of shape (3, 32, 32).

Access patterns serve a number of purposes:

They convey information to operators, telling an operator which dimension to operate over. For example, (compute reduce-sum (access (access-tensor a) 1)) will reduce all of the item dimensions via summing, producing a tensor of shape (1).
They (implicitly) talk about how memory is logically and literally accessed (hence the name access pattern). The dimensions are ordered, with the implicit assumption that they match with the underlying memory layout. Thus, the access pattern describes the literal way we access memory, while also putting a logical layer on top of that, indicating in a very basic way which dimensions are the "items" we're accessing, and which dimensions are not.

Instead of high-level Glenside operators operating over access patterns with ordered dimensions, I would like to try having them operate over some tensor construct with named dimensions which convey the same information. When it comes to telling a compute operator which dimensions to compute over, the access pattern conveys this by grouping the relevant dimensions together in the "item" dimensions group. The actual order of the dimensions is totally irrelevant in this context. Thus, if we instead could mark the dimensions we want to compute over by name, then the underlying tensor representation wouldn't need its dimensions to have any order. For example (compute reduce-sum (dims h w) (tensor (dims n c h w) a)) conveys that we want to compute over dimensions h and w in a tensor with dimensions n, c, h, and w.

At the same time, we would need to "add back" the idea of layout, or dimension ordering. This is good, though, as we can now explicitly design layout semantics for Glenside, where they are currently entirely implicit. Describing hardware atoms relies heavily on the implicit layout semantics of access patterns.

Clean up rest of interpreter tests with benchmark test macro

#35 converts the most important of the tests, need to convert the rest of tests

Implementing batch norms in hardware

cc @stdavids

I know I'm a bit behind on this, but we're finally ready to start looking at implementing batch norms from the Glenside side of things. We can talk in hackathon about it. From the Glenside perspective, this involves:

Whatever type of hardware block the batch norm will run on, need to create a representation for it in Glenside
Implement rewrites to simplify/fuse the computes, if we can and want to
Generate code for the batch norm operators

The interesting thing about batch norms in our workloads is that they aren't one "batch norm" operator, but instead, they're a chain of simpler primitive operators implementing the linear transformation that batch norm represents at inference time. This is because I've run the SimplifyInference Relay pass over the workloads before importing them from Relay. This is a habitual thing to do when working with workloads in Relay, and so that's why I did it; however, it might be easier for Glenside to take in opaque, un-simplified batch norm operators. It could then implement its own equivalent of SimplifyInference and create a "batch norm inference" node. Currently, this isn't possible, as the batch norm nodes in Relay have been blown up into smaller operators. I'm not sure this will be necessary, though -- I'm hoping we can just enable Glenside to fold the computation back together into some efficient vectorized format.

Benchmark interpreter so that we can measure improvement

List of TODOs in the code

Tighter TVM Integration

I'd like to explore how Glenside and TVM can be more closely integrated. Some ideas:

Represent portions of Glenside programs using TVM/Relay
- If we just want to explore tensorization of a portion of a Glenside program, it would be neat if we could represent the rest of the program (the "software" portion) with TVM/Relay, hopefully with the goal of compiling that portion much more quickly when needed.
Explore how Bring Your Own Codegen can be used with Glenside to potentially generate a compiler for Glenside-created hardware
Glenside -> Relay/TVM compiler? We've done it the other way around. Not sure if this would be useful.

Associative, commutative, etc "traits" for access patterns

A lot of Glenside's rewrites focus on implementing simple things, i.e. "bubbling" an access pattern through another access pattern (aka using composition commutativity to reorder the access patterns). I would love it if we could implement a way to automatically derive these types of rewrites. At first, we could use unproven claims that access patterns have specific traits (i.e. it commutes with another access pattern). Perhaps in the future, we can check these claims, or even automatically infer claims.

Implement `group` parameter when compiling relay.nn.conv2d (needed for Mobilenet conv2ds (depthwise separable?))

One of the last missing pieces for Mobilenet.

Implement non-im2col'ed convolutions

cc @stdavids, who knows how these work on systolic arrays.

We had a conversation about it a while back, and I'll need to refresh my memory.

The current Glenside syntax works very well for im2col, which was pretty serendipitous. Hopefully we have the same luck with the normal implementation of conv2ds...

DRAM in Glenside

???how???

Collect numbers (what numbers?) on naive allocation scheme so we can report improvement

What should we collect?

Amount of data allocated

Build Errors from Fresh Clone

Replication Steps:

$ git clone [email protected]:gussmith23/glenside.git
$ cd glenside
$ cargo run demo \
     conv2d \
     data/conv2d/conv2d.glenside \
     data/conv2d/conv2d-shapes.json \
     conv2d.c \
     conv2d-hw-design.json \
     --iter-limit 40 \
     --time-limit 40 \
     --node-limit 500000 \
     --find-monolithic-designs '(64,64)' \
     --blocking '(64,64)' \
     --prefer-bsg-blocking

Error Message:

 1    Compiling serde v1.0.116
 2    Compiling tvm-sys v0.1.0 (https://github.com/mwillsey/incubator-tvm?rev=3b6edf9ec0b6b3ab6a91174e7e2aa321cd8ec9b2#3b6    edf9e)
 3    Compiling tvm-macros v0.1.1 (https://github.com/mwillsey/incubator-tvm?rev=3b6edf9ec0b6b3ab6a91174e7e2aa321cd8ec9b2#    3b6edf9e)
 4 error: failed to run custom build command for `tvm-sys v0.1.0 (https://github.com/mwillsey/incubator-tvm?rev=3b6edf9ec0    b6b3ab6a91174e7e2aa321cd8ec9b2#3b6edf9e)`
 5
 6 Caused by:
 7   process didn't exit successfully: `/home/stdavids/bsg_ml_atoms/extern/glenside/target/debug/build/tvm-sys-36ae5b38ceb    5a8d5/build-script-build` (exit code: 101)
 8 --- stdout
 9 cargo:rerun-if-env-changed=TVM_HOME
10 cargo:rustc-link-lib=dylib=tvm
11 cargo:rustc-link-search=/home/stdavids/.cargo/git/checkouts/incubator-tvm-4920fa2689ba3c60/3b6edf9/build
12 cargo:warning=couldn't execute `llvm-config --prefix` (error: No such file or directory (os error 2))
13 cargo:warning=set the LLVM_CONFIG_PATH environment variable to a valid `llvm-config` executable
14
15 --- stderr
16 thread 'main' panicked at 'Unable to find libclang: "the `libclang` shared library at /usr/lib64/clang-private/libclang    .so.7 could not be opened: libclangAST.so.7: cannot open shared object file: No such file or directory"', /home/stdavid    s/.cargo/registry/src/github.com-1ecc6299db9ec823/bindgen-0.51.1/src/lib.rs:1731:13
17 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
18
19 warning: build failed, waiting for other jobs to finish...
20 error: build failed

Model the tensor sequencer in Glenside

Scott's tensor sequencer can read tensors intelligently, creating different shapes on the fly.

Tensorize ResNet using automatic/non-glenside blocking

Use .uninitialized (not .dram) for things that don't need to be zero-initialized

This is purely to speed up Scott's simulator.

Add code coverage to CI

Test Relay->Glenside dense

Currently not tested because of parsing problems on the Relay side; see apache/tvm/issues/6470.

Glenside expressions have too many common subexpressions; not capturing what's really happening

The basic problem can be captured with the syntax of access-slice. Each slice slices out a piece of a tensor. For a given slice, it's currently ambiguous as to whether the tensor is precomputed and resident in memory, or computed just for that slice. Well, most of the time, that slice will have already been computed, and we're just reading from memory. That isn't what the language seems to capture, though! It seems moreso to say that every slice expression computes the entire subexpression again.

Now, at first this wasn't a problem, because we could do whatever we wanted with the glenside expression once we extracted it. So when we compile a program, we make sure each subexpression is computed just once.

This raises the minor issue of the fact that the semantics of the language are vague/contradictory. But we can live with that, if things are working.

However, there's a much bigger problem now: with real models (i.e. mobilenet), subexpressions show up a LOT. Mobilenet's convolutional layers are currently implemented using slice to slice out parts of the weight/image to be convolved. This happens many times (16x, 32x at least) per convolutional layer. This means that when we go to write a cost function, the cost function is going to grow 16x per convolutional layer, as egg cost functions must be monotonic (that is, an expression's cost must be >= the cost of any of its children). Those 16s stack fast and quickly overflow usize::MAX.

Use Rust's ndarray library more efficiently in the interpreter

We use ndarray as our tensor representation, but we don't use it very efficiently. Specifically, we copy data everywhere, for nearly every operation. The ndarray library is smart; it can do many of the operations we need to do without copying data, if only we understood how to use it well. There are also opportunities to contribute back to ndarray, if we find something we'd like to be able to do that they haven't yet implemented. From what little I've dug in their source code already, it seems like these opportunities might be abundant, and contributing shouldn't be too hard!

Training in Glenside

What does training look like in Glenside?

Write macros to clean up language tests

Implement bias_add with axis=-1

I'm not actually sure what this means in Relay, I'm confirming it now

Enable more sharing in program nodes

When we instantiate a CNN in the egraph, every layer gets its own set of enodes, even if two layers are the same (e.g. two identically-sized convolutions). This means that discoveries about program nodes -- e.g. that a certain layer can be computed with a systolic array of size e.g. 16x16, are not shared across identical layers, and so we need to give the egraph enough time to make sure it discovers the same facts about all relevant layers.

There are arguments for both sides. For keeping it the way it is:

Well, it's already built!
Also, we currently have the ability to reason about layer-level values and their equality. That is, we could, if we want to, say that values in a specific layer are equal or different from values in another layer. Equality is about tensor values. This proposal would change equality to be about something else: it would say that, if the inputs to the expressions are the same, then the expressions are equal.
- I guess in a way we're actually already doing this. It's just a matter of, what do we consider as "inputs"? What do we consider as "expressions"? We want to separate layer by layer, but right now, a layer represents the result of this layer and all the layers before it.
There might be unforeseen consequences...we might no longer be able to represent some things
- I was scared at first that, for example, the zero-value-tracking that is necessary to enable tensorization of padded data to systolic arrays wasn't going to work after this, because it's all about reasoning about values. However, I realized I think we'll be OK, because those zeroes will be there regardless of this input data.

For changing:

We might get massive sharing
- We would then get massive speedup
- We would also get more consistency in what gets tensorized. E.g. two identical layers will both have the same tensorization options (because they will be equal) rather than perhaps having one not get tensorized.