Giter Club home page Giter Club logo

glenside's Introduction

Glenside

Build and test Check formatting

Check out the web demo!

Glenside is a pure, low-level tensor program representation which enables tensor program optimization via program rewriting, using rewriting frameworks such as the egg equality saturation library. If you are interested in transforming and optimizing tensor kernels (e.g. fusing kernels, exploring data layouts, or mapping to custom hardware), then Glenside is of interest to you! See the web demo for concrete examples. See our MAPS 2021 paper to understand why Glenside exists and how it can be used. Finally, see the docs for technical documentation.

Quickstart

Fastest way to verify that this code actually does something is to build the Docker image and run the tests:

git clone <this repo>
cd <this repo>
docker build --tag glenside .
docker run -it glenside cargo test

...and "soon" I will add interactive web demos and pretty visualizations!

Dependencies

Glenside optionally depends on TVM and CPLEX. To disable these optional dependencies, use the --no-default-features flag with cargo, e.g. cargo test --no-default-features.

CPLEX

Glenside uses the CPLEX ILP solver. It isn't actually used in the core of Glenside anymore, and needs to be removed or cordoned off, but for now, to get Glenside fully working, you need CPLEX. To set up CPLEX, follow these steps:

  1. Get access to CPLEX. Students and academics can do so by making an account through their academic program. Download and install CPLEX on your machine.
  2. Set environment variables. Set $CPLEX_LIB to the location of the newly-installed CPLEX library on your machine. For me on OSX, it resides in /Applications/CPLEX_Studio1210/cplex/lib/x86-64_osx/static_pic/.

Publications

glenside's People

Contributors

ad1024 avatar dependabot[bot] avatar gussmith23 avatar hypercubestart avatar logmoss avatar ninehusky avatar rishikumarray avatar vcanumalla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

glenside's Issues

Remove old constructs

Before removing, take the documentation from these constructs and use them to improve the corresponding newer constructs.

  • BsgSystolicArray
  • MoveAxis
  • CartesianProduct
  • MapDotProduct
  • Slice
  • Concatenate
  • ElementwiseAdd

Remove `MyAnalysisData::Legacy`

It's basically only used for Usize now. Should be replaced with Usize.

  • Remove MyAnalysisDataLegacyData struct
  • change implementation of get_usize()
  • Remove implementation of get_shape()

Tensorize Mobilenet to automatically-blocking systolic arrays

This is a shorter path to victory, that I honestly think will be good enough for the paper perhaps by itself. I'm starting to get a little defensive about it, knowing that I want Glenside to work a certain way (i.e. wanting Glenside to do the blocking), but I can't get so broken up about it.

Using/instantiating accumulators via Glenside

When Glenside blocks up a computation, it sometimes needs to explicitly do accumulation (if you split along the dimension of the vectors being dot-prodded). This is what accumulators in Scott's systolic arrays are for, and there are commands to do accumulation in pieces, so I think this this is possible...however, Scott's current high-level C API doesn't have a command for this. The only way to take advantage of the accumulators is to let Scott's tools do the blocking.

Tasks

  • Confirm that this is currently not possible with Scott's API
  • See whether we could do lower-level codegen to activate the accumulators
  • Confirm that representing this kind of...stateful? computation is possible in Glenside
    • I'm not sure stateful is the right word, but there is some notion that certain data needs to reside in the right place in memory at a specific time for this to work

Speed up Relay-to-Glenside tests

The Relay to Glenside tests do a number of random samples, comparing the output of the relay program and the output of the glenside program to see if they match given random inputs. We can just do one random sample if we feel like multiple aren't useful.

Speeding up the interpreter should also speed up these tests.

Glenside should find a blocking strategy for Mobilenet convolutions

This is mostly an issue of egg search performance and the fact that Glenside's blowing up the egraph with rewrites. See comment thread for debugging chain.

  • Add (failing) test for just one convolutional layer of Mobilenet
  • Debug why it's not tensorizing
  • Add needed rewrites --> Unsure on whether this is actually the issue

Clean up codegen tests using Rust macros

The codegen tests are super bloated, though their structure is really repetitive. There's probably a clever way to write them using macros. And then we might write more of them, because they wouldn't be so hard to write!

Remove unnecessary language constructs

There are at least a few language constructs that do roughly the same things.

  • shape vs raw list: We can probably just remove shape.
  • shape-of probably isn't needed for now, as all of our workloads are statically sized

Memoize interpreter

Use a more sane interpretation strategy -- use memoization or something, rather than naive recursion.
I think (I don't actually remember) the interpreter is totally naive and doesn't memoize at all. Thus, when subexpressions appear multiple times, we completely recompute them each time.

Interpreter benchmarks are measuring too much

The benchmarks in the interpreter code are measuring the setup code and the result-verifying code as well. It works for now, but in the future, we might want to resolve that. It looks like it could be annoying to resolve, however. The easiest solution might be to have the macro create both a #[test] and a #[bench] separately, where the test tests functionality, and the benchmark is just set up for timing.

Add new atom (with rewrites and codegen): systolic array with externally (non-Glenside) managed blocking

Scott's tools can block up computations just like Glenside can. He wants us to add a new atom that will let him do the blocking himself.

Tasks:

  • Add atom: systolic-array-automatic-blocking
    • naming isn't great
  • Add rewrite: need to figure out exactly what can be blocked up by Scott's scripts
  • Interpret we're not actually implementing the systolic arrays in the interpreter (they're not needed, yet...and at that point, the interpreter would become, like, an emulator)
  • Add codegen: call Scott's API
  • Add demo

Rethink representation for binary computations

I'm not sure the representation for binary computations is as clean as it could be at the moment. For example, computing elementwise add between two tensors is done as (compute elementwise-add (access-pair (access (access-tensor a) 0) (access (access-tensor b) 0))); however, there's no reason they can't be accessed at some other (matching) dimension. The access-pair conveys that they should be "paired" elementwise. The compute then looks for the "tuple" or "pair" dimension and reduces that dimension via an add. It feels very ad-hoc.

Interestingly, as I'm writing this, I realize that elementwise-add and reduce-sum computations have very similar semantics; elementwise-add reduces along a single dimension in the item shape, though, while reduce-sum reduces along all dimensions in the item shape.

Glenside "named tensors" implementation (named dimensions, algebra of dimensions, rank polymorphism, layout agnostic tensors, and many other names/ideas)

Named tensors are a great idea for making deep learning programming easier and safer at compiletime and runtime. The basic concept is this: tensors are n-dimensional arrays, and so traditionally, their dimensions have been indexed using numbers. As deep learning has evolved, though, common tensor layouts have arisen, e.g. NCHW for activation tensors, where N is the batch dimension, C is the channels dimension, and H/W are height and width. Using named tensors for activation tensors makes sense, as it

  • Makes it clear what each dimension is for
  • Removes the notion of "layout" (in-memory ordering of dimensions) from the tensor

I would like to explore some related ideas in Glenside. Currently, Glenside represents tensors as "access patterns", which represents dimensions in an ordered list (just as a standard tensor representation would). However, access patterns add the concept of some dimensions at the end being "accessed" as the item dimensions, creating a tensor which iterates over tensors. For example, a tensor a of shape (1, 3, 32, 32) accessed at dimension 1 becomes an access pattern with shape (1) and "item shape" (3, 32, 32): conceptually, a tensor of shape (1) whose items are tensors of shape (3, 32, 32).

Access patterns serve a number of purposes:

  • They convey information to operators, telling an operator which dimension to operate over. For example, (compute reduce-sum (access (access-tensor a) 1)) will reduce all of the item dimensions via summing, producing a tensor of shape (1).
  • They (implicitly) talk about how memory is logically and literally accessed (hence the name access pattern). The dimensions are ordered, with the implicit assumption that they match with the underlying memory layout. Thus, the access pattern describes the literal way we access memory, while also putting a logical layer on top of that, indicating in a very basic way which dimensions are the "items" we're accessing, and which dimensions are not.

Instead of high-level Glenside operators operating over access patterns with ordered dimensions, I would like to try having them operate over some tensor construct with named dimensions which convey the same information. When it comes to telling a compute operator which dimensions to compute over, the access pattern conveys this by grouping the relevant dimensions together in the "item" dimensions group. The actual order of the dimensions is totally irrelevant in this context. Thus, if we instead could mark the dimensions we want to compute over by name, then the underlying tensor representation wouldn't need its dimensions to have any order. For example (compute reduce-sum (dims h w) (tensor (dims n c h w) a)) conveys that we want to compute over dimensions h and w in a tensor with dimensions n, c, h, and w.

At the same time, we would need to "add back" the idea of layout, or dimension ordering. This is good, though, as we can now explicitly design layout semantics for Glenside, where they are currently entirely implicit. Describing hardware atoms relies heavily on the implicit layout semantics of access patterns.

Implementing batch norms in hardware

cc @stdavids

I know I'm a bit behind on this, but we're finally ready to start looking at implementing batch norms from the Glenside side of things. We can talk in hackathon about it. From the Glenside perspective, this involves:

  • Whatever type of hardware block the batch norm will run on, need to create a representation for it in Glenside
  • Implement rewrites to simplify/fuse the computes, if we can and want to
  • Generate code for the batch norm operators

The interesting thing about batch norms in our workloads is that they aren't one "batch norm" operator, but instead, they're a chain of simpler primitive operators implementing the linear transformation that batch norm represents at inference time. This is because I've run the SimplifyInference Relay pass over the workloads before importing them from Relay. This is a habitual thing to do when working with workloads in Relay, and so that's why I did it; however, it might be easier for Glenside to take in opaque, un-simplified batch norm operators. It could then implement its own equivalent of SimplifyInference and create a "batch norm inference" node. Currently, this isn't possible, as the batch norm nodes in Relay have been blown up into smaller operators. I'm not sure this will be necessary, though -- I'm hoping we can just enable Glenside to fold the computation back together into some efficient vectorized format.

Tighter TVM Integration

I'd like to explore how Glenside and TVM can be more closely integrated. Some ideas:

  • Represent portions of Glenside programs using TVM/Relay
    • If we just want to explore tensorization of a portion of a Glenside program, it would be neat if we could represent the rest of the program (the "software" portion) with TVM/Relay, hopefully with the goal of compiling that portion much more quickly when needed.
  • Explore how Bring Your Own Codegen can be used with Glenside to potentially generate a compiler for Glenside-created hardware
  • Glenside -> Relay/TVM compiler? We've done it the other way around. Not sure if this would be useful.

Associative, commutative, etc "traits" for access patterns

A lot of Glenside's rewrites focus on implementing simple things, i.e. "bubbling" an access pattern through another access pattern (aka using composition commutativity to reorder the access patterns). I would love it if we could implement a way to automatically derive these types of rewrites. At first, we could use unproven claims that access patterns have specific traits (i.e. it commutes with another access pattern). Perhaps in the future, we can check these claims, or even automatically infer claims.

Implement non-im2col'ed convolutions

cc @stdavids, who knows how these work on systolic arrays.

We had a conversation about it a while back, and I'll need to refresh my memory.

The current Glenside syntax works very well for im2col, which was pretty serendipitous. Hopefully we have the same luck with the normal implementation of conv2ds...

Build Errors from Fresh Clone

Replication Steps:

$ git clone [email protected]:gussmith23/glenside.git
$ cd glenside
$ cargo run demo \
     conv2d \
     data/conv2d/conv2d.glenside \
     data/conv2d/conv2d-shapes.json \
     conv2d.c \
     conv2d-hw-design.json \
     --iter-limit 40 \
     --time-limit 40 \
     --node-limit 500000 \
     --find-monolithic-designs '(64,64)' \
     --blocking '(64,64)' \
     --prefer-bsg-blocking

Error Message:

 1    Compiling serde v1.0.116
 2    Compiling tvm-sys v0.1.0 (https://github.com/mwillsey/incubator-tvm?rev=3b6edf9ec0b6b3ab6a91174e7e2aa321cd8ec9b2#3b6    edf9e)
 3    Compiling tvm-macros v0.1.1 (https://github.com/mwillsey/incubator-tvm?rev=3b6edf9ec0b6b3ab6a91174e7e2aa321cd8ec9b2#    3b6edf9e)
 4 error: failed to run custom build command for `tvm-sys v0.1.0 (https://github.com/mwillsey/incubator-tvm?rev=3b6edf9ec0    b6b3ab6a91174e7e2aa321cd8ec9b2#3b6edf9e)`
 5
 6 Caused by:
 7   process didn't exit successfully: `/home/stdavids/bsg_ml_atoms/extern/glenside/target/debug/build/tvm-sys-36ae5b38ceb    5a8d5/build-script-build` (exit code: 101)
 8 --- stdout
 9 cargo:rerun-if-env-changed=TVM_HOME
10 cargo:rustc-link-lib=dylib=tvm
11 cargo:rustc-link-search=/home/stdavids/.cargo/git/checkouts/incubator-tvm-4920fa2689ba3c60/3b6edf9/build
12 cargo:warning=couldn't execute `llvm-config --prefix` (error: No such file or directory (os error 2))
13 cargo:warning=set the LLVM_CONFIG_PATH environment variable to a valid `llvm-config` executable
14
15 --- stderr
16 thread 'main' panicked at 'Unable to find libclang: "the `libclang` shared library at /usr/lib64/clang-private/libclang    .so.7 could not be opened: libclangAST.so.7: cannot open shared object file: No such file or directory"', /home/stdavid    s/.cargo/registry/src/github.com-1ecc6299db9ec823/bindgen-0.51.1/src/lib.rs:1731:13
17 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
18
19 warning: build failed, waiting for other jobs to finish...
20 error: build failed

Glenside expressions have too many common subexpressions; not capturing what's really happening

The basic problem can be captured with the syntax of access-slice. Each slice slices out a piece of a tensor. For a given slice, it's currently ambiguous as to whether the tensor is precomputed and resident in memory, or computed just for that slice. Well, most of the time, that slice will have already been computed, and we're just reading from memory. That isn't what the language seems to capture, though! It seems moreso to say that every slice expression computes the entire subexpression again.

Now, at first this wasn't a problem, because we could do whatever we wanted with the glenside expression once we extracted it. So when we compile a program, we make sure each subexpression is computed just once.

This raises the minor issue of the fact that the semantics of the language are vague/contradictory. But we can live with that, if things are working.

However, there's a much bigger problem now: with real models (i.e. mobilenet), subexpressions show up a LOT. Mobilenet's convolutional layers are currently implemented using slice to slice out parts of the weight/image to be convolved. This happens many times (16x, 32x at least) per convolutional layer. This means that when we go to write a cost function, the cost function is going to grow 16x per convolutional layer, as egg cost functions must be monotonic (that is, an expression's cost must be >= the cost of any of its children). Those 16s stack fast and quickly overflow usize::MAX.

Use Rust's ndarray library more efficiently in the interpreter

We use ndarray as our tensor representation, but we don't use it very efficiently. Specifically, we copy data everywhere, for nearly every operation. The ndarray library is smart; it can do many of the operations we need to do without copying data, if only we understood how to use it well. There are also opportunities to contribute back to ndarray, if we find something we'd like to be able to do that they haven't yet implemented. From what little I've dug in their source code already, it seems like these opportunities might be abundant, and contributing shouldn't be too hard!

Enable more sharing in program nodes

When we instantiate a CNN in the egraph, every layer gets its own set of enodes, even if two layers are the same (e.g. two identically-sized convolutions). This means that discoveries about program nodes -- e.g. that a certain layer can be computed with a systolic array of size e.g. 16x16, are not shared across identical layers, and so we need to give the egraph enough time to make sure it discovers the same facts about all relevant layers.

There are arguments for both sides. For keeping it the way it is:

  • Well, it's already built!
  • Also, we currently have the ability to reason about layer-level values and their equality. That is, we could, if we want to, say that values in a specific layer are equal or different from values in another layer. Equality is about tensor values. This proposal would change equality to be about something else: it would say that, if the inputs to the expressions are the same, then the expressions are equal.
    • I guess in a way we're actually already doing this. It's just a matter of, what do we consider as "inputs"? What do we consider as "expressions"? We want to separate layer by layer, but right now, a layer represents the result of this layer and all the layers before it.
  • There might be unforeseen consequences...we might no longer be able to represent some things
    • I was scared at first that, for example, the zero-value-tracking that is necessary to enable tensorization of padded data to systolic arrays wasn't going to work after this, because it's all about reasoning about values. However, I realized I think we'll be OK, because those zeroes will be there regardless of this input data.

For changing:

  • We might get massive sharing
    • We would then get massive speedup
    • We would also get more consistency in what gets tensorized. E.g. two identical layers will both have the same tensorization options (because they will be equal) rather than perhaps having one not get tensorized.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.