accelerate makes the idea of high-performance machine

For completeness: <div class="highlight highlight-source-haskell notranslate posit

Is there a way to force accelerate operations to be sequentially evaluated? about accelerate HOT 10 CLOSED

Ebanflo42 commented on June 21, 2024

Is there a way to force accelerate operations to be sequentially evaluated?

from accelerate.

Comments (10)

tomsmeding commented on June 21, 2024 1

For completeness:

sequentialAFoldl :: (Shape sh, Elt e, Arrays a)
                 => (Acc a -> Acc (Array sh e) -> Acc a)
                 -> Acc a
                 -> Acc (Array (sh :. Int) e)
                 -> Acc a
sequentialAFoldl f a0 as =
    let _ ::. n = shape as
    in afst (awhile (\(T2 _ i) -> map (< n) i)
                    (\(T2 a i) ->
                        let part = slice as (constant Any ::. the i)
                        in T2 (f a part) (map (+ 1) i))
                    (T2 a0 (unit 0)))

WARNING: Untested!

from accelerate.

tomsmeding commented on June 21, 2024 1

For completeness, with vectorisation I meant something like the following: (note: typechecks, but untested)

forwardAffine' :: (Shape sh, Elt t, A.Floating t)
               => Acc (Array (sh :. Int :. Int) t)
               -> Acc (Array (sh :. Int :. Int :. Int) t, Array (sh :. Int :. Int) t)
               -> Acc (Array (sh :. Int :. Int) t)
forwardAffine' activations params =
  let
    (weights, biases) = unlift params
    rest ::. r ::. c ::. _batches = shape weights
    act'              = A.replicate (constant A.Any ::. r ::. constant All ::. constant All) activations
  in
    A.zipWith (+) biases $ A.fold (+) 0 (transpose' (A.zipWith (*) weights act'))
  where
    transpose' :: (Shape sh, Elt t)
               => Acc (Array (sh :. Int :. Int) t)
               -> Acc (Array (sh :. Int :. Int) t)
    transpose' arr =
        let rest ::. n ::. m = shape arr
        in A.backpermute (rest ::. m ::. n) (\(idx ::. i ::. j) -> idx ::. j ::. i) arr

Firstly note that transpose' just exchanges the outermost two dimensions of an array. This can perhaps also be done using the recent lens-like additions to Accelerate, but I'm more familiar with the "traditional" method, hence the above. :)

What I did was add an extra outer dimension to all inputs and outputs of forwardAffine; the intention is that all inputs and outputs are equally large in this outer dimension, and forwardAffine' just parallelises over that dimension. weights thus gets an additional outer dimension, reflected in the _batches addition to the pattern match. Furthermore, we just copy the outer dimension of activations over to act' unchanged using All. In the last line, then, the zipWith (*) needs no modification because it is an elementwise operation that doesn't care about an additional dimension. For the fold (note by the way that this is A.sum), we previously wanted to fold over the outermost dimension, but that dimension has now shifted one inward. Hence we first transpose the two outermost dimensions and fold afterwards. The zipWith (+) is again elementwise, so no need to change that.

The way I'm hoping to structure the library, f could be practically anything, provided the library or the user provides a derivative for f.

Parallelising the outermost layer (mapping over the batches) as well as the inner computation (the worker) will not work in current Accelerate without vectorising the worker as above, unfortunately. This is because the worker could theoretically do wildly different things for each outer element, and Accelerate currently wants all parallelism to be regular, as it is called. (There is work to try to change this, but here we're talking about very long-term, multiple years in the future.)

However, even making the outer layer sequential wouldn't work currently if you want to collect the output values in an array. (If you want to e.g. sum the results anyway, that would be possible of course.) The loop would be an awhile, but you then need to incrementally build the output array -- and this is the problem. While you can repeatedly concatenate a new result after the output array using (A.++), that will entail an array copy and thus lead to the loop having quadratic complexity in the output size.

Another not-really-viable option is to put the batch items in a large tuple; this can only work if you know the number of batch items at accelerate-compile-time, and would be quite hard on the simplifier anyway if there are a non-trivial number of items.

It is expected that later this year we will have an implementation of permute that can perform in-place mutation on the target array if that array is not used anywhere else. When that lands, there is a way to have your desired functionality (with a sequential outer loop) by modifying the output array in each awhile iteration, putting a single slice of the result in place using permute const.

(Some of the ideas here should be credited to the Accelerate team members, with whom I spoke today)

from accelerate.

tomsmeding commented on June 21, 2024 1

Yeah a typical batch size would be >=64, and even if it were smaller this solution would sacrifice shape polymorphism, which is not nice.

I was thinking about an even uglier setup where the batch is a tuple of input arrays; then you don't lose shape polymorphism, because the arrays in a tuple don't need to have the same shape.

The function representing the model should include the batch index, and backpermute can be used to shift this index to the back so that the forward and backward pass are ordinary accelerate functions -- mapLastIndex not required.

Smart! Yes, if you can require that model functions are rank-polymorphic such that they ignore the leftmost dimensions, then that approach could make it work, since the model function would automatically map over the array without any work whatsoever on your part (apart from permuting the dimensions if necessary, but even that could be avoided if you import the data in the right way).

Then you even get parallelisation on both the batch index and the model computation.

from accelerate.

tomsmeding commented on June 21, 2024

The only reason I can imagine why you would want a loop over an array to be sequential is because you want to thread some state through, presumably the model you are building. Is that correct or am I way out of whack here?

Would a function like this solve your problems (modulo the name I guess)?

sequentialAFoldl :: (Shape sh, Elt e, Arrays a)
                 => (Acc a -> Acc (Array sh e) -> Acc a)
                 -> Acc a
                 -> Acc (Array (sh :. Int) e)
                 -> Acc a

This function sequentialAFoldl would do a left fold over its third argument (of type Array (sh :. Int) e, here unconventionally considered an array of Array sh es), threading through a value of type Acc a. It's a bit like foldl in standard Haskell.

Note that you speak about folding over the inner dimension instead of the outer dimension as my hypothetical sequentialAFoldl does; however, note that with the right backpermute you can permute dimensions at will, so that should not represent a loss of expressivity (though possibly of some performance due to shuffling of indices, depending on how things are fused).

Or would you rather need this?

sequentialABuildl :: (Shape sh, Shape sh', Elt e, Elt e', Arrays a)
                  => (Acc a -> Acc (Array sh e) -> Acc (Array sh' e', a))
                  -> Acc a
                  -> Acc (Array (sh :. Int) e)
                  -> Acc (Array (sh' :. Int) e', a)

which is mostly the same except that it additionally builds up a new array (of shape sh' :. Int) in the process, which it fills piece by piece along the outer dimension.

Both functions above are hypothetical, but sequentialAFoldl can be implemented using awhile, I think. sequentialABuildl, on the other hand, cannot currently be implemented using the Accelerate surface language with the expected performance characteristics — it would need to be a new primitive in the internal language. But I've found a need for sequentialBuildl for other reasons, so I wanted to ask if you happened to be looking for the same.

from accelerate.

Ebanflo42 commented on June 21, 2024

The loop needs to be sequential in order to avoid nested data parallelism. Sorry if I wasn't clear in the original post: the outer loop is a map (on really any index, as you point out) which can be parallelized, each worker for this map has to be sequential.

Here is a very inefficient implementation of the kind of mapping function I am thinking of, which would probably throw a runtime error, but just so the idea is clear:

mapLastIndex :: forall a sh sh' . (Elt a, Shape sh, Shape sh')
             => (Acc (Array sh a) -> Acc (Array sh' a)) -- | worker function
             -> Acc (Array (sh :. Int) a) -- | input array
             -> Exp (sh' :. Int) -- | expected output shape (not sure if this argument is necessary)
             -> Acc (Array (sh' :. Int) a) -- | output array
mapLastIndex f arr outshape =
  let
    generator :: Exp (sh' :. Int) -> Exp a
    generator ix =
      let
        (a :. i) = runExp ix
      in
        (f $ A.slice arr (constant (Any :. i))) A.! (constant a)
  in
    A.generate outshape generator

Obviously there are a few problems with this, but the critical part is that the worker function is the forward pass (or backward pass) of the model, and, as far as I can tell, it has to be evaluated sequentially in order for this outer loop to be parallel. A typical example of this worker would be some composition of matrix multiplications, vector additions, and sigmoid functions mapped over the whole array.

Your sequentialAFoldl gives me an idea of how this would be done, but what I am looking for is more like

sequentialFold :: (Shape sh, Elt e, Arrays a) => (Acc a -> Exp e -> Acc a) -> Acc a -> Acc (Array sh e) -> Acc a

I'm guessing this can be done using awhile?

from accelerate.

tomsmeding commented on June 21, 2024

That sequentialFold can indeed probably be implemented using awhile.

I get what your mapLastIndex is trying to do; let's indeed ignore its wildly inefficient/non-working implementation. ;) Whether you can implement that depends on how f is structured. If possible, the most natural way to do this (in current Accelerate) would be to vectorise f so that it runs in parallel over an array of inputs. That array of inputs would then be your arr, which I assume is your "batch index". The idea then is that you don't need to have a parallel layer inside of a sequential layer; you just have one parallel layer that does it all.

If what you're implementing is just a basic dense neural network, I think this should be very possible. I believe I've written some experimental code at some point that does exactly that, in fact.

Could you perhaps share what the worker function looks like? Perhaps I can give some more concrete pointers about how to vectorise it.

from accelerate.

tomsmeding commented on June 21, 2024

Side note: I believe my sequentialABuildl would allow you to do what you want directly; you just ignore the state argument and treat it like a map that can incrementally fill the output array sequentially.

from accelerate.

Ebanflo42 commented on June 21, 2024

The way I'm hoping to structure the library, f could be practically anything, provided the library or the user provides a derivative for f. This is essentially how it's done in pytorch/tensorflow, the main difference being that I want to use generalized algebraic datatypes to give the user more explicit control over the compute graph.

Anyway, you asked for an example, so suppose we are dealing with a rank-polymorphic affine layer:

forwardAffine :: (Shape sh, Elt t, A.Floating t)
              => Acc (Array (sh :. Int) t)
              -> Acc (Array (sh :. Int :. Int) t, Array (sh :. Int) t)
              -> Acc (Array (sh :. Int) t)
forwardAffine activations params =
  let
    (weights, biases) = unlift params
    rest ::. r ::. c  = shape weights
    act'              = A.replicate ((constant Any) ::. r ::. (constant All)) activations
  in
    A.zipWith (+) biases $ A.fold (+) 0 (A.zipWith (*) weights act')

This typechecks but I didn't test it. In this case f would be \a -> f a params for some fixed params. The forward pass of the entire model would be things like this composed with element-wise sigmoids, ReLu's, and anything else the user might want, like exponentials, trigonometry, etc.

Typically this forward pass would then be mapped over the batch index in parallel; same with the backward pass. What do you think would be the most efficient way of approaching it in accelerate?

from accelerate.

Ebanflo42 commented on June 21, 2024

Another not-really-viable option is to put the batch items in a large tuple; this can only work if you know the number of batch items at accelerate-compile-time, and would be quite hard on the simplifier anyway if there are a non-trivial number of items.

Yeah a typical batch size would be >=64, and even if it were smaller this solution would sacrifice shape polymorphism, which is not nice.

I think I'm beginning to understand the essential idea though. The function representing the model should include the batch index, and backpermute can be used to shift this index to the back so that the forward and backward pass are ordinary accelerate functions -- mapLastIndex not required.

from accelerate.

Ebanflo42 commented on June 21, 2024

This seems great, thanks. I'll close the issue.

from accelerate.

Is there a way to force accelerate operations to be sequentially evaluated? about accelerate HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent