Comments (10)
For completeness:
sequentialAFoldl :: (Shape sh, Elt e, Arrays a)
=> (Acc a -> Acc (Array sh e) -> Acc a)
-> Acc a
-> Acc (Array (sh :. Int) e)
-> Acc a
sequentialAFoldl f a0 as =
let _ ::. n = shape as
in afst (awhile (\(T2 _ i) -> map (< n) i)
(\(T2 a i) ->
let part = slice as (constant Any ::. the i)
in T2 (f a part) (map (+ 1) i))
(T2 a0 (unit 0)))
WARNING: Untested!
from accelerate.
For completeness, with vectorisation I meant something like the following: (note: typechecks, but untested)
forwardAffine' :: (Shape sh, Elt t, A.Floating t)
=> Acc (Array (sh :. Int :. Int) t)
-> Acc (Array (sh :. Int :. Int :. Int) t, Array (sh :. Int :. Int) t)
-> Acc (Array (sh :. Int :. Int) t)
forwardAffine' activations params =
let
(weights, biases) = unlift params
rest ::. r ::. c ::. _batches = shape weights
act' = A.replicate (constant A.Any ::. r ::. constant All ::. constant All) activations
in
A.zipWith (+) biases $ A.fold (+) 0 (transpose' (A.zipWith (*) weights act'))
where
transpose' :: (Shape sh, Elt t)
=> Acc (Array (sh :. Int :. Int) t)
-> Acc (Array (sh :. Int :. Int) t)
transpose' arr =
let rest ::. n ::. m = shape arr
in A.backpermute (rest ::. m ::. n) (\(idx ::. i ::. j) -> idx ::. j ::. i) arr
Firstly note that transpose'
just exchanges the outermost two dimensions of an array. This can perhaps also be done using the recent lens-like additions to Accelerate, but I'm more familiar with the "traditional" method, hence the above. :)
What I did was add an extra outer dimension to all inputs and outputs of forwardAffine
; the intention is that all inputs and outputs are equally large in this outer dimension, and forwardAffine'
just parallelises over that dimension. weights
thus gets an additional outer dimension, reflected in the _batches
addition to the pattern match. Furthermore, we just copy the outer dimension of activations
over to act'
unchanged using All
. In the last line, then, the zipWith (*)
needs no modification because it is an elementwise operation that doesn't care about an additional dimension. For the fold (note by the way that this is A.sum
), we previously wanted to fold over the outermost dimension, but that dimension has now shifted one inward. Hence we first transpose the two outermost dimensions and fold afterwards. The zipWith (+)
is again elementwise, so no need to change that.
The way I'm hoping to structure the library,
f
could be practically anything, provided the library or the user provides a derivative forf
.
Parallelising the outermost layer (mapping over the batches) as well as the inner computation (the worker) will not work in current Accelerate without vectorising the worker as above, unfortunately. This is because the worker could theoretically do wildly different things for each outer element, and Accelerate currently wants all parallelism to be regular, as it is called. (There is work to try to change this, but here we're talking about very long-term, multiple years in the future.)
However, even making the outer layer sequential wouldn't work currently if you want to collect the output values in an array. (If you want to e.g. sum the results anyway, that would be possible of course.) The loop would be an awhile
, but you then need to incrementally build the output array -- and this is the problem. While you can repeatedly concatenate a new result after the output array using (A.++)
, that will entail an array copy and thus lead to the loop having quadratic complexity in the output size.
Another not-really-viable option is to put the batch items in a large tuple; this can only work if you know the number of batch items at accelerate-compile-time, and would be quite hard on the simplifier anyway if there are a non-trivial number of items.
It is expected that later this year we will have an implementation of permute
that can perform in-place mutation on the target array if that array is not used anywhere else. When that lands, there is a way to have your desired functionality (with a sequential outer loop) by modifying the output array in each awhile
iteration, putting a single slice of the result in place using permute const
.
(Some of the ideas here should be credited to the Accelerate team members, with whom I spoke today)
from accelerate.
Yeah a typical batch size would be >=64, and even if it were smaller this solution would sacrifice shape polymorphism, which is not nice.
I was thinking about an even uglier setup where the batch is a tuple of input arrays; then you don't lose shape polymorphism, because the arrays in a tuple don't need to have the same shape.
The function representing the model should include the batch index, and
backpermute
can be used to shift this index to the back so that the forward and backward pass are ordinary accelerate functions --mapLastIndex
not required.
Smart! Yes, if you can require that model functions are rank-polymorphic such that they ignore the leftmost dimensions, then that approach could make it work, since the model function would automatically map over the array without any work whatsoever on your part (apart from permuting the dimensions if necessary, but even that could be avoided if you import the data in the right way).
Then you even get parallelisation on both the batch index and the model computation.
from accelerate.
The only reason I can imagine why you would want a loop over an array to be sequential is because you want to thread some state through, presumably the model you are building. Is that correct or am I way out of whack here?
Would a function like this solve your problems (modulo the name I guess)?
sequentialAFoldl :: (Shape sh, Elt e, Arrays a)
=> (Acc a -> Acc (Array sh e) -> Acc a)
-> Acc a
-> Acc (Array (sh :. Int) e)
-> Acc a
This function sequentialAFoldl
would do a left fold over its third argument (of type Array (sh :. Int) e
, here unconventionally considered an array of Array sh e
s), threading through a value of type Acc a
. It's a bit like foldl
in standard Haskell.
Note that you speak about folding over the inner dimension instead of the outer dimension as my hypothetical sequentialAFoldl
does; however, note that with the right backpermute
you can permute dimensions at will, so that should not represent a loss of expressivity (though possibly of some performance due to shuffling of indices, depending on how things are fused).
Or would you rather need this?
sequentialABuildl :: (Shape sh, Shape sh', Elt e, Elt e', Arrays a)
=> (Acc a -> Acc (Array sh e) -> Acc (Array sh' e', a))
-> Acc a
-> Acc (Array (sh :. Int) e)
-> Acc (Array (sh' :. Int) e', a)
which is mostly the same except that it additionally builds up a new array (of shape sh' :. Int
) in the process, which it fills piece by piece along the outer dimension.
Both functions above are hypothetical, but sequentialAFoldl
can be implemented using awhile
, I think. sequentialABuildl
, on the other hand, cannot currently be implemented using the Accelerate surface language with the expected performance characteristics — it would need to be a new primitive in the internal language. But I've found a need for sequentialBuildl
for other reasons, so I wanted to ask if you happened to be looking for the same.
from accelerate.
The loop needs to be sequential in order to avoid nested data parallelism. Sorry if I wasn't clear in the original post: the outer loop is a map (on really any index, as you point out) which can be parallelized, each worker for this map has to be sequential.
Here is a very inefficient implementation of the kind of mapping function I am thinking of, which would probably throw a runtime error, but just so the idea is clear:
mapLastIndex :: forall a sh sh' . (Elt a, Shape sh, Shape sh')
=> (Acc (Array sh a) -> Acc (Array sh' a)) -- | worker function
-> Acc (Array (sh :. Int) a) -- | input array
-> Exp (sh' :. Int) -- | expected output shape (not sure if this argument is necessary)
-> Acc (Array (sh' :. Int) a) -- | output array
mapLastIndex f arr outshape =
let
generator :: Exp (sh' :. Int) -> Exp a
generator ix =
let
(a :. i) = runExp ix
in
(f $ A.slice arr (constant (Any :. i))) A.! (constant a)
in
A.generate outshape generator
Obviously there are a few problems with this, but the critical part is that the worker function is the forward pass (or backward pass) of the model, and, as far as I can tell, it has to be evaluated sequentially in order for this outer loop to be parallel. A typical example of this worker would be some composition of matrix multiplications, vector additions, and sigmoid functions mapped over the whole array.
Your sequentialAFoldl
gives me an idea of how this would be done, but what I am looking for is more like
sequentialFold :: (Shape sh, Elt e, Arrays a) => (Acc a -> Exp e -> Acc a) -> Acc a -> Acc (Array sh e) -> Acc a
I'm guessing this can be done using awhile
?
from accelerate.
That sequentialFold
can indeed probably be implemented using awhile
.
I get what your mapLastIndex
is trying to do; let's indeed ignore its wildly inefficient/non-working implementation. ;) Whether you can implement that depends on how f
is structured. If possible, the most natural way to do this (in current Accelerate) would be to vectorise f
so that it runs in parallel over an array of inputs. That array of inputs would then be your arr
, which I assume is your "batch index". The idea then is that you don't need to have a parallel layer inside of a sequential layer; you just have one parallel layer that does it all.
If what you're implementing is just a basic dense neural network, I think this should be very possible. I believe I've written some experimental code at some point that does exactly that, in fact.
Could you perhaps share what the worker function looks like? Perhaps I can give some more concrete pointers about how to vectorise it.
from accelerate.
Side note: I believe my sequentialABuildl
would allow you to do what you want directly; you just ignore the state argument and treat it like a map that can incrementally fill the output array sequentially.
from accelerate.
The way I'm hoping to structure the library, f
could be practically anything, provided the library or the user provides a derivative for f
. This is essentially how it's done in pytorch/tensorflow, the main difference being that I want to use generalized algebraic datatypes to give the user more explicit control over the compute graph.
Anyway, you asked for an example, so suppose we are dealing with a rank-polymorphic affine layer:
forwardAffine :: (Shape sh, Elt t, A.Floating t)
=> Acc (Array (sh :. Int) t)
-> Acc (Array (sh :. Int :. Int) t, Array (sh :. Int) t)
-> Acc (Array (sh :. Int) t)
forwardAffine activations params =
let
(weights, biases) = unlift params
rest ::. r ::. c = shape weights
act' = A.replicate ((constant Any) ::. r ::. (constant All)) activations
in
A.zipWith (+) biases $ A.fold (+) 0 (A.zipWith (*) weights act')
This typechecks but I didn't test it. In this case f
would be \a -> f a params
for some fixed params
. The forward pass of the entire model would be things like this composed with element-wise sigmoids, ReLu's, and anything else the user might want, like exponentials, trigonometry, etc.
Typically this forward pass would then be mapped over the batch index in parallel; same with the backward pass. What do you think would be the most efficient way of approaching it in accelerate
?
from accelerate.
Another not-really-viable option is to put the batch items in a large tuple; this can only work if you know the number of batch items at accelerate-compile-time, and would be quite hard on the simplifier anyway if there are a non-trivial number of items.
Yeah a typical batch size would be >=64, and even if it were smaller this solution would sacrifice shape polymorphism, which is not nice.
I think I'm beginning to understand the essential idea though. The function representing the model should include the batch index, and backpermute
can be used to shift this index to the back so that the forward and backward pass are ordinary accelerate functions -- mapLastIndex
not required.
from accelerate.
This seems great, thanks. I'll close the issue.
from accelerate.
Related Issues (20)
- Examples of incorrect output from pretty printer HOT 11
- [BUG] Imperfect shape query propagation in the presence of dead code HOT 2
- [BUG] Build fails HOT 6
- [BUG] Imperfect dead code elimination
- [BUG] Unexpectedly long phases when training a neural network HOT 1
- Support CUDA 11 HOT 1
- [BUG] CUDA-10 library doesn't support the Turing-based RTX 2060? HOT 8
- `inconsistent valuation @ shared 'Acc'` when trying to lift non-`Acc` function to `Acc` HOT 6
- `Foreign` instance for reference interpreter
- [BUG] doc bugs
- Could not enable debugging options HOT 5
- Support GHCJS compilation HOT 7
- [BUG] Function hashes have incorrect length causing internal errors HOT 2
- [BUG] undefined symbol: _ZTIN4llvm10CallbackVHE HOT 4
- [BUG] Value 'sm_30' is not defined for option 'gpu-name' HOT 4
- [BUG] typo in Semigroup instance of Exp (Maybe a) HOT 1
- How to realise convolution? HOT 13
- [Tracking Issue] Implementing (Segmented) Single-Pass Look-Back Scans
- [BUG] Internal error in package accelerate and LLVM.PTX backend: CUDA Exception - misaligned address HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from accelerate.