Giter Club home page Giter Club logo

allocationkit.jl's Introduction

AllocationKit

Build Status Code Style: Blue

This is an in-progress package to provide a set of tools for the allocation of intermediate tensors within TensorOperations.jl.

Rationale

The problem of allocating intermediate tensors is a difficult one. The optimal allocation strategy seems to be dependent on the size of the tensors, the number of operations, the number of threads, the memory bandwidth, and the cache size. This package aims to provide a set of tools to experiment with, in order to hopefully provide an efficient allocation strategy for a given problem.

The expressions that are generated by TensorOperations allocate new tensors in a variety of ways. The first and foremost is the allocation of output tensors, i.e. objects that are used further down the line in the program, and should thus in almost all cases be just allocated like any julia variable, and handled by the garbage collector. If a user desires to change this behaviour, this is probably something that should be handled externally, and not by TensorOperations, thus this type of allocation is not focussed on by this package. The code pattern that achieves this would just use in-place operations for output tensors, and would look something like this:

C = my_allocator()
@tensor C[a,b,c] = A[a,b,d] * B[d,c]

The second type of allocation is the intermediate tensors that arise from the pairwise handling of larger tensor networks. I will refer to these as external intermediate tensors. In practise, these are very hard to remove, and in general settings, it seems that these should and could also be handled by the native garbage collection mechanism. However, often the same network is evaluated many times with the same size of input tensors, which arises for example when building up a Krylov subspace for a linear solver or an eigenvalue solver. In these cases, it is possible to reuse the same intermediate tensors, and thus hoist the allocation out of the loop. For this case, we aim to provide some solutions that differ in the amount of additional work/code a user is willing to supply, as well as the performance gains that can be achieved.

Finally, there is the case of the intermediate tensors that arise from the pairwise contraction of tensors when using the common pattern of TTGT (transpose-transpose-gemm-transpose), where the input tensors are first permuted to a form where the contraction can be dispatched to a BLAS routine. I will refer to these as internal intermediate tensors. Note that this is implementation-dependent, and various alternatives exist which already aim to handle this problem. For example, tblis does not require these additional intermediate tensors, and other implementations exist (https://arxiv.org/abs/1607.00145), ... Additionally, these intermediate tensors are shorter-lived, and thus may benefit from a different strategy than the intermediate tensors that arise from the pairwise contraction of tensors. We will also provide some examples of these strategies, but this should be considered separately.

Benchmarking

As the problem is inherently dependent on the network, the size of the tensors, the hardware, and the surrounding code (e.g. the number of threads, the available memory, ...), I currently only provide a limited set of benchmarks, but feel free to open an issue and I'll gladly consider adding it to the benchmark suite.

Because of the aforementioned reasons, the benchmarks will be run without envoking the garbage collector inbetween different samples, and we will focus on the repeated evaluation of the same tensor network. This should hopefully replicate a more realistic setting than simply isolating a single contraction.

Please also note that my experience with benchmarking allocations is limited, so any suggestions or notes are more than welcome.

Solutions

Manual allocation and deallocation

One of the simplest solutions to try is to manually allocate and deallocate the intermediate tensors. While this will not reduce the amount of allocations, it will bypass the garbage collector, and at the very least alleviate these costs. Additionally, it might reduce the total memory usage at a given time, as memory will be freed as early as possible.

Nevertheless, this process might be error-prone, as interruption of the program might lead to memory-leaks, while explicit finalizers encur some performance cost.

allocationkit.jl's People

Contributors

lkdvos avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.