Giter Club home page Giter Club logo

array's Introduction

About

This library provides a multidimensional array class for C++, with the following design goals:

  • Enable specification of array parameters as compile-time constants per parameter, enabling more efficient code generation, while retaining run-time flexibility where needed.
  • Provide an API following the conventions of the C++ STL where possible.
  • Minimal dependencies and requirements (the library is currently a single header file, and depends only on the C++ STL).

The library uses some ideas established in other existing projects, such as numpy, Halide, and Eigen. Array shapes are specified as a list of N dimensions, where each dimension has parameters such as an extent and a stride. Array references and objects use shape objects to map N-dimensional indices to a flat index. N-dimensional indices are mapped to flat offsets with the following formula:

flat_offset = (x0 - min0)*stride0 + (x1 - min1)*stride1 + ... + (xN - minN)*strideN

where:

  • xN are the indices in each dimension.
  • minN are the mins in each dimension. The min is the value of the first in-range index in this dimension (the max is minN + extentN - 1).
  • strideN are the distances in the flat offsets between elements in each dimension.

Arrays efficiently support advanced manipulations like cropping, slicing, and splitting arrays and loops, all while preserving compile-time constant parameters when possible. Although it is a heavily templated library, incorrect usage generates informative and helpful error messages. Typically, an issue will result in only one error message, located at the site of the problem in user code. This is something we test for.

Many other libraries offering multi-dimensional arrays or tensors allow compile-time constant shapes. However, most if not all of them only allow either all of the shape parameters to be compile-time constant, or none of them. This is really limiting; often only a few key parameters of a shape need to be compile-time constant for performance, while other dimensions need flexibility to accommodate runtime-valued shape parameters. Some examples of this are:

  • 'Chunky' image formats with a small fixed number of channels.
  • Matrices where one dimension represent variables intrinsic to the problem, while the other dimension represents a number of samples of data.
  • Algorithms optimized by splitting or tiling intermediate stages will have intermediate buffers that have a constant extent in the dimensions that are split or tiled.

Some other features of the library are:

  • CUDA support for use in __device__ functions.
  • Einstein reduction helpers, enabling many kinds of reductions and other array operations to be expressed safely.

For more detailed documentation, see the generated documentation.

Usage

Shapes

The basic types provided by the library are:

  • dim<Min, Extent, Stride>, a description of a single dimension. The template parameters specify a compile-time constant min, extent, or stride, or are dynamic (meaning unknown) and are specified at runtime.
  • shape<Dim0, Dim1, ...>, a description of multiple dimensions. Dim0 is referred to as the innermost dimension.
  • array<T, Shape, Allocator>, a container following the conventions of std::vector where possible. This container manages the allocation of a buffer associated with a Shape.
  • array_ref<T, Shape>, a wrapper for addressing existing memory with a shape Shape.

To define an array, define a shape type, and use it to define an array object:

  using my_3d_shape_type = shape<dim<>, dim<>, dim<>>;
  constexpr int width = 16;
  constexpr int height = 10;
  constexpr int depth = 3;
  my_3d_shape_type my_3d_shape(width, height, depth);
  array<int, my_3d_shape_type> my_array(my_3d_shape);

General shapes and arrays like this have the following built-in aliases:

  • shape_of_rank<N>, an N-dimensional shape.
  • array_ref_of_rank<T, N> and array_of_rank<T, N, Allocator>, N-dimensional arrays with a shape of shape_of_rank<N>.

Access and iteration

Accessing array or array_ref is done via operator(...) and operator[index_type]. There are both variadic and index_type overloads of operator(). index_type is a specialization of std::tuple defined by shape (and array and array_ref), e.g. my_3d_shape_type::index_type.

  for (int z = 0; z < depth; z++) {
    for (int y = 0; y < height; y++) {
      for (int x = 0; x < width; x++) {
        // Variadic version:
        my_array(x, y, z) = 5;
        // Or the index_type version:
        my_array[{x, y, z}] = 5;
      }
    }
  }

array::for_each_value and array_ref::for_each_value calls a function with a reference to each value in the array.

  my_array.for_each_value([](int& value) {
    value = 5;
  });

for_all_indices is a free function taking a shape object and a function to call with every index in the shape. for_each_index is similar, calling a free function with the index as an instance of the index type my_3d_shape_type::index_type.

  for_all_indices(my_3d_shape, [&](int x, int y, int z) {
    my_array(x, y, z) = 5;
  });
  for_each_index(my_3d_shape, [&](my_3d_shape_type::index_type i) {
    my_array[i] = 5;
  });

The order in which each of for_each_value, for_each_index, and for_all_indices execute their traversal over the shape is defined by shape_traits<Shape>. The default implementation of shape_traits<Shape>::for_each_index iterates over the innermost dimension as the innermost loop, and proceeds in order to the outermost dimension.

  my_3d_shape_type my_shape(2, 2, 2);
  for_all_indices(my_shape, [](int x, int y, int z) {
    std::cout << x << ", " << y << ", " << z << std::endl;
  });
  // Output:
  // 0, 0, 0
  // 1, 0, 0
  // 0, 1, 0
  // 1, 1, 0
  // 0, 0, 1
  // 1, 0, 1
  // 0, 1, 1
  // 1, 1, 1

The default implementation of shape_traits<Shape>::for_each_value iterates over a dynamically optimized shape. The order will vary depending on the properties of the shape.

There are overloads of for_all_indices and for_each_index accepting a permutation to indicate the loop order. In this example, the permutation <2, 0, 1> iterates over the z dimension as the innermost loop, then x, then y.

  for_all_indices<2, 0, 1>(my_shape, [](int x, int y, int z) {
    std::cout << x << ", " << y << ", " << z << std::endl;
  });
  // Output:
  // 0, 0, 0
  // 0, 0, 1
  // 1, 0, 0
  // 1, 0, 1
  // 0, 1, 0
  // 0, 1, 1
  // 1, 1, 0
  // 1, 1, 1

Compile-time constant shapes

In the previous examples, no array parameters are compile time constants, so all of these accesses and loops expand to a flat_offset expression where the mins, extents, and strides are runtime variables. This can prevent the compiler from generating efficient code. For example, the compiler may be able to auto-vectorize these loops, but if the stride of the dimension accessed by the vectorized loop is a runtime variable, the compiler will have to generate gathers and scatters instead of vector load and store instructions, even if the stride is one at runtime.

To avoid this, we need to make array parameters compile time constants. However, while making array parameters compile time constants helps the compiler generate efficient code, it also makes the program less flexible.

This library helps balance this tradeoff by enabling any of the array parameters to be compile time constants, but not require it. Which parameters should be made into compile time constants will vary depending on the use case. A common case is to make the innermost dimension have stride 1:

  using my_dense_3d_shape_type = shape<
      dim</*Min=*/dynamic, /*Extent=*/dynamic, /*Stride=*/1>,
      dim<>,
      dim<>>;
  array<char, my_dense_3d_shape_type> my_dense_array({16, 3, 3});
  for (auto x : my_dense_array.x()) {
    // The compiler knows that each loop iteration accesses
    // elements that are contiguous in memory for contiguous x.
    my_dense_array(x, y, z) = 0;
  }

A dimension with unknown min and extent, and stride 1, is common enough that it has a built-in alias dense_dim<>, and shapes with a dense first dimension are common enough that shapes and arrays have the following built-in aliases:

  • dense_shape<N>, an N-dimensional shape with the first dimension being dense.
  • dense_array_ref<T, N> and dense_array<T, N, Allocator>, N-dimensional arrays with a shape of dense_shape<N>.

There are other common examples that are easy to support. A very common array is an image where 3-channel RGB or 4-channel RGBA pixels are stored together in a 'chunky' format.

template <int Channels, int XStride = Channels>
using chunky_image_shape = shape<
    strided_dim</*Stride=*/XStride>,
    dim<>,
    dense_dim</*Min=*/0, /*Extent=*/Channels>>;
array<uint8_t, chunky_image_shape<3>> my_chunky_image({1920, 1080, {}});

strided_dim<> is another alias for dim<> where the min and extent are unknown, and the stride may be a compile-time constant. image.h is a small helper library of typical image shape and object types defined using arrays, including chunky_image_shape.

Another common example is matrices indexed (row, column) with the column dimension stored densely:

  using matrix_shape = shape<dim<>, dense_dim<>>;
  array<double, matrix_shape> my_matrix({10, 4});
  for (auto i : my_matrix.i()) {
    for (auto j : my_matrix.j()) {
      // This loop ordering is efficient for this type.
      my_matrix(i, j) = 0.0;
    }
  }

There are also many use cases for matrices with small constant sizes. This library provides auto_allocator<T, N>, an std::allocator compatible allocator that only allocates buffers of N small fixed sized objects with automatic storage. This makes it possible to define a small matrix type that will not use any dynamic memory allocation:

template <int M, int N>
using small_matrix_shape = shape<
    dim<0, M>,
    dense_dim<0, N>>;
template <typename T, int M, int N>
using small_matrix = array<T, small_matrix_shape<M, N>, auto_allocator<T, M*N>>;
small_matrix<float, 4, 4> my_small_matrix;
// my_small_matrix is only one fixed size allocation, no new/delete calls
// happen. sizeof(small_matrix) = sizeof(float) * 4 * 4 + (overhead)

matrix.h is a small helper library of typical matrix shape and object types defined using arrays, including the examples above.

Slicing, cropping, and splitting

Shapes and arrays can be sliced and cropped using interval<Min, Extent> objects, which are similar to dim<>s. They can have either a compile-time constant or runtime valued min and extent. range(begin, end) is a helper functions to construct an interval.

  // Slicing
  array_ref_of_rank<int, 2> channel1 = my_array(_, _, 1);
  array_ref_of_rank<int, 1> row4_channel2 = my_array(_, 4, 2);

  // Cropping
  array_ref_of_rank<int, 3> top_left = my_array(interval<>{0, 2}, interval<>{0, 4}, _);
  array_ref_of_rank<int, 2> center_channel0 = my_array(interval<>{1, 2}, interval<>{2, 4}, 0);

The _ or all constants are placeholders indicating the entire dimension should be preserved. Dimensions that are sliced are removed from the shape of the array.

When iterating a dim, it is possible to split it first by either a compile-time constant or a runtime-valued split factor. A split dim produces an iterator range that produces interval<> objects. This allows easy tiling of algorithms:

  constexpr index_t x_split_factor = 3;
  const index_t y_split_factor = 5;
  for (auto yo : split(my_array.y(), y_split_factor)) {
    for (auto xo : split<x_split_factor>(my_array.x())) {
      auto tile = my_array(xo, yo, _);
      for (auto x : tile.x()) {
        // The compiler knows this loop has a fixed extent x_split_factor!
        tile(x, y, z) = x;
      }
    }
  }

Both loops have extents that are not divided by their split factors. To avoid generating an array_ref referencing data out of bounds of the original array, the split iterators modify the last iteration. The behavior of each kind of split is different:

  • Because the extent of yo can vary, it is reduced on the last iteration. This strategy can accommodate dimensions of any extent.
  • Because the extent of xo must be a constant, the last iteration will be shifted to overlap the previous iteration. This strategy requires the extent of the dimension being split is greater than the split factor (but not a multiple!)

Compile-time constant split factors produce ranges with compile-time extents, and shapes and arrays cropped with these ranges will have a corresponding dim<> with a compile-time constant extent. This allows potentially significant optimizations to be expressed relatively easily!

Einstein reductions

The ein_reduce.h header provides Einstein notation reductions and summation helpers, similar to np.einsum or tf.einsum. These are zero-cost abstractions for describing loops that allow expressing a wide variety of array operations. Einstein notation expression operands are constructed using the ein<i, j, ...>(x) helper function, where x can be any callable object, including an array<> or array_ref<>. i, j, ... are constexpr integers indicating which dimensions of the reduction operation are used to evaluate x. Therefore, the number of arguments of x must match the number of dimensions provided to ein. Operands can be combined into larger expressions using typical binary operators.

Einstein notation expressions can be evaluated using one of the following functions:

  • ein_reduce(expression), evaluate an arbitrary Einstein notation expression.
  • lhs = make_ein_sum<T, i, j, ...>(rhs), evaluate the summation ein<i, j, ...>(lhs) += rhs, and return lhs. The shape of lhs is inferred from the expression.

Here are some examples using these reduction operations to compute summations:

  // Name the dimensions we use in Einstein reductions.
  enum { i = 0, j = 1, k = 2, l = 3 };

  // Dot product dot1 = dot2 = x.y:
  vector<float> x({10});
  vector<float> y({10});
  float dot1 = make_ein_sum<float>(ein<i>(x) * ein<i>(y));
  float dot2 = 0.0f;
  ein_reduce(ein(dot2) += ein<i>(x) * ein<i>(y));

  // Matrix multiply C1 = C2 = A*B:
  matrix<float> A({10, 10});
  matrix<float> B({10, 15});
  matrix<float> C1({10, 15});
  fill(C1, 0.0f);
  ein_reduce(ein<i, j>(C1) += ein<i, k>(A) * ein<k, j>(B));
  auto C2 = make_ein_sum<float, i, j>(ein<i, k>(A) * ein<k, j>(B));

We can use arbitrary functions as expression operands:

  // Cross product array crosses_n = x_n x y_n:
  using vector_array = array<float, shape<dim<0, 3>, dense_dim<>>>;
  vector_array xs({3, 100});
  vector_array ys({3, 100});
  vector_array crosses({3, 100});
  auto epsilon3 = [](int i, int j, int k) { return sgn(j - i) * sgn(k - i) * sgn(k - j); };
  ein_reduce(ein<i, l>(crosses) += ein<i, j, k>(epsilon3) * ein<j, l>(xs) * ein<k, l>(ys));

These operations generally produce loop nests that are as readily optimized by the compiler as hand-written loops. In this example, crosses, xs, and ys have shape shape<dim<0, 3>, dense_dim<>>, so the compiler will see small constant loops and likely be able to optimize this to similar efficiency as hand-written code, by unrolling and evaluating the function at compile time. The compiler also should be able to efficiently vectorize the l dimension of the ein_reduce, because that dimension has a constant stride 1.

The expression can be another kind of reduction, or not a reduction at all:

  // Matrix transpose AT = A^T:
  matrix<float> AT({10, 10});
  ein_reduce(ein<i, j>(AT) = ein<j, i>(A));

  // Maximum of each x-y plane of a 3D volume:
  dense_array<float, 3> volume({8, 12, 20});
  dense_array<float, 1> max_xy({20});
  auto r = ein<k>(max_xy);
  ein_reduce(r = max(r, ein<i, j, k>(volume)));

Reductions can have a mix of result and operand types:

  // Compute X1 = X2 = DFT[x]:
  using complex = std::complex<float>;
  dense_array<complex, 2> W({10, 10});
  for_all_indices(W.shape(), [&](int j, int k) {
    W(j, k) = exp(-2.0f * pi * complex(0, 1) * (static_cast<float>(j * k) / 10));
  });
  // Using `make_ein_sum`, returning the result:
  auto X1 = make_ein_sum<complex, j>(ein<j, k>(W) * ein<k>(x));
  // Using `ein_reduce`, computing the result in place:
  vector<complex> X2({10}, 0.0f);
  ein_reduce(ein<j>(X2) += ein<j, k>(W) * ein<k>(x));

These reductions also compose well with loop transformations like split and array operations like slicing and cropping. For example, a matrix multiplication can be tiled like so:

  // Adjust this depending on the target architecture. For AVX2, vectors are 256-bit.
  constexpr index_t vector_size = 32 / sizeof(float);

  // We want the tiles to be big without spilling the accumulators to the stack.
  constexpr index_t tile_rows = 4;
  constexpr index_t tile_cols = vector_size * 3;

  for (auto io : split<tile_rows>(C.i())) {
    for (auto jo : split<tile_cols>(C.j())) {
      auto C_ijo = C(io, jo);
      fill(C_ijo, 0.0f);
      ein_reduce(ein<i, j>(C_ijo) += ein<i, k>(A) * ein<k, j>(B));
    }
  }

This generates the following machine code(*) for the inner loop using clang 18 with -O2 -ffast-math:

vbroadcastss	(%rsi,%rdi,4), %ymm12
vmovups	-64(%r12,%r15,4), %ymm13
vmovups	-32(%r12,%r15,4), %ymm14
vmovups	(%r12,%r15,4), %ymm15
addq	%rbx, %r15
vfmadd231ps	%ymm12, %ymm13, %ymm11
vfmadd231ps	%ymm12, %ymm14, %ymm10
vfmadd231ps	%ymm12, %ymm15, %ymm9
vbroadcastss	(%r8,%rdi,4), %ymm12
vfmadd231ps	%ymm12, %ymm13, %ymm8
vfmadd231ps	%ymm12, %ymm14, %ymm7
vfmadd231ps	%ymm12, %ymm15, %ymm6
vbroadcastss	(%r10,%rdi,4), %ymm12
vfmadd231ps	%ymm12, %ymm13, %ymm5
vfmadd231ps	%ymm12, %ymm14, %ymm4
vfmadd231ps	%ymm12, %ymm15, %ymm3
vbroadcastss	(%rdx,%rdi,4), %ymm12
incq	%rdi
vfmadd231ps	%ymm13, %ymm12, %ymm2
vfmadd231ps	%ymm14, %ymm12, %ymm1
vfmadd231ps	%ymm12, %ymm15, %ymm0
cmpq	%rdi, %r13
jne	.LBB8_12

This is 40-50x faster than a naive C implementation of nested loops on my machine, and it should be within a factor of 2 of the peak possible performance. A similar example that is only a little more complicated achieves around 90% of the peak possible performance.

(*) Unfortunately, this doesn't generate performant code currently and requires a few tweaks to work around an issue in LLVM. See the matrix example for the code that produces the above assembly. To summarise, it is currently necessary to perform the accumulation into a temporary buffer instead of accumulating directly into the output.

CUDA support

Most of the functions in this library are marked with __device__, enabling them to be used in CUDA code. This includes array_ref<T, Shape> and most of its helper functions. The exceptions to this are functions and classes that allocate memory, primarily array<T, Shape, Alloc>.

Try it on Compiler Explorer

This library is available on Compiler Explorer as Array.

array's People

Contributors

benweiss avatar dsharlet avatar dsharletg avatar fbleibel-g avatar jiawen avatar woodychow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

array's Issues

Comparison with mdspan/mdarray

midspan and mdarray are being proposed for inclusion in the C++ standard library. Looking at this library, it seems that some of the basic concerns are the same, but the proposed model and solution is different. It would be very helpful for users that are deciding about which low-level type to use to have a comparison between the two approaches.

`array::set_shape` is unsafe

array::set_shape does not destroy elements in the old shape but not in the new shape, and it does not construct elements in the new shape but not in the old shape.

Remove `shape<>` specialization

The only reason this specialization is necessary is because shape<Dims...> defines two default constructors when Dims... is empty.

Consider free function to access dims instead of member functions

When accessing a dim of a shape, ideally we would allow simply auto x = s.dim<0>(). Unfortunately, the reality of template syntax issues in many situations means it has to be auto x = s.template dim<0>().

This is the main reason I added named aliases like auto x = s.x().

But, maybe a better direction to go is a free function dim<i>(s).

3x1x3 interleaved layout returns y().stride() of 1 instead of 9

Hi Dillon 👋

Hope you're doing well.

Ran into a bug today that I'd like to share with you.

Creating a 3x1x3 interleaved shape and querying the y().stride() returns 1 instead of 9. I believe it is fine for the stride to not be well-defined in terms of the design assumptions, but when the stride is used to calculate number of bytes per row, it becomes problematic.

This can be triggered by wrapping it in a non-owning view such as OpenCV cv::Mat which checks against a minimum step size (https://github.com/opencv/opencv/blob/4.x/modules/core/src/matrix.cpp#L434) and fails.

Use cases for 1px height images include LUTs and unit tests.

For OpenCV we can work around it by updating those specific wrappers to test for height == 1 at run-time and switch to a different logic for the number of bytes per row, or even algorithmically by allocating an extra row that will not be used, but more generally it seems undesirable for clients of the array library to have to work around it and I was wondering if there's something we can to do address this within the library.

dim allows copy construction with compile-time incompatible strides

Probably what is going on is

array/array.h

Lines 363 to 371 in b36aaef

/** Copy another dim object, possibly with different compile-time template
* parameters. */
template <index_t CopyMin, index_t CopyExtent, index_t CopyStride,
class = internal::enable_if_compatible<Min, CopyMin>,
class = internal::enable_if_compatible<Extent, CopyExtent>,
class = internal::enable_if_compatible<Stride, CopyStride>>
dim(const dim<CopyMin, CopyExtent, CopyStride>& other)
: dim(other.min(), other.extent(), other.stride()) {
}
fails due to the enable_ifs. Then the argument is converted to a range and

array/array.h

Lines 359 to 360 in b36aaef

dim(const base_range& range, index_t stride = Stride)
: dim(range.min(), range.extent(), stride) {}
is used.

This is a really ugly behavior, something that should fail at compile time builds and produces subtly incorrect results at runtime instead!

const_cast for array_ref<const T>

When doing interop with other libraries (e.g., CoreML's MLMultiArray), we occasionally need to turn a array_ref<const T> to a array_ref<T>, since the consumer wants a T* even though it is not written to.

Should we provide one?

Error messages can be unwieldy

When using constructors or operator() incorrectly, the error messages are often huge and hard to decipher, for a few reasons:

  1. The template parameters of all of the shape parameters make symbol names huge.
  2. The errors occur deep in the call stack in variadic templates, when the error should occur closer to the user's code.

Issue (2) could be improved with more aggressive use of static_assert earlier in the API.

Issue (1) is harder to address, but would be mitigated somewhat by addressing the second issue (because fewer symbols would be reported as part of errors).

`make` broken on Apple Silicon Macs with Xcode

It complains:

$ make
mkdir -p obj/test
c++ -I. -c -o obj/test/algorithm.o test/algorithm.cpp  -O2 -ffast-math -fstrict-aliasing -march=native  -std=c++14 -Wall
clang: error: the clang compiler does not support '-march=native'
make: *** [obj/test/algorithm.o] Error 1

This is because Apple Clang bundled with Xcode no longer supports the flag. They want you to pass something like -mcpu=apple-a14.

Slicing does not reduce rank

It would better match behavior of similar systems, and probably would be more useful, if slicing an array reduced its rank. Currently, the behavior of a(_, 5) is to produce an array_ref with a 2D shape {a.x(), {5, 1, a.y().stride()}. It would be better if it produced a 1D shaped result with a.x() as the only dimension.

call of overloaded ‘for_each_value_in_order<(nda::shape<>::rank() - 1)>(nda::shape<>::index_type, void (&)(const int&, int&), std::pair<const int*, std::tuple<> >&, std::pair<int*, std::tuple<> >&)’ is ambiguous

mkdir -p obj/test g++ -I. -c -o obj/test/algorithm.o test/algorithm.cpp -O2 -ffast-math -fstrict-aliasing -march=native -std=c++14 -Wall In file included from test/algorithm.cpp:15: ./array.h: In instantiation of ‘void nda::for_each_value_in_order(const Shape&, const ShapeA&, PtrA, const ShapeB&, PtrB, Fn&&) [with Shape = nda::shape<>; ShapeA = nda::shape<>; PtrA = const int*; ShapeB = nda::shape<>; PtrB = int*; Fn = void (&)(const int&, int&); <template-parameter-1-7> = void]’: ./array.h:1796:28: required from ‘static void nda::copy_shape_traits<ShapeSrc, ShapeDst>::for_each_value(const ShapeSrc&, TSrc, const ShapeDst&, TDst, Fn&&) [with Fn = void (&)(const int&, int&); TSrc = const int*; TDst = int*; ShapeSrc = nda::shape<>; ShapeDst = nda::shape<>]’ ./array.h:2598:56: required from ‘void nda::copy(const nda::array_ref<TSrc, ShapeSrc>&, const nda::array_ref<TDst, ShapeDst>&) [with TSrc = const int; TDst = int; ShapeSrc = nda::shape<>; ShapeDst = nda::shape<>; <template-parameter-1-5> = void]’ ./array.h:2614:7: required from ‘void nda::copy(const nda::array<TSrc, ShapeSrc, AllocSrc>&, nda::array<TDst, ShapeDst, AllocDst>&) [with TSrc = int; TDst = int; ShapeSrc = nda::shape<>; ShapeDst = nda::shape<>; AllocSrc = std::allocator<int>; AllocDst = std::allocator<int>; <template-parameter-1-7> = void]’ test/algorithm.cpp:76:12: required from here ./array.h:1602:55: error: call of overloaded ‘for_each_value_in_order<(nda::shape<>::rank() - 1)>(nda::shape<>::index_type, void (&)(const int&, int&), std::pair<const int*, std::tuple<> >&, std::pair<int*, std::tuple<> >&)’ is ambiguous 1602 | internal::for_each_value_in_order<Shape::rank() - 1>(shape.extent(), fn, a, b); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ ./array.h:1361:41: note: candidate: ‘void nda::internal::for_each_value_in_order(const ExtentType&, Fn&&, Ptrs ...) [with long unsigned int D = 18446744073709551615; ExtentType = std::tuple<>; Fn = void (&)(const int&, int&); Ptrs = {std::pair<const int*, std::tuple<> >, std::pair<int*, std::tuple<> >}]’ 1361 | NDARRAY_INLINE NDARRAY_HOST_DEVICE void for_each_value_in_order( | ^~~~~~~~~~~~~~~~~~~~~~~ ./array.h:1369:41: note: candidate: ‘void nda::internal::for_each_value_in_order(const std::tuple<>&, Fn&&, Ptrs ...) [with long unsigned int D = 18446744073709551615; Fn = void (&)(const int&, int&); Ptrs = {std::pair<const int*, std::tuple<> >, std::pair<int*, std::tuple<> >}]’ 1369 | NDARRAY_INLINE NDARRAY_HOST_DEVICE void for_each_value_in_order( | ^~~~~~~~~~~~~~~~~~~~~~~ ./array.h: In instantiation of ‘void nda::for_each_value_in_order(const Shape&, Ptr, Fn&&) [with Shape = nda::shape<>; Ptr = int*; Fn = nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]::<lambda(int&)>&; <template-parameter-1-4> = void]’: ./array.h:1771:28: required from ‘static void nda::shape_traits<Shape>::for_each_value(const Shape&, Ptr, Fn&&) [with Ptr = int*; Fn = nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]::<lambda(int&)>&; Shape = nda::shape<>]’ ./array.h:1978:38: required from ‘void nda::array_ref<T, Shape>::for_each_value(Fn&&) const [with Fn = nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]::<lambda(int&)>; <template-parameter-2-2> = void; T = int; Shape = nda::shape<>]’ ./array.h:2747:3: required from ‘void nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]’ ./array.h:2752:11: required from ‘void nda::generate(nda::array<T, Shape, Alloc>&, Generator&&) [with T = int; Shape = nda::shape<>; Alloc = std::allocator<int>; Generator = int (&)(); <template-parameter-1-5> = int]’ test/algorithm.cpp:73:19: required from here ./array.h:1585:55: error: call of overloaded ‘for_each_value_in_order<(nda::shape<>::rank() - 1)>(nda::shape<>::index_type, nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]::<lambda(int&)>&, std::pair<int*, std::tuple<> >&)’ is ambiguous 1585 | internal::for_each_value_in_order<Shape::rank() - 1>(shape.extent(), fn, base_and_stride); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./array.h:1361:41: note: candidate: ‘void nda::internal::for_each_value_in_order(const ExtentType&, Fn&&, Ptrs ...) [with long unsigned int D = 18446744073709551615; ExtentType = std::tuple<>; Fn = nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]::<lambda(int&)>&; Ptrs = {std::pair<int*, std::tuple<> >}]’ 1361 | NDARRAY_INLINE NDARRAY_HOST_DEVICE void for_each_value_in_order( | ^~~~~~~~~~~~~~~~~~~~~~~ ./array.h:1369:41: note: candidate: ‘void nda::internal::for_each_value_in_order(const std::tuple<>&, Fn&&, Ptrs ...) [with long unsigned int D = 18446744073709551615; Fn = nda::generate(const nda::array_ref<T, Shape>&, Generator&&) [with T = int; Shape = nda::shape<>; Generator = int (&)(); <template-parameter-1-4> = int]::<lambda(int&)>&; Ptrs = {std::pair<int*, std::tuple<> >}]’ 1369 | NDARRAY_INLINE NDARRAY_HOST_DEVICE void for_each_value_in_order( | ^~~~~~~~~~~~~~~~~~~~~~~ ./array.h: In instantiation of ‘void nda::for_each_value_in_order(const Shape&, const ShapeA&, PtrA, const ShapeB&, PtrB, Fn&&) [with Shape = nda::shape<>; ShapeA = nda::shape<>; PtrA = const int*; ShapeB = nda::shape<>; PtrB = const int*; Fn = nda::array_ref<T, Shape>::operator!=(const nda::array_ref<T, Shape>&) const [with T = const int; Shape = nda::shape<>]::<lambda(nda::array_ref<const int, nda::shape<> >::const_reference, nda::array_ref<const int, nda::shape<> >::const_reference)>&; <template-parameter-1-7> = void]’: ./array.h:1796:28: required from ‘static void nda::copy_shape_traits<ShapeSrc, ShapeDst>::for_each_value(const ShapeSrc&, TSrc, const ShapeDst&, TDst, Fn&&) [with Fn = nda::array_ref<T, Shape>::operator!=(const nda::array_ref<T, Shape>&) const [with T = const int; Shape = nda::shape<>]::<lambda(nda::array_ref<const int, nda::shape<> >::const_reference, nda::array_ref<const int, nda::shape<> >::const_reference)>; TSrc = const int*; TDst = const int*; ShapeSrc = nda::shape<>; ShapeDst = nda::shape<>]’

Use helper class for individual compile-time constants

Currently, dim has 3 template parameter compile-time constants. Each of these is implemented manually, resulting in a ~3x code duplication.

More importantly, methods like shape::min, shape::extent, and shape::stride lose compile-time constants, because they are simply instances of std::tuple<index_t, ...>.

This could be avoided if dim::min, dim::extent, etc. returned a constant<Value = UNK> object, and would reduce the code duplication.

The downside is errors would become less clear, and it would require implicit conversions from the new object to index_t for many basic operations, and might be annoying if it fails in some cases, or require a lot of operator overloading to implement <, <=, etc.

Find a good syntax for expressing ranges

The current interval/range syntax is pretty verbose. It would be great to be able to write something like a(x0:x1, y0:y1, z) to crop x, y and slice z, similar to numpy style. In C++ we are more limited. The current syntax is something like a(interval(x0, x1), interval(y0, y1), z), or a(range<>(x0, x1 - x0), range<>(y0, y1 - y0), z).

Possible options I can think of are:

  • Some single-letter helper function V(begin, end): a(V(x0, x1), V(y0, y1), z). If you have an active imagination, V is kind of like the for-all symbol...
  • If x0, y0 where some kind of class, we could do some crazy (and possibly expensive) trick with operator->: a(x0 -> x1, y0 -> y1, z). It is unreasonable to require x0, y0 to be a particular type though.

Some additional thoughts:

  • Both intervals and ranges are important. Ranges are useful because compile-time constant extents are possible (hence the template for range<>).
  • Global operators can't be overloaded for index-like types, so we couldn't do something like make x0 > x1 return a bool_or_range implicitly convertible type. Nor should we...

make_compact should give dimensions constexpr strides where possible

shape<dense_dim<0, 10>, dim<>> s({}, {0, 20});
auto compact = make_compact(s);

Currently, compact will be of the same shape type as s. However, we could produce shape<dense_dim<0, 10>, strided_dim<10>> safely instead.

This arises in practice when tiling algorithms:

for (auto yo : split<4>(output.y()) {
  for (auto xo : split<4>(output.x()) {
    auto output_tile = input(xo, yo);
    auto tile_buffer = make_array<T>(make_compact(output_tile.shape()));
  }
}

In an example like this, we'd like tile_buffer.shape().y().stride() to evaluate to a constant, rather than a runtime value.

Consider adding "flat iterators" for arrays

It would be great to have "flat iterators" for arrays. Among other things, this could enable the standard <algorithm> algorithms to work on arrays where it makes sense, and we could get rid of our subset of these mirroring the STL. However, it is difficult to implement a performant flat iterator, due to the branchiness at each iteration required in the implementation.

Shapes should only get automatically defined strides when they are used to construct arrays

Currently, shapes with unknown strides are resolved upon shape construction. It would be better if shape strides were only resolved when they are used to construct an array. It is questionable whether unknown strides should be resolved when using the shape to construct an array_ref. Unfortunately, it would likely break a large amount of existing code if array_ref did not get automatic strides.

whole-matrix einsum is slower than hand-written code

The current timing I see for the matrix tests is:

reduce_matrix time: 10.6356 ms
einsum time: 13.4737 ms
reduce_tiles time: 0.890935 ms
einsum_tiles time: 0.929729 ms

The einsum_tiles version matches reduce_tiles closely enough to be zero-cost, but something is wrong with einsum vs. reduce_matrix. This is strange because it is the easier case.

Implement `shape::is_one_to_one` and `shape::is_subset_of`

In order to determine if a shape maps two different indices to the same flat indices (i.e. is_one_to_one is false), we need to solve:

x_0*S_0 + x_1*S_1 + x_2*S_2 + ... == y_0*S_0 + y_1*S_1 + y_2*S_2 + ...

where x_i, y_i are (different) indices, and S_i are the strides of this shape. This is equivalent to:

(x_0 - y_0)*S_0 + (x_1 - y_1)*S_1 + ... == 0

We don't actually care what x_i and y_i are, so this is equivalent to:

x_0*S_0 + x_1*S_1 + x_2*S_2 + ... == 0

where x_i != 0. This is a linear diophantine equation, and we already have one solution at x_i = 0, so we just need to find other solutions, and check that they are in range.

This is pretty hard. Wikipedia research so far suggests we need to rewrite the equation as a system of linear diophantine equations, and then use the "Hermite normal form" to get the unbounded solutions, and then do some combinatoric search for the in-bounds solutions. This is an NP-hard problem, but the size of the problems are small, and I don't think these functions need to be fast.

is_subset_of is possibly similar.

Rename stack_allocator

The name stack_allocator is misleading because the memory may not be on the stack, it is more accurate to say the allocator inherits the storage class of the container it is used for.

runtime assert failures regarding incompatible dimension parameters are hard to debug

When dim runtime checks fail, they are hard to debug, because which dimension and failing values are not identified.

The tests in this repo define some more informative assert macros like ASSERT_EQ(x, y) that show the values involved in the assertion failure, but this requires runtime string handling, which I'd rather avoid as a dependency (outside of the tests).

Cropping via ranges results in an array with the same indices as the old array

In numpy and many other systems, b = a(interval(5, 12)) results in an array where b(x) == a(x + 5) and b has an extent of 7 (and a min of 0).

In array, this doesn't happen. b(x) == a(x), where b has a min of x.min() + 5 and an extent of 7. The only other system that is similar that I'm aware of is Halide.

This is very convenient for tiling algorithms transparently, but may not be what people would prefer to see in other use cases.

I lean towards the current behavior (which enables transparent tiling of algorithms), but it would be good to find a way to seamlessly enable the numpy style as well.

Should move/make_*_move use rvalue references for the source?

These functions move elements from the source, but not the source itself.

However, it is also possible that the make_move, make_dense_move, and make_compact_move functions should move the whole source if the shape is equal to the source's shape.

Better differentiate between ranges and shapes

It would be better if shapes/dims and ranges were better differentiated. They are very similar, but differ in the following ways:

  1. shapes/dims have strides, ranges do not.
  2. Typically, one cares more about the min of a range, and the extent of a dim.

A common pattern, e.g. a(b.x(), b.y()), but it's passing the stride of x and y to b's operator(). It doesn't cause problems, but it is messy. A worse example is array b(a.shape()): If b has padding (e.g. if it were the result of a crop), it allocates b with the same padding, which is a surprising behavior.

The second issue manifests in a current ugly minor issue: range(index_t x) is interpreted as a range with a min of x and extent 1, while dim(index_t) is interpreted as a dim with extent 1.

This is a bit of a generalization of #12.

Need a version of generate() that takes indices

generate() works like std::generate() and needs to maintain state elsewhere. It'd be nice to have a function that does a fill based on "pattern function" that takes in indices instead:

Equivalent to something like:

for_each_index(image.shape(), [&](const typename Shape::index_type& i) {
  image(i) = std::apply(pattern, i);
});

Add CUDA support

@jiawen found that adding CUDA support just required adding __host__ __device__ to some of the functions in array.h.

I think we should do a few things here:

  1. Add __host__ __device__ to appropriate helper functions in array.h to enable shape and array_ref to be used in __device__ functions.
  2. Add an allocator that uses cudaMalloc/cudaFree to enable array to be used on the host.
  3. Maybe we need some wrappers around cudaMemcpy*, ideally transparently with copy.

(1) by itself would be really useful and a good first step.

Numpy -> nda::array cheat sheet?

It would be very helpful to provide a cheat sheet table at the top of the array.h header file.
Something that shows how to do an equivalent numpy, Matlab, or Eigen operation using array.h

I was recently confused by slicing vs cropping. Something like this would help a lot

numpy nda::array
x[:, :, 0] x(_, _, 0)
x[:, :, [0]] x(_, _, range(0, 1))

etc

It should be possible to be completely oblivious to mins

Currently, it is sometimes unavoidable to pass a placeholder or value for the min of a range or dim. It should be possible to work with extents and strides without having to work around mins. This applies to both compile-time and runtime mins.

Template argument order issue in make_* functions that return an array

When allocators are expensive to construct (e.g. auto_allocator uses a lot of stack space), the current make_array and similar helpers are problematic, because they use a default argument for the allocator value, and a default template type argument.

An example of currently problematic code is this:

auto temp = make_array(other.shape(), auto_allocator<T, 32 * 32 * 4>());

i.e. a temporary array the same shape as another array, but possibly placed on the stack.

The problem with this is that there are two arrays of size 32 * 32 * 4 * sizeof(T) on the stack: one owned by temp, the other just created to make the call to make_array and never actually used.

It would be better if we could specify the type of the allocator, without ever instantiating it.

We clearly need to split each instance of make_* into two overloads: one with an allocator argument, and one without (to avoid the default argument instance). The overload that takes an allocator instance is not a problem, template type deduction is fine in this case (and we don't even need a default value of the allocator type).

However, the overload without it is problematic:

template <typename T, typename Shape, typename Alloc = std::allocator<T>>
auto make_array(const Shape& shape) {
   ...
}

We want to be able to take the default allocator 99% of the time. But when we want to specify the type, sometimes specifying the type of the shape is hard, e.g. if using make_compact on a shape with non-trivial constant extents/strides.

So we want the Alloc template parameter to be earlier in the list, at least before Shape, but then Shape needs a default type. Maybe just giving this a dummy type, maybe even just void is reasonable? That would hopefully not interfere with type deduction, and making an array with no shape argument would be an error anyways?

Remove copies from Einstein expression construction

Currently, it is not possible to construct an Einstein reduction expression without copying the operands. Most of these copies are small/lightweight (never arrays, only array_refs), but operands that are functions/lambdas may be expensive to copy or not copyable at all.

Alias for a fixed size array allocated on the stack

I've recently had use for a fixed size array allocated on the stack, i.e. the n-dimensional counterpart to std::array or the equivalent of Eigen::Matrix<T, Rows, Cols>: this is an array with.

template <typename T, size_t... Extents>
using FixedSizeArray =
    nda::array<T, typename CompactShapeHelper<Extents...>::Shape,
               nda::auto_allocator<T, (... * Extents)>>;

Where Shape defines each dim with completely static values: min = 0, extent = Extent, Stride = (the product of all subsequent extents), that is row-major for a 2D matrix. I guess Stride doesn't have to make the array compact for all use cases, but this can help with naive indexing (offset = h * w * i + w * j + k).

Do you think it would be interesting to have such a type defined in array.h?

Bounds issues should be reported consistently

Functions which can fail due to bounds issues should use a consistent reporting mechanism. std::out_of_range is used by the STL for std::vector::at, which motivates using that consistently everywhere. However, constructing a dim with a runtime value inconsistent with the compile-time constant produces an assert.

Reshape an array in runtime

Hi,

Is it possible to reshape the array in runtime ?

I think you have more available functions in the code than in the README.md. Is it possible to consider a wiki for the list of functions and performance notes ?

If I understand well, the array is implemented as a normal array in C++ (in the stack) ?

Regards,
Tien Tu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.