nvidia / matx Goto Github PK

View Code? Open in Web Editor NEW

1.2K 24.0 80.0 34.33 MB

An efficient C++17 GPU numerical computing library with Python-like syntax

Home Page: https://nvidia.github.io/MatX

License: BSD 3-Clause "New" or "Revised" License

CMake 28.46% Shell 0.23% Cuda 20.37% C++ 47.05% Python 1.22% HTML 0.01% Jupyter Notebook 2.36% C 0.29% Makefile 0.02%

cuda gpu gpgpu hpc gpu-computing

matx's Introduction

MatX - GPU-Accelerated Numerical Computing in Modern C++

MatX is a modern C++ library for numerical computing on NVIDIA GPUs and CPUs. Near-native performance can be achieved while using a simple syntax common in higher-level languages such as Python or MATLAB.

The above image shows the Python (Numpy) version of an FFT resampler next to the MatX version. The total runtimes of the NumPy version, CuPy version, and MatX version are shown below:

Python/Numpy: 5360ms (Xeon(R) CPU E5-2698 v4 @ 2.20GHz)
CuPy: 10.6ms (A100)
MatX: 2.54ms (A100)

While the code complexity and length are roughly the same, the MatX version shows a 2100x over the Numpy version, and over 4x faster than the CuPy version on the same GPU.

Key features include:

⚡ MatX is fast. By using existing, optimized libraries as a backend, and efficient kernel generation when needed, no hand-optimizations are necessary
👐 MatX is easy to learn. Users familiar with high-level languages will pick up the syntax quickly
📑 MatX easily integrates with existing libraries and code
🎇 Visualize data from the GPU right on a web browser
↕️ IO capabilities for reading/writing files

Requirements
Installation
- Building MatX
- Integrating MatX With Your Own Projects
Documentation
- Supported Data Types
Unit Tests
Quick Start Guide
Release History
Filing Issues
Contributing Guide

Requirements

MatX support is currently limited to Linux only due to the time to test Windows. If you'd like to voice your support for native Windows support using Visual Studio, please comment on the issue here: #153.

Note: CUDA 12.0.0 through 12.2.0 have an issue that causes building MatX unit tests to show a compiler error or cause a segfault in the compiler. Please use CUDA 11.8 or CUDA 12.2.1+ with MatX.

MatX is using features in C++17 and the latest CUDA compilers and libraries. For this reason, when running with GPU support, CUDA 11.8 and g++9, nvc++ 24.5, or clang 17 or newer is required. You can download the CUDA Toolkit here.

MatX has been tested on and supports Pascal, Turing, Volta, Ampere, Ada, and Hopper GPU architectures. Jetson products are supported with Jetpack 5.0 or above.

The MatX build system when used with CMake will automatically fetch packages from the internet that are missing or out of date. If you are on a machine without internet access or want to manage the packages yourself, please follow the offline instructions and pay attention to the required versions of the dependencies.

Note for CPU/Host support: CPU/Host execution is considered beta. Only operator execution is supported right now, but no functions that require libraries (FFT/GEMM, etc). If you find a bug in an operator on CPU, please report it in the issues above.

Installation

MatX is a header-only library that does not require compiling for using in your applications. However, building unit tests, benchmarks, or examples must be compiled. CPM is used as a package manager for CMake to download and configure any dependencies. If MatX is to be used in an air-gapped environment, CPM can be configured to search locally for files. Depending on what options are enabled, compiling could take very long without parallelism enabled. Using the -j flag on make is suggested with the highest number your system will accommodate.

Building MatX

To build all components, issue the standard cmake build commands in a cloned repo:

mkdir build && cd build
cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=ON -DMATX_BUILD_EXAMPLES=ON -DMATX_BUILD_DOCS=OFF ..
make -j

By default CMake will target the GPU architecture(s) of the system you're compiling on. If you wish to target other architectures, pass the CMAKE_CUDA_ARCHITECTURES flag with a list of architectures to build for:

cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90"

By default nothing is compiled. If you wish to compile certain options, use the CMake flags below with ON or OFF values:

MATX_BUILD_TESTS
MATX_BUILD_BENCHMARKS
MATX_BUILD_EXAMPLES
MATX_BUILD_DOCS

For example, to enable unit test building:

mkdir build && cd build
cmake -DMATX_BUILD_TESTS=ON ..
make -j

Integrating MatX With Your Own Projects

MatX uses CMake as a first-class build generator, and therefore provides the proper config files to include into your own project. There are typically two ways to do this:

Adding MatX as a subdirectory
Installing MatX to the system

1. MatX as a Subdirectory

Adding the subdirectory is useful if you include the MatX source into the directory structure of your project. Using this method, you can simply add the MatX directory:

add_subdirectory(path/to/matx)

An example of using this method can be found in the examples/cmake_sample_project directory.

2. MatX Installed to the System

The other option is to install MatX and use the configuration file provided after building. This is typically done in a way similar to what is shown below:

cd /path/to/matx
mkdir build && cd build
cmake ..
make && make install

If you have the correct permissions, the headers and cmake packages will be installed on your system in the expected paths for your operating system. With the package installed you can use find_package as follows:

find_package(matx CONFIG REQUIRED)

MatX CMake Targets

Once either of the two methods above are done, you can use the transitive target matx::matx in your library inside of target_link_libraries, e.g:

target_link_libraries(MyProject matx::matx)

MatX may add other optional targets in the future inside the matx:: namespace as well.

Documentation

Documentation for MatX can be built locally as shown above with the DBUILD_DOCS=ON cmake flag. Building documentation requires the following to be installed: doxygen, breathe, sphinx, sphinx-rtd-theme, libjs-mathjax, texlive-font-utils, flex, bison

Current documentation can be found here
A quick start guide can be found here
Current library limitations are listed here
A conversion from MATLAB and Python syntax is found here
A self-guided Jupyer notebook training can be found here

MatX uses semantic versioning and reserve the right to introduce breaking API changes on major releases.

Supported Data Types

MatX supports all types that use standard C++ operators for math (+, -, etc). Unit tests are run against all common types shown below.

Integer: int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t
Floating Point: matxFp16 (fp16), matxBf16 (bfloat16), float, double
Complex: matxfp16Complex, matxBf16Complex, cuda::std::complex<float>, cuda::std::complex<double>

Since CUDA half precision types (__half and __nv_bfloat16) do not support all C++ operators on the host side, MatX provides the matxFp16 and matxBf16 types for scalars, and matxFp16Complex and matxBf16Complex for complex types. These wrappers are needed so that tensor views can be evaluated on both the host and device, regardless of CUDA or hardware support. When possible, the half types will use hardware- accelerated intrinsics automatically. Existing code using __half and __nv_bfloat16 may be converted to the matx equivalent types directly and leverage all operators.

Unit Tests

MatX contains a suite of unit tests to test functionality of the primitive functions, plus end-to-end tests of example code. MatX uses pybind11 to generate some of the unit test inputs and outputs. This avoids the need to store large test vector files in git, and instead can be generated as-needed.

To run the unit tests, from the cmake build directory run:

test/matx_test

This will execute all unit tests defined. If you wish to execute a subset of tests, or run with different options, you may run test/matx_test directly with parameters defined by Google Test. To run matx_test directly, you must be inside the build/test directory for the correct paths to be set. For example, to run only tests with the name FFT:

cd build/test
./matx_test --gtest_filter="*FFT*"

Quick Start Guide

We provide a variety of training materials and examples to quickly learn the MatX API.

A quick start guide can be found in the docs directory or from the main documentation site. The MatX quick start guide is modeled after NumPy's and demonstrates how to manipulate and create tensors.
A set of MatX notebooks can be found in the docs directory. These four notebooks walk through the major MatX features and allow the developer to practice writing MatX code with guided examples and questions.
Finally, for new MatX developers, browsing the example applications can provide familarity with the API and best practices.

Release Major Features

v0.8.0:

Features
- Updated cuTENSOR and cuTensorNet versions
- Added configurable print formatting
- ARM FFT support via NVPL
- New operators: abs2(), outer(), isnan(), isinf()
- Many more unit tests for CPU tests
Bug fixes for matmul on Hopper, 2D FFTs, and more

v0.7.0:

Features
- Automatic documentation generation
- Use CCCL instead of CUB/libcudac++
- New operators: polyval, matvec
- Improved caching and teardown of transforms
- Optimized polyphase resampler
- Negative slice indexing
Many new bug fixes and error checking

v0.6.0:

Breaking changes
- This marks the first release of using "transforms as operators". This allows transforms to be used in any operator expression, whereas the previous release required them to be on separate lines. For an example, please see: https://nvidia.github.io/MatX/basics/fusion.html. This also causes a breaking change with transform usage. Converting to the new format is as simple as moving the function parameters. For example: matmul(C, A, B, stream); becomes (C = matmul(A,B)).run(stream);.
Features
- Polyphase channelizer
- Many new operators, including upsample, downsample, pwelch, overlap, at, etc
- Added more lvalue semantics for operators based on view manipulation
Bug fixes
- Fixed cache issues
- Fixed stride = 0 in matmul

Discussions

We have an open discussions board here. We encourage any questions about the library to be posted here for other users to learn from and read through.

Filing Issues

We welcome and encourage the creation of issues against MatX. When creating a new issue, please use the following syntax in the title of your submission to help us prioritize responses and planned work.

Bug Report: Append [BUG] to the beginning of the issue title, e.g. [BUG] MatX fails to build on P100 GPU
Documentation Request: Append [DOC] to the beginning of the issue title
Feature Request: Append [FEA] to the beginning of the issue title
Submit a Question: Append [QST] to the beginning of the issue title

As with all issues, please be as verbose as possible and, if relevant, include a test script that demonstrates the bug or expected behavior. It's also helpful if you provide environment details about your system (bare-metal, cloud GPU, etc).

Contributing Guide

Please review the CONTRIBUTING.md file for information on how to contribute code and issues to MatX. We require all pull requests to have a linear history and rebase to main before merge.

matx's People

Stargazers

Watchers

Forkers

kerrmudgeon awthomp brycelelbach mfkiwl cliffburdick metavai sondro algoskynet rqzhang0 binbinmeng atomicvar pmm09c g921002 pkurainbow ssghost enginbozkurt lijun99 bonevbs lighttransport ax3l amarry tianyuanzhangnb python-repository-hub tanmoyio galv bpinzone-nvidia samkenxstream cwharris pkestene dagardner-nv laplacekorea edwardnguyen1705 leofang ajunlonglive lygztq chomolungma phoenixdigitalfx lacrymose gg-big-org longervision kshitij12345 dualword yply sourcegraph-ce fuhengwu2021 joolstorrentecalo karboski amritasenthil tmartin-gh dc-zhou alawaji214 nivir cav71 neuroradiology cauliyang kzconnection eminsight nixotrix b-xiang matrix97317 forerunnerai hugo-syn codingonion ryanvolz ezhangle benbarsdell lucifer1004 denghuilu bhaskarrakshit mfzmullen ayushchatur raplonu woodchuck42 brunoscaglione hugophibbs schaudge zelbok michelmudemfu ravi688

matx's Issues

more flexible memory handling

You chose a managed memory scheme. This is good for entry level, but to consider it for actual systems it would be nice to have some more allocators implemented within the library (or just support the ones from RAPIDS).
In the absence of managed memory it would be nice to have a (Pinned memory) CPU variation of the tensor that handles GPU/CPU copy. Also with allocators...
Vector type support would be nice (uchar4...)

This sort of basic data structure is very much in need and the lazy execution model is looking compact and useful...

[BUG] matx::argmax returns strange indices when the input has a fractional part.

Describe the bug
matx::argmax returns strange indices when the input float has a fractional part.

To Reproduce

#include <matx.h>

int main() {
  auto a = matx::make_tensor<float, 1>({5});
  a.SetVals({1.0, 2.0, 3.0, 4.0, 0.0});     // it works
  // a.SetVals({0.1, 0.2, 0.3, 0.4, 0.0});  // it DO NOT WORK
  a.Print();

  matx::tensor_t<float, 0> max_val{};
  matx::tensor_t<matx::index_t, 0> idx{};

  matx::argmax(max_val, idx, a);
  
  cudaStreamSynchronize(0);

  max_val.Print();
  idx.Print();

  return 0;
}

Output 1 (1.0, 2.0, 3.0, 4.0, 0.0 version):

000000: 1.0000e+00 
000001: 2.0000e+00 
000002: 3.0000e+00 
000003: 4.0000e+00 
000004: 0.0000e+00 
4.0000e+00 
3

Output 2 (0.1, 0.2, 0.3, 0.4, 0.0 version):

000000: 1.0000e-01 
000001: 2.0000e-01 
000002: 3.0000e-01 
000003: 4.0000e-01 
000004: 0.0000e+00 
4.0000e-01 
9223372036854775807

Expected behavior
The second version should give the same result.

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[BUG] Sorting into 1D output from CUB gives memory errors

When switching over to CUB it requires an array of pointers as batch inputs. This was not being allocated properly for the sum case.

[BUG] matx::mean() computing different result for owned vs non-owned view.

Describe the bug
matx::mean() does not produce identical results for an owned vs non-owned view.

To Reproduce
Run this unit test from my fork of Matx:
https://github.com/bpinzone-nvidia/MatX/blob/bpinzone_mean_transpose_unit_test/test/00_tensor/ViewTests.cu#L232

Expected behavior
first_method_mean_over_rows and second_method_mean_over_rows should compare equal after computation.

System details

Ubuntu 20.04.3 LTS
CUDA Version: 11.5
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

[BUG] Problems with matx::reduce in the newest version

There seems to be a bug in the newest version of MatX. matx::reduce(x_norm, norm(x), detail::reduceOpSum<SType>(), stream, true); does not seem to work anymore and always leads to 0 in the accumulation. Replacing the above line with matx::sum(x_norm, norm(x), stream); seems to have resolved the issue for me. I am not sure why that is, matx::sum seems to be derived from the general reduce operation?

[FEA] Add custom allocator interface

Is your feature request related to a problem? Please describe.
MatX currently takes raw non-owned pointers, smart unowned pointers, and self-allocates owned pointers otherwise. While this allows for many different options, it does not allow users to have their own memory allocators MatX would call.

Describe the solution you'd like
Allow functionality similar to xtensor

Describe alternatives you've considered
Above

Giving an error about index_t

Hi, I was trying this API . I used the example simple_pipeline.cu

I tried to compile it with nvcc -I ../include/ ./simple_pipeline.cu .
The error I got was something like :


../include/matx_tensor_ops.h(941): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(947): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(979): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(999): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1018): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1037): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1056): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1087): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1138): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1189): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1241): error: identifier "index_t" is undefined

../include/matx_tensor_ops.h(1300): error: identifier "index_t" is undefined

Error limit reached.
100 errors detected in the compilation of "./simple_pipeline.cu".
Compilation terminated.

How can I overcome this issue . I have Toolkit 11.6 and 3070 maxq ?
Can you provide script to compile it properly .

[BUG] Fix filter() and add IIR tests

The IIR filter appear to not work on signals that are not multiples of 8192. Fix this and add tests

Unable to compile sample application

Describe the bug
Unable to compile sample application against matx
and the .cu file does nothing but #include <matx.h>

These are the compile errors

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(135): error: namespace "cuda::std::__3::detail" has no member "__atomic_thread_fence_cuda"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(135): error: namespace "cuda::std::__3::detail" has no member "__thread_scope_system_tag"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(142): error: namespace "cuda::std::__3::detail" has no member "__atomic_signal_fence_cuda"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: namespace "cuda::std::__3::detail" has no member "__atomic_load_n_cuda"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: namespace "cuda::std::__3::detail" has no member "__scope_tag"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: expected an expression

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(224): error: no instance of overloaded function "cuda::std::__3::__cxx_atomic_alignment_unwrap" matches the argument list
            argument types are: (<error-type>)

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: namespace "cuda::std::__3::detail" has no member "__atomic_exchange_n_cuda"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: namespace "cuda::std::__3::detail" has no member "__scope_tag"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: expected an expression

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(235): error: no instance of overloaded function "cuda::std::__3::__cxx_atomic_alignment_unwrap" matches the argument list
            argument types are: (<error-type>)

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(215): error: namespace "cuda::std::__3::detail" has no member "__atomic_store_n_cuda"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(215): error: namespace "cuda::std::__3::detail" has no member "__scope_tag"

/usr/local/cuda-11.4/targets/x86_64-linux/include/cuda/std/detail/__atomic(215): error: expected an expression

To Reproduce
Steps to reproduce the behavior:

Compile the sample application with CUDA 11.4

Expected behavior
compilation

Code snippers

cmake_minimum_required(VERSION 3.18)

project(SAMPLE_MATX LANGUAGES CUDA CXX)
find_package(CUDAToolkit 11.4 REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 75)

find_package(matx CONFIG REQUIRED)

add_executable(sample_matx main.cu)
target_link_libraries(sample_matx PRIVATE matx::matx)

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: 11.4
g++ version:9.3

Additional context
Does the sample application require CUDA 11.5?

[BUG] Tensors created from user-defined pointers cannot be printed

Describe the bug
Tensors created from user-defined pointers cannot be printed.

To Reproduce

Allocate memory for a 1-D tensor with 4 elements using CUDA APIs:

float* ptr_dev;
CUDA_CHECK(cudaMalloc(&ptr_dev, sizeof(float) * 4));

Make a tensor from that device pointer:

long long shape[1] = {4};
auto t = matx::make_tensor<float, 1>(ptr_dev, shape);
t.Print();

Results: the GetPointerKind in Print(Args... dims) throws out an error because of line:

MATX_ASSERT(tmp != nullptr, matxInvalidParameter);

It seems like the GetPointerKind won't work for user-defined pointers because allocationMap is empty.

Expected behavior
The t.Print() should print the tensor.

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[BUG] SetVals failed on tensors that are created from user pointers

Describe the bug
When using SetVals on tensors that are created from user pointers, a SegmentFault is thrown.

To Reproduce

float* dev_float;
cudaMalloc(&dev_float, sizeof(float) * 6);

auto t = matx::make_tensor<float, 2, matx::non_owning>(dev_float, {2, 3});
t.SetVals({{1, 2, 3}, {4, 5, 6}});
t.Print();

VSCode Debug Error:

Expected behavior
The SetVals should work for tensors created from user pointers.

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[BUG] CUDA 11.4 with libcudacxx 1.7.0-ea still won't compile the examples.

Describe the bug
I cannot build the examples with CUDA 11.4 and libcudacxx 1.7.0-ea

To Reproduce

CMake configuration:

(base) root@fdecaa517939:~/matx_test/MatX/build# https_proxy=10.162.14.132:7890 cmake -DBUILD_EXAMPLES=ON ..
-- Auto-detecting GPU architectures since CMAKE_CUDA_ARCHITECTURES not defined
-- The CUDA compiler identification is NVIDIA 11.4.100
-- The CXX compiler identification is GNU 9.3.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Auto detection of gpu-archs: 86
-- Using GPU architectures 86-real
-- Downloading CPM.cmake to /root/matx_test/MatX/build/cmake/CPM_0.32.2.cmake
-- Setting build type to 'Debug' as none was specified.
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.4.100") 
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Need libcuda++ 1.7.0 or higher (included in CTK 11.5+). Finding...
-- CPM: adding package [email protected] (1.7.0-ea)
-- CPM: adding package [email protected] (v2.6.2)
-- pybind11 v2.6.2 
-- Found PythonInterp: /root/miniconda3/bin/python (found version "3.9.5") 
-- Found PythonLibs: /root/miniconda3/lib/libpython3.9.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Found Python3: /root/miniconda3/bin/python3.9 (found version "3.9.5") found components: Interpreter Development Development.Module Development.Embed 
-- Configuring done
-- Generating done
-- Build files have been written to: /root/matx_test/MatX/build

Make convolution example:

(base) root@fdecaa517939:~/matx_test/MatX/build# make convolution
[ 50%] Building CUDA object examples/CMakeFiles/convolution.dir/convolution.cu.o
/root/matx_test/MatX/include/matx_tensor_impl.h(700): error: missing return statement at end of non-void function "matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, I=1, Is=<matx::index_t>]"
          detected during:
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, I=1, Is=<matx::index_t>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, I=0, Is=<matx::index_t>]" 
(740): here
            instantiation of "T &matx::detail::tensor_impl_t<T, RANK, Desc>::operator()(Is...) noexcept [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, M=1, Is=<matx::index_t>, <unnamed>=true]" 
/root/matx_test/MatX/include/matx_tensor.h(1103): here
            instantiation of "void matx::tensor_t<T, RANK, Storage, Desc>::SetVals(const std::initializer_list<T> &) noexcept [with T=float, RANK=1, Storage=matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, Desc=matx::static_tensor_desc_t<10LL>]" 
/root/matx_test/MatX/examples/convolution.cu(93): here

/root/matx_test/MatX/include/matx_tensor_impl.h(700): error: missing return statement at end of non-void function "matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=2, Is=<matx::index_t, matx::index_t>]"
          detected during:
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=2, Is=<matx::index_t, matx::index_t>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=1, Is=<matx::index_t, matx::index_t>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=0, Is=<matx::index_t, matx::index_t>]" 
(740): here
            instantiation of "T &matx::detail::tensor_impl_t<T, RANK, Desc>::operator()(Is...) noexcept [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, M=2, Is=<matx::index_t, matx::index_t>, <unnamed>=true]" 
/root/matx_test/MatX/examples/convolution.cu(77): here

/root/matx_test/MatX/include/matx_tensor_impl.h(700): error: missing return statement at end of non-void function "matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, I=1, Is=<unsigned int>]"
          detected during:
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, I=1, Is=<unsigned int>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, I=0, Is=<unsigned int>]" 
(740): here
            instantiation of "T &matx::detail::tensor_impl_t<T, RANK, Desc>::operator()(Is...) noexcept [with T=float, RANK=1, Desc=matx::static_tensor_desc_t<10LL>, M=1, Is=<unsigned int>, <unnamed>=true]" 
/root/matx_test/MatX/include/kernels/matx_conv_kernels.cuh(93): here
            instantiation of "void matx::Conv1D(OutType, InType, FilterType, matx::index_t, matx::index_t, matx::matxConvCorrMode_t) [with OutType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(89): here
            instantiation of "void matx::detail::matxDirectConv1DInternal(OutputType &, const InType &, const FilterType &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(187): here
            instantiation of "void matx::conv1d(OutputType &, const In1Type &, const In2Type &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384009LL>>, In1Type=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, In2Type=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/examples/convolution.cu(103): here

/root/matx_test/MatX/include/matx_tensor_impl.h(700): error: missing return statement at end of non-void function "matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=2, Is=<unsigned int, unsigned int>]"
          detected during:
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=2, Is=<unsigned int, unsigned int>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=1, Is=<unsigned int, unsigned int>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=0, Is=<unsigned int, unsigned int>]" 
(740): here
            instantiation of "T &matx::detail::tensor_impl_t<T, RANK, Desc>::operator()(Is...) noexcept [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, M=2, Is=<unsigned int, unsigned int>, <unnamed>=true]" 
/root/matx_test/MatX/include/kernels/matx_conv_kernels.cuh(117): here
            instantiation of "void matx::Conv1D(OutType, InType, FilterType, matx::index_t, matx::index_t, matx::matxConvCorrMode_t) [with OutType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(89): here
            instantiation of "void matx::detail::matxDirectConv1DInternal(OutputType &, const InType &, const FilterType &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(187): here
            instantiation of "void matx::conv1d(OutputType &, const In1Type &, const In2Type &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384009LL>>, In1Type=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, In2Type=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/examples/convolution.cu(103): here

/root/matx_test/MatX/include/matx_tensor_impl.h(700): error: missing return statement at end of non-void function "matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=2, Is=<unsigned int, int>]"
          detected during:
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=2, Is=<unsigned int, int>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=1, Is=<unsigned int, int>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, I=0, Is=<unsigned int, int>]" 
(740): here
            instantiation of "T &matx::detail::tensor_impl_t<T, RANK, Desc>::operator()(Is...) noexcept [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384000LL>, M=2, Is=<unsigned int, int>, <unnamed>=true]" 
/root/matx_test/MatX/include/kernels/matx_conv_kernels.cuh(134): here
            instantiation of "void matx::Conv1D(OutType, InType, FilterType, matx::index_t, matx::index_t, matx::matxConvCorrMode_t) [with OutType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(89): here
            instantiation of "void matx::detail::matxDirectConv1DInternal(OutputType &, const InType &, const FilterType &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(187): here
            instantiation of "void matx::conv1d(OutputType &, const In1Type &, const In2Type &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384009LL>>, In1Type=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, In2Type=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/examples/convolution.cu(103): here

/root/matx_test/MatX/include/matx_tensor_impl.h(700): error: missing return statement at end of non-void function "matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384009LL>, I=2, Is=<unsigned int, matx::index_t>]"
          detected during:
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384009LL>, I=2, Is=<unsigned int, matx::index_t>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384009LL>, I=1, Is=<unsigned int, matx::index_t>]" 
(695): here
            instantiation of "matx::detail::tensor_impl_t<T, RANK, Desc>::stride_type matx::detail::tensor_impl_t<T, RANK, Desc>::GetVal(std::tuple<Is...>) [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384009LL>, I=0, Is=<unsigned int, matx::index_t>]" 
(740): here
            instantiation of "T &matx::detail::tensor_impl_t<T, RANK, Desc>::operator()(Is...) noexcept [with T=float, RANK=2, Desc=matx::static_tensor_desc_t<100LL, 16384009LL>, M=2, Is=<unsigned int, matx::index_t>, <unnamed>=true]" 
/root/matx_test/MatX/include/kernels/matx_conv_kernels.cuh(161): here
            instantiation of "void matx::Conv1D(OutType, InType, FilterType, matx::index_t, matx::index_t, matx::matxConvCorrMode_t) [with OutType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(89): here
            instantiation of "void matx::detail::matxDirectConv1DInternal(OutputType &, const InType &, const FilterType &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::detail::tensor_impl_t<float, 2, matx::static_tensor_desc_t<100LL, 16384009LL>>, InType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, FilterType=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/include/matx_conv.h(187): here
            instantiation of "void matx::conv1d(OutputType &, const In1Type &, const In2Type &, matx::matxConvCorrMode_t, cudaStream_t) [with OutputType=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384009LL>>, In1Type=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<100LL, 16384000LL>>, In2Type=matx::tensor_t<float, 1, matx::basic_storage<matx::raw_pointer_buffer<float, matx::owning, matx::matx_allocator<float>>>, matx::static_tensor_desc_t<10LL>>]" 
/root/matx_test/MatX/examples/convolution.cu(103): here

6 errors detected in the compilation of "/root/matx_test/MatX/examples/convolution.cu".
make[3]: *** [examples/CMakeFiles/convolution.dir/build.make:76: examples/CMakeFiles/convolution.dir/convolution.cu.o] Error 255
make[2]: *** [CMakeFiles/Makefile2:306: examples/CMakeFiles/convolution.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:313: examples/CMakeFiles/convolution.dir/rule] Error 2
make: *** [Makefile:248: convolution] Error 2

Expected behavior
The example is expected to be compiled.

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0
Commit ID: 8976874

[QST] about ccmake and downloading rapids.cmake

Currently when using ccmake (instead of cmake), I can't configure the project; after hitting "c" twice, cmake fails with the following error message

CMake Error at CMakeLists.txt:43 (include):
   include could not find requested file:

     rapids-cmake

 CMake Error at CMakeLists.txt:44 (include):
   include could not find requested file:

     rapids-cpm

 CMake Error at CMakeLists.txt:45 (include):
   include could not find requested file:

     rapids-export

 CMake Error at CMakeLists.txt:46 (include):
   include could not find requested file:

     rapids-find

 CMake Error at CMakeLists.txt:49 (include):
   include could not find requested file:

     rapids-cuda

 Auto-detecting GPU architectures since CMAKE_CUDA_ARCHITECTURES not defined
 CMake Error at CMakeLists.txt:52 (rapids_cuda_init_architectures):
   Unknown CMake command "rapids_cuda_init_architectures".

 Configuring incomplete, errors occurred!
 See also "/home/pkestene/install/matX/github/MatX/build_sais_pas/CMakeFiles/CMakeOutput.log".

In the top-level CMakeLists.txt, if I change

if(NOT EXISTS ${CMAKE_BINARY_DIR}/RAPIDS.cmake)
    file(DOWNLOAD https://raw.githubusercontent.com/rapidsai/rapids-cmake/branch-21.12/RAPIDS.cmake
         ${CMAKE_BINARY_DIR}/RAPIDS.cmake)
       include(${CMAKE_BINARY_DIR}/RAPIDS.cmake)
endif()

into

if(NOT EXISTS ${CMAKE_BINARY_DIR}/RAPIDS.cmake)
    file(DOWNLOAD https://raw.githubusercontent.com/rapidsai/rapids-cmake/branch-21.12/RAPIDS.cmake
         ${CMAKE_BINARY_DIR}/RAPIDS.cmake)
endif()
include(${CMAKE_BINARY_DIR}/RAPIDS.cmake)

then ccmake configure is OK.

I don't clain, this is a bug, but this is a bit disturbing to me.

[BUG] Reduce operations give partly wrong results with 3D tensors.

Describe the bug
I tried to use matx::sum on 3D tensors, which are common in Deep Learning (batch_size, sequence_len, embedding_size). The results are partly wrong. I don't know if it's my mis-using or it's not supported yet.

To Reproduce
Steps to reproduce the behavior:

Create a 3D tensor mat with shape {2, 3, 2}:

auto vec = matx::make_tensor<float, 1>({12});
vec.SetVals({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12});
auto mat = vec.View({2, 3, 2});
// now mat is
// [
//   [[1, 2],
//    [3, 4],
//    [5, 6]],
//   [[7, 8],
//    [9, 10],
//    [11, 12]]
// ]

Perform sum operation on each row:

auto mat_sum = matx::make_tensor<float, 1>({6});
matx::sum(mat_sum, mat);
mat_sum.Print();

The result is:

000000: 3.0000 
000001: 7.0000 
000002: 11.0000 
000003: 33554528.0000 
000004: 0.0000 
000005: 0.0000

Expected behavior
The Results should be:

000000: 3.0000 
000001: 7.0000 
000002: 11.0000 
000003: 15.0000 
000004: 19.0000 
000005: 23.0000

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[FEA] Added indices to find()

Currently the find() function will return all values meeting the criteria, but the location of the values from the original tensor are not returned. This request is to add the indices as an optional return.

[FEA] Add static_assert on concat if dimension is out of range

Currently if concat on a dimension that is larger than the input dimensions you get a nasty template error. We should give a more user-friendly message.

[BUG] matx::matmul can't be performed on tensors created from user-defined pointers

Describe the bug
I'm trying to use matx::matmul to perform GEMM on two tensors created from device pointers allocated by CUDA APIs. But the code didn't compile for some reasons.

To Reproduce
Steps to reproduce the behavior:

Prepare the data buffers on CPU:

float data_a[4] = {1.0, 2.0, 3.0, 4.0};
float data_b[4] = {1.0, 2.0, 3.0, 4.0};

Allocate GPU memory for three 2x2 tensors using cudaMalloc:

float *pa, *pb, *pc;
CUDA_CHECK(cudaMalloc(&pa, sizeof(float) * 4));
CUDA_CHECK(cudaMalloc(&pb, sizeof(float) * 4));
CUDA_CHECK(cudaMalloc(&pc, sizeof(float) * 4));

Copy the data to GPU:

CUDA_CHECK(cudaMemcpy(pa, data_a, sizeof(float) * 4, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(pb, data_b, sizeof(float) * 4, cudaMemcpyHostToDevice));

Create 3 tensors from these pointers:

auto ta = matx::make_tensor<float, 2, matx::non_owning>(pa, {2, 2});
auto tb = matx::make_tensor<float, 2, matx::non_owning>(pb, {2, 2});
auto tc = matx::make_tensor<float, 2, matx::non_owning>(pc, {2, 2});

Perform matmul:

matx::matmul(tc, ta, tb);
tc.Print();

Compiling failed:

[ 98%] Building CUDA object test/operator_tests/CMakeFiles/embedding.dir/embedding.cu.o
/root/gs/nexus3/3rdparty/MatX/include/matx_tensor_desc.h(230): error: no operator "=" matches these operands
            operand types are: std::array<long long, 2UL> = const matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>
          detected during:
            instantiation of "void matx::tensor_desc_t<ShapeType, StrideType, RANK>::InitFromShape(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=const matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &]" 
(112): here
            instantiation of "matx::tensor_desc_t<ShapeType, StrideType, RANK>::tensor_desc_t(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=const matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &, <unnamed>=true]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(225): here
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

/root/gs/nexus3/3rdparty/MatX/include/matx_tensor_desc.h(233): error: class "matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>" has no member "begin"
          detected during:
            instantiation of "void matx::tensor_desc_t<ShapeType, StrideType, RANK>::InitFromShape(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=const matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &]" 
(112): here
            instantiation of "matx::tensor_desc_t<ShapeType, StrideType, RANK>::tensor_desc_t(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=const matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &, <unnamed>=true]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(225): here
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

/root/gs/nexus3/3rdparty/MatX/include/matx_tensor_desc.h(230): error: no operator "=" matches these operands
            operand types are: std::array<long long, 2UL> = matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>
          detected during:
            instantiation of "void matx::tensor_desc_t<ShapeType, StrideType, RANK>::InitFromShape(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &]" 
(112): here
            instantiation of "matx::tensor_desc_t<ShapeType, StrideType, RANK>::tensor_desc_t(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &, <unnamed>=true]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(227): here
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

/root/gs/nexus3/3rdparty/MatX/include/matx_tensor_desc.h(233): error: class "matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>" has no member "begin"
          detected during:
            instantiation of "void matx::tensor_desc_t<ShapeType, StrideType, RANK>::InitFromShape(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &]" 
(112): here
            instantiation of "matx::tensor_desc_t<ShapeType, StrideType, RANK>::tensor_desc_t(S2 &&) [with ShapeType=std::array<long long, 2UL>, StrideType=std::array<long long, 2UL>, RANK=2, S2=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>> &, <unnamed>=true]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(227): here
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(236): error: no suitable user-defined conversion from "matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>" to "const matx::tensor_t<float, 2, matx::DefaultStorage<float>, matx::DefaultDescriptor<2>>" exists
          detected during:
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(237): error: no suitable user-defined conversion from "matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>" to "const matx::tensor_t<float, 2, matx::DefaultStorage<float>, matx::DefaultDescriptor<2>>" exists
          detected during:
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

/root/gs/nexus3/3rdparty/MatX/include/matx_matmul.h(238): error: no suitable user-defined conversion from "matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>" to "const matx::tensor_t<float, 2, matx::DefaultStorage<float>, matx::DefaultDescriptor<2>>" exists
          detected during:
            instantiation of "matx::detail::MatMulParams_t matx::detail::matxMatMulHandle_t<TensorTypeC, TensorTypeA, TensorTypeB, PROV>::GetGemmParams(TensorTypeC &, const TensorTypeA &, const TensorTypeB &) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
(916): here
            instantiation of "void matx::matmul(TensorTypeC &, const TensorTypeA &, const TensorTypeB &, cudaStream_t, float, float) [with TensorTypeC=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeA=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, TensorTypeB=matx::tensor_t<float, 2, matx::basic_storage<matx::raw_pointer_buffer<float, matx::non_owning, matx::matx_allocator<float>>>, matx::DefaultDescriptor<2>>, PROV=(matx::MatXMatMulProvider_t)1]" 
/root/gs/nexus3/test/operator_tests/embedding.cu(36): here

7 errors detected in the compilation of "/root/gs/nexus3/test/operator_tests/embedding.cu".
make[3]: *** [test/operator_tests/CMakeFiles/embedding.dir/build.make:76: test/operator_tests/CMakeFiles/embedding.dir/embedding.cu.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:1179: test/operator_tests/CMakeFiles/embedding.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:1186: test/operator_tests/CMakeFiles/embedding.dir/rule] Error 2
make: *** [Makefile:533: embedding] Error 2

Expected behavior
The matx::matmul should compile.

System details (please complete the following information):

OS: Ubuntu 18.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[FEA] Add more static_asserts

Many of the errors seen by users are the result of SFINAE and are hard to read. Use static_assert wherever possible instead.

[BUG] Maintain storage and descriptor types across views

Currently when a view is created from a tensor, or when temporary tensors are created, they aren't necessarily using the same types of storage and descriptors as the original tensors. This only shows up when using the advanced API where custom storage is used, but it should be uniform across the entire API.

[FEA] Add detail:: namespace

Is your feature request related to a problem? Please describe.
Use detail:: to hide internal implementation details from users

[FEA] Add unique()

uniq() is a commonly used function for finding a unique set of values in a list. The primitives are available in CUB.

[BUG] einsum is not working

Describe the bug
matx::cutensor::einsum outputs wrong results.

To Reproduce
Steps to reproduce the behavior:

Create a 2x3 tensor a:

auto a = matx::make_tensor<float, 2>({2, 3});
a.SetVals({
    {1, 2, 3},
    {4, 5, 6}
});

Perform einsum operation to reduce sum a:

auto a_reduced = matx::make_tensor<float, 1>({3});
matx::cutensor::einsum(a_reduced, "ij->j", 0, a);

Print the result:

cudaStreamSynchronize(0);
a_reduced.Print();

The shape is correct but all values are zeros.

Expected behavior

NumPy result:

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[BUG] Visual studio fails to compile unit tests/examples

Describe the bug
CMake generates a solution that fails to compile 11 of 20 projects on VS2022.

From #147 :

The problem is not that nvcc is being passed an incorrect flag, but rather that fvisibility is not valid on VS. We use the option -forward-unknown-to-host-compiler, so any unknown parameter (of which this is one), nvcc will automatically forward to VS.

To Reproduce
Steps to reproduce the behavior:

cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=ON -DMATX_BUILD_EXAMPLES=ON -DMATX_BUILD_DOCS=OFF -DCMAKE_CUDA_ARCHITECTURES=52 -DCMAKE_BUILD_TYPE=Debug ..

Expected behavior
Expect to successfully compile all unit tests & examples.

Code snippers
output log attached.

System details (please complete the following information):

Windows 10 Pro
CMake 3.22.1
VS2022 (MSVC 19.30.30706.0)
CUDA 11.6
pybind11 2.6.2

[FEA] Add static shape/strides

Currently only the tensor rank is a compile-time constant, but shape/strides are often known at compile-time as well. Add these in after #48 is complete.

[BUG] Unit Test Failure

Most of the unit tests pass. I think this one may be towards the end of the list. One test fails on an assertion, and a subsequent test throws an exception.

After pulling updates from the main branch,
built and run using:

cmake -DBUILD_TESTS=ON ..
make -j
cd test
./matx_test

Output of failures:

Comparison failed at /path/to/MatX/test/00_transform/MatMul.cu:136:0/64: val=0.438721+13.132812j file=8.766909+-0.536683j (c)
/path/to/MatX/test/00_transform/MatMul.cu:136: Failure
Failed
[  FAILED  ] MatMulTestFloatTypes/6.MediumRect, where TypeParam = matx::matxHalfComplex<matx::matxHalf<__half> > (255 ms)
[----------] 3 tests from MatMulTestFloatTypes/6 (262 ms total)

[----------] 3 tests from MatMulTestFloatTypes/7, where TypeParam = matx::matxHalfComplex<matx::matxHalf<__nv_bfloat16> >
[ RUN      ] MatMulTestFloatTypes/7.SmallRect
matxException (matxMatMulError: ret == CUBLAS_STATUS_SUCCESS) - /path/to/MatX/include/matx_matmul.h:505

Stack Trace:
 ./matx_test() [0x4cd619]
 ./matx_test() [0xd58d45]
 ./matx_test() [0xd4ed50]
 ./matx_test() [0xd40f9a]
 ./matx_test() [0xd3873b]
 ./matx_test : void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x65
 ./matx_test : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x4b
 ./matx_test : testing::Test::Run()+0xea
 ./matx_test : testing::TestInfo::Run()+0x122
 ./matx_test : testing::TestSuite::Run()+0x133
 ./matx_test : testing::internal::UnitTestImpl::RunAllTests()+0x3c1
 ./matx_test : bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x65
 ./matx_test : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x4b
 ./matx_test : testing::UnitTest::Run()+0xaa
 ./matx_test() [0x4a3c99]
 ./matx_test() [0x4a27d6]
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./matx_test() [0x4a266e]

System details:

GPU: Quadro RTX 5000
OS: CentOS 7
CUDA version: 11.5
gcc/g++ version: 9.3.1 20200408 (Red Hat 9.3.1-2)
cmake version: 3.22.1

[QST] Performance in comparison to PyTorch (w/ jit)

Dear MatX Team,

thank you for this nice software :)

I wanted to ask how MatX would compare to PyTorch (potentially with jit). For simplicity, I made a test case where one simplified self-attention layer is computed. Ultimately, I want to implement custom attention/transformer layers and thought MatX could be a good choice here, which gives better speed than PyTorch but more flexibility than NVIDIA/FasterTransformer. However, for this simple test case, I made it seems that PyTorch is faster (or at least as fast) and I wanted to ask if this is expected or if the benchmark is set up wrongly.

Thank you
Lukas

PyTorch Code

import torch
import time

print(torch.__version__)

L = 128
D = 4096
loops = 256
kw = {'device': torch.device('cuda:0'), 'dtype': torch.float32}
def randn(*args): return torch.randn(args, **kw)

torch.manual_seed(0)
emb = randn(L, D)
Wq = randn(D, D)
Wk = randn(D, D)
Wv = randn(D, D)

def transformer(x):
    q = x @ Wq
    k = x @ Wk
    v = x @ Wv
    
    score = q @ k.T
    attn = torch.softmax(score, axis=1)
    res = attn @ v
    
    return res

traced_transformer = torch.jit.trace(transformer, (emb,), check_trace=True)

def test(fun, name):
    testlist = [None]*loops
    st = time.time()
    for i in range(loops):
        testlist[i] = fun(emb)
    torch.cuda.synchronize()
    ed = time.time()
    print(f"{name:50s} {int((ed-st)/loops*1_000_000)} us")
    
    
test(transformer, "Warmup")
test(transformer, "Pytorch")
test(traced_transformer, "JIT Trace")
with torch.jit.optimized_execution(True):
    test(traced_transformer, "JIT Trace + Opt")

Output (please note that I know about CUDA timers but didn't think they would be necessary here due to the loop. Please correct me if I am wrong)

1.9.0
Warmup                                             1458 us
Pytorch                                            1379 us
JIT Trace                                          1402 us
JIT Trace + Opt                                    1388 us

C++

#include <stdio.h>
#include <matx.h>
#include "matx_viz.h"

int main(int argc, char **argv) {
    typedef float scalar_t;

    constexpr int L = 128;
    constexpr int D = 4096;
    constexpr int loops = 30;

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    auto emb = matx::make_tensor<scalar_t>({L, D});
    auto Wq = matx::make_tensor<scalar_t>({D, D});
    auto Wk = matx::make_tensor<scalar_t>({D, D});
    auto Wv = matx::make_tensor<scalar_t>({D, D});

    matx::randomGenerator_t<scalar_t> randData_emb(emb.TotalSize(), 0);
    matx::randomGenerator_t<scalar_t> randData_W(Wq.TotalSize(), 0);

    auto randTensor_emb = randData_emb.GetTensorView<2>({L,D}, matx::NORMAL);
    auto randTensor_W = randData_W.GetTensorView<2>({D,D}, matx::NORMAL);

    // Fill with random (each time different)
    (emb = randTensor_emb).run(stream);
    (Wq = randTensor_W).run(stream);
    (Wk = randTensor_W).run(stream);
    (Wv = randTensor_W).run(stream);

    auto q = matx::make_tensor<scalar_t>({L, D});
    auto k = matx::make_tensor<scalar_t>({L, D});
    auto kT = matx::make_tensor<scalar_t>({D, L});
    auto v = matx::make_tensor<scalar_t>({L, D});
    auto res = matx::make_tensor<scalar_t>({L, D});
    auto score = matx::make_tensor<scalar_t>({D, D});
    auto exp_score = matx::make_tensor<scalar_t>({D, D});
    auto sum_exp_score = matx::make_tensor<scalar_t>({D});
    auto attn = matx::make_tensor<scalar_t>({D, D});

    for (int loop = 0; loop < loops; loop++){
        cudaStreamSynchronize(stream);
        cudaEventRecord(start, stream);

        // My Code
        matx::matmul(q, emb, Wq, stream); // q= Wq @ emb
        matx::matmul(k, emb, Wk, stream); // k= Wk @ emb
        matx::matmul(v, emb, Wv, stream); // v= Wv @ emb
        matx::transpose(kT, k, stream); // kT= k^T
        matx::matmul(score, q, kT, stream); // score= q @ k^T
        (exp_score = matx::exp(score)).run(stream);
        matx::sum(sum_exp_score, exp_score, stream);
        (attn = exp_score / sum_exp_score).run(stream); // exp(s) / sum(exp(s))
        matx::matmul(res, attn, v); // res = attn @ v
        cudaEventRecord(stop, stream);
        cudaStreamSynchronize(stream);
        float time_ms;
        cudaEventElapsedTime(&time_ms, start, stop);

        printf("%d us\n", (int) (1000*time_ms));
    }
}

Output

245243 us
2277 us
2257 us
2260 us
2254 us
2255 us
3160 us
2500 us
2260 us
2260 us
2261 us
3659 us
2641 us
2255 us
2256 us
2254 us
2257 us
2255 us
2519 us
2259 us
2257 us
2256 us
2258 us
3501 us
2270 us
2256 us
2754 us
2261 us
2256 us
2253 us

Compile options

cmake -DMATX_BUILD_TESTS=OFF -DMATX_BUILD_BENCHMARKS=OFF -DMATX_BUILD_EXAMPLES=OFF -DMATX_BUILD_DOCS=OFF -D CMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11 -DCMAKE_CUDA_HOST_COMPILER=g++-11 -DMATX_EN_PYBIND11=ON -DCMAKE_BUILD_TYPE=Release ..
-- The CUDA compiler identification is NVIDIA 11.7.64
-- The CXX compiler identification is GNU 11.1.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/lib/ccache/g++-11 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found suitable version "11.7.64", minimum required is "11.5") 
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Using GPU architectures 70
-- Downloading CPM.cmake to /home/lukas/oss/MatX/examples/cmake_sample_project/build/cmake/CPM_0.32.2.cmake
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.7.64") 
-- Recent libcuda++ found in CUDA toolkit. Skipping finding...
-- Enabling pybind11 support
-- CPM: adding package [email protected] (v2.6.2)
make-- pybind11 v2.6.2 
CMake Warning (dev) at /usr/share/cmake-3.22/Modules/CMakeDependentOption.cmake:84 (message):
  Policy CMP0127 is not set: cmake_dependent_option() supports full Condition
  Syntax.  Run "cmake --help-policy CMP0127" for policy details.  Use the
  cmake_policy command to set the policy and suppress this warning.
Call Stack (most recent call first):
  build/_deps/pybind11-src/CMakeLists.txt:98 (cmake_dependent_option)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found PythonInterp: /usr/bin/python3.6 (found version "3.6.12") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
 -- Found Python3: /usr/bin/python3.6 (found version "3.6.12") found components: Interpreter Development Development.Module Development.Embed 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/lukas/oss/MatX/examples/cmake_sample_project/build

System

➜  build git:(main) ✗ nvidia-smi             
Thu Jun  9 22:13:07 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   46C    P8    19W / 250W |   2267MiB /  8192MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1296      G   /usr/lib/xorg/Xorg                 24MiB |
|    0   N/A  N/A      1467      G   /usr/bin/gnome-shell               82MiB |
|    0   N/A  N/A      2100      G   /usr/lib/xorg/Xorg                143MiB |
|    0   N/A  N/A      2227      G   /usr/bin/gnome-shell               54MiB |
|    0   N/A  N/A      2762      G   ...493856081279591316,131072       94MiB |
|    0   N/A  N/A     10390      C   ...nda3/envs/t1.9/bin/python     1861MiB |
+-----------------------------------------------------------------------------+
➜  build git:(main) ✗ nvidia-smi -L 
GPU 0: NVIDIA GeForce RTX 2080 SUPER (UUID: GPU-2d3310c9-8492-2f50-c678-e0185f5f508a)

[FEA] Allow all CUB functions to take operators

After a recent commit, we support CUB taking arbitrary tensors that can be permuted/strided. Operators can also be passed to CUB with the correct iterator functionality. Currently there are several pieces blocking this since operators don't have all the functions that are used in the tensor iterator (Data(), LSize(), etc). Since the tensor iterator is too specialized for operators, a second operator iterator can be made that only supports what operators have in common (Size(), Rank(), and operator()).

[FEA] Implementation of MATLAB's find()

First of all, let me just say that I'm excited to see this project. I'm a long-time user of MATLAB but very new to CUDA, and I love that you are lowering the barrier of entry to GPU programming.

Is your feature request related to a problem? Please describe.
Slicing a matrix is often paired with MATLAB's built-in find function. I think it would be a valuable addition to this project.

Describe the solution you'd like
Similar to how you have implemented linspace, meshgrid, etc, it would be great to see a similar syntax & functionality to the above link.

Describe alternatives you've considered
I just stumbled upon this library today--I read through all the help docs and didn't see this anywhere. Hopefully I didn't miss it!

[FEA] N-D Tensors

Is your feature request related to a problem? Please describe.
MatX only supports up to 4D tensors, but there are use cases for higher dimensions (as mentioned in other issues).

Describe the solution you'd like
Support unlimited dimensions

[FEA] Support concatenate operator

Support an operator for concatenating tensors and/or operators together. Initial support can be for two tensors, but it should be extended to the more generalized case of unlimited tensors and any dimension.

ML specific operators most needed

Is your feature request related to a problem? Please describe.
Widely used by ML software, ArgMax and ArgMin(coordinate of the maximum/minimum value a.k.a. ISAMAX/IDAMAX/ISAMIN/IDAMIN in BLAS world) are surprisingly rare in tensor libraries these days. Same problem is for partial sorting (a.k.a. TopK operator). And the third thing is concatenation which would be a good addition to currently implemented slicing.

Describe the solution you'd like
This would be terrific to have all these facilities in MatX.

[BUG] The exercises in notebooks may be updated. Some of them cannot be compiled successfully.

Describe the bug
The simple example MatX/docs_input/notebooks/exercises/example2_tensor_add.cu cannot be compiled successfully.

To Reproduce
The current version of the code is

#include <matx.h>

using namespace matx;

int main() {

  tensorShape_t<2> shape({2, 3});
  tensor_t<float, 2> A(shape);
  tensor_t<float, 2> B(shape);
  tensor_t<float, 2> C(shape);
  tensor_t<float, 1> V({3});

  A.SetVals({ {1, 2, 3},
        {4, 5, 6}});
  
  B.SetVals({ {7, 8, 9},
        {10, 11, 12}});

  // TODO: Add tensors A and B and store the result in C

  
  A.Print();
  printf("\n");
  B.Print();
  printf("\n");
  C.Print();
}

If I change it to

#include <matx.h>

using namespace matx;

int main() {
  tensor_t<float, 2> A({2, 3});
  tensor_t<float, 2> B({2, 3});
  tensor_t<float, 2> C({2, 3});
  tensor_t<float, 1> V({3});

  A.SetVals({ {1, 2, 3},
        {4, 5, 6}});
  
  B.SetVals({ {7, 8, 9},
        {10, 11, 12}});

  // TODO: Add tensors A and B and store the result in C
  
  A.Print(0, 0);
  printf("\n");
  B.Print(0, 0);
  printf("\n");
  C.Print(0, 0);
}

It works.

Expected behavior
The examples should be compiled and work.

System details (please complete the following information):

OS: 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0
Codebase commit: d22dc84

Additional context
Add any other context about the problem here.

[BUG] CMake fails while building without documentation on CUDA 11.4

Hi! Super cool library!

The CMake configuration fails when compiling MatX without documentation. It works fine with -DBUILD_DOCS=ON. This machine is running CUDA 11.4.

➜  MatX git:(main) mkdir build
➜  MatX git:(main) cd build
➜  build git:(main) cmake -DBUILD_TESTS=ON -DBUILD_BENCHMARKS=ON -DBUILD_EXAMPLES=ON -DBUILD_DOCS=OFF ..
-- Auto-detecting GPU architectures since CMAKE_CUDA_ARCHITECTURES not defined
-- The CUDA compiler identification is NVIDIA 11.4.120
-- The CXX compiler identification is GNU 11.1.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cuda/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Auto detection of gpu-archs: 86
-- Using GPU architectures 86-real
-- Downloading CPM.cmake to /home/luigi/sandbox/MatX/build/cmake/CPM_0.32.2.cmake
-- Setting build type to 'Debug' as none was specified.
-- Found CUDAToolkit: /opt/cuda/include (found version "11.4.120")
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Need libcuda++ 1.6.0 or higher (included in CTK 11.5+). Finding...
CMake Error at build/cmake/CPM_0.32.2.cmake:204 (find_package):
  find_package called with invalid argument "3bc2d4b"
Call Stack (most recent call first):
  build/cmake/CPM_0.32.2.cmake:255 (cpm_find_package)
  build/_deps/rapids-cmake-src/rapids-cmake/cpm/find.cmake:152 (CPMFindPackage)
  cmake/FindLibcudacxx.cmake:21 (rapids_cpm_find)
  cmake/FindLibcudacxx.cmake:44 (find_and_configure_libcudacxx)
  CMakeLists.txt:95 (include)

[BUG] Fix pybind dependencies on external projects

MatX should provide the proper pybind targets for external projects even if tests/examples are disabled

[BUG] matmul throws a CUBLAS_STATUS_INVALID_VALUE exception when passed a permuted view

Describe the bug
matmul throws an exception if a permuted view is passed to it. The following exception leads to a CUBLAS_STATUS_INVALID_VALUE being returned in matx_matmul.h:535.

To Reproduce

    tensor_t<InType, 2> At({n, m});
    randomGenerator_t<InType> randData(At.TotalSize(), 0);
    auto randTensor = randData.GetTensorView<2>({n, m}, NORMAL);
    (At = randTensor).run(stream);
    auto A = At.PermuteMatrix();

    tensor_t<InType, 2> b({n, 1});
    tensor_t<InType, 2> c({m, 1});
    (b = 1).run(stream);

    matmul(c, A, b, stream);

System
OS: Ubuntu x86_64
CPU: Xeon Silver 4214 @ 2.2GHz
GPU: NVIDIA A100

Workaround
An effective workaround is to simply copy the transposed matrix:

     tensor_t<InType, 2> A({m, n});
     transpose(A, At, stream);
     matmul(c, A, b, stream);

works.

[QST] Comparison to Pytorch

Hi, thanks for releasing MatX, I really find it interesting! I was wondering about a speed comparison to Pytorch (e.g. comparing GEMM)? I only found this and it says also 4x faster than CuPy:
https://thomasaarholt.github.io/fftspeedtest/fftspeedtest.html

But I am not too familiar with the technical details.
Do the two frameworks compare similar in terms of speed (ignoring the intended use cases of either framework)?

I appreciate your work and help! :)
Lukas

[QST] Transform Reduce - Kernel Fusion possible?

Use the form below to ask a question about MatX. This can be anything from a code question to questions about releases.
Hi, on first sight of this project I had hoped that the deferred executors would allow for fusing transformations into reductions. But after trying it out I'm not so sure anymore:

int const row = 3;
int const col = 4;
auto tensor = matx::make_tensor<int, 2>({row, col});
tensor.SetVals({{3, 2, 4, 5},
                {0, -2, 3, 1},
                {9, 8, 7, 6}});
tensor.Print(0,0);
auto count = matx::make_tensor<int, 1>({row});

// Goal: For each row, count the number of elements greater than 2.
matx::sum(count, tensor > 2);
cudaDeviceSynchronize();
count.Print(0);

This code compiles and runs, but the result is not the expected {3, 1, 4}, but {1, 1, 1}.
Running the transformation and reduction independently works as expected:

(tensor = tensor > 2).run();
matx::sum(count, tensor);

Is or will it be possible to avoid unnecessary reads/writes to tensor by kernel fusion in MatX?

[BUG] Invalid cuQuantum SDK download link in cmake/FindcuTensorNet.cmake

Describe the bug
The cuQuantum SDK download link in cmake/FindcuTensorNet.cmake is invalid (returns a 404 not found).

To Reproduce
The download link from that file is:

https://developer.download.nvidia.cn/compute/cuquantum/redist/linux-x86_64/cuquantum-linux-x86_64-0.1.0.30-archive.tar.xz

And this gives me a 404 - Not Found.

Expected behavior
The download link should be valid.

[FEA] Add option for col-major memory layout

A lot of existing code already has column-major layout. Many existing libraries take column-major layout. We should give the option to declare a tensor with that layout to allow further optimizations

[BUG] matx::argmax returns weird results for tensors that have all-negative elements

Describe the bug
When the input tensor has all-negative elements, like [-3, -1, -7], the matx::argmax returns weird index.

To Reproduce

#include <matx.h>

int main() {
  // Create a tensor with all negative elements
  auto t = matx::make_tensor<float, 1>({3});
  t.SetVals({-3, -1, -7});
  t.Print();

  // Apply matx::argmax
  matx::tensor_t<float, 0> max_val{};
  matx::tensor_t<matx::index_t, 0> max_idx{};
  matx::argmax(max_val, max_idx, t);
  max_val.Print();
  max_idx.Print();

  return 0;
}

Output:

000000: -3.0000e+00 
000001: -1.0000e+00 
000002: -7.0000e+00 
1.1755e-38 
9223372036854775807

The max_val is 1.1755e-38 and max_idx is 9223372036854775807 which make no sense.

Expected behavior
The max_val should be -1 and max_idx should be 1.

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[FEA] Add find() function

find() is used in MATLAB to find values meeting some criteria and saving only the output that meets the criteria. This is notable different than the current IF conditional operator since IF only operates element-wise and cannot save a subset of values. CUB provides a select primitive to gain this functionality.

[FEA] Add host compilation support

Is your feature request related to a problem? Please describe.
matx should be includeable from a host compiler even if it's not going to be used inside of it

Describe the solution you'd like
Not error out

[FEA] We don't have unary operator `-`?

It seems like MatX don't have unary operator - currently, something like:

// a, b are tensors
(b = -a).run();

I looked up the the code, it turned out that the operator is commented:

Why don't we add this operator?

[BUG] Investigate static_tensor issues

Tensor with a static descriptor have out of bounds accesses when doing copies,

[FEA] Prepend MATX_ to CMake options

Helps disambiguate options from other libraries

[FEA] Jetson Support

Thank you for this nice library!

Is Jetson support on the horizon and if so, what is the timeline for that?

I am considering starting a project that could build on MatX but it would be vital for the project to eventually support Nvidia Jetsons.

Thanks a lot
Cheers
Lukas

RTX3090 Fails To Compile Recursive Filter

Describe the bug

12%] Building CXX object _deps/fmt-build/CMakeFiles/fmt.dir/src/os.cc.o
[ 13%] Building CUDA object examples/CMakeFiles/fft_conv.dir/fft_conv.cu.o
ptxas error   : Value of threads per SM for entry _ZN4matx15RecursiveFilterILj2ELj2ENS_8tensor_tIfLi2EEES2_fEEvT1_T2_PKT3_S7_PVS5_S9_xPViS7_ is out of range. .minnctapersm will be ignored
make[2]: *** [examples/CMakeFiles/recursive_filter.dir/build.make:76: examples/CMakeFiles/recursive_filter.dir/recursive_filter.cu.o] Error 255
make[1]: *** [CMakeFiles/Makefile2:330: examples/CMakeFiles/recursive_filter.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

To Reproduce
Steps to reproduce the behavior:

Compile on 3090

Expected behavior
No errors

Code snippers
If applicable, add code snippets to help explain your problem.

System details (please complete the following information):
CUDA 11.5
Ubuntu 20.04

Additional context
Add any other context about the problem here.

[BUG] An empty macro MATX_ASSERT_STR_EXP is missing in matx_error.h when NDEBUG is defined

Describe the bug
An empty macro MATX_ASSERT_STR_EXP is missing in matx_error.h when NDEBUG is defined.

When building with -DNDEBUG (e.g. cmake with -DCMAKE_BUILD_TYPE=Release), this error will be thrown by the compiler:

/usr/local/include/matx_einsum.h(105): error: identifier "MATX_ASSERT_STR_EXP" is undefined

Expected behavior
An empty definition of MATX_ASSERT_STR_EXP should be added between #else and #endif, just like MATX_ASSERT and MATX_ASSERT_STR.

System details (please complete the following information):

OS: Ubuntu 20.04
CUDA version: CUDA 11.4
g++ version: 9.3.0

[FEA] Tensor contractions

Great work so far on MatX!

I wonder if tensor contractions (aka tensordot or einsum) are in the roadmap for MatX. Until now this has existed in cuTENSOR but it is quite verbose, so it would be great to write tensor contractions using MatX high-level syntax.

[QST] Creating 2D view from multiple device pointers

Looking through the documentation here, I see the example on how to create a tensor view from device pointer. Is there any way create a single tensor view from multiple device pointers? If anything, a work around could involve copying the contents at the pointers to a single contiguous location and creating a view from that. Just looking for a way to avoid having to do that.

nvidia / matx Goto Github PK

matx's Introduction

MatX - GPU-Accelerated Numerical Computing in Modern C++

Table of Contents

Requirements

Installation

Building MatX

Integrating MatX With Your Own Projects

1. MatX as a Subdirectory

2. MatX Installed to the System

MatX CMake Targets

Documentation

Supported Data Types

Unit Tests

Quick Start Guide

Release Major Features

Discussions

Filing Issues

Contributing Guide

matx's People

Stargazers

Watchers

Forkers

matx's Issues

Recommend Projects

Recommend Topics

Recommend Org