Giter Club home page Giter Club logo

cudnn-frontend's Introduction

cuDNN FrontEnd(FE) API

Introduction

The cuDNN FrontEnd(FE) API is a C++ header-only library that wraps the cuDNN C backend API. Both the FE and backend APIs are entry points to the same set of functionality that is commonly referred to as the "graph API".

While there are two entry points to the graph API (i.e. backend and frontend), it is expected that most users will use the FE API. Reasons being:

  • FE API is less verbose without loss of control. All functionality accessible through the backend API is also accessible through the FE API.
  • FE API adds functionality on top of the backend API, like errata filters and autotuning.

Also, for those using backend API, FE API source and samples can serve as reference implementation.

In FE v1.0 API, users can describe multiple operations that form subgraph through a persistent cudnn_frontend::graph::Graph object. Unlike the FE v0.x API, users don't need to worry about specifying shapes and sizes of the intermediate virtual tensors. FE v1.0 API extends the groundwork of earlier versions and introduces a new set of APIs to further simplify the workflow. For detailed information of FE v1.0 API, see README.FE.1.0.md.

Additionally, FE v1.0 API provides python bindings to all API through pybind11. It is recommended that new users of cuDNN start with the frontend v1.0 API. See samples/cpp and samples/python for more details on its usage.

Usage

For c++ users, in order to include the entire library, include the cudnn_frontend header file include/cudnn_frontend.h into your compilation unit.

For Python users, run import cudnn

Build:

Dependencies

With the release of v1.0, we are bumping up the minimum supported cudnn version to 8.5.0

cuda can be downloaded from the nvidia dev-zone

cudnn can be installed from - nvidia dev-zone - pypi wheels

Minimum python version needed 3.6 The python binding compilation requires development package which can be installed by running apt-get install python-dev.

To run the Python samples, you will need the dependencies mentioned in requirements.txt. This can be be installed by running: pip install -r requirements.txt

Python API

pip wheel installation

Download the pip wheel corresponding to your python installation.

pip install nvidia_cudnn_frontend

Source installation:

Install FE python API by running:

pip install -v git+https://github.com/NVIDIA/cudnn-frontend.git

Above command picks cuda and cudnn from default system paths.

To provide a custom CUDA installation path, use environment variable: CUDAToolkit_ROOT.
To provide a custom CUDNN installation path, use environment variable: CUDNN_PATH.

Checking the installation

To test whether installation is successful, run:

pytest test/python_fe

NOTE: Only v1.0 API is exposed via python bindings.

C++ API

C++ API is header only library.

The root CMakeLists.txt can be used as reference to include the cudnn_frontend in your project's build system.

Building samples

The following compilation steps are only required for building the samples.

Provide CUDA installation path according to: https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html

Provide CUDNN installation path using CUDNN_PATH env variable or cmake parameter.

CUDNN_PATH has the cudnn installation:

  • Headers are in CUDNN_PATH/include.
  • Libraries are in CUDNN_PATH/lib or CUDNN_PATH/lib64 or CUDNN_PATH/lib/x64.

For a in-source build,

mkdir build
cd build
cmake -DCUDNN_PATH=/path/to/cudnn -DCUDAToolkit_ROOT=/path/to/cuda  ../
cmake --build . -j16
bin/samples

To skip building samples, use -DCUDNN_FRONTEND_BUILD_SAMPLES=OFF.

To skip building python bindings, use -DCUDNN_FRONTEND_BUILD_PYTHON_BINDINGS=OFF.

In case, you have a stale cmake cache and want to update the cudnn/cuda paths, please delete the cmake cache (or build directory and redo the above steps).

Debugging

For initial debugging, we recommend turning on the cudnn FE logging and checking for warnings and errors. cuDNN Frontend API logging records execution flow through cuDNN frontend API. This functionality is disabled by default, and can be enabled through methods described in this section.

Method 1: Using Environment Variables:

Environment variables CUDNN_FRONTEND_LOG_INFO=0 CUDNN_FRONTEND_LOG_INFO=1
CUDNN_FRONTEND_LOG_FILE not set No Logging No Logging
CUDNN_FRONTEND_LOG_FILE set to stdout or stderr No Logging Logging to cout or cerr
CUDNN_FRONTEND_LOG_FILE set to filename.txt No Logging Logging to the filename

Method 2: Using API calls:

Calling cudnn_frontend::isLoggingEnabled() = true|false has same effect of setting the environment variable. Calling cudnn_frontend::getStream() = stream_name can be used to assign the output stream directly.

For further debugging, please turn on the cudnn backend logs described here https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html#error-reporting-and-api-logging

Documentation

Contributing:

Please refer to our contribution guide

Feedback

Support, resources, and information about cuDNN can be found online at https://developer.nvidia.com/cudnn.

Also, bugs and RFEs can be reported in the issues section.

cudnn-frontend's People

Contributors

anerudhan avatar junaire avatar jyknight avatar vedaanta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cudnn-frontend's Issues

Support "make install"

Even though the cudnn-frontend library is header-only, and effectively consists of just the files under include/, the lack of a CMake install target to populate an installation location is not conventional.

Adding the following to the top-level CMakeLists.txt will provide the minimum expected behavior:

include(GNUInstallDirs)

install(DIRECTORY include/
    DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
    FILES_MATCHING PATTERN "*.h" PATTERN "*.hpp"
)

Additional files (e.g. documentation under ${CMAKE_INSTALL_PREFIX}/share/doc/cudnn-frontend/) may be added as appropriate.

Error: CUDNN_STATUS_EXECUTION_FAILED

Hi,

Iโ€™m currently using CUDNN to write a deep learning super-resolution sample. But the function cudnnConvolutionBiasActivationForward will return status CUDNN_STATUS_EXECUTION_FAILED. I check out the API Reference DOC and only find the description "The function failed to launch on the GPU." for this state.

I also found a similar post in the NVIDIA forum, but there is no reply in this post. I found that the memory occupied by the graphics card will rise rapidly when running the sample. But I can't debug without specific diagnostic information. Could you give any suggestions for this issue?

Thanks a lot in advance!

Bug in Flash with rng dropout sample

I believe I've found a bug within the Flash with rng dropout sample test

if (seq_len_override) {
        Surface<int32_t> devActualSeqlenQ(b, false);
        Surface<int32_t> devActualSeqlenKV(b, false);
        std::vector<int32_t> hostActualSeqlenQ(b, 20);
        std::vector<int32_t> hostActualSeqlenKV(b, 20);
        
        checkCudaErr(cudaMemcpy(devActualSeqlenQ.devPtr,
                                hostActualSeqlenQ.data(),
                                sizeof(hostActualSeqlenQ[0]) * b,
                                cudaMemcpyHostToDevice));
        checkCudaErr(cudaMemcpy(devActualSeqlenKV.devPtr,
                                hostActualSeqlenKV.data(),
                                sizeof(hostActualSeqlenKV[0]) * b,
                                cudaMemcpyHostToDevice));
        checkCudaErr(cudaDeviceSynchronize());
        
        variant_pack[seq_q]  = devActualSeqlenQ.devPtr;
        variant_pack[seq_kv] = devActualSeqlenKV.devPtr;
}
    
Surface<float> statsTensor(b * h * s_q * 1, false);
if (is_inference == false) {
        variant_pack[stats] = statsTensor.devPtr;
}

When the devActualSeqlenQ and devActualSeqlenKV Surfaces go out of scope at the end of the if statement the destructors are called and both devPtrs are cudaFree'd, even though they are part of the variant pack.

The bug is hidden due to the fact that upon the construction of the statsTensor Surface, the next cudaMalloc call returns the same device address that was previously freed, resulting in the statsTensor.devPtr being the same address as that of either the devActualSeqLenQ or devActualSeqlenKV devPtr. I'm not sure how the test still passes even though the hostActualSeqlen vectors that are cudaMemcpy'd in are subsequently freed on the device.

Tasks

No tasks being tracked yet.

Lack of activation function LeakyReLU

Hi,

Iโ€™m currently using CUDNN to write a deep learning super-resolution sample. But I found that support for LeakyReLU is not mentioned in the document. But LeakyReLU is in my pre-trained model. Could you please give any suggestions or sample code to solve this problem?

Thanks a lot in advance!

Windows build error

include\cudnn_frontend\graph_interface.h(444,19): Error C2248: 'cudnn_frontend::graph::Layernorm_attributes::forward_phase': cannot access private member declared in class 'cudnn_frontend::graph::Layernorm_attributes'

identifier "geomlib::_NV_ANON_NAMESPACE::kEps" is undefined in device code

VS 2019(16.11.34):cmake(3.20.21032501-MSVC_2)
cuda(10.2)
cudnn(9.0+10.2)
libtorch(1.10.1+cuda 10.2)
VulkanSDK-1.3.275.0
hdf5-1.14.3
CMAKE the project is ok
but when I use the PolycubeHexMesher.sln and generate it shows:
identifier "geomlib::_NV_ANON_NAMESPACE::kEps" is undefined in device code
(in the file "generalized_projection_cuda.cu", there are several error)

Cudnn Error InstanceNormalizationPlugin

Hello, I am working on a project to convert a pytorch tracking algorithm to TensorRT

I can successfully export the tracking algorithm to onnx then to TensorRT, but when I try to run inference on it, I get this error:
[E] C:\_src\plugin\instanceNormalizationPlugin\instanceNormalizationPlugin.cu (335) - Cudnn Error in nvinfer1::plugin::InstanceNormalizationPlugin::enqueue: 14 (CUDNN_STATUS_VERSION_MISMATCH)

the tracking module uses a couple of submodules, one of them is bb regressor, I exported it to onnx and we see that it contains InstanceNormalization:
bb_regressor_sanitized

now, when I try to run inference on a pytorch docked, which is run on linux, the inference is ok, but I am using Windows 10 and for some reason it is not working

Keep in mind that I reinstalled the latest cuDNN 4 times, because of the CUDNN_STATUS_VERSION_MISMATCH, just in case I messed up the instalation, but the error still seems to occur, so maybe it is a compatability issue with Windows?

Thank you for your help!

question about the fusion_sample

Hi there,

I am looking at the samples for convolution, and I came across 2 sample functions: run_conv_add_bias_activation and run_conv_scale_bias_add_leaky_relu . Looking through the code, I noticed that both of them are identical except for the alignment value. 4 for the former and 16 for the latter.

My questions are:

  1. Is there a reason for putting one in conv_sample and the other in fusion_sample?
  2. Are there any differences besides the alignment value?
  3. Did you set the alignment value to 16 "to run a tensor core engine", which in turn enables some sort of op fusion by cuda/cudnn? (Not a cuda coding expert here)

run_conv_add_bias_activation(int64_t* x_dim,

run_conv_scale_bias_add_leaky_relu(int64_t* x_dim,

Thanks!

[Question] How to match Flash Attention 2 performance?

I wrote a helper that allows someone to use CuDNN attention within Pytorch seamlessly.

import cudnn
import torch
import math

# export CUDNN_FRONTEND_LOG_FLIE=fe.log
# export CUDNN_FRONTEND_LOG_INFO=1

# import os
# os.environ["CUDNN_FRONTEND_LOG_FILE"] = "fe.log"
# os.environ["CUDNN_FRONTEND_LOG_INFO"] = "1"

def convert_to_cudnn_type(torch_type):
    if torch_type == torch.float16:
        return cudnn.data_type.HALF
    elif torch_type == torch.bfloat16:
        return cudnn.data_type.BFLOAT16
    elif torch_type == torch.float32:
        return cudnn.data_type.FLOAT
    elif torch_type == torch.int32:
        return cudnn.data_type.INT32
    elif torch_type == torch.int64:
        return cudnn.data_type.INT64
    else:
        raise ValueError("Unsupported tensor data type.")

def make_cudnn_autograd(*, num_heads, head_dim, dtype):
    assert dtype in [torch.float16, torch.bfloat16], f"Invalid dtype {dtype}"
    dtype = convert_to_cudnn_type(dtype)
    # match CuDNN's docs
    H, D = num_heads, head_dim
    del num_heads, head_dim

    cache = {}

    def init_or_check_tensor_attrs(tensor_name, tensor):
        nonlocal cache
        for attr in ['shape', 'stride', 'dtype', 'device']:
            key = f'{tensor_name}_{attr}'
            if key not in cache:
                cache[key] = getattr(tensor, attr)
                if callable(cache[key]):
                    cache[key] = cache[key]()
            else:
                v = cache[key]() if callable(cache[key]) else cache[key]
                assert cache[key] == v, f"Expected {cache[key]} but got {v}"


    class CuDNNAttention(torch.autograd.Function):
        @staticmethod
        def forward(ctx, B, N, L, q, kv, seqlens_kv):
            assert q.shape == (B, N, H, D)
            assert kv.shape == (B, N + L, 2, H, D)
            assert seqlens_kv.shape == (B,)

            # CuDNN plans are compiled for a specific shape, stride, dtype
            # So we need to verify those attributes
            init_or_check_tensor_attrs('q', q)
            init_or_check_tensor_attrs('kv', kv)
            init_or_check_tensor_attrs('seqlens_kv', seqlens_kv)

            q = q.permute(0, 2, 1, 3)  # B N H D -> B H N D
            kv_view = kv.permute(2, 0, 3, 1, 4) # B S KV H D -> KV B H S D
            k_view, v_view = torch.unbind(kv_view, dim=0)

            assert not k_view.is_contiguous() and not v_view.is_contiguous(), f"kv should not be contiguous (unnecessary copy)"
            assert k_view.shape == (B, H, (N + L),  D), f"Got shape {k_view.shape} instead of {(B, num_heads, (N + L),  D)}"
            assert v_view.shape == (B, H, (N + L), D)

            # TODO: Is this safe?
            if 'stats' not in cache:
                cache['stats'] = torch.empty(B, H, N, 1, dtype=torch.float32, device=q.device)
                cache['seqlens_q'] = torch.tensor([N] * B, device=q.device, dtype=torch.int32).view(B, 1, 1, 1)
                cache['o'] = torch.empty_like(q)

            stats = cache['stats']
            seqlens_q = cache['seqlens_q']
            o = cache['o']

            seqlens_kv = seqlens_kv.view(B, 1, 1, 1)

            if 'compiled_graph_fwd' not in cache:
                print("Compiling CuDNN graphs ...")
                g_fwd = cudnn.pygraph(
                    io_data_type=dtype,
                    intermediate_data_type=cudnn.data_type.FLOAT,
                    compute_data_type=cudnn.data_type.FLOAT,
                )
                cache['name_to_cu_tensor'] = {
                    'q_cu': g_fwd.tensor_like(q.detach()),
                    'k_cu': g_fwd.tensor_like(k_view.detach()),
                    'v_cu': g_fwd.tensor_like(v_view.detach()),
                    'seqlens_q_cu': g_fwd.tensor_like(seqlens_q.detach()),
                    'seqlens_kv_cu': g_fwd.tensor_like(seqlens_kv.detach())
                }
                cu_tens = cache['name_to_cu_tensor']

                o_forward, stats_forward = g_fwd.sdpa(
                    name="sdpa",
                    q=cu_tens['q_cu'],
                    k=cu_tens['k_cu'],
                    v=cu_tens['v_cu'],
                    is_inference=False,
                    attn_scale=1.0 / math.sqrt(D),
                    use_causal_mask=False,
                    use_padding_mask=True,
                    seq_len_q=cu_tens['seqlens_q_cu'],
                    seq_len_kv=cu_tens['seqlens_kv_cu']
                )

                o_forward.set_output(True).set_dim(o.shape).set_stride(o.stride()).set_data_type(dtype)
                stats_forward.set_output(True).set_data_type(cudnn.data_type.FLOAT).set_dim(stats.shape).set_stride(stats.stride())

                cu_tens['o_forward_cu'] = o_forward
                cu_tens['stats_forward_cu'] = stats_forward

                def assert_cudnn_shape(tensor, expected_shape):
                    assert tuple(tensor.get_dim()) == expected_shape, f"Expected shape {expected_shape} but got {tensor.get_dim()}"

                assert_cudnn_shape(cu_tens['q_cu'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['k_cu'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['v_cu'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['o_forward_cu'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['stats_forward_cu'], (B, H, N, 1))
                assert_cudnn_shape(cu_tens['seqlens_q_cu'], (B, 1, 1, 1))
                assert_cudnn_shape(cu_tens['seqlens_kv_cu'], (B, 1, 1, 1))

                g_fwd.validate()
                g_fwd.build_operation_graph()
                g_fwd.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
                g_fwd.check_support()
                g_fwd.build_plans()

                cache['compiled_graph_fwd'] = g_fwd

                g_bwd = cudnn.pygraph(
                     io_data_type=dtype,
                     intermediate_data_type=cudnn.data_type.FLOAT,
                     compute_data_type=cudnn.data_type.FLOAT,
                )

                cu_tens['q_cu_bwd'] = g_bwd.tensor_like(q.detach())
                cu_tens['k_cu_bwd'] = g_bwd.tensor_like(k_view.detach())
                cu_tens['v_cu_bwd'] = g_bwd.tensor_like(v_view.detach())
                cu_tens['o_cu_bwd'] = g_bwd.tensor_like(o.detach())
                cu_tens['dO_cu_bwd'] = g_bwd.tensor_like(o.detach())
                cu_tens['stats_cu_bwd'] = g_bwd.tensor_like(stats.detach())
                cu_tens['seqlens_q_cu_bwd'] = g_bwd.tensor_like(seqlens_q.detach())
                cu_tens['seqlens_kv_cu_bwd'] = g_bwd.tensor_like(seqlens_kv.detach())

                dQ_bwd_cu, dK_bwd_cu, dV_bwd_cu = g_bwd.sdpa_backward(
                    name="sdpa_backward",
                    q=cu_tens['q_cu_bwd'],
                    k=cu_tens['k_cu_bwd'],
                    v=cu_tens['v_cu_bwd'],
                    o=cu_tens['o_cu_bwd'],
                    dO=cu_tens['dO_cu_bwd'],
                    stats=cu_tens['stats_cu_bwd'],
                    attn_scale=1.0 / math.sqrt(D),
                    use_causal_mask=False,
                    use_padding_mask=True,
                    seq_len_q=cu_tens['seqlens_q_cu_bwd'],
                    seq_len_kv=cu_tens['seqlens_kv_cu_bwd']
                )

                # TODO: Is this safe?
                # cache['dQ'] = torch.empty_like(q).contiguous()
                # cache['dK'] = torch.empty_like(k_view).contiguous()
                # cache['dV'] = torch.empty_like(v_view).contiguous()

                cache['dQ'] = torch.empty_like(q)
                cache['dK'] = torch.empty_like(k_view)
                cache['dV'] = torch.empty_like(v_view)

                dQ_bwd_cu.set_output(True).set_dim(cache['dQ'].size()).set_stride(cache['dQ'].stride())
                dK_bwd_cu.set_output(True).set_dim(cache['dK'].size()).set_stride(cache['dK'].stride())
                dV_bwd_cu.set_output(True).set_dim(cache['dV'].size()).set_stride(cache['dV'].stride())

                cu_tens['dQ_cu_bwd'] = dQ_bwd_cu
                cu_tens['dK_cu_bwd'] = dK_bwd_cu
                cu_tens['dV_cu_bwd'] = dV_bwd_cu

                assert_cudnn_shape(cu_tens['q_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['k_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['v_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['dQ_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['dK_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['dV_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['o_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['dO_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['stats_cu_bwd'], (B, H, N, 1))
                assert_cudnn_shape(cu_tens['seqlens_q_cu_bwd'], (B, 1, 1, 1))
                assert_cudnn_shape(cu_tens['seqlens_kv_cu_bwd'], (B, 1, 1, 1))

                g_bwd.validate()
                g_bwd.build_operation_graph()
                g_bwd.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
                g_bwd.check_support()
                g_bwd.build_plans()

                cache['compiled_graph_bwd'] = g_bwd

                # TODO: Is this safe?
                cache['workspace'] = torch.empty(
                    max(g_fwd.get_workspace_size(), g_bwd.get_workspace_size()),
                    device=q.device, dtype=torch.uint8
                )
            
            name_to_cu_tensor = cache['name_to_cu_tensor']
            variant_pack_forward = {
                name_to_cu_tensor[name]: tensor for name, tensor in [
                    ('q_cu', q),
                    ('k_cu', k_view),
                    ('v_cu', v_view),
                    ('o_forward_cu', o),
                    ('stats_forward_cu', stats),
                    ('seqlens_q_cu', seqlens_q),
                    ('seqlens_kv_cu', seqlens_kv)
                ]
            }
            cache['compiled_graph_fwd'].execute(variant_pack_forward, cache['workspace'])
            ctx.save_for_backward(q, k_view, v_view, o, stats, seqlens_kv)
            ctx.B, ctx.N, ctx.L = B, N, L
            ctx.dtype = dtype
            return o

        @staticmethod
        def backward(ctx, grad_output):
            q, k, v, o, stats, seqlens = ctx.saved_tensors
            B, N, L = ctx.B, ctx.N, ctx.L
            seqlens_q = cache['seqlens_q']

            cu_tens = cache['name_to_cu_tensor']

            assert tuple(grad_output.shape) ==  (B, H, N, D)
            assert tuple(grad_output.shape) == tuple(cu_tens['dO_cu_bwd'].get_dim())
            # For batch size 1, the stride can have 2 1s, I think this is a Pytorch bug
            # https://discuss.pytorch.org/t/stride-has-2-1s-in-it/208036
            assert tuple(grad_output.stride())[1:] == tuple(cu_tens['dO_cu_bwd'].get_stride())[1:], f"{tuple(cu_tens['dO_cu_bwd'].get_stride())} (expected) != {tuple(grad_output.stride())} (actual) for shape {tuple(grad_output.shape)}"
            assert convert_to_cudnn_type(grad_output.dtype) == cu_tens['dO_cu_bwd'].get_data_type()

            variant_pack_backward = {
                cu_tens[name]: tensor for name, tensor in [
                    ('dQ_cu_bwd', cache['dQ']),
                    ('dK_cu_bwd', cache['dK']),
                    ('dV_cu_bwd', cache['dV']),
                    ('q_cu_bwd', q),
                    ('k_cu_bwd', k),
                    ('v_cu_bwd', v),
                    ('o_cu_bwd', o),
                    ('dO_cu_bwd', grad_output),
                    ('stats_cu_bwd', stats),
                    ('seqlens_q_cu_bwd', seqlens_q),
                    ('seqlens_kv_cu_bwd', seqlens)
                ]
            }

            cache['compiled_graph_bwd'].execute(variant_pack_backward, cache['workspace'])
            assert cache['dQ'].shape == (B, H, N, D)
            dQ = cache['dQ'].permute(0, 2, 1, 3) # B H N D -> B N H D

            assert cache['dK'].shape == (B, H, N + L, D)
            assert cache['dV'].shape == (B, H, N + L, D)

            dKV = torch.stack([cache['dK'], cache['dV']], dim=2)
            assert dKV.shape == (B, H, 2, N + L, D)

            dKV = dKV.permute(0, 3, 2, 1, 4) # B H 2 N D -> B N 2 H D

            return None, None, None, dQ, dKV, None

    return CuDNNAttention

However, while this gets better forward pass performance. It gets far worse backwards pass performance. Any thoughts on why this might be the case? I'm hoping there might be some obvious deficiency in my code.

(Unit is ms).

attention-forward-performance:
   batch_size     CuDNN  FlashAttention
0         1.0  0.022976        0.033024
1         2.0  0.021664        0.039456
2         4.0  0.047680        0.058112
3         6.0  0.056800        0.072208

attention-backward-performance:
   batch_size     CuDNN  FlashAttention
0         2.0  0.386144        0.282272
1         4.0  0.741664        0.301184
2         6.0  1.108608        0.464320

Number of heuristic engine configs mismatch by calling getEngineConfigCount and getEngineConfig

Hi, I'm trying to use engine heuristic builder for autotune. The build call looks successful. Then I called getEngineConfigCount. It succeeded and returned 9, which means engine builder created 9 configs. The weird thing is when I called getEngineConfig, the status is success, but the returned result is 0, which means no data was added into heuristic_results. This makes all heuristic configs uninitialized.

Anyone has suggestion for possible directions? Thanks!

Default `cudnnConvolutionMode_t` and how to set it?

Hello,

I've been reading through the documentation, especially regarding the convolution operation and I'm confused on what is the default mode for the convolution operation, whether it is CUDNN_CONVOLUTION or CUDNN_CROSS_CORRELATION. I would appreciate it if someone could clarify this question, and additionally, I would like to ask if currently there is a setter to change between operation modes.

Thank you!

[Question] Making dO contiguous affects output?

I've noticed when using Pytorch's custom autograd functions, that sometimes the stride of dO can be (0, 0, 0, 0).
Here's a very simple example: https://discuss.pytorch.org/t/getting-unusual-strides-when-using-pytorchs-autograd/208093.

In my custom wrapper for CudNN, I solve this my making dO contiguous if the stride is all zeros. Code (ctrl-f for "CHECK FOR WEIRD STRIDE"):

import cudnn
import torch
import math

def convert_to_cudnn_type(torch_type):
    if torch_type == torch.float16:
        return cudnn.data_type.HALF
    elif torch_type == torch.bfloat16:
        return cudnn.data_type.BFLOAT16
    elif torch_type == torch.float32:
        return cudnn.data_type.FLOAT
    elif torch_type == torch.int32:
        return cudnn.data_type.INT32
    elif torch_type == torch.int64:
        return cudnn.data_type.INT64
    else:
        raise ValueError("Unsupported tensor data type.")

def make_cudnn_autograd(*, num_heads, head_dim, dtype):
    assert dtype in [torch.float16, torch.bfloat16], f"Invalid dtype {dtype}"
    dtype = convert_to_cudnn_type(dtype)
    # match CuDNN's docs
    H, D = num_heads, head_dim
    del num_heads, head_dim

    cache = {}

    def assert_cudnn_shape(tensor, expected_shape):
        assert tuple(tensor.get_dim()) == expected_shape, f"Expected shape {expected_shape} but got {tensor.get_dim()}"

    def init_or_check_tensor_attrs(tensor_name, tensor):
        nonlocal cache
        for attr in ['shape', 'stride', 'dtype', 'device']:
            key = f'{tensor_name}_{attr}'
            if key not in cache:
                cache[key] = getattr(tensor, attr)
                if callable(cache[key]):
                    cache[key] = cache[key]()
            else:
                v = cache[key]() if callable(cache[key]) else cache[key]
                assert cache[key] == v, f"Expected {cache[key]} but got {v}"


    class CuDNNAttention(torch.autograd.Function):
        @staticmethod
        def forward(ctx, B, N, L, q, kv, seqlens_kv):
            assert q.shape == (B, N, H, D)
            assert kv.shape == (B, N + L, 2, H, D)
            assert seqlens_kv.shape == (B,)

            # CuDNN plans are compiled for a specific shape, stride, dtype
            # So we need to verify those attributes
            init_or_check_tensor_attrs('q', q)
            init_or_check_tensor_attrs('kv', kv)
            init_or_check_tensor_attrs('seqlens_kv', seqlens_kv)

            q = q.permute(0, 2, 1, 3)  # B N H D -> B H N D
            kv_view = kv.permute(2, 0, 3, 1, 4) # B S KV H D -> KV B H S D
            k_view, v_view = torch.unbind(kv_view, dim=0)

            assert not k_view.is_contiguous() and not v_view.is_contiguous(), f"kv should not be contiguous (unnecessary copy)"
            assert k_view.shape == (B, H, (N + L),  D), f"Got shape {k_view.shape} instead of {(B, num_heads, (N + L),  D)}"
            assert v_view.shape == (B, H, (N + L), D)

            # TODO: Is this safe?
            if 'stats' not in cache:
                cache['stats'] = torch.empty(B, H, N, 1, dtype=torch.float32, device=q.device)
                cache['seqlens_q'] = torch.tensor([N] * B, device=q.device, dtype=torch.int32).view(B, 1, 1, 1)
                cache['o'] = torch.empty_like(q)

            stats = cache['stats']
            seqlens_q = cache['seqlens_q']
            o = cache['o']

            seqlens_kv = seqlens_kv.view(B, 1, 1, 1)

            if 'compiled_graph_fwd' not in cache:
                print("Compiling CuDNN forward graph ...")
                g_fwd = cudnn.pygraph(
                    io_data_type=dtype,
                    intermediate_data_type=cudnn.data_type.FLOAT,
                    compute_data_type=cudnn.data_type.FLOAT,
                )
                cache['name_to_cu_tensor'] = {
                    'q_cu': g_fwd.tensor_like(q.detach()),
                    'k_cu': g_fwd.tensor_like(k_view.detach()),
                    'v_cu': g_fwd.tensor_like(v_view.detach()),
                    'seqlens_q_cu': g_fwd.tensor_like(seqlens_q.detach()),
                    'seqlens_kv_cu': g_fwd.tensor_like(seqlens_kv.detach())
                }
                cu_tens = cache['name_to_cu_tensor']

                o_forward, stats_forward = g_fwd.sdpa(
                    name="sdpa",
                    q=cu_tens['q_cu'],
                    k=cu_tens['k_cu'],
                    v=cu_tens['v_cu'],
                    is_inference=False,
                    attn_scale=1.0 / math.sqrt(D),
                    use_causal_mask=False,
                    use_padding_mask=True,
                    seq_len_q=cu_tens['seqlens_q_cu'],
                    seq_len_kv=cu_tens['seqlens_kv_cu']
                )

                o_forward.set_output(True).set_dim(o.shape).set_stride(o.stride()).set_data_type(dtype)
                stats_forward.set_output(True).set_data_type(cudnn.data_type.FLOAT).set_dim(stats.shape).set_stride(stats.stride())

                cu_tens['o_forward_cu'] = o_forward
                cu_tens['stats_forward_cu'] = stats_forward

                assert_cudnn_shape(cu_tens['q_cu'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['k_cu'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['v_cu'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['o_forward_cu'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['stats_forward_cu'], (B, H, N, 1))
                assert_cudnn_shape(cu_tens['seqlens_q_cu'], (B, 1, 1, 1))
                assert_cudnn_shape(cu_tens['seqlens_kv_cu'], (B, 1, 1, 1))

                g_fwd.validate()
                g_fwd.build_operation_graph()
                g_fwd.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
                g_fwd.check_support()
                g_fwd.build_plans()

                cache['compiled_graph_fwd'] = g_fwd

                # TODO: Is this safe?
                cache['workspace'] = torch.empty(
                    g_fwd.get_workspace_size(),
                    device=q.device, dtype=torch.uint8
                )
            
            name_to_cu_tensor = cache['name_to_cu_tensor']
            variant_pack_forward = {
                name_to_cu_tensor[name]: tensor for name, tensor in [
                    ('q_cu', q),
                    ('k_cu', k_view),
                    ('v_cu', v_view),
                    ('o_forward_cu', o),
                    ('stats_forward_cu', stats),
                    ('seqlens_q_cu', seqlens_q),
                    ('seqlens_kv_cu', seqlens_kv)
                ]
            }
            cache['compiled_graph_fwd'].execute(variant_pack_forward, cache['workspace'])
            ctx.save_for_backward(q, k_view, v_view, o, stats, seqlens_kv)
            ctx.B, ctx.N, ctx.L = B, N, L
            ctx.dtype = dtype
            return o

        @staticmethod
        def backward(ctx, dO):
            q, k_view, v_view, o, stats, seqlens_kv = ctx.saved_tensors
            B, N, L = ctx.B, ctx.N, ctx.L
            seqlens_q = cache['seqlens_q']
            cu_tens = cache['name_to_cu_tensor']

            init_or_check_tensor_attrs('dO', dO)

            # CHECK FOR WEIRD STRIDE
            # if dO's total stride is 0, copy it to a single element tensor
            if all(s == 0 for s in dO.stride()):
                dO = dO.contiguous()
            assert dO.shape == (B, H, N, D)
            # dO = dO.contiguous()

            if 'dQ' not in cache:
                cache['dQ'] = torch.empty_like(q)
                cache['dK'] = torch.empty_like(k_view)
                cache['dV'] = torch.empty_like(v_view)

            dQ, dK, dV = cache['dQ'], cache['dK'], cache['dV']

            if 'compiled_graph_bwd' not in cache:
                print(f"Compiling CuDNN backward graph ...")
                g_bwd = cudnn.pygraph(
                     io_data_type=dtype,
                     intermediate_data_type=cudnn.data_type.FLOAT,
                     compute_data_type=cudnn.data_type.FLOAT,
                )

                cu_tens['q_cu_bwd'] = g_bwd.tensor_like(q.detach())
                cu_tens['k_cu_bwd'] = g_bwd.tensor_like(k_view.detach())
                cu_tens['v_cu_bwd'] = g_bwd.tensor_like(v_view.detach())
                cu_tens['o_cu_bwd'] = g_bwd.tensor_like(o.detach())
                cu_tens['dO_cu_bwd'] = g_bwd.tensor_like(dO.detach())
                cu_tens['stats_cu_bwd'] = g_bwd.tensor_like(stats.detach())
                cu_tens['seqlens_q_cu_bwd'] = g_bwd.tensor_like(seqlens_q.detach())
                cu_tens['seqlens_kv_cu_bwd'] = g_bwd.tensor_like(seqlens_kv.detach())

                dQ_bwd_cu, dK_bwd_cu, dV_bwd_cu = g_bwd.sdpa_backward(
                    name="sdpa_backward",
                    q=cu_tens['q_cu_bwd'],
                    k=cu_tens['k_cu_bwd'],
                    v=cu_tens['v_cu_bwd'],
                    o=cu_tens['o_cu_bwd'],
                    dO=cu_tens['dO_cu_bwd'],
                    stats=cu_tens['stats_cu_bwd'],
                    attn_scale=1.0 / math.sqrt(D),
                    use_causal_mask=False,
                    use_padding_mask=True,
                    seq_len_q=cu_tens['seqlens_q_cu_bwd'],
                    seq_len_kv=cu_tens['seqlens_kv_cu_bwd']
                )

                dQ_bwd_cu.set_output(True).set_dim(dQ.size()).set_stride(dQ.stride())
                dK_bwd_cu.set_output(True).set_dim(dK.size()).set_stride(dK.stride())
                dV_bwd_cu.set_output(True).set_dim(dV.size()).set_stride(dV.stride())

                cu_tens['dQ_cu_bwd'] = dQ_bwd_cu
                cu_tens['dK_cu_bwd'] = dK_bwd_cu
                cu_tens['dV_cu_bwd'] = dV_bwd_cu

                assert_cudnn_shape(cu_tens['q_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['k_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['v_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['dQ_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['dK_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['dV_cu_bwd'], (B, H, N + L, D))
                assert_cudnn_shape(cu_tens['o_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['dO_cu_bwd'], (B, H, N, D))
                assert_cudnn_shape(cu_tens['stats_cu_bwd'], (B, H, N, 1))
                assert_cudnn_shape(cu_tens['seqlens_q_cu_bwd'], (B, 1, 1, 1))
                assert_cudnn_shape(cu_tens['seqlens_kv_cu_bwd'], (B, 1, 1, 1))

                g_bwd.validate()
                g_bwd.build_operation_graph()
                g_bwd.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
                g_bwd.check_support()
                g_bwd.build_plans()

                cache['compiled_graph_bwd'] = g_bwd
                cache['workspace'] = torch.empty(
                    max(cache['compiled_graph_fwd'].get_workspace_size(), cache['compiled_graph_bwd'].get_workspace_size()),
                    device=q.device, dtype=torch.uint8
                )

            variant_pack_backward = {
                cu_tens[name]: tensor for name, tensor in [
                    ('dQ_cu_bwd', cache['dQ']),
                    ('dK_cu_bwd', cache['dK']),
                    ('dV_cu_bwd', cache['dV']),
                    ('q_cu_bwd', q),
                    ('k_cu_bwd', k_view),
                    ('v_cu_bwd', v_view),
                    ('o_cu_bwd', o),
                    ('dO_cu_bwd', dO),
                    ('stats_cu_bwd', stats),
                    ('seqlens_q_cu_bwd', seqlens_q),
                    ('seqlens_kv_cu_bwd', seqlens_kv)
                ]
            }

            cache['compiled_graph_bwd'].execute(variant_pack_backward, cache['workspace'])
            assert cache['dQ'].shape == (B, H, N, D)
            dQ = cache['dQ'].permute(0, 2, 1, 3) # B H N D -> B N H D

            assert cache['dK'].shape == (B, H, N + L, D)
            assert cache['dV'].shape == (B, H, N + L, D)

            dKV = torch.stack([cache['dK'], cache['dV']], dim=2)
            assert dKV.shape == (B, H, 2, N + L, D)

            dKV = dKV.permute(0, 3, 2, 1, 4) # B H 2 N D -> B N 2 H D

            return None, None, None, dQ, dKV, None

    return CuDNNAttention

The problem is, when I do this, I get massive numerical error. Do you have thoughts on why making dO contiguous might cause issues?

[ERROR] Exception CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_NOT_INITIALIZED

Hello๏ผŒI have met some questions that I cannot run the testcase bin/samples [fusion],the error messages are below:

bin/samples [fusion]
Filters: [fusion]
Randomness seeded to: 415973869
TEST_CASE :: ConvScaleBiasAddAct sample
====DIMENSIONS====
input dims are 4, 24, 31, 31
filter dims are 32, 24, 3, 3
output dims are 4, 32, 31, 31
[ERROR] Exception CUDNN_BACKEND_TENSOR_DESCRIPTOR cudnnFinalize failed cudnn_status: CUDNN_STATUS_NOT_INITIALIZED

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samples is a Catch2 v3.5.3 host application.
Run with -? for options

-------------------------------------------------------------------------------
ConvScaleBiasAddAct sample
-------------------------------------------------------------------------------
/home/tairan/work/cuda/cudnn-frontend/samples/legacy_samples/test_list.cpp:705
...............................................................................

/home/tairan/work/cuda/cudnn-frontend/samples/legacy_samples/fusion_sample.cpp:347: FAILED:
  CHECK( false )
with message:
  TEST_CASE :: ConvScaleBiasAddAct sample

cuda version:12.2
driver version:535.129.03
cudnn version:9

Other hand, when I compile the project with cmake, it give me some messages as below:

Warning: Unused direct dependencies:
        /usr/local/cuda-12.2/lib64/libnvrtc.so.12
        /lib/x86_64-linux-gnu/libcuda.so.1
        /usr/local/cuda-12.2/lib64/libcudnn_adv.so.9
        /usr/local/cuda-12.2/lib64/libcudnn_ops.so.9
        /usr/local/cuda-12.2/lib64/libcudnn_cnn.so.9
        /usr/local/cuda-12.2/lib64/libcudnn_graph.so.9
        /usr/local/cuda-12.2/lib64/libcudnn_engines_runtime_compiled.so.9
        /usr/local/cuda-12.2/lib64/libcudnn_engines_precompiled.so.9
        /usr/local/cuda-12.2/lib64/libcudnn_heuristic.so.9

Maybe the question from here? I can run normal cudnn code such as mnist, so I think it doesn't the compatibility problem between CUDA and cudnn.

Error running Flash & BatchNormalization tests

Hello!

We've been working on a remoting system for CUDA calls that generally works pretty well with most applications, including most cudnn-frontend tests. However we've been having issues with a few of the Flash and BatchNorm tests, specifically the following:

Flash with rng dropout
Flash with no dropout
Flash backward
BN Finalize Graph
SGBN Add Relu Graph
LayerNorm Training
LayerNorm Inference
SDPA Graph with serialization
Fused scalar
IMMA execution with manual autotuning
BN Finalize
Batch normalization
BF16 LLM Flash MHA Fprop sample
BF16 LLM Flash MHA Bprop sample

Which all segfault on cudnnBackendExecute. Here's an example stacktrace for one of them:

Thread 6 "gpu-controller" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc1fff000 (LWP 85939)]
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:383
383	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:383
#1  0x00007fff5ece7876 in ?? () from /usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9.1.0
#2  0x00007fff5ece849d in ?? () from /usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9.1.0
#3  0x00007fffc0328345 in cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan const&, cudnn::backend::VariantPack&) () from /usr/local/cuda/lib64/libcudnn_graph.so.9.1.0
#4  0x00007fffc03297ef in cudnnBackendExecute () from /usr/local/cuda/lib64/libcudnn_graph.so.9.1.0
#5  0x00005555557934ee in CudaExecutor::wcudnnBackendExecute (
    this=0x555556e36b80 <CudaExecutor::getInstance()::instance>, cudaBuffer=0x300001b8)
    at /home/ubuntu/cedana-gpu/src/gpu-controller/execution/cudnn_execution.cpp:374
#6  0x0000555555754c03 in CudaExecutor::Execute (this=0x555556e36b80 <CudaExecutor::getInstance()::instance>)
    at /home/ubuntu/cedana-gpu/src/gpu-controller/execution/cuda_executor.cpp:1378

I'm mostly curious if there's anything specific about the way the kernels for Flash or BN are written that would cause an error like this to happen. Inputs to the execute() call in all cases seem to be correct, which were validated by manually printing them out. Memory addresses seem fine as well.

Thank you!

Why `cudnnConvolutionBackwardData` call `cudnn::ops::convertTensor_kernel<__half, __half, float, 0>(float, __half const*` ?

cudnnConvolutionBackwardData call cudnn::ops::convertTensor_kernel<__half, __half, float, 0>(float, __half const* which take 6ms longer to call this API.

NSight log:
image

  • 1 & 2 are normal kernel in cudnnConvolutionBackwardData
  • 3 is abnormal convertTensor_kernel call in cudnnConvolutionBackwardData which make it very slow.

This is test on V100.

Normal call cudnn api log:

I! CuDNN (v8101) function cudnnConvolutionBackwardData() called:
i!     handle: type=cudnnHandle_t; streamId=0x55cd8ad4dbc0;
i!     alpha: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     wDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[128,256,1,1];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NHWC (1);
i!     wData: location=dev; addr=0x7f620d0f0800;
i!     dyDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,128,56,56];
i!         strideA: type=int; val=[401408,1,7168,128];
i!     dyData: location=dev; addr=0x7f5cd8aba200;
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1);
i!         reorderType: type=int; val=0;
i!         arrayLength: type=int; val=2;
i!         padA: type=int; val=[0,0];
i!         strideA: type=int; val=[1,1]; 
i!         dilationA: type=int; val=[1,1];
i!         groupCount: type=int; val=1;
i!     algo: type=cudnnConvolutionBwdDataAlgo_t; val=CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 (1);
i!     workSpace: location=dev; addr=0x7f5cf12ba200;
i!     workSpaceSizeInBytes: type=size_t; val=1;
i!     beta: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     dxDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,256,56,56];
i!         strideA: type=int; val=[802816,1,14336,256];
i!     dxData: location=dev; addr=0x7f5e91aba200;
i! Time: 2021-12-09T11:24:46.391427 (0d+0h+0m+55s since start)
i! Process=3115; Thread=3923; GPU=0; Handle=0x55cd8aaacfe0; StreamId=0x55cd8ad4dbc0.

Abnormal call cudnn api (which calls cudnn::ops::convertTensor_kernel<__half, __half, float, 0>(float, __half const*) log:

I! CuDNN (v8101) function cudnnConvolutionBackwardData() called:
i!     handle: type=cudnnHandle_t; streamId=0x56322bc56140;
i!     alpha: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     wDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,256,1,1];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NHWC (1);
i!     wData: location=dev; addr=0x7ef58d662a00;
i!     dyDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,512,28,28];
i!         strideA: type=int; val=[401408,1,14336,512];
i!     dyData: location=dev; addr=0x7ef1f9bb6400;
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1);
i!         reorderType: type=int; val=0;
i!         arrayLength: type=int; val=2;
i!         padA: type=int; val=[0,0];
i!         strideA: type=int; val=[2,2];
i!         dilationA: type=int; val=[1,1];
i!         groupCount: type=int; val=1;
i!     algo: type=cudnnConvolutionBwdDataAlgo_t; val=CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 (1);
i!     workSpace: location=dev; addr=0x7ef0593b6200;
i!     workSpaceSizeInBytes: type=size_t; val=822083600;
i!     beta: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     dxDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,256,56,56];
i!         strideA: type=int; val=[802816,1,14336,256];
i!     dxData: location=dev; addr=0x7ef2123b6400;
i! Time: 2021-12-09T11:21:19.554570 (0d+0h+0m+41s since start)
i! Process=1839; Thread=2213; GPU=0; Handle=0x56322bbabe60; StreamId=0x56322bc56140.

Thanks in advance!

Support use of external/system Catch2 installation

I am building cudnn-frontend v1.1.0 in a restricted environment that is unable to use the FetchContent mechanism. I already have an existing build of Catch2 that should be consumed in its stead.

Elaborating the upper part of samples/CMakeLists.txt as follows will make this possible:

find_package(Catch2 QUIET)

if(NOT Catch2_FOUND)
    Include(FetchContent)

    # Fetch and build catch2
    FetchContent_Declare(
      Catch2
      GIT_REPOSITORY https://github.com/catchorg/Catch2.git
      GIT_TAG        v3.3.2
    )
    FetchContent_MakeAvailable(Catch2)
endif()

(I would submit this as a PR, but I see that PRs are not being accepted.)

I am then able to specify -DCatch2_ROOT=/path/to/catch2-install to CMake, and the samples build makes use of the existing library without issue.

CUDNN not working with RTX A4000

Hi there,

I'm not sure where to open this issue, so I'm dropping it here since I cannot find the appropriate channel.

I have been trying to crunch some numbers with several python packages. but none of those works with the RTX A4000. After debugging for days, I found the issue is not my environment but the GPU.

More specific, I have been trying to use the following packages:

MXNet
CuPy

My first impression was that there was something wrong with my environment. Then I tried to just run a docker container to test the GPU along with MXNet package:

docker pull mxnet/python:gpu

Trying to run either mxnet or cupy hangs... it does not compute anything...

Then I tried the default nvidia container:

docker run --rm --gpus all nvidia/cuda nvidia-smi

The output seems ok:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:01:00.0  On |                  Off |
| 41%   49C    P2    37W / 140W |    787MiB / 16109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

It is very confusing because I can run Pytorch, TensoRT etc in this machine...

Then I tried to run the installer over again for the following the NVIDIA Guide for CUDNN:

$cp -r /usr/src/cudnn_samples_v8/ $HOME
$ cd  $HOME/cudnn_samples_v8/mnistCUDNN
$make clean && make
$ time ./mnistCUDNN

Then what I get is:

Executing: mnistCUDNN
cudnnGetVersion() : 8300 , CUDNN_VERSION from cudnn.h : 8300 (8.3.0)
Host compiler version : GCC 7.5.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 48  Capabilities 8.6, SmClock 1560.0 Mhz, MemSize (Mb) 16109, MemClock 7001.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...

^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.017408 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.018432 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.027648 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.060416 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.096256 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 2000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037888 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.046080 time requiring 2000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.055296 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.057344 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.078848 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.016384 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.016384 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.017408 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.055296 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.060416 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 2000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037888 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.041984 time requiring 2000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.050176 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.063488 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.076800 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.009216 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.031744 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.061440 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.187392 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.235520 time requiring 178432 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.066560 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.080864 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.116736 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.120832 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.121856 time requiring 4656640 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.024576 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.029696 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.041984 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.076800 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.078848 time requiring 178432 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 2057744 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.037888 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.057344 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.091136 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.106496 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.108544 time requiring 64000 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_EXECUTION_FAILED for Algo 7: -1.000000 time requiring 1433120 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

real	7m18.842s
user	7m17.724s
sys	0m0.516s

This takes a LONG time to execute.

Then, for a final test, I just swapped the RTX A4000 by an old NVidia Tesla K80... Suddenly everything works like a charm...

I'm almost sure there is something wrong either with the driver or with the GPU firmware. I tried the code I have in Google Colab, the old K80, and even a Jetson AGX and it works, but not with the RTX A4000

The system information for CuPy(python) output which also takes a long time is:

import cupy as cp
print(cp.show_config())
OS                           : Linux-5.4.0-42-generic-x86_64-with-Ubuntu-18.04-bionic
Python Version               : 3.6.9
CuPy Version                 : 9.6.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.19.5
SciPy Version                : 1.5.4
Cython Build Version         : 0.29.22
Cython Runtime Version       : 0.29.22
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : /usr/local/cuda/bin/nvcc
CUDA Build Version           : 10020
CUDA Driver Version          : 11050
CUDA Runtime Version         : 10020
cuBLAS Version               : (available)
cuFFT Version                : 10102
cuRAND Version               : 10102
cuSOLVER Version             : (10, 3, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (10, 2)
Thrust Version               : 100907
CUB Build Version            : 100800
Jitify Build Version         : 60e9e72
cuDNN Build Version          : 8204
cuDNN Version                : 8300
NCCL Build Version           : 21104
NCCL Runtime Version         : 21104
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA RTX A4000
Device 0 Compute Capability  : 86
Device 0 PCI Bus ID          : 0000:01:00.0
None

Apologies if this is not the right place for this issue, I really appreciate your help.

[Question] How to do attention with different sequence length for queries and keys/values?

I have more keys/values than queries, and I'm trying to update the queries only.
I have structured my tensors as follows:

    # B = batch_size, N and L are sequence lengths.
    # There are N + L k/v pairs, and only N query pairs

    assert not k_view.is_contiguous() and k_view.shape == (B, num_heads, (N + L),  head_dim)
    assert not v_view.is_contiguous() and v_view.shape == (B, num_heads, (N + L), head_dim)
    assert q.shape == (B, num_heads, N, head_dim)

    q_cu = g_fwd.tensor_like(q)
    k_cu = g_fwd.tensor_like(k_view)
    v_cu = g_fwd.tensor_like(v_view)
    
    seqlens_q_cu = g_fwd.tensor_like(seqlens_q)
    seqlens_kv_cu = g_fwd.tensor_like(seqlens_kv)
    stats = torch.empty(B, num_heads, N, 1, dtype=torch.float32, device=q.device)

    o_forward, stats_forward = g_fwd.sdpa(
        name="sdpa",
        q=q_cu,
        k=k_cu,
        v=v_cu,
        is_inference=False,
        attn_scale=1.0 / math.sqrt(head_dim),
        use_causal_mask=False,
        use_padding_mask=True,
        seq_len_q=seqlens_q_cu,
        seq_len_kv=seqlens_kv_cu
    )

But this gives the error:

ValueError: dimensions mismatch as broadcasting 2 non-one dimension sizes.

How can I fix this? Should I be using the ragged tensor API?

Cannot build nvidia-tensorflow with v0.5

A Dockerfile to reproduce:

FROM nvidia/cuda:11.4.2-cudnn8-devel-ubuntu20.04
ENV TZ=Europe/London
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ >/etc/timezone
RUN apt-get update && apt-get -y upgrade
RUN apt-get install -y build-essential git git-lfs wget vim software-properties-common unzip python3-pip
RUN update-alternatives --install /usr/bin/python python $(which python3) 10
RUN pip install --upgrade numpy astor
WORKDIR /workdir
RUN wget https://github.com/bazelbuild/bazel/releases/download/0.26.1/bazel-0.26.1-installer-linux-x86_64.sh
RUN chmod +x bazel-0.26.1-installer-linux-x86_64.sh && ./bazel-0.26.1-installer-linux-x86_64.sh

RUN git clone https://github.com/NVIDIA/cudnn-frontend.git
RUN git clone --branch r1.15.5+nv21.10 --single-branch https://github.com/NVIDIA/tensorflow.git
WORKDIR /workdir/tensorflow
ENV TF_ENABLE_XLA=0 \
    TF_NEED_OPENCL_SYCL=0 \
    TF_NEED_ROCM=0 \
    TF_NEED_CUDA=1 \
    TF_NEED_TENSORRT=0 \
    TF_CUDA_VERSION=11 \
    TF_CUBLAS_VERSION=11 \
    TF_NCCL_VERSION=2 \
    TF_CUDNN_VERSION=8 \
    TF_CUDA_PATHS="/usr/include,/usr/lib/x86_64-linux-gnu,/usr/local/cuda/include,/usr/local/cuda/lib64,/usr/local/cuda/bin,/usr/local/cuda" \
    TF_CUDA_COMPUTE_CAPABILITIES=3.5,5.0,5.2,6.1,7.0,7.5,8.6 \
    CC_OPT_FLAGS="-march=sandybridge -mfma -mfpmath=both -fopenmp"
RUN PYTHON_BIN_PATH=$(which python) ./configure
RUN bazel build --config=opt --config=noaws --config=nogcp --config=nohdfs --config=noignite --config=nokafka //tensorflow/tools/pip_package:build_pip_package

Error message:

ERROR: /workdir/tensorflow/tensorflow/stream_executor/cuda/BUILD:343:1: C++ compilation of rule '//tensorflow/stream_executor/cuda:cudnn_plugin' failed (Exit 1)
In file included from bazel-out/host/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend.h:116,
                 from tensorflow/stream_executor/cuda/cuda_dnn.cc:50:
bazel-out/host/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend/third_party/cudnn_frontend/include/cudnn_frontend_ExecutionPlanCache.h:29:10: fatal error: cudnn_frontend_OperationGraph.h: No such file or directory
   29 | #include <cudnn_frontend_OperationGraph.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Target //tensorflow/tools/pip_package:build_pip_package failed to build

while using git clone --branch v0.4.1 --single-branch https://github.com/NVIDIA/cudnn-frontend.git seems to work.

Question About Reduce Node

Does the reduce node support performing a reduction on a non-contiguous dimension?
I appreciate your help!

Forward conv1d + transposition + conv1d ?

Hi, I'm trying to find whether one can use cudnn-frontend to perform/fuse the following:

  • forward 1d convolution on n channels with 1 1d filter of size w applied separately on every channel of s samples (with padding to w/2, one gets approx n s-sized vectors)
  • transposition of the result ('lines become columns') - some call this 'corner turn'
  • then forward convolution on s channels, with a number f of filters, each filter of size n (getting f s-sized vectors)

I'm a bit fuzzy on how to perform that middle operation. It might just be a simple swap of dimensions somewhere instead of an actual memory-move, but I'm not sure how the code would look.

Thanks in advance for your help

Support batch size 0

Downstream PyTorch issue:
pytorch/pytorch#133780

Describe the bug
cuDNN frontend rejects batch_size=0 input with CUDNN_STATUS_BAD_PARAM

Expected behavior
cuDNN should return to me a tensor [0, num_head, sequence_length, dims_per_head]
something like that, maybe the heads/seq are permuted differently, but the important part is that batch_size would be 0.
it would have the same dimensions as Q.
you could even just return Q, probably.

System Environment (please complete the following information):
Accessing cuDNN via torch sdpa,
PyTorch 2.5.0.dev20240811+cu121

torch.backends.cudnn.version()
90100
  • cudnn_frontend version: not sure how to look this up in PyTorch
  • cudnn_backend version: 90100
  • GPU arch: RTX 4090
  • cuda runtime version: PyTorch bundled 12.1 (though 12.2 is installed on the system)
  • cuda driver version: 535.183.01
  • host compiler: g++ (Ubuntu 12.3.0-17ubuntu1) 12.3.0
  • OS: ubuntu24.04

API logs
Please attach API logs for both cudnn_frontend and cudnn_backend.

[cudnn_frontend] INFO: Validating SDPANode CUDNN_SDPA...
[cudnn_frontend] INFO: Inferrencing properties for Scaled_dot_product_flash_attention node  CUDNN_SDPA...
[cudnn_frontend] INFO: Validating matmul node bmm1...
[cudnn_frontend] INFO: Inferrencing properties for matmul node bmm1...
[cudnn_frontend] INFO: Validating pointwise node attn_scale...
[cudnn_frontend] INFO: Inferrencing properties for pointwise node attn_scale...
[cudnn_frontend] INFO: Validating SoftmaxNode softmax...
[cudnn_frontend] INFO: Inferrencing properties for Softmax node softmax.
[cudnn_frontend] INFO: Validating reduction node M...
[cudnn_frontend] INFO: Inferrencing properties for reduction node M...
[cudnn_frontend] INFO: Validating pointwise node sub...
[cudnn_frontend] INFO: Inferrencing properties for pointwise node sub...
[cudnn_frontend] INFO: Validating pointwise node exp...
[cudnn_frontend] INFO: Inferrencing properties for pointwise node exp...
[cudnn_frontend] INFO: Validating reduction node sum...
[cudnn_frontend] INFO: Inferrencing properties for reduction node sum...
[cudnn_frontend] INFO: Validating pointwise node log...
[cudnn_frontend] INFO: Inferrencing properties for pointwise node log...
[cudnn_frontend] INFO: Validating pointwise node add...
[cudnn_frontend] INFO: Inferrencing properties for pointwise node add...
[cudnn_frontend] INFO: Validating pointwise node div...
[cudnn_frontend] INFO: Inferrencing properties for pointwise node div...
[cudnn_frontend] INFO: Validating matmul node bmm2...
[cudnn_frontend] INFO: Inferrencing properties for matmul node bmm2...
[cudnn_frontend] INFO: Creating cudnn tensors for node named 'CUDNN_SDPA':
[cudnn_frontend] ERROR: CUDNN_BACKEND_TENSOR_DESCRIPTOR: Check and Set the CUDNN_ATTR_TENSOR_DIMENSIONS Correctly cudnn_status: CUDNN_STATUS_BAD_PARAM. ["CUDNN_BACKEND_API_FAILED"] because (e.getCudnnStatus() != CUDNN_STATUS_SUCCESS) at /pytorch/third_party/cudnn_frontend/include/cudnn_frontend/cudnn_interface.h:89
[cudnn_frontend] ERROR: create_cudnn_tensor(tensor, tensors) at /pytorch/third_party/cudnn_frontend/include/cudnn_frontend/node_interface.h:707
[cudnn_frontend] ERROR: create_cudnn_tensors_(uid_to_backend_tensors) at /pytorch/third_party/cudnn_frontend/include/cudnn_frontend/node_interface.h:316
[cudnn_frontend] ERROR: sub_node->create_cudnn_tensors(uid_to_backend_tensors) at /pytorch/third_party/cudnn_frontend/include/cudnn_frontend/node_interface.h:318
[cudnn_frontend] ERROR: create_cudnn_tensors(uid_to_tensors) at /pytorch/third_party/cudnn_frontend/include/cudnn_frontend/node_interface.h:408


I! CuDNN (v90100 70) function cudnnCreate() called:
i!     handle: location=host; addr=0x582fcf32a680;
i! Time: 2024-08-18T02:59:15.167979 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGraphLibraryConfigInit() called:
i!     apiLog: type=cudnnLibConfig_t; val=CUDNN_STANDARD;
i! Time: 2024-08-18T02:59:15.168030 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnSetStream() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     streamId: type=cudaStream_t; streamId=(nil) (defaultStream);
i! Time: 2024-08-18T02:59:15.169306 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=0; Handle=0x582fcf7a4130; StreamId=(nil) (defaultStream).


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169405 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169445 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169458 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169468 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169483 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169494 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169516 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-08-18T02:59:15.169533 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90100 70) function cudnnGetErrorString() called:
i!     status: type=int; val=2000;
i! Time: 2024-08-18T02:59:15.169624 (0d+0h+0m+0s since start)
i! Process=7823; Thread=7823; GPU=NULL; Handle=NULL; StreamId=NULL.


To Reproduce
Steps to reproduce the behavior:

import torch
from torch.nn.functional import scaled_dot_product_attention
from torch.nn.attention import SDPBackend, sdpa_kernel

device = torch.device('cuda')
dtype = torch.float16

batch = 0
q_heads = kv_heads = 10
q_tokens = 3952
kv_tokens = 16
head_dim = 64
q = torch.zeros(batch, q_heads, q_tokens, head_dim, device=device, dtype=dtype)
k = torch.zeros(batch, kv_heads, kv_tokens, head_dim, device=device, dtype=dtype)
v = torch.zeros(batch, kv_heads, kv_tokens, head_dim, device=device, dtype=dtype)

with sdpa_kernel(SDPBackend.CUDNN_ATTENTION):
    scaled_dot_product_attention(q, k, v)

Additional context
Add any other context about the problem here.

I'm trying to do attention on a batch-of-zero, because my program uses a static graph and I rely on zero-batching (index_select zero-batch of inputs, index_add zero-batch of outputs) to toggle functionality without adding branches to the logic.

Inference result of deep learning model is all NAN

Hi,

Iโ€™m currently using CUDNN to write a deep learning super-resolution sample. But I found that the inference result of the model is all NAN. Then I tried to print the output of the middle layer and found that the output would be enlarged after one block, and the data exceeded the representation range after more than 20 blocks. As shown in the figure below.

d59e9e23d3bfc17fe2ee0b01155d4df

But I compare it with the python version and confirmed that the input image and weight of convolutional layers are consistent. So I think I made some mistakes while building the model using the CUDNN API. Could you please take a look at my code to see if I made any obvious mistakes and give me some debugging suggestions?

Thanks a lot in advance!

CUDNN_FRONTEND_BUILD_UNIT_TESTS option is broken

When I configure with -DCUDNN_FRONTEND_BUILD_UNIT_TESTS=ON, I get

CMake Error at CMakeLists.txt:49 (add_subdirectory):
  The source directory

    /path/to/cudnn-frontend/1.1.0/test

  does not contain a CMakeLists.txt file.

Is the test/ directory supposed to have some content that comes from somewhere else?

need default return value for cudnn_frontend::PointWiseDesc_v8::getPortCount() const

Hi all,

when building pytorch it treats warnings as errors and it barfs here:

In file included from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_Operation.h:36,
                 from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_OperationGraph.h:36,
                 from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_Heuristics.h:30,
                 from /opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend.h:101,
                 from /opt/pytorch/pytorch/aten/src/ATen/native/cudnn/Conv_v8.cpp:10:
/opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h: In member function โ€˜int64_t cudnn_frontend::PointWiseDesc_v8::getPortCount() constโ€™:
/opt/pytorch/pytorch/cmake/../third_party/cudnn_frontend/include/cudnn_frontend_PointWiseDesc.h:120:5: error: control reaches end of non-void function [-Werror=return-type]
  120 |     }

I think a default value should be returned to fix this.

error: ambiguous overload for โ€˜operator*โ€™ in test_list.cpp

hey i have cudatoolkit 12.2 and cudnn 8.9 installed on ubuntu 22.04 lts
when following the samples build instructions below i encounter the error in the title

 mkdir build; cd build
cmake -DCUDNN_PATH=/usr/lib/x86_64-linux-gnu/ -DCUDAToolkit_ROOT=/usr/local/cuda/bin ..
 cmake --build . -j8

the configure step seems fine

cmake configure output
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.91") 
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- cudnn found at /usr/lib/x86_64-linux-gnu/libcudnn.so.
-- cudnn_adv_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.
-- Found LIBRARY: /usr/include  
-- cuDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- cuDNN: /usr/include
-- Configuring done
-- Generating done
-- Build files have been written to: /home/kevin/code/thirdparty/cudnn-frontend/build

then in test_list.cpp there is an overload resolution error of operator * because of the datatypes

nvcc error
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp: In function โ€˜void CATCH2_INTERNAL_TEST_102()โ€™:
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: error: ambiguous overload for โ€˜operator*โ€™ (operand types are โ€˜halfโ€™ and โ€˜floatโ€™)
 2127 |         half globalScaleOutput = afterConvOutput * scale1.hostPtr[0] * scale2.hostPtr[0];
      |                                  ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
      |                                  |                                 |
      |                                  half                              float
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(int, float)โ€™ (built-in)
 2127 |         half globalScaleOutput = afterConvOutput * scale1.hostPtr[0] * scale2.hostPtr[0];
      |                                  ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(long long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(long long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:2127:50: note: candidate: โ€˜operator*(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:618:44: note: candidate: โ€˜__half operator*(const __half&, const __half&)โ€™
  618 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator*(const __half &lh, const __half &rh) { return __hmul(lh, rh); }
      |                                            ^~~~~~~~
In file included from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h: In instantiation of โ€˜void scale_and_bias_tensor_cpu(const inputType*, outputType*, const scale_bias_type*, const scale_bias_type*, int64_t, const int64_t*) [with inputType = __half; scale_bias_type = float; outputType = float; int64_t = long int]โ€™:
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:1560:50:   required from here
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: error: ambiguous overload for โ€˜operator*โ€™ (operand types are โ€˜const __halfโ€™ and โ€˜const floatโ€™)
  128 |         outputData[i] = inputData[i] * scaleData[c] + biasData[c];
      |                         ~~~~~~~~~~~~~^~~~~~~~~~~~
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(long long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(long long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:128:38: note: candidate: โ€˜operator*(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:618:44: note: candidate: โ€˜__half operator*(const __half&, const __half&)โ€™
  618 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator*(const __half &lh, const __half &rh) { return __hmul(lh, rh); }
      |                                            ^~~~~~~~
In file included from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h: In instantiation of โ€˜void gen_stats_cpu(const T_ELEM*, std::vector<std::pair<float, float> >&, int64_t, const int64_t*) [with T_ELEM = __half; int64_t = long int]โ€™:
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:1650:24:   required from here
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: error: ambiguous overload for โ€˜operator+โ€™ (operand types are โ€˜floatโ€™ and โ€˜const __halfโ€™)
  169 |         outputData[channel_index].first = outputData[channel_index].first + inputData[i];
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, long long unsigned int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, long long int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, long unsigned int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, long int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, unsigned int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:169:75: note: candidate: โ€˜operator+(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:606:44: note: candidate: โ€˜__half operator+(const __half&, const __half&)โ€™
  606 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator+(const __half &lh, const __half &rh) { return __hadd(lh, rh); }
      |                                            ^~~~~~~~
In file included from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: error: ambiguous overload for โ€˜operator-โ€™ (operand types are โ€˜const __halfโ€™ and โ€˜floatโ€™)
  182 |         T_ELEM diff = (inputData[i] - outputData[channel_index].first) * (inputData[i] - outputData[channel_index].first);
      |                       ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(long long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(long long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:37: note: candidate: โ€˜operator-(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:612:44: note: candidate: โ€˜__half operator-(const __half&, const __half&)โ€™
  612 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator-(const __half &lh, const __half &rh) { return __hsub(lh, rh); }
      |                                            ^~~~~~~~
In file included from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: error: ambiguous overload for โ€˜operator-โ€™ (operand types are โ€˜const __halfโ€™ and โ€˜floatโ€™)
  182 |         T_ELEM diff = (inputData[i] - outputData[channel_index].first) * (inputData[i] - outputData[channel_index].first);
      |                                                                          ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(long long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(long long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:182:88: note: candidate: โ€˜operator-(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:612:44: note: candidate: โ€˜__half operator-(const __half&, const __half&)โ€™
  612 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator-(const __half &lh, const __half &rh) { return __hsub(lh, rh); }
      |                                            ^~~~~~~~
In file included from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: error: ambiguous overload for โ€˜operator+โ€™ (operand types are โ€˜floatโ€™ and โ€˜__halfโ€™)
  183 |         outputData[channel_index].second = outputData[channel_index].second + diff;
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, long long unsigned int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, long long int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, long unsigned int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, long int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, unsigned int)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:183:77: note: candidate: โ€˜operator+(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:606:44: note: candidate: โ€˜__half operator+(const __half&, const __half&)โ€™
  606 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator+(const __half &lh, const __half &rh) { return __hadd(lh, rh); }
      |                                            ^~~~~~~~
In file included from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h: In instantiation of โ€˜void batch_normalize(const T_ELEM*, T_ELEM*, const std::vector<std::pair<float, float> >&, int64_t, const int64_t*) [with T_ELEM = __half; int64_t = long int]โ€™:
/home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:1652:26:   required from here
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: error: ambiguous overload for โ€˜operator-โ€™ (operand types are โ€˜const __halfโ€™ and โ€˜const floatโ€™)
  204 |         outputData[i] = (inputData[i] - stats[batch_index].first) / (T_ELEM) std::sqrt(stats[batch_index].second);
      |                         ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(long long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(long long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(long unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(long int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(unsigned int, float)โ€™ (built-in)
/home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:204:39: note: candidate: โ€˜operator-(float, float)โ€™ (built-in)
In file included from /usr/local/cuda/include/cuda_fp16.h:4070,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_emu.h:28,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/fp16_dev.h:27,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/helpers.h:36,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/cpu_references.h:25,
                 from /home/kevin/code/thirdparty/cudnn-frontend/samples/test_list.cpp:27:
/usr/local/cuda/include/cuda_fp16.hpp:612:44: note: candidate: โ€˜__half operator-(const __half&, const __half&)โ€™
  612 | __CUDA_HOSTDEVICE__ __forceinline__ __half operator-(const __half &lh, const __half &rh) { return __hsub(lh, rh); }
      |                                            ^~~~~~~~
g

how to map to the original algorithm

hello, thx for your great work @YangXu1990uiuc

i am wondering if we can find a mapping relationship between the execution_engine with the original algorithm like CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD/CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM/CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_โ€‹PRECOMP_GEMM.

INT8 sample didn't work?

I tried cuda 11.2 ,11.4 and cudnn 8.1 , 8.2, 8.4, the Sample ConvScaleBiasAct_int8 didn't work in any combination of these env, and it will return the error :
"[ERROR] Exception CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED"
my device type is A100, is there anything wrong?

Update single header file for nlohmann json

nlohmann JSON is at 3.11 (major version 11) and consider updating that header file. Also, it will eliminate the need for compiling with C++11 features.

Be wary in that for the latest release of 3.11, the single header file tries to specialize a templatized value from std library outside its namespace. Which it shouldn't. I demonstrate a fix here:

ernestyalumni@728a474

Why is graph::check_support really slow?

I am using the cudnn_frontend to perform a simple matmul - all works as expected when using FLOAT data type, but the graph::check_support is really slow. I there any way to speed this up?

Also, the check_support fails when using a DOUBLE data type - is this expected?

Thanks.

Execute matmul op faild

I tried to run the matmul op and follow the codes in function "void run_matmul_bias_gelu" in fusion_sample.cpp. My code is as below:

int main() {
  auto ha = loadBinary<float>("/workspace/features.bin");
  auto hb = loadBinary<float>("/workspace/weight.bin");
  auto hc = loadBinary<float>("/workspace/output.bin");

  float *a_ptr;
  cudaMalloc((void **)&a_ptr, ha.size() * sizeof(float));
  cudaMemcpy(a_ptr, ha.data(), ha.size() * sizeof(float),
             cudaMemcpyHostToDevice);
  float *b_ptr;
  cudaMalloc((void **)&b_ptr, hb.size() * sizeof(float));
  cudaMemcpy(b_ptr, hb.data(), hb.size() * sizeof(float),
             cudaMemcpyHostToDevice);

  float *c_ptr;
  cudaMalloc((void **)&c_ptr, hc.size() * sizeof(float));

  const int m = ha.size() / 96;
  const int n = 96;
  const int k = 96;

  int64_t stride[3];
  int64_t a_dim[3] = {1, m, k};
  int64_t b_dim[3] = {1, k, n};
  int64_t c_dim[3] = {1, m, n};
  generateStrides(a_dim, stride, 3, CUDNN_TENSOR_NCHW);
  auto aMatrixTensor =
      cudnn_frontend::TensorBuilder()
          .setDim(3, a_dim)
          .setStride(3, stride)
          .setId('a')
          .setAlignment(
              16)  // 16B alignment is needed to run a tensor core engine
          .setDataType(CUDNN_DATA_FLOAT)
          .build();

  generateStrides(b_dim, stride, 3, CUDNN_TENSOR_NCHW);
  auto bMatrixTensor = cudnn_frontend::TensorBuilder()
                           .setDim(3, b_dim)
                           .setStride(3, stride)
                           .setId('b')
                           .setAlignment(16)
                           .setDataType(CUDNN_DATA_FLOAT)
                           .build();

  generateStrides(c_dim, stride, 3, CUDNN_TENSOR_NCHW);
  auto afterMatMulTensor = cudnn_frontend::TensorBuilder()
                               .setDim(3, c_dim)
                               .setStride(3, stride)
                               .setId('A')  // after matmul
                               .setAlignment(16)
                               .setVirtual()
                               .setDataType(CUDNN_DATA_FLOAT)
                               .build();

  std::cout << aMatrixTensor.describe() << std::endl;
  std::cout << bMatrixTensor.describe() << std::endl;
  std::cout << afterMatMulTensor.describe() << std::endl;

  // Define the matmul desc
  auto matmulDesc = cudnn_frontend::MatMulDescBuilder()
                        .setComputeType(CUDNN_DATA_FLOAT)
                        .build();
  std::cout << matmulDesc.describe() << std::endl;

  // Create a matmul Node
  auto matmul_op = cudnn_frontend::OperationBuilder(
                       CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR)
                       .setaMatDesc(aMatrixTensor)
                       .setbMatDesc(bMatrixTensor)
                       .setcMatDesc(afterMatMulTensor)
                       .setmatmulDesc(matmulDesc)
                       .build();
  std::cout << matmul_op.describe() << std::endl;

  std::array<cudnn_frontend::Operation const *, 1> ops = {&matmul_op};
  cudnnHandle_t handle_;
  checkCudnnErr(cudnnCreate(&handle_));

  auto opGraph = cudnn_frontend::OperationGraphBuilder()
                     .setHandle(handle_)
                     .setOperationGraph(ops.size(), ops.data())
                     .build();
  auto plan =
      get_execplan_from_heuristics_else_fall_back(std::move(opGraph), handle_);

  auto workspace_size = plan.getWorkspaceSize();
  std::cout << plan.describe() << " requires workspace " << workspace_size
            << std::endl;

  void *workspace_ptr = nullptr;
  if (workspace_size > 0) {
    cudaMalloc(&workspace_ptr, (size_t)workspace_size);
  }

  void *data_ptrs[] = {a_ptr, b_ptr, c_ptr};
  int64_t uids[] = {'a', 'b', 'c'};
  auto variantPack = cudnn_frontend::VariantPackBuilder()
                         .setWorkspacePointer(workspace_ptr)
                         .setDataPointers(3, data_ptrs)
                         .setUids(3, uids)
                         .build();

  cudnnStatus_t status = cudnnBackendExecute(handle_, plan.get_raw_desc(),
                                             variantPack.get_raw_desc());

  if (workspace_size > 0) {
    (cudaFree(workspace_ptr));
  }
  checkCudnnErr(cudnnDestroy(handle_));
  cudnn_frontend::throw_if(
      [status]() { return (status != CUDNN_STATUS_SUCCESS); },
      "Plan execute error", status);
}

I run it under cuda-memcheck and got logs as below, is there anything g wrong with my codes? I use docker:nvcr.io/nvidia/pytorch:21.10-py3 and my driver version is Driver Version: 515.65.01

root@3c261e44c675:/workspace/trt_inference/bin# cuda-memcheck ./trtexc
========= CUDA-MEMCHECK
ha size is 5164704
hb size is 9216
hc size is 5164704
CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: CUDNN_DATA_FLOAT Id: 97 Alignment: 16 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 1,53799,96 ] Str [,5164704,96,1] isVirtual: 0 isByValue: 0
CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: CUDNN_DATA_FLOAT Id: 98 Alignment: 16 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 1,96,96 ] Str [,9216,96,1] isVirtual: 0 isByValue: 0
CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: CUDNN_DATA_FLOAT Id: 65 Alignment: 16 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 1,53799,96 ] Str [,5164704,96,1] isVirtual: 1 isByValue: 0
CUDNN_BACKEND_MATMUL_DESCRIPTOR : Math precision 0
CUDNN_BACKEND_OPERATION : OpMode: 19 X 0 Y 0 W 0 B 0 T 0 DW 0 DY 0 DX 0 C 0 A Mtrix 0x557468e919e0 B Mtrix 0x5574c343d5d0 C Mtrix 0x5574c35f37b0 P 0 MatMul 0x5574c35f39c0 Reduction 0 alphabetaType 4 Alpha: 1 1 Alpha2: 1 1 Beta: 0 0
Heuristic has 3 configurations
CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR : Matmul_eng0_k24=27, numeric_notes:[CUDNN_NUMERICAL_NOTE_TENSOR_CORE,] behavior_notes:[CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION,] workSpaceSize: 0 requires workspace 0
terminate called after throwing an instance of 'cudnn_frontend::cudnnException'
what(): Plan execute error
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

Hope to get help! Thanks.

Is dgrad+relu with fp32 supported?

I have try to fuse dgrad with some pointwise operation like relu,
but when cudnnBackendFinalize(plan), there is a error.
CUDNN_STATUS_NOT_SUPPORTED.

I am not sure if my code error.
I want know, is dgrad + pointwise support now in cudnn 8.6 for Turing Arch?
Could you give me some sample about dgrad fusion?
In this project, only one case for dgrad, but it it only support for Ampere Arch.

About cudnn backend

Thanks for your great works. I have a small question. Was the actual implementation of cudnn written in c++ and compiled with nvcc?

Trouble when fusing layernorm with pointwise operation.

Describe the bug
I'm trying to fuse layernorm node with pointwise node(mul)
I got this "CUDNN_STATUS_BAD_PARAM_NULL_POINTER" when executing the graph, the error seems to come from the params passed to variant_pack, but I can't locate where my mistake is.

System Environment (please complete the following information):

  • cudnn_frontend version: 1.5.1
  • cudnn_backend version: 9.1.0
  • GPU arch: RTX 3050
  • cuda runtime version: 12.4
  • cuda driver version: 550.90.07
  • host compiler: gcc12.3.0
  • OS: Ubuntu22.04

API logs
be.log
fe.log

To Reproduce

import cudnn
import pytest
import torch

import functools

def torch_fork_set_rng(seed=None):
    def decorator_(func):
        @functools.wraps(func)
        def wrapper_(*args, **kwargs):
            with torch.random.fork_rng(devices=range(torch.cuda.device_count())):
                if seed is not None:
                    torch.manual_seed(seed)
                return func(*args, **kwargs)

        return wrapper_

    return decorator_

@torch_fork_set_rng(seed=0)
def test_layernorm(param_extract):

    embedding_dim, input_type = param_extract

    if input_type == torch.bfloat16:
        atol, rtol = 0.125, 0.125
    else:
        atol, rtol = 1e-2, 1e-2

    batch_size, seq_size = 16, 128
    N, C, H, W = batch_size * seq_size, embedding_dim, 1, 1

    epsilon_value = 1e-3

    x_gpu = (
        3
        * torch.randn(
            N, C, H, W, requires_grad=False, device="cuda", dtype=input_type
        ).to(memory_format=torch.channels_last)
        - 0.5
    )
    scale_gpu = (
        5
        * torch.randn(
            1, C, H, W, requires_grad=False, device="cuda", dtype=input_type
        ).to(memory_format=torch.channels_last)
        - 1
    )
    bias_gpu = (
        7
        * torch.randn(
            1, C, H, W, requires_grad=False, device="cuda", dtype=input_type
        ).to(memory_format=torch.channels_last)
        - 2
    )
    epsilon_cpu = torch.full(
        (1, 1, 1, 1),
        epsilon_value,
        requires_grad=False,
        device="cpu",
        dtype=torch.float32,
    )

    mask_gpu = torch.ones(N, C, H, W, device="cuda", dtype=input_type).to(memory_format=torch.channels_last)

    Y_expected = torch.nn.functional.layer_norm(
        x_gpu,
        [C, H, W],
        weight=scale_gpu.squeeze(0),
        bias=bias_gpu.squeeze(0),
        eps=epsilon_value,
    )
    mean_expected = x_gpu.to(torch.float32).mean(dim=(1, 2, 3), keepdim=True)
    inv_var_expected = torch.rsqrt(
        torch.var(x_gpu.to(torch.float32), dim=(1, 2, 3), keepdim=True) + epsilon_value
    )

    handle = cudnn.create_handle()
    stream = torch.cuda.current_stream().cuda_stream
    cudnn.set_stream(handle=handle, stream=stream)

    graph = cudnn.pygraph(
        intermediate_data_type=cudnn.data_type.FLOAT,
        compute_data_type=cudnn.data_type.FLOAT,
        handle=handle,
    )

    X = graph.tensor(
        name="X", dim=x_gpu.size(), stride=x_gpu.stride(), data_type=x_gpu.dtype
    )
    scale = graph.tensor(
        name="scale",
        dim=scale_gpu.size(),
        stride=scale_gpu.stride(),
        data_type=scale_gpu.dtype,
    )
    bias = graph.tensor(
        name="bias",
        dim=bias_gpu.size(),
        stride=bias_gpu.stride(),
        data_type=bias_gpu.dtype,
    )
    epsilon = graph.tensor(
        name="epsilon",
        dim=epsilon_cpu.size(),
        stride=epsilon_cpu.stride(),
        is_pass_by_value=True,
        data_type=epsilon_cpu.dtype,
    )

    mask = graph.tensor(
        name="mask", dim=mask_gpu.size(), stride=mask_gpu.stride(), data_type=mask_gpu.dtype
    )

    X_after_mul = graph.mul(name="mul", a=X, b=mask, compute_data_type=cudnn.data_type.FLOAT)

    Y, mean, inv_var = graph.layernorm(
        name="LN",
        norm_forward_phase=cudnn.norm_forward_phase.TRAINING,
        input=X_after_mul,
        scale=scale,
        bias=bias,
        epsilon=epsilon,
    )

    Y.set_output(True).set_data_type(x_gpu.dtype)
    mean.set_output(True).set_data_type(mean_expected.dtype)
    inv_var.set_output(True).set_data_type(inv_var_expected.dtype)

    graph.validate()
    graph.build_operation_graph()
    graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
    graph.check_support()
    graph.build_plans(cudnn.build_plan_policy.ALL)

    Y_actual = torch.empty_like(x_gpu)
    mean_actual = torch.empty_like(mean_expected)
    inv_var_actual = torch.empty_like(inv_var_expected)

    workspace = torch.empty(
        graph.get_workspace_size(), device="cuda", dtype=torch.uint8
    )

    graph.execute(
        {
            X: x_gpu.detach(),
            scale: scale_gpu.detach(),
            bias: bias_gpu.detach(),
            mask: mask_gpu,
            epsilon: epsilon_cpu,
            Y: Y_actual,
            mean: mean_actual,
            inv_var: inv_var_actual,
        },
        workspace,
        handle=handle,
    )

    torch.cuda.synchronize()

    torch.testing.assert_close(Y_expected, Y_actual, atol=atol, rtol=rtol)
    torch.testing.assert_close(mean_expected, mean_actual, atol=atol, rtol=rtol)
    torch.testing.assert_close(inv_var_expected, inv_var_actual, atol=atol, rtol=rtol)

    cudnn.destroy_handle(handle)

if __name__ == "__main__":
    test_layernorm((1600, torch.float))

Additional context
thank you for anything

Matmul test failure

I encountered a test failure after building and running the tests. Here are the details:

  • GPU: RTX 4090
  • Repo branch: v1.4.0
  • Operating System: Ubuntu 22.04.3
  • CUDA version: 12.2
  • cuDNN version: 8.9.7
  • g++version: 11.4.0

I followed the build instructions as provided in the README:

mkdir build
cd build
cmake ..
make -j8

Output is:

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDAToolkit: /usr/local/cuda-12.2/targets/x86_64-linux/include (found version "12.2.140")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Performing Test HAVE_FLAG__ffile_prefix_map__nvme2_medsam_cuda_mode_cudnn_frontend_build__deps_catch2_src__
-- Performing Test HAVE_FLAG__ffile_prefix_map__nvme2_medsam_cuda_mode_cudnn_frontend_build__deps_catch2_src__ - Success
-- cudnn found at /usr/local/cuda-12.2/lib64/libcudnn.so.
-- Found LIBRARY: /usr/local/cuda-12.2/include
-- cuDNN: /usr/local/cuda-12.2/lib64/libcudnn.so
-- cuDNN: /usr/local/cuda-12.2/include
-- cudnn_adv_infer found at /usr/local/cuda-12.2/lib64/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /usr/local/cuda-12.2/lib64/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /usr/local/cuda-12.2/lib64/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /usr/local/cuda-12.2/lib64/libcudnn_ops_train.so.
-- cudnn found at /usr/local/cuda-12.2/lib64/libcudnn.so.
-- cuDNN: /usr/local/cuda-12.2/lib64/libcudnn.so
-- cuDNN: /usr/local/cuda-12.2/include
-- cudnn_adv_infer found at /usr/local/cuda-12.2/lib64/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /usr/local/cuda-12.2/lib64/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /usr/local/cuda-12.2/lib64/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /usr/local/cuda-12.2/lib64/libcudnn_ops_train.so.
-- Configuring done (6.0s)
-- Generating done (0.0s)
-- Build files have been written to: /nvme2/medsam/cuda-mode/cudnn-frontend/build
[100%] Linking CXX executable ../bin/samples
Warning: Unused direct dependencies:
        /usr/local/cuda-12.2/lib64/libnvrtc.so.12
        /usr/local/cuda-12.2/lib64/libnvrtc-builtins.so.12.2
        /lib/x86_64-linux-gnu/libcuda.so.1
        /usr/local/cuda-12.2/lib64/libnvJitLink.so.12
        /usr/local/cuda-12.2/lib64/libcudnn_adv_train.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_ops_train.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_cnn_train.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_adv_infer.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_cnn_infer.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_ops_infer.so.8
[100%] Built target samples

Then I run the matmul test

CUDNN_FRONTEND_LOG_FILE=stdout CUDNN_FRONTEND_LOG_INFO=1 ./build/bin/samples MatMul

Output is:

Filters: "MatMul"
Randomness seeded to: 1045110732
[cudnn_frontend] INFO: Validating matmul node GEMM...
[cudnn_frontend] INFO: Inferrencing properties for matmul node GEMM...
[cudnn_frontend] INFO: Creating cudnn tensors for node named 'GEMM':
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["BFLOAT16"] Id: 2 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,32,128 ] Str [ 4096,128,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["BFLOAT16"] Id: 3 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,128,64 ] Str [ 8192,64,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["FLOAT"] Id: 4 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,32,64 ] Str [ 2048,64,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: Building MatmulNode operations GEMM...
[cudnn_frontend] CUDNN_BACKEND_MATMUL_DESCRIPTOR : Math precision ["FLOAT"]
[cudnn_frontend] CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR has 1operations.
Tag: Matmul_

[cudnn_frontend] INFO:  Getting plan from heuristics for Matmul_ ...
[cudnn_frontend] CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR :
Heuristic Mode 3 has 6 configurations 
[cudnn_frontend] INFO: get_heuristics_list statuses: CUDNN_STATUS_SUCCESS 
[cudnn_frontend] INFO: config list has 6 configurations.
[cudnn_frontend] INFO: config list has 6 good configurations.
[cudnn_frontend] INFO: Extracting engine configs.
[cudnn_frontend] INFO: Querying engine config properties
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 0 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 1 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 2 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 3 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 4 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 5 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: plans.check_support(h) at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/graph_interface.h:260

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samples is a Catch2 v3.3.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
Matmul
-------------------------------------------------------------------------------
/nvme2/medsam/cuda-mode/cudnn-frontend/samples/cpp/matmuls.cpp:31
...............................................................................

/nvme2/medsam/cuda-mode/cudnn-frontend/samples/cpp/matmuls.cpp:80: FAILED:
  REQUIRE( graph.check_support(handle).is_good() )
with expansion:
  false

===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 11 | 10 passed | 1 failed

Many samples don't work for me

I'm familiar with how to use CUDA and more specifically cuDNN., and I'm now trying to get started with the cuDNN backend API by experimenting with the samples in this repository. However, many of them don't seem to work for me.

My environment

  • Windows 11 version 21H2 (OS Build 22000.613)
  • Cuda 1.5 V11.5.119
  • Cudnn 8.3.3
  • Visual Studio 2019 version 16.0
  • (the machine I'm compiling this on only has a GTX 1060 MaxQ, but I don't think that's very relevant yet)

General errors

  • ConvBiasAct: Setting a relu clip slope doesn't work, the result is always
    [ERROR] Exception CUDNN_BACKEND_POINTWISE_DESCRIPTOR: SetAttribute CUDNN_ATTR_POINTWISE_RELU_LOWER_CLIP_SLOPE, Failed cudnn_status: CUDNN_STATUS_BAD_PARAM

  • ConvBiasAct: After commenting out the previous code, the next issue is
    [ERROR] Exception CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: GetAttribute CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE Failed cudnn_status: CUDNN_STATUS_BAD_PARAM.
    This code is gated behind a cuDNN version check, but my version is high enough so I don't know why this fails. After commenting out this and the previous issues this sample is running fine.

  • ConvBiasScaleAct, ConvBiasScaleAct_int8, ConvScaleBiasAddAct sample, ConvScaleBiasAddAct sample_float and fail with the exact same
    [ERROR] Exception CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR: cudnn Finalize failed cudnn_status: CUDNN_STATUS_BAD_PARAM.
    No idea what could cause this.

  • Multihead attention sample, MatmulBiasAct sample and MatmulBiasAct sample_float, all fail with
    [ERROR] Exception CUDNN_BACKEND_MATMUL_DESCRIPTOR: cudnnCreate Failed cudnn_status: CUDNN_STATUS_ALLOC_FAILED.
    I'm not sure what could cause this, I have plenty of free CPU and GPU memory.

  • ConvDrelu and DgradDrelu fail with
    [ERROR] Exception CUDNN_BACKEND_OPERATION: SetAttribute CUDNN_ATTR_OPERATION_POINTWISE_DYDESC Failed cudnn_status: CUDNN_STATUS_BAD_PARAM
    Maybe pointwise derivatives are properly supported yet? There is no documentation in the API reference for this attribute name yet.

Hard-crash in BN Finalize:

  • Creation of the third tensor here fails:
    auto epsilonTensor = scalar_tensor_create(CUDNN_DATA_DOUBLE, 300);
    auto expDecayTensor = scalar_tensor_create(CUDNN_DATA_DOUBLE, 301);
    auto accumCountTensor = scalar_tensor_create(CUDNN_DATA_INT64, 302);
    . I assume this is because there is some issue with the CUDNN_DATA_INT64 datatype? Specifically the cudnnBackendFinalize of the tensor descriptor crashes internally (it's not just a non-success return value, the call itself actually generates an exception).

Conclusion

In conclusion, everything seems to be very brittle at the moment. Is this because I am using some wrong versions? Or is the cuDNN backend/frontend API still too much of a work-in-progress for end users? Or is Windows support still lacking?

cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.

when running pytest test/python_fe on latest version, it returns

        graph.validate()
        graph.build_operation_graph()
        graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
>       graph.check_support()
E       cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.

test/python_fe/test_matmul_bias_relu.py:278: cudnnGraphNotSupportedError
==================================================================================== warnings summary =====================================================================================
test/python_fe/test_apply_rope.py::test_apply_rope
  /home/vipuser/miniconda3/envs/llm-env/lib/python3.10/site-packages/torch/random.py:159: UserWarning: CUDA reports that you have 8 available devices, and you have used fork_rng without explicitly specifying which devices are being used. For safety, we initialize *every* CUDA device by default, which can be quite slow if you have a lot of CUDAs. If you know that you are only making use of a few CUDA devices, set the environment variable CUDA_VISIBLE_DEVICES or the 'devices' keyword argument of fork_rng with the set of devices you are actually using. For example, if you are using CPU only, set device.upper()_VISIBLE_DEVICES= or devices=[]; if you are using device 0 only, set CUDA_VISIBLE_DEVICES=0 or devices=[0].  To initialize all devices and suppress this warning, set the 'devices' keyword argument to `range(torch.cuda.device_count())`.
    warnings.warn(message)

test/python_fe/test_conv_genstats.py::test_conv_genstats
  /mnt/zzd/llm.c/cudnn-frontend/test/python_fe/test_conv_genstats.py:14: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
    conv_output = torch.nn.functional.conv2d(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================= short test summary info =================================================================================
FAILED test/python_fe/test_apply_rope.py::test_apply_rope - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_batchnorm.py::test_bn_relu_with_mask - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_batchnorm.py::test_drelu_dadd_dbn - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_bias.py::test_conv_bias_relu - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_bias.py::test_conv_relu - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_bias.py::test_conv3d_bias_leaky_relu - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_bias.py::test_leaky_relu_backward - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_bias.py::test_conv_int8 - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_genstats.py::test_conv_genstats - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_conv_reduction.py::test_reduction - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_matmul_bias_relu.py::test_matmul_bias_relu[param_extract0] - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_matmul_bias_relu.py::test_matmul_bias_relu[param_extract1] - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_matmul_bias_relu.py::test_matmul_bias_relu[param_extract4] - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
FAILED test/python_fe/test_matmul_bias_relu.py::test_matmul_bias_relu[param_extract5] - cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No execution plans built successfully.
================================================================ 14 failed, 3514 skipped, 2 warnings in 100.57s (0:01:40) =================================================================

and my CUDA is 12.4, cuDNN is 9.1, Driver Version is 550.54.15 on Ubuntu 22.04

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.