Your current environment <a href="https://docs.vllm.ai/en/stable/g

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Same issue with latest version: <div class="highlight highlight-source-shell notra

[Installation]: Build from source: Could NOT find Python. Could not build wheels for vllm. about vllm HOT 4 OPEN

Brennanzuz commented on July 3, 2024 1

[Installation]: Build from source: Could NOT find Python. Could not build wheels for vllm.

from vllm.

Comments (4)

youkaichao commented on July 3, 2024

I don't know why cmake fails to find python. One way worth a try is to follow the doc to set up conda environment.

from vllm.

Brennanzuz commented on July 3, 2024

Okay I've set up the conda environment, but now there's another issue that's way too much for the terminal to handle:

      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146346; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146350; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146354; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146358; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146362; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146366; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146370; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146406; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146410; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146414; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146418; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146422; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146426; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146430; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146434; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146470; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146474; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146478; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146482; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146486; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146490; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146494; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146498; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151390; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151394; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151398; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151402; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151406; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151410; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151414; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151418; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151479; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151483; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151487; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151491; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151495; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151499; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151503; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151507; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151568; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151572; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151576; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151580; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151584; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151588; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151592; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151596; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151657; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151661; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151665; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151669; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151673; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151677; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151681; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151685; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151876; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151880; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151884; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151888; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151892; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151896; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151900; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151904; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151965; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151969; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151973; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151977; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151981; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151985; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151989; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151993; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152054; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152058; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152062; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152066; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152070; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152074; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152078; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152082; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152143; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152147; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152151; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152155; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152159; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152163; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152167; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152171; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152344; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152348; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152352; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152356; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152360; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152364; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152368; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152372; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152433; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152437; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152441; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152445; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152449; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152453; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152457; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152461; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152522; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152526; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152530; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152534; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152538; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152542; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152546; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152550; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152611; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152615; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152619; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152623; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152627; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152631; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152635; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152639; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152803; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152807; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152811; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152815; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152819; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152823; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152827; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152831; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152892; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152896; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152900; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152904; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152908; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152912; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152916; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152920; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152981; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152985; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152989; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152993; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152997; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153001; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153005; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153009; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153070; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153074; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153078; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153082; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153086; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153090; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153094; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153098; info    : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
      [24/26] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu.o
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184:   required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17:   required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:336:336:   required from here
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
        141 | struct TiledCopy : Copy_Atom
            |        ^~~~~~~~~
      /usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
       3349 |     cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
            | ^   ~~
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184:   required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17:   required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:342:332:   required from here
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
        141 | struct TiledCopy : Copy_Atom
            |        ^~~~~~~~~
      /usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
       3349 |     cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
            | ^   ~~
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184:   required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17:   required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:306:103:   required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::bfloat16_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:352:130:   required from here
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
        141 | struct TiledCopy : Copy_Atom
            |        ^~~~~~~~~
      /usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
       3349 |     cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
            | ^   ~~
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184:   required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17:   required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:310:104:   required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::bfloat16_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:352:130:   required from here
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
        141 | struct TiledCopy : Copy_Atom
            |        ^~~~~~~~~
      /usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
       3349 |     cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
            | ^   ~~
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184:   required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17:   required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:306:103:   required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::half_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:357:126:   required from here
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
        141 | struct TiledCopy : Copy_Atom
            |        ^~~~~~~~~
      /usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
       3349 |     cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
            | ^   ~~
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184:   required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17:   required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17:   required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:310:104:   required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::half_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
      /home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:357:126:   required from here
      /tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
        141 | struct TiledCopy : Copy_Atom
            |        ^~~~~~~~~
      /usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
       3349 |     cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
            | ^   ~~
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 155, in run
          self._create_wheel_file(bdist_wheel)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 357, in _create_wheel_file
          files, mapping = self._run_build_commands(dist_name, unpacked, lib, tmp)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 280, in _run_build_commands
          self._run_build_subcommands()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 307, in _run_build_subcommands
          self.run_command(name)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
          self.distribution.run_command(command)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 968, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 91, in run
          _build_ext.run(self)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
          self.build_extensions()
        File "<string>", line 205, in build_extensions
        File "/opt/conda/envs/vllm/lib/python3.11/subprocess.py", line 413, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=96', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.
      /tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py:988: _DebuggingTips: Problem in editable installation.
      !!
      
              ********************************************************************************
              An error happened while installing `vllm` in editable mode.
      
              The following steps are recommended to help debug this problem:
      
              - Try to install the project normally, without using the editable mode.
                Does the error still persist?
                (If it does, try fixing the problem before attempting the editable mode).
              - If you are using binary extensions, make sure you have all OS-level
                dependencies installed (e.g. compilers, toolchains, binary libraries, ...).
              - Try the latest version of setuptools (maybe the error was already fixed).
              - If you (or your project dependencies) are using any setuptools extension
                or customization, make sure they support the editable mode.
      
              After following the steps above, if the problem still persists and
              you think this is related to how setuptools handles editable installations,
              please submit a reproducible example
              (see https://stackoverflow.com/help/minimal-reproducible-example) to:
      
                  https://github.com/pypa/setuptools/issues
      
              See https://setuptools.pypa.io/en/latest/userguide/development_mode.html for details.
              ********************************************************************************
      
      !!
        cmd_obj.run()
      Traceback (most recent call last):
        File "/opt/conda/envs/vllm/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/opt/conda/envs/vllm/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/conda/envs/vllm/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 273, in build_editable
          return hook(wheel_directory, config_settings, metadata_directory)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 443, in build_editable
          return self._build_with_temp_dir(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 395, in _build_with_temp_dir
          self.run_setup()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 413, in <module>
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 184, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
          dist.run_commands()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 968, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 155, in run
          self._create_wheel_file(bdist_wheel)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 357, in _create_wheel_file
          files, mapping = self._run_build_commands(dist_name, unpacked, lib, tmp)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 280, in _run_build_commands
          self._run_build_subcommands()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 307, in _run_build_subcommands
          self.run_command(name)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
          self.distribution.run_command(command)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 968, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 91, in run
          _build_ext.run(self)
        File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
          self.build_extensions()
        File "<string>", line 205, in build_extensions
        File "/opt/conda/envs/vllm/lib/python3.11/subprocess.py", line 413, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=96', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building editable for vllm
Failed to build vllm
ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects

Also, probably a beginner question, but why does pip install take so long? It needs to download the whole 776MB of torch again? Wonder why it can't just cache it.

from vllm.

youkaichao commented on July 3, 2024

cc @tlrmchlsmth , seems like it is related with the mma stuff in #5642

from vllm.

Msiavashi commented on July 3, 2024

Same issue with latest version:

Using MAX_JOBS=8 as the number of jobs.
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Build type: RelWithDebInfo
      -- Target device: cuda
      -- Could NOT find Python (missing: Python_INCLUDE_DIRS Interpreter Development.Module Development.SABIModule)
      CMake Error at cmake/utils.cmake:10 (message):
        Unable to find python matching: /home/user/Desktop/vllm/venv/bin/python.
      Call Stack (most recent call first):
        CMakeLists.txt:43 (find_python_from_executable)

from vllm.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.