Comments (4)
I don't know why cmake fails to find python. One way worth a try is to follow the doc to set up conda environment.
from vllm.
Okay I've set up the conda environment, but now there's another issue that's way too much for the terminal to handle:
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146346; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146350; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146354; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146358; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146362; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146366; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146370; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146406; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146410; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146414; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146418; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146422; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146426; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146430; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146434; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146470; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146474; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146478; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146482; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146486; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146490; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146494; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 146498; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151390; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151394; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151398; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151402; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151406; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151410; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151414; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151418; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151479; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151483; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151487; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151491; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151495; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151499; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151503; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151507; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151568; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151572; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151576; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151580; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151584; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151588; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151592; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151596; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151657; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151661; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151665; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151669; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151673; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151677; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151681; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151685; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151876; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151880; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151884; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151888; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151892; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151896; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151900; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151904; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151965; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151969; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151973; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151977; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151981; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151985; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151989; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 151993; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152054; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152058; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152062; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152066; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152070; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152074; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152078; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152082; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152143; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152147; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152151; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152155; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152159; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152163; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152167; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152171; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152344; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152348; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152352; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152356; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152360; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152364; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152368; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152372; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152433; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152437; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152441; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152445; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152449; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152453; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152457; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152461; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152522; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152526; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152530; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152534; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152538; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152542; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152546; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152550; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152611; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152615; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152619; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152623; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152627; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152631; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152635; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152639; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152803; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152807; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152811; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152815; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152819; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152823; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152827; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152831; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152892; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152896; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152900; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152904; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152908; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152912; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152916; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152920; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152981; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152985; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152989; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152993; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 152997; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153001; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153005; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153009; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153070; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153074; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153078; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153082; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153086; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153090; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153094; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
ptxas /tmp/tmpxft_00028e53_00000000-6_marlin_24_cuda_kernel.compute_86.ptx, line 153098; info : Advisory: Modifier '.sp::ordered_metadata' should be used on instruction 'mma' instead of modifier '.sp' as it is expected to have substantially reduced performance on some future architectures
[24/26] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu.o
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17: required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:336:336: required from here
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
141 | struct TiledCopy : Copy_Atom
| ^~~~~~~~~
/usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
3349 | cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
| ^ ~~
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<6, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, signed char, cute::tuple<long int, cute::C<1>, long int>, signed char, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_S32S8S8_SS_TN>, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17: required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<signed char, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:342:332: required from here
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
141 | struct TiledCopy : Copy_Atom
| ^~~~~~~~~
/usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
3349 | cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
| ^ ~~
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17: required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:306:103: required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::bfloat16_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:352:130: required from here
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
141 | struct TiledCopy : Copy_Atom
| ^~~~~~~~~
/usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
3349 | cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
| ^ ~~
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::bfloat16_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::bfloat16_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17: required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::bfloat16_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:310:104: required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::bfloat16_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:352:130: required from here
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
141 | struct TiledCopy : Copy_Atom
| ^~~~~~~~~
/usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
3349 | cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
| ^ ~~
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<13, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<2, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<3, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17: required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<64>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<8>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:306:103: required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::half_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:357:126: required from here
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
141 | struct TiledCopy : Copy_Atom
| ^~~~~~~~~
/usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
3349 | cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
| ^ ~~
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of ‘static constexpr cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple<int, int, int, int>; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple<cute::C<64>, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple<long int, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]’:
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from ‘static cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Params cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, typename std::enable_if<is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple<int, int, int, int>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileScheduler_ = cutlass::gemm::PersistentScheduler]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from ‘cutlass::Status cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>::GemmKernel; typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter<GemmKernel_, typename std::enable_if<cutlass::gemm::detail::IsCutlass3GemmKernel<GemmKernel_>::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<9, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cutlass::float_e4m3_t, cute::tuple<long int, cute::C<1>, long int>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<cute::GMMA::ScaleIn::One, cute::GMMA::ScaleIn::One> >, cute::Layout<cute::tuple<cute::C<1>, cute::C<1>, cute::C<1> > >, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128> >, cute::tuple<cute::C<128>, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<4, 2, 16, false, true>, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<64>, cute::C<32> >, void, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple<long int, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, cutlass::half_t, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<0, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<1>, cute::C<0>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, cutlass::FloatRoundStyle::round_to_nearest, void>, cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<2, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, float, cute::tuple<cute::C<0>, cute::C<1>, cute::C<0> >, 4>, cutlass::epilogue::fusion::Sm90AccFetch> >, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32> >, cute::tuple<cute::C<32>, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Arguments; cudaStream_t = CUstream_st*]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:232:17: required from ‘void _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_gemm_caller(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with Gemm = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::cutlass_3x_gemm<cutlass::float_e4m3_t, cutlass::half_t, _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue, cute::tuple<cute::C<64>, cute::C<128>, cute::C<128> >, cute::tuple<cute::C<2>, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::epilogue::TmaWarpSpecialized>; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:310:104: required from ‘void cutlass_gemm_sm90_fp8_dispatch(at::Tensor&, const at::Tensor&, const at::Tensor&, EpilogueArgs&& ...) [with InType = cutlass::float_e4m3_t; OutType = cutlass::half_t; Epilogue = _GLOBAL__N__605ca568_16_scaled_mm_c3x_cu_2bec3df9::ScaledEpilogue; EpilogueArgs = {const at::Tensor&, const at::Tensor&}]’
/home/jovyan/vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu:357:126: required from here
/tmp/tmpn64e116l.build-temp/_deps/cutlass-src/include/cute/atom/copy_atom.hpp:141:8: note: ‘using TMA_D = struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’ {aka ‘struct cute::TiledCopy<cute::Copy_Atom<cute::Copy_Traits<cute::SM90_TMA_STORE, cute::C<32768>, cute::AuxTmaParams<cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0>, cute::C<0> >, const cute::Layout<cute::tuple<cute::C<32>, cute::C<64> >, cute::tuple<cute::ScaledBasis<cute::C<1>, 1>, cute::ScaledBasis<cute::C<1>, 0> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::half_t>, cute::Layout<cute::tuple<cute::C<1>, cute::tuple<cute::tuple<cute::C<32>, cute::C<64> > > >, cute::tuple<cute::C<0>, cute::tuple<cute::tuple<cute::C<64>, cute::C<1> > > > >, cute::tuple<cute::C<64>, cute::C<32> > >’} has no user-provided default constructor
141 | struct TiledCopy : Copy_Atom
| ^~~~~~~~~
/usr/local/cuda/include/cuda.h:3349:1: note: and the implicitly-defined constructor does not initialize ‘cuuint64_t CUtensorMap_st::opaque [16]’
3349 | cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
| ^ ~~
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 155, in run
self._create_wheel_file(bdist_wheel)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 357, in _create_wheel_file
files, mapping = self._run_build_commands(dist_name, unpacked, lib, tmp)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 280, in _run_build_commands
self._run_build_subcommands()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 307, in _run_build_subcommands
self.run_command(name)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
self.distribution.run_command(command)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 968, in run_command
super().run_command(command)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 91, in run
_build_ext.run(self)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
self.build_extensions()
File "<string>", line 205, in build_extensions
File "/opt/conda/envs/vllm/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=96', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.
/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py:988: _DebuggingTips: Problem in editable installation.
!!
********************************************************************************
An error happened while installing `vllm` in editable mode.
The following steps are recommended to help debug this problem:
- Try to install the project normally, without using the editable mode.
Does the error still persist?
(If it does, try fixing the problem before attempting the editable mode).
- If you are using binary extensions, make sure you have all OS-level
dependencies installed (e.g. compilers, toolchains, binary libraries, ...).
- Try the latest version of setuptools (maybe the error was already fixed).
- If you (or your project dependencies) are using any setuptools extension
or customization, make sure they support the editable mode.
After following the steps above, if the problem still persists and
you think this is related to how setuptools handles editable installations,
please submit a reproducible example
(see https://stackoverflow.com/help/minimal-reproducible-example) to:
https://github.com/pypa/setuptools/issues
See https://setuptools.pypa.io/en/latest/userguide/development_mode.html for details.
********************************************************************************
!!
cmd_obj.run()
Traceback (most recent call last):
File "/opt/conda/envs/vllm/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/opt/conda/envs/vllm/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/vllm/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 273, in build_editable
return hook(wheel_directory, config_settings, metadata_directory)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 443, in build_editable
return self._build_with_temp_dir(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 395, in _build_with_temp_dir
self.run_setup()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 413, in <module>
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 184, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
dist.run_commands()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 968, in run_command
super().run_command(command)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 155, in run
self._create_wheel_file(bdist_wheel)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 357, in _create_wheel_file
files, mapping = self._run_build_commands(dist_name, unpacked, lib, tmp)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 280, in _run_build_commands
self._run_build_subcommands()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/editable_wheel.py", line 307, in _run_build_subcommands
self.run_command(name)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
self.distribution.run_command(command)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 968, in run_command
super().run_command(command)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 91, in run
_build_ext.run(self)
File "/tmp/pip-build-env-tfbqim18/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
self.build_extensions()
File "<string>", line 205, in build_extensions
File "/opt/conda/envs/vllm/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=96', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building editable for vllm
Failed to build vllm
ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects
Also, probably a beginner question, but why does pip install take so long? It needs to download the whole 776MB of torch again? Wonder why it can't just cache it.
from vllm.
cc @tlrmchlsmth , seems like it is related with the mma stuff in #5642
from vllm.
Same issue with latest version:
Using MAX_JOBS=8 as the number of jobs.
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build type: RelWithDebInfo
-- Target device: cuda
-- Could NOT find Python (missing: Python_INCLUDE_DIRS Interpreter Development.Module Development.SABIModule)
CMake Error at cmake/utils.cmake:10 (message):
Unable to find python matching: /home/user/Desktop/vllm/venv/bin/python.
Call Stack (most recent call first):
CMakeLists.txt:43 (find_python_from_executable)
from vllm.
Related Issues (20)
- [Bug]: asyncio.exceptions.CancelledError asyncio.exceptions.TimeoutError HOT 1
- api_server.py: error: unrecognized arguments: --tool-use-prompt-template --enable-api-tools --enable-auto-tool-choice HOT 1
- [Bug]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 t transpose_mat2 n m 9216 n 3398 k 7168 mat1_ld 7168 mat2_ld 7168 result_ld 9216 computeType 68 scaleType 0
- [Bug]: asyncio.exceptions.CancelledError asyncio.exceptions.TimeoutError HOT 4
- [Feature]: Support for OpenAIEmbeddings with Langchain HOT 8
- [Bug]: which torchvision version required HOT 12
- [Usage]: has vllm supported encoder-only model such as bge-m3?
- [Bug]: VLLM usage on AWS Inferentia instances HOT 6
- [Bug]: KeyError: '/psm_ed65b7e3' HOT 2
- [Feature]: Need CPU inferencing support for non-x86 architectures HOT 2
- [Bug]: 'int' object has no attribute 'expansion'
- [Bug]: Detokenizer stage is causing a significant delay HOT 3
- [RFC]: Support sparse KV cache framework HOT 6
- [Installation]: Failed to install the packages at entrypoint
- File "/public/home/huangchensen/miniconda3/envs/pytorch21/lib/python3.9/site-packages/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/public/home/huangchensen/miniconda3/envs/pytorch21/lib/python3.9/site-packages/vllm/worker/worker.py", line 100, in init_device init_distributed_environment(self.parallel_config, self.rank, File "/public/home/huangchensen/miniconda3/envs/pytorch21/lib/python3.9/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment pynccl_utils.init_process_group( File "/public/home/huangchensen/miniconda3/envs/pytorch21/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 46, in init_process_group comm = NCCLCommunicator(init_method=init_method, File "/public/home/huangchensen/miniconda3/envs/pytorch21/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 249, in __init__ assert result == 0 AssertionError HOT 1
- [Usage]: How to set --max-logprobs to the default length of LLM's vocab_size.
- DeepSeekCoderV2
- [Bug]: Different Image Size support with Phi-3-Vision and torchvision dependency HOT 2
- [New Model]: bump a new version of vllm to support Qwen2 series
- [RFC]: A Flexible Architecture for Distributed Inference HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.