Hi, Whether there is test to generate test data for comparing

I think below ops may have first try: deformable convolution.

There will definitely be a permute op similar to <a h

roofline model about triton HOT 8 CLOSED

leiwen83 commented on July 22, 2024

roofline model

from triton.

Comments (8)

ptillet commented on July 22, 2024 1

Hi, this data is actually rather old. Auto-TVM has improved since then in FP32, and the focus of Triton has focuses to tensor cores. There is more data in Chapter 5 of my PhD dissertation if you are interested ( https://search.proquest.com/openview/8e39f66528f4f49a4aab91412cff9d05/1?pq-origsite=gscholar&cbl=18750&diss=y ).

I am working on a library of standard ops for deep learning (matmul, conv, batchnorm, softmax, etc.) This should provide a good deal of examples for people interested in learning more about Triton. My ultimate goal is actually to provide a compact, viable and open-source alternative to cuBLAS and cuDNN. When I do, I will include script to generate these kinds of plots, including comparison again TVM -- Tensor Comprehensions is now deprecated AFAIK. It should be out in about 1 month (or 2 at most)

Note that support for INT8 is pretty poor (and in fact inexistent for tensor cores) at the moment and that the focus is more on training. TensorRT would be hard to replace at this point.

from triton.

leiwen83 commented on July 22, 2024

I see.

Actually from daily work, I conclude that the most time consuming part, one is the CONV, another is data movement. The data movement op like nhwc2nchw in TRT is also good at reaching the peak bandwidth provided by HW.
What is omited by TRT is that beside op like nhwc2nchw, there are actually many other data movement require to be optimized. So if triton would be help, that would be great.

from triton.

ptillet commented on July 22, 2024

Do you think you could create a list of ops that you need? I can try to include them in the upcoming op library if possible.

from triton.

leiwen83 commented on July 22, 2024

I think below ops may have first try:

deformable convolution. which may refer to the DCV2_op in https://github.com/msracver/Deformable-ConvNets.
nchw2nhwc and nhwc2nchw implemented by triton which could be compared the bandwidth with TRT
nn.shuffle()
ops fusion that take [nhwc2nchw ->nn.shuffle()->nchw2nhwc] into one op

from triton.

ptillet commented on July 22, 2024

There will definitely be a permute op similar to https://github.com/ptillet/torch-blocksparse/blob/master/torch_blocksparse/permute.py that will allow high-BW conversion to/from any CNN input format. I think right now this is only tested for conversion between NCHW and CHWN, but it can be easily edited to accomodate NHWC as well.

By nn.shuffle, do you mean https://pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html ? I could do that, seems like a low-hanging fruit and generally useful op.

from triton.

leiwen83 commented on July 22, 2024

Yes. it is the torch.nn.PixelShuffle

from triton.

ptillet commented on July 22, 2024

To answer the initial question, I have dug through my filesystem and found the plot code I used for the roofline method:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib

poster_mode = False
if poster_mode:
  font = {'size'   : 22}
  matplotlib.rc('font', **font)
  matplotlib.rc('lines', markersize=11)
  plt.figure(figsize=(20,10))

colors = {'cublas': '#0d770e',
          'triton': '#a81034',
		      'tvm': '#f4b95f',
          'tc': '#adb9d3',
          'plaidml': '#00a0dc'}

perf = {'cublas':  [0.37, 0.77, 1.55, 3, 4.42, 5.51, 6.43, 6.66, 6.92],
        'triton':  [0.45, 0.91, 1.79, 2.99, 4.38, 5.27, 5.84, 5.97, 6.5],
        'tc':      [0.29, 0.32, 0.52, 0.96, 1.1, 1.48, 1.8, 2.2, 2.25],
        'tvm':     [0.14, 0.37, 0.93, 1.85, 2.91, 3.43, 3.38, 3.85, 3.86],
        'plaidml': [0.1, 0.17, 0.3, 0.52, 0.93, 1.51, 1.76, 2.12, 2.44]}

# arithmetic intensity
m = np.repeat(1760, 9)
n = 2**np.arange(2, 11)
k = np.repeat(1760, 9)
flops = 2.*m*n*k
transfer = 4.*(m*k + k*n)
intensity = flops / transfer

# device properties
bandwidth = 256*1e9
max_flops = 7.5
roofline = np.minimum(bandwidth*intensity*1e-12, max_flops)
plt.loglog(intensity, roofline, label = 'Roofline Model', color = 'black')
plt.scatter(intensity, perf['cublas'], label = 'cuBLAS 10.0', color = colors['cublas'])
plt.scatter(intensity, perf['triton'], label = 'Triton', color = colors['triton'])
plt.scatter(intensity, perf['tvm'], label = 'Auto-TVM', color = colors['tvm'])
plt.scatter(intensity, perf['tc'], label = 'Tensor Comprehensions', color = colors['tc'])
plt.scatter(intensity, perf['plaidml'], label = 'PlaidML', color = colors['plaidml'])
plt.legend()
plt.xlabel('Arithmetic Intensity (TFLOP/GB)')
plt.ylabel('Performance (TFLOP/S)')
name = 'roofline-baseline'
if poster_mode:
  name += '-poster'
plt.savefig(name + '.pdf', transparent = False, bbox_inches = 'tight', pad_inches = 0)
plt.show()

Unfortunately, I don't have the script I used to get the performance data. I don't think the FP32 data is meaningful anymore: Auto-TVM got significantly better, and TC/PlaidML got deprecated.

from triton.

ptillet commented on July 22, 2024

Closing the issue for now. Opened an issue on the permute op here: https://github.com/ptillet/triton/issues/56

from triton.

roofline model about triton HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent