Comments (8)
Hi, this data is actually rather old. Auto-TVM has improved since then in FP32, and the focus of Triton has focuses to tensor cores. There is more data in Chapter 5 of my PhD dissertation if you are interested ( https://search.proquest.com/openview/8e39f66528f4f49a4aab91412cff9d05/1?pq-origsite=gscholar&cbl=18750&diss=y ).
I am working on a library of standard ops for deep learning (matmul, conv, batchnorm, softmax, etc.) This should provide a good deal of examples for people interested in learning more about Triton. My ultimate goal is actually to provide a compact, viable and open-source alternative to cuBLAS and cuDNN. When I do, I will include script to generate these kinds of plots, including comparison again TVM -- Tensor Comprehensions is now deprecated AFAIK. It should be out in about 1 month (or 2 at most)
Note that support for INT8 is pretty poor (and in fact inexistent for tensor cores) at the moment and that the focus is more on training. TensorRT would be hard to replace at this point.
from triton.
I see.
Actually from daily work, I conclude that the most time consuming part, one is the CONV, another is data movement. The data movement op like nhwc2nchw in TRT is also good at reaching the peak bandwidth provided by HW.
What is omited by TRT is that beside op like nhwc2nchw, there are actually many other data movement require to be optimized. So if triton would be help, that would be great.
from triton.
Do you think you could create a list of ops that you need? I can try to include them in the upcoming op library if possible.
from triton.
I think below ops may have first try:
- deformable convolution. which may refer to the DCV2_op in https://github.com/msracver/Deformable-ConvNets.
- nchw2nhwc and nhwc2nchw implemented by triton which could be compared the bandwidth with TRT
- nn.shuffle()
- ops fusion that take [nhwc2nchw ->nn.shuffle()->nchw2nhwc] into one op
from triton.
There will definitely be a permute
op similar to https://github.com/ptillet/torch-blocksparse/blob/master/torch_blocksparse/permute.py that will allow high-BW conversion to/from any CNN input format. I think right now this is only tested for conversion between NCHW and CHWN, but it can be easily edited to accomodate NHWC as well.
By nn.shuffle, do you mean https://pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html ? I could do that, seems like a low-hanging fruit and generally useful op.
from triton.
Yes. it is the torch.nn.PixelShuffle
from triton.
To answer the initial question, I have dug through my filesystem and found the plot code I used for the roofline method:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
poster_mode = False
if poster_mode:
font = {'size' : 22}
matplotlib.rc('font', **font)
matplotlib.rc('lines', markersize=11)
plt.figure(figsize=(20,10))
colors = {'cublas': '#0d770e',
'triton': '#a81034',
'tvm': '#f4b95f',
'tc': '#adb9d3',
'plaidml': '#00a0dc'}
perf = {'cublas': [0.37, 0.77, 1.55, 3, 4.42, 5.51, 6.43, 6.66, 6.92],
'triton': [0.45, 0.91, 1.79, 2.99, 4.38, 5.27, 5.84, 5.97, 6.5],
'tc': [0.29, 0.32, 0.52, 0.96, 1.1, 1.48, 1.8, 2.2, 2.25],
'tvm': [0.14, 0.37, 0.93, 1.85, 2.91, 3.43, 3.38, 3.85, 3.86],
'plaidml': [0.1, 0.17, 0.3, 0.52, 0.93, 1.51, 1.76, 2.12, 2.44]}
# arithmetic intensity
m = np.repeat(1760, 9)
n = 2**np.arange(2, 11)
k = np.repeat(1760, 9)
flops = 2.*m*n*k
transfer = 4.*(m*k + k*n)
intensity = flops / transfer
# device properties
bandwidth = 256*1e9
max_flops = 7.5
roofline = np.minimum(bandwidth*intensity*1e-12, max_flops)
plt.loglog(intensity, roofline, label = 'Roofline Model', color = 'black')
plt.scatter(intensity, perf['cublas'], label = 'cuBLAS 10.0', color = colors['cublas'])
plt.scatter(intensity, perf['triton'], label = 'Triton', color = colors['triton'])
plt.scatter(intensity, perf['tvm'], label = 'Auto-TVM', color = colors['tvm'])
plt.scatter(intensity, perf['tc'], label = 'Tensor Comprehensions', color = colors['tc'])
plt.scatter(intensity, perf['plaidml'], label = 'PlaidML', color = colors['plaidml'])
plt.legend()
plt.xlabel('Arithmetic Intensity (TFLOP/GB)')
plt.ylabel('Performance (TFLOP/S)')
name = 'roofline-baseline'
if poster_mode:
name += '-poster'
plt.savefig(name + '.pdf', transparent = False, bbox_inches = 'tight', pad_inches = 0)
plt.show()
Unfortunately, I don't have the script I used to get the performance data. I don't think the FP32 data is meaningful anymore: Auto-TVM got significantly better, and TC/PlaidML got deprecated.
from triton.
Closing the issue for now. Opened an issue on the permute op here: https://github.com/ptillet/triton/issues/56
from triton.
Related Issues (20)
- Use TRITON_CACHE in setup.py so that .triton must not have to be in home directory of user HOT 1
- Can't install triton HOT 1
- Flash Attention 3 --> Triton HOT 1
- Mistakes in `class DistributedEncoding`'s illustration
- Latest nightly triton causes my custom fused attention kernel to output incorrect results. HOT 3
- Unexpected segmentation fault with tl.sum in a simple loop HOT 3
- Support reduction operations on global memory with `red` ptx instruction
- Cannot use Triton interpreter with matrix multiply example HOT 4
- error: fp8e4nv data type is not supported on CUDA arch < 89
- Can I explicitly specify the "tl.load" to load data into shared memory? HOT 1
- breaking change to constexpr in triton 3.0.0 HOT 2
- AttributeError: 'InterpretedFunction' object has no attribute 'cache_key' HOT 2
- SWP: use of address of iterator in dist1Cluster HOT 1
- Introduce `tl.assume` or use `assert` expression in non-debug builds to guide optimization?
- run into dead loop when tuning the tma persistent kernel HOT 6
- [BUG] error load fp32 value from 2D tensor HOT 2
- [BUG] device_print - Triton nightly, 3.0 incorrect values (zero) when using pointer arithmetics(constexpr etc.) other than with triton.language.arange HOT 3
- Incompatible type error with "torch.onnx.export()"
- Syncing blocks in SplitK GEMM HOT 1
- [BUG] triton.language.associative_scan returning incorrect results when `reverse=True`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from triton.