How to build TF C++ custom operators for TF-DML? about tensorflow-directml HOT 5 CLOSED

microsoft commented on August 22, 2024

How to build TF C++ custom operators for TF-DML?

from tensorflow-directml.

Comments (5)

jstoecker commented on August 22, 2024 1

We're only just starting to optimize our shaders for training scenarios, so it's premature to look too closely at performance when it's still being built. It's not necessary to write custom HLSL to determine where the bottlenecks are: there are great GPU profiling tools available that make this process much simpler. We're already aware of and working on optimizations that benefit the models you're referencing. Our internal measurements show different performance measurements than what you're suggesting, even with older versions of TF-DML, but we'll have more to share on this once we actually start integrating our optimizations.

I do appreciate the benefits of extensibility, and Antares is an intriguing project! I'm not convinced it's the right path for TF-DML at this very moment. Consider the complexity involved in setting up a plugin like Antares: it appears to require a server (running in WSL or a Linux host) that communicates with the an agent on the Windows host. It also requires modifications to the TF client script to use the custom operators. For a low-level engineer or researcher trying to squeeze every drop of performance out of their model this is a reasonable ask, but this presents huge challenges for an app developer that wants deploy their application, built on top of TF, to users that may have no idea what WSL or docker are.

The goal we have right now is building a stable and efficient backend for TF that works with the breadth of DX hardware and runs existing TF models without modification. Most of our users want something that just works with their hardware and is competitive in perf with existing acceleration backends. We're working on this first and foremost, and prioritizing ways to bypass DirectML shaders does not serve this purpose.

That said, the DML device backend in this fork is obviously open source; everything you need to try building the custom ops is available to you. I would genuinely be very interested to see what kinds of improvements you can demonstrate with this approach, and it may be something we want to look at in the future!

from tensorflow-directml.

jstoecker commented on August 22, 2024

Currently, the operator shaders in tensorflow-directml is not only lacking, but also not most efficient.

Can you be more specific on the operators you feel are lacking or inefficient? If you haven't seen our operator roadmap it shows the current and planned coverage for the next few months. We're focused on CNNs initially, so if the scenarios you have in mind aren't covered by the common assortment of layers used in CNNs then it would be good to understand what specifically you're looking to support. I'm sensing that you have a more research-oriented focus, which is an interesting topic on its own.

As far as shader efficiency, this is also something we're working on right now. Specifically, we have improvements that target convolution and normalization in training since these tend to be the bottlenecks for CNNs. Are there other operators you have in mind?

For TF-DML, the building stack is for Windows with VS + DirectML SDK + .., which is much complex.

It definitely isn't as straightforward to build TensorFlow in Windows, but that's true regardless of the changes in this fork. If I had to guess it's because Windows support was added later in TF's life, and even the build system (bazel) has a heavy bias toward Unix-like platforms. That said, the DirectML integration is automatic, so I'm not sure what you mean by the DirectML SDK. Are you referring to the Windows SDK? We'll likely switch to the open-source DirectX headers in the near future, which should make it possible to build with older Windows SDKs.

We are looking into extend Antares plugin to solve this, which has been verified in Linux CUDA/ROCm environment.

From a cursory glance at your project it looks like a compiler infrastructure approach with an intermediate representation. We've noticed a few of these initiatives, like TF's own MLIR, and it will be interesting to see how they play out. One concern is the true portability of the generated code (in both a functional and performance sense). Regardless, it's a neat area of exploration.

I have to ask, though: if Antares can generate robust HLSL then what is the interest in TF+DirectML? Can it use the existing Eigen source code to generate the kernels, and if so why not just generate everything entirely with Antares? Is this to fill in coverage where DirectML doesn't support certain ops or you want to add truly new operators? I can see how, from a researcher perspective, it would be useful to quickly prototype new ops and have them run with hardware acceleration; the concerns of robustness are not paramount in this scenario so the implementation can be specialized to your own hardware. This is a bit different from our (initial) goals with this project, which are mostly on end-users that don't have an interest in extending TF's operator catalog or recompiling TF from scratch to run on their specific GPU.

How to build TF C++ custom operators for TF-DML?

The steps to define custom operators is dictated by TensorFlow, not TF+DirectML. I assume what you're really asking is how to write operator kernels (i.e. OpKernels, the implementations of existing TensorFlow operators) to invoke custom HLSL shaders. This is technically possible, but it's not the approach we're taking here. It's actually a very good question, and hopefully I can shed some light on the philosophy here.

First, let's step back from the TensorFlow side of things for a second. The DirectML API does not expose any extensibility hooks for custom operators because the API is low-level enough that it's not necessary. The user of the API is in total control of how, where, and when the work is recorded and executed. A game engine, for example, can record 3D rendering work with its own shaders into the same command list (if desired) as DirectML shaders. Adding any extensibility APIs to DirectML itself would just impose unnatural (and likely restrictive) abstractions on how developers connect DirectML to their applications.

It should be clear that custom HLSL can interoperate with DirectML, but you won't see examples of that in this repository. Why? We want DirectML to support multiple frameworks or applications with the same core shaders, not just TensorFlow. Writing HLSL in this repo would mean duplicating code in every application or framework. And, even worse, it would jeopardize any reasonable way to wrangle the numerous hardware-specific features (or limitations), work around bugs, and ensure a minimum bar for quality. Writing shader code that works well for one GPU is not hard, but writing shader code that works well on multiple devices from multiple vendors across multiple generations and driver versions is an entirely different challenge.

So, in summary, our priority is ensuring DirectML can support the important ML operators that frameworks use in inference or training. In some cases this requires new DirectML APIs, which we add. In an overwhelming number of operators, however, it's possible (and efficient enough) to simply chain existing DirectML ops together to express the underlying computation. You'll see this in hundreds of the DML kernels of this repo. A small subset of operators aren't general enough to warrant a new DirectML API, but they're also too complicated to implement with composition; these may be candidates for framework-specific shaders in the future, but it's not high on our list of priorities for right now.

from tensorflow-directml.

ghostplant commented on August 22, 2024

OK, I am not quite clear which operators are slow in TF-DML but the overall result shows it reaches to about 17% performance of the TF-CUDA in resnet50/alexnet/inception3. Developing custom HLSL operators can help us see which operator implemented by DML is not quite efficient. Currently Antares plugin (as an custom TF-DML operator) can be a tool to create DNN-purpose HLSL operators to analyze this.

One the other hand, we want to develop a TF custom operator contributed by Antares in every TF backend to "make up" the disadvantage of existing "Tensorflow framework which has a good support on every CUDA GPU using kernels whose performance are not too bad, but has a weak support on extremely-fast kernel as well as general-purpose fusion (e.g. fusing slice->transpose->matmul->biasadd into a single kernel which is proven much faster in most cases)". Besides, extending new operators coming in the future is also helpful for Tensorflow in whatever backends, since TF 1.15.x is in maintaining mode and it no longer extends new operators proposed after it is frozen.

So if developers don't care much about the performance, I think TF-DML with all HLSL implemented has already been nice, but it makes TF-DML limited in many other high-performance scenarios.

If Antares can generate robust HLSL then what is the interest in TF+DirectML?
It is a makeup to handle weak points in current Tensorflow design (e.g. high-performance kernels, new ops, deep fusions, ..).

Can it use the existing Eigen source code to generate the kernels, and if so why not just generate everything entirely with Antares?
It will work if to "generate everything entirely with Antares" but the complexity is as large as "finishing all HLSL implementations". I don't think it is needed to open such a large work, but it is okay just to be a helper to solve weak points of Tensorflow.

Is this to fill in coverage where DirectML doesn't support certain ops or you want to add truly new operators?
Mostly the former purpose but not exactly so. The Antares plugin is just a 3-file plugin which doesn't have any operators extended, while it just exposes an interface to allow TF users to define new operators they want, just like how TC is used in Pytorch. This not only solves new-operator extension, but also solves what operator is inefficient. A complex thing we found is that one implementation of HLSL can be the fastest one on one GPU vendor, but it is not likely to be quite fast on another GPU vendor. Luckily, this can be solved by Antares which can hit everyone's fastest HLSL implementation for different GPU vendors (by tuning).

from tensorflow-directml.

ghostplant commented on August 22, 2024

Thanks, we'll show some results which could be even faster than TF-CUDA.
BTW, the answer to this issue I think is not exactly the official link custom operators provided above, which only include customized CPU kernel and CUDA kernel.
For DirectX backend, the developer need to know some critical DX resources like the command queue to push our customized shaders to, and also other important D3D resources like that.

from tensorflow-directml.

jstoecker commented on August 22, 2024

Thanks ghostplant.

To make the answer to this issue is clear, TF-DML currently supports three approaches to implementing ops (ignoring the framework "glue" ops that don't involve GPU work):

Calling DML operator APIs directly. Example: MatMul
Calling DML graph APIs to chain together DML operators. Example: ApplyAdaDelta
Calling D3D APIs to manipulate resources (no shaders). Example: DeepCopy

Like I said earlier, we have no examples of (or plans to add) DML kernels with loose HLSL bytecode, but there's nothing technical that prevents you from experimenting with this. If you're familiar enough with D3D12 it should make sense that all the required D3D interfaces needed to do this are available, but we're focused on ensuring DML delivers on these needs instead of relying on custom ops that override or extend the TF API.

I'm going to close this issue for now, but perhaps we can revisit in the future!

from tensorflow-directml.

How to build TF C++ custom operators for TF-DML? about tensorflow-directml HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent