microsoft / tensorflow-directml Goto Github PK

Fork of TensorFlow accelerated by DirectML

License: Apache License 2.0

Starlark 2.53% Python 34.08% Batchfile 0.02% C++ 54.39% C 0.64% Shell 0.47% MLIR 0.46% SWIG 0.12% Jupyter Notebook 1.54% LLVM 0.01% CMake 0.15% Java 0.62% Makefile 0.08% Dockerfile 0.06% HTML 3.40% Objective-C 0.07% Objective-C++ 0.16% Ruby 0.01% Go 1.21% Perl 0.01%

tensorflow-directml's Introduction

TensorFlow-DirectML

	Warnings
⚠️	h5py 3.0.0 and 3.1.0 broke compatibility with TensorFlow. Please make sure that your environment has a different version of h5py before using TensorFlow-DirectML.

TensorFlow is an end-to-end open source platform for machine learning. This repository is a fork of tensorflow that leverages DirectML to provide cross-vendor hardware acceleration on Windows and the Windows Subsystem for Linux (WSL). TensorFlow with DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

Latest Release:

Questions, Issues, and Feedback

Frequently asked questions: FAQ
Learn about our roadmap: Wiki
Ask a question: Discussions
Report a bug: Issues

You can also contact us directly at [email protected].

Getting Started

TensorFlow with DirectML is supported on both the latest versions of Windows and the Windows Subsystem for Linux. For detailed instructions on getting started, see GPU accelerated ML training (docs.microsoft.com).

TensorFlow with DirectML is compatible with TensorFlow 1.15 and is supported for production use. Official Python packages are available on the tensorflow-directml PyPI project, and C library packages are available for download on GitHub.

The DirectML repository includes a few samples that have been tested to work with the latest builds on PyPI. These samples include both inference and training scripts, and you can either train the models from scratch or use the supplied pre-trained weights. However, we encourage testing on any TensorFlow 1.15-compatible models -- if you run into issues, please let us know!

The following resources provide additional background on DirectML and TensorFlow:

System Requirements

Windows

Windows 10 Version 1709, 64-bit (Build 16299 or higher) or Windows 11 Version 21H2, 64-bit (Build 22000 or higher)
Python x86-64 3.5, 3.6, or 3.7¹
One of the following supported GPUs:
- AMD Radeon R5/R7/R9 2xx series or newer
- Intel HD Graphics 5xx or newer
- NVIDIA GeForce GTX 9xx series GPU or newer

¹ Note: Python 3.8 or newer is not currently supported. To use the official PyPi packages, the CPython interpreter is required. NumPy 1.19.4 is requires the KB4598291 to properly work on Windows.

Windows Subsystem for Linux

Windows 10 Version 21H2, 64-bit (Build 20150 or higher) or Windows 11 Version 21H2, 64-bit (Build 22000 or higher)
Python x86-64 3.5, 3.6, or 3.7²
One of the following supported GPUs:
- AMD Radeon R5/R7/R9 2xx series or newer, and 20.20.01.05 driver or newer
- Intel HD Graphics 6xx or newer, and 28.20.100.8322 driver or newer
- NVIDIA GeForce GTX 9xx series GPU or newer, and 460.20 driver or newer

² Note: Python 3.8 or newer is not currently supported. To use the official PyPi packages, the CPython interpreter is required.

Contribute

If you would like to contribute to tensorflow-directml, please see our contribution guidelines and read the Microsoft Open Source Code of Conduct. We use GitHub issues for tracking requests and bugs. Please do not report security vulnerabilities through public GitHub issues. See SECURITY.md for more details.

See BUILD.md for instructions on how to produce private builds of tensorflow-directml.

License

This project is licensed under Apache License 2.0.

The tensorflow-directml Python wheel binary package includes a redistributable version of the DirectML library, which is downloaded automatically as a part of the build. The use of the redistributable DirectML library is governed by a separate license that is found as part of the package (found in tensorflow_core/python/DirectML_LICENSE.txt when extracted).

Data Collection Notice

The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft's privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

Disabling Telemetry

The official builds of tensorflow-directml (hosted on PyPI) have data collection enabled. This telemetry is enabled when building with --config=dml_telemetry (i.e. the --telemetry switch in build.py), but it is disabled by default for local builds.

Trademarks Notice

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

tensorflow-directml's People

Contributors

Stargazers

Watchers

tensorflow-directml's Issues

Issues running AI Benchmark..

Hi,
seeing my last issue being closed: microsoft/DirectML#16
just updated to latest (200626) tensorflow directml to test on "native" Windows:(tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64)
I'm on NV Titan V and 451.48 driver..
now 1/19. MobileNet-V2 training step runs without issues..
so my last issue is solved..
but benchmark still fails to completion.. now faults on "2/19. Inception-V3" training step..
I think maybe a GPU mem allocation issue as I see on task manager GPU tab that "dedicated GPU mem" is almost full prior to training step (11.8/12GB allocated)..
seems DirectML backend maybe not optimized in relation to GPU mem usage as I can run this benchmark on CUDA backend without issues..
or maybe either AI Bench or DirectML backend is not freeing GPU mem "buffers" between benchmark steps..
hope we can end running full AI Benchmark on DirectML without issues..
for later will ask for better training performance as:
1.2 - training | batch=50, size=224x224: 9138 ± 137 ms
seems to much for a Titan V.. at least on CUDA this is way faster..

python
Python 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from ai_benchmark import AIBenchmark
>>> results = AIBenchmark().run()

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.19564-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: 11.0
*  CUDA Build: V11.0.167

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 52.5 ± 10.7 ms
1.2 - training  | batch=50, size=224x224: 9138 ± 137 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 812 ± 36 ms
2020-06-28 22:50:23.358757: F tensorflow/core/common_runtime/dml/dml_allocator.cc:97] Check failed: (((HRESULT)((hr))) >= 0) == true (0 vs. 1)

It's not working on Intel Graphics 5500

OS : Windows 10
Python version : 3.9.15
tensorflow-cpu==2.10.0

when I run import tensorflow as tf
it shows 0 compatible device

fit with split produces error

System Information

Amd 3900x 5700x
windows 10
python 3.7

Repro Details

have a x array of shape (100,100)
where
x[0].shape =(100,)

Describe the current behavior
fit tries to multiply a shape by a float
Describe the expected behavior
it should realize that the first dimension is what it need to multiply by the float

Other info / logs
x.shape: (594, 4096)
y.shape: (594, 4096)
(4096,)
(4096,)
<class 'tensorflow.python.framework.ops.EagerTensor'>

TypeError Traceback (most recent call last)
in
38 print(type(X_train[0]))
39 history = model.fit(X_train,y_train,validation_split=.5,verbose=1,
---> 40 batch_size=40,shuffle=False)
41 curLoss = history.history["loss"]
42 curVal = history.history["val_loss"]

c:\users\tasha\anaconda3\envs\directml\lib\site-packages\tensorflow_core\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
725 max_queue_size=max_queue_size,
726 workers=workers,
--> 727 use_multiprocessing=use_multiprocessing)
728
729 def evaluate(self,

c:\users\tasha\anaconda3\envs\directml\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
226 max_queue_size=max_queue_size,
227 workers=workers,
--> 228 use_multiprocessing=use_multiprocessing)
229
230 total_samples = _get_total_number_of_samples(training_data_adapter)

c:\users\tasha\anaconda3\envs\directml\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py in _process_training_inputs(model, x, y, batch_size, epochs, sample_weights, class_weights, steps_per_epoch, validation_split, validation_data, validation_steps, shuffle, distribution_strategy, max_queue_size, workers, use_multiprocessing)
533 val_x, val_y,
534 val_sample_weights) = training_utils.split_training_and_validation_data(
--> 535 x, y, sample_weights, validation_split)
536 train_adapter = adapter_cls(
537 x,

c:\users\tasha\anaconda3\envs\directml\lib\site-packages\tensorflow_core\python\keras\engine\training_utils.py in split_training_and_validation_data(x, y, sample_weights, validation_split)
1869 'you cannot use validation_split.')
1870 if hasattr(x[0], 'shape'):
-> 1871 split_at = int(x[0].shape[0] * (1. - validation_split))
1872 else:
1873 split_at = int(len(x[0]) * (1. - validation_split))

TypeError: unsupported operand type(s) for *: 'Dimension' and 'float'

How to build TF C++ custom operators for TF-DML?

Currently, the operator shaders in tensorflow-directml is not only lacking, but also not most efficient.

We are looking into extend Antares plugin to solve this, which has been verified in Linux CUDA/ROCm environment.

For TF-DML, the building stack is for Windows with VS + DirectML SDK + .., which is much complex.
So how to easily extend TF C++ custom operators for TF-DML?

Windows Camera post process(DMFT) with DirectML(Tensorflow)

We are developing Camera post process @ DMFT (Windows User space DLL),
and we currently hope to run DirectML (with Tensorflow) in DMFT.
We have tried porting C++ sample(DirectMLSuperResolution) of DirectML to DMFT & it's work,
But the part of tensoftflow we don't know how to proceed

Best of best regards

Not able to use my own callbacks

Hi, it's my first time reporting a issue, so I'm sorry if I misclassified it.
I am needing to do some research with TF2.0 with my team. When I run the code in a enviroment with tensorflow-cpu, the program works just fine, as expected. However, when trying in another enviroment with tensorflow-directml -to use my GPU-, the code breaks as follow:

Emphasis on error:

'''python
File "C:\Users\berna\anaconda3\envs\tf2-directml\lib\site-packages\tensorflow_core\python\ops\gen_resource_variable_ops.py",
line 64, in assign_add_variable_op
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'AssignAddVariableOp' OpKernel for 'DML' devices
compatible with node {{node AssignAddVariableOp}}
(OpKernel was found, but attributes didn't match) Requested Attributes: dtype=DT_DOUBLE
. Registered: device='CPU'; dtype in [DT_INT64]
device='CPU'; dtype in [DT_INT32]
device='CPU'; dtype in [DT_UINT16]
device='CPU'; dtype in [DT_INT16]
device='CPU'; dtype in [DT_UINT8]
device='CPU'; dtype in [DT_INT8]
device='CPU'; dtype in [DT_HALF]
device='CPU'; dtype in [DT_BFLOAT16]
device='CPU'; dtype in [DT_FLOAT]
device='CPU'; dtype in [DT_DOUBLE]
device='CPU'; dtype in [DT_COMPLEX64]
device='CPU'; dtype in [DT_COMPLEX128]
device='DML'; dtype in [DT_FLOAT]
device='DML'; dtype in [DT_HALF]
device='DML'; dtype in [DT_INT64]
[Op:AssignAddVariableOp]
'''

The keras allows to create a callback as explained on https://keras.io/guides/writing_your_own_callbacks/ . I know that the problem is only with the customized callback because if I comment it and use just keras callbacks, the code return to work with directml.

Where callbacks are called [custom callback is called as "PlotLearning(X_val,y_val)"]:

Eager is activated (I'm not sure if it matters) with tf.compat.v1.enable_eager_execution()

My specifications:

Hardware:
- GPU = AMD Radeon RT 6600 XT
- CPU = I5-11400F
Software:
- Anaconda 2.2.0
- Python 3.6.13
- tensorflow-directml 1.15.7
- Windows 11

Thanks for your help!

1 Latest TensorFlow-DirectML C API package

Would it be possible to release a C API package each time the Python packages are released? I don't have access to the Windows Insider SDKs and have not been successful in building it from source.

LSTM training is super slow on GPU

This training loop takes more than a second per epoch using tensorflow-directml but a fraction of a second with standard tensorflow.
It actually doesnt work at all (error is NaN after a couple of iterations) but I already opened another Issue for that.

Code:

import tensorflow as tf
import numpy as np
from tensorflow import keras
import matplotlib.pyplot as plt
import time
from datetime import timedelta

def fn(x):
    return tf.sin(x)

seq_length = 200
x = tf.linspace(tf.constant(0, dtype=tf.float32), 50, seq_length)
y = fn(x)

n_outputs = 50
model = keras.layers.LSTM(n_outputs, return_sequences=True)
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = keras.losses.MSE

loss_history = []
epochs = 2_000
out_epochs = 10
start = time.time()
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        y_pred = model(tf.zeros(shape=(1, seq_length, 1)))
        y_pred_data = y_pred[0, :, 0]
        loss = loss_fn(y, y_pred_data)
    loss_history.append(loss.numpy())
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    if epoch % out_epochs == 0:
        print(f"Epoch {epoch}: Loss = {loss} ({timedelta(seconds=time.time()-start)})")

System: Intel i5-7200U with Intel HD graphics 620

Validation accuracy not improving?

System Information:

Windows 10 Build/Version: 20H2 (OS Build 19042.906)
native windows
Python Version: Python 3.7 via anaconda virtual env
TensorFlow-DirectML Version: 1.15.4.dev201216
Graphics card driver version: Radeon Adrenalin 21.2.3
Radeon RX Vega 64 8gb

Repro:

Hey i did a transfer learning of image classification ResNet50with my own dataset using the basis of https://github.com/krishnaik06/Tomato-Leaf-Disease-Prediction script for "Transfer Learning Resnet 50.ipynb".

i compare two result one from cpu tensorflow 2, and directml tensorflow 1.15 run on vega 64.
however the result of two execution is really drastic.

five epochs cpu tensorflow 2:

five epochs gpu directml-tensorflow 1.15:

i also try to run the code with google colab but only one epoch due to long training time:
epoch 1/10 323/323 [==============================] - 13357s 41s/step - loss: 0.7214 - accuracy: 0.8985 - val_loss: 0.1454 - val_accuracy: 0.9617

is this normal because of different version of tensorflow?
time execution for cpu tensorflow for one epoch is about +-20 min
time execution for gpu directml tensorflow for one epoch is about +-16 min

Help Needed on Out of Memory Issue

Hi,

I have tried to run SSD MobilenetV1 model from http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz with tensorflow-directml precompiled C libraries available in https://github.com/microsoft/tensorflow-directml/releases/tag/v1.15.5.dev210429.

However, I am getting OOM issue when running the model. The snippet of the output is as below:

2021-06-28 09:23:32.032871: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-06-28 09:23:32.035434: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library C:\Users\vince\Downloads\Release/directml.adbd007a01a52364381a1c71ebb6fa1b2389c88d.dll
2021-06-28 09:23:32.166394: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:249] DirectML device enumeration: found 1 compatible adapters.
2021-06-28 09:23:32.232103: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:185] DirectML: creating device on adapter 0 (NVIDIA GeForce GTX 950)
2021-06-28 09:23:32.299033: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library Kernel32.dll
...
2021-06-28 09:24:26.743588: I tensorflow/core/common_runtime/bfc_allocator.cc:935]      Summary of in-use Chunks by size: 
2021-06-28 09:24:26.743597: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 147330 Chunks of size 256 totalling 35.97MiB
2021-06-28 09:24:26.743604: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 704 Chunks of size 512 totalling 352.0KiB
2021-06-28 09:24:26.743609: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 69 Chunks of size 768 totalling 51.8KiB
2021-06-28 09:24:26.743615: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 9 Chunks of size 1024 totalling 9.0KiB
2021-06-28 09:24:26.743620: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 3 Chunks of size 1280 totalling 3.8KiB
2021-06-28 09:24:26.743626: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 19 Chunks of size 2048 totalling 38.0KiB
2021-06-28 09:24:26.743632: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 6 Chunks of size 2304 totalling 13.5KiB
2021-06-28 09:24:26.743637: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 3584 totalling 3.5KiB
2021-06-28 09:24:26.743643: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 4 Chunks of size 4096 totalling 16.0KiB
2021-06-28 09:24:26.743648: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 2 Chunks of size 4608 totalling 9.0KiB
2021-06-28 09:24:26.743654: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 4864 totalling 4.8KiB
2021-06-28 09:24:26.743660: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 63201 Chunks of size 7680 totalling 462.90MiB
2021-06-28 09:24:26.743665: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 101 Chunks of size 7936 totalling 782.8KiB
2021-06-28 09:24:26.743671: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 69 Chunks of size 8192 totalling 552.0KiB
2021-06-28 09:24:26.743676: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 39 Chunks of size 8448 totalling 321.8KiB
2021-06-28 09:24:26.743682: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 51 Chunks of size 8704 totalling 433.5KiB
2021-06-28 09:24:26.743687: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 62 Chunks of size 8960 totalling 542.5KiB
2021-06-28 09:24:26.743693: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 117 Chunks of size 9216 totalling 1.03MiB
2021-06-28 09:24:26.743698: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 189 Chunks of size 9472 totalling 1.71MiB
2021-06-28 09:24:26.743704: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 2960 Chunks of size 9728 totalling 27.46MiB
2021-06-28 09:24:26.743709: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 54 Chunks of size 9984 totalling 526.5KiB
2021-06-28 09:24:26.743715: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 38 Chunks of size 10240 totalling 380.0KiB
2021-06-28 09:24:26.743720: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 51 Chunks of size 10496 totalling 522.8KiB
2021-06-28 09:24:26.743726: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 94 Chunks of size 10752 totalling 987.0KiB
2021-06-28 09:24:26.743732: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 122 Chunks of size 11008 totalling 1.28MiB
2021-06-28 09:24:26.743737: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 290 Chunks of size 11264 totalling 3.12MiB
2021-06-28 09:24:26.743743: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 49 Chunks of size 11520 totalling 551.3KiB
2021-06-28 09:24:26.743748: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 616 Chunks of size 11776 totalling 6.92MiB
2021-06-28 09:24:26.743754: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 27 Chunks of size 12032 totalling 317.3KiB
2021-06-28 09:24:26.743759: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 55 Chunks of size 12288 totalling 660.0KiB
2021-06-28 09:24:26.743766: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 72 Chunks of size 12544 totalling 882.0KiB
2021-06-28 09:24:26.743772: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 174 Chunks of size 12800 totalling 2.12MiB
2021-06-28 09:24:26.743778: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 494 Chunks of size 13056 totalling 6.15MiB
2021-06-28 09:24:26.743783: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 943 Chunks of size 13312 totalling 11.97MiB
2021-06-28 09:24:26.743789: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 22 Chunks of size 13568 totalling 291.5KiB
2021-06-28 09:24:26.743794: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 129 Chunks of size 13824 totalling 1.70MiB
2021-06-28 09:24:26.743800: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 17 Chunks of size 14080 totalling 233.8KiB
2021-06-28 09:24:26.743805: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 44 Chunks of size 14336 totalling 616.0KiB
2021-06-28 09:24:26.743810: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 132 Chunks of size 14592 totalling 1.84MiB
2021-06-28 09:24:26.743816: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 304 Chunks of size 14848 totalling 4.30MiB
2021-06-28 09:24:26.743821: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1399 Chunks of size 15104 totalling 20.15MiB
2021-06-28 09:24:26.743827: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 69491 Chunks of size 15360 totalling 1017.93MiB
2021-06-28 09:24:26.743833: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 32 Chunks of size 15616 totalling 488.0KiB
2021-06-28 09:24:26.743839: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 45 Chunks of size 15872 totalling 697.5KiB
2021-06-28 09:24:26.743844: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 16 Chunks of size 16128 totalling 252.0KiB
2021-06-28 09:24:26.743849: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 25 Chunks of size 16384 totalling 400.0KiB
2021-06-28 09:24:26.743855: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 14 Chunks of size 16640 totalling 227.5KiB
2021-06-28 09:24:26.743860: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 14 Chunks of size 16896 totalling 231.0KiB
2021-06-28 09:24:26.743865: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 19 Chunks of size 17152 totalling 318.3KiB
2021-06-28 09:24:26.743871: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 462 Chunks of size 17408 totalling 7.67MiB
2021-06-28 09:24:26.743876: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 12 Chunks of size 17664 totalling 207.0KiB
2021-06-28 09:24:26.743882: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 20 Chunks of size 17920 totalling 350.0KiB
2021-06-28 09:24:26.743887: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 14 Chunks of size 18176 totalling 248.5KiB
2021-06-28 09:24:26.743892: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 21 Chunks of size 18432 totalling 378.0KiB
2021-06-28 09:24:26.743898: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 25 Chunks of size 18688 totalling 456.3KiB
2021-06-28 09:24:26.743904: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 14 Chunks of size 18944 totalling 259.0KiB
2021-06-28 09:24:26.743909: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 20 Chunks of size 19200 totalling 375.0KiB
2021-06-28 09:24:26.743915: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 214 Chunks of size 19456 totalling 3.97MiB
2021-06-28 09:24:26.743920: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 19712 totalling 211.8KiB
2021-06-28 09:24:26.743925: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 17 Chunks of size 19968 totalling 331.5KiB
2021-06-28 09:24:26.743931: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 9 Chunks of size 20224 totalling 177.8KiB
2021-06-28 09:24:26.743936: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 24 Chunks of size 20480 totalling 480.0KiB
2021-06-28 09:24:26.743942: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 32 Chunks of size 20736 totalling 648.0KiB
2021-06-28 09:24:26.743949: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 54 Chunks of size 20992 totalling 1.08MiB
2021-06-28 09:24:26.743954: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 13 Chunks of size 21248 totalling 269.8KiB
2021-06-28 09:24:26.743959: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 64 Chunks of size 21504 totalling 1.31MiB
2021-06-28 09:24:26.743965: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 8 Chunks of size 21760 totalling 170.0KiB
2021-06-28 09:24:26.743970: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 19 Chunks of size 22016 totalling 408.5KiB
2021-06-28 09:24:26.743979: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 15 Chunks of size 22272 totalling 326.3KiB
2021-06-28 09:24:26.743985: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 25 Chunks of size 22528 totalling 550.0KiB
2021-06-28 09:24:26.743990: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 62 Chunks of size 22784 totalling 1.35MiB
2021-06-28 09:24:26.743996: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 511 Chunks of size 23040 totalling 11.23MiB
2021-06-28 09:24:26.744001: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 23296 totalling 250.3KiB
2021-06-28 09:24:26.744007: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 22 Chunks of size 23552 totalling 506.0KiB
2021-06-28 09:24:26.744012: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 4 Chunks of size 23808 totalling 93.0KiB
2021-06-28 09:24:26.744018: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 15 Chunks of size 24064 totalling 352.5KiB
2021-06-28 09:24:26.744023: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 3 Chunks of size 24320 totalling 71.3KiB
2021-06-28 09:24:26.744029: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 24576 totalling 264.0KiB
2021-06-28 09:24:26.744034: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 24 Chunks of size 24832 totalling 582.0KiB
2021-06-28 09:24:26.744040: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 113 Chunks of size 25088 totalling 2.70MiB
2021-06-28 09:24:26.744046: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 7 Chunks of size 25344 totalling 173.3KiB
2021-06-28 09:24:26.744051: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 19 Chunks of size 25600 totalling 475.0KiB
2021-06-28 09:24:26.744056: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 9 Chunks of size 25856 totalling 227.3KiB
2021-06-28 09:24:26.744062: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 26112 totalling 280.5KiB
2021-06-28 09:24:26.744068: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 26368 totalling 283.3KiB
2021-06-28 09:24:26.744073: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 14 Chunks of size 26624 totalling 364.0KiB
2021-06-28 09:24:26.744078: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 16 Chunks of size 26880 totalling 420.0KiB
2021-06-28 09:24:26.744084: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 79 Chunks of size 27136 totalling 2.04MiB
2021-06-28 09:24:26.744089: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 6 Chunks of size 27392 totalling 160.5KiB
2021-06-28 09:24:26.744095: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 13 Chunks of size 27648 totalling 351.0KiB
2021-06-28 09:24:26.744100: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 8 Chunks of size 27904 totalling 218.0KiB
2021-06-28 09:24:26.744106: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 13 Chunks of size 28160 totalling 357.5KiB
2021-06-28 09:24:26.744111: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 28416 totalling 305.3KiB
2021-06-28 09:24:26.744116: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 26 Chunks of size 28672 totalling 728.0KiB
2021-06-28 09:24:26.744122: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 10 Chunks of size 28928 totalling 282.5KiB
2021-06-28 09:24:26.744128: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 51 Chunks of size 29184 totalling 1.42MiB
2021-06-28 09:24:26.744134: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 15 Chunks of size 29440 totalling 431.3KiB
2021-06-28 09:24:26.744140: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 23 Chunks of size 29696 totalling 667.0KiB
2021-06-28 09:24:26.744145: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 11 Chunks of size 29952 totalling 321.8KiB
2021-06-28 09:24:26.744150: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 20 Chunks of size 30208 totalling 590.0KiB
2021-06-28 09:24:26.744156: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 21 Chunks of size 30464 totalling 624.8KiB
2021-06-28 09:24:26.744161: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 30720 totalling 30.0KiB
2021-06-28 09:24:26.744167: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 32768 totalling 32.0KiB
2021-06-28 09:24:26.744172: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 36864 totalling 36.0KiB
2021-06-28 09:24:26.744177: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 49152 totalling 48.0KiB
2021-06-28 09:24:26.744183: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 2 Chunks of size 65536 totalling 128.0KiB
2021-06-28 09:24:26.744188: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 98304 totalling 96.0KiB
2021-06-28 09:24:26.744194: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 2 Chunks of size 131072 totalling 256.0KiB
2021-06-28 09:24:26.744199: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 2 Chunks of size 262144 totalling 512.0KiB
2021-06-28 09:24:26.744205: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 279552 totalling 273.0KiB
2021-06-28 09:24:26.744210: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 294912 totalling 288.0KiB
2021-06-28 09:24:26.744216: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 524288 totalling 512.0KiB
2021-06-28 09:24:26.744221: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 3 Chunks of size 559104 totalling 1.60MiB
2021-06-28 09:24:26.744227: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 6 Chunks of size 1048576 totalling 6.00MiB
2021-06-28 09:24:26.744232: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 1118208 totalling 1.07MiB
2021-06-28 09:24:26.744238: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 2 Chunks of size 1179648 totalling 2.25MiB
2021-06-28 09:24:26.744244: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 2097152 totalling 2.00MiB
2021-06-28 09:24:26.744249: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 2236416 totalling 2.13MiB
2021-06-28 09:24:26.744254: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 4194304 totalling 4.00MiB
2021-06-28 09:24:26.744260: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 4718592 totalling 4.50MiB
2021-06-28 09:24:26.744265: I tensorflow/core/common_runtime/bfc_allocator.cc:938] 1 Chunks of size 10957568 totalling 10.45MiB
2021-06-28 09:24:26.744270: I tensorflow/core/common_runtime/bfc_allocator.cc:942] Sum Total of in-use chunks: 1.66GiB
2021-06-28 09:24:26.744277: I tensorflow/core/common_runtime/bfc_allocator.cc:944] total_region_allocated_bytes_: 1789375232 memory_limit_: 1789375284 available bytes: 52 curr_region_allocation_bytes_: 3578750976
2021-06-28 09:24:26.744285: I tensorflow/core/common_runtime/bfc_allocator.cc:950] Stats: 
Limit:                  1789375284
InUse:                  1784465664
MaxInUse:               1788396032
NumAllocs:                  821973
MaxAllocSize:             11416576

2021-06-28 09:24:26.758302: W tensorflow/core/common_runtime/bfc_allocator.cc:445] ****************************************************************************************************
2021-06-28 09:24:26.758346: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at dml_kernel_context.cc:167 : Resource exhausted: OOM when allocating tensor with shape[1,64,150,150] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,64,150,150] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
	 [[{{node FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Postprocessor/BatchMultiClassNonMaxSuppression/map/while/LoopCond/_1539]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1,64,150,150] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
	 [[{{node FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_pointwise/BatchNorm/batchnorm/add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

System Information

Host System

Windows 10 Version : Windows 10 Pro 64-bit (10.0, Build 19041) (19041.vb_release.191206-1406)
Processor : AMD Ryzen 5 3600 6-Core Processor (12 CPUs), ~3.6GHz
Memory : 16384MB RAM
DirectX Version : DirectX 12

Python Environment

Python Version : 3.6.8
TensorFlow-DirectML : 1.15.5.dev210429

DirectX Device

Description : NVIDIA GeForce GTX 950
Manufacturer : NVIDIA
Chip Type : GeForce GTX 950
Dedicated Memory : 2007 MB
Driver Version : 27.21.14.5167
Driver Model : WDDM 2.7
Driver Date : 5/7/2020 8:00:00 AM
Feature Levels : 12_1,12_0,11_1,11_0,10_1,10_0,9_3,9_2,9_1

Repro Details

Describe the current behavior
OOM after running the sample for some time(>10s)

Describe the expected behavior
No OOM issue

Code to reproduce the issue
The attached VS2019 projects can be compiled with std c++17, opencv and tensorflow-directml C libraries.
https://drive.google.com/drive/folders/126U92nV160TaWZLOUSH5FYUK6-mpXCZK?usp=sharing

The executable can be run as below:
tf_directml_ssdmobilenetv1.exe [video file]

FYI, I have used https://github.com/serizba/cppflow/tree/243ff2fc4e33632b91676cad7d6cfc3c92308601 in the sample codes.

Other info / logs
The complete logs along with system info are located in https://drive.google.com/drive/folders/126U92nV160TaWZLOUSH5FYUK6-mpXCZK?usp=sharing as well.

Besides, I have tested the exact same codes but by linking it with normal tensorflow-cpu C libraries and I dont see the OOM issue.

Thanks,
Vincent

AMD Radeon R7 350X GPU not improving computational time

System

Windows: Win Home Version 20H2 (OS Build 19042.985)
Processor: i7-6700 CPU 3.40GHz
Memory: 32768MB Ram
DirectX Version : DirectX 12

Python Environment

Python Version : 3.6
TensorFlow-DirectML : 1.15.5.dev210429

I am trying to train a model with an AMD Radeon R7 350X GPU and while it is recognized after setting up tensorflow-directml, it does not improve computational time in the slighest (to be precise it stays exactly the same).

I ran the same training with the CPU: i7-6700 CPU 3.40GHz CPU to compare.

Is this to be expected? I understand the R7 350X is not precisely a powerful GPU, but shouldn't there be some improvement still?

Tensorflow optimizers approach NaN

This model/training loop approaches an error of NaN after a couple of iterations:

import tensorflow as tf
import numpy as np
from tensorflow import keras
import matplotlib.pyplot as plt
import time
from datetime import timedelta

def fn(x):
    return tf.sin(x)

seq_length = 200
x = tf.linspace(tf.constant(0, dtype=tf.float32), 50, seq_length)
y = fn(x)

n_outputs = 50
model = keras.layers.LSTM(n_outputs, return_sequences=True)
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = keras.losses.MSE

loss_history = []
epochs = 2_000
out_epochs = 10
start = time.time()
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        y_pred = model(tf.zeros(shape=(1, seq_length, 1)))
        y_pred_data = y_pred[0, :, 0]
        loss = loss_fn(y, y_pred_data)
    loss_history.append(loss.numpy())
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    if epoch % out_epochs == 0:
        print(f"Epoch {epoch}: Loss = {loss} ({timedelta(seconds=time.time()-start)})")

After a couple of training loops, the loss is NaN instead of a float

System: Intel i5-7200U with Intel HD Graphics 620

CrossFire & MultiGPU & GPU workload

How to check that the library uses both GPU AMD RX570 (as one logical card with two nodes)?
For max performance should I
2.a) enable or disable CrossFire?
2.b) set GPU Workload - Compute or Graphics?

WARNING:tensorflow:From C:\Users\yuriy\source\repos\PythonApplication1\PythonApplication1\env\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-11-15 19:13:48.169040: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectML70b4b8b341c8bda5dc82ecd28a29d918c28282b0.dll
2020-11-15 19:13:48.243361: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:132] DirectML device enumeration: found 2 compatible adapters.
2020-11-15 19:13:48.244920: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-11-15 19:13:48.246223: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 0 (Radeon RX 570 Series)
2020-11-15 19:13:48.411156: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:114] DirectML: creating device on adapter 1 (Intel(R) HD Graphics 530)

update pypi project?

last version published in december.
Can you update to the latest?

[Feature Request] DirectML as Pluggable device for TensorFlow

In the latest version of Tensorflow (v2.5.0), PluggableDevice support has been added.
At the beginning, I paid attention to your dml implementation and apple's apple/tensorflow_macos implementation at the same time. Recently I suddenly discovered that their repo has been archived. After checking carefully, I know that they directly use PluggableDevice to make tensorflow run on metal.
This plug-in approach, I think, should also be very suitable for directml.

With two DirectML devices present, the squeezenet sample code does not use Radeon adapter

In my system two DirectML devices are found (correct), but it defaults to use the slower device 0 and I was not able to change the deployment config to force using the radeon device which id 1.

2020-06-23 16:30:30.909523: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-06-23 16:30:30.950090: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 2 compatible adapters.
2020-06-23 16:30:30.950294: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Intel(R) UHD Graphics 620)
2020-06-23 16:30:30.971162: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll
2020-06-23 16:30:30.981310: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 1 (Radeon (TM) RX 550X)

Cannot assign a device for operation embedding/embeddings/Initializer/random_uniform/

System Information

Windows 10
Python Version (3.6.13)
TensorFlow-DirectML Version (1.15.7)
Graphics card driver version ( AMD Radeon Pro V520 MxGPU)

Hi,
I have AMD GPU on my local machine and I want to train the LSTM model that requires TensorFlow. Firstly, by using TensorFlow-directML, the machine can detect GPU in the system. Code and results are below;

**from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 5162271997438626014,
name: "/device:DML:0"
device_type: "DML"
memory_limit: 6797208279
locality {
}
incarnation: 12883817374713471833
physical_device_desc: "{"name": "AMD Radeon Pro V520 MxGPU", "vendor_id": 4098, "device_id": 29538, "driver_version": "27.20.11025.4019"}"]

Nothing a problem so far. But while training the model, is there any stage we need to activate this GPU? I am getting this error. Without GPU, the model starts running and I can see epoch stage. But it is a bit complex therefore I takes to time to get a result.
GPU can be detected by tensorflow but while training the model device problem occurred.
Can you guess what is the problem?

nvalidArgumentError: Cannot assign a device for operation embedding/embeddings/Initializer/random_uniform/sub: Could not satisfy explicit device specification '' because the node node embedding/embeddings/Initializer/random_uniform/sub (defined at C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) placed on device Device assignments active during op 'embedding/embeddings/Initializer/random_uniform/sub' creation:
with tf.device(None): <C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1535> was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:DML:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:DML:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:DML:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:DML:0' resource_device_name_='/job:localhost/replica:0/task:0/device:DML:0' supported_device_types_=[CPU] possible_devices_=[]
Add: DML CPU
Const: DML CPU
RandomUniform: DML CPU
Sub: DML CPU
Mul: DML CPU
Sqrt: DML CPU
VarHandleOp: DML CPU
AssignVariableOp: DML CPU
VarIsInitializedOp: DML CPU
ReadVariableOp: DML CPU
ResourceGather: DML CPU
Identity: DML CPU
ResourceScatterAdd: DML CPU
Fill: DML CPU
Shape: DML CPU
Unique: DML CPU
StridedSlice: DML CPU
UnsortedSegmentSum: CPU
AddV2: DML CPU
RealDiv: DML CPU
AssignSubVariableOp: DML CPU
NoOp: DML CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
embedding/embeddings/Initializer/random_uniform/shape (Const)
embedding/embeddings/Initializer/random_uniform/min (Const)
embedding/embeddings/Initializer/random_uniform/max (Const)
embedding/embeddings/Initializer/random_uniform/RandomUniform (RandomUniform) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
embedding/embeddings/Initializer/random_uniform/sub (Sub)
embedding/embeddings/Initializer/random_uniform/mul (Mul)
embedding/embeddings/Initializer/random_uniform (Add)
embedding/embeddings (VarHandleOp) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
embedding/embeddings/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
embedding/embeddings/Assign (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
embedding/embeddings/Read/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
embedding/embedding_lookup (ResourceGather) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
embedding/embedding_lookup/Identity (Identity)
VarIsInitializedOp (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:DML:0
training/Adam/embedding/embeddings/m/Initializer/zeros/shape_as_tensor (Const)
training/Adam/embedding/embeddings/m/Initializer/zeros/Const (Const)
training/Adam/embedding/embeddings/m/Initializer/zeros (Fill)
training/Adam/embedding/embeddings/m (VarHandleOp)
training/Adam/embedding/embeddings/m/IsInitialized/VarIsInitializedOp (VarIsInitializedOp)
training/Adam/embedding/embeddings/m/Assign (AssignVariableOp)
training/Adam/embedding/embeddings/m/Read/ReadVariableOp (ReadVariableOp)
training/Adam/embedding/embeddings/v/Initializer/zeros/shape_as_tensor (Const)
training/Adam/embedding/embeddings/v/Initializer/zeros/Const (Const)
training/Adam/embedding/embeddings/v/Initializer/zeros (Fill)
training/Adam/embedding/embeddings/v (VarHandleOp)
training/Adam/embedding/embeddings/v/IsInitialized/VarIsInitializedOp (VarIsInitializedOp)
training/Adam/embedding/embeddings/v/Assign (AssignVariableOp)
training/Adam/embedding/embeddings/v/Read/ReadVariableOp (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/Unique (Unique)
training/Adam/Adam/update_embedding/embeddings/Shape (Shape)
training/Adam/Adam/update_embedding/embeddings/strided_slice/stack (Const)
training/Adam/Adam/update_embedding/embeddings/strided_slice/stack_1 (Const)
training/Adam/Adam/update_embedding/embeddings/strided_slice/stack_2 (Const)
training/Adam/Adam/update_embedding/embeddings/strided_slice (StridedSlice)
training/Adam/Adam/update_embedding/embeddings/UnsortedSegmentSum (UnsortedSegmentSum)
training/Adam/Adam/update_embedding/embeddings/mul (Mul)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/mul_1 (Mul)
training/Adam/Adam/update_embedding/embeddings/AssignVariableOp (AssignVariableOp)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp_1 (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/ResourceScatterAdd (ResourceScatterAdd)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp_2 (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/mul_2 (Mul)
training/Adam/Adam/update_embedding/embeddings/mul_3 (Mul)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp_3 (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/mul_4 (Mul)
training/Adam/Adam/update_embedding/embeddings/AssignVariableOp_1 (AssignVariableOp)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp_4 (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/ResourceScatterAdd_1 (ResourceScatterAdd)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp_5 (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/Sqrt (Sqrt)
training/Adam/Adam/update_embedding/embeddings/mul_5 (Mul)
training/Adam/Adam/update_embedding/embeddings/add (AddV2)
training/Adam/Adam/update_embedding/embeddings/truediv (RealDiv)
training/Adam/Adam/update_embedding/embeddings/AssignSubVariableOp (AssignSubVariableOp)
training/Adam/Adam/update_embedding/embeddings/ReadVariableOp_6 (ReadVariableOp)
training/Adam/Adam/update_embedding/embeddings/group_deps (NoOp)
VarIsInitializedOp_19 (VarIsInitializedOp)
VarIsInitializedOp_37 (VarIsInitializedOp)

 [[node embedding/embeddings/Initializer/random_uniform/sub (defined at C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]Additional information about colocations:No node-device colocations were active during op 'embedding/embeddings/Initializer/random_uniform/sub' creation.

Device assignments active during op 'embedding/embeddings/Initializer/random_uniform/sub' creation:
with tf.device(None): <C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1535>

Original stack trace for 'embedding/embeddings/Initializer/random_uniform/sub':
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel_launcher.py", line 16, in
app.launch_new_instance()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\traitlets\config\application.py", line 664, in launch_instance
app.start()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel\kernelapp.py", line 612, in start
self.io_loop.start()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
self.asyncio_loop.run_forever()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\asyncio\base_events.py", line 442, in run_forever
self._run_once()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\asyncio\base_events.py", line 1462, in _run_once
handle._run()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\ioloop.py", line 688, in
lambda f: self._run_callback(functools.partial(callback, future))
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
ret = callback()
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 814, in inner
self.ctx_run(self.run)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 775, in run
yielded = self.gen.send(value)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
yield gen.maybe_future(dispatch(*args))
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell
yield gen.maybe_future(handler(stream, idents, msg))
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel\kernelbase.py", line 545, in execute_request
user_expressions, allow_stdin,
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\IPython\core\interactiveshell.py", line 2867, in run_cell
raw_cell, store_history, silent, shell_futures)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\IPython\core\interactiveshell.py", line 2895, in _run_cell
return runner(coro)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\IPython\core\async_helpers.py", line 68, in pseudo_sync_runner
coro.send(None)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\IPython\core\interactiveshell.py", line 3072, in run_cell_async
interactivity=interactivity, compiler=compiler, result=result)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\IPython\core\interactiveshell.py", line 3263, in run_ast_nodes
if (await self.run_code(code, result, async=asy)):
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
concat_lstm = get_model1(tf_idf_train,X_meta_train, results,embedding_dimensions)
File "", line 17, in get_model1
mask_zero=True)(tf_idf_input) # Use masking to handle the variable sequence lengths
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 824, in call
self._maybe_build(inputs)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 2146, in _maybe_build
self.build(input_shapes)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\utils\tf_utils.py", line 306, in wrapper
output_shape = fn(instance, input_shape)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\layers\embeddings.py", line 146, in build
constraint=self.embeddings_constraint)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 529, in add_weight
aggregation=aggregation)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\training\tracking\base.py", line 712, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\engine\base_layer_utils.py", line 139, in make_variable
shape=variable_shape if variable_shape else None)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\variables.py", line 258, in call
return cls._variable_v1_call(*args, **kwargs)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\variables.py", line 219, in _variable_v1_call
shape=shape)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\variables.py", line 197, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\variable_scope.py", line 2503, in default_variable_creator
shape=shape)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\variables.py", line 262, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py", line 1406, in init
distribute_strategy=distribute_strategy)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py", line 1537, in _init_from_args
initial_value() if init_from_fn else initial_value,
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\keras\engine\base_layer_utils.py", line 119, in
init_val = lambda: initializer(shape, dtype=dtype)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\init_ops.py", line 283, in call
shape, self.minval, self.maxval, dtype, seed=self.seed)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\random_ops.py", line 246, in random_uniform
result = math_ops.add(rnd * (maxval - minval), minval, name=name)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\math_ops.py", line 899, in binary_op_wrapper
return func(x, y, name=name)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py", line 11926, in sub
"Sub", x=x, y=y, name=name)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3371, in create_op
attrs, op_def, compute_device)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3440, in _create_op_internal
op_def=op_def)
File "C:\Users\kagan.senturk\Anaconda3\envs\tfradeon\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in init
self._traceback = tf_stack.extract_stack()

Windows version could not load dynamic library 'DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll'

Hi all,

I just installed windows version and running the test from here, but TF complains that Could not load dynamic library 'DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll'.

The first time when I was installing tensorflow-directml, I got an error message

ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: 'C:\\Users\\Username\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python37\\site-packages\\tensorflow_estimator\\python\\estimator\\canned\\linear_optimizer\\python\\utils\\__pycache__\\sharded_mutable_dense_hashtable.cpython-37.pyc'

which didn't show up when I tried to do pip install again. I'm not sure if this error is related to the library issue.

The full input & output is

>>> import tensorflow.compat.v1 as tf
>>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
>>> print(tf.add([1.0, 2.0], [3.0, 4.0]))
2020-08-16 22:42:06.778056: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters.
2020-08-16 22:42:06.778504: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-08-16 22:42:06.780186: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Radeon RX Vega)
2020-08-16 22:42:06.854914: W tensorflow/stream_executor/platform/default/dso_loader.cc:71] Could not load dynamic library 'DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll'; dlerror: DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll not found

I tried to force-reinstall tensorflow-directml, but it didn't work.

Here is some additional information which could be useful.

Python version: 3.7.8 (it's standalone installation from the store, not conda)
Windows version: 19042.450
GPU: Vega 56
Driver version: 20.7.1

Does it support usage of Intel(R) Iris(R) Xe Graphics ?

Below is my computer settings. Thanks.

Windows 10 Build/Version (Version 2004 / Build 19041)
Python Version (e.g. 3.7.11)
TensorFlow-DirectML Version - 1.15.3.dev200626
Graphics card driver version - Intel(R) Iris(R) Xe Graphics)

And I just keep facing the below error messages when I try to run the testing codes.

2021-11-27 06:47:22.188398: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-11-27 06:47:22.188473: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-27 06:47:24.000891: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2021-11-27 06:47:24.000943: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)

Is there a way to use it with R

LSTM 8x slow on gpu

Train on 36090 samples 2022-04-25 21:56:12.505195: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library C:\Users\onurb\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python/directml.24bfac66e4ee42ec393a5fb471412d0177bc7bcf.dll 2022-04-25 21:56:12.506028: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library dxgi.dll 2022-04-25 21:56:12.509302: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library d3d12.dll 2022-04-25 21:56:12.961954: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:250] DirectML device enumeration: found 1 compatible adapters. 2022-04-25 21:56:12.962441: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2022-04-25 21:56:12.966749: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 0 (AMD Radeon(TM) Graphics) 2022-04-25 21:56:13.055907: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library Kernel32.dll 36090/36090 - 232s - loss: 0.0014 - acc: 0.0396 Train on 36090 samples

when using only cpu takes 30-40s there is huge difference. Also look like Gpu not taking load.
I am using 4750u apu

"import tensorflow.compat.v1 as tf" exits with "Illegal instruction"

I'm following the instructions here https://docs.microsoft.com/en-gb/windows/win32/direct3d12/gpu-tensorflow-wsl

(directml) cht@DESKTOP-IHJTV93:~$ python
Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow.compat.v1 as tf
Illegal instruction
(directml) cht@DESKTOP-IHJTV93:~$

This is quite exotic for me, so not sure ahead-of-time what information will help, but here's a start,

OS Name: Microsoft Windows 10 Pro Insider Preview
OS Version: 10.0.20152 N/A Build 20152
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
Linux version 4.19.104
GPU: Nvidia GeForce GTX 1050 Ti

Does this mitigate the requirement of installing CuDNN?

If I use DirectML, do I still need to install the CUDA libraries? What about support for upcoming RDNA2 AMD GPUs?

Tensorflow 2 support (?)

Are you planning to support tensorflow 2 in the near future?
It would be awesome, in particular for unlucky AMD GPU owners (like me..)

No device assignments were active

When I use on windows10, just run a complex model ,it failed, the error like this
No device assignments were active during op 'Embedding-Token/embedding_lookup/Identity' creation., others are used correctly.

Caffe Model in AMD GPU

I have the following code detecting objects in real time, is it possible to use GPU RX 480 AMD instead of processor with caffe model? Are there any tutorials or tips on how to do this?

# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])

Thank you!

OpenAI baselines very slow

When using OpenAI's baselines:
https://github.com/openai/baselines

I get very slow training performance despite the GPU usage being close to 100%
I get around 30 fps of training performance but when using the CPU I get over 200 fps

python -m baselines.run --alg=ppo2 --network=cnn --env=PongNoFrameskip-v4 --num_timesteps=2e7

Same problem when I switch to the NVIDIA MX 150 dedicated GPU, I get 50 fps instead of 500 fps when using CUDA drivers

That said when I use the MLP model the FPS is much better (over 300 fps).

System Information

Windows 11 Pro 21H2
Python 3.7
TensorFlow-DirectML 1.15.5 (latest)
Intel 620 Integrated GPU (latest driver)
- Intel i7-8550U

Memory error on tensorflow directml

Hi, I'd like to know more about the memory errors that I'm getting while using tensorflow, I can compute 128x128 images just fine, but when working with bigger images I get memory errors.

21:43:27.369699: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at dml_kernel_context.cc:174 : Resource exhausted: OOM when allocating tensor with shape[16,128,98,98] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocato.

Adam (and some others) optimizers not working

Hey guys, thanks for your time reading this. Basically, I've created a GANN with tensorflow-directml to run on my AMD GPU. All seems to be fine except when I try to use some specific optimizers. Namely, Adam which is the one I would like to use most, but the error is the same for the rest of the optimizers which don't work.

ADAM optimizer:

2021-06-16 23:30:02.076691: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python/directml.adbd007a01a52364381a1c71ebb6fa1b2389c88d.dll
2021-06-16 23:30:02.499379: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:249] DirectML device enumeration: found 1 compatible adapters.
2021-06-16 23:30:02.499695: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-06-16 23:30:02.887365: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:185] DirectML: creating device on adapter 0 (AMD Radeon RX 6900 XT)
2021-06-16 23:30:03.183315: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library Kernel32.dll
WARNING:tensorflow:From H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\ops\nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Traceback (most recent call last):
File "H:\NFT_AI_generator\main.py", line 20, in
training.train(data, discriminator, generator, EPOCHES, BATCH_SIZE, optimiser='Adam', update=True)
File "H:\NFT_AI_generator\src\GANN\training.py", line 43, in train
discriminator_loss, generator_loss = training_step(generator_optimizer,discriminator_optimizer, generator, discriminator, batch, batch_size=batch_size, k=1)
File "H:\NFT_AI_generator\src\GANN\training.py", line 25, in training_step
discriminator.trainable_variables)) # Takes a list of gradient and variables pairs
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\optimizer_v2.py", line 439, in apply_gradients
kwargs={"name": name})
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 1940, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 1947, in _merge_call
return merge_fn(self._strategy, *args, **kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\optimizer_v2.py", line 483, in _distributed_apply
var, apply_grad_to_update_var, args=(grad,), group=False))
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 1553, in update
return self._update(var, fn, args, kwargs, group)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 2165, in _update
return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 2171, in _update_non_slot
result = fn(*args, **kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\optimizer_v2.py", line 465, in apply_grad_to_update_var
update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\adam.py", line 207, in _resource_apply_dense
use_locking=self._use_locking)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\training\gen_training_ops.py", line 1644, in resource_apply_adam
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'ResourceApplyAdam' OpKernel for 'DML' devices compatible with node {{node ResourceApplyAdam}}
(OpKernel was found, but attributes didn't match) Requested Attributes: T=DT_DOUBLE, use_locking=true, use_nesterov=false
. Registered: device='CPU'; T in [DT_HALF]
device='CPU'; T in [DT_BFLOAT16]
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_DOUBLE]
device='DML'; T in [DT_FLOAT]
device='DML'; T in [DT_HALF]
[Op:ResourceApplyAdam]

For FTRL:

2021-06-16 23:31:26.282677: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python/directml.adbd007a01a52364381a1c71ebb6fa1b2389c88d.dll
2021-06-16 23:31:26.702632: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:249] DirectML device enumeration: found 1 compatible adapters.
2021-06-16 23:31:26.703000: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-06-16 23:31:27.093813: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:185] DirectML: creating device on adapter 0 (AMD Radeon RX 6900 XT)
2021-06-16 23:31:27.384261: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library Kernel32.dll
WARNING:tensorflow:From H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\ops\nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Traceback (most recent call last):
File "H:\NFT_AI_generator\main.py", line 20, in
training.train(data, discriminator, generator, EPOCHES, BATCH_SIZE, optimiser='Ftrl', update=True)
File "H:\NFT_AI_generator\src\GANN\training.py", line 43, in train
discriminator_loss, generator_loss = training_step(generator_optimizer,discriminator_optimizer, generator, discriminator, batch, batch_size=batch_size, k=1)
File "H:\NFT_AI_generator\src\GANN\training.py", line 25, in training_step
discriminator.trainable_variables)) # Takes a list of gradient and variables pairs
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\optimizer_v2.py", line 439, in apply_gradients
kwargs={"name": name})
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 1940, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 1947, in _merge_call
return merge_fn(self._strategy, *args, **kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\optimizer_v2.py", line 483, in _distributed_apply
var, apply_grad_to_update_var, args=(grad,), group=False))
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 1553, in update
return self._update(var, fn, args, kwargs, group)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 2165, in _update
return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\distribute\distribute_lib.py", line 2171, in _update_non_slot
result = fn(*args, **kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\optimizer_v2.py", line 465, in apply_grad_to_update_var
update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\keras\optimizer_v2\ftrl.py", line 178, in _resource_apply_dense
use_locking=self._use_locking)
File "H:\NFT_AI_generator\venv\lib\site-packages\tensorflow_core\python\training\gen_training_ops.py", line 2079, in resource_apply_ftrl
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'ResourceApplyFtrl' OpKernel for 'DML' devices compatible with node {{node ResourceApplyFtrl}}
. Registered: device='CPU'; T in [DT_HALF]
device='CPU'; T in [DT_BFLOAT16]
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_DOUBLE]
[Op:ResourceApplyFtrl]
Press any key to continue . . .

Host System
--------------------------------------------------------------------------------
Windows 10 Version  : Windows 10 Version 20H2 (OS Build 19042.1052)
Processor           : AMD Ryzen 9 3900X 12-Core Processor, 3793 Mhz, 12 Core(s), 24 Logical Processor(s)
Memory              : 32GB RAM
DirectX Version     : DirectX 12

Python Environment
--------------------------------------------------------------------------------
Python Version      : 3.7.0
TensorFlow-DirectML : 1.15.5

DirectX Device
--------------------------------------------------------------------------------
Description         : AMD RX 6900 XT
Manufacturer        : AMD
Chip Type           : AMD RX 6900 XT
Dedicated Memory    : 16000 MB
Driver Version      : 21.4.1

Let me know if you need any other info :) Thank you in advance for all of your help!

tensorflow/directml is slow compared to coremltools on MacOS

I'm trying to deploy a multi-model tool for mouse behavior classification on Linux, Windows & Mac. For Linux, I use tensor flow 1.15 directly, with Cuda drivers to access the GPU(s). For Mac, I translate the models into .mlmodel files using coremltools. For Windows, I'm trying to use tensorflow-directml in order to easily utilize whatever GPU (Nvidia or AMD) that is available. I'm finding that, on the same laptop with AMD GPU (a MacBook Pro), the tf-directml version runs about 3x slower than the mlmodel version in MacOS. Here are some stats:

Model                 mlmodel   tf-directml        notes
detection             0.033 sec   0.088 sec       based on inception resnet v2
pose                  0.067 sec.  0.248 sec.      8-stack hourglass

I realize I'm running a very early version. Do you expect the performance to improve substantially? Do you have a guess as to when we might see performance improvements?

directml on custom tensoflow build ?

Is custom tensorflow-directml build possible for tensorflow 2.3.0. If yes, what directml files or libraries would be required.

ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.

(directml) PS C:\Workshop\PyDirectML> python
Python 3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow.compat.v1 as tf
Traceback (most recent call last):
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\Manob Biswas.conda\envs\directml\lib\imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "C:\Users\Manob Biswas.conda\envs\directml\lib\imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_init_.py", line 102, in
from tensorflow_core import *
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core_init_.py", line 28, in
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_init_.py", line 50, in getattr
module = self.load()
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_init.py", line 44, in _load
module = importlib.import_module(self.name)
File "C:\Users\Manob Biswas.conda\envs\directml\lib\importlib_init.py", line 126, in import_module
return _bootstrap.gcd_import(name[level:], package, level)
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python_init.py", line 49, in
from tensorflow.python import pywrap_tensorflow
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 74, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\Manob Biswas.conda\envs\directml\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\Manob Biswas.conda\envs\directml\lib\imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "C:\Users\Manob Biswas.conda\envs\directml\lib\imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.

Complex number support ?

Current complex numbers kernel appears to be run on the CPU when in eager execution and fails when run in non eager mode on GPU devices.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation Complex_1: Could not >satisfy explicit device specification '/device:DML:2' because no supported kernel for DML devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:DML:2' assigned_device_name_='' >resource_device_name_='' supported_device_types_=[CPU] possible_devices_=[]
Complex: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
Complex_1 (Complex) /device:DML:2

Op: Complex
Node attrs: Tout=DT_COMPLEX64, T=DT_FLOAT
Registered kernels:
device='CPU'; T in [DT_FLOAT]; Tout in [DT_COMPLEX64]
device='CPU'; T in [DT_DOUBLE]; Tout in [DT_COMPLEX128]

Is gpu complex number support expected like in mainline TensorFlow ?

Tensorflow-directml doesnt work with AMD Radeon RX 6700 XT

Host System

Windows 10 Version : Windows 10 Pro 64-bit (10.0, Build 19042) (19041.vb_release.191206-1406)
Processor : AMD Ryzen 5 5600X 6-Core Processor (12 CPUs), ~3.7GHz
Memory : 16384MB RAM
DirectX Version : DirectX 12

Python Environment

Python Version : 3.6.13
TensorFlow-DirectML : 1.15.5.dev210429

DirectX Device

Description : AMD Radeon RX 6700 XT
Manufacturer : Advanced Micro Devices, Inc.
Chip Type : AMD Radeon Graphics Processor (0x73DF)
Dedicated Memory : 12243 MB
Driver Version : 27.20.21003.8013
Driver Model : WDDM 2.7
Driver Date : 11.05.2021 02:00:00
Feature Levels : 12_1,12_0,11_1,11_0,10_1,10_0,9_3,9_2,9_1

Repro Details

I am trying to run inference using Tensorflow SSD MobilenetV2 but it doesnt work:
python detect_realtime_nano.py --trt-graph ssd_mobilenet_v2_coco_2018_03_29\frozen_inference_graph.pb --labels mscoco_label_map.pbtxt

This is the terminal output:

[INFO] loading TRT graph...
WARNING:tensorflow:From C:\Users\bukar\tensorflow1\project2\detect_realtime_nano.py:26: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From C:\Users\bukar\tensorflow1\project2\detect_realtime_nano.py:28: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

[INFO] initializing TensorFlow session...
WARNING:tensorflow:From C:\Users\bukar\tensorflow1\project2\detect_realtime_nano.py:97: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From C:\Users\bukar\tensorflow1\project2\detect_realtime_nano.py:99: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

[INFO] starting video stream...
[INFO] warming up the nano...
2021-05-24 10:10:51.758549: F tensorflow/core/common_runtime/dml/dml_heap_allocator.cc:103] Check failed: ptr != nullptr Invalid pointer
2021-05-24 10:10:51.758616: F tensorflow/core/common_runtime/dml/dml_heap_allocator.cc:103] Check failed: ptr != nullptr Invalid pointer

Same script worked fine using tensorflow-CPU-version. Atached: The python-script as txt-file.
detect_realtime_nano_py.txt

I just downloaded the SSD-Model and try running inference using a USB-Webcam.
What could be the problem?

Is multi-gpu training available?

So, the support for tensorflow 1.x seems to be almost complete, is the function multi_gpu from utils working? This is something that I'm looking forward, any information on this would be extremely important.

Would something like this work?

model = multi_gpu_model(model, gpus=2)

New update doesn't show GPU Usage

I was running the new update (tensorflow-directml 1.15.4.dev201216) and notice that I can't see how much of the GPU is being used, performance got a little bit worse, about 8% slower, it seems to be the case the Windows doesn't recognize this version on the GPU Engine.

tensorflow-directml 1.15.4.dev201216 GPU Usage

tensorflow-directml 1.15.3.dev200911 GPU Usage

Happy Christmas and New Year everyone!!

Training ResNet fails on Intel HD 620: LLVM ERROR: SPIRV internal error: Invalid magic number

The error appears as soon as the training starts.

I am running this sample: https://github.com/losttech/Gradient-Samples/tree/master/ResNetBlock
I used TensorFlow-DirectML by replacing this line with GradientEngine.UseCondaEnvironment("tf-dx-1.x");

The same code also with TensorFlow-DirectML appears to be working fine on another machine with NVidia GPU.

"Device Removed Error" When Radeon Software in "Compute" mode

System Information

Windows 10 Build/Version: 20H2 (OS Build 19042.906)
native windows (not in wsl)
Python Version: Python 3.6.13 :: Anaconda, Inc.
TensorFlow-DirectML Version: 1.15.4.dev201216
Graphics card driver version: Radeon Adrenalin 21.3.2
Radeon RX 580 8gb

Repro Details

Put the radeon software in Compute mode under graphics -> global -> advanced (drop down) -> workload
Follow instructions on microsoft docs until:

import tensorflow.compat.v1 as tf
tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
print(tf.add([1.0, 2.0], [3.0, 4.0]))

when the driver is in Compute mode, this happens consistently:

   ...: print(tf.add([1.0, 2.0], [3.0, 4.0]))
2021-04-05 00:09:26.099098: I tensorflow/stream_executor/platform/default/dso_loader.cc:98] Successfully opened dynamic library D:\Python\Anaconda3\envs\directml\lib\site-packages\tensorflow_core\python/directml.bdb07c797e1af1b4a42d21c67ce5494d73991459.dll
2021-04-05 00:09:26.148238: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:126] DirectML device enumeration: found 1 compatible adapters.
2021-04-05 00:09:26.148707: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-04-05 00:09:26.150293: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:109] DirectML: creating device on adapter 0 (Radeon RX 580 Series)
Executing op Add in device /job:localhost/replica:0/task:0/device:DML:0
2021-04-05 00:09:26.343999: E tensorflow/core/common_runtime/dml/dml_heap_allocator.cc:53] The DirectML device has encountered an unrecoverable error (DXGI_ERROR_DEVICE_REMOVED). This is most often caused by a timeout occurring on the GPU. Please visit https://aka.ms/tfdmltimeout for more information and troubleshooting steps.
2021-04-05 00:09:26.344165: F tensorflow/core/common_runtime/dml/dml_heap_allocator.cc:53] HRESULT failed with 0x887a0005: hr

and then python (ipython at least) crashes and sends me back to pwsh.
Describe the expected behavior
For it to work, e.g.

2021-04-05 00:10:11.517762: I tensorflow/stream_executor/platform/default/dso_loader.cc:98] Successfully opened dynamic library D:\Python\Anaconda3\envs\directml\lib\site-packages\tensorflow_core\python/directml.bdb07c797e1af1b4a42d21c67ce5494d73991459.dll
2021-04-05 00:10:11.564196: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:126] DirectML device enumeration: found 1 compatible adapters.
2021-04-05 00:10:11.564781: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-04-05 00:10:11.566155: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:109] DirectML: creating device on adapter 0 (Radeon RX 580 Series)
Executing op Add in device /job:localhost/replica:0/task:0/device:DML:0
tf.Tensor([4. 6.], shape=(2,), dtype=float32)

To be clear, is does work in "Graphics mode", but when it is in "Compute mode" it doesn't. I've somewhat underclocked (1106 MHz) and undervolted (~980 mV) the gpu, while overclocked the vram (2250 MHz) (for other reasons), that might have something to do with it, but I didn't change any of that other than flipping it back to Graphics mode to get it to work.

It seems like a minor bug, but it was literally my first experience with this and I'm sure others might run into it to, and you know, first impressions and all. Other than that, thank you very much for this, I was getting bummed when all the frameworks were saying I needed Nvidia.

Use Tensorflow-directml in RStudio

Hello, thank you a lot for this incredible package, can I use this version of tensorflow in RStudio ?

Best of best regards

R470R

Tensorflow-directml is not making any difference in processing times in GPU vs CPU

System Information

Windows 10 - Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz 2.81 GHz

Python Version = 3.6.
TensorFlow-DirectML Version 21.2.2
Graphics card driver version - ntel(R) HD Graphics 520

Repro Details

Execute the following code in an environment with directml to run on gpu and an environment without directml to run on cpu

Describe the expected behavior
I have been trying to execute the following code using directml and compare the training times in CPU and GPU but I am not seeing any difference in training times. Can someone help me with troubleshooting the issue

Code to reproduce the issue
import tensorflow.compat.v1 as tf
from tensorflow.keras import layers
import numpy as np

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

model = tf.keras.models.Sequential([
layers.Flatten(input_shape=(28, 28, 1)),
layers.Dense(4096,activation='relu'),
layers.Dense(4096,activation='relu'),
layers.Dense(10, activation='softmax')
])
model.summary()

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],)

model.fit(np.expand_dims(x_train,3), y_train, epochs=2, batch_size=1024)

Other info / logs
GPU Usage-

CPU Usage-

why /DML:0 instead of /GPU:0 ?

Can you explain the logic behind this decision?

If I run any github tensorflow project, it requires to rename all /GPU to /DML. Not convenient.

'protobuf.bzl no such package' error while compiling directML from source

Hello, I'm trying to compile tf-directml from source but I get weird errors about 'protobuf.bzl'.

I've followed your suggested build-guide and tensorflow's one. I've run the configure.py and build command using msys2 on Windows 10.

Below is the full log of the run, any hints?

 python build.py
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Options provided by the client:
  'build' options: --python_path=C:/Users/masc9/AppData/Local/Programs/Python/Python38/python.exe
INFO: Reading rc options for 'build' from d:\projects\tensorflow-directml\.bazelrc:
  'build' options: --apple_platform_type=macos --define framework_shared_object=true --define open_source_build=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --copt=-w --config=v1
INFO: Reading rc options for 'build' from d:\projects\tensorflow-directml\.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=C:/Users/masc9/AppData/Local/Programs/Python/Python38/python.exe --action_env PYTHON_LIB_PATH=C:/Users/masc9/AppData/Local/Programs/Python/Python38/lib/site-packages --python_path=C:/Users/masc9/AppData/Local/Programs/Python/Python38/python.exe --config=xla --config monolithic --copt=-w --host_copt=-w --copt=-DWIN32_LEAN_AND_MEAN --host_copt=-DWIN32_LEAN_AND_MEAN --copt=-DNOGDI --host_copt=-DNOGDI --verbose_failures --distinct_host_configuration=false --define=override_eigen_strong_inline=true --action_env TF_CONFIGURE_IOS=0
INFO: Found applicable config definition build:v1 in file d:\projects\tensorflow-directml\.bazelrc: --define=tf_api_version=1 --action_env=TF2_BEHAVIOR=0
INFO: Found applicable config definition build:xla in file d:\projects\tensorflow-directml\.bazelrc: --action_env=TF_ENABLE_XLA=1 --define=with_xla_support=true
INFO: Found applicable config definition build:xla in file d:\projects\tensorflow-directml\.tf_configure.bazelrc: --define with_xla_support=true
INFO: Found applicable config definition build:monolithic in file d:\projects\tensorflow-directml\.bazelrc: --define framework_shared_object=false
INFO: Found applicable config definition build:opt in file d:\projects\tensorflow-directml\.tf_configure.bazelrc: --copt=/arch:AVX --define with_default_optimizations=true
INFO: Found applicable config definition build:dml in file d:\projects\tensorflow-directml\.bazelrc: --define=using_dml=true --copt -DTENSORFLOW_USE_DIRECTML
Loading:
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
DEBUG: Rule 'io_bazel_rules_docker' indicated that a canonical reproducible form can be obtained by modifying arguments shallow_since = "1556410077 -0400"
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Analyzing: target //tensorflow/tools/pip_package:build_pip_package (1 packages loaded, 0 targets configured)
ERROR: D:/projects/tensorflow-directml/tensorflow/tools/pip_package/BUILD:233:1: error loading package 'tensorflow': in D:/projects/tensorflow-directml/tensorflow/core/platform/default/build_config.bzl: Encountered error while reading extension file 'protobuf.bzl': no such package '@com_google_protobuf//': Traceback (most recent call last):
        File "D:/projects/tensorflow-directml/third_party/repo.bzl", line 104
                _apply_patch(ctx, ctx.attr.patch_file)
        File "D:/projects/tensorflow-directml/third_party/repo.bzl", line 71, in _apply_patch
                _execute_and_check_ret_code(ctx, cmd)
        File "D:/projects/tensorflow-directml/third_party/repo.bzl", line 52, in _execute_and_check_ret_code
                fail("Non-zero return code({1}) when ...))
Non-zero return code(127) when executing 'C:\msys64\usr\bin\bash.exe -l -c "patch" "-p1" "-d" "D:/projects/dml_build/3t5smh3m/external/com_google_protobuf" "-i" "D:/projects/tensorflow-directml/third_party/protobuf/protobuf.patch"':
Stdout:
Stderr: /usr/bin/bash: line 1: patch: command not found
 and referenced by '//tensorflow/tools/pip_package:build_pip_package'
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: error loading package 'tensorflow': in D:/projects/tensorflow-directml/tensorflow/core/platform/default/build_config.bzl: Encountered error while reading extension file 'protobuf.bzl': no such package '@com_google_protobuf//': Traceback (most recent call last):
        File "D:/projects/tensorflow-directml/third_party/repo.bzl", line 104
                _apply_patch(ctx, ctx.attr.patch_file)
        File "D:/projects/tensorflow-directml/third_party/repo.bzl", line 71, in _apply_patch
                _execute_and_check_ret_code(ctx, cmd)
        File "D:/projects/tensorflow-directml/third_party/repo.bzl", line 52, in _execute_and_check_ret_code
                fail("Non-zero return code({1}) when ...))
Non-zero return code(127) when executing 'C:\msys64\usr\bin\bash.exe -l -c "patch" "-p1" "-d" "D:/projects/dml_build/3t5smh3m/external/com_google_protobuf" "-i" "D:/projects/tensorflow-directml/third_party/protobuf/protobuf.patch"':
Stdout:
Stderr: /usr/bin/bash: line 1: patch: command not found
INFO: Elapsed time: 15.738s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (4 packages loaded, 0 targets configured)
FAILED: Build did NOT complete successfully (4 packages loaded, 0 targets configured)
Traceback (most recent call last):
  File "build.py", line 371, in 
    main()
  File "build.py", line 358, in main
    build(args)
  File "build.py", line 221, in build
    subprocess.run(" ".join(cl), shell=True, check=True)
  File "C:\Users\masc9\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'bazel --output_user_root=D:\Projects\tensorflow-directml\..\dml_build build --config=opt --config=dml --strip never --copt /wd4716 --copt /Z7 --copt /FS --linkopt /DEBUG:FASTLINK //tensorflow/tools/pip_package:build_pip_package' returned non-zero exit status 1.

dml_command_recorder.cc:366 Check failed: (((HRESULT)((dml_device_->GetDeviceRemovedReason()))) >= 0) == true (0 vs. 1

System Information

Windows 10 Version 2004 / Build 19041.508)
WSL2 cat /etc/debian_version > 10.0
Python 3.7.7
TensorFlow-DirectML Version tensorflow_directml-1.15.3.dev200911-cp37-cp37m-win_amd64.whl
-GPU AMD Radeon VII - rn-rad-win-20-20-01-05

Host System
--------------------------------------------------------------------------------
Windows 10 Version  : Windows 10 Home 64-bit (10.0, Build 19041) (19041.vb_release.191206-1406)
Processor           : AMD Ryzen 7 3700X 8-Core Processor              (16 CPUs), ~3.6GHz
Memory              : 65536MB RAM
DirectX Version     : DirectX 12

Python Environment
--------------------------------------------------------------------------------
Python Version      : 3.7.7
TensorFlow-DirectML : 1.15.3.dev200911

DirectX Device
--------------------------------------------------------------------------------
Description         : AMD Radeon VII
Manufacturer        : Advanced Micro Devices, Inc.
Chip Type           : AMD Radeon Graphics Processor (0x66AF)
Dedicated Memory    : 33270 MB
Driver Version      : 27.20.2001.5003
Driver Model        : WDDM 2.7
Driver Date         : 29/06/2020 03:00:00
Feature Levels      : 12_1,12_0,11_1,11_0,10_1,10_0,9_3,9_2,9_1

DirectX Device
--------------------------------------------------------------------------------
Description         : AMD Radeon VII
Manufacturer        : Advanced Micro Devices, Inc.
Chip Type           : AMD Radeon Graphics Processor (0x66AF)
Dedicated Memory    : 33270 MB
Driver Version      : 27.20.2001.5003
Driver Model        : WDDM 2.7
Driver Date         : 29/06/2020 03:00:00
Feature Levels      : 12_1,12_0,11_1,11_0,10_1,10_0,9_3,9_2,9_1

Repro Details

In [1]: from ai_benchmark import AIBenchmark
In [2]: benchmark = AIBenchmark(use_CPU=None, verbose_level=1)
>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..
In [3]: results = benchmark.run()

Describe the current behavior

14/19. ICNet

14.1 - inference | batch=5, size=1024x1536: 330 ± 6 ms
2020-09-15 10:35:21.207385: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:366] Check failed: (((HRESULT)((dml_device_->GetDeviceRemovedReason()))) >= 0) == true (0 vs. 1)

Describe the expected behavior
bench passed successfully.

Code to reproduce the issue
-->Repro Details

Other info / logs

conda list

(directml) PS F:\DSML\GPT2\gpt-2> conda list
# packages in environment at F:\DSML\Soft\Anaconda\envs\directml:
#
# Name                    Version                   Build  Channel
_tflow_select             2.2.0                     eigen
absl-py                   0.9.0                    py37_0
ai-benchmark              0.1.2                    pypi_0    pypi
astor                     0.8.0                    py37_0
attrs                     19.3.0                     py_0
backcall                  0.1.0                    py37_0
blas                      1.0                         mkl
bleach                    3.1.4                      py_0
blinker                   1.4                      py37_0
brotlipy                  0.7.0           py37he774522_1000
ca-certificates           2020.6.20            hecda079_0    conda-forge
cachetools                4.1.0                      py_1
certifi                   2020.6.20        py37hc8dfbb8_0    conda-forge
cffi                      1.14.0           py37h7a1dbc1_0
chardet                   3.0.4                 py37_1003
click                     7.1.2                      py_0
colorama                  0.4.3                      py_0
cryptography              2.9.2            py37h7a1dbc1_0
cycler                    0.10.0                   pypi_0    pypi
decorator                 4.4.2                      py_0
defusedxml                0.6.0                      py_0
entrypoints               0.3                      py37_0
fire                      0.3.1              pyh9f0ad1d_0    conda-forge
gast                      0.2.2                    py37_0
google-auth               1.17.2                     py_0
google-auth-oauthlib      0.4.1                      py_2
google-pasta              0.2.0                      py_0
grpcio                    1.27.2           py37h351948d_0
h5py                      2.10.0           py37h5e291fa_0
hdf5                      1.10.4               h7ebc959_0
icc_rt                    2019.0.0             h0cc432a_1
idna                      2.10                       py_0
importlib_metadata        1.5.0                    py37_0
intel-openmp              2020.1                      216
ipykernel                 5.1.4            py37h39e3cac_0
ipython                   7.13.0           py37h5ca1d4c_0
ipython_genutils          0.2.0                    py37_0
jedi                      0.17.0                   py37_0
jinja2                    2.11.2                     py_0
jsonschema                3.2.0                    py37_0
jupyter_client            6.1.3                      py_0
jupyter_contrib_core      0.3.3                      py_2    conda-forge
jupyter_contrib_nbextensions 0.5.1                    py37_0    conda-forge
jupyter_core              4.6.3                    py37_0
jupyter_highlight_selected_word 0.2.0                 py37_1000    conda-forge
jupyter_latex_envs        1.4.4                 py37_1000    conda-forge
jupyter_nbextensions_configurator 0.4.1            py37hc8dfbb8_1    conda-forge
jupyterthemes             0.20.0                   pypi_0    pypi
keras-applications        1.0.8                      py_1
keras-preprocessing       1.1.0                      py_1
kiwisolver                1.2.0                    pypi_0    pypi
lesscpy                   0.14.0                   pypi_0    pypi
libiconv                  1.15              hfa6e2cd_1006    conda-forge
libprotobuf               3.12.3               h7bd577a_0
libsodium                 1.0.16               h9d3ae62_0
libxml2                   2.9.10               h5d81f1c_2    conda-forge
libxslt                   1.1.33               h579f668_1    conda-forge
lxml                      4.5.2            py37h8ba8a40_0    conda-forge
m2w64-gcc-libgfortran     5.3.0                         6
m2w64-gcc-libs            5.3.0                         7
m2w64-gcc-libs-core       5.3.0                         7
m2w64-gmp                 6.1.0                         2
m2w64-libwinpthread-git   5.0.0.4634.697f757               2
markdown                  3.1.1                    py37_0
markupsafe                1.1.1            py37he774522_0
matplotlib                3.3.0                    pypi_0    pypi
mistune                   0.8.4            py37he774522_0
mkl                       2020.1                      216
mkl-service               2.3.0            py37hb782905_0
mkl_fft                   1.1.0            py37h45dec08_0
mkl_random                1.1.1            py37h47e9c7a_0
msys2-conda-epoch         20160418                      1
nbconvert                 5.6.1                    py37_0
nbformat                  5.0.6                      py_0
notebook                  6.0.3                    py37_0
numpy                     1.18.5           py37h6530119_0
numpy-base                1.18.5           py37hc3f5095_0
oauthlib                  3.1.0                      py_0
openssl                   1.1.1g               he774522_1    conda-forge
opt_einsum                3.1.0                      py_0
pandoc                    2.2.3.2                       0
pandocfilters             1.4.2                    py37_1
parso                     0.7.0                      py_0
pickleshare               0.7.5                    py37_0
pillow                    7.2.0                    pypi_0    pypi
pip                       20.0.2                   py37_3
ply                       3.11                     pypi_0    pypi
powershell_shortcut       0.0.1                         3
prometheus_client         0.7.1                      py_0
prompt-toolkit            3.0.4                      py_0
prompt_toolkit            3.0.4                         0
protobuf                  3.12.3           py37h33f27b4_0
psutil                    5.7.2                    pypi_0    pypi
py-cpuinfo                7.0.0                    pypi_0    pypi
pyasn1                    0.4.8                      py_0
pyasn1-modules            0.2.7                      py_0
pycparser                 2.20                       py_2
pygments                  2.6.1                      py_0
pyjwt                     1.7.1                    py37_0
pyopenssl                 19.1.0                     py_1
pyparsing                 2.4.7                    pypi_0    pypi
pyreadline                2.1                      py37_1
pyrsistent                0.16.0           py37he774522_0
pysocks                   1.7.1                    py37_1
python                    3.7.7                h81c818b_4
python-dateutil           2.8.1                      py_0
python_abi                3.7                     1_cp37m    conda-forge
pywin32                   227              py37he774522_1
pywinpty                  0.5.7                    py37_0
pyyaml                    5.3.1            py37h8055547_0    conda-forge
pyzmq                     18.1.1           py37ha925a31_0
regex                     2020.7.14        py37h4ab8f01_0    conda-forge
requests                  2.24.0                     py_0
requests-oauthlib         1.3.0                      py_0
rsa                       4.0                        py_0
scipy                     1.5.0            py37h9439919_0
send2trash                1.5.0                    py37_0
setuptools                46.4.0                   py37_0
six                       1.14.0                   py37_0
sqlite                    3.31.1               h2a8f88b_1
tensorboard               1.15.0                   pypi_0    pypi
tensorboard-plugin-wit    1.6.0                      py_0
tensorflow                1.15.0          eigen_py37h9f89a44_0
tensorflow-base           1.15.0          eigen_py37h07d2309_0
tensorflow-directml       1.15.3.dev200911          pypi_0    pypi
tensorflow-estimator      1.15.1             pyh2649769_0
termcolor                 1.1.0                    py37_1
terminado                 0.8.3                    py37_0
testpath                  0.4.4                      py_0
tornado                   6.0.4            py37he774522_1
tqdm                      4.48.2             pyh9f0ad1d_0    conda-forge
traitlets                 4.3.3                    py37_0
urllib3                   1.25.9                     py_0
vc                        14.1                 h0510ff6_4
vs2015_runtime            14.16.27012          hf0eaf9b_2
wcwidth                   0.1.9                      py_0
webencodings              0.5.1                    py37_1
werkzeug                  0.16.1                     py_0
wheel                     0.34.2                   py37_0
win_inet_pton             1.1.0                    py37_0
wincertstore              0.2                      py37_0
winpty                    0.4.3                         4
wrapt                     1.12.1           py37he774522_1
yaml                      0.2.5                he774522_0    conda-forge
zeromq                    4.3.1                h33f27b4_3
zipp                      3.1.0                      py_0
zlib                      1.2.11               h62dcd97_4

Huge performance difference between onnxruntime and tensorflow-directml runtime

System Information

Knowing your system configuration can help us diagnose issues more easily, so please provide as much information as possible. At a minimum, it is useful to know the following:

Windows 10 Build/Version : Version 1909 / Build 18363.1139
WSL distribution and its WSL version if testing Linux package (e.g. Ubuntu 20.04 WSL2)
Python Version 3.7.4
TensorFlow-DirectML Version: 1.15.3.dev200911
Graphics card driver version (Intel Iris 580 Pro -27.20.100.8682)

print_system_info.py was not found!

Host System
--------------------------------------------------------------------------------
Windows 10 Version  : Windows 10 Enterprise 64-bit (10.0, Build 18363.1139)
Processor           : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (1 CPUs)
Memory              : 16GB RAM
DirectX Version     : DirectX 12

Repro Details

Describe the current behavior
When a converted tensorflow model ('.pb) is used to run inference on the latest version of tensorflow-directml, the inference time is several times worse than the onnx runtime execution time in cpu. i.e. it takes 1.2 seconds on GPU, while in onnxruntime its just aroun 30ms for CPU mode. changing the tensorflow code to run on CPU doesnt change anytihng and I'd still get 1.2/1.3 seconds.

Describe the expected behavior
It should run much faster, at the very least it should match the onnx cpu stats.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
This is the model : simpnet_onnx_10312020.zip, and here is the code to reproduce this:

import tensorflow as tf
import numpy as np

from timeit import default_timer as timer
from tensorflow.python.client import device_lib 
import logging, os

logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
 
INPUT_TENSOR_NAME = 'input:0'
OUTPUT_TENSOR_NAME = 'add_13:0'
PB_PATH=r"D:\simpnet_onnx_10312020.pb"
 
print(device_lib.list_local_devices()) 
with tf.device("/device:DML:0"):
    img = np.random.randn(1,3,112,112)

    with tf.gfile.FastGFile(PB_PATH, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
    
    with tf.Graph().as_default() as graph:
        tf.import_graph_def(graph_def, name="")
    
    input_tensor = graph.get_tensor_by_name(INPUT_TENSOR_NAME)
    output_tensor = graph.get_tensor_by_name(OUTPUT_TENSOR_NAME)

    with tf.Session(graph=graph) as sess:
        output_vals = sess.run(output_tensor, feed_dict={input_tensor: img})  #
    
    with tf.Session(graph=graph) as sess:
        start = timer()       
        output_vals = sess.run(output_tensor, feed_dict={input_tensor: img})  #
        end = timer()

    elapsed = end-start
    prediction=int(np.argmax(np.array(output_vals).squeeze(), axis=0))
    print(f'took {elapsed:.4f} sec or {elapsed*1000:.2f} ms')

results in :

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10759035991921129072
, name: "/device:DML:0"
device_type: "DML"
memory_limit: 7678597939
locality {
}
incarnation: 11366947237963948902
physical_device_desc: "Intel(R) Iris(R) Pro Graphics 580"
]
took 1.2200 sec or 1220.03 ms

and the onnx model was converted like this :

import onnx
from onnx_tf.backend import prepare
onnx_model = onnx.load(r"D:\simpnet_onnx_10312020.onnx")
tf_rep = prepare(onnx_model)
tf_rep.export_graph(r"D:\simpnet_onnx_10312020.pb")

The original onnx model can be found here : simpnet_onnx_10312020.zip

How to get devices names?

tf.config.experimental.get_visible_devices()
returns tf devices with /DML:0 names.
But how to get real device name such as 'Geforce ...' ?

Multi gpu DML?

I'm considering getting another gpu first i need to know if tensorflow-directml will use both gpus when training one model?

Force Full Usage of Dedicated VRAM instead of Shared Memory (RAM)

System Information

Windows 10 Build 19041
Not running on WSL
Python 3.7.9
TensorFlow-DirectML Version 1.15.3.dev200911
Graphics card AMD Rx580 8GB, Driver version 20.8.3

Describe the current behavior
In training, tensorflow-directML seems to be using my shared GPU memory, which is basically RAM, rather than my VRAM. This led to tremendous performance handicaps.

Describe the expected behavior
Wouldn't it make sense for the program to use all the VRAM first, then use the RAM if necessary. Happen from times to times, could potentially lead to half of my VRAM not getting used.
Apparently DirectX "sees" the VRAM and shared RAM as one bulk of 16GB VRAM, which is an undesirable behavior.

Code to reproduce the issue
I am training with default Keras model with a previous weight loaded. Does not seem to happen if I did not load a previous weight. A snippet of my code

from keras.applications.densenet import DenseNet201
model=DenseNet201(include_top=True,weights=None,classes =no_class,input_shape=inputShape)
weigh_load="model_ckpt_DenseNet201\\weights.10-0.812-0.618.hdf5"
model.load_weights(weigh_load)
opt = Adam(lr=INIT_LR, decay=1e-6)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
path = 'model_ckpt_DenseNet201'
ckpt_cb = ModelCheckpoint(os.path.join(path, 'weights.{epoch:02d}-{acc:.3f}-{val_acc:.3f}.hdf5'),
                         save_weights_only=True)
history = model.fit_generator(
    generator=train_gen,
    steps_per_epoch=len(train_gen),
    validation_data=val_gen,
    validation_steps=len(val_gen),
    epochs=EPOCHS,
    shuffle=False,
    verbose=1,
    callbacks=[ckpt_cb],   
    use_multiprocessing=False,
    workers=8
    )

AMD APU support?

Hi,

Will there be support for AMD APUs? I would like to test Steam deck performance however no GPU detected. Thanks

Low performance on RX 580 with TF benchmarks

I get low performance on TF benchmarks with my RX 580:
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

using their example command:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

I get this error and performance result:
2020-06-19 16:01:17.369204: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost'

Step Img/sec total_loss
1 images/sec: 4.8 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 4.7 +/- 0.0 (jitter = 0.1) 7.593
20 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.696
30 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.753
40 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 8.007
50 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 7.520
60 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.989
70 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 8.028
80 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.932
90 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.850
100 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.798

total images/sec: 4.90

Note:

GPU and VRAM usage are at 100%, so it's not using the CPU
I get around 88 image/s on latest version of ROCm (Ubuntu 20.04) with this computer

Info:

RX 580 8GB driver 26.20.12028.2
Dual Intel Xeon 2680 v2
64 GB ram
Windows 10 2004
OSbuild 19041.329
python 3.7

microsoft / tensorflow-directml Goto Github PK

tensorflow-directml's Introduction

TensorFlow-DirectML

Questions, Issues, and Feedback

Getting Started

System Requirements

Windows

Windows Subsystem for Linux

Contribute

License

Data Collection Notice

Disabling Telemetry

Trademarks Notice

tensorflow-directml's People

Contributors

Stargazers

Watchers

Forkers

tensorflow-directml's Issues

System Information

Repro Details

Other info / logs x.shape: (594, 4096) y.shape: (594, 4096) (4096,) (4096,) <class 'tensorflow.python.framework.ops.EagerTensor'>

System Information:

Repro:

System Information

Host System

Python Environment

DirectX Device

Repro Details

System

Python Environment

System Information

System Information

Host System

Python Environment

DirectX Device

Repro Details

System Information

Repro Details

System Information

Repro Details

System Information

Repro Details

System Information

Repro Details

System Information

total images/sec: 4.90

Recommend Projects

Recommend Topics

Recommend Org

Other info / logs
x.shape: (594, 4096)
y.shape: (594, 4096)
(4096,)
(4096,)
<class 'tensorflow.python.framework.ops.EagerTensor'>