Giter Club home page Giter Club logo

Comments (15)

PatriceVignola avatar PatriceVignola commented on August 22, 2024 1

Thanks for this @Herobring . We'll investigate and come back to you.

from tensorflow-directml.

ffleader1 avatar ffleader1 commented on August 22, 2024 1

I also ran into this problem with ai_benchmark. My GPU is Rx 580.

from tensorflow-directml.

Glyphus avatar Glyphus commented on August 22, 2024

I had the very same problem at the very same section yesterday with a vega56. If it helps to upload some logs, please let me know what you'd need.

from tensorflow-directml.

hr0109 avatar hr0109 commented on August 22, 2024

I also ran into this problem with ai_benchmark. My GPU is Rx 580.

Same problem with integrated GPU on ryzen 3500u
@PatriceVignola

from tensorflow-directml.

Bkindmonk avatar Bkindmonk commented on August 22, 2024

Ran into the same issue using
CPU: Ryzen 5 3600
GPU: Radeon RX 5700 XT
OS: Windows 10 (Version 10.0.18363 Build 18363)
Python: Python 3.7.4

Also Ran into the same issue when use_CPU was set to True

from tensorflow-directml.

PatriceVignola avatar PatriceVignola commented on August 22, 2024

Hi everyone,

We just released a new version of our pypi package with changes that should reduce the occurrences of device removals on AMD hardware. The package also has many bug fixes, some performance improvements and better operator coverage. Please update and let us know how it goes!

from tensorflow-directml.

ffleader1 avatar ffleader1 commented on August 22, 2024

Hi everyone,

We just released a new version of our pypi package with changes that should reduce the occurrences of device removals on AMD hardware. The package also has many bug fixes, some performance improvements and better operator coverage. Please update and let us know how it goes!

Still did not work. I tried running ai-benchmark and, on bench 7/19, got:

2020-12-20 15:10:19.371787: F tensorflow/core/common_runtime/dml/dml_command_queue.cc:35] Check failed: (((HRESULT)((queue_->Signal(fence_.Get(), last_fence_value_)))) >= 0) == true (0 vs. 1)

from tensorflow-directml.

PatriceVignola avatar PatriceVignola commented on August 22, 2024

@ffleader1 Are you sure that you upgraded tensorflow-directml to 1.15.4.dev201216? The error message here looks like a message from the previous version. In 1.15.4.dev201216, we changed the message of the DML_CHECK_SUCCEEDED macro and it should now output a more helpful message along the lines of HRESULT failed with <error_code>

dml_command_queue.cc
DML_CHECK_SUCCEEDED definition
HandledFailedHr definition

from tensorflow-directml.

ffleader1 avatar ffleader1 commented on August 22, 2024

@ffleader1 Are you sure that you upgraded tensorflow-directml to 1.15.4.dev201216? The error message here looks like a message from the previous version. In 1.15.4.dev201216, we changed the message of the DML_CHECK_SUCCEEDED macro and it should now output a more helpful message along the lines of HRESULT failed with <error_code>

dml_command_queue.cc
DML_CHECK_SUCCEEDED definition
HandledFailedHr definition

Oh my bad. Pycharm did update the plugin properly for me. Anyway, I tried again and I got to 14/19, with this error:

2020-12-20 21:33:35.204926: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:366] HRESULT failed with 0x887a0006: dml_device_->GetDeviceRemovedReason()

from tensorflow-directml.

PatriceVignola avatar PatriceVignola commented on August 22, 2024

Thank you, that makes more sense :) The change that we pushed in this release doesn't apply to all situations or all hardware equally, so getting more data on which hardware it helps and which hardware still hit device removals will help our ongoing efforts to minimize them. Can I ask which driver version you have installed?

In the meantime, you can also follow the instructions over here to temporarily disable TDRs and see if it helps. This is the same manipulation that CUDA users are asked to do to avoid TDRs on Windows, which is probably the reason you're getting a device removal in the first place.

from tensorflow-directml.

ffleader1 avatar ffleader1 commented on August 22, 2024

Thank you, that makes more sense :) The change that we pushed in this release doesn't apply to all situations or all hardware equally, so getting more data on which hardware it helps and which hardware still hit device removals will help our ongoing efforts to minimize them. Can I ask which driver version you have installed?

In the meantime, you can also follow the instructions over here to temporarily disable TDRs and see if it helps. This is the same manipulation that CUDA users are asked to do to avoid TDRs on Windows, which is probably the reason you're getting a device removal in the first place.

I do not update my GPU driver regularly. I use a Rx 580 with 20.9.1 driver.
I will try to both update driver and increase the TDR and head back.

from tensorflow-directml.

ffleader1 avatar ffleader1 commented on August 22, 2024

Thank you, that makes more sense :) The change that we pushed in this release doesn't apply to all situations or all hardware equally, so getting more data on which hardware it helps and which hardware still hit device removals will help our ongoing efforts to minimize them. Can I ask which driver version you have installed?

In the meantime, you can also follow the instructions over here to temporarily disable TDRs and see if it helps. This is the same manipulation that CUDA users are asked to do to avoid TDRs on Windows, which is probably the reason you're getting a device removal in the first place.

Oh I think I got it working. Changing the TDRs does seem to work. I am currently setting it to 10. I wonder if I should invert it back to 2.

from tensorflow-directml.

PatriceVignola avatar PatriceVignola commented on August 22, 2024

Good, it confirms that the device removal is due to a TDR and not a driver issue. In general you should keep the TDR as the default value since it protects you from having an unresponsive system, but since training models can take a long time, it's ok to disable it temporarily while you're executing the workload. Don't forget to set it back to the default value when you're done though!

from tensorflow-directml.

Herobring avatar Herobring commented on August 22, 2024

Awesome! I'll rerun test on a new version in the nearest weekends.

from tensorflow-directml.

Herobring avatar Herobring commented on August 22, 2024

Hi @PatriceVignola! Thank you for the fix, it works! 🎉

For those who interested results for non-overcloked AMD Radeon VII + directML setup are 7935, which is 40% of AMD Radeon VII + tf 2.1.0 + ROCm(OpenCL) + Debian10 =19367 (probably overcloked)
Anyway this is still a great start for making AMD on Windows builds possible, cheers! 🍾 🎉

  • TF Version: 1.15.4
  • Platform: Windows-10-10.0.19041-SP0
  • CPU: N/A
  • CPU RAM: 64 GB
  • GPU/0:
  • GPU RAM: 31.7 GB
  • CUDA Version: N/A
  • CUDA Build: N/A

Device Inference Score: 4486
Device Training Score: 3449
Device AI Score: 7935

from tensorflow-directml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.