Giter Club home page Giter Club logo

Comments (16)

valeriob01 avatar valeriob01 commented on July 26, 2024
2019-09-07 12:06:44 90348611    33410000 36.98%;  886 us/sq; ETA 0d 14:01; bac38bb8e27196e5
2019-09-07 12:06:53 90348611    33420000 36.99%;  886 us/sq; ETA 0d 14:01; 5dc04e6cd38ab191
2019-09-07 12:07:02 90348611    33430000 37.00%;  887 us/sq; ETA 0d 14:01; b91d6d315cae4932
Queue at 0x7f23e803a000 inactivated due to async error:
        HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION:  The agent attempted to execute an illegal shader instruction.

This needs reboot.

I don't know if gpuowl registers an error in this case, I don't think so. This is a severe error that blocks the program. The nErrors indication can only capture certain events, I would say "less severe than this one".

from gpuowl.

preda avatar preda commented on July 26, 2024

Hi, I've never encountered this error myself; probably I'll have to wait until I can repro.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

Hi Mihai, this was a one-time error, never reproduced myself, but I have 2 radeon7 and both show the same computation errors including the all-zero residue error. I have tested them also on separate and different mainboards and on Debian and on Ubuntu, the computation errors are common. It seems to me that the dealer got a batch of buggy Radeon VII.

from gpuowl.

preda avatar preda commented on July 26, 2024

I don't know, I also don't see the all-zero.. Could be many things causing it... we need more information.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

One thing I can say is that on all-zero occurrence corresponds a page fault.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

Also, more information here: On occurrence of all-zero error, the error is repeated over and over until the next Gerbicz Check, which fails, then on reload the error may disappear. Then it may reappear randomly. I have also seen 3 consecutive errors, which make gpuowl exit. I have observed scrupulously this behaviour, the error rate tend to increase with temperature. By cooling the gpu very well I can keep this error to a minimum of occurrences. But still, I cannot eliminate it reliably.
Tested on two different mainboards, and cpus, ram, hard disk, with two different Radeon VII.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

Just happened again, on the dual radeon 7 system, the gpu in error is at rest now, gpuowl has been killed, but the other gpu is still working and computing. I thought the error was more severe, but I need to reboot to restart the gpu in error.

gpuerror

from gpuowl.

preda avatar preda commented on July 26, 2024

Are you using PCIe raisers?

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

Are you using PCIe raisers?

No. ROCm doesn't support pci risers. Risers are a thing of the past for me.
Maybe the source of errors is some other component involved in the computation.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

Are you using PCIe raisers?

No. ROCm doesn't support pci risers. Risers are a thing of the past for me.
Maybe the source of errors is some other component involved in the computation.

However, Radeon VII is the only cpu model to see these errors. Other gpus I have, RX580 and Verga64 never seen a single error...

from gpuowl.

selroc avatar selroc commented on July 26, 2024

I typed an r in excess, that's Vega64 !
Well, I will investigate if the RAM is suffering from being too near the CPU cooler fan.
This is a new account I created to divide my work.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

I went on and installed Debian 10.1 with ROCm 2.8, this seems to have reduced the errors a great amount, and the all-zero residue error has not occurred until now.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

I typed an r in excess, that's Vega64 !
Well, I will investigate if the RAM is suffering from being too near the CPU cooler fan.
This is a new account I created to divide my work.

I will just use mprime stress test to verify the RAM.

from gpuowl.

valeriob01 avatar valeriob01 commented on July 26, 2024

ROCm/ROCm#873 (comment)

from gpuowl.

preda avatar preda commented on July 26, 2024

Are you overclocking the GPU RAM, or undervolting? if so, maybe that is too aggressive.

from gpuowl.

selroc avatar selroc commented on July 26, 2024

Are you overclocking the GPU RAM, or undervolting? if so, maybe that is too aggressive.

The irony is that I never touch voltage/clock settings, it is just that I have found a way to cool the gpu very well. With Debian 10.1 things are going better, the number of errors has reduced by 90%

from gpuowl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.