NVIDIA Open GPU Kernel Modules Version 535.129.03 <h3 dir="aut

According to <a href="https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535

Please check: Does the disk for the VM have enough space?

Yes. The problem also exists with an unmodified driver from <a href="ht

Could you find the broken log at /var/log/ ? <p d

cuInit breaks kernel logging with CC-on about open-gpu-kernel-modules HOT 7 CLOSED

derpsteb commented on June 30, 2024

cuInit breaks kernel logging with CC-on

from open-gpu-kernel-modules.

Comments (7)

derpsteb commented on June 30, 2024

I found out that the issue only starts once I connect to the GPU for the first time. So assuming nvidia-persistenced is not running I can reboot the VM and kernel logs work as expected. After running nvidia-smi for the first time the logs are broken.

from open-gpu-kernel-modules.

Tan-YiFan commented on June 30, 2024

According to https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html:

Confidential Compute applications will not work on this release. Please continue to use 535.104.05 for this use case.

Could you try again using 535.104.05?

from open-gpu-kernel-modules.

derpsteb commented on June 30, 2024

I tried. Problem persists. :/
Also changed the kernel to match exactly with what the docs require here. So I am running 6.2.0-26 instead of 6.2.0-37 now.

from open-gpu-kernel-modules.

Tan-YiFan commented on June 30, 2024

Please check:

Does the disk for the VM have enough space?
Could you build the nvidia driver without NV_VERBOSE=1 DEBUG=1?

from open-gpu-kernel-modules.

derpsteb commented on June 30, 2024

Yes.
The problem also exists with an unmodified driver from https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run

I can reproduce the error by running a minimal program that just executes cuInit.

from open-gpu-kernel-modules.

Tan-YiFan commented on June 30, 2024

Could you find the broken log at /var/log/ ?

Also check nvidia-smi conf-compute -f and nvidia-smi conf-compute -gc.

By the way, would the cuda application fail? Or, only the dmesg is broken?

from open-gpu-kernel-modules.

derpsteb commented on June 30, 2024

Okay. It seems like using NVreg_RmMsg=":" as described here overwhelms the kernel logging subsystem if one is using CC-mode.

When configuring the module to only print warnings the logs continue to work as expected. Even with CC-on. I guess something in the CC-only codepaths produces a prohibitive amount of logs for the kernel to handle.

To answer your questions: the cuda application continues to work. Both smi commands print the expected output. There are no logs in /var/log/kernel if the logging is overwhelmed.

from open-gpu-kernel-modules.

Recommend Projects

cuInit breaks kernel logging with CC-on about open-gpu-kernel-modules HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent