Giter Club home page Giter Club logo

Comments (9)

shovsj avatar shovsj commented on August 16, 2024

sorry, I found out that the error message is given because there is already running nv-hostengine in host,

root@cl-platform-gpu01:/# dcgmi dmon -e 1004
# Entity  TENSO
      Id
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error

So, never mind about the error message.

Instead, profiling metrics become zero like
when I enter the following command

dcgmi dmon -e 155,1001,1004

image
as you can see, power usage is more than 100W, gpu job is currently running, but profiling metrics are zero.

if I reconnect using the same command, then I can get the result properly for a while.
image

I think NVIDIA/gpu-monitoring-tools#189 is the same issue.
any idea? is this a bug?

from dcgm.

nikkon-dev avatar nikkon-dev commented on August 16, 2024

Thanks for the report!

The behavior is indeed looks suspicious and if it's consistent, could you provide debug logs from the nv-hostengine at the moment when 1001,1004 metrics become zeros?
nv-hostengine -f /tmp/host.log --log-level debug

Could you also try https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/datacenter-gpu-manager_2.2.3_amd64.deb version instead of 2.1.4?

from dcgm.

shovsj avatar shovsj commented on August 16, 2024

thank you for your help!

I just killed the running nv-hostengine and ran the code in your comment as root

root     27160     1  0 20:33 ?        00:00:02 nv-hostengine -f /tmp/host.log --log-level debug

And I tried again, and got the following log

2021-06-07 20:42:16.079 DEBUG [27160:27173] Checking status for gpu 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2019] [DcgmCacheManager::GetGpuStatus]
2021-06-07 20:42:16.099 DEBUG [27160:27173] Appended entity double eg 1, eid 0, fieldId 155, ts 1623066136079523, value1 101.231000, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:16.162 DEBUG [27160:27173] Preparing to update watchInfo 0x7f8ed4000c40, eg 1, eid 1, fieldId 155 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:5120] [DcgmCacheManager::ActuallyUpdateAllFields]
2021-06-07 20:42:16.162 DEBUG [27160:27173] Checking status for gpu 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2019] [DcgmCacheManager::GetGpuStatus]
2021-06-07 20:42:16.164 DEBUG [27160:27173] Appended entity double eg 1, eid 1, fieldId 155, ts 1623066136162076, value1 158.219000, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:16.265 DEBUG [27160:27162] Processing request of type 56 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2359] [DcgmHostEngineHandler::ProcessRequest]
2021-06-07 20:42:16.266 DEBUG [27160:27174] Sending message to 2 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/common/transport/DcgmIpc.cpp:1081] [DcgmIpc::SendMessageImpl]
2021-06-07 20:42:16.492 DEBUG [27160:27481] [PerfWorks] This iteration Samples Decoded: 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:548] [DcgmLopConfig::GetSamples]
2021-06-07 20:42:16.492 DEBUG [27160:27481] [PerfWorks] Overall number of Samples Decoded: 8 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:549] [DcgmLopConfig::GetSamples]
2021-06-07 20:42:16.493 DEBUG [27160:27481] [PerfModule][MIG] GetMigInstanceCount called for a non-mig device. GpuId: 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopGpu.cpp:536] [DcgmLopGpu::GetComputeInstancesCount]
2021-06-07 20:42:16.494 DEBUG [27160:27481] [PerfWorks] This iteration Samples Decoded: 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:548] [DcgmLopConfig::GetSamples]
2021-06-07 20:42:16.494 DEBUG [27160:27481] [PerfWorks] Overall number of Samples Decoded: 2 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:549] [DcgmLopConfig::GetSamples]
2021-06-07 20:42:16.495 DEBUG [27160:27481] [PerfModule][MIG] GetMigInstanceCount called for a non-mig device. GpuId: 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopGpu.cpp:536] [DcgmLopGpu::GetComputeInstancesCount]
2021-06-07 20:42:16.495 DEBUG [27160:27481] Field ID 1001 is being cached with no samples available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/LopValues.hpp:102] [DcgmNs::Modules::Profiling::PrepareValue]
2021-06-07 20:42:16.495 DEBUG [27160:27481] Field ID 1004 is being cached with no samples available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/LopValues.hpp:102] [DcgmNs::Modules::Profiling::PrepareValue]
2021-06-07 20:42:16.495 DEBUG [27160:27481] Appended entity double eg 1, eid 0, fieldId 1001, ts 1623066136493204, value1 0.400895, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:16.495 DEBUG [27160:27481] Appended entity double eg 1, eid 0, fieldId 1004, ts 1623066136493204, value1 0.151762, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:16.495 DEBUG [27160:27481] Appended entity double eg 1, eid 1, fieldId 1001, ts 1623066136495190, value1 0.000000, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:16.495 DEBUG [27160:27481] Appended entity double eg 1, eid 1, fieldId 1004, ts 1623066136495190, value1 0.000000, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:16.495 DEBUG [27160:27481] minWakeupMs = 1000 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1652] [DcgmModuleProfiling::ReadAndCacheMetrics]
2021-06-07 20:42:17.079 DEBUG [27160:27173] Preparing to update watchInfo 0x7f8ed4000e90, eg 1, eid 0, fieldId 155 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:5120] [DcgmCacheManager::ActuallyUpdateAllFields]
2021-06-07 20:42:17.079 DEBUG [27160:27173] Checking status for gpu 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2019] [DcgmCacheManager::GetGpuStatus]
2021-06-07 20:42:17.082 DEBUG [27160:27173] Appended entity double eg 1, eid 0, fieldId 155, ts 1623066137079858, value1 95.980000, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]
2021-06-07 20:42:17.162 DEBUG [27160:27173] Preparing to update watchInfo 0x7f8ed4000c40, eg 1, eid 1, fieldId 155 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:5120] [DcgmCacheManager::ActuallyUpdateAllFields]
2021-06-07 20:42:17.162 DEBUG [27160:27173] Checking status for gpu 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2019] [DcgmCacheManager::GetGpuStatus]
2021-06-07 20:42:17.190 DEBUG [27160:27173] Appended entity double eg 1, eid 1, fieldId 155, ts 1623066137162162, value1 162.274000, value2 0.000000, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6082] [DcgmCacheManager::AppendEntityDouble]

image

and here is the entire log
host.log

The link must be a newer version of DCGM, right? My OS is Centos7, so I will try to install the newer version and test again.
Many thanks.

from dcgm.

nikkon-dev avatar nikkon-dev commented on August 16, 2024

For Centos7/Rhel7, you may get this RPM: https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/datacenter-gpu-manager-2.2.3-1-x86_64.rpm
Newer versions have better logging and some bug fixes related to the issue you are experiencing.

You may also try to run dcgmi profile --pause && dcgmi profile --resume once you start seeing zeroes. The outcome of that command would also help us to diagnose the issue further.

WBR,
Nik

from dcgm.

shovsj avatar shovsj commented on August 16, 2024

thank you for your kind help, but it seems not working.

I upgraded DCGM using the RPM provided,

[parl@cl-platform-gpu02:/home/parl] dcgmi --version

dcgmi  version: 2.2.3

[parl@cl-platform-gpu02:/home/parl] nv-hostengine --version
Version : 2.2.3
Build ID : 7
Build Date : 2021-04-30
Build Type : Release
Commit ID : 627b2ed27157d9c5c1cddc468bcccd3ba086ec4e
Branch Name : rel_dcgm_2_2
CPU Arch : x86_64
Build Platform : Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64
CRC : ac449ff8470d86588114e8fdf901332f

But still, it happens again.
while watching the profiling metrics, the suggested command

[parl@cl-platform-gpu02:/home/parl] dcgmi profile --pause && dcgmi profile --resume
Successfully paused profiling.
Successfully resumed profiling.

rarely works, but usually results are still zeros.

If there is any data you need more, please let me know.

from dcgm.

nikkon-dev avatar nikkon-dev commented on August 16, 2024

I would need to see the debug logs from the newer version as well.
Also, could you clarify if the following is correct?

  • Only the GPU1 is affected. GPU0 always gets some values.
  • The issue reproduces after approximately the same amount of time once the initial monitoring starts.
  • There is some load on both GPUs all the time, and GPU1 values become zeroes in the middle of some load-generating process.

Another question that I'd like to clarify - does the issue reproduce if just GPU1 is monitored? dcgmi dmon -e 1001,1004 -i 1

WBR,
Nik

from dcgm.

bstollenvidia avatar bstollenvidia commented on August 16, 2024

Closing due to inactivity.

from dcgm.

lw921014 avatar lw921014 commented on August 16, 2024

nv-hostengine -f /tmp/host.log --log-level debug

I meet the same problem, how to solve it?

the host error log is shown as folloing
image

from dcgm.

nikkon-dev avatar nikkon-dev commented on August 16, 2024

@lw921014,

  • What DCGM version are you using? The dcgmi -v output, while the nv-hostengine is running, would be sufficient.
  • What GPUs are used?
  • How long does it take to reproduce?
  • Is there anything Nvidia related in the dmesg logs?
  • Do you have multiple GPUs installed or just one? If multiple GPUs are used, does this issue always happen on GPU 0?
  • Are you using Nvidia profiling tools at the same time?

from dcgm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.