Giter Club home page Giter Club logo

Comments (35)

nikkon-dev avatar nikkon-dev commented on July 17, 2024

@ligeweiwu,

You need to start the nv-hostengine process. For your layout that would be sudo ./DCGM/_out/Linux-amd64-debug/bin/nv-hostengine -f host.log --log-level debug.
The nv-hostengine should be run as root, otherwise, some functionality will not be available.
The provided command would start nv-hostengine daemon process that writes debug logs into the host.log file.
If you want to debug the nv-hostengine itself, then it's better to add the -n argument that would prevent the nv-hostengine from demonization.

Alternatively, you could generate .deb/.rpm packages and install them using the package manager, but that's a very inconvenient way for debugging.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
It works. Thanks for your help!

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi, I have another question.
Now I have add the gpu to the Group (Group 2) successfully. And when I enter
"./dcgmi diag -r 1 -g 2"
It gives me an info
"Couldn't parse json: 'WARNING: You must also provide env __NVVS_DBG_FILE=
"
How can I fix it ?
Below is my working environment.
Thanks.

+-------------------+----------------------------------------------------------+
| GROUPS |
| 3 groups found. |
+===================+==========================================================+
| Groups | |
| -> 0 | |
| -> Group ID | 0 |
| -> Group Name | DCGM_ALL_SUPPORTED_GPUS |
| -> Entities | GPU 0, GPU 1, GPU 2 |
| -> 1 | |
| -> Group ID | 1 |
| -> Group Name | DCGM_ALL_SUPPORTED_NVSWITCHES |
| -> Entities | None |
| -> 2 | |
| -> Group ID | 2 |
| -> Group Name | GPU_GROUP |
| -> Entities | GPU 2 |
+-------------------+----------------------------------------------------------+

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
By the way, I build the debug version of DCGM with command
"./build.sh -d -c"
After that, it generate /_out/Linux-amd64-debug folder.
Thanks.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

@ligeweiwu,

Most likely, the nv-hostengine cannot find where the nvvs binary is located. It tries to find it in the default location where the package manager installs it.
To override that logic, you need to set an environment variable NVVS_BIN_PATH=full path to nvvs location in the _out/Linux-amd64-debug/share/nvidia-validation-suite before running the nv-hostengine process.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi, thanks for your help. Now I have another question.
Now, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug" again, it gives me an error:
"Err: Failed to start DCGM Server: -7
User defined signal 1"
How can I fix it?
Thanks

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

@ligeweiwu,

-7 stands for DCGM_ST_INIT_ERROR = -7, //!< DCGM Init error and it's hard to tell what is wrong without the debug logs (the host.log in your command).

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi, Below is the snapshot of host.log.
By the way, this error occurs when I enter "export NVVS_BIN_PATH=full path". After that, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug", this error happens.
image
Thanks.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi, Below is the snapshot of host.log.
By the way, this error occurs when I enter "export NVVS_BIN_PATH=full path". After that, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug", this error happens.
image
Thanks.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

Hi, this is the content of debug log.
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 1, fieldId 513, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101f500000002 (eg 1, entityId 2, fieldId 501) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher]
2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 501, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101fd00000002 (eg 1, entityId 2, fieldId 509) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher]
2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 509, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101fe00000002 (eg 1, entityId 2, fieldId 510) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher]
2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 510, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101ff00000002 (eg 1, entityId 2, fieldId 511) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher]
2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 511, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x1020000000002 (eg 1, entityId 2, fieldId 512) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher]
2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 512, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x1020100000002 (eg 1, entityId 2, fieldId 513) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher]
2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 513, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Skipping waitForUpdate since the cache manager thread is not running yet. [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2394] [DcgmCacheManager::UpdateAllFields]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Added field group id 3, name DCGM_INTERNAL_JOB, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmFieldGroup.cpp:172] [DcgmFieldGroupManager::AddFieldGroup]
2022-12-02 10:41:26.277 INFO [83727:83727] Created thread named "cache_mgr_main" ID 1854199552 DcgmThread ptr 0x0xe95a10 [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:110] [DcgmThread::Start]
2022-12-02 10:41:26.277 DEBUG [83727:83727] Skipping waitForUpdate since the cache manager thread is not running yet. [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2394] [DcgmCacheManager::UpdateAllFields]
2022-12-02 10:41:26.278 DEBUG [83727:83727] dcgmStartEmbedded(): Embedded host engine started [/workspaces/HGDCGM/dcgmlib/src/DcgmApi.cpp:4826] [{anonymous}::StartEmbeddedV2]
2022-12-02 10:41:26.278 DEBUG [83727:83732] Thread handle 1854199552 running [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:299] [DcgmThread::RunInternal]
2022-12-02 10:41:26.278 DEBUG [83727:83727] Entering dcgmEngineRun(unsigned short portNumber, char const *socketPath, unsigned int isConnectionTCP) (5555 127.0.0.1 1) [/workspaces/HGDCGM/dcgmlib/entry_point.h:73] [dcgmEngineRun]
2022-12-02 10:41:26.278 INFO [83727:83732] Cache manager update thread starting [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:6053] [DcgmCacheManager::run]
2022-12-02 10:41:26.278 DEBUG [83727:83732] Preparing to update watchInfo 0xeaf500, eg 1, eid 1, fieldId 512 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:5278] [DcgmCacheManager::ActuallyUpdateAllFields]
2022-12-02 10:41:26.278 INFO [83727:83727] Created thread named "dcgm_ipc" ID 1845806848 DcgmThread ptr 0x0xe94610 [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:110] [DcgmThread::Start]
2022-12-02 10:41:26.278 DEBUG [83727:83732] Checking status for gpu 1 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2275] [DcgmCacheManager::GetGpuStatus]
2022-12-02 10:41:26.278 DEBUG [83727:83733] Thread handle 1845806848 running [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:299] [DcgmThread::RunInternal]
2022-12-02 10:41:26.278 DEBUG [83727:83732] Appended entity blob eg 1, eid 1, fieldId 512, ts 1669948886278497, valueSize 2048, cached 1, buffered 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:6452] [DcgmCacheManager::AppendEntityBlob]
2022-12-02 10:41:26.278 ERROR [83727:83733] bind failed. port 5555, address 127.0.0.1, errno 98 [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:326] [DcgmIpc::InitTCPListenerSocket]
2022-12-02 10:41:26.278 ERROR [83727:83733] InitTCPListenerSocket() returned Generic unspecified error [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:195] [DcgmIpc::run]
2022-12-02 10:41:26.278 DEBUG [83727:83732] Preparing to update watchInfo 0xeaf360, eg 1, eid 1, fieldId 510 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:5278] [DcgmCacheManager::ActuallyUpdateAllFields]
2022-12-02 10:41:26.278 ERROR [83727:83727] initFuture returned Generic unspecified error [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:131] [DcgmIpc::Init]
2022-12-02 10:41:26.278 DEBUG [83727:83732] Checking status for gpu 1 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2275] [DcgmCacheManager::GetGpuStatus]
2022-12-02 10:41:26.278 ERROR [83727:83727] Got error Generic unspecified error from m_dcgmIpc.Init [/workspaces/HGDCGM/dcgmlib/src/DcgmHostEngineHandler.cpp:3767] [DcgmHostEngineHandler::RunServer]
2022-12-02 10:41:26.278 DEBUG [83727:83733] Thread id 1845806848 stopped [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:308] [DcgmThread::RunInternal]
2022-12-02 10:41:26.278 DEBUG [83727:83727] Returning -7 [/workspaces/HGDCGM/dcgmlib/entry_point.h:73] [dcgmEngineRun]

Thanks.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

@ligeweiwu,

That's an easy one - you either already have nv-hostengine process running or something else is listening on the 5555 TCP port.

2022-12-02 10:41:26.278 ERROR [83727:83733] bind failed. port 5555, address 127.0.0.1, errno 98

Normally, when nv-hostengine runs it creates a PID file that allows it to understand later if another instance is already running. That may be not true for debug builds.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Thanks for your help. This error has been solved.
But when I set the environment variable
"export NVVS_BIN_PATH=XXX/GDCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/" ,and then start the nv-hostengine service. After that, I run
"./dcgmi diag -r 1 -g 2" (the gpu group has already been created).
It still gives me an message:
Couldn't parse json: 'WARNING: You must also provide env __NVVS_DBG_FILE='
By the way, when I install the deb package, this error would not happen. So is there other procedure that I missed?
Thanks

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

@ligeweiwu,

In the debug nv-hostengine logs, look for the NVVS-related lines.
Also, there may be nvvs.log file next to the nvvs binary.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi, this is the NVVS-related lines.
2022-12-01 22:01:31.184 DEBUG [1909:1911] [[Diag]] Unknown subcommand: 1 [/workspaces/HGDCGM/modules/diag/DcgmModuleDiag.cpp:144] [DcgmModuleDiag::ProcessCoreMessage]
2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] External command stdout: WARNING: You must also provide env __NVVS_DBG_FILE= [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:676] [DcgmDiagManager::PerformExternalCommand]
2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] External command stderr: { "DCGM GPU Diagnostic" : { "runtime_error" : "Unable to get the driver version: Host engine connection invalid/disconnected. Couldn't succeed despite 0 retries.", "version" : "440.37" } } [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:677] [DcgmDiagManager::PerformExternalCommand]
2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] The external command '/usr/share/nvidia-validation-suite/nvvs' returned a non-zero exit code: 1 [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:705] [DcgmDiagManager::PerformExternalCommand]
2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] Failed to parse NVVS output: WARNING: You must also provide env __NVVS_DBG_FILE= [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:1145] [DcgmDiagManager::ValidateNvvsOutput]
2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] Error happened during JSON parsing of NVVS output: The GPU Diagnostic returned Json that cannot be parsed. [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:885] [DcgmDiagManager::RunDiag]
2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] NVVS stderr:
{ "DCGM GPU Diagnostic" : { "runtime_error" : "Unable to get the driver version: Host engine connection invalid/disconnected. Couldn't succeed despite 0 retries.", "version" : "440.37" } } [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:886] [DcgmDiagManager::RunDiag]
2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] RunDiagAndAction returned -40 [/workspaces/HGDCGM/modules/diag/DcgmModuleDiag.cpp:120] [DcgmModuleDiag::ProcessRun_v6]
2022-12-01 22:01:31.195 DEBUG [1909:2000] Sending message to 17 [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:1198] [DcgmIpc::SendMessageImpl]

I see there is a message about
"The external command '/usr/share/nvidia-validation-suite/nvvs' returned a non-zero exit code: 1 [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:705] [DcgmDiagManager::PerformExternalCommand]"
That means it use the location "/usr/share/nvidia-validation-suite/nvvs",
But my workspace is not in /usr/share/nvidia-validation-suite/nvvs, and I also set the env variable NVVS_BIN_PATH to my workspace
(export NVVS_BIN_PATH=XXX/GDCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/)
Is there other env variable that I should be set ? (e.g. __NVVS_DBG_FILE)
Thanks.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

Do you set the NVVS_BIN_PATH for the root user or for your current user?
If you set export NVVS_BIN_PATH= and then sudo ./nv-hostengine..., the env variable is not set for the root user that runs the nv-hostengine.
Logfile should also mention the NVVS_BIN_PATH variable and its value.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon,
(1): I do not find NVVS_BIN_PATH in the logfile.
(2): I set export NVVS_BIN_PATH= and then sudo ./nv-hostengine...
and when I set ./nv-hostengine... it works fine.

Thanks for you help.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon,
Sorry to bother you again. But I still have some problem on debugging version of dcgm.
Now I can run Diag test on corresponding gpu group. But when I use gdb to go into the source code, it can not find the source code correctly.
For example:

  1. Build the debug version:
    ./build.sh -d -c
    After that, it generate _out/Linux-amd64-debug/bin, and the corresponding bin file like dcgmi is in it.
  2. Run the test
    ./dcgmi diag -r 1 -g 2. It works fine. Every thing is fine.
  3. Using gdb to go into the source code
    gdb --args ./dcgmi diag -r 1 -g 2
    And I want to add the breakpoint at Diag::RunDiagOnce, so I
    b Diag::RunDiagOnce
    And then run, But the program will never run into the break point. And when I add some print info and rebuild it again, it can show the new printinfo. But it never reach into the breakpoint (Diag::RunDiagOnce ).
    So which step should I still miss? or I gdb into a wrong executable file (maybe not dcgmi)?
    Thanks

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Another Example, if I in GDB input:
b CommandLineParser.cpp:99 //Entry method into this class for a given command line provided by main()
The GDB will give a feedback:
Breakpoint at 0x44b61e: file /opt/cross/x86_64-linux-gnu/include/tclap/CommandLineParser.cpp, line 165.

The gdb feedback breakpoint is not the point that I set.
So i think the debug symbol may be crashed dcgmi? And is there anything else that I should amend to the command "./build.sh -d -c" ?

Thanks

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

@ligeweiwu,

There are several steps that you need to take to be able to debug:

  1. If you were using the buildcontainer to build the dcgm, the sources location differs from your host. You will need to tell gdb the right place to find the sources and to convert buildcontainer path to your host path. Something like this should be added to a .gdbinit file:
directory /home/your_host_user/src/DCGM
set substitute-path /workspaces/dcgm /home/your_host_user/src/DCGM

The buildcontainer mounts sources to /workspaces/dcgm inside the build container.

  1. dcgmi diag would not allow you to debug the diagnostic. In a nutshell, when you run dcgmi diag it connects to nv-hostengine to send the commands, and nv-hostengine runs the nvvs binary to perform actual diagnostic. So the path is dcgmi->nv-hostengine->nvvs. You will need to debug the nv-hostengine and tell gdb to follow children on forks.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon, i try it but it still have some problem.
(1): my workspace is:
/mnt/ssd/liuzhenli.lzl/DCGM
(2):
The execute path is:
/mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin
(3):
I add .gdbinit file in /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin, and the content is:

directory /mnt/ssd/liuzhenli.lzl/DCGM
set substitute-path /workspaces/DCGM /mnt/ssd/liuzhenli.lzl/DCGM

(4):
Now I just want to test the commandLineParser function, e.g. in CommandLineParser.cpp:
1307: dcgmReturn_t CommandLineParser::ProcessDiagCommandLine(int argc, char const *const *argv)
1308:{
1309: // Check for stop diag request
1310: const char *value = std::getenv(STOP_DIAG_ENV_VARIABLE_NAME);
1311: if (value != nullptr)

so I do:
4.1): gdb --args ./dcgmi diag -r 1 -g 2 (i think CommandLineParser is controlled by dcgmi, Is right ??)
4.2): b CommandLineParser.cpp:1310
And then run in gdb

(5):
The gdb feedback is still:
Breakpoint 1 at 0x408399: file /workspaces/DCGM/dcgmi/main_dcgmi.cpp, line 3419

(6):
I also see build.sh in DCGM repo, and I see docker start command is:
docker run --rm -u "$(id -u)":"$(id -g)"
${DOCKER_ARGS:-}
-v "${DIR}":"${REMOTE_DIR}"
And I print ${DIR}, it is actually:/mnt/ssd/liuzhenli.lzl/DCGM. And ${REMOTE_DIR} is: /workspaces/DCGM
So i think I add set substitute-path correctly in .gdbinit file
So is there anything that I do wrong?

By the way when I add other print info in CommandLineParser.cpp, and then run /dcgmi diag -r 1 -g 2, it can always print the new info.

Sorry to bother you.
Thanks

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

It’s /workspaces/dcgm not capital letters. The path is case sensitive.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon,
Now this is the content of my .gdbinit File in the path (/mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin)

directory /mnt/ssd/liuzhenli.lzl/DCGM
set substitute-path /workspaces/dcgm /mnt/ssd/liuzhenli.lzl/DCGM (I have change DCGM to dcgm for /workspaces)

However, when I run gdb (gdb --args ./dcgmi diag -r 1 -g 2, b CommandLineParser.cpp:1310), It sill gives me the feed back:
"Breakpoint 1 at 0x408389: file /workspaces/DCGM/dcgmi/main_dcgmi.cpp, line 3419."

May be there are some problem for gdb to go into source code of DCGM debug-version?

By the way, my gdb version is:
GNU gdb (Ubuntu 8.2-0ubuntu1~16.04.1) 8.2, and when I start gdb for ./dcgmi, it also gives warning:
"
BFD: warning: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin/../lib/libdcgm.so.3: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
BFD: warning: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin/../lib/libdcgm.so.3: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
"

Thanks for your help.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

You need to update the gdb to a version that understands Dwarf5 debug info format. DCGM is built with GCC11 and uses Dwarf5 by default. You need GDB 10 or newer to be able to read that debug info.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon, thanks for your help. I have another question.
As you said,
"The provided command would start nv-hostengine daemon process that writes debug logs into the host.log file."
For instance, if I run
./dcgmi diag -r 3 -g 2
I see the content of host.log file, it contains the debug log of many functions (e.g. function defined in DcgmCacheManager.cpp).
In other words, Is possible to say that host.log file contains almost all of function, which is utilized in "./dcgmi diag -r 3 -g 2" ?
Thanks.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

The host.log would contain only the nv-hostengine part. The real work is done by nvvs binary with its own log. For debugging needs, you could run nvvs directly.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon
The path is dcgmi->nv-hostengine->nvvs.
In other words, nvvs binary executes the diagnostic test on the corresponding GPU, and nv-hostengine can use NVML function to monitor the performance metrics of GPU (e.g. power, Temperature, ecc...).
Is my understanding right?
Thanks.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

That's correct.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon
I have another question. Does DCGM support cuda-12 now? When I update cuda driver and run the previous workflow again, it gives me an error:

_out/Linux-amd64-debug/share/nvidia-validation-suite/nvvs.log:78:2022-12-12 15:56:40.014 ERROR [29094:29094] Detected unsupported Cuda version: 12.0 [/workspaces/DCGM/nvvs/src/TestFramework.cpp:243] [TestFramework::GetPluginDirExtension]

Thanks

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hello nikkon
Now I am debugging "dcgmi diag -r 1".
As you said before, I can run nvvs binary directly for debugging purpose. However, when I run ./nvvs -h. I could not find the same test info as "dcgmi diag -r 1".
Could you tell me how can I run nvvs directly to get the same effect as I run "dcgmi diag -r 1"?
Thank you very much.

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

That's --specifiedtest short

r 1 - short
r 2 - long
r 3 - xlong

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Thanks for your help. It works.

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hello nikkon

I have another issue for debugging.
Now I am running the command
"./nvvs --specifiedtest pcie"
I found that GDB can follow the source code, which is located in nvvs folder.
But for the source code in pcie folder, (e.g. PcieWrapper.cpp), gdb can not go into it.

Let me give a more specific example.
In PluginLib.cpp, it has a callback function "m_runTestCB(timeout, numParameters, parms, m_userData);" GDB can go into the source code of PluginLib.cpp. But when I enter "s" for "m_runTestCB", I can not follow the source code of RunTest, which is defined in PcieWrapper.cpp. I also try to add other breakpoints for source code of pcie, but they do not work as well.

Could you tell me how to fix it?

Thanks

from dcgm.

nikkon-dev avatar nikkon-dev commented on July 17, 2024

In gdb, you need to set solib-search-path to plugins/cudaXX directory, where XX is your driver's Cuda version: 10, 11 or 12

from dcgm.

ligeweiwu avatar ligeweiwu commented on July 17, 2024

@nikkon-dev
Hi nikkon

I am building the DCGM source code and using nvvs/dcgmi to perform the diagnostic test. I see all plugintest and they are all in the format of .so. .But when I want to perform the "memory bandwidth" diagnostic, they give me an error:

./dcgmi diag -r "memory bandwidth" -g 2
Error: requested test "memory bandwidth" was not found among possible test choices.

In my case, all plugin.so are in the location of
/username/DCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/plugins/cuda11, and there is no name of "memory bandwidth". And I also see the source code, actually i think it doesn't have the option name "memory bandwidth". It only has "memtest".

So please tell me how can I run "memory bandwith" using DCGM source code?

By the way, the memtest is OK ("./dcgmi diag -r memtest -g 2" works fine, and I also see the corresponding libMemtest.so in plugins/cuda11, and the source code has the option "memtest").

Thanks

from dcgm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.