See: <a class="issue-link js-issue-link" data-error-text="Failed to

Following appears to fix the issue (psm_user.h). <div class="snippet-clipboard-con

Unified virtual memory failure in multi-GPU box about opa-psm2 HOT 6 CLOSED

cornelisnetworks commented on August 20, 2024

Unified virtual memory failure in multi-GPU box

from opa-psm2.

Comments (6)

paboyle commented on August 20, 2024

Feedback from NVIDIA:

They said:

"The documentation for cuPointerGetAttribute is here

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html#group__CUDA__UNIFIED_1g0c28ed0aff848042bc0533110e45820c

When testing for the CU_POINTER_ATTRIBUTE_MEMORY_TYPE, the possible values for data are CU_MEMORYTYPE_HOST and CU_MEMORYTYPE_DEVICE only. The other two options are never returned by this call.

A different call to cuPointerGetAttribute is needed to determine if a pointer points to managed memory. Testing for the CU_POINTER_ATTRIBUTE_IS_MANAGED, sets data to either 0 or 1. "

Thus, I think they say it looks like psm_user.h _psmi_is_cuda_mem is being used to indicate the CUDA IPC can be used, but is false triggering when the memory is managed, according to the CUDA documentation.

However, I disagree with them:

I have read the documentation, and it is actually unclear what the behaviour is defined as under unified memory pointers.

In fact it seems to miss a few words to say what pointer type is being talked about in places.

"Returns in *data the physical memory type of the memory that ptr addresses as a CUmemorytype enumerated value."

Can reasonably be interpreted that the Unified memory type should return CU_MEMORYTYPE_UNIFIED, on account of it being unified memory type.

It seems very very odd to have CU_MEMORYTYPE_UNIFIED and a query function for MEMORY_TYPE that is unable to return some of the values.

Whatever happens: the combination of CUDA and PSM2 has a bug until this is fixed and the usage agreed upon.

from opa-psm2.

paboyle commented on August 20, 2024

Following appears to fix the issue (psm_user.h).

PSMI_ALWAYS_INLINE(
int
_psmi_is_cuda_mem(void *ptr))
{
        CUresult cres;
        CUmemorytype mt;
        unsigned uvm;

        cres = psmi_cuPointerGetAttribute(&mt, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, (CUdeviceptr) ptr);

        if ((cres == CUDA_SUCCESS) && (mt == CU_MEMORYTYPE_DEVICE)) {

          cres = psmi_cuPointerGetAttribute(&uvm, CU_POINTER_ATTRIBUTE_IS_MANAGED, (CUdeviceptr) ptr);

          if ((cres==CUDA_SUCCESS) && (uvm==0) ) return 1;
          else                                   return 0;

        } else {

          return 0;

        }
}

Personally, I think cuPointerGetAttribute looks to be in the wrong, though

from opa-psm2.

paboyle commented on August 20, 2024

The cuda 10.1 execution of the code was run on a system with the cuda 9.2 kernel driver. it is possible (Tim Lanfear produced same output with cuda 9.2 and 10.1) that the 10.1 behaviour will match 9.2 once the kernel driver is updated.

This is on a centrally run supercomputer, so I can't update kernel drivers to check.

from opa-psm2.

paboyle commented on August 20, 2024

Received the following from Nvidia, and they confirm this is their defined behaviour of cuPointerGetAttribute. Perhaps even worse, I think this means the returned class will vary page by page with UVM according to location:

The behaviour of cuPointerGetAttribute() with the CU_POINTER_ATTRIBUTE_MEMORY_TYPE attribute is that it will return the physical memory type of the memory that the pointer addresses. With a pointer to managed memory, at any given time this could be host memory or device memory depending on where the data is mapped at that time. Putting it another way, ptr is a virtual address backed by some physical storage, and CU_POINTER_ATTRIBUTE_MEMORY_TYPE allows you to find where the pointer is at this moment in time.

While the documentation is supposed to make it clear that only these two values can be returned (if cuPointerGetAttribute() is successful), I do think it could be improved, especially since reusing the CUmemorytype enum might suggest otherwise.

I'm not familiar with PSM2, but from the context I gather it is going to use cudaIpcGetMemHandle() (or driver API equivalent) and then pass the returned pointer to another process. Unfortunately, the IPC API is not supported for managed memory allocations. That's because in the source process the CUDA Runtime is able to intercept page faults on the allocation and handle them, but the receiving process does not know that the pointer is managed by the CUDA driver and so tries to use the normal kernel page fault handler, which doesn't know about the GPU (*).

Without knowing the rest of the code, I'm not sure if your proposed fix would work – it depends what PSM2 does if it determines that the pointer is "not a CUDA pointer."

It may also be helpful to clarify the difference between "unified" and "managed" memory. Unified Memory is the ability for a virtual memory allocation to be physically located in either device or host memory, and for page faults from CPU or GPU to be handled to access the correct memory or migrate the page as appropriate (e.g. based on heuristics). Managed Memory is unified memory that is managed by the CUDA driver rather than the Linux kernel.

(*) As you may know, we have been working with the Linux community on the Heterogeneous Memory Manager (HMM) in the Linux kernel. HMM allows the Linux kernel to understand about pointers in non-conventional memory (e.g. GPU memory) which means that we no longer need "managed memory" since the memory can be handled by the kernel instead of by the CUDA driver. With HMM, you would be able to use malloc() instead of cudaMallocManaged() and you should be able to pass the pointer via IPC and have it behave correctly.

from opa-psm2.

mwheinz commented on August 20, 2024

Reviewing old issues - it appears that Adam submitted a pair of patches for this back in 2019.

@paboyle - I know it's been a ridiculously long time but do you know if this was fixed in more recent IFS releases or is this still a problem for you?

from opa-psm2.

mwheinz commented on August 20, 2024

Pete has agreed that PSM2 was patched to correct the issue.

from opa-psm2.

Unified virtual memory failure in multi-GPU box about opa-psm2 HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent