Hi, I tried to run the FODO example without changes on the Perlmutter GPU partition an

With cudatoolkit/11.5 , running <code class="notransla

Memo from our discussion: Debug workflow: <a href="https://war

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Illegal memory access with FODO example on GPU about impactx HOT 7 CLOSED

ecp-warpx commented on July 16, 2024 1

Illegal memory access with FODO example on GPU

from impactx.

Comments (7)

n01r commented on July 16, 2024 1

With cudatoolkit/11.5, running cuda-gdb gives an error

(impactx) mgarten@nid001512:/pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU_DEBUG> cuda-gdb
cuda-gdb: warning: PyMemoryView_FromObject: called while Python is not available!
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = '/opt/cray/pe/python/3.9.7.1'
  program name = 'python3'
  isolated = 0
  environment = 1
  user site = 1
  import site = 1
  sys._base_executable = '/global/homes/m/mgarten/sw/perlmutter/venvs/impactx/bin/python3'
  sys.base_prefix = '/opt/cray/pe/python/3.9.7.1'
  sys.base_exec_prefix = '/opt/cray/pe/python/3.9.7.1'
  sys.platlibdir = 'lib'
  sys.executable = '/global/homes/m/mgarten/sw/perlmutter/venvs/impactx/bin/python3'
  sys.prefix = '/opt/cray/pe/python/3.9.7.1'
  sys.exec_prefix = '/opt/cray/pe/python/3.9.7.1'
  sys.path = [
    '/opt/cray/pe/python/3.9.7.1',
    '/opt/cray/pe/python/3.9.7.1/lib/python39.zip',
    '/opt/cray/pe/python/3.9.7.1/lib/python3.9',
    '/opt/cray/pe/python/3.9.7.1/lib/python3.9/lib-dynload',
  ]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
cuda-gdb: warning: PyMemoryView_FromObject: called while Python is not available!
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
cuda-gdb: warning: PyMemoryView_FromObject: called while Python is not available!
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
cuda-gdb: warning: PyMemoryView_FromObject: called while Python is not available!
  File "<frozen importlib._bootstrap_external>", line 846, in exec_module
cuda-gdb: warning: PyMemoryView_FromObject: called while Python is not available!
  File "<frozen importlib._bootstrap_external>", line 951, in get_code
cuda-gdb: warning: PyMemoryView_FromObject: called while Python is not available!
SystemError: <class 'memoryview'> returned NULL without setting an error

But swapping it out for cudatoolkit/11.0 lets me run the debugger.

cuda-gdb run

(cuda-gdb) file impactx
Reading symbols from impactx...done.
(cuda-gdb) run input_fodo.in amrex.throw_exception=1 amrex.signal_handling=0
Starting program: /pscratch/sd/m/mgarten/impactx/001_FODO_single-GPU_DEBUG/impactx input_fodo.in amrex.throw_exception=1 amrex.signal_handling=0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/opt/cray/pe/gcc/11.2.0/snos/lib64/libstdc++.so.6.0.29-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /opt/cray/pe/gcc/11.2.0/snos/lib64/libstdc++.so.6.0.29-gdb.py
line to your configuration file "/global/homes/m/mgarten/.cuda-gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/global/homes/m/mgarten/.cuda-gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
[New Thread 0x7fffe5ed7000 (LWP 67128)]
Initializing CUDA...
[Detaching after fork from child process 67129]
[New Thread 0x7fffdbbb0000 (LWP 67141)]
[New Thread 0x7fffdb3af000 (LWP 67142)]
warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

CUDA initialized with 1 GPU per MPI rank; 1 GPU(s) used in total
warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
AMReX (22.06-39-g2d931f63cb4d) initialized
warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

boxArray(0) (BoxArray maxbox(1)
       m_ref->m_hash_sig(0)
       ((0,0,0) (7,7,7) (0,0,0)) )

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

Beam kinetic energy (MeV): 2000
Bunch charge (C): 0
Particle type: electron
Number of particles: 10000
Beam distribution type: waterbag
Static units
Initialized beam distribution parameters
warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

# of particles: 10000
Initialized element list
 ++++ Starting step=0
warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)

warning: Cuda API error detected: cuPointerGetAttribute returned (0x1)


CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x6f5c240 (Drift.H:69)

Thread 1 "impactx" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 61, block (0,0,0), thread (128,0,0), device 0, sm 0, warp 4, lane 0]
0x0000000006f5c250 in impactx::Drift::operator() (this=0x131f9ad0, p=..., px=<optimized out>, py=<optimized out>, pt=<optimized out>, refpart=...)
    at /global/homes/m/mgarten/src/impactx/src/particles/elements/Drift.H:69
69	            p.pos(0) = x + m_ds * px;

Backtrace

(cuda-gdb) backtrace
#0  0x0000000006f5c250 in impactx::Drift::operator() (this=0x131f9ad0, p=..., px=<optimized out>, py=<optimized out>, pt=<optimized out>, refpart=...)
    at /global/homes/m/mgarten/src/impactx/src/particles/elements/Drift.H:69
#1  impactx::detail::PushSingleParticle<impactx::Drift const&>::operator() (this=0x7fffddfffbf8, i=<optimized out>) at /global/homes/m/mgarten/src/impactx/src/particles/Push.cpp:81
#2  amrex::detail::call_f<impactx::detail::PushSingleParticle<impactx::Drift const&>, int> (f=..., i=<optimized out>)
    at /global/u1/m/mgarten/src/impactx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H:752
#3  0x00000000070cd460 in _ZZN5amrex11ParallelForIiRKN7impactx6detail18PushSingleParticleIRKNS1_5DriftEEEvEENSt9enable_ifIXsr5amrex19MaybeDeviceRunnableIT0_vEE5valueEvE4typeERKNS_3Gpu10KernelInfoET_OSB_ENKUlvE_clEv (this=<optimized out>) at /global/u1/m/mgarten/src/impactx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuLaunchFunctsG.H:802
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I could not see any values for variables because the compiler optimizes them out in Drift.H.

69	            p.pos(0) = x + m_ds * px;
(cuda-gdb) print px
$1 = <optimized out>
(cuda-gdb) print x
$2 = <optimized out>
(cuda-gdb) print p
$3 = (@local _ZN7impactx5Drift5PTypeE & @local) <error reading variable>
(cuda-gdb) break Drift.H:67

So I built again with the option g -O0 and hopefully I will see more.

Edit:
... I actually tried to build it again without optimization but it still shows <optimized out>. Should I have deleted the build directory completely before?

from impactx.

cemitch99 commented on July 16, 2024 1

The object p is complicated struct, so I think the final line makes sense. I'm not sure if gdb will allow a print p.pos(0), etc.

from impactx.

ax3l commented on July 16, 2024 1

In the end, the current AMReX particle AoS object p is really just a

struct {
   amrex::ParticleReal r[n];
   int i[m];
};

You could check in cuda-gdb if the object p is valid memory (on the device) itself by printing its address and checking its range and then printing it's first member (which we interpret as position x).

... I actually tried to build it again without optimization but it still shows . Should I have deleted the build directory completely before?

yes, you need to redo the configure step with a fresh build dir. CXXFLAGS are only added at the first configure in a build directory (they change defaults for the configure step).

from impactx.

ax3l commented on July 16, 2024 1

that should work in general... doing it with a single configure is the safest bet if you are unsure though.
You can configure with -DCMAKE_VERBOSE_MAKEFILE=ON if you are unsure what's ending up on the compiler line and want to see.

from impactx.

ax3l commented on July 16, 2024

Memo from our discussion:

Debug workflow: https://warpx.readthedocs.io/en/latest/usage/workflows/debugging.html
cuda-gdb with AMReX runtime options amrex.throw_exception = 1 amrex.signal_handling = 0

from impactx.

n01r commented on July 16, 2024

yes, you need to redo the configure step with a fresh build dir. CXXFLAGS are only added at the first configure in a build directory (they change defaults for the configure step).

But deleting build, running cmake -S . -B build and then doing ccmake build, editing stuff, hitting c to configure and g to generate should work, no?

from impactx.

ax3l commented on July 16, 2024

cc @WeiqunZhang @atmyers @kngott turns out this is in part a bug in AMReX init with GPU-aware MPI on Perlmutter.

If I set export MPICH_GPU_SUPPORT_ENABLED=0 the issue Cuda API error detected: cuPointerGetAttribute returned (0x1) vanishes. Backtrace:

The other issue above is an when we try to access fundamental types (not even pointers) of lattice elements on device, e.g., the amrex::ParticleReal m_ds member: CUDA Exception: Warp Illegal Address. The problem is so weird that I start to think it's a compiler bug... and it probably is: #174

from impactx.

Illegal memory access with FODO example on GPU about impactx HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent