To whom it may concern, Hello! I'm very interested in the work mobil

Please provide suggestions to avoid program being killed about jax3d HOT 4 OPEN

google-research commented on June 18, 2024

Please provide suggestions to avoid program being killed

from jax3d.

Comments (4)

AuthorityWang commented on June 18, 2024

Could you list your PC configuration? The process is probably killed by Linux OOM killer, maybe its because your GPU do not have enough memory. You can use sudo dmesg | tail -7 to make sure if it is killed for OOM. If so, maybe you can try to ban the OOM killer, or reduce the resolution or change fp32 to fp16 (this may lead to some problem), reduce batch_size is not so useful during my trying. By the way, when I'm training this model, even an RTX3090 with 24GB memory will sometimes give an error of OOM, so maybe a GPU with bigger memory is needed:)

from jax3d.

Kirstihly commented on June 18, 2024

Thank you so much for the reply! I'm using one of the four V100 GPU fans. When it's running, it shows below.

+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   65C    P0   275W / 300W |  29505MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Right after it was killed, this is the logs printed out.

dmesg -T | tail -20
                           thp_fault_alloc 109032
                           thp_collapse_alloc 660
[Fri Sep 16 10:37:38 2022] Tasks state (memory values in pages):
[Fri Sep 16 10:37:38 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri Sep 16 10:37:38 2022] [  16912]     0 16912     4661      747    81920      154             0 bash
[Fri Sep 16 10:37:38 2022] [  19712]     0 19712 21929189  4178964 42479616   524132             0 python
[Fri Sep 16 10:37:38 2022] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,mems_allowed=0-1,oom_memcg=/docker/37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,task_memcg=/docker/37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,task=python,pid=19712,uid=0
[Fri Sep 16 10:37:38 2022] Memory cgroup out of memory: Killed process 19712 (python) total-vm:87716756kB, anon-rss:16416528kB, file-rss:223552kB, shmem-rss:75776kB, UID:0 pgtables:41484kB oom_score_adj:0
[Fri Sep 16 10:37:39 2022] oom_reaper: reaped process 19712 (python), now anon-rss:0kB, file-rss:92372kB, shmem-rss:75776kB
[Fri Sep 16 10:51:02 2022] kauditd_printk_skb: 132 callbacks suppressed
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.455:111772): apparmor="ALLOWED" operation="exec" profile="/usr/sbin/sssd" name="/usr/bin/nsupdate" pid=20469 comm="sssd_be" requested_mask="x" denied_mask="x" fsuid=0 ouid=0 target="/usr/sbin/sssd//null-/usr/bin/nsupdate"
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111773): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/bin/nsupdate" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111774): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/lib/x86_64-linux-gnu/ld-2.27.so" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111775): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/etc/ld.so.cache" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111776): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/liblwres.so.160.0.1" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111777): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/liblwres.so.160.0.1" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111778): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libdns.so.1100.1.1" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111779): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libdns.so.1100.1.1" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111780): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libkrb5.so.3.3" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111781): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libkrb5.so.3.3" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0

Yes, looks like it does not have enough memory.
Luckily, I changed all fp32 to fp16 and it runs smoothly now.

from jax3d.

AuthorityWang commented on June 18, 2024

Well, an v100 with 32GB memory is ok for this model, it's somehow, strange for OOM, maybe some other reasons is responsible for this problem. Maybe change fp32 to fp16 is ok, but I had to say this is not the best solution anyway.

from jax3d.

AuthorityWang commented on June 18, 2024

Hi, maybe I find a possibility for OOM due to JAX, please see this JAX docs link, it introduces some OOM cases when using JAX, hope it will be helpful:)

from jax3d.

Please provide suggestions to avoid program being killed about jax3d HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent