Giter Club home page Giter Club logo

Comments (4)

AuthorityWang avatar AuthorityWang commented on June 18, 2024

Could you list your PC configuration? The process is probably killed by Linux OOM killer, maybe its because your GPU do not have enough memory. You can use sudo dmesg | tail -7 to make sure if it is killed for OOM. If so, maybe you can try to ban the OOM killer, or reduce the resolution or change fp32 to fp16 (this may lead to some problem), reduce batch_size is not so useful during my trying. By the way, when I'm training this model, even an RTX3090 with 24GB memory will sometimes give an error of OOM, so maybe a GPU with bigger memory is needed:)

from jax3d.

Kirstihly avatar Kirstihly commented on June 18, 2024

Thank you so much for the reply! I'm using one of the four V100 GPU fans. When it's running, it shows below.

+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   65C    P0   275W / 300W |  29505MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Right after it was killed, this is the logs printed out.

dmesg -T | tail -20
                           thp_fault_alloc 109032
                           thp_collapse_alloc 660
[Fri Sep 16 10:37:38 2022] Tasks state (memory values in pages):
[Fri Sep 16 10:37:38 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri Sep 16 10:37:38 2022] [  16912]     0 16912     4661      747    81920      154             0 bash
[Fri Sep 16 10:37:38 2022] [  19712]     0 19712 21929189  4178964 42479616   524132             0 python
[Fri Sep 16 10:37:38 2022] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,mems_allowed=0-1,oom_memcg=/docker/37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,task_memcg=/docker/37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,task=python,pid=19712,uid=0
[Fri Sep 16 10:37:38 2022] Memory cgroup out of memory: Killed process 19712 (python) total-vm:87716756kB, anon-rss:16416528kB, file-rss:223552kB, shmem-rss:75776kB, UID:0 pgtables:41484kB oom_score_adj:0
[Fri Sep 16 10:37:39 2022] oom_reaper: reaped process 19712 (python), now anon-rss:0kB, file-rss:92372kB, shmem-rss:75776kB
[Fri Sep 16 10:51:02 2022] kauditd_printk_skb: 132 callbacks suppressed
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.455:111772): apparmor="ALLOWED" operation="exec" profile="/usr/sbin/sssd" name="/usr/bin/nsupdate" pid=20469 comm="sssd_be" requested_mask="x" denied_mask="x" fsuid=0 ouid=0 target="/usr/sbin/sssd//null-/usr/bin/nsupdate"
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111773): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/bin/nsupdate" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111774): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/lib/x86_64-linux-gnu/ld-2.27.so" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111775): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/etc/ld.so.cache" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111776): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/liblwres.so.160.0.1" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111777): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/liblwres.so.160.0.1" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111778): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libdns.so.1100.1.1" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111779): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libdns.so.1100.1.1" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111780): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libkrb5.so.3.3" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111781): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libkrb5.so.3.3" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0

Yes, looks like it does not have enough memory.
Luckily, I changed all fp32 to fp16 and it runs smoothly now.

from jax3d.

AuthorityWang avatar AuthorityWang commented on June 18, 2024

Well, an v100 with 32GB memory is ok for this model, it's somehow, strange for OOM, maybe some other reasons is responsible for this problem. Maybe change fp32 to fp16 is ok, but I had to say this is not the best solution anyway.

from jax3d.

AuthorityWang avatar AuthorityWang commented on June 18, 2024

Hi, maybe I find a possibility for OOM due to JAX, please see this JAX docs link, it introduces some OOM cases when using JAX, hope it will be helpful:)

from jax3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.