Comments (4)
Could you list your PC configuration? The process is probably killed by Linux OOM killer, maybe its because your GPU do not have enough memory. You can use sudo dmesg | tail -7
to make sure if it is killed for OOM. If so, maybe you can try to ban the OOM killer, or reduce the resolution or change fp32 to fp16 (this may lead to some problem), reduce batch_size is not so useful during my trying. By the way, when I'm training this model, even an RTX3090 with 24GB memory will sometimes give an error of OOM, so maybe a GPU with bigger memory is needed:)
from jax3d.
Thank you so much for the reply! I'm using one of the four V100 GPU fans. When it's running, it shows below.
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 |
| N/A 65C P0 275W / 300W | 29505MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Right after it was killed, this is the logs printed out.
dmesg -T | tail -20
thp_fault_alloc 109032
thp_collapse_alloc 660
[Fri Sep 16 10:37:38 2022] Tasks state (memory values in pages):
[Fri Sep 16 10:37:38 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Fri Sep 16 10:37:38 2022] [ 16912] 0 16912 4661 747 81920 154 0 bash
[Fri Sep 16 10:37:38 2022] [ 19712] 0 19712 21929189 4178964 42479616 524132 0 python
[Fri Sep 16 10:37:38 2022] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,mems_allowed=0-1,oom_memcg=/docker/37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,task_memcg=/docker/37a209c3c5f0070c118aa65fd9ca14ebb82c28b13a3fc702fb4e4cb5614da3cf,task=python,pid=19712,uid=0
[Fri Sep 16 10:37:38 2022] Memory cgroup out of memory: Killed process 19712 (python) total-vm:87716756kB, anon-rss:16416528kB, file-rss:223552kB, shmem-rss:75776kB, UID:0 pgtables:41484kB oom_score_adj:0
[Fri Sep 16 10:37:39 2022] oom_reaper: reaped process 19712 (python), now anon-rss:0kB, file-rss:92372kB, shmem-rss:75776kB
[Fri Sep 16 10:51:02 2022] kauditd_printk_skb: 132 callbacks suppressed
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.455:111772): apparmor="ALLOWED" operation="exec" profile="/usr/sbin/sssd" name="/usr/bin/nsupdate" pid=20469 comm="sssd_be" requested_mask="x" denied_mask="x" fsuid=0 ouid=0 target="/usr/sbin/sssd//null-/usr/bin/nsupdate"
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111773): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/bin/nsupdate" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111774): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/lib/x86_64-linux-gnu/ld-2.27.so" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111775): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/etc/ld.so.cache" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111776): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/liblwres.so.160.0.1" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111777): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/liblwres.so.160.0.1" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111778): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libdns.so.1100.1.1" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111779): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libdns.so.1100.1.1" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111780): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libkrb5.so.3.3" pid=20469 comm="nsupdate" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Fri Sep 16 10:51:02 2022] audit: type=1400 audit(1663350722.459:111781): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/bin/nsupdate" name="/usr/lib/x86_64-linux-gnu/libkrb5.so.3.3" pid=20469 comm="nsupdate" requested_mask="rm" denied_mask="rm" fsuid=0 ouid=0
Yes, looks like it does not have enough memory.
Luckily, I changed all fp32 to fp16 and it runs smoothly now.
from jax3d.
Well, an v100 with 32GB memory is ok for this model, it's somehow, strange for OOM, maybe some other reasons is responsible for this problem. Maybe change fp32 to fp16 is ok, but I had to say this is not the best solution anyway.
from jax3d.
Hi, maybe I find a possibility for OOM due to JAX, please see this JAX docs link, it introduces some OOM cases when using JAX, hope it will be helpful:)
from jax3d.
Related Issues (20)
- How to improve the performance further? HOT 14
- [MobileNeRF] About the HTML of real360 HOT 1
- About the supplementary material in mobilenerf paper HOT 1
- Error during stage3: Unimplemented MHLO HOT 3
- Out of Memory when trying to train real360 scene
- np.argmax unexpected keyword argument 'keepdims' in mobileNerf HOT 2
- GPU requirements OOM HOT 1
- Mesh obtained with center cube HOT 2
- Provide NeRF Eval Instructions HOT 2
- Provide Pre-Trained NeRF Checkpoints HOT 2
- There is no render_semantic_lib file
- Testing Nesf
- MultiNerf Result Samples HOT 3
- Deeplab v3 pretrained model
- Using eight A40GPUs to run the real360 model, the result is not ideal HOT 1
- Massive difference between stage3 psnr and the resulting mesh HOT 1
- Please provide trained models HOT 5
- Has anyone tried rendering multiple models at the same time๏ผ HOT 1
- NeSF dataset ground truth labels
- test result HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jax3d.