This means that it's not possible to keep training after that.

This doesn't always happens. For example, i can easily do: <div class="highlight

if you try a larger allocation in y , <code class="not

Should be fixed with <a class="commit-link" data-hovercard-type="commit" data-hovercar

When CUDA OOM error, memory is not released about tabnet HOT 8 CLOSED

mlverse commented on June 6, 2024

When CUDA OOM error, memory is not released

from tabnet.

Comments (8)

skeydan commented on June 6, 2024

... but this is not tabnet-specific (as opposed to: it always happens with torch)?

I found I always have to restart R after CUDA OOM.

from tabnet.

dfalbel commented on June 6, 2024

This doesn't always happens. For example, i can easily do:

library(torch)
x <- torch_randn(1e3, 8e6, device = "cuda")
#> Error in (function (size, options) : CUDA out of memory. Tried to allocate 29.80 GiB (GPU 0; 10.91 GiB total capacity; 0 bytes already allocated; 10.20 GiB free; 0 bytes reserved in total by PyTorch)
#> Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
#> frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f3b388b6b89 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10.so)
#> frame #1: <unknown function> + 0x25cbf (0x7f3b38650cbf in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #2: <unknown function> + 0x27227 (0x7f3b38652227 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #3: <unknown function> + 0x2788a (0x7f3b3865288a in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x35b (0x7f3ad288f97b in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #5: <unknown function> + 0x4023416 (0x7f3ad2a69416 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #6: <unknown function> + 0x404ac9f (0x7f3ad2a90c9f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #7: <unknown function> + 0x1322e5d (0x7f3b28c15e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #8: <unknown function> + 0x12fd301 (0x7f3b28bf0301 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #9: <unknown function> + 0x1305f1f (0x7f3b28bf8f1f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #10: <unknown function> + 0x1322e5d (0x7f3b28c15e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #11: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xc3 (0x7f3b28cf7783 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #12: at::native::randn(c10::ArrayRef<long>, c10::optional<at::Generator>, c10::TensorOptions const&) + 0x40 (0x7f3b288ea720 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #13: at::native::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x30 (0x7f3b288ea850 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #14: <unknown function> + 0x15b62b1 (0x7f3b28ea92b1 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #15: <unknown function> + 0x15f2b9e (0x7f3b28ee5b9e in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #16: <unknown function> + 0xb5e655 (0x7f3b28451655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #17: <unknown function> + 0x132d1eb (0x7f3b28c201eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #18: <unknown function> + 0x13037c0 (0x7f3b28bf67c0 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #19: <unknown function> + 0xb5e655 (0x7f3b28451655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #20: <unknown function> + 0x132d1eb (0x7f3b28c201eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #21: at::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x1f7 (0x7f3b28d1bb57 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #22: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&)::{lambda()#1}::operator()() const + 0x99 (0x7f3b38fbbc4d in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #23: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x54 (0x7f3b38fbbce9 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #24: _lantern_randn_intarrayref_tensoroptions + 0x14c (0x7f3b38dac9b2 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #25: cpp_torch_namespace_randn_size_IntArrayRef(std::vector<long, std::allocator<long> >, XPtrTorchTensorOptions) + 0x58 (0x7f3b39760fa8 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #26: _torch_cpp_torch_namespace_randn_size_IntArrayRef + 0x87 (0x7f3b39554a97 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #27: <unknown function> + 0xf932c (0x7f3b4caa932c in /usr/lib/R/lib/libR.so)
#> frame #28: <unknown function> + 0xf9826 (0x7f3b4caa9826 in /usr/lib/R/lib/libR.so)
#> frame #29: <unknown function> + 0x137106 (0x7f3b4cae7106 in /usr/lib/R/lib/libR.so)
#> frame #30: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #31: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #32: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #33: Rf_eval + 0x353 (0x7f3b4caf38c3 in /usr/lib/R/lib/libR.so)
#> frame #34: <unknown function> + 0xc650d (0x7f3b4ca7650d in /usr/lib/R/lib/libR.so)
#> frame #35: <unknown function> + 0x137106 (0x7f3b4cae7106 in /usr/lib/R/lib/libR.so)
#> frame #36: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #37: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #38: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #39: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #40: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #41: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #42: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #43: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #44: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #45: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #46: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #47: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #48: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #49: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #50: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #51: Rf_eval + 0x353 (0x7f3b4caf38c3 in /usr/lib/R/lib/libR.so)
#> frame #52: <unknown function> + 0xc650d (0x7f3b4ca7650d in /usr/lib/R/lib/libR.so)
#> frame #53: <unknown function> + 0x137106 (0x7f3b4cae7106 in /usr/lib/R/lib/libR.so)
#> frame #54: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #55: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #56: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #57: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #58: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #59: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #60: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #61: Rf_eval + 0x353 (0x7f3b4caf38c3 in /usr/lib/R/lib/libR.so)
#> frame #62: <unknown function> + 0x148693 (0x7f3b4caf8693 in /usr/lib/R/lib/libR.so)
#> frame #63: Rf_eval + 0x572 (0x7f3b4caf3ae2 in /usr/lib/R/lib/libR.so)
x <- torch_randn(1e3, 1e6, device = "cuda")

See that the first call fails, but not the second. THis wouldn't be possible if the memory was not released.
It could be a bug in torch but might be easier to reproduce here.

from tabnet.

skeydan commented on June 6, 2024

but does the first ever get as far as allocating the memory?

from tabnet.

dfalbel commented on June 6, 2024

Maybe not, but then this indicates that the problem is probably before the allocation that fails. And not that torch is always not releasing the cuda memory when there's a CUDA OOM error.

You can do something like this too:

library(torch)
f <- function() {
  x <- list()
  for (i in 1:10000) {
    x[[i]] <- torch_randn(1e6, device = "cuda")
  }
}
f()
#> Error in (function (size, options) : CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.91 GiB total capacity; 10.20 GiB already allocated; 2.50 MiB free; 10.20 GiB reserved in total by PyTorch)
#> Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
#> frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7ff813de5b89 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10.so)
#> frame #1: <unknown function> + 0x25cbf (0x7ff813b7fcbf in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #2: <unknown function> + 0x27227 (0x7ff813b81227 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #3: <unknown function> + 0x2788a (0x7ff813b8188a in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x35b (0x7ff7addbe97b in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #5: <unknown function> + 0x4023416 (0x7ff7adf98416 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #6: <unknown function> + 0x404ac9f (0x7ff7adfbfc9f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #7: <unknown function> + 0x1322e5d (0x7ff804144e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #8: <unknown function> + 0x12fd301 (0x7ff80411f301 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #9: <unknown function> + 0x1305f1f (0x7ff804127f1f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #10: <unknown function> + 0x1322e5d (0x7ff804144e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #11: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xc3 (0x7ff804226783 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #12: at::native::randn(c10::ArrayRef<long>, c10::optional<at::Generator>, c10::TensorOptions const&) + 0x40 (0x7ff803e19720 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #13: at::native::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x30 (0x7ff803e19850 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #14: <unknown function> + 0x15b62b1 (0x7ff8043d82b1 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #15: <unknown function> + 0x15f2b9e (0x7ff804414b9e in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #16: <unknown function> + 0xb5e655 (0x7ff803980655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #17: <unknown function> + 0x132d1eb (0x7ff80414f1eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #18: <unknown function> + 0x13037c0 (0x7ff8041257c0 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #19: <unknown function> + 0xb5e655 (0x7ff803980655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #20: <unknown function> + 0x132d1eb (0x7ff80414f1eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #21: at::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x1f7 (0x7ff80424ab57 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #22: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&)::{lambda()#1}::operator()() const + 0x99 (0x7ff8144eac4d in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #23: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x54 (0x7ff8144eace9 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #24: _lantern_randn_intarrayref_tensoroptions + 0x14c (0x7ff8142db9b2 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #25: cpp_torch_namespace_randn_size_IntArrayRef(std::vector<long, std::allocator<long> >, XPtrTorchTensorOptions) + 0x58 (0x7ff814c8ffa8 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #26: _torch_cpp_torch_namespace_randn_size_IntArrayRef + 0x87 (0x7ff814a83a97 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #27: <unknown function> + 0xf932c (0x7ff827fd832c in /usr/lib/R/lib/libR.so)
#> frame #28: <unknown function> + 0xf9826 (0x7ff827fd8826 in /usr/lib/R/lib/libR.so)
#> frame #29: <unknown function> + 0x137106 (0x7ff828016106 in /usr/lib/R/lib/libR.so)
#> frame #30: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #31: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #32: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #33: Rf_eval + 0x353 (0x7ff8280228c3 in /usr/lib/R/lib/libR.so)
#> frame #34: <unknown function> + 0xc650d (0x7ff827fa550d in /usr/lib/R/lib/libR.so)
#> frame #35: <unknown function> + 0x137106 (0x7ff828016106 in /usr/lib/R/lib/libR.so)
#> frame #36: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #37: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #38: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #39: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #40: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #41: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #42: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #43: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #44: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #45: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #46: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #47: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #48: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #49: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #50: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #51: Rf_eval + 0x353 (0x7ff8280228c3 in /usr/lib/R/lib/libR.so)
#> frame #52: <unknown function> + 0xc650d (0x7ff827fa550d in /usr/lib/R/lib/libR.so)
#> frame #53: <unknown function> + 0x137106 (0x7ff828016106 in /usr/lib/R/lib/libR.so)
#> frame #54: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #55: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #56: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #57: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #58: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #59: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #60: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #61: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #62: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #63: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
x <- torch_randn(1e3, 1e6)

from tabnet.

skeydan commented on June 6, 2024

Convinced! So it really depends on whether the action following the OOM error still "fits" or not:

library(torch)
f <- function() {
  x <- list()
  for (i in 1:10000) {
    x[[i]] <- torch_randn(1e6, device = "cuda")
  }
}
f()

nvidia-smi says

4025MiB /  4035MiB

y <- torch_randn(1, device = "cuda")

nvidia-smi says

4027MiB /  4035MiB

So that's good to know!

from tabnet.

dfalbel commented on June 6, 2024

if you try a larger allocation in y, gc will be called and eventually all gpu memory will be released.

from tabnet.

skeydan commented on June 6, 2024

true! In this isolated setup, it works

from tabnet.

dfalbel commented on June 6, 2024

Should be fixed with d5dbfe3

from tabnet.

When CUDA OOM error, memory is not released about tabnet HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent