Giter Club home page Giter Club logo

Comments (8)

skeydan avatar skeydan commented on June 6, 2024

... but this is not tabnet-specific (as opposed to: it always happens with torch)?

I found I always have to restart R after CUDA OOM.

from tabnet.

dfalbel avatar dfalbel commented on June 6, 2024

This doesn't always happens. For example, i can easily do:

library(torch)
x <- torch_randn(1e3, 8e6, device = "cuda")
#> Error in (function (size, options) : CUDA out of memory. Tried to allocate 29.80 GiB (GPU 0; 10.91 GiB total capacity; 0 bytes already allocated; 10.20 GiB free; 0 bytes reserved in total by PyTorch)
#> Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
#> frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f3b388b6b89 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10.so)
#> frame #1: <unknown function> + 0x25cbf (0x7f3b38650cbf in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #2: <unknown function> + 0x27227 (0x7f3b38652227 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #3: <unknown function> + 0x2788a (0x7f3b3865288a in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x35b (0x7f3ad288f97b in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #5: <unknown function> + 0x4023416 (0x7f3ad2a69416 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #6: <unknown function> + 0x404ac9f (0x7f3ad2a90c9f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #7: <unknown function> + 0x1322e5d (0x7f3b28c15e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #8: <unknown function> + 0x12fd301 (0x7f3b28bf0301 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #9: <unknown function> + 0x1305f1f (0x7f3b28bf8f1f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #10: <unknown function> + 0x1322e5d (0x7f3b28c15e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #11: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xc3 (0x7f3b28cf7783 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #12: at::native::randn(c10::ArrayRef<long>, c10::optional<at::Generator>, c10::TensorOptions const&) + 0x40 (0x7f3b288ea720 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #13: at::native::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x30 (0x7f3b288ea850 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #14: <unknown function> + 0x15b62b1 (0x7f3b28ea92b1 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #15: <unknown function> + 0x15f2b9e (0x7f3b28ee5b9e in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #16: <unknown function> + 0xb5e655 (0x7f3b28451655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #17: <unknown function> + 0x132d1eb (0x7f3b28c201eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #18: <unknown function> + 0x13037c0 (0x7f3b28bf67c0 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #19: <unknown function> + 0xb5e655 (0x7f3b28451655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #20: <unknown function> + 0x132d1eb (0x7f3b28c201eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #21: at::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x1f7 (0x7f3b28d1bb57 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #22: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&)::{lambda()#1}::operator()() const + 0x99 (0x7f3b38fbbc4d in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #23: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x54 (0x7f3b38fbbce9 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #24: _lantern_randn_intarrayref_tensoroptions + 0x14c (0x7f3b38dac9b2 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #25: cpp_torch_namespace_randn_size_IntArrayRef(std::vector<long, std::allocator<long> >, XPtrTorchTensorOptions) + 0x58 (0x7f3b39760fa8 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #26: _torch_cpp_torch_namespace_randn_size_IntArrayRef + 0x87 (0x7f3b39554a97 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #27: <unknown function> + 0xf932c (0x7f3b4caa932c in /usr/lib/R/lib/libR.so)
#> frame #28: <unknown function> + 0xf9826 (0x7f3b4caa9826 in /usr/lib/R/lib/libR.so)
#> frame #29: <unknown function> + 0x137106 (0x7f3b4cae7106 in /usr/lib/R/lib/libR.so)
#> frame #30: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #31: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #32: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #33: Rf_eval + 0x353 (0x7f3b4caf38c3 in /usr/lib/R/lib/libR.so)
#> frame #34: <unknown function> + 0xc650d (0x7f3b4ca7650d in /usr/lib/R/lib/libR.so)
#> frame #35: <unknown function> + 0x137106 (0x7f3b4cae7106 in /usr/lib/R/lib/libR.so)
#> frame #36: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #37: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #38: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #39: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #40: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #41: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #42: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #43: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #44: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #45: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #46: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #47: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #48: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #49: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #50: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #51: Rf_eval + 0x353 (0x7f3b4caf38c3 in /usr/lib/R/lib/libR.so)
#> frame #52: <unknown function> + 0xc650d (0x7f3b4ca7650d in /usr/lib/R/lib/libR.so)
#> frame #53: <unknown function> + 0x137106 (0x7f3b4cae7106 in /usr/lib/R/lib/libR.so)
#> frame #54: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #55: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #56: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #57: <unknown function> + 0x13a989 (0x7f3b4caea989 in /usr/lib/R/lib/libR.so)
#> frame #58: Rf_eval + 0x180 (0x7f3b4caf36f0 in /usr/lib/R/lib/libR.so)
#> frame #59: <unknown function> + 0x14550f (0x7f3b4caf550f in /usr/lib/R/lib/libR.so)
#> frame #60: Rf_applyClosure + 0x1c7 (0x7f3b4caf62d7 in /usr/lib/R/lib/libR.so)
#> frame #61: Rf_eval + 0x353 (0x7f3b4caf38c3 in /usr/lib/R/lib/libR.so)
#> frame #62: <unknown function> + 0x148693 (0x7f3b4caf8693 in /usr/lib/R/lib/libR.so)
#> frame #63: Rf_eval + 0x572 (0x7f3b4caf3ae2 in /usr/lib/R/lib/libR.so)
x <- torch_randn(1e3, 1e6, device = "cuda")

See that the first call fails, but not the second. THis wouldn't be possible if the memory was not released.
It could be a bug in torch but might be easier to reproduce here.

from tabnet.

skeydan avatar skeydan commented on June 6, 2024

but does the first ever get as far as allocating the memory?

from tabnet.

dfalbel avatar dfalbel commented on June 6, 2024

Maybe not, but then this indicates that the problem is probably before the allocation that fails. And not that torch is always not releasing the cuda memory when there's a CUDA OOM error.

You can do something like this too:

library(torch)
f <- function() {
  x <- list()
  for (i in 1:10000) {
    x[[i]] <- torch_randn(1e6, device = "cuda")
  }
}
f()
#> Error in (function (size, options) : CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.91 GiB total capacity; 10.20 GiB already allocated; 2.50 MiB free; 10.20 GiB reserved in total by PyTorch)
#> Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
#> frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7ff813de5b89 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10.so)
#> frame #1: <unknown function> + 0x25cbf (0x7ff813b7fcbf in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #2: <unknown function> + 0x27227 (0x7ff813b81227 in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #3: <unknown function> + 0x2788a (0x7ff813b8188a in /home/dfalbel/torch/lantern/build/libtorch/lib/libc10_cuda.so)
#> frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x35b (0x7ff7addbe97b in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #5: <unknown function> + 0x4023416 (0x7ff7adf98416 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #6: <unknown function> + 0x404ac9f (0x7ff7adfbfc9f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cuda.so)
#> frame #7: <unknown function> + 0x1322e5d (0x7ff804144e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #8: <unknown function> + 0x12fd301 (0x7ff80411f301 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #9: <unknown function> + 0x1305f1f (0x7ff804127f1f in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #10: <unknown function> + 0x1322e5d (0x7ff804144e5d in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #11: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xc3 (0x7ff804226783 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #12: at::native::randn(c10::ArrayRef<long>, c10::optional<at::Generator>, c10::TensorOptions const&) + 0x40 (0x7ff803e19720 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #13: at::native::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x30 (0x7ff803e19850 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #14: <unknown function> + 0x15b62b1 (0x7ff8043d82b1 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #15: <unknown function> + 0x15f2b9e (0x7ff804414b9e in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #16: <unknown function> + 0xb5e655 (0x7ff803980655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #17: <unknown function> + 0x132d1eb (0x7ff80414f1eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #18: <unknown function> + 0x13037c0 (0x7ff8041257c0 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #19: <unknown function> + 0xb5e655 (0x7ff803980655 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #20: <unknown function> + 0x132d1eb (0x7ff80414f1eb in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #21: at::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x1f7 (0x7ff80424ab57 in /home/dfalbel/torch/lantern/build/libtorch/lib/libtorch_cpu.so)
#> frame #22: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&)::{lambda()#1}::operator()() const + 0x99 (0x7ff8144eac4d in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #23: torch::randn(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x54 (0x7ff8144eace9 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #24: _lantern_randn_intarrayref_tensoroptions + 0x14c (0x7ff8142db9b2 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/deps/liblantern.so)
#> frame #25: cpp_torch_namespace_randn_size_IntArrayRef(std::vector<long, std::allocator<long> >, XPtrTorchTensorOptions) + 0x58 (0x7ff814c8ffa8 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #26: _torch_cpp_torch_namespace_randn_size_IntArrayRef + 0x87 (0x7ff814a83a97 in /home/dfalbel/R/x86_64-pc-linux-gnu-library/4.0/torch/libs/torchpkg.so)
#> frame #27: <unknown function> + 0xf932c (0x7ff827fd832c in /usr/lib/R/lib/libR.so)
#> frame #28: <unknown function> + 0xf9826 (0x7ff827fd8826 in /usr/lib/R/lib/libR.so)
#> frame #29: <unknown function> + 0x137106 (0x7ff828016106 in /usr/lib/R/lib/libR.so)
#> frame #30: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #31: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #32: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #33: Rf_eval + 0x353 (0x7ff8280228c3 in /usr/lib/R/lib/libR.so)
#> frame #34: <unknown function> + 0xc650d (0x7ff827fa550d in /usr/lib/R/lib/libR.so)
#> frame #35: <unknown function> + 0x137106 (0x7ff828016106 in /usr/lib/R/lib/libR.so)
#> frame #36: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #37: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #38: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #39: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #40: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #41: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #42: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #43: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #44: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #45: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #46: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #47: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #48: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #49: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #50: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #51: Rf_eval + 0x353 (0x7ff8280228c3 in /usr/lib/R/lib/libR.so)
#> frame #52: <unknown function> + 0xc650d (0x7ff827fa550d in /usr/lib/R/lib/libR.so)
#> frame #53: <unknown function> + 0x137106 (0x7ff828016106 in /usr/lib/R/lib/libR.so)
#> frame #54: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #55: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #56: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #57: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #58: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #59: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
#> frame #60: Rf_applyClosure + 0x1c7 (0x7ff8280252d7 in /usr/lib/R/lib/libR.so)
#> frame #61: <unknown function> + 0x13a989 (0x7ff828019989 in /usr/lib/R/lib/libR.so)
#> frame #62: Rf_eval + 0x180 (0x7ff8280226f0 in /usr/lib/R/lib/libR.so)
#> frame #63: <unknown function> + 0x14550f (0x7ff82802450f in /usr/lib/R/lib/libR.so)
x <- torch_randn(1e3, 1e6)

from tabnet.

skeydan avatar skeydan commented on June 6, 2024

Convinced! So it really depends on whether the action following the OOM error still "fits" or not:

library(torch)
f <- function() {
  x <- list()
  for (i in 1:10000) {
    x[[i]] <- torch_randn(1e6, device = "cuda")
  }
}
f()

nvidia-smi says

4025MiB /  4035MiB 
y <- torch_randn(1, device = "cuda")

nvidia-smi says

4027MiB /  4035MiB 

So that's good to know!

from tabnet.

dfalbel avatar dfalbel commented on June 6, 2024

if you try a larger allocation in y, gc will be called and eventually all gpu memory will be released.

from tabnet.

skeydan avatar skeydan commented on June 6, 2024

true! In this isolated setup, it works

from tabnet.

dfalbel avatar dfalbel commented on June 6, 2024

Should be fixed with d5dbfe3

from tabnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.