Comments (12)
Thanks, yes -- I found that adding this line at the end of the for loop:
if i mod 30 = 29 then
Caml.Gc.major()
keeps the memory from blowing up. It also improves the performance to the point that it's faster than Pytorch. yay.
from ocaml-torch.
That's indeed one of the disadvantage of using a gc rather than ref-counting: gpu memory is handled via RAII mechanisms in libtorch and so is only collected when the gc triggers and collects the data rather than being collected as early as possible.
The easy way to get around this is to trigger the gc manually though there is a balance to find here as calling the gc has some significant cost. This is done in most example, e.g. for min-gpt on this line.
from ocaml-torch.
Hi Laurent,
Still struggling with this issue; repeated calls to Caml.Gc.major () or Caml.Gc.full_major () don't seem to be helping.
I'm guessing that the GC sees the Tensors only as their small pointers / Ocaml structures, not as the many MB of GPU RAM that they consume, so that it chooses not to deallocate them? For example, below change in GC state corresponds with losing > 10GB of GPU RAM -- thereby making the app fail.
minor_collections: 8269
major_collections: 10
compactions: 0
forced_major_collections: 1
minor_words: 61346823
promoted_words: 8816555
major_words: 17280763
top_heap_words: 17656542
heap_words: 4194144
live_words: 3420356
free_words: 768036
largest_free: 0
fragments: 5752
live_blocks: 856672
free_blocks: 0
heap_chunks: 0
------- after algorithm (similar to Dec 12 code):
minor_collections: 8412 (+143)
major_collections: 13 (+3)
compactions: 0
forced_major_collections: 2
minor_words: 61731907 (+385084)
promoted_words: 8938926 (+122371)
major_words: 17404160 (+123397)
top_heap_words: 17656542 (same)
heap_words: 4109158 (-84986)
live_words: 3420403 (+47)
free_words: 682575 (-85461)
largest_free: 0
fragments: 6180 (+428)
live_blocks: 842693 (-13979)
free_blocks: 0
heap_chunks: 0
------- after Gc.full_major ():
minor_collections: 8646 (+234)
major_collections: 19 (+6)
compactions: 0
forced_major_collections: 4
minor_words: 61732479 (-572)
promoted_words: 8938926 (same)
major_words: 17404160 (same)
top_heap_words: 17656542 (same)
heap_words: 4080486 (-28672)
live_words: 3374323 (-46080)
free_words: 700011 (+17436)
largest_free: 0
fragments: 6152 (-28)
live_blocks: 827333 (-15360)
free_blocks: 0
heap_chunks: 0
Note: the code from Dec 12, with repeated calls to Caml.Gc.full_major(), consumes > 50x the GPU RAM as it naively ought to... excluding the 916MB allocated by default ...
Am considering writing the critical code in C++ & deallocating through IIRC, as you mention. If there are examples of this in your source code, I would be happy to study them & report back.
Thank you again for this excellent library!
from ocaml-torch.
In wrapper_generated.ml, there are (deferred) calls to C.Tensor.free :
Gc.finalise C.Tensor.free t0;
via
open Ctypes
module C = Torch_bindings.C (Torch_generated)
open C.TensorG
Is it possible to call C.Tensor.free ? Seems easier than writing in C++
from ocaml-torch.
It's indeed the case that the gc doesn't see the tensor as occupying a large amount of memory, however this should not be an issue as the call to Gc.full_major()
should collect all the dangling memory regardless of whether it uses a large amount or not (the gc knowing about the memory usage is only useful to decide when to trigger the gc).
So if you still see memory usage increasing despite regular full major collections, there is a deeper issue somewhere, it could be a bug in ocaml-torch or that your code somehow retains references to the tensors. I would suggest trying to reduce as much as possible the example until there is no memory leak anymore and hopefully this should give an idea of what is going on (and if you have some very short repro, that would be useful to help debug the issue if it's within ocaml-torch).
from ocaml-torch.
from ocaml-torch.
Can you try this?
https://github.com/tlh24/ocaml-torch-leaktest
Calling Gc.full_major () does not decrease the memory allocation.
Interestingly, memory allocation climbs the first several iterations then saturates by the 20th.
from ocaml-torch.
What if this is some issue with gradient tracing (??)
from ocaml-torch.
Sorry I didn't find the time to look at your repro so far, gradient tracing may indeed be a culprit though the issue usually happens if you have some form of global accumulator which I don't see in your code. Anyway you can give a try at running this whithin a Tensor.no_grad
block to deactivate gradient tracing. Might be worth looking at this PyTorch FAQ too in case there is anything related, the memory allocation climbing only for the first iteration makes me more suspicious of the allocator doing some caching, it could be interesting to check what happens when using a cpu device rather than a gpu, as well as checking whether this also happens when running similar code with the Python api.
from ocaml-torch.
from ocaml-torch.
I just tried your example and it seems to me that after adding a call to Gc.full_major
at the beginning of each loop, the GPU memory stays roughly constant. This would tend to agree with some allocation caching taking place within libtorch so not really a leak on the ocaml side.
from ocaml-torch.
Yes, that makes sense -- I wonder what it's caching, though. Would be really nice to get my GPU ram back.
FWIW, added a C++ test (thank you, gpt4), which uses even less memory. I suppose I can ffi it? Might be useful for other ocaml-torch users?
from ocaml-torch.
Related Issues (20)
- Is it possible to download pretrained model weights from the python version of pytorch and use it in ocaml-torch? HOT 2
- Mnist example - crashes at runtime HOT 2
- Torch installation not working with ubuntu HOT 2
- Getting Issue with mnist tutorials HOT 5
- Is there anthing equivalent to load and save model in ocaml torch HOT 4
- Support for CUDA 11 HOT 3
- Mini-Dalle Example HOT 2
- How to load a npz file to a tensor? HOT 1
- diffusers-ocaml HOT 3
- Custom training HOT 1
- Setting slices of a tensor: what is the equivalent of (python) x[i,:,:] = y HOT 7
- Install Ocaml-Torch failed on macos with M1 chip HOT 2
- installing gpu accelerated version on macos
- most exceptions from libtorch not caught (macos) HOT 24
- installing on macos in presence of other pytorch instances HOT 11
- .pt file to .ot file - Anyway to skip the conversion HOT 12
- How to extend Tensor.t with additional functions? HOT 3
- Issues with install on MacOS, with ARM architecture
- undefined symbol: ffi_type_void, version LIBFFI_BASE_7.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocaml-torch.