Giter Club home page Giter Club logo

Comments (10)

skol101 avatar skol101 commented on July 23, 2024

When resuming from your pretrained G and D,

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 23.69 GiB total capacity; 5.86 GiB already allocated; 2.71 GiB free; 19.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from freevc.

skol101 avatar skol101 commented on July 23, 2024

Works fine with num_workers=4. It's a minor issue, but could be useful for somebody.

from freevc.

skol101 avatar skol101 commented on July 23, 2024

Hmm also, bugs during eval step with num_workers=4

INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/freevc/G_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/G_195000.pth' (iteration 2053)
./logs/freevc/D_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/D_195000.pth' (iteration 2053)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 2053 [97%]
INFO:freevc:[2.2948668003082275, 2.467487096786499, 9.334207534790039, 16.157079696655273, 1.7016690969467163, 379800, 0.00015467115812058983]
INFO:freevc:====> Epoch: 2053
INFO:freevc:====> Epoch: 2054
INFO:freevc:Train Epoch: 2055 [5%]
INFO:freevc:[2.3948028087615967, 2.7103328704833984, 10.981183052062988, 18.12336540222168, 1.8913426399230957, 380000, 0.0001546324927477965]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fd1f94cc430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 30152) is killed by signal: Aborted. 
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/G_380000.pth
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/D_380000.pth


from freevc.

OlaWod avatar OlaWod commented on July 23, 2024

I had not encountered this problem so currently I tend to think it is due to the machine.

from freevc.

skol101 avatar skol101 commented on July 23, 2024

Yes, maybe it's some local misconfiguration of the env.

from freevc.

skol101 avatar skol101 commented on July 23, 2024

What pytroch/cuda versions are you running, please?

from freevc.

OlaWod avatar OlaWod commented on July 23, 2024

torch 1.10.0
cudatoolkit 11.1.1

from freevc.

skol101 avatar skol101 commented on July 23, 2024

Cheers, mine has
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0

from freevc.

yt605155624 avatar yt605155624 commented on July 23, 2024

set num_workers=0 works well for me

from freevc.

yt605155624 avatar yt605155624 commented on July 23, 2024

set persistent_workers=True in train and eval DataLoder works well for me when I set num_workers>1
check link

from freevc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.