Executing command CUDA_VISIBLE_DEVICES="0" python train.py -c configs/freevc.json -m f

When resuming from your pretrained G and D, <div class="snippet-clipboard-content

Hmm also, bugs during eval step with num_workers=4 <div class="snippet-clipboard-c

Bug with num_workers=8 about freevc HOT 10 CLOSED

olawod commented on July 23, 2024

Bug with num_workers=8

from freevc.

Comments (10)

skol101 commented on July 23, 2024

When resuming from your pretrained G and D,

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.71 GiB (GPU 0; 23.69 GiB total capacity; 5.86 GiB already allocated; 2.71 GiB free; 19.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from freevc.

skol101 commented on July 23, 2024

Works fine with num_workers=4. It's a minor issue, but could be useful for somebody.

from freevc.

skol101 commented on July 23, 2024

Hmm also, bugs during eval step with num_workers=4

INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
./logs/freevc/G_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/G_195000.pth' (iteration 2053)
./logs/freevc/D_195000.pth
INFO:freevc:Loaded checkpoint './logs/freevc/D_195000.pth' (iteration 2053)
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:freevc:Train Epoch: 2053 [97%]
INFO:freevc:[2.2948668003082275, 2.467487096786499, 9.334207534790039, 16.157079696655273, 1.7016690969467163, 379800, 0.00015467115812058983]
INFO:freevc:====> Epoch: 2053
INFO:freevc:====> Epoch: 2054
INFO:freevc:Train Epoch: 2055 [5%]
INFO:freevc:[2.3948028087615967, 2.7103328704833984, 10.981183052062988, 18.12336540222168, 1.8913426399230957, 380000, 0.0001546324927477965]
terminate called without an active exception
terminate called without an active exception
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fd1f94cc430>
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1430, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/sk/anaconda3/envs/freevc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 30152) is killed by signal: Aborted. 
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/G_380000.pth
INFO:freevc:Saving model and optimizer state at iteration 2055 to ./logs/freevc/D_380000.pth

from freevc.

OlaWod commented on July 23, 2024

I had not encountered this problem so currently I tend to think it is due to the machine.

from freevc.

skol101 commented on July 23, 2024

Yes, maybe it's some local misconfiguration of the env.

from freevc.

skol101 commented on July 23, 2024

What pytroch/cuda versions are you running, please?

from freevc.

OlaWod commented on July 23, 2024

torch 1.10.0
cudatoolkit 11.1.1

from freevc.

skol101 commented on July 23, 2024

Cheers, mine has
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0

from freevc.

yt605155624 commented on July 23, 2024

set num_workers=0 works well for me

from freevc.

yt605155624 commented on July 23, 2024

set persistent_workers=True in train and eval DataLoder works well for me when I set num_workers>1
check link

from freevc.

Bug with num_workers=8 about freevc HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent