<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

1080Ti is out of memory for testing 1024P pretrained model about pix2pixhd HOT 20 OPEN

nvidia commented on May 17, 2024

1080Ti is out of memory for testing 1024P pretrained model

from pix2pixhd.

Comments (20)

9of9 commented on May 17, 2024 31

If you're using pyTorch 1.0.0, you'll also get a CUDA out of memory error. You'll want to find line 214 in pix2pixHD_model.py and comment out

        if torch.__version__.startswith('0.4'):
            with torch.no_grad():
                fake_image = self.netG.forward(input_concat)
        else:
            fake_image = self.netG.forward(input_concat)

And replace it with just

        with torch.no_grad():
            fake_image = self.netG.forward(input_concat)

Or your own, improved, pyTorch version-detecting code. with torch.no_grad() is correct for pyTorch 0.4, but should also be used for later versions of pyTorch, which this code does not do.

from pix2pixhd.

tcwang0509 commented on May 17, 2024

1080Ti should be able to run the inference perfectly fine; it should only take about 4G memory. Are you sure the GPU is not running something else at the same time?

from pix2pixhd.

nejyeah commented on May 17, 2024

I am sure there is no other jobs running at the same time.
Pytorch is built through docker images. Here is the Dockerfile and docker-compose file.

# Dockerfile
FROM pytorch-cuda8-cudnn6:gpu-py3
RUN mkdir /app \
    && pip install dominate
WORKDIR /app

docker-compose.yml

version: '2'
services:
  pix2pixHD:
    build: .
    image: pytorch/pix2pixhd:gpu-py3
    container_name: pytorch_pix2pixHD
    volumes:
      - .:/app
    #environment:
    #  - CUDA_VISIBLE_DEVICES=0
    command:
      - bash
      - ./scripts/test_1024p.sh

Error information:

pytorch_pix2pixHD | ---------- Networks initialized -------------
pytorch_pix2pixHD | model [Pix2PixHDModel] was created
pytorch_pix2pixHD | THCudaCheck FAIL file=/tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
pytorch_pix2pixHD | /app/models/pix2pixHD_model.py:112: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
pytorch_pix2pixHD |   input_label = Variable(input_label, volatile=infer)
pytorch_pix2pixHD | process image... ['./datasets/cityscapes/test_label/frankfurt_000000_000576_gtFine_labelIds.png']
pytorch_pix2pixHD | Traceback (most recent call last):
pytorch_pix2pixHD |   File "test.py", line 29, in <module>
pytorch_pix2pixHD |     generated = model.inference(data['label'], data['inst'])
pytorch_pix2pixHD |   File "/app/models/pix2pixHD_model.py", line 188, in inference
pytorch_pix2pixHD |     fake_image = self.netG.forward(input_concat)
pytorch_pix2pixHD |   File "/app/models/networks.py", line 182, in forward
pytorch_pix2pixHD |     output_prev = model_upsample(model_downsample(input_i) + output_prev)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 75, in forward
pytorch_pix2pixHD |     input = module(input)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
pytorch_pix2pixHD |     self.padding, self.dilation, self.groups)
pytorch_pix2pixHD | RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu:58

That's wired!

from pix2pixhd.

arthur-qiu commented on May 17, 2024

I meet similar problem. I solve it by adding proper options. You may need to read the "readme" carefully.

from pix2pixhd.

xmengli commented on May 17, 2024

@tcwang0509
Thanks for your excellent work!!

I run the inference code bash ./scripts/test_1024p.sh on my server but it shows error:
I specify the batchSize to 1.

---------- Networks initialized -------------
Pretrained network G has fewer layers; The following are not initialized:
['model', 'model1_1']
model [Pix2PixHDModel] was created
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "test.py", line 29, in <module>
    generated = model.inference(data['label'], data['inst'])
  File "/home/xmli/pheng4/pix2pixHD/models/pix2pixHD_model.py", line 188, in inference
    fake_image = self.netG.forward(input_concat)
  File "/home/xmli/pheng4/pix2pixHD/models/networks.py", line 182, in forward
    output_prev = model_upsample(model_downsample(input_i) + output_prev)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
    self.padding, self.dilation, self.groups)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu:58

I run with TiTan XP and I used an empty GPU for the inference:
My torch version is 0.3.0

nvidia-smi
Sat Apr  7 19:19:50 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 0000:04:00.0      On |                  N/A |
| 28%   49C    P2    61W / 250W |    251MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 0000:05:00.0     Off |                  N/A |
| 50%   78C    P2   269W / 250W |  10280MiB / 12189MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 0000:08:00.0     Off |                  N/A |
| 23%   36C    P8    16W / 250W |      3MiB / 12189MiB |      0%      Default |

from pix2pixhd.

xmengli commented on May 17, 2024

@tcwang0509 @ArthurQiuu Could you provide any solutions to the problems? Thanks so much!!

from pix2pixhd.

xmengli commented on May 17, 2024

The problem solved when I update the torch version from 0.3.0 to 0.3.1.post2
I posted my pytorch info.
Thanks all!


$ conda list | grep pytorch
cuda80                    1.0                  h205658b_0    pytorch
pytorch                   0.3.1           py27_cuda8.0.61_cudnn7.0.5_2    pytorch
torchvision               0.2.0            py27hfb27419_1    pytorch

from pix2pixhd.

borisfom commented on May 17, 2024

I am running ToT Pytorch and 1024p does not fit in 16G by default for inference (test.py). I have added FP16 option (see my PR) to make it fit.

from pix2pixhd.

cchen156 commented on May 17, 2024

I meet the same problem when using a Titan X GPU to test the pre-trained 1024p model. Did anyone solve the out-of-memory problem?

@tcwang0509 Is it possible to provide the 512p pre-trained model for testing? Thank you!

from pix2pixhd.

hahakid commented on May 17, 2024

I meet the same problem on 1080ti, I run the program on an empty GPU, it failed, but I can still get two pics.
So I read the options.py and comments the --resize_or_crop none, it can work but the generated images（1024×512） are not so well as expected. When using the default --resize_or_crop==scale_width, I can get only one generated image(2048*1024)， it is much better.

therefore, I try to train my own models, using /scripts/train_512p.sh/
I have the following problem,
create web directory ./checkpoints/label2city_512p/web...
Traceback (most recent call last):
File "train.py", line 61, in
Variable(data['image']), Variable(data['feat']), infer=save_fake)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 154, in forward
input_label, inst_map, real_image, feat_map = self.encode_input(label, inst, image, feat)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 122, in encode_input
if self.opt.data_type==16:
AttributeError: 'Namespace' object has no attribute 'data_type'

actually, all the other train scripts generate the same issues.
Any help?

the datasets are managed as follows.
train_img: ****leftImg8bit.png
train_inst:****gtFine_instanceIds.png
train_label:****gtFine_laelIds.png

from pix2pixhd.

hahakid commented on May 17, 2024

@tcwang0509 I tries different combinations of parameters in the test_1024p.sh, I found that the --ngf highly affect the memory. I also watch the memory composition during running, the training of 512 may only use about 4Gb, however, the testing will eat much more. Reduce the number of --ngf to 20 can make sure the testing but the quality of images are very strange. I tested on both 1080ti and titan x.

from pix2pixhd.

tcwang0509 commented on May 17, 2024

@ouyangkid are you using pytorch 0.4? It seems the problem is due to volatile not supported anymore, so inference costs a lot more memory than it should. Please pull the latest version and see if it works.

from pix2pixhd.

hahakid commented on May 17, 2024

@tcwang0509 Yes, thanks for your response, it seems that the last version will be 1.0, but not publicly available. I will wait and try after they published the official version.

from pix2pixhd.

marioft commented on May 17, 2024

@ouyangkid I got the same error as you "... AttributeError: 'Namespace' object has no attribute 'data_type'". Did you only change the --ngf parameter? I have already tried that and did not work.
Thanks in advance.

from pix2pixhd.

hahakid commented on May 17, 2024

@marioft according to @tcwang0509, the problem is because of the versions of different software, as I tried, reduce the parms of --ngf is one of the operations that can decrease the memory consumptions of the GPUs, however, the outputs are wired.

I suggest you wait for the new version of pytorch 1.0 / tensorrt. As you can see, the nvidia has only one guy support on this project currently, I also give up any test.
this is my envs:
cuda9.0 cudnn 7.1.5 tensorrt 4.0 pytorch 4.0

from pix2pixhd.

marioft commented on May 17, 2024

Thanks for your reply, I'll update the software then and hope it works. I'm working with Cuda7.5, cudnn7.1.3, tensorrt 4.0.1, and pytorch 0.4.0.

from pix2pixhd.

Avyukth commented on May 17, 2024

I ran the code with default bash ./scripts/test_1024p.sh its working fine with pytorch 0.4 then I repalce the train label with custom same dimension image as given in the test case 1024x2048 its throws bellow error
Traceback (most recent call last):
File "test.py", line 61, in
generated = model.inference(data['label'], data['inst'])
File "project/pix2pixHD/models/pix2pixHD_model.py", line 216, in inference
fake_image = self.netG.forward(input_concat)
File "project/pix2pixHD/models/networks.py", line 180, in forward
output_prev = self.model(input_downsampled[-1])
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

any insight thanks in advance

from pix2pixhd.

commented on May 17, 2024

Hi @nejyeah I am trying to run pix2pixHD using a Docker container. I user your Dockerfile, but this line

FROM pytorch-cuda8-cudnn6:gpu-py3

raise an error:

pull access denied for pytorch-cuda8-cudnn6, repository does not exist or may require 'docker login'

Can you help me dockerize pix2pixHD?

from pix2pixhd.

nejyeah commented on May 17, 2024

@fabio-C Sorry, I did not keep the dockerfile and the docker image.

from pix2pixhd.

royaljain commented on May 17, 2024

@9of9's solution worked for me (Thanks !). I noted one interesting thing though, if I pass --resize_or_crop none, then I don't get out of memory ( although the output images don't make sense ). OOM occurs only when --resize_or_crop == scale_width

from pix2pixhd.

1080Ti is out of memory for testing 1024P pretrained model about pix2pixhd HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent