Comments (20)
If you're using pyTorch 1.0.0, you'll also get a CUDA out of memory error. You'll want to find line 214 in pix2pixHD_model.py and comment out
if torch.__version__.startswith('0.4'):
with torch.no_grad():
fake_image = self.netG.forward(input_concat)
else:
fake_image = self.netG.forward(input_concat)
And replace it with just
with torch.no_grad():
fake_image = self.netG.forward(input_concat)
Or your own, improved, pyTorch version-detecting code. with torch.no_grad()
is correct for pyTorch 0.4, but should also be used for later versions of pyTorch, which this code does not do.
from pix2pixhd.
1080Ti should be able to run the inference perfectly fine; it should only take about 4G memory. Are you sure the GPU is not running something else at the same time?
from pix2pixhd.
I am sure there is no other jobs running at the same time.
Pytorch is built through docker images. Here is the Dockerfile and docker-compose file.
# Dockerfile
FROM pytorch-cuda8-cudnn6:gpu-py3
RUN mkdir /app \
&& pip install dominate
WORKDIR /app
docker-compose.yml
version: '2'
services:
pix2pixHD:
build: .
image: pytorch/pix2pixhd:gpu-py3
container_name: pytorch_pix2pixHD
volumes:
- .:/app
#environment:
# - CUDA_VISIBLE_DEVICES=0
command:
- bash
- ./scripts/test_1024p.sh
Error information:
pytorch_pix2pixHD | ---------- Networks initialized -------------
pytorch_pix2pixHD | model [Pix2PixHDModel] was created
pytorch_pix2pixHD | THCudaCheck FAIL file=/tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
pytorch_pix2pixHD | /app/models/pix2pixHD_model.py:112: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
pytorch_pix2pixHD | input_label = Variable(input_label, volatile=infer)
pytorch_pix2pixHD | process image... ['./datasets/cityscapes/test_label/frankfurt_000000_000576_gtFine_labelIds.png']
pytorch_pix2pixHD | Traceback (most recent call last):
pytorch_pix2pixHD | File "test.py", line 29, in <module>
pytorch_pix2pixHD | generated = model.inference(data['label'], data['inst'])
pytorch_pix2pixHD | File "/app/models/pix2pixHD_model.py", line 188, in inference
pytorch_pix2pixHD | fake_image = self.netG.forward(input_concat)
pytorch_pix2pixHD | File "/app/models/networks.py", line 182, in forward
pytorch_pix2pixHD | output_prev = model_upsample(model_downsample(input_i) + output_prev)
pytorch_pix2pixHD | File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD | result = self.forward(*input, **kwargs)
pytorch_pix2pixHD | File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 75, in forward
pytorch_pix2pixHD | input = module(input)
pytorch_pix2pixHD | File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD | result = self.forward(*input, **kwargs)
pytorch_pix2pixHD | File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
pytorch_pix2pixHD | self.padding, self.dilation, self.groups)
pytorch_pix2pixHD | RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu:58
That's wired!
from pix2pixhd.
I meet similar problem. I solve it by adding proper options. You may need to read the "readme" carefully.
from pix2pixhd.
@tcwang0509
Thanks for your excellent work!!
I run the inference code bash ./scripts/test_1024p.sh on my server but it shows error:
I specify the batchSize to 1.
---------- Networks initialized -------------
Pretrained network G has fewer layers; The following are not initialized:
['model', 'model1_1']
model [Pix2PixHDModel] was created
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "test.py", line 29, in <module>
generated = model.inference(data['label'], data['inst'])
File "/home/xmli/pheng4/pix2pixHD/models/pix2pixHD_model.py", line 188, in inference
fake_image = self.netG.forward(input_concat)
File "/home/xmli/pheng4/pix2pixHD/models/networks.py", line 182, in forward
output_prev = model_upsample(model_downsample(input_i) + output_prev)
File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
self.padding, self.dilation, self.groups)
File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu:58
I run with TiTan XP and I used an empty GPU for the inference:
My torch version is 0.3.0
nvidia-smi
Sat Apr 7 19:19:50 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 0000:04:00.0 On | N/A |
| 28% 49C P2 61W / 250W | 251MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 0000:05:00.0 Off | N/A |
| 50% 78C P2 269W / 250W | 10280MiB / 12189MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 0000:08:00.0 Off | N/A |
| 23% 36C P8 16W / 250W | 3MiB / 12189MiB | 0% Default |
from pix2pixhd.
@tcwang0509 @ArthurQiuu Could you provide any solutions to the problems? Thanks so much!!
from pix2pixhd.
The problem solved when I update the torch version from 0.3.0 to 0.3.1.post2
I posted my pytorch info.
Thanks all!
$ conda list | grep pytorch
cuda80 1.0 h205658b_0 pytorch
pytorch 0.3.1 py27_cuda8.0.61_cudnn7.0.5_2 pytorch
torchvision 0.2.0 py27hfb27419_1 pytorch
from pix2pixhd.
I am running ToT Pytorch and 1024p does not fit in 16G by default for inference (test.py). I have added FP16 option (see my PR) to make it fit.
from pix2pixhd.
I meet the same problem when using a Titan X GPU to test the pre-trained 1024p model. Did anyone solve the out-of-memory problem?
@tcwang0509 Is it possible to provide the 512p pre-trained model for testing? Thank you!
from pix2pixhd.
I meet the same problem on 1080ti, I run the program on an empty GPU, it failed, but I can still get two pics.
So I read the options.py and comments the --resize_or_crop none, it can work but the generated images(1024×512) are not so well as expected. When using the default --resize_or_crop==scale_width, I can get only one generated image(2048*1024), it is much better.
therefore, I try to train my own models, using /scripts/train_512p.sh/
I have the following problem,
create web directory ./checkpoints/label2city_512p/web...
Traceback (most recent call last):
File "train.py", line 61, in
Variable(data['image']), Variable(data['feat']), infer=save_fake)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 154, in forward
input_label, inst_map, real_image, feat_map = self.encode_input(label, inst, image, feat)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 122, in encode_input
if self.opt.data_type==16:
AttributeError: 'Namespace' object has no attribute 'data_type'
actually, all the other train scripts generate the same issues.
Any help?
the datasets are managed as follows.
train_img: ****leftImg8bit.png
train_inst:****gtFine_instanceIds.png
train_label:****gtFine_laelIds.png
from pix2pixhd.
@tcwang0509 I tries different combinations of parameters in the test_1024p.sh, I found that the --ngf highly affect the memory. I also watch the memory composition during running, the training of 512 may only use about 4Gb, however, the testing will eat much more. Reduce the number of --ngf to 20 can make sure the testing but the quality of images are very strange. I tested on both 1080ti and titan x.
from pix2pixhd.
@ouyangkid are you using pytorch 0.4? It seems the problem is due to volatile not supported anymore, so inference costs a lot more memory than it should. Please pull the latest version and see if it works.
from pix2pixhd.
@tcwang0509 Yes, thanks for your response, it seems that the last version will be 1.0, but not publicly available. I will wait and try after they published the official version.
from pix2pixhd.
@ouyangkid I got the same error as you "... AttributeError: 'Namespace' object has no attribute 'data_type'". Did you only change the --ngf parameter? I have already tried that and did not work.
Thanks in advance.
from pix2pixhd.
@marioft according to @tcwang0509, the problem is because of the versions of different software, as I tried, reduce the parms of --ngf is one of the operations that can decrease the memory consumptions of the GPUs, however, the outputs are wired.
I suggest you wait for the new version of pytorch 1.0 / tensorrt. As you can see, the nvidia has only one guy support on this project currently, I also give up any test.
this is my envs:
cuda9.0 cudnn 7.1.5 tensorrt 4.0 pytorch 4.0
from pix2pixhd.
Thanks for your reply, I'll update the software then and hope it works. I'm working with Cuda7.5, cudnn7.1.3, tensorrt 4.0.1, and pytorch 0.4.0.
from pix2pixhd.
I ran the code with default bash ./scripts/test_1024p.sh
its working fine with pytorch 0.4 then I repalce the train label with custom same dimension image as given in the test case 1024x2048 its throws bellow error
Traceback (most recent call last):
File "test.py", line 61, in
generated = model.inference(data['label'], data['inst'])
File "project/pix2pixHD/models/pix2pixHD_model.py", line 216, in inference
fake_image = self.netG.forward(input_concat)
File "project/pix2pixHD/models/networks.py", line 180, in forward
output_prev = self.model(input_downsampled[-1])
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
any insight thanks in advance
from pix2pixhd.
Hi @nejyeah I am trying to run pix2pixHD using a Docker container. I user your Dockerfile, but this line
FROM pytorch-cuda8-cudnn6:gpu-py3
raise an error:
pull access denied for pytorch-cuda8-cudnn6, repository does not exist or may require 'docker login'
Can you help me dockerize pix2pixHD?
from pix2pixhd.
@fabio-C Sorry, I did not keep the dockerfile and the docker image.
from pix2pixhd.
@9of9's solution worked for me (Thanks !). I noted one interesting thing though, if I pass --resize_or_crop none, then I don't get out of memory ( although the output images don't make sense ). OOM occurs only when --resize_or_crop == scale_width
from pix2pixhd.
Related Issues (20)
- wrong output when testing with RGB segmentation mask HOT 1
- I trained so poorly? HOT 5
- RuntimeError: CUDA out of memory,continuous training? HOT 5
- Low performance compared to pix2pix
- Errors during testing
- ./checkpoints/label2city_1024p_feat/latest_net_E.pth not exists yet!
- The test results were not good
- module 'torch._C' has no attribute '_cuda_setDevice' HOT 1
- Regarding High Dynamic Range Images HOT 1
- Regarding the inclusion of classification criteria during training.
- The training effect is good, but the test effect is poor. HOT 2
- Hello, I only care yellow color loss, how to improve my loss function
- How does layer-wise feature matching help with discriminator and GAN training objective?
- .
- Guidance Needed for Selecting Best Epoch/Weights in Pix2PixHD Training
- Update code for new version of python
- How to solve the RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype
- not work in python 3.10
- Edge2face experiment with CelebA-HQ
- Issues with Running stylegan2_pytorch in gpu settings on colab notebook
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pix2pixhd.