gpleiss / efficient_densenet_pytorch Goto Github PK
View Code? Open in Web Editor NEWA memory-efficient implementation of DenseNets
License: MIT License
A memory-efficient implementation of DenseNets
License: MIT License
Hi,
Thanks for sharing the code. I read your code and have two small questions.
flake8 testing of https://github.com/gpleiss/efficient_densenet_pytorch on Python 3.6.2
$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
./models/densenet.py:15:66: E999 TabError: inconsistent use of tabs and spaces in indentation
self.add_module('conv.1', nn.Conv2d(num_input_features, bn_size *
^
./models/densenet_efficient.py:404:15: E999 TabError: inconsistent use of tabs and spaces in indentation
output = input
^
I can run the network by using CIFAR-10 as my dataset.
However, when I use my own dataset which has the size of 256256 , it can not work.
I tried to transform my data into 3232, it also works.
So how can I solve the scaling problem?
maybe it's not the best place to ask this, but I thought I would be able to get some insight from the author directly :)
You used _EfficientReLU where you called the backend operations, is this necessary? What is the gain here, could I simply substitute it with nn.ReLU?
My intention is to change the type of activation here, wonder whether i can do it in a simpler way.
0.1
Training
Traceback (most recent call last):
File "demo.py", line 272, in
fire.Fire(demo)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "demo.py", line 250, in demo
n_epochs=n_epochs, batch_size=batch_size, seed=seed)
File "demo.py", line 166, in train
train=True,
File "demo.py", line 101, in run_epoch
output_var = model(input_var)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 218, in forward
features = self.features(x)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 152, in forward
outputs.append(module.forward(outputs))
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 107, in forward
new_features = super(_DenseLayer, self).forward(prev_features)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/yyj/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 85, in forward
return fn(self.norm_weight, self.norm_bias, self.conv_weight, *inputs)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 265, in forward
conv_output = self.efficient_conv.forward(conv_weight, None, relu_output)
File "/home/yyj/Downloads/efficient_densenet_pytorch-master/models/densenet_efficient.py", line 457, in forward
self.groups, cudnn.benchmark
TypeError: _cudnn_convolution_full_forward received an invalid combination of arguments - got (torch.cuda.FloatTensor, torch.cuda.FloatTensor, NoneType, torch.cuda.FloatTensor, tuple, tuple, tuple, int, bool), but expected (torch.cuda.RealTensor input, torch.cuda.RealTensor weight, torch.cuda.RealTensor bias, torch.cuda.RealTensor output, std::vector pad, std::vector stride, std::vector dilation, int groups, bool benchmark, bool deterministic)
It appears that models in the torchvision are now using adaptive poolings: adaptive_avg_pool2d
to break the tie to the input size: vision/densenet.py
Perhaps that simplify the constructor a little bit and generalize the usage even more?
Hi,
Thanks for your works!
Recently I upgrade my network to 0.4 with your implementation of DenseNet. And I found that the new version is slower than before. I thought that the shared memory could speed up the forward pass obviously.In my application, predicting one subject on the 0.3.x version cost 9s but now it need 11s.
The dice metric also get worth than before. I found that in the new code you use the Kaiming normal initialization but before default initialization (uniform?). I have try to make all parameters as before but it has not effect. Have you some advice for me?
Thanks.
Hi, thanks for your great work!
I'm working on densenet169 these days, do you know where I can find the ImageNet pretrained weights for this efficient implementation? Or do you have any example code to show how to convert the other implementation's pretrained model to this one?
I do have noticed this #13 , but it seems @ZhengRui didn't provide any example code, and I don't know where to start..
Hi. The link for the MxNet implementation provided in README is broken.
Hi,
I test the 0.4 version. I find out that the torch.utils.checkpoint cost more memory than your implementation on pytorch0.3 by using torch._C._cudnn_batch_norm_forward.
But i can not find the similar functions like torch._C._cudnn_batch_norm_forward in 0.4.
Do you want to implement it for pytorch0.4?
Thanks for the great repo! Just found out that transformation was applied prior to train val split, effectively augmenting validation set. I don't have a neat solution for this but maybe this gist https://gist.github.com/kevinzakka/d33bf8d6c7f06a9d8c76d97a7879f5cb could be a way out, although it has to load twice the same data.
Thanks very much for sharing this implementation. I forked the code. It works great on PyTorch 0.3.1. But when I ran it with 0.4.0 (master version), I got following error (I made some minor change so the line number wouldn't match):
File "../networks/densenet_efficient.py", line 330, in forward
bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)
TypeError: Variable data has to be a tensor, but got torch.cuda.FloatStorage
It turned out that for this line:
bn_input_var = Variable(type(inputs[0])(storage).resize_(size), volatile=True)
The inputs in version 0.3.1 is FloatTensor but in 0.4.0 it's Variable.
I am wondering what's the best way to update the code for 0.4.0?
Many thanks!
Hi, I have found the same problem about the GPU memory,and is there any memory-efficient tensorflow Implementation?
Thanks very much!
Hi, thanks for the implementation.
I find one weird thing using the multi-GPU option. I am trying to replicate the DenseNet-190-40 model which in the DenseNet paper produced the best result on CIFAR dataset. Using a minibatch size of 256, I can train with >= 4 GPUs but not with 2 GPUs, which indicates the multi-GPU version is working properly. However, I also find that per minibatch training time is about 30% slower with 8 GPUs vs. 4 GPUs. Is this expected and do you know what is slowing down the training?
Thanks!
the efficient_densenet_bottleneck_test.py failed in test_backward_computes_backward_pass
> assert(almost_equal(layer.conv.weight.grad.data, layer_efficient.conv_weight.grad.data))
E assert False
E + where False = almost_equal(\n(0 ,0 ,.,.) = \n 0.3746\n\n(0 ,1 ,.,.) = \n 70.7402\n\n(0 ,2 ,.,.) = \n 68.3647\n\n(0 ,3 ,.,.) = \n 5.2501\n\n(0 ,4 ,.,...) = \n 101.7459\n\n(3 ,6 ,.,.) = \n 10.9038\n\n(3 ,7 ,.,.) = \n 0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n, \n(0 ,0 ,.,.) = \n 0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n 2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n 0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n)
E + where \n(0 ,0 ,.,.) = \n 0.3746\n\n(0 ,1 ,.,.) = \n 70.7402\n\n(0 ,2 ,.,.) = \n 68.3647\n\n(0 ,3 ,.,.) = \n 5.2501\n\n(0 ,4 ,.,...) = \n 101.7459\n\n(3 ,6 ,.,.) = \n 10.9038\n\n(3 ,7 ,.,.) = \n 0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n 0.3746\n\n(0 ,1 ,.,.) = \n 70.7402\n\n(0 ,2 ,.,.) = \n 68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n 101.7459\n\n(3 ,6 ,.,.) = \n 10.9038\n\n(3 ,7 ,.,.) = \n 0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
E + where Variable containing:\n(0 ,0 ,.,.) = \n 0.3746\n\n(0 ,1 ,.,.) = \n 70.7402\n\n(0 ,2 ,.,.) = \n 68.3647\n\n(0 ,3 ,.,.) = \n ...) = \n 101.7459\n\n(3 ,6 ,.,.) = \n 10.9038\n\n(3 ,7 ,.,.) = \n 0.0000\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n 0.0978\n\n(0 ,1 ,.,.) = \n 1.9624\n\n(0 ,2 ,.,.) = \n 2.4802\n\n(0 ,3 ,.,.) = \n 1.06...5 ,.,.) = \n 0.4832\n\n(3 ,6 ,.,.) = \n 1.0052\n\n(3 ,7 ,.,.) = \n 1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
E + where Parameter containing:\n(0 ,0 ,.,.) = \n 0.0978\n\n(0 ,1 ,.,.) = \n 1.9624\n\n(0 ,2 ,.,.) = \n 2.4802\n\n(0 ,3 ,.,.) = \n 1.06...5 ,.,.) = \n 0.4832\n\n(3 ,6 ,.,.) = \n 1.0052\n\n(3 ,7 ,.,.) = \n 1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False).weight
E + where Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False) = Sequential (\n (norm): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True)\n (relu): ReLU (inplace)\n (conv): Conv2d(8, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)\n).conv
E + and \n(0 ,0 ,.,.) = \n 0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,.) = \n 2.1138e+21\n\n(...-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n 0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Variable containing:\n(0 ,0 ,.,.) = \n 0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n 0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.data
E + where Variable containing:\n(0 ,0 ,.,.) = \n 0.0000e+00\n\n(0 ,1 ,.,.) = \n -2.0594e+24\n\n(0 ,2 ,.,.) = \n -9.6653e+20\n\n(0 ,3 ,.,....-1.5375e+00\n\n(3 ,6 ,.,.) = \n -7.0127e-03\n\n(3 ,7 ,.,.) = \n 0.0000e+00\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = Parameter containing:\n(0 ,0 ,.,.) = \n 0.0978\n\n(0 ,1 ,.,.) = \n 1.9624\n\n(0 ,2 ,.,.) = \n 2.4802\n\n(0 ,3 ,.,.) = \n 1.06...5 ,.,.) = \n 0.4832\n\n(3 ,6 ,.,.) = \n 1.0052\n\n(3 ,7 ,.,.) = \n 1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n.grad
E + where Parameter containing:\n(0 ,0 ,.,.) = \n 0.0978\n\n(0 ,1 ,.,.) = \n 1.9624\n\n(0 ,2 ,.,.) = \n 2.4802\n\n(0 ,3 ,.,.) = \n 1.06...5 ,.,.) = \n 0.4832\n\n(3 ,6 ,.,.) = \n 1.0052\n\n(3 ,7 ,.,.) = \n 1.7624\n[torch.cuda.FloatTensor of size 4x8x1x1 (GPU 0)]\n = _EfficientDensenetBottleneck (\n).conv_weight
I uncommented the code in densenet_efficient.py
self.efficient_batch_norm.training = False,
but the issue persists.
The reduce() function was dropped in Python 3. https://docs.python.org/3.0/whatsnew/3.0.html#builtins There is still an implementation in functools but the advise is to use a loop instead. How should https://github.com/gpleiss/efficient_densenet_pytorch/blob/master/models/densenet_efficient.py#L146 be modified to work in both Python 2 and Python 3?
hi, @gpleiss ,
i am so appreciate with your great job. but when i try your code on 448*448 images, depth of 40 net can
not work with batch size of 10. so it will not be trained on depth of 101. my gpu is 11GB.
can you help me?
I tried to run the demo.py file, and got the error :
('The function received no value for the required argument:', 'data')
Any plans to make pretrained models available?
Or is it possible to use the pretrained models from https://github.com/liuzhuang13/DenseNet#results-on-imagenet-and-pretrained-models?
Thanks in advance,
I tried to train on multi gpus, and after a lot of tries I found DenseNetEfficientMulti
is not giving same output as DenseNetEfficient
, actually DenseNetEfficientMulti
's output depends on how many device_ids
specified in DataParallel
. However, when only specify one gpu, it indeed gives same result as DenseNetEfficient
. And when selected gpus are fixed, DenseNetEfficientMulti
's outputs are also fixed.
If i just use DenseNetEfficient
for multi gpu case, it will say Tensors are on different gpus
. I guess there are some buffer which caused this issue.
@taineleau @gpleiss would you either make DenseNetEfficient
being able to support multi gpu case, or fix the bug of DenseNetEfficientMulti
, I think DenseNetEfficientMulti
might has some bugs as its output depends on the number of gpus used. Any insights to solve this will be helpful, thanks.
Hello,
I am a beginner in python and pyTorch and am trying to use your densenet efficient implementation on a different dataset than CIFAR (images are 80 pixels wide, instead of 32). I use a windows 10 laptop with the experimental pyTorch port on Windows by peterjc123 (see pytorch/pytorch#494).
I have incorporated your DenseNetEfficient model in a training script adapted from andreasveit's densenet implementation for pyTorch and replaced the CIFAR datasets loaders with datasets ImageFolder as follows:
train_loader = torch.utils.data.DataLoader(
datasets.ImageFolder(root=args.dataroot + '/train', transform=transform_train), batch_size=args.batch_size, shuffle=True, **kwargs)
When launching the training script; I get a cryptic error (for me):
Traceback (most recent call last):
File "train.py", line 312, in
main()
File "train.py", line 153, in main
train(train_loader, model, criterion, optimizer, epoch)
File "train.py", line 185, in train
output = model(input_var)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\densenet\densenetEfficient.py", line 213, in forward
out = self.classifier(out)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\linear.py", line 54, in forward
return self.backend.Linear.apply(input, self.weight, self.bias)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn_functions\linear.py", line 12, in forward
output.addmm(0, 1, input, weight.t())
RuntimeError: size mismatch at d:\downloads\pytorch-master-1\torch\lib\thc\generic/THCTensorMathBlas.cu:243
I am surely doing something wrong but searched a lot and did not find anything,
Any recommendation would be welcome,
Thanks a lot,
Christophe
I checked the small_inputs parameter in model however couldn't find the loading options in demo code.
Hi,
In demo.py#L212, the mean value and stdv value are given directly:
mean = [0.5071, 0.4867, 0.4408]
stdv = [0.2675, 0.2565, 0.2761]
but when I use the compute-cifar10-mean.py to calculate them, I get the result as follows:
means: [0.53129727, 0.52593911, 0.52069134]
stdevs: [0.28938246, 0.28505746, 0.27971658]
these two results are different obviously, can you tell me how to calculate the mean and stdv in original demo?
Thanks!
Is there pre-trained models ready to use?
I hope you tell me the version. If so, thanks.
Hi,
Thanks for your code. I read both the single-gpu and multi-gpus codes. For the single-gpu version, you create the shared memory inside each dense block. Could all the dense blocks share the same memory and you only allocate one block of space? I think it should further reduce the space usage.
For the multi-gpus version, you create the shared memory in the initialization method of the whole network, i.e. one level upper the dense block initialization. However, you register a buffer inside each dense block for the shared memory, which is done by
self.register_buffer('CatBN_output_buffer', self.storage)
Does this mean each dense block has independent shared memory? If so, why don't you let them share the same area?
Thanks
Hi, thanks for this efficient densenet code.
But I found a probable mistake in demo.py at line 112:
error = 1 - torch.eq(predictions_var, target_var).float().mean()
it might have to be corrected to:
error = 1 - torch.eq(predictions_var.view(-1), target_var).float().mean()
Because the size of predictions_var and target_var are (train_size, 1) and (train_size, ), torch.eq(...)
will return a train_size * train_size matrix, and its entries are almost 0 (only 1 at diagonal). Then the error rate will not able to decrease.
Hi,
The table shows comparison of speed(sec/mini bach).
Is the speed a training time?
I wonder whether the inference speed of efficient one is also slower than the naive implementation.
Did you compare the inference time, too?
Hi, thanks for this efficient densenet code.
I have some problems.
I try to use different input data size to train the model,
If the size is smaller than 64 * 64 that will be no problem
but if the size is bigger than 64 * 64 the errors will appear.
RuntimeError: size mismatch at /pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:247
Can you point me how to fix it?
Thank you so much.
Hi,
Would it be possible to also have a test example in the demo? (not only train)
When I try to use the network I trained, the following command:
outputs = model(torch.autograd.Variable(images))
Throws this kind of error message:
File "evaluate.py", line 65, in main
outputs = model(torch.autograd.Variable(images))
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\densenet\densenetEfficient.py", line 205, in forward
features = self.features(x)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\container.py", line 64, in forward
input = module(input)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\modules\conv.py", line 237, in forward
self.padding, self.dilation, self.groups)
File "D:\deepLearning\Anaconda\lib\site-packages\torch\nn\functional.py", line 43, in conv2d
return f(input, weight, bias)
RuntimeError: expected CPU tensor (got CUDA tensor)
Thanks in advance,
Christophe
hi , it worked when in python2 environment, but failed in python3.
*** Error in `/home/tengbq/.virtualenvs/py3/bin/python': free(): invalid next size (fast): 0x00007f5f8d34c3b0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f640b7e57e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f640b7ede0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f640b7f198c]
/usr/local/cuda-8.0/lib64/libcudnn.so.6(cudnnDestroyConvolutionDescriptor+0x9)[0x7f63772f4c69]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0x2dfe17)[0x7f6364044e17]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f6364f16834]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch8autograd11ConvForward5applyERKSt6vectorINS0_8VariableESaIS3_EE+0x1192)[0x7f6364287712]
/home/tengbq/.virtualenvs/py3/local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0x410c7e)[0x7f6364175c7e]
/home/tengbq/.virtualenvs/py3/bin/python(_PyObject_FastCallDict+0x8b)[0x55a279d841bb]
/home/tengbq/.virtualenvs/py3/bin/python(+0x19cd3e)[0x55a279e11d3e]
/home/tengbq/.virtualenvs/py3/bin/python(_PyEval_EvalFrameDefault+0x30a)[0x55a279e3619a]
/home/tengbq/.virtualenvs/py3/bin/python(+0x1959a6)[0x55a279e0a9a6]
/home/tengbq/.virtualenvs/py3/bin/python(+0x196a11)[0x55a279e0ba11]
Amazon p3.2xlarge: 1 GPUs - Tesla V100 -- GPU Memory: 16GB -- Batch Size = 64
If efficient = False:
Error: RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 15.75 GiB total capacity; 14.71 GiB already allocated; 4.88 MiB free; 4.02 MiB cached)
If efficient = True:
Error: RuntimeError: CUDA out of memory. Tried to allocate 61.25 MiB (GPU 0; 15.75 GiB total capacity; 14.65 GiB already allocated; 50.88 MiB free; 5.33 MiB cached)
Amazon g3.4xlarge: 1 GPUs - Tesla M60 -- GPU Memory: 8GB -- Batch Size = 64
If efficient = False:
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)
If efficient = True:
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 7.44 GiB total capacity; 6.98 GiB already allocated; 25.81 MiB free; 5.57 MiB cached)
Hi, Thanks for this implementation ! I'm wondering how to obtain the quite strong test set result on CIFAR-10, as reported in the original densenet paper (e.g., error rate <=3.5 on C-10+, with depth =190, growth_rate = 40). When I run the script as:
CUDA_VISIBLE_DEVICES=0,1,2,3 python demo.py --depth 190 --efficient False --data ./data --save ./ckpts
The final test error is reported as 0.0535. I'm wondering whether the high error is due to no data augmentation is conducted in the default setting. May I know whether it is C10+ dataset or C10?
Best
Environment:
def _cat_function_factory(conv, relu):
def cat_function(*inputs):
concated_features = torch.cat(inputs, 1)
bottleneck_output = relu(conv(concated_features))
return bottleneck_output
return cat_function
class _DenseLayer(nn.Module):
def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
super(_DenseLayer, self).__init__()
self.add_module('conv1', nn.Conv2d(num_input_features, bn_size * growth_rate, 1))
self.add_module('relu1', nn.ReLU(inplace=True))
self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate, 3, padding=1))
self.add_module('relu2', nn.ReLU(inplace=True))
self.drop_rate = drop_rate
def forward(self, *inputs):
cat_function = _cat_function_factory(self.conv1, self.relu1)
if any(feature.requires_grad for feature in inputs):
output = cp.checkpoint(cat_function, *inputs)
else:
output = cat_function(*inputs)
new_features = self.relu2(self.conv2(output))
if self.drop_rate > 0:
new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
return new_features
class _DenseBlock(nn.Module):
def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
super(_DenseBlock, self).__init__()
for i in range(num_layers):
layer = _DenseLayer(num_input_features + i * growth_rate,
growth_rate, bn_size, drop_rate)
self.add_module(f'denselayer{i + 1}', layer)
def forward(self, init_features):
features = [init_features]
for name, layer in self.named_children():
new_features = layer(*features)
features.append(new_features)
return torch.cat(features, 1)
It can run on single GPU, but it throws a Segmentation fault (core dumped) error when running on multiple GPUS. What can be caused this issues?
I just want to benchmark the new implementation of efficient densenet with the code here. However, it seems that the used checkpointed modules are not broadcast to multiple GPUs as I got the following errors:
File "/home/changmao/efficient_densenet_pytorch/models/densenet.py", line 16, in bn_function
bottleneck_output = conv(relu(norm(concated_features)))
File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 49, in forward
self.training or not self.track_running_stats, self.momentum, self.eps)
File "/home/changmao/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1194, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)
I think that the checkpoint feature provides weak support for nn.DataParallel
.
Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:
when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the for device_idx in range(torch.cuda.device_count())
part in _SharedAllocation()
part requires some modification and optimization.
when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.
Hi @gpleiss , thanks for this efficient densenet code.
Would you please kindly, implement an 'Efficient Conv3d Class'?
I will really appreciate if you can provide some guidance to me for implementing this class based on your 'EfficientConv2d' class.
Thank you for your effort. Do you know where to find the pre-trained weight implemented by pytorch?
AssertionError in test_forward_training_true_computes_forward_pass:
assert almost_equal(layer.norm.running_mean, layer_efficient.norm_running_mean)
assert almost_equal(layer.norm.running_var, layer_efficient.norm_running_var)
layer.norm.running_mean =
0.2516
0.0036
-0.6237
0.2686
-1.1193
1.2112
-0.0139
0.0237
[torch.FloatTensor of size 8]
layer_efficient.norm_running_mean=
0.2840
-0.0010
-0.7056
0.3032
-1.2588
1.3351
-0.0184
0.0538
[torch.FloatTensor of size 8]
layer.norm.running_var =
0.7604
0.6536
1.3444
0.1388
1.1254
0.1573
1.3377
0.9247
[torch.FloatTensor of size 8]
layer_efficient.norm_running_var =
0.7321
0.6162
1.3844
0.0518
1.1621
0.0664
1.3355
0.9229
[torch.FloatTensor of size 8]
Hi @gpleiss,
I was trying to train an ensemble of DenseNets_BC_100_12 in 2 GPU NVIDIA k80 when I encountered the memory efficient problem. However, I my research is sensible in terms of the number of parameters, and when I moved to this implementation they do not match any more.
In this implementation file you can see how the number of parameters exactly matches the ones reported:
+-------------+-------------+-------+--------------+
| Model | Growth Rate | Depth | M. of Params |
+-------------+-------------+-------+--------------+
| DenseNet | 12 | 40 | 1.02 |
+-------------+-------------+-------+--------------+
| DenseNet | 12 | 100 | 6.98 |
+-------------+-------------+-------+--------------+
| DenseNet | 24 | 100 | 27.249 |
+-------------+-------------+-------+--------------+
| DenseNet-BC | 12 | 100 | 0.769 |
+-------------+-------------+-------+--------------+
| DenseNet-BC | 24 | 250 | 15.324 |
+-------------+-------------+-------+--------------+
| DenseNet-BC | 40 | 190 | 25.624 |
+-------------+-------------+-------+--------------+
However, in this other implementation following yours indications
+-------------+-------------+-------+--------------+
| Model | Growth Rate | Depth | M. of Params |
+-------------+-------------+-------+--------------+
| DenseNet-BC | 12 | 100 | 1.108 |
+-------------+-------------+-------+--------------+
| DenseNet-BC | 24 | 250 | 4.275 |
+-------------+-------------+-------+--------------+
| DenseNet-BC | 40 | 190 | 11.7 |
+-------------+-------------+-------+--------------+
Is there something else that need to be taken care and I am not seeing?
Thanks a lot in advance,
Pablo
Hi,
I am currently trying to understand ur densenet code. As given in the paper DENSENET which is given as follows:
At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached.
But I am unable to find softmax layer in your optimised code. Please, could you kindly provide me how this has been implemented in Densenet.py file.
Thanks
Hi, I'm using this repo in my project which I plan to release soon, but it seems illegal to use your code without your permission. Could you add an open source license to this repo so that I can include your copyright in my project? It should take less than one minute to do so. Thanks.
is there a master can convet pytorch implementation to tf?
I run the demo on cifar 10 and have got the result show below. I found the error is around 0.9 in all the epoches. And the final error is very high(0.897). Is it right?
`
Eval: (Epoch 300 of 300) [0016/0020] Time: 0.06535 (1.054) Loss: 0.47998 (0.388) Error: 0.89749 (0.897)
Eval: (Epoch 300 of 300) [0017/0020] Time: 0.06668 (1.120) Loss: 0.34982 (0.385) Error: 0.89259 (0.897)
Eval: (Epoch 300 of 300) [0018/0020] Time: 0.06468 (1.185) Loss: 0.29784 (0.380) Error: 0.89301 (0.897)
Eval: (Epoch 300 of 300) [0019/0020] Time: 0.06477 (1.250) Loss: 0.52827 (0.388) Error: 0.89644 (0.897)
Eval: (Epoch 300 of 300) [0020/0020] Time: 0.03533 (1.285) Loss: 0.39355 (0.389) Error: 0.89657 (0.897)
`
I try to use torch.storage in my network. I use Pytorch3.
if self.storage.size() < size:
is_cuda = self.storage.is_cuda
if is_cuda:
gpu_ID = self.storage.get_device()
print('gpu_ID1:',gpu_ID)
self.storage.resize_(size)
gpu_ID= self.storage.get_device()
print('gpu_ID2:',gpu_ID)
if is_cuda:
self.storage = self.storage.cuda(gpu_ID)
gpu_ID= self.storage.get_device()
print('gpu_ID3:',gpu_ID)
The output is
gpu_ID1: 1
gpu_ID2: 0
gpu_ID3: 0
The self.storage comes from self.storage= torch.Storage(1024)
.
It seems the resize_ function will change the gpu where the storage will be saved.
I wish the storage is saved in GPU 1 rather than 0.
How can i do that?
Hi, I run the demo in this repo without changing anything except the cifar data directory. However, it raised a Runtime: RuntimeError: tensors are on different GPUs
. Then I tried the naive implementation with command line flag --efficient=False, it actually worked fine. By the way, the efficient implementation can work well after I modify the code to not use torch.nn.DataParallel
. Do you actually test your implementation in multiple GPU? I guess since you manually change the behaviour of gradient flow, something wrong...
Hi,thank you for your great work! I only find a BatchNorm layer before the final classifier, should there be a global average pooling layer before the classifier?Sorry this is a simple question,looking forward to your reply.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.