Comments (19)
Hi Debapriya,
The results in the paper were generated with a single-GPU code and modifying the published code to enable multi-GPU training may hinder replicability, so we are not planning to enable it.
But, if you need multi-GPU training, you can do the following fix. The main reason why the DepthDecoder does not get replicated is because the convolutions (and associated parameters) are stored in a self.convs
dictionary. Pytorch does not detect that, even when it is declared with nn.ModuleList
. So, instead of using self.convs
as
self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)
,
you can try registering layers as
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out))
.
This would mean that calling these convolutions would require changing
x = self.convs[("upconv", i, 0)](x)
to
x = getattr(self, "upconv_{}_0".format(i))(x)
.
The same modifications need to happen to ("upconv", i, 1)
and ("disp", i)
convolutional blocks.
Regards,
Daniyar
from monodepth2.
Oh, you should also convert dispconv
operations in the decoder to getattr
and setattr
operations.
from monodepth2.
@cswwp glad it worked. I noticed the discrepancy after reading through my own replies that I don't remember anymore:
The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.
from monodepth2.
Hi Daniyar,
Thanks for the prompt reply.
It worked. I had to change in depth_decoder as well as pose_decoder.
nn.modulelist is no longer required then.
Regards - Debapriya
from monodepth2.
Thanks for the update, I'm going to close this issue now.
from monodepth2.
Hi Debapriya,
The results in the paper were generated with a single-GPU code and modifying the published code to enable multi-GPU training may hinder replicability, so we are not planning to enable it.
But, if you need multi-GPU training, you can do the following fix. The main reason why the DepthDecoder does not get replicated is because the convolutions (and associated parameters) are stored in a
self.convs
dictionary. Pytorch does not detect that, even when it is declared withnn.ModuleList
. So, instead of usingself.convs
as
self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)
,
you can try registering layers as
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out))
.This would mean that calling these convolutions would require changing
x = self.convs[("upconv", i, 0)](x)
to
x = getattr(self, "upconv_{}_0".format(i))(x)
.The same modifications need to happen to
("upconv", i, 1)
and("disp", i)
convolutional blocks.Regards,
Daniyar
Hi Daniyar
Hope you will be in good health. Thanks for helping us out and also for great response.
Please see this code as you mentioned in relevent issue, suggest some fixes.
class DepthDecoder(nn.Module):
def init(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
super(DepthDecoder, self).init()
self.num_output_channels = num_output_channels
self.use_skips = use_skips
self.upsample_mode = 'nearest'
self.scales = scales
self.num_ch_enc = num_ch_enc
self.num_ch_dec = np.array([16, 32, 64, 128, 256])
# decoder
self.convs = OrderedDict()
for i in range(4, -1, -1):
# upconv_0
num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
num_ch_out = self.num_ch_dec[i]
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out))
#self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)
# upconv_1
num_ch_in = self.num_ch_dec[i]
if self.use_skips and i > 0:
num_ch_in += self.num_ch_enc[i - 1]
num_ch_out = self.num_ch_dec[i]
self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)
for s in self.scales:
self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)
self.decoder = nn.ModuleList(list(self.convs.values()))
self.sigmoid = nn.Sigmoid()
def forward(self, input_features):
self.outputs = {}
# decoder
x = input_features[-1]
for i in range(4, -1, -1):
#x = self.convs[("upconv", i, 0)](x)
x = getattr(self, "upconv_{}_0".format(i))(x)
x = [upsample(x)]
if self.use_skips and i > 0:
x += [input_features[i - 1]]
x = torch.cat(x, 1)
x = getattr(self, "upconv_{}_1".format(i))(x)
#x = self.convs[("upconv", i, 1)](x)
if i in self.scales:
#self.outputs[("disp", i)] = self.sigmoid(self, "dispconv_{}".format(i)(x))
setattr(self, "disp_{}".format(i), self.sigmoid(self, "dispconv_{}".format(i))(x))
return self.outputs
This gives me an error
RuntimeError: Error(s) in loading state_dict for DepthDecoder:
size mismatch for decoder.1.conv.conv.weight: copying a param with shape torch.Size([256, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 256, 3, 3]).
size mismatch for decoder.1.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for decoder.2.conv.conv.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 128, 3, 3]).
size mismatch for decoder.2.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
size mismatch for decoder.3.conv.conv.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 96, 3, 3]).
size mismatch for decoder.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for decoder.4.conv.conv.weight: copying a param with shape torch.Size([64, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([16, 16, 3, 3]).
size mismatch for decoder.4.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([16]).
Thank you
from monodepth2.
Hi @danish87, Are you loading a pre-trained model using multi-gpu code? As I said earlier, modifying the published code to enable multi-GPU training may hinder replicability.
from monodepth2.
Hi @daniyar-niantic, really thanks for your sharing, i have tried modify DepthDecoder.py as you said, and then i replace 'self.models["encoder"].to(self.device)' with 'self.models["encoder"] = nn.DataParallel(self.models["encoder"]).cuda()', also replace 'self.models["depth"].to(self.device)' with 'self.models["depth"] = nn.DataParallel(self.models["depth"]).cuda()', but the run result is error
'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)'
it seems data device are not match to model, welcome any suggestion, or share your success method
from monodepth2.
@cswwp are you trying to load a trained model, or train from scratch? See my comment above, getting a trained model to load correctly on multiple GPUs is outside the scope of this repo.
from monodepth2.
@cswwp are you trying to load a trained model, or train from scratch? See my comment above, getting a trained model to load correctly on multiple GPUs is outside the scope of this repo.
@daniyar-niantic I'm just want to train from scratch with multi-gpu, and i didn't load the pretraind model, just from scratch, but it still can't run in multi-gpus
from monodepth2.
Ok, try doing the following
self.models["encoder"] = nn.DataParallel(self.models["encoder"])
self.models["depth"] = nn.DataParallel(self.models["depth"])
self.models["encoder"].cuda()
self.models["depth"].cuda()
from monodepth2.
Ok, try doing the following
self.models["encoder"] = nn.DataParallel(self.models["encoder"]) self.models["depth"] = nn.DataParallel(self.models["depth"]) self.models["encoder"].cuda() self.models["depth"].cuda()
@daniyar-niantic I modify the code in trainer.py as you said, but it still this error
from monodepth2.
Could you try calling cuda()
on encoder and decoder after both of them were cast to nn.DataParallel
as in my snippet?
from monodepth2.
snippet
still error as you say, i'm confused @daniyar-niantic
from monodepth2.
Even if not add # before self.models["encoder"].to(self.device),
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
from monodepth2.
@daniyar-niantic Yes, it works, really thanks for your patience and kindly help.
from monodepth2.
glad it worked. I noticed the discrepancy after reading through my own replies that I don't remember anymore:
Yes, you said it, and i just watch the upconv, and i need to give more attention to the comment
from monodepth2.
@daniyar-niantic, I get the same error when I'm trying DistributedDataParallel. I did all the changes you mentioned above. It gives me the following error
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
from monodepth2.
I did the step as you discussed. But it seems all things are put on GPU 0, nothing is put on other GPUs. When the batchsize is larger, it occurred OOM error immediately with no occupation in other GPUs. Could you please help me figure it out?
from monodepth2.
Related Issues (20)
- Network inference time problem
- The requested array has an inhomogeneous shape after 1 dimensions. HOT 5
- Write split file HOT 2
- obtained some very strange depth maps HOT 4
- The difference in the intrinsic matrix affects the results
- Question about image resolution
- question for the Data Preparation
- Why is smooth_loss divided by 2**scale?
- Questions about the meaning of grid in the F.grid_sample function
- the eval file about 'gt_depths.npz'
- Run on Google Colab,
- Run on Google Colab, but out of System RAM
- The new constraint about pose is not useful?
- RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS HOT 2
- How to obtain reconstructed image and loss for a single demo image
- How to setup already trained computer vision model Ultralytics YOLOv8 with monodepth2 HOT 1
- Questions about pose estimation HOT 1
- Different evaluation criteria for Monodepth2 M and S
- Regarding the patent and license HOT 1
- RuntimeError: The size of tensor a (319) must match the size of tensor b (639) at non-singleton dimension 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from monodepth2.