Giter Club home page Giter Club logo

Comments (19)

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024 4

Hi Debapriya,

The results in the paper were generated with a single-GPU code and modifying the published code to enable multi-GPU training may hinder replicability, so we are not planning to enable it.

But, if you need multi-GPU training, you can do the following fix. The main reason why the DepthDecoder does not get replicated is because the convolutions (and associated parameters) are stored in a self.convs dictionary. Pytorch does not detect that, even when it is declared with nn.ModuleList. So, instead of using self.convs as
self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out),
you can try registering layers as
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out)).

This would mean that calling these convolutions would require changing
x = self.convs[("upconv", i, 0)](x)
to
x = getattr(self, "upconv_{}_0".format(i))(x).

The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.

Regards,

Daniyar

from monodepth2.

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024 2

Oh, you should also convert dispconv operations in the decoder to getattr and setattr operations.

from monodepth2.

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024 1

@cswwp glad it worked. I noticed the discrepancy after reading through my own replies that I don't remember anymore:

The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.

from monodepth2.

debapriyamaji avatar debapriyamaji commented on September 7, 2024

Hi Daniyar,
Thanks for the prompt reply.
It worked. I had to change in depth_decoder as well as pose_decoder.
nn.modulelist is no longer required then.

Regards - Debapriya

from monodepth2.

mdfirman avatar mdfirman commented on September 7, 2024

Thanks for the update, I'm going to close this issue now.

from monodepth2.

danish87 avatar danish87 commented on September 7, 2024

Hi Debapriya,

The results in the paper were generated with a single-GPU code and modifying the published code to enable multi-GPU training may hinder replicability, so we are not planning to enable it.

But, if you need multi-GPU training, you can do the following fix. The main reason why the DepthDecoder does not get replicated is because the convolutions (and associated parameters) are stored in a self.convs dictionary. Pytorch does not detect that, even when it is declared with nn.ModuleList. So, instead of using self.convs as
self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out),
you can try registering layers as
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out)).

This would mean that calling these convolutions would require changing
x = self.convs[("upconv", i, 0)](x)
to
x = getattr(self, "upconv_{}_0".format(i))(x).

The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.

Regards,

Daniyar

Hi Daniyar

Hope you will be in good health. Thanks for helping us out and also for great response.
Please see this code as you mentioned in relevent issue, suggest some fixes.

class DepthDecoder(nn.Module):
def init(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
super(DepthDecoder, self).init()

    self.num_output_channels = num_output_channels
    self.use_skips = use_skips
    self.upsample_mode = 'nearest'
    self.scales = scales

    self.num_ch_enc = num_ch_enc
    self.num_ch_dec = np.array([16, 32, 64, 128, 256])

    # decoder
    self.convs = OrderedDict()
    for i in range(4, -1, -1):
        # upconv_0
        num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
        num_ch_out = self.num_ch_dec[i]
        setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out))
        #self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[i]
        if self.use_skips and i > 0:
            num_ch_in += self.num_ch_enc[i - 1]
        num_ch_out = self.num_ch_dec[i]
        self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

    for s in self.scales:
        self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)

    self.decoder = nn.ModuleList(list(self.convs.values()))
    self.sigmoid = nn.Sigmoid()

def forward(self, input_features):
    self.outputs = {}

    # decoder
    x = input_features[-1]
    for i in range(4, -1, -1):
        #x = self.convs[("upconv", i, 0)](x)
        x = getattr(self, "upconv_{}_0".format(i))(x)
        x = [upsample(x)]
        if self.use_skips and i > 0:
            x += [input_features[i - 1]]
        x = torch.cat(x, 1)
        x = getattr(self, "upconv_{}_1".format(i))(x)
        #x = self.convs[("upconv", i, 1)](x)
        if i in self.scales:
            #self.outputs[("disp", i)] = self.sigmoid(self, "dispconv_{}".format(i)(x))
            setattr(self, "disp_{}".format(i), self.sigmoid(self, "dispconv_{}".format(i))(x))
    return self.outputs

This gives me an error
RuntimeError: Error(s) in loading state_dict for DepthDecoder:
size mismatch for decoder.1.conv.conv.weight: copying a param with shape torch.Size([256, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 256, 3, 3]).
size mismatch for decoder.1.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for decoder.2.conv.conv.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 128, 3, 3]).
size mismatch for decoder.2.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
size mismatch for decoder.3.conv.conv.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 96, 3, 3]).
size mismatch for decoder.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for decoder.4.conv.conv.weight: copying a param with shape torch.Size([64, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([16, 16, 3, 3]).
size mismatch for decoder.4.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([16]).

Thank you

from monodepth2.

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024

Hi @danish87, Are you loading a pre-trained model using multi-gpu code? As I said earlier, modifying the published code to enable multi-GPU training may hinder replicability.

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

Hi @daniyar-niantic, really thanks for your sharing, i have tried modify DepthDecoder.py as you said, and then i replace 'self.models["encoder"].to(self.device)' with 'self.models["encoder"] = nn.DataParallel(self.models["encoder"]).cuda()', also replace 'self.models["depth"].to(self.device)' with 'self.models["depth"] = nn.DataParallel(self.models["depth"]).cuda()', but the run result is error

'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)'

it seems data device are not match to model, welcome any suggestion, or share your success method

from monodepth2.

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024

@cswwp are you trying to load a trained model, or train from scratch? See my comment above, getting a trained model to load correctly on multiple GPUs is outside the scope of this repo.

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

@cswwp are you trying to load a trained model, or train from scratch? See my comment above, getting a trained model to load correctly on multiple GPUs is outside the scope of this repo.

@daniyar-niantic I'm just want to train from scratch with multi-gpu, and i didn't load the pretraind model, just from scratch, but it still can't run in multi-gpus

from monodepth2.

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024

Ok, try doing the following

self.models["encoder"] = nn.DataParallel(self.models["encoder"])
self.models["depth"] = nn.DataParallel(self.models["depth"])
self.models["encoder"].cuda()
self.models["depth"].cuda()

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

Ok, try doing the following

self.models["encoder"] = nn.DataParallel(self.models["encoder"])
self.models["depth"] = nn.DataParallel(self.models["depth"])
self.models["encoder"].cuda()
self.models["depth"].cuda()

@daniyar-niantic I modify the code in trainer.py as you said, but it still this error
image

image
image

image

from monodepth2.

daniyar-niantic avatar daniyar-niantic commented on September 7, 2024

Could you try calling cuda() on encoder and decoder after both of them were cast to nn.DataParallel as in my snippet?

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

snippet

image
still error as you say, i'm confused @daniyar-niantic

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

Even if not add # before self.models["encoder"].to(self.device),
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

@daniyar-niantic Yes, it works, really thanks for your patience and kindly help.

from monodepth2.

cswwp avatar cswwp commented on September 7, 2024

glad it worked. I noticed the discrepancy after reading through my own replies that I don't remember anymore:

Yes, you said it, and i just watch the upconv, and i need to give more attention to the comment

from monodepth2.

zshn25 avatar zshn25 commented on September 7, 2024

@daniyar-niantic, I get the same error when I'm trying DistributedDataParallel. I did all the changes you mentioned above. It gives me the following error

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

from monodepth2.

sunnyHelen avatar sunnyHelen commented on September 7, 2024

I did the step as you discussed. But it seems all things are put on GPU 0, nothing is put on other GPUs. When the batchsize is larger, it occurred OOM error immediately with no occupation in other GPUs. Could you please help me figure it out?

from monodepth2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.