Hi, It's mentioned that the code runs with single GPU. Did you guys try running it

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Multi GPU training about monodepth2 HOT 19 CLOSED

nianticlabs commented on September 7, 2024 1

Multi GPU training

from monodepth2.

Comments (19)

daniyar-niantic commented on September 7, 2024 4

Hi Debapriya,

The results in the paper were generated with a single-GPU code and modifying the published code to enable multi-GPU training may hinder replicability, so we are not planning to enable it.

But, if you need multi-GPU training, you can do the following fix. The main reason why the DepthDecoder does not get replicated is because the convolutions (and associated parameters) are stored in a self.convs dictionary. Pytorch does not detect that, even when it is declared with nn.ModuleList. So, instead of using self.convs as
self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out),
you can try registering layers as
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out)).

This would mean that calling these convolutions would require changing
x = self.convs[("upconv", i, 0)](x)
to
x = getattr(self, "upconv_{}_0".format(i))(x).

The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.

Regards,

Daniyar

from monodepth2.

daniyar-niantic commented on September 7, 2024 2

Oh, you should also convert dispconv operations in the decoder to getattr and setattr operations.

from monodepth2.

daniyar-niantic commented on September 7, 2024 1

@cswwp glad it worked. I noticed the discrepancy after reading through my own replies that I don't remember anymore:

The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.

from monodepth2.

debapriyamaji commented on September 7, 2024

Hi Daniyar,
Thanks for the prompt reply.
It worked. I had to change in depth_decoder as well as pose_decoder.
nn.modulelist is no longer required then.

Regards - Debapriya

from monodepth2.

mdfirman commented on September 7, 2024

Thanks for the update, I'm going to close this issue now.

from monodepth2.

danish87 commented on September 7, 2024

Hi Debapriya,

The results in the paper were generated with a single-GPU code and modifying the published code to enable multi-GPU training may hinder replicability, so we are not planning to enable it.

But, if you need multi-GPU training, you can do the following fix. The main reason why the DepthDecoder does not get replicated is because the convolutions (and associated parameters) are stored in a self.convs dictionary. Pytorch does not detect that, even when it is declared with nn.ModuleList. So, instead of using self.convs as
self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out),
you can try registering layers as
setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out)).

This would mean that calling these convolutions would require changing
x = self.convs[("upconv", i, 0)](x)
to
x = getattr(self, "upconv_{}_0".format(i))(x).

The same modifications need to happen to ("upconv", i, 1) and ("disp", i) convolutional blocks.

Regards,

Daniyar

Hi Daniyar

Hope you will be in good health. Thanks for helping us out and also for great response.
Please see this code as you mentioned in relevent issue, suggest some fixes.

class DepthDecoder(nn.Module):
def init(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
super(DepthDecoder, self).init()

    self.num_output_channels = num_output_channels
    self.use_skips = use_skips
    self.upsample_mode = 'nearest'
    self.scales = scales

    self.num_ch_enc = num_ch_enc
    self.num_ch_dec = np.array([16, 32, 64, 128, 256])

    # decoder
    self.convs = OrderedDict()
    for i in range(4, -1, -1):
        # upconv_0
        num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
        num_ch_out = self.num_ch_dec[i]
        setattr(self, "upconv_{}_0".format(i), ConvBlock(num_ch_in, num_ch_out))
        #self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

        # upconv_1
        num_ch_in = self.num_ch_dec[i]
        if self.use_skips and i > 0:
            num_ch_in += self.num_ch_enc[i - 1]
        num_ch_out = self.num_ch_dec[i]
        self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

    for s in self.scales:
        self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)

    self.decoder = nn.ModuleList(list(self.convs.values()))
    self.sigmoid = nn.Sigmoid()

def forward(self, input_features):
    self.outputs = {}

    # decoder
    x = input_features[-1]
    for i in range(4, -1, -1):
        #x = self.convs[("upconv", i, 0)](x)
        x = getattr(self, "upconv_{}_0".format(i))(x)
        x = [upsample(x)]
        if self.use_skips and i > 0:
            x += [input_features[i - 1]]
        x = torch.cat(x, 1)
        x = getattr(self, "upconv_{}_1".format(i))(x)
        #x = self.convs[("upconv", i, 1)](x)
        if i in self.scales:
            #self.outputs[("disp", i)] = self.sigmoid(self, "dispconv_{}".format(i)(x))
            setattr(self, "disp_{}".format(i), self.sigmoid(self, "dispconv_{}".format(i))(x))
    return self.outputs

This gives me an error
RuntimeError: Error(s) in loading state_dict for DepthDecoder:
size mismatch for decoder.1.conv.conv.weight: copying a param with shape torch.Size([256, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 256, 3, 3]).
size mismatch for decoder.1.conv.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for decoder.2.conv.conv.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 128, 3, 3]).
size mismatch for decoder.2.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
size mismatch for decoder.3.conv.conv.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 96, 3, 3]).
size mismatch for decoder.3.conv.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for decoder.4.conv.conv.weight: copying a param with shape torch.Size([64, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([16, 16, 3, 3]).
size mismatch for decoder.4.conv.conv.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([16]).

Thank you

from monodepth2.

daniyar-niantic commented on September 7, 2024

Hi @danish87, Are you loading a pre-trained model using multi-gpu code? As I said earlier, modifying the published code to enable multi-GPU training may hinder replicability.

from monodepth2.

cswwp commented on September 7, 2024

Hi @daniyar-niantic, really thanks for your sharing, i have tried modify DepthDecoder.py as you said, and then i replace 'self.models["encoder"].to(self.device)' with 'self.models["encoder"] = nn.DataParallel(self.models["encoder"]).cuda()', also replace 'self.models["depth"].to(self.device)' with 'self.models["depth"] = nn.DataParallel(self.models["depth"]).cuda()', but the run result is error

'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)'

it seems data device are not match to model, welcome any suggestion, or share your success method

from monodepth2.

daniyar-niantic commented on September 7, 2024

@cswwp are you trying to load a trained model, or train from scratch? See my comment above, getting a trained model to load correctly on multiple GPUs is outside the scope of this repo.

from monodepth2.

cswwp commented on September 7, 2024

@cswwp are you trying to load a trained model, or train from scratch? See my comment above, getting a trained model to load correctly on multiple GPUs is outside the scope of this repo.

@daniyar-niantic I'm just want to train from scratch with multi-gpu, and i didn't load the pretraind model, just from scratch, but it still can't run in multi-gpus

from monodepth2.

daniyar-niantic commented on September 7, 2024

Ok, try doing the following

self.models["encoder"] = nn.DataParallel(self.models["encoder"])
self.models["depth"] = nn.DataParallel(self.models["depth"])
self.models["encoder"].cuda()
self.models["depth"].cuda()

from monodepth2.

cswwp commented on September 7, 2024

Ok, try doing the following

self.models["encoder"] = nn.DataParallel(self.models["encoder"])
self.models["depth"] = nn.DataParallel(self.models["depth"])
self.models["encoder"].cuda()
self.models["depth"].cuda()

@daniyar-niantic I modify the code in trainer.py as you said, but it still this error

from monodepth2.

daniyar-niantic commented on September 7, 2024

Could you try calling cuda() on encoder and decoder after both of them were cast to nn.DataParallel as in my snippet?

from monodepth2.

cswwp commented on September 7, 2024

snippet

still error as you say, i'm confused @daniyar-niantic

from monodepth2.

cswwp commented on September 7, 2024

Even if not add # before self.models["encoder"].to(self.device),
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

from monodepth2.

cswwp commented on September 7, 2024

@daniyar-niantic Yes, it works, really thanks for your patience and kindly help.

from monodepth2.

cswwp commented on September 7, 2024

glad it worked. I noticed the discrepancy after reading through my own replies that I don't remember anymore:

Yes, you said it, and i just watch the upconv, and i need to give more attention to the comment

from monodepth2.

zshn25 commented on September 7, 2024

@daniyar-niantic, I get the same error when I'm trying DistributedDataParallel. I did all the changes you mentioned above. It gives me the following error

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

from monodepth2.

sunnyHelen commented on September 7, 2024

I did the step as you discussed. But it seems all things are put on GPU 0, nothing is put on other GPUs. When the batchsize is larger, it occurred OOM error immediately with no occupation in other GPUs. Could you please help me figure it out?

from monodepth2.

Multi GPU training about monodepth2 HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent