zxcqlf / monovit Goto Github PK

View Code? Open in Web Editor NEW

157.0 157.0 18.0 2.33 MB

Self-supervised monocular depth estimation with a vision transformer

License: MIT License

Python 100.00%

monovit's People

Contributors

Stargazers

Watchers

Forkers

repo-collection cv-depth bingai aleky-g wangjiyuan9 anhquancao tom20arf avenger123456 wencheng256 eltonfss crisdamian tusharnagrale ciliverlee zzuzhr michaeltan53 steven-xiong peter12398 aifeixingdelv

monovit's Issues

Training from scratch custom data

Hello Zhaocq,

First of all thank you very much for sharing your amazing work. I managed to integrate the model into the Monodepth training pipeline and train the model using the pre-trained weights that you provided. Even though, the results obtained were not great as with far object/scenes the model was not managing to keep a smooth disparity or properly interpreting the scene. The top-right image corresponds to the ZED2 output depth.

My goal is to be able to train it from "scratch" (using the pre-trained models with ImageNet) as I would like to include a semi-supervised term into the loss function using GT Depth (LIDAR or ZED2 depth) information to be able to get metric depth and pose and keep consistency with far elements of the image. Inspired by this paper https://arxiv.org/pdf/1910.01765.pdf

Here I attach some sample images that I am using for training:

I am using the same parameters that you are mentioning in the experiments section when you explain how you trained from scratch, using the pre-trained models from ImageNet and combined with the information provided in the multiple papers of Monodepth and Monodepth2.

When starting training the first mini-batch output is as shown in the following image:

However the following mini-batches from the same epoch are as follows, which means that the NN is not properly being trained:

I am using the ZED2 camera which has a baseline of 12cm between each lenses and using rectified images before inference.

With this provided information would you be able to understand what I may be missing from the papers and code provided that is not making the model train?

Thank you very much for your time.

Can you share more details about the evaluation with the DrivingStereo dataset?

I'm interested in evaluating monocular depth estimation solutions with the DrivingStereo dataset, but since the dataset is focused in stereo matching I'm not yet sure how to do this. Your experiments implementation could serve as a reference for me and other people trying to conduct similar experiments.

Thanks in advance!

A evaluation for Driving Stereo Dataset

Dear author,

Thank you for your fantastic contribution. I noticed that you mentioned the results for the Drivingstereo dataset in the paper. However, I cannot locate any corresponding code for this dataset. The Drivingstereo dataset is different from the Kitti dataset and does not have a toolkit for loading :(

Thank you for your time and assistance!

Pose Network weights

First of all, thank you for sharing your nice work.

Could you share the weights of pose networks in addition to depth networks?

Thank you.

Lost of Code in Trainer.py

Hi, thanks for your excellent work. But it seems to lose lots of code in your trainer.py.

Revise the setting of the depth network

For training, please download monodepth2, replace the depth network, and revise the setting of the depth network, the optimizer and learning rate according to trainer.py.

I read the contents of the sentence above. I don't know how to modify trainer.py. Can you help me?

The model and loaded state dict do not match exactly.

Hello,

I would like to express my gratitude for your outstanding work. I am currently working with monodepth2 and have encountered a model mismatch error while attempting to replace the depth network. The specific error I am facing is as follows:

I would be immensely grateful if you could provide me with some suggestions or advice on how to modify or address this issue。

Lost encoder.pth in mono_1024x320

Thanks for your work! @zxcqlf

It seems that the encoder.pth is not included in here

Discussion: how about implement joint CNN&TF on posenet backbone?

Thanks for your paper and repo, I am working on the self-supervised odometry estimation, I am interested in your apporach for depth prediction with local and global context.

Just asking your opinions, if I implenment your contribution architecture on PoseNet side, what advantge could be expected? Thanks!!

Evaluating on a folder of images

Do you have the equivalent script to monodepth2's "test_simple.py" script but for monovit? or is there a way to use your evaluate_depth.py to evaluate on a folder on RGB images?

关于消融实验

论文当中是以其他论文MPViT模块的删减来做消融，请问这个的考虑是基于什么呢？因为一般来说消融实验会在自己本身论文模块进行删减。

A Plain Copy of SegFormer?

@zxcqlf, why your code is 99% similar to SegFormer? But you never mentioned it in your paper.

During the test, the encoder model loads mpvit_small.pth, why does the downloaded weight still have depth.pth?

The reproduction results are not satisfactory

Can you send me the full code？

What is the license for this repository?

Thank you for your great work! Could you tell me the license for this repository?
Thank you in advance.

How to replace monodepth2's depth network with the one of MonoViT?

Hi,

Thanks for your fantastic work. I am a bit confused of the following description.
please download monodepth2, replace the depth network, and revise the setting of the depth network, the optimizer and learning rate according to trainer.py.

Does depth network mean replacing the folder of monodepth2 with MonoViT's?
There's no in trainer.py of MonoViT.

Can you provide more info how to do the replacement? Thanks in advance.

load checkpoint

Train stage:About "For training, please download monodepth2, replace the depth network, and revise the setting of the depth network" .Can you give some details, because it occurs some errors

Questions about the size of feature maps

Hello, author, thanks for your remarkable work.
I noticed that you changed the stride(from 2 to 1) of the second conv block of stem block to get a H / 2 × W / 2 feature map. And after the first "Joint CNN & Transformer Layer", the feature map downsample twice again to H / 4 × W / 4 . But according to the paper of MPViT, it seems the first "Joint CNN & Transformer Layer" won't change the height and width of feature map. Did you make any additional changes?

mmcv version problem

The following error occurs due to the mmcv version problem, how did you solve it?

ModuleNotFoundError: No module named 'mmcv._ext'

After making modifications to monodepth2, the results are significantly different from what was expected.

class Trainer:
    def __init__(self, options):
        ...

        # self.models["encoder"] = networks.ResnetEncoder(
        #     self.opt.num_layers, self.opt.weights_init == "pretrained")
        self.models['encoder'] = networks.mpvit_small()
        self.models["encoder"].to(self.device)
        # self.parameters_to_train += list(self.models["encoder"].parameters())

        self.models["depth"] = networks.DepthDecoder()
        self.models["depth"].to(self.device)
        # self.parameters_to_train += list(self.models["depth"].parameters())

        if self.use_pose_net:
            if self.opt.pose_model_type == "separate_resnet":
                self.models["pose_encoder"] = networks.ResnetEncoder(
                    self.opt.num_layers,
                    self.opt.weights_init == "pretrained",
                    num_input_images=self.num_pose_frames)

                self.models["pose_encoder"].to(self.device)
                self.parameters_to_train += list(self.models["pose_encoder"].parameters())

                self.models["pose"] = networks.PoseDecoder(
                    self.models["pose_encoder"].num_ch_enc,
                    num_input_features=1,
                    num_frames_to_predict_for=2)

            elif self.opt.pose_model_type == "shared":
                self.models["pose"] = networks.PoseDecoder(
                    self.models["encoder"].num_ch_enc, self.num_pose_frames)

            elif self.opt.pose_model_type == "posecnn":
                self.models["pose"] = networks.PoseCNN(
                    self.num_input_frames if self.opt.pose_model_input == "all" else 2)

            self.models["pose"].to(self.device)
            self.parameters_to_train += list(self.models["pose"].parameters())

        if self.opt.predictive_mask:
            assert self.opt.disable_automasking, \
                "When using predictive_mask, please disable automasking with --disable_automasking"

            self.models["predictive_mask"] = networks.DepthDecoder()
            self.models["predictive_mask"].to(self.device)
            self.parameters_to_train += list(self.models["predictive_mask"].parameters())

        # self.model_optimizer = optim.Adam(self.parameters_to_train, self.opt.learning_rate)
        # self.model_lr_scheduler = optim.lr_scheduler.StepLR(
        #     self.model_optimizer, self.opt.scheduler_step_size, 0.1)


        #######################
        ####   MonoViT      ##
        ######################
        # self.model_optimizer = optim.AdamW(self.parameters_to_train, self.opt.learning_rate)
        self.params = [{
            "params": self.parameters_to_train,
            "lr": 1e-4
            #"weight_decay": 0.01
            },
            {
            "params": list(self.models["encoder"].parameters()),
            "lr": self.opt.learning_rate
            # "weight_decay": 0.01
        }]
        self.model_optimizer = optim.AdamW(self.params)
        self.model_lr_scheduler = optim.lr_scheduler.ExponentialLR(
            self.model_optimizer, 0.9)

        ....

I made the modifications to the monodepth2 trainer according to your readme, but the results I obtained after training were significantly different from the results in your paper. I suspect that my modifications may not be correct, but due to my limited abilities, I cannot find the error. Could you help me take a look and see where I may have made a mistake? Thank you very much for your open source contributions, and I hope to receive your help.

关于模块替换

您好我想把Convolutional Block 替换新的模块请问这块代码具体是哪部分么

Recreation failed

MonoVit The results are shocking, thanks for your work. But I had a little problem in reproducing.
In trainer.py, I wrote:
self.models["encoder"] = networks.mpvit_small()
self.models["encoder"].to(self.device)

self.models["encoder"].num_ch_enc = [64, 128, 216, 288, 288]
self.models["depth"] = networks.DepthDecoder()
self.models["depth"].to(self.device)
self.parameters_to_train += list(self.models["depth"].parameters())
and:
self.models["encoder"] did not put it in self.parameters_to_train. Learning rate and optimizer are the same as you give.
Everything else remains in the same setting as the monodepth2.
My environment:
torch 1.12.1+cu116 pypi_0 pypi
torchaudio 0.12.1+cu116 pypi_0 pypi
torchvision 0.13.1+cu116 pypi_0 pypi
The results obtained:
abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 |
& 0.106 & 0.766 & 4.491 & 0.182 & 0.893 & 0.965 & 0.983 \
It is very different from your result, can you give me some hints, or send your related files to me:
[email protected]

Which net to use for pose network?

Thanks for the great work and open source your code!

I have a question about the pose network: which network should I use for the pose network? According to paper I should use a 'lightweight ResNet18', but I fails to find it in the repo, the posecnn used in monodepth2 seems not a ResNet, is there something I miss?

Best regards,
Liancheng

Missing scripts

Dear authors, thanks so much for releasing your codes! I was wondering if some scripts are not yet uploaded to the repo? What would be the expected date for the full repo to be released, for reproducing training and evaluation?

Hello, may I ask how Attention maps of SoTA methods and the corresponding error map are made? Could you please let me know?

> thank you !!!!!!

          > thank you !!!!!!

Could you send me the code? I'm having problems adjusting the train.py and trainer.py files.
my mail is: [email protected]

Originally posted by @wasup07 in #14 (comment)

Any Hugging Face Demo？Please

Thank your amazing work. Maybe many people want try it at the first time. So a hugging face demo is wonderful : )

Unable to reproduce MPViT-base correctly.

Dear author,

Thank you for your fantastic contribution ! However，I had some problems reproducing the results of MPViT-base. I'd really appreciate it if you would help me check what the problem is :-) @zxcqlf

I eval my MPViT-base model on KITTI and got the following results:

I think it may be because I set num_ch_enc or ch_enc incorrectly in depth_decoder, would you help me confirm what the correct value should be?

Firstly, I modified the DepthDecoder class in hr_decoder.py by changing self.num_ch_dec to np.array([64, 64, 128, 256, 512]) as shown below.

class DepthDecoder(nn.Module):
    def __init__(self, ch_enc = [64,128,216,288,288], scales=range(4),num_ch_enc = [ 64, 64, 128, 256, 512 ], num_output_channels=1):
        super(DepthDecoder, self).__init__()
        self.num_output_channels = num_output_channels
        self.num_ch_enc = num_ch_enc
        self.ch_enc = ch_enc
        self.scales = scales
        # self.num_ch_dec = np.array([16, 32, 64, 128, 256])  # mpvit_small
        self.num_ch_dec = np.array([64, 64, 128, 256, 512])  # mpvit_base

Secondly, in trainer.py, I reassigned the ch_enc and num_ch_enc arguments to DepthDecoder. It looks like this:

class Trainer:
    def __init__(self, options, ngpus_per_node=None):
        ... ...
        self.models["encoder"] = networks.mpvit_base()
        self.models["encoder"].to(self.device)
        # self.parameters_to_train += list(self.models["encoder"].parameters())

        self.models["depth"] = networks.DepthDecoder(ch_enc=[128, 224, 368, 480, 480], num_ch_enc = [128,128,256,512,1024])
        self.models["depth"].to(self.device)
        self.parameters_to_train += list(self.models["depth"].parameters())
        ... ...

Finally, in evaluate_depth.py, I changed the parameters of the encoder and decoder.

def evaluate(opt,ngpus_per_node=None):
        ... ...
        encoder = networks.mpvit_base().to(device) #networks.ResnetEncoder(opt.num_layers, False)
        encoder.num_ch_enc = [128, 224, 368, 480, 480] # = networks.ResnetEncoder(opt.num_layers, False)

        depth_decoder = networks.DepthDecoder(ch_enc=[128,224,368,480,480], num_ch_enc = [128,128,256,512,1024]).to(device)
        ... ...

As a supplement, my training loss looks like this:

Thank you for your time and assistance!

Could you please give me the complete code ?

I still can't reproduce the results in the paper