zxcqlf / monovit Goto Github PK
View Code? Open in Web Editor NEWSelf-supervised monocular depth estimation with a vision transformer
License: MIT License
Self-supervised monocular depth estimation with a vision transformer
License: MIT License
Hello Zhaocq,
First of all thank you very much for sharing your amazing work. I managed to integrate the model into the Monodepth training pipeline and train the model using the pre-trained weights that you provided. Even though, the results obtained were not great as with far object/scenes the model was not managing to keep a smooth disparity or properly interpreting the scene. The top-right image corresponds to the ZED2 output depth.
My goal is to be able to train it from "scratch" (using the pre-trained models with ImageNet) as I would like to include a semi-supervised term into the loss function using GT Depth (LIDAR or ZED2 depth) information to be able to get metric depth and pose and keep consistency with far elements of the image. Inspired by this paper https://arxiv.org/pdf/1910.01765.pdf
Here I attach some sample images that I am using for training:
I am using the same parameters that you are mentioning in the experiments section when you explain how you trained from scratch, using the pre-trained models from ImageNet and combined with the information provided in the multiple papers of Monodepth and Monodepth2.
When starting training the first mini-batch output is as shown in the following image:
However the following mini-batches from the same epoch are as follows, which means that the NN is not properly being trained:
I am using the ZED2 camera which has a baseline of 12cm between each lenses and using rectified images before inference.
With this provided information would you be able to understand what I may be missing from the papers and code provided that is not making the model train?
Thank you very much for your time.
I'm interested in evaluating monocular depth estimation solutions with the DrivingStereo dataset, but since the dataset is focused in stereo matching I'm not yet sure how to do this. Your experiments implementation could serve as a reference for me and other people trying to conduct similar experiments.
Thanks in advance!
Dear author,
Thank you for your fantastic contribution. I noticed that you mentioned the results for the Drivingstereo dataset in the paper. However, I cannot locate any corresponding code for this dataset. The Drivingstereo dataset is different from the Kitti dataset and does not have a toolkit for loading :(
Thank you for your time and assistance!
First of all, thank you for sharing your nice work.
Could you share the weights of pose networks in addition to depth networks?
Thank you.
Hi, thanks for your excellent work. But it seems to lose lots of code in your trainer.py.
For training, please download monodepth2, replace the depth network, and revise the setting of the depth network, the optimizer and learning rate according to trainer.py.
I read the contents of the sentence above. I don't know how to modify trainer.py. Can you help me?
Hello,
I would like to express my gratitude for your outstanding work. I am currently working with monodepth2 and have encountered a model mismatch error while attempting to replace the depth network. The specific error I am facing is as follows:
I would be immensely grateful if you could provide me with some suggestions or advice on how to modify or address this issue。
Thanks for your paper and repo, I am working on the self-supervised odometry estimation, I am interested in your apporach for depth prediction with local and global context.
Just asking your opinions, if I implenment your contribution architecture on PoseNet side, what advantge could be expected? Thanks!!
Do you have the equivalent script to monodepth2's "test_simple.py" script but for monovit? or is there a way to use your evaluate_depth.py to evaluate on a folder on RGB images?
论文当中是以其他论文MPViT模块的删减来做消融,请问这个的考虑是基于什么呢?因为一般来说消融实验会在自己本身论文模块进行删减。
@zxcqlf, why your code is 99% similar to SegFormer? But you never mentioned it in your paper.
Can you send me the full code?
Thank you for your great work! Could you tell me the license for this repository?
Thank you in advance.
Hi,
Thanks for your fantastic work. I am a bit confused of the following description.
please download monodepth2, replace the depth network, and revise the setting of the depth network, the optimizer and learning rate according to trainer.py.
Can you provide more info how to do the replacement? Thanks in advance.
Train stage:About "For training, please download monodepth2, replace the depth network, and revise the setting of the depth network" .Can you give some details, because it occurs some errors
Hello, author, thanks for your remarkable work.
I noticed that you changed the stride(from 2 to 1) of the second conv block of stem block to get a H / 2 × W / 2 feature map. And after the first "Joint CNN & Transformer Layer", the feature map downsample twice again to H / 4 × W / 4 . But according to the paper of MPViT, it seems the first "Joint CNN & Transformer Layer" won't change the height and width of feature map. Did you make any additional changes?
The following error occurs due to the mmcv version problem, how did you solve it?
ModuleNotFoundError: No module named 'mmcv._ext'
class Trainer:
def __init__(self, options):
...
# self.models["encoder"] = networks.ResnetEncoder(
# self.opt.num_layers, self.opt.weights_init == "pretrained")
self.models['encoder'] = networks.mpvit_small()
self.models["encoder"].to(self.device)
# self.parameters_to_train += list(self.models["encoder"].parameters())
self.models["depth"] = networks.DepthDecoder()
self.models["depth"].to(self.device)
# self.parameters_to_train += list(self.models["depth"].parameters())
if self.use_pose_net:
if self.opt.pose_model_type == "separate_resnet":
self.models["pose_encoder"] = networks.ResnetEncoder(
self.opt.num_layers,
self.opt.weights_init == "pretrained",
num_input_images=self.num_pose_frames)
self.models["pose_encoder"].to(self.device)
self.parameters_to_train += list(self.models["pose_encoder"].parameters())
self.models["pose"] = networks.PoseDecoder(
self.models["pose_encoder"].num_ch_enc,
num_input_features=1,
num_frames_to_predict_for=2)
elif self.opt.pose_model_type == "shared":
self.models["pose"] = networks.PoseDecoder(
self.models["encoder"].num_ch_enc, self.num_pose_frames)
elif self.opt.pose_model_type == "posecnn":
self.models["pose"] = networks.PoseCNN(
self.num_input_frames if self.opt.pose_model_input == "all" else 2)
self.models["pose"].to(self.device)
self.parameters_to_train += list(self.models["pose"].parameters())
if self.opt.predictive_mask:
assert self.opt.disable_automasking, \
"When using predictive_mask, please disable automasking with --disable_automasking"
self.models["predictive_mask"] = networks.DepthDecoder()
self.models["predictive_mask"].to(self.device)
self.parameters_to_train += list(self.models["predictive_mask"].parameters())
# self.model_optimizer = optim.Adam(self.parameters_to_train, self.opt.learning_rate)
# self.model_lr_scheduler = optim.lr_scheduler.StepLR(
# self.model_optimizer, self.opt.scheduler_step_size, 0.1)
#######################
#### MonoViT ##
######################
# self.model_optimizer = optim.AdamW(self.parameters_to_train, self.opt.learning_rate)
self.params = [{
"params": self.parameters_to_train,
"lr": 1e-4
#"weight_decay": 0.01
},
{
"params": list(self.models["encoder"].parameters()),
"lr": self.opt.learning_rate
# "weight_decay": 0.01
}]
self.model_optimizer = optim.AdamW(self.params)
self.model_lr_scheduler = optim.lr_scheduler.ExponentialLR(
self.model_optimizer, 0.9)
....
I made the modifications to the monodepth2 trainer according to your readme, but the results I obtained after training were significantly different from the results in your paper. I suspect that my modifications may not be correct, but due to my limited abilities, I cannot find the error. Could you help me take a look and see where I may have made a mistake? Thank you very much for your open source contributions, and I hope to receive your help.
您好 我想把Convolutional Block 替换新的模块 请问这块代码具体是哪部分么
MonoVit The results are shocking, thanks for your work. But I had a little problem in reproducing.
In trainer.py, I wrote:
self.models["encoder"] = networks.mpvit_small()
self.models["encoder"].to(self.device)
self.models["encoder"].num_ch_enc = [64, 128, 216, 288, 288]
self.models["depth"] = networks.DepthDecoder()
self.models["depth"].to(self.device)
self.parameters_to_train += list(self.models["depth"].parameters())
and:
self.models["encoder"] did not put it in self.parameters_to_train. Learning rate and optimizer are the same as you give.
Everything else remains in the same setting as the monodepth2.
My environment:
torch 1.12.1+cu116 pypi_0 pypi
torchaudio 0.12.1+cu116 pypi_0 pypi
torchvision 0.13.1+cu116 pypi_0 pypi
The results obtained:
abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 |
& 0.106 & 0.766 & 4.491 & 0.182 & 0.893 & 0.965 & 0.983 \
It is very different from your result, can you give me some hints, or send your related files to me:
[email protected]
Thanks for the great work and open source your code!
I have a question about the pose network: which network should I use for the pose network? According to paper I should use a 'lightweight ResNet18', but I fails to find it in the repo, the posecnn used in monodepth2 seems not a ResNet, is there something I miss?
Best regards,
Liancheng
Dear authors, thanks so much for releasing your codes! I was wondering if some scripts are not yet uploaded to the repo? What would be the expected date for the full repo to be released, for reproducing training and evaluation?
Hello, may I ask how Attention maps of SoTA methods and the corresponding error map are made? Could you please let me know?
> thank you !!!!!!
Could you send me the code? I'm having problems adjusting the train.py and trainer.py files.
my mail is: [email protected]
Originally posted by @wasup07 in #14 (comment)
Thank your amazing work. Maybe many people want try it at the first time. So a hugging face demo is wonderful : )
Dear author,
Thank you for your fantastic contribution ! However,I had some problems reproducing the results of MPViT-base
. I'd really appreciate it if you would help me check what the problem is :-) @zxcqlf
I eval my MPViT-base model on KITTI and got the following results:
I think it may be because I set num_ch_enc
or ch_enc
incorrectly in depth_decoder, would you help me confirm what the correct value should be?
hr_decoder.py
by changing self.num_ch_dec
to np.array([64, 64, 128, 256, 512])
as shown below.class DepthDecoder(nn.Module):
def __init__(self, ch_enc = [64,128,216,288,288], scales=range(4),num_ch_enc = [ 64, 64, 128, 256, 512 ], num_output_channels=1):
super(DepthDecoder, self).__init__()
self.num_output_channels = num_output_channels
self.num_ch_enc = num_ch_enc
self.ch_enc = ch_enc
self.scales = scales
# self.num_ch_dec = np.array([16, 32, 64, 128, 256]) # mpvit_small
self.num_ch_dec = np.array([64, 64, 128, 256, 512]) # mpvit_base
trainer.py
, I reassigned the ch_enc
and num_ch_enc
arguments to DepthDecoder. It looks like this:class Trainer:
def __init__(self, options, ngpus_per_node=None):
... ...
self.models["encoder"] = networks.mpvit_base()
self.models["encoder"].to(self.device)
# self.parameters_to_train += list(self.models["encoder"].parameters())
self.models["depth"] = networks.DepthDecoder(ch_enc=[128, 224, 368, 480, 480], num_ch_enc = [128,128,256,512,1024])
self.models["depth"].to(self.device)
self.parameters_to_train += list(self.models["depth"].parameters())
... ...
evaluate_depth.py
, I changed the parameters of the encoder and decoder.def evaluate(opt,ngpus_per_node=None):
... ...
encoder = networks.mpvit_base().to(device) #networks.ResnetEncoder(opt.num_layers, False)
encoder.num_ch_enc = [128, 224, 368, 480, 480] # = networks.ResnetEncoder(opt.num_layers, False)
depth_decoder = networks.DepthDecoder(ch_enc=[128,224,368,480,480], num_ch_enc = [128,128,256,512,1024]).to(device)
... ...
As a supplement, my training loss looks like this:
Thank you for your time and assistance!
I still can't reproduce the results in the paper
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.