nianticlabs / monodepth2 Goto Github PK

View Code? Open in Web Editor NEW

4.1K 4.1K 951.0 10.27 MB

[ICCV 2019] Monocular depth estimation from a single image

License: Other

Python 20.74% Jupyter Notebook 78.30% Shell 0.96%

computer-vision deep-learning depth-estimation monodepth neural-network pytorch self-supervision

monodepth2's Issues

Training on a different dataset - Intrinsic parameters

Hello,

This is not an issue with the code, but rather a question about training on cityscapes instead of kitti. I created a new Dataset class in which I read the camera parameters stored in a json file and set them in a numpy array:

K = np.array([[fx, 0, u0, 0], [0, fy, v0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], dtype=np.float32)

These intrinsics are then used in generate_images_pred. I noticed on tensorboard that the color_pred_s_0 was sometimes really messy. Apart from that, the other outputs look good. So, I was wondering, is there anything else to change in the code?

Best regards

How is static mask being computed?

I am trying to understand how the static mask is being computed. I haven't had the chance to run the code as I just want to incorporate the idea into my own project.

I noticed that the computation is all being done here:

if not self.opt.disable_automasking:
                # add random numbers to break ties
                identity_reprojection_loss += torch.randn(
                    identity_reprojection_loss.shape).cuda() * 0.00001

                combined = torch.cat((identity_reprojection_loss, reprojection_loss), dim=1)
            else:
                combined = reprojection_loss

            if combined.shape[1] == 1:
                to_optimise = combined
            else:
                to_optimise, idxs = torch.min(combined, dim=1)

            if not self.opt.disable_automasking:
                outputs["identity_selection/{}".format(scale)] = (idxs > 1).float()

loss += to_optimise.mean()

In pytorch, the tensors are of shape [B,C,H,W], correct? So you are concatenating the reprojection loss from the target-source pair with the reprojection loss from the target-warped pair along the channel dimension. This means combined should only have exactly two channels? Then you choose to keep the minimum of the two to add to the loss.

How are you calculating the mask? You check for when idxs > 1. But if there were only two channels, how can idxs ever be greater than 1?

Mean and std for KITTI

How did you get this mean and std?

x = (input_image - 0.45) / 0.225

Why is the input not normalized for PoseCNN network?

https://github.com/nianticlabs/monodepth2/blob/master/networks/pose_cnn.py#L39

Question about the evaluation of pose, the optimized scale

monodepth2/evaluate_pose.py

Line 43 in 1cc8b81

scale = np.sum(gtruth_xyz * pred_xyz) / np.sum(pred_xyz ** 2)

The scale is optimized for every 5-frame track_length. I wonder if this is a common way to do (like the code is borrowed from SfMLearner); in my opinion it is sort of "cheating" (no offense), because in reality you cannot optimize the scale all the time since no ground truth is available. A more reasonable way should be to fix the scale beforehand, e.g. using the optimal scale on the training set, and use the same scale for all the sequence.

What's your opinion?

Resnet-50 separate_resnet pose network bug

Resnet-50 architecture for 'separate_resnet' pose network has a bug.

Flags:

--num_layers 50
--pose_model_type separate_resnet

Error:

Traceback (most recent call last):
  File "/media/Data/Alwyn/github/monodepth2/train.py", line 17, in <module>
    trainer = Trainer(opts)
  File "/media/Data/Alwyn/github/monodepth2/trainer.py", line 68, in __init__
    num_input_images=self.num_pose_frames)
  File "/media/Data/Alwyn/github/monodepth2/networks/resnet_encoder.py", line 80, in __init__
    self.encoder = resnet_multiimage_input(num_layers, pretrained, num_input_images)
  File "/media/Data/Alwyn/github/monodepth2/networks/resnet_encoder.py", line 58, in resnet_multiimage_input
    model.load_state_dict(loaded)
  File "/home/iitp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNetMultiImageInput:
	Unexpected key(s) in state_dict: "layer1.0.conv3.weight", "layer1.0.bn3.running_mean", "layer1.0.bn3.running_var", "layer1.0.bn3.weight", "layer1.0.bn3.bias", "layer1.0.downsample.0.weight", "layer1.0.downsample.1.running_mean", "layer1.0.downsample.1.running_var", "layer1.0.downsample.1.weight", "layer1.0.downsample.1.bias", "layer1.1.conv3.weight", "layer1.1.bn3.running_mean", "layer1.1.bn3.running_var", "layer1.1.bn3.weight", "layer1.1.bn3.bias", "layer1.2.conv3.weight", "layer1.2.bn3.running_mean", "layer1.2.bn3.running_var", "layer1.2.bn3.weight", "layer1.2.bn3.bias", "layer2.0.conv3.weight", "layer2.0.bn3.running_mean", "layer2.0.bn3.running_var", "layer2.0.bn3.weight", "layer2.0.bn3.bias", "layer2.1.conv3.weight", "layer2.1.bn3.running_mean", "layer2.1.bn3.running_var", "layer2.1.bn3.weight", "layer2.1.bn3.bias", "layer2.2.conv3.weight", "layer2.2.bn3.running_mean", "layer2.2.bn3.running_var", "layer2.2.bn3.weight", "layer2.2.bn3.bias", "layer2.3.conv3.weight", "layer2.3.bn3.running_mean", "layer2.3.bn3.running_var", "layer2.3.bn3.weight", "layer2.3.bn3.bias", "layer3.0.conv3.weight", "layer3.0.bn3.running_mean", "layer3.0.bn3.running_var", "layer3.0.bn3.weight", "layer3.0.bn3.bias", "layer3.1.conv3.weight", "layer3.1.bn3.running_mean", "layer3.1.bn3.running_var", "layer3.1.bn3.weight", "layer3.1.bn3.bias", "layer3.2.conv3.weight", "layer3.2.bn3.running_mean", "layer3.2.bn3.running_var", "layer3.2.bn3.weight", "layer3.2.bn3.bias", "layer3.3.conv3.weight", "layer3.3.bn3.running_mean", "layer3.3.bn3.running_var", "layer3.3.bn3.weight", "layer3.3.bn3.bias", "layer3.4.conv3.weight", "layer3.4.bn3.running_mean", "layer3.4.bn3.running_var", "layer3.4.bn3.weight", "layer3.4.bn3.bias", "layer3.5.conv3.weight", "layer3.5.bn3.running_mean", "layer3.5.bn3.running_var", "layer3.5.bn3.weight", "layer3.5.bn3.bias", "layer4.0.conv3.weight", "layer4.0.bn3.running_mean", "layer4.0.bn3.running_var", "layer4.0.bn3.weight", "layer4.0.bn3.bias", "layer4.1.conv3.weight", "layer4.1.bn3.running_mean", "layer4.1.bn3.running_var", "layer4.1.bn3.weight", "layer4.1.bn3.bias", "layer4.2.conv3.weight", "layer4.2.bn3.running_mean", "layer4.2.bn3.running_var", "layer4.2.bn3.weight", "layer4.2.bn3.bias". 
	size mismatch for layer1.0.conv1.weight: copying a param with shape torch.Size([64, 64, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
	size mismatch for layer1.1.conv1.weight: copying a param with shape torch.Size([64, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
	size mismatch for layer1.2.conv1.weight: copying a param with shape torch.Size([64, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
	size mismatch for layer2.0.conv1.weight: copying a param with shape torch.Size([128, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 3, 3]).
	size mismatch for layer2.0.downsample.0.weight: copying a param with shape torch.Size([512, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 1, 1]).
	size mismatch for layer2.0.downsample.1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for layer2.0.downsample.1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for layer2.0.downsample.1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for layer2.0.downsample.1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for layer2.1.conv1.weight: copying a param with shape torch.Size([128, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
	size mismatch for layer2.2.conv1.weight: copying a param with shape torch.Size([128, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
	size mismatch for layer2.3.conv1.weight: copying a param with shape torch.Size([128, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
	size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).
	size mismatch for layer3.0.downsample.0.weight: copying a param with shape torch.Size([1024, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 128, 1, 1]).
	size mismatch for layer3.0.downsample.1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for layer3.0.downsample.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for layer3.0.downsample.1.running_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for layer3.0.downsample.1.running_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for layer3.1.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for layer3.2.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for layer3.3.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for layer3.4.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for layer3.5.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for layer4.0.conv1.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 256, 3, 3]).
	size mismatch for layer4.0.downsample.0.weight: copying a param with shape torch.Size([2048, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 256, 1, 1]).
	size mismatch for layer4.0.downsample.1.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for layer4.0.downsample.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for layer4.0.downsample.1.running_mean: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for layer4.0.downsample.1.running_var: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for layer4.1.conv1.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for layer4.2.conv1.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for fc.weight: copying a param with shape torch.Size([1000, 2048]) from checkpoint, the shape in current model is torch.Size([1000, 512]).

Process finished with exit code 1

Scaled translation in PoseCNN

Why do you scale translation just for posecnn but not for shared or separate?

monodepth2/trainer.py

Line 372 in 5cc5c85

axisangle[:, 0], translation[:, 0] * mean_inv_depth[:, 0], frame_id < 0)

Element 0 of tensors does not require grad and does not have a grad_fn

Sorry to bother you.
I changed some place of the code and got an error "Element 0 of tensors does not require grad and does not have a grad_fn". I think it's here

So, I want to ask you the why we use "with torch.no_grad()" here. If I delete it, will it affect the result?
Thanks a lot for your help~

Why does the model project relatively static pixels to infinity?

I thought I'd open up a discussion.

I recently tried training monodepth2 on cityscapes and discovered that even with static mask, it was failing to mask out some vehicles moving with the ego car and was projecting them to infinity. In order to resolve this issue, I want to understand exactly why projecting these pixels to infinity minimizes the loss. This is what I don't entirely understand. The best understanding I have is that in order to make a point in 3D not move across frames, the easiest way to do this is to set its z-component to infinity, so that the euclidean distance between the two points zero. The z-component would be the depth from the depth map. But somehow, I don't believe this is right. Does anyone have a clear mathematical formulation that shows, given the re-projection equation, minimizing the loss comes from setting z to infinity? I believe that if I know the exact mathematical component that is causing this issue, it will be clearer how to resolve the issue.

Reprojection Equation:

The most obvious thing to try would be to come up with a different loss function to pass the static pixels through to prevent them from being projected to infinity. But then you'd need some sort of idea of what the actual depth should be, and considering this is unsupervised approach, I don't see how you can do that. I know struct2depth added a regularizing term to approximate what the depth of an object should be by learning the objects height, using similar triangles. But I tried struct2depth as well and it was still projecting objects to infinity, although this might have been due to the fact that MASK-RCNN was missing too many objects. Just throwing ideas out there. Thanks.

odometry split test file list

Hi,
I'm wandering how do you generate these files for different sequences. I want to evaluate model on
other sequences. How can I get those?

Why use padding_mode='border'?

When warping the image, you use padding_mode='border', which causes border effects (e.g. on the left and bottom for image 2->1)

monodepth2/trainer.py

Lines 381 to 384 in ec10cf1

 outputs[("color", frame_id, scale)] = F.grid_sample( 

 inputs[("color", frame_id, source_scale)], 

 outputs[("sample", frame_id, scale)], 

 padding_mode="border")

In the code it seems that you compute the reconstruction loss on the whole image, including the border effect parts (if I'm wrong, correct me). Wouldn't it degrades the model's performance?

What I've seen others doing (e.g. struct2depth) is to use a mask, and only compute the loss on the masked (valid) area. This can be done using padding_mode='zeros' to generate the valid mask. E.g., black areas are not contributing to the loss.

What's your opinion?

Speed up the training process

Dear,

Thanks a lot for your open source project!
As I reproduce the result, I found that the training time is longer than the reported time in the readme. I already moved the dataset to an SSD drive and used the default settings (I am using Titan XP, i7-7700, and pillow-simd). I am wondering do you have some extra operations, for example, pre-resize images?

For example: python train.py --model_name stereo_model2 --frame_ids 0 --use_stereo --split eigen_full costs about 14 hours.

TypeError in training.

Thank you for sharing your great code. Your kind instruction in usage is a great help in setting the environment. The simple prediction test using your trained model worked successfully, but the we have some error in training. I would be appreciate any help from you. Thank you in advance.

python train.py --model_name stereo_model \

--frame_ids 0 --use_stereo --split eigen_full

Training model named:
stereo_model
Models and tensorboard events files are saved to:
/root/tmp
Training is using:
cuda
Using split:
eigen_full
There are 45200 training items and 1776 validation items

Training
Traceback (most recent call last):
File "train.py", line 18, in
trainer.train()
File "/workspace/trainer.py", line 186, in train
self.run_epoch()
File "/workspace/trainer.py", line 198, in run_epoch
for batch_idx, inputs in enumerate(self.train_loader):
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/workspace/datasets/mono_dataset.py", line 174, in getitem
self.brightness, self.contrast, self.saturation, self.hue)
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torchvision/transforms/transforms.py", line 725, in get_params
if brightness > 0:
TypeError: '>' not supported between instances of 'tuple' and 'int'

why do we need the track_length?

monodepth2/evaluate_pose.py

Line 118 in 1cc8b81

track_length = 5

why do we need the track_length when evaluating pose result?
Thank you for your help~

Error while Running monodepth2 on Jetson Nano

Hi, thanks a lot for this project. I am trying to use your work for a summer research project at my university and want to run monodepth2 on the Nvidia Jetson Nano. However, when I try to run the example command to test out the depth prediction for a single image, I get this error:

Weirdly, the file in question (SpatialUpSamplingBilinear.cu) doesn't exist anywhere on my machine. I installed pytorch using this guide: https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano/post/5324123/#5324123
(This might be a pytorch specific issue, in which case I'd be happy to close this)

Thank you!

How to work in indoor scene

Hi,thanks for your work
I want to train ASN model use stereo mode in indoor scene,I change max_depth in option.py,but the model not fit.maybe need to changes more parameters? hope you get me some advice,thank you

RuntimeError: CUDA error: an illegal memory access was encountered

Hello, when I trained your stereo+mono model, this error occurred after 10 epoch. By the way, my GPU is TITAN V.

and the error is

batch size changed and error occur

Sorry to bother you again~
Due to memory limitation, I change the batch_size from 12 to 2. And then there's an error.

I didn't change other place.

Pre-trained models on Odom split?

Seems that the released pre-trained models are trained on Eign split (for depth), I wonder if the Odom split models (for pose) are also available. Thanks!

Translation vector for flipped image

Does the translation get affected when an image is flipped? You have changed tx here:
https://github.com/nianticlabs/monodepth2/blob/master/datasets/mono_dataset.py#L194

But instead, for flipped image principal point changes as in here:
https://github.com/ClementPinard/SfmLearner-Pytorch/blob/0caec9ed0f83cb65ba20678a805e501439d2bc25/custom_transforms.py#L55

Model trained on cityscapes projects static objects to infinity

Hello,

I added support for training monodepth2 on cityscapes, and trained it using the default hyperparameters used to train for monocular on kitti. It looks like it is having trouble masking out the all the static pixels and is projecting cars that move with the ego-car to infinity. I was wondering if anyone had any useful input on this? The most obvious thing to do is to weight the static mask in such a way that it masks out more pixels as static. But I was wondering if the authors ran into this issue when training on kitti. It's possible that this didn't happen on kitti because the environment is mostly static?

How to prevent holes during training?

Hello,

First off, thank you for all your hard work on this. I really enjoyed reading your paper and learning about the method.

I'm currently training the model on my own dataset. It's an indoor office-type setting with a lot more low texture regions than KITTI (blank walls and floors). I read in the paper that these regions can give rise to "holes" in the lower resolution depth maps and I can see signs of this after the first epoch of training. As training goes on, the holes get worse to the point where all resolutions are hole patterns or completely blank.

Any tips on how to make the model more robust to these artifacts? I was thinking of adjusting the number of scales or maybe changing alpha in the photometric error.

Also, the video sequences were taken at 30 fps. I noticed that adjacent frames did not look very different than the target image. Could this also be contributing to the holes?

I also want to avoid going through my data and removing clips with a lot of low texture.

Thank you

BackprojectDepth

Thanks a lot, i find the BackprojectDepth class,but i still can not compute the diatance between the object(such as one point) and ground ，can you give me more help,thank you.

Why have the model learn disparity instead of depth directly?

I noticed that during evaluation, you have to compute a ratio to correct for a scaling factor when computing the errors. Specifically in evaluate_depth.py,

if not opt.disable_median_scaling:
            ratio = np.median(gt_depth) / np.median(pred_depth)
            ratios.append(ratio)
            pred_depth *= ratio

If you need the ground truth to properly inference, then what is the point of this model? What's the point of inferencing if you need ground truth? Why not just have the depth net predict depth directly instead of predict disparity?

I would expect the depth to be off by a scaling factor and maybe even an offset if you tried inferencing with a camera with different intrinsics than the camera the model was trained with, but if the model was properly learning the depth, and you are evaluating on kitti, the same dataset you trained with, why would you need to adjust by this scaling factor?

From what I can tell, this is due to how you are converting between disparity and depth. depth = 1 / disp. You don't know the scaling factor, so you just hand-wave it away and call it one.

Normally, to convert between depth and disparity, you need the focal length and the baseline distance between the two cameras as explained by opencv here: https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_calib3d/py_depthmap/py_depthmap.html

But this model assumes the two cameras are in a binocular setting. This is a monocular setting, and the pose between two frames almost never represents a binocular setting with identity rotation matrix and zero translation in y and z directions. So what does disparity even mean in this model??? Is it still just the difference in pixels of two correspondence points, except now the distance can be in both x and y?

I've been trying to think of how to properly compute depth from disparity in this model. But it's not as simple as computing the distance between the two cameras because they are not binocular. I suppose you'd have to apply rectification so they are co-planar? But even then, that approach requires two images to properly compute disparity/depth. And the depth net always only takes in one image.

The conclusion I've arrived to is that it'd be better to have the depth net directly predict depth instead of predicting disparity. And then the model would properly learn scale? To me, it doesn't even make sense to talk about disparity in a monocular setting anyways. What is the disparity even relative to?

But something tells me that there is a reason why everyone who is doing this unsupervised approach is directly predicting disparity instead of depth. I am worried that if the model tries to learn depth directly, then it may focus too much on points that are far away during the loss optimization because it will be more likely to get larger errors for points that are far away than points that are close up, and then it won't really learn proper depth for objects that are up close, which are the points we actually care about. How far away the sky is normally isn't very important.

Sorry that this is so long. I wanted to make it very clear where I am coming from. Thank you.

Multi GPU training

Hi,
It's mentioned that the code runs with single GPU. Did you guys try running it across multiple GPUs.
I tried across two GPUs. The encoder works fine. However, there is some problem in creating replicas of depth decoder across GPUs. Any clue with that?
Regards - Debapriya

how to calculate the real world distance value from the camera to object

Hi , I want to know how to calculate the real world distance value from the camera to object. And I am using my own image so which mean pretrained model trained on different camera config. So how can I calibrate this pretrained model to my own camera input image.
Thanks for advance

3-D coordinates

hello,
how to use the disp to get the 3-D coordinates

how to train without depth gt

I want to train using data, but I want to do it without depth ground truth.

how can i possible?

png but added a jpg extension

Just to let you know that test_simple saved a png but added a jpg extension which might confuse for the initial test example. Great work!

Bruce

What GPUs have you used for training the Monodepth2

Hi,

Thank you for sharing the code and one question for the GPU version you used. I use RTX 2080 Ti and seems like there are issues:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 18, in
trainer.train()
File "/home/mingfu/monodepth2/trainer.py", line 186, in train
self.run_epoch()
File "/home/mingfu/monodepth2/trainer.py", line 202, in run_epoch
outputs, losses = self.process_batch(inputs)
File "/home/mingfu/monodepth2/trainer.py", line 252, in process_batch
outputs.update(self.predict_poses(inputs, features))
File "/home/mingfu/monodepth2/trainer.py", line 292, in predict_poses
axisangle[:, 0], translation[:, 0], invert=(f_i < 0))
File "/home/mingfu/monodepth2/layers.py", line 41, in transformation_from_parameters
M = torch.matmul(R, T)
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCBlas.cu:441

However, when I change to V100, the code works.

Stereo trained with baseline 0.1

The models which were trained with stereo supervision were trained with a nominal baseline of 0.1 units.

monodepth2/datasets/mono_dataset.py

Line 196 in 5cc5c85

stereo_T[0, 3] = side_sign * baseline_sign * 0.1

The KITTI rig has a baseline of 54cm. Therefore, to convert the stereo predictions to real-world scale you multiply the depths by 5.4.

monodepth2/evaluate_depth.py

Line 24 in 5cc5c85

STEREO_SCALE_FACTOR = 5.4

Is there any significance in training with a nominal baseline of 0.1?

How to get the Three-dimensional coordinates

hello，
how to get the Three-dimensional coordinates from the image_disp and image,because i want to know the distance between the object and the land

Why the baseline is so high?

SfMLearner shows that the best result on KITTI with delta<1.25 is around 0.73, but why the baseline in this paper is so high, more than 10% above SfMLearner? I notice that the baseline in this paper has already outperformed some new published papers. So I'm curious which part contribute such a promising improvement?

Different intrinsics in stereo setting

The current code takes same intrinsics for both cameras in stereo setting, but if we have different intrinsics the current code version not easy to adapt. This is because of the usage of the input dict, we have no clue which one of the images in the batch are left or right. We have to hack in many places in the code to make it work. Is there an easy way to handle this issue?

monodepth2/datasets/mono_dataset.py

Line 159 in 2e0c261

 inputs[("color", i, -1)] = self.get_color(folder, frame_index, other_side, do_flip) 

Question about the fixed intrinsic you used in the KITTIDataset class

Hi,

I notice that someone has asked about the intrinsic matrix when we change to a new dataset and I am now trying to modify the code to train on my own dataset. I notice that in Monocular setting, in the class KITTIDataset(MonoDataset) you fixed the intrinsic matrix K as:

self.K = np.array([[0.58, 0, 0.5, 0],
[0, 1.92, 0.5, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]], dtype=np.float32)
self.full_res_shape = (1242, 375)

Then I look at the devkit_kitti_rawdata folder provided from the official website of Kitti dataset and inside there is a readme about the camera parameter related to four cameras. Since in the paper you use color input, therefore, I supposed that in the monocular training the image should come from camera 2 or 3 which obtain left and right color image sequence. Then the K_02 and K_03 provided in the readme which is the calibration matrix of camera 2 or 3 before rectification is:

K_02: 9.597910e+02 0.000000e+00 6.960217e+02 0.000000e+00 9.569251e+02 2.241806e+02 0.000000e+00 0.000000e+00 1.000000e+00

K_03: 9.037596e+02 0.000000e+00 6.957519e+02 0.000000e+00 9.019653e+02 2.242509e+02 0.000000e+00 0.000000e+00 1.000000e+00

And I try to recover these matrices from the K you fixed in the code by rescaling them using the image_height = 1242 and image_width=375 and perform the multiplication

K[0,:] *= image_width
K[1,:] *= image_height

but they seem not to match with the raw data K matrix.

Did I do something wrong or misunderstand something?

How to use one model on different datasets? (Make3D dataset)

Hi,

You reported performance on the Make3D dataset using models trained on KITTI. Could you give more details, why you choice a center crop of 2 × 1 ratio? Is it because of different camera intrinsics?
Have you tried training different datasets with different intrinsics? In this way, a model can be adapted to different datasets, right?

Best regards

Why do you not use the augmented images for computing loss?

I noticed that you are only using the augmented images for passing through the depth net. But when you do the inverse warping and compute the losses, you always use the original color image. Why is that?

Is fly-out mask being computed?

I was wondering if you were computing a fly-out mask to mask out pixels that "fly out" when computing the re-projection loss. I didn't notice it in your paper of code.

inference over multiple images in a time

Empty Epoch Problem

Hi, sorry to trouble you.
I'm trying to use monodepth2 to train my own dataset, and I met some problem when I was training. There were some epochs seem to be empty, because there are no outputs(no loss, no time elapsed, no time left, seems to be skipped).

I wonder the reason of this situation.

Very hope you could see this issue, appreciate a lot!

Training with custom dataset

Hi, thank you always for your support.
It seems that monodepth2 trains KITTI data with Eigen split, but I would like to train with other dataset.
The training part of your code seems to depend on KITTI format and the Eigen split algorithm.
How can I train with general dataset e.g. just a stream of frames.

Thank you in advance.

how to draw a trajectory picture using the pose result

Sorry to bother you again~
How can I transform the pose evaluation result to the same format as the gt file provided by KITTI. I mean when I want to draw the pose trajectory, I found the output pose result's format is N * 4 4 , but the groundtruth file's format is N12.

Actually I just want to draw a trajectory picture like this to see the pose result.

Thank you very much for your help.

Resnet-50 trained models

Hi, are the models listed here all with resnet-18 backbone? Do you also provide resnet-50 trained ones (the one in the paper page 12 table 6c)?

Thank you very much.

Monodepth on Tensorflow

It's present (or planned) a version of monodepth2 in Tensorflow?

Key in depth_predition_example.ipynb

Hi,
Cool stuff!
The key model_name in the depth_prediction_example notebook is flipped. It should be "model_name = "mono_640x192".

Custom dimension input to model

Hi,

When I try to add a custom dimension image instead of a 1024x320 image, I get the following error:

line 60, in forward
x = torch.cat(x, 1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 351 and 352 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:83

My understanding is it needs to have 2^5 as a factor due to the scales. Is there a workaround to bypass this mismatch?

a little question about pose

I'm trying to figure out the transform relationship among frames.

After this step, are we getting T0→-1 and T0→1 OR T-1→0 and T0→1？
Thank you~

Training on wild videos

Dear authors,

In your paper, you also showed training on videos from Youtube 'Wind Walk Travel Videos’, I am wondering how you deal with the intrinsic parameters (camera focal length, distortion, etc.).

Regards,

Kaixuan

Using flyout mask leads to weird border affects

I know this was mentioned in this issue: #20
However, it was said that this was only an issue for the stereo case.

I tried manually computing the fly-out mask and replacing the invalid region of the warped image with the actual pixels from the target image. And in some cases, this led to the depth net predicting inf depth at the edges. I realize that this is different from using zero padding when performing the grid sampling. But I would think this would lead to better results, not worse, because by using the fly-out mask, you are not allowing the fly-out pixels to contaminate the loss.

in convert_to_HWC "size of input tensor and input format are different

Thanks for your great work.
But I get this error when I run train.py. It seems there's something wrong when write output log. Could you please help me fix it ?

small object depth estimation

Hi, thanks a lot for this project and your help.And I am new to this stuff. I want to use the project to get the depth estimation in my own picture. and my attention is to get small object depth estimation compared with the rest component in the picture.

here is an example and I want to get the taxi depth

the orginal picture is:

the depth estimation picture after I use mono+stereo_640x192 model:

how can I achieve it. If I train in my own picture dataset, can I get a better result?
Thanks you very much!

	outputs[("color", frame_id, scale)] = F.grid_sample(
	inputs[("color", frame_id, source_scale)],
	outputs[("sample", frame_id, scale)],
	padding_mode="border")

nianticlabs / monodepth2 Goto Github PK

monodepth2's Issues

Recommend Projects

Recommend Topics

Recommend Org