nianticlabs / monodepth2 Goto Github PK
View Code? Open in Web Editor NEW[ICCV 2019] Monocular depth estimation from a single image
License: Other
[ICCV 2019] Monocular depth estimation from a single image
License: Other
Hello,
This is not an issue with the code, but rather a question about training on cityscapes instead of kitti. I created a new Dataset
class in which I read the camera parameters stored in a json file and set them in a numpy array:
K = np.array([[fx, 0, u0, 0], [0, fy, v0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], dtype=np.float32)
These intrinsics are then used in generate_images_pred
. I noticed on tensorboard that the color_pred_s_0
was sometimes really messy. Apart from that, the other outputs look good. So, I was wondering, is there anything else to change in the code?
Best regards
I am trying to understand how the static mask is being computed. I haven't had the chance to run the code as I just want to incorporate the idea into my own project.
I noticed that the computation is all being done here:
if not self.opt.disable_automasking:
# add random numbers to break ties
identity_reprojection_loss += torch.randn(
identity_reprojection_loss.shape).cuda() * 0.00001
combined = torch.cat((identity_reprojection_loss, reprojection_loss), dim=1)
else:
combined = reprojection_loss
if combined.shape[1] == 1:
to_optimise = combined
else:
to_optimise, idxs = torch.min(combined, dim=1)
if not self.opt.disable_automasking:
outputs["identity_selection/{}".format(scale)] = (idxs > 1).float()
loss += to_optimise.mean()
In pytorch, the tensors are of shape [B,C,H,W], correct? So you are concatenating the reprojection loss from the target-source pair with the reprojection loss from the target-warped pair along the channel dimension. This means combined
should only have exactly two channels? Then you choose to keep the minimum of the two to add to the loss.
How are you calculating the mask? You check for when idxs > 1. But if there were only two channels, how can idxs ever be greater than 1?
How did you get this mean and std?
x = (input_image - 0.45) / 0.225
Why is the input not normalized for PoseCNN network?
https://github.com/nianticlabs/monodepth2/blob/master/networks/pose_cnn.py#L39
In
Line 43 in 1cc8b81
track_length
. I wonder if this is a common way to do (like the code is borrowed from SfMLearner); in my opinion it is sort of "cheating" (no offense), because in reality you cannot optimize the scale all the time since no ground truth is available. A more reasonable way should be to fix the scale beforehand, e.g. using the optimal scale on the training set, and use the same scale for all the sequence.
What's your opinion?
Resnet-50 architecture for 'separate_resnet' pose network has a bug.
Flags:
--num_layers 50
--pose_model_type separate_resnet
Error:
Traceback (most recent call last):
File "/media/Data/Alwyn/github/monodepth2/train.py", line 17, in <module>
trainer = Trainer(opts)
File "/media/Data/Alwyn/github/monodepth2/trainer.py", line 68, in __init__
num_input_images=self.num_pose_frames)
File "/media/Data/Alwyn/github/monodepth2/networks/resnet_encoder.py", line 80, in __init__
self.encoder = resnet_multiimage_input(num_layers, pretrained, num_input_images)
File "/media/Data/Alwyn/github/monodepth2/networks/resnet_encoder.py", line 58, in resnet_multiimage_input
model.load_state_dict(loaded)
File "/home/iitp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNetMultiImageInput:
Unexpected key(s) in state_dict: "layer1.0.conv3.weight", "layer1.0.bn3.running_mean", "layer1.0.bn3.running_var", "layer1.0.bn3.weight", "layer1.0.bn3.bias", "layer1.0.downsample.0.weight", "layer1.0.downsample.1.running_mean", "layer1.0.downsample.1.running_var", "layer1.0.downsample.1.weight", "layer1.0.downsample.1.bias", "layer1.1.conv3.weight", "layer1.1.bn3.running_mean", "layer1.1.bn3.running_var", "layer1.1.bn3.weight", "layer1.1.bn3.bias", "layer1.2.conv3.weight", "layer1.2.bn3.running_mean", "layer1.2.bn3.running_var", "layer1.2.bn3.weight", "layer1.2.bn3.bias", "layer2.0.conv3.weight", "layer2.0.bn3.running_mean", "layer2.0.bn3.running_var", "layer2.0.bn3.weight", "layer2.0.bn3.bias", "layer2.1.conv3.weight", "layer2.1.bn3.running_mean", "layer2.1.bn3.running_var", "layer2.1.bn3.weight", "layer2.1.bn3.bias", "layer2.2.conv3.weight", "layer2.2.bn3.running_mean", "layer2.2.bn3.running_var", "layer2.2.bn3.weight", "layer2.2.bn3.bias", "layer2.3.conv3.weight", "layer2.3.bn3.running_mean", "layer2.3.bn3.running_var", "layer2.3.bn3.weight", "layer2.3.bn3.bias", "layer3.0.conv3.weight", "layer3.0.bn3.running_mean", "layer3.0.bn3.running_var", "layer3.0.bn3.weight", "layer3.0.bn3.bias", "layer3.1.conv3.weight", "layer3.1.bn3.running_mean", "layer3.1.bn3.running_var", "layer3.1.bn3.weight", "layer3.1.bn3.bias", "layer3.2.conv3.weight", "layer3.2.bn3.running_mean", "layer3.2.bn3.running_var", "layer3.2.bn3.weight", "layer3.2.bn3.bias", "layer3.3.conv3.weight", "layer3.3.bn3.running_mean", "layer3.3.bn3.running_var", "layer3.3.bn3.weight", "layer3.3.bn3.bias", "layer3.4.conv3.weight", "layer3.4.bn3.running_mean", "layer3.4.bn3.running_var", "layer3.4.bn3.weight", "layer3.4.bn3.bias", "layer3.5.conv3.weight", "layer3.5.bn3.running_mean", "layer3.5.bn3.running_var", "layer3.5.bn3.weight", "layer3.5.bn3.bias", "layer4.0.conv3.weight", "layer4.0.bn3.running_mean", "layer4.0.bn3.running_var", "layer4.0.bn3.weight", "layer4.0.bn3.bias", "layer4.1.conv3.weight", "layer4.1.bn3.running_mean", "layer4.1.bn3.running_var", "layer4.1.bn3.weight", "layer4.1.bn3.bias", "layer4.2.conv3.weight", "layer4.2.bn3.running_mean", "layer4.2.bn3.running_var", "layer4.2.bn3.weight", "layer4.2.bn3.bias".
size mismatch for layer1.0.conv1.weight: copying a param with shape torch.Size([64, 64, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
size mismatch for layer1.1.conv1.weight: copying a param with shape torch.Size([64, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
size mismatch for layer1.2.conv1.weight: copying a param with shape torch.Size([64, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
size mismatch for layer2.0.conv1.weight: copying a param with shape torch.Size([128, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 3, 3]).
size mismatch for layer2.0.downsample.0.weight: copying a param with shape torch.Size([512, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 64, 1, 1]).
size mismatch for layer2.0.downsample.1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for layer2.0.downsample.1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for layer2.0.downsample.1.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for layer2.0.downsample.1.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for layer2.1.conv1.weight: copying a param with shape torch.Size([128, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for layer2.2.conv1.weight: copying a param with shape torch.Size([128, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for layer2.3.conv1.weight: copying a param with shape torch.Size([128, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).
size mismatch for layer3.0.downsample.0.weight: copying a param with shape torch.Size([1024, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 128, 1, 1]).
size mismatch for layer3.0.downsample.1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for layer3.0.downsample.1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for layer3.0.downsample.1.running_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for layer3.0.downsample.1.running_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for layer3.1.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for layer3.2.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for layer3.3.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for layer3.4.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for layer3.5.conv1.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for layer4.0.conv1.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 256, 3, 3]).
size mismatch for layer4.0.downsample.0.weight: copying a param with shape torch.Size([2048, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 256, 1, 1]).
size mismatch for layer4.0.downsample.1.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for layer4.0.downsample.1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for layer4.0.downsample.1.running_mean: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for layer4.0.downsample.1.running_var: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for layer4.1.conv1.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for layer4.2.conv1.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for fc.weight: copying a param with shape torch.Size([1000, 2048]) from checkpoint, the shape in current model is torch.Size([1000, 512]).
Process finished with exit code 1
Why do you scale translation just for posecnn
but not for shared
or separate
?
Line 372 in 5cc5c85
I thought I'd open up a discussion.
I recently tried training monodepth2 on cityscapes and discovered that even with static mask, it was failing to mask out some vehicles moving with the ego car and was projecting them to infinity. In order to resolve this issue, I want to understand exactly why projecting these pixels to infinity minimizes the loss. This is what I don't entirely understand. The best understanding I have is that in order to make a point in 3D not move across frames, the easiest way to do this is to set its z-component to infinity, so that the euclidean distance between the two points zero. The z-component would be the depth from the depth map. But somehow, I don't believe this is right. Does anyone have a clear mathematical formulation that shows, given the re-projection equation, minimizing the loss comes from setting z to infinity? I believe that if I know the exact mathematical component that is causing this issue, it will be clearer how to resolve the issue.
Reprojection Equation:
The most obvious thing to try would be to come up with a different loss function to pass the static pixels through to prevent them from being projected to infinity. But then you'd need some sort of idea of what the actual depth should be, and considering this is unsupervised approach, I don't see how you can do that. I know struct2depth added a regularizing term to approximate what the depth of an object should be by learning the objects height, using similar triangles. But I tried struct2depth as well and it was still projecting objects to infinity, although this might have been due to the fact that MASK-RCNN was missing too many objects. Just throwing ideas out there. Thanks.
When warping the image, you use padding_mode='border'
, which causes border effects (e.g. on the left and bottom for image 2->1)
Lines 381 to 384 in ec10cf1
What I've seen others doing (e.g. struct2depth) is to use a mask, and only compute the loss on the masked (valid) area. This can be done using padding_mode='zeros'
to generate the valid mask. E.g., black areas are not contributing to the loss.
What's your opinion?
Dear,
Thanks a lot for your open source project!
As I reproduce the result, I found that the training time is longer than the reported time in the readme. I already moved the dataset to an SSD drive and used the default settings (I am using Titan XP, i7-7700, and pillow-simd). I am wondering do you have some extra operations, for example, pre-resize images?
For example: python train.py --model_name stereo_model2 --frame_ids 0 --use_stereo --split eigen_full
costs about 14 hours.
Thank you for sharing your great code. Your kind instruction in usage is a great help in setting the environment. The simple prediction test using your trained model worked successfully, but the we have some error in training. I would be appreciate any help from you. Thank you in advance.
python train.py --model_name stereo_model \
--frame_ids 0 --use_stereo --split eigen_full
Training model named:
stereo_model
Models and tensorboard events files are saved to:
/root/tmp
Training is using:
cuda
Using split:
eigen_full
There are 45200 training items and 1776 validation items
Training
Traceback (most recent call last):
File "train.py", line 18, in
trainer.train()
File "/workspace/trainer.py", line 186, in train
self.run_epoch()
File "/workspace/trainer.py", line 198, in run_epoch
for batch_idx, inputs in enumerate(self.train_loader):
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in next
return self._process_next_batch(batch)
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/workspace/datasets/mono_dataset.py", line 174, in getitem
self.brightness, self.contrast, self.saturation, self.hue)
File "/root/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/torchvision/transforms/transforms.py", line 725, in get_params
if brightness > 0:
TypeError: '>' not supported between instances of 'tuple' and 'int'
Line 118 in 1cc8b81
Hi, thanks a lot for this project. I am trying to use your work for a summer research project at my university and want to run monodepth2 on the Nvidia Jetson Nano. However, when I try to run the example command to test out the depth prediction for a single image, I get this error:
Weirdly, the file in question (SpatialUpSamplingBilinear.cu) doesn't exist anywhere on my machine. I installed pytorch using this guide: https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano/post/5324123/#5324123
(This might be a pytorch specific issue, in which case I'd be happy to close this)
Thank you!
Hi,thanks for your work
I want to train ASN model use stereo mode in indoor scene,I change max_depth in option.py,but the model not fit.maybe need to changes more parameters? hope you get me some advice,thank you
Seems that the released pre-trained models are trained on Eign split (for depth), I wonder if the Odom split models (for pose) are also available. Thanks!
Does the translation get affected when an image is flipped? You have changed tx
here:
https://github.com/nianticlabs/monodepth2/blob/master/datasets/mono_dataset.py#L194
But instead, for flipped image principal point changes as in here:
https://github.com/ClementPinard/SfmLearner-Pytorch/blob/0caec9ed0f83cb65ba20678a805e501439d2bc25/custom_transforms.py#L55
Hello,
I added support for training monodepth2 on cityscapes, and trained it using the default hyperparameters used to train for monocular on kitti. It looks like it is having trouble masking out the all the static pixels and is projecting cars that move with the ego-car to infinity. I was wondering if anyone had any useful input on this? The most obvious thing to do is to weight the static mask in such a way that it masks out more pixels as static. But I was wondering if the authors ran into this issue when training on kitti. It's possible that this didn't happen on kitti because the environment is mostly static?
Hello,
First off, thank you for all your hard work on this. I really enjoyed reading your paper and learning about the method.
I'm currently training the model on my own dataset. It's an indoor office-type setting with a lot more low texture regions than KITTI (blank walls and floors). I read in the paper that these regions can give rise to "holes" in the lower resolution depth maps and I can see signs of this after the first epoch of training. As training goes on, the holes get worse to the point where all resolutions are hole patterns or completely blank.
Any tips on how to make the model more robust to these artifacts? I was thinking of adjusting the number of scales or maybe changing alpha in the photometric error.
Also, the video sequences were taken at 30 fps. I noticed that adjacent frames did not look very different than the target image. Could this also be contributing to the holes?
I also want to avoid going through my data and removing clips with a lot of low texture.
Thank you
Thanks a lot, i find the BackprojectDepth class,but i still can not compute the diatance between the object(such as one point) and ground ,can you give me more help,thank you.
I noticed that during evaluation, you have to compute a ratio to correct for a scaling factor when computing the errors. Specifically in evaluate_depth.py,
if not opt.disable_median_scaling:
ratio = np.median(gt_depth) / np.median(pred_depth)
ratios.append(ratio)
pred_depth *= ratio
If you need the ground truth to properly inference, then what is the point of this model? What's the point of inferencing if you need ground truth? Why not just have the depth net predict depth directly instead of predict disparity?
I would expect the depth to be off by a scaling factor and maybe even an offset if you tried inferencing with a camera with different intrinsics than the camera the model was trained with, but if the model was properly learning the depth, and you are evaluating on kitti, the same dataset you trained with, why would you need to adjust by this scaling factor?
From what I can tell, this is due to how you are converting between disparity and depth. depth = 1 / disp. You don't know the scaling factor, so you just hand-wave it away and call it one.
Normally, to convert between depth and disparity, you need the focal length and the baseline distance between the two cameras as explained by opencv here: https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_calib3d/py_depthmap/py_depthmap.html
But this model assumes the two cameras are in a binocular setting. This is a monocular setting, and the pose between two frames almost never represents a binocular setting with identity rotation matrix and zero translation in y and z directions. So what does disparity even mean in this model??? Is it still just the difference in pixels of two correspondence points, except now the distance can be in both x and y?
I've been trying to think of how to properly compute depth from disparity in this model. But it's not as simple as computing the distance between the two cameras because they are not binocular. I suppose you'd have to apply rectification so they are co-planar? But even then, that approach requires two images to properly compute disparity/depth. And the depth net always only takes in one image.
The conclusion I've arrived to is that it'd be better to have the depth net directly predict depth instead of predicting disparity. And then the model would properly learn scale? To me, it doesn't even make sense to talk about disparity in a monocular setting anyways. What is the disparity even relative to?
But something tells me that there is a reason why everyone who is doing this unsupervised approach is directly predicting disparity instead of depth. I am worried that if the model tries to learn depth directly, then it may focus too much on points that are far away during the loss optimization because it will be more likely to get larger errors for points that are far away than points that are close up, and then it won't really learn proper depth for objects that are up close, which are the points we actually care about. How far away the sky is normally isn't very important.
Sorry that this is so long. I wanted to make it very clear where I am coming from. Thank you.
Hi,
It's mentioned that the code runs with single GPU. Did you guys try running it across multiple GPUs.
I tried across two GPUs. The encoder works fine. However, there is some problem in creating replicas of depth decoder across GPUs. Any clue with that?
Regards - Debapriya
Hi , I want to know how to calculate the real world distance value from the camera to object. And I am using my own image so which mean pretrained model trained on different camera config. So how can I calibrate this pretrained model to my own camera input image.
Thanks for advance
hello,
how to use the disp to get the 3-D coordinates
I want to train using data, but I want to do it without depth ground truth.
how can i possible?
Hi
Just to let you know that test_simple saved a png but added a jpg extension which might confuse for the initial test example. Great work!
Bruce
Hi,
Thank you for sharing the code and one question for the GPU version you used. I use RTX 2080 Ti and seems like there are issues:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 18, in
trainer.train()
File "/home/mingfu/monodepth2/trainer.py", line 186, in train
self.run_epoch()
File "/home/mingfu/monodepth2/trainer.py", line 202, in run_epoch
outputs, losses = self.process_batch(inputs)
File "/home/mingfu/monodepth2/trainer.py", line 252, in process_batch
outputs.update(self.predict_poses(inputs, features))
File "/home/mingfu/monodepth2/trainer.py", line 292, in predict_poses
axisangle[:, 0], translation[:, 0], invert=(f_i < 0))
File "/home/mingfu/monodepth2/layers.py", line 41, in transformation_from_parameters
M = torch.matmul(R, T)
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCBlas.cu:441
However, when I change to V100, the code works.
The models which were trained with stereo supervision were trained with a nominal baseline of 0.1 units.
monodepth2/datasets/mono_dataset.py
Line 196 in 5cc5c85
The KITTI rig has a baseline of 54cm. Therefore, to convert the stereo predictions to real-world scale you multiply the depths by 5.4.
Line 24 in 5cc5c85
Is there any significance in training with a nominal baseline of 0.1?
hello,
how to get the Three-dimensional coordinates from the image_disp and image,because i want to know the distance between the object and the land
SfMLearner shows that the best result on KITTI with delta<1.25 is around 0.73, but why the baseline in this paper is so high, more than 10% above SfMLearner? I notice that the baseline in this paper has already outperformed some new published papers. So I'm curious which part contribute such a promising improvement?
The current code takes same intrinsics for both cameras in stereo setting, but if we have different intrinsics the current code version not easy to adapt. This is because of the usage of the input dict, we have no clue which one of the images in the batch are left or right. We have to hack in many places in the code to make it work. Is there an easy way to handle this issue?
monodepth2/datasets/mono_dataset.py
Line 159 in 2e0c261
Hi,
I notice that someone has asked about the intrinsic matrix when we change to a new dataset and I am now trying to modify the code to train on my own dataset. I notice that in Monocular setting, in the class KITTIDataset(MonoDataset)
you fixed the intrinsic matrix K as:
self.K = np.array([[0.58, 0, 0.5, 0],
[0, 1.92, 0.5, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]], dtype=np.float32)
self.full_res_shape = (1242, 375)
Then I look at the devkit_kitti_rawdata
folder provided from the official website of Kitti dataset and inside there is a readme about the camera parameter related to four cameras. Since in the paper you use color input, therefore, I supposed that in the monocular training the image should come from camera 2 or 3 which obtain left and right color image sequence. Then the K_02 and K_03 provided in the readme which is the calibration matrix of camera 2 or 3 before rectification is:
K_02: 9.597910e+02 0.000000e+00 6.960217e+02 0.000000e+00 9.569251e+02 2.241806e+02 0.000000e+00 0.000000e+00 1.000000e+00
K_03: 9.037596e+02 0.000000e+00 6.957519e+02 0.000000e+00 9.019653e+02 2.242509e+02 0.000000e+00 0.000000e+00 1.000000e+00
And I try to recover these matrices from the K you fixed in the code by rescaling them using the image_height = 1242 and image_width=375 and perform the multiplication
K[0,:] *= image_width
K[1,:] *= image_height
but they seem not to match with the raw data K matrix.
Did I do something wrong or misunderstand something?
Hi,
Best regards
I noticed that you are only using the augmented images for passing through the depth net. But when you do the inverse warping and compute the losses, you always use the original color image. Why is that?
I was wondering if you were computing a fly-out mask to mask out pixels that "fly out" when computing the re-projection loss. I didn't notice it in your paper of code.
Hi, sorry to trouble you.
I'm trying to use monodepth2 to train my own dataset, and I met some problem when I was training. There were some epochs
seem to be empty
, because there are no outputs
(no loss, no time elapsed, no time left, seems to be skipped).
I wonder the reason of this situation.
Very hope you could see this issue, appreciate a lot!
Hi, thank you always for your support.
It seems that monodepth2 trains KITTI data with Eigen split, but I would like to train with other dataset.
The training part of your code seems to depend on KITTI format and the Eigen split algorithm.
How can I train with general dataset e.g. just a stream of frames.
Thank you in advance.
Sorry to bother you again~
How can I transform the pose evaluation result to the same format as the gt file provided by KITTI. I mean when I want to draw the pose trajectory, I found the output pose result's format is N * 4 4 , but the groundtruth file's format is N12.
Actually I just want to draw a trajectory picture like this to see the pose result.
Thank you very much for your help.
Hi, are the models listed here all with resnet-18 backbone? Do you also provide resnet-50 trained ones (the one in the paper page 12 table 6c)?
Thank you very much.
It's present (or planned) a version of monodepth2 in Tensorflow?
Hi,
Cool stuff!
The key model_name in the depth_prediction_example notebook is flipped. It should be "model_name = "mono_640x192".
Hi,
When I try to add a custom dimension image instead of a 1024x320 image, I get the following error:
line 60, in forward
x = torch.cat(x, 1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 351 and 352 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:83
My understanding is it needs to have 2^5 as a factor due to the scales. Is there a workaround to bypass this mismatch?
Dear authors,
In your paper, you also showed training on videos from Youtube 'Wind Walk Travel Videos’, I am wondering how you deal with the intrinsic parameters (camera focal length, distortion, etc.).
Regards,
Kaixuan
I know this was mentioned in this issue: #20
However, it was said that this was only an issue for the stereo case.
I tried manually computing the fly-out mask and replacing the invalid region of the warped image with the actual pixels from the target image. And in some cases, this led to the depth net predicting inf depth at the edges. I realize that this is different from using zero padding when performing the grid sampling. But I would think this would lead to better results, not worse, because by using the fly-out mask, you are not allowing the fly-out pixels to contaminate the loss.
Hi, thanks a lot for this project and your help.And I am new to this stuff. I want to use the project to get the depth estimation in my own picture. and my attention is to get small object depth estimation compared with the rest component in the picture.
here is an example and I want to get the taxi depth
the depth estimation picture after I use mono+stereo_640x192 model:
how can I achieve it. If I train in my own picture dataset, can I get a better result?
Thanks you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.