What is the scaling factor needed to get metric depth maps from output disparity maps

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Depth map scale for KITTI data about manydepth HOT 5 CLOSED

nianticlabs commented on June 5, 2024

Depth map scale for KITTI data

from manydepth.

Comments (5)

JamieWatson683 commented on June 5, 2024 1

Hi, thanks a lot for your interest in the project.

Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper.

At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set.

In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name mono+stereo_1024x320). Since this model uses stereo pairs during training, it will have a real world scale given by the fixed, known baseline between the 2 cameras. In Monodepth2, this baseline is set to be 0.1m, whereas in reality the cameras are 0.54m apart. This is why at evaluation time they scale by 5.4.

I hope this makes sense, but if not please let me know and I can try to clarify.

from manydepth.

ChauChorHim commented on June 5, 2024 1

Hi, thanks a lot for your interest in the project.

Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper.

At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set.

In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name mono+stereo_1024x320). Since this model uses stereo pairs during training, it will have a real world scale given by the fixed, known baseline between the 2 cameras. In Monodepth2, this baseline is set to be 0.1m, whereas in reality the cameras are 0.54m apart. This is why at evaluation time they scale by 5.4.

I hope this makes sense, but if not please let me know and I can try to clarify.

Hi,
Thanks for your detailed answer, but I am still a little bit confused. Since the estimated depths is needed to apply median scaling, then what's the meaning of adaptive cost volume?

from manydepth.

JamieWatson683 commented on June 5, 2024 1

@ChauChorHim - the median scaling is purely an evaluation step, so we can compare to the GT and obtain scores. This is the same as in Monodepth2, and indeed (almost) all depth estimation works trained from monocular video.

The adaptive cost volume is a training time technique. A cost volume uses hypothesised depth planes to warp previous features to establish a potential that a certain depth is correct. In the training from monocular video case - we do not know the scale, and so we cannot decide the hypothesised depth planes in advance (imagine that we define our min/max depth planes to be 0.5m and 100m, but the network can pick any scale, and perhaps it compresses everything such that it's max depth is only 10m. This would mean almost all of our cost volume is unhelpful).

Instead we need to estimate the depth planes as we train - hence the adaptive cost volume.

Does that help at all?

@JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth.

from manydepth.

JinraeKim commented on June 5, 2024

@JamieWatson683
TBH, I don't understand the point of this median scaling approach. Can it be consistent over multiple test datasets? If not, what is the point of the self-supervised depth estimation? Maybe I miss something...

from manydepth.

JinraeKim commented on June 5, 2024

@JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth.

Thank you so much! This helped me a lot.
I also studied more after asking the question and realised the inherit issue of scales from the monocular depth estimation.

If you don't mind, could you answer this as well?
I'm reading manydepth paper and still don't understand how the networks work "after" the cost volume part.
For example, the cost volume is constructed with predefined depths (linearly spaced within d_min, d_max, namely).
This can also be interpreted as "likelihood" as mentioned in the paper.
It is fed (with the feature) into the additional encoder-decoder map of the target frame.
I think the encoder-decoder need to be trained as well, but it's not obvious to me how one can train the encoder-decoder to predict the depth of the target frame. If we need ground-truth depths to train those, it is not a self-supervised scheme, so I may miss something important.

Thank you in advance!

from manydepth.

Depth map scale for KITTI data about manydepth HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent