Giter Club home page Giter Club logo

Comments (5)

JamieWatson683 avatar JamieWatson683 commented on June 5, 2024 1

Hi, thanks a lot for your interest in the project.

Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper.

At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set.

In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name mono+stereo_1024x320). Since this model uses stereo pairs during training, it will have a real world scale given by the fixed, known baseline between the 2 cameras. In Monodepth2, this baseline is set to be 0.1m, whereas in reality the cameras are 0.54m apart. This is why at evaluation time they scale by 5.4.

I hope this makes sense, but if not please let me know and I can try to clarify.

from manydepth.

ChauChorHim avatar ChauChorHim commented on June 5, 2024 1

Hi, thanks a lot for your interest in the project.

Manydepth is all about training a depth model from monocular video sequences alone. In this setting (similar to Monodepth2 "M" models), depths and poses are estimated up to some arbitrary scale factor (the same way monocular SLAM or Sfm cannot resolve absolute scale). We do not know this in advance, and it will change with each model that is trained (the network effectively gets to decide its scale) - note this is why we need the "adaptive cost volume" we introduce as one of our contributions in the paper.

At evaluation time we need to apply median scaling to our estimated depths to allow for comparison to ground truth lidar (see here). This is identical to Monodepth2 for a "mono" model. If you want to get to a rough real world scale, you could try scaling Manydepth's outputs by the average median scaling of the test set.

In your comment you are comparing Manydepth to a model from Monodepth2 which was trained using both monocular sequences and stereo pairs (hence the name mono+stereo_1024x320). Since this model uses stereo pairs during training, it will have a real world scale given by the fixed, known baseline between the 2 cameras. In Monodepth2, this baseline is set to be 0.1m, whereas in reality the cameras are 0.54m apart. This is why at evaluation time they scale by 5.4.

I hope this makes sense, but if not please let me know and I can try to clarify.

Hi,
Thanks for your detailed answer, but I am still a little bit confused. Since the estimated depths is needed to apply median scaling, then what's the meaning of adaptive cost volume?

from manydepth.

JamieWatson683 avatar JamieWatson683 commented on June 5, 2024 1

@ChauChorHim - the median scaling is purely an evaluation step, so we can compare to the GT and obtain scores. This is the same as in Monodepth2, and indeed (almost) all depth estimation works trained from monocular video.

The adaptive cost volume is a training time technique. A cost volume uses hypothesised depth planes to warp previous features to establish a potential that a certain depth is correct. In the training from monocular video case - we do not know the scale, and so we cannot decide the hypothesised depth planes in advance (imagine that we define our min/max depth planes to be 0.5m and 100m, but the network can pick any scale, and perhaps it compresses everything such that it's max depth is only 10m. This would mean almost all of our cost volume is unhelpful).

Instead we need to estimate the depth planes as we train - hence the adaptive cost volume.

Does that help at all?

@JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth.

from manydepth.

JinraeKim avatar JinraeKim commented on June 5, 2024

@JamieWatson683
TBH, I don't understand the point of this median scaling approach. Can it be consistent over multiple test datasets? If not, what is the point of the self-supervised depth estimation? Maybe I miss something...

from manydepth.

JinraeKim avatar JinraeKim commented on June 5, 2024

@JinraeKim - the median scaling approach is the standard for evaluating depth estimation works trained on monocular video (going back to SfmLearner). It can be thought of in the same way as monocular slam techniques, or structure from motion - both of these give outputs only up to scale, and are (highly) unlikely to be in real world scale. There are a few monocular depth papers which try to address the scale issue, but that was not the goal of ManyDepth.

Thank you so much! This helped me a lot.
I also studied more after asking the question and realised the inherit issue of scales from the monocular depth estimation.

If you don't mind, could you answer this as well?
I'm reading manydepth paper and still don't understand how the networks work "after" the cost volume part.
For example, the cost volume is constructed with predefined depths (linearly spaced within d_min, d_max, namely).
This can also be interpreted as "likelihood" as mentioned in the paper.
It is fed (with the feature) into the additional encoder-decoder map of the target frame.
I think the encoder-decoder need to be trained as well, but it's not obvious to me how one can train the encoder-decoder to predict the depth of the target frame. If we need ground-truth depths to train those, it is not a self-supervised scheme, so I may miss something important.

Thank you in advance!

image

from manydepth.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.