Giter Club home page Giter Club logo

Comments (8)

mdfirman avatar mdfirman commented on May 23, 2024

Interesting ideas! Thanks for sharing these.

I'm surprised that the disp_multi output is so blurry. I would guess that there might be a bug somewhere in how your multiple views are being preprocessed or used. It might be worth carefully checking tensorboard images (e.g. checking cost volume minimums) to check that sensible things are being done when it comes to intrinsics, extrinsics etc.

Using multiple views

I would say to start with that the idea of using the backwards camera (in reverse) seems very sensible – a good idea! I would avoid using the side cameras though, at least to start with; as you point out the pose network is going to have a hard job there, and those cameras are going to be seeing quite different things to what the front and back cameras are seeing.

Cropping with intrinsics

I agree you should have to change the intrinsics when you crop – but I'm not sure I quite follow the logic you're using here:

# The center points are the original center points + (0.5 * the number of cropped pixels on the bottom) - (0.5 * the number of pixels cropped on the top)

I'm also not sure of all the conventions used in Berkley.

Overall – when you crop an image like so:

cropped_image = uncropped_image[crop_top:(crop_top + crop_height), crop_left:(crop_left + crop_width)]

my understanding is that the principle point changes as follows:

cropped_cx = uncropped_cx - crop_left
cropped_cy = uncropped_cy - crop_top

, and the focal lengths don't change at all. Perhaps you could check that this is happening in your code?

Different image sizes

In theory – yes! But in practice this introduces a lot of potential for hard-to-find bugs in intrinsics and extrinsics especially. So make sure a simple version (e.g. where you only use sequences with one single image size) works well first!

from manydepth.

didriksg avatar didriksg commented on May 23, 2024

Finished updating

Hi!
I have been able to test out some of the ideas I mentioned. This has resulted in some interesting results.

Using multiple views
I have tested a bit using the backward-facing camera simultaneously with the front-facing camera. I reverted the temporal order for the backward-facing camera and treated the images as a separate "scene," consisting of between 100 and 120 frames. I have provided a GIF displaying the images in a scene:

ezgif-4-575b2022f45b

Here, I have cropped the image so that the bottom part of the vehicle is not showing, resulting in ~300px. I have also cropped the top part with the same amount. The total number of samples in my training set is now ~35 000.

When training with this dataset, I now get some interesting-looking disparity images:

Loss values:
image
image
image

Back-facing camera:
color_0
image

color_pred_1
image

color_pred -1
image

consistency_mask
image

disp_mono
image

disp_multi
image

Front-facing camera:
color_0
image

color_pred_1
image

color_pred_-1
image

consistency_mask
image

disp_mono
image

disp_multi
image

These results look slightly like the results mentioned in your paper about moving objects and using the baseline model without the consistency loss. However, from my understanding, the disparity maps generated from a single image, in that case, were looking OK.

Cropping
I noticed that my train of thought might be a bit vague. I'm unsure whether the principal point should represent coordinates in the original image or for the cropped one. E.g. if I have an image of 1200x1000, with an cx of 600 and cy of 500. If I crop it by 300 pixels on the top and bottom, resulting in a 1200x400 image. Should my cx and cy now represent coordinates in the 1200x1000 image, leaving the values the same, or should it represent coordinates in the new image, resulting in cx as 600 and cy now at 250?

I will try to change my intrinsic calculation to your suggestions, and retrain the network:

    def load_intrinsics(self, folder, frame_index):
        path = pathlib.Path(self.data_path + folder).parent
        cam_name = folder.split('/')[-1]

        K = np.fromfile(f'{path}/{cam_name}_k_matrix.npy')
        K = K.reshape(3, 3)

        fx = K[0, 0]
        cx = K[0, 2]
        fy = K[1, 1]
        cy = K[1, 2]
        
        cropped_height = self.full_res_shape[1] - self.crop_value[3] - self.crop_value[1]
        cropped_width = self.full_res_shape[0] - self.crop_value[2] - self.crop_value[0]
        
        intrinsics = np.array([[fx, 0, cx, 0],
                               [0, fy, cy - self.crop_value[3], 0],
                               [0, 0, 1, 0],
                               [0, 0, 0, 1]]).astype(np.float32)

        intrinsics[0, 0] /= self.full_res_shape[0]
        intrinsics[1, 1] /= self.full_res_shape[1]
        intrinsics[0, 2] /= cropped_width
        intrinsics[1, 2] /= cropped_height

        return intrinsics

Btw, these are the training parameters used:

  • height 224
  • width 608
  • freeze_teacher_epoch 12
  • batch_size 12

from manydepth.

mdfirman avatar mdfirman commented on May 23, 2024

Nice! Yes the disp_multi you have here look much more sharp than you posted before. Did you change anything else?

from manydepth.

didriksg avatar didriksg commented on May 23, 2024

The results are from training with a larger dataset (with the added backward-facing camera images) and changing the intrinsic matrix accordingly to my first comments. No changes other than those.

from manydepth.

mdfirman avatar mdfirman commented on May 23, 2024

Super – thanks for reporting back on this. Very interesting results.

I hope now that disp_multi gives results on a par with (or ideally better than) disp_mono.

from manydepth.

didriksg avatar didriksg commented on May 23, 2024

Do you have any thoughts on the cars (moving objects) that are detected as far away? I have noticed this behavior both when using front only and backward/forward cameras. I also noticed it for another dataset I am training on as well (DDAD from Toyota Research Institute) when only using the front-facing camera.

from manydepth.

mdfirman avatar mdfirman commented on May 23, 2024

Yes – this 'hole punching' behaviour is pretty common when training on monocular videos with moving objects.

This is discussed in some detail in the monodepth2 paper ('Auto-Masking Stationary Pixels' section), and, to a lesser extent, the ManyDepth paper.

Automasking in monodepth2 helps a little with these, but doesn't solve completely. You might want to look to some more recent works e.g. [1] if they are causing you significant bother. (Or perhaps consider a more hacky solution e.g. using semantics with some heuristics)

[1] Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. Unsupervised monocular depth learning in dynamic scenes. In CoRL, 2020

from manydepth.

didriksg avatar didriksg commented on May 23, 2024

It's not really a big problem at the moment, but it surely is something that I'll look into improving if possible! Thank you so much for your help and input! Really appreciate you taking your time with your detailed answers! :)

from manydepth.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.