Hi, I have some questions regarding training with a custom dataset.

Training with Lyft about manydepth HOT 8 CLOSED

nianticlabs commented on May 23, 2024

Training with Lyft

from manydepth.

Comments (8)

mdfirman commented on May 23, 2024

Interesting ideas! Thanks for sharing these.

I'm surprised that the disp_multi output is so blurry. I would guess that there might be a bug somewhere in how your multiple views are being preprocessed or used. It might be worth carefully checking tensorboard images (e.g. checking cost volume minimums) to check that sensible things are being done when it comes to intrinsics, extrinsics etc.

Using multiple views

I would say to start with that the idea of using the backwards camera (in reverse) seems very sensible – a good idea! I would avoid using the side cameras though, at least to start with; as you point out the pose network is going to have a hard job there, and those cameras are going to be seeing quite different things to what the front and back cameras are seeing.

Cropping with intrinsics

I agree you should have to change the intrinsics when you crop – but I'm not sure I quite follow the logic you're using here:

# The center points are the original center points + (0.5 * the number of cropped pixels on the bottom) - (0.5 * the number of pixels cropped on the top)

I'm also not sure of all the conventions used in Berkley.

Overall – when you crop an image like so:

cropped_image = uncropped_image[crop_top:(crop_top + crop_height), crop_left:(crop_left + crop_width)]

my understanding is that the principle point changes as follows:

cropped_cx = uncropped_cx - crop_left
cropped_cy = uncropped_cy - crop_top

, and the focal lengths don't change at all. Perhaps you could check that this is happening in your code?

Different image sizes

In theory – yes! But in practice this introduces a lot of potential for hard-to-find bugs in intrinsics and extrinsics especially. So make sure a simple version (e.g. where you only use sequences with one single image size) works well first!

from manydepth.

didriksg commented on May 23, 2024

Finished updating

Hi!
I have been able to test out some of the ideas I mentioned. This has resulted in some interesting results.

Using multiple views
I have tested a bit using the backward-facing camera simultaneously with the front-facing camera. I reverted the temporal order for the backward-facing camera and treated the images as a separate "scene," consisting of between 100 and 120 frames. I have provided a GIF displaying the images in a scene:

Here, I have cropped the image so that the bottom part of the vehicle is not showing, resulting in ~300px. I have also cropped the top part with the same amount. The total number of samples in my training set is now ~35 000.

When training with this dataset, I now get some interesting-looking disparity images:

Loss values:

Back-facing camera:
color_0

color_pred_1

color_pred -1

consistency_mask

disp_mono

disp_multi

Front-facing camera:
color_0

color_pred_1

color_pred_-1

consistency_mask

disp_mono

disp_multi

These results look slightly like the results mentioned in your paper about moving objects and using the baseline model without the consistency loss. However, from my understanding, the disparity maps generated from a single image, in that case, were looking OK.

Cropping
I noticed that my train of thought might be a bit vague. I'm unsure whether the principal point should represent coordinates in the original image or for the cropped one. E.g. if I have an image of 1200x1000, with an cx of 600 and cy of 500. If I crop it by 300 pixels on the top and bottom, resulting in a 1200x400 image. Should my cx and cy now represent coordinates in the 1200x1000 image, leaving the values the same, or should it represent coordinates in the new image, resulting in cx as 600 and cy now at 250?

I will try to change my intrinsic calculation to your suggestions, and retrain the network:

    def load_intrinsics(self, folder, frame_index):
        path = pathlib.Path(self.data_path + folder).parent
        cam_name = folder.split('/')[-1]

        K = np.fromfile(f'{path}/{cam_name}_k_matrix.npy')
        K = K.reshape(3, 3)

        fx = K[0, 0]
        cx = K[0, 2]
        fy = K[1, 1]
        cy = K[1, 2]
        
        cropped_height = self.full_res_shape[1] - self.crop_value[3] - self.crop_value[1]
        cropped_width = self.full_res_shape[0] - self.crop_value[2] - self.crop_value[0]
        
        intrinsics = np.array([[fx, 0, cx, 0],
                               [0, fy, cy - self.crop_value[3], 0],
                               [0, 0, 1, 0],
                               [0, 0, 0, 1]]).astype(np.float32)

        intrinsics[0, 0] /= self.full_res_shape[0]
        intrinsics[1, 1] /= self.full_res_shape[1]
        intrinsics[0, 2] /= cropped_width
        intrinsics[1, 2] /= cropped_height

        return intrinsics

Btw, these are the training parameters used:

height 224
width 608
freeze_teacher_epoch 12
batch_size 12

from manydepth.

mdfirman commented on May 23, 2024

Nice! Yes the disp_multi you have here look much more sharp than you posted before. Did you change anything else?

from manydepth.

didriksg commented on May 23, 2024

The results are from training with a larger dataset (with the added backward-facing camera images) and changing the intrinsic matrix accordingly to my first comments. No changes other than those.

from manydepth.

mdfirman commented on May 23, 2024

Super – thanks for reporting back on this. Very interesting results.

I hope now that disp_multi gives results on a par with (or ideally better than) disp_mono.

from manydepth.

didriksg commented on May 23, 2024

Do you have any thoughts on the cars (moving objects) that are detected as far away? I have noticed this behavior both when using front only and backward/forward cameras. I also noticed it for another dataset I am training on as well (DDAD from Toyota Research Institute) when only using the front-facing camera.

from manydepth.

mdfirman commented on May 23, 2024

Yes – this 'hole punching' behaviour is pretty common when training on monocular videos with moving objects.

This is discussed in some detail in the monodepth2 paper ('Auto-Masking Stationary Pixels' section), and, to a lesser extent, the ManyDepth paper.

Automasking in monodepth2 helps a little with these, but doesn't solve completely. You might want to look to some more recent works e.g. [1] if they are causing you significant bother. (Or perhaps consider a more hacky solution e.g. using semantics with some heuristics)

[1] Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. Unsupervised monocular depth learning in dynamic scenes. In CoRL, 2020

from manydepth.

didriksg commented on May 23, 2024

It's not really a big problem at the moment, but it surely is something that I'll look into improving if possible! Thank you so much for your help and input! Really appreciate you taking your time with your detailed answers! :)

from manydepth.

Training with Lyft about manydepth HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent