Giter Club home page Giter Club logo

manydepth's People

Contributors

daniyar-niantic avatar jamiewatson683 avatar mdfirman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

manydepth's Issues

inconsistent and incorrect scale in predicted depth

Hi there, thanks for sharing your codes. I am trying to train and predict my own dataset.
The predicted depth jpeg image looks fine in eye. However, when I try to reconstruct the depth image into pointcloud in camera coordinate, the points are inconsistent in the same object and in the whole scene.

For example, the upper pole on the right are stretched, which is not obvious in the jpeg image. Also, the scale of the car is not normal.

I think one of the reason is that the scale of depth is not correct locally and globally.

Have your group noticed these issues? I am really appreciated if you could give me some advices!

RGB image

1645427110 701078

Depth image
1645427110 701078_disp_None

Pointcloud - the poles are twisted
Screenshot from 2022-03-15 11-39-43

Pointcloud - the car is streched
Screenshot from 2022-03-15 11-43-58

Training with Lyft

Hi, I have some questions regarding training with a custom dataset.

(I noticed that my issue became a bit lengthy, so here's a TL;DR):

  1. Can I use images pointing in more than one direction to increase the number of samples in the dataset?
  2. Do I need to modify the intrinsic matrix when cropping the images
  3. Can I use images from different cameras, with different dimensions, but points in the same direction?

More in-depth questions

I'm trying to use the data from the Lyft dataset. It contains images from multiple cameras, all pointing in different directions. I've mainly used the front-facing camera, but I'm not sure how good the result actually is. I've attached some samples of the original data and its corresponding disparity images:

Original image
image

Disp_mono:
image

Disp_multi:
image

Training stats after 42k batches:
image
image
image

As you can see, the model has clearly learned the most important principles, but I still feel that these disparity images are not as good as those created by training with the Kitti dataset.

The total number of images in the dataset from the front-facing camera is ~17 000. I guess that the model would benefit from more data, but this leads me to my questions

Do you think it would be possible to use data from cameras pointing in different directions simultaneously as I use data from the front-facing camera? I'm a bit concerned about how this will affect the pose network, as the cameras move differently compared to each other. The Lyft vehicles are utilized with cameras in the following setup:

image

Another possibility that I might try is to use the backward-facing camera. Using this in reverse temporal order would simulate the car moving forward (although with some other views than the forward-facing ones).

I have also tried to crop the images a bit, as the original images contain the lower part of the vehicle. By doing so, I have also changed the cx and cy parameters in the intrinsic matrix. (I used Berkley Automations library here: https://berkeleyautomation.github.io/perception/api/camera_intrinsics.html), but I'm not quite sure if I should change the intrinsic at all. I've done it like this:

# This is defined in __init__()
self.crop_value = (4, 200, 4, 216)

# The intrinsic matrix is different for each vehicle, so each sequence contains the associated vehicle's intrinsic.
path = pathlib.Path(self.data_path + folder).parent
K = np.fromfile(f'{path}/CAM_FRONT_k_matrix.npy')
K = K.reshape(3, 3)

fx = K[0, 0] 
cx = K[0, 2]
fy = K[1, 1]
cy = K[1, 2]

# Initialize the camera intrinsic params.
cam_intrinsics = CameraIntrinsics(
            fx=fx,
            fy=fy,
            cx=cx,
            cy=cy,
            width=self.full_res_shape[0],
            height=self.full_res_shape[1]
        )

# Calculate the new dimensions and center points.
cropped_width = self.full_res_shape[0] - self.crop_value[2] - self.crop_value[0]
cropped_height = self.full_res_shape[1] - self.crop_value[3] - self.crop_value[1]

# The center points are the original center points + (0.5 * the number of cropped pixels on the bottom) - (0.5 * the number of pixels cropped on the top)
crop_cj = (self.full_res_shape[0] - self.crop_value[2] + self.crop_value[0]) // 2
crop_ci = (self.full_res_shape[1] - self.crop_value[3] + self.crop_value[1]) // 2

# Generate the new cropped intrinsics.
cropped_intrinsics = cam_intrinsics.crop(
    height=cropped_height,
    width=cropped_width,
    crop_ci=crop_ci,
    crop_cj=crop_cj,
)

# Create the 4x4 version.
intrinsics = np.array([[cropped_intrinsics.fx, 0, cropped_intrinsics.cx, 0],
                       [0, cropped_intrinsics.fy, cropped_intrinsics.cy, 0],
                       [0, 0, 1, 0],
                       [0, 0, 0, 1]]).astype(np.float32)

# Resize fx and fy by the original dimensions and cx, cy by the cropped dimensions.
intrinsics[0, 0] /= self.full_res_shape[0]
intrinsics[1, 1] /= self.full_res_shape[1]
intrinsics[0, 2] /= cropped_width
intrinsics[1, 2] /= cropped_height

I have also noticed that some of the sequences in the Lyft dataset contain images in different dimensions. Some of the images are in 1224x1024, and some in 1920x1080. As long as I normalize the intrinsic matrix with the corresponding image dimensions, do you think it would be any problems with using these images simulatenously? One possibility is maybe to crop both images so that they are in the same format, if this is possible (as per my other question).

How can I save predicted depth map?

It's a great work! But I got one question.

After evaluating the model, how can I save the predicted depth map?
I notice that there is an option 'eval_split' in the code and I think it can save predicted depth map.
If I set 'eval_split' is 'benchmark', an error occured:
`
Traceback (most recent call last):
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/hzc/manydepth/manydepth/evaluate_depth.py", line 371, in
evaluate(options.parse())
File "/home/hzc/manydepth/manydepth/evaluate_depth.py", line 158, in evaluate
for i, data in tqdm.tqdm(enumerate(dataloader)):
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/hzc/manydepth/manydepth/datasets/mono_dataset.py", line 157, in getitem
folder, frame_index + i, side, do_flip)
File "/home/hzc/manydepth/manydepth/datasets/kitti_dataset.py", line 65, in get_color
color = self.loader(self.get_image_path(folder, frame_index, side))
File "/home/hzc/manydepth/manydepth/datasets/kitti_dataset.py", line 82, in get_image_path
self.data_path, folder, "image_0{}/data".format(self.side_map[side]), f_str)
KeyError: None

`

Detach the reprojection loss mask needed?

Hi,

Thanks a lot for this open repo and your interesting paper! I have a question wrt a detail in your code. In trainer.py line 607

reprojection_loss = reprojection_loss * reprojection_loss_mask 
reprojection_loss = reprojection_loss.sum() / (reprojection_loss_mask.sum() + 1e-7) 

I wonder does thisreprojection_loss_mask need to be detached?

Thanks!

Hi, I would like to make two small suggestions about the evaluate code

Thanks for your awesome work!

The first suggestion is to use transforms.InterpolationMode.LANCZOS instead of Image.ANTIALIAS to avoid the UserWarning. In the Image.py of PIL we can find LANCZOS = ANTIALIAS = 1, so the performance will not change after the replacement.

self.interp = Image.ANTIALIAS

UserWarning:
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py:280: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.

The second suggestion is about tqdm.

for i, data in tqdm.tqdm(enumerate(dataloader)):

The correct way to use it is for i, data in enumerate(tqdm.tqdm(dataloader)):, and this will allow the progress bar to display correctly. 😄

Input channels for input_features for PoseDecoder

Hi, I got a question regarding the input_features data for PoseDecoder network.

From the line below, the PoseDecoder accepts an input feature with number of channels equal to self.num_ch_enc[-1], which according to the ResnetMultiImageInput encoder, should be 512.
self.convs[("squeeze")] = nn.Conv2d(self.num_ch_enc[-1], 256, 1)

However, the output features of the ResnetEncoder have the following shapes, which means that only the last element of the features array is accepted by the PoseDecoder?:
torch.Size([1, 64, 320, 96])
torch.Size([1, 64, 160, 48])
torch.Size([1, 128, 80, 24])
torch.Size([1, 256, 40, 12])
torch.Size([1, 512, 20, 6])

Perhaps I am reading the code wrongly, so I appreciate if anyone could explain if to me. Thank you so much!

Feature request: Tensorflow lite version

first congratulations on the project, and thank you for sharing this research and the models.

I am interested in a tensorflow lite version of this model, I appreciate if you can share it in case you have it. I would like to test it on a device android with very little resources.

batch size 8 got abs rel 0.130

Hi, thanks for your great work and sharing code.
I have a V100 GPU but I cannot start training with batch size 12. The maximum value of batch size I can use is 8. And I didn't change other parameters. Then I got a bad performance.
image
image
How should I do to get a better performance?

About monodepth teacher

May I ask how do you init the teacher monodepth2 network in your training? Do you load a pretrained monodepth weight or trained from scratch(not exactly scratch, but from the pretrained resnet)?

I have these questions because I think my monodepth result is good, but when I close pose net and use my pose ground truth, the mono predict after training with manydepth is not as good as much, so does the multi predict result of course.

Thank you.

Can't get custom dataset to perform as well as monodepth2

I'm running into issues getting manydepth to produce good results with a custom dataset. The same dataset on monodepth2 works pretty well (though dynamic objects are incorrect). I'm using the same camera intrinsics as for monodepth2 so pretty sure they're correct. I've tried freezing the teacher it at 5 epochs as well but it produces the same results.

I've got about 26k pairs of images at 20 fps at 640x416 resolution.

python -m manydepth.train --dataset <custom> --data_path ../my-dataset/ --batch_size 9 --log_frequency 5 --height 416 --num_workers 16 --png --freeze_teacher_epoch 5

Original:
monorail5_0000004000

Manydepth:
monorail5_0000004000_disp_multi

Monodepth2:
monorail5_0000004000_disp

Depth map scale for KITTI data

What is the scaling factor needed to get metric depth maps from output disparity maps with the KITTI dataset?

I see that a lot of the code is from monodepth2 including using the same disparity to depth transformation when predicting for KITTI images, that is; disp_to_depth with default values 0.1 and 100, followed by scaling with the KITTI stereo factor of 5.4. Using these default values the transformation can be summarised by the following formula

depth = 5.4 / (0.01+9.99*disparity)

However using this same transformation on the output of manydepth results in depth maps with completely different scales to that of the monodepth2 depth maps. For example the output of test_sequence_target.jpg on the manydepth KITTI_HR model using multi mode has the following statistics:

output max value mean median min value
raw disparity 0.651358 0.247255 0.187170 0.027917
depth map 18.6921 3.23547 2.87261 0.828594

Compare this with the output of running the same image on the monodepth2 mono+stereo_1024x320 model:

output max value mean median min value
raw disparity 0.114764 0.037749 0.026548 0.006090
depth map 76.2298 20.6049 19.6213 4.66927

The same can be seen for any images in the KITTI dataset.

Clearly because the scale of the raw output disparities is very different there needs to be a different scale applied when transforming into depth, but I can't find anywhere in the code what this should be. Is there a known value to scale the depths maps for KITTI images so that depth is in a metric scale, or at least they more match the scale used by monodepth2 for KITTI images?

How to debug when training with "python -m manydepth.train"

Thanks for the impressive work! I have a question for the tools you use for debugging manydpeth. As you use "python -m manydepth.train" to specify the level of path, I can't find a way to debug using VsCode because no runnable .py file can be used to start running the code. So can you give some insights to debug the code?
Thanks a lot!

Moving object from left to right

Thanks for the excellent work.

I see you use self-supervised training to deal with the cost_volume overfitting, so the network can predict fine with multi-frame when there is a moving object moving in front of and in the same direction of the camera, like front cars.

I also test with your model, to predict with multi-frame, when an object is moving from left to right in front of the camera, I thought is would give a wrong result, but the result is just fine, which is what I don't understand. I think in this case, the cost volume will have two image sections--which are regions of the moving object in two frames--that cannot find a match in any depth. So why does it still predict fine? Can I trust this result?

Just to demostrate, I post these two image, but this is not what I use to train/predict:
image
image

parameters of the model

Hi,
I have a question about the amount of parameters and computation of the model(MR for KITTI), and should the amount of parameters and computation of the teacher network also be included?

Can you provide more ablations about mask?

Hi,thank you for your excellent work!
The mask you proposed is very useful, but I still have some questions.
In the body of the paper, your loss is:
11

but have you tried to remove (1-M) for Lp ? like this:
22

or remove M and (1-M) both:
33

And another question: in your ablation you show that the performance of Manydepth(with motion masking, w/o teacher) is much worse than the performance of Manydepth(w/o motion masking), what's your thoughts of this phenomenon? I think adding the motion masking to Lp will make the model better, because the model will not attempt to reproject moving objects, but the results seem to get much worse.

Stereo + Temporal

Hi:

Thank you for sharing this wonderful work!

In the Monodepth2 you tried the Monocular, Stereo, and M+S, have you try the stereo in this manydepth setting? does it has much performance boost over monocular?

Thank you!

About test-time-refinement (TTR)

Hi @JamieWatson683, thank you for this very exciting project! May I ask you a question: Do you provide code for the test-time-refinement (TTR) as shown in the main table of the Results section? If so, how to use that for my own sequence?

Depth estimation from underwater monocular video sequences

Hi@mdfirman @daniyar-niantic ,
Thanks for your work!
I tested your model in the underwater data set, but the effect is not very good. after debugging, the loss function drops normally, and the pose network can work normally, but the final result is very strange.the depth data is almost between 0.01-0.15m. I want to ask whether is the model doesn't work for this type of dataset,?here are some images from my dataset, do you know what's the problem?Thanks!
3415
3415_disp_multi
3048
3374
4856

about confidence_mask

I don't think I understand what confidence_mask is and what is this function doing:

    def compute_confidence_mask(self, cost_volume, num_bins_threshold=None):
        """ Returns a 'confidence' mask based on how many times a depth bin was observed"""

        if num_bins_threshold is None:
            num_bins_threshold = self.num_depth_bins
        confidence_mask = ((cost_volume > 0).sum(1) == num_bins_threshold).float()

        return confidence_mask

Is this just the same with 1-missing_mask?
Can you please explain it? Or does this have any explanation in the paper? Thanks!

Contact person / email

Hi,
thanks for the interesting paper.
I found a small issue in the content of the paper and would like to discuss it with the main contact, but couldn't find any contact details in the paper.

Looking forward to hearing from you
Yevhen

"Normal" Training Loss and Strange Test Result

After fixing the "--png" bug, I also faced difficulties in reproducing good results.

Training Loss

manydepth_loss

with command

CUDA_VISIBLE_DEVICES=0 python3 -m manydepth.train --data_path /home/kitti_raw/ --log_dir workdirs/ --model_name manydepth --png

which is quite normal (I don't know what it is expected to be but that is reasonable at least).

Test Results

   abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 |                                                                                                                              
&   0.454  &   4.961  &  12.336  &   0.607  &   0.288  &   0.541  &   0.754  \\ 

which is of course a wrong one.

Tensorboard Validation events

tensorboard_val_color
tensorboard_val_disp

I can't detect bug from here.

Local Modification of Codes

For codes, I modified the datasets/mono_dataset.py on the color augmentation part in compatibility with the new torchvision (which does not seems to be the main problem).
git_diff_dataset

I also modified the export_gt scripts (I don't find the original script works because the splits are on the upper level folder of the script).

git_diff_export_gt

Strange training loss and test result

I ran

CUDA_VISIBLE_DEVICES=0 python3 -m manydepth.train --data_path /home/kitti_raw/ --log_dir workdirs/ --model_name manydepth

epoch 0 | batch 0 | examples/s: 2.8 | loss: 0.00810 | time elapsed: 00h00m09s | time left: 00h00m00s
epoch 0 | batch 250 | examples/s: 22.4 | loss: 0.00049 | time elapsed: 00h02m34s | time left: 11h19m40s
epoch 0 | batch 500 | examples/s: 20.6 | loss: 0.00024 | time elapsed: 00h05m00s | time left: 10h59m55s
epoch 0 | batch 750 | examples/s: 22.2 | loss: 0.00013 | time elapsed: 00h07m26s | time left: 10h50m49s
epoch 0 | batch 1000 | examples/s: 21.5 | loss: 0.00008 | time elapsed: 00h09m52s | time left: 10h44m55s
epoch 0 | batch 1250 | examples/s: 21.0 | loss: 0.00018 | time elapsed: 00h12m18s | time left: 10h41m15s
epoch 0 | batch 1500 | examples/s: 21.7 | loss: 0.00019 | time elapsed: 00h14m45s | time left: 10h37m37s
epoch 0 | batch 1750 | examples/s: 21.5 | loss: 0.00011 | time elapsed: 00h17m10s | time left: 10h33m45s

The loss is extremely small.

The result on the 12 epoch (it should be reasonable at this moment), but is not.

abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 |
&   0.443  &   4.757  &  12.083  &   0.588  &   0.303  &   0.561  &   0.766  \\

get_depth never enabled

Hi, and thank you for your contribution!

I have earlier trained monodepth2 with the Lyft dataset with success, and I'm trying to train manydepth with the same dataloader (with some modifications e.g., for the new load_intrinsics() function.). When using gt depth generated from lidar scans from an onboard lidar, I noticed that the functionality is never called, even though check_depth() returns True. After looking in the MonoDataset, I noticed on line 192 that it seems this functionality is disabled. Is this intentional?

From MonoDataset;

        if self.load_depth and False:
            depth_gt = self.get_depth(folder, frame_index, side, do_flip)
            inputs["depth_gt"] = np.expand_dims(depth_gt, 0)
            inputs["depth_gt"] = torch.from_numpy(inputs["depth_gt"].astype(np.float32))

I tried removing the additional False, but it seems that the lidar data in the Lyft dataset does not have points divisible by 4, as per this ValueError:

  File "/cluster/work/didriksg/depth_detection/manydepth/manydepth/kitti_utils.py", line 70, in generate_depth_map
    velo = load_velodyne_points(velo_filename)
  File "/cluster/work/didriksg/depth_detection/manydepth/manydepth/kitti_utils.py", line 16, in load_velodyne_points
    points = np.fromfile(filename, dtype=np.float32).reshape(-1, 4)
ValueError: cannot reshape array of size 555895 into shape (4)

I suppose a solution here is to drop the three last/first points so that the number of points is divisible by 4?

I also have some questions regarding some suspicious-looking loss, but I will look a bit more into it and possibly post it in a separate issue.

how to choose the best freeze_teacher_epoch parameter?

Hi, thanks for sharing the great work!

But I'm confused that when I trained with custom dataset, how to choose the best freeze_teacher_epoch parameter? Does this have anything to do with the amount of data?

Looking forward to your reply. Thank you!

Question About Input Image Size

Hi there, I am trying to train the model using KITTI Raw dataset images. I realize that the KITTI raw images have a resolution of 1242 x 375 while the default image settings for the model is 640 x 192. Do I have to resize all the KITTI raw images to 640 x 192 before using them for training? Thank you for your advice!

Supplementary materials

Hi nianticlabs:

Thank you for sharing this amazing research project to the community!
just one question, where can we access the supplementary materials for this paper?

thank you!

sincerely
Ziyue Feng

How to get the error map between predicted depth and the GT depth map

Hi,

I'm very interested in the error map visualization in Fig.4 of your paper. Do you use the projected LIDAR point cloud as the GT, or the improved ground truth image in KITTI for error computation? I wonder whether you conduct the interpolation to the GT depth map?

Can you provide the code to show the error map? Thank you a lot :D

how to do the ablation study without teacher

Hi, I want to try to do your ablation study without the teacher network.
It is "ManyDepth (with motion masking, w/o teacher) 0.154" in table 4 in your paper.
I mention there are choices about freezing the teacher network. How to get the results of models without the teacher.

Bug in intrinsics re-scaling?

I noticed you are updating intrinsics like so:

K[0, :] *= self.width // (2 ** scale)
K[1, :] *= self.height // (2 ** scale)

Aren't you supposed to multiply the intrinsics by the ratio of the new_shape / orig_shape? If you are resizing your img to be self.width // (2 ** scale), then shouldn't it be K[0, :] *= (self.width // (2 ** scale * orig_width)) where orig_width is the original width of the image before resizing?

What you are doing here seems to be just multiplying the intrinsics by the size of the new image. That cannot be. Am I mis-reading the code?

AssertionError: size of input tensor and input format are different.

Traceback (most recent call last):
  File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hzc/manydepth/manydepth/train.py", line 16, in <module>
    trainer.train()
  File "/home/hzc/manydepth/manydepth/trainer.py", line 211, in train
    self.run_epoch()
  File "/home/hzc/manydepth/manydepth/trainer.py", line 242, in run_epoch
    self.log("train", inputs, outputs, losses)
  File "/home/hzc/manydepth/manydepth/trainer.py", line 742, in log
    consistency_target, self.step)
  File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tensorboardX/writer.py", line 608, in add_image
    image(tag, img_tensor, dataformats=dataformats), global_step, walltime)
  File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tensorboardX/summary.py", line 283, in image
    tensor = convert_to_HWC(tensor, dataformats)
  File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tensorboardX/utils.py", line 103, in convert_to_HWC
    tensor shape: {}, input_format: {}".format(tensor.shape, input_format)
AssertionError: size of input tensor and input format are different.         tensor shape: (1, 3, 192, 640), input_format: CHW

I don't edit any codes. But an error occurs when I train. How can I solve this?
Thanks in advance.

A tiny bug during training

Hi, thanks for your great work!

I find a tiny bug which will influence the decay of learning rate: In the function train() in train.py, when epoch reaches freeze_teacher_epoch, it will reset the optimizer and lr_scheduler, which makes epoch reset to 0 in lr_scheduler's view.

I have proved that the lr will never decay in normal training, because step_size=15 and when epoch == 15, lr_scheduler is reset.

I fixed it and trained a new model under the same condition, getting the following results,

abs_rel sq_rel rmse rmse_log a1 a2 a3
KITTI_MR 0.098 0.770 4.459 0.176 0.900 0.965 0.983
NEW 0.100 0.755 4.423 0.178 0.899 0.964 0.983

It seems better in sq_rel and rmse.

Custom data sets (high speed scenes) do not work well

First of all, thank you for your contribution to the depth estimate!
When I reproduced your code, I used my own custom dataset for training the monocular model, which consists of 6k consecutive frame images.
I have also changed the intrinsics matrix K in the data loader, which should be correct after verification.
However, the training results are still unsatisfactory, and I cannot even generate the correct depth map of the road, and I cannot get the correct depth of the vehicles on the road.
So I would like to ask you what could be the cause of this situation? My personal guess is that apart from the relatively small dataset, is it possible that in high-speed scenarios where the environment is relatively simple, monocular training does not produce large losses and therefore the network cannot be trained sufficiently?
I would be grateful for a solution!
Thanks!

Multi-GPU

Hi:

Seems the training time is similar to Monodepth2, pretty efficient!
I'm wondering if it's possible to utilize multiple GPUs?

Thank you

About depth bins

Hi,
Thanks for the interesting paper. It is really impressive and inspiring.
I want to ask you some questions about the binning strategy.
In options.py, there are two options, inverse, and linear, but the linear is default and chosen for your model.
As far as I know, many papers of MVS depth using DNN construct cost volume with planes sampled from the inverse depth space. But in your case, does linear perform better than inverse sampling? Also, would you please explain any insights behind this choice?

how to get the RGBD map like the video shows

the demo video demo shows the rgbd map.
image

I'm currious about how to get this rgbd map.
A possible method is depth image + intrinsic -> pointcloud -> aggragate all pointclouds with poses -> voxelization -> rgbd map.
Could anybody know how to generate this rgbd map?

Train on own dataset with not good result

Hi, thanks for your interesting paper and innovative ideas on depth estimation. I am trying to use your model to train on our own campus dataset to see if it works well in real time. As a freshman on deep learning, I follow your experiment implementation and code instructions but still get frustrating results. Could you give me some advice on training to get a better result?

My frame order is [0,-1,1], so I changed the code to match the input.
Screenshot from 2021-05-12 21-53-15

My result:
Screenshot from 2021-05-12 21-57-27
Screenshot from 2021-05-12 21-57-31
Screenshot from 2021-05-12 21-57-40
Screenshot from 2021-05-12 21-57-50
Screenshot from 2021-05-12 21-58-09
Screenshot from 2021-05-12 21-58-17
Screenshot from 2021-05-12 21-58-25
Screenshot from 2021-05-12 21-58-30
Screenshot from 2021-05-12 21-58-34
Screenshot from 2021-05-12 21-58-38
Screenshot from 2021-05-12 21-58-42

My settings:
{
"data_path": "/media/xzy/daa84e38-7f66-4aa4-a0ce-4fe978abe706/xzy/Downloads/manydepth/dump_root",
"log_dir": "/media/xzy/daa84e38-7f66-4aa4-a0ce-4fe978abe706/xzy/Downloads/manydepth/log",
"model_name": "Vecan_model",
"split": "vecan",
"num_layers": 18,
"depth_binning": "linear",
"num_depth_bins": 96,
"dataset": "cityscapes_preprocessed",
"png": true,
"height": 192,
"width": 640,
"disparity_smoothness": 0.001,
"scales": [
0,
1,
2,
3
],
"min_depth": 0.1,
"max_depth": 80.0,
"frame_ids": [
0,
-1,
1
],
"batch_size": 8,
"learning_rate": 0.0001,
"num_epochs": 20,
"scheduler_step_size": 15,
"freeze_teacher_and_pose": false,
"freeze_teacher_epoch": 5,
"v1_multiscale": false,
"avg_reprojection": false,
"disable_automasking": false,
"no_ssim": false,
"weights_init": "pretrained",
"use_future_frame": false,
"num_matching_frames": 1,
"disable_motion_masking": false,
"no_matching_augmentation": false,
"no_cuda": false,
"num_workers": 8,
"load_weights_folder": "/media/xzy/daa84e38-7f66-4aa4-a0ce-4fe978abe706/xzy/Downloads/manydepth/manydepth/checkpoint/KITTI_MR",
"mono_weights_folder": null,
"models_to_load": [
"encoder",
"depth",
"pose_encoder",
"pose"
],
"log_frequency": 250,
"save_frequency": 1,
"eval_stereo": false,
"eval_mono": false,
"disable_median_scaling": false,
"pred_depth_scale_factor": 1,
"ext_disp_to_eval": null,
"eval_split": "eigen",
"save_pred_disps": false,
"no_eval": false,
"eval_eigen_to_benchmark": false,
"eval_out_dir": null,
"post_process": false,
"zero_cost_volume": false,
"static_camera": false
}

Training on NYU-V2 Dataset

Hi, is it possible to train the model on NYU V2 dataset and evaluate the model with existing code? Do we need any preprocessing for that?

depth ground truth error

Thank you for your excellent work, and I have learned many things from your source codes.

I saw that the error occurs when the ground truth depth is loaded. (issue9)

Can I just remove "and False" in mono_dataset.py: 192line?

or Do I need to make further modifications to other parts of the source code?

Wrong depth scale when using ground-truth camera poses.

Hi, I have a question about using ground-truth camera poses instead of predicted camera poses. I tried to use camera poses with the correct scale in the KITTI dataset, but I find the scale not correct yet. Is there anything I missed? I only changed the code as follows.

output, lowest_cost, costvol = encoder(input_color, lookup_frames,
                                                       relative_poses, # change to relative_poses_gt
                                                       K,
                                                       invK,
                                                       min_depth_bin, max_depth_bin)

Thanks a lot!

pose_enc.load_state_dict error

I test the kiittiHR models ,and got the error,:
-> Loading weights from /home/wangshuo/PycharmProjects/test_list/depth/manydepth/models/KITTI_HR Traceback (most recent call last): File "/home/wangshuo/anaconda3/envs/many/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/wangshuo/anaconda3/envs/many/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/wangshuo/PycharmProjects/test_list/depth/manydepth/manydepth/evaluate_depth.py", line 399, in <module> evaluate(options.parse()) File "/home/wangshuo/PycharmProjects/test_list/depth/manydepth/manydepth/evaluate_depth.py", line 146, in evaluate pose_enc.load_state_dict(pose_enc_dict, strict=True) File "/home/wangshuo/anaconda3/envs/many/lib/python3.6/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for ResnetEncoder: Unexpected key(s) in state_dict: "encoder.bn1.num_batches_tracked", "encoder.layer1.0.bn1.num_batches_tracked", "encoder.layer1.0.bn2.num_batches_tracked", "encoder.layer1.1.bn1.num_batches_tracked", "encoder.layer1.1.bn2.num_batches_tracked", "encoder.layer2.0.bn1.num_batches_tracked", "encoder.layer2.0.bn2.num_batches_tracked", "encoder.layer2.0.downsample.1.num_batches_tracked", "encoder.layer2.1.bn1.num_batches_tracked", "encoder.layer2.1.bn2.num_batches_tracked", "encoder.layer3.0.bn1.num_batches_tracked", "encoder.layer3.0.bn2.num_batches_tracked", "encoder.layer3.0.downsample.1.num_batches_tracked", "encoder.layer3.1.bn1.num_batches_tracked", "encoder.layer3.1.bn2.num_batches_tracked", "encoder.layer4.0.bn1.num_batches_tracked", "encoder.layer4.0.bn2.num_batches_tracked", "encoder.layer4.0.downsample.1.num_batches_tracked", "encoder.layer4.1.bn1.num_batches_tracked", "encoder.layer4.1.bn2.num_batches_tracked".

Why disable gradients of on lookup images?

In resnet_encoder.py line 275~291

# feature extraction on lookup images - disable gradients to save memory       
with torch.no_grad():            
      if self.adaptive_bins:                
          self.compute_depth_bins(min_depth_bin, max_depth_bin)
     ......

I don't understand why disable gradients of on lookup images, if don't do like this, will the result be impacted?

Depth evaluation for single frame mode?

Thanks for the wonderful work!

I have a question for the depth evaluation:
When I evaluate the depth performance of a single image, which has no previous frame, I set
"--zero_cost_vulome" and "--num_matching_frames = 0"
for the evaluation options.
However, the "evaluate_depth" encouter a failure: because "frames_to_load[1:]" is empty, the "lookup_frames" receive the empty Tensor-list.

What should I set or change for the single frame mode evaluation, where the test frame has completely no previous frames or future frames?

About update_adaptive_depth_bins in trainer.py

Thanks for sharing your amazing work and code.

Have a question about the update_adaptive_depth_bins() function (Line364 around). It is mentioned in your paper, the depth range is dynamically updated by min and max of MVS depth (i.e., the student network). When checking the code, the mono_depth is used instead. Do I misunderstand that? Or the MVS depth will be learned to mimic the Mono depth? Thanks for your clarification.

What should cityscapes looks like?

I followed this repo to preprocess the cityscapes dataset, but 'FileNotFoundError: [Errno 2] No such file or directory: '/home/hzc/cityscape/ulm/ulm_000056_000015.jpg'' when I trained the model.

So, what should this dataset looks like?

Freeze Teacher network from beginning

Hi,

thanks a lot for sharing this.

I have a fully pre-trained teacher network and tried to freeze its weights directly from the beginning, as I guess it does not make sense to train it further.

However, if I set --freeze_teacher_and_pose as a run option, then self.min_depth_tracker and self.max_depth_tracker are never set in trainer.py, because following lines are not called:

self.min_depth_tracker = 0.1
self.max_depth_tracker = 10.0

Thus, I get an error on the following lines:

min_depth_bin = self.min_depth_tracker
max_depth_bin = self.max_depth_tracker

How do you suggest to initialize self.min_depth_tracker and self.max_depth_tracker in case of freezing teacher weights from the beginning on? I suppose it makes sense to initialize it to reflect the range of depth values that my pretrained model produces?

Thanks in advance and best regards,
Patrick

Pre-trained models for monocular depth networks

Thank you for open-sourcing this very interesting work. Would it be possible to also provide weights for the monocular depth networks (the teacher networks) that go along with the currently available pre-trained models? Thank you!

Can't reproduce the results on cityscapes

Hi, it's a great work!

I follow the instruction to train and evaluate on cithscapes, while got following result which is slightly different from the paper.
image

Hence, how can I achieve the SOTA?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.