nianticlabs / manydepth Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2021] Self-supervised depth estimation from short sequences
License: Other
[CVPR 2021] Self-supervised depth estimation from short sequences
License: Other
Hi there, thanks for sharing your codes. I am trying to train and predict my own dataset.
The predicted depth jpeg image looks fine in eye. However, when I try to reconstruct the depth image into pointcloud in camera coordinate, the points are inconsistent in the same object and in the whole scene.
For example, the upper pole on the right are stretched, which is not obvious in the jpeg image. Also, the scale of the car is not normal.
I think one of the reason is that the scale of depth is not correct locally and globally.
Have your group noticed these issues? I am really appreciated if you could give me some advices!
RGB image
Hi, I have some questions regarding training with a custom dataset.
(I noticed that my issue became a bit lengthy, so here's a TL;DR):
I'm trying to use the data from the Lyft dataset. It contains images from multiple cameras, all pointing in different directions. I've mainly used the front-facing camera, but I'm not sure how good the result actually is. I've attached some samples of the original data and its corresponding disparity images:
Training stats after 42k batches:
As you can see, the model has clearly learned the most important principles, but I still feel that these disparity images are not as good as those created by training with the Kitti dataset.
The total number of images in the dataset from the front-facing camera is ~17 000. I guess that the model would benefit from more data, but this leads me to my questions
Do you think it would be possible to use data from cameras pointing in different directions simultaneously as I use data from the front-facing camera? I'm a bit concerned about how this will affect the pose network, as the cameras move differently compared to each other. The Lyft vehicles are utilized with cameras in the following setup:
Another possibility that I might try is to use the backward-facing camera. Using this in reverse temporal order would simulate the car moving forward (although with some other views than the forward-facing ones).
I have also tried to crop the images a bit, as the original images contain the lower part of the vehicle. By doing so, I have also changed the cx
and cy
parameters in the intrinsic matrix. (I used Berkley Automations library here: https://berkeleyautomation.github.io/perception/api/camera_intrinsics.html), but I'm not quite sure if I should change the intrinsic at all. I've done it like this:
# This is defined in __init__()
self.crop_value = (4, 200, 4, 216)
# The intrinsic matrix is different for each vehicle, so each sequence contains the associated vehicle's intrinsic.
path = pathlib.Path(self.data_path + folder).parent
K = np.fromfile(f'{path}/CAM_FRONT_k_matrix.npy')
K = K.reshape(3, 3)
fx = K[0, 0]
cx = K[0, 2]
fy = K[1, 1]
cy = K[1, 2]
# Initialize the camera intrinsic params.
cam_intrinsics = CameraIntrinsics(
fx=fx,
fy=fy,
cx=cx,
cy=cy,
width=self.full_res_shape[0],
height=self.full_res_shape[1]
)
# Calculate the new dimensions and center points.
cropped_width = self.full_res_shape[0] - self.crop_value[2] - self.crop_value[0]
cropped_height = self.full_res_shape[1] - self.crop_value[3] - self.crop_value[1]
# The center points are the original center points + (0.5 * the number of cropped pixels on the bottom) - (0.5 * the number of pixels cropped on the top)
crop_cj = (self.full_res_shape[0] - self.crop_value[2] + self.crop_value[0]) // 2
crop_ci = (self.full_res_shape[1] - self.crop_value[3] + self.crop_value[1]) // 2
# Generate the new cropped intrinsics.
cropped_intrinsics = cam_intrinsics.crop(
height=cropped_height,
width=cropped_width,
crop_ci=crop_ci,
crop_cj=crop_cj,
)
# Create the 4x4 version.
intrinsics = np.array([[cropped_intrinsics.fx, 0, cropped_intrinsics.cx, 0],
[0, cropped_intrinsics.fy, cropped_intrinsics.cy, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]).astype(np.float32)
# Resize fx and fy by the original dimensions and cx, cy by the cropped dimensions.
intrinsics[0, 0] /= self.full_res_shape[0]
intrinsics[1, 1] /= self.full_res_shape[1]
intrinsics[0, 2] /= cropped_width
intrinsics[1, 2] /= cropped_height
I have also noticed that some of the sequences in the Lyft dataset contain images in different dimensions. Some of the images are in 1224x1024, and some in 1920x1080. As long as I normalize the intrinsic matrix with the corresponding image dimensions, do you think it would be any problems with using these images simulatenously? One possibility is maybe to crop both images so that they are in the same format, if this is possible (as per my other question).
As far as I know, cost volume in multi view stereo costs much memory like 11G. So what about this method?
It's a great work! But I got one question.
After evaluating the model, how can I save the predicted depth map?
I notice that there is an option 'eval_split' in the code and I think it can save predicted depth map.
If I set 'eval_split' is 'benchmark', an error occured:
`
Traceback (most recent call last):
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/hzc/manydepth/manydepth/evaluate_depth.py", line 371, in
evaluate(options.parse())
File "/home/hzc/manydepth/manydepth/evaluate_depth.py", line 158, in evaluate
for i, data in tqdm.tqdm(enumerate(dataloader)):
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/hzc/manydepth/manydepth/datasets/mono_dataset.py", line 157, in getitem
folder, frame_index + i, side, do_flip)
File "/home/hzc/manydepth/manydepth/datasets/kitti_dataset.py", line 65, in get_color
color = self.loader(self.get_image_path(folder, frame_index, side))
File "/home/hzc/manydepth/manydepth/datasets/kitti_dataset.py", line 82, in get_image_path
self.data_path, folder, "image_0{}/data".format(self.side_map[side]), f_str)
KeyError: None
`
Hi,
Thanks a lot for this open repo and your interesting paper! I have a question wrt a detail in your code. In trainer.py
line 607
reprojection_loss = reprojection_loss * reprojection_loss_mask
reprojection_loss = reprojection_loss.sum() / (reprojection_loss_mask.sum() + 1e-7)
I wonder does thisreprojection_loss_mask
need to be detached?
Thanks!
Thanks for your awesome work!
The first suggestion is to use transforms.InterpolationMode.LANCZOS
instead of Image.ANTIALIAS
to avoid the UserWarning. In the Image.py
of PIL
we can find LANCZOS = ANTIALIAS = 1
, so the performance will not change after the replacement.
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py:280: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
The second suggestion is about tqdm.
manydepth/manydepth/evaluate_depth.py
Line 158 in 44e2cb8
for i, data in enumerate(tqdm.tqdm(dataloader)):
, and this will allow the progress bar to display correctly. 😄Hi, @JamieWatson683 @daniyar-niantic
I evaluate the model with CityScapes (512x192) and got the same number as that in Table 3 in your paper (IEEE) and paper (arxiv).
Is the resolution in Table 3 a clerical error? (416x128 in the paper, 512x192 in the weights)
Is there a revised version of the paper available for reference?
Hi, I got a question regarding the input_features data for PoseDecoder network.
From the line below, the PoseDecoder accepts an input feature with number of channels equal to self.num_ch_enc[-1], which according to the ResnetMultiImageInput encoder, should be 512.
self.convs[("squeeze")] = nn.Conv2d(self.num_ch_enc[-1], 256, 1)
However, the output features of the ResnetEncoder have the following shapes, which means that only the last element of the features array is accepted by the PoseDecoder?:
torch.Size([1, 64, 320, 96])
torch.Size([1, 64, 160, 48])
torch.Size([1, 128, 80, 24])
torch.Size([1, 256, 40, 12])
torch.Size([1, 512, 20, 6])
Perhaps I am reading the code wrongly, so I appreciate if anyone could explain if to me. Thank you so much!
first congratulations on the project, and thank you for sharing this research and the models.
I am interested in a tensorflow lite version of this model, I appreciate if you can share it in case you have it. I would like to test it on a device android with very little resources.
May I ask how do you init the teacher monodepth2 network in your training? Do you load a pretrained monodepth weight or trained from scratch(not exactly scratch, but from the pretrained resnet)?
I have these questions because I think my monodepth result is good, but when I close pose net and use my pose ground truth, the mono predict after training with manydepth is not as good as much, so does the multi predict result of course.
Thank you.
I'm running into issues getting manydepth to produce good results with a custom dataset. The same dataset on monodepth2 works pretty well (though dynamic objects are incorrect). I'm using the same camera intrinsics as for monodepth2 so pretty sure they're correct. I've tried freezing the teacher it at 5 epochs as well but it produces the same results.
I've got about 26k pairs of images at 20 fps at 640x416 resolution.
python -m manydepth.train --dataset <custom> --data_path ../my-dataset/ --batch_size 9 --log_frequency 5 --height 416 --num_workers 16 --png --freeze_teacher_epoch 5
What is the scaling factor needed to get metric depth maps from output disparity maps with the KITTI dataset?
I see that a lot of the code is from monodepth2 including using the same disparity to depth transformation when predicting for KITTI images, that is; disp_to_depth
with default values 0.1
and 100
, followed by scaling with the KITTI stereo factor of 5.4
. Using these default values the transformation can be summarised by the following formula
depth = 5.4 / (0.01+9.99*disparity)
However using this same transformation on the output of manydepth results in depth maps with completely different scales to that of the monodepth2 depth maps. For example the output of test_sequence_target.jpg
on the manydepth KITTI_HR
model using multi mode has the following statistics:
output | max value | mean | median | min value |
---|---|---|---|---|
raw disparity | 0.651358 |
0.247255 |
0.187170 |
0.027917 |
depth map | 18.6921 |
3.23547 |
2.87261 |
0.828594 |
Compare this with the output of running the same image on the monodepth2 mono+stereo_1024x320
model:
output | max value | mean | median | min value |
---|---|---|---|---|
raw disparity | 0.114764 |
0.037749 |
0.026548 |
0.006090 |
depth map | 76.2298 |
20.6049 |
19.6213 |
4.66927 |
The same can be seen for any images in the KITTI dataset.
Clearly because the scale of the raw output disparities is very different there needs to be a different scale applied when transforming into depth, but I can't find anywhere in the code what this should be. Is there a known value to scale the depths maps for KITTI images so that depth is in a metric scale, or at least they more match the scale used by monodepth2 for KITTI images?
Thanks for the impressive work! I have a question for the tools you use for debugging manydpeth. As you use "python -m manydepth.train" to specify the level of path, I can't find a way to debug using VsCode because no runnable .py file can be used to start running the code. So can you give some insights to debug the code?
Thanks a lot!
Thanks for the excellent work.
I see you use self-supervised training to deal with the cost_volume overfitting, so the network can predict fine with multi-frame when there is a moving object moving in front of and in the same direction of the camera, like front cars.
I also test with your model, to predict with multi-frame, when an object is moving from left to right in front of the camera, I thought is would give a wrong result, but the result is just fine, which is what I don't understand. I think in this case, the cost volume will have two image sections--which are regions of the moving object in two frames--that cannot find a match in any depth. So why does it still predict fine? Can I trust this result?
Just to demostrate, I post these two image, but this is not what I use to train/predict:
Hi,
I have a question about the amount of parameters and computation of the model(MR for KITTI), and should the amount of parameters and computation of the teacher network also be included?
Hi,thank you for your excellent work!
The mask you proposed is very useful, but I still have some questions.
In the body of the paper, your loss is:
but have you tried to remove (1-M) for Lp ? like this:
And another question: in your ablation you show that the performance of Manydepth(with motion masking, w/o teacher) is much worse than the performance of Manydepth(w/o motion masking), what's your thoughts of this phenomenon? I think adding the motion masking to Lp will make the model better, because the model will not attempt to reproject moving objects, but the results seem to get much worse.
Hi:
Thank you for sharing this wonderful work!
In the Monodepth2 you tried the Monocular, Stereo, and M+S, have you try the stereo in this manydepth setting? does it has much performance boost over monocular?
Thank you!
Hi @JamieWatson683, thank you for this very exciting project! May I ask you a question: Do you provide code for the test-time-refinement (TTR) as shown in the main table of the Results section? If so, how to use that for my own sequence?
Hi@mdfirman @daniyar-niantic ,
Thanks for your work!
I tested your model in the underwater data set, but the effect is not very good. after debugging, the loss function drops normally, and the pose network can work normally, but the final result is very strange.the depth data is almost between 0.01-0.15m. I want to ask whether is the model doesn't work for this type of dataset,?here are some images from my dataset, do you know what's the problem?Thanks!
I don't think I understand what confidence_mask is and what is this function doing:
def compute_confidence_mask(self, cost_volume, num_bins_threshold=None):
""" Returns a 'confidence' mask based on how many times a depth bin was observed"""
if num_bins_threshold is None:
num_bins_threshold = self.num_depth_bins
confidence_mask = ((cost_volume > 0).sum(1) == num_bins_threshold).float()
return confidence_mask
Is this just the same with 1-missing_mask?
Can you please explain it? Or does this have any explanation in the paper? Thanks!
Hi,
thanks for the interesting paper.
I found a small issue in the content of the paper and would like to discuss it with the main contact, but couldn't find any contact details in the paper.
Looking forward to hearing from you
Yevhen
After fixing the "--png" bug, I also faced difficulties in reproducing good results.
with command
CUDA_VISIBLE_DEVICES=0 python3 -m manydepth.train --data_path /home/kitti_raw/ --log_dir workdirs/ --model_name manydepth --png
which is quite normal (I don't know what it is expected to be but that is reasonable at least).
abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 |
& 0.454 & 4.961 & 12.336 & 0.607 & 0.288 & 0.541 & 0.754 \\
which is of course a wrong one.
I can't detect bug from here.
For codes, I modified the datasets/mono_dataset.py on the color augmentation part in compatibility with the new torchvision (which does not seems to be the main problem).
I also modified the export_gt scripts (I don't find the original script works because the splits are on the upper level folder of the script).
I ran
CUDA_VISIBLE_DEVICES=0 python3 -m manydepth.train --data_path /home/kitti_raw/ --log_dir workdirs/ --model_name manydepth
epoch 0 | batch 0 | examples/s: 2.8 | loss: 0.00810 | time elapsed: 00h00m09s | time left: 00h00m00s
epoch 0 | batch 250 | examples/s: 22.4 | loss: 0.00049 | time elapsed: 00h02m34s | time left: 11h19m40s
epoch 0 | batch 500 | examples/s: 20.6 | loss: 0.00024 | time elapsed: 00h05m00s | time left: 10h59m55s
epoch 0 | batch 750 | examples/s: 22.2 | loss: 0.00013 | time elapsed: 00h07m26s | time left: 10h50m49s
epoch 0 | batch 1000 | examples/s: 21.5 | loss: 0.00008 | time elapsed: 00h09m52s | time left: 10h44m55s
epoch 0 | batch 1250 | examples/s: 21.0 | loss: 0.00018 | time elapsed: 00h12m18s | time left: 10h41m15s
epoch 0 | batch 1500 | examples/s: 21.7 | loss: 0.00019 | time elapsed: 00h14m45s | time left: 10h37m37s
epoch 0 | batch 1750 | examples/s: 21.5 | loss: 0.00011 | time elapsed: 00h17m10s | time left: 10h33m45s
The loss is extremely small.
The result on the 12 epoch (it should be reasonable at this moment), but is not.
abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 |
& 0.443 & 4.757 & 12.083 & 0.588 & 0.303 & 0.561 & 0.766 \\
Hi, and thank you for your contribution!
I have earlier trained monodepth2 with the Lyft dataset with success, and I'm trying to train manydepth with the same dataloader (with some modifications e.g., for the new load_intrinsics()
function.). When using gt depth generated from lidar scans from an onboard lidar, I noticed that the functionality is never called, even though check_depth()
returns True
. After looking in the MonoDataset
, I noticed on line 192 that it seems this functionality is disabled. Is this intentional?
From MonoDataset
;
if self.load_depth and False:
depth_gt = self.get_depth(folder, frame_index, side, do_flip)
inputs["depth_gt"] = np.expand_dims(depth_gt, 0)
inputs["depth_gt"] = torch.from_numpy(inputs["depth_gt"].astype(np.float32))
I tried removing the additional False
, but it seems that the lidar data in the Lyft dataset does not have points divisible by 4, as per this ValueError
:
File "/cluster/work/didriksg/depth_detection/manydepth/manydepth/kitti_utils.py", line 70, in generate_depth_map
velo = load_velodyne_points(velo_filename)
File "/cluster/work/didriksg/depth_detection/manydepth/manydepth/kitti_utils.py", line 16, in load_velodyne_points
points = np.fromfile(filename, dtype=np.float32).reshape(-1, 4)
ValueError: cannot reshape array of size 555895 into shape (4)
I suppose a solution here is to drop the three last/first points so that the number of points is divisible by 4?
I also have some questions regarding some suspicious-looking loss, but I will look a bit more into it and possibly post it in a separate issue.
Hi, thanks for sharing the great work!
But I'm confused that when I trained with custom dataset, how to choose the best freeze_teacher_epoch parameter? Does this have anything to do with the amount of data?
Looking forward to your reply. Thank you!
Hi there, I am trying to train the model using KITTI Raw dataset images. I realize that the KITTI raw images have a resolution of 1242 x 375 while the default image settings for the model is 640 x 192. Do I have to resize all the KITTI raw images to 640 x 192 before using them for training? Thank you for your advice!
Hi nianticlabs:
Thank you for sharing this amazing research project to the community!
just one question, where can we access the supplementary materials for this paper?
thank you!
sincerely
Ziyue Feng
Hi,
I'm very interested in the error map visualization in Fig.4 of your paper. Do you use the projected LIDAR point cloud as the GT, or the improved ground truth image in KITTI for error computation? I wonder whether you conduct the interpolation to the GT depth map?
Can you provide the code to show the error map? Thank you a lot :D
Hi, I want to try to do your ablation study without the teacher network.
It is "ManyDepth (with motion masking, w/o teacher) 0.154" in table 4 in your paper.
I mention there are choices about freezing the teacher network. How to get the results of models without the teacher.
I noticed you are updating intrinsics like so:
K[0, :] *= self.width // (2 ** scale)
K[1, :] *= self.height // (2 ** scale)
Aren't you supposed to multiply the intrinsics by the ratio of the new_shape / orig_shape? If you are resizing your img to be self.width // (2 ** scale), then shouldn't it be K[0, :] *= (self.width // (2 ** scale * orig_width)) where orig_width is the original width of the image before resizing?
What you are doing here seems to be just multiplying the intrinsics by the size of the new image. That cannot be. Am I mis-reading the code?
Traceback (most recent call last):
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/hzc/manydepth/manydepth/train.py", line 16, in <module>
trainer.train()
File "/home/hzc/manydepth/manydepth/trainer.py", line 211, in train
self.run_epoch()
File "/home/hzc/manydepth/manydepth/trainer.py", line 242, in run_epoch
self.log("train", inputs, outputs, losses)
File "/home/hzc/manydepth/manydepth/trainer.py", line 742, in log
consistency_target, self.step)
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tensorboardX/writer.py", line 608, in add_image
image(tag, img_tensor, dataformats=dataformats), global_step, walltime)
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tensorboardX/summary.py", line 283, in image
tensor = convert_to_HWC(tensor, dataformats)
File "/home/hzc/anaconda3/envs/cas/lib/python3.6/site-packages/tensorboardX/utils.py", line 103, in convert_to_HWC
tensor shape: {}, input_format: {}".format(tensor.shape, input_format)
AssertionError: size of input tensor and input format are different. tensor shape: (1, 3, 192, 640), input_format: CHW
I don't edit any codes. But an error occurs when I train. How can I solve this?
Thanks in advance.
Hi, thanks for your great work!
I find a tiny bug which will influence the decay of learning rate: In the function train()
in train.py
, when epoch
reaches freeze_teacher_epoch
, it will reset the optimizer and lr_scheduler, which makes epoch
reset to 0 in lr_scheduler's view.
I have proved that the lr
will never decay in normal training, because step_size=15
and when epoch == 15
, lr_scheduler is reset.
I fixed it and trained a new model under the same condition, getting the following results,
abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | |
---|---|---|---|---|---|---|---|
KITTI_MR | 0.098 | 0.770 | 4.459 | 0.176 | 0.900 | 0.965 | 0.983 |
NEW | 0.100 | 0.755 | 4.423 | 0.178 | 0.899 | 0.964 | 0.983 |
It seems better in sq_rel
and rmse
.
First of all, thank you for your contribution to the depth estimate!
When I reproduced your code, I used my own custom dataset for training the monocular model, which consists of 6k consecutive frame images.
I have also changed the intrinsics matrix K in the data loader, which should be correct after verification.
However, the training results are still unsatisfactory, and I cannot even generate the correct depth map of the road, and I cannot get the correct depth of the vehicles on the road.
So I would like to ask you what could be the cause of this situation? My personal guess is that apart from the relatively small dataset, is it possible that in high-speed scenarios where the environment is relatively simple, monocular training does not produce large losses and therefore the network cannot be trained sufficiently?
I would be grateful for a solution!
Thanks!
Hi:
Seems the training time is similar to Monodepth2, pretty efficient!
I'm wondering if it's possible to utilize multiple GPUs?
Thank you
Hi,
Thanks for the interesting paper. It is really impressive and inspiring.
I want to ask you some questions about the binning strategy.
In options.py, there are two options, inverse, and linear, but the linear is default and chosen for your model.
As far as I know, many papers of MVS depth using DNN construct cost volume with planes sampled from the inverse depth space. But in your case, does linear perform better than inverse sampling? Also, would you please explain any insights behind this choice?
the demo video demo shows the rgbd map.
I'm currious about how to get this rgbd map.
A possible method is depth image + intrinsic -> pointcloud -> aggragate all pointclouds with poses -> voxelization -> rgbd map.
Could anybody know how to generate this rgbd map?
My settings:
{
"data_path": "/media/xzy/daa84e38-7f66-4aa4-a0ce-4fe978abe706/xzy/Downloads/manydepth/dump_root",
"log_dir": "/media/xzy/daa84e38-7f66-4aa4-a0ce-4fe978abe706/xzy/Downloads/manydepth/log",
"model_name": "Vecan_model",
"split": "vecan",
"num_layers": 18,
"depth_binning": "linear",
"num_depth_bins": 96,
"dataset": "cityscapes_preprocessed",
"png": true,
"height": 192,
"width": 640,
"disparity_smoothness": 0.001,
"scales": [
0,
1,
2,
3
],
"min_depth": 0.1,
"max_depth": 80.0,
"frame_ids": [
0,
-1,
1
],
"batch_size": 8,
"learning_rate": 0.0001,
"num_epochs": 20,
"scheduler_step_size": 15,
"freeze_teacher_and_pose": false,
"freeze_teacher_epoch": 5,
"v1_multiscale": false,
"avg_reprojection": false,
"disable_automasking": false,
"no_ssim": false,
"weights_init": "pretrained",
"use_future_frame": false,
"num_matching_frames": 1,
"disable_motion_masking": false,
"no_matching_augmentation": false,
"no_cuda": false,
"num_workers": 8,
"load_weights_folder": "/media/xzy/daa84e38-7f66-4aa4-a0ce-4fe978abe706/xzy/Downloads/manydepth/manydepth/checkpoint/KITTI_MR",
"mono_weights_folder": null,
"models_to_load": [
"encoder",
"depth",
"pose_encoder",
"pose"
],
"log_frequency": 250,
"save_frequency": 1,
"eval_stereo": false,
"eval_mono": false,
"disable_median_scaling": false,
"pred_depth_scale_factor": 1,
"ext_disp_to_eval": null,
"eval_split": "eigen",
"save_pred_disps": false,
"no_eval": false,
"eval_eigen_to_benchmark": false,
"eval_out_dir": null,
"post_process": false,
"zero_cost_volume": false,
"static_camera": false
}
Hi, is it possible to train the model on NYU V2 dataset and evaluate the model with existing code? Do we need any preprocessing for that?
Thank you for your excellent work, and I have learned many things from your source codes.
I saw that the error occurs when the ground truth depth is loaded. (issue9)
Can I just remove "and False" in mono_dataset.py: 192line?
or Do I need to make further modifications to other parts of the source code?
Hi, I have a question about using ground-truth camera poses instead of predicted camera poses. I tried to use camera poses with the correct scale in the KITTI dataset, but I find the scale not correct yet. Is there anything I missed? I only changed the code as follows.
output, lowest_cost, costvol = encoder(input_color, lookup_frames,
relative_poses, # change to relative_poses_gt
K,
invK,
min_depth_bin, max_depth_bin)
Thanks a lot!
I test the kiittiHR models ,and got the error,:
-> Loading weights from /home/wangshuo/PycharmProjects/test_list/depth/manydepth/models/KITTI_HR Traceback (most recent call last): File "/home/wangshuo/anaconda3/envs/many/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/wangshuo/anaconda3/envs/many/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/wangshuo/PycharmProjects/test_list/depth/manydepth/manydepth/evaluate_depth.py", line 399, in <module> evaluate(options.parse()) File "/home/wangshuo/PycharmProjects/test_list/depth/manydepth/manydepth/evaluate_depth.py", line 146, in evaluate pose_enc.load_state_dict(pose_enc_dict, strict=True) File "/home/wangshuo/anaconda3/envs/many/lib/python3.6/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for ResnetEncoder: Unexpected key(s) in state_dict: "encoder.bn1.num_batches_tracked", "encoder.layer1.0.bn1.num_batches_tracked", "encoder.layer1.0.bn2.num_batches_tracked", "encoder.layer1.1.bn1.num_batches_tracked", "encoder.layer1.1.bn2.num_batches_tracked", "encoder.layer2.0.bn1.num_batches_tracked", "encoder.layer2.0.bn2.num_batches_tracked", "encoder.layer2.0.downsample.1.num_batches_tracked", "encoder.layer2.1.bn1.num_batches_tracked", "encoder.layer2.1.bn2.num_batches_tracked", "encoder.layer3.0.bn1.num_batches_tracked", "encoder.layer3.0.bn2.num_batches_tracked", "encoder.layer3.0.downsample.1.num_batches_tracked", "encoder.layer3.1.bn1.num_batches_tracked", "encoder.layer3.1.bn2.num_batches_tracked", "encoder.layer4.0.bn1.num_batches_tracked", "encoder.layer4.0.bn2.num_batches_tracked", "encoder.layer4.0.downsample.1.num_batches_tracked", "encoder.layer4.1.bn1.num_batches_tracked", "encoder.layer4.1.bn2.num_batches_tracked".
In resnet_encoder.py line 275~291
# feature extraction on lookup images - disable gradients to save memory
with torch.no_grad():
if self.adaptive_bins:
self.compute_depth_bins(min_depth_bin, max_depth_bin)
......
I don't understand why disable gradients of on lookup images, if don't do like this, will the result be impacted?
Thanks for the wonderful work!
I have a question for the depth evaluation:
When I evaluate the depth performance of a single image, which has no previous frame, I set
"--zero_cost_vulome" and "--num_matching_frames = 0"
for the evaluation options.
However, the "evaluate_depth" encouter a failure: because "frames_to_load[1:]" is empty, the "lookup_frames" receive the empty Tensor-list.
What should I set or change for the single frame mode evaluation, where the test frame has completely no previous frames or future frames?
Thanks for sharing your amazing work and code.
Have a question about the update_adaptive_depth_bins() function (Line364 around). It is mentioned in your paper, the depth range is dynamically updated by min and max of MVS depth (i.e., the student network). When checking the code, the mono_depth is used instead. Do I misunderstand that? Or the MVS depth will be learned to mimic the Mono depth? Thanks for your clarification.
I followed this repo to preprocess the cityscapes dataset, but 'FileNotFoundError: [Errno 2] No such file or directory: '/home/hzc/cityscape/ulm/ulm_000056_000015.jpg'' when I trained the model.
So, what should this dataset looks like?
Hi,
thanks a lot for sharing this.
I have a fully pre-trained teacher network and tried to freeze its weights directly from the beginning, as I guess it does not make sense to train it further.
However, if I set --freeze_teacher_and_pose
as a run option, then self.min_depth_tracker
and self.max_depth_tracker
are never set in trainer.py, because following lines are not called:
manydepth/manydepth/trainer.py
Lines 59 to 60 in 7e4c46f
Thus, I get an error on the following lines:
manydepth/manydepth/trainer.py
Lines 306 to 307 in 7e4c46f
How do you suggest to initialize self.min_depth_tracker
and self.max_depth_tracker
in case of freezing teacher weights from the beginning on? I suppose it makes sense to initialize it to reflect the range of depth values that my pretrained model produces?
Thanks in advance and best regards,
Patrick
Thank you for open-sourcing this very interesting work. Would it be possible to also provide weights for the monocular depth networks (the teacher networks) that go along with the currently available pre-trained models? Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.