Giter Club home page Giter Club logo

deepv2d's Introduction


This repository contains the source code for our paper:

DeepV2D: Video to Depth with Differentiable Structure from Motion
Zachary Teed and Jia Deng
International Conference on Learning Representations (ICLR) 2020


Our code was tested using Tensorflow 1.12.0 and Python 3. To use the code, you need to first install the following python packages:

First create a clean virtualenv

virtualenv --no-site-packages -p python3 deepv2d_env
source deepv2d_env/bin/activate
pip install tensorflow-gpu==1.12.0
pip install h5py
pip install easydict
pip install scipy
pip install opencv-python
pip install pyyaml
pip install toposort
pip install vtk

You can optionally compile our cuda backprojection operator by running

cd deepv2d/special_ops && ./ && cd ../..

This will reduce peak GPU memory usage. You may need to change CUDALIB to where you have cuda is installed.


Video to Depth (V2D)

Try it out on one of the provided test sequences. First download our pretrained models


or from google drive

The demo code will output a depth map and display a point cloud for visualization. Once the depth map has appeared, press any key to open the point cloud visualization.


python demos/ --model=models/nyu.ckpt --sequence=data/demos/nyu_0


python demos/ --model=models/scannet.ckpt --sequence=data/demos/scannet_0


python demos/ --model=models/kitti.ckpt --sequence=data/demos/kitti_0

You can also run motion estimation in global mode which updates all the poses jointly as a single optimization problem

python demos/ --model=models/nyu.ckpt --sequence=data/demos/nyu_0 --mode=global

Uncalibrated Video to Depth (V2D-Uncalibrated)

If you do not know the camera intrinsics you can run DeepV2D in uncalibrated mode. In the uncalibrated setting, the motion module estimates the focal length during inference.

python demos/ --video=data/demos/


DeepV2D can also be used for tracking and mapping on longer videos. First, download some test sequences


Try it out on NYU-Depth, ScanNet, TUM-RGBD, or KITTI. Using more keyframes --n_keyframes=? reduces drift but results in slower tracking.

python demos/ --dataset=kitti --n_keyframes=2
python demos/ --dataset=scannet --n_keyframes=3

The --cinematic flag forces the visualization to follow the camera

python demos/ --dataset=nyu --n_keyframes=3 --cinematic

The --clear_points flag can be used so that only the point cloud of the current depth is plotted.

python demos/ --dataset=tum --n_keyframes=3 --clear_points


You can evaluate the trained models on one of the datasets...

python evaluation/ --model=models/nyu.ckpt

First download the dataset using this script provided on the official website. Then run the evaluation script where KITTI_PATH is the location of where the dataset was downloaded

python evaluation/ --model=models/kitti.ckpt --dataset_dir=KITTI_PATH

First download the ScanNet dataset.

Then run the evaluation script where SCANNET_PATH is the location of where you downloaded ScanNet

python evaluation/ --model=models/scannet.ckpt --dataset_dir=SCANNET_PATH


You can train a model on one of the datasets

First download the training tfrecords file here (143Gb) containing the NYU data. Once the data has been downloaded, train the model by running the command (training takes about 1 week on a Nvidia 1080Ti GPU)

Camera poses for NYU were estimated using ORB-SLAM2 using kinect measurements. You can download the estimated poses from google drive.

python training/ --cfg=cfgs/nyu.yaml --name=nyu_model --tfrecords=nyu_train.tfrecords

Note: this creates a temporary directory which is used to store intermediate depth predictions. You can specify the location of the temporary directory using the --tmp flag. You can use multiple gpus by using the --num_gpus flag. If you train with multiple gpus, you can reduce the number of training iterations in cfgs/nyu.yaml.

First download the dataset using this script provided on the official website. Once the dataset has been downloaded, write the training sequences to a tfrecords file

python training/ --dataset=kitti --dataset_dir=KITTI_DIR --records_file=kitti_train.tfrecords

You can now train the model (training takes about 1 week on a Nvidia 1080Ti GPU). Note: this creates a temporary directory which is used to store intermediate depth predictions. You can specify the location of the temporary directory using the --tmp flag. You can use multiple gpus by using the --num_gpus flag.

python training/ --cfg=cfgs/kitti.yaml --name=kitti_model --tfrecords=kitti_train.tfrecords
python training/ --cfg=cfgs/scannet.yaml --name=scannet_model --dataset_dir="path to scannet"

deepv2d's People


zachteed avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepv2d's Issues

Image reconstruction quality on kitti

Hi! I try to reconstruct image at Frame T using image at Frame T+1. However, the visualization seems odd.

Here is how I do the reconstruction:
Set D, RgbT, RgbT+, PoseT, PoseT+ as predicted depth(unscaled), input Rgb at frame T, input Rgb at fram T+1, Pose predicted at T, Pose predicted at T+1.

1. pts3d = backproject(Depth)
2. pts3d_at_frameT+1 = PoseT+ * inv(PoseT)
3. pts2d_at_frameT+1 = project(pts3d_at_frameT+1)
4. grid sample

However, below is a visualized reconstruction at 2011_10_03_drive_0027_0000000799.png. First row is original input, second row is reconstructed rgb, third row is flow visualizion:

I notice an obvious lack of scale in the reconstruction, it is general for other sequences. The pose I used come from Depth prediciton process(the pose results from eval_kitti scipt.). Ideally, the left corner car's position should not move since it is static.

InvalidArgumentError (see above for traceback): Cholesky decomposition was not successful. The input might not be valid.

Caused by op 'motion/PnP_1/Cholesky', defined at: 
  File "demos/", line 152, in <module>  
  File "demos/", line 90, in main  
    use_fcrn=True, is_calibrated=False, use_regressor=False)
  File "deepv2d/", line 68, in __init__ 
  File "deepv2d/", line 129, in _build_motion_graph    
    images, depths, intrinsics, edge_inds, init=do_init)    
  File "deepv2d/modules/", line 287, in forward    
    (jj,ii), num_fixed=num_fixed, include_intrinsics=(not self.is_calibrated))  
  File "deepv2d/geometry/", line 527, in global_optim
    delta_update = cholesky_solve(H, b) 
  File "deepv2d/geometry/", line 32, in solve    
    x = cholesky_solve(H, b)  
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/", line 111, in decorated  
    return _graph_mode_decorator(f, *args, **kwargs)
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/", line 132, in 
    result, grad_fn = f(*args)
  File "deepv2d/geometry/", line 9, in cholesky_solve
    chol = tf.linalg.cholesky(H)
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/", line 709, in 
    "Cholesky", input=input, name=name)
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/", line 787, in 
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/", line 488, in 
    return func(*args, **kwargs)
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/", line 3274, in 
  File "/mnt/lustre/xiehaozhe/Applications/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/", line 1770, in 
    self._traceback = tf_stack.extract_stack() 

InvalidArgumentError (see above for traceback): Cholesky decomposition was not successful. The input might not be valid.
    [[node motion/PnP_1/Cholesky (defined at deepv2d/geometry/ = Cholesky[T=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:CPU:0"](motion/PnP_1/Cast_3)]]
    [[{{node motion/PnP_2/Cast_5/_2999}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_5542_motion/PnP_2/Cast_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Blas xGEMMBatched launch failed : a.shape=[7,3,3], b.shape=[7,3,1], m=3, n=1, k=3, batch_size=7

when I run the demo with gpu,there is something wrong:
Caused by op 'stereo/MatMul', defined at:
File "demos/", line 81, in
File "demos/", line 55, in main
deepv2d = DeepV2D(cfg, args.model, use_fcrn=args.fcrn, is_calibrated=is_calibrated, mode=args.mode)
File "Deepv2d/", line 73, in init
File "Deepv2d/", line 164, in _build_depth_graph
depths = self.depth_net.forward(Ts, images, intrinsics, adj_list)
File "Deepv2d/modules/", line 187, in forward
spred = self.stereo_network_avg(poses, images, intrinsics, idx)
File "Deepv2d/modules/", line 116, in stereo_network_avg
volume = operators.backproject_avg(Ts, depths, intrinsics, fmaps, adj_list)
File "Deepv2d/special_ops/", line 55, in backproject_avg
Tii = Ts.gather(ii) * Ts.gather(ii).inv() # this is just a set of id trans.
File "Deepv2d/geometry/", line 146, in inv
Ginv = se3_matrix_inverse(self.matrix())
File "Deepv2d/geometry/", line 203, in se3_matrix_inverse
t = -tf.matmul(R, t)
File "/home/duanzm/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/", line 2019, in matmul
a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
File "/home/duanzm/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/", line 1245, in batch_mat_mul
"BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
File "/home/duanzm/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/", line 787, in _apply_op_helper
File "/home/duanzm/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/util/", line 488, in new_func
return func(*args, **kwargs)
File "/home/duanzm/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/", line 3274, in create_op
File "/home/duanzm/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/", line 1770, in init
self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[7,3,3], b.shape=[7,3,1], m=3, n=1, k=3, batch_size=7
[[node stereo/MatMul (defined at Deepv2d/geometry/ = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](stereo/transpose, stereo/strided_slice_1)]]
[[{{node Sum/_2107}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3915_Sum", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

my cuda version is 9.0,what should i do?

NYU gt association and pose

Hi, thx for sharing the code.
However, I have one problem about the dataloader for NYUv2.

associations_file = osp.join(scene_dir, 'associations.txt')
camera_file = osp.join(scene_dir, 'pose.txt')

I have already downloaded the NYUv2 raw dataset, and want to generate the tfrecords on my own.
However, it seems that the association file and gt pose file is not provided in the official dataset.
Did you generate it from another approach?

Dynamic Frames For Inference

Hi Zachary,

I am trying to check the depth performance of deepV2D with various frames. If I change the config of KITTI to Frames:2, it turns out that the network parameter of the motion predictor is mismatched. Do we have to re-train the network under the setting of two frames here?


question about view pooling

Hi, thanks to your code.Although I view the code, I still don't understand the meaning of view pooling.In '3D Matching Network with view concatenation', you build cost volume for each image pairs, then you stack all of it and refine it with 3dcnn(_hourglass_3d) and output the probablity of depth.For me, I don't know where is the pooling work for different cost volume, it seems that you stack all the cost volume and output the depth map.

tensorflow-gpu version

FYR: Tested ok under tensorflow-gpu==1.14.0

Errors using tensorflow-gpu==1.12.0

module 'tensorflow' has no attribute 'custom_gradient'

Errors using tensorflow-gpu==1.13.1

failed to run optimizer arithmeticoptimizer, stage removestackstridedslicesameaxis node

Testing command

python demos/ --dataset=scannet --n_keyframes=3

Here's my conda enviroments.yml

name: py37-deepv2d
  - defaults
  - _libgcc_mutex=0.1=main
  - _tflow_select=2.1.0=gpu
  - absl-py=0.7.1=py37_0
  - astor=0.7.1=py37_0
  - blas=1.0=mkl
  - c-ares=1.15.0=h7b6447c_1
  - ca-certificates=2019.5.15=0
  - certifi=2019.3.9=py37_0
  - cudatoolkit=10.0.130=0
  - cudnn=7.6.0=cuda10.0_0
  - cupti=10.0.130=0
  - gast=0.2.2=py37_0
  - grpcio=1.16.1=py37hf8bcb03_1
  - h5py=2.9.0=py37h7918eee_0
  - hdf5=1.10.4=hb1b8bf9_0
  - intel-openmp=2019.4=243
  - keras-applications=1.0.8=py_0
  - keras-preprocessing=1.1.0=py_1
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libprotobuf=3.8.0=hd408876_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - markdown=3.1.1=py37_0
  - mkl=2019.4=243
  - mkl_fft=1.0.12=py37ha843d7b_0
  - mkl_random=1.0.2=py37hd81dba3_0
  - mock=3.0.5=py37_0
  - ncurses=6.1=he6710b0_1
  - numpy=1.16.4=py37h7e9f1db_0
  - numpy-base=1.16.4=py37hde5b4d6_0
  - openssl=1.1.1c=h7b6447c_1
  - pip=19.1.1=py37_0
  - protobuf=3.8.0=py37he6710b0_0
  - python=3.7.3=h0371630_0
  - readline=7.0=h7b6447c_5
  - scipy=1.2.1=py37h7c811a0_0
  - setuptools=41.0.1=py37_0
  - six=1.12.0=py37_0
  - sqlite=3.28.0=h7b6447c_0
  - tensorboard=1.13.1=py37hf484d3e_0
  - tensorflow=1.13.1=gpu_py37hc158e3b_0
  - tensorflow-base=1.13.1=gpu_py37h8d69cac_0
  - tensorflow-estimator=1.13.0=py_0
  - tensorflow-gpu=1.13.1=h0d30ee6_0
  - termcolor=1.1.0=py37_1
  - tk=8.6.8=hbc83047_0
  - werkzeug=0.15.4=py_0
  - wheel=0.33.4=py37_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - attrs==19.1.0
    - backcall==0.1.0
    - bleach==3.1.0
    - cycler==0.10.0
    - decorator==4.4.0
    - defusedxml==0.6.0
    - easydict==1.9
    - entrypoints==0.3
    - google-pasta==0.1.8
    - ipykernel==5.1.1
    - ipython==7.5.0
    - ipython-genutils==0.2.0
    - jedi==0.13.3
    - jinja2==2.10.1
    - jsonschema==3.0.1
    - jupyter-client==5.2.4
    - jupyter-core==4.4.0
    - jupyterlab==0.35.6
    - jupyterlab-server==0.2.0
    - kiwisolver==1.1.0
    - markupsafe==1.1.1
    - matplotlib==3.1.0
    - mistune==0.8.4
    - nbconvert==5.5.0
    - nbformat==4.4.0
    - notebook==5.7.8
    - opencv-python==
    - pandas==0.24.2
    - pandocfilters==1.4.2
    - parso==0.4.0
    - pexpect==4.7.0
    - pickleshare==0.7.5
    - prometheus-client==0.7.0
    - prompt-toolkit==2.0.9
    - ptyprocess==0.6.0
    - pygments==2.4.2
    - pyparsing==2.4.0
    - pyrsistent==0.15.2
    - python-dateutil==2.8.0
    - pytz==2019.1
    - pyyaml==5.3
    - pyzmq==18.0.1
    - seaborn==0.9.0
    - send2trash==1.5.0
    - terminado==0.8.2
    - testpath==0.4.2
    - toposort==1.5
    - tornado==6.0.2
    - tqdm==4.43.0
    - traitlets==4.3.2
    - vtk==8.1.2
    - wcwidth==0.1.7
    - webencodings==0.5.1
    - wrapt==1.12.0
prefix: /home/yoyee/miniconda3/envs/py37-deepv2d

Intrinsics file - what do numbers mean?

In all kitti demo sequences there is a file called "intrinsics.txt" with four numbers. What do they mean, why are they necessary and where do you get the values from?

If I followed the code correctly, they refer to fx, fy, cx, cy (see, line50) and you need them to reproject 2d points to 3d. This makes sense to me. But how do you get those values? In my understanding, Kitti raw provides the camera intrinsic matrix in the files "calib_cam_to_cam.txt" in lines starting with "K_0". It also provides the projection matrix directly in lines starting with "P_0". But the values specified there significantly differ from the values in the "intrinsics.txt" file. In particular, fx is approximately equal to fy in "calib_cam_to_cam.txt", whereas in the "intrinsics.txt" file, the first and second value differ by ~10% ?!

Problem running demo

Hi, I'm trying to run the demo with both kitti and nyu but i'm getting the following error:

Backprojection Op not available: Using python implementation
Traceback (most recent call last):
File "demos/", line 68, in
File "demos/", line 41, in main
net = DeepV2D(INPUT_DIMS, cfg)
File "lib/", line 38, in init
poses_pred = motion.forward(images[:, 1:], image_star, depth, intrinsics)
File "lib/networks/", line 84, in forward
G = self.flowse3(feat1, feat2, depth1, intrinsics/SC, G=G, reuse=i>0)
File "lib/networks/", line 107, in flowse3
coords = camera.camera_transform_project(G, depth, intrinsics)
File "lib/", line 87, in camera_transform_project
X = point_cloud_from_depth(depth, intrinsics)
File "lib/", line 72, in point_cloud_from_depth
X = iproj(pix, depth, kv)
File "lib/", line 54, in iproj
fx, fy, cx, cy = tf.split(kv, [1, 1, 1, 1], axis=-1)
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/", line 1226, in split
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/ops/", line 3289, in _split_v
num_split=num_split, name=name)
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 767, in apply_op
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 2508, in create_op
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 1873, in set_shapes_for_outputs
shapes = shape_func(op)
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 1823, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 610, in call_cpp_shape_fn
debug_python_shape_fn, require_shape_fn)
File "/data/work/depth_estimation/DeepV2D/venv/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 676, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension size, given by scalar input 2, must be non-negative but is -1 for 'motion/split' (op: 'SplitV') with input shapes: [7,4], [4], [] and with computed input tensors: input[2] = <-1>.

demo_slam on custom video?

Thanks for the great work. I see that from the code in it is taking the video sequence from the nyu/kitti/scannet dataset, is there a way to use demo_slam with my custom video of n indoor scene recorded using smart phone camera?r

about jacobian

Thank you for your great work!
I want to know how following derivation is achieved:
Screenshot from 2021-03-03 19-43-48
Could you please refer me some material from which I can learn related knowledge?
It will be very helpful.

Problem running KITTI demo

After installing the requirements, entering "python demos/ --cfg cfgs/kitti.yaml --sequence demo_videos/kitti_demos/032/" yields ...

Traceback (most recent call last):
  File "demos/", line 67, in <module>
  File "demos/", line 44, in main
    depths = net.forward(data_blob)
  File "lib/", line 113, in forward
    output =, feed_dict=feed_dict)
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/client/", line 929, in run
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/client/", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/client/", line 1328, in _do_run
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/client/", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3,47,271] = [3, 47, 272] does not index into param shape [4,48,272,64]
	 [[node motion/GatherNd_1 (defined at lib/utils/ ]]

Caused by op 'motion/GatherNd_1', defined at:
  File "demos/", line 67, in <module>
  File "demos/", line 38, in main
    net = DeepV2D(INPUT_DIMS, cfg)
  File "lib/", line 38, in __init__
    poses_pred = motion.forward(images[:, 1:], image_star, depth, intrinsics)
  File "lib/networks/", line 84, in forward
    G = self.flowse3(feat1, feat2, depth1, intrinsics/SC, G=G, reuse=i>0)
  File "lib/networks/", line 108, in flowse3
    featw = bilinear_sampler.bilinear_sampler(feat2, coords)
  File "lib/utils/", line 88, in bilinear_sampler
    output = bilinear_sampler_general(imgs, coords)
  File "lib/utils/", line 50, in bilinear_sampler_general
    img01 = tf.gather_nd(imgs, coords01)
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/ops/", line 3647, in gather_nd
    "GatherNd", params=params, indices=indices, name=name)
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/framework/", line 788, in _apply_op_helper
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/util/", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/framework/", line 3300, in create_op
  File "/home/cgebbe/.local/lib/python3.6/site-packages/tensorflow/python/framework/", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[3,47,271] = [3, 47, 272] does not index into param shape [4,48,272,64]
	 [[node motion/GatherNd_1 (defined at lib/utils/ ]]

Results even better using "default" validation method?

I believe your results might be even slightly better if you use the default validation method:

The paper states that you directly use the 192x1088 output image of the CNN for evaluation. In contrast, other papers first resize the inferred image to the RGB size, crop it and then evaluate it, see

You can do the same if you first pad the output image with 108 pixels to undo the previous cropping and then perform the resizing and cropping. In that case I get an absRelErr=0.0640. I believe the improvement is due to the fact that I see some artifacts at the top which are simply cropped away with this method.

Note however, that I have skipped some of the 697 images from the Eigen split, if one of the four neighboring images was not available. How have you dealt with these cases? It is not mentioned at all in the paper.

About ScanNet

Thanks for sharing the wonderful work.

I have a question for the usage of the scenes in the ScanNet dataset.
While ScanNet itself provides train/val/test splits, it seems like this paper utilized specific scenes as below.


I want to double-check whether I correctly understand the author's intentions.

ScanNet training

in the paper you train on KITTI, NYU and ScanNet for your best results (Scannet gets the most iterations during stage 1). However, there are only training scipts for KITTI and NYU present here.

What is the reason for this? Could this be additionally provided?

Another question: Afterwards, you report that stage 2 is trained for another 120k iterations. On what benchmark is this? Is this on the individual benchmark, where you report your results or do you train on several as in stage 1?

Best regards!

How did you evaluate TUM using translational rmse(m/s)

Hi, thank you for your nice work.
I'm wondering how you get the results from the paper.
I ran the code by

python demos/ --dataset=tum

Extract the poses from slam.poses.
Then, I use evo_rpe for evaluation.
But the metrics from evo_rpe is

{"title": "RPE w.r.t. translation part (m)\nfor delta = 1 (frames) using consecutive pairs\n(with Sim(3) Umeyama alignment)", "ref_name": "DeepV2D/data/slam/tum/rgbd_dataset_freiburg1_room/groundtruth.txt", "est_name": "DeepV2D/results/tum/poses.tum", "label": "RPE (m)"}

The aligned trajectory also doesn't look right.
May I ask if there's some conversion I missed?

Thank you.

NYUD pose

Hi, thanks for you work. The tfrecord is too big to download. Could you share a compressed file of pose information?

How to load 2-stage checkpoints for evaluation?

Hi @zachteed @heilaw @anewell @jiadeng ,

Thank you for your work! I have a question about the ckpt for demo and evaluation.

When we train the model we can get two checkpoints for stage_1 and stage_2, but I notice we only need to load one ckpt file for evaluation and demo. How can we get this final ckpt file and could you please explain more about the relationship between this final ckpt and two-stage ckpts got from training.

Thank you so much!

cannot load checkpoint

I tried the demo
python3 demos/ --video=data/demos/
but it crashed:

tensorflow/core/framework/] OP_REQUIRES failed at : Not found: Key stereo/BatchNorm/moving_stddev not found in checkpoint
Traceback (most recent call last):
  File "/Users/l0stpenguin/Library/Python/3.7/lib/python/site-packages/tensorflow_core/python/client/", line 1365, in _do_call
    return fn(*args)
  File "/Users/l0stpenguin/Library/Python/3.7/lib/python/site-packages/tensorflow_core/python/client/", line 1350, in _run_fn
    target_list, run_metadata)
  File "/Users/l0stpenguin/Library/Python/3.7/lib/python/site-packages/tensorflow_core/python/client/", line 1443, in _call_tf_sessionrun
tensorflow.python.framework.errors_impl.NotFoundError: Key stereo/BatchNorm/moving_stddev not found in checkpoint
         [[{{node save/RestoreV2}}]]

I do not have gpu machine. Is it possible to run it without gpu?

kitti.ckpt does not support global mode

Hi, when I use kitti.ckpt and set --mode=global, the following error arises:

Traceback (most recent call last):
  File "demos/", line 84, in <module>
  File "demos/", line 66, in main
    depths, poses = deepv2d(images, intrinsics, viz=True, iters=args.n_iters)
  File "deepv2d/", line 467, in __call__
  File "deepv2d/", line 368, in update_poses
    self.poses, self.intrinsics, self.weights =, feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/", line 950, in run
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/", line 1149, in _run
ValueError: Cannot feed value of shape (1, 192, 1088) for Tensor 'Placeholder_1:0', which has shape '(5, 192, 1088)'

The nyu.ckpt is normal in both global and keyframe mode.

What is the problem?

Many thanks.

Questions about intrinsics for ScanNet

Hello. Thanks for the great work.

I am trying to run your evaluation code on ScanNet. I think I am using a newer version of ScanNet where some intrinsics file are placed elsewhere. So I got FileNotFound error here:

Can you let me know what intrinsics (with respect to what image size) should be put here so that I can assign them manually?

Also, I am confused why only the depth intrinsics were used:
while I guess the network will need color intrinsics instead.

Question about pose preprocessing


I have a question about retrieving the pose data.
As referenced below, after the pose is converted from quaternion to matrix, it follows by an inverse operation. Why is this inverse operation necessary?

pose_mat = pose_vec2mat(pose_vec)


Significance of multiplying the translation vector by 0.1(args['scale'])


Thank you for sharing the code.

I am not able to understand the significance of multiplying the translation vector by 0.1(args['scale']) constant in file to update variables trajectory[i][0:3, 3].

Can you explain to me why you multiplied the translation vector by 0.1(args['scale'])?

        for i in range(len(trajectory)):
            trajectory[i] =, util.inv_SE3(trajectory[i]))
            trajectory[i][0:3, 3] *= self.args['scale']

Is your method end-2-end ?

I have read your paper ! Thanks for uploading the code.
However, I would like to ask if your method can be trained end-2-end.
As I understand, the Depth module will build a cost volume around the key frame and then use 3D CNN network to predict the depth of that keyframe. In the Motion module, images and depths are required as the input to predict the relative poses.
If you have N = 5 input images, does it mean that you have to run your Depth module N times to get all N depth maps as input to the Motion module.

Question about scaling?

Hi @zachteed @heilaw @anewell @jiadeng

I notice that you scale both the depth map and the translation in pose matrix with scaling ratio 0.1 when training the KITTI dataset.
However, in the data streaming script for NYU and SCANNET, I only find the scaling for depth map with scaling ratio 1/5000 and 1/1000. Could you please explain why we don't need to scale the translation for NYU and SCANNET?

Thank you so much!

LS-OPTIMIZATION LAYER back propagation

In paper I found a sentence
"In the backward pass, the gradients can be found by solving another linear system." in appendix under the title LS-OPTIMIZATION LAYER.
1.)Which is that linear system ?
2.)How did you get equation (16) in appendix?

can anyone please help me

why scale depth_pred for evaluation?

Hello. I notice you scale both depth and pose estimation for evaluation. It's reasonable to scale pose same as previous works but it's unfair to scale depth_pred too since the ground truth depth is used in the loss function. Yours is a supervised depth estimation method, why you also scale the estimated depth?

How to implement a single-view demo?

Hi, I'm trying to implement a single-view NYU evaluation. I noticed that you use 8 frames in your code and estimate the depth map of the first keyframe. I tried to reduce the number of frames to 1, the predicted depth maps are all nan values. I also tried to concatenate two same frames to predict, the resulted depth is not well.

How to implement a single-view demo correctly?

NYU preprocessing

Hi, thanks for your excellent work. If I make the nyud tfrecord myself, should I preprocess the depth first(using the official matlab tool) to align the depth ?

Can't run demo with batch size 8

Hi I was trying to run the demo
python demos/ --model=models/scannet.ckpt --sequence=data/demos/scannet_0
But got the following error

2020-02-27 14:07:27.062479: E tensorflow/stream_executor/cuda/] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_NOT_SUPPORTED
2020-02-27 14:07:27.062517: E tensorflow/stream_executor/cuda/] Internal: failed BLAS call, see log for details
Traceback (most recent call last):   
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 1334, in _do_call
    return fn(*args)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 1407, in _call_tf_sessionrun
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[134400,2,3], b.shape=[134400,3,6], m=2, n=6, k=3, batch_size=134400
         [[{{node motion/PnP/einsum_1/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](motion/PnP/einsum_1/Reshape, motion/PnP/einsum_1/Reshape
         [[{{node motion/PnP_2/einsum_7/Reshape_2/_2363}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", 
send_device_incarnation=1, tensor_name="edge_5308_motion/PnP_2/einsum_7/Reshape_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):   
  File "demos/", line 82, in <module>
  File "demos/", line 64, in main
    depths, poses = deepv2d(images, intrinsics, viz=True, iters=args.n_iters)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/", line 462, in __call__
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/", line 368, in update_poses
    self.poses, self.intrinsics, self.weights =, feed_dict=feed_dict)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 929, in run
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 1328, in _do_run
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[134400,2,3], b.shape=[134400,3,6], m=2, n=6, k=3, batch_size=134400
         [[node motion/PnP/einsum_1/MatMul (defined at /projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/utils/  = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job
:localhost/replica:0/task:0/device:GPU:0"](motion/PnP/einsum_1/Reshape, motion/PnP/einsum_1/Reshape_1)]]
         [[{{node motion/PnP_2/einsum_7/Reshape_2/_2363}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", 
send_device_incarnation=1, tensor_name="edge_5308_motion/PnP_2/einsum_7/Reshape_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'motion/PnP/einsum_1/MatMul', defined at:
  File "demos/", line 82, in <module>
  File "demos/", line 55, in main
    deepv2d = DeepV2D(cfg, args.model, use_fcrn=args.fcrn, is_calibrated=is_calibrated, mode=args.mode)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/", line 68, in __init__
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/", line 129, in _build_motion_graph
    images, depths, intrinsics, edge_inds, init=do_init)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/modules/", line 282, in forward
    Tij = Tij.keyframe_optim(target, weight, depths, intrinsics)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/geometry/", line 364, in keyframe_optim
    J = einsum('...ij,...jk->...ik', jproj, jtran)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/utils/", line 49, in einsum
    out = tf.einsum(equation, *inputs)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/", line 257, in einsum
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/", line 389, in _einsum_reduction
    product = math_ops.matmul(t0, t1)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/", line 2019, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/", line 1245, in batch_mat_mul
    "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/", line 787, in _apply_op_helper
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/util/", line 488, in new_func
    return func(*args, **kwargs)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/", line 3274, in create_op
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[134400,2,3], b.shape=[134400,3,6], m=2, n=6, k=3, batch_size=134400
         [[node motion/PnP/einsum_1/MatMul (defined at /projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/utils/  = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](motion/PnP/einsum_1/Reshape, motion/PnP/einsum_1/Reshape_1)]]
         [[{{node motion/PnP_2/einsum_7/Reshape_2/_2363}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5308_motion/PnP_2/einsum_7/Reshape_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

My environment setup is python 3.6.7, tensorflow-gpu 1.12.0
Seems that the problem is the batch size is too big. I have success when I only use 4 images. Can you help?

NYU Depth TFRecord

Thanks for sharing this great work.
I'm wondering: where does the nyu_train.tfrecords file ( come from?
It seems there are 13776 examples, each with 9 RGB images, 1 depth image and smaller things like intrinsics and poses.
It's about 138GB but NYU Depth V2 is more like 400GB, which surprises me (even though encoding is not the same). Maybe this file was built using NYU Depth V1, which is 90GB? Is this file the one used in the experiments reported in the paper?

Simultaneous Tracking and Mapping?

Hi, thanks for the great work. I wonder if you can provide a demo code to perform tracking (camera pose estimation) and mapping (depth estimation) simultaneously.

About heavy distortion of video.

Hi @heilaw @anewell @jiadeng @zachteed
Thanks for your work, I notice that if the video is uncalibrated with unknown focal, you offer the demos/ , which can estimates the focal length during inference. So I wonder do you estimate the distortion parameters k1, k2, p1, p2 as well? Since I want to run on uncalibrated video with heavy distortion.


cuda version error?

Hi, thanks for your work. I encountered a problem that,

2020-09-06 16:43:02.192510: E tensorflow/stream_executor/cuda/] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_NOT_SUPPORTED

Is it a problem about cuda version? My environment is TF1.12, CUDA 9.2

How to inference on a longer video?

Hello Teed
I'm new to video to depth area, thanks for your excellent work.
I'm using your codes to predict the dpeth maps from " ", however, I found you only predict 8 depth maps from a single video, I tried to remove this constraint but the out of memory error happened.
How to predict dense depth maps from a single video with your project? I'm looking forward to your reply, thank you very much!

Test on video


I'd like to test on a video sequence in TUM. I'm wondering how you test on a video sequence.
If the poses are unknown, how do you compute the poses? What's the batch size do you use to optimize the poses? Which images do you sample to compute the pose for a certain frame.

If the poses are known, do you only update the depth for one iteration?


scale in evaluation?

Hi, thanks for sharing the good work.
However, I'm curious about the scale here in evaluation.

From my understanding, deepv2d is supervised, and should require no scaling in depth or pose evaluation.
However, in your evaluation script, all depth and pose are rescaled, why do we need that?

Another problem is about the scaling factor when calculating the trans(cm)

a =, t2) /, t2)

Shouldn't it be,t1)/*t2)?

Output of Demo scripts

Thanks for uploading the code for this research paper.
I am successfully able to run the demo code for nyu, however the output is a single depth image and same goes for demo_uncalibrated script where the entire video is provided as input.
Shouldn't the output be multiple depth maps for different video frames or something similar as written in the paper?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.