Giter Club home page Giter Club logo

relpose-plus-plus's Introduction

Code for RelPose++

[arXiv] [Colab] [Project Page] [Bibtex]

Setup Dependencies

We recommend using a conda environment to manage dependencies. Install a version of Pytorch compatible with your CUDA version from the Pytorch website.

git clone --depth 1 https://github.com/amyxlase/relpose-plus-plus.git
conda create -n relposepp python=3.8
conda activate relposepp
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -r requirements.txt

Then, follow the directions to install Pytorch3D here.

Run Demo

A Colab notebook is available here

To run locally, first download pre-trained weights:

mkdir -p weights
gdown https://drive.google.com/uc?id=1FGwMqgLXv4R0xMzEKVv3n3Aghn0MQXKY&export=download
unzip relposepp_weights.zip -d weights

The demo can be run on any image directory with 2-8 images. Each image must be associated with a bounding box. The colab notebook has an interactive interface for selecting bounding boxes.

The bounding boxes can either be extracted automatically from masks or specified in a json file.

Run demo by extracting bounding boxes from masks:

python relpose/demo.py  --image_dir examples/robot/images \
    --mask_dir examples/robot/masks --output_path robot.html

Run demo using the masked model (ignores background):

python relpose/demo.py  --image_dir examples/robot/images --model_dir weights/relposepp_masked \
    --mask_dir examples/robot/masks --output_path robot.html

Run demo with specified bounding boxes:

python relpose/demo.py  --image_dir examples/robot/images \
    --bbox_path examples/robot/bboxes.json --output_path robot.html

The demo will output an html file that can be opened in a browser. The html file will display the input images and predicted cameras. An example is shown here.

Pre-processing CO3D

Download the CO3Dv2 dataset from here.

Then, pre-process the annotations:

python -m preprocess.preprocess_co3d --category all --precompute_bbox \
    --co3d_v2_dir /path/to/co3d_v2
python -m preprocess.preprocess_co3d --category all \
    --co3d_v2_dir /path/to/co3d_v2

Training

Trainer should be run via:

torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=8 \
relpose/trainer_ddp.py --batch_size=48 --num_images=8 --random_num_images=true  \
--gpu_ids=0,1,2,3,4,5,6,7 --lr=1e-5 --normalize_cameras --use_amp 

Our released model was trained to 800,000 iterations using 8 GPUS (A6000).

Evaluation Directions

Please refer to eval.md for instructions on running evaluations.

Citing RelPose++

If you use find this code helpful, please cite:

@article{lin2023relposepp,
    title={RelPose++: Recovering 6D Poses from Sparse-view Observations},
    author={Lin, Amy and Zhang, Jason Y and Ramanan, Deva and Tulsiani, Shubham},
    journal={arXiv preprint arXiv:2305.04926},
    year={2023}
}

relpose-plus-plus's People

Contributors

jasonyzhang avatar jytime avatar amyxlase avatar

Stargazers

 avatar Shengyu HUANG avatar Weihang Li avatar Alberto Remus avatar Hu Wenbo avatar DoubleXING avatar  avatar D-Pheobus avatar Yihan Chen avatar  avatar bro_q_dev avatar Lingdong Wang avatar Qirui Wu avatar Berkan Lafci avatar  avatar Zhiwei Wang avatar dsx avatar  avatar  avatar YuqiangLi avatar ZhenLei avatar  avatar Teng Xu avatar Aniket Agarwal avatar  avatar Baorui Ma avatar Jangho Park avatar Jianxff avatar Di Chen avatar Jintao Zhang avatar  avatar Jiahao Li avatar Pascal Sommer avatar Emil Vilagos avatar  avatar FeiXue avatar OldSix avatar Chao Wen avatar  avatar Alvin avatar Youngju Na avatar Jack Langerman avatar gzrer avatar kaku avatar  avatar  avatar  avatar  avatar  avatar Junlin Han avatar Fazeng Li avatar Bruno Santos avatar Alessio Regalbuto avatar Prajwal Singh avatar Yuanhong Yu avatar Yun Xiang avatar Francesco Fugazzi avatar Amani Kiruga avatar Xu Gu avatar  avatar Francesco avatar  avatar Lauritz Raisch avatar Vincenzo-Rinaldi avatar tattaka avatar Jianjin Xu avatar RWL avatar Yunsheng Luo avatar Shuo avatar Paragoner avatar peabody124 avatar YiChenCityU avatar Anant Bhasin avatar Fan Yang avatar Vishal Prabhu avatar tz ✨ avatar ivan avatar Jaydon avatar Zilong Chen avatar John Casey avatar  avatar  avatar Dimitrije Antić avatar Zubair Irshad avatar  avatar  avatar Dehao Huang avatar  avatar ccy_ustb avatar Qitao Zhao avatar  avatar  avatar liuye avatar kuko avatar Zhixuan Xu avatar Zz2022 avatar Sebastian Jung avatar Mitchell Mosure avatar Wonbong Jang (Won) avatar Pengyu Yin avatar

Watchers

bro_q_dev avatar  avatar  avatar Dingding avatar Wei Wu avatar  avatar Yu Zhang avatar  avatar Alessio Regalbuto avatar Francesco Fugazzi avatar

relpose-plus-plus's Issues

Expected date for full evaluation code

Hi, congrats on the great work and thank you for sharing code.

I was wondering if there is an estimated timeline for the release of the complete evaluation code. Thank you for any
information you can provide on this matter.

Failed to predict camera poses

Hi there,

I tried the relposepp_masked checkpoint to predict camera poses of some images with masks: images.zip

The input images are:
input

However, the result seems not correct:
result

I'd like to know if I have made any mistake on testing the code? Thanks a lot!

Maximum of the input num_images

Hi~ What brilliant work. I have read relpose++ these days, and after running your code, there is a question around me, what is the maximum input num_images in the transformer encoder, if the image sequence is so big, that 8 images can not cover the object enough?

Camera convention of pose prediction

Hi, thanks a lot for sharing this code! What is the convention of the predicted poses? Is it camera to world poses or world to cameras poses? And what should I do if I want to convert the poses to camera to world + opencv style? (i.e., z for the looking direction, y for the down direction and x for the right direction).

Thanks in advance!

The demo combined with ners

Thanks for the awesome work! I have noticed that you provide an experiment on the combination of relposepp with ners. Could you provide a demo of this experiment or could you tell me how to initialize the mesh cube using the estimated camera poses? For example, how to set up the lengths and centroids of the mesh cube?

a question about your paper

Hi,

Thanks for your brilliant work! I have a small question about your work.

In Figure 3: Coordinate Systems for Estimating Camera Translation in your paper, I feel confused about the translation coordinates. Taking the right one 'Look-at Centered' as an example, the world origin is setted at the unique point closest to the optical axes of all cameras. I can understand why T1 = [0,0,1], but why T2 = [0,0,2]? I suppose it should be [0,1.5,1.5] according to the relative position between the left camera and the world origin. Am I wrong? Could you please explain this? Thanks in advance!

Some clarification questions as well as question how to use a custom dataset

Hi,

Congratulations for your work and many thanks for making it public. Since it seemed quite interesting for my project I was really curious to try it and see how it performs for my task. For my project I have a set of images from a turntable of different fragments. In practice for each fragment I have a set of 121 images with a difference of 3 degrees one from the other. For each one of these images I have the corresponding masks as well as the ground truth camera poses: An example can be seen in the images below:
image
image

image
image

image

Now the idea is that I would like to extract the camera pose for an unseen view based on the existing ones.

Thus, I was curious to see how your approach performs. For that reason I set up a virtual environment and downloaded the pretrained models/weights and I've run it on a setup of 8 images:
image

giving me the following output:
image

Which seems to work quite nicely already out of the box. Therefore, I would like to see if I can optimize it so that it works with my dataset. For that I have some questions:

  1. I got the aforementioned result by using the masked pre-trained model. I guess the other one is based on the bounding boxes if I am not mistaken, right?
  2. As it is the demo script now it resolves the camera pose estimation for a given images between 2-8, can be modified to give an output for only one image or for more than 8. I guess I would need to retrain the model?
  3. How can to plot the ground truth camera poses so that I can get a qualitative representation of the error?
  4. Do you already have a script to compute the quantitative rotation and translation error in regards to the ground truth camera poses?
  5. As I understood you estimate the camera poses in regards to the camera pose of the first image, right? Since from what I've seen the estimated camera pose of the first image is always an identity matrix and then I guess you use the utils/normalize_cameras.py script to transform the camera poses back and forth if it is needed.
  6. If I wanted to retrain the model for my fragments, would make sense to train for different fragments all together or to train a model for each piece individually?

Thank you for your time.

When running the colab code, the visualization is not ideal

Load pretrained weights

model, args = get_model(
model_dir="/content/relposepp_masked/",
num_images=8,
device="cuda"
)

Loading checkpoint ckpt_000800000.pth
Missing keys: ['feature_extractor.feature_positional_encoding.pos_table_1']
Unexpected keys: []

about dataset choice

Hi,

Could you tell me why you chose 'set_lists_fewview_dev' instead of 'set_lists_fewview_train' ?
How long does once training (800_000 iterations) cost using 8xA6000 ? about 3 days ?

Thanks beforehand.

Camera cone with image 3D visualiation

Dear Authors,

Visualizations in the paper and animations in the website look really cool. Would you be able to share the piece of code to reproduce such a visualization?

Thank you in Advance!

about the dataset

Because all zip files of CO3Dv2 occupy 5.5 TB of disk-space, it is difficult to follow this work if all file are used. I am wonder if you only used the single-sequence subset, which takes takes 8.9 GB.

eval translation

Hi,

In you training code, normalize_cameras=True, which means T and R are decoupled, am I right? But, in your eval code, when you load GT, normalize_cameras=False (see eval/util.py line25), which means first camera frame is adopted here (SFM format). From my perspective, their formats are inconsistent. I don't really understand here, can you explain a little bit? I really appreciate that!

training loss

Hi,

Hello, thank you for your wonderful work. I would like to ask, when the loss drops to what level does it mean that the model has basically converged?
My data contains 140 sequences. The current model is trained to 1640 iterations, the rotation loss is about 1.26, and the translation loss is about 0.14.

Please Provide Some Advice on Sparse Viewpoint Reconstruction

HI, I am very grateful for your research and how it has helped my work. However, I have encountered a problem. I hope to generate sparse point clouds using the results produced by your research. In the past, I have used COLMAP to generate sparse point clouds, but now my research scenario has changed to sparse point cloud reconstruction under sparse viewpoints, and COLMAP has failed in this scenario.

If you know how to reconstruct sparse point clouds using the sparse viewpoints poses produced by your research, please provide me with some guidance. Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.