amyxlase / relpose-plus-plus Goto Github PK

View Code? Open in Web Editor NEW

220.0 10.0 14.0 40.98 MB

Code Release for RelPose++: Recovering 6D Poses from Sparse-view Observations

Home Page: https://amyxlase.github.io/relpose-plus-plus/

License: MIT License

Shell 0.28% Python 99.72%

relpose-plus-plus's Introduction

Code for RelPose++

[arXiv] [Colab] [Project Page] [Bibtex]

Setup Dependencies

We recommend using a conda environment to manage dependencies. Install a version of Pytorch compatible with your CUDA version from the Pytorch website.

git clone --depth 1 https://github.com/amyxlase/relpose-plus-plus.git
conda create -n relposepp python=3.8
conda activate relposepp
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -r requirements.txt

Then, follow the directions to install Pytorch3D here.

Run Demo

A Colab notebook is available here

To run locally, first download pre-trained weights:

mkdir -p weights
gdown https://drive.google.com/uc?id=1FGwMqgLXv4R0xMzEKVv3n3Aghn0MQXKY&export=download
unzip relposepp_weights.zip -d weights

The demo can be run on any image directory with 2-8 images. Each image must be associated with a bounding box. The colab notebook has an interactive interface for selecting bounding boxes.

The bounding boxes can either be extracted automatically from masks or specified in a json file.

Run demo by extracting bounding boxes from masks:

python relpose/demo.py  --image_dir examples/robot/images \
    --mask_dir examples/robot/masks --output_path robot.html

Run demo using the masked model (ignores background):

python relpose/demo.py  --image_dir examples/robot/images --model_dir weights/relposepp_masked \
    --mask_dir examples/robot/masks --output_path robot.html

Run demo with specified bounding boxes:

python relpose/demo.py  --image_dir examples/robot/images \
    --bbox_path examples/robot/bboxes.json --output_path robot.html

The demo will output an html file that can be opened in a browser. The html file will display the input images and predicted cameras. An example is shown here.

Pre-processing CO3D

Download the CO3Dv2 dataset from here.

Then, pre-process the annotations:

python -m preprocess.preprocess_co3d --category all --precompute_bbox \
    --co3d_v2_dir /path/to/co3d_v2
python -m preprocess.preprocess_co3d --category all \
    --co3d_v2_dir /path/to/co3d_v2

Training

Trainer should be run via:

torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 --nproc_per_node=8 \
relpose/trainer_ddp.py --batch_size=48 --num_images=8 --random_num_images=true  \
--gpu_ids=0,1,2,3,4,5,6,7 --lr=1e-5 --normalize_cameras --use_amp

Our released model was trained to 800,000 iterations using 8 GPUS (A6000).

Evaluation Directions

Please refer to eval.md for instructions on running evaluations.

Citing RelPose++

If you use find this code helpful, please cite:

@article{lin2023relposepp,
    title={RelPose++: Recovering 6D Poses from Sparse-view Observations},
    author={Lin, Amy and Zhang, Jason Y and Ramanan, Deva and Tulsiani, Shubham},
    journal={arXiv preprint arXiv:2305.04926},
    year={2023}
}

relpose-plus-plus's People

Contributors

Stargazers

Watchers

Forkers

neeharperi hiyyg sohaib0399 morbi25 wybglw jytime imj2185 jackzhousz flyinggh ricklentz amanikiruga peterzs zixunh diff-11

relpose-plus-plus's Issues

access 3D object pose via 3D cuboid around the object or sparse point clouds for novel objects

Thanks a lot for sharing your code. Are you able to add a demo code that showcases how to retrieve the 3D cuboid/bounding box around the object as well? I saw you retrieved the camera poses for the robot in the notebook. Thank you.

robot.html doesn't show image for each camera in render, does this mean the demo didn't work?

I ran the following in the terminal and the script finished with no errors and the output is a robot.html:
python relpose/demo.py --image_dir examples/robot/images \ --mask_dir examples/robot/masks --output_path robot.html

When I open the robot.html in the browser, I see a number of cameras but there's no image associated with each one. Is this normal?

Expected date for full evaluation code

Hi, congrats on the great work and thank you for sharing code.

I was wondering if there is an estimated timeline for the release of the complete evaluation code. Thank you for any
information you can provide on this matter.

Failed to predict camera poses

Hi there,

I tried the relposepp_masked checkpoint to predict camera poses of some images with masks: images.zip

The input images are:

However, the result seems not correct:

I'd like to know if I have made any mistake on testing the code? Thanks a lot!

Maximum of the input num_images

Hi~ What brilliant work. I have read relpose++ these days, and after running your code, there is a question around me, what is the maximum input num_images in the transformer encoder, if the image sequence is so big, that 8 images can not cover the object enough?

Camera convention of pose prediction

Hi, thanks a lot for sharing this code! What is the convention of the predicted poses? Is it camera to world poses or world to cameras poses? And what should I do if I want to convert the poses to camera to world + opencv style? (i.e., z for the looking direction, y for the down direction and x for the right direction).

Thanks in advance!

How to determine the Intrinsic parameters for NeRS

Hi,

it is a very excellent project and I have one small question.

In your evaluation, how to set the intrinsic parameters for the NeRS to reconstruct the whole scene?

Thanks a lot!

The demo combined with ners

Thanks for the awesome work! I have noticed that you provide an experiment on the combination of relposepp with ners. Could you provide a demo of this experiment or could you tell me how to initialize the mesh cube using the estimated camera poses? For example, how to set up the lengths and centroids of the mesh cube?

a question about your paper

Hi,

Thanks for your brilliant work! I have a small question about your work.

In Figure 3: Coordinate Systems for Estimating Camera Translation in your paper, I feel confused about the translation coordinates. Taking the right one 'Look-at Centered' as an example, the world origin is setted at the unique point closest to the optical axes of all cameras. I can understand why T1 = [0,0,1], but why T2 = [0,0,2]? I suppose it should be [0,1.5,1.5] according to the relative position between the left camera and the world origin. Am I wrong? Could you please explain this? Thanks in advance!

Some clarification questions as well as question how to use a custom dataset

Hi,

Congratulations for your work and many thanks for making it public. Since it seemed quite interesting for my project I was really curious to try it and see how it performs for my task. For my project I have a set of images from a turntable of different fragments. In practice for each fragment I have a set of 121 images with a difference of 3 degrees one from the other. For each one of these images I have the corresponding masks as well as the ground truth camera poses: An example can be seen in the images below:

Now the idea is that I would like to extract the camera pose for an unseen view based on the existing ones.

Thus, I was curious to see how your approach performs. For that reason I set up a virtual environment and downloaded the pretrained models/weights and I've run it on a setup of 8 images:

giving me the following output:

Which seems to work quite nicely already out of the box. Therefore, I would like to see if I can optimize it so that it works with my dataset. For that I have some questions:

I got the aforementioned result by using the masked pre-trained model. I guess the other one is based on the bounding boxes if I am not mistaken, right?
As it is the demo script now it resolves the camera pose estimation for a given images between 2-8, can be modified to give an output for only one image or for more than 8. I guess I would need to retrain the model?
How can to plot the ground truth camera poses so that I can get a qualitative representation of the error?
Do you already have a script to compute the quantitative rotation and translation error in regards to the ground truth camera poses?
As I understood you estimate the camera poses in regards to the camera pose of the first image, right? Since from what I've seen the estimated camera pose of the first image is always an identity matrix and then I guess you use the utils/normalize_cameras.py script to transform the camera poses back and forth if it is needed.
If I wanted to retrain the model for my fragments, would make sense to train for different fragments all together or to train a model for each piece individually?

Thank you for your time.

When running the colab code, the visualization is not ideal

Load pretrained weights

model, args = get_model(
model_dir="/content/relposepp_masked/",
num_images=8,
device="cuda"
)

Loading checkpoint ckpt_000800000.pth
Missing keys: ['feature_extractor.feature_positional_encoding.pos_table_1']
Unexpected keys: []

camera instrinsics change caused by images croped and resized

Hi~, thanks for your brilliant work again, I was wondering whether the camera intrinsics need to be recomputed after crop and resize process in the dataloader.

about dataset choice

Hi,

Could you tell me why you chose 'set_lists_fewview_dev' instead of 'set_lists_fewview_train' ?
How long does once training (800_000 iterations) cost using 8xA6000 ? about 3 days ?

Thanks beforehand.

Camera cone with image 3D visualiation

Dear Authors,

Visualizations in the paper and animations in the website look really cool. Would you be able to share the piece of code to reproduce such a visualization?

Thank you in Advance!

about the dataset

Because all zip files of CO3Dv2 occupy 5.5 TB of disk-space, it is difficult to follow this work if all file are used. I am wonder if you only used the single-sequence subset, which takes takes 8.9 GB.

eval translation

Hi,

In you training code, normalize_cameras=True, which means T and R are decoupled, am I right? But, in your eval code, when you load GT, normalize_cameras=False (see eval/util.py line25), which means first camera frame is adopted here (SFM format). From my perspective, their formats are inconsistent. I don't really understand here, can you explain a little bit? I really appreciate that!

Is it possible to use a human head image for testing? Do you have any requirements for the image?

training loss

Hi,

Hello, thank you for your wonderful work. I would like to ask, when the loss drops to what level does it mean that the model has basically converged?
My data contains 140 sequences. The current model is trained to 1640 iterations, the rotation loss is about 1.26, and the translation loss is about 0.14.

Please Provide Some Advice on Sparse Viewpoint Reconstruction

HI, I am very grateful for your research and how it has helped my work. However, I have encountered a problem. I hope to generate sparse point clouds using the results produced by your research. In the past, I have used COLMAP to generate sparse point clouds, but now my research scenario has changed to sparse point cloud reconstruction under sparse viewpoints, and COLMAP has failed in this scenario.

If you know how to reconstruct sparse point clouds using the sparse viewpoints poses produced by your research, please provide me with some guidance. Thank you very much.