tjiiv-cprg / epro-pnp-v2 Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 8.0 12.09 MB

[TPAMI 2024] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

Home Page: https://arxiv.org/abs/2303.12787

License: MIT License

Python 96.67% C++ 1.02% Cuda 2.30%

3d-object-detection 6dof gauss-newton levenberg-marquardt monocular perspective-n-point pose-estimation pytorch

epro-pnp-v2's People

Contributors

Stargazers

Watchers

Forkers

byran-wang hiyyg dudulu-9 llianxu drscopus seedofshadow 0iui0 xijunke

epro-pnp-v2's Issues

inference is non-deterministic ?

Hi!

I've been doing inference with the following script, using the previous repo and this config:

from mmcv.parallel import MMDataParallel
from mmdet.datasets import build_dataloader
from epropnp_det.datasets.builder import build_dataset
from epropnp_det.apis.inference import init_detector
from mmcv import Config
import torch
from mmdet.apis import set_random_seed

set_random_seed(0, deterministic=True)

config_file = 'configs/epropnp_det_basic.py'
checkpoint_file = '/path/to/checkpoint/file'
device = 'cuda:0'
cfg = Config.fromfile(config_file)
distributed = False
samples_per_gpu = cfg.data.val.pop('samples_per_gpu', 1)
samples_per_gpu = 1
dataset = build_dataset(cfg.data.val)
model = init_detector(cfg, checkpoint_file, device=device)
model.test_cfg['debug'] = ['orient']
model = MMDataParallel(model, device_ids=[0])

data_loader = build_dataloader(
    dataset,
    samples_per_gpu=samples_per_gpu,
    workers_per_gpu=cfg.data.workers_per_gpu,
    dist=distributed,
    shuffle=False)

for i, data in enumerate(data_loader):
    with torch.no_grad():
        result = model(return_loss=False, rescale=True, **data)
    print(result[0]["orient_logprob"][0].shape)
    print(result[0]["bbox_results"][0].shape)
    print(result[0]["bbox_3d_results"][0].shape)
    print("------------------------------------")
    if i == 20:
        break

print('2nd for cycle')

for i, data in enumerate(data_loader):

    with torch.no_grad():
        result = model(return_loss=False, rescale=True, **data)
    print(result[0]["orient_logprob"][0].shape)
    print(result[0]["bbox_results"][0].shape)
    print(result[0]["bbox_3d_results"][0].shape)
    print("------------------------------------")

    logprob = result[0]["orient_logprob"]
    bbox_3d = result[0]["bbox_3d_results"]
    if i == 20:
        break

This way I'm printing the shapes of results for cars in each image. The first part of the shapes correspond to the number of detected objects for the image. I noticed that despite setting the seed I sometimes (from 2*20 iterations always) get different number of detections for the two iterations of the same dataloader (separated by print('2nd for cycle') ).

Outputs for the above script:
FIRST ITERATION:

(1, 128)
(1, 5)
(1, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(5, 128)
(5, 5)
(5, 20)
------------------------------------
(33, 128)
(33, 5)
(33, 20)
------------------------------------
(14, 128)
(14, 5)
(14, 20)
------------------------------------
(2, 128)
(2, 5)
(2, 20)
------------------------------------
(5, 128)
(5, 5)
(5, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(4, 128)
(4, 5)
(4, 20)
------------------------------------
(35, 128)
(35, 5)
(35, 20)
------------------------------------
(12, 128)
(12, 5)
(12, 20)
------------------------------------
(1, 128)
(1, 5)
(1, 20)
------------------------------------
(2, 128)
(2, 5)
(2, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(5, 128)
(5, 5)
(5, 20)
------------------------------------
(33, 128)
(33, 5)
(33, 20)
------------------------------------
(15, 128)
(15, 5)
(15, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(3, 128)
(3, 5)
(3, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(3, 128)
(3, 5)
(3, 20)

SECOND ITERATION:

(1, 128)
(1, 5)
(1, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(6, 128)
(6, 5)
(6, 20)
------------------------------------
(32, 128)
(32, 5)
(32, 20)
------------------------------------
(15, 128)
(15, 5)
(15, 20)
------------------------------------
(2, 128)
(2, 5)
(2, 20)
------------------------------------
(5, 128)
(5, 5)
(5, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(5, 128)
(5, 5)
(5, 20)
------------------------------------
(34, 128)
(34, 5)
(34, 20)
------------------------------------
(12, 128)
(12, 5)
(12, 20)
------------------------------------
(1, 128)
(1, 5)
(1, 20)
------------------------------------
(2, 128)
(2, 5)
(2, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(4, 128)
(4, 5)
(4, 20)
------------------------------------
(32, 128)
(32, 5)
(32, 20)
------------------------------------
(12, 128)
(12, 5)
(12, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(3, 128)
(3, 5)
(3, 20)
------------------------------------
(0, 128)
(0, 5)
(0, 20)
------------------------------------
(3, 128)
(3, 5)
(3, 20)

As you can see from 20 iterations there are 8 differences in detected object numbers. Only one difference is bigger than 1: 12 instead of 15.

What could be the cause of this? Maybe the non-deterministic nature of the pnp-solver?
Thanks in advance for the help!

monitoring perspective inference？

Hello author, your work is great! My custom dataset is a scene from a monitoring perspective, which is a tilted camera perspective. I want to infer images. Is there a way to complete the inference by modifying the configuration? (I do not have the corresponding annotations for retraining, so I cannot retrain at this time)

Object detector training code

Hi there,
Thanks for sharing all the work and code, very interesting.

I was wondering if you have any code for training the object detectors mentioned in your papers? This repo only contains the code for the "second" stage of your work.

Kind regards,
Chris

torchvision requires a python version > 3.8

torchvision requires the minimum python version of 3.8. I tried setting up the venv with conda. Did not work.
Is it possible to setup the venv using pip or poetry?

coordinate issue

Dear hansheng,

Thank for the second version. I am wondering how to change the coordinate in such setting: x3d fixed, but to predict x2d, w2d. Now the x2d is [0,1,2,....]×[0,1,2,....]，the x3d and x2d is among[0,1]. In my case, x3d is fixed as [0,1,2,....]×[0,1,2,....]×[depth], how to normalize it? Thank you very much.

Why is CLS_ORIENTATION False for barriers in dataset?

I see that in the code:

EPro-PnP-v2/EPro-PnP-Det_v2/epropnp_det/datasets/nuscenes3d_dataset.py

Line 41 in 85215de

 CLS_ORIENTATION = [True, True, True, True, True, True, True, True, False, False] 

CLS_ORIENTATION is set to false for barriers. Why is it this way?

nuScenes states about the TP metrics in the detection task:

We omit measurements for classes where they are not well defined: AVE for cones and barriers since they are stationary; AOE of cones since they do not have a well defined orientation; and AAE for cones and barriers since there are no attributes defined on these classes.

So orientation should be computed for barriers too.

Or did I misunderstood the meaning of CLS_ORIENTATION ?

mini-dataset output Json file

Thank you for the exciting paper, and providing the codes!
I needed the json output file of the mini dataset for my bachelor project, can you please share it with me?
here is my email:
[email protected]

How to understand the weight w2d?

w2d : Shape (num_obj, num_points, 2)
I'm sorry, after reading papers and code for a long time, I still haven't understood the physical meaning of w2d.
For 2D matching, the weight is score: Shape (num_obj, num_points, 1), which represents the probability of matching between two pairs of 2d feature points. But why does this 3d-2d shape have two columns, and what is the specific meaning?

Theoretical question about Importance-sampling (duplicate from previous repo)

Hello!
I just realised there is now a new repository so I thought I copy my question from there. (I don't know if the old one is still maintained or not, I hope this isn't a problem.)
Issue in previous repository

Question about image shapes

Hello!
I have some questions abou image shapes:

I see that the images you get from the dataloder after the pipelines are of 1600x672 resolution. But the backbone is a ResNet101 pretrained on ImageNet, which I think accepts 224x224 images. If this is true, then the images are resized by the backbone, but the ground truths will be on the original scale. This leads to some confusions for me. For example:
I see it in the code that center predictions in the fcos head are based on strides so the correspond to the 224x224 images. But the gt 2d centers are from the the 1600x672 resolution annotations. So they don't match.
So how does this work? My intuition is that ResNet isn't actually 224x224 here but I couldn't find any evidence.

Multiple times the code makes it seem like that the images in the bacthes are not of the same shape (but that can't be the case right?):

EPro-PnP-v2/EPro-PnP-Det_v2/epropnp_det/models/dense_heads/deform_pnp_head.py

Lines 796 to 797 in 85215de

 img_shapes = cam_intrinsic.new_tensor([img_meta['img_shape'][:2] for img_meta in img_metas]) 

 ori_shapes = cam_intrinsic.new_tensor([img_meta['ori_shape'][:2] for img_meta in img_metas])

This part of the code I really don't understand because I think 'batch_input_shape' and 'img_shape' are always the same here, so this will be an all-zero mask:

EPro-PnP-v2/EPro-PnP-Det_v2/epropnp_det/models/dense_heads/deform_pnp_head.py

Lines 383 to 390 in 85215de

 with default_timers['FCOS head forward time']: 

 batch_size = mlvl_feats[0].size(0) 

 input_img_h, input_img_w = img_metas[0]['batch_input_shape'] 

 img_masks = mlvl_feats[0].new_ones( 

 (batch_size, input_img_h, input_img_w)) 

 for img_id in range(batch_size): 

 img_h, img_w, _ = img_metas[img_id]['img_shape'] 

 img_masks[img_id, :img_h, :img_w] = 0

Thanks in advance for the help!

Could a new type of loss be introduced for classes?

In EPro-PnP-Det_v2 if we want to improve the classification performance, could theoretically a new type of loss be introduced with the help of the deformable correspondance head?

I was thinking about how the yaw angle distribution corresponds to different classes. During the AMIS algorithm we could use the generated rotation distribution, evaluate it from 0 to 2pi with some density. Then feed this distribution to a simple network which classifies based on yaw angle. Maybe this isn't suitable for all classes but it might be useful to train a binary classsifier for pedestrians and cones (which can be mixed for classifiers that are based on purely image inputs) and add the scores to the corresponding ones in the FCOS detection head with some weighting.

Or we could just use these orient logprobs for this purpose?:

EPro-PnP-v2/EPro-PnP-Det_v2/epropnp_det/models/dense_heads/deform_pnp_head.py

Lines 563 to 574 in 85215de

 if 'orient' in debug: 

 orient_bins = getattr(self.test_cfg, 'orient_bins', 128) 

 orient_grid = torch.linspace( 

 0, 2 * np.pi * (orient_bins - 1) / orient_bins, 

 steps=orient_bins, device=x3d.device) 

 # (orient_bins, num_obj, 4) 

 pose_grid = pose_opt[None].expand(orient_bins, -1, -1).clone() 

 pose_grid[..., 3] = orient_grid[None, :, None] 

 cost = evaluate_pnp( 

 x3d, x2d, w2d, pose_grid, self.camera, self.cost_fun, out_cost=True)[1] 

 orient_logprob = cost.neg().log_softmax(dim=0) + np.log(orient_bins / (2 * np.pi)) 

 orient_logprob = orient_logprob.transpose(1, 0).cpu().numpy()

This is just an idea and my question is, could this theoretically work? Can this be backpropagated at all?

Thanks in advance for the answer, and for the previous ones too, they've been very useful.

Question: What are the main differences compared to the original repository?

I've spent some time searching, I would be thankful if you could point out the main differences, so I know where to look.

Thanks in advance!

	img_shapes = cam_intrinsic.new_tensor([img_meta['img_shape'][:2] for img_meta in img_metas])
	ori_shapes = cam_intrinsic.new_tensor([img_meta['ori_shape'][:2] for img_meta in img_metas])

	with default_timers['FCOS head forward time']:
	batch_size = mlvl_feats[0].size(0)
	input_img_h, input_img_w = img_metas[0]['batch_input_shape']
	img_masks = mlvl_feats[0].new_ones(
	(batch_size, input_img_h, input_img_w))
	for img_id in range(batch_size):
	img_h, img_w, _ = img_metas[img_id]['img_shape']
	img_masks[img_id, :img_h, :img_w] = 0

	if 'orient' in debug:
	orient_bins = getattr(self.test_cfg, 'orient_bins', 128)
	orient_grid = torch.linspace(
	0, 2 * np.pi * (orient_bins - 1) / orient_bins,
	steps=orient_bins, device=x3d.device)
	# (orient_bins, num_obj, 4)
	pose_grid = pose_opt[None].expand(orient_bins, -1, -1).clone()
	pose_grid[..., 3] = orient_grid[None, :, None]
	cost = evaluate_pnp(
	x3d, x2d, w2d, pose_grid, self.camera, self.cost_fun, out_cost=True)[1]
	orient_logprob = cost.neg().log_softmax(dim=0) + np.log(orient_bins / (2 * np.pi))
	orient_logprob = orient_logprob.transpose(1, 0).cpu().numpy()

tjiiv-cprg / epro-pnp-v2 Goto Github PK

epro-pnp-v2's People

Contributors

Stargazers

Watchers

Forkers

epro-pnp-v2's Issues

Recommend Projects

Recommend Topics

Recommend Org