roytseng-tw / detectron.pytorch Goto Github PK

A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.

License: MIT License

Python 83.46% Shell 0.37% Cuda 8.39% C 7.17% C++ 0.36% MATLAB 0.26%

mask-rcnn pytorch detection pose-estimation segmentation detectron

detectron.pytorch's People

Contributors

Stargazers

Watchers

Forkers

hzhang57 wkentaro shlpu pkuzqj tfwu jwyang cclauss zouhongwei shubhampachori12110095 jdc08161063 statml yuechengyin qihuacheng mvpduncan xuanhan863 goatmessi7 yuzcccc marvin521 dichen-cd hyzcn labimage grseb9s zhengfangwu dreadlord1984 ruotianluo rongchangzhao hajungong007 aitechnology liyuanyaun xxradon haroldss felixmonkey animebing dgreyling swordsmanxyz wenshuangsong tagelian jackhenry1992 mati1994 hsuxu joaopcanario lvzhaoyang eric-zhang1990 ai3dvision lamhocn kexinyi yobcmst hsouporto fabienbaradel mtlong kywang salemameen cnheider liviust ml-lab briando2005 yanwang2014 xiaopengyou0000 hbredin petronetto kep-w christy-yuan-li cryax mannykayy daniellsm lyrl thorpham shaunstanislauslau jiyun-cui ryanmaynard fitsumreda hephaex wecognize oppa3109 b2220333 hewumars insmod-he pandinosaurus fishyuli changanvr galaxy-fangfang rockystevejobs psu1 chriszhenghaochen tangyoubao longchr123 ericeiffel rmonla shaojinding neo4reo taoari liu0329 hhy5277 williamtran29 gil2abir drorhilman shaoli-huang britefury ifighting oasisyang

detectron.pytorch's Issues

ImportError: dynamic module does not define module export function (PyInit_bbox)

Hi, @roytseng-tw
I encounter the import problems (python3.5), need help. Thanks.

$ python3 tools/train_net_step.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4.yml --use_tfboard --bs 4 --nw 4

Traceback (most recent call last):
  File "tools/train_net_step.py", line 25, in <module>
    from datasets.roidb import combined_roidb_for_training
  File "/home/yuekaiyu/code/Detectron.pytorch/lib/datasets/roidb.py", line 27, in <module>
    import utils.boxes as box_utils
  File "/home/yuekaiyu/code/Detectron.pytorch/lib/utils/boxes.py", line 52, in <module>
    import utils.bbox as cython_bbox
ImportError: dynamic module does not define module export function (PyInit_bbox)

hangs in training

Thanks for your codes!
I was able to successfully train configs/e2e_mask_rcnn_R-50-FPN_1x.yaml with a (batch_size, learning_rate) = (8, 0.01) until a certain number of iterations (max = ~60K). So far, the losses look quite similar to your benchmark.
Training speed is also quite comparable to Detectron
The issue I'm having is the training hangs randomly at a certain iteration, which is not consistent from run to run, sometimes after 5K, 1K, or 60K iterations.
I'm using 4 V-100 GPUs.

Any thoughts?

cuda runtime error: out of memory after 15K iteration of 4 GPUs training

Hi, @roytseng-tw, I have successfully ran train_net.py of resnet-C4 (using 4 GPUs), but after 15K training steps, cuda runtime error with out of memory.

As I thought, this was caused by the increasing GPU-memory of dynamic graph, such as loss+=xxx, can you give some advices for solving this issue.

Negative areas found

When I try to run the training code, I get the following error:

RuntimeWarning: Negative areas found: 3

I'm running: e2e_mask_rcnn_R-101-FPN_2x.yaml

About resume

Hello, I try to resume the training by using this command:

 python tools/train_net_step.py --dataset coco2017 --cfg configs/e2e_faster_rcnn_R-101-FPN_1x.yaml --use_tfboard --load_ckpt  Outputs/e2e_faster_rcnn_R-101-FPN_1x/May02-12-15-12_faster_step/ckpt/model_step69999.pth --resume

However, it throw out a runtime error

Traceback (most recent call last):
  File "tools/train_net_step.py", line 367, in main
    optimizer.step()
  File "/home/philokey/.virtualenvs/py3/lib/python3.5/site-packages/torch/optim/sgd.py", line 94, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:271

How can I solve this problem?

Undefined names: CUDA, CylinderGridGenFunction

flake8 testing of https://github.com/roytseng-tw/mask-rcnn.pytorch on Python 3.6

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./lib/setup.py:93:48: F821 undefined name 'CUDA'
            self.set_executable('compiler_so', CUDA['nvcc'])
                                               ^
./lib/model/roi_crop/modules/gridgen.py:39:18: F821 undefined name 'CylinderGridGenFunction'
        self.f = CylinderGridGenFunction(self.height, self.width, lr=lr)
                 ^
2     F821 undefined name 'CylinderGridGenFunction'
2

Combine train and val?

Hi, just ask, would you consider to combine the train and val together?
Like a standard one: every epoch do the validation before ckpt, if val accuracy/loss is higher then save the ckpt.
I know this need some work to be done.
But it would be very convenience and easily to start.

Now i am working on it, but i guess i can do the validation after every ckpt is generated during training, (based on your test_net.py, load the ckpt and val it). model.train->model.save->model.eval
It would be more efficient if you can prove a example that model.train->model.eval->model.save.

Fine-tuning on Detectron pre-trained weights

Thanks for the great work!

I wonder if I can load pre-trained weights from Detectron and fine-tune it with this code?

Error when running test on one GPU when multiple are available.

Hi, I tried to run a test today and got the following error:

Traceback (most recent call last):
  File "tools/test_net.py", line 108, in <module>
    check_expected_results=True)
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/core/test_engine.py", line 128, in run_inference
    all_results = result_getter()
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/core/test_engine.py", line 108, in result_getter
    multi_gpu=multi_gpu_testing
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/core/test_engine.py", line 158, in test_net_on_dataset
    args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/core/test_engine.py", line 253, in test_net
    cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/core/test.py", line 66, in im_detect_all
    model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/core/test.py", line 127, in im_detect_bbox
    return_dict = model(**inputs)
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/nn/parallel/data_parallel.py", line 82, in forward
    mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
  File "/home/rizhiy/object-detection/Detectron.pytorch/lib/nn/parallel/data_parallel.py", line 82, in <listcomp>
    mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range

The command I used to run the test: python tools/test_net.py --cfg configs/e2e_mask_rcnn_R-101-FPN_2x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-101-FPN_2x/Apr19-11-34-35_devbox/ckpt/model_7_29315.pth --dataset coco2017.

It appears that there is some inconsistency in the number of devices during setup.

Not sure what needs to be fixed, but as a workaround, you can just restrict python to one GPU with CUDA_VISIBLE_DEVICES=0.

error when training using one GPU when multiple GPUs are available

I have 1 trivial GPU0 and 4 GPUs (1,2,3,4) in my machine. If I do not specify GPU to use and input:
python tools/train_net_step.py --dataset coco2017 --cfg configs/e2e_faster_rcnn_R-101-FPN_1x.yaml
the error is:
path/miniconda3/lib/python3.6/site-packages/torch/cuda/init.py:116: UserWarning:
Found GPU1 Quadro K600 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.

warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
INFO train_net_step.py: 361: Training starts !
INFO net.py: 72: Changing learning rate 0.000000 -> 0.006667
Traceback (most recent call last):
File "tools/train_net_step.py", line 437, in
main()
File "tools/train_net_step.py", line 407, in main
net_outputs = maskRCNN(**input_data)
File "path/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "path/CODE/Detectron.pytorch/lib/nn/parallel/data_parallel.py", line 82, in forward
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
File "path/CODE/Detectron.pytorch/lib/nn/parallel/data_parallel.py", line 82, in
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range

Do you have plans to support R-FCN or Light-Head-RCNN?

Don't have good results when use pre-trained Detectron model

When I run infer_simple.py with pre-trained Detectron model, I don't have good results. The command is like as:
python3 tools/infer_simple.py --dataset coco --cfg configs/e2e_mask_rcnn_R-101-FPN_2x.yaml --load_detectron configs/e2e_mask_rcnn_R-101-FPN_2x.pkl --image_dir demo/sample_images --output_dir demo/out,
the scores of objects are very low as 0.08, I can't get accurate results.
So what's wrong?

I meet problem during implement light_head_rcnn

loss_bbox is not converge.other loss(loss_cls,loss_rpn_cls,loss_bbox) is converge.can I push the code to you for debug.

Undefined names

See #5

flake8 testing of https://github.com/roytseng-tw/Detectron.pytorch on Python 3.6.3

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./lib/core/test.py:315:13: F821 undefined name 'image_utils'
    im_ar = image_utils.aspect_ratio_rel(im, aspect_ratio)
            ^
./lib/core/test.py:402:18: F821 undefined name 'im_conv_body_only'
    im_scale_i = im_conv_body_only(model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE)
                 ^
./lib/core/test.py:465:16: F821 undefined name 'im_conv_body_only'
    im_scale = im_conv_body_only(model, im_hf, target_scale, target_max_size)
               ^
./lib/core/test.py:482:20: F821 undefined name 'im_conv_body_only'
        im_scale = im_conv_body_only(model, im, target_scale, target_max_size)
                   ^
./lib/core/test.py:491:13: F821 undefined name 'image_utils'
    im_ar = image_utils.aspect_ratio_rel(im, aspect_ratio)
            ^
./lib/core/test.py:499:20: F821 undefined name 'im_conv_body_only'
        im_scale = im_conv_body_only(
                   ^
./lib/core/test.py:569:16: F821 undefined name 'im_conv_body_only'
    im_scale = im_conv_body_only(model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE)
               ^
./lib/core/test.py:640:16: F821 undefined name 'im_conv_body_only'
    im_scale = im_conv_body_only(model, im_hf, target_scale, target_max_size)
               ^
./lib/core/test.py:658:20: F821 undefined name 'im_conv_body_only'
        im_scale = im_conv_body_only(model, im, target_scale, target_max_size)
                   ^
./lib/core/test.py:669:13: F821 undefined name 'image_utils'
    im_ar = image_utils.aspect_ratio_rel(im, aspect_ratio)
            ^
./lib/core/test.py:677:20: F821 undefined name 'im_conv_body_only'
        im_scale = im_conv_body_only(
                   ^
11    F821 undefined name 'image_utils'
11

pytorch 0.4 support ?

Hi, thanks for your great work!
I want to know that will Detectron.pytorch support pytorch>=0.4?

How to train with a smaller net-input size such as (640,480)?

Thanks for sharing your great job.
By printing the size of image, I get 768x1344 of resnet50-fpn model.
However, in my case, I want to retrain this network using a smaller net-input size, such as 640x480.

I simply tried to modify the config file of e2e_mask_rcnn_R-50-FPN_2x.yaml as follows:

But during traing, it said python double free or corruption error:

The tensorboard show this error is occured after 120 steps but not the start-training time....

Can you give me some advices for solving this error?
Have you trained with a smaller net-input size?

Train Using Precomputed RPN Proposals?

Do you plan to support training Using Precomputed RPN Proposals?

Doubts about the loss_cls and accuracy_cls calculation

Hi:
I have some doubts in the evaluation of loss_cls and accuracy_cls in function of fast_rcnn_losses in lib/modling/fast_rcnn_heads.py.
Based on my understanding, the following calculation seems assume cls_score and rois_label have the same length and matching order. Like pred [0,1,1], lable [0,1,2] (just the idea).
But the real is more like pred [0,1,2,3], label [0,1,2] (pred length may more or less And the order may not match).
Based on my experience in matterport's mask rcnn. Before calculate the class loss and accuracy, there is operation will matching the pred and label in order based on the nearest box. Basically, it found the nearest pred bbox as the 'right' pred for one label box (make sense).

I didnot found some operation in the code yet, i guess i ignore or misunderstand something (new to the mask rcnn/faster rcnn).
So my real question is how do you make sure the pred class and label class matching before calculate the loss/accuracy?
thanks.

def fast_rcnn_losses(cls_score, bbox_pred, label_int32, bbox_targets,
                     bbox_inside_weights, bbox_outside_weights):
    device_id = cls_score.get_device()
    rois_label = Variable(torch.from_numpy(label_int32.astype('int64'))).cuda(device_id)
    loss_cls = F.cross_entropy(cls_score, rois_label)
    ........
    # class accuracy
    cls_preds = cls_score.max(dim=1)[1].type_as(rois_label)
    accuracy_cls = cls_preds.eq(rois_label).float().mean(dim=0)

    return loss_cls, loss_bbox, accuracy_cls

mismatch of shape while loading from .pkl file

I tried inference with e2e_keypoint_rcnn_R-50-FPN_s1x.yaml using pkl file available from Detectron @https://github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md

CUDA :9.0
GPU: K80
Pytorch: 0.4.0
python:2.7

Got this error:

File "tools/infer_simple.py", line 176, in
main()
File "tools/infer_simple.py", line 128, in main
load_detectron_weight(maskRCNN, args.load_detectron)
File "/home/tester/detectron/mask-rcnn.pytorch/lib/utils/detectron_weight_helper.py", line 22, in load_detectron_weight
p_tensor.copy_(torch.Tensor(src_blobs[d_name]))
RuntimeError: The expanded size of the tensor (81) must match the existing size (2) at non-singleton dimension 0

rpn_cls_prob overwritten by rpn_bbox_pred

@roytseng-tw Nice work BTW,

https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/modeling/rpn_heads.py#L106-L109
rpn_cls_prob values are overwritten by rpn_bbox_pred,

Is this a bug or its intentional?

Cannot run inference in Jupyter

@roytseng-tw Hi,
I run the inference based on your infer_simple.py successfully .
At same environment (one gpu, same machine, same folder path), i use it in Jupyter for inference, but give me error in the data_parallel. Any ideal?

This is my Jupyter move:
I use following load the per-trained model with success return model structure.

cfg.MODEL.NUM_CLASSES = 3
cfg_file = 'configs/e2e_mask_rcnn_R-50-C4_1x.yaml'
load_name= '/home/ubuntu/Detectron_master/Outputs/e2e_mask_rcnn_R-50-C4_1x/May04-11-28-11_ubuntu16_step/ckpt/model_step19999.pth'

cfg_from_file(cfg_file)
assert_and_infer_cfg()

maskRCNN = Generalized_RCNN()
maskRCNN.cuda()
checkpoint = torch.load(load_name, map_location=lambda storage, loc: storage)
net_utils.load_ckpt(maskRCNN, checkpoint['model'])
maskRCNN = mynn.DataParallel(maskRCNN, cpu_keywords=['im_info', 'roidb'],
                                 minibatch=True)
maskRCNN.eval()

However, when i next call it in a im_detect_all

im=cv2.imread('test.jpg')
cls_boxes, cls_segms, cls_keyps = im_detect_all(maskRCNN, im, timers=timers)

It give me a mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()]) error.

IndexError                                Traceback (most recent call last)
<ipython-input-5-f3b1e8bf1385> in <module>()
      9 timers = defaultdict(Timer)
     10 print('entry[image]',entry['image'])
---> 11 cls_boxes, cls_segms, cls_keyps = im_detect_all(maskRCNN, im, timers=timers)

~/Detectron_master/lib/core/test.py in im_detect_all(model, im, box_proposals, timers)
     68     else:
     69         scores, boxes, im_scale, blob_conv = im_detect_bbox(
---> 70             model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
     71     timers['im_detect_bbox'].toc()
     72 

~/Detectron_master/lib/core/test.py in im_detect_bbox(model, im, target_scale, target_max_size, boxes)
    133     inputs['im_info'] = [Variable(torch.from_numpy(inputs['im_info']), volatile=True)]
    134 
--> 135     return_dict = model(**inputs)
    136 
    137     if cfg.MODEL.FASTER_RCNN:

~/.local/lib/python3.5/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    355             result = self._slow_forward(*input, **kwargs)
    356         else:
--> 357             result = self.forward(*input, **kwargs)
    358         for hook in self._forward_hooks.values():
    359             hook_result = hook(self, input, result)

~/Detectron_master/lib/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
     83                 mini_inputs = [x[i] for x in inputs]
     84 
---> 85                 mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
     86                 # print('mini_kwargs',mini_kwargs)
     87                 a, b = self._minibatch_scatter(device_id, *mini_inputs, **mini_kwargs)

~/Detectron_master/lib/nn/parallel/data_parallel.py in <listcomp>(.0)
     83                 mini_inputs = [x[i] for x in inputs]
     84 
---> 85                 mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
     86                 # print('mini_kwargs',mini_kwargs)
     87                 a, b = self._minibatch_scatter(device_id, *mini_inputs, **mini_kwargs)

IndexError: list index out of range

Simple demo.py file for an example usage

Do you have a simple demo.py file that allows you to load a pretrained model and predict on a single example image ?

Thanks.

RetinaNet

Is RetinaNet (or any other single stage detector) training/inference supported? I saw some field that correspond to RetinaNet in config.py - hence this question.

Thanks,

Error at testing with Detectron pretrained ResNet-50 architecture

Expected results

I was trying to test Detectron ResNet50 architecture with pretrained caffe weights on COCO-Val 2017 set and got the error below.

Update: Detectron repo updated with "group batch norm" feature 12 days ago. (https://github.com/facebookresearch/Detectron/tree/master/configs/04_2018_gn_baselines) I believe they also changed model files and only providing pkl files for new baselines (https://github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md). If my assumption is true, can you upload previous .pkl files to some place, so that we can continue using your implementation in pytorch?

Actual results

loading annotations into memory...
Done (t=0.70s)
creating index...
index created!
loading annotations into memory...
Done (t=0.93s)
creating index...
index created!
INFO test_engine.py: 335: loading detectron weights data/pretrained_model/R-50.pkl
Traceback (most recent call last):
  File "tools/test_net.py", line 112, in <module>
    check_expected_results=True)
  File "/home/john/Desktop/cvav_proj/detectorn_roytseng/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
    all_results = result_getter()
  File "/home/john/Desktop/cvav_proj/detectorn_roytseng/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
    multi_gpu=multi_gpu_testing
  File "/home/john/Desktop/cvav_proj/detectorn_roytseng/mask-rcnn.pytorch/lib/core/test_engine.py", line 158, in test_net_on_dataset
    args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id
  File "/home/john/Desktop/cvav_proj/detectorn_roytseng/mask-rcnn.pytorch/lib/core/test_engine.py", line 232, in test_net
    model = initialize_model_from_cfg(args, gpu_id=gpu_id)
  File "/home/john/Desktop/cvav_proj/detectorn_roytseng/mask-rcnn.pytorch/lib/core/test_engine.py", line 336, in initialize_model_from_cfg
    load_detectron_weight(model, args.load_detectron)
  File "/home/john/Desktop/cvav_proj/detectorn_roytseng/mask-rcnn.pytorch/lib/utils/detectron_weight_helper.py", line 21, in load_detectron_weight
    p_tensor.copy_(torch.Tensor(src_blobs[d_name]))
KeyError: 'fpn_inner_res5_2_sum_w'

Detailed steps to reproduce

I've downloaded ResNet-50 model file from Detectron github page (https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/MSRA/R-50.pkl).

The command I've ran is here

python tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-FPN_1x.yaml --load_detectron data/pretrained_model/R-50.pkl

Also I get KeyError: 'conv_rpn_w' when i change config to R-50-C4_1x or R-50-C4_2x files.

System information

Operating system: Ubuntu 16.04
CUDA version: 9
cuDNN version: ?
GPU models (for all devices if they are not all the same): 1050 Ti
python version: 3.6.4 (Anaconda custom)
pytorch version: 0.3.4
Anything else that seems relevant: ?

coco eval perfomance

Excellent work! Have you trained from scratch and how's the performance on COCO evaluation? BTW, could you share some pre-trained weights to test on? Thanks a lot!

batch size, lr, and schedule.

According to the documentation, if I understand correctly, in some settings you changed the batch size, and thus lr proportionally, but you did not change the schedule (in terms of "iterations"). You should scale the schedule proportionally and let the solver see the same total number of images. To match curves, the x-axis should also be # of images (or equivalently, epochs), but not iterations.

Dataloader throws error during iter()

Hello, I'm trying to get the repo to work with PyTorch 0.4.
While most of the changes are rather trivial, the sampler this repo uses, return both index and aspect ratio (correct me if it is something else, but it is a tuple and the batch sampler assume integer), there isn't any straightforward way to fix it with the new dataloader structure introduced in pytorch/pytorch#1867.
What would you think is the better way to make it compatible without breaking anything?
Thank you

Python 2 support

Will you add Python 2 support for this repo? In general, I have done the following three things to make the infer_simple.py script work for python2.

fix super: 3to2 -f super -w .
rename utils.collections to utils.collections2 to avoid conflicting with the official collections library
pickle.load(fp, encoding='latin1') -> pickle.load(fp)

An example repo is at https://github.com/taoari/Detectron.pytorch/commits/dev, will you add full support of this repo for python 2?

'Detectron.pytorch/lib/utils/detectron_weight_helper.py' can be used to inference masks,but can't inference keypoints.

Correctly inference masks:
(cuipt) cui@DemonHunters:~/mask-rcnn.pytorch$ python tools/infer_simple.py --dataset coco --cfg configs/e2e_mask_rcnn_R-101-FPN_2x.yaml --load_detectron data/model_final.pkl --image_dir demo/sample_images Called with args: Namespace(cfg_file='configs/e2e_mask_rcnn_R-101-FPN_2x.yaml', cuda=True, dataset='coco', image_dir='demo/sample_images', images=None, load_ckpt=None, load_detectron='data/model_final.pkl', merge_pdfs=True, output_dir='infer_outputs', set_cfgs=[]) load cfg from file: configs/e2e_mask_rcnn_R-101-FPN_2x.yaml loading detectron weights data/model_final.pkl img 0 person 0.999168 img 1 suitcase 0.741572 chair 0.996991 chair 0.995423 chair 0.974603 chair 0.902452 chair 0.748457 book 0.762648 chair 0.9888 clock 0.992333 img 2 train 0.99889 person 0.826093 img 3 car 0.994156 car 0.999019 truck 0.839317 car 0.995135 car 0.9096 traffic light 0.984154 car 0.99167 car 0.995001 car 0.981888
however, can't inference keyoints, so how to modify 'detectron_weight_helper.py' to inference keypoints?
`(cuipt) cui@DemonHunters:~/mask-rcnn.pytorch$ python tools/infer_simple.py --dataset keypoints_coco \

--cfg configs/e2e_mask_rcnn_R-101-FPN_2x.yaml
--load_detectron data/model_final.pkl
--image_dir demo/sample_images_keypoints
Called with args:
Namespace(cfg_file='configs/e2e_mask_rcnn_R-101-FPN_2x.yaml', cuda=True, dataset='keypoints_coco', image_dir='demo/sample_images_keypoints', images=None, load_ckpt=None, load_detectron='data/model_final.pkl', merge_pdfs=True, output_dir='infer_outputs', set_cfgs=[])
load cfg from file: configs/e2e_mask_rcnn_R-101-FPN_2x.yaml
loading detectron weights data/model_final.pkl
Traceback (most recent call last):
File "tools/infer_simple.py", line 176, in
main()
File "tools/infer_simple.py", line 128, in main
load_detectron_weight(maskRCNN, args.load_detectron)
File "/home/cui/mask-rcnn.pytorch/lib/utils/detectron_weight_helper.py", line 21, in load_detectron_weight
p_tensor.copy_(torch.Tensor(src_blobs[d_name]))
RuntimeError: invalid argument 2: sizes do not match at /pytorch/torch/lib/THC/generic/THCTensorCopy.c:51
`

importerror: no deafultdict

When i run imfer_simple.py, I meet this error in "utils/misc.py" in line
from collections import defaultdice,Iterable
so how to solve this problem?

Support for different class ckpt loaded?

Hi:
I used a customized dataset with class=3, the training is fine and the ckpt can be generated.
But, when comes to test, there is a problem when loaded the ckpt: ckpt and model's output dimension not match.

maskRCNN = Generalized_RCNN() based on assume class 81(coco class) and my ckpt is based on class3.

What i usually do is change the the output layer of model to fit different class. But the mask rcnn is more complicated than a "normal" model.

So Could you show me which layers should be changed to fit the customized num_class?

RuntimeError: While copying the parameter named Mask_Outs.classify.weight, whose dimensions in the model are torch.Size([81, 256, 1, 1]) and whose dimensions in the checkpoint are torch.Size([3, 256, 1, 1]).

Thanks

Unable to Properly Load Classes

I am trying to train a model using a custom JSON dataset that I converted to the COCO format. I've adapted the code given in train.py, but I am unable to load the classes properly. Regardless of what number of classes I specify in the config file, I am getting this same error. Is there an obvious mistake that I am making? Thank you!

timers = defaultdict(Timer)

### Dataset ###
timers['roidb'].tic()
roidb, ratio_list, ratio_index = combined_roidb_for_training(cfg.TRAIN.DATASETS, cfg.TRAIN.PROPOSAL_FILES)
timers['roidb'].toc()
train_size = len(roidb)
logger.info('{:d} roidb entries'.format(train_size))
logger.info('Takes %.2f sec(s) to construct roidb', timers['roidb'].average_time)

sampler = MinibatchSampler(ratio_list, ratio_index)
dataset = RoiDataLoader(
    roidb,
    cfg.MODEL.NUM_CLASSES,
    training=True)
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=args.batch_size,
    sampler=sampler,
    num_workers=cfg.DATA_LOADER.NUM_THREADS,
    collate_fn=collate_minibatch)

assert_and_infer_cfg()

The output:

INFO:datasets.json_dataset:Loading cached gt_roidb from /home/cees2/Image Project/Code/mask-rcnn.pytorch/data/cache/init_data_gt_roidb.pkl
INFO:datasets.roidb:Appending horizontally-flipped training examples...
INFO:datasets.roidb:Loaded dataset: init_data
INFO:datasets.roidb:Filtered 120 roidb entries: 120 -> 0
INFO:datasets.roidb:Computing image aspect ratios and ordering the ratios...
INFO:datasets.roidb:done
INFO:datasets.roidb:Computing bounding-box regression targets...
INFO:datasets.roidb:done
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
[]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-f5a66a92f826> in <module>()
      3 ### Dataset ###
      4 timers['roidb'].tic()
----> 5 roidb, ratio_list, ratio_index = combined_roidb_for_training(cfg.TRAIN.DATASETS, cfg.TRAIN.PROPOSAL_FILES)
      6 timers['roidb'].toc()
      7 train_size = len(roidb)

~/Image Project/Code/mask-rcnn.pytorch/lib/datasets/roidb.py in combined_roidb_for_training(dataset_names, proposal_files)
     77     logger.info('done')
     78 
---> 79     _compute_and_log_stats(roidb)
     80 
     81     return roidb, ratio_list, ratio_index

~/Image Project/Code/mask-rcnn.pytorch/lib/datasets/roidb.py in _compute_and_log_stats(roidb)
    229 def _compute_and_log_stats(roidb):
    230     print(roidb)
--> 231     classes = roidb[0]['dataset'].classes
    232     char_len = np.max([len(c) for c in classes])
    233     hist_bins = np.arange(len(classes) + 1)

IndexError: list index out of range

Hi roytseng, I'd like to put a project based on your 'Detectron.pytorch' project to my Github reposity, could I?

Inspired by '4K Video Demo by Karol Majek' at https://github.com/matterport/Mask_RCNN#projects-using-this-model, and based on your 'Detectron.pytorch' project, I built a toy project.
Compared Karol Majek's, my project blent human masks and human keypoints together, it seemed funny, so I'd like to put a project based on your 'Detectron.pytorch' project to my Github reposity, could I?
The demo video is below.
Could you visit this demo video at 'youku.com'?http://v.youku.com/v_show/id_XMzU2MDYyNDQ5Mg==.html?spm=a2hzp.8244740.0.0
Looking forward to hearing from you soon.

loss_rcnn_box is Nan

I was able to successfully train a model with a custom dataset using the command line arguments and train.py file given. I refactored the train.py code to run with hardcoded variables instead of command line arguments. Yet in my own script, after the first step, the loss_rcnn_bbox values are Nan, which will then crash the program. What could be possible causes?

        outputs = maskRCNN(**input_data)

        rois_label = outputs['rois_label']
        cls_score = outputs['cls_score']
        bbox_pred = outputs['bbox_pred']
        loss_rpn_cls = outputs['loss_rpn_cls'].mean()
        loss_rpn_bbox = outputs['loss_rpn_bbox'].mean()
        loss_rcnn_cls = outputs['loss_rcnn_cls'].mean()
        print(outputs['loss_rcnn_bbox'].mean()) #this value is Nan
        loss_rcnn_bbox = outputs['loss_rcnn_bbox'].mean()

Poor training results

Hi, I have trained R-101-FPN with coco2017, using 4 GPUs, but only got mmAP=0.33 during test which is well below Detectron result of 0.40.

What can be the problem?

I have used python tools/train_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-101-FPN_2x.yaml --use-tfboard --nw 8 --b 8 for training and python tools/test_net.py --cfg configs/e2e_mask_rcnn_R-101-FPN_2x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-101-FPN_2x/Apr19-11-34-35_devbox/ckpt/model_7_29315.pth --dataset coco2017

The loss at the end was about 0.6 which also seems a bit high.

A bug when running train_net_step.py

Hi roytseng-tw, I run into the following bug when running the "train_net_step.py". Do you have any ideas about the reason? Thanks.

main()

File "tools/train_net_step.py", line 227, in main
dataiterator = iter(dataloader)
File "/home/wxk/anaconda/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 428, in iter
return _DataLoaderIter(self)
File "/home/wxk/anaconda/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 244, in init
self._put_indices()
File "/home/wxk/anaconda/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 292, in _put_indices
indices = next(self.sample_iter, None)
File "/home/wxk/anaconda/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 120, in iter
batch.append(int(idx))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

Cannot unpickle Pretrained weights

Do you plan to support RetinaNet architecture?

Do you plan to support RetinaNet form the FocalLoss paper?

Compile error: 'cuda.h'

Not really an issue, just want to share my experience.

If you are using the code in some clusters, cuda might not be installed under /usr/local/cuda/. In this case, in addition to modifying CUDA_PATH in make.sh. You might also need to specify CPATH=/path/to/your/cuda/include.

For example
CPATH=/path/to/your/cuda/include ./make.sh

A trouble to understand the attribute of "training" of the class "CollectAndDistributeFpnProposalOp()

I was running the "test_net.py". In the /lib/modeling/FPN.py file, there is such a line "self.CollectAndDistributeFpnRpnProposals = CollectAndDistributeFpnRpnProposalsOp()" in the constructor of the "fpn_rpn_outputs" class. Since the CollectAndDistributeFpnRpnProposalsOp class inherits the nn.module which has an attribute named "training" and it is "True" by default, so the CollectAndDistributeFpnRpnProposals object's "training" attribute is also "True".

But when I print out the "self.CollectAndDistributeFpnRpnProposals.training" in the "forward" function of the "fpn_rpn_outputs" class, I saw a "False".

Do you know when the "training" attribute of the CollectAndDistributeFpnRpnProposals object is set to be False?

inference time

Hi, do you compare the inference time to caffe2, which one is faster?
If I want to make inference of many images at the same time, could the average process time be shorter?

data_parallel error?

Hi, thanks for contribution of mask rcnn, i like the ideal of building with different modules(you can try different backbone, box head, mask head), which has high potential for improvements.
I try to using my customer dataset with coco style in this project. (Already successfully implemented in matterport's tf+keras mask rcnn)
But i get the following errors and get no clue.
I guess it is something in the data_parallel?
Any suggestions/ideals are welcome.

Namespace(batch_size=2, cfg_file='/home/ubuntu/skin_demo/Tooth/Detection/configs/e2e_mask_rcnn_R-50-C4_1x.yaml', cuda=True, dataset='coco2014', disp_interval=20, load_ckpt=None, load_detectron=None, lr=None, lr_decay_gamma=None, no_save=False, num_workers=1, optimizer=None, resume=False, set_cfgs=[], start_step=0, use_tfboard=True)
Batch size change from 1 (in config file) to 2
NUM_GPUs: 1, TRAIN.IMS_PER_BATCH: 2
Number of data loading threads: 1
Adjust BASE_LR linearly according to batch size change: 0.01 --> 0.02
loading annotations into memory...
Done (t=0.26s)
creating index...
index created!
INFO json_dataset.py: 298: Loading cached gt_roidb from /home/ubuntu/skin_demo/Tooth/Detection/Detectron.pytorch-master/data/cache/coco_2014_train_gt_roidb.pkl
INFO roidb.py:  50: Appending horizontally-flipped training examples...
INFO roidb.py:  52: Loaded dataset: coco_2014_train
INFO roidb.py: 143: Filtered 0 roidb entries: 578 -> 578
INFO roidb.py:  69: Computing image aspect ratios and ordering the ratios...
INFO roidb.py:  71: done
INFO roidb.py:  75: Computing bounding-box regression targets...
INFO roidb.py:  77: done
INFO train_net_step.py: 203: 578 roidb entries
INFO train_net_step.py: 204: Takes 1.24 sec(s) to construct roidb
INFO train_net_step.py: 319: Training starts !
INFO net.py:  72: Changing learning rate 0.000000 -> 0.006667
Traceback (most recent call last):
  File "tools/train_net_step.py", line 397, in <module>
    main()
  File "tools/train_net_step.py", line 364, in main
    net_outputs = maskRCNN(**input_data)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/skin_demo/Tooth/Detection/Detectron.pytorch-master/lib/nn/parallel/data_parallel.py", line 113, in forward
    outputs = [self.module(*inputs[0], **kwargs[0])]
  File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/skin_demo/Tooth/Detection/Detectron.pytorch-master/lib/modeling/model_builder.py", line 116, in forward
    roidb = list(map(lambda x: blob_utils.deserialize(x)[0], roidb))
  File "/home/ubuntu/skin_demo/Tooth/Detection/Detectron.pytorch-master/lib/modeling/model_builder.py", line 116, in <lambda>
    roidb = list(map(lambda x: blob_utils.deserialize(x)[0], roidb))
  File "/home/ubuntu/skin_demo/Tooth/Detection/Detectron.pytorch-master/lib/utils/blob.py", line 176, in deserialize
    return pickle.loads(arr.astype(np.uint8).tobytes())
AttributeError: 'list' object has no attribute 'astype'

Unpickling error while training from scratch e2e mask rcnn for Resnet-50-C4 (1x).

Conda 4.5, Python 3.6, Pytorch 0.3.1

Traceback (most recent call last):
  File "tools/train_net_step.py", line 391, in <module>
    main()
  File "tools/train_net_step.py", line 222, in main
    maskRCNN = Generalized_RCNN()
mask-rcnn.pytorch/lib/modeling/model_builder.py", line 98, in __init__
    self._init_modules()
mask-rcnn.pytorch/lib/modeling/model_builder.py", line 102, in _init_modules
    resnet_utils.load_pretrained_imagenet_weights(self)
/mask-rcnn.pytorch/lib/utils/resnet_weights_helper.py", line 21, in load_pretrained_imagenet_weights
    pretrianed_state_dict = convert_state_dict(torch.load(weights_file))
lib/python3.6/site-packages/torch/serialization.py", line 267, in load
    return _load(f, map_location, pickle_module)
lib/python3.6/site-packages/torch/serialization.py", line 410, in _load
    magic_number = pickle_module.load(f)
_pickle.UnpicklingError: invalid load key, '<'.

What am I missing?
Please help.

Eval code for COCO

Hi, can you provide some eval APIs so that we can test the performance on COCO?

Documentation

Hello!

Is it possible to add documentation for model? for example, for forward params?

there are not those two function

File "tools/train_net.py", line 25, in
import utils.misc as misc_utils
File "/mnt/disk1/oujie/pytorch_mask/Detectron.pytorch-master/lib/utils/misc.py", line 3, in
from collections import defaultdict, Iterable
ImportError: cannot import name defaultdict

RuntimeError: received 0 items of ancdata

I got the following error during training:

Traceback (most recent call last):
  File "tools/train_net.py", line 316, in main
    for step, input_data in zip(range(args.start_iter, iters_per_epoch), dataloader):
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 275, in __next__
    idx, batch = self._get_batch()
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 254, in _get_batch
    return self.data_queue.get()
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/rizhiy/miniconda3/envs/Detectron.pytorch/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

Low GPU utilization

I'm training on 4 GPUs with 8 workers but getting only about 50% GPU utilization.

What can be the problem?

A trouble to understand the getitem method in RoiDataLoader class

I am trying to understand the signature of the "getitem" method of the "RoiDataLoader" class in the /lib/roi_data/loader.py file. That class is a subclass of the abstract class "dataset" in pytorch. In the definition of "dataset" in pytorch, the "getitem" method supports integer indexing in range from 0 to len(self) exclusive. But for the RoiDataLoader, the parameter for "getitem" method is an index_tuple. Could you explain how it works?

Thanks.

Training get NAN loss value when set GPUS>2 using the train_net.py

Thanks for sharing your greate job.
When I set GPUS=3, batchsize=15, I got NAN loss value. Also test GPUS=4 & batchsize=16/20, and the same NAN loss. (Training e2e-resnet50-c4-2x from scratch)

But when using GPUS=2 & batchsize=10 , it will be right