lufficc / ssd Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 385.0 764 KB

High quality, fast, modular reference implementation of SSD in PyTorch

License: MIT License

Python 100.00%

computer-vision deep-learning object-detection pytorch ssd

ssd's People

Contributors

Stargazers

Watchers

Forkers

tomzhang yuckfu chaoso barbecacov liuguoyou hzhang57 isummer ml-lab anothertk jeffrey98-ai hefv57 ai-full-stacker eric-zhang1990 xbcreal suixiaodan 17764591637 wyq136 xingliujia yangjirui cf2220160244 wpbird007 zgsxwsdxg bokyliu lintangss williu66 thuang001 jiancui1992 nineship grib0ed0v snooble alexwq100 jiachen0212 bringbackm aihgf surexs zhangxuanaj door5719 haxisnake yan0409 fengxingxiang chencq1234 miwaliu wwlcape taotaoxu ahwxz123 liaw05 xiaollz gjiangtao youzhonghui xiaolaozai psyanglll liaoguiqiu captain1986 hangyuq baiboat galliopro amrit-das touchylk note-liu ihaeyong beibinli ythhy westnight zhengfangwu haiyang21 ratnajitmukherjee ngunsu yaoqingyuan saliven1970 pankajmehar tony2278 recardodk jsherrah marshall2m zhulei2016 guobinli fweih hitzht zyg11 hsynt arithmeticzzy tfygg ycxxn andreistirb lovehuanhuan philipxue smallcraft felixzhang7 yevgeniiaivanova nichhb dchiji huanghekun genius0712 seongkyun cooparation mohanrobotics janet518 ahcit itking666 b3ql

ssd's Issues

The cfg is incomplete?

ssd300_voc0712.yaml, ssd512_voc0712.yaml, ssd300_coco_trainval35k.yaml, some settings are not existing

train_transform = TrainAugmentation(cfg.INPUT.IMAGE_SIZE, cfg.INPUT.PIXEL_MEAN)
target_transform = MatchPrior(PriorBox(cfg)(), cfg.MODEL.CENTER_VARIANCE,
cfg.MODEL.SIZE_VARIANCE, cfg.MODEL.THRESHOLD)

Training hangs. Seems dead lock somewhere

My training always hang when training for about 10k iterations, so that I have never finished the training procedure. Does anyone get this situation?

Below is my screen output.
It doesn't pop out any error, just hangs...

2019-01-15 17:46:13,572 SSD.trainer INFO: Iter: 016500, Lr: 0.00100, Cost: 30.35s, Eta: 18:05:08, total_loss: 2.746, classification_loss: 1.881, regression_loss: 0.864
2019-01-15 17:46:44,709 SSD.trainer INFO: Iter: 016550, Lr: 0.00100, Cost: 30.74s, Eta: 18:04:35, total_loss: 3.110, classification_loss: 2.112, regression_loss: 0.998
2019-01-15 17:47:15,735 SSD.trainer INFO: Iter: 016600, Lr: 0.00100, Cost: 30.60s, Eta: 18:04:01, total_loss: 2.336, classification_loss: 1.702, regression_loss: 0.634
2019-01-15 17:47:46,991 SSD.trainer INFO: Iter: 016650, Lr: 0.00100, Cost: 30.91s, Eta: 18:03:28, total_loss: 2.972, classification_loss: 2.040, regression_loss: 0.932
2019-01-15 17:48:18,479 SSD.trainer INFO: Iter: 016700, Lr: 0.00100, Cost: 31.07s, Eta: 18:02:57, total_loss: 2.584, classification_loss: 1.810, regression_loss: 0.774
2019-01-15 17:48:49,426 SSD.trainer INFO: Iter: 016750, Lr: 0.00100, Cost: 30.55s, Eta: 18:02:22, total_loss: 2.723, classification_loss: 1.915, regression_loss: 0.807

Cannot load pre-trained SSD512 model

Hi,

I can run the demo with the provided SSD300 model, but when using the provided SSD512 config files/weights (configs/ssd512_voc0712.yaml, ssd512_voc0712_mAP80.25.pth) am getting this error:

model.load(weights)
File "SSD/ssd/modeling/ssd.py", line 97, in load
self.load_state_dict(torch.load(model, map_location=lambda storage, loc: storage))
File "/anaconda/envs/maskRcnnB/lib/python3.5/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SSD:
Unexpected key(s) in state_dict: "extras.8.weight", "extras.8.bias", "extras.9.weight", "extras.9.bias", "classification_headers.6.weight", "classification_headers.6.bias", "regression_headers.6.weight", "regression_headers.6.bias".
size mismatch for classification_headers.4.bias: copying a param with shape torch.Size([126]) from checkpoint, the shape in current model is torch.Size([84]).
size mismatch for classification_headers.4.weight: copying a param with shape torch.Size([126, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([84, 256, 3, 3]).
size mismatch for regression_headers.4.bias: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for regression_headers.4.weight: copying a param with shape torch.Size([24, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([16, 256, 3, 3]).

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>python setup.py build_ext install
running build_ext
building 'pycocotools._mask' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>

Train with Custom Dataset

Hi Li,

I've been trying to train a custom SSD but I'm running into some issues. I annotated some 1,200 images on only one class. I've used Rectlabel where the output is an XML file per each image file. I then create the same dir structure as VOC2007 (Annotations, JPEGImages, ImageSets) saving files trainval.txt, test.txt, val.txt and {class_name}_trainval.txt, ..., in ImageSets/Main. I then modified the configs/ssd300_voc0712.yaml to take NUM_CLASSES: 2 and modify classes_name in voc_dataset.py. (I've also tried the steps you outline here).

The dataset gets recognized but when it's going through the DataLoader (each image, boxes,labels) I get the following error:

2019-01-15 09:58:13,478 SSD.trainer INFO: Init from base net vgg16_reducedfc.pth
2019-01-15 09:58:13,580 SSD.trainer INFO: Train dataset size: 752
2019-01-15 09:58:13,580 SSD.trainer INFO: Start training
Traceback (most recent call last):
  File "train_ssd.py", line 139, in <module>
    main()
  File "train_ssd.py", line 130, in main
    model = train(cfg, args)
  File "train_ssd.py", line 71, in train
    return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
  File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/engine/trainer.py", line 68, in do_train
    for iteration, (images, boxes, labels) in enumerate(data_loader):
  File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
  File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/data/datasets/your_dataset.py", line 37, in __getitem__
    image, boxes, labels = self.transform(image, boxes, labels)
  File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/modeling/data_preprocessing.py", line 33, in __call__
    return self.augment(img, boxes, labels)
  File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/transforms/transforms.py", line 55, in __call__
    img, boxes, labels = t(img, boxes, labels)
  File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/transforms/transforms.py", line 347, in __call__
    boxes[:, :2] += (int(left), int(top))
IndexError: too many indices for array

I'm running CUDA 10.

Is there any other step that I'm missing? Thank you very much!

How to train on 1024*1024 picture?

This project is the best SSD model !! But I have a task to detect small object , 512512 is not suitable. How can I change it to input 10241024 picture? Can somebody give me a configuration? Thanks so much ! Urgent!Wating online !

torch/extension.h not found when building

python build.py build_ext develop
running build_ext
building 'torch_extension' extension
gcc -pthread -B /home/marco/anaconda2/envs/SSD/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/home/marco/Documenti/github/SSD-1.0.1/ext -I/home/marco/anaconda2/envs/SSD/lib/python3.6/site-packages/torch/lib/include -I/home/marco/anaconda2/envs/SSD/lib/python3.6/site-packages/torch/lib/include/TH -I/home/marco/anaconda2/envs/SSD/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/marco/anaconda2/envs/SSD/include/python3.6m -c /home/marco/Documenti/github/SSD-1.0.1/ext/vision.cpp -o build/temp.linux-x86_64-3.6/home/marco/Documenti/github/SSD-1.0.1/ext/vision.o -DTORCH_EXTENSION_NAME=torch_extension -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/marco/Documenti/github/SSD-1.0.1/ext/nms.h:3,
from /home/marco/Documenti/github/SSD-1.0.1/ext/vision.cpp:2:
/home/marco/Documenti/github/SSD-1.0.1/ext/cpu/vision.h:3:10: fatal error: torch/extension.h: File or directory does not exist
#include <torch/extension.h>
^~~~~~~~~~~~~~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1

Inference Speed

It seems that inference speed is very slow with PostProcessor part.

About the model initilize

Hi
I find that before you resume/init_from pre_trained model , the SSD class has reset the paramerters.
But when I cancel the resume process, it will lead to errors like this ，how can I init the weights without the pretrained model ( either vgg_resuced.pth ot ):

File "train_ssd.py", line 139, in
main()
File "train_ssd.py", line 130, in main
model = train(cfg, args)
File "train_ssd.py", line 71, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
File "/home/fmming/test/SSD/SSD-master/ssd/engine/trainer.py", line 76, in do_train
loss_dict = model(images, targets=(boxes, labels))
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/fmming/test/SSD/SSD-master/ssd/modeling/ssd.py", line 86, in forward
regression_loss, classification_loss = self.criterion(confidences, locations, gt_labels, gt_boxes)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/fmming/test/SSD/SSD-master/ssd/modeling/multibox_loss.py", line 31, in forward
mask = box_utils.hard_negative_mining(loss, labels, self.neg_pos_ratio)
File "/home/fmming/test/SSD/SSD-master/ssd/utils/box_utils.py", line 123, in hard_negative_mining
_, indexes = loss.sort(dim=1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered

C1083: Cannot open include file: 'io.h': No such file or directory

(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>python setup.py build_ext install
running build_ext
building 'pycocotools._mask' extension
D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -ID:\ai\Anaconda3\envs\ssd\lib\site-packages\numpy\core\include -I../common -ID:\ai\Anaconda3\envs\ssd\include -ID:\ai\Anaconda3\envs\ssd\include /Tcpycocotools/_mask.c /Fobuild\temp.win-amd64-3.7\Release\pycocotools/_mask.obj
_mask.c
d:\ai\anaconda3\envs\ssd\include\pyconfig.h(59): fatal error C1083: Cannot open include file: 'io.h': No such file or directory
error: command 'D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe' failed with exit status 2

(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>

COCO performance

@lufficc
First of all, thank you for the implementation. It's very helpful.
But have you trained SSD on COCO by yourself? Could you please provide the detail results of performance? Furthermore, it will be highly appreciated if you could share the pre-trained model as far as I am concerned.

test speed

Hello, how about the detect speed, i have run the demo.py, but i can't reach the speed in paper.

Inference speed for batch size 1

Hey there,

Thank you for your amazing job ! But I was wondering, what is the inference performance for batch size 1 ? I trained SSD on my on datasets and i'm getting ~0.40s / image and it feels quite slow ... I also trained a Faster-RCNN and even with a resnext-152 as backbone I have similar/faster inference time.

configuration problem

hi, a question about the configuration file:
e.g. the ssd300 voc and ssd 512 voc file:
I found in 512, the anchor size is not the same as in the 300 file,

I think, for 300 and 512 file ,the difference is the input image size, the network model is the same.
so, the anchor size should not be changed which the anchor in the same layer, because the receptive field is the same. right?（输入大小不同，但是网络结构相同，对于同一层的anchor，他们设置的size应该是不用改变的，但是看你300 和512的配置文件，anchor size是不一样的。我觉得anchor size 与感受野有关，所以不应该如你所写的变大才对）

How to use SSD models for another image size?

Hi, @lufficc!

Great work. How could I use your implementation for image size = 256, for example?

UPD: It is already done, sorry

IsADirectoryError: [Errno 21] Is a directory: 'configs'

Traceback (most recent call last):
File "train_ssd.py", line 138, in
main()
File "train_ssd.py", line 119, in main
cfg.merge_from_file(args.config_file)
File "/root/anaconda3/lib/python3.7/site-packages/yacs/config.py", line 172, in merge_from_file
with open(cfg_filename, "r") as f:
IsADirectoryError: [Errno 21] Is a directory: 'configs'
i got the error

What does CENTER(SIZE)_VARIANCE mean in defaults.py?

I don`t know these configure in default.py:

Hard negative mining

_C.MODEL.CENTER_VARIANCE = 0.1
_C.MODEL.SIZE_VARIANCE = 0.2
↑
Theyr used in /ssd/util/box_utils.py when boxes invert into locations or locations invert into boxes.But I dont know why.

change MAX_PER_CLASS to 400 as official caffe code will slightly increase mAP(0.8025=>0.8063, 0.7783=>0.7798)

_C.TEST.MAX_PER_CLASS = 200
_C.TEST.MAX_PER_IMAGE = -1
↑
I dont know these either and I cant find where they`r used in project.

Can anyone help?Thanks all the time.

Match Bug

Hello，your match strategy is error，please check it again

Training speed is slow

hello, thanks for the amazing job!
i have a problem about training speed.
i trained voc dataset using 2 1080ti and the speed is about 0.75/iteration, in my tensorflow implementation, the training speed is about 0.45/iteration. Also, i hear some other projects can achive 0.3/iteration in pytorch.
Can you share your training speed?

Should batchsize be scaled by number of GPUs?

I noticed that iteration is scaled but batch size is not:

SSD/train_ssd.py

Line 67 in 7691b27

 batch_sampler = torch.utils.data.sampler.BatchSampler(sampler=sampler, batch_size=cfg.SOLVER.BATCH_SIZE, drop_last=False) 

Should this be:

batch_sampler = torch.utils.data.sampler.BatchSampler(sampler=sampler, batch_size=cfg.SOLVER.BATCH_SIZE*args.num_gpus, drop_last=False)

About the coco annotations

I have download the coco2014 annotations, it doesn has the "annotations/instances_minival2014.json" and " annotations/instances_valminusminival2014.json", did you make it by yourself ?
thks a lot

What dose file【predictions.pth】mean created by project after evaluating?

After evaluating, I can look up a file in path '/output/voc_2007_test/' named 'predictions.pth'.
What is it? Is the same as 'ssd512_vgg_final.pth'?
Dose 'predictions.pth' reduce something from 'ssd512_vgg_final.pth'?
Can it use to predict as same as 'ssd512_vgg_final.pth'?

why divide center_variance and size_variance when calculating the loss?

As the title mentioned ,I don't understand why you divide center_variance and size_variance when calculating the loss which is not mentioned in the origin ssd paper?

CPU/GPU usage influencing training speed

Hi,
I am trying to train on coco. I used dockerfile to build the image.

FROM nvcr.io/nvidia/pytorch:18.12.1-py3
# FROM pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
RUN pip install tensorboardX yacs tqdm pillow 
RUN conda install -y opencv cython
RUN git clone https://github.com/cocodataset/cocoapi.git && cd cocoapi/PythonAPI && python setup.py build_ext install
RUN git clone https://github.com/pytorch/vision.git \
    && cd vision \
    && python setup.py install
COPY . /SSD
WORKDIR /SSD
RUN python /SSD/ext/build.py build_ext develop
CMD [ "bash" ]

But I experienced severe CPU usage (almost100%) meanwhile low GPU usage on several machine (20C40T cpu and v100 gpu / 4C8T cpu and rtx 2080 gpu), and the training is extremely slow ().
I tried to use conda install of pytorch and the same thing happens.
Meanwhile, other pytorch 1.0 repo (maskrcnn-benchmark) was fine using provided docker file.
Is there anyone experiencing the same problem as I do?

how to finetune from a pretrained model?

Error during NMS Build

Thank you very much for the repository. I'm using gcc 7.3.0 for building NMS. Should that be ok?

I get the following output on my stderr
stderr.log

Also, I had to install Cython before building cocotools, perhaps, you could mention in the documentation.

Can I train a SSD without pre-trained weights？

Thanks for your amazing implement !
I want to remove the VGG in your project with net I define.
But it seems that I must load the pre-trained .pth file?

how to train a custom dataset

In order to train custom dataset, what modify show I do?

No module named 'torch_extension'

When I run demo.py, I get the following error：

Traceback (most recent call last):
File "demo.py", line 9, in
from ssd.modeling.predictor import Predictor
File "/home/guo/workspace/Object_Detection/SSD/Pytorch_SSD/ssd/modeling/predictor.py", line 3, in
from ssd.modeling.post_processor import PostProcessor
File "/home/guo/workspace/Object_Detection/SSD/Pytorch_SSD/ssd/modeling/post_processor.py", line 3, in
from ssd.utils.nms import boxes_nms
File "/home/guo/workspace/Object_Detection/SSD/Pytorch_SSD/ssd/utils/nms.py", line 1, in
import torch_extension
ModuleNotFoundError: No module named 'torch_extension'

trainning error on SSD-512

1.
When I try to train a model with input size 512, loss always be non/inf , I had tried to reduce the warmup.factor to 0.1 or change the learning rate to 0.00001, both of them seem not work.
It just work well when I use smaller batch size ( <= 8)

I'm confused about that. Batch Size should decide by GPU-memory ( I run the program on 8*Tesla v100 32G memory, I think it is enougt for training ), why it cause such an error in loss function,
2.
I think batch size should be related to the iteration step, but the iteration is independent to batch size, so the small batch size will lead to shorter trainning time

Do you have any suggestions about the 2 questions ?
Thks a lot

error in nms

hi,lufficc:
I have a problem, when I change the configuration for gpu , e.g. change cuda to cuda:2, I want to change the training in the third gpu

The following error happens:
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /home/xxx/lufficc_ssd_shifted_anchor/ext/cuda/nms.cu:103

And the command I use to start the training is
python train_ssd.py --config-file configs/ssd300_voc0712.yaml --save_step 5000 --eval_step 1 --resume output/ssd513_vgg_iteration_005000.pth

nms get error

when I ran the eval section, it get error Segmentation fault (core dumped) in the nms part. I think it is the .so error. my gcc version is gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) and my cuda is 8.0

Lr:0.0000

I have checked ssd300_voc0712.yaml, but lr=0.000 during the train process.Why?
And I follow the Multi-GPU training setting and my gpus=2, but the training speed is half of single gpu. Why?

Runtime error: Not compiled with GPU support

hello，when I use the code to train my own datesets,and execute the command in readme step by step,and
I met this problem:
2019-01-17 20:52:48,754 SSD.trainer INFO: Iter: 004550, Lr: 0.00100, Cost: 223.93s, Eta: 6 days, 3:12:37, total_loss: 3.036, regression_loss: 0.752, classification_loss: 2.283
2019-01-17 20:56:37,353 SSD.trainer INFO: Iter: 004600, Lr: 0.00100, Cost: 225.57s, Eta: 6 days, 3:08:25, total_loss: 3.188, regression_loss: 1.202, classification_loss: 1.987
2019-01-17 21:00:23,136 SSD.trainer INFO: Iter: 004650, Lr: 0.00100, Cost: 222.75s, Eta: 6 days, 3:03:03, total_loss: 2.773, regression_loss: 0.871, classification_loss: 1.902
2019-01-17 21:04:11,656 SSD.trainer INFO: Iter: 004700, Lr: 0.00100, Cost: 225.49s, Eta: 6 days, 2:58:50, total_loss: 2.975, regression_loss: 0.941, classification_loss: 2.034
2019-01-17 21:08:00,204 SSD.trainer INFO: Iter: 004750, Lr: 0.00100, Cost: 225.52s, Eta: 6 days, 2:54:39, total_loss: 2.737, regression_loss: 0.806, classification_loss: 1.932
2019-01-17 21:11:48,767 SSD.trainer INFO: Iter: 004800, Lr: 0.00100, Cost: 225.54s, Eta: 6 days, 2:50:28, total_loss: 2.875, regression_loss: 1.041, classification_loss: 1.834
2019-01-17 21:15:37,322 SSD.trainer INFO: Iter: 004850, Lr: 0.00100, Cost: 225.53s, Eta: 6 days, 2:46:17, total_loss: 3.588, regression_loss: 1.296, classification_loss: 2.292
2019-01-17 21:19:25,896 SSD.trainer INFO: Iter: 004900, Lr: 0.00100, Cost: 225.55s, Eta: 6 days, 2:42:08, total_loss: 2.428, regression_loss: 0.619, classification_loss: 1.809
2019-01-17 21:23:11,709 SSD.trainer INFO: Iter: 004950, Lr: 0.00100, Cost: 222.78s, Eta: 6 days, 2:36:54, total_loss: 2.225, regression_loss: 0.690, classification_loss: 1.535
2019-01-17 21:27:00,831 SSD.trainer INFO: Iter: 005000, Lr: 0.00100, Cost: 226.09s, Eta: 6 days, 2:32:59, total_loss: 2.845, regression_loss: 1.028, classification_loss: 1.817
2019-01-17 21:27:00,913 SSD.trainer INFO: Saved checkpoint to output/ssd300_vgg_iteration_005000.pth
2019-01-17 21:27:00,914 SSD.inference INFO: Will evaluate 1 dataset(s):
2019-01-17 21:27:00,914 SSD.inference INFO: Evaluating voc_2007_test dataset(75 images):
2019-01-17 21:27:00,914 SSD.inference INFO: Progress on CUDA 0:
0%| | 0/75 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_ssd.py", line 138, in
main()
File "train_ssd.py", line 129, in main
model = train(cfg, args)
File "train_ssd.py", line 71, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
File "/home/t/github/SSD/ssd/engine/trainer.py", line 113, in do_train
do_evaluation(cfg, model, cfg.OUTPUT_DIR, distributed=args.distributed)
File "/home/t/github/SSD/ssd/engine/inference.py", line 93, in do_evaluation
_evaluation(cfg, dataset_name, test_dataset, predictor, distributed, output_dir)
File "/home/t/github/SSD/ssd/engine/inference.py", line 62, in _evaluation
output = predictor.predict(image)
File "/home/t/github/SSD/ssd/modeling/predictor.py", line 27, in predict
results = self.post_processor(scores, boxes, width=width, height=height)
File "/home/t/github/SSD/ssd/modeling/post_processor.py", line 66, in call
keep = boxes_nms(boxes, probs, self.iou_threshold, self.max_per_class)
File "/home/t/github/SSD/ssd/utils/nms.py", line 18, in boxes_nms
keep = _nms(boxes, scores, nms_thresh)
RuntimeError: Not compiled with GPU support (nms at /home/t/github/SSD/ext/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f88fb9dfcc5 in /home/t/anaconda3/envs/tf/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7f88f76ed274 in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #2: + 0x13697 (0x7f88f76f8697 in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x1380e (0x7f88f76f880e in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x10a0a (0x7f88f76f5a0a in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)

frame #50: __libc_start_main + 0xf0 (0x7f894e009830 in /lib/x86_64-linux-gnu/libc.so.6)

(tf) t@t-System-Product-Name:~/github/SSD$ python

RandomSampleCrop Bug

Since transforms.py is a copy from ssd.pytorch augment.py, please refer to these issues:
amdegroot/ssd.pytorch#119
amdegroot/ssd.pytorch#68
https://github.com/lufficc/SSD/blob/master/ssd/transforms/transforms.py#L282

SSD cannot. You should resize image before input. You can try Faster R-CNN, which accepts variable sized images.

Thanks so much for answer my question, Mr.Author!!!!!
And I modify the config/xxx.yaml, count the feature_map_size, strides, min/max_size, aspect_ratios and something, especailly modify the vgg_ssd.py, and it works.

Regression Loss get inf after some iters

I have set lr from le-3 to lr-5, but the inf alse apear

coco-test-dev-2015 performance

i submit results for ssd300_coco_trainval35k_AP22.9.pth model to coco-server.
Here are the results

note: I use my own non-max suppersion, which is slightly different from lufficc's version

How to submit to test-dev-2015:

use detection server: https://competitions.codalab.org/competitions/5181
chose test-dev2018 (bbox)


COCO-test-dev-2015 server
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.255
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.263
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.067
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.270
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.236
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.345
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.359
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.098
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.567 
Done (t=334.90s)

to compare the difference of my non-max suppression with lufficc, here are my results of the models:

local : COCO-test-dev-2014 (instances_minival2014.json, num_imagaes = 5k)
ssd300_coco_trainval35k_AP22.9.pth model

DONE (t=6.61s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.251
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.428
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.261
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.061
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.271
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.419
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.234
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.358
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.097
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.397
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.562

local : COCO-test-dev-2014 (instances_minival2014.json, num_imagaes = 5k)
ssd300_voc0712_mAP77.83.pth

metric_type = voc07
           #name   ap
       aeroplane   0.825236
         bicycle   0.844450
            bird   0.759660
            boat   0.710224
          bottle   0.527462
             bus   0.864337
             car   0.865986
             cat   0.874129
           chair   0.617937
             cow   0.827866
     diningtable   0.786153
             dog   0.851901
           horse   0.863020
       motorbike   0.851469
          person   0.802394
     pottedplant   0.507871
           sheep   0.768501
            sofa   0.792603
           train   0.870370
       tvmonitor   0.755360
---------------------------------
             mAP   0.778346

ssd/data/datasets/init.py

factory = global()[data['factory']]
dataset = factory(**args)

I don't understand the above two.

How to make my script run multi epoch?

Hi, I found my training script always stopped at the end of the first epoch.
My training script works in a manner of for epoch in range(MAX_EPOCH) instead of a sampler. I just want know how to make my training script continue run.

implementation of hard_negative_mining

@lufficc
Hi,
I found that you use the sort function to get the mask of wanted labels.

SSD/ssd/utils/box_utils.py

Line 123 in 7691b27

_, indexes = loss.sort(dim=1, descending=True)

But I can not understand its mechanism fully, can you explain how it works?
Thank you

performance drop when using distributed training ?

The results drop to nearly 71 mAP when using distributed training. (4-GPU)

about logger

你好，最近在看这个程序，对于新的DistributedDataParallel比较生疏，这个里面

在训练的循环里面，reduce_loss_dict（...）和save_to_disk = distributed_util.get_rank() == 0保存模型，对rank0和非rank0做了不同的执行语句，但是logger显示每次loss，时间时，没有。我看运行时候：

应该显示的是rank0 loss reduce后的结果，那么非rank0的logger显示是怎么抑制的呢？

代码在

这个函数里面做了rank0的判断，但是这里的logger是名为SSD的，而后面循环的logger叫做SSD.trainer,能否解答下呢？谢谢。

### 另外，程序哪里开始是在不同gpu上同时执行的呢？thks

Training model not able to download

How to visual in tensorboardX?

I want to see the loss ratio but how to visual in tensorboardX?which URL?And where is the code in the project?

What the meaning of warm-up strategy?

Why is the parameter last_epoch used? The reason of alpha = self.last_epoch / self.warmup_iters?

class WarmupMultiStepLR(MultiStepLR):
    def __init__(self, optimizer, milestones, gamma=0.1, warmup_factor=1.0 / 3,
                 warmup_iters=500, last_epoch=-1):
        self.warmup_factor = warmup_factor
        self.warmup_iters = warmup_iters
        super().__init__(optimizer, milestones, gamma, last_epoch)

    def get_lr(self):
        lr = super().get_lr()
        if self.last_epoch < self.warmup_iters:
            alpha = self.last_epoch / self.warmup_iters
            warmup_factor = self.warmup_factor * (1 - alpha) + alpha
            return [l * warmup_factor for l in lr]
        return lr

coco training speed problem

hello, i trained coco on 2 gpus and find the speed getting slower during training(0.8s/iter for early iteration, and 1.9s/iter at the end of the training) i wonder if you encounter this problem

Request for coco trained model

Can you provide coco trained model (ssd300)? I want to use it to run the evaluation code and reproduce the results below:

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.229

Thanks a lot!

Training error with batch size 64 (two gpus)

The loss is unstable, and the error comes after 430 iters.

2018-12-17 13:55:45,716 SSD.trainer INFO: Train dataset size: 16551
2018-12-17 13:55:45,716 SSD.trainer INFO: Start training
2018-12-17 13:55:53,054 SSD.trainer INFO: Iter: 000010, Lr: 0.00069, Cost: 6.79s, Eta: 11:18:23, Loss: 16.110, Regression Loss 2.962, Classification Loss: 13.149
2018-12-17 13:55:59,009 SSD.trainer INFO: Iter: 000020, Lr: 0.00072, Cost: 5.54s, Eta: 10:35:03, Loss: 14.744, Regression Loss 2.703, Classification Loss: 12.041
2018-12-17 13:56:05,192 SSD.trainer INFO: Iter: 000030, Lr: 0.00074, Cost: 5.78s, Eta: 10:29:42, Loss: 13.971, Regression Loss 2.775, Classification Loss: 11.196
2018-12-17 13:56:11,117 SSD.trainer INFO: Iter: 000040, Lr: 0.00077, Cost: 5.54s, Eta: 10:20:49, Loss: 13.053, Regression Loss 2.877, Classification Loss: 10.176
2018-12-17 13:56:17,044 SSD.trainer INFO: Iter: 000050, Lr: 0.00080, Cost: 5.54s, Eta: 10:14:58, Loss: 11.377, Regression Loss 2.694, Classification Loss: 8.683
2018-12-17 13:56:22,996 SSD.trainer INFO: Iter: 000060, Lr: 0.00082, Cost: 5.57s, Eta: 10:11:33, Loss: 12.235, Regression Loss 2.856, Classification Loss: 9.379
2018-12-17 13:56:28,939 SSD.trainer INFO: Iter: 000070, Lr: 0.00085, Cost: 5.56s, Eta: 10:08:50, Loss: 9.304, Regression Loss 2.722, Classification Loss: 6.582
2018-12-17 13:56:34,890 SSD.trainer INFO: Iter: 000080, Lr: 0.00088, Cost: 5.57s, Eta: 10:06:57, Loss: 9.608, Regression Loss 2.600, Classification Loss: 7.008
2018-12-17 13:56:40,899 SSD.trainer INFO: Iter: 000090, Lr: 0.00090, Cost: 5.63s, Eta: 10:06:10, Loss: 9.044, Regression Loss 2.633, Classification Loss: 6.411
2018-12-17 13:56:46,872 SSD.trainer INFO: Iter: 000100, Lr: 0.00093, Cost: 5.59s, Eta: 10:05:02, Loss: 10.493, Regression Loss 2.597, Classification Loss: 7.896
2018-12-17 13:56:52,839 SSD.trainer INFO: Iter: 000110, Lr: 0.00096, Cost: 5.58s, Eta: 10:04:02, Loss: 9.837, Regression Loss 2.504, Classification Loss: 7.333
2018-12-17 13:56:58,813 SSD.trainer INFO: Iter: 000120, Lr: 0.00098, Cost: 5.59s, Eta: 10:03:18, Loss: 8.993, Regression Loss 2.577, Classification Loss: 6.416
2018-12-17 13:57:04,785 SSD.trainer INFO: Iter: 000130, Lr: 0.00101, Cost: 5.58s, Eta: 10:02:36, Loss: 9.234, Regression Loss 2.366, Classification Loss: 6.868
2018-12-17 13:57:10,782 SSD.trainer INFO: Iter: 000140, Lr: 0.00104, Cost: 5.61s, Eta: 10:02:12, Loss: 9.572, Regression Loss 2.397, Classification Loss: 7.175
2018-12-17 13:57:16,768 SSD.trainer INFO: Iter: 000150, Lr: 0.00106, Cost: 5.60s, Eta: 10:01:47, Loss: 10.361, Regression Loss 2.455, Classification Loss: 7.906
2018-12-17 13:57:22,772 SSD.trainer INFO: Iter: 000160, Lr: 0.00109, Cost: 5.62s, Eta: 10:01:32, Loss: 11.323, Regression Loss 2.497, Classification Loss: 8.826
2018-12-17 13:57:28,794 SSD.trainer INFO: Iter: 000170, Lr: 0.00112, Cost: 5.63s, Eta: 10:01:19, Loss: 11.311, Regression Loss 2.368, Classification Loss: 8.942
2018-12-17 13:57:34,801 SSD.trainer INFO: Iter: 000180, Lr: 0.00114, Cost: 5.62s, Eta: 10:01:07, Loss: 14.360, Regression Loss 2.493, Classification Loss: 11.866
2018-12-17 13:57:40,815 SSD.trainer INFO: Iter: 000190, Lr: 0.00117, Cost: 5.62s, Eta: 10:00:55, Loss: 9.740, Regression Loss 2.547, Classification Loss: 7.192
2018-12-17 13:57:46,815 SSD.trainer INFO: Iter: 000200, Lr: 0.00120, Cost: 5.61s, Eta: 10:00:42, Loss: 12.304, Regression Loss 2.444, Classification Loss: 9.860
2018-12-17 13:57:53,002 SSD.trainer INFO: Iter: 000210, Lr: 0.00122, Cost: 5.76s, Eta: 10:01:10, Loss: 9.891, Regression Loss 2.465, Classification Loss: 7.427
2018-12-17 13:57:59,044 SSD.trainer INFO: Iter: 000220, Lr: 0.00125, Cost: 5.66s, Eta: 10:01:18, Loss: 10.401, Regression Loss 2.495, Classification Loss: 7.905
2018-12-17 13:58:05,060 SSD.trainer INFO: Iter: 000230, Lr: 0.00128, Cost: 5.63s, Eta: 10:01:06, Loss: 9.791, Regression Loss 2.253, Classification Loss: 7.538
2018-12-17 13:58:11,072 SSD.trainer INFO: Iter: 000240, Lr: 0.00130, Cost: 5.63s, Eta: 10:00:55, Loss: 9.441, Regression Loss 2.396, Classification Loss: 7.045
2018-12-17 13:58:17,158 SSD.trainer INFO: Iter: 000250, Lr: 0.00133, Cost: 5.68s, Eta: 10:00:57, Loss: 8.072, Regression Loss 2.440, Classification Loss: 5.632
2018-12-17 13:58:23,013 SSD.trainer INFO: Iter: 000260, Lr: 0.00136, Cost: 5.47s, Eta: 10:00:13, Loss: 8.662, Regression Loss 2.442, Classification Loss: 6.221
2018-12-17 13:58:29,099 SSD.trainer INFO: Iter: 000270, Lr: 0.00138, Cost: 5.70s, Eta: 10:00:20, Loss: 8.421, Regression Loss 2.250, Classification Loss: 6.171
2018-12-17 13:58:35,208 SSD.trainer INFO: Iter: 000280, Lr: 0.00141, Cost: 5.72s, Eta: 10:00:30, Loss: 8.425, Regression Loss 2.143, Classification Loss: 6.281
2018-12-17 13:58:41,339 SSD.trainer INFO: Iter: 000290, Lr: 0.00144, Cost: 5.73s, Eta: 10:00:41, Loss: 9.016, Regression Loss 2.448, Classification Loss: 6.568
2018-12-17 13:58:47,479 SSD.trainer INFO: Iter: 000300, Lr: 0.00146, Cost: 5.74s, Eta: 10:00:57, Loss: 11.354, Regression Loss 2.215, Classification Loss: 9.139
2018-12-17 13:58:53,697 SSD.trainer INFO: Iter: 000310, Lr: 0.00149, Cost: 5.81s, Eta: 10:01:23, Loss: 12.369, Regression Loss 2.147, Classification Loss: 10.221
2018-12-17 13:58:59,810 SSD.trainer INFO: Iter: 000320, Lr: 0.00152, Cost: 5.72s, Eta: 10:01:33, Loss: 10.004, Regression Loss 2.278, Classification Loss: 7.726
2018-12-17 13:59:05,849 SSD.trainer INFO: Iter: 000330, Lr: 0.00154, Cost: 5.65s, Eta: 10:01:26, Loss: 7.794, Regression Loss 2.384, Classification Loss: 5.411
2018-12-17 13:59:11,847 SSD.trainer INFO: Iter: 000340, Lr: 0.00157, Cost: 5.61s, Eta: 10:01:11, Loss: 8.697, Regression Loss 2.366, Classification Loss: 6.331
2018-12-17 13:59:17,999 SSD.trainer INFO: Iter: 000350, Lr: 0.00160, Cost: 5.75s, Eta: 10:01:21, Loss: 12.521, Regression Loss 2.570, Classification Loss: 9.951
2018-12-17 13:59:24,357 SSD.trainer INFO: Iter: 000360, Lr: 0.00162, Cost: 5.98s, Eta: 10:02:10, Loss: 12.485, Regression Loss 2.474, Classification Loss: 10.012
2018-12-17 13:59:30,369 SSD.trainer INFO: Iter: 000370, Lr: 0.00165, Cost: 5.63s, Eta: 10:01:55, Loss: 12.791, Regression Loss 2.641, Classification Loss: 10.150
2018-12-17 13:59:36,477 SSD.trainer INFO: Iter: 000380, Lr: 0.00168, Cost: 5.73s, Eta: 10:01:59, Loss: 11.360, Regression Loss 2.661, Classification Loss: 8.699
2018-12-17 13:59:42,585 SSD.trainer INFO: Iter: 000390, Lr: 0.00170, Cost: 5.72s, Eta: 10:01:59, Loss: 11.183, Regression Loss 2.592, Classification Loss: 8.591
2018-12-17 13:59:48,701 SSD.trainer INFO: Iter: 000400, Lr: 0.00173, Cost: 5.72s, Eta: 10:01:59, Loss: 10.166, Regression Loss 2.575, Classification Loss: 7.590
2018-12-17 13:59:54,813 SSD.trainer INFO: Iter: 000410, Lr: 0.00176, Cost: 5.72s, Eta: 10:02:02, Loss: 17.562, Regression Loss 2.554, Classification Loss: 15.008
2018-12-17 14:00:00,942 SSD.trainer INFO: Iter: 000420, Lr: 0.00178, Cost: 5.74s, Eta: 10:02:05, Loss: 10.339, Regression Loss 2.592, Classification Loss: 7.747
2018-12-17 14:00:07,075 SSD.trainer INFO: Iter: 000430, Lr: 0.00181, Cost: 5.75s, Eta: 10:02:10, Loss: 28.599, Regression Loss 9.237, Classification Loss: 19.362
Traceback (most recent call last):
  File "train_ssd.py", line 139, in <module>
    main()
  File "train_ssd.py", line 130, in main
    model = train(cfg, args)
  File "train_ssd.py", line 76, in train
    return do_train(cfg, model, train_loader, optimizer, scheduler, criterion, device, args)
  File "/home/ycg/workspace/SSD/ssd/engine/trainer.py", line 78, in do_train
    regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ycg/workspace/SSD/ssd/modeling/multibox_loss.py", line 31, in forward
    mask = box_utils.hard_negative_mining(loss, labels, self.neg_pos_ratio)
  File "/home/ycg/workspace/SSD/ssd/utils/box_utils.py", line 123, in hard_negative_mining
    _, indexes = loss.sort(dim=1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered
Traceback (most recent call last):
  File "train_ssd.py", line 139, in <module>
    main()
  File "train_ssd.py", line 130, in main
    model = train(cfg, args)
  File "train_ssd.py", line 76, in train
    return do_train(cfg, model, train_loader, optimizer, scheduler, criterion, device, args)
  File "/home/ycg/workspace/SSD/ssd/engine/trainer.py", line 78, in do_train
    regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ycg/workspace/SSD/ssd/modeling/multibox_loss.py", line 31, in forward
    mask = box_utils.hard_negative_mining(loss, labels, self.neg_pos_ratio)
  File "/home/ycg/workspace/SSD/ssd/utils/box_utils.py", line 123, in hard_negative_mining
    _, indexes = loss.sort(dim=1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered
terminate called without an active exception
terminate called without an active exception

In the test phase, 81 categories of COCO did not match！

你好，我在COCO2014数据集上迭代训练了400000个iter，最终AP结果的与https://github.com/lufficc/SSD#details 描述的接近，但是当我使用训练好的模型运行demo.py测试时，发现只有'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 这些类别可以准确匹配，之后的许多类别出现匹配错误: 如‘dog’-->'cat', 'zebra'-->'bear', ‘horse’-->'dog', 'sheep'-->horse, 我发现存在的规律是出错的类别idx普遍正确的idx超前了一个值。

不知道你训练COCO之后是否出现了这样的问题，如果需要的话我可以将测试出错的图像发送邮箱，因为觉得汉语能说的清楚，请见谅。
期待你的回复。