lufficc / ssd Goto Github PK
View Code? Open in Web Editor NEWHigh quality, fast, modular reference implementation of SSD in PyTorch
License: MIT License
High quality, fast, modular reference implementation of SSD in PyTorch
License: MIT License
ssd300_voc0712.yaml
, ssd512_voc0712.yaml
, ssd300_coco_trainval35k.yaml
, some settings are not existing
train_transform = TrainAugmentation(cfg.INPUT.IMAGE_SIZE, cfg.INPUT.PIXEL_MEAN)
target_transform = MatchPrior(PriorBox(cfg)(), cfg.MODEL.CENTER_VARIANCE,
cfg.MODEL.SIZE_VARIANCE, cfg.MODEL.THRESHOLD)
My training always hang when training for about 10k iterations, so that I have never finished the training procedure. Does anyone get this situation?
Below is my screen output.
It doesn't pop out any error, just hangs...
2019-01-15 17:46:13,572 SSD.trainer INFO: Iter: 016500, Lr: 0.00100, Cost: 30.35s, Eta: 18:05:08, total_loss: 2.746, classification_loss: 1.881, regression_loss: 0.864
2019-01-15 17:46:44,709 SSD.trainer INFO: Iter: 016550, Lr: 0.00100, Cost: 30.74s, Eta: 18:04:35, total_loss: 3.110, classification_loss: 2.112, regression_loss: 0.998
2019-01-15 17:47:15,735 SSD.trainer INFO: Iter: 016600, Lr: 0.00100, Cost: 30.60s, Eta: 18:04:01, total_loss: 2.336, classification_loss: 1.702, regression_loss: 0.634
2019-01-15 17:47:46,991 SSD.trainer INFO: Iter: 016650, Lr: 0.00100, Cost: 30.91s, Eta: 18:03:28, total_loss: 2.972, classification_loss: 2.040, regression_loss: 0.932
2019-01-15 17:48:18,479 SSD.trainer INFO: Iter: 016700, Lr: 0.00100, Cost: 31.07s, Eta: 18:02:57, total_loss: 2.584, classification_loss: 1.810, regression_loss: 0.774
2019-01-15 17:48:49,426 SSD.trainer INFO: Iter: 016750, Lr: 0.00100, Cost: 30.55s, Eta: 18:02:22, total_loss: 2.723, classification_loss: 1.915, regression_loss: 0.807
Hi,
I can run the demo with the provided SSD300 model, but when using the provided SSD512 config files/weights (configs/ssd512_voc0712.yaml, ssd512_voc0712_mAP80.25.pth) am getting this error:
model.load(weights)
File "SSD/ssd/modeling/ssd.py", line 97, in load
self.load_state_dict(torch.load(model, map_location=lambda storage, loc: storage))
File "/anaconda/envs/maskRcnnB/lib/python3.5/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SSD:
Unexpected key(s) in state_dict: "extras.8.weight", "extras.8.bias", "extras.9.weight", "extras.9.bias", "classification_headers.6.weight", "classification_headers.6.bias", "regression_headers.6.weight", "regression_headers.6.bias".
size mismatch for classification_headers.4.bias: copying a param with shape torch.Size([126]) from checkpoint, the shape in current model is torch.Size([84]).
size mismatch for classification_headers.4.weight: copying a param with shape torch.Size([126, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([84, 256, 3, 3]).
size mismatch for regression_headers.4.bias: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for regression_headers.4.weight: copying a param with shape torch.Size([24, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([16, 256, 3, 3]).
(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>python setup.py build_ext install
running build_ext
building 'pycocotools._mask' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/
(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>
Hi Li,
I've been trying to train a custom SSD but I'm running into some issues. I annotated some 1,200 images on only one class. I've used Rectlabel where the output is an XML file per each image file. I then create the same dir structure as VOC2007 (Annotations
, JPEGImages
, ImageSets
) saving files trainval.txt, test.txt, val.txt and {class_name}_trainval.txt, ..., in ImageSets/Main
. I then modified the configs/ssd300_voc0712.yaml
to take NUM_CLASSES: 2
and modify classes_name
in voc_dataset.py
. (I've also tried the steps you outline here).
The dataset gets recognized but when it's going through the DataLoader (each image
, boxes
,labels
) I get the following error:
2019-01-15 09:58:13,478 SSD.trainer INFO: Init from base net vgg16_reducedfc.pth
2019-01-15 09:58:13,580 SSD.trainer INFO: Train dataset size: 752
2019-01-15 09:58:13,580 SSD.trainer INFO: Start training
Traceback (most recent call last):
File "train_ssd.py", line 139, in <module>
main()
File "train_ssd.py", line 130, in main
model = train(cfg, args)
File "train_ssd.py", line 71, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/engine/trainer.py", line 68, in do_train
for iteration, (images, boxes, labels) in enumerate(data_loader):
File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
return self._process_next_batch(batch)
File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/ldap/mariano.metallo/anaconda3/envs/SSD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/data/datasets/your_dataset.py", line 37, in __getitem__
image, boxes, labels = self.transform(image, boxes, labels)
File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/modeling/data_preprocessing.py", line 33, in __call__
return self.augment(img, boxes, labels)
File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/transforms/transforms.py", line 55, in __call__
img, boxes, labels = t(img, boxes, labels)
File "/home/ldap/mariano.metallo/03_SSD_Classifier/SSD/ssd/transforms/transforms.py", line 347, in __call__
boxes[:, :2] += (int(left), int(top))
IndexError: too many indices for array
I'm running CUDA 10.
Is there any other step that I'm missing? Thank you very much!
This project is the best SSD model !! But I have a task to detect small object , 512512 is not suitable. How can I change it to input 10241024 picture? Can somebody give me a configuration? Thanks so much ! Urgent!Wating online !
python build.py build_ext develop
running build_ext
building 'torch_extension' extension
gcc -pthread -B /home/marco/anaconda2/envs/SSD/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/home/marco/Documenti/github/SSD-1.0.1/ext -I/home/marco/anaconda2/envs/SSD/lib/python3.6/site-packages/torch/lib/include -I/home/marco/anaconda2/envs/SSD/lib/python3.6/site-packages/torch/lib/include/TH -I/home/marco/anaconda2/envs/SSD/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/marco/anaconda2/envs/SSD/include/python3.6m -c /home/marco/Documenti/github/SSD-1.0.1/ext/vision.cpp -o build/temp.linux-x86_64-3.6/home/marco/Documenti/github/SSD-1.0.1/ext/vision.o -DTORCH_EXTENSION_NAME=torch_extension -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/marco/Documenti/github/SSD-1.0.1/ext/nms.h:3,
from /home/marco/Documenti/github/SSD-1.0.1/ext/vision.cpp:2:
/home/marco/Documenti/github/SSD-1.0.1/ext/cpu/vision.h:3:10: fatal error: torch/extension.h: File or directory does not exist
#include <torch/extension.h>
^~~~~~~~~~~~~~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1
It seems that inference speed is very slow with PostProcessor part.
Hi
I find that before you resume/init_from pre_trained model , the SSD class has reset the paramerters.
But when I cancel the resume process, it will lead to errors like this ,how can I init the weights without the pretrained model ( either vgg_resuced.pth ot ):
File "train_ssd.py", line 139, in
main()
File "train_ssd.py", line 130, in main
model = train(cfg, args)
File "train_ssd.py", line 71, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
File "/home/fmming/test/SSD/SSD-master/ssd/engine/trainer.py", line 76, in do_train
loss_dict = model(images, targets=(boxes, labels))
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/fmming/test/SSD/SSD-master/ssd/modeling/ssd.py", line 86, in forward
regression_loss, classification_loss = self.criterion(confidences, locations, gt_labels, gt_boxes)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/fmming/test/SSD/SSD-master/ssd/modeling/multibox_loss.py", line 31, in forward
mask = box_utils.hard_negative_mining(loss, labels, self.neg_pos_ratio)
File "/home/fmming/test/SSD/SSD-master/ssd/utils/box_utils.py", line 123, in hard_negative_mining
_, indexes = loss.sort(dim=1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered
(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>python setup.py build_ext install
running build_ext
building 'pycocotools._mask' extension
D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -ID:\ai\Anaconda3\envs\ssd\lib\site-packages\numpy\core\include -I../common -ID:\ai\Anaconda3\envs\ssd\include -ID:\ai\Anaconda3\envs\ssd\include /Tcpycocotools/_mask.c /Fobuild\temp.win-amd64-3.7\Release\pycocotools/_mask.obj
_mask.c
d:\ai\anaconda3\envs\ssd\include\pyconfig.h(59): fatal error C1083: Cannot open include file: 'io.h': No such file or directory
error: command 'D:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe' failed with exit status 2
(ssd) D:\ai\Anaconda3\envs\SSD\github\cocoapi\PythonAPI>
@lufficc
First of all, thank you for the implementation. It's very helpful.
But have you trained SSD on COCO by yourself? Could you please provide the detail results of performance? Furthermore, it will be highly appreciated if you could share the pre-trained model as far as I am concerned.
Hello, how about the detect speed, i have run the demo.py, but i can't reach the speed in paper.
Hey there,
Thank you for your amazing job ! But I was wondering, what is the inference performance for batch size 1 ? I trained SSD on my on datasets and i'm getting ~0.40s / image and it feels quite slow ... I also trained a Faster-RCNN and even with a resnext-152 as backbone I have similar/faster inference time.
hi, a question about the configuration file:
e.g. the ssd300 voc and ssd 512 voc file:
I found in 512, the anchor size is not the same as in the 300 file,
I think, for 300 and 512 file ,the difference is the input image size, the network model is the same.
so, the anchor size should not be changed which the anchor in the same layer, because the receptive field is the same. right?( 输入大小不同,但是网络结构相同,对于同一层的anchor,他们设置的size应该是不用改变的,但是看你300 和512的配置文件,anchor size是不一样的。我觉得anchor size 与感受野有关,所以不应该如你所写的变大才对)
Hi, @lufficc!
Great work. How could I use your implementation for image size = 256, for example?
UPD: It is already done, sorry
Traceback (most recent call last):
File "train_ssd.py", line 138, in
main()
File "train_ssd.py", line 119, in main
cfg.merge_from_file(args.config_file)
File "/root/anaconda3/lib/python3.7/site-packages/yacs/config.py", line 172, in merge_from_file
with open(cfg_filename, "r") as f:
IsADirectoryError: [Errno 21] Is a directory: 'configs'
i got the error
I don`t know these configure in default.py:
_C.MODEL.CENTER_VARIANCE = 0.1
_C.MODEL.SIZE_VARIANCE = 0.2
↑
Theyr used in /ssd/util/box_utils.py when boxes invert into locations or locations invert into boxes.But I don
t know why.
_C.TEST.MAX_PER_CLASS = 200
_C.TEST.MAX_PER_IMAGE = -1
↑
I dont know these either and I can
t find where they`r used in project.
Can anyone help?Thanks all the time.
Hello,your match strategy is error,please check it again
hello, thanks for the amazing job!
i have a problem about training speed.
i trained voc dataset using 2 1080ti and the speed is about 0.75/iteration, in my tensorflow implementation, the training speed is about 0.45/iteration. Also, i hear some other projects can achive 0.3/iteration in pytorch.
Can you share your training speed?
I noticed that iteration is scaled but batch size is not:
Line 67 in 7691b27
batch_sampler = torch.utils.data.sampler.BatchSampler(sampler=sampler, batch_size=cfg.SOLVER.BATCH_SIZE*args.num_gpus, drop_last=False)
I have download the coco2014 annotations, it doesn has the "annotations/instances_minival2014.json" and " annotations/instances_valminusminival2014.json", did you make it by yourself ?
thks a lot
After evaluating, I can look up a file in path '/output/voc_2007_test/' named 'predictions.pth'.
What is it? Is the same as 'ssd512_vgg_final.pth'?
Dose 'predictions.pth' reduce something from 'ssd512_vgg_final.pth'?
Can it use to predict as same as 'ssd512_vgg_final.pth'?
As the title mentioned ,I don't understand why you divide center_variance and size_variance when calculating the loss which is not mentioned in the origin ssd paper?
Hi,
I am trying to train on coco. I used dockerfile to build the image.
FROM nvcr.io/nvidia/pytorch:18.12.1-py3
# FROM pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
RUN pip install tensorboardX yacs tqdm pillow
RUN conda install -y opencv cython
RUN git clone https://github.com/cocodataset/cocoapi.git && cd cocoapi/PythonAPI && python setup.py build_ext install
RUN git clone https://github.com/pytorch/vision.git \
&& cd vision \
&& python setup.py install
COPY . /SSD
WORKDIR /SSD
RUN python /SSD/ext/build.py build_ext develop
CMD [ "bash" ]
But I experienced severe CPU usage (almost100%) meanwhile low GPU usage on several machine (20C40T cpu and v100 gpu / 4C8T cpu and rtx 2080 gpu), and the training is extremely slow ().
I tried to use conda install of pytorch and the same thing happens.
Meanwhile, other pytorch 1.0 repo (maskrcnn-benchmark) was fine using provided docker file.
Is there anyone experiencing the same problem as I do?
Hi
Thank you very much for the repository. I'm using gcc 7.3.0 for building NMS. Should that be ok?
I get the following output on my stderr
stderr.log
Also, I had to install Cython before building cocotools, perhaps, you could mention in the documentation.
Thanks for your amazing implement !
I want to remove the VGG in your project with net I define.
But it seems that I must load the pre-trained .pth file?
In order to train custom dataset, what modify show I do?
When I run demo.py, I get the following error:
Traceback (most recent call last):
File "demo.py", line 9, in
from ssd.modeling.predictor import Predictor
File "/home/guo/workspace/Object_Detection/SSD/Pytorch_SSD/ssd/modeling/predictor.py", line 3, in
from ssd.modeling.post_processor import PostProcessor
File "/home/guo/workspace/Object_Detection/SSD/Pytorch_SSD/ssd/modeling/post_processor.py", line 3, in
from ssd.utils.nms import boxes_nms
File "/home/guo/workspace/Object_Detection/SSD/Pytorch_SSD/ssd/utils/nms.py", line 1, in
import torch_extension
ModuleNotFoundError: No module named 'torch_extension'
1.
When I try to train a model with input size 512, loss always be non/inf , I had tried to reduce the warmup.factor to 0.1 or change the learning rate to 0.00001, both of them seem not work.
It just work well when I use smaller batch size ( <= 8)
I'm confused about that. Batch Size should decide by GPU-memory ( I run the program on 8*Tesla v100 32G memory, I think it is enougt for training ), why it cause such an error in loss function,
2.
I think batch size should be related to the iteration step, but the iteration is independent to batch size, so the small batch size will lead to shorter trainning time
Do you have any suggestions about the 2 questions ?
Thks a lot
hi,lufficc:
I have a problem, when I change the configuration for gpu , e.g. change cuda to cuda:2, I want to change the training in the third gpu
The following error happens:
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /home/xxx/lufficc_ssd_shifted_anchor/ext/cuda/nms.cu:103
And the command I use to start the training is
python train_ssd.py --config-file configs/ssd300_voc0712.yaml --save_step 5000 --eval_step 1 --resume output/ssd513_vgg_iteration_005000.pth
when I ran the eval section, it get error Segmentation fault (core dumped) in the nms part. I think it is the .so error. my gcc version is gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) and my cuda is 8.0
I have checked ssd300_voc0712.yaml, but lr=0.000 during the train process.Why?
And I follow the Multi-GPU training setting and my gpus=2, but the training speed is half of single gpu. Why?
hello,when I use the code to train my own datesets,and execute the command in readme step by step,and
I met this problem:
2019-01-17 20:52:48,754 SSD.trainer INFO: Iter: 004550, Lr: 0.00100, Cost: 223.93s, Eta: 6 days, 3:12:37, total_loss: 3.036, regression_loss: 0.752, classification_loss: 2.283
2019-01-17 20:56:37,353 SSD.trainer INFO: Iter: 004600, Lr: 0.00100, Cost: 225.57s, Eta: 6 days, 3:08:25, total_loss: 3.188, regression_loss: 1.202, classification_loss: 1.987
2019-01-17 21:00:23,136 SSD.trainer INFO: Iter: 004650, Lr: 0.00100, Cost: 222.75s, Eta: 6 days, 3:03:03, total_loss: 2.773, regression_loss: 0.871, classification_loss: 1.902
2019-01-17 21:04:11,656 SSD.trainer INFO: Iter: 004700, Lr: 0.00100, Cost: 225.49s, Eta: 6 days, 2:58:50, total_loss: 2.975, regression_loss: 0.941, classification_loss: 2.034
2019-01-17 21:08:00,204 SSD.trainer INFO: Iter: 004750, Lr: 0.00100, Cost: 225.52s, Eta: 6 days, 2:54:39, total_loss: 2.737, regression_loss: 0.806, classification_loss: 1.932
2019-01-17 21:11:48,767 SSD.trainer INFO: Iter: 004800, Lr: 0.00100, Cost: 225.54s, Eta: 6 days, 2:50:28, total_loss: 2.875, regression_loss: 1.041, classification_loss: 1.834
2019-01-17 21:15:37,322 SSD.trainer INFO: Iter: 004850, Lr: 0.00100, Cost: 225.53s, Eta: 6 days, 2:46:17, total_loss: 3.588, regression_loss: 1.296, classification_loss: 2.292
2019-01-17 21:19:25,896 SSD.trainer INFO: Iter: 004900, Lr: 0.00100, Cost: 225.55s, Eta: 6 days, 2:42:08, total_loss: 2.428, regression_loss: 0.619, classification_loss: 1.809
2019-01-17 21:23:11,709 SSD.trainer INFO: Iter: 004950, Lr: 0.00100, Cost: 222.78s, Eta: 6 days, 2:36:54, total_loss: 2.225, regression_loss: 0.690, classification_loss: 1.535
2019-01-17 21:27:00,831 SSD.trainer INFO: Iter: 005000, Lr: 0.00100, Cost: 226.09s, Eta: 6 days, 2:32:59, total_loss: 2.845, regression_loss: 1.028, classification_loss: 1.817
2019-01-17 21:27:00,913 SSD.trainer INFO: Saved checkpoint to output/ssd300_vgg_iteration_005000.pth
2019-01-17 21:27:00,914 SSD.inference INFO: Will evaluate 1 dataset(s):
2019-01-17 21:27:00,914 SSD.inference INFO: Evaluating voc_2007_test dataset(75 images):
2019-01-17 21:27:00,914 SSD.inference INFO: Progress on CUDA 0:
0%| | 0/75 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_ssd.py", line 138, in
main()
File "train_ssd.py", line 129, in main
model = train(cfg, args)
File "train_ssd.py", line 71, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, device, args)
File "/home/t/github/SSD/ssd/engine/trainer.py", line 113, in do_train
do_evaluation(cfg, model, cfg.OUTPUT_DIR, distributed=args.distributed)
File "/home/t/github/SSD/ssd/engine/inference.py", line 93, in do_evaluation
_evaluation(cfg, dataset_name, test_dataset, predictor, distributed, output_dir)
File "/home/t/github/SSD/ssd/engine/inference.py", line 62, in _evaluation
output = predictor.predict(image)
File "/home/t/github/SSD/ssd/modeling/predictor.py", line 27, in predict
results = self.post_processor(scores, boxes, width=width, height=height)
File "/home/t/github/SSD/ssd/modeling/post_processor.py", line 66, in call
keep = boxes_nms(boxes, probs, self.iou_threshold, self.max_per_class)
File "/home/t/github/SSD/ssd/utils/nms.py", line 18, in boxes_nms
keep = _nms(boxes, scores, nms_thresh)
RuntimeError: Not compiled with GPU support (nms at /home/t/github/SSD/ext/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f88fb9dfcc5 in /home/t/anaconda3/envs/tf/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7f88f76ed274 in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #2: + 0x13697 (0x7f88f76f8697 in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x1380e (0x7f88f76f880e in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x10a0a (0x7f88f76f5a0a in /home/t/github/SSD/ext/torch_extension.cpython-36m-x86_64-linux-gnu.so)
frame #50: __libc_start_main + 0xf0 (0x7f894e009830 in /lib/x86_64-linux-gnu/libc.so.6)
(tf) t@t-System-Product-Name:~/github/SSD$ python
Since transforms.py is a copy from ssd.pytorch augment.py, please refer to these issues:
amdegroot/ssd.pytorch#119
amdegroot/ssd.pytorch#68
https://github.com/lufficc/SSD/blob/master/ssd/transforms/transforms.py#L282
Thanks so much for answer my question, Mr.Author!!!!!
And I modify the config/xxx.yaml, count the feature_map_size, strides, min/max_size, aspect_ratios and something, especailly modify the vgg_ssd.py, and it works.
I have set lr from le-3 to lr-5, but the inf alse apear
i submit results for ssd300_coco_trainval35k_AP22.9.pth model to coco-server.
Here are the results
How to submit to test-dev-2015:
use detection server: https://competitions.codalab.org/competitions/5181
chose test-dev2018 (bbox)
COCO-test-dev-2015 server
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.255
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.435
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.263
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.067
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.270
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.236
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.345
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.359
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.098
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.567
Done (t=334.90s)
to compare the difference of my non-max suppression with lufficc, here are my results of the models:
local : COCO-test-dev-2014 (instances_minival2014.json, num_imagaes = 5k)
ssd300_coco_trainval35k_AP22.9.pth model
DONE (t=6.61s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.251
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.428
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.261
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.061
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.271
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.419
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.234
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.342
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.358
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.097
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.397
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.562
local : COCO-test-dev-2014 (instances_minival2014.json, num_imagaes = 5k)
ssd300_voc0712_mAP77.83.pth
metric_type = voc07
#name ap
aeroplane 0.825236
bicycle 0.844450
bird 0.759660
boat 0.710224
bottle 0.527462
bus 0.864337
car 0.865986
cat 0.874129
chair 0.617937
cow 0.827866
diningtable 0.786153
dog 0.851901
horse 0.863020
motorbike 0.851469
person 0.802394
pottedplant 0.507871
sheep 0.768501
sofa 0.792603
train 0.870370
tvmonitor 0.755360
---------------------------------
mAP 0.778346
factory = global()[data['factory']]
dataset = factory(**args)
I don't understand the above two.
Hi, I found my training script always stopped at the end of the first epoch.
My training script works in a manner of for epoch in range(MAX_EPOCH)
instead of a sampler. I just want know how to make my training script continue run.
The results drop to nearly 71 mAP when using distributed training. (4-GPU)
你好,最近在看这个程序,对于新的DistributedDataParallel比较生疏,这个里面
在训练的循环里面,reduce_loss_dict(...)和save_to_disk = distributed_util.get_rank() == 0保存模型,对rank0和非rank0做了不同的执行语句,但是logger显示每次loss,时间时,没有。我看运行时候:
应该显示的是rank0 loss reduce后的结果,那么非rank0的logger显示是怎么抑制的呢?
代码在
这个函数里面做了rank0的判断,但是这里的logger是名为SSD的,而后面循环的logger叫做SSD.trainer,能否解答下呢?谢谢。
### 另外,程序哪里开始是在不同gpu上同时执行的呢?thks
I want to see the loss ratio but how to visual in tensorboardX?which URL?And where is the code in the project?
Why is the parameter last_epoch used? The reason of alpha = self.last_epoch / self.warmup_iters
?
class WarmupMultiStepLR(MultiStepLR):
def __init__(self, optimizer, milestones, gamma=0.1, warmup_factor=1.0 / 3,
warmup_iters=500, last_epoch=-1):
self.warmup_factor = warmup_factor
self.warmup_iters = warmup_iters
super().__init__(optimizer, milestones, gamma, last_epoch)
def get_lr(self):
lr = super().get_lr()
if self.last_epoch < self.warmup_iters:
alpha = self.last_epoch / self.warmup_iters
warmup_factor = self.warmup_factor * (1 - alpha) + alpha
return [l * warmup_factor for l in lr]
return lr
hello, i trained coco on 2 gpus and find the speed getting slower during training(0.8s/iter for early iteration, and 1.9s/iter at the end of the training) i wonder if you encounter this problem
Can you provide coco trained model (ssd300)? I want to use it to run the evaluation code and reproduce the results below:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.229
Thanks a lot!
The loss is unstable, and the error comes after 430 iters.
2018-12-17 13:55:45,716 SSD.trainer INFO: Train dataset size: 16551
2018-12-17 13:55:45,716 SSD.trainer INFO: Start training
2018-12-17 13:55:53,054 SSD.trainer INFO: Iter: 000010, Lr: 0.00069, Cost: 6.79s, Eta: 11:18:23, Loss: 16.110, Regression Loss 2.962, Classification Loss: 13.149
2018-12-17 13:55:59,009 SSD.trainer INFO: Iter: 000020, Lr: 0.00072, Cost: 5.54s, Eta: 10:35:03, Loss: 14.744, Regression Loss 2.703, Classification Loss: 12.041
2018-12-17 13:56:05,192 SSD.trainer INFO: Iter: 000030, Lr: 0.00074, Cost: 5.78s, Eta: 10:29:42, Loss: 13.971, Regression Loss 2.775, Classification Loss: 11.196
2018-12-17 13:56:11,117 SSD.trainer INFO: Iter: 000040, Lr: 0.00077, Cost: 5.54s, Eta: 10:20:49, Loss: 13.053, Regression Loss 2.877, Classification Loss: 10.176
2018-12-17 13:56:17,044 SSD.trainer INFO: Iter: 000050, Lr: 0.00080, Cost: 5.54s, Eta: 10:14:58, Loss: 11.377, Regression Loss 2.694, Classification Loss: 8.683
2018-12-17 13:56:22,996 SSD.trainer INFO: Iter: 000060, Lr: 0.00082, Cost: 5.57s, Eta: 10:11:33, Loss: 12.235, Regression Loss 2.856, Classification Loss: 9.379
2018-12-17 13:56:28,939 SSD.trainer INFO: Iter: 000070, Lr: 0.00085, Cost: 5.56s, Eta: 10:08:50, Loss: 9.304, Regression Loss 2.722, Classification Loss: 6.582
2018-12-17 13:56:34,890 SSD.trainer INFO: Iter: 000080, Lr: 0.00088, Cost: 5.57s, Eta: 10:06:57, Loss: 9.608, Regression Loss 2.600, Classification Loss: 7.008
2018-12-17 13:56:40,899 SSD.trainer INFO: Iter: 000090, Lr: 0.00090, Cost: 5.63s, Eta: 10:06:10, Loss: 9.044, Regression Loss 2.633, Classification Loss: 6.411
2018-12-17 13:56:46,872 SSD.trainer INFO: Iter: 000100, Lr: 0.00093, Cost: 5.59s, Eta: 10:05:02, Loss: 10.493, Regression Loss 2.597, Classification Loss: 7.896
2018-12-17 13:56:52,839 SSD.trainer INFO: Iter: 000110, Lr: 0.00096, Cost: 5.58s, Eta: 10:04:02, Loss: 9.837, Regression Loss 2.504, Classification Loss: 7.333
2018-12-17 13:56:58,813 SSD.trainer INFO: Iter: 000120, Lr: 0.00098, Cost: 5.59s, Eta: 10:03:18, Loss: 8.993, Regression Loss 2.577, Classification Loss: 6.416
2018-12-17 13:57:04,785 SSD.trainer INFO: Iter: 000130, Lr: 0.00101, Cost: 5.58s, Eta: 10:02:36, Loss: 9.234, Regression Loss 2.366, Classification Loss: 6.868
2018-12-17 13:57:10,782 SSD.trainer INFO: Iter: 000140, Lr: 0.00104, Cost: 5.61s, Eta: 10:02:12, Loss: 9.572, Regression Loss 2.397, Classification Loss: 7.175
2018-12-17 13:57:16,768 SSD.trainer INFO: Iter: 000150, Lr: 0.00106, Cost: 5.60s, Eta: 10:01:47, Loss: 10.361, Regression Loss 2.455, Classification Loss: 7.906
2018-12-17 13:57:22,772 SSD.trainer INFO: Iter: 000160, Lr: 0.00109, Cost: 5.62s, Eta: 10:01:32, Loss: 11.323, Regression Loss 2.497, Classification Loss: 8.826
2018-12-17 13:57:28,794 SSD.trainer INFO: Iter: 000170, Lr: 0.00112, Cost: 5.63s, Eta: 10:01:19, Loss: 11.311, Regression Loss 2.368, Classification Loss: 8.942
2018-12-17 13:57:34,801 SSD.trainer INFO: Iter: 000180, Lr: 0.00114, Cost: 5.62s, Eta: 10:01:07, Loss: 14.360, Regression Loss 2.493, Classification Loss: 11.866
2018-12-17 13:57:40,815 SSD.trainer INFO: Iter: 000190, Lr: 0.00117, Cost: 5.62s, Eta: 10:00:55, Loss: 9.740, Regression Loss 2.547, Classification Loss: 7.192
2018-12-17 13:57:46,815 SSD.trainer INFO: Iter: 000200, Lr: 0.00120, Cost: 5.61s, Eta: 10:00:42, Loss: 12.304, Regression Loss 2.444, Classification Loss: 9.860
2018-12-17 13:57:53,002 SSD.trainer INFO: Iter: 000210, Lr: 0.00122, Cost: 5.76s, Eta: 10:01:10, Loss: 9.891, Regression Loss 2.465, Classification Loss: 7.427
2018-12-17 13:57:59,044 SSD.trainer INFO: Iter: 000220, Lr: 0.00125, Cost: 5.66s, Eta: 10:01:18, Loss: 10.401, Regression Loss 2.495, Classification Loss: 7.905
2018-12-17 13:58:05,060 SSD.trainer INFO: Iter: 000230, Lr: 0.00128, Cost: 5.63s, Eta: 10:01:06, Loss: 9.791, Regression Loss 2.253, Classification Loss: 7.538
2018-12-17 13:58:11,072 SSD.trainer INFO: Iter: 000240, Lr: 0.00130, Cost: 5.63s, Eta: 10:00:55, Loss: 9.441, Regression Loss 2.396, Classification Loss: 7.045
2018-12-17 13:58:17,158 SSD.trainer INFO: Iter: 000250, Lr: 0.00133, Cost: 5.68s, Eta: 10:00:57, Loss: 8.072, Regression Loss 2.440, Classification Loss: 5.632
2018-12-17 13:58:23,013 SSD.trainer INFO: Iter: 000260, Lr: 0.00136, Cost: 5.47s, Eta: 10:00:13, Loss: 8.662, Regression Loss 2.442, Classification Loss: 6.221
2018-12-17 13:58:29,099 SSD.trainer INFO: Iter: 000270, Lr: 0.00138, Cost: 5.70s, Eta: 10:00:20, Loss: 8.421, Regression Loss 2.250, Classification Loss: 6.171
2018-12-17 13:58:35,208 SSD.trainer INFO: Iter: 000280, Lr: 0.00141, Cost: 5.72s, Eta: 10:00:30, Loss: 8.425, Regression Loss 2.143, Classification Loss: 6.281
2018-12-17 13:58:41,339 SSD.trainer INFO: Iter: 000290, Lr: 0.00144, Cost: 5.73s, Eta: 10:00:41, Loss: 9.016, Regression Loss 2.448, Classification Loss: 6.568
2018-12-17 13:58:47,479 SSD.trainer INFO: Iter: 000300, Lr: 0.00146, Cost: 5.74s, Eta: 10:00:57, Loss: 11.354, Regression Loss 2.215, Classification Loss: 9.139
2018-12-17 13:58:53,697 SSD.trainer INFO: Iter: 000310, Lr: 0.00149, Cost: 5.81s, Eta: 10:01:23, Loss: 12.369, Regression Loss 2.147, Classification Loss: 10.221
2018-12-17 13:58:59,810 SSD.trainer INFO: Iter: 000320, Lr: 0.00152, Cost: 5.72s, Eta: 10:01:33, Loss: 10.004, Regression Loss 2.278, Classification Loss: 7.726
2018-12-17 13:59:05,849 SSD.trainer INFO: Iter: 000330, Lr: 0.00154, Cost: 5.65s, Eta: 10:01:26, Loss: 7.794, Regression Loss 2.384, Classification Loss: 5.411
2018-12-17 13:59:11,847 SSD.trainer INFO: Iter: 000340, Lr: 0.00157, Cost: 5.61s, Eta: 10:01:11, Loss: 8.697, Regression Loss 2.366, Classification Loss: 6.331
2018-12-17 13:59:17,999 SSD.trainer INFO: Iter: 000350, Lr: 0.00160, Cost: 5.75s, Eta: 10:01:21, Loss: 12.521, Regression Loss 2.570, Classification Loss: 9.951
2018-12-17 13:59:24,357 SSD.trainer INFO: Iter: 000360, Lr: 0.00162, Cost: 5.98s, Eta: 10:02:10, Loss: 12.485, Regression Loss 2.474, Classification Loss: 10.012
2018-12-17 13:59:30,369 SSD.trainer INFO: Iter: 000370, Lr: 0.00165, Cost: 5.63s, Eta: 10:01:55, Loss: 12.791, Regression Loss 2.641, Classification Loss: 10.150
2018-12-17 13:59:36,477 SSD.trainer INFO: Iter: 000380, Lr: 0.00168, Cost: 5.73s, Eta: 10:01:59, Loss: 11.360, Regression Loss 2.661, Classification Loss: 8.699
2018-12-17 13:59:42,585 SSD.trainer INFO: Iter: 000390, Lr: 0.00170, Cost: 5.72s, Eta: 10:01:59, Loss: 11.183, Regression Loss 2.592, Classification Loss: 8.591
2018-12-17 13:59:48,701 SSD.trainer INFO: Iter: 000400, Lr: 0.00173, Cost: 5.72s, Eta: 10:01:59, Loss: 10.166, Regression Loss 2.575, Classification Loss: 7.590
2018-12-17 13:59:54,813 SSD.trainer INFO: Iter: 000410, Lr: 0.00176, Cost: 5.72s, Eta: 10:02:02, Loss: 17.562, Regression Loss 2.554, Classification Loss: 15.008
2018-12-17 14:00:00,942 SSD.trainer INFO: Iter: 000420, Lr: 0.00178, Cost: 5.74s, Eta: 10:02:05, Loss: 10.339, Regression Loss 2.592, Classification Loss: 7.747
2018-12-17 14:00:07,075 SSD.trainer INFO: Iter: 000430, Lr: 0.00181, Cost: 5.75s, Eta: 10:02:10, Loss: 28.599, Regression Loss 9.237, Classification Loss: 19.362
Traceback (most recent call last):
File "train_ssd.py", line 139, in <module>
main()
File "train_ssd.py", line 130, in main
model = train(cfg, args)
File "train_ssd.py", line 76, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, criterion, device, args)
File "/home/ycg/workspace/SSD/ssd/engine/trainer.py", line 78, in do_train
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ycg/workspace/SSD/ssd/modeling/multibox_loss.py", line 31, in forward
mask = box_utils.hard_negative_mining(loss, labels, self.neg_pos_ratio)
File "/home/ycg/workspace/SSD/ssd/utils/box_utils.py", line 123, in hard_negative_mining
_, indexes = loss.sort(dim=1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered
Traceback (most recent call last):
File "train_ssd.py", line 139, in <module>
main()
File "train_ssd.py", line 130, in main
model = train(cfg, args)
File "train_ssd.py", line 76, in train
return do_train(cfg, model, train_loader, optimizer, scheduler, criterion, device, args)
File "/home/ycg/workspace/SSD/ssd/engine/trainer.py", line 78, in do_train
regression_loss, classification_loss = criterion(confidence, locations, labels, boxes)
File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ycg/workspace/SSD/ssd/modeling/multibox_loss.py", line 31, in forward
mask = box_utils.hard_negative_mining(loss, labels, self.neg_pos_ratio)
File "/home/ycg/workspace/SSD/ssd/utils/box_utils.py", line 123, in hard_negative_mining
_, indexes = loss.sort(dim=1, descending=True)
RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered
terminate called without an active exception
terminate called without an active exception
你好,我在COCO2014数据集上迭代训练了400000个iter,最终AP结果的与https://github.com/lufficc/SSD#details 描述的接近,但是当我使用训练好的模型运行demo.py测试时,发现只有'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 这些类别可以准确匹配,之后的许多类别出现匹配错误: 如‘dog’-->'cat', 'zebra'-->'bear', ‘horse’-->'dog', 'sheep'-->horse, 我发现存在的规律是出错的类别idx普遍正确的idx超前了一个值。
不知道你训练COCO之后是否出现了这样的问题,如果需要的话我可以将测试出错的图像发送邮箱,因为觉得汉语能说的清楚,请见谅。
期待你的回复。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.