Comments (8)
Hi, load_from
is loading a checkpoint to the entire model, while pretrained
in the detector is for the backbone, or the head in the detector.
Btw, our config is for 8 GPUs. How many GPUs are you using?
Please post your log of Nan loss here that we can get more information to help you.
Thanks.
from mmtracking.
Thanks for your reply! I use 1 GPU, and I use the lr = 0.02/8. My loss log is like below(The losses are all Nan for the first printed iter):
#======================================================
loading annotations into memory...
Done (t=1.50s)
creating index...
index created!
loading annotations into memory...
Done (t=1.26s)
creating index...
index created!
2021-01-15 16:22:54,112 - mmdet - INFO - load checkpoint from /data1/code/mmtracking/pretrain_models/faster-rcnn-coco.pth
2021-01-15 16:22:54,782 - mmdet - WARNING - The model and loaded state dict do not match exactly
size mismatch for roi_head.bbox_head.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([2, 1024]).
size mismatch for roi_head.bbox_head.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([2]).
size mismatch for roi_head.bbox_head.fc_reg.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([4, 1024]).
size mismatch for roi_head.bbox_head.fc_reg.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([4]).
2021-01-15 16:22:54,787 - mmdet - INFO - Start running, host: root@train-mmtracking, work_dir: /data1/code/mmtracking/work_dirs/faster-rcnn_r50_fpn_4e_mot17-half
2021-01-15 16:22:54,787 - mmdet - INFO - workflow: [('train', 1)], max: 4 epochs
2021-01-15 16:23:14,498 - mmdet - INFO - Epoch [1][50/3996] lr: 1.238e-03, eta: 1:44:20, time: 0.393, data_time: 0.054, memory: 3692, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 40.2724, loss_bbox: nan, loss: nan
#====================================================================
Here is my configure file, where I pre-download the model with the provided url :
#==========================================================
USE_MMDET = True
base = [
'../base/models/faster_rcnn_r50_fpn.py',
'../base/datasets/mot_challenge_det.py', '../base/default_runtime.py'
]
model = dict(
# noqa: E251
detector=dict(
rpn_head=dict(bbox_coder=dict(clip_border=False)),
roi_head=dict(
bbox_head=dict(bbox_coder=dict(
clip_border=False), num_classes=1))))
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=100,
warmup_ratio=1.0 / 100,
step=[3])
total_epochs = 4
load_from = ('/data1/code/mmtracking/pretrain_models/faster-rcnn-coco.pth')
data_root = '/data1/dataset/MOT17_coco/'
img_root = '/data1/dataset/MOT17/'
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
ann_file=data_root + 'annotations/half-train_cocoformat.json',
img_prefix=img_root + 'train',
classes=('pedestrian', ),
),
val=dict(
ann_file=data_root + 'annotations/half-val_cocoformat.json',
img_prefix=img_root + 'train',
classes=('pedestrian', ),
),
test=dict(
ann_file=data_root + 'annotations/half-val_cocoformat.json',
img_prefix=img_root + 'train',
classes=('pedestrian', ),
))
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
#==============================================
I train the detector using the command:
python tools/train.py configs/det/faster-rcnn_r50_fpn_4e_mot17-half.py
from mmtracking.
I tried 1 GPU training and my work train successfully.
That's my log
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = 'http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth'
resume_from = None
workflow = [('train', 1)]
USE_MMDET = True
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=100,
warmup_ratio=0.01,
step=[3])
total_epochs = 4
work_dir = './work_dirs/faster-rcnn_r50_fpn_4e_mot17-half'
gpu_ids = range(0, 1)
2021-01-15 22:53:17,916 - mmdet - INFO - load model from: torchvision://resnet50
2021-01-15 22:53:18,410 - mmdet - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
loading annotations into memory...
Done (t=2.22s)
creating index...
index created!
loading annotations into memory...
Done (t=2.34s)
creating index...
index created!
2021-01-15 22:53:35,455 - mmdet - INFO - load checkpoint from http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth
2021-01-15 22:53:36,025 - mmdet - WARNING - The model and loaded state dict do not match exactly
size mismatch for roi_head.bbox_head.fc_cls.weight: copying a param with shape torch.Size([81, 1024]) from checkpoint, the shape in current model is torch.Size([2, 1024]).
size mismatch for roi_head.bbox_head.fc_cls.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([2]).
size mismatch for roi_head.bbox_head.fc_reg.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([4, 1024]).
size mismatch for roi_head.bbox_head.fc_reg.bias: copying a param with shape torch.Size([320]) from checkpoint, the shape in current model is torch.Size([4]).
2021-01-15 22:53:36,031 - mmdet - INFO - Start running, host: jmpang@rtx2080ti-216, work_dir: /new-pool/pangjiangmiao/codebase/mmtracking/work_dirs/faster-rcnn_r50_fpn_4e_mot17-half
2021-01-15 22:53:36,031 - mmdet - INFO - workflow: [('train', 1)], max: 4 epochs
2021-01-15 22:53:58,280 - mmdet - INFO - Epoch [1][50/3996] lr: 1.238e-03, eta: 1:57:55, time: 0.444, data_time: 0.056, memory: 3590, loss_rpn_cls: 0.1135, loss_rpn_bbox: 0.0792, loss_cls: 0.4331, acc: 80.8926, loss_bbox: 0.3878, loss: 1.0136
2021-01-15 22:54:09,850 - mmdet - INFO - Epoch [1][100/3996] lr: 2.475e-03, eta: 1:29:21, time: 0.231, data_time: 0.007, memory: 3688, loss_rpn_cls: 0.0556, loss_rpn_bbox: 0.0642, loss_cls: 0.2840, acc: 88.4629, loss_bbox: 0.2511, loss: 0.6549
Did you modify anything else except the learning rate?
from mmtracking.
It seems that you are not using the pre-trained model in mmdetection. Which model are you using?
from mmtracking.
I download the model from ''http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth' and put it in the path where I name the 'load_from ='
And I use the log from mmtracking/logs/det/faster-rcnn_r50_fpn_4e_mot17-half.py, which contains the command: USE_MMDET = True
Nothing else did I modify.
It seems that you train the model using mmdetection, not mmtracking. Since your log looks different from the default logs in mmtracking/logs/det.
from mmtracking.
Yes,USE_MMDET = True
directly turns the API to mmdet.
This is weird that I cannot reproduce your case. Can you double-check the config and figure out the differences between your experiment and mine?
from mmtracking.
Thank you for your responding!
I check the configure and find that I accidentually changed the data_pipeline in base, which caused the Nan loss. Now the training loss become exactly as your log.
from mmtracking.
hi @gsygsygsy123 ,
What did you change in the data_pipeline? I am stuck on the same error
from mmtracking.
Related Issues (20)
- VisDrone DET and VID COCO annotation format
- 您好,您的mmtrack/datasets/pipelines/transforms.py中的SeqExpand应该是写错了,得到的序列中每个图片的增强都不同,在vid算法训练时,时序信息将会丢失。 HOT 2
- mmtracking and BDD100Kdataset
- Multiple static targets in MixFormer
- Poor training results on custom data with ''Temporal ROI Align for Video Object Recognition'' ?
- Maybe a small bug about test progress bar in multi_gpu_test(). HOT 1
- AssertionError: MMCV==2.0.1 is used but incompatible. Please install mmcv>=1.3.17, <2.0.0. HOT 2
- ModuleNotFoundError: No module named 'mmtrack' HOT 1
- mmcv compatibility
- ModuleNotFoundError: No module named 'mmcv._ext' HOT 1
- Failing to build wheels of mmtrack HOT 1
- The MOT tutorial does not output label for detection results
- KeyError: 'categories'
- Using YOLOV8 in MMTracking
- -
- mmdet depends on mmcv>=2.0.0rc4 while mmtrack depends on mmcv <2.0.0 HOT 2
- Compatability Issue between MMCV and MMTracking HOT 1
- in _get_stream if device.type == "cpu": AttributeError: 'int' object has no attribute 'type' HOT 1
- Is this repo dead? HOT 2
- KeyError: "'track_bboxes' not found in the outputs."
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mmtracking.