Pseudo-3D Residual Networks
This repo implements the network structure of P3D[1] with PyTorch, pre-trained model weights are converted from caffemodel, which is supported from the author's repo
NEWS!!!
First,
The prepared weights at the following section is transfered from the author's. However, due to a difference of pooling operation between CAFFE and PyTorch, The same weights will generate different size of feature map. Anyone that use this repo should know that: this difference will not bring any influence if you use P3D199 to finetune. Of course, you can modify by change the padding value of the pooling layer, then direct inference is also OK(code is updated already).
Second,
Recently, I got the opportunity to train the whole Kinetics data, so I am trying to train a more powerful p3d modelweight based on input size of 3x16x224x224. I will share the weights after the ddl of Anet18! please have a wait.
Requirements:
- pytorch
- numpy
Structure details
In the author's official repo, only P3D-199 is released. Besides this deepest P3D-199, I also implement P3D-63 and P3D-131, which are respectively modified from ResNet50-3D and ResNet101-3D, the two nets may bring more convenience to users who have only memory-limited GPUs.
Pretrained weights
(Pretrained weights of P3D63 and P3D131 are not yet supported)
(tips: I feel sorry to canceal the download urls of pretrained weights because of some private reasons. For more information you could send emails to me.) (New tips: Model weights now are available.)
1, P3D-199 trained on Kinetics dataset:
2, P3D-199 trianed on Kinetics Optical Flow (TVL1):
Example Code
from __future__ import print_function
from p3d_model import *
import torch
model = P3D199(pretrained=True,num_classes=400)
model = model.cuda()
data=torch.autograd.Variable(torch.rand(10,3,16,160,160)).cuda() # if modality=='Flow', please change the 2nd dimension 3==>2
out=model(data)
print(out.size(),out)
Ablation settings
-
ST-Structures:
All P3D models in this repo support various forms of ST-Structures like ('A','B','C') ,('A','B') and ('A'), code is as follows.
model = P3D63(ST_struc=('A','B')) model = P3D131(ST_struc=('C'))
-
Flow and RGB models:
Set parameter modality='RGB' as 'RGB' model, 'Flow' as flow model. Flow model i trained on TVL1 optical flow images.
model= P3D199(pretrained=True,modality='Flow')
-
Finetune the model
when finetuning the models on your custom dataset, use get_optim_policies() to set different learning speed for different layers. e.g. When dataset is small, Only need to train several deepest layers, set slow_rate=0.8 in code, and change the following lr_mult,decay_mult.
please cite this repo if you take use of it.
Experiment Result (Out of the paper)
(All the following results are generated by End-to-End manners).
Some of them have outperforms state of the arts.
- Action recognition(mean accuracy on UCF101):
modality/model | RGB | Flow | Fusion |
---|---|---|---|
P3D199 (Sports-1M) | 88.5% | - | - |
P3D199 (Kinetics) | 91.2% | 92.4% | 98.3% |
- Action localization(mAP on Thumos14):
steps: perframe+watershed
Step | perframe | localization |
---|---|---|
P3D199(Sports-1M | 0.451 | 0.25 |
P3D199(Kinetics) | 0.569(fused) | 0.307 |
Reference:
[1]Learning Spatio-Temporal Representation with Pseudo-3D Residual,ICCV2017