This is a repository trying to implement C3D-caffe on tensorflow,useing models directly converted from original C3D-caffe.
Be aware that there are about 5% video-level accuracy margin on UCF101 split1 between our implement in tensorflow and the original C3D-caffe.
I add ensemble.py for ensemble multi-stream network including (optical flow, rgb, selected active frame (SAF my own method). You can use activeframe.py for extracting active frames before feeding into C3D networks.
Then you can use ensemble.py for fusion these network. The result of my trial is shown in the table bellow:
Dataset | networks | Accuracy |
---|---|---|
HMDB51 | RGB+SAF+OF | 66.2% |
UCF101 | RGB+SAF+OF | 94.1% |
KTH | RGB+SAF+OF | 93.8% |
- Have installed the tensorflow >= 1.2 version
- You must have installed the following two python libs: a) tensorflow b) Pillow
- You must have downloaded the UCF101 (Action Recognition Data Set)
- Each single avi file is decoded with 5FPS (it's depend your decision) in a single directory.
- you can use the
./list/convert_video_to_images.sh
script to decode the ucf101 video files - run
./list/convert_video_to_images.sh .../UCF101 5
- you can use the
- Generate {train,test}.list files in
list
directory. Each line corresponds to "image directory" and a class (zero-based). For example:- you can use the
./list/convert_images_to_list.sh
script to generate the {train,test}.list for the dataset - run
./list/convert_images_to_list.sh .../dataset_images 4
, this will generatetest.list
andtrain.list
files by a factor 4 inside the root folder
- you can use the
database/ucf101/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01 0
database/ucf101/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c02 0
database/ucf101/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c03 0
database/ucf101/train/ApplyLipstick/v_ApplyLipstick_g01_c01 1
database/ucf101/train/ApplyLipstick/v_ApplyLipstick_g01_c02 1
database/ucf101/train/ApplyLipstick/v_ApplyLipstick_g01_c03 1
database/ucf101/train/Archery/v_Archery_g01_c01 2
database/ucf101/train/Archery/v_Archery_g01_c02 2
database/ucf101/train/Archery/v_Archery_g01_c03 2
database/ucf101/train/Archery/v_Archery_g01_c04 2
database/ucf101/train/BabyCrawling/v_BabyCrawling_g01_c01 3
database/ucf101/train/BabyCrawling/v_BabyCrawling_g01_c02 3
database/ucf101/train/BabyCrawling/v_BabyCrawling_g01_c03 3
database/ucf101/train/BabyCrawling/v_BabyCrawling_g01_c04 3
database/ucf101/train/BalanceBeam/v_BalanceBeam_g01_c01 4
database/ucf101/train/BalanceBeam/v_BalanceBeam_g01_c02 4
database/ucf101/train/BalanceBeam/v_BalanceBeam_g01_c03 4
database/ucf101/train/BalanceBeam/v_BalanceBeam_g01_c04 4
...
python train_c3d_ucf101.py
will train C3D model. The trained model will saved inmodels
directory.python predict_c3d_ucf101.py
will test C3D model on a validation data set.cd ./C3D-tensorflow-1.0 &&python Random_clip_valid.py
will get the random-clip accuracy on UCF101 test set with providedsports1m_finetuning_ucf101.model
.C3D-tensorflow-1.0/Random_clip_valid.py
code is compatible with tensorflow 1.0+ , with a little bit different with the old repository- IMPORTANT NOTE: when you load the sports1m_finetuning_ucf101.model,you should use the tranpose operation like:
pool5 = tf.transpose(pool5, perm=[0,1,4,2,3])
,or inRandom_clip_valid.py
looks like:["transpose", [0, 1, 4, 2, 3]]
, but if you loadconv3d_deepnetA_sport1m_iter_1900000_TF.model
orc3d_ucf101_finetune_whole_iter_20000_TF.model
,you don't need tranpose operation,just comment that line code.
-
Note:
1.All report results are done specific on UCF101 split1 (train videos:9537,test videos:3783).
2.ALL the results are video-level accuracy,unless stated otherwise.
3.We follow the same way to extract clips from video as the C3D paper saying:'To extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips.These clips are passed to the C3D network to extract fc6 activations. These clip fc6 activations are averaged to form a 4096-dim video descriptor which is then followed by an L2-normalization' -
C3D as feature extractor:
platform | feature extractor model | fc6+SVM | fc6+SVM+L2 norm |
---|---|---|---|
caffe | conv3d_deepnetA_sport1m_iter_1900000.caffemodel | 81.99% | 83.39% |
tensorflow | conv3d_deepnetA_sport1m_iter_1900000_TF.model | 79.38% | 81.44% |
tensorflow | c3d_ucf101_finetune_whole_iter_20000_TF.model | 79.67% | 81.33% |
tensorflow | sports1m_finetuning_ucf101.model | 82.73% | 85.35% |
- finetune C3D network on UCF101 split1 with pre-trained model:
platform | pre-trained model | train-strategy | video-accuracy | clip-accuracy | random-clip |
---|---|---|---|---|---|
caffe | c3d_ucf101_finetune_whole_iter_20000.caffemodel | directly test | - | 79.87% | - |
tensorflow | c3d_ucf101_finetune_whole_iter_20000_TF.model | directly test | 78.35% | 72.77% | 57.15% |
tensorflow-A | conv3d_deepnetA_sport1m_iter_1900000_TF.caffemodel | whole finetuning | 76.0% | 71% | 69.8% |
tensorflow-B | sports1m_finetuning_ucf101.model | freeze conv,only finetune fc layers | 79.93% | 74.65% | 76.6% |
- Note:
1.thetensorflow-A
model corresponding to the original C3D model pre-trained on UCF101 provided by @ hx173149 .
2.thetensorflow-B
model is just freeze the conv layers intensorflow-A
and finetuning four more epochs on fc layers withlearning rate=1e-3
.
3.therandom-clip
column means random choose one clip from each video in UCF101 test split 1 ,so the result are not so robust.But according to the Law of Large Numbers,we may assume this items is positive correlated to your video-level accuracy.
4.it's obvious that it if you do more finetuning work based onc3d_ucf101_finetune_whole_iter_20000_TF.model
,and you may achieve better performance,i didn't do it because of time limit.
5.with no doubt that you can get better result by appropriately finetuning the network
Model | Description | Clouds | Download |
---|---|---|---|
C3D sports1M TF | C3D sports1M converted from caffe C3D | Dropbox | C3D sports1M |
C3D UCF101 TF | C3D UCF101 trained model converted from caffe C3D | Dropbox | C3D UCF101 |
C3D UCF101 TF train | finetuning on UCF101 split1 use C3D sports1M model by @ hx173149 | Dropbox | C3D UCF101 split1 |
split1 meanfile TF | UCF101 split1 meanfile converted from caffe C3D | Dropbox | UCF101 split1 meanfile |
everything above | all four files above | baiduyun | baiduyun |
Modify the NUM_CLASSES
variable in the c3d_model.py
file.
Change the NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN
and NUM_EXAMPLES_PER_EPOCH_FOR_EVAL
variables in the c3d_model.py
file
NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN
= (total number of training image)/NUM_FRAMES_PER_CLIP
NUM_EXAMPLES_PER_EPOCH_FOR_EVAL
= (total number of evaluating image)/NUM_FRAMES_PER_CLIP
NUM_FRAMES_PER_CLIP
is in the c3d_model.py file
- Thanks the author Du tran's code: C3D-caffe
- C3D: Generic Features for Video Analysis