Using two stream architecture to implement a classic action recognition method on UCF101 dataset

License: MIT License

Python 100.00%

action-recognition two-stream pytorch ucf101 video action-detection

two-stream-action-recognition's Introduction

two-stream-action-recognition

We use a spatial and motion stream cnn with ResNet101 for modeling video information in UCF101 dataset.

Reference Paper

1. Data

1.1 Spatial input data -> rgb frames

We extract RGB frames from each video in UCF101 dataset with sampling rate: 10 and save as .jpg image in disk which cost about 5.9G.

1.2 Motion input data -> stacked optical flow images

In motion stream, we use two methods to get optical flow data.

Download the preprocessed tvl1 optical flow dataset directly from https://github.com/feichtenhofer/twostreamfusion.
Using flownet2.0 method to generate 2-channel optical flow image and save its x, y channel as .jpg image in disk respectively, which cost about 56G.

1.3 (Alternative)Download the preprocessed data directly from feichtenhofer/twostreamfusion)

RGB images

wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.001
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.002
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.003

cat ucf101_jpegs_256.zip* > ucf101_jpegs_256.zip
unzip ucf101_jpegs_256.zip

Optical Flow

wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.001
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.002
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.003

cat ucf101_tvl1_flow.zip* > ucf101_tvl1_flow.zip
unzip ucf101_tvl1_flow.zip

2. Model

2.1 Spatial cnn

As mention before, we use ResNet101 first pre-trained with ImageNet then fine-tuning on our UCF101 spatial rgb image dataset.

2.2 Motion cnn

Input data of motion cnn is a stack of optical flow images which contained 10 x-channel and 10 y-channel images, So it's input shape is (20, 224, 224) which can be considered as a 20-channel image.
In order to utilize ImageNet pre-trained weight on our model, we have to modify the weights of the first convolution layer pre-trained with ImageNet from (64, 3, 7, 7) to (64, 20, 7, 7).
In [2] Wang provide a method called **Cross modality pre-

** to do such weights shape transform. He first average the weight value across the RGB channels and replicate this average by the channel number of motion stream input( which is 20 is this case)

3. Training strategies

3.1 Spatial cnn

Here we utilize the techniques in Temporal Segment Network. For every videos in a mini-batch, we randomly select 3 frames from each video. Then a consensus among the frames will be derived as the video-level prediction for calculating loss.

3.2 Motion cnn

In every mini-batch, we randomly select 64 (batch size) videos from 9537 training videos and futher randomly select 1 stacked optical flow in each video.

3.3 Data augmentation

Both stream apply the same data augmentation technique such as random cropping.

4. Testing method

For every 3783 testing videos, we uniformly sample 19 frames in each video and the video level prediction is the voting result of all 19 frame level predictions.
The reason we choose the number 19 is that the minimun number of video frames in UCF101 is 28 and we have to make sure there are sufficient frames for testing in 10 stack motion stream.

5. Performace

network	top1
Spatial cnn	82.1%
Motion cnn	79.4%
Average fusion	88.5%

6. Pre-trained Model

7. Testing on Your Device

Spatial stream

Please modify this path and this funcition to fit the UCF101 dataset on your device.
Training and testing

python spatial_cnn.py --resume PATH_TO_PRETRAINED_MODEL

Only testing

python spatial_cnn.py --resume PATH_TO_PRETRAINED_MODEL --evaluate

Motion stream

Please modify this path and this funcition to fit the UCF101 dataset on your device.
Training and testing

python motion_cnn.py --resume PATH_TO_PRETRAINED_MODEL

Only testing

python motion_cnn.py --resume PATH_TO_PRETRAINED_MODEL --evaluate

two-stream-action-recognition's People

Contributors

Stargazers

Watchers

Forkers

bityangke ccv-edward qijiezhao guoshengxu yurkovanton wanghuogen tebogonakampe zumbalamambo dkrathi457 johnsnow511 kansea willdamon cwzat yanwang2014 chengmuni66 milkcat0904 zcunyi jiansowa victorleelk yu1ut chandfan rkdasari jacobtom hanimiao pchankh candicet93 afshaanmaz feirenlg zcrwind xiaobai12345 fantasylsc mafeimf zuoxuangn guofuzheng luqinghit mahiagrawal haroldss frankz-ai xiaoyu5301 huantingzhao shafaypro xiaopangzi313 stevenfengli namhsing nanzhixiong nanzhixionggit tonytan666 pharrellg yesyu ericwangyz jjj-jessie luchencatherine fhung65 vinocherish sdd9465 yaozhengjie vinxentzhang jxlin wonderland-dsg thuliusj lilimeng alysons oryosu kasparov92 jianhua2022 infinite-song dendisuhubdy wikipedia2008 sudabai666 mynameiziji xyishere sxzy gjl2922000 harshpal-singh mingyaoshuai cong222 qianchen94 gimps leonhardcc fairyzhxz aeonstasis terasakisatoshi godspysonyou johnsonman croaker5 liamhiley summerraining anjingxing maodong2056 desti-nation taoxiesz sunghyunwee orliz fengxiaohu wenyafei4 kwanegx wwfnwg wl-zhao jovialio fendou201398

two-stream-action-recognition's Issues

There are some repeat video id in testlist and trainlist

choose Only Testing , but still train data

hello , a new learner here.
I have choose the Only Testing. but the result shows that it still train data. it is so weird.
can any one give me some tips.
THX

Can you share specification of pc?

I am using GTX-1060 3Gb GPU. When I am trying to train, I always face the OOM problem despite using small batch size. How to avoid that problem?

Requirements ?

hi
I m totally new to deep learning having no experience with torch.
Would u please mention hardware and software requirements.... GPU used, OS...?
Do pytorch support windows 7 and other libararies used in implementation..?
Thanks

a problem

I have been meeting a problem as follows,can you give me some advice?thanks!

About Spatialcnn

on what basis you are only taking 3 random frames per video for video-level prediction?

Could you please justify this I just want to know whats the reason behind this if you could let me, that would be great

Thanks
Hareesh

Thank you for your kindly sharing this code. Is this code for python 2.7 or 3 ?

Thank you for your kindly sharing this code. Is this code for python 2.7 or 3 ? Thank you!

choice of making flow tensor

https://github.com/jeffreyhuang1/two-stream-action-recognition/blob/master/dataloader/motion_dataloader.py#L55
variable j starts from 0, so according to line 55 and 56, first frames optical flow - u and v - will be in:
flow[-2,:,:] and flow[-1,:,:]. While it makes sense for it to be flow[0,:,:] and flow[1,:,:] ?

About Pre-trained Model problem

Hi,
when I download Pre-trained model, I can not unpack the files. can you upload the Pre-trained Model again.
thanks

Why is the Prec@1 of motion cnn so low?

Hi, the spatial cnn seems normal, but I notice that the experimental result on motion cnn is a little wired.
It is much lower than that of two-stream cnn (81.0%) and that of TSN (87.2%). Also, in the two papers mentioned above, both temporal streams perform better than spatial streams.

Motion resent101

Hi,
Thanks a lot for providing the open-source code!

     May I ask how did you get your pre-trained motion ResNet101? It's pre-trained on ImageNet and fine-tuned on UCF101 or the just changing the channels of the model trained on ImageNet? 
  
      Highly appreciate your time and help!

about the train and test data

a new learner here.
I have noticed the dataloader.I have a question is that the result is the first split of result or it is the average of three splits.

j - 1 in flow index

Hey @jeffreyhuang1
Thanks very much for your work! It really help me understand the papers.

I am not sure if i understand this part right. In function stackopf, should it be j instead of j-1? E.g:

flow[2*(j),:,:] = H
flow[2*(j)+1,:,:] = V

Thanks!

Testing realtime

How do I test it either realtime or using video as an using through VideoCaputre?

Pre-trained model

Could you publish your pre-trained models? Thank you !

How many Gpus you use？

Thanks for your jobs! This repo is well organized and help me a lot.
I want to know how many GPUs did you use when training a Resnet101 on UCF101 RGB frames? And how much time did you spend on training? Also, is that Bn-inception would more computational and storage efficient?

Is frame-wise min-max normalization used for flow images?

I want to know how the provided flow images are normalized. Is frame-wise min-max normalization used?

Can l use the two stream model as a feature extractor ?

Hello,

thank you for your work.

l'm asking how can l use spatial and temporal model on UCF-101 for features extraction ?

Thank you

For spatial_cnn.py,When predict the test dataset,why not use tsn network to predict it's class?

Hello,
I get some confusion about reading spatial_cnn.py.When predict the test dataset,I find that you just add all sample's predictions to be the final result ,why not use tsn network to predict it's class? for example ,every video is divied to 3 segements,and randly choose one RGB frame to predict , then average the three predictions as the final result!

for j in range(nb_data):
videoName = keys[j].split('/',1)[0]
if videoName not in self.dic_video_level_preds.keys():
self.dic_video_level_preds[videoName] = preds[j,:]
else:
self.dic_video_level_preds[videoName] += preds[j,:]

OSError: [Errno 4] Interrupted system call

In spatial_cnn.py, if use print(result) directly,OSError: [Errno 4] Interrupted system call may be happen with python2.7. The code below can solve this.
try: print (result) except IOError, e: if e.errno != errno.EINTR: raise

About spatial cnn,how many images do you extract from every trainning video? thank you !

How to get pickle files?

Hi:
After I run python UCF_spatial_cnn.py , then I get "can not find pickle" error ,how to get pickle files?

real time prediction using webcam

hi, may i know whether this program can be modified to run in real time using webcam?
real-time extracting frames and features and make prediction.

getting video frames

how are u extracting video frames from video....or ur code assume that frames are already extracted?

Missing "v" folder in "UCF101/tvl1_flow/"

According to the code, there are 2 folders inside the "UCF101/tvl1_flow/" directory, "u" and "v".
When we download the zip file from this mentioned repository and extract the file, only the "u" folder is created. Can you please explain what these two folders contain and how to generate the "v" folder?

How can I generate 2-channel optical flow image and save its x, y channel as .jpg image?

After using flownet2.0 method, I have generated *.flo files. But I don't know how to convert * .flo files to 2-channel optical flow image and save its x, y channel as .jpg image. Can you help me? Thanks so much.

the dataset of flow are damaged files

can you upload the dataset again, thanks

the problem of the pre-trained model

I download the pre-trained model ( model_best.pth.tar), but it's damaged. So ,can I download the pre-trained model from other places? Thanks

Pre-trained models are corrupted

Unable to decompress the tar files provided. Could you verify at your end if you are able to decompress them?

"This does not look like a tar archive.
Skipping to next header.
A lone zero block at 62938.
Exiting with failure status due to previous errors"

low accuracy with spatial pretrained model

Hi! I'm training your spatial_cnn model. I downloaded rgb images from "feichtenhofer/twostreamfusion". Then, I followed your guide (modify some functions) and used your pretrained model, but I got only 79.xx% on test set and 99.xx% on training set. I can't achieve your result, which is 82.1%.
So what's my problem? Thank you.

There are problems in the pre training model！！！

can you help me to solve it? thanks very mach

UCF_101 is not defined

The following error occurs :-

Traceback (most recent call last):
File "average_fusion.py", line 23, in
ucf_split='01')
File "D:\IP\SignLanguage\TwoStreamFusion\two-stream-action-recognition-master
dataloader\spatial_dataloader.py", line 78, in init
splitter = UCF101_splitter(path=ucf_list,split=ucf_split)
NameError: name 'UCF101_splitter' is not defined

Also, please specify the directories inside the 'UCF101' folder.

UCF101_splitter not defined

The function UCF101_splitter is not defined anywhere.
The following error is shown on running average_fusion.py

splitter = UCF101_splitter(path=ucf_list,split=ucf_split)
NameError: name 'UCF101_splitter' is not defined

a problem of ‘ValueError: insecure string pickle’

Thank you for your kindly sharing this code. But I have a problem of ‘ValueError: insecure string pickle’, when run the flowing codes in spatial_dataloader and motion_dataloader,
with open('dic/frame_count.pickle', 'rb') as file:
dic_frame = pickle.load(file)
I installed the Anaconda2-4.1.1-Linux-x86_64. Is the version too low?
I tried to modify 'rb' to 'r', but still the same problem. Can you help me? Thank you!

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

i try to reload the spatial model but it's not useful,,,..... the report errors as follows:

torch.Size([3, 224, 224])
==> Validation data :71877frames
torch.Size([3, 224, 224])
==> Build model and setup loss and optimizer
==> loading checkpoint 'record/spatial/spatial_video_preds.pickle'
Traceback (most recent call last):
File "/home/xm/python/action/two-stream-action-recognition-master/spatial_cnn.py", line 268, in
main()
File "/home/xm/python/action/two-stream-action-recognition-master/spatial_cnn.py", line 65, in main
model.run()
File "/home/xm/python/action/two-stream-action-recognition-master/spatial_cnn.py", line 109, in run
self.resume_and_evaluate()
File "/home/xm/python/action/two-stream-action-recognition-master/spatial_cnn.py", line 93, in resume_and_evaluate
checkpoint = torch.load(self.resume)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 367, in load
return _load(f, map_location, pickle_module)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 528, in _load
magic_number = pickle_module.load(f)

About motion_cnn.py,why data loading needs so much time ?it seems to be close to the batch time.

hello,When I run the motion_cnn.py,I get some questions,As follows,
1、About motion_cnn.py,When I set the bath_size as more than 32 such as 64, I find the program stuck at the training stage, and some time it run fast ,but then it stuck.(My GPU is GTX1080ti,when i run the command nvidia-smi,It shows that I just use less than 4G memory)
2、Also, I find that the data loading time is similar to the batch time,It makes me confused,Why data loading needs so long ,and similiar to batch time,does the model trainning (batch time-data loading time) need little time?

hope to get your answer,thanks!

missing 'frame_count.pickle'

IOError: [Errno 2] No such file or directory: 'dic/frame_count.pickle'

Thank you for your kindly sharing this great code . I was puzzled about some details

Dear Jeffrey,
Thank you for your kindly sharing the great two-stream-action-recognition code. Though I have read the readme and reference papers[1] [2] [3], I was puzzled about some details.
In training strategies, in every mini-batch, you randomly select 64 (batch size) videos from 9537 training videos and further randomly select 1 stacked optical flow in each video. I wonder which paper this method was proposed. Is this method part of TSN paper[2]?
In testing method, for every 3783 testing videos, we uniformly sample 19 frames in each video and the video level prediction is the voting result of all 19 frame level predictions. This method seems like TSN[2], but the 19 segments of 10 frames maybe overlapped.
In your validation stage, the precision stored in opf_test.csv is the test precision or validation precision? The loaded data is the testlist01?
The motion_video_preds.pickle save the video level prediction. What does it mean? And where to find the test accuracy in split 1 testlist01?
Is this code corresponding to your own paper? I want to quote your paper. Or can you give me some details about your code? Thank you very much.
I send an E-mail for your help. Thank you very much.
Best Regard,
tangjun

关于视频数据的预处理

首先，感谢能够有这么完整的代码和说明文档。对我帮助很大。
但是仍存在一些问题：
1.对于空间输入数据，采样率为10。对于运动输入数据，光流数据的采样率是多少？
情况1.是根据采样图像的后10帧得到的光流volume吗？
情况2.还是采样得到的帧，每10帧计算一个光流volume？
2.训练时，对运动空间，只采样3个帧；对运动空间只采样1个光流堆。那为何最初的数据处理要存储这么多的数据呢？
3.测试，对每个视频采样19帧，是为了保有足够的帧用于10个堆栈运动流中的测试。那么是否推测问题1中，光流的采样为情况2.
希望您有空能解答一下我的疑问。感谢。

Is this code completely corresponding to the whole TSN framework? or part from the TSN framework?

Thank you for kindly sharing this code. Is this code completely corresponding to the paper ([2] Temporal Segment Networks), i.e. the whole TSN framework, except the resnet part ? or part from the TSN framework? Thank you!

extracting pretrained model

Tried to extract pretrained model using tar...but its giving error:
tar: This does not look like a tar archive
tar: Skipping to next header
tar: A lone zero block at 79811
tar: Exiting with failure status due to previous errors
Any idea?

Recognizing activity realtime

Can you please share the pretrained model and testing code? Im having hard time while trying to extracting the rgb frames and compute optical flow for activity recognition... Thank you

How to run this code?

Hi, Thanks a lot for sharing this code!

May I ask how to start to train your two-stream network and how to pre-process data?

Thanks a lot!

how to fuse the features

After getting the two streams run, how can I fuse the features from each network?

some problem of pretrained models？

sorry i make a mistake，there is no problem

Which optical flow feature is used for the pretrained motion model?

TVL1 or flownet2.0?
Thanks~

Improvement on motion-cnn result: 84.1% on split-1, with VGG-16

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

kernel died on particular line

Hi
On running code i am getting "kernel died and restarting" on line self.bn1 = nn.BatchNorm3d( mid_planes).
if i comment this code line then it runs but again same error occurs wherever batch normalization is used. Any idea what could be the reason.

A problem I meet in spatial_cnn and spatial_dataloader

Dear jeffreyhuang,
I read your readme, the input of spatial_cnn is from feichtenhofer/twostreamfusion ,
So the path in spatial_cnn and spatial_dataloader equals to feichtenhofer’s Directory.
Is it right?
When I run the spatial_cnn, I have a problem:

Traceback (most recent call last):
File "/home/tangjun/jeffreyhuang/two-stream-action-recognition/spatial_cnn.py", line 272, in
main()
File "/home/tangjun/jeffreyhuang/two-stream-action-recognition/spatial_cnn.py", line 55, in main
train_loader, test_loader, test_video = data_loader.run()
File "/home/tangjun/jeffreyhuang/two-stream-action-recognition/dataloader/spatial_dataloader.py", line 100, in run
train_loader = self.train()
File "/home/tangjun/jeffreyhuang/two-stream-action-recognition/dataloader/spatial_dataloader.py", line 133, in train
print(training_set[1][0]['img1'].size())
File "/home/tangjun/jeffreyhuang/two-stream-action-recognition/dataloader/spatial_dataloader.py", line 60, in getitem
data[key] = self.load_ucf_image(video_name, index)
File "/home/tangjun/jeffreyhuang/two-stream-action-recognition/dataloader/spatial_dataloader.py", line 30, in load_ucf_image
img = Image.open(path +str(index)+'.jpg')
File "/home/tangjun/anaconda2/lib/python2.7/site-packages/PIL/Image.py", line 2272, in open
fp = builtins.open(filename, "rb")
IOError: [Errno 2] No such file or directory: '/home/tangjun/dataset1/jpegs_256/Swing/separated_images/v_Swing_g09_c02/v_Swing_g09_c02_29.jpg'

Best regards,
Yours sincerely
Tang Jun

jeffreyyihuang / two-stream-action-recognition Goto Github PK