ibm / blvnet-tam Goto Github PK

The official Codes for NeurIPS 2019 paper. Quanfu Fan, Ricarhd Chen, Hilde Kuehne, Marco Pistoia, David Cox, "More Is Less: Learning Efficient Video Representations by Temporal Aggregation Modules"

License: Apache License 2.0

Python 100.00%

action-recognition computer-vision pytorch

blvnet-tam's Introduction

bLVNet-TAM

This repository holds the code and models for our paper,

Quanfu Fan*, Chun-Fu (Richard) Chen*, Hilde Kuehne, Marco Pistoia, David Cox, "More Is Less: Learning Efficient Video Representations by Temporal Aggregation Modules"

If you use the code and models from this repo, please cite our work. Thanks!

@incollection{
    fan2019blvnet,
    title={{More Is Less: Learning Efficient Video Representations by Temporal Aggregation Modules}},
    author={Quanfu Fan and Chun-Fu (Ricarhd) Chen and Hilde Kuehne and Marco Pistoia and David Cox},
    booktitle={Advances in Neural Information Processing Systems 33},
    year={2019}
}

Requirements

pip install -r requirement.txt

Pretrained Models on Something-Something

The results below (top-1 accuracy) are reported under the single-crop and single-clip setting.

V1

Name	Top-1 Val Acc.
bLVNet-TAM-50-a2-b4-f8x2	46.4
bLVNet-TAM-50-a2-b4-f16x2	48.4
bLVNet-TAM-101-a2-b4-f8x2	47.8
bLVNet-TAM-101-a2-b4-f16x2	49.6
bLVNet-TAM-101-a2-b4-f24x2	52.2
bLVNet-TAM-101-a2-b4-f32x2	53.1

V2

Name	Top-1 Val Acc.
bLVNet-TAM-50-a2-b4-f8x2	59.1
bLVNet-TAM-50-a2-b4-f16x2	61.7
bLVNet-TAM-101-a2-b4-f8x2	60.2
bLVNet-TAM-101-a2-b4-f16x2	61.9
bLVNet-TAM-101-a2-b4-f24x2	64.0
bLVNet-TAM-101-a2-b4-f32x2	65.2

Data Preparation

We provide two scripts in the folder tools for prepare input data for model training. The scripts sample an image sequence from a video and then resize each image to have its shorter side to be 256 while keeping the aspect ratio of the image. You may need to set up folder_root accordingly to assure the extraction works correctly.

Training

To reproduce the results in our paper, the pretrained models of bLNet are required and they are available at here.

With the pretrained models placed in the folder pretrained, the following script can be used to train a bLVNet-101-TAM-a2-b4-f8x2 model on Something-Something V2

python3 train.py --datadir /path/to/folder \
--dataset st2stv2 -d 101 --groups 16  \ 
--logdir /path/to/logdir --lr 0.01 -b 64 --dropout 0.5 -j 36 \
--blending_frames 3 --epochs 50 --disable_scaleup --imagenet_blnet_pretrained

Test

First download the models and put them in the pretrained folder. Then follow the example below to evaluate a model. Example: evaluating the bLVNet-101-TAM-a2-b4-f8x2 model on Something-Something V2

python3 test.py --datadir /path/to/folder --dataset st2stv2 -d 101 --groups 16 \ 
--alpha 2 --beta 4 --evaluate --pretrained --dataset --disable_scaleup \
--logdir /path/to/logdir

You can add num_crops and num_clips arguments to perform multi-crops and multi-clips evaluation to video-level accuracy.

Please feel free to let us know if you encounter any issue when using our code and models.

blvnet-tam's People

Contributors

Stargazers

Watchers

Forkers

sajid3 kaihemo fagan2888 videodnn oldcricket acodec yuan-2703 bhaskers-blu-org1 liuguoyou chlei233 user8361 jinhaoduan yulengchuanjiang ghas-results

blvnet-tam's Issues

Network giving ambiguous results.

I am doing 2 class classification using BLVNET-TAM.
When i had given data for validation the accuracy is coming good but on testing same data accuracy is bad.
for instance ,
validation probabilities are somewhat like [0.2 , 0.8]
while during testing probabilities are like [0.5.. , 0.499..] on same data.

imagenet_blnet_pretrained

Is it intentional to have

if input_channels != 3

only in this line?

As it stands the code only loads the weights for the flow-case and not the RGB-case. I have thus extended the condition to the below and was wondering if it's correct:

        if input_channels != 3:  # flow
            ....
        else:
            print("Loading RGB weights")
            state_d = checkpoint['state_dict']
        model.load_state_dict(state_d, strict=False)

Multi-gpu training issue: device cuda mismatch

Hi, in training I encountered RuntimeError: Function SqueezeBackward1 returned an invalid gradient at index 0 - expected device cuda:3 but got cuda:4. In the paper I remember you used 8 GPUs. Have you encountered this issue before? Thanks.

The command I am using is python3 train.py --datadir ../../sthsthv2 --dataset st2stv2 -d 101 --groups 16 --logdir ./logs --lr 0.01 -b 64 --dropout 0.5 -j 36 --blending_frames 3 --epochs 50 --disable_scaleup --imagenet_blnet_pretrained. And I have the pretrained model ready under ./pretrained.

The environment I am using under Ubuntu18.04.3 LTS is:

#Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
ca-certificates           2019.11.27                    0
certifi                   2019.11.28               py37_0
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 9.1.0                hdf63c60_0
libstdcxx-ng              9.1.0                hdf63c60_0
ncurses                   6.1                  he6710b0_1
numpy                     1.17.4                   pypi_0    pypi
openssl                   1.1.1d               h7b6447c_3
pillow                    6.2.1                    pypi_0    pypi
pip                       19.3.1                   py37_0
protobuf                  3.11.1                   pypi_0    pypi
python                    3.7.5                h0371630_0
readline                  7.0                  h7b6447c_5
scikit-video              1.1.11                   pypi_0    pypi
scipy                     1.3.3                    pypi_0    pypi
setuptools                42.0.2                   py37_0
six                       1.13.0                   pypi_0    pypi
sqlite                    3.30.1               h7b6447c_0
tensorboard-logger        0.1.0                    pypi_0    pypi
tk                        8.6.8                hbc83047_0
torch                     1.3.1                    pypi_0    pypi
torchvision               0.4.2                    pypi_0    pypi
tqdm                      4.40.2                   pypi_0    pypi
wheel                     0.33.6                   py37_0
xz                        5.2.4                h14c3975_4
zlib                      1.2.11               h7b6447c_3

Low validation accuracy on Something-Something V1 and Something-Something V2

Thanks for your excellent work!

I tried to train a bLVNet-101-TAM-a2-b4-f8x2 model on Something-Something V1 using the hyperparameters setting reflected in the following bash command:

CUDA_VISIBLE_DEVICES=0,1 python train.py --datadir /media/data/something_v1 --dataset st2stv1 -d 101 --groups 16 --logdir logs --lr 0.01 -b 64 --dropout 0.5 -j 36 --blending_frames 3 --epochs 50 --disable_scaleup --imagenet_blnet_pretrained

At the end of the training, I get only around 31% validation accuracy, whereas the training accuracy is around 79%. Comparing with TSM-ResNet50, the training accuracy is actually higher (79% vs 78%), so it is even more confusing on why the validation accuracy is that low. Is it overfitting by any chance? Or is there something wrong with preprocessing the validation data (note that I used the scripts you provided to preprocess data)? Any comments on what I may be doing wrong here? I am sharing the corresponding training/validation log.
log.log

Update: Trained bLVNet-101-TAM-a2-b4-f8x2 on Something-Something V2 as well. With the training script provided, validation accuracy reaches around 53% only (less than the ~60% accuracy reported). Also, for Something-Something-V2, the training accuracy is lower compared to TSM-ResNet50 (88.2% with TSM vs. 78.5% with bLVNet). Please let me know your thoughts on what may be going wrong here. Log corresponding to SthV2 training -
log.log

About Flow Training

I found that you haven't provide flow models, so if i want to retrain a flow model.
Are all training parameters the same with RGB training?
And do i need to train the model from scratch with random initialization?
Thanks~

Cannot replicate Kinetics-400 Results

Thank you very much for posting the codebase! However, I'm having difficulty replicating the 71.0% acc mentioned in the paper for 'bLVNet-TAM-8×2', I only get around 55% when validating each epoch. Training accuracy is 58%. If I do 10 crops per video for validation this rises to around 57%

Would it be possible to share your training log file? Or know your final training accuracy to see if something is wrong with my validation/eval script?

Wanted to check if any of the below was wrong:

TAM Backbone is initialised from 'ImageNet-bLResNet-50-a2-b4.pth.tar'
For bLVNet-model:

{'depth': 50, 'alpha': 2, 'beta': 4, 'groups': 16, 'num_classes': 400, 'dropout': 0.5, 'blending_frames': 3, 'input_channels': 3, 'pretrained': None, 'dataset': 'kinetics400', 'imagenet_blnet_pretrained': True}

For the training: 0.01 LR with a total batch of 64 (8 * 8 per GPU); cosine-annealing LR-schedule trained for 50 epochs

I thought perhaps this should be trained for 100 epochs; similar to TSM, not 50?