Giter Club home page Giter Club logo

sa-tensorflow's Introduction

SA-tensorflow

Tensorflow implementation of soft-attention mechanism for video caption generation.

An example of soft-attention mechanism. The attention weight alpha indicates the temporal attention in one video based on each word.

[Yao et al. 2015 Describing Videos by Exploiting Temporal Structure] The original code implemented in Torch can be found here.

Prerequisites

  • Python 2.7
  • Tensorflow >= 0.7.1
  • NumPy
  • pandas
  • keras
  • java 1.8.0

Data

The MSVD [2] dataset can be download from here.

We pack the data into the format of HDF5, where each file is a mini-batch for training and has the following keys:

[u'data', u'fname', u'label', u'title']

batch['data'] stores the visual features. shape (n_step_lstm, batch_size, hidden_dim)

batch['fname'] stores the filenames(no extension) of videos. shape (batch_size)

batch['title'] stores the description. If there are multiple sentences correspond to one video, the other metadata such as visual features, filenames and labels have to duplicate for one-to-one mapping. shape (batch_size)

batch['label'] indicates where the video ends. For instance, [-1., -1., -1., -1., 0., -1., -1.] means that the video ends at index 4.

shape (n_step_lstm, batch_size)

Generate HDF5 data

We generate the HDF5 data by following the steps below. The codes are a little messy. If you have any questions, feel free to ask.

1. Generate Label

Once you change the video_path and output_path, you can generate labels by running the script:

python hdf5_generator/generate_nolabel.py

I set the length of each clip to 10 frames and the maximum length of frames to 450. You can change the parameters in function get_frame_list(frame_num).

2. Pack features together (no caption information)

Inputs:

label_path: The path for the labels generated earlier.

feature_path: The path that stores features such as VGG and C3D. You can change the directory name whatever you want.

Ouputs:

h5py_path: The path that you store the concatenation of different features, the code will automatically put the features in the subdirectory cont

python hdf5_generator/input_generator.py

Note that in function get_feats_depend_on_label(), you can choose whether to take the mean feature or random sample feature of frames in one clip. The random sample script is commented out since the performance is worse.

3. Add captions into HDF5 data

I set the maxmimum number of words in a caption to 35. feature folder is where our final output features store.

python hdf5_generator/trans_video_youtube.py

(The codes here are written by Kuo-Hao)

Generate data list

video_data_path_train = '$ROOTPATH/SA-tensorflow/examples/train_vn.txt'

You can change the path variable to the absolute path of your data. Then simply run python getlist.py to generate the list.

P.S. The filenames of HDF5 data start with train, val, test.

Usage

training

$ python Att.py --task train

testing

Test the model after a certain number of training epochs.

$ python Att.py --task test --net models/model-20

Author

Tseng-Hung Chen

Kuo-Hao Zeng

Disclaimer

We modified the code from this repository jazzsaxmafia/video_to_sequence to the temporal-attention model.

References

[1] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. arXiv:1502.08029v4, 2015.

[2] chen:acl11, title = "Collecting Highly Parallel Data for Paraphrase Evaluation", author = "David L. Chen and William B. Dolan", booktitle = "Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011)", address = "Portland, OR", month = "June", year = 2011

[3] Microsoft COCO Caption Evaluation

sa-tensorflow's People

Contributors

kuohaozeng avatar tsenghungchen avatar yenchenlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sa-tensorflow's Issues

HDF5 example data

Hello! Thank you very much for the nice project!
I am trying to reproduce the results, but I have problems creating the hdf5 files, can anyone maybe provide one or two of the hdf5 file/s to have an example to compare?
That would be great!

Handling longer videos while preparing data files

Thanks for preparing a wonderful code!

While preparing the data h5 files, as mentioned- "batch['data'] stores the visual features. shape (n_step_lstm, batch_size, hidden_dim)"
How to deal with videos that are longer than "n_step_lstm" length?
If the video is broken into parts and stored as separate input samples, would the model figure out and learn from parts of same video using the batch['label'] parameter.

Any help on preparing the data h5 files would be appreciated.
Thanks.

Error in generate_nolabel

[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
[h264 @ 0x163d0e0] missing picture in access unit
Traceback (most recent call last):
File "hdf5_generator/generate_nolabel.py", line 88, in
get_label_list(fname)
File "hdf5_generator/generate_nolabel.py", line 71, in get_label_list
frame_len = get_total_frame_number(fname)
File "hdf5_generator/generate_nolabel.py", line 33, in get_total_frame_number
length = float(cap.get(cv2.cv.CV_CAP_PROP_FRAME_COUNT))
AttributeError: 'module' object has no attribute 'cv'
andy1028@andy1028-Envy:/media/andy1028/data1t/os_prj/github/SA-tensorflow$

What does the 'feature_path' mean?

Hi, I have trouble with what the feature_path means?
As you said, "The path that stores features such as VGG and C3D". Does it means the weights from VGG-16 which is already trained?
I have a pretrained vgg16_weights.h5 file but it doesn't work well.

where is a data?

where can i get data?
there is a problem when I load video_data_path_train ...etc..

Can't run "input_generator.py"

the function
def splitdata(path, train_num, val_num)
is not executed in "generate_nolabel.py" so the file "msvd_dataset_final.npz" is not generated
so i can't run "input_generator.py"

Number of epochs to reproduce paper scores

I was able to write a script for data generation for MSVD.
Could you please comment on the number of epochs to run to reproduce scores as the [Yao et al. 2015 Describing Videos by Exploiting Temporal Structure] paper.
I see that in the code it is mentioned 900 epochs.
Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.