Giter Club home page Giter Club logo

vlcap's Introduction

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

PWC

[arXiv] [pdf]

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.

Environment Setup

  1. Clone this repository
git clone https://github.com/UARK-AICV/VLCAP.git
cd VLCAP
  1. Prepare Conda environment
conda env create -f environment.yml
conda activate pytorch
  1. Add project root to PYTHONPATH

Note that you need to do this each time you start a new session.

source setup.sh

Data Preparation

Download features from Google Drive: env feature and lang feature.

mkdir data/anet; cd data/anet
unzip anet_c3d
unzip anet_clip_b16

Training

To train our MART model on ActivityNet Captions:

bash scripts/train.sh [anet/yc2] [true/false]

Here you can specify the dataset (ActivityNet:anet or YouCook2:yc2) and whether to use the proposed language feature (true/false).

Training log and model will be saved at results/anet_re_*.
Once you have a trained model, you can follow the instructions below to generate captions.

Evaluation

  1. Generate captions
bash scripts/translate_greedy.sh anet_re_* [val/test]

Replace anet_re_* with your own model directory name. The generated captions are saved at results/anet_re_*/greedy_pred_val.json

  1. Evaluate generated captions
bash scripts/eval.sh anet [val/test] results/anet_re_*/greedy_pred_[val/test].json

The results should be comparable with the results we present at Table 5 of the paper.

Citations

If you find this code useful for your research, please cite our papers:

@INPROCEEDINGS{9897766,
  author={Yamazaki, Kashu and Truong, Sang and Vo, Khoa and Kidd, Michael and Rainwater, Chase and Luu, Khoa and Le, Ngan},
  booktitle={2022 IEEE International Conference on Image Processing (ICIP)}, 
  title={VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning}, 
  year={2022},
  volume={},
  number={},
  pages={3656-3661},
  doi={10.1109/ICIP46576.2022.9897766}}
@ARTICLE{2022arXiv221115103Y,
       author = {{Yamazaki}, Kashu and {Vo}, Khoa and {Truong}, Sang and {Raj}, Bhiksha and {Le}, Ngan},
        title = "{VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computer Vision and Pattern Recognition},
         year = 2022,
        month = nov,
          eid = {arXiv:2211.15103},
        pages = {arXiv:2211.15103},
archivePrefix = {arXiv},
       eprint = {2211.15103},
 primaryClass = {cs.CV},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2022arXiv221115103Y},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

Acknowledgement

We acknowledge the following open-source projects that we based on our work:

  1. MART

vlcap's People

Contributors

kashu7100 avatar teytaud avatar daweiro avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.