LSTP-Chat: Language-guided Spatial-Temporal Prompt Learning for Video Chat

Updates

(2024.02.27) Paper Release, check it on Arxiv.
(2024.02.26) Initial Release (´▽`ʃ♡ƪ)

Overview

This is a chat agent based on our work LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding. This work is finetuned on video-instruction datasets Video-ChatGPT and image-instruction datasets LLaVA.

We have meticulously chosen two distinct architectural paradigms for our study: the encoder-decoder architecture, exemplified by BLIP2-Flan-T5-xl, and the decoder-only architecture, represented by InstructBLIP-Vicuna-7B. For further exploration, we also provide the code to tune the LLM with LoRA.

Installation

# clone project
git clone https://github.com/bigai-nlco/LSTP-Chat
cd LSTP-Chat

# create conda environment
conda create -n LSTP
conda activate LSTP

# install requirements
pip install -r requirements.txt

Data Preparation

You can download all the instruction data and evaluation data from Video-LLaVA/DATA

inputs/ivinstruct
├── llava_image_tune
└── videochatgpt_tune

How to run

Our training framework offers tailored scripts to meet the diverse needs of researchers.

Train model

# run on local
python src/train.py experiment=LSTP_SF_blip2flant5xl_videoinstruct # blip2-flan-t5-xl + video-instruct
python src/train.py experiment=LSTP_SF_instructblipvicuna7b_videoinstruct # instructblip-vicuna-7b + video-instruct

# run on cluster
sbatch scripts/videoinstruct_train.slurm # blip2-flan-t5-xl + video-instruct
sbatch scripts/videoinstruct_vicuna_train.slurm # instructblip-vicuna-7b + video-instruct

For those with limited GPU resources, we also provide the pipeline to shorten the training procudure

# step 1: generate the pseudal labels from the base-model, and extract the optical flow in advance

# step 2: train the temporal sampler
python src/train.py experiment=LSTP_TG_blip2flant5xl_videoinstruct

# step 3: train LSTP with fixed temporal sampler
python src/train.py experiment=LSTP_blip2flant5xl_ivinstruct # blip2-flan-t5-xl + video-instruct + image-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivinstruct # instrucblip-vicuna-7b + video-instruct + image-instruct
python src/train.py experiment=LSTP_blip2flant5xl_ivtinstruct # blip2-flan-t5-xl (LoRA) + video-instruct + image-instruct + text-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivtinstruct # instrucblip-vicuna-7b (LoRA) + video-instruct + image-instruct + text-instruct

Evaluate model

# run inference for LSTP-Vicuan-7B
bash eval/scripts/run_qa_msvd_vicuna.sh
bash eval/scripts/run_qa_msrvtt_vicuna.sh
bash eval/scripts/run_qa_activitynet_vicuna.sh

# run inference for LSTP-Flan-T5-xl
bash eval/scripts/run_qa_msvd.sh
bash eval/scripts/run_qa_msrvtt.sh
bash eval/scripts/run_qa_activitynet.sh

# run evaluation
bash eval/scripts/eval_qa_msvd.sh
bash eval/scripts/eval_qa_msrvtt.sh
bash eval/scripts/eval_qa_activitynet.sh

Configures

data:
  - text_dir
  - video_dir
  - processor_name
  - sampler_processor_name
  - nframe # final sampled frames
  - target_size # image size
  - batch_size
model:
  - model_name_or_path
  - sampler_name_or_path
  - of_extractor_name_or_path
  - optimizer
  - scheduler
  - generate_configs
path:
  - data_dir
  - video_dir
  - text_dir
  - output_dir
trainer: 
  - strategy
  - accelerator
  - devices
  - num_nodes
  - precision

Evaluation Results

Metrics: Accuracy/Score

Methods	LLM size	MSVD-QA	MSRVTT-QA	ActivityNet-QA
FrozenBiLM	1B	32.2/-	16.8/-	24.7/-
VideoChat	7B	56.4/2.8	45.0/2.5	-/2.2
LLaMA-Adapter	7B	54.9/3.1	43.8/2.7	34.2/2.7
Video-LLaMA	7B	51.6/2.5	29.6/1.8	12.4/1.1
Video-ChatGPT	7B	64.9/3.3	49.3/2.8	35.2/2.7
Video-LLaVA	7B	70.7/3.9	59.2/3.5	45.3/3.3
LSTP-7B	7B	71.3/3.9	57.3/3.3	43.9/3.3

Demo

We provide the chat demo supported by Gradio. We also provide some checkpoints, you can download it an put it to ckpts/LSTP-Chat/.

Model Zoo

Model	Base Model	Training Data	Strategy for LLM	Download Link
LSTP-7B	InstructBlip-Vicuna-7B	Video-ChatGPT, LLaVA	fixed	Huggingface

python -m demo.demo

Acknowledgement

Data: Video-ChatGPT, LLaVA
Preprocess: Video Features
Code: LAVIS/instructblip, lightning-hydra-template
Demo: LLaVA, Video-LLaVA

Citation

@misc{wang2024lstp,
    title={LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding},
    author={Yuxuan Wang and Yueqian Wang and Pengfei Wu and Jianxin Liang and Dongyan Zhao and Zilong Zheng},
    year={2024},
    eprint={2402.16050},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

bigai-nlco / lstp-chat Goto Github PK

lstp-chat's Introduction

LSTP-Chat: Language-guided Spatial-Temporal Prompt Learning for Video Chat

Updates

Overview

Installation

Data Preparation

How to run

Evaluation Results

Demo

Acknowledgement

Citation

lstp-chat's People

Contributors

Stargazers

Watchers

Forkers

lstp-chat's Issues

Recommend Projects

Recommend Topics

Recommend Org