Dingyi Yang, Chunru Zhan, Ziheng Wang, Biao Wang, Tiezheng Ge, Bo Zheng, Qin Jin
Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to share a story and attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes.
In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip's duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach.
- 2024/12: Video frame features and models can be accessed via Onedrive.
- 2024/07: Our code and dataset annotations are released. Video features will be available soon.
- Clone this repository
git clone https://github.com/alibaba/alimama-video-narrator
cd alimama-video-narrator
- Install Package
pip install --upgrade pip
conda env create -f environment.yml
- File
Our annotations can be found at "/data/all_video_data.json".
- Data Process
Due to copyright considerations, we will release the features of the original videos (coming soon).
If you want to extract features from your raw videos, please download all videos and store them in "/data_process/all_videos/". Then, proceed to extract the video features:
cd data_process/
python process_video.py
python get_blip_fea.py
Get the training data:
cd data_process/
# Visual Compression & Memory Consolidation
python get_training_data.py ./blip_fea/video_cuts/ ../data/all_video_data.json
cp training_data.json ../data/split/
cd ../data/split/
python split.py training_data.json ../all_video_data.json
python get_cut_data.py train.json train_shots.json
We apply the pretraining model firefly-baichuan-7b, with the details shown in: https://github.com/yangjianxin1/Firefly
You can directly use the baichuan-7b model, downloaded from: https://huggingface.co/baichuan-inc/Baichuan-7B
Run the following shell script to train your model:
bash train.sh
Run the following shell script for inference. Set 'offered_label' to False to generate narrations based on the model-generated storyline; otherwise, set it to True to use the ground truth (user-provided) storyline.
bash infer.sh
- Standard metrics such as BLEU and CIDEr
python tokenize_output.py $chk_path/output.json
cd metrics/evaluator_for_caption/
python evaluate_ads.py $chk_path/out_tokens.json
- Visual Relevance (EMScore & EMScore_ref)
Download Chinese_CLIP model from: https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16
cd metrics/EMScore/
python eval_ad_with_ref.py --inpath $chk_path/output.json
"EMScore(X,V) -> full_F" refers to EMScore;"EMScore(X,V,X*) -> full_F" refers to EMScore_ref 3. Knowledge Relevance
Download chinese-roberta-large model from: https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
cd metrics/roberta_based/
# info_sim
python info_sim.py ../data/all_video_data.json $chk_path/output.json idf_with_all_ref.json
# info_diverse
python info_diverse.py $chk_path/output.json idf_with_all_ref.json
- Fluency (intra-story repetition)
cd metrics/roberta_based/
python count_intra_repeat.py chk_path/output.json
If you find our work useful for your research and applications, please cite using this BibTeX:
@misc{yang2024synchronizedvideostorytellinggenerating,
title={Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline},
author={Dingyi Yang and Chunru Zhan and Ziheng Wang and Biao Wang and Tiezheng Ge and Bo Zheng and Qin Jin},
year={2024},
eprint={2405.14040},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2405.14040},
}
alimama-video-narrator's People
alimama-video-narrator's Issues
BLIP-2 Model used
Hi,
Thanks for sharing and very cool work! I have a couple of suggestions to improve the codebase:
Update Hardcoded Model Location
In the file get_blip_fea.py, could you change the hardcoded location of the model? The current hardcoded path doesn't specify which model is being used. For example:
- `blip_loc = "/pretrained_models/blip2/"`
+ `blip_loc = "Salesforce/blip2-opt-2.7b"`
Add Directory Creation Code
In the file process_video.py, the code expects the directory images/video_cuts
which does not exist by default. Could you please add the following code to ensure the directory is created if it doesn't exist?
from pathlib import Path
Path("images/video_cuts").mkdir(exist_ok=True)
Similarly, in get_blip_fea.py for blip_fea/video_cuts
.
Missing file ``non_lora_trainables.bin`` in ``lora_video_infer.py``
The script lora_video_infer.py expects a file named non_lora_trainables.bin
to be present in the model_path. However, I could not find any instructions or references on where to obtain or generate this file.
I am running lora_video_infer.py
with --pretrain_path=baichuan-inc/Baichuan-7B
(HF hub).
Thank you for your assistance!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.