antoyang / vidchapters Goto Github PK

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

Home Page: http://arxiv.org/abs/2309.13952

License: MIT License

Python 47.74% Shell 0.34% C++ 0.35% Cuda 3.52% Jupyter Notebook 48.06%

dense-video-captioning multimodal-learning pre-training temporal-language-grounding video-captioning video-understanding vision-and-language weakly-supervised-learning vid2seq video-chapter-generation

vidchapters's People

Contributors

Stargazers

Watchers

Forkers

eltociear lin-roger wemersiveadmin huangmr0719 wangdeyu flying2023 arctanbell kavindie quangminhdinh hsuanguo johnny-haytham xlsean gunnarmarino jordi-bird

vidchapters's Issues

Dense video captioning on ActivityNet Captions dataset?

Great work Antoine! In your last paper Vid2Seq, you also tested the pre-trained model on the ActivityNet captions dataset, but in VidChapters you only show on ViTT and YouCook2. I am wondering if there is any particular reason to pick ViTT and YouCook2. Is it because the ActivityNet captions dataset is larger than these two (i.e. longer training time) or is it because it contains more diverse activities which makes it a harder dataset?

Thank you!

The Purpose of Time Token Weights Normalization

@antoyang Thanks for the wonderful work! Could you please explain the purpose of normalizing time token weights?

How to run on a single video

How can I run it on a single video?

Demo for dense video captioning purposes

As i clearly understand, demo_vid2seq.py is used for main goal: video chapter generation.
How can i change this module for dense video captioning purposes? Or can you add new demo for this inference, please?

Will the model weights pretrained on YT-Temporal-1B be released?

Hello! Thanks for releasing the PyTorch implementation of Vid2Seq. I was wondering if the model weights pretrained on YT-Temporal-1B would be released.

Replace with faster-whisper

I prefer using faster whisper, if you need to do demo with it too, here is the revised code:

import argparse
import torch
import os
import pickle
from args import get_args_parser, MODEL_DIR
import whisper
from faster_whisper import WhisperModel, decode_audio
import whisperx
from typing import TypedDict
class SingleSegment(TypedDict):
    """
    A single segment (up to multiple sentences) of a speech.
    """
    start: float
    end: float
    text: str

# Args
parser = argparse.ArgumentParser(parents=[get_args_parser()])
args = parser.parse_args()
device = torch.device(args.device)

print("load Whisper model")
asr_model = WhisperModel("large-v3",device="cuda", compute_type="float16")
print("extract ASR")
asr = asr_model.transcribe(args.video_example,without_timestamps=True,word_timestamps=False, beam_size=5,initial_prompt='Please！ add punctuations。',vad_filter=True)
print("load align model")
align_model, metadata = whisperx.load_align_model(language_code=asr[1].language, device=args.device, model_dir=MODEL_DIR)
print("extract audio")
audio = whisperx.load_audio(args.video_example)

print("align ASR")
the_segments = []
for segment in asr[0]:
    s_item = {'text':segment.text,'start':segment.start,'end':segment.end}
    the_segments.append(s_item)
print(the_segments[:3])

print("whisperx.......")
aligned_asr = whisperx.align(the_segments, align_model, metadata, audio, args.device, return_char_alignments=False)

print("saving")
pickle.dump(aligned_asr, open(args.asr_example, 'wb'))

Inference of dense caption

Hi!
Congratulations that you have done a great job!!!
If I want to do dense captioning inference in this project, should i modify something in "python demo_vid2seq.py --load= --video_example=<VIDEO_PATH> --asr_example <OUTPUT_ASR_PATH> --combine_datasets chapters
"? cuz the captioning result is just like in the format of video chapter

Will you release chapters_clipvitl14_features?

Thanks for your amazing work! Will you release chapters_clipvitl14_features so that we can have a quickstart ;)

Video request

Thanks for your wonderful work!

I would like to download videos of VidChapters. Could you provide some cmd tools to quickly download these videos? And how much storage do you use for these videos?

Thanks~

Error with dependencies in requirements.txt

First of all, thank you for huge and interesting work!

I've found some error with versions of dependensies in requirements.txt file:

tokenizer in demo_vid2seq.py

hello, as i want to use demo_vid2seq.py to get video captioning, there are many questions which i don't understand, first, when i run demo_vid2seq.py, there is an error:

load Vid2Seq model
Traceback (most recent call last):
File "demo_vid2seq.py", line 55, in
tokenizer = _get_tokenizer(args.model_name, args.num_bins)
File "/root/tzp/codes/VidChapters-main/model/vid2seq.py", line 12, in _get_tokenizer
tokenizer = T5Tokenizer.from_pretrained(tokenizer_path, local_files_only=True)
File "/root/.local/conda/envs/vidchapter/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained
f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
OSError: Can't load tokenizer for 't5-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 't5-base' is the correct path to a directory containing all relevant files for a T5Tokenizer tokenizer.

i don't know how to set and get the tokenizer, can you help me?

inference without speech

hi!
may i know how to do the inference without speech?

I've set the --no_speech but so that the output is [].
And when i do inference in activitynet and charades dataset, the output looks like it only considers the speech feature

Thank you!

test on vitt

Thank you for your work. I would like to know if the file clipvit14.pth contains the features of the test set of vitt. When I was testing, I encountered the error: File "/public/home/code/VidChapters-main/VidChapters-main/dataset/dvc_dataset.py", line 65, in _get_video assert video_id in self.features, video_id AssertionError: 0_-0zE4NDkuYo

Hallucinations in chapter title generation after some time

Hello, First of all, thank you for an awesome project and making source code and models available. I really appreciate it.

I was able to run the demo_asr.py and demo_vid2seq.py. I tested HowTo100M + VidChapters-7M + YouCook2 and VidChapters-7M models with it.

On a few videos I tested, VidChapters-7M seems to work better. However, after some time, it starts to repeat the chapter name for rest of the video. I know its not using ChatGPT behind the scene, and term Hallucination is not a perfect fit for the title.

Here is an example:

input video: https://www.youtube.com/watch?v=jARPzYkjp3g

output:

[
    {'sentence': 'The four pillars of happiness.', 'timestamp': [0.0, 14.901636363636365]}, 
    {'sentence': 'The four pillars of happiness.', 'timestamp': [14.901636363636365, 69.5409696969697]}, 
    {'sentence': 'Belonging.', 'timestamp': [69.5409696969697, 94.37703030303031]}, 
    {'sentence': 'Purpose.', 'timestamp': [94.37703030303031, 139.0819393939394]}, 
    {'sentence': 'Religion.', 'timestamp': [139.0819393939394, 173.85242424242423]}, 
    {'sentence': 'Fast casual restaurants.', 'timestamp': [173.85242424242423, 213.59012121212123]}, 
    {'sentence': 'The sacramento.', 'timestamp': [213.59012121212123, 243.39339393939395]}, 
    {'sentence': 'The sacramento.', 'timestamp': [243.39339393939395, 268.22945454545453]}, 
    {'sentence': 'The sacramento.', 'timestamp': [268.22945454545453, 307.9671515151515]}, 
    {'sentence': 'The sacramento.', 'timestamp': [307.9671515151515, 342.73763636363634]}, 
    {'sentence': 'The sacramento.', 'timestamp': [342.73763636363634, 377.50812121212124]}, 
    {'sentence': 'The sacramento.', 'timestamp': [377.50812121212124, 407.31139393939395]}, 
    {'sentence': 'The sacramento.', 'timestamp': [407.31139393939395, 437.11466666666666]}, 
    {'sentence': 'The sacramento.', 'timestamp': [437.11466666666666, 461.9507272727273]}, 
    {'sentence': 'The sacramento.', 'timestamp': [461.9507272727273, 471.88515151515156]}, 
    {'sentence': 'The sacramento.', 'timestamp': [471.88515151515156, 491.754]}
]

Another example:
Input: https://www.youtube.com/watch?v=PRpr0_Iz4dI

Output:

[
	{'sentence': 'Intro.', 'timestamp': [0.0, 7.774242424242424]}, 
	{'sentence': 'Why we hate the new logos.', 'timestamp': [7.774242424242424, 116.61363636363636]}, 
	{'sentence': 'Why people are missing the bigger picture.', 'timestamp': [116.61363636363636, 174.92045454545453]}, 
	{'sentence': "Why people aren't logical.", 'timestamp': [174.92045454545453, 244.88863636363635]}, 
	{'sentence': "Why people aren't logical.", 'timestamp': [244.88863636363635, 384.82499999999993]}
]

Do you know if I can do something to improve it a little?

Run demo for videos without speech

How can I run demo_asr.py for videos without speech?
I tried it by pickling a json file as:
{ "segments": [], "word_segments": [] }
But got a runtime error in the demo_asr.py script:
load Vid2Seq model loading visual backbone extracting visual features visual features extracted load ASR ASR to tokens Traceback (most recent call last): File "demo_vid2seq.py", line 150, in <module> input_tokens = torch.cat(input_tokens, 0) RuntimeError: torch.cat(): expected a non-empty list of Tensors

About the generative objective and the denoising objective

Hi! Thank you for huge and interesting work!
Would you please provide a guideline to the implementation of both the generative objective and the denoising objective mentioned in the vid2seq paper. I can't find it in the code.

Code issue

We run the demo_vid2seq.py
In vid2seq.py line 41 we find that:
self.t5_model.resize_token_embeddings(len(tokenizer) - num_bins) # remove the weights of the 28 tokens that are not used (32128 vs 32100 in the tokenizer)
self.t5_model.resize_token_embeddings(len(tokenizer)) # add time tokens

These two lines of code are the same. We commented out one line and An error occurred：
File "demo_vid2seq.py", line 170, in
temperature=1)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/vid2seq.py", line 163, in generate
num_return_sequences=num_captions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 1534, in generate
**model_kwargs,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 2814, in beam_search
output_hidden_states=output_hidden_states,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1698, in forward
return_dict=return_dict,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1082, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 710, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 616, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 528, in forward
query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

May I ask how we should resolve it?