antoyang / vidchapters Goto Github PK
View Code? Open in Web Editor NEW[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
Home Page: http://arxiv.org/abs/2309.13952
License: MIT License
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
Home Page: http://arxiv.org/abs/2309.13952
License: MIT License
Great work Antoine! In your last paper Vid2Seq, you also tested the pre-trained model on the ActivityNet captions dataset, but in VidChapters you only show on ViTT and YouCook2. I am wondering if there is any particular reason to pick ViTT and YouCook2. Is it because the ActivityNet captions dataset is larger than these two (i.e. longer training time) or is it because it contains more diverse activities which makes it a harder dataset?
Thank you!
@antoyang Thanks for the wonderful work! Could you please explain the purpose of normalizing time token weights?
How can I run it on a single video?
As i clearly understand, demo_vid2seq.py is used for main goal: video chapter generation.
How can i change this module for dense video captioning purposes? Or can you add new demo for this inference, please?
Hello! Thanks for releasing the PyTorch implementation of Vid2Seq. I was wondering if the model weights pretrained on YT-Temporal-1B would be released.
I prefer using faster whisper, if you need to do demo with it too, here is the revised code:
import argparse
import torch
import os
import pickle
from args import get_args_parser, MODEL_DIR
import whisper
from faster_whisper import WhisperModel, decode_audio
import whisperx
from typing import TypedDict
class SingleSegment(TypedDict):
"""
A single segment (up to multiple sentences) of a speech.
"""
start: float
end: float
text: str
# Args
parser = argparse.ArgumentParser(parents=[get_args_parser()])
args = parser.parse_args()
device = torch.device(args.device)
print("load Whisper model")
asr_model = WhisperModel("large-v3",device="cuda", compute_type="float16")
print("extract ASR")
asr = asr_model.transcribe(args.video_example,without_timestamps=True,word_timestamps=False, beam_size=5,initial_prompt='Please! add punctuations。',vad_filter=True)
print("load align model")
align_model, metadata = whisperx.load_align_model(language_code=asr[1].language, device=args.device, model_dir=MODEL_DIR)
print("extract audio")
audio = whisperx.load_audio(args.video_example)
print("align ASR")
the_segments = []
for segment in asr[0]:
s_item = {'text':segment.text,'start':segment.start,'end':segment.end}
the_segments.append(s_item)
print(the_segments[:3])
print("whisperx.......")
aligned_asr = whisperx.align(the_segments, align_model, metadata, audio, args.device, return_char_alignments=False)
print("saving")
pickle.dump(aligned_asr, open(args.asr_example, 'wb'))
Hi!
Congratulations that you have done a great job!!!
If I want to do dense captioning inference in this project, should i modify something in "python demo_vid2seq.py --load= --video_example=<VIDEO_PATH> --asr_example <OUTPUT_ASR_PATH> --combine_datasets chapters
"? cuz the captioning result is just like in the format of video chapter
Thanks for your amazing work! Will you release chapters_clipvitl14_features so that we can have a quickstart ;)
Thanks for your wonderful work!
I would like to download videos of VidChapters. Could you provide some cmd tools to quickly download these videos? And how much storage do you use for these videos?
Thanks~
hello, as i want to use demo_vid2seq.py to get video captioning, there are many questions which i don't understand, first, when i run demo_vid2seq.py, there is an error:
load Vid2Seq model
Traceback (most recent call last):
File "demo_vid2seq.py", line 55, in
tokenizer = _get_tokenizer(args.model_name, args.num_bins)
File "/root/tzp/codes/VidChapters-main/model/vid2seq.py", line 12, in _get_tokenizer
tokenizer = T5Tokenizer.from_pretrained(tokenizer_path, local_files_only=True)
File "/root/.local/conda/envs/vidchapter/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained
f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
OSError: Can't load tokenizer for 't5-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 't5-base' is the correct path to a directory containing all relevant files for a T5Tokenizer tokenizer.
i don't know how to set and get the tokenizer, can you help me?
hi!
may i know how to do the inference without speech?
I've set the --no_speech but so that the output is [].
And when i do inference in activitynet and charades dataset, the output looks like it only considers the speech feature
Thank you!
Thank you for your work. I would like to know if the file clipvit14.pth contains the features of the test set of vitt. When I was testing, I encountered the error: File "/public/home/code/VidChapters-main/VidChapters-main/dataset/dvc_dataset.py", line 65, in _get_video assert video_id in self.features, video_id AssertionError: 0_-0zE4NDkuYo
Hello, First of all, thank you for an awesome project and making source code and models available. I really appreciate it.
I was able to run the demo_asr.py
and demo_vid2seq.py
. I tested HowTo100M + VidChapters-7M + YouCook2
and VidChapters-7M
models with it.
On a few videos I tested, VidChapters-7M
seems to work better. However, after some time, it starts to repeat the chapter name for rest of the video. I know its not using ChatGPT behind the scene, and term Hallucination is not a perfect fit for the title.
Here is an example:
input video: https://www.youtube.com/watch?v=jARPzYkjp3g
output:
[
{'sentence': 'The four pillars of happiness.', 'timestamp': [0.0, 14.901636363636365]},
{'sentence': 'The four pillars of happiness.', 'timestamp': [14.901636363636365, 69.5409696969697]},
{'sentence': 'Belonging.', 'timestamp': [69.5409696969697, 94.37703030303031]},
{'sentence': 'Purpose.', 'timestamp': [94.37703030303031, 139.0819393939394]},
{'sentence': 'Religion.', 'timestamp': [139.0819393939394, 173.85242424242423]},
{'sentence': 'Fast casual restaurants.', 'timestamp': [173.85242424242423, 213.59012121212123]},
{'sentence': 'The sacramento.', 'timestamp': [213.59012121212123, 243.39339393939395]},
{'sentence': 'The sacramento.', 'timestamp': [243.39339393939395, 268.22945454545453]},
{'sentence': 'The sacramento.', 'timestamp': [268.22945454545453, 307.9671515151515]},
{'sentence': 'The sacramento.', 'timestamp': [307.9671515151515, 342.73763636363634]},
{'sentence': 'The sacramento.', 'timestamp': [342.73763636363634, 377.50812121212124]},
{'sentence': 'The sacramento.', 'timestamp': [377.50812121212124, 407.31139393939395]},
{'sentence': 'The sacramento.', 'timestamp': [407.31139393939395, 437.11466666666666]},
{'sentence': 'The sacramento.', 'timestamp': [437.11466666666666, 461.9507272727273]},
{'sentence': 'The sacramento.', 'timestamp': [461.9507272727273, 471.88515151515156]},
{'sentence': 'The sacramento.', 'timestamp': [471.88515151515156, 491.754]}
]
Another example:
Input: https://www.youtube.com/watch?v=PRpr0_Iz4dI
Output:
[
{'sentence': 'Intro.', 'timestamp': [0.0, 7.774242424242424]},
{'sentence': 'Why we hate the new logos.', 'timestamp': [7.774242424242424, 116.61363636363636]},
{'sentence': 'Why people are missing the bigger picture.', 'timestamp': [116.61363636363636, 174.92045454545453]},
{'sentence': "Why people aren't logical.", 'timestamp': [174.92045454545453, 244.88863636363635]},
{'sentence': "Why people aren't logical.", 'timestamp': [244.88863636363635, 384.82499999999993]}
]
Do you know if I can do something to improve it a little?
How can I run demo_asr.py for videos without speech?
I tried it by pickling a json file as:
{ "segments": [], "word_segments": [] }
But got a runtime error in the demo_asr.py script:
load Vid2Seq model loading visual backbone extracting visual features visual features extracted load ASR ASR to tokens Traceback (most recent call last): File "demo_vid2seq.py", line 150, in <module> input_tokens = torch.cat(input_tokens, 0) RuntimeError: torch.cat(): expected a non-empty list of Tensors
Hi! Thank you for huge and interesting work!
Would you please provide a guideline to the implementation of both the generative objective and the denoising objective mentioned in the vid2seq paper. I can't find it in the code.
We run the demo_vid2seq.py
In vid2seq.py line 41 we find that:
self.t5_model.resize_token_embeddings(len(tokenizer) - num_bins) # remove the weights of the 28 tokens that are not used (32128 vs 32100 in the tokenizer)
self.t5_model.resize_token_embeddings(len(tokenizer)) # add time tokens
These two lines of code are the same. We commented out one line and An error occurred:
File "demo_vid2seq.py", line 170, in
temperature=1)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/vid2seq.py", line 163, in generate
num_return_sequences=num_captions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 1534, in generate
**model_kwargs,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 2814, in beam_search
output_hidden_states=output_hidden_states,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1698, in forward
return_dict=return_dict,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1082, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 710, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 616, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 528, in forward
query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
May I ask how we should resolve it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.