Giter Club home page Giter Club logo

antoyang / vidchapters Goto Github PK

View Code? Open in Web Editor NEW
151.0 3.0 15.0 35.16 MB

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

Home Page: http://arxiv.org/abs/2309.13952

License: MIT License

Python 47.74% Shell 0.34% C++ 0.35% Cuda 3.52% Jupyter Notebook 48.06%
dense-video-captioning multimodal-learning pre-training temporal-language-grounding video-captioning video-understanding vision-and-language weakly-supervised-learning vid2seq video-chapter-generation

vidchapters's People

Contributors

antoyang avatar eltociear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vidchapters's Issues

Dense video captioning on ActivityNet Captions dataset?

Great work Antoine! In your last paper Vid2Seq, you also tested the pre-trained model on the ActivityNet captions dataset, but in VidChapters you only show on ViTT and YouCook2. I am wondering if there is any particular reason to pick ViTT and YouCook2. Is it because the ActivityNet captions dataset is larger than these two (i.e. longer training time) or is it because it contains more diverse activities which makes it a harder dataset?

Thank you!

Demo for dense video captioning purposes

As i clearly understand, demo_vid2seq.py is used for main goal: video chapter generation.
How can i change this module for dense video captioning purposes? Or can you add new demo for this inference, please?

Replace with faster-whisper

I prefer using faster whisper, if you need to do demo with it too, here is the revised code:

import argparse
import torch
import os
import pickle
from args import get_args_parser, MODEL_DIR
import whisper
from faster_whisper import WhisperModel, decode_audio
import whisperx
from typing import TypedDict
class SingleSegment(TypedDict):
    """
    A single segment (up to multiple sentences) of a speech.
    """
    start: float
    end: float
    text: str

# Args
parser = argparse.ArgumentParser(parents=[get_args_parser()])
args = parser.parse_args()
device = torch.device(args.device)

print("load Whisper model")
asr_model = WhisperModel("large-v3",device="cuda", compute_type="float16")
print("extract ASR")
asr = asr_model.transcribe(args.video_example,without_timestamps=True,word_timestamps=False, beam_size=5,initial_prompt='Please! add punctuations。',vad_filter=True)
print("load align model")
align_model, metadata = whisperx.load_align_model(language_code=asr[1].language, device=args.device, model_dir=MODEL_DIR)
print("extract audio")
audio = whisperx.load_audio(args.video_example)

print("align ASR")
the_segments = []
for segment in asr[0]:
    s_item = {'text':segment.text,'start':segment.start,'end':segment.end}
    the_segments.append(s_item)
print(the_segments[:3])

print("whisperx.......")
aligned_asr = whisperx.align(the_segments, align_model, metadata, audio, args.device, return_char_alignments=False)

print("saving")
pickle.dump(aligned_asr, open(args.asr_example, 'wb'))

Inference of dense caption

Hi!
Congratulations that you have done a great job!!!
If I want to do dense captioning inference in this project, should i modify something in "python demo_vid2seq.py --load= --video_example=<VIDEO_PATH> --asr_example <OUTPUT_ASR_PATH> --combine_datasets chapters
"? cuz the captioning result is just like in the format of video chapter

Video request

Thanks for your wonderful work!

I would like to download videos of VidChapters. Could you provide some cmd tools to quickly download these videos? And how much storage do you use for these videos?

Thanks~

tokenizer in demo_vid2seq.py

hello, as i want to use demo_vid2seq.py to get video captioning, there are many questions which i don't understand, first, when i run demo_vid2seq.py, there is an error:

load Vid2Seq model
Traceback (most recent call last):
File "demo_vid2seq.py", line 55, in
tokenizer = _get_tokenizer(args.model_name, args.num_bins)
File "/root/tzp/codes/VidChapters-main/model/vid2seq.py", line 12, in _get_tokenizer
tokenizer = T5Tokenizer.from_pretrained(tokenizer_path, local_files_only=True)
File "/root/.local/conda/envs/vidchapter/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained
f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
OSError: Can't load tokenizer for 't5-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 't5-base' is the correct path to a directory containing all relevant files for a T5Tokenizer tokenizer.

i don't know how to set and get the tokenizer, can you help me?

inference without speech

hi!
may i know how to do the inference without speech?

I've set the --no_speech but so that the output is [].
And when i do inference in activitynet and charades dataset, the output looks like it only considers the speech feature

Thank you!

test on vitt

Thank you for your work. I would like to know if the file clipvit14.pth contains the features of the test set of vitt. When I was testing, I encountered the error: File "/public/home/code/VidChapters-main/VidChapters-main/dataset/dvc_dataset.py", line 65, in _get_video assert video_id in self.features, video_id AssertionError: 0_-0zE4NDkuYo

Hallucinations in chapter title generation after some time

Hello, First of all, thank you for an awesome project and making source code and models available. I really appreciate it.

I was able to run the demo_asr.py and demo_vid2seq.py. I tested HowTo100M + VidChapters-7M + YouCook2 and VidChapters-7M models with it.

On a few videos I tested, VidChapters-7M seems to work better. However, after some time, it starts to repeat the chapter name for rest of the video. I know its not using ChatGPT behind the scene, and term Hallucination is not a perfect fit for the title.

Here is an example:

input video: https://www.youtube.com/watch?v=jARPzYkjp3g

output:

[
    {'sentence': 'The four pillars of happiness.', 'timestamp': [0.0, 14.901636363636365]}, 
    {'sentence': 'The four pillars of happiness.', 'timestamp': [14.901636363636365, 69.5409696969697]}, 
    {'sentence': 'Belonging.', 'timestamp': [69.5409696969697, 94.37703030303031]}, 
    {'sentence': 'Purpose.', 'timestamp': [94.37703030303031, 139.0819393939394]}, 
    {'sentence': 'Religion.', 'timestamp': [139.0819393939394, 173.85242424242423]}, 
    {'sentence': 'Fast casual restaurants.', 'timestamp': [173.85242424242423, 213.59012121212123]}, 
    {'sentence': 'The sacramento.', 'timestamp': [213.59012121212123, 243.39339393939395]}, 
    {'sentence': 'The sacramento.', 'timestamp': [243.39339393939395, 268.22945454545453]}, 
    {'sentence': 'The sacramento.', 'timestamp': [268.22945454545453, 307.9671515151515]}, 
    {'sentence': 'The sacramento.', 'timestamp': [307.9671515151515, 342.73763636363634]}, 
    {'sentence': 'The sacramento.', 'timestamp': [342.73763636363634, 377.50812121212124]}, 
    {'sentence': 'The sacramento.', 'timestamp': [377.50812121212124, 407.31139393939395]}, 
    {'sentence': 'The sacramento.', 'timestamp': [407.31139393939395, 437.11466666666666]}, 
    {'sentence': 'The sacramento.', 'timestamp': [437.11466666666666, 461.9507272727273]}, 
    {'sentence': 'The sacramento.', 'timestamp': [461.9507272727273, 471.88515151515156]}, 
    {'sentence': 'The sacramento.', 'timestamp': [471.88515151515156, 491.754]}
]

Another example:
Input: https://www.youtube.com/watch?v=PRpr0_Iz4dI

Output:

[
	{'sentence': 'Intro.', 'timestamp': [0.0, 7.774242424242424]}, 
	{'sentence': 'Why we hate the new logos.', 'timestamp': [7.774242424242424, 116.61363636363636]}, 
	{'sentence': 'Why people are missing the bigger picture.', 'timestamp': [116.61363636363636, 174.92045454545453]}, 
	{'sentence': "Why people aren't logical.", 'timestamp': [174.92045454545453, 244.88863636363635]}, 
	{'sentence': "Why people aren't logical.", 'timestamp': [244.88863636363635, 384.82499999999993]}
]

Do you know if I can do something to improve it a little?

Run demo for videos without speech

How can I run demo_asr.py for videos without speech?
I tried it by pickling a json file as:
{ "segments": [], "word_segments": [] }
But got a runtime error in the demo_asr.py script:
load Vid2Seq model loading visual backbone extracting visual features visual features extracted load ASR ASR to tokens Traceback (most recent call last): File "demo_vid2seq.py", line 150, in <module> input_tokens = torch.cat(input_tokens, 0) RuntimeError: torch.cat(): expected a non-empty list of Tensors

Code issue

We run the demo_vid2seq.py
In vid2seq.py line 41 we find that:
self.t5_model.resize_token_embeddings(len(tokenizer) - num_bins) # remove the weights of the 28 tokens that are not used (32128 vs 32100 in the tokenizer)
self.t5_model.resize_token_embeddings(len(tokenizer)) # add time tokens

These two lines of code are the same. We commented out one line and An error occurred:
File "demo_vid2seq.py", line 170, in
temperature=1)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/vid2seq.py", line 163, in generate
num_return_sequences=num_captions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 1534, in generate
**model_kwargs,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/transformers/generation/utils.py", line 2814, in beam_search
output_hidden_states=output_hidden_states,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1698, in forward
return_dict=return_dict,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 1082, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 710, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 616, in forward
output_attentions=output_attentions,
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/workspace/ai-story/daikun.zhang/VidChapters/model/modeling_t5.py", line 528, in forward
query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/VidChapters/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

May I ask how we should resolve it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.