pku-yuangroup / video-llava Goto Github PK

View Code? Open in Web Editor NEW

2.9K 28.0 207.0 115.2 MB

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Home Page: https://arxiv.org/pdf/2311.10122.pdf

License: Apache License 2.0

Python 94.34% HTML 0.79% JavaScript 1.03% CSS 0.19% Shell 3.66%

instruction-tuning large-vision-language-model multi-modal

video-llava's Introduction

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

If you like our project, please give us a star ⭐ on GitHub for latest update.

💡 I also have other video-language projects that may interest you ✨.

Open-Sora-Plan

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan

📰 News

[2024.07.27] 🔥🔥🔥 A fine-tuned Video-LLaVA focuses on theme exploration, narrative analysis, and character dynamics. Thanks to @micuelll. , CinePile addresses these overlooked areas with fine-tuning Video-LLaVA in their benchmark.
[2024.05.15] 🤝🤝🤝 Thanks to the generous contributions of @zucchini-nlp, Video-LLaVa now available in the Transformers library! More details here.
[2024.01.27] 👀👀👀 Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
[2024.01.17] 🔥🔥🔥 Our LanguageBind has been accepted at ICLR 2024!
[2024.01.16] 🔥🔥🔥 We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh.
[2023.11.30] 🤝 Thanks to the generous contributions of the community, the OpenXLab's demo is now accessible.
[2023.11.23] We are training a new and powerful model.
[2023.11.21] 🤝 Check out the replicate demo, created by @nateraw, who has generously supported our research!
[2023.11.20] 🤗 Hugging Face demo and all codes & datasets are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.

💡 Simple baseline, learning united visual representation by alignment before projection

With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

🔥 High performance, complementary learning with video and image

Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide online demo in Huggingface Spaces.

python -m  videollava.serve.gradio_web_server

demo.mp4

CLI Inference

CUDA_VISIBLE_DEVICES=0 python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit

CUDA_VISIBLE_DEVICES=0 python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit

🚀 Main Results

Image understanding

Video understanding

🛠️ Requirements and Installation

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

🤖 API

Warning

🚨 Upgrade transformers for quick access.

pip install -U transformers

If you need to install av then do

python -m pip install av

import av
import numpy as np
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = "YOUR-LOCAL-VIDEO-PATH"
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
>>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

outdated

We open source all codes. If you want to load the model (e.g. LanguageBind/Video-LLaVA-7B) on local, you can use the following code snippets.

Inference for image

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'videollava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    image_processor = processor['image']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

Inference for video

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    video = 'videollava/serve/examples/sample_demo_1.mp4'
    inp = 'Why is this video funny?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    video_processor = processor['video']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.1,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
Video-ChatGPT Great job contributing the evaluation code and dataset.

🙌 Related Projects

LanguageBind An open source five modalities language-based retrieval framework.
Chat-UniVi This framework empowers the model to efficiently utilize a limited number of visual tokens.

🔒 License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

@article{zhu2023languagebind,
  title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
  journal={arXiv preprint arXiv:2310.01852},
  year={2023}
}

✨ Star History

🤝 Contributors

video-llava's People

Contributors

Stargazers

Watchers

Forkers

llziss4ai sorokinvld eltociear nateraw leosapucaia bardia323 tangtc1981 quantaji holyr00d aaronbesson xgithubzero tomchapin techthiyanes arkboy1224 francescofugazzi zer0days404 xiechengmude minyang-chen kinddevil brianjking jjhw muharremokutan shivamsinha15 zhouxiqin2017 anny-lyr superoldman96 ssahgal rajasardar aidenli lcolok jithinraj ojasvita25 paperwave mohitsolunke shuai2023 keyman9848 flyinggh dwhnicholas killvxk rkp64 evdcush hbcbh1999 liuxing9848 7-z-7 flyfishok yanxg obinna josephrmartinez suryatmodulus 2132660698 gz475 tangklsustc smartjoy-tech sundogs8603 hsaigroup githubmalajava tiansiyuan cjschneider2 qoboty joberzheng ahgsql aezi1 riasl wolfworld6 leo7611 jensinjames sxm1129 bosakstepna mhhamdan shiyuzh2007 nonomal atoaplus nightzjp mysqlsc dongguanting gsilva2016 rochemedia leonpalm leiwangr skyblueballykid knowledgehacker abecid iwillcodeu doublekwsj mbrukman yangboz kim-hojoon faisalshahbaz munbongchoi guanghui0607 almakedon anminhhung dudulightricks tian1327 adityakane2001 hasnat79 samuraibarbi ahmadexp ryanamundson1 s1e9

video-llava's Issues

Video checkpoint is broken

I've run the requirements installation, but the model spits out random words when I run the video examples from gradio.
The image captioning works fine.
The hugging face demo for video captioning works fine as well.

Some questions about the preprocessed data

Through analyzing your training script, it seems to me that the pre-processed visual features you stored on Baidu's cloud storage service have already undergone the project layer, is that correct? In other words, these pre-processed visual features are not the pixel-based data prior to extraction via the LanguageBind visual feature extractor mentioned in your paper. I am not sure whether my understanding is accurate, so please correct me if I'm wrong. Thank you.

Questions about the reproduction of the TGIF-QA

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Repeated output.

Hi, sometimes the model generates repeated text output until the maximum lengths. How to deal with this issue? Thanks.

the instructions to run the demo offline

good job! thanks for you gays!
can you provide instructions like this:
https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/docs/offline_demo.md

Cannot reproduce Zero-shot Video-QA (MSVD)

Thanks for your contribution!

I tried to reproduce your result (Zero-shot VideoQA on MSVD dataset) with the pretrained weight https://huggingface.co/LanguageBind/Video-LLaVA-7B/tree/main.

But the result is completely different from your paper.
(Reproduced result is shown below)

Can you check this?

A100 * 4 OOM

A100 * 4 OOM batchsize=1 的情况下仍然是OOM，好像不太合理吧，请问怎么回事呢？

这个是我的bash，大佬帮忙看下~~谢谢

DATA_ROOT="llava_all_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path pretrain_models/vicuna-7b-v1.5 \
    --version v1 \
    --data_path ./train_anno.json \
    --video_folder ${DATA_ROOT} \
    --image_folder ${DATA_ROOT} \
    --X "Video" "Image" \
    --video_tower pretrain_models/LanguageBind_Video_merge \
    --image_tower pretrain_models/LanguageBind_Image \
    --pretrain_mm_mlp_adapter pretrain_models/Video-LLaVA-Pretrain-7B/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_x_start_end False \
    --mm_use_x_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/Video-LLaVA-7B \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --cache_dir "./cache_dir"

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.30 GiB (GPU 0; 39.59 GiB total capacity; 32.86 GiB already allocated; 4.76 GiB free; 32.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[Q] local demo problem multi-turn dialogue

In the multi-turn dialogue, it appears that the video is being passed multiple times, leading to crashes in GPU memory. How can this be resolved? Does it support multiple turns of dialogue with the same video?

推理代码放入循环中报错？

单个视频推理都是OK的，只要把推理的代码放入循环中，就报IndexError，是什么情况呢？？

你好，训练这个模型需要多大的计算资源？

Cannot reproduce Zero-shot Video-QA (MSVD)

As I mentioned in a previous issue, I am unable to reproduce the MSVD Video-QA results, which are significantly different from the reported results in the paper displayed below.

Although you suggested referring to this issue, it is NOT relevant to my case because I tested this using your codebase (and, I checked it more than twice).

I am sharing my inferenced results on MSVD Video-QA link predicted with the shared checkpoint you provided Video-LLaVA-7B.

Could you please share your predicted results on MSVD Video-QA as well?

Can not reproduce LLaVA-Bench.

I tried running the evaluation code on your model checkpoint, but I cannot reproduce the results noted in your paper.

Can you help me with this?

Below are the results that I got.

Video-LLaVA-7B
all 56.2 86.2 48.4
llava_bench_complex 71.8 85.4 61.2
llava_bench_conv 45.6 85.9 39.1
llava_bench_detail 39.8 88.0 35.0

Below are the results from the paper.

How to finetune Video-LLaVA-7B on one specific task.

I notice that in all finetune bash (finetune.sh finetune_lora.sh finetune_sqa.sh and so on), there is one line:
IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
What does this line mean? Does it means that I can not simply run finetune.sh to finetune Video-LLaVA-7B ?
Moreover, if I want to ft released Video-LLaVA-7B directly, can you give me a short guide readme? Hope to have a conversation with you.

Some issues with pretrain

Thank you very much for your excellent work.
When I was using the pretrain script, I found that the --video_tower LanguageBind/LanguageBind_Video_merge and --image_tower LanguageBind/LanguageBind_Image on hf repo does not contain the configuration file for the image processor, which leads to code errors.

How to solve this problem?

sillicon mac os support

is there anyway to run inferencing while using metal gpu offloading?

deepspeed zero3 is not supported yet

I try to replace the zero2.json with zero3.json in the pretraining stage. But the model hangs and cannot train like zero2. Is it normal?

How to batch evaluate in inference?

Hi, how can I make the inference code to evaluate videos in batch?
I naively concatenated the tensor in dimension 0 and get this error.

Can you help me the figure out it? Thanks.

mistake in joint understanding

I try to let the model understand both picture and video, but there is a mistake. Obviously, what the video records is not the flag. In my view, after share project layer, in the share feature space, the distance between the two vectors is very close. Therefore, the LLM thinks "they are the same". Is this the bottleneck of this model? or just need to tune some instructions backend? if you have any ideas or improvement, please let me know. I will be very grateful. I am very interested in this topic.

Much slower than llava training.

I found code here: https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/llava/model/llava_arch.py#L326
will cost a lot of time when batch size sets to 32.

maybe change the data sample func and only image or video data is used in each batch will help increase training speed.

Instruction tuning on A100 (40G)?

Dear author,

Thanks for releasing the amazing code. I'm trying to train the model using A100 (40G).

I loaded the pre-trained mm_projector.bin and run the finetune.sh script with video data.
However, even I decreased the per_device_train_batch_size into 1, I still got CUDA Out of memory.
I noticed that the default setting is 16.
So I wonder is there something wrong?

Looking forward to hearing back from you.

Thanks!

Example Video Order

"CLI Reasoning" example videos in reverse order :P

断点续训失败问题

deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)

File "/home/work/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/deepspeed.py", line 389, in deepspeed_load_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
ValueError: Can't find a valid checkpoint at ./checkpoints/Video-LLaVA-Pretrain-7B/checkpoint-1500
此路径下有两个文件 config.json mm_projector.bin

About training comsuption.

Dear authors,

Thanks for your great work, which is of great value to the community.

I would like to know how many GPUs (what type) you use to train the network. And roughly training time?

I am looking forward to hearing back from you.

Best

Poor ability to describe video content.

Feel like overfitting the training data?

1 why not report the finetuned results on some QA datasets?

Hi,
Have you tested your finetuned model on some QA datasets, which can be used to compare with some strong baseline models, such as BLIP-2?

licensing question

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.

The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Could you confirm what the service is referring to? The gradio UI or is the model under Apache 2.0 license which allows for commercial use afaik.

Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing

我采用您代码中可以本地调用模型运行的方式，下载好llava7b、languageBindImage和LanguageBindVideo，但是报如下错误
Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'
还有一段代码报错，最后输出
LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'

This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint

Thank you very much for your great work, when I run locally，Igot this message：

This IS expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

but，Different images, the same output result，why？

What metric do you use for the POPE benchmark? This benchmark has three splits.

What does the value you reported stand for. Did you calculate F1/accuracy for each split and then report the average score or just merge the three splits into one and report the F1/accuracy?

[Dataset] Is it possible to provide a download link of processed data besides baidu nextdisk?

Thank you for the great work. Is it possible to provide a download link of processed data besides baidu nextdisk?

About the evaluation on TGIF dataset

There are many different types of QA in the TGIF. Will you design prompts for each type? Or let the model randomly generate?

How to set up a single round conversation without adding history information

How to set up a single round conversation without adding history information？

一个sample是一个model 是指的什么意思呢

在Video-LLaVA/llava/model/llava_arch.py 文件中364-367行代码的理解
这段话中，一个sample是一个model 是指的什么意思呢，对应一个batch 中一个sample吗？
cur_X_idx += 1 ############## 注意这里跳过了，如果一个sample是一个modal，那么就跳过1个全zero的modal，如果一个sample对应多个modal，这里的训练逻辑不对！！！ ###### 但似乎不影响1个sample的inference ###### 一个text对应视频和图片，直接走下边了。只有1个text，传入none或者1个/2个全zero都无所谓，反正没有下一个数据了。 continue

multiple questions for multiple videos

Hello! When I try to implement multiple questions for multiple videos, there's some errors:

It seems that the current prompt doesn't contain the previous answer, and there's an error in the llava/model/llava_arch.py

Would you provide an inference demo for multiple questions && multiple videos? Thanks!

Also, here is my multiple questions inference demo for multiple videos, please point out the problem.

`while True:
    try:
        video = input('Input video path: ')
    except:
        print('Input error, try again.')
        continue
    else:
        if video == 'exit':
            print('Goodbye!')
            break
        if not os.path.exists(video):
            print('{} not exist, try again.'.format(video))
            continue

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)


    while True:
        try:
            inp = input('User: ')
        except:
            print('Input error, try again.')
            continue
        else:
            if inp == 'exit':
                break
            else:
                # print(f"{roles[1]}: {inp}")
                inp = DEFAULT_X_TOKEN['VIDEO'] + '\n' + inp
                conv.append_message(conv.roles[0], inp)
                conv.append_message(conv.roles[1], None)

                prompt = conv.get_prompt() # 同一视频，多个问题时，不包含上个问题答案；不同视频时，之前的提问内容不清空
                input_ids = tokenizer_X_token(prompt, tokenizer, X_TOKEN_INDEX['VIDEO'], return_tensors='pt').unsqueeze(0).cuda()
                stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
                keywords = [stop_str]
                stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

                with torch.inference_mode():
                    output_ids = model.generate(
                        input_ids,
                        images=[tensor, key],
                        do_sample=True,
                        temperature=0.1,
                        max_new_tokens=1024,
                        use_cache=True,
                        stopping_criteria=[stopping_criteria])

                outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
                print(outputs)`

bash scripts/v1_5/eval/eval_qa_msrvtt.sh

when running "bash scripts/v1_5/eval/eval_qa_msrvtt.sh",
It reports the following errors,
Can any gays tell me how to fix it ?

completed_files: 0
incomplete_files: 72821
Error processing file 'v_video9632_17': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6975cac190>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))
Error processing file 'v_video7010_0': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f68e4f5fac0>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))

[Question] Would it be possible to provide which 8 frames Video-LLaVA should use?

If I read your paper correctly, you're only using 8 frames of a video? Would it be possible to specify which 8 frames to use instead of just using the first 8 frames?

Error of multi-turn conversations

Thank you for your great work! I tried some of the sample videos provided, but the model doesn't seem to support multi-turn conversations. The error is triggered when I try to send the second question. Does it only support a single round at this stage?

Streaming responses

Please support streaming responses.

why there are pure NLP conversation in videochatgpt_llavaimage_tune.json

like dict as follows. Is it to just finetune LLM ?

{'id': '740000',
'model': '',
'conversations': [{'from': 'human',
'value': '将下面的生活场景润色成动画片的对话，句数不少于8句，要求内容有教育意义\n家庭晚会\n爸爸：今晚我们举办家庭晚会，大家有什么节目表演吗？\n妈妈：我会唱一首歌，希望大家喜欢。\n哥哥：我会弹吉他，可以伴奏你唱歌。\n妹妹：我想跳一支舞，可以吗？'},
{'from': 'gpt',
'value': '(画面：家庭客厅，爸爸、妈妈、哥哥、妹妹四人围坐在沙发上)\n\n爸爸：今晚我们一家人一起举办家庭晚会，大家有什么节目表演吗？\n\n妈妈：我会唱一首歌，希望大家喜欢。 (妈妈微笑着说)\n\n哥哥：我会弹吉他，可以伴奏你唱歌。(哥哥翻起吉他，摆了一个酷酷的姿势)\n\n妹妹：我想跳一支舞，可以吗？(妹妹脸上带着期待的表情)\n\n爸爸：当然可以，你想跳哪一种舞呢？\n\n妹妹：我想跳芭蕾舞，因为我一直很喜欢看芭蕾舞演出。\n\n哥哥：那我们可以一起编排一个小节目，让妹妹跳芭蕾舞，我伴奏，妈妈唱歌。\n\n妹妹：太好了，我要开始练习了。(妹妹兴奋地跑出房间)\n\n爸爸：看到你们这么积极，我也想表演一个小魔术，让大家惊喜。\n\n妈妈：晚会不仅是娱乐，也是学习和交流的机会，我们可以通过互相表演，展示自己的才艺，同时也能增进家庭的感情，提高家人之间的沟通和合作能力。\n\n(画面：一家四口团结一致，笑容满面，享受着愉快的家庭晚会)\n\n教育意义：家庭晚会是一种增强家庭凝聚力、提高沟通能力和合作能力的好方法。通过家庭晚会，家人们可以分享自己的才艺、交流心得，加深彼此之间的了解和感情。在这个过程中，家长可以向孩子展示自己的兴趣和爱好，同时也可以从孩子身上学到新鲜事物，增加彼此之间的互动和交流。'}]

loading checkpoint shards 0% . Killed ? help

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Killed
killed

The number of iterations is half what it should be

Thanks for your great job, I met a question:The number of iterations is half what it should be.

The number of dataset is 665k+98K; num_train_epochs=1.0; per_device_train_batch_size=16;gpu=8.
Iteration should be (665+98)*1.0/16/8=5.96k
However It show iteration is 2.48k

Looking forward to your reply, thanks in advance

always answer yes？

When I ask, "is there any <some-object> in the video?", the model always answers "yes".

inference endpoint in huggingface

the inference endpoint in huggingface isn't working with the videollava model
here's the stack trace:

2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | No custom pipeline found at /repository/handler.py
2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | Initializing model from directory:/repository
2023/12/05 17:27:45 ~ INFO | Using device GPU
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 17, in init
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan
ch28b 2023-12-05T22:27:45.805Z
2023/12/05 17:27:45 ~ await handler()
2023/12/05 17:27:45 ~ await self._router.startup()
2023/12/05 17:27:45 ~ Application startup failed. Exiting.
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/init.py", line 705, in pipeline
2023/12/05 17:27:45 ~ self.pipeline = get_pipeline(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ async with self.lifespan_context(app) as maybe_state:
2023/12/05 17:27:45 ~ File "/app/webservice_starlette.py", line 57, in some_startup_task
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 682, in startup
2023/12/05 17:27:45 ~ hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023/12/05 17:27:45 ~ return HuggingFaceHandler(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ inference_handler = get_inference_handler_either_custom_or_default_handler(HF_MODEL_DIR, task=HF_TASK)
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 710, in getitem
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 45, in get_inference_handler_either_custom_or_default_handler
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 584, in aenter
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/utils.py", line 261, in get_pipeline
2023/12/05 17:27:45 ~ raise KeyError(key)
2023/12/05 17:27:45 ~ config_class = CONFIG_MAPPING[config_dict["model_type"]]
2023/12/05 17:27:45 ~ config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs)
2023/12/05 17:27:45 ~ Traceback (most recent call last):
2023/12/05 17:27:45 ~ KeyError: 'llava'
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 998, in from_pretrained

Why the mm_use_x_start_end and mm_use_x_patch_token are False?

Dear authors,

I wonder why the mm_use_x_start_end and mm_use_x_start_end are set to False?
If not use them, the tokenizer will tokenize tokens like <im_start> into multiple tokens, would that cause trouble?
Could you help explain this?

Thanks!

Q：她们乒乓球打得好吗？

A：她们乒乓球打得很好，她们的手指灵活，乒乓球一次都没有错过目标。

from：https://replicate.com/nateraw/video-llava?prediction=p4qnq7dbin6msnwk77b3odfjna

result of 336 input shape?

Hi, have you tried using 336 as input shape? Will the result be better?

can not find preprocessor_config.json in LanguageBind_Image

When I tried to deploy the project on hf locally, I couldn't connect to huggingface, so I pre-downloaded LanguageBind_image, video-llava-7B and LanguageBind_video_image locally, and set model_path = 'local_path/LanguageBind/ in app.py Video-LLaVA-7B/',
And change mm_image_tower and mm_video_tower in llava7B's config.json to local paths. But eventually I got this error at runtime.
local_path/LanguageBind/LanguageBind_Image does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//local_path/LanguageBind/LanguageBind_Image/main' for available files
I think this error means that the preprocessor_config.json file was not found in LanguageBind_Image. However, I did not find this file in the corresponding location of huggingface. Did I make a mistake?

Unexpected error from cudaGetDeviceCount()

Hello! I met this error when I was preparing the environment on 3090 GPUs, is this due to the low driver version of the GPUs? My driver version is 470.141.03 with CUDA 11.7.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

Is there a way to fit Quantized version in Google Colab?

Thanks so much! Google Colab offers a T4 with ~14ish GB of Graphics Memory, is there a way to fit this in there?

pku-yuangroup / video-llava Goto Github PK

video-llava's Introduction

If you like our project, please give us a star ⭐ on GitHub for latest update.

📰 News

😮 Highlights

💡 Simple baseline, learning united visual representation by alignment before projection

🔥 High performance, complementary learning with video and image

🤗 Demo

Gradio Web UI

CLI Inference

🚀 Main Results

Image understanding

Video understanding

🛠️ Requirements and Installation

🤖 API

Inference for image

Inference for video

🗝️ Training & Validating

👍 Acknowledgement

🙌 Related Projects

🔒 License

✏️ Citation

✨ Star History

🤝 Contributors

video-llava's People

Contributors

Stargazers

Watchers

Forkers

video-llava's Issues

A100 * 4 OOM batchsize=1 的情况下仍然是OOM，好像不太合理吧，请问怎么回事呢？

when running "bash scripts/v1_5/eval/eval_qa_msrvtt.sh", It reports the following errors, Can any gays tell me how to fix it ?

Recommend Projects

Recommend Topics

Recommend Org

when running "bash scripts/v1_5/eval/eval_qa_msrvtt.sh",
It reports the following errors,
Can any gays tell me how to fix it ?