Giter Club home page Giter Club logo

video-llava's Introduction

If you like our project, please give us a star ⭐ on GitHub for latest update.

hf_space Open in OpenXLab Studios Replicate demo and cloud API arXiv
License Hits GitHub issues GitHub closed issues
zhihu zhihu zhihu zhihu zhihu zhihu zhihu

PWC
PWC
PWC

💡 I also have other video-language projects that may interest you ✨.

Open-Sora-Plan
github github

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan
github github arXiv

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan
github github arXiv

📰 News

  • [2024.07.27] 🔥🔥🔥 A fine-tuned Video-LLaVA focuses on theme exploration, narrative analysis, and character dynamics. Thanks to @micuelll. , CinePile addresses these overlooked areas with fine-tuning Video-LLaVA in their benchmark.
  • [2024.05.15] 🤝🤝🤝 Thanks to the generous contributions of @zucchini-nlp, Video-LLaVa now available in the Transformers library! More details here.
  • [2024.01.27] 👀👀👀 Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
  • [2024.01.17] 🔥🔥🔥 Our LanguageBind has been accepted at ICLR 2024!
  • [2024.01.16] 🔥🔥🔥 We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh.
  • [2023.11.30] 🤝 Thanks to the generous contributions of the community, the OpenXLab's demo is now accessible.
  • [2023.11.23] We are training a new and powerful model.
  • [2023.11.21] 🤝 Check out the replicate demo, created by @nateraw, who has generously supported our research!
  • [2023.11.20] 🤗 Hugging Face demo and all codes & datasets are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.

💡 Simple baseline, learning united visual representation by alignment before projection

  • With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

🔥 High performance, complementary learning with video and image

  • Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide online demo in Huggingface Spaces.

python -m  videollava.serve.gradio_web_server
demo.mp4

CLI Inference

CUDA_VISIBLE_DEVICES=0 python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit

CUDA_VISIBLE_DEVICES=0 python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit

🚀 Main Results

Image understanding

Video understanding

🛠️ Requirements and Installation

  • Python >= 3.10
  • Pytorch == 2.0.1
  • CUDA Version >= 11.7
  • Install required packages:
git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

🤖 API

Warning

🚨 Upgrade transformers for quick access.
pip install -U transformers

If you need to install av then do

python -m pip install av

import av
import numpy as np
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = "YOUR-LOCAL-VIDEO-PATH"
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
>>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'
outdated

We open source all codes. If you want to load the model (e.g. LanguageBind/Video-LLaVA-7B) on local, you can use the following code snippets.

Inference for image

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'videollava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    image_processor = processor['image']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

Inference for video

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    video = 'videollava/serve/examples/sample_demo_1.mp4'
    inp = 'Why is this video funny?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    video_processor = processor['video']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.1,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

  • LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
  • Video-ChatGPT Great job contributing the evaluation code and dataset.

🙌 Related Projects

  • LanguageBind An open source five modalities language-based retrieval framework.
  • Chat-UniVi This framework empowers the model to efficiently utilize a limited number of visual tokens.

🔒 License

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}
@article{zhu2023languagebind,
  title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
  journal={arXiv preprint arXiv:2310.01852},
  year={2023}
}

✨ Star History

Star History

🤝 Contributors

video-llava's People

Contributors

binzhu-ece avatar eltociear avatar jessytsui avatar linb203 avatar nahidalam avatar nateraw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

video-llava's Issues

Video checkpoint is broken

I've run the requirements installation, but the model spits out random words when I run the video examples from gradio.
The image captioning works fine.
The hugging face demo for video captioning works fine as well.

Some questions about the preprocessed data

Through analyzing your training script, it seems to me that the pre-processed visual features you stored on Baidu's cloud storage service have already undergone the project layer, is that correct? In other words, these pre-processed visual features are not the pixel-based data prior to extraction via the LanguageBind visual feature extractor mentioned in your paper. I am not sure whether my understanding is accurate, so please correct me if I'm wrong. Thank you.

Questions about the reproduction of the TGIF-QA

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Repeated output.

Hi, sometimes the model generates repeated text output until the maximum lengths. How to deal with this issue? Thanks.

A100 * 4 OOM

A100 * 4 OOM batchsize=1 的情况下仍然是OOM,好像不太合理吧,请问怎么回事呢?

这个是我的bash,大佬帮忙看下~~谢谢

DATA_ROOT="llava_all_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path pretrain_models/vicuna-7b-v1.5 \
    --version v1 \
    --data_path ./train_anno.json \
    --video_folder ${DATA_ROOT} \
    --image_folder ${DATA_ROOT} \
    --X "Video" "Image" \
    --video_tower pretrain_models/LanguageBind_Video_merge \
    --image_tower pretrain_models/LanguageBind_Image \
    --pretrain_mm_mlp_adapter pretrain_models/Video-LLaVA-Pretrain-7B/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_x_start_end False \
    --mm_use_x_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/Video-LLaVA-7B \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --cache_dir "./cache_dir"
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.30 GiB (GPU 0; 39.59 GiB total capacity; 32.86 GiB already allocated; 4.76 GiB free; 32.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[Q] local demo problem multi-turn dialogue

In the multi-turn dialogue, it appears that the video is being passed multiple times, leading to crashes in GPU memory. How can this be resolved? Does it support multiple turns of dialogue with the same video?

Cannot reproduce Zero-shot Video-QA (MSVD)

As I mentioned in a previous issue, I am unable to reproduce the MSVD Video-QA results, which are significantly different from the reported results in the paper displayed below.

image

Although you suggested referring to this issue, it is NOT relevant to my case because I tested this using your codebase (and, I checked it more than twice).

I am sharing my inferenced results on MSVD Video-QA link predicted with the shared checkpoint you provided Video-LLaVA-7B.

Could you please share your predicted results on MSVD Video-QA as well?

Can not reproduce LLaVA-Bench.

I tried running the evaluation code on your model checkpoint, but I cannot reproduce the results noted in your paper.

Can you help me with this?

Below are the results that I got.

Video-LLaVA-7B
all 56.2 86.2 48.4
llava_bench_complex 71.8 85.4 61.2
llava_bench_conv 45.6 85.9 39.1
llava_bench_detail 39.8 88.0 35.0

Below are the results from the paper.

스크린샷 2023-12-19 18 37 41

How to finetune Video-LLaVA-7B on one specific task.

I notice that in all finetune bash (finetune.sh finetune_lora.sh finetune_sqa.sh and so on), there is one line:
IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
What does this line mean? Does it means that I can not simply run finetune.sh to finetune Video-LLaVA-7B ?
Moreover, if I want to ft released Video-LLaVA-7B directly, can you give me a short guide readme? Hope to have a conversation with you.

Some issues with pretrain

Thank you very much for your excellent work.
When I was using the pretrain script, I found that the --video_tower LanguageBind/LanguageBind_Video_merge and --image_tower LanguageBind/LanguageBind_Image on hf repo does not contain the configuration file for the image processor, which leads to code errors.
image
How to solve this problem?

How to batch evaluate in inference?

Hi, how can I make the inference code to evaluate videos in batch?
I naively concatenated the tensor in dimension 0 and get this error.

image

Can you help me the figure out it? Thanks.

mistake in joint understanding

1701287333033 I try to let the model understand both picture and video, but there is a mistake. Obviously, what the video records is not the flag. In my view, after share project layer, in the share feature space, the distance between the two vectors is very close. Therefore, the LLM thinks "they are the same". Is this the bottleneck of this model? or just need to tune some instructions backend? if you have any ideas or improvement, please let me know. I will be very grateful. I am very interested in this topic.

Instruction tuning on A100 (40G)?

Dear author,

Thanks for releasing the amazing code. I'm trying to train the model using A100 (40G).

I loaded the pre-trained mm_projector.bin and run the finetune.sh script with video data.
However, even I decreased the per_device_train_batch_size into 1, I still got CUDA Out of memory.
I noticed that the default setting is 16.
So I wonder is there something wrong?

Looking forward to hearing back from you.

Thanks!

断点续训失败问题

deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)

File "/home/work/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/deepspeed.py", line 389, in deepspeed_load_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
ValueError: Can't find a valid checkpoint at ./checkpoints/Video-LLaVA-Pretrain-7B/checkpoint-1500
此路径下有两个文件 config.json mm_projector.bin

About training comsuption.

Dear authors,

Thanks for your great work, which is of great value to the community.

I would like to know how many GPUs (what type) you use to train the network. And roughly training time?

I am looking forward to hearing back from you.

Best

licensing question

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.

The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Could you confirm what the service is referring to? The gradio UI or is the model under Apache 2.0 license which allows for commercial use afaik.

Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing

我采用您代码中可以本地调用模型运行的方式,下载好llava7b、languageBindImage和LanguageBindVideo,但是报如下错误
Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'
还有一段代码报错,最后输出
LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'

This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint

Thank you very much for your great work, when I run locally,Igot this message:

  • This IS expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

but,Different images, the same output result,why?

一个sample是一个model 是指的什么意思呢

在Video-LLaVA/llava/model/llava_arch.py 文件中364-367行代码的理解
这段话中,一个sample是一个model 是指的什么意思呢,对应 一个batch 中一个sample吗?
cur_X_idx += 1 ############## 注意这里跳过了,如果一个sample是一个modal,那么就跳过1个全zero的modal,如果一个sample对应多个modal,这里的训练逻辑不对!!! ###### 但似乎不影响1个sample的inference ###### 一个text对应视频和图片,直接走下边了。只有1个text,传入none或者1个/2个全zero都无所谓,反正没有下一个数据了。 continue

multiple questions for multiple videos

Hello! When I try to implement multiple questions for multiple videos, there's some errors:

It seems that the current prompt doesn't contain the previous answer, and there's an error in the llava/model/llava_arch.py
Snipaste_2023-12-01_11-58-17
Snipaste_2023-12-01_11-50-41

Would you provide an inference demo for multiple questions && multiple videos? Thanks!

Also, here is my multiple questions inference demo for multiple videos, please point out the problem.

`while True:
    try:
        video = input('Input video path: ')
    except:
        print('Input error, try again.')
        continue
    else:
        if video == 'exit':
            print('Goodbye!')
            break
        if not os.path.exists(video):
            print('{} not exist, try again.'.format(video))
            continue

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)


    while True:
        try:
            inp = input('User: ')
        except:
            print('Input error, try again.')
            continue
        else:
            if inp == 'exit':
                break
            else:
                # print(f"{roles[1]}: {inp}")
                inp = DEFAULT_X_TOKEN['VIDEO'] + '\n' + inp
                conv.append_message(conv.roles[0], inp)
                conv.append_message(conv.roles[1], None)

                prompt = conv.get_prompt() # 同一视频,多个问题时,不包含上个问题答案;不同视频时,之前的提问内容不清空
                input_ids = tokenizer_X_token(prompt, tokenizer, X_TOKEN_INDEX['VIDEO'], return_tensors='pt').unsqueeze(0).cuda()
                stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
                keywords = [stop_str]
                stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

                with torch.inference_mode():
                    output_ids = model.generate(
                        input_ids,
                        images=[tensor, key],
                        do_sample=True,
                        temperature=0.1,
                        max_new_tokens=1024,
                        use_cache=True,
                        stopping_criteria=[stopping_criteria])

                outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
                print(outputs)`

bash scripts/v1_5/eval/eval_qa_msrvtt.sh

when running "bash scripts/v1_5/eval/eval_qa_msrvtt.sh",
It reports the following errors,
Can any gays tell me how to fix it ?

completed_files: 0
incomplete_files: 72821
Error processing file 'v_video9632_17': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6975cac190>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))
Error processing file 'v_video7010_0': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f68e4f5fac0>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))

Error of multi-turn conversations

Thank you for your great work! I tried some of the sample videos provided, but the model doesn't seem to support multi-turn conversations. The error is triggered when I try to send the second question. Does it only support a single round at this stage?

why there are pure NLP conversation in videochatgpt_llavaimage_tune.json

like dict as follows. Is it to just finetune LLM ?

{'id': '740000',
'model': '',
'conversations': [{'from': 'human',
'value': '将下面的生活场景润色成动画片的对话,句数不少于8句,要求内容有教育意义\n家庭晚会\n爸爸:今晚我们举办家庭晚会,大家有什么节目表演吗?\n妈妈:我会唱一首歌,希望大家喜欢。\n哥哥:我会弹吉他,可以伴奏你唱歌。\n妹妹:我想跳一支舞,可以吗?'},
{'from': 'gpt',
'value': '(画面:家庭客厅,爸爸、妈妈、哥哥、妹妹四人围坐在沙发上)\n\n爸爸:今晚我们一家人一起举办家庭晚会,大家有什么节目表演吗?\n\n妈妈:我会唱一首歌,希望大家喜欢。 (妈妈微笑着说)\n\n哥哥:我会弹吉他,可以伴奏你唱歌。(哥哥翻起吉他,摆了一个酷酷的姿势)\n\n妹妹:我想跳一支舞,可以吗?(妹妹脸上带着期待的表情)\n\n爸爸:当然可以,你想跳哪一种舞呢?\n\n妹妹:我想跳芭蕾舞,因为我一直很喜欢看芭蕾舞演出。\n\n哥哥:那我们可以一起编排一个小节目,让妹妹跳芭蕾舞,我伴奏,妈妈唱歌。\n\n妹妹:太好了,我要开始练习了。(妹妹兴奋地跑出房间)\n\n爸爸:看到你们这么积极,我也想表演一个小魔术,让大家惊喜。\n\n妈妈:晚会不仅是娱乐,也是学习和交流的机会,我们可以通过互相表演,展示自己的才艺,同时也能增进家庭的感情,提高家人之间的沟通和合作能力。\n\n(画面:一家四口团结一致,笑容满面,享受着愉快的家庭晚会)\n\n教育意义:家庭晚会是一种增强家庭凝聚力、提高沟通能力和合作能力的好方法。通过家庭晚会,家人们可以分享自己的才艺、交流心得,加深彼此之间的了解和感情。在这个过程中,家长可以向孩子展示自己的兴趣和爱好,同时也可以从孩子身上学到新鲜事物,增加彼此之间的互动和交流。'}]

The number of iterations is half what it should be

Thanks for your great job, I met a question:The number of iterations is half what it should be.

The number of dataset is 665k+98K; num_train_epochs=1.0; per_device_train_batch_size=16;gpu=8.
Iteration should be (665+98)*1.0/16/8=5.96k
However It show iteration is 2.48k

Looking forward to your reply, thanks in advance

always answer yes?

When I ask, "is there any <some-object> in the video?", the model always answers "yes".

inference endpoint in huggingface

the inference endpoint in huggingface isn't working with the videollava model
here's the stack trace:

2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | No custom pipeline found at /repository/handler.py
2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | Initializing model from directory:/repository
2023/12/05 17:27:45 ~ INFO | Using device GPU
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 17, in init
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan
ch28b 2023-12-05T22:27:45.805Z
2023/12/05 17:27:45 ~ await handler()
2023/12/05 17:27:45 ~ await self._router.startup()
2023/12/05 17:27:45 ~ Application startup failed. Exiting.
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/init.py", line 705, in pipeline
2023/12/05 17:27:45 ~ self.pipeline = get_pipeline(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ async with self.lifespan_context(app) as maybe_state:
2023/12/05 17:27:45 ~ File "/app/webservice_starlette.py", line 57, in some_startup_task
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 682, in startup
2023/12/05 17:27:45 ~ hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023/12/05 17:27:45 ~ return HuggingFaceHandler(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ inference_handler = get_inference_handler_either_custom_or_default_handler(HF_MODEL_DIR, task=HF_TASK)
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 710, in getitem
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 45, in get_inference_handler_either_custom_or_default_handler
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 584, in aenter
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/utils.py", line 261, in get_pipeline
2023/12/05 17:27:45 ~ raise KeyError(key)
2023/12/05 17:27:45 ~ config_class = CONFIG_MAPPING[config_dict["model_type"]]
2023/12/05 17:27:45 ~ config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs)
2023/12/05 17:27:45 ~ Traceback (most recent call last):
2023/12/05 17:27:45 ~ KeyError: 'llava'
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 998, in from_pretrained

can not find preprocessor_config.json in LanguageBind_Image

When I tried to deploy the project on hf locally, I couldn't connect to huggingface, so I pre-downloaded LanguageBind_image, video-llava-7B and LanguageBind_video_image locally, and set model_path = 'local_path/LanguageBind/ in app.py Video-LLaVA-7B/',
And change mm_image_tower and mm_video_tower in llava7B's config.json to local paths. But eventually I got this error at runtime.
local_path/LanguageBind/LanguageBind_Image does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//local_path/LanguageBind/LanguageBind_Image/main' for available files
I think this error means that the preprocessor_config.json file was not found in LanguageBind_Image. However, I did not find this file in the corresponding location of huggingface. Did I make a mistake?

Unexpected error from cudaGetDeviceCount()

Hello! I met this error when I was preparing the environment on 3090 GPUs, is this due to the low driver version of the GPUs? My driver version is 470.141.03 with CUDA 11.7.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.