mbzuai-oryx / video-chatgpt Goto Github PK

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Home Page: https://mbzuai-oryx.github.io/Video-ChatGPT

License: Creative Commons Attribution 4.0 International

Python 99.38% Shell 0.62%

chatbot clip gpt-4 llama llava mulit-modal vicuna vision-language vision-language-pretraining video-chatboat

video-chatgpt's Introduction

Oryx Video-ChatGPT 🎥 💬

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of Artificial Intelligence

Video-based Generative Performance Benchmarking

Zeroshot Question-Answer Evaluation

Demo	Paper	Demo Clips	Offline Demo	Training	Video Instruction Data	Quantitative Evaluation	Qualitative Analysis
			Offline Demo	Training	Video Instruction Dataset	Quantitative Evaluation	Qualitative Analysis

📢 Latest Updates

Sep-30: Our VideoInstruct100K dataset can be downloaded from HuggingFace/VideoInstruct100K. 🔥🔥
Jul-15: Our quantitative evaluation benchmark for Video-based Conversational Models now has its own dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT. 🔥🔥
Jun-28: Updated GitHub readme featuring benchmark comparisons of Video-ChatGPT against recent models - Video Chat, Video LLaMA, and LLaMA Adapter. Amid these advanced conversational models, Video-ChatGPT continues to deliver state-of-the-art performance.:fire::fire:
Jun-08 : Released the training code, offline demo, instructional data and technical report. All the resources including models, datasets and extracted features are available here. 🔥🔥
May-21 : Video-ChatGPT: demo released.

Online Demo 💻

🔥🔥 You can try our demo using the provided examples or by uploading your own videos HERE. 🔥🔥

🔥🔥 Or click the image to try the demo! 🔥🔥 You can access all the videos we demonstrate on here.

Video-ChatGPT Overview 💡

Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.

Contributions 🏆

We introduce 100K high-quality video-instruction pairs together with a novel annotation framework that is scalable and generates a diverse range of video-specific instruction sets of high-quality.
We develop the first quantitative video conversation evaluation framework for benchmarking video conversation models.
Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.

Installation 🔧

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline 💿

To run the demo offline, please refer to the instructions in offline_demo.md.

Training 🚋

For training instructions, check out train_video_chatgpt.md.

Video Instruction Dataset 📂

We are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from here. More details on our human-assisted and semi-automatic annotation framework for generating the data are available at VideoInstructionDataset.md.

Quantitative Evaluation 📊

Our paper introduces a new Quantitative Evaluation Framework for Video-based Conversational Models. To explore our benchmarks and understand the framework in greater detail, please visit our dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT.

For detailed instructions on performing quantitative evaluation, please refer to QuantitativeEvaluation.md.

Video-based Generative Performance Benchmarking and Zero-Shot Question-Answer Evaluation tables are provided for a detailed performance overview.

Zero-Shot Question-Answer Evaluation

Model	MSVD-QA		MSRVTT-QA		TGIF-QA		Activity Net-QA
	Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
FrozenBiLM	32.2	--	16.8	--	41.0	--	24.7	--
Video Chat	56.3	2.8	45.0	2.5	34.4	2.3	26.5	2.2
LLaMA Adapter	54.9	3.1	43.8	2.7	-	-	34.2	2.7
Video LLaMA	51.6	2.5	29.6	1.8	-	-	12.4	1.1
Video-ChatGPT	64.9	3.3	49.3	2.8	51.4	3.0	35.2	2.7

Video-based Generative Performance Benchmarking

Evaluation Aspect	Video Chat	LLaMA Adapter	Video LLaMA	Video-ChatGPT
Correctness of Information	2.23	2.03	1.96	2.40
Detail Orientation	2.50	2.32	2.18	2.52
Contextual Understanding	2.53	2.30	2.16	2.62
Temporal Understanding	1.94	1.98	1.82	1.98
Consistency	2.24	2.15	1.79	2.37

Qualitative Analysis 🔍

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks 🎥

Creative and Generative Tasks 🖌️

Spatial Understanding 🌐

Video Understanding and Conversational Tasks 💬

Action Recognition 🏃

Question Answering Tasks ❓

Temporal Understanding ⏳

Acknowledgements 🙏

LLaMA: A great attempt towards open and efficient LLMs!
Vicuna: Has the amazing language capabilities!
LLaVA: our architecture is inspired from LLaVA.
Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Salman Khan, Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Jiale Cao, Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Muzammal Naseer, Akshay Dudhane, Jean Lahoud, Awais Rauf, Sahal Shaji, Bokang Jia, without which this project would not be possible.

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

    @article{Maaz2023VideoChatGPT,
        title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
        author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
        journal={arXiv:2306.05424},
        year={2023}
}

License 📜

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟 Please raise any issues or questions here.

video-chatgpt's People

Contributors

Stargazers

Watchers

Forkers

mmaaz60 elephantclock vaimalaviya1233 santhoshreddy39 anas-zafar blkluv tahabi09 2132660698 seshakiran etri-crossmodal techthiyanes eltociear lazykumasensei kzke vcip2015 zedking12138 hailin-shi wolfworld6 mohbattharani amura hanoonar drummyfloyd sorokinvld michalwar againstentropy johnwick123f hzphzp ashmalvayani shehanmunasinghe adambear thusharakart airhors jaqujaqu 4017147 florienthuang wondward sanyaade-projects memmat kavindie jennyziyi-xu jamespark3922 athlabs-co vhzy wanghaisheng ninjacode01 indhralochan allen-oneill henryhzy avert cgy1992 while-basic amine179 jeonggwanlee ryanamundson1 chakrabortyrajatsubhra tyttyttyt farewellthree girihemant19 rochemedia paperwave unbalancedvariance iaooo-pu peacepowerx ykpypl snafi99 weikaih04 apprikatai octag0no lee-b btheahs heldjan pratyush2303 peichangliang123 zhaopufeng joefioresi718 mikewangwzhl ntlm1686 biphobe samxrx guandage xiaozhiob mengqiuqu hongkongkiwi

video-chatgpt's Issues

Training GPU Device minimum Require Specs.

Thanks for your wonderful work.

I saw you used 8 A100 40GB GPUs.

Is it possible to learn with 4 x RTX A6000 GPUs?

the version of transformers

I can not get the version of transformers transformers@git+https://github.com/huggingface/transformers.git@cae78c46.
Can you give me some help?

TypeError: forward () got an unexpected keyword argument "position_ids"

According to the tutorial, I can execute this project, but the execution will report an error when I reach this position.

What training-dataset do you use for evaluating zero-shot ActivityNet-QA?

Hi!

As I read the paper, I couldn't find any training dataset except ActivityNet 100K pairs for instruction tuning.
What training-dataset do you use for evaluating zero-shot ActivityNet-QA?

pydantic version problem

I have the same problem like subzeroid/instagrapi#1435

Just looks like:
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing
profile_pic_url
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing
profile_pic_url_hd
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing
is_private
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing

I follow the instructions that revoke the pydantic version from 2.0.x to 1.10.9, then the project can work properly.

plz insert into requirements.txt:
pydantic==1.10.9

error while deploying video-chatgpt locally

I followed the instruction of how to set up the model inference locally, now the server is up but when I uploaded the test video for inference, I was given the error:

2023-07-25 08:26:23 | ERROR | asyncio | Task exception was never retrieved
future: <Task finished name='7rg4m1ee7gf_12' coro=<Queue.process_events() done, defined at /home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py:343> exception=1 validation error for PredictBody
event_id
Field required [type=missing, input_value={'fn_index': 12, 'data': ...on_hash': '7rg4m1ee7gf'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.3/v/missing>
Traceback (most recent call last):
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py", line 347, in process_events
client_awake = await self.gather_event_data(event)
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py", line 220, in gather_event_data
data, client_awake = await self.get_message(event, timeout=receive_timeout)
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py", line 456, in get_message
return PredictBody(**data), True
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/pydantic/main.py", line 150, in init
pydantic_self.pydantic_validator.validate_python(data, self_instance=pydantic_self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for PredictBody
event_id
Field required [type=missing, input_value={'fn_index': 12, 'data': ...on_hash': '7rg4m1ee7gf'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.3/v/missing

Fail to run Video-ChatGPT Demo Offline

Thank you for sharing the good work!

I followed "offline_demo.md" to run offline, but website has no respones.

The terminal shows below. What does line 10 means? What error occurred?

$ python video_chatgpt/demo/video_demo.py --model-name /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/LLaVA-Lightning-7B-v1-1 --projection_path /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin
2023-09-07 14:10:24 | INFO | gradio_web_server | args: Namespace(host='0.0.0.0', port=None, controller_url='http://localhost:210001', concurrency_count=8, model_list_mode='once', share=False, moderate=False, embed=False, model_name='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/LLaVA-Lightning-7B-v1-1', vision_tower_name='openai/clip-vit-large-patch14', conv_mode='video-chatgpt_v1', projection_path='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin')
2023-09-07 14:10:24 | INFO | gradio_web_server | Namespace(host='0.0.0.0', port=None, controller_url='http://localhost:210001', concurrency_count=8, model_list_mode='once', share=False, moderate=False, embed=False, model_name='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/LLaVA-Lightning-7B-v1-1', vision_tower_name='openai/clip-vit-large-patch14', conv_mode='video-chatgpt_v1', projection_path='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin')
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using a model of type llava to instantiate a model of type VideoChatGPT. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                        | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|████████████████                | 1/2 [00:04<00:04,  4.19s/it]
Loading checkpoint shards: 100%|████████████████████████████████| 2/2 [00:05<00:00,  2.68s/it]
Loading checkpoint shards: 100%|████████████████████████████████| 2/2 [00:05<00:00,  2.90s/it]
2023-09-07 14:10:30 | ERROR | stderr | 
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32006. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

'NOTE: Please make sure you press the ‘Upload Video’ button and wait for it to display 'Start Chatting' before submitting question to Video-ChatGPT.' But Start Chatting button always be gray.

2023-09-07 14:10:30 | ERROR | stderr | 
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32006. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
2023-09-07 14:10:48 | INFO | stdout | Loading weights from /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin
2023-09-07 14:10:49 | INFO | stdout | Weights loaded from /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin
2023-09-07 14:10:55 | INFO | stdout | Initialization Finished
2023-09-07 14:10:56 | ERROR | stderr | /home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/deprecation.py:43: UserWarning: You have unused kwarg parameters in Markdown, please remove them: {'style': 'color:gray'}
2023-09-07 14:10:56 | ERROR | stderr |   warnings.warn(
2023-09-07 14:10:56 | INFO | stdout | Running on local URL:  http://127.0.0.1:7860
2023-09-07 14:14:05 | INFO | gradio_web_server | load_demo.. params: {}
2023-09-07 14:14:18 | INFO | gradio_web_server | add_text. ip:. len: 26
2023-09-07 14:14:19 | ERROR | stderr | Traceback (most recent call last):
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
2023-09-07 14:14:19 | ERROR | stderr |     output = await app.get_blocks().process_api(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
2023-09-07 14:14:19 | ERROR | stderr |     result = await self.call_function(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/blocks.py", line 898, in call_function
2023-09-07 14:14:19 | ERROR | stderr |     prediction = await anyio.to_thread.run_sync(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
2023-09-07 14:14:19 | ERROR | stderr |     return await get_asynclib().run_sync_in_worker_thread(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
2023-09-07 14:14:19 | ERROR | stderr |     return await future
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
2023-09-07 14:14:19 | ERROR | stderr |     result = context.run(func, *args)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/utils.py", line 549, in async_iteration
2023-09-07 14:14:19 | ERROR | stderr |     return next(iterator)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt/demo/chat.py", line 109, in answer
2023-09-07 14:14:19 | ERROR | stderr |     image_forward_outs = self.vision_tower(image_tensor, output_hidden_states=True)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 958, in forward
2023-09-07 14:14:19 | ERROR | stderr |     return self.vision_model(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 883, in forward
2023-09-07 14:14:19 | ERROR | stderr |     hidden_states = self.embeddings(pixel_values)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 196, in forward
2023-09-07 14:14:19 | ERROR | stderr |     patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
2023-09-07 14:14:19 | ERROR | stderr |     return self._conv_forward(input, self.weight, self.bias)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
2023-09-07 14:14:19 | ERROR | stderr |     return F.conv2d(input, weight, bias, self.stride,
2023-09-07 14:14:19 | ERROR | stderr | RuntimeError: GET was unable to find an engine to execute this computation
2023-09-07 14:15:59 | INFO | stdout | Running on public URL: https://639177a685ea0e6be8.gradio.live
2023-09-07 14:15:59 | INFO | stdout | 
2023-09-07 14:15:59 | INFO | stdout | This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces

Website not working properly

Hey, I hope you guys are doing well i want to use the vdeio-ChatGPT for a POC but the chatting feature is not working can you please look into this ASAP. Thank you

openai/clip-vit-large-patch14 were not used when initializing CLIPVisionModel

I running this command
python video_chatgpt/demo/video_demo.py
--model-name Video-ChatGPT_Models/LLaVA-Lightning-7B-v1-1
--projection_path Video-ChatGPT_Models/video_chatgpt-7B.bin

It seems to work well with this, but I'm not sure if this is really okay, any advice would be appreciated.

CLIPVisionModel and configuration warnings

When I run inference, I get the following warnings:

Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_projection.weight', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.bias', 'logit_scale', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.9.mlp.fc1.weight', 'visual_projection.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.q_proj.bias']
- This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

and

UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)

Are these expected? Can I ignore them?

Longer frames issues.

In "./video_chatgpt/eval/model_utils.py", line 12

`def load_video(vis_path, n_clips=1, num_frm=100):
"""
Load video frames from a video file.

Parameters:
vis_path (str): Path to the video file.
n_clips (int): Number of clips to extract from the video. Defaults to 1.
num_frm (int): Number of frames to extract from each clip. Defaults to 100.

I just modified the num_frm from 100 to 200, in order to understand longer videos better.
But there are some errors occurred as follows:

After many attempts, we still didn't figure out the point.
So could you help me to check out this problem? Or is there the right way to detect 200 frames? Thank you!

Demo is dead :cry:

The demo page is dead, it gives

ERR_NGROK_3200
Tunnel video-chatgpt.ngrok.app not found

Hope it will be fixed soon.

Alternative choices for linear layer

Thank you for the very excellent work! In your paper you mentioned that you experimented with more complex network models in addition to linear layers, will you publish the details and evaluation results of the other attempts?

Thanks in advance!

IT looks nice habibi

TypeError: unsupported operand type(s) for |: 'type' and 'type'

Hi,

Edit: I see that I need python 3.10 now, will update

When using: python video_chatgpt/demo/video_demo.py --model-name /notebooks/LLaVA-7B-Lightening-v1-1 --projection_path /notebooks/Video-ChatGPT

I'm getting the following error:

Video uploading is not working and I want to design website for you.

I was trying to text Video-ChatGPT, but it was showing error after I clicked on the Upload Video button.

With this, I also want to contribute to this project by making an appealing website for making it a bit more attractive and pleasing to others as well as I would also make a documentation in the website which make users understand how to use it ⚡.

OSError: It looks like the config file at 'video_chatgpt-7B.bin' is not a valid JSON file.

python scripts/apply_delta.py --base-model-path video_chatgpt-7B.bin --target-model-path LLaVA-Lightning-7B-v1-1 --delta-path liuhaotian/LLaVA-Lightning-7B-delta-v1-1

why Tis issue?
OSError: It looks like the config file at 'video_chatgpt-7B.bin' is not a valid JSON file.

IsADirectoryError: [Errno 21] Is a directory: '/notebooks/Video-ChatGPT'

Hi,

I'm trying to run the code from /notebooks/Video-ChatGPT directory and getting the following error:

had to use: python video_chatgpt/demo/video_demo.py --model-name /notebooks/LLaVA-7B-Lightening-v1-1 --projection_path /notebooks/Video-ChatGPT/video_chatgpt-7B.bin

Some video features are not found in pre-computed spatiotemporal CLIP features

Hi, I found that there is "v_6Ke30NtYOC0.pkl" as training data in the file "video_chatgpt_training.json"(obtained from "scripts/convert_instruction_json_to_training_format.py"), but it is not in the downloaded the pre-computed spatiotemporal CLIP features(link). How can I fix this problem?

Thanks!

Finetuning recipe for custom dataset

Can you please share some finetuning recipe for custom dataset?

License

Hi,

In your README, you have "Non-commercial bespoke license. Please refer to license terms here." But "here" links to GPL 3.0 which specifically allows commercial use.

Can you clarify? Thanks.

online demo fail

i can't open the online demo link in readme,md

Can't download weights using transformers api

Hi,
When using the code below to download the weights I'm getting an error key regarding LLava.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("mmaaz60/LLaVA-7B-Lightening-v1-1")

video length

video length can be 30 min?

ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=1']

Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=0']
Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=2']
Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=1']

Code & Pretrained Models

Please note that this is an ongoing work where we are working on improving our architecture design and finetuning on the video instruction data. We will release our codes and pretrained models very soon (Before Jun 17, 2023). Stay tuned!

Loss curve during training

Hi, thank you for sharing the great work!
Could you share the loss curve during training?

[benchmark] Some questions about the details to generate files in step 1 during `Video-based Generative Performance Benchmarking`.

Hello, follow the instructions of step 1 in quantitative_evaluation, I obtain three files:

one file generated with generic_qa.json and run_inference_benchmark_general.py
one file generated with consistency_qa.json and run_inference_benchmark_consistency.py
one file generated with temporal_qa.json and run_inference_benchmark_general.py

Then, do I need to generate any other file? And how does them function in step 2? More specifically, if I want to evaluate correctness and detailed, which file generated in step1 should I input to the pred_path in step 2 command?

Great work!

Hi, congrats on the great work and very impressive performance!

I have a small question on the Spatio-Temporal features using CLIP. So in the OneDrive downloading path (https://mbzuaiac-my.sharepoint.com/:f:/g/personal/hanoona_bangalath_mbzuai_ac_ae/EnLRDehrr8lGqHpC5w1zZ9QBnsiVffYy5vCv8Hl14deRcg?e=Ul5DUE) you provided, there seems to be no "v_CL6TbOgnLzA.pkl" file, which exists in https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/docs/train_video_ids.txt, and will cause bug when run training script. Could you help?

I would appreciate it very much if you could reply.
Thanks in advance.

from video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM

11]
4s
!python scripts/apply_delta.py --base-model-path /content/drive/MyDrive/Video-ChatGPT/video_chatgpt-7B.bin --target-model-path LLaVA-Lightning-7B-v1-1 --delta-path liuhaotian/LLaVA-Lightning-7B-delta-v1-1
Traceback (most recent call last):
File "/content/drive/MyDrive/Video-ChatGPT/scripts/apply_delta.py", line 10, in
from video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM
ModuleNotFoundError: No module named 'video_chatgpt'

Llama 2 7B

Hi great project

I am trying to run it with Llama 2. I have followed the steps. I am getting Hallucinations.

Can it be done with Llama 2 7B ?

About Video Instruction Data Generation

@mmaaz60 @hanoonaR
Thank you for sharing your great work.
I have question about video instruction data generation.
As you mentioned in your paper, You made a video instruction dataset by using both human-assisted and semi-automatic annotation methods.
What is the ratio of each method to the entire dataset?
I think you created more than 70 percent of the entire dataset by using semi-automatic annotation methods. Because using human-assisted method costs a lot.....
Thank you in advance

can I train in one A100 80G GPU?

Hello, thanks for the great work.

Can I train the model using only one A100 80G GPU? Or how can we modify the code so that it can be trained on one gpu? Thank you so much.

Zero-Shot Question-Answer Evaluation of Accuracy

Hello, sorry to bother you again. Your work is very interesting and we might want to build on it for further research.

When we did a zeroshot QA test on the MSVD-QA dataset, we found that for any question, the answer is one word. For example: {"answer":"someone","question":"who opened the box that held an automatic weapon in a gun?","video_id":1451}. But when we tested with the pretrained video chatgpt model, we found that the model often outputs a complete sentence, like: 'The box that held the automatic weapon was opened by a person who is not visible in the video.'. Note that the video chatgpt model get 64.9 accuracy on MSVD-QA. how is this evaluated? I saw that the previous work tests with this dataset required prediction==answer to be correct. I think this is too harsh for large models unless the models are trained on corresponding dataset.

The same problem seems to occur on other datasets (ActivityNet-QA). Because the answer of this dataset seems to have only a few words.

Why is important to change the order of content['q'] and <video>?

As the comment says: https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/scripts/convert_instruction_json_to_training_format.py#L28

Question about result

@hanoonaR @mmaaz60
Hello, Thank you for sharing excellent work.
I have confirmed the output of model.
Even though I input the same video, the results are different like below
('The person in the video is using a cellphone' or 'The person in the video is holding a cellphone')
Can you tell me how to control the result? I mean I wanna get the same result
Thank you in advance

training process: mat1 and mat2 must have the same dtype

Hi, sorry to bother you again!
I just followed the instructions in train_video_chatgpt.md, and use the command to start train.
My devices are 2 RTX 8000, which is not Ampere GPU(a little bit out of date). And these devices don't support bf16 and tf32, so I set these two params as False.
Then a Runtime Error occurred:

mat1 and mat2 must have the same dtype.

I'm sure we have used the exactly same environment as yours. So is this problem caused by GPU difference?
If so, is there any way to solve this problem?
We would be very appreciated if you could help us! Thank you!

Few-shot video classification

Thanks a lot for this exciting work!

I have a general question whether the proposed approach would work well for some sort of few-shot video understanding / classification. From the technical side of things, it should be possible to provide multiple videos with textual description as part of the prompt. I am wondering if the currently trained model would handle the ambiguity of this new, few-shot approach. Have you guys tried anything in "few-shot" direction, or have any intuition if this might work / require some further training?

Setup Error during training

warnings.warn(
Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 113, in init
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/training_args.py", line 1190, in post_init
raise ValueError(
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

Cannot load './LLaVA-Lightening-7B-v1-1'

Hi, I followed the instruction in offline_demo.md and saved the Video-ChatGPT weights and the LLaVA-Lightening-7B-v1-1 weights in the current directory. Then I used this command for Offline Demo: python video_chatgpt/demo/video_demo.py --model-name ./LLaVA-Lightening-7B-v1-1 --projection_path ./video_chatgpt-7B.bin
but got an error: huggingface_hub.utils.validators.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './LLaVA-Lightening-7B-v1-1'.
Is it the way I fill in --model_name wrong?

Thanks!

Can I apply your weights to vicuna 13b or 33b?

About the evaluation time

Hi authors,

Thanks for the great work! I am running your evaluation code on ActivityQA dataset following https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/README.md and the inference time takes up to 4-5 hours on a single A100 80G GPU. I am wondering whether that is normal? Thanks.

I would appreciate it very much if you could reply.

Thanks

Question about running video-chatgpt demo offline

@mmaaz60
In Advance, Thank you for providing great work.
I have question about running video-chatgpt demo offline.
when i uploaded the video with 'Upload Video' button, it did not work after 10 minutes.....
Can you confirm it ? and tell me how to use it correctly?

During the demo, sometimes an exception occurs

I have the predictor now but what exactly should I send in the payload? How do I include the video alongside my prompt in the request to the endpoint created by Sagemaker? Can you please share some example payload?
I tried sending a payload without a video just to test whether the endpoint works or not and it gave me the following error:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{ "code": 400, "type": "InternalServerException", "message": "/.sagemaker/mms/models/MBZUAI__Video-ChatGPT-7B does not appear to have a file named config.json. Checkout \u0027https://huggingface.co//.sagemaker/mms/models/MBZUAI__Video-ChatGPT-7B/None\u0027 for available files." }
Where can I get this config.json file?

The code uses the path Video-ChatGPT/video_chatgpt/demo/demo_sample_videos/ instead of Video-ChatGPT/video_chatgpt/demo/serve/demo_sample_videos/ written in the git readme

Error while deploying video-chatgpt locally

I followed the instruction of how to set up the model inference locally, but I was given the error:

Can you help me? Thanks.

mbzuai-oryx / video-chatgpt Goto Github PK

video-chatgpt's Introduction

Oryx Video-ChatGPT 🎥 💬

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan

Mohamed bin Zayed University of Artificial Intelligence

Video-based Generative Performance Benchmarking

Zeroshot Question-Answer Evaluation

📢 Latest Updates

Online Demo 💻

Video-ChatGPT Overview 💡

Contributions 🏆

Installation 🔧

Running Demo Offline 💿

Training 🚋

Video Instruction Dataset 📂

Quantitative Evaluation 📊

Zero-Shot Question-Answer Evaluation

Video-based Generative Performance Benchmarking

Qualitative Analysis 🔍

Video Reasoning Tasks 🎥

Creative and Generative Tasks 🖌️

Spatial Understanding 🌐

Video Understanding and Conversational Tasks 💬

Action Recognition 🏃

Question Answering Tasks ❓

Temporal Understanding ⏳

Acknowledgements 🙏

License 📜

video-chatgpt's People

Contributors

Stargazers

Watchers

Forkers

video-chatgpt's Issues

Thank you for sharing the good work!

I followed "offline_demo.md" to run offline, but website has no respones.

The terminal shows below. What does line 10 means? What error occurred?

'NOTE: Please make sure you press the ‘Upload Video’ button and wait for it to display 'Start Chatting' before submitting question to Video-ChatGPT.' But Start Chatting button always be gray.

Recommend Projects

Recommend Topics

Recommend Org