Giter Club home page Giter Club logo

mbzuai-oryx / video-chatgpt Goto Github PK

View Code? Open in Web Editor NEW
949.0 14.0 83.0 80.99 MB

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Home Page: https://mbzuai-oryx.github.io/Video-ChatGPT

License: Creative Commons Attribution 4.0 International

Python 99.38% Shell 0.62%
chatbot clip gpt-4 llama llava mulit-modal vicuna vision-language vision-language-pretraining video-chatboat

video-chatgpt's Introduction

Oryx Video-ChatGPT ๐ŸŽฅ ๐Ÿ’ฌ

Oryx Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

* Equally contributing first authors

Mohamed bin Zayed University of Artificial Intelligence


Video-based Generative Performance Benchmarking

PWC

Zeroshot Question-Answer Evaluation

PWC PWC PWC PWC


Demo Paper Demo Clips Offline Demo Training Video Instruction Data Quantitative Evaluation Qualitative Analysis
Demo YouTube paper DemoClip-1 DemoClip-2 DemoClip-3 DemoClip-4 Offline Demo Training Video Instruction Dataset Quantitative Evaluation Qualitative Analysis

๐Ÿ“ข Latest Updates

  • Sep-30: Our VideoInstruct100K dataset can be downloaded from HuggingFace/VideoInstruct100K. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Jul-15: Our quantitative evaluation benchmark for Video-based Conversational Models now has its own dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Jun-28: Updated GitHub readme featuring benchmark comparisons of Video-ChatGPT against recent models - Video Chat, Video LLaMA, and LLaMA Adapter. Amid these advanced conversational models, Video-ChatGPT continues to deliver state-of-the-art performance.:fire::fire:
  • Jun-08 : Released the training code, offline demo, instructional data and technical report. All the resources including models, datasets and extracted features are available here. ๐Ÿ”ฅ๐Ÿ”ฅ
  • May-21 : Video-ChatGPT: demo released.

Online Demo ๐Ÿ’ป

๐Ÿ”ฅ๐Ÿ”ฅ You can try our demo using the provided examples or by uploading your own videos HERE. ๐Ÿ”ฅ๐Ÿ”ฅ

๐Ÿ”ฅ๐Ÿ”ฅ Or click the image to try the demo! ๐Ÿ”ฅ๐Ÿ”ฅ demo You can access all the videos we demonstrate on here.


Video-ChatGPT Overview ๐Ÿ’ก

Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.

Video-ChatGPT Architectural Overview


Contributions ๐Ÿ†

  • We introduce 100K high-quality video-instruction pairs together with a novel annotation framework that is scalable and generates a diverse range of video-specific instruction sets of high-quality.
  • We develop the first quantitative video conversation evaluation framework for benchmarking video conversation models.
  • Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.

Contributions


Installation ๐Ÿ”ง

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline ๐Ÿ’ฟ

To run the demo offline, please refer to the instructions in offline_demo.md.


Training ๐Ÿš‹

For training instructions, check out train_video_chatgpt.md.


Video Instruction Dataset ๐Ÿ“‚

We are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from here. More details on our human-assisted and semi-automatic annotation framework for generating the data are available at VideoInstructionDataset.md.


Quantitative Evaluation ๐Ÿ“Š

Our paper introduces a new Quantitative Evaluation Framework for Video-based Conversational Models. To explore our benchmarks and understand the framework in greater detail, please visit our dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT.

For detailed instructions on performing quantitative evaluation, please refer to QuantitativeEvaluation.md.

Video-based Generative Performance Benchmarking and Zero-Shot Question-Answer Evaluation tables are provided for a detailed performance overview.

Zero-Shot Question-Answer Evaluation

Model MSVD-QA MSRVTT-QA TGIF-QA Activity Net-QA
Accuracy Score Accuracy Score Accuracy Score Accuracy Score
FrozenBiLM 32.2 -- 16.8 -- 41.0 -- 24.7 --
Video Chat 56.3 2.8 45.0 2.5 34.4 2.3 26.5 2.2
LLaMA Adapter 54.9 3.1 43.8 2.7 - - 34.2 2.7
Video LLaMA 51.6 2.5 29.6 1.8 - - 12.4 1.1
Video-ChatGPT 64.9 3.3 49.3 2.8 51.4 3.0 35.2 2.7

Video-based Generative Performance Benchmarking

Evaluation Aspect Video Chat LLaMA Adapter Video LLaMA Video-ChatGPT
Correctness of Information 2.23 2.03 1.96 2.40
Detail Orientation 2.50 2.32 2.18 2.52
Contextual Understanding 2.53 2.30 2.16 2.62
Temporal Understanding 1.94 1.98 1.82 1.98
Consistency 2.24 2.15 1.79 2.37

Qualitative Analysis ๐Ÿ”

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks ๐ŸŽฅ

sample1


Creative and Generative Tasks ๐Ÿ–Œ๏ธ

sample5


Spatial Understanding ๐ŸŒ

sample8


Video Understanding and Conversational Tasks ๐Ÿ’ฌ

sample10


Action Recognition ๐Ÿƒ

sample22


Question Answering Tasks โ“

sample14


Temporal Understanding โณ

sample18


Acknowledgements ๐Ÿ™

  • LLaMA: A great attempt towards open and efficient LLMs!
  • Vicuna: Has the amazing language capabilities!
  • LLaVA: our architecture is inspired from LLaVA.
  • Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Salman Khan, Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Jiale Cao, Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Muzammal Naseer, Akshay Dudhane, Jean Lahoud, Awais Rauf, Sahal Shaji, Bokang Jia, without which this project would not be possible.

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

    @article{Maaz2023VideoChatGPT,
        title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
        author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
        journal={arXiv:2306.05424},
        year={2023}
}

License ๐Ÿ“œ

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! ๐ŸŒŸ Please raise any issues or questions here.


video-chatgpt's People

Contributors

ashmalvayani avatar eltociear avatar hanoonar avatar mmaaz60 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

video-chatgpt's Issues

pydantic version problem

I have the same problem like subzeroid/instagrapi#1435

Just looks like:
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing
profile_pic_url
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing
profile_pic_url_hd
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing
is_private
Field required [type=missing, input_value={'id': '52761857721', 'pk': '52761857721'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.1/v/missing

I follow the instructions that revoke the pydantic version from 2.0.x to 1.10.9, then the project can work properly.

plz insert into requirements.txt:
pydantic==1.10.9

error while deploying video-chatgpt locally

I followed the instruction of how to set up the model inference locally, now the server is up but when I uploaded the test video for inference, I was given the error:

2023-07-25 08:26:23 | ERROR | asyncio | Task exception was never retrieved
future: <Task finished name='7rg4m1ee7gf_12' coro=<Queue.process_events() done, defined at /home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py:343> exception=1 validation error for PredictBody
event_id
Field required [type=missing, input_value={'fn_index': 12, 'data': ...on_hash': '7rg4m1ee7gf'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.3/v/missing>
Traceback (most recent call last):
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py", line 347, in process_events
client_awake = await self.gather_event_data(event)
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py", line 220, in gather_event_data
data, client_awake = await self.get_message(event, timeout=receive_timeout)
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/gradio/queueing.py", line 456, in get_message
return PredictBody(**data), True
File "/home/conducivedev/.conda/envs/video_chatgpt/lib/python3.10/site-packages/pydantic/main.py", line 150, in init
pydantic_self.pydantic_validator.validate_python(data, self_instance=pydantic_self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for PredictBody
event_id
Field required [type=missing, input_value={'fn_index': 12, 'data': ...on_hash': '7rg4m1ee7gf'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.0.3/v/missing
Screenshot from 2023-07-25 08-27-16

Fail to run Video-ChatGPT Demo Offline

Thank you for sharing the good work!

I followed "offline_demo.md" to run offline, but website has no respones.

The terminal shows below. What does line 10 means? What error occurred?

$ python video_chatgpt/demo/video_demo.py --model-name /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/LLaVA-Lightning-7B-v1-1 --projection_path /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin
2023-09-07 14:10:24 | INFO | gradio_web_server | args: Namespace(host='0.0.0.0', port=None, controller_url='http://localhost:210001', concurrency_count=8, model_list_mode='once', share=False, moderate=False, embed=False, model_name='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/LLaVA-Lightning-7B-v1-1', vision_tower_name='openai/clip-vit-large-patch14', conv_mode='video-chatgpt_v1', projection_path='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin')
2023-09-07 14:10:24 | INFO | gradio_web_server | Namespace(host='0.0.0.0', port=None, controller_url='http://localhost:210001', concurrency_count=8, model_list_mode='once', share=False, moderate=False, embed=False, model_name='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/LLaVA-Lightning-7B-v1-1', vision_tower_name='openai/clip-vit-large-patch14', conv_mode='video-chatgpt_v1', projection_path='/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin')
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using a model of type llava to instantiate a model of type VideoChatGPT. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                        | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                | 1/2 [00:04<00:04,  4.19s/it]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:05<00:00,  2.68s/it]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:05<00:00,  2.90s/it]
2023-09-07 14:10:30 | ERROR | stderr | 
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32006. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

'NOTE: Please make sure you press the โ€˜Upload Videoโ€™ button and wait for it to display 'Start Chatting' before submitting question to Video-ChatGPT.' But Start Chatting button always be gray.

2023-09-07 14:10:30 | ERROR | stderr | 
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32006. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
2023-09-07 14:10:48 | INFO | stdout | Loading weights from /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin
2023-09-07 14:10:49 | INFO | stdout | Weights loaded from /home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt-7B.bin
2023-09-07 14:10:55 | INFO | stdout | Initialization Finished
2023-09-07 14:10:56 | ERROR | stderr | /home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/deprecation.py:43: UserWarning: You have unused kwarg parameters in Markdown, please remove them: {'style': 'color:gray'}
2023-09-07 14:10:56 | ERROR | stderr |   warnings.warn(
2023-09-07 14:10:56 | INFO | stdout | Running on local URL:  http://127.0.0.1:7860
2023-09-07 14:14:05 | INFO | gradio_web_server | load_demo.. params: {}
2023-09-07 14:14:18 | INFO | gradio_web_server | add_text. ip:. len: 26
2023-09-07 14:14:19 | ERROR | stderr | Traceback (most recent call last):
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
2023-09-07 14:14:19 | ERROR | stderr |     output = await app.get_blocks().process_api(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
2023-09-07 14:14:19 | ERROR | stderr |     result = await self.call_function(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/blocks.py", line 898, in call_function
2023-09-07 14:14:19 | ERROR | stderr |     prediction = await anyio.to_thread.run_sync(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
2023-09-07 14:14:19 | ERROR | stderr |     return await get_asynclib().run_sync_in_worker_thread(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
2023-09-07 14:14:19 | ERROR | stderr |     return await future
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
2023-09-07 14:14:19 | ERROR | stderr |     result = context.run(func, *args)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/utils.py", line 549, in async_iteration
2023-09-07 14:14:19 | ERROR | stderr |     return next(iterator)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/Documents/jjy/comment_generator/Video-ChatGPT/video_chatgpt/demo/chat.py", line 109, in answer
2023-09-07 14:14:19 | ERROR | stderr |     image_forward_outs = self.vision_tower(image_tensor, output_hidden_states=True)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 958, in forward
2023-09-07 14:14:19 | ERROR | stderr |     return self.vision_model(
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 883, in forward
2023-09-07 14:14:19 | ERROR | stderr |     hidden_states = self.embeddings(pixel_values)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 196, in forward
2023-09-07 14:14:19 | ERROR | stderr |     patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-09-07 14:14:19 | ERROR | stderr |     return forward_call(*args, **kwargs)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
2023-09-07 14:14:19 | ERROR | stderr |     return self._conv_forward(input, self.weight, self.bias)
2023-09-07 14:14:19 | ERROR | stderr |   File "/home/nkd/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
2023-09-07 14:14:19 | ERROR | stderr |     return F.conv2d(input, weight, bias, self.stride,
2023-09-07 14:14:19 | ERROR | stderr | RuntimeError: GET was unable to find an engine to execute this computation
2023-09-07 14:15:59 | INFO | stdout | Running on public URL: https://639177a685ea0e6be8.gradio.live
2023-09-07 14:15:59 | INFO | stdout | 
2023-09-07 14:15:59 | INFO | stdout | This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces

Website not working properly

Hey, I hope you guys are doing well i want to use the vdeio-ChatGPT for a POC but the chatting feature is not working can you please look into this ASAP. Thank you

CLIPVisionModel and configuration warnings

When I run inference, I get the following warnings:

Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_projection.weight', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.5.mlp.fc1.bias', 'logit_scale', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.9.mlp.fc1.weight', 'visual_projection.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.q_proj.bias']
- This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

and

UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)

Are these expected? Can I ignore them?

Longer frames issues.

In "./video_chatgpt/eval/model_utils.py", line 12

`def load_video(vis_path, n_clips=1, num_frm=100):
"""
Load video frames from a video file.

Parameters:
vis_path (str): Path to the video file.
n_clips (int): Number of clips to extract from the video. Defaults to 1.
num_frm (int): Number of frames to extract from each clip. Defaults to 100.

`

I just modified the num_frm from 100 to 200, in order to understand longer videos better.
But there are some errors occurred as follows:

2023-06-21 16:40:25 | ERROR | stderr | /home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/transformers/generation/utils.py:1211: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
2023-06-21 16:40:25 | ERROR | stderr | warnings.warn(
2023-06-21 16:40:26 | ERROR | stderr | Traceback (most recent call last):
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/gradio/routes.py", line 394, in run_predict
2023-06-21 16:40:26 | ERROR | stderr | output = await app.get_blocks().process_api(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/gradio/blocks.py", line 1075, in process_api
2023-06-21 16:40:26 | ERROR | stderr | result = await self.call_function(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/gradio/blocks.py", line 898, in call_function
2023-06-21 16:40:26 | ERROR | stderr | prediction = await anyio.to_thread.run_sync(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
2023-06-21 16:40:26 | ERROR | stderr | return await get_asynclib().run_sync_in_worker_thread(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
2023-06-21 16:40:26 | ERROR | stderr | return await future
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
2023-06-21 16:40:26 | ERROR | stderr | result = context.run(func, *args)
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/gradio/utils.py", line 549, in async_iteration
2023-06-21 16:40:26 | ERROR | stderr | return next(iterator)
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/Video-ChatGPT/video_chatgpt/demo/chat.py", line 118, in answer
2023-06-21 16:40:26 | ERROR | stderr | output_ids = self.model.generate(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2023-06-21 16:40:26 | ERROR | stderr | return func(*args, **kwargs)
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/transformers/generation/utils.py", line 1462, in generate
2023-06-21 16:40:26 | ERROR | stderr | return self.sample(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/transformers/generation/utils.py", line 2478, in sample
2023-06-21 16:40:26 | ERROR | stderr | outputs = self(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-06-21 16:40:26 | ERROR | stderr | return forward_call(*args, **kwargs)
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/Video-ChatGPT/video_chatgpt/model/video_chatgpt.py", line 191, in forward
2023-06-21 16:40:26 | ERROR | stderr | outputs = self.model(
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/anaconda3/envs/fantasy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-06-21 16:40:26 | ERROR | stderr | return forward_call(*args, **kwargs)
2023-06-21 16:40:26 | ERROR | stderr | File "/home/wangpj/Video-ChatGPT/video_chatgpt/model/video_chatgpt.py", line 105, in forward
2023-06-21 16:40:26 | ERROR | stderr | if cur_input_ids[video_start_token_pos + num_patches + 1] != self.vision_config.vid_end_token:
2023-06-21 16:40:26 | ERROR | stderr | IndexError: index 523 is out of bounds for dimension 0 with size 429

After many attempts, we still didn't figure out the point.
So could you help me to check out this problem? Or is there the right way to detect 200 frames? Thank you!

Alternative choices for linear layer

Thank you for the very excellent work! In your paper you mentioned that you experimented with more complex network models in addition to linear layers, will you publish the details and evaluation results of the other attempts?

Thanks in advance!

Video uploading is not working and I want to design website for you.

I was trying to text Video-ChatGPT, but it was showing error after I clicked on the Upload Video button.

image

With this, I also want to contribute to this project by making an appealing website for making it a bit more attractive and pleasing to others as well as I would also make a documentation in the website which make users understand how to use it โšก.

License

Hi,

In your README, you have "Non-commercial bespoke license. Please refer to license terms here." But "here" links to GPL 3.0 which specifically allows commercial use.

Can you clarify? Thanks.

Can't download weights using transformers api

Hi,
When using the code below to download the weights I'm getting an error key regarding LLava.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("mmaaz60/LLaVA-7B-Lightening-v1-1")

ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=1']

Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=0']
Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=2']
Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--local-rank=1']

Code & Pretrained Models

Please note that this is an ongoing work where we are working on improving our architecture design and finetuning on the video instruction data. We will release our codes and pretrained models very soon (Before Jun 17, 2023). Stay tuned!

[benchmark] Some questions about the details to generate files in step 1 during `Video-based Generative Performance Benchmarking`.

Hello, follow the instructions of step 1 in quantitative_evaluation, I obtain three files:

  1. one file generated with generic_qa.json and run_inference_benchmark_general.py
  2. one file generated with consistency_qa.json and run_inference_benchmark_consistency.py
  3. one file generated with temporal_qa.json and run_inference_benchmark_general.py

Then, do I need to generate any other file? And how does them function in step 2? More specifically, if I want to evaluate correctness and detailed, which file generated in step1 should I input to the pred_path in step 2 command?

Great work!

Hi, congrats on the great work and very impressive performance!

I have a small question on the Spatio-Temporal features using CLIP. So in the OneDrive downloading path (https://mbzuaiac-my.sharepoint.com/:f:/g/personal/hanoona_bangalath_mbzuai_ac_ae/EnLRDehrr8lGqHpC5w1zZ9QBnsiVffYy5vCv8Hl14deRcg?e=Ul5DUE) you provided, there seems to be no "v_CL6TbOgnLzA.pkl" file, which exists in https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/docs/train_video_ids.txt, and will cause bug when run training script. Could you help?

I would appreciate it very much if you could reply.
Thanks in advance.

from video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM

11]
4s
!python scripts/apply_delta.py --base-model-path /content/drive/MyDrive/Video-ChatGPT/video_chatgpt-7B.bin --target-model-path LLaVA-Lightning-7B-v1-1 --delta-path liuhaotian/LLaVA-Lightning-7B-delta-v1-1
Traceback (most recent call last):
File "/content/drive/MyDrive/Video-ChatGPT/scripts/apply_delta.py", line 10, in
from video_chatgpt.model.video_chatgpt import VideoChatGPTLlamaForCausalLM
ModuleNotFoundError: No module named 'video_chatgpt'

Llama 2 7B

Hi great project

I am trying to run it with Llama 2. I have followed the steps. I am getting Hallucinations.

Can it be done with Llama 2 7B ?

About Video Instruction Data Generation

@mmaaz60 @hanoonaR
Thank you for sharing your great work.
I have question about video instruction data generation.
As you mentioned in your paper, You made a video instruction dataset by using both human-assisted and semi-automatic annotation methods.
What is the ratio of each method to the entire dataset?
I think you created more than 70 percent of the entire dataset by using semi-automatic annotation methods. Because using human-assisted method costs a lot.....
Thank you in advance

can I train in one A100 80G GPU?

Hello, thanks for the great work.

Can I train the model using only one A100 80G GPU? Or how can we modify the code so that it can be trained on one gpu? Thank you so much.

Zero-Shot Question-Answer Evaluation of Accuracy

Hello, sorry to bother you again. Your work is very interesting and we might want to build on it for further research.

When we did a zeroshot QA test on the MSVD-QA dataset, we found that for any question, the answer is one word. For example: {"answer":"someone","question":"who opened the box that held an automatic weapon in a gun?","video_id":1451}. But when we tested with the pretrained video chatgpt model, we found that the model often outputs a complete sentence, like: 'The box that held the automatic weapon was opened by a person who is not visible in the video.'. Note that the video chatgpt model get 64.9 accuracy on MSVD-QA. how is this evaluated? I saw that the previous work tests with this dataset required prediction==answer to be correct. I think this is too harsh for large models unless the models are trained on corresponding dataset.

The same problem seems to occur on other datasets (ActivityNet-QA). Because the answer of this dataset seems to have only a few words.

Question about result

@hanoonaR @mmaaz60
Hello, Thank you for sharing excellent work.
I have confirmed the output of model.
Even though I input the same video, the results are different like below
('The person in the video is using a cellphone' or 'The person in the video is holding a cellphone')
Can you tell me how to control the result? I mean I wanna get the same result
Thank you in advance

training process: mat1 and mat2 must have the same dtype

Hi, sorry to bother you again!
I just followed the instructions in train_video_chatgpt.md, and use the command to start train.
My devices are 2 RTX 8000, which is not Ampere GPU(a little bit out of date). And these devices don't support bf16 and tf32, so I set these two params as False.
Then a Runtime Error occurred:

mat1 and mat2 must have the same dtype.

I'm sure we have used the exactly same environment as yours. So is this problem caused by GPU difference?
If so, is there any way to solve this problem?
We would be very appreciated if you could help us! Thank you!

Few-shot video classification

Thanks a lot for this exciting work!

I have a general question whether the proposed approach would work well for some sort of few-shot video understanding / classification. From the technical side of things, it should be possible to provide multiple videos with textual description as part of the prompt. I am wondering if the currently trained model would handle the ambiguity of this new, few-shot approach. Have you guys tried anything in "few-shot" direction, or have any intuition if this might work / require some further training?

Setup Error during training

warnings.warn(
Traceback (most recent call last):
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train_mem.py", line 11, in
train()
File "/apdcephfs_cq3/share_1311970/Video-ChatGPT/video_chatgpt/train/train.py", line 482, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 113, in init
File "/apdcephfs_cq3/share_1311970/lb/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/transformers/training_args.py", line 1190, in post_init
raise ValueError(
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

Cannot load './LLaVA-Lightening-7B-v1-1'

Hi, I followed the instruction in offline_demo.md and saved the Video-ChatGPT weights and the LLaVA-Lightening-7B-v1-1 weights in the current directory. Then I used this command for Offline Demo: python video_chatgpt/demo/video_demo.py --model-name ./LLaVA-Lightening-7B-v1-1 --projection_path ./video_chatgpt-7B.bin
but got an error: huggingface_hub.utils.validators.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './LLaVA-Lightening-7B-v1-1'.
Is it the way I fill in --model_name wrong?

Thanks!

image

Question about running video-chatgpt demo offline

@mmaaz60
In Advance, Thank you for providing great work.
I have question about running video-chatgpt demo offline.
when i uploaded the video with 'Upload Video' button, it did not work after 10 minutes.....
Can you confirm it ? and tell me how to use it correctly?

During the demo, sometimes an exception occurs

2023-06-27 05:01:51 | ERROR | stderr | File "/home/lizheng/software/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
2023-06-27 05:01:51 | ERROR | stderr | result = context.run(func, *args)
2023-06-27 05:01:51 | ERROR | stderr | File "/home/lizheng/software/miniconda3/envs/video_chatgpt/lib/python3.10/site-packages/gradio/utils.py", line 549, in async_iteration
2023-06-27 05:01:51 | ERROR | stderr | return next(iterator)
2023-06-27 05:01:51 | ERROR | stderr | File "/home/lizheng/project/Video-ChatGPT/video_chatgpt/demo/chat.py", line 105, in answer
2023-06-27 05:01:51 | ERROR | stderr | image_tensor = img_list[0]
2023-06-27 05:01:51 | ERROR | stderr | IndexError: list index out of range

linear layer

Hi, can you confirm the dimensions of the linear layer that you learned?

Kind Regards

Website Down

Hello everyone, I hope you are doing good. The website is down for the demo can you please look into it ASAP? Thank you

Missing config.json for Video-ChatGPT-7B when deployed on AWS Sagemaker

I have deployed the Video-ChatGPT-7B model on AWS Sagemaker using the script given on the HuggingFace website.
image

However, there are two issues:

  1. I have the predictor now but what exactly should I send in the payload? How do I include the video alongside my prompt in the request to the endpoint created by Sagemaker? Can you please share some example payload?
  2. I tried sending a payload without a video just to test whether the endpoint works or not and it gave me the following error:
    ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{ "code": 400, "type": "InternalServerException", "message": "/.sagemaker/mms/models/MBZUAI__Video-ChatGPT-7B does not appear to have a file named config.json. Checkout \u0027https://huggingface.co//.sagemaker/mms/models/MBZUAI__Video-ChatGPT-7B/None\u0027 for available files." }
    Where can I get this config.json file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.