txh-mercury / vast Goto Github PK

Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Home Page: https://arxiv.org/abs/2305.18500

License: MIT License

Python 48.33% Jupyter Notebook 50.57% Shell 1.10%

audio-language cross-modality-pretraining dataset multimodal-foundation-model vision-audio-subtitle-text vision-language

vast's People

Contributors

Stargazers

Watchers

Forkers

lihanddd jeremy-su1 onequery yumianhuli1 xlsean wenxinwilliam m-toman yepjin kimx3966 leminhhuan72 delusionallogic zhiwen-t

vast's Issues

Error about finetune_qa_msvd task (Miss key 'desc' or 'caption' in descs_qa_trainval.json)

03/26/2024 19:23:18 - INFO - main - load_from_pretrained: ./output/vast/pretrain_vast/ckpt/model_step_204994.pt
03/26/2024 19:23:18 - INFO - main - Load from pretrained dir ./output/vast/pretrain_vast
03/26/2024 19:23:19 - INFO - main - Unexpected keys ['vision_encoder.text.logit_scale']
03/26/2024 19:23:19 - INFO - main - missing_keys ['vision_encoder.logit_scale']
03/26/2024 19:23:20 - INFO - main - ==================learning_rate_settings==================

03/26/2024 19:23:20 - INFO - main - basic_lr : 1e-05
03/26/2024 19:23:20 - INFO - main - clip_lr_visual : 5e-07
03/26/2024 19:23:20 - INFO - main - clip_lr_visual_len : 245
03/26/2024 19:23:20 - INFO - main - new_lr : 0
03/26/2024 19:23:20 - INFO - main - new_params_name: []
0%| | 0/5670 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/workspace/Project/VideoLargeModel/VAST/./run.py", line 63, in
main()
File "/mnt/workspace/Project/VideoLargeModel/VAST/./run.py", line 46, in main
train(model, optimizer, train_loader, val_loaders, args.run_cfg, start_step = start_step, verbose_time=False)
File "/mnt/workspace/Project/VideoLargeModel/VAST/utils/pipeline.py", line 35, in train
for step, (name, batch) in enumerate(train_loader):
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/loader.py", line 101, in iter
self.preload(loader_it)
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/loader.py", line 112, in preload
self.batch = next(it)
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/loader.py", line 48, in iter
batch = next(iter_)
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/IndexAnno.py", line 69, in getitem
raw_captions = anno['desc'] if 'desc' in anno else anno['caption']
KeyError: 'caption'

I am trying the VQA task on MSVD-QA dataset.
I use the "python3 -m torch.distributed.launch
--nnodes 1
--node_rank 0
--nproc_per_node 4
--master_port 9834
./run.py
--learning_rate 1e-5
--checkpointing true
--first_eval false
--config ./config/vast/finetune_cfg/VQA-msvd.json
--pretrain_dir $output_dir
--save_best true
--output_dir $output_dir/downstream/VQA-msvd " command line and meet above error.

I notice the AnnoIndexedDataset(Dataset) require 'desc' or 'caption' in anno, but the msvd/descs_cap_train.json do not have these info. I want to ask how to fix thie error. Thank you.

How did you get the audio for "datasets/srcdata/msrvtt/audios"?

The original msrvtt folder structure is the below.

msrvtt
├── annotation
│ ├── MSR_VTT.json
├── high-quality
│ ├── structured-symlinks
│ │ ├── jsfusion_val_caption_idx.pkl
│ │ ├── ... many other files....
├── structured-symlinks
│ ├── jsfusion_val_caption_idx.pkl
│ ├── ... many other files....
├── videos
│ ├── all
│ │ ├── video1.mp4
│ │ ├── ....
│ │ ├── video9999.mp4
│ ├── tmp
│ │ ├──MSRVTT.zip
│ ├── vids
│ │ ├──data
│ │ │ ├── MSRVTT.zip

However, there is no audios for msrvtt.

How did you get the audio?
Is there specific way to extract the audio for example, bitrate, sample rate, audio channel, type of codec.
Any kind of audio file is valid?
"datasets/src/data/msrvtt/videos" == "msrvtt/videos/all" ?

"/data/IndexAnno.py", "VQA-msrvtt.json", and "descs_qa_trainval.json"

Hi. I am trying to finetune the MSRVTT-QA.

However, it has an error, I can modify the grammar to get rid of the error but I am not sure that I understand right.

line 68 of "/data/IndexAnno.py"
raw_captions = anno['desc'] if 'desc' in anno else anno['caption']
it returns error for the case of MSRVTT-QA.

Simply, because 'descs_qa_trainval.json' does not contain 'desc' nor 'caption'.

{"video_id": "video0", "question": "who drives down the road in an audi?", "answer": "man", "subtitle": ""}, {"video_id": "video0", "question": "what is a man doing?", "answer": "show", "subtitle": ""}, {"video_id": "video0", "question": "what is a man silently narrates his experience doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is a person doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is a person doing?", "answer": "tell", "subtitle": ""}, {"video_id": "video0", "question": "what is guy doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is man doing?", "answer": "talk", "subtitle": ""}, {"video_id": "video0", "question": "what is the man doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is a man doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is shown?", "answer": "car", "subtitle": ""}, {"video_id": "video0", "question": "what is dancing?", "answer": "group", "subtitle": ""}, {"video_id": "video0", "question": "who is driving?", "answer": "man", "subtitle": ""}, {"video_id": "video0", "question": "what is a man driving?", "answer": "car", "subtitle": ""}

Can I substitute the 'subtitle' for 'desc'/'caption' for the "/data/indexAnno.py" line 68?
but not sure as many 'subtitle' is empty.

The overall pipeline implementations of caption generation for VAST-27M

Thank you for your great contributions!

As described above, I notice that only trained video and audio captioners are provided in this repo.
Would the authors open the implementation process for the LLM part and the overall scripts for the caption generation?

Any reply will be sincerely appreciated.
Best regards,

what is the difference between argument "--local-rank" and "--local_rank"?

Why do you use two local rank for multiple GPU usages?

at args.py

Code release

Hello
When are you planning to release your code?

Inference code

Hello.
Thanks for awesome work and sharing the code.

Can you please share the inference/demo code?
Thanks

Is there any plan to release the finetune model of downstream tasks?

Code Release Please

Hello! Waiting for your code for so long, when are you planning to release it ?

Memory usuage during validation

Hi, When validation set size is large, GPU memory usage is much more than required during training even with small batch size. Can you please suggest where is the issue?
Thanks

Error while captioning using single processor

hi I got the following error while trying to run the code on a set of videos
Traceback (most recent call last):
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/./run.py", line 65, in
main()
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/./run.py", line 58, in main
test(model, val_loaders, args.run_cfg)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/utils/pipeline.py", line 156, in test
eval_log = evaluate_fn(model, test_loader, run_cfg, global_step=0)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/evaluation/evaluation_mm.py", line 25, in evaluate_mm
val_log = evaluate_single(model, loader, task.split('--')[0], run_cfg, global_step,task.split('--')[1])
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/evaluation/evaluation_mm.py", line 46, in evaluate_single
cap_dict = evaluate_cap(model, task, val_loader, run_cfg, global_step, dset_name)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/evaluation/evaluation_mm.py", line 130, in evaluate_cap
for batch in eval_loader:
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/data/loader.py", line 103, in iter
self.preload(loader_it)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/data/loader.py", line 116, in preload
self.batch = next(it)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/data/IndexAnno.py", line 68, in getitem
raw_captions = anno['desc'] if 'desc' in anno else anno['caption']
KeyError: 'caption'

On detailed verification of the code I go to know that the batch is not getting created by the data loader while running it on single gpu mode.Basically it is getting one loader in prefetched loader and not going futher . Can anyone please help me in solving this issue?

How can I fine-tune a model for a downstream task?

I now need to validate the performance on the MSRVTT dataset. How can this be implemented? Could you provide a corresponding tutorial?

License?

In the code I can find several licenses (Apache, BSD-3, MIT, ...)
Where/What is the license of this repository?
Cheers

What are the minimum requirements for gpu memory

What are the minimum requirements for gpu memory，thanks！

Dataset download

How to download video based on video_id in dataset?

/github/workspace/src/video/ffmpeg/threaded_decoder.cc:292: [14:29:09] /github/workspace/src/video/ffmpeg/threaded_decode r.cc:218: Check failed: avcodec_send_packet(dec_ctx_.get(), pkt.get()) >= 0 (-11 vs. 0) Thread worker: Error sending packet.

请问这个问题怎么解决？感谢！

Activitynet-QA annotations are missing

Could you please upload the Activitynet-QA annotations you used to finetune the model?
Thanks

labelling my own data use vast's captioner error?

The format of the meta.json is as follows：

What's the function of the param 'captioner_mode' ?

What's the difference when it's on or off ?

Nice work!

It's a nice work. when will the code be released?

Problem running finetuning on TGIF

I get the following error when trying to finetune on TGIF:

/github/workspace/src/video/video_reader.cc:270: [/scratch-shared/scur1914/gifs/tumblr_nqjzxszVxD1uz6id5o1_500.gif] Failed to measure duration/frame-count due to broken metadata.[23:11:27] /github/workspace/src/video/video_reader.cc:270: [/scratch-shared/scur1914/gifs/tumblr_nqjzxszVxD1uz6id5o1_500.gif] Failed to measure duration/frame-count due to broken metadata.

Should I transform the gifs to frames?
The config file for TGIF has the vision format set to video_rawvideo. I added the following to vision_mapper.py at line 138:

if not os.path.exists(video_path): video_path = video_path.replace('.mkv', '.gif')

Missing config files for pretrain

Hi, Thanks for the great work. In the pretrain_vast.json, the settings for "run_cfg" and "model_cfg" are respectively set to "./config/default_run_cfg.json" and "./config/newvlp/default_model_cfg.json". However, I did not find these two files in the folder ./config. Are they respectively same with "./config/vast/default_run_cfg.json" and "./config/vast/default_model_cfg.json"?