txh-mercury / vast Goto Github PK
View Code? Open in Web Editor NEWCode and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Home Page: https://arxiv.org/abs/2305.18500
License: MIT License
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Home Page: https://arxiv.org/abs/2305.18500
License: MIT License
03/26/2024 19:23:18 - INFO - main - load_from_pretrained: ./output/vast/pretrain_vast/ckpt/model_step_204994.pt
03/26/2024 19:23:18 - INFO - main - Load from pretrained dir ./output/vast/pretrain_vast
03/26/2024 19:23:19 - INFO - main - Unexpected keys ['vision_encoder.text.logit_scale']
03/26/2024 19:23:19 - INFO - main - missing_keys ['vision_encoder.logit_scale']
03/26/2024 19:23:20 - INFO - main - ==================learning_rate_settings==================
03/26/2024 19:23:20 - INFO - main - basic_lr : 1e-05
03/26/2024 19:23:20 - INFO - main - clip_lr_visual : 5e-07
03/26/2024 19:23:20 - INFO - main - clip_lr_visual_len : 245
03/26/2024 19:23:20 - INFO - main - new_lr : 0
03/26/2024 19:23:20 - INFO - main - new_params_name: []
0%| | 0/5670 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/workspace/Project/VideoLargeModel/VAST/./run.py", line 63, in
main()
File "/mnt/workspace/Project/VideoLargeModel/VAST/./run.py", line 46, in main
train(model, optimizer, train_loader, val_loaders, args.run_cfg, start_step = start_step, verbose_time=False)
File "/mnt/workspace/Project/VideoLargeModel/VAST/utils/pipeline.py", line 35, in train
for step, (name, batch) in enumerate(train_loader):
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/loader.py", line 101, in iter
self.preload(loader_it)
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/loader.py", line 112, in preload
self.batch = next(it)
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/loader.py", line 48, in iter
batch = next(iter_)
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/pai/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/workspace/Project/VideoLargeModel/VAST/data/IndexAnno.py", line 69, in getitem
raw_captions = anno['desc'] if 'desc' in anno else anno['caption']
KeyError: 'caption'
I am trying the VQA task on MSVD-QA dataset.
I use the "python3 -m torch.distributed.launch
--nnodes 1
--node_rank 0
--nproc_per_node 4
--master_port 9834
./run.py
--learning_rate 1e-5
--checkpointing true
--first_eval false
--config ./config/vast/finetune_cfg/VQA-msvd.json
--pretrain_dir $output_dir
--save_best true
--output_dir $output_dir/downstream/VQA-msvd " command line and meet above error.
I notice the AnnoIndexedDataset(Dataset) require 'desc' or 'caption' in anno, but the msvd/descs_cap_train.json do not have these info. I want to ask how to fix thie error. Thank you.
The original msrvtt folder structure is the below.
msrvtt
├── annotation
│ ├── MSR_VTT.json
├── high-quality
│ ├── structured-symlinks
│ │ ├── jsfusion_val_caption_idx.pkl
│ │ ├── ... many other files....
├── structured-symlinks
│ ├── jsfusion_val_caption_idx.pkl
│ ├── ... many other files....
├── videos
│ ├── all
│ │ ├── video1.mp4
│ │ ├── ....
│ │ ├── video9999.mp4
│ ├── tmp
│ │ ├──MSRVTT.zip
│ ├── vids
│ │ ├──data
│ │ │ ├── MSRVTT.zip
However, there is no audios for msrvtt.
How did you get the audio?
Is there specific way to extract the audio for example, bitrate, sample rate, audio channel, type of codec.
Any kind of audio file is valid?
"datasets/src/data/msrvtt/videos" == "msrvtt/videos/all" ?
Hi. I am trying to finetune the MSRVTT-QA.
However, it has an error, I can modify the grammar to get rid of the error but I am not sure that I understand right.
line 68 of "/data/IndexAnno.py"
raw_captions = anno['desc'] if 'desc' in anno else anno['caption']
it returns error for the case of MSRVTT-QA.
Simply, because 'descs_qa_trainval.json' does not contain 'desc' nor 'caption'.
{"video_id": "video0", "question": "who drives down the road in an audi?", "answer": "man", "subtitle": ""}, {"video_id": "video0", "question": "what is a man doing?", "answer": "show", "subtitle": ""}, {"video_id": "video0", "question": "what is a man silently narrates his experience doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is a person doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is a person doing?", "answer": "tell", "subtitle": ""}, {"video_id": "video0", "question": "what is guy doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is man doing?", "answer": "talk", "subtitle": ""}, {"video_id": "video0", "question": "what is the man doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is a man doing?", "answer": "drive", "subtitle": ""}, {"video_id": "video0", "question": "what is shown?", "answer": "car", "subtitle": ""}, {"video_id": "video0", "question": "what is dancing?", "answer": "group", "subtitle": ""}, {"video_id": "video0", "question": "who is driving?", "answer": "man", "subtitle": ""}, {"video_id": "video0", "question": "what is a man driving?", "answer": "car", "subtitle": ""}
Can I substitute the 'subtitle' for 'desc'/'caption' for the "/data/indexAnno.py" line 68?
but not sure as many 'subtitle' is empty.
Thank you for your great contributions!
As described above, I notice that only trained video and audio captioners are provided in this repo.
Would the authors open the implementation process for the LLM part and the overall scripts for the caption generation?
Any reply will be sincerely appreciated.
Best regards,
Why do you use two local rank for multiple GPU usages?
at args.py
Hello
When are you planning to release your code?
Hello.
Thanks for awesome work and sharing the code.
Can you please share the inference/demo code?
Thanks
Hello! Waiting for your code for so long, when are you planning to release it ?
Hi, When validation set size is large, GPU memory usage is much more than required during training even with small batch size. Can you please suggest where is the issue?
Thanks
hi I got the following error while trying to run the code on a set of videos
Traceback (most recent call last):
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/./run.py", line 65, in
main()
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/./run.py", line 58, in main
test(model, val_loaders, args.run_cfg)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/utils/pipeline.py", line 156, in test
eval_log = evaluate_fn(model, test_loader, run_cfg, global_step=0)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/evaluation/evaluation_mm.py", line 25, in evaluate_mm
val_log = evaluate_single(model, loader, task.split('--')[0], run_cfg, global_step,task.split('--')[1])
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/evaluation/evaluation_mm.py", line 46, in evaluate_single
cap_dict = evaluate_cap(model, task, val_loader, run_cfg, global_step, dset_name)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/evaluation/evaluation_mm.py", line 130, in evaluate_cap
for batch in eval_loader:
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/data/loader.py", line 103, in iter
self.preload(loader_it)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/data/loader.py", line 116, in preload
self.batch = next(it)
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/MIST/envs/vast/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/administrator/1b402ff7-f596-4523-a6fd-4ccdd4432680/ego4d/VAST/data/IndexAnno.py", line 68, in getitem
raw_captions = anno['desc'] if 'desc' in anno else anno['caption']
KeyError: 'caption'
On detailed verification of the code I go to know that the batch is not getting created by the data loader while running it on single gpu mode.Basically it is getting one loader in prefetched loader and not going futher . Can anyone please help me in solving this issue?
I now need to validate the performance on the MSRVTT dataset. How can this be implemented? Could you provide a corresponding tutorial?
In the code I can find several licenses (Apache, BSD-3, MIT, ...)
Where/What is the license of this repository?
Cheers
What are the minimum requirements for gpu memory,thanks!
How to download video based on video_id in dataset?
请问这个问题怎么解决?感谢!
Could you please upload the Activitynet-QA annotations you used to finetune the model?
Thanks
What's the difference when it's on or off ?
It's a nice work. when will the code be released?
I get the following error when trying to finetune on TGIF:
/github/workspace/src/video/video_reader.cc:270: [/scratch-shared/scur1914/gifs/tumblr_nqjzxszVxD1uz6id5o1_500.gif] Failed to measure duration/frame-count due to broken metadata.[23:11:27] /github/workspace/src/video/video_reader.cc:270: [/scratch-shared/scur1914/gifs/tumblr_nqjzxszVxD1uz6id5o1_500.gif] Failed to measure duration/frame-count due to broken metadata.
Should I transform the gifs to frames?
The config file for TGIF has the vision format set to video_rawvideo. I added the following to vision_mapper.py at line 138:
if not os.path.exists(video_path): video_path = video_path.replace('.mkv', '.gif')
Hi, Thanks for the great work. In the pretrain_vast.json, the settings for "run_cfg" and "model_cfg" are respectively set to "./config/default_run_cfg.json" and "./config/newvlp/default_model_cfg.json". However, I did not find these two files in the folder ./config. Are they respectively same with "./config/vast/default_run_cfg.json" and "./config/vast/default_model_cfg.json"?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.