pku-yuangroup / languagebind Goto Github PK
View Code? Open in Web Editor NEW【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Home Page: https://arxiv.org/abs/2310.01852
License: MIT License
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Home Page: https://arxiv.org/abs/2310.01852
License: MIT License
Hey, I have some data of images and videos and i want these to get alligned with text. My usecase is just a binary classification. So, my texts are nothing but two sentences - 'The data is live' , 'The data is non live'. So, basically i wanted to increase my model's performance by utilising a multi-modality model. How do i do this? Any resources?
Nice work! I noticed that you have released VIT-H model for video modality. So, Do you have any plan to release VIT-H models for additional modalities?
If so, that would be great.
import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor
pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT' # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)
model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
out = model(**data)
print(out.text_embeds @ out.image_embeds.T)
In this code, what is the maximum length of your text
? If it exceeds 77, will it be truncated directly?
Hi,
Great work and thanks for open sourcing, I was trying your model on 150 video clips and audio clips, each clip is of length 5 seconds. Below is a screenshot of the code I am using. Here, the array, video_clips
and audio_files
are of size 150. During the embedding generation, the GPU consumes more than 8 GB of memory and the embedding generation stops. I tried the exact same sample with imageBind, but that seems to work fine during inference and embedding generation. Any idea if I am doing something wrong?
device = 'cuda:0'
device = torch.device(device)
clip_type = ('video', 'audio')
model = LanguageBind(clip_type=clip_type)
model = model.to(device)
model.eval()
pretrained_ckpt = f'lb203/LanguageBind_Video'
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type}
inputs = {
'video': to_device(modality_transform['video'](video_clips), device),
'audio': to_device(modality_transform['audio'](audio_files), device),
}
inputs['language'] = to_device(tokenizer(transcriptions_list, max_length=77, padding='max_length',
truncation=True, return_tensors='pt'), device)
with torch.no_grad():
embeddings = model(inputs)
Hi, I notice that in your paper, the results for full-tuning are reported. I'd like to know the training configurations for full tuning -- do you use the text prompt and input modality data with contrastive learning during full tuning, or use class labels with traditional classification setting (e.g., cross-entropy loss)? Thank you.
Hi, in the readme train_and_validation, the data is not release so it's hard to reimplement data to the right format as you did
I want to reimplement the code of you for training, can you provide me a sample data?
I have been following and utilizing your codebase for an extended period in my research. I believe your paper deserves far more attention than Imagebind.
nice work !An error occurred while trying to load the model using the huggingface api
`from transformers import AutoProcessor, AutoModel, AutoTokenizer
processor = AutoProcessor.from_pretrained("LanguageBind/LanguageBind_Video")
model = AutoModel.from_pretrained("LanguageBind/LanguageBind_Video")
tokenizer = AutoTokenizer.from_pretrained("LanguageBind/LanguageBind_Video")`
KeyError: 'LanguageBindVideo'
Could you give an example of using huggingface transformers input video to extract features
Great work!
Excuse me, I would like to inquire about the unimodal fine-tuning process as outlined in your documentation (https://github.com/PKU-YuanGroup/LanguageBind/blob/main/TRAIN_AND_VALIDATE.md#training-languagebind:~:text=Depth%2DLanguage%20with%208%20GPUs%20(1%20nodes%20x%208%20GPUs)).
If I choose to lock the image, does it mean that the Lora depth pre-trained model you developed is frozen, and I am training a new Lora model of my own? In this context, what role do the following models play: MODEL_DICT = {"ViT-L-14": "laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K"}
and CHECKPOINT_DICT = {"ViT-L-14": "models--laion--CLIP-ViT-L-14-DataComp.XL-s13B-b90K/snapshots/84c9828e63dc9a9351d1fe637c346d4c1c4db341/pytorch_model.bin"}
?
Hi authors,
Thanks for releasing the code.
I noticed that you mentioned "Note that our image encoder is the same as OpenCLIP. Not as fine-tuned as other modalities."
I would like to know what is the exact version of CLIP weight are you using?
Thanks!
First of all congrats on the paper and thanks for providing the code!
In the paper at 'Zero-shot language-based multi-modal joint retrieval' you mention that integrating/combining multiple embeddings improves the performance. I am specifically referring to the sentence:
'Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities.'
However, the paper does not clarify how the embeddings for different modalities are actually combined. If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?
Do you simply average the different embeddings?
Thanks in advance,
Anthony Mendil.
Great work! I have noticed in figure 3 of your paper that the multi-modal encoders weights are frozen when doing the Multi-modal Joint Learning. Do you mean they are frozen during all the training time and you only use LoRA to adjust the multi-modal encoders?
If so, how do you initialize their weights? Are they also initialized from pretrained OpenCLIP vision encoder?
Furthermore, are there any pretrain steps in your work? Can I train LanguageBind from scratch or I can only use LoRA to finetune it?
Thanks for the job!
May I know how many GPU sources you used to train the foundation model?
May I ask what is the max input length of the text encoder?
How can I combine LanguageBind with LLM to fine-tune my own downstream tasks? Such as Qwen?
Thank you for your excellent work!
Will you release the hashtags of the videos and the prompt used by mPLUG-owl and ChatGPT?
Hi Dear Author,
Great work! I'd like to inquire where I can find the address for Audio-Language Alignment data. I noticed in scripts/audio_language/train.sh
that there is a mention of 4,800,000 instances of audio-language data, which seems to be significantly more than the 1 million mentioned in the paper. Could you please provide information on where to download this data for easier replication of the paper's results?
Thank you!
import torch
from languagebind import LanguageBindImage, LanguageBindImageTokenizer, LanguageBindImageProcessor
pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)
model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
out = model(**data)
print(out.text_embeds @ out.image_embeds.T)
您好, 请问我如果加载LanguageBind_Image模型,用于图像和文本特征的提取对齐,那么我是用 out.text_embed和 out.image_embeds 这两个进行后续的工作吗?比如后续进行融合分类。
Hi
Thanks for the great work.
Imagebind uses Vit-H, so I'm supervised that you were able to achieve better performance using Vit-L only. Have you tried to explore Vit-H under your setting? I see in the config there are some leftover code of LAION CLIP ViT-H
where is the LanguageBind_Audio_FT in huggingface?
How to load pt model trained according to Training LanguageBind step? or How to load these models like the Inference for multi model binding step in the readme.md
I'd like to know what settings correspond to the LanguageBind_Video_merge model you put on the hugging face
Thanks for your wonderful work.
I am very excited about your idea. May I ask the computation budget used to train the largest Imagebind model? How many GPU hour do you use?
Hello,
Thank you for sharing such a great job!
I have encountered some issues where the inference results of the model are inconsistent when I run Python inference.py multiple times。
For example, the first time:
Video x Text:
[[1.0000000e+00 3.0187387e-08]
[8.4319353e-08 9.9999988e-01]]
Image x Text:
[[1.0000000e+00 4.0604040e-09]
[1.2165047e-08 1.0000000e+00]]
Depth x Text:
[[0.971602 0.02839794]
[0.97326183 0.02673816]]
Audio x Text:
[[0.99523276 0.00476721]
[0.09370264 0.9062974 ]]
Thermal x Text:
[[0.6276049 0.3723951]
[0.6245749 0.3754251]]
Video x Audio:
[[1.0000000e+00 0.0000000e+00]
[3.1131478e-32 1.0000000e+00]]
Image x Depth:
[[5.2336713e-07 9.9999952e-01]
[1.0000000e+00 4.3559140e-08]]
Image x Thermal:
[[5.1953281e-40 1.0000000e+00]
[7.0966505e-27 1.0000000e+00]]
But the second time, we got:
Video x Text:
[[1.0000000e+00 3.0187387e-08]
[8.4319353e-08 9.9999988e-01]]
Image x Text:
[[1.0000000e+00 4.0604040e-09]
[1.2165047e-08 1.0000000e+00]]
Depth x Text:
[[0.17767465 0.8223253 ]
[0.18100499 0.818995 ]]
Audio x Text:
[[0.99523276 0.00476721]
[0.09370264 0.9062974 ]]
Thermal x Text:
[[0.47579706 0.52420294]
[0.5624282 0.43757182]]
Video x Audio:
[[1.0000000e+00 0.0000000e+00]
[3.1131478e-32 1.0000000e+00]]
Image x Depth:
[[0.9892476 0.01075235]
[0.9906881 0.00931183]]
Image x Thermal:
[[9.9999619e-01 3.8228222e-06]
[1.0000000e+00 1.5902166e-24]]
Why does this randomness occur?
Thank you for your excellent work. I want to know what is the difference between this work and ImageBind. According to my understanding, the difference is mainly reflected in the different modalities used as band, right? Thanks!
Great job! When is the release date for the Huge model planned?
I want to use pretrained weights to inference, but I need embeddings['image'].shape from 768 to 1024.
How to do that?
When run the code train, I use the sample TextVideo with the data is MSRVTT, to implement, run the config
CACHE_DIR= '/root/.cache'
TRAIN_DATA = '/content/MSRVTT_data.json'
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
%cd /content/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --node_rank=0 --nproc_per_node 1 \
-m main \
--train-data ${TRAIN_DATA} \
--train-num-samples 1000 \
--clip-type "vl" \
--do_train \
--lock-text --lock-image --text-type "mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 16 \
--lr 1e-4 --coef-lr 1 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 8 --force-patch-dropout 0.3 \
--epochs 16 --batch-size 10 --accum-freq 4 --warmup 20 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
--do_eval \
--val_vl_ret_data "msrvtt"
However, when run, the bug look like
LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has
been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.
How I can fix it?
In the code, the image and video encoder are initialized from the same model, but trained separately. Does it make performance better?
Thanks for your effort pushing MLLM into the next stage. Recently, I want to follow your work, and download VIDAL-10M video-text data id2title_folder_raw_ofa_mplug_gpt_sound10076613.json.
I found it contain around 10M video-text, I have following question wish you could give me some hints.
Thanks in advance.
Hello! Your LanguageBind is amazing! But I'm new to multimodality, and I was wondering what's the difference between LanguageBind and LLaVA-1.5? Should I use LLaVA-1.5 or LanguageBind if I want my model to have more reasoning power while handling multimodal input (currently, text, image, and video are the three modes at most)? Considering that LanguageBind may be a better choice if other modes are to be added in the future, can LanguageBind be easily combined with LLaVA-1.5, LLaMA, or etc.? I'd like to hear your views on these issues.
ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement torch==1.13.0+cu116 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0)
ERROR: No matching distribution found for torch==1.13.0+cu116
I think the torch is outdate now, we should change this to newer version
Why do I download weights again every time I run inference.py ?
Great work!
I'd like to learn more about the details of the pretraining process mentioned: "During the pretraining process, all modalities gradually align with the language modality through contrastive learning."
Could you clarify if this pretraining process is equivalent to LoRA fine-tuning? In other words, during the pretraining phase, are parameters updated for the video encoder, infrared encoder, depth encoder, and audio encoder using the four types of data contained in VIDAL-10M, namely, video-language data, infrared-language data, depth-language data, and audio-language data, through contrastive learning?
As explore the code, and in my knowledge (please correct if there are something wrong), the current code do not have flash attention in training but instead that the vanilla attention
I think flash attention is a low hanging fruit when training and eval will be faster but still the same result
Do you have any plan to apply flash attention to your code?
Using LanguageBindVideoTower(video_tower, args=video_tower_cfg, cache_dir='', **kwargs) doesn't work. How do I adjust the CLIPVisionTransformer to fit the LanguageBind_Video_Huge_V1.5_FT model
To train using my own audio dataset, I left clip_type as al and while training, I noticed that the following code is executed when an audio clip is found in the VAT_dataset Class.
self.id2path_cap, self.ids = get_audio_anno()
However, I didn't see a definition for the get_audio_anno() function anywhere, so that's where I got the undefined function error. Is there any way I can get some information about that function?
Hi,
Is there any code snippets to test languagebind audio with large batch in gpus?
Hi,
I find your project intriguing and believe it could greatly assist in working with multiple data sources. However, I noticed that you haven't mentioned how the vector data generated by your project can be utilized for downstream tasks, such as video captioning. Do you have any plans to address this aspect? I'd be interested to hear your ideas on how one could leverage your model for such tasks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.