pku-yuangroup / languagebind Goto Github PK

View Code? Open in Web Editor NEW

537.0 13.0 43.0 19.05 MB

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Home Page: https://arxiv.org/abs/2310.01852

License: MIT License

Python 99.14% Shell 0.86%

language-central multi-modal pretraining zero-shot

languagebind's People

Contributors

Stargazers

Watchers

languagebind's Issues

finetuning on a classification task

Hey, I have some data of images and videos and i want these to get alligned with text. My usecase is just a binary classification. So, my texts are nothing but two sentences - 'The data is live' , 'The data is non live'. So, basically i wanted to increase my model's performance by utilising a multi-modality model. How do i do this? Any resources?

VIT-H model on other modality [Audio/Depth/Thermal]

Nice work! I noticed that you have released VIT-H model for video modality. So, Do you have any plan to release VIT-H models for additional modalities?

If so, that would be great.

The length of text that the text encoder can handle

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

In this code, what is the maximum length of your text? If it exceeds 77, will it be truncated directly?

Seeing excessive GPU memory usage during inference

Hi,
Great work and thanks for open sourcing, I was trying your model on 150 video clips and audio clips, each clip is of length 5 seconds. Below is a screenshot of the code I am using. Here, the array, video_clips and audio_files are of size 150. During the embedding generation, the GPU consumes more than 8 GB of memory and the embedding generation stops. I tried the exact same sample with imageBind, but that seems to work fine during inference and embedding generation. Any idea if I am doing something wrong?

device = 'cuda:0'
device = torch.device(device)
clip_type = ('video', 'audio')
model = LanguageBind(clip_type=clip_type)
model = model.to(device)
model.eval()
pretrained_ckpt = f'lb203/LanguageBind_Video'

tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type}

inputs = {
    'video': to_device(modality_transform['video'](video_clips), device),
    'audio': to_device(modality_transform['audio'](audio_files), device),
}

inputs['language'] = to_device(tokenizer(transcriptions_list, max_length=77, padding='max_length',
                                         truncation=True, return_tensors='pt'), device)

with torch.no_grad():
    embeddings = model(inputs)

What is the training configurations for full tuning?

Hi, I notice that in your paper, the results for full-tuning are reported. I'd like to know the training configurations for full tuning -- do you use the text prompt and input modality data with contrastive learning during full tuning, or use class labels with traditional classification setting (e.g., cross-entropy loss)? Thank you.

provide a sample data for training

Hi, in the readme train_and_validation, the data is not release so it's hard to reimplement data to the right format as you did
I want to reimplement the code of you for training, can you provide me a sample data?

When will you release the dataset?

Congrats on Acceptance !!!

I have been following and utilizing your codebase for an extended period in my research. I believe your paper deserves far more attention than Imagebind.

how to use hugging face model

nice work ！An error occurred while trying to load the model using the huggingface api
`from transformers import AutoProcessor, AutoModel, AutoTokenizer

processor = AutoProcessor.from_pretrained("LanguageBind/LanguageBind_Video")
model = AutoModel.from_pretrained("LanguageBind/LanguageBind_Video")
tokenizer = AutoTokenizer.from_pretrained("LanguageBind/LanguageBind_Video")`

KeyError: 'LanguageBindVideo'

Could you give an example of using huggingface transformers input video to extract features

Inquiry on Unimodal Fine-Tuning with Locked Image in LanguageBind

Great work!
Excuse me, I would like to inquire about the unimodal fine-tuning process as outlined in your documentation (https://github.com/PKU-YuanGroup/LanguageBind/blob/main/TRAIN_AND_VALIDATE.md#training-languagebind:~:text=Depth%2DLanguage%20with%208%20GPUs%20(1%20nodes%20x%208%20GPUs)).
If I choose to lock the image, does it mean that the Lora depth pre-trained model you developed is frozen, and I am training a new Lora model of my own? In this context, what role do the following models play: MODEL_DICT = {"ViT-L-14": "laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K"} and CHECKPOINT_DICT = {"ViT-L-14": "models--laion--CLIP-ViT-L-14-DataComp.XL-s13B-b90K/snapshots/84c9828e63dc9a9351d1fe637c346d4c1c4db341/pytorch_model.bin"}?

Vision encoder version

Hi authors,

Thanks for releasing the code.
I noticed that you mentioned "Note that our image encoder is the same as OpenCLIP. Not as fine-tuned as other modalities."
I would like to know what is the exact version of CLIP weight are you using?

Thanks!

Combination of multiple modalities

First of all congrats on the paper and thanks for providing the code!

In the paper at 'Zero-shot language-based multi-modal joint retrieval' you mention that integrating/combining multiple embeddings improves the performance. I am specifically referring to the sentence:

'Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities.'

However, the paper does not clarify how the embeddings for different modalities are actually combined. If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?
Do you simply average the different embeddings?

Thanks in advance,
Anthony Mendil.

Can you share the NYU-D dataset you used for evaluation, e.g. how to split the dataset?

How to Initialize the multi-modal encoders & training from scratch

Great work! I have noticed in figure 3 of your paper that the multi-modal encoders weights are frozen when doing the Multi-modal Joint Learning. Do you mean they are frozen during all the training time and you only use LoRA to adjust the multi-modal encoders?

If so, how do you initialize their weights? Are they also initialized from pretrained OpenCLIP vision encoder?

Furthermore, are there any pretrain steps in your work? Can I train LanguageBind from scratch or I can only use LoRA to finetune it?

GPU sources

Thanks for the job!

May I know how many GPU sources you used to train the foundation model?

Text input length

May I ask what is the max input length of the text encoder?

Fine-tuneing LLM + LanguageBind?

How can I combine LanguageBind with LLM to fine-tune my own downstream tasks? Such as Qwen?

Hashtags and prompts?

Thank you for your excellent work!

Will you release the hashtags of the videos and the prompt used by mPLUG-owl and ChatGPT?

视频特征的提取支持动态帧数吗，效果相对于8帧会有下降或者变差吗

Audio-Language Alignment data for reproduction

Hi Dear Author,

Great work! I'd like to inquire where I can find the address for Audio-Language Alignment data. I noticed in scripts/audio_language/train.sh that there is a mention of 4,800,000 instances of audio-language data, which seems to be significantly more than the 1 million mentioned in the paper. Could you please provide information on where to download this data for easier replication of the paper's results?

Thank you!

用于特征提取对齐，选用输出为什么参数

import torch
from languagebind import LanguageBindImage, LanguageBindImageTokenizer, LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

您好, 请问我如果加载LanguageBind_Image模型，用于图像和文本特征的提取对齐，那么我是用 out.text_embed和 out.image_embeds 这两个进行后续的工作吗？比如后续进行融合分类。

Choice of Vit-L over Vit-H

Hi
Thanks for the great work.
Imagebind uses Vit-H, so I'm supervised that you were able to achieve better performance using Vit-L only. Have you tried to explore Vit-H under your setting? I see in the config there are some leftover code of LAION CLIP ViT-H

where is the LanguageBind_Audio_FT in huggingface?

How to load pt model trained according to Training LanguageBind step?

How to load pt model trained according to Training LanguageBind step? or How to load these models like the Inference for multi model binding step in the readme.md

where is LanguageBind_Image

about LanguageBind_Video_merge

I'd like to know what settings correspond to the LanguageBind_Video_merge model you put on the hugging face

gpu资源

Thanks for your wonderful work.
I am very excited about your idea. May I ask the computation budget used to train the largest Imagebind model? How many GPU hour do you use?

Inconsistent running results of inference.py

Hello,
Thank you for sharing such a great job！
I have encountered some issues where the inference results of the model are inconsistent when I run Python inference.py multiple times。
For example, the first time:

      Video x Text:
       [[1.0000000e+00 3.0187387e-08]
       [8.4319353e-08 9.9999988e-01]]
      Image x Text:
       [[1.0000000e+00 4.0604040e-09]
       [1.2165047e-08 1.0000000e+00]]
      Depth x Text:
       [[0.971602   0.02839794]
       [0.97326183 0.02673816]]
      Audio x Text:
       [[0.99523276 0.00476721]
       [0.09370264 0.9062974 ]]
      Thermal x Text:
       [[0.6276049 0.3723951]
       [0.6245749 0.3754251]]
      Video x Audio:
       [[1.0000000e+00 0.0000000e+00]
       [3.1131478e-32 1.0000000e+00]]
      Image x Depth:
       [[5.2336713e-07 9.9999952e-01]
       [1.0000000e+00 4.3559140e-08]]
      Image x Thermal:
       [[5.1953281e-40 1.0000000e+00]
       [7.0966505e-27 1.0000000e+00]]

But the second time, we got:

Video x Text:
 [[1.0000000e+00 3.0187387e-08]
 [8.4319353e-08 9.9999988e-01]]
Image x Text:
 [[1.0000000e+00 4.0604040e-09]
 [1.2165047e-08 1.0000000e+00]]
Depth x Text:
 [[0.17767465 0.8223253 ]
 [0.18100499 0.818995  ]]
Audio x Text:
 [[0.99523276 0.00476721]
 [0.09370264 0.9062974 ]]
Thermal x Text:
 [[0.47579706 0.52420294]
 [0.5624282  0.43757182]]
Video x Audio:
 [[1.0000000e+00 0.0000000e+00]
 [3.1131478e-32 1.0000000e+00]]
Image x Depth:
 [[0.9892476  0.01075235]
 [0.9906881  0.00931183]]
Image x Thermal:
 [[9.9999619e-01 3.8228222e-06]
 [1.0000000e+00 1.5902166e-24]]

Why does this randomness occur？

Difference from imagebind

Thank you for your excellent work. I want to know what is the difference between this work and ImageBind. According to my understanding, the difference is mainly reflected in the different modalities used as band, right? Thanks!

VIT-H model release

Great job! When is the release date for the Huge model planned?

Can I change embeddings['image'].shape from 768 to 1024?

I want to use pretrained weights to inference, but I need embeddings['image'].shape from 768 to 1024.
How to do that?

cannot run the code train

When run the code train, I use the sample TextVideo with the data is MSRVTT, to implement, run the config

CACHE_DIR= '/root/.cache'
TRAIN_DATA = '/content/MSRVTT_data.json'
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
%cd /content/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --node_rank=0 --nproc_per_node 1 \
    -m main  \
    --train-data ${TRAIN_DATA} \
    --train-num-samples 1000 \
    --clip-type "vl" \
    --do_train \
    --lock-text --lock-image --text-type "mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 16 \
    --lr 1e-4 --coef-lr 1 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 8 --force-patch-dropout 0.3 \
    --epochs 16 --batch-size 10 --accum-freq 4 --warmup 20 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_vl_ret_data "msrvtt"

However, when run, the bug look like

LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has 
been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

How I can fix it?

Why don't to share the parameters backbone between Image and Video?

In the code, the image and video encoder are initialized from the same model, but trained separately. Does it make performance better?

confusion about VIDAL-10M video-text data

Thanks for your effort pushing MLLM into the next stage. Recently, I want to follow your work, and download VIDAL-10M video-text data id2title_folder_raw_ofa_mplug_gpt_sound10076613.json.

I found it contain around 10M video-text, I have following question wish you could give me some hints.

what's the difference between this 10M video-text and 3M video-text mentioned in your ICLR paper.
Regarding to this 10M video-text, I found many video's raw(including title and hashtag) contains some words like youtube, shorts. Take youtube ID LbxMRY4_W10 for example, its raw is I kicked this ball higher than Ja Morant can jump! #shorts #youtubeshorts #youtube #shortclips. But in your paper, you mention "we removed irrelevant words and hashtags, such as ”youtube”, ”fyp”, ”shorts”, etc".

Thanks in advance.

What's the difference between LanguageBind and LLaVA-1.5

Hello! Your LanguageBind is amazing! But I'm new to multimodality, and I was wondering what's the difference between LanguageBind and LLaVA-1.5? Should I use LLaVA-1.5 or LanguageBind if I want my model to have more reasoning power while handling multimodal input (currently, text, image, and video are the three modes at most)? Considering that LanguageBind may be a better choice if other modes are to be added in the future, can LanguageBind be easily combined with LLaVA-1.5, LLaMA, or etc.? I'd like to hear your views on these issues.

bug in install requirements.txt

ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement torch==1.13.0+cu116 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0)
ERROR: No matching distribution found for torch==1.13.0+cu116

I think the torch is outdate now, we should change this to newer version

About download weights

Why do I download weights again every time I run inference.py ?

pretraining details

Great work!
I'd like to learn more about the details of the pretraining process mentioned: "During the pretraining process, all modalities gradually align with the language modality through contrastive learning."
Could you clarify if this pretraining process is equivalent to LoRA fine-tuning? In other words, during the pretraining phase, are parameters updated for the video encoder, infrared encoder, depth encoder, and audio encoder using the four types of data contained in VIDAL-10M, namely, video-language data, infrared-language data, depth-language data, and audio-language data, through contrastive learning?

Add flash attention 2

As explore the code, and in my knowledge (please correct if there are something wrong), the current code do not have flash attention in training but instead that the vanilla attention
I think flash attention is a low hanging fruit when training and eval will be faster but still the same result
Do you have any plan to apply flash attention to your code?

how to load LanguageBind/LanguageBind_Video_Huge_V1.5_FT model

Using LanguageBindVideoTower(video_tower, args=video_tower_cfg, cache_dir='', **kwargs) doesn't work. How do I adjust the CLIPVisionTransformer to fit the LanguageBind_Video_Huge_V1.5_FT model

Use of undefined functions during fine_tune with custom audio data

To train using my own audio dataset, I left clip_type as al and while training, I noticed that the following code is executed when an audio clip is found in the VAT_dataset Class.

self.id2path_cap, self.ids = get_audio_anno()

However, I didn't see a definition for the get_audio_anno() function anywhere, so that's where I got the undefined function error. Is there any way I can get some information about that function?

batch inference

Hi,

Is there any code snippets to test languagebind audio with large batch in gpus?

research about a model video captioning

Hi,
I find your project intriguing and believe it could greatly assist in working with multiple data sources. However, I noticed that you haven't mentioned how the vector data generated by your project can be utilized for downstream tasks, such as video captioning. Do you have any plans to address this aspect? I'd be interested to hear your ideas on how one could leverage your model for such tasks.

pku-yuangroup / languagebind Goto Github PK

languagebind's People

Contributors

Stargazers

Watchers

Forkers

languagebind's Issues

Recommend Projects

Recommend Topics

Recommend Org