tiger-ai-lab / uniir Goto Github PK

View Code? Open in Web Editor NEW

82.0 3.0 11.0 54.85 MB

Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)

Home Page: https://tiger-ai-lab.github.io/UniIR/

License: MIT License

Python 92.66% Shell 7.34%

language-model retrieval

uniir's Introduction

UniIR

🌐 Homepage | 🤗 Dataset(M-BEIR Benchmark) | 🤗 Checkpoints(UniIR models) | 📖 arXiv | GitHub

This repo contains the codebase for the ECCV-2024 paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers"

🔔News

🔥[2024-04-13]: We highlight another valuable and concurrent research on training instruction-following, multi-task multi-modal retrievers with Late-interaction:PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers , which was done by the researchers of the University of Cambridge. They also introduced the M2KR benchmark which can be used to train and evaluate multi-modal universal information retrievers. We may combine the M2KR and M-BEIR benchmarks together to facilitate the advance of this field.
🔥[2024-03-18]: Release the UniIR(CLIP_SF) large and UniIR(BLIP_FF) large checkpoints 🤗 Checkpoints
🔥[2023-12-21]: Our 🤗 M-BEIR Benchmark is now available for use.

Introduction

We propose the UniIR(Universal multimodal Information Retrieval) framework to learn a single retriever to accomplish (possibly) any retrieval task. Unlike traditional IR systems, UniIR needs to follow the instructions to take a heterogeneous query to retrieve from a heterogeneous candidate pool with millions of candidates in diverse modalities.

M-BEIR

To train and evaluate universal multimodal retrieval models, we build a large-scale retrieval benchmark named M-BEIR (Multimodal BEnchmark for Instructed Retrieval).

M-BEIR Downloading

We provide the M-BEIR dataset in the 🤗 Dataset. Please follow the instructions provided on the HF page to download the dataset and prepare the data for training and evaluation. You need to set up GiT LFS and directly clone the repo:

git clone https://huggingface.co/datasets/TIGER-Lab/M-BEIR

UniIR Models

We provide the codebase for training and evaluating the UniIR CLIP-ScoreFusion, CLIP-FeatureFusion, BLIP-ScoreFusion, and BLIP-FeatureFusion models.

Environment

Prepare the codebase of the UniIR project and Conda environment using the following commands:

git clone https://github.com/TIGER-AI-Lab/UniIR
cd UniIR

cd src/models/
conda env create -f uniir_env.yml

Training

To train the UniIR models from pretrained CLIP and BLIP checkpoints, please follow the instructions below. The scripts will automatically download the pretrained CLIP and BLIP checkpoints.

1. Download the M-BEIR Benchmark

Please download the M-BEIR benchmark by following the instructions in the M-BEIR section.

2. Scripts

To train UniIR CLIP_SF Large with the default configuration:

cd src/models/uniir_clip/clip_scorefusion/configs_scripts/large/train/inbatch/

Modify inbatch.yaml for hyperparameter tuning and run_inbatch.sh for your own environment and paths.

Note:

Modify the UNIIR_DIR in the run_inbatch.sh to the directory where you want to store the checkpoints.
Modify the MBEIR_DATA_DIR in the run_inbatch.sh to the directory where you store the M-BEIR benchmark.
Modify the SRC_DIR in the run_inbatch.sh to the directory where you store the codebase of the UniIR project(This repo).
By default, UniIR models are trained on M-BEIR with in-batch negatives, and the hard negatives provided by the original datasets are not used.
We used wandb to log the training process. Please make sure a .env environment with WANDB_API_KEY, WANDB_PROJECT, and WANDB_ENTITY is set.

Then you can run the following command to train the UniIR CLIP_SF Large model.

bash run_inbatch.sh

To train UniIR BLIP_FF Large with the default configuration:

cd src/models/uniir_blip/blip_featurefusion/configs_scripts/large/train/inbatch/

Modify inbatch.yaml for hyperparameter tuning and run_inbatch.sh for your own environment and paths.

bash run_inbatch.sh

Similarly, you can train the UniIR CLIP_FF and BLIP_SF models by modifying the corresponding scripts.

Evaluation

We provide the evaluation pipeline for the UniIR models on the M-BEIR benchmark.

1. Environment

Please create an environment for the FAISS library:

# From the root directory of the project
cd src/common/
conda env create -f faiss_env.yml

2. Download the M-BEIR Benchmark

Please download the M-BEIR benchmark by following the instructions in the M-BEIR section.

3. Download the UniIR Checkpoints

You can train the UniIR models from scratch or download the pre-trained UniIR checkpoints by following the instructions in the Model Zoo section.

4. Scripts

To evaluate UniIR CLIP_SF Large with the default configuration:

cd src/models/uniir_clip/clip_scorefusion/configs_scripts/large/eval/inbatch/

Modify embed.yaml, index.yaml, retrieval.yaml and run_eval_pipeline_inbatch.sh for your own environment, paths and evaluation settings.

Note:

If you download our pretrained UniIR model, please modify the UNIIR_DIR in the run_eval_pipeline_inbatch.sh to the directory where you want to store large files including the checkpoints, embeddings, index and retrieval results. Then you can place the clip_sf_large.pth file in the following path:
```
$UNIIR_DIR/checkpoint/CLIP_SF/Large/Instruct/InBatch/clip_sf_large.pth
```
This the default path specified by model.ckpt_config in the embed.yaml file.
Modify the MBEIR_DATA_DIR in the run_eval_pipeline_inbatch.sh to the directory where you store the M-BEIR benchmark.
Modify the SRC_DIR in the run_eval_pipeline_inbatch.sh to the directory where you store the codebase of the UniIR project(This repo).

The default configuration will evaluate the UniIR CLIP_SF Large model on both the M-BEIR (5.6M heterogeneous candidate pool) and the M-BEIR_local (homogeneous candidate pool) benchmarks. UNION in the yaml files refers to the M-BEIR (5.6M heterogeneous candidate pool). You can follow the comments in the yaml files and modify the configurations to evaluate the model on the M-BEIR_local benchmark only.

bash run_eval_pipeline_inbatch.sh

embed, index, logger and retrieval_results will be saved in the $UNIIR_DIR directory.

To evaluate UniIR BLIP_FF Large with the default configuration:

cd src/models/unii_blip/blip_featurefusion/configs_scripts/large/eval/inbatch/

Similarly, if you download our pretrained UniIR model, you can place the blip_ff_large.pth file in the following path:

$UNIIR_DIR/checkpoint/BLIP_FF/Large/Instruct/InBatch/blip_ff_large.pth

The default configuration will evaluate the UniIR BLIP_FF Large model on both the M-BEIR and the M-BEIR_local benchmarks.

bash run_eval_pipeline_inbatch.sh

You can train and evaluate the UniIR CLIP_FF and BLIP_SF models by modifying the corresponding scripts.

Model Zoo

We provide the UniIR model checkpoints in the 🤗 Checkpoints. You can directly use the checkpoints for retrieval tasks or fine-tune the models for your own retrieval tasks.

Available Checkpoints

Model Name	Version	Model Size	Model Link
UniIR(CLIP-SF)	Large	5.13 GB	Download Link
UniIR(BLIP-FF)	Large	7.49 GB	Download Link

You can download them by

git clone https://huggingface.co/TIGER-Lab/UniIR

Citation and Contact

Cong Wei: [email protected]
Yang Chen: [email protected]
Alan Ritter: [email protected]
Wenhu Chen: [email protected]

BibTeX:

@article{wei2023uniir,
  title={Uniir: Training and benchmarking universal multimodal information retrievers},
  author={Wei, Cong and Chen, Yang and Chen, Haonan and Hu, Hexiang and Zhang, Ge and Fu, Jie and Ritter, Alan and Chen, Wenhu},
  journal={arXiv preprint arXiv:2311.17136},
  year={2023}
}

uniir's People

Contributors

Stargazers

Watchers

Forkers

thanhpham1987 codeaudit imperatorexterminatus evelynmitchell seungonekim edchengg caisarl76 castorini sahel-sh paperwave

uniir's Issues

Evaluation Variability with Random Instruction Selection

First of all, thank you for sharing your excellent research with the community.

I have a question regarding the code implementation for evaluations using the M-BEIR dataset. It appears that the instruction is chosen randomly when conducting evaluations. This leads me to believe that if the random seed is not fixed, the evaluation performance could vary depending on which instruction is selected. Do you have any experimental results where you fixed a specific instruction for evaluation?

unable to reproduce the results

Hi, Thanks for the amazing work!

In producing the results from CLIP_SF-base, it seems that I don't get the same results as those in the paper:
TaskID Task Dataset Split Metric CandPool Value UnionPool UnionValue
0 text -> image visualnews_task0 test Recall@1 visualnews_task0 0.1109 union 0.0
0 text -> image visualnews_task0 test Recall@5 visualnews_task0 0.2493 union 0.0
0 text -> image visualnews_task0 test Recall@10 visualnews_task0 0.3288 union 0.0001
0 text -> image mscoco_task0 test Recall@1 mscoco_task0_test 0.4138 union 0.0
0 text -> image mscoco_task0 test Recall@5 mscoco_task0_test 0.7016 union 0.0
0 text -> image mscoco_task0 test Recall@10 mscoco_task0_test 0.8058 union 0.0
0 text -> image fashion200k_task0 test Recall@10 fashion200k_task0 0.0791 union 0.0
0 text -> image fashion200k_task0 test Recall@20 fashion200k_task0 0.1123 union 0.0
0 text -> image fashion200k_task0 test Recall@50 fashion200k_task0 0.1815 union 0.0
1 text -> text webqa_task1 test Recall@1 webqa_task1 0.0049 union 0.0041
1 text -> text webqa_task1 test Recall@5 webqa_task1 0.0069 union 0.0053
1 text -> text webqa_task1 test Recall@10 webqa_task1 0.0081 union 0.0069
2 text -> image,text edis_task2 test Recall@1 edis_task2 0.0861 union 0.016
2 text -> image,text edis_task2 test Recall@5 edis_task2 0.1771 union 0.0707
2 text -> image,text edis_task2 test Recall@10 edis_task2 0.2197 union 0.1027
2 text -> image,text webqa_task2 test Recall@1 webqa_task2 0.094 union 0.0104
2 text -> image,text webqa_task2 test Recall@5 webqa_task2 0.2298 union 0.039
2 text -> image,text webqa_task2 test Recall@10 webqa_task2 0.3102 union 0.0697
3 image -> text visualnews_task3 test Recall@1 visualnews_task3 0.004 union 0.0
3 image -> text visualnews_task3 test Recall@5 visualnews_task3 0.0184 union 0.0004
3 image -> text visualnews_task3 test Recall@10 visualnews_task3 0.0369 union 0.0008
3 image -> text mscoco_task3 test Recall@1 mscoco_task3_test 0.001 union 0.0
3 image -> text mscoco_task3 test Recall@5 mscoco_task3_test 0.6712 union 0.0004
3 image -> text mscoco_task3 test Recall@10 mscoco_task3_test 0.8522 union 0.001
3 image -> text fashion200k_task3 test Recall@10 fashion200k_task3 0.0814 union 0.0
3 image -> text fashion200k_task3 test Recall@20 fashion200k_task3 0.1287 union 0.0
3 image -> text fashion200k_task3 test Recall@50 fashion200k_task3 0.215 union 0.0004
4 image -> image nights_task4 test Recall@1 nights_task4 0.066 union 0.0
4 image -> image nights_task4 test Recall@5 nights_task4 0.2689 union 0.0024
4 image -> image nights_task4 test Recall@10 nights_task4 0.4439 union 0.009
6 image,text -> text oven_task6 test Recall@1 oven_task6 0.0576 union 0.0047
6 image,text -> text oven_task6 test Recall@5 oven_task6 0.1758 union 0.0186
6 image,text -> text oven_task6 test Recall@10 oven_task6 0.247 union 0.0339
6 image,text -> text infoseek_task6 test Recall@1 infoseek_task6 0.0405 union 0.0009
6 image,text -> text infoseek_task6 test Recall@5 infoseek_task6 0.1377 union 0.0061
6 image,text -> text infoseek_task6 test Recall@10 infoseek_task6 0.2011 union 0.0138
7 image,text -> image fashioniq_task7 test Recall@10 fashioniq_task7 0.1564 union 0.0818
7 image,text -> image fashioniq_task7 test Recall@20 fashioniq_task7 0.2207 union 0.1171
7 image,text -> image fashioniq_task7 test Recall@50 fashioniq_task7 0.3205 union 0.1894
7 image,text -> image cirr_task7 test Recall@1 cirr_task7 0.0281 union 0.0185
7 image,text -> image cirr_task7 test Recall@5 cirr_task7 0.3475 union 0.1432
7 image,text -> image cirr_task7 test Recall@10 cirr_task7 0.4703 union 0.2153
8 image,text -> image,text oven_task8 test Recall@1 oven_task8 0.3581 union 0.1942
8 image,text -> image,text oven_task8 test Recall@5 oven_task8 0.5641 union 0.3501
8 image,text -> image,text oven_task8 test Recall@10 oven_task8 0.637 union 0.423
8 image,text -> image,text infoseek_task8 test Recall@1 infoseek_task8 0.165 union 0.0884
8 image,text -> image,text infoseek_task8 test Recall@5 infoseek_task8 0.3319 union 0.2085
8 image,text -> image,text infoseek_task8 test Recall@10 infoseek_task8 0.4188 union 0.2864

Could you help to identify the potential reasons? Thanks a lot!

Questions about reproduction

Hi, first of all, thanks for the great resource. I'm now trying to reproduce the results of UNIIR CLIP score fusion in Table11. I have two questions regarding to the results.

An obvious difference between my reproduction and paper results on MSCOCO task1 and Visual News task4. Is there any possible issue here?
I observe slight variations in my reproduction and paper results on the other tasks. Does the variation come from randomly sampled instructions and is this an acceptable variations?

custom dataset retrieval

First of all, thank you for sharing your excellent research with the community.

I have a small question: Can you provide a tutorial or script to make UniIR compatible with custom datasets? This will provide greater help and impact to the community.

Question regarding CLIP Zero-shot Evaluation Methodology in Table 6 of the paper

Thank you for your good work!

I have a question regarding the evaluation methodology for CLIP Zero-shot as presented in Table 6 of the paper.

It seems that this section utilizes the original CLIP model, which inherently does not support multimodal joint encoding. Specifically, in tasks such as q_t-->(c_i, c_t), how are the (c_i, c_t) pairs encoded? Are you also directly subject to score-level fusion?

I would greatly appreciate any insights you could provide on this matter.

How is 'qinst' represented in the code?

Data split annotation

Nice work!
I found that the data split strategy in Appendix has several "random" parts. Could you share the specific annotation files about the data splitting ways in all 10 datasets?

About the number of hard negative candidates

I want to know whether your models are trained with hard negatives.
In the paper's data sections (section 3.2 and 6.3), it seems that the M-BEIR query may have hard negative candidates.
But in the code's training config (like src/models/uniir_clip/clip_scorefusion/configs_scripts/large/train/inbatch/inbatch.yaml), the number hard negatives is set to 0.
So, I want to know whether your models in Table 2 are trained with hard negatives.

Why is the total number of queries in "M-BEIR/query/train" different from the count of queries in "mbeir_union_up_train.jsonl"?

The total numbr of queries in "M-BEIR/query/train" and "mbeir_union_up_train.jsonl" are 1150203 and 1332790, respectively. What is the reason for this discrepancy?

Checkpoints Release

Amazing work! When is the trained checkpoint going release?

tiger-ai-lab / uniir Goto Github PK

uniir's Introduction

UniIR

🔔News

Introduction

Content

M-BEIR

M-BEIR Downloading

UniIR Models

Environment

Training

1. Download the M-BEIR Benchmark

2. Scripts

To train UniIR CLIP_SF Large with the default configuration:

Note:

To train UniIR BLIP_FF Large with the default configuration:

Similarly, you can train the UniIR CLIP_FF and BLIP_SF models by modifying the corresponding scripts.

Evaluation

1. Environment

2. Download the M-BEIR Benchmark

3. Download the UniIR Checkpoints

4. Scripts

To evaluate UniIR CLIP_SF Large with the default configuration:

Note:

To evaluate UniIR BLIP_FF Large with the default configuration:

You can train and evaluate the UniIR CLIP_FF and BLIP_SF models by modifying the corresponding scripts.

Model Zoo

Available Checkpoints

Citation and Contact

uniir's People

Contributors

Stargazers

Watchers

Forkers

uniir's Issues

Recommend Projects

Recommend Topics

Recommend Org