Giter Club home page Giter Club logo

vrsbench's Introduction

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Xiang Li, Jian Ding, Mohamed Elhoseiny

RSGPT: A Remote Sensing Vision Language Model and Benchmark

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li☨

VRSBench

VRSBench is a Versatile Vision-Language Benchmark for Remote Sensing Image Understanding.

VRSBench is a Versatile Vision-Language Benchmark for Remote Sensing Image Understanding. It consists of 29,614 remote sensing images with detailed captions, 52,472 object refers, and 3123,221 visual question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks.

🗓️ TODO

  • Release code and models of baseline models.
  • [2024.06.19] We release the instructions and code for calling GPT-4V to get initial annotations.
  • [2024.06.19] We release the VRSBench, A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. VRSBench contains 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. check VRSBench Project Page.

Using datasets

The dataset can be downloaded from link and used via the Hugging Face datasets library. To load the dataset, you can use the following code snippet:

from datasets import load_dataset
fw = load_dataset("xiang709/VRSBench", streaming=True)

Dataset curation

To construct our VRSBench dataset, we employed multiple data engineering steps, including attribute extraction, prompting engineering, GPT-4 inference, and human verification.

  • Attribute Extraction: we extract image information, including the source and resolution, as well as object information—such as the object category, bounding box, color, position (absolute and relative), and size (absolute and relative)—from existing object detection datasets. Please check extract_patch_json.py for attribute extraction details.
  • Prompting Engineering: We carefully design instructions to prompt GPT-4V to create detailed image captions, object referring, and question-answer pairs. Please check instruction.txt for detialed instruction.
  • GPT-4 inference: Given input prompts, we call OpenAI API to automatically generate image captions, object referring, and question-answer pairs based on the prompts. Use the extract_patch_json.py to get initial annotations for image captioning, visual grounding, and VQA tasks using GPT-4V.
  • Human verification: To improve the quality of the dataset, we engage human annotators to validate each annotation generated by GPT-4V.

Model Training

For the above three tasks, we benchmark state-of-the-art models, including LLaVA-1.5, MiniGPT-v2, Mini-Gemini, and GeoChat, to demonstrate the potential of LVMs for remote sensing image understanding. To ensure a fair comparison, we reload the models that are initially trained on large-scale image-text alignment datasets, and then finetune each method using the training set of our RSVBench dataset. For each comparing method, we finetune the model on the training set of our RSVBench dataset for 5 epochs. Following GeoChat, we use LoRA finetuning to finetune all comparing methods, with a rank of 64.

Use the prepare_geochat_eval_all.ipynb to prepare the VRSBench evaluation file for image captioning, visual grounding, and VQA tasks.

Benchmark Results

Image Captioning Results

Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE_L CIDEr Avg_L
GeoChat w/o ft 13.9 6.6 3.0 1.4 7.8 13.2 0.4 36
GPT-4V 37.2 22.5 13.7 8.6 20.9 30.1 19.1 67
MiniGPT-v2 36.8 22.4 13.9 8.7 17.1 30.8 21.4 37
LLaVA-1.5 48.1 31.5 21.2 14.7 21.9 36.9 33.9 49
GeoChat 46.7 30.2 20.1 13.8 21.1 35.2 28.2 52
Mini-Gemini 47.6 31.1 20.9 14.3 21.5 36.8 33.5 47

Caption: Detailed image caption performance on the VRSBench dataset. Avg_L denotes the average word length of generated captions.

Object Referring Results

Method Unique Non Unique All
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
GeoChat w/o ft 20.7 5.4 7.3 1.7 12.9 3.2
GPT-4V 8.6 2.2 2.5 0.4 5.1 1.1
MiniGPT-v2 40.7 18.9 32.4 15.2 35.8 16.8
LLaVA-1.5 54.8 28.7 23.1 7.0 36.3 16.0
GeoChat 55.1 31.2 28.5 10.1 39.6 19.1
Mini-Gemini 45.2 19.4 17.4 4.6 29.0 10.8

Caption: Visual grounding performance on the VRSBench dataset.

Visual Question Answering Results

Method Category Presence Quantity Color Shape Size Position Direction Scene Reasoning Avg.
# VQAs 5435 7789 6374 3550 1422 1011 5829 477 4620 902
GeoChat w/o ft 48.5 85.9 19.2 17.0 18.3 32.0 43.4 42.1 44.2 57.4 40.8
GPT-4V 67.0 87.6 45.6 71.0 70.8 54.3 67.2 50.7 69.8 72.4 65.6
MiniGPT-v2 46.2 74.1 47.3 44.4 28.6 17.2 23.3 15.3 38.7 36.3 37.1
LLaVA-1.5 62.8 89.2 50.4 57.8 58.5 52.3 56.9 50.7 66.0 64.9 60.9
GeoChat 60.4 89.9 47.5 58.7 59.1 52.3 57.0 50.3 66.1 64.9 60.6
Mini-Gemini 58.7 89.4 50.0 57.9 57.9 53.7 54.8 50.1 65.0 64.3 60.2

Caption: Visual question answering performance on the {\papernameAbbrev} dataset. Rows in bold indicate newly added results.

Licensing Information

The dataset is released under the CC-BY-4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Related Projects

  • RSGPT. The first GPT-based Large Vision-Language Model in remote sensing. RSGPT
  • Survey. A comprehensive survey about vision-language models in remote sensing. RSVLM.
  • MiniGPT-v2. MiniGPT-v2

📜 Citation

@article{li2024vrsbench,
  title={VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding},
  author={Xiang Li, Jian Ding, and Mohamed Elhoseiny},
  journal={arXiv:2406.12384},
  year={2024}
}

🙏 Acknowledgement

Our VRSBench dataset is built based on DOTA-v2 and DIOR datasets.

We are thankful to LLaVA-1.5, MiniGPT-v2, Mini-Gemini, and GeoChat for releasing their models and code as open-source contributions.

vrsbench's People

Contributors

lx709 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

jennylee2024

vrsbench's Issues

Cider Cacluate?

I'm stuck with cacluating the cider metric for the Caption Task of this bench. When there is only one reference sentence, the value of cider is always 0. How can I calculate the cider correctly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.