๐ฅ๐ฅ๐ฅ A curated list of Visual Instruction Tuning.
Please feel free to pull requests or open an issue to add papers.
Table of Contents
Name | Paper | Link | Notes |
---|---|---|---|
LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | Link | Dataset for evaluation on multiple capabilities |
MIMIC-IT | Otter: A Multi-Modal Model with In-Context Instruction Tuning | Coming soon | Multimodal in-context instruction tuning |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |
arXiv | 2023-04-19 | Github | Demo |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace |
arXiv | 2023-03-30 | Github | |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action |
arXiv | 2023-03-20 | Github | Demo |
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models |
arXiv | 2023-03-08 | Github | Demo |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Evaluating Object Hallucination in Large Vision-Language Models |
arXiv | 2023-05-17 | Github | - |
Transfer Visual Prompt Generator across LLMs |
arXiv | 2023-05-02 | Github | Demo |
Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS | 2023-04-29 | Github | |
GPT-4 Technical Report | arXiv | 2023-03-15 | - | - |
PaLM-E: An Embodied Multimodal Language Model | arXiv | 2023-03-06 | - | Demo |
Language Is Not All You Need: Aligning Perception with Language Models |
arXiv | 2023-02-27 | Github | - |
Multimodal Chain-of-Thought Reasoning in Language Models |
arXiv | 2023-02-02 | Github | - |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
arXiv | 2023-01-30 | Github |