Giter Club home page Giter Club logo

awesome-multimodal-reasoning's Introduction

Awesome Multimodal Reasoning

Collection of papers and resources on how to unlock reasoning abilities under multimodal settings.

Animation from ViperGPT (Surís et al.)

Consider how difficult it would be to study from a book that lacks any figures, diagrams or tables. We enhance our learning ability when we combine different data modalities, such as vision, language, and audio [1]. Recently, large language models (LLMs) have achieved remarkable results in complex reasoning tasks by generating intermediate steps before deducing the answer via chain-of-thought (CoT) reasoning [2] [3]. However, most of the research on CoT reasoning only involves the language modality and not others. We present a collection of papers and resources on how to unlock these abilities under multimodal settings.

Contents

Technique

End-to-end Models

  1. Learning to Reason: End-to-End Module Networks for Visual Question Answering. ICCV 2017

    Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko. [Paper], 2017.4

  2. Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan. [Blog] [Paper], 2022.4

  3. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Preprint

    Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. [Paper] [Code], 2023.1

  4. Language Is Not All You Need: Aligning Perception with Language Models. Preprint

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei. [Paper], 2023.2

  5. Prismer: A Vision-Language Model with An Ensemble of Experts. Preprint

    Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar. [Project] [Paper] [Code] [Demo], 2023.3

  6. PaLM-E: An Embodied Multimodal Language Model. Preprint

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence. [Project] [Paper], 2023.3

  7. GPT-4 Technical Report. Preprint

    OpenAI. [Blog] [Paper], 2023.3

  8. Visual Instruction Tuning. Preprint

    Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. [Project] [Paper] [Code] [Demo], 2023.4

  9. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. Preprint

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny. [Project] [Paper] [Code], 2023.4

  10. Otter: A Multi-Modal Model with In-Context Instruction Tuning Preprint

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu. [Paper] [Code], 2023.5

  11. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. Preprint

    VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. [Paper] [Code] [Demo], 2023.5

Prompting & In-context Learning

  1. Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021

    Multimodal Few-Shot Learning with Frozen Language Models. [Paper], 2021.6

  2. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. ICLR 2023

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence. [Project] [Paper] [Code], 2022.4

  3. Multimodal Chain-of-Thought Reasoning in Language Models. Preprint

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola. [Paper] [Code], 2023.2

  4. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. Preprint

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan. [Paper] [Code], 2023.3

  5. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action. Preprint

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang. [Project] [Paper] [Code] [Demo], 2023.3

  6. Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings. Preprint

    Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, William Yang Wang. [Paper] [Code], 2023.5

Compositional & Symbolic Approach

  1. Inferring and Executing Programs for Visual Reasoning. ICCV 2017

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick. [Project] [Paper] [Code], 2017.5

  2. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. NeurIPS 2018

    Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, Joshua B. Tenenbaum. [Project] [Paper] [Code], 2018.10

  3. Visual Programming: Compositional visual reasoning without training. CPVR 2023

    Tanmay Gupta, Aniruddha Kembhavi. [Project] [Paper] [Code], 2022.11

  4. ViperGPT: Visual Inference via Python Execution for Reasoning. Preprint

    Dídac Surís, Sachit Menon, Carl Vondrick. [Project] [Paper] [Code], 2023.3

  5. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. Preprint

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang. [Paper] [Code], 2023.3

  6. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. Preprint

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Jianfeng Gao. [Project] [Paper] [Code], 2023.4

Benchmark

  • SCIENCEQA Multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations.
  • ARO Systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order.
  • OK-VQA Visual question answering that requires methods which can draw upon outside knowledge to answer questions.
  • A-OKVQA Knowledge-based visual question answering benchmark.
  • NExT-QA Video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.
  • GQA Compositional questions over real-world images.
  • VQA Questions about images that require an understanding of vision, language and commonsense knowledge.
  • VQAv2 2nd iteration of the Visual Question Answering Dataset (VQA).
  • TAG Questions that require understanding the textual cues in an image.
  • Bongard-HOI Visual reasoning benchmark on compositional learning of human-object interactions (HOIs) from natural images.
  • ARC General artificial intelligence benchmark, targetted at artificially intelligent systems that aim at emulating a human-like form of general fluid intelligence.

Other Useful Resources

Other Awesome Lists

  • LLM-Reasoning-Papers Collection of papers and resources on Reasoning in Large Language Models, including Chain-of-Thought, Instruction-Tuning, and others.
  • Chain-of-ThoughtsPapers A trend starts from "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models".
  • Prompt4ReasoningPapers Repository for the paper "Reasoning with Language Model Prompting: A Survey".
  • Deep-Reasoning-Papers Recent Papers including Neural Symbolic Reasoning, Logical Reasoning, Visual Reasoning, planning and any other topics connecting deep learning and reasoning.

Contributing

  • Add a new paper or update an existing paper, thinking about which category the work should belong to.
  • Use the same format as existing entries to describe the work.
  • Add the abstract link of the paper (/abs/ format if it is an arXiv publication).

Don't worry if you do something wrong, it will be fixed for you!

Contributors

awesome-multimodal-reasoning's People

Contributors

atfortes avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.