Giter Club home page Giter Club logo

morphtokens's Introduction

Auto-Encoding Morph-Tokens for Multimodal LLM

Kaihang Pan1, Siliang Tang1, Juncheng Li1,2†, Zhaoyu Fan1, Wei Chow1, Shuicheng Yan3, Tat-Seng Chua2, Yueting Zhuang1, Hanwang Zhang3,4

1Zhejiang University, 2National University of Singapore, 3Skywork AI, 4Nanyang Technological University

Corresponding Author

Overview

We introduce Morph-Tokens to resolve the conflicting objectives of visual comprehension and generation. The term ''morph'' implies a transformation where the pre-MLLM visual-tokens are not necessarily equal to the post-MLLM ones. Specifically, the pre-MLLM tokens are abstract semantics, serving as visual prompts for comprehension tasks. In contrast, the post-MLLM tokens are visually complete tokens for image generation, thanks to the powerful comprehension ability of MLLM that recovers the lost visual features due to abstraction. The framework of our morph-token-based MLLM is shown in the following figure:

On this basis, we propose a 3-stage training strategy as shown in the following figure. After training, it shows remarkable abilities, exceling at both multimodal comprehension and generation.

Acknowledgment

Thanks to the open source of the following projects:

  • LAVIS: A Library for Language-Vision Intelligence.
  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models.
  • taming-transformers: Taming Transformers for High-Resolution Image Synthesis.
  • SEED: Making LLaMA SEE and Draw with SEED Tokenizer.

Citation

If you found this work useful, please consider citing our paper as follows:

@misc{pan2024autoencoding,
      title={Auto-Encoding Morph-Tokens for Multimodal LLM}, 
      author={Kaihang Pan and Siliang Tang and Juncheng Li and Zhaoyu Fan and Wei Chow and Shuicheng Yan and Tat-Seng Chua and Yueting Zhuang and Hanwang Zhang},
      year={2024},
      eprint={2405.01926},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

morphtokens's People

Contributors

beepkh avatar

Stargazers

 avatar 刘强 avatar Zhanwen Chen avatar  avatar Yang Liu avatar elucida avatar MagicSource avatar realgump avatar Zhang Jiahui avatar OedoSoldier avatar Benjamin Foster Holmes avatar  avatar  avatar 姬忠鹏 avatar  avatar 오윤진 Yoonjin Oh avatar  avatar 爱可可-爱生活 avatar  avatar Eric Meier avatar Wei Chow avatar  avatar Zhiqi Ge avatar C avatar  avatar Long Qian avatar Tetsuya Motegi avatar  avatar Andrés Ávila avatar Marty Sullivan avatar Arnaud Berenbaum avatar iv avatar Rui Shao avatar  avatar 唐国梁Tommy avatar Yun Zhu avatar MSL avatar zht8506 avatar Xu Luo avatar Zhao Zhang avatar  avatar

Watchers

 avatar  avatar

Forkers

zhanwenchen

morphtokens's Issues

How to use the code?

Hello, thank you for your excellent work! However, I'm a bit confused about your code. Would it be possible to release your training code?

Stage 2, 3 Training Part

Thank you for great work.

In Morph/morph/models/morph.py, I found Stage 1 Training.

But I can't find Stage 2, 3 part. Where I should look for?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.