morphtokens's Introduction

Auto-Encoding Morph-Tokens for Multimodal LLM

Kaihang Pan¹, Siliang Tang¹, Juncheng Li^1,2†, Zhaoyu Fan¹, Wei Chow¹, Shuicheng Yan³, Tat-Seng Chua², Yueting Zhuang¹, Hanwang Zhang^3,4

¹Zhejiang University, ²National University of Singapore, ³Skywork AI, ⁴Nanyang Technological University

^†Corresponding Author

Overview

We introduce Morph-Tokens to resolve the conflicting objectives of visual comprehension and generation. The term ''morph'' implies a transformation where the pre-MLLM visual-tokens are not necessarily equal to the post-MLLM ones. Specifically, the pre-MLLM tokens are abstract semantics, serving as visual prompts for comprehension tasks. In contrast, the post-MLLM tokens are visually complete tokens for image generation, thanks to the powerful comprehension ability of MLLM that recovers the lost visual features due to abstraction. The framework of our morph-token-based MLLM is shown in the following figure:

On this basis, we propose a 3-stage training strategy as shown in the following figure. After training, it shows remarkable abilities, exceling at both multimodal comprehension and generation.

Acknowledgment

Thanks to the open source of the following projects:

LAVIS: A Library for Language-Vision Intelligence.
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models.
taming-transformers: Taming Transformers for High-Resolution Image Synthesis.
SEED: Making LLaMA SEE and Draw with SEED Tokenizer.

Citation

If you found this work useful, please consider citing our paper as follows:

@misc{pan2024autoencoding,
      title={Auto-Encoding Morph-Tokens for Multimodal LLM}, 
      author={Kaihang Pan and Siliang Tang and Juncheng Li and Zhaoyu Fan and Wei Chow and Shuicheng Yan and Tat-Seng Chua and Yueting Zhuang and Hanwang Zhang},
      year={2024},
      eprint={2405.01926},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

morphtokens's People

Contributors

Stargazers

Watchers

morphtokens's Issues

Plan to open official checkpoint and training code

Thank you for your great work.

I wonder do you have any plan to open checkpoint and training code of MorphTokens model.

Thank you.

Wonder the difference between morph-token and abstr-abstr(VQ) in ablation study

abstr-abstr(VQ) also adopts a decoder-only transformers to transform post-MLLM token to complete visual token, just like the proposed morph-token. So what's the difference? Is it that the abstr-abstr(VQ) requires the pre- and post-LLM tokens to be the same?

How to use the code?

Hello, thank you for your excellent work! However, I'm a bit confused about your code. Would it be possible to release your training code?

Stage 2, 3 Training Part

Thank you for great work.

In Morph/morph/models/morph.py, I found Stage 1 Training.

But I can't find Stage 2, 3 part. Where I should look for?

Thank you.

Recommend Projects

dcdmllm / morphtokens Goto Github PK

morphtokens's Introduction

Auto-Encoding Morph-Tokens for Multimodal LLM

Overview

Acknowledgment

Citation

morphtokens's People

Contributors

Stargazers

Watchers

Forkers

morphtokens's Issues

Plan to open official checkpoint and training code

Wonder the difference between morph-token and abstr-abstr(VQ) in ablation study

How to use the code?

Stage 2, 3 Training Part

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent