baaivision / emu Goto Github PK

Emu Series: Generative Multimodal Models from BAAI

Home Page: https://baaivision.github.io/emu2/

License: Apache License 2.0

Python 99.63% Makefile 0.19% Shell 0.18%

foundation-models generative-pretraining-in-multimodality in-context-learning instruct-tuning multimodal-generalist multimodal-pretraining

emu's People

Contributors

Stargazers

Watchers

Forkers

quan-sun apollohuang1 seahl0119 eltociear wxinlong wolfwjs 2132660698 nemonameless ingeniousfrog weibobo2015 u-k-l sharmaiiitb melodymei assassindesign ddxdaniel vibhanshukchaturvedi lemo2012 gempollai bluepeople1 zyoungxu yunlong10 techthiyanes ryanzhangfan wang-xiaodong1899 hyj666-maker bainaryglobe amarone slothercui fake10086 cliangyu lhongjum hojun9459 jxzhangjhu binaryninja pineking kmishra1204 arunbanswal knowledgehacker bigdatasciencegroup alxemade gary109 zerosum0x00 creativebuilds zhangxiaosong18 trunghlt mrcodechef alexandor91 hbcbh1999 rkp64 ningshuang-yao haikuoxin standardgalactic liubo0902 zhaopufeng tianbingsheng raymonddawn zirenlegend colorcatliu heepengpeng zhangzw12319 wulouzhu sunyu0853 gz475 happyxy hhy5277 snoopycn jaedukseo msmelissa ffos tony-comments qinhao0519 iaagulo jade2290 yzeng58 upmalata thanhpham1987 josephkj core00-gs bruinxiong hongcheng-gao ideachoi337 call-center-together moqingxinai zhibinwang1990

emu's Issues

Question about training objectives

In Section 2.2 of the paper, it is mentioned that there are two types of losses:

ordinary language modeling cross-entropy loss for discrete text tokens.
L2 regression loss for continuous visual embeddings.

And I'm wondering how the L2 loss is calculated.
Based on my understanding, it's the difference between the predicted token (after the regression head) and that token outputed from the Causal Transformer.
Can you confirm if my understanding is correct or provide more information on how the L2 regression loss is calculated?
Thank you

LLama的license问题

https://github.com/baaivision/Emu/blob/8edd8c24a97564c55901fbc7ce671abec2e71042/README.md?plain=1#L39C1-L39C1
这里放的是llama2的license，论文里似乎使用的是llama1吧？所以，应该是不可以商用的吧。

那请问是否有计划基于llama2可商用版本进行重新训练呢？

add how to run gradio demo locally

would be great to add gradio demo files in repo and how to run it locally similar to https://github.com/pix2pixzero/pix2pix-zero#gradio-demo, thanks

Inference with video

Hi, you guys have done great work.
can you guide me on how to make inferences with video input? and are you planning to push the training and finetuning code for this model?

Is there any smaller variants?

Hi, excellent work!
I noticed that only the weight for a large model (14B) is available in the GitHub repository. Is it possible to provide weights for some smaller variants, such as the 7B model? I believe this would enable broader applications of this work.

Inquiry Regarding the Integration of Domain-Specific Knowledge into Emu2 for Enhanced Multimodal Learning

Dear Emu2 Development Team,

I hope this message finds you well. I am reaching out to discuss the potential for integrating domain-specific knowledge into the Emu2 framework to further enhance its multimodal learning capabilities. As a researcher deeply invested in the intersection of AI and specialised domains, I am particularly interested in how Emu2 could be tailored to understand and generate content within specific fields such as medical imaging, legal document analysis, or engineering design.

The impressive in-context learning abilities and the state-of-the-art performance of Emu2 on various benchmarks suggest that it has a robust foundation for such an expansion. However, the nuances and complexities of domain-specific data present unique challenges that may require additional fine-tuning or the incorporation of expert knowledge bases.

Could you please shed light on the following aspects:

The feasibility of fine-tuning Emu2 with domain-specific datasets, and whether there are any existing efforts or planned updates in this direction.
The potential for Emu2 to interface with external knowledge bases or ontologies that could provide a structured understanding of domain-specific terminology and concepts.
Any considerations or recommendations you might have for researchers looking to adapt Emu2 for specialised applications, including but not limited to, data preparation, model training, and evaluation metrics.

I believe that enhancing Emu2's capabilities with domain-specific intelligence could open up new frontiers for applied AI research and practical applications. I am eager to explore collaborative opportunities or contribute to the ongoing development efforts to realise this vision.

Thank you for your time and consideration. I look forward to your insights and guidance on this matter.

Best regards,
yihong1120

Error running `image_inference.py`

Hi, thanks for your great work!

I am running into a problem when trying to generate image with Emu.

Traceback (most recent call last):                                                                                       
  File "/home/shibo/Emu/image_inference.py", line 36, in <module>                                                        
    pipeline = EmuGenerationPipeline.from_pretrained(                                                                    
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                    
  File "/home/shibo/Emu/models/pipeline.py", line 254, in from_pretrained                                                
    return cls(                                                                                                          
           ^^^^                                                                                                          
  File "/home/shibo/Emu/models/pipeline.py", line 54, in __init__                                                        
    self.emu_encoder = self.prepare_emu("Emu-14B", multimodal_model, **kwargs)                                           
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                           
  File "/home/shibo/Emu/models/pipeline.py", line 232, in prepare_emu                                                    
    model.load_state_dict(ckpt, strict=True)                                                                             
  File "/home/shibo/anaconda3/envs/emu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict                                                                                                                    
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(                                            
RuntimeError: Error(s) in loading state_dict for Emu:                                                                    
        Unexpected key(s) in state_dict: "decoder.lm.model.layers.0.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.1.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.2.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layer$.3.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.4.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.5.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.6.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.7.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.8.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.9.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.10.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.11.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.12.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.13.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.14.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.15.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.16.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.17.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.18.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.19.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.20.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.21.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.22.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.23.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.24.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.25.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.26.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.27.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.28.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.29.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.30.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.31.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.32.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.33.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.34.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.35.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.36.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.37.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.38.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.39.self_attn.rotary_emb.inv_freq".

Do you have any ideas on the problem?

does plan to update llama to llama2?

What's the difference between Emu_pretain.pt and pytorch_model.bin?

Hi! Thanks for sharing your awesome work! I have a question on the difference between pytorch_model.bin and Emu_pretain.pt. In other words, if I only want to test the image captioning ability of emu, is Emu_pretain.pt enough?

Moreover, would you mind reporting md5 of each checkpoints please? I meet an error when loading checkpoints so I want to check if the checkpoint is broken. Looking forward to your reply and thanks for your great work again!

Image blending prompt

Hey, thanks so much for releasing your great paper and making the code and weights public. I am quite interested in playing around with the model for the task of image blending. Could you please provide the exact prompt used for the image blending results in Figure 1 of the paper?

Image Generation

Thanks for your amazing work
Is there any plan to release the image generation part?

Time for releasing the pre-training code

Is there a certain time for releasing the pre-training code?

何时支持中文？

Multi-GPU support

Can we support multi-GPU inference? The model is too large to fit in one consumer GPU, even 4090 does not have enough memory. I think two 4090s are enough.

How much video memory do I need for reasoning?

I have a 3090nvidia graphics card, is it possible to do inference?

Is there any plan to replace the language model with a smaller one?

eg. Yi-6B, Yi-6B-chat, Mistral-7B.

How to inference dialogue with EMU?

Request for Detailed Explanation of Causal Transformer Module

Dear Authors,

I've been reading your work on the Causal Transformer module and I find it fascinating. However, I'm having difficulty understanding some aspects of it, especially how the module transforms 2D spatial visual signals to 1D causal sequences in a latent space.

Specifically, I'm struggling to understand the process detailed in this part of your paper:

"Given an image I with its encodings g(I) from EVA-CLIP, Causal Transformer accepts randomly initialized embeddings {e1,e2,...,eN} as input, and outputs N embeddings {z1,z2,...,zN} that capture the causal dependency of the given image: z1,z2,...,zN = CausalTransformer (g(I),{e1,e2,...,eN}) (1)"

Could you please provide a more detailed explanation or possibly a diagrammatic illustration of this process? I believe this would greatly aid my understanding, as well as that of other readers who may be having a similar issue.

Online demo vs. inference example result different.

I'm interested in this work, but I have some confusion. The output content is very rich when I test the online demo, but I run inference.py in instruct_example() by same image and question. I'm wondering if there are any different settings or handling when online?

Question: Describe this image.
Emu-Online: In this image, we see two young boys running across a grassy field. They are both wearing shorts and appear to be having a lot of fun as they run towards the camera. The sun is shining brightly in the background, casting a warm glow over the scene. The grass is green and lush, and there are some trees in the distance. The boys are laughing and seem to be enjoying each other's company as they run. Overall, this image conveys a sense of joy and playfulness, as well as a connection between the two boys.
inference.py: In this image, we can see two young boys running across a grassy field. They are both wearing shorts and appear to be enjoying themselves as they run. The sky is blue and there are trees in the background.

how to get the label data which is needed in video_webdataset_maker_YT1b_sb.py

Image generation evaluation

Hi, thanks for this great work!

Do you plan to release the evaluation script of image generation (ms-coco)?

Weight files take up too much disk space

Trying to run your demo, but getting No space left on device error, while loading the model, it takes more than 60GB of memory, on your HF it seems like there are 15 10GB files, are they all needed?

Thanks!

how to inference with a video

Is there any video inference demo?

Will you share the training loss log?

Hi @yqy2001~ Thanks for your interesting work.
I'm now trying to reproduce your training process and I'm wondering if you can share your training loss curve. This will be very helpful.
Thanks~

Plan of releasing YT-Storyboard-1B Dataset

Hi, Thanks for your great work, do you have the plan to release the YT-Storyboard-1B dataset?

offline infer different from your demo.

The output of the model is very short, even i set "answer more than xxx words" in the prompt doesn't help. here is my test code, any help please!

from PIL import Image 
import requests
import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") # "BAAI/Emu2-Chat"

with init_empty_weights():
     model = AutoModelForCausalLM.from_pretrained(
        "BAAI/Emu2-Chat", # "BAAI/Emu2-Chat"
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        cache_dir="./ckpt")  

device_map = infer_auto_device_map(model, max_memory={0:'50GiB',1:'50GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer'])  
# input and output logits should be on same device
device_map["model.decoder.lm.lm_head"] = 0

model = load_checkpoint_and_dispatch(
    model, 
    './ckpt/models--BAAI--Emu2-Chat/snapshots/20ea30b04f8fee599cf97535e655c200df728501',
    device_map=device_map).eval()

frms = [
  "img1.jpg",
  "img2.jpg",
  "img3.jpg",
  "img4.jpg",
  "img5.jpg",
  "img6.jpg",
  "img7.jpg",
  "img8.jpg",
]

video = [Image.open(frm).convert('RGB') for frm in frms]

import time

start_time = time.time()
query = "{}Describe the video in details.".format("[<VID_PLH>]"*len(frms))


inputs = model.build_input_ids(
    text=[query],
    tokenizer=tokenizer,
    video=video

)

with torch.no_grad():
     outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        video=inputs["video"].to(torch.bfloat16),
        max_new_tokens=2048,
        length_penalty=-1)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("infer cost", time.time()-start_time)

output_text

It seems that the demo cannot be accessed properly?

Question about training

Thanks for your great work first!
I found that the code uses "emu_encoder.decoder.lm.generate()" to produce text response and uses "emu_encoder.decoder.lm.model()" to produce latent image embeddings. So how can I output both the text and image embedding to reproduce your training process?
Or training is the first to use "emu_encoder. decoder. Lm. generate ()" to generate the text and then using "emu_encoder.decoder. Lm. model ()" to generate the image embedding?
Thanks for you reply!

emu2-gen A100 OOM

`pipe = DiffusionPipeline.from_pretrained(
path,
custom_pipeline="pipeline_emu2_gen",
torch_dtype=torch.bfloat16,
use_safetensors=True,
variant="bf16",
low_cpu_mem_usage=True
)
pipe.to("cuda")
print(pipe)
prompt = "impressionist painting of an astronaut in a jungle"
ret = pipe(prompt)

prompt = [image, "wearing a red hat on the head."]
ret = pipe(prompt)
`
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB
A100 80G Memorty is OOM
can be load in multi gpu ? or cpu offload

Create a demo on Hugging Face

Hi!

Very cool work! It would be nice to have demo on the Hugging Face Hub as well.

This is a step-by-step guide explaining the process in case you're interested. : )

Please let us know if you would be interested and if you have any questions.

tools for processing YT-Storyboad-1B

Inspiring work about multimodal transformer!
Would you please share how to download storyboard images and subtitles from youtube videos? It would be better if you release your YT-Storyboad-1B dataset!

Batch inference.

Hi! I was wondering if there is any way to get batch inference running with Emu.

Thanks.

What is the rationale to name the paper Emu?

What is the name Emu short for?

The in-context learning sample for pretrained model does not work as expected.

Hello. Thank you for sharing such a great work. I am trying to run samples in inference.py. The instruction-tuned worked perfectly. However, the in-context working example for pretrained model did not work as expected. Here is the log (I covered up paths for security reasons):

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/home/.../miniconda3/envs/emu/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:386: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/home/.../miniconda3/envs/emu/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:396: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `None` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
=====> model_cfg: {'embed_dim': 1024, 'vision_cfg': {'image_size': 224, 'layers': 40, 'width': 1408, 'head_width': 88, 'mlp_ratio': 4.3637, 'patch_size': 14, 'eva_model_name': 'eva-clip-g-14-x', 'drop_path_rate': 0, 'xattn': True, 'freeze': False}, 'multimodal_cfg': {'name': 'llama-13B', 'xattn': True, 'n_causal': 32, 'freeze': False}, 'vladapter_cfg': {'name': 'cformer', 'n_causal': 32}}
The Special Tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]', 'additional_special_tokens': ['[/IMG]', '<image>', '[IMG]']}
Vocab Size: 32004
image_token_id: 32002
[IMG] token id: 32003
[/IMG] token id: 32001
=====> loading from ckpt_path /groups/.../Emu/pretrain/multimodal_encoder/pytorch_model.bin
=====> get model.load_state_dict msg: _IncompatibleKeys(missing_keys=[], unexpected_keys=['decoder.lm.model.layers.0.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.1.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.2.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.3.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.4.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.5.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.6.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.7.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.8.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.9.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.10.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.11.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.12.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.13.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.14.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.15.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.16.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.17.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.18.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.19.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.20.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.21.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.22.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.23.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.24.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.25.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.26.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.27.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.28.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.29.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.30.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.31.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.32.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.33.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.34.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.35.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.36.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.37.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.38.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.39.self_attn.rotary_emb.inv_freq'])
===> prompt: [IMG]<image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>[/IMG]There are two dogs.[IMG]<image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>[/IMG]There are three pandas.[IMG]<image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>[/IMG]
===> output: dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog

The last image is a sunflower, so the model is supposed to generate something related to it but the model generated a repetition of 'dog'.
Do you have any clue to the cause, or do you have the same result? Thanks in advance.

Multi-image experiments

Hey,

I had a question regarding the specific prompt setting for the multi-image results from figure 4 (left side) of the paper. From a brief skim of your inference.py script and the EMU modeling code inside models, I think the modifications to make this work would be something like this:

    prompt = "You will be presented with some images: [IMG]ImageContent[/IMG]." \ 
             "You will be able to see the images after I provide them to you." \ 
             "Please answer my questions based on the given images." \
             "[USER]: [IMG]<image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image>[/IMG]" \
             "This is the first image." \
             "[IMG]<image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image>[/IMG]" \
             "This is the second image." \
             "What's the difference?" \
             "[ASSISTANT]:" \

    samples = {"image": img, "prompt": prompt}

    output_text = emu_model.generate(
        samples,
        max_new_tokens=512,
        num_beams=5,
        length_penalty=0.0,
        repetition_penalty=1.0,
    )[0].strip()

    print(f"===> caption output: {output_text}")

Please verify if this is the prompt that you used for the multi-image results in the paper, and let me know if it would work.

The other alternative would be to prepend the [USER] token before every new image-text sequence. However, since that would be out-of-distribution with respect to the instruction fine-tuning data format, I am not sure it would work.

Cannot reproduce the results on VisDial

Thanks for your great work.
Following the official evaluation guide, I find it difficult to reproduce the results reported in your paper on VisDial.
Could you please share the related codes for evaluating the dialog task? @yqy2001 @Quan-Sun

Eval Dataset

Hi, I have a question about the COCO dataset evaluation.
What' s the meaning in the benchmark/COCO/annotations/vqa_test.json and vqa_val_eval.json?
Or could you please give the link of these dependent datasets?

The n_query and v_query in EMU2

Thanks for the excellent work!
I want to get the meaning of n_query and v_query. Why an image is worth 256 tokens while a video is worth 64? Could you please offer the codes of video inference? Thanks!

Gibberish response while trying sample inference script

Hi Team,

Thanks for your amazing work, unfortunately, I'm unable to get the correct response using the inference.py with a sample image in the examples folder,

Inquire on the evaluation code

Hi @yqy2001 , thanks for your hard work. Do you have a schedule to release the evaluation code recently?

Implementation in generate_image()

Hi, I have a question regarding the image generation process, specifically the generate_image function at https://github.com/baaivision/Emu/blob/main/models/modeling_emu.py#L185

According to this function and the description in the paper, it forces the model to generate n_causal=32 <image> tokens consecutively.
In each decoding step, the function concatenates an token after the generated sequence, then passes through the model. Finally, the outputs.hidden_state[-1][:, -n_causal:, :] is projected through the self.decoder.lm.stu_regress_head.

So my confusion is, what is the difference between directly appending n_causal=32 <image> tokens to the text prompts, and decoding the output at once?
Because each <image> token is identical and there is no storage of intermediate outputs.

Please correct me if have misunderstood.

Cannot reproduce image2text score on coco dataset (CIDEr)

Thanks for your work.

I try to reproduce paper's CIDEr score, but it failed with a difference.
I hope you share the original recipe of inference on coco dataset (I2T)

@yqy2001 @Quan-Sun

Is there a performance gap between causal cformer and a bi-directional one just like Q-former?

question about the pretrained weights for image encoder EVA-clip

Hey~
Thanks for your amazing work!
I have a question about your pretrianed weights in your Emu's training.
It seems that you used a 1b version of eva-02-clip, as mentioned in your paper.
Also the eva-clip-g-14-x is also found in your code.
However, I cound not find the pre-trained weights for the small g (1b) version of eva-02-clip, neithr in eval-clip official github repo or in huggingface modelzoo. Instead I found the 1b version of eva01-clip (not 02) in https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA01_CLIP_g_14_psz14_s11B.pt.
Could provide the specific link of your used eva-clip model weights?
Please forgive me if I am wrong.

splitting the checkpoint into several smaller files

though this is less important, I would like to ask if you can split the 30G checkpoint into several smaller files, like most models (e.g., llama, alpaca) do?
This will make the checkpoint easier to download, especially when the network is unstable.

Generate video from any prompt sequence

Hello, I saw on the project page that you showed the Generate video from any prompt sequence function, but I didn't seem to see this function in the demo. Will you launch this function?

question about the visual autoencoder

Thanks for the great work! I have some questions about the checkpoints:

It seems that BAAI/Emu2 does not include the weight of visual decoder (diffusion unet), but I think in section 2.2.3 of the paper, Emu2 should include the autoencoder-trained decoder?
Emu2-Gen provides the weights of visual decoder, can the visual encoder and decoder in BAAI/Emu2-Gen work as an autoencoder?
looking forward to your reply~