Giter Club home page Giter Club logo

emu's People

Contributors

ifsheldon avatar quan-sun avatar ryanzhangfan avatar slothercui avatar wxinlong avatar yqy2001 avatar zhangxiaosong18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

emu's Issues

Question about training objectives

In Section 2.2 of the paper, it is mentioned that there are two types of losses:

  • ordinary language modeling cross-entropy loss for discrete text tokens.
  • L2 regression loss for continuous visual embeddings.

And I'm wondering how the L2 loss is calculated.
Based on my understanding, it's the difference between the predicted token (after the regression head) and that token outputed from the Causal Transformer.
Can you confirm if my understanding is correct or provide more information on how the L2 regression loss is calculated?
Thank you

Inference with video

Hi, you guys have done great work.
can you guide me on how to make inferences with video input? and are you planning to push the training and finetuning code for this model?

Is there any smaller variants?

Hi, excellent work!
I noticed that only the weight for a large model (14B) is available in the GitHub repository. Is it possible to provide weights for some smaller variants, such as the 7B model? I believe this would enable broader applications of this work.

Inquiry Regarding the Integration of Domain-Specific Knowledge into Emu2 for Enhanced Multimodal Learning

Dear Emu2 Development Team,

I hope this message finds you well. I am reaching out to discuss the potential for integrating domain-specific knowledge into the Emu2 framework to further enhance its multimodal learning capabilities. As a researcher deeply invested in the intersection of AI and specialised domains, I am particularly interested in how Emu2 could be tailored to understand and generate content within specific fields such as medical imaging, legal document analysis, or engineering design.

The impressive in-context learning abilities and the state-of-the-art performance of Emu2 on various benchmarks suggest that it has a robust foundation for such an expansion. However, the nuances and complexities of domain-specific data present unique challenges that may require additional fine-tuning or the incorporation of expert knowledge bases.

Could you please shed light on the following aspects:

  1. The feasibility of fine-tuning Emu2 with domain-specific datasets, and whether there are any existing efforts or planned updates in this direction.
  2. The potential for Emu2 to interface with external knowledge bases or ontologies that could provide a structured understanding of domain-specific terminology and concepts.
  3. Any considerations or recommendations you might have for researchers looking to adapt Emu2 for specialised applications, including but not limited to, data preparation, model training, and evaluation metrics.

I believe that enhancing Emu2's capabilities with domain-specific intelligence could open up new frontiers for applied AI research and practical applications. I am eager to explore collaborative opportunities or contribute to the ongoing development efforts to realise this vision.

Thank you for your time and consideration. I look forward to your insights and guidance on this matter.

Best regards,
yihong1120

Error running `image_inference.py`

Hi, thanks for your great work!

I am running into a problem when trying to generate image with Emu.

Traceback (most recent call last):                                                                                       
  File "/home/shibo/Emu/image_inference.py", line 36, in <module>                                                        
    pipeline = EmuGenerationPipeline.from_pretrained(                                                                    
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                    
  File "/home/shibo/Emu/models/pipeline.py", line 254, in from_pretrained                                                
    return cls(                                                                                                          
           ^^^^                                                                                                          
  File "/home/shibo/Emu/models/pipeline.py", line 54, in __init__                                                        
    self.emu_encoder = self.prepare_emu("Emu-14B", multimodal_model, **kwargs)                                           
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                           
  File "/home/shibo/Emu/models/pipeline.py", line 232, in prepare_emu                                                    
    model.load_state_dict(ckpt, strict=True)                                                                             
  File "/home/shibo/anaconda3/envs/emu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict                                                                                                                    
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(                                            
RuntimeError: Error(s) in loading state_dict for Emu:                                                                    
        Unexpected key(s) in state_dict: "decoder.lm.model.layers.0.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.1.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.2.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layer$.3.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.4.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.5.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.6.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.7.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.8.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.9.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.10.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.11.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.12.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.13.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.14.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.15.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.16.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.17.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.18.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.19.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.20.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.21.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.22.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.23.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.24.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.25.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.26.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.27.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.28.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.29.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.30.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.31.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.32.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.33.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.34.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.35.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.36.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.37.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.38.self_attn.rotary_emb.inv_freq", "decoder.lm.model.layers.39.self_attn.rotary_emb.inv_freq".

Do you have any ideas on the problem?

What's the difference between Emu_pretain.pt and pytorch_model.bin?

Hi! Thanks for sharing your awesome work! I have a question on the difference between pytorch_model.bin and Emu_pretain.pt. In other words, if I only want to test the image captioning ability of emu, is Emu_pretain.pt enough?

Moreover, would you mind reporting md5 of each checkpoints please? I meet an error when loading checkpoints so I want to check if the checkpoint is broken. Looking forward to your reply and thanks for your great work again!

Image blending prompt

Hey, thanks so much for releasing your great paper and making the code and weights public. I am quite interested in playing around with the model for the task of image blending. Could you please provide the exact prompt used for the image blending results in Figure 1 of the paper?

Image Generation

Thanks for your amazing work
Is there any plan to release the image generation part?

Multi-GPU support

Can we support multi-GPU inference? The model is too large to fit in one consumer GPU, even 4090 does not have enough memory. I think two 4090s are enough.

Request for Detailed Explanation of Causal Transformer Module

Dear Authors,

I've been reading your work on the Causal Transformer module and I find it fascinating. However, I'm having difficulty understanding some aspects of it, especially how the module transforms 2D spatial visual signals to 1D causal sequences in a latent space.

Specifically, I'm struggling to understand the process detailed in this part of your paper:

"Given an image I with its encodings g(I) from EVA-CLIP, Causal Transformer accepts randomly initialized embeddings {e1,e2,...,eN} as input, and outputs N embeddings {z1,z2,...,zN} that capture the causal dependency of the given image: z1,z2,...,zN = CausalTransformer (g(I),{e1,e2,...,eN}) (1)"

Could you please provide a more detailed explanation or possibly a diagrammatic illustration of this process? I believe this would greatly aid my understanding, as well as that of other readers who may be having a similar issue.

Online demo vs. inference example result different.

I'm interested in this work, but I have some confusion. The output content is very rich when I test the online demo, but I run inference.py in instruct_example() by same image and question. I'm wondering if there are any different settings or handling when online?
000001163
Question: Describe this image.
Emu-Online: In this image, we see two young boys running across a grassy field. They are both wearing shorts and appear to be having a lot of fun as they run towards the camera. The sun is shining brightly in the background, casting a warm glow over the scene. The grass is green and lush, and there are some trees in the distance. The boys are laughing and seem to be enjoying each other's company as they run. Overall, this image conveys a sense of joy and playfulness, as well as a connection between the two boys.
inference.py: In this image, we can see two young boys running across a grassy field. They are both wearing shorts and appear to be enjoying themselves as they run. The sky is blue and there are trees in the background.

Image generation evaluation

Hi, thanks for this great work!

Do you plan to release the evaluation script of image generation (ms-coco)?

Weight files take up too much disk space

Trying to run your demo, but getting No space left on device error, while loading the model, it takes more than 60GB of memory, on your HF it seems like there are 15 10GB files, are they all needed?

Thanks!

Will you share the training loss log?

Hi @yqy2001~ Thanks for your interesting work.
I'm now trying to reproduce your training process and I'm wondering if you can share your training loss curve. This will be very helpful.
Thanks~

offline infer different from your demo.

image
The output of the model is very short, even i set "answer more than xxx words" in the prompt doesn't help. here is my test code, any help please!

from PIL import Image 
import requests
import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat") # "BAAI/Emu2-Chat"

with init_empty_weights():
     model = AutoModelForCausalLM.from_pretrained(
        "BAAI/Emu2-Chat", # "BAAI/Emu2-Chat"
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        cache_dir="./ckpt")  

device_map = infer_auto_device_map(model, max_memory={0:'50GiB',1:'50GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer'])  
# input and output logits should be on same device
device_map["model.decoder.lm.lm_head"] = 0

model = load_checkpoint_and_dispatch(
    model, 
    './ckpt/models--BAAI--Emu2-Chat/snapshots/20ea30b04f8fee599cf97535e655c200df728501',
    device_map=device_map).eval()

frms = [
  "img1.jpg",
  "img2.jpg",
  "img3.jpg",
  "img4.jpg",
  "img5.jpg",
  "img6.jpg",
  "img7.jpg",
  "img8.jpg",
]

video = [Image.open(frm).convert('RGB') for frm in frms]

import time

start_time = time.time()
query = "{}Describe the video in details.".format("[<VID_PLH>]"*len(frms))


inputs = model.build_input_ids(
    text=[query],
    tokenizer=tokenizer,
    video=video

)

with torch.no_grad():
     outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        video=inputs["video"].to(torch.bfloat16),
        max_new_tokens=2048,
        length_penalty=-1)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("infer cost", time.time()-start_time)

output_text

Question about training

Thanks for your great work first!
I found that the code uses "emu_encoder.decoder.lm.generate()" to produce text response and uses "emu_encoder.decoder.lm.model()" to produce latent image embeddings. So how can I output both the text and image embedding to reproduce your training process?
Or training is the first to use "emu_encoder. decoder. Lm. generate ()" to generate the text and then using "emu_encoder.decoder. Lm. model ()" to generate the image embedding?
Thanks for you reply!

emu2-gen A100 OOM

`pipe = DiffusionPipeline.from_pretrained(
path,
custom_pipeline="pipeline_emu2_gen",
torch_dtype=torch.bfloat16,
use_safetensors=True,
variant="bf16",
low_cpu_mem_usage=True
)
pipe.to("cuda")
print(pipe)
prompt = "impressionist painting of an astronaut in a jungle"
ret = pipe(prompt)

prompt = [image, "wearing a red hat on the head."]
ret = pipe(prompt)
`
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB
A100 80G Memorty is OOM
can be load in multi gpu ? or cpu offload

tools for processing YT-Storyboad-1B

Inspiring work about multimodal transformer!
Would you please share how to download storyboard images and subtitles from youtube videos? It would be better if you release your YT-Storyboad-1B dataset!

Batch inference.

Hi! I was wondering if there is any way to get batch inference running with Emu.

Thanks.

The in-context learning sample for pretrained model does not work as expected.

Hello. Thank you for sharing such a great work. I am trying to run samples in inference.py. The instruction-tuned worked perfectly. However, the in-context working example for pretrained model did not work as expected. Here is the log (I covered up paths for security reasons):

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/home/.../miniconda3/envs/emu/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:386: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/home/.../miniconda3/envs/emu/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:396: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `None` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
=====> model_cfg: {'embed_dim': 1024, 'vision_cfg': {'image_size': 224, 'layers': 40, 'width': 1408, 'head_width': 88, 'mlp_ratio': 4.3637, 'patch_size': 14, 'eva_model_name': 'eva-clip-g-14-x', 'drop_path_rate': 0, 'xattn': True, 'freeze': False}, 'multimodal_cfg': {'name': 'llama-13B', 'xattn': True, 'n_causal': 32, 'freeze': False}, 'vladapter_cfg': {'name': 'cformer', 'n_causal': 32}}
The Special Tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]', 'additional_special_tokens': ['[/IMG]', '<image>', '[IMG]']}
Vocab Size: 32004
image_token_id: 32002
[IMG] token id: 32003
[/IMG] token id: 32001
=====> loading from ckpt_path /groups/.../Emu/pretrain/multimodal_encoder/pytorch_model.bin
=====> get model.load_state_dict msg: _IncompatibleKeys(missing_keys=[], unexpected_keys=['decoder.lm.model.layers.0.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.1.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.2.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.3.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.4.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.5.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.6.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.7.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.8.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.9.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.10.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.11.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.12.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.13.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.14.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.15.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.16.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.17.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.18.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.19.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.20.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.21.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.22.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.23.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.24.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.25.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.26.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.27.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.28.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.29.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.30.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.31.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.32.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.33.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.34.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.35.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.36.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.37.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.38.self_attn.rotary_emb.inv_freq', 'decoder.lm.model.layers.39.self_attn.rotary_emb.inv_freq'])
===> prompt: [IMG]<image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>[/IMG]There are two dogs.[IMG]<image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>[/IMG]There are three pandas.[IMG]<image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>[/IMG]
===> output: dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog dog

The last image is a sunflower, so the model is supposed to generate something related to it but the model generated a repetition of 'dog'.
Do you have any clue to the cause, or do you have the same result? Thanks in advance.

Multi-image experiments

Hey,

I had a question regarding the specific prompt setting for the multi-image results from figure 4 (left side) of the paper. From a brief skim of your inference.py script and the EMU modeling code inside models, I think the modifications to make this work would be something like this:

    prompt = "You will be presented with some images: [IMG]ImageContent[/IMG]." \ 
             "You will be able to see the images after I provide them to you." \ 
             "Please answer my questions based on the given images." \
             "[USER]: [IMG]<image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image>[/IMG]" \
             "This is the first image." \
             "[IMG]<image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image>[/IMG]" \
             "This is the second image." \
             "What's the difference?" \
             "[ASSISTANT]:" \

    samples = {"image": img, "prompt": prompt}

    output_text = emu_model.generate(
        samples,
        max_new_tokens=512,
        num_beams=5,
        length_penalty=0.0,
        repetition_penalty=1.0,
    )[0].strip()

    print(f"===> caption output: {output_text}")

Please verify if this is the prompt that you used for the multi-image results in the paper, and let me know if it would work.

The other alternative would be to prepend the [USER] token before every new image-text sequence. However, since that would be out-of-distribution with respect to the instruction fine-tuning data format, I am not sure it would work.

Cannot reproduce the results on VisDial

Thanks for your great work.
Following the official evaluation guide, I find it difficult to reproduce the results reported in your paper on VisDial.
Could you please share the related codes for evaluating the dialog task? @yqy2001 @Quan-Sun

Eval Dataset

Hi, I have a question about the COCO dataset evaluation.
What' s the meaning in the benchmark/COCO/annotations/vqa_test.json and vqa_val_eval.json?
Or could you please give the link of these dependent datasets?

The n_query and v_query in EMU2

Thanks for the excellent work!
I want to get the meaning of n_query and v_query. Why an image is worth 256 tokens while a video is worth 64? Could you please offer the codes of video inference? Thanks!

Implementation in generate_image()

Hi, I have a question regarding the image generation process, specifically the generate_image function at https://github.com/baaivision/Emu/blob/main/models/modeling_emu.py#L185

According to this function and the description in the paper, it forces the model to generate n_causal=32 <image> tokens consecutively.
In each decoding step, the function concatenates an token after the generated sequence, then passes through the model. Finally, the outputs.hidden_state[-1][:, -n_causal:, :] is projected through the self.decoder.lm.stu_regress_head.

So my confusion is, what is the difference between directly appending n_causal=32 <image> tokens to the text prompts, and decoding the output at once?
Because each <image> token is identical and there is no storage of intermediate outputs.

Please correct me if have misunderstood.

question about the pretrained weights for image encoder EVA-clip

Hey~
Thanks for your amazing work!
I have a question about your pretrianed weights in your Emu's training.
It seems that you used a 1b version of eva-02-clip, as mentioned in your paper.
Also the eva-clip-g-14-x is also found in your code.
However, I cound not find the pre-trained weights for the small g (1b) version of eva-02-clip, neithr in eval-clip official github repo or in huggingface modelzoo. Instead I found the 1b version of eva01-clip (not 02) in https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA01_CLIP_g_14_psz14_s11B.pt.
Could provide the specific link of your used eva-clip model weights?
Please forgive me if I am wrong.

splitting the checkpoint into several smaller files

though this is less important, I would like to ask if you can split the 30G checkpoint into several smaller files, like most models (e.g., llama, alpaca) do?
This will make the checkpoint easier to download, especially when the network is unstable.

Generate video from any prompt sequence

Hello, I saw on the project page that you showed the Generate video from any prompt sequence function, but I didn't seem to see this function in the demo. Will you launch this function?

question about the visual autoencoder

Thanks for the great work! I have some questions about the checkpoints:

  1. It seems that BAAI/Emu2 does not include the weight of visual decoder (diffusion unet), but I think in section 2.2.3 of the paper, Emu2 should include the autoencoder-trained decoder?
  2. Emu2-Gen provides the weights of visual decoder, can the visual encoder and decoder in BAAI/Emu2-Gen work as an autoencoder?
    looking forward to your reply~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.