circleradon / osprey Goto Github PK

View Code? Open in Web Editor NEW

711.0 14.0 40.0 24.3 MB

[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"

License: Apache License 2.0

Python 98.73% Shell 1.27%

mllm sam visual-instruction-tuning pixel-understanding

osprey's Introduction

Demo username & password: osprey

A part of Along the River During the Qingming Festival (清明上河图)

Spirited Away (千与千寻)

Updates 📌

[2024/3/29]🔥 We released Osprey-Chat model, which exhibits better conversation and image-level understanding&reasoning capabilities.

[2024/2/27]🔥 Osprey has been accepted to CVPR2024!

[2024/1/15]🔥 We released the evaluation code.

[2023/12/29]🔥 We released the training code and Osprey-724K dataset.

[2023/12/18]🔥 We released the code, osprey-7b model and online demo for Osprey.

What is Osprey 👀

Osprey is a mask-text instruction tuning approach that extends MLLMs by incorporating pixel-wise mask regions into language instructions, enabling fine-grained visual understanding. Based on input mask region, Osprey generate the semantic descriptions including short description and detailed description.

Our Osprey can seamlessly integrate with SAM in point-prompt, box-prompt and segmentation everything modes to generate the semantics associated with specific parts or objects.

Watch Video Demo 🎥

Try Our Demo 🕹️

Online demo

Click 👇 to try our demo online.

web demo

username: osprey
password: osprey

Point
Box
Everything

Offline demo

💻 requirments: For this demo, it needs about 17GB GPU memory for Osprey(15GB) and SAM(2GB).

First install Gradio-Osprey-Demo.
Install Segment Anything.

pip install git+https://github.com/facebookresearch/segment-anything.git

Download all the checkpoints:

The default path of all the checkpoints:

├── demo
    ├── checkpoints
    │   ├── Osprey_7b
    │   └── sam_vit_b_01ec64.pth 
    └── open_clip_pytorch_model.bin

Or change the "mm_vision_tower" in config.json of Osprey-7b model to the Absolute Path of open_clip_pytorch_model.bin.

Run app.py.

cd demo
python app.py --model checkpoints/Osprey_7b

Install 🛠️

Clone this repository and navigate to Osprey folder

git clone https://github.com/CircleRadon/Osprey.git
cd Osprey

Install packages

conda create -n osprey python=3.10 -y
conda activate osprey
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Dataset 🌟

The all datasets for training can be found in Dataset preparation.

Osprey-724K: 🤗Hugging Face

Osprey-724K is an instruction dataset with mask-text pairs, containing around 724K GPT-generated multimodal dialogues to encourage MLLMs for fine-grained pixel-level image understanding. It contains object-level, part-level and additional instruction samples for robustness and flexibility.

Training 🚀

Stage1: Image-Text Alignment Pre-training
- The pretrained projector weights for Convnext-large-CLIP can be found in projector weights.
Stage2: Mask-Text Alignment Pre-training
- Download vicuna-7b-v1.5.
- Download projector weights trained in stage1: projector weights.
- Set model_name_or_path in stage2.sh to the path of vicuna-7b-v1.5.
- Set pretrain_mm_mlp_adapter in stage2.sh to the path of mm_projector.
- Set vision_tower in stage2.sh to the path of Convnext-large-CLIP-model.
- Run sh scripts/stage2.sh.
Stage3: End-to-End Fine-tuning
- Set model_name_or_path in stage2.sh to the path of stage2 checkpoint.
- Set vision_tower in stage2.sh to the path of Convnext-large-CLIP-model.
- Run sh scripts/stage3.sh.

Checkpoints 🤖

Osprey-7b model🤗: model

We also provide the checkpoint of intermediate stage2, please check model.

Evaluation 🔎

See evaluation for details.

TODO List 📝

Release the checkpoints, inference codes and demo.
Release the dataset and training scripts.
Release the evaluation code.
Release the code for data generation pipeline.

Acknowledgement 💌

LLaVA-v1.5: the codebase we built upon.
SAM: the demo uses the segmentation result from SAM as the input of Osprey.

BibTeX 🖊️

@misc{Osprey,
  title={Osprey: Pixel Understanding with Visual Instruction Tuning},
  author={Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang and Jianke Zhu},
  year={2023},
  eprint={2312.10032},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

osprey's People

Contributors

Stargazers

Watchers

osprey's Issues

Mask-level Ferret?

Hello again!

I was wondering how you were able to have Ferret output segmentation masks? From the paper it looks like it is restricted to yo bounding boxes?

Thank you!

about the training log

你好，我最近想尝试在自己的数据集上训练模型
但不是很清楚三个stage训练的损失收敛到多少才算训练完成

请问能不能提供一下三个stage的训练log，或者告知一下三个stage收敛后loss大致的范围

感谢🙏

Addign Open Vocabulary Segmentation to Demo

Hello!

Thank you for this great work! I was wondering if there is any easy way to extend the demo to include open-vocab segmentation?

Thank you!
Josh

Data annotation pipeline

Great work! Could you release the data annotation pipeline?

Expecting value: line 1 column 1 (char 0)

error report when evaluate Open-Vocabulary Segmentation for cityscapes

Hello,

when i try to evaluate Open-Vocabulary Segmentation for cityscapes, there are some following errors

[02/07 15:28:29 detectron2]: Start inference on 500 batches
0it [00:14, ?it/s]
Traceback (most recent call last):
File "/home/user/workspace/Osprey/osprey/eval/eval_open_vocab_seg_detectron2.py", line 634, in
evaluator.process(inputs, outputs)
File "/home/user/workspace/Osprey/detectron2/detectron2/evaluation/evaluator.py", line 88, in process
evaluator.process(inputs, outputs)
File "/home/user/workspace/Osprey/detectron2/detectron2/evaluation/cityscapes_evaluation.py", line 75, in process
class_id = name2label[classes].id
KeyError: 'car,cars'

I finished setup as https://github.com/CircleRadon/Osprey/blob/main/osprey/eval/README.md#:~:text=1.-,Open%2DVocabulary%20Segmentation,-Download%20SentenceBERT%20model

and data preparation as well done.
https://github.com/CircleRadon/Osprey/blob/main/osprey/eval/datasets/README.md

thanks.

online demo is not work

How to use Osprey in my own dataset

Thank you for your amazing work. And here is the thing, what should i do to link your Osprey with my own dataset? I have some images collected by our team. Should i retrain the model to make the description of every area more accurate? how to do this? And because it is a folder of images not just one image that i can use in the demo. How to use your model in a large amount of images and collect all the result?

关于token数量的问题

在ospery中，convnext产生的image feature token应该是1024个（1024 * 768的特征），再结合mask feature （128 + 64 + 32 + 16）和pos的 token，以及text的token 是否会比较容易超出2048比较多？
如果以上数值理解有谬误，烦请指正，非常感谢～

API requires password?

I got the API to run with this code:

//import { client } from "@gradio/client";


// This did not help:
// const client = require('@gradio/client');

(async () => {

    // This did not work, because the client is not a function.
    //const client = await import('@gradio/client');

    const dynamic = new Function('modulePath', 'return import(modulePath)');

    const { client } = await dynamic('@gradio/client');

    const app = await client("http://osprey:[email protected]:8000/");
    const result = await app.predict(0, []);

    // @ts-ignore
    console.log(result.data);


})();

but I get this error:

Error: Could not get config:
    at resolve_config (file:///home/arthur/dev/ai/manga/node_modules/@gradio/client/dist/index.js:1390:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async file:///home/arthur/dev/ai/manga/node_modules/@gradio/client/dist/index.js:457:18

which I think is due to the page requiring the osprey/osprey password...

could that password be removed so API access can work?

Thanks.

The checkpoint-final of stage2

Hello, I was wondering if you could kindly assist me with my request. I am interested in starting my training journey from stage 3, and I was wondering if it would be possible for you to provide me with the checkpoint of stage 2?

error when load Osprey-724K/osprey_detail_description.json

Hello, i am trying to reproduce osprey now.
when i tried to train stage 3,
there are some errors as below when load Osprey-724K/osprey_detail_description.json.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Traceback (most recent call last):
File "/home/user/LMM/Osprey/osprey/train/train_mem.py", line 13, in
train()
File "/home/user/LMM/Osprey/osprey/train/train.py", line 696, in train
data_module = make_multitask_data_module(tokenizer=tokenizer,
File "/home/user/LMM/Osprey/osprey/datasets/data_modules.py", line 58, in make_multitask_data_module
train_dataset = build_osprey_dataset(dataset_config,
File "/home/user/LMM/Osprey/osprey/datasets/data_modules.py", line 75, in build_osprey_dataset
temp_dataset = build_osprey_dataset(cfg, tokenizer=tokenizer, data_args=data_args, **kwargs)
File "/home/user/LMM/Osprey/osprey/datasets/data_modules.py", line 136, in build_osprey_dataset
dataset = OspreyDetailedDescription(
File "/home/user/LMM/Osprey/osprey/datasets/osprey_724k.py", line 234, in init
super().init(tokenizer, data_args, ann_file, img_prefix)
File "/home/user/LMM/Osprey/osprey/datasets/osprey_724k.py", line 53, in init
super().init(tokenizer, data_args, ann_file, img_prefix)
File "/home/user/LMM/Osprey/osprey/datasets/stage2_data.py", line 28, in init
self.data_infos = self.load_annotations(ann_file)
File "/home/user/LMM/Osprey/osprey/datasets/osprey_724k.py", line 258, in load_annotations
answer = re.findall(r"<.>:\ (.)", ann['description'][i])[0]
IndexError: list index out of range
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

please check this.
thanks

Which SentenceBERT model did you use?

Great work! I have a question about section 5.2.1 of the paper: which SentenceBERT model did you use to calculate the semantic similarity?

checkpoint/osprey_7b

I have placed all the files from https://huggingface.co/sunshine-lwt/Osprey-7b/tree/main into the Osprey-main/checkpoint/osprey_7b folder, but an error occurred.
OSError: We could not connect to 'https://huggingface.co/' to load this file, it is also not found in the cache files, and it appears that checkpoint/osprey_7b is not a path to a directory containing a file named config.json. Please check your internet connection, or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

"I have also placed the open_clip_pytorch_model.bin file in the Osprey-main/checkpoint/osprey_7b folder. Additionally, the 'mm_vision_tower' is set to 'checkpoint/osprey_7b/open_clip_pytorch_model.bin', and the sam_vit_b_01ec64.pth file has been placed in the Osprey-main/checkpoints folder."

ERROR AT END.

34s
!python /content/Osprey/demo/app.py --model sunshine-lwt/Osprey-7b
output
2024-01-14 16:21:15.919229: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-14 16:21:15.919294: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-14 16:21:15.919351: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-14 16:21:17.052157: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100% 2/2 [00:12<00:00, 6.40s/it]
Some weights of the model checkpoint at sunshine-lwt/Osprey-7b were not used when initializing OspreyLlamaForCausalLM: ['model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.22.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.13.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.3.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.16.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.23.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.5.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.15.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.4.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.9.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.14.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.21.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.17.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.18.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.20.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.11.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.24.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.10.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.25.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.19.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.8.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.7.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.6.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.26.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.12.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.2.weight']

This IS expected if you are initializing OspreyLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing OspreyLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of OspreyLlamaForCausalLM were not initialized from the model checkpoint at sunshine-lwt/Osprey-7b and are newly initialized: ['model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.20.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.19.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.10.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.7.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.23.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.22.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.15.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.17.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.3.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.18.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.8.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.14.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.13.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.16.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.25.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.6.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.24.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.9.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.26.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.4.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.5.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.21.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.11.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.12.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.1.gamma']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:389: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:394: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
Traceback (most recent call last):
File "/content/Osprey/demo/app.py", line 20, in
sam_predictor = get_sam_predictor()
File "/content/Osprey/demo/inference.py", line 15, in get_sam_predictor
sam = sam_model_registrymodel_type
File "/usr/local/lib/python3.10/dist-packages/segment_anything/build_sam.py", line 38, in build_sam_vit_b
return _build_sam(
File "/usr/local/lib/python3.10/dist-packages/segment_anything/build_sam.py", line 104, in _build_sam
with open(checkpoint, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/sam_vit_b_01ec64.pth'

Loss 0 during Stage 3 Training? + Stage 2 Model

Hi I was able to train the stage 2 model and moving onto stage 3 training.

Here is the command I am running. The loss seems fine if I change the model_name_or_path to the vicuna lm model instead of stage 2 trained model. Do you see any issue, and is it possible to share the stage 2 model as well?

deepspeed --include localhost:0,1  osprey/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path exp/stage2_slurm_no_grad_ckpt/ \
    --dataset_config ./osprey/configs/stage3.json \
    --version v1 \
    --vision_tower $MODELS/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup/open_clip_pytorch_model.bin \
    --pretrain_mm_mlp_adapter models/osprey-v1.0-mlp2x-512px-convnext-pretrain-vicuna-7b-v1.5/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir './exp/stage3_test' \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1\
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to "none" \
    --group_by_modality_length False

如何使用HQ-SAM自动化生产mask数据

你好，请问贵团队是如何使用HQ-SAM自动化生产mask数据的？我没有在您提供的代码中找到相关地方，请问可以提供一下吗？

Can not load the page for the online demo

Even enter the password and name for the online demo, the demo page is not loading. Could you please take a look to fix it?

dataset release

Hi, very nice work, do you have any specific plan to release the data?

API Help

Hello.

Is there some documentation for the API I'm missing ?

When I click on the "Use with API" at the bottom of the demo, I get some documentation, but it's not very clear, I just get that there are some numbered functions and what their parameters are, but I'm not clear on what the functions actually are/do.

Any help would be extremely welcome.

I have another question. What I'm trying to do, is ask a model to find all faces in an image (and their position) and/or find all speech bubbles in an image (and their position) etc.

I currently do this with segment-anything and gpt4-v but it's extremely expensive, I'd really like to be able to run it locally.

You can see the technique I do now in the pictures: segment, group zones by non-overlap, and for each group, label each zone with a number, and ask gpt4-v to tell me which number is a speech bubble, which a face, which a sound effect, etc. This is pretty accurate (about 5% error rate, and going down as I improve the prompt/labelling, though most improvements I found also come at the cost of spending more tokens)

Is there a way to get the same result with Osprey ?

Thank you so much in advance.

vg images link is invalid,and

Hello，vg images link is invalid. Could you tell me how to download it?

How to use Osprey-Chat to generate a short description for all masks of an image?

Thank you for the amazing work! I saw the Offline demo you introduced, but it seems that this process still requires manual clicking on specific masks to generate a specific description, rather than generating descriptions for all masks at once. I currently hope to use Osprey (of course, Osprey-Chat is better) to build a fully automated process that does not require human involvement, generating a brief description for each mask in a picture. Do you have any suggestions? Thanks a lot!

Can you tell me the computer configuration requirements for this?

Inference Scripts

Hi, I found that the test program in eval can only process one image and 1 mask at a time. can you please provide a test script for bs greater than 1.

can not access your dataset

Hi, thank you for your awesome work. However, I am unable to access the dataset mentioned in your data preparation section as the link provided is unavailable. Could you please provide an updated link? Thank you!

Incorrect link:
pascal_part: train.json, VOCdevkit.

关于模型训练

您好

请问你们在训练的时候，有没有遇到过训练卡在第一个epoch，但是GPU占用为100%的情况

一开始以为是服务器的问题，但只要把MASK Token部分代码删掉，就可以正常训练

if cur_input_ids.numel() > 0:
                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
                    mask_idx = torch.nonzero(cur_input_ids==self.tokenizer.convert_tokens_to_ids(['<mask>'])[0])
                    _l = 0
                    for i, idx in enumerate(mask_idx):
                        cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[_l:idx[0]]).detach())
                        ## mask
                        cur_new_input_embeds.append(mask_feats[batch_idx][i:i+1].detach())
                        ## pos
                        cur_new_input_embeds.append(pos_feats[batch_idx][i:i+1].detach())
                        if labels is not None:
                            cur_labels[idx[0]:idx[0]+2] = torch.full((2,), IGNORE_INDEX, device=labels.device, dtype=labels.dtype)
                        _l = idx[0]+2
                    if _l< len(cur_input_ids):
                        cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[_l:]).detach())

                else:
                    mask_idx = torch.nonzero(cur_input_ids==self.tokenizer.convert_tokens_to_ids(['<mask>'])[0])
                    assert len(mask_idx) == len(mask_feats[batch_idx]), "mask num not equal to mask feats"
                   
                    _l = 0
                    for i, idx in enumerate(mask_idx):
                        cur_raw_new_input_embeds = self.get_model().embed_tokens(cur_input_ids[_l:idx[0]])
                        cur_new_input_embeds.append(cur_raw_new_input_embeds)
                        ## mask
                        cur_new_input_embeds.append(mask_feats[batch_idx][i:i+1].to(cur_raw_new_input_embeds.dtype))
                        ## pos
                        cur_new_input_embeds.append(pos_feats[batch_idx][i:i+1].to(cur_raw_new_input_embeds.dtype))

                        if labels is not None:
                            cur_labels[idx[0]:idx[0]+2] = torch.full((2,), IGNORE_INDEX, device=labels.device, dtype=labels.dtype)

                        _l = idx[0]+2
                    if _l< len(cur_input_ids):
                        cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[_l:]))

                if labels is not None:
                    cur_new_labels.append(cur_labels)

Download issue for ospery model checkpoint

Hi, first, thank you for the amazing work!

I'm wondering if you provide a google drive to your model checkpoint or some other downloadable link? I tried to download it from huggingface model repo, but either I got an error saying osprey is not a valid key or access was denied in the git command.

Is there another way to download the checkpoint model so that I can try to deploy it locally?

Thank you

To create a public link, set `share=True` in `launch()`.

Some weights of the model checkpoint at checkpoints/osprey_7b were not used when initializing OspreyLlamaForCausalLM: ['model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.15.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.12.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.10.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.23.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.18.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.16.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.4.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.7.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.22.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.17.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.9.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.6.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.2.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.20.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.19.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.13.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.25.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.14.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.24.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.11.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.26.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.8.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.0.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.1.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.5.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.3.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.21.weight', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.1.weight']

This IS expected if you are initializing OspreyLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing OspreyLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of OspreyLlamaForCausalLM were not initialized from the model checkpoint at checkpoints/osprey_7b and are newly initialized: ['model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.20.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.14.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.19.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.16.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.12.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.13.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.7.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.3.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.15.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.8.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.10.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.1.blocks.0.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.18.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.5.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.3.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.26.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.22.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.9.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.23.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.2.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.0.blocks.1.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.17.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.6.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.4.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.21.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.24.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.25.gamma', 'model.vision_tower.vision_tower.visual.trunk.stages.2.blocks.11.gamma']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Running on local URL: http://127.0.0.1:8002

To create a public link, set share=True in launch().

Does this count as a successful operation? But the access to the result failed at http://127.0.0.1:8002/

How to finetune Osprey on RefCOCOg?

Hi! Thanks for the great work!

Could you share any configs on fine-tuning Osprey on RefCOCOg dataset? I am trying to follow your work and reproduce the results on it, what's the starting checkpoint and the prompt template? It would be very appreciated if any fine-tuning config could be shared.
Thank you!

训练过程中使用到的数据集

您好，感谢您的工作和开源代码👍👍👍！我想请教一下：

在训练Osprey的整个过程中是否有使用到一些视频领域的多模态数据集呢？比如MSR-VTT, MSVD和VATEX.
我看您使用到了COCO, RefCOCO等数据集，他们是不是包含了MS-COCO呢？好像MS-COCO是COCO的子集😀

感激不尽！💐💐💐

How to evaluate the Open-Vocabulary Segmentation results in Table 2?

Hi,

Thank you for sharing your impressive work!

I got confused about Table 2: How are the open vocabulary segmentation metrics calculated?
Also, could you please explain how Osprey outputs the mask to calculate these metrics?

Thanks for your help!

About the LoRA implement in the code.

HeIlo, I want to use LoRA to finetune the LlamaDecoder in Osprey on downstream tasks. There is a lora_enable in the training args, I set it to true but there is an error when I run stage 3 training. Is the Lora training option available here？