shikras / shikra Goto Github PK

View Code? Open in Web Editor NEW

709.0 709.0 44.0 8.3 MB

License: Other

Python 100.00%

shikra's People

Contributors

Stargazers

Watchers

shikra's Issues

Could you share your prompt or code to generate QA data from GPT4

which is section 5.3.2 in the paper.

training detail

Hi. Thank you for your good work.
I have a question about the training details.
So I saw in the code that you use the Seq2SeqTrainer class from huggingface.
It seems that you used simple cross-entropy loss for your model like other MLLMs. Is it right?

If the target is "A man[0.220,0.216,0.568,0.830] holding roses[0.404,0.374,0.588,0.758] and a woman[0.606,0.250,0.812,0.830] covering her mouth[0.612,0.358,0.666,0.414].", then the model is just trained by teacher forcing with the target?

Inconsistent performance on REC task

The performance of Shikra on the dataset of REC task is quite surprising.
I am trying to get the shikra-7b model by using vicuna-7b as the base model and using the shikra-7b-delta-v1 as the delta model.

I evaluate the shikra-7b model on RefCOCO testA and RefCOCO testB, but only get 79.64% and 64.54% overall accuracy.
It does not match the performance on the Table 3.

Do you have any suggestion?

it seems like you use llava model, I want to know when training model, do you add the position information like " the cat is at [0.2, 0.2, 0.5, 0.5], or without any position information in training?

Evaluation on PointQA, VQAv2, OK-VQA and Captioning

Hi, The current codebase does not contain the configs for evaluation of the mentioned tasks. It only contain shikra_eval_multi_pope.py and shikra_eval_multi_rec.py. Will you please explain the evaluation processes for PointQA, VQAv2, OK-VQA and Captioning? Thanks.
[I tried PointQA myself, but results on Visual-7W are not reproduced. Hence, it will be great to have the configs which you used. Thanks.]

A question about the installed "transformers"

Thanks for the awesome work! I encountered a problem when installing the transformers when conducting pip install -r requirement.txt. The Error info is attached below.

Is it possible if I simply run pip install transformers to install this package? Or I must install the "transformers" provided by the "huggingface"?

Looking forward to your reply.

(shikra) zlx@ubuntu-Super-Server:~/shikra$ pip install git+https://github.com/huggingface/transformers@cae78c46
Collecting git+https://github.com/huggingface/transformers@cae78c46
Cloning https://github.com/huggingface/transformers (to revision cae78c46) to /tmp/pip-req-build-vzu547ee
Running command git clone --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-vzu547ee
fatal: unable to access 'https://github.com/huggingface/transformers/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
error: subprocess-exited-with-error

× git clone --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-vzu547ee did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-vzu547ee did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Question about coordinate numerical representation

May I ask if instead of fully fine-tuning LLM, fine-tuning LLM by using a method like LoRA, can enable LLM to understand and produce coordinate numerical representation？

Online demo not working

Something went wrong
Connection errored out.

TypeError: cfg should be a dict, ConfigDict or Config, but got <class 'NoneType'>

Wrong output when the inference stage

I have followed the readme file to config all of the setup steps, including downloading the dataset. When I directly run the inference command, the output the the model is random characters.

Some setup steps:
(1)Environment installation is same as the requirements, including the specific version of transformer.
(2)The original LLaMA weights are downloaded from HuggingFace website and using the official conversion command. Then applying the shikras/shikra-7b-delta-v1 to the original weights.
(3)Download the dataset images used in the repo and change the dataset root. For inference stage, I use the shikra_eval_multi_pope script, the default configuration file is 'DEFAULT_TEST_POPE_VARIANT', the dataset used is COCO val2014 dataset.

The command I use for the inference is:

accelerate launch --num_processes 4 --main_process_port 23786 mllm/pipeline/finetune.py config/shikra_eval_multi_pope.py --cfg-options model_args.model_name_or_path=path/to/my/cocoimage/root

using a single NVIDIA A100 GPU.

But the output for COCO_POPE_RANDOM_q_a,COCO_POPE_POPULAR_q_a and COCO_POPE_ADVERSARIAL_q_a, all of the output of the model is like:

{"pred": " 00000000000000000000000000002.222222222222222222222222222222222222222............2222.......................22222........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ Ho Ho....................................................... Brasil. Brasil..... Brasil. Brasil................... Brasil Brasil............... Brasil Brasil Brasil Hamilton Brasil................................. Hamilton.................................................. Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton... Hamilton Hamilton Hamilton Hamilton..... Hamilton............ Hamilton Hamilton Hamilton Hamilton Hamilton.... Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog Herzog..... Gh Herzog", 

"target": " A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there a snowboard in the image? How would you answer it briefly and precisely using the image <im_start> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_end> ? ASSISTANT: The answer is yes."}

{"pred": "", 
"target": " A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Please provide a direct and to-the-point response to 'Is there a dining table in the image?' while considering the image <im_start> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_patch> <im_end> . ASSISTANT: The answer is no."}

The prediction is either empty or garbled output in the output_dir/multitest_xxxx_extra_prediction.jsonl.
The metric computation shows all of the results are false, like:

{
    "multitest_COCO_POPE_POPULAR_q_a_accuracy": 0.0,
    "multitest_COCO_POPE_POPULAR_q_a_failed": 3000,
    "multitest_COCO_POPE_POPULAR_q_a_runtime": 20486.2627,
    "multitest_COCO_POPE_POPULAR_q_a_samples_per_second": 0.146,
    "multitest_COCO_POPE_POPULAR_q_a_steps_per_second": 0.018,
    "multitest_COCO_POPE_POPULAR_q_a_target_failed": 0
}

I check all of the configurations and didn't find some errors. So could you please give me some suggestions? Thanks!

RuntimeError: Internal: unk is not define

Dear authors,

Thanks for the great work! I've encountered a RuntimeError: Internal: unk is not define when running
accelerate launch --num_processes 4 \ --main_process_port 23786 \ mllm/pipeline/finetune.py \ config/shikra_pretrain_final19_stage2.py \ --cfg-options model_args.model_name_or_path=/path/to/init/checkpoint
and the Traceback indicates that something went wrong when calling transformers.AutoTokenizer.from_pretrained().

Do you have any idea of this Error?

Looking forward to your reply : )

A question about the result in Table 6

Thanks for your awesome work! Shikras opens a way to effectively represent the coordinates in the image.

I have a question about the result in Table 6: the performance of Shikra on OK-VQA dataset is quite surprising, do you fine-tune Shikra on OK-VQA or does instruction-tuning data include OK-VQA?

What if the center point of an object is not on the object itself?

When will the code and data get released?

Thanks for the great work! When will the code and data get released? Is there a demo that we can take a try first?

training on 8 V100 is too slow, shikra_pretrain_final19_stage2 nearly 800h。 Does anyone have a similar situation?

accelerate launch --num_processes 8
--main_process_port 23786
mllm/pipeline/finetune.py
config/shikra_pretrain_final19_stage2.py
--cfg-options model_args.model_name_or_path=../models/shikras/shikra-7b-0708 --overwrite_output_dir
--per_device_train_batch_size 2

{'loss': 0.1921, 'learning_rate': 3.0703101013202335e-08, 'epoch': 0.0}
{'loss': 0.1677, 'learning_rate': 6.140620202640467e-08, 'epoch': 0.0}
{'loss': 0.1395, 'learning_rate': 9.2109303039607e-08, 'epoch': 0.0}
{'loss': 0.1647, 'learning_rate': 1.2281240405280934e-07, 'epoch': 0.0}
{'loss': 0.1434, 'learning_rate': 1.535155050660117e-07, 'epoch': 0.0}
{'loss': 0.1707, 'learning_rate': 1.84218606079214e-07, 'epoch': 0.0}
{'loss': 0.131, 'learning_rate': 2.1492170709241634e-07, 'epoch': 0.0}
0%| | 73/217125 [16:42<877:13:11, 14.55s/it]

How is the toy shikra trained in Table 2?

I find this project extremely interesting and I'm eager to follow its progress. I have a question regarding the training process mentioned in the paper. The paper refers to "toy shikra/toy model" many times. I'm curious to know how the toy shikra was trained, particularly the results mentioned in Table 2. Was it trained only with REC datasets and initialized from the llama model?

Shikra-RD

Hi, in the downloaded files, there are many JSON files, which is the Shikra-RD generated by the author and used in stage 2?

About the dataset size of PointQA

Thanks for your good work. I have downloaded your dataset in .jsonl. I have some questions.

The number of data in pointQA_local_train.jsonl is 27426. In the original paper, the number of questions is 40,409. The numbers of data in the test set and validation set are also mismatched with the original paper. How to generate the .jsonl file? Considering the number of data in the test set is not aligned. How to get a fair comparison?

I am looking forward to your reply.

Could you provide more information about the instruction to GPT4 when generating Shikra-RD "cot_with_ans" data?

Dear the authors,
Thanks for the great work! After reading your paper and codes, I'm still confused about the generation process of the "cot_with_ans"of Shikra-RD. Could you provide me with more infomation about the instruction when interacting with GPT-4?

Really looking forward to your reply~

May I ask if it supports multi image input, that is, multiple images and one text

请问支持多图输入吗，就是多个图像，一条文本这样

AttributeError: 'Seq2SeqTrainingArguments' object has no attribute 'hf_deepspeed_config'

Command

accelerate launch --num_processes 4 \
        --main_process_port 23786 \
        /home/junjiewen/shikra/mllm/pipeline/finetune.py \
        config/shikra_eval_flickr.py \
        --cfg-options model_args.model_name_or_path=/home/junjiewen/minigpt4/ckpt/shikra/shikra-transfer/

config/shikra_eval_flickr.py

_base_ = ['_base_/dataset/DEFAULT_TEST_DATASET.py', '_base_/model/shikra.py', '_base_/train/eval.py']

training_args = dict(
    output_dir='./exp/{{fileBasenameNoExtension}}',

    do_train=False,
    do_eval=True,
    do_predict=False,
    do_multi_predict=False,

    fp16=True,
    fp16_full_eval=True,
    bf16=False,
    bf16_full_eval=False,
    per_device_eval_batch_size=8,
)

model_args = dict(
    model_name_or_path=None,
    # vision_tower='/home/junjiewen/minigpt4/ckpt/clip-vit-large-patch14/'
)

data_args = dict(
    train=None,
    validation=_base_.DEFAULT_TEST_FLICKR_VARIANT['FLICKR_EVAL_with_box'],
    test=None,
    # multitest={k: {'cfg': v, 'compute_metric': dict(type='FLICKRComputeMetrics')} for k, v in _base_.DEFAULT_TEST_FLICKR_VARIANT.items() if 'q_a' in k},
    multitest=None,
    compute_metric=None,

    # padding collator kwargs
    collator_kwargs=dict(
        padding=True,
        max_length=1024,
    ),

    # generate config
    gen_kwargs=dict(
        max_new_tokens=1024,
        num_beams=1,
    ),
)

DEFAULT_TEST_DATASET.py

_base_ = [
    # 'DEFAULT_TEST_REC_VARIANT.py',
    'DEFAULT_TEST_FLICKR_VARIANT.py',
    # 'DEFAULT_TEST_GQA_VARIANT.py',
    # 'DEFAULT_TEST_CLEVR_VARIANT.py',
    # 'DEFAULT_TEST_GPTGEN_VARIANT.py',
    # 'DEFAULT_TEST_VCR_VARIANT.py',
    # 'DEFAULT_TEST_VQAv2_VARIANT.py',
    # 'DEFAULT_TEST_POINT_VARIANT.py',
    # 'DEFAULT_TEST_POPE_VARIANT.py',
]

DEFAULT_TEST_DATASET = dict(
    flickr=dict(
        type='FlickrDataset',
        filename=r'{{fileDirname}}/../../../data/CWB_flickr30k_eval.jsonl',
        image_folder=r'/data/junjiewen/flicker/flickr30k-images',
        template_file=r'{{fileDirname}}/template/flickr30k.json',
    ),
    # **_base_.DEFAULT_TEST_REC_VARIANT,
    **_base_.DEFAULT_TEST_FLICKR_VARIANT,
    # **_base_.DEFAULT_TEST_GQA_VARIANT,
    # **_base_.DEFAULT_TEST_CLEVR_VARIANT,
    # **_base_.DEFAULT_TEST_GPTGEN_VARIANT,
    # **_base_.DEFAULT_TEST_VCR_VARIANT,
    # **_base_.DEFAULT_TEST_VQAv2_VARIANT,
    # **_base_.DEFAULT_TEST_POINT_VARIANT,
    # **_base_.DEFAULT_TEST_POPE_VARIANT,
)

what's the image preprocess method?

It seems the image preprocess method are saved in the ckpt, and can't obtain from the code.

infer problem: ModuleNotFoundError: No module named 'petrel_client'

def init_ceph_client_if_needed():
global client
if client is None:
logger.info(f"initializing ceph client ...")
st = time.time()
from petrel_client.client import Client # noqa
client = Client(enable_mc=True)
ed = time.time()
logger.info(f"initialize client cost {ed - st:.2f} s")

I met a problem
"ModuleNotFoundError: No module named 'petrel_client"

How can I solve it? Thanks

May I ask for advice on process_conv_multimage in single_image_convsation.py?What does it for? Can it handle multiple images?

I have collected the download addresses for all the training data and posted them here for others to download conveniently.

I am reproducing the model on V100 GPU. If anyone is doing the same, I hope we can communicate and exchange ideas together. My wechat : Anymake_ren
1、Flickr 30k ：
http://shannon.cs.illinois.edu/DenotationGraph/data/index.html

2、The Visual Genome Dataset
VG数据集主要由4个部分组成：
Region Description：图片被划分成一个个region，每个region都有与其对应的一句自然语言描述。
Region Graph：每个region中的object、attribute、relationship被提取出来，构成局部的“Scene Graph”。
Scene Graph：把一张图片中的所有Region Graph合并成一个全局的Scene Graph。
QA：每张图片会有多对QA，分为两种类型：region-based和freeform。前者基于Region Description提出，与局部region的内容直接相关；后者则基于整张图片来提出。
https://homes.cs.washington.edu/~ranjay/visualgenome/api.html

3、LLaVA-CC3M-Pretrain-595K
https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/tree/main

4、LLaVA-Instruct-150K
图片是COCO2014
https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main

5、CLEVR：
该数据集为合成数据集，是由一些简单的几何形状构成的视觉场景。数据集中的问题总是需要一长串的推理过程，为了对推理能力进行详细评估，所有问题分为了5类：属性查询（querying attribute），属性比较（comparing attributes），存在性（existence），计数（counting），整数比较（integer comparison）。所有的问题都是程序生成的。该数据集的人为标注数据子集为CLEVR-Humans
https://cs.stanford.edu/people/jcjohns/clevr/

6、GQA
图片20G，
https://cs.stanford.edu/people/dorarad/gqa/download.html

7、Visual7W: Grounded Question Answering in Images
Visual7W 是一个图像内容理解的数据集，通过对图像区域的文字描述和互相之间的关联，进行视觉问答 (Visual Question Answering) 任务，数据集中不仅包含图像本身，还包括图像区域内容相关的问答。
Visual7W 是 Visual Genome 数据集的一个子集，包含 47,300 张 COCO 数据集图像，327,929 个问答对，1,311,756 个人类生成的多选题，以及涵盖 36,579 个类别的 561,459 个 object groundings。
Visual7W 的问题主要由 What, Where, How, When, Who,Why, 以及 Which 构成。问题为多选，每个问题都有四个候选答案。
http://ai.stanford.edu/~yukez/visual7w/

8、VCR：Visual Commonsense Reasoning
VCR 全称 Visual Commonsense Reasoning，是一个用于视觉常识推理的大规模数据集。该数据集提出了关于图像的具有挑战性的问题，机器需要完成两个子任务：正确回答问题以及提供理由证明其答案的合理性。
VCR 数据集包含大量问题，其中 212K 个用于训练，26K 个用于验证，25K 个用于测试。答案和理由来自超过 110K 个不重复的电影场景。
https://visualcommonsense.com/download/

9、VQAv2 dataset
https://visualqa.org/download.html

10、VQA-E
全称 Visual Question Answering with Explanation，是带有解析的视觉问答数据集，其涉及的模型需要预测并生成答案解析。它是由 VQA v2 数据集自动衍生出来的，为每个 “图像-问题-答案三要素” 合成为一个文本解析，这使得问答过程更容易理解和可追溯。
COCO Images: Training images [83K/13GB], Validation Images [41K/6GB]
https://github.com/liqing-ustc/VQA-E

11、VQA-X （2018）
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
VQA-X是一个既有文字解释又有Visual grounding的数据集, 图片是coco2014

ModuleNotFoundError: No module named 'petrel_client'

Hello, I installed Shikra according to the readme and downloaded the dataset.
I tried to perform inference but encountered an error during runtime.
The error is "ModuleNotFoundError: No module named 'petrel_client".

My python version is 3.9.2, torch vesion is 2.0.1.
The following is the error message

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/shikra/mllm/pipeline/finetune.py:141 in │
│ │
│ 138 │
│ 139 │
│ 140 if name == "main": │
│ ❱ 141 │ main() │
│ 142 │
│ │
│ /root/shikra/mllm/pipeline/finetune.py:127 in main │
│ │
│ 124 │ │ │ prefix = f"multitest{k}" │
│ 125 │ │ │ │
│ 126 │ │ │ trainer.compute_metrics = _compute_metrics │
│ ❱ 127 │ │ │ _pred_results = trainer.predict(_ds, metric_key_prefix=_prefix, **gen_kwargs │
│ 128 │ │ │ trainer.log_metrics(_prefix, _pred_results.metrics) # noqa │
│ 129 │ │ │ trainer.save_metrics(_prefix, _pred_results.metrics) # noqa │
│ 130 │ │ │ trainer.save_prediction(_pred_results, file_key_prefix=_prefix) │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/transformers/trainer_seq2seq.py:135 in predict │
│ │
│ 132 │ │ ) │
│ 133 │ │ self._gen_kwargs = gen_kwargs │
│ 134 │ │ │
│ ❱ 135 │ │ return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix= │
│ 136 │ │
│ 137 │ def prediction_step( │
│ 138 │ │ self, │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/transformers/trainer.py:3020 in predict │
│ │
│ 3017 │ │ start_time = time.time() │
│ 3018 │ │ │
│ 3019 │ │ eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se │
│ ❱ 3020 │ │ output = eval_loop( │
│ 3021 │ │ │ test_dataloader, description="Prediction", ignore_keys=ignore_keys, metric_k │
│ 3022 │ │ ) │
│ 3023 │ │ total_batch_size = self.args.eval_batch_size * self.args.world_size │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/transformers/trainer.py:3115 in evaluation_loop │
│ │
│ 3112 │ │ │
│ 3113 │ │ observed_num_examples = 0 │
│ 3114 │ │ # Main evaluation loop │
│ ❱ 3115 │ │ for step, inputs in enumerate(dataloader): │
│ 3116 │ │ │ # Update the observed num examples │
│ 3117 │ │ │ observed_batch_size = find_batch_size(inputs) │
│ 3118 │ │ │ if observed_batch_size is not None: │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:633 in next │
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(pytorch/pytorch#76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:677 in _next_data │
│ │
│ 674 │ │
│ 675 │ def _next_data(self): │
│ 676 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 677 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │
│ 678 │ │ if self._pin_memory: │
│ 679 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │
│ 680 │ │ return data │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py:51 in fetch │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.getitems: │
│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index) │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_index] │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py:51 in │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.getitems: │
│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index) │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_index] │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /root/shikra/mllm/dataset/single_image_convsation.py:42 in getitem │
│ │
│ 39 │ │
│ 40 │ def getitem(self, index, debug_mode=False) -> Dict[str, Any]: │
│ 41 │ │ # getitem │
│ ❱ 42 │ │ item = self.get_raw_item(index) │
│ 43 │ │ image: Image.Image = item.get('image', None) │
│ 44 │ │ target: Dict[str, Any] = item.get('target', None) │
│ 45 │ │ raw_conv: List[Dict[str, Any]] = item['conversations'] │
│ │
│ /root/shikra/mllm/dataset/single_image_convsation.py:267 in get_raw_item │
│ │
│ 264 │ │
│ 265 │ def get_raw_item(self, index) -> Dict[str, Any]: │
│ 266 │ │ self.initialize_if_needed() │
│ ❱ 267 │ │ return self.dataset[index] │
│ 268 │ │
│ 269 │ def repr(self) -> str: │
│ 270 │ │ head = "Dataset " + self.class.name │
│ │
│ /root/shikra/mllm/dataset/single_image_dataset/pope.py:16 in getitem │
│ │
│ 13 │ │
│ 14 │ def getitem(self, index): │
│ 15 │ │ item = self.get_raw_item(index) │
│ ❱ 16 │ │ image = self.get_image(image_path=item['image']) │
│ 17 │ │ │
│ 18 │ │ question = item['text'] │
│ 19 │ │ final_question = self.get_template().replace(QUESTION_PLACEHOLDER, question) │
│ │
│ /root/shikra/mllm/dataset/utils/mixin.py:72 in get_image │
│ │
│ 69 │ def get_image(self, image_path): │
│ 70 │ │ if self.image_folder is not None: │
│ 71 │ │ │ image_path = os.path.join(self.image_folder, image_path) │
│ ❱ 72 │ │ image = read_img_general(image_path) │
│ 73 │ │ return image │
│ 74 │ │
│ 75 │ def get_template(self): │
│ │
│ /root/shikra/mllm/dataset/utils/io.py:20 in read_img_general │
│ │
│ 17 │
│ 18 def read_img_general(img_path): │
│ 19 │ if "s3://" in img_path: │
│ ❱ 20 │ │ cv_img = read_img_ceph(img_path) │
│ 21 │ │ # noinspection PyUnresolvedReferences │
│ 22 │ │ return Image.fromarray(cv2.cvtColor(cv_img, cv2.COLOR_BGR2RGB)) │
│ 23 │ else: │
│ │
│ /root/shikra/mllm/dataset/utils/io.py:31 in read_img_ceph │
│ │
│ 28 │
│ 29 │
│ 30 def read_img_ceph(img_path): │
│ ❱ 31 │ init_ceph_client_if_needed() │
│ 32 │ img_bytes = client.get(img_path) │
│ 33 │ assert img_bytes is not None, f"Please check image at {img_path}" │
│ 34 │ img_mem_view = memoryview(img_bytes) │
│ │
│ /root/shikra/mllm/dataset/utils/io.py:46 in init_ceph_client_if_needed │
│ │
│ 43 │ if client is None: │
│ 44 │ │ logger.info(f"initializing ceph client ...") │
│ 45 │ │ st = time.time() │
│ ❱ 46 │ │ from petrel_client.client import Client # noqa │
│ 47 │ │ client = Client(enable_mc=True) │
│ 48 │ │ ed = time.time() │
│ 49 │ │ logger.info(f"initialize client cost {ed - st:.2f} s") │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'petrel_client'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/shikra/venv/bin/accelerate:8 in │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if name == 'main': │
│ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if name == "main": │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/accelerate/commands/launch.py:941 in │
│ launch_command │
│ │
│ 938 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 939 │ │ sagemaker_launcher(defaults, args) │
│ 940 │ else: │
│ ❱ 941 │ │ simple_launcher(args) │
│ 942 │
│ 943 │
│ 944 def main(): │
│ │
│ /root/shikra/venv/lib/python3.9/site-packages/accelerate/commands/launch.py:603 in │
│ simple_launcher │
│ │
│ 600 │ process.wait() │
│ 601 │ if process.returncode != 0: │
│ 602 │ │ if not args.quiet: │
│ ❱ 603 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 604 │ │ else: │
│ 605 │ │ │ sys.exit(1) │
│ 606 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/root/shikra/venv/bin/python', 'mllm/pipeline/finetune.py', 'config/shikra_eval_multi_pope.py',
'--cfg-options', 'model_args.model_name_or_path=/root/shikra/shikra-7b', '--per_device_eval_batch_size', '1']' returned
non-zero exit status 1.

why can't I get the right answer？

I followed the installation steps but got empty response or error response.
I used the shikra-7b-delta-v1, shikra-7b-delta-v1-0708, and used the script of
python mllm/models/shikra/apply_delta.py
--base /path/to/llama-7b
--target /output/path/to/shikra-7b
--delta shikras/shikra-7b-delta-v1

but got wrong even used the example:

warnings.warn(
done generated in 329.3124873638153 seconds
response: Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer676767676767676767676767676767676767676767676767676767676767676767676767676762]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]6262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262
USER: Provide a comprehensive description of the image and specify the positions of any mentioned objects in square brackets. <PIL.Image.Image image mode=RGB size=640x640 at 0x7F4D982D1CF0>
ASSISTANT: Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer676767676767676767676767676767676767676767676767676767676767676767676767676762]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]6262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262 <PIL.Image.Image image mode=RGB size=640x640 at 0x7F4D982D1CF0>
[['USER: Provide a comprehensive description of the image <image> and specify the positions of any mentioned objects in square brackets.', None], [None, 'ASSISTANT: Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer Answer676767676767676767676767676767676767676767676767676767676767676767676767676762]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]62]6262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262626262']]

help me!~

Do you have plan to release an online demo?

For example, release an online gradio.app link or deploy on huggingface space?
Or provide a run script for a local demo based on gradio or other libraries.

Use which annotation tools?

use which annotation tools

What's the definition of shikra model's generation function and how to download the transformers version provided in requirements.txt?

What's the definition of shikra model's generation function?

I found that the definition of the generation function is not provided in this repo and the transformers version provided in the requirements.txt is not available now.

Can you tell me how can I download the correct transformers version and what's the definition of shikra model's generation function?

Thanks!

question: How to get output for a single image

Hi, thanks for your excellent work. I wonder to know how to get ouptut for a single image, as the inference code is used for a dataset. How could we use for a single image simply?

About accelerate config

Hello!

I'm here to help. Could you please provide more context about the "accelerate config" you're referring to? Accelerate is a library that helps simplify multi-GPU and distributed training in PyTorch, but I'd need more information about the specific configuration you're asking about in order to provide guidance.

Additionally, if you could provide more details about the context of your question and which specific settings you're looking for guidance on, I'd be happy to assist you. If you're trying to replicate a research paper's results, understanding the context and the details of the settings you're referring to would be important for providing accurate advice.

Question about Training init weight

Good Work!
When train the model, what should be the init weight? (/path/to/init/checkpoint)
LLaMA-7B or shikra-7B?

provide a run script for running one conversion

Can u provide a run script for running one conversion based on one image and one text input?

Inconsistent performance on MMBench

I evaluate the released model with your official inference code on MMBench but only got 36.x overall accuracy.
It does not match the performance on the leaderboard.

Do you have any suggestion?

Web demo output is weird

I downloaded the model from huggingface: https://huggingface.co/shikras/shikra-7b-delta-v1/tree/main

And I run the web demo with: python mllm/demo/server.py --model_path /path/to/shikra/ckpt

However, I got the output:

Is there anything wrong with my environment? Is the latest transformers not working for this model?

Why choose to train and finetune LLM for detection task instead of Vit or other mid layers?

Can you provide a sample of the training set?

Very good work. I was working on this before, but I was not successful. I used Visualglm to do lora training. The dataset also used pictures and bboxes of various types of objects in the pictures, and then gpt to generate descriptive statements, but instead of having coordinate information in the statements, I had gpt generate orientation space nouns instead. The results were mediocre, and even though I used comparison samples for training, there was still a serious illusion. It is good to see that your work has been successful so far, can you share examples of the sample data used for training for each task? I only see the questioning interrogatives and instructions in the paper, not the format of the responses generated by gpt. Thanks!
非常好的工作。
之前我也在做这一方面的工作，但是我没有成功。
我使用了Visualglm做lora训练。
数据集也是用了图片和图片中各类物体的bbox，再用gpt生成描述性语句，但是语句里不包含有坐标信息，而是让gpt生成了方位空间名词来代替。
效果很一般，即使我使用了对比样本进行训练，依旧有很严重的幻觉。
很高兴看到你们的工作取得了现在的成功，可以分享一下各个任务训练时使用的样本数据的示例吗？我在论文中只看到了提问的问句和指令，没有看到gpt生成的回答的格式，谢谢！

Question about the training parameter setting at stage1 and stage2

Dear the authors,
Thanks for the great work!~ I am trying to reproduce the shikra-7b model by training from vicuna-7b according to the descriptions in the SHIKRA paper. And I'm confused about some details of the parameter setting. Would you please provide me with some solutions to the following questions?

Should I use vicuna-7b as the training init weight? If yes, should I use the raw vicuna-7b or should I replace the "config.json, generation_config.json, special_tokens_map.json, tokenizer_config.json" with that of shikra-7b? Since the above json files are not the same between vicuna-7b and shikra-7b.
What config files should I use at stage1 and stage2? shikra_pretrain_concat8_stage1.py for stage1 and shikra_pretrain_final19_stage2.py for stage2? And could you tell me what's the usage of shikra_pretrain_concat3_stage0.py? Only stage 1 and stage2 are introduced in the paper.
Are the two stages trained seperately which means we train the stage1 first and save the model, then resume from the previous model to train the stage2? And what's the num_train_epochs for stage2? Does it has the same setting of 1.5 epochs as stage1?

Really Looking forward to you reply~

Doubts about Training Commands: Inconsistency between the Ratio in the Two-Stage Training

Thank you for sharing the code and data. Could you please provide detailed training commands? According to the article, the training process consists of two stages: the first stage is the reorganized VL dataset, and the second stage is the instruction stage with a 5:5 sampling ratio. However, we noticed that the provided training command is only 'config/shikra_pretrain_final19_stage2', which has a ratio of 1:9. This is a bit confusing. Could you please clarify the correct training commands for both stages as mentioned in the article?

gqa_scene_graph_index.json

i wanna know where is gqa_scene_graph_index.json

which torch version used?

the new version is 2.0.1, I wonder this project use 1.* or 2.*?

Cuda memory requirement

Q1: What is the minimum cuda memory requirement in training?
Q2: Does the raw training script support Deepspeed? It seems 24G cuda memory is not enough in 1 batch size training.

Could you please share the prompt for GPT4?

Hi, thanks for you brilliant work~

I want to finetune it on my dataset, and I want to follow your way to generate some GQA data. Could you please share the data generation code or prompt?

Thanks!

question about REC result

Hi,

Thank you for your great job.

could you explain the result of model “ofa” in the Table 3 in your paper ? i did not find the data in "ofa" paper.

best,
zhi

Question about the training init weight

Dear the suthors,
Thanks for the great work! I encounter a problem when training the model to reproduce the performance in shikra~
When training the model, what should be the initial weight? (/path/to/init/checkpoint)
LLaMA-7B or shikra-7B?

Looking forward to your reply!

What is the difference between your work and kosmos-2 ?

Could you tell me what's you differ from kosmos-2 ? I haven't seen your comparison with kosmos-2, but I think the two jobs are relatively similar.

An NCCL RuntimeError occurred when saving the model

Dear authors, I ran into an error when saving the model. Concretely, the program stuck at the model saving stage and timeout after ~30mins. Seems that's a FSDP issue? Do you happen to know how to resolve this issue? Thanks!

If anybody happens to know the solution please help me, I've been stuck here for several days. Many thx!!

The traceback info is similar to the following.

{'train_runtime': 119.8187, 'train_samples_per_second': 2.504, 'train_steps_per_second': 0.075, 'train_loss': 1.2388523353470697, 'epoch': 2.57}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:54<00:00, 12.76s/it]
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2778, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801649 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2778, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1801649 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5037) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
/tmp/FastChat/fastchat/train/train_mem.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-07_07:00:37
host : edf307caae46
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 5037)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5037

How were the COT and GCOT constructed in the training set of CLEVR?

Thank you for your wonderful work!

In section 6.1, it is mentioned that "We train our Shikra-7B (without pre-training) on CLEVR in three settings: 1) Only use Question and Answer (Q→A); 2) Use Question, CoT, and answer (Q→CA); 3) Use GCoT with Center Point annotation and answer (Q→CPointA)."
How were the COT and GCOT constructed in the training set?

shikras / shikra Goto Github PK

shikra's People

Contributors

Stargazers

Watchers

Forkers

shikra's Issues

Command

config/shikra_eval_flickr.py

DEFAULT_TEST_DATASET.py

Root Cause (first observed failure): [0]: time : 2023-04-07_07:00:37 host : edf307caae46 rank : 0 (local_rank: 0) exitcode : -6 (pid: 5037) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 5037

Recommend Projects

Recommend Topics

Recommend Org

Root Cause (first observed failure):
[0]:
time : 2023-04-07_07:00:37
host : edf307caae46
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 5037)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5037