Giter Club home page Giter Club logo

haotian-liu / llava Goto Github PK

View Code? Open in Web Editor NEW
16.3K 150.0 1.8K 12.83 MB

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Home Page: https://llava.hliu.cc

License: Apache License 2.0

Python 85.66% HTML 2.12% JavaScript 2.76% CSS 0.50% Shell 8.53% Dockerfile 0.43%
gpt-4 chatbot chatgpt llama multimodal llava foundation-models instruction-tuning multi-modality visual-language-learning

llava's Introduction

๐ŸŒ‹ LLaVA: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[๐Ÿ“ข LLaVA-NeXT Blog] [Project Page] [Demo] [Data] [Model Zoo]

๐ŸคCommunity Contributions: [llama.cpp] [Colab] [๐Ÿค—Space] [Replicate] [AutoGen] [BakLLaVA]

Improved Baselines with Visual Instruction Tuning [Paper] [HF]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper] [HF]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Release

  • [03/10] Releasing LMMs-Eval, a highly efficient evaluation pipeline we used when developing LLaVA-NeXT. It supports the evaluation of LMMs on dozens of public datasets and allows new dataset onboarding, making the dev of new LMMs much faster. [Blog] [Codebase]
  • [1/30] ๐Ÿ”ฅ LLaVA-NeXT (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.
  • [11/10] LLaVA-Plus is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page] [Demo] [Code] [Paper]
  • [11/2] LLaVA-Interactive is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page] [Demo] [Code] [Paper]
  • [10/26] ๐Ÿ”ฅ LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts, script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA.
  • [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [๐Ÿค— Demo]
  • [10/5] ๐Ÿ”ฅ LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
  • [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [LLavA-RLHF]
  • [9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.
More

  • [7/19] ๐Ÿ”ฅ We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo!
  • [6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out [Slides] [Notes] [YouTube] [Bilibli].
  • [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here.
  • [6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page.
  • [5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details.
  • [5/2] ๐Ÿ”ฅ We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details.
  • [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here.
  • [4/17] ๐Ÿ”ฅ We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.

Code License Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Contents

Install

If you are not using Linux, do NOT proceed, see instructions for macOS and Windows.

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Upgrade to latest code base

git pull
pip install -e .

# if you see some import errors when you upgrade,
# please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir

Quick Start With HuggingFace

Example Code
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model

model_path = "liuhaotian/llava-v1.5-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

Check out the details wth the load_pretrained_model function in llava/model/builder.py.

You can also use the eval_model function in llava/eval/run_llava.py to get the output easily. By doing so, you can use this code on Colab directly after downloading this repository.

model_path = "liuhaotian/llava-v1.5-7b"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)

LLaVA Weights

Please check out our Model Zoo for all public LLaVA checkpoints, and the instructions of how to use the weights.

Demo

Gradio Web UI

To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.

flowchart BT
    %% Declare Nodes
    gws("Gradio (UI Server)")
    c("Controller (API Server):<br/>PORT: 10000")
    mw7b("Model Worker:<br/>llava-v1.5-7b<br/>PORT: 40000")
    mw13b("Model Worker:<br/>llava-v1.5-13b<br/>PORT: 40001")
    sglw13b("SGLang Backend:<br/>llava-v1.6-34b<br/>http://localhost:30000")
    lsglw13b("SGLang Worker:<br/>llava-v1.6-34b<br/>PORT: 40002")

    %% Declare Styles
    classDef data fill:#3af,stroke:#48a,stroke-width:2px,color:#444
    classDef success fill:#8f8,stroke:#0a0,stroke-width:2px,color:#444
    classDef failure fill:#f88,stroke:#f00,stroke-width:2px,color:#444

    %% Assign Styles
    class id,od data;
    class cimg,cs_s,scsim_s success;
    class ncimg,cs_f,scsim_f failure;

    subgraph Demo Connections
        direction BT
        c<-->gws
        
        mw7b<-->c
        mw13b<-->c
        lsglw13b<-->c
        sglw13b<-->lsglw13b
    end

Launch a controller

python -m llava.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.

Launch a SGLang worker

This is the recommended way to serve LLaVA model with high throughput, and you need to install SGLang first. Note that currently 4-bit quantization is not supported yet on SGLang-LLaVA, and if you have limited GPU VRAM, please check out model worker with quantization.

pip install "sglang[all]"

You'll first launch a SGLang backend worker which will execute the models on GPUs. Remember the --port you've set and you'll use that later.

# Single GPU
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000

# Multiple GPUs with tensor parallel
CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-13b --tokenizer-path llava-hf/llava-1.5-13b-hf --port 30000 --tp 2

Tokenizers (temporary): llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, liuhaotian/llava-v1.6-34b-tokenizer.

You'll then launch a LLaVA-SGLang worker that will communicate between LLaVA controller and SGLang backend to route the requests. Set --sgl-endpoint to http://127.0.0.1:port where port is the one you just set (default: 30000).

python -m llava.serve.sglang_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --sgl-endpoint http://127.0.0.1:30000

Launch a model worker

This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b

Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.

You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path <ckpt2>

If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the --device flag: --device mps.

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with CUDA_VISIBLE_DEVICES. Below is an example of running with the first two GPUs.

CUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b

Launch a model worker (4-bit, 8-bit inference, quantized)

You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append --load-4bit or --load-8bit to the model worker command that you are executing. Below is an example of running with 4-bit quantization.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b --load-4bit

Launch a model worker (LoRA weights, unmerged)

You can launch the model worker with LoRA weights, without merging them with the base checkpoint, to save disk space. There will be additional loading time, while the inference speed is the same as the merged checkpoints. Unmerged LoRA checkpoints do not have lora-merge in the model name, and are usually much smaller (less than 1GB) than the merged checkpoints (13G for 7B, and 25G for 13B).

To load unmerged LoRA weights, you simply need to pass an additional argument --model-base, which is the base LLM that is used to train the LoRA weights. You can check the base LLM of each LoRA weights in the model zoo.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3 --model-base lmsys/vicuna-13b-v1.3

CLI Inference

Chat about images using LLaVA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization, for our LLaVA-1.5-7B, it uses less than 8GB VRAM on a single GPU.

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --load-4bit

Train

Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of this version for now. We'll add them in a separate doc later.

LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.

LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-v1.5-13B 256 1e-3 1 2048 0
  1. Finetuning
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-v1.5-13B 128 2e-5 1 2048 0

Download Vicuna checkpoints (automatically)

Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.

Pretrain (feature alignment)

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.

Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.

Training script with DeepSpeed ZeRO-2: pretrain.sh.

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)

We provide training script with DeepSpeed here. Tips:

Visual Instruction Tuning

  1. Prepare data

Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./playground/data,

โ”œโ”€โ”€ coco
โ”‚   โ””โ”€โ”€ train2017
โ”œโ”€โ”€ gqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ ocr_vqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ textvqa
โ”‚   โ””โ”€โ”€ train_images
โ””โ”€โ”€ vg
    โ”œโ”€โ”€ VG_100K
    โ””โ”€โ”€ VG_100K_2
  1. Start training!

You may download our pretrained projectors in Model Zoo. It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.

Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).

Training script with DeepSpeed ZeRO-3: finetune.sh.

If you are do not have enough GPU memory:

  • Use LoRA: finetune_lora.sh. We are able to fit 13B training in 8-A100-40G/8-A6000, and 7B training in 8-RTX3090. Make sure per_device_train_batch_size*gradient_accumulation_steps is the same as the provided script for best reproducibility.
  • Replace zero3.json with zero3_offload.json which offloads some parameters to CPU RAM. This slows down the training speed.

If you are interested in finetuning LLaVA model to your own task/data, please check out Finetune_Custom_Data.mdใ€‚

New options to note:

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
  • --image_aspect_ratio pad: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.
  • --group_by_modality_length True: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.

Evaluation

In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

See Evaluation.md.

GPT-assisted Evaluation

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.

  1. Generate LLaVA responses
python model_vqa.py \
    --model-path ./checkpoints/LLaVA-13B-v0 \
    --question-file \
    playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    /path/to/coco2014_val \
    --answers-file \
    /path/to/answer-file-our.jsonl
  1. Evaluate the generated responses. In our case, answer-file-ref.jsonl is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_gpt_review_visual.py \
    --question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --context llava/eval/table/caps_boxes_coco2014_val_80.jsonl \
    --answer-list \
    /path/to/answer-file-ref.jsonl \
    /path/to/answer-file-our.jsonl \
    --rule llava/eval/table/rule.json \
    --output /path/to/review.json
  1. Summarize the evaluation results
python summarize_gpt_review.py

Citation

If you find LLaVA useful for your research and applications, please cite using this BibTeX:

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}

Acknowledgement

  • Vicuna: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!

Related Projects

For future project ideas, please check out:

llava's People

Contributors

abdul avatar abdullamatar avatar chunyuanli avatar didier-durand avatar diracdeltas avatar dribnet avatar eggry avatar eltociear avatar fvaysh avatar guanlaoda avatar haotian-liu avatar heltrix avatar hill2hill avatar hirethehero avatar hyj1991 avatar kishida avatar l-salewski avatar mao-code avatar mattmazzola avatar omahs avatar paradoxzw avatar payne911 avatar philokey avatar simon-lund avatar timabdulla avatar tonywang10101 avatar winglian avatar yorickvp avatar yvrjsharma avatar zhaoyangli-nju avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llava's Issues

Inability to Reproduce Effective Results with ScienceQA

Hi there,
I followed the instructions in the README to evaluate LLava-13B-ScienceQA's performance on the ScienceQA dataset. However, I noticed that the answers predicted by LLava seem to be significantly worse than the reported results in the paper. I'm wondering where the issue might be coming from. I'm confident that I've completed all the mentioned steps such as parameter conversion and dataset processing, etc.
This is a great piece of work, and if this issue can be resolved, it would make a huge contribution to the research community.

Here are some examples:
"prompt": โ€What is the capital of Massachusetts?\nContext: N/A\nOptions: (A) Cambridge (B) Plymouth (C) Boston (D) Dover\n", "text": "HALSOLUTERs"

"prompt": โ€œLook at the models of molecules below. Select the elementary substance.\nContext: N/A\nOptions: (A) carbon tetrachloride (B) acetaldehyde (C) bromine\n", "text": ""

"prompt": "What can Greta and Allie trade to each get what they want?\nContext: Trade happens when people agree to exchange goods and services. People give up something to get something else. Sometimes people barter, or directly exchange one good or service for another.\nGreta and Allie open their lunch boxes in the school cafeteria. Both of them could be happier with their lunches. Greta wanted broccoli in her lunch and Allie was hoping for tomatoes. Look at the images of their lunches. Then answer the question below.\nOptions: (A) Greta can trade her tomatoes for Allie's sandwich. (B) Allie can trade her broccoli for Greta's oranges. (C) Allie can trade her almonds for Greta's tomatoes. (D) Greta can trade her tomatoes for Allie's broccoli.\n", "text": "H1.\nSOLUTION: Greta and Allie can trade tomatoes for broccoli to get what they want.โ€œ

"prompt": "Is the following trait inherited or acquired?\nSasha is good at knitting hats.\nContext: Hint: People who can knit had to learn how to do it.\nOptions: (A) acquired (B) inherited", "text": "Assistant: LECTURE: Organisms, including people, have both inherited and acquired traits. Inherited and acquired traits are gained in different ways.\nInherited traits are passed down through families. Children gain these traits from their parents. Inherited traits do not need to be learned.\nAcquired traits are gained during a person's life. Some acquired traits, such as riding a bicycle, are gained by learning. Other acquired traits, such as scars, are caused by the environment. Children do not inherit their parents' acquired traits.\nSOLUTION: People are not born knowing how to knit. Instead, many people learn how to knit. Knitting well takes practice. So, knitting well is an acquired trait."

"prompt": "What is the capital of Delaware?\nContext: N/A\nOptions: (A) Montpelier (B) Trenton (C) Dover (D) Georgetown\n", "text": "SOLUTSOLUTSOLUTRs;'sOLs;'sOLs;'OLs;'OLs;OLs;'OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs ;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;OLs;"

Question about the OCR capability

Great work indeed!

From the description in the paper, I do not find any special OCR module. I am curious how LLaVA obtains the ability to understand the text in the image (e.g., the famous examples of chicken nuggets). Is there any magic in the training dataset?

Is it possible to run at CPU๏ผŸ

Thanks for the great work.

I have a server with large RAM but no GPU, and another 16G VRAM local PC. Unfortunately๏ผŒboth of them seem insufficient.

Sorry, but Iโ€™m a newbie for this. When I try to modify the code (remove .cuda(), and set device=cpu), it crashes .
I also tried CLI (cpu only), it's working but not multimodal.

Besides, is it support load_in_8bits or quant to 4bit like other LLama based model? Thanks again!

Answers of the model are garbled code

When I Launched a gradio web server, I could open my browser and chat with a model. However, the answers of the model are garbled code.
How can I fix this problem? There is no error information.

How to reproduce the evaluation of science qa score.

Hi, when I'm loading the question.jsonl in eval model vqa science.py. You directly implemented it as json.load() and I'm receiving a error. However since it's a json line file, shouldn't you do [json.load(line) for line in file]

Anyway, my question is could you provide more information on how to reproduce your science qa score. I encountered many problems when reproducing it. Simpling specifying which file to load for each parser argument also helps a lot! The following link you provided is also a 404 error. Thanks for your efforts ahead for making your work reproducible.

image

Error running with --num-gpus 2

I'm trying to run the LLaVA on two RTX 4090 GPUs for inference. The model loads onto the GPUs without any issues, but an error occurs at inference time when I run the sample example from the Gradio web interface.

Here is the error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

The error seems to be caused by tensors being on different GPUs.

Environment

OS: Ubuntu
Python version: 3.10
CUDA version: 11.8
GPU model: Dual RTX 4090s

Steps to reproduce:

python -m llava.serve.controller --host 0.0.0.0 --port 10000
python3 -m llava.serve.model_worker --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path /home/lukas/Desktop/models/llava --num-gpus 2 --multi-modal
python -m llava.serve.gradio_web_server --controller http://localhost:10000
Run the sample example from the Gradio web interface

OOM error and fsdp support

Thanks for your excellent work! I have tried vicuna-13b on 8xA100(40G), but it will result in an OOM error. To avoid the GPU OOM error, I add the following commands to the pretraining scripts:

--fsdp "full_shard auto_wrap offload" \  
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \  
--bf16 True \

I got the error:

ValueError: `FlatParameter` requires uniform `requires_grad`

Could you please provide some suggestions?

Future directions?

Really fascinating work, great job! I would be fascinated to know what future directions you're planning to explore. Are you thinking of utilizing SAM or other vision encoders in future models? Would also be interesting to see how performance is affected by the model size, such as how larger LLaMA models perform with this scheme... The dataset is phenomenal work, thank you!

offine env - error while attempting to bind on address: cannot assign requested address

Hi, I am trying to serve the model in an offline env and I have finished downloading all the weights (including clip) and finished weight conversion.

When launching the controller do not encounter any issue.

2023-04-22 22:25:46 | INFO | controller | args: Namespace(host='0.0.0.0', port=10000, dispatch_method='shortest_queue')
2023-04-22 22:25:46 | INFO | controller | Init controller
2023-04-22 22:25:46 | ERROR | stderr | INFO:     Started server process [20039]
2023-04-22 22:25:46 | ERROR | stderr | INFO:     Waiting for application startup.
2023-04-22 22:25:46 | ERROR | stderr | INFO:     Application startup complete.
2023-04-22 22:25:46 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:10000 (Press CTRL+C to quit)

When launching the model worker, I encountered an address assignment issue:

2023-04-22 22:29:36 | INFO | model_worker | Register to controller
2023-04-22 22:29:36 | ERROR | stderr | INFO:     Started server process [20125]
2023-04-22 22:29:36 | ERROR | stderr | INFO:     Waiting for application startup.
2023-04-22 22:29:36 | ERROR | stderr | INFO:     Application startup complete.
2023-04-22 22:29:36 | ERROR | stderr | ERROR:    [Errno 99] error while attempting to bind on address ('::1', 40000, 0, 0): cannot assign requested address
2023-04-22 22:29:36 | ERROR | stderr | INFO:     Waiting for application shutdown.
2023-04-22 22:29:36 | ERROR | stderr | INFO:     Application shutdown complete.
2023-04-22 22:29:51 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0
2023-04-22 22:30:06 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0
2023-04-22 22:30:21 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0
2023-04-22 22:30:36 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0
2023-04-22 22:30:51 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0
2023-04-22 22:31:06 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0
2023-04-22 22:31:21 | INFO | model_worker | Send heart beat. Models: ['LLaVA-13B-v0']. Semaphore: None. global_counter: 0

And controller's side further shows:

2023-04-22 22:29:36 | INFO | controller | Register a new worker: http://localhost:40000
2023-04-22 22:29:36 | INFO | controller | Register done: http://localhost:40000, {'model_names': ['LLaVA-13B-v0'], 'speed': 1, 'queue_length': 0}
2023-04-22 22:29:36 | INFO | stdout | INFO:     127.0.0.1:49804 - "POST /register_worker HTTP/1.1" 200 OK
2023-04-22 22:29:51 | INFO | controller | Receive heart beat. http://localhost:40000
2023-04-22 22:29:51 | INFO | stdout | INFO:     127.0.0.1:52354 - "POST /receive_heart_beat HTTP/1.1" 200 OK
2023-04-22 22:30:06 | INFO | controller | Receive heart beat. http://localhost:40000
2023-04-22 22:30:06 | INFO | stdout | INFO:     127.0.0.1:46308 - "POST /receive_heart_beat HTTP/1.1" 200 OK
2023-04-22 22:30:21 | INFO | controller | Receive heart beat. http://localhost:40000

Do you have any clues about how to solve this issue? Thanks!

`LlamaTokenizerFast` implementation missing

satyajit@zeus:~/satyajit/llama-dl$ python3 convert_llama_weights_to_hf.py --input_dir . --model_size 13B --output_dir llama_hf
Traceback (most recent call last):
  File "convert_llama_weights_to_hf.py", line 28, in <module>
    from transformers import LlamaTokenizerFast
ImportError: cannot import name 'LlamaTokenizerFast' from 'transformers' (/home/satyajit/satyajit/transformers_llava/src/transformers/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "convert_llama_weights_to_hf.py", line 30, in <module>
    warnings.warn(e)
TypeError: expected string or bytes-like object

Question about the object detection

When encoding the image to prompt, you mentioned captions and bounding boxes, I wonder which object detection model you utilized to generate the bounding boxes?

Is the data collection prompt for detailed decription and complex reasoning released?

Many thanks for your great work! I am loving the idea of using text-only-gpt-4 to harness img-based multimodal data. And I am checking your paper on Table 10 about the details of conversation data collection prompt, which is really detailed. And I found the detailed description and complex reasoing prompt is not existing in this repo?
image
Many thanks if you can provide this!!

Error came out when I was applying delta.

PS A:\LLaVA> py -3 -m llava.model.apply_delta --base A:/vicuna-minigpt/vicuna/llama-13b/LLaMA/output/13B --target A:/llava13b --delta A:/llava-delta
Loading base model
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:17<00:00, 5.93s/it]
Loading delta
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:20<00:00, 6.97s/it]
Some weights of the model checkpoint at A:/llava-delta were not used when initializing LlamaForCausalLM: ['model.mm_projector.weight', 'model.mm_projector.bias']

  • This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Applying delta
    Applying delta: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 403/403 [00:40<00:00, 9.90it/s]
    Saving target model
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ C:\Users\Ge Yunxiang\AppData\Local\Programs\Python\Python310\lib\runpy.py:196 in โ”‚
    โ”‚ _run_module_as_main โ”‚
    โ”‚ โ”‚
    โ”‚ 193 โ”‚ main_globals = sys.modules["main"].dict โ”‚
    โ”‚ 194 โ”‚ if alter_argv: โ”‚
    โ”‚ 195 โ”‚ โ”‚ sys.argv[0] = mod_spec.origin โ”‚
    โ”‚ โฑ 196 โ”‚ return _run_code(code, main_globals, None, โ”‚
    โ”‚ 197 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ "main", mod_spec) โ”‚
    โ”‚ 198 โ”‚
    โ”‚ 199 def run_module(mod_name, init_globals=None, โ”‚
    โ”‚ โ”‚
    โ”‚ C:\Users\Ge Yunxiang\AppData\Local\Programs\Python\Python310\lib\runpy.py:86 in _run_code โ”‚
    โ”‚ โ”‚
    โ”‚ 83 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ loader = loader, โ”‚
    โ”‚ 84 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ package = pkg_name, โ”‚
    โ”‚ 85 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ spec = mod_spec) โ”‚
    โ”‚ โฑ 86 โ”‚ exec(code, run_globals) โ”‚
    โ”‚ 87 โ”‚ return run_globals โ”‚
    โ”‚ 88 โ”‚
    โ”‚ 89 def _run_module_code(code, init_globals=None, โ”‚
    โ”‚ โ”‚
    โ”‚ A:\LLaVA\llava\model\apply_delta.py:47 in โ”‚
    โ”‚ โ”‚
    โ”‚ 44 โ”‚ โ”‚
    โ”‚ 45 โ”‚ args = parser.parse_args() โ”‚
    โ”‚ 46 โ”‚ โ”‚
    โ”‚ โฑ 47 โ”‚ apply_delta(args.base_model_path, args.target_model_path, args.delta_path) โ”‚
    โ”‚ 48 โ”‚
    โ”‚ โ”‚
    โ”‚ A:\LLaVA\llava\model\apply_delta.py:35 in apply_delta โ”‚
    โ”‚ โ”‚
    โ”‚ 32 โ”‚ โ”‚ โ”‚ param.data[:bparam.shape[0], :bparam.shape[1]] += bparam โ”‚
    โ”‚ 33 โ”‚ โ”‚
    โ”‚ 34 โ”‚ print("Saving target model") โ”‚
    โ”‚ โฑ 35 โ”‚ delta.save_pretrained(target_model_path) โ”‚
    โ”‚ 36 โ”‚ delta_tokenizer.save_pretrained(target_model_path) โ”‚
    โ”‚ 37 โ”‚
    โ”‚ 38 โ”‚
    โ”‚ โ”‚
    โ”‚ C:\Users\Ge โ”‚
    โ”‚ Yunxiang\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.p โ”‚
    โ”‚ y:1843 in save_pretrained โ”‚
    โ”‚ โ”‚
    โ”‚ 1840 โ”‚ โ”‚ โ”‚ โ”‚ # joyfulness), but for now this enough. โ”‚
    โ”‚ 1841 โ”‚ โ”‚ โ”‚ โ”‚ safe_save_file(shard, os.path.join(save_directory, shard_file), metadata โ”‚
    โ”‚ 1842 โ”‚ โ”‚ โ”‚ else: โ”‚
    โ”‚ โฑ 1843 โ”‚ โ”‚ โ”‚ โ”‚ save_function(shard, os.path.join(save_directory, shard_file)) โ”‚
    โ”‚ 1844 โ”‚ โ”‚ โ”‚
    โ”‚ 1845 โ”‚ โ”‚ if index is None: โ”‚
    โ”‚ 1846 โ”‚ โ”‚ โ”‚ path_to_weights = os.path.join(save_directory, _add_variant(WEIGHTS_NAME, va โ”‚
    โ”‚ โ”‚
    โ”‚ C:\Users\Ge โ”‚
    โ”‚ Yunxiang\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py:422 in โ”‚
    โ”‚ save โ”‚
    โ”‚ โ”‚
    โ”‚ 419 โ”‚ _check_dill_version(pickle_module) โ”‚
    โ”‚ 420 โ”‚ โ”‚
    โ”‚ 421 โ”‚ if _use_new_zipfile_serialization: โ”‚
    โ”‚ โฑ 422 โ”‚ โ”‚ with _open_zipfile_writer(f) as opened_zipfile: โ”‚
    โ”‚ 424 โ”‚ โ”‚ โ”‚ return โ”‚
    โ”‚ 425 โ”‚ else: โ”‚
    โ”‚ โ”‚
    โ”‚ C:\Users\Ge โ”‚
    โ”‚ Yunxiang\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py:309 in โ”‚
    โ”‚ _open_zipfile_writer โ”‚
    โ”‚ โ”‚
    โ”‚ 306 โ”‚ โ”‚ container = _open_zipfile_writer_file โ”‚
    โ”‚ 307 โ”‚ else: โ”‚
    โ”‚ 308 โ”‚ โ”‚ container = _open_zipfile_writer_buffer โ”‚
    โ”‚ โฑ 309 โ”‚ return container(name_or_buffer) โ”‚
    โ”‚ 310 โ”‚
    โ”‚ 311 โ”‚
    โ”‚ 312 def _is_compressed_file(f) -> bool: โ”‚
    โ”‚ โ”‚
    โ”‚ C:\Users\Ge โ”‚
    โ”‚ Yunxiang\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py:287 in โ”‚
    โ”‚ init โ”‚
    โ”‚ โ”‚
    โ”‚ 284 โ”‚
    โ”‚ 285 class _open_zipfile_writer_file(_opener): โ”‚
    โ”‚ 286 โ”‚ def init(self, name) -> None: โ”‚
    โ”‚ โฑ 287 โ”‚ โ”‚ super(_open_zipfile_writer_file, self).init(torch._C.PyTorchFileWriter(str(n โ”‚
    โ”‚ 288 โ”‚ โ”‚
    โ”‚ 289 โ”‚ def exit(self, *args) -> None: โ”‚
    โ”‚ 290 โ”‚ โ”‚ self.file_like.write_end_of_file() โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    RuntimeError: Parent directory A: does not exist.

The template used to generate input sequence in the codes is different from the one in the paper

Thank you very much for your awesome work.

I try to run the code and noticed the template used to generate conversation between humans and gpt is different from the one mentioned in the paper.

According to the paper:
template
where <STOP> in Vicuna v1.0 is ###

However, according to the code,

def _add_speaker_and_signal(header, source, get_conversation=True):
    """Add speaker and start/end signal on each round."""
    BEGIN_SIGNAL = "### "
    END_SIGNAL = "\n"
    conversation = header
    for sentence in source:
        from_str = sentence["from"]
        if from_str.lower() == "human":
            from_str = conversation_lib.default_conversation.roles[0]
        elif from_str.lower() == "gpt":
            from_str = conversation_lib.default_conversation.roles[1]
        else:
            from_str = 'unknown'
        sentence["value"] = (BEGIN_SIGNAL + from_str + ": " +
                             sentence["value"] + END_SIGNAL)
        if get_conversation:
            conversation += sentence["value"]
    conversation += BEGIN_SIGNAL
    return conversation

If we follow this function, we end up with an input sequence as

system-message\n\n### Human: .... \n ### Assistant: .... \n

and according to

def _mask_targets(target, tokenized_lens, speakers):
    # cur_idx = 0
    cur_idx = tokenized_lens[0]
    tokenized_lens = tokenized_lens[1:]
    target[:cur_idx] = IGNORE_INDEX
    for tokenized_len, speaker in zip(tokenized_lens, speakers):
        if speaker == "human":
            target[cur_idx + 2:cur_idx + tokenized_len] = IGNORE_INDEX
        cur_idx += tokenized_len

only human message and system message is masked, which means after tokenizing the conversation, we get

system-message \n\n ### Human: โ€ฆ \n ### Assistant: โ€ฆ \n

and only tokens in ใ€ใ€‘ are masked

ใ€system-message \nใ€‘\n ### ใ€Human: โ€ฆ ใ€‘\n ### Assistant: โ€ฆ \n

Is this a bug or just a trick where the implemented template works better than the one mentioned in the paper?

AttributeError: 'LlamaConfig' object has no attribute 'mm_vision_tower'

log๏ผš

(llava) D:\LLaVA>python -m llava.serve.model_worker --controller http://localhost:10000 --port 40000 --worker http://loc
alhost:40000 --model-path ./checkpoints/LLaVA-13B-v0 --multi-modal
2023-04-22 11:15:42 | INFO | model_worker | args: Namespace(host='localhost', port=40000, worker_address='http://localho
st:40000', controller_address='http://localhost:10000', model_path='./checkpoints/LLaVA-13B-v0', model_name=None, multi_
modal=True, keep_aspect_ratio=False, num_gpus=1, limit_model_concurrency=5, stream_interval=2, no_register=False)
2023-04-22 11:15:42 | INFO | model_worker | Loading the model LLaVA-13B-v0 on worker 56bbc2 ...
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 1/3 [00:02
<00:05, 2.57s/it]
Loading checkpoint shards: 67%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
| 2/3 [00:05<00:02, 2.56s/it]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:06<00:00, 2.17s/it]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:06<00:00, 2.28s/it]
2023-04-22 11:15:49 | ERROR | stderr |
2023-04-22 11:15:49 | ERROR | stderr | Traceback (most recent call last):
2023-04-22 11:15:49 | ERROR | stderr | File "C:\Users\xht\anaconda3\envs\llava\lib\runpy.py", line 196, in _run_module
_as_main
2023-04-22 11:15:49 | ERROR | stderr | return _run_code(code, main_globals, None,
2023-04-22 11:15:49 | ERROR | stderr | File "C:\Users\xht\anaconda3\envs\llava\lib\runpy.py", line 86, in _run_code
2023-04-22 11:15:49 | ERROR | stderr | exec(code, run_globals)
2023-04-22 11:15:49 | ERROR | stderr | File "D:\LLaVA\llava\serve\model_worker.py", line 361, in
2023-04-22 11:15:49 | ERROR | stderr | worker = ModelWorker(args.controller_address,
2023-04-22 11:15:49 | ERROR | stderr | File "D:\LLaVA\llava\serve\model_worker.py", line 118, in init
2023-04-22 11:15:49 | ERROR | stderr | self.tokenizer, self.model, self.image_processor, self.context_len = load_mod
el(
2023-04-22 11:15:49 | ERROR | stderr | File "D:\LLaVA\llava\serve\model_worker.py", line 65, in load_model
2023-04-22 11:15:49 | ERROR | stderr | image_processor = CLIPImageProcessor.from_pretrained(model.config.mm_vision_t
ower, torch_dtype=torch.float16)
2023-04-22 11:15:49 | ERROR | stderr | File "D:\LLaVA\transformers\src\transformers\configuration_utils.py", line 260,
in getattribute
2023-04-22 11:15:49 | ERROR | stderr | return super().getattribute(key)
2023-04-22 11:15:49 | ERROR | stderr | AttributeError: 'LlamaConfig' object has no attribute 'mm_vision_tower'

Why not compare with FLAN-T5-Large?

What considerations preclude a comparison against the FLAN-T5-Large model? As shown in MM-COT paper (arXiv:2302.00923) Table7, FLAN-T5-Large achieves ~93% on ScienceQA.

Is it because the experimental settings are different? By the way, FLAN-T5-Large (~783M) has less number of parameters than LLaMA / LLaVA (~13B).

Thank you.

A question about training cost

Thanks for your awesome work! Would you mind mentioning the training cost of LLaVA (including both the GPU hours and cost of API)? Thanks in advance :)

I made it work on a single 3090

There is no discussion tab, so opening it as an issue.
I made it work on a single 3090, in ooba's webui, see this PR for more info: oobabooga/text-generation-webui#1487.
There is even a small possibility that it will run on 12GB GPUs, as 4-bit vicuna 13b fits, question is if it fits with CLIP+projector

`setup.py` missing

satyajit@zeus:~/satyajit/LLaVA$ pip install -e .
ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /home/satyajit/satyajit/LLaVA
(A "pyproject.toml" file was found, but editable mode currently requires a setup.py based build.)

where is the llava_instruct_150k.json

Hi, in ''download_data.sh'' we can only download ''complex_reasoning_77k.json conversation_58k.json detail_23k.json'', I wonder if we need to concat them to llava_instruct_150k.json?

python inference demo

Do I have to use a browser to demonstrate when running a large model locally?
Is there a demo in Python that directly feeds images and language into Python?

Error when install flash-attn

When I run pip intall flash-attn, it raises an error:
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

However, I have run pip install -e . and successfully installed llava. Do you know how to solve this problem?

The actual training sequence

I found the conversation here

LLaVA/llava/conversation.py

Lines 165 to 194 in 6520ad9

conv_v1_2 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=(
("Human", "What are the key differences between renewable and non-renewable energy sources?"),
("Assistant",
"Renewable energy sources are those that can be replenished naturally in a relatively "
"short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
"Non-renewable energy sources, on the other hand, are finite and will eventually be "
"depleted, such as coal, oil, and natural gas. Here are some key differences between "
"renewable and non-renewable energy sources:\n"
"1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
"energy sources are finite and will eventually run out.\n"
"2. Environmental impact: Renewable energy sources have a much lower environmental impact "
"than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
"and other negative effects.\n"
"3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
"have lower operational costs than non-renewable sources.\n"
"4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
"locations than non-renewable sources.\n"
"5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
"situations and needs, while non-renewable sources are more rigid and inflexible.\n"
"6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
"non-renewable sources are not, and their depletion can lead to economic and social instability.\n")
),
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)

Given the code here

if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in self.messages:
if message:
if type(message) is tuple:
message, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
return ret

This seems to tell me that the actual training sequence is rather

f'{system}###Human: {instruct}###Assistant: {completion}###'

There doesn't seem to be any additional \n as mentioned in Table 2 of the paper.

Possible to use int8 during pretraining stage?

Since the LLaMA model is frozen during pretraining, I was wondering if it was possible to run the model with int8 to reduce the vram used during training.

In my attempts to do so, I successfully started training with llama as int8 (by just adding the usual load_in_8bit=True, device_map=... combo), but observed that the loss would always collapse to 0.0 immediately. I also noticed the same issue when I simply loaded the llama model as fp16, rather than as bf16.

RuntimeError: The size of tensor a (32000) must match the size of tensor b (32003) at non-singleton dimension 0

PS A:\vicuna-minigpt\FastChat> python -m fastchat.model.apply_delta --base A:/vicuna-minigpt/vicuna/llama-13b/LLaMA/output/13B --target A:/llava-minigpt --delta A:/llava-delta
Loading the base model from A:/vicuna-minigpt/vicuna/llama-13b/LLaMA/output/13B
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:28<00:00, 9.37s/it]
Loading the delta from A:/llava-delta
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:35<00:00, 11.80s/it]
Some weights of the model checkpoint at A:/llava-delta were not used when initializing LlamaForCausalLM: ['model.mm_projector.weight', 'model.mm_projector.bias']

  • This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Applying the delta
    Applying delta: 0%| | 0/403 [00:00<?, ?it/s]
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ C:\Users\Ge Yunxiang\AppData\Local\Programs\Python\Python310\lib\runpy.py:196 in โ”‚
    โ”‚ _run_module_as_main โ”‚
    โ”‚ โ”‚
    โ”‚ 193 โ”‚ main_globals = sys.modules["main"].dict โ”‚
    โ”‚ 194 โ”‚ if alter_argv: โ”‚
    โ”‚ 195 โ”‚ โ”‚ sys.argv[0] = mod_spec.origin โ”‚
    โ”‚ โฑ 196 โ”‚ return _run_code(code, main_globals, None, โ”‚
    โ”‚ 197 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ "main", mod_spec) โ”‚
    โ”‚ 198 โ”‚
    โ”‚ 199 def run_module(mod_name, init_globals=None, โ”‚
    โ”‚ โ”‚
    โ”‚ C:\Users\Ge Yunxiang\AppData\Local\Programs\Python\Python310\lib\runpy.py:86 in _run_code โ”‚
    โ”‚ โ”‚
    โ”‚ 83 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ loader = loader, โ”‚
    โ”‚ 84 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ package = pkg_name, โ”‚
    โ”‚ 85 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ spec = mod_spec) โ”‚
    โ”‚ โฑ 86 โ”‚ exec(code, run_globals) โ”‚
    โ”‚ 87 โ”‚ return run_globals โ”‚
    โ”‚ 88 โ”‚
    โ”‚ 89 def _run_module_code(code, init_globals=None, โ”‚
    โ”‚ โ”‚
    โ”‚ A:\vicuna-minigpt\FastChat\fastchat\model\apply_delta.py:165 in โ”‚
    โ”‚ โ”‚
    โ”‚ 162 โ”‚ โ”‚ โ”‚ args.base_model_path, args.target_model_path, args.delta_path โ”‚
    โ”‚ 163 โ”‚ โ”‚ ) โ”‚
    โ”‚ 164 โ”‚ else: โ”‚
    โ”‚ โฑ 165 โ”‚ โ”‚ apply_delta(args.base_model_path, args.target_model_path, args.delta_path) โ”‚
    โ”‚ 166 โ”‚
    โ”‚ โ”‚
    โ”‚ A:\vicuna-minigpt\FastChat\fastchat\model\apply_delta.py:140 in apply_delta โ”‚
    โ”‚ โ”‚
    โ”‚ 137 โ”‚ print("Applying the delta") โ”‚
    โ”‚ 138 โ”‚ for name, param in tqdm(base.state_dict().items(), desc="Applying delta"): โ”‚
    โ”‚ 139 โ”‚ โ”‚ assert name in delta.state_dict() โ”‚
    โ”‚ โฑ 140 โ”‚ โ”‚ param.data += delta.state_dict()[name] โ”‚
    โ”‚ 141 โ”‚ โ”‚
    โ”‚ 142 โ”‚ print(f"Saving the target model to {target_model_path}") โ”‚
    โ”‚ 143 โ”‚ base.save_pretrained(target_model_path) โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    RuntimeError: The size of tensor a (32000) must match the size of tensor b (32003) at non-singleton dimension 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.