nvlabs / vila Goto Github PK

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)

License: Apache License 2.0

Shell 6.98% Python 93.02%

vila's Introduction

VILA: On Pre-training for Visual Language Models

VILA arxiv / VILA Demo / VILA Huggingface

💡 Introduction

VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling video understanding and multi-image understanding capabilities. VILA is deployable on the edge by AWQ 4bit quantization and TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance; (4) token compression extends #video frames. VILA unveils appealing capabilities, including: video reasoning, in-context learning, visual chain-of-thought, and better world knowledge.

💡 News

[2024/07] VILA1.5 also ranks 1st place (OSS model) on MLVU test leaderboard.
[2024/06] VILA1.5 is now the best open sourced VLM on MMMU leaderboard and Video-MME leaderboard!
[2024/05] We release VILA-1.5, which offers video understanding capability. VILA-1.5 comes with four model sizes: 3B/8B/13B/40B.
[2024/05] We release AWQ-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on diverse NVIDIA GPUs (A100, 4090, 4070 Laptop, Orin, Orin Nano) by TinyChat and TensorRT-LLM backends.
[2024/03] VILA has been accepted by CVPR 2024!
[2024/02] We release AWQ-quantized 4bit VILA models, deployable on Jetson Orin and laptops through TinyChat and TinyChatEngine.
[2024/02] VILA is released. We propose interleaved image-text pretraining that enables multi-image VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.
[2023/12] Paper is on Arxiv!

Performance

Image QA Benchmarks

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$	Prec.	VQAv2	GQA	VizWiz	SQA-I	VQA-T	POPE	MME	MMB	MMB-CN	SEED	SEED-I	MMMU (val)	MMMU (test)	llava-bench	MM-Vet	Average
VILA1.5-3B	fp16	80.4	61.5	53.5	69.0	60.4	85.9	1442.44	63.4	52.7	60.9	67.9	33.3	30.8	75.9	35.4	60.2
VILA1.5-3B-AWQ	int4	80.0	61.1	53.8	67.8	60.4	85.9	1437.34	63.3	51.4	59.8	66.6	32.7	31.1	75.0	37.3	59.9
VILA1.5-3B-S2	fp16	79.8	61.4	61.3	69.6	63.4	85.3	1431.65	62.8	52.2	60.0	66.4	32.8	31.3	76.7	38.6	60.9
VILA1.5-3B-S2-AWQ	int4	79.4	61.3	62.3	69.2	63.0	85.8	1417.06	61.6	51.5	59.1	65.7	33.4	30.4	77.1	36.7	60.5
Llama-3-VILA1.5-8B	fp16	80.9	61.9	58.7	79.9	66.3	84.4	1577.01	72.3	66.2	64.2	71.4	36.9	36.0	80.0	38.3	65.1
Llama-3-VILA1.5-8B-AWQ	int4	80.3	61.7	59.3	79.0	65.4	82.9	1593.65	71.0	64.9	64.0	71.1	36.0	36.1	79.0	37.2	64.5
VILA1.5-13B	fp16	82.8	64.3	62.6	80.1	65.0	86.3	1569.55	74.9	66.3	65.1	72.6	37.9	33.6	80.8	44.3	66.3
VILA1.5-13B-AWQ	int4	82.7	64.5	63.3	79.7	64.7	86.7	1531.35	74.7	66.7	65.1	72.6	37.8	34.0	81.9	46.4	66.5
VILA1.5-40B	fp16	84.3	64.6	62.2	87.2	73.6	87.3	1726.82	82.4	80.2	69.1	75.8	51.9	46.9	81.3	53.0	72.4
VILA1.5-40B-AWQ	int4	84.1	64.4	61.3	86.7	73.2	88.2	1714.79	83.2	79.6	68.9	75.6	49.3	46.2	83.0	51.4	72.1

^{NOTE: VQAV2 and VizWiz are test-dev, the average accuracy is calculated over all datasets and MME numbers are divided by 20.}

Video QA Benchmarks

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$	Prec.	Perception Test	ActivityNet	MSVD	MSRVTT	TGIF
VILA1.5-3B	fp16	47	50.2	76.6	57.5	51.7
VILA1.5-3B-S2	fp16	49.7	50.7	76.9	57.6	51.7
Llama-3-VILA1.5-8B	fp16	54.1	54.3	78.3	60.1	54.1
VILA1.5-13B	fp16	53.6	54.7	77.9	60.2	56
VILA1.5-40B	fp16	54	58	80.1	63	58.2

Inference speed ( Token/sec )

$~~~~~~$	Precision	A100	4090	Orin
VILA1.5-3B	fp16	104.6	137.6	25.4
VILA1.5-3B-AWQ	int4	182.8	215.5	42.5
VILA1.5-3B-S2	fp16	104.3	137.2	24.6
VILA1.5-3B-S2-AWQ	int4	180.2	219.3	40.1
Llama-3-VILA1.5-8B	fp16	74.9	57.4	10.2
Llama-3-VILA1.5-8B-AWQ	int4	168.9	150.2	28.7
VILA1.5-13B	fp16	50.9	OOM	6.1
VILA1.5-13B-AWQ	int4	115.9	105.7	20.6
VILA1.5-40B	fp16	OOM	OOM	--
VILA1.5-40B-AWQ	int4	57.0	OOM	--

^{NOTE: Measured using the TinyChat backend at batch size = 1.}

VILA Examples

Video captioning

7ko9e-AGmbM.12_0_217_out.mp4

Prompt: Elaborate on the visual and narrative elements of the video in detail.

Caption: The video shows a person's hands working on a white surface. They are folding a piece of fabric with a checkered pattern in shades of blue and white. The fabric is being folded into a smaller, more compact shape. The person's fingernails are painted red, and they are wearing a black and red garment. There are also a ruler and a pencil on the surface, suggesting that measurements and precision are involved in the process.

In context learning

Multi-image reasoning

VILA on Jetson Orin

VILA-13B_Orin_deer.mp4.mp4

VILA on RTX 4090

vila_4090_two_cars_3x.mp4

Installation

./environment_setup.sh

or follow the instructions below in order.

conda create -n vila python=3.10 -y # make sure you install python 3.10
conda activate vila

pip install --upgrade pip  # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda-toolkit -y
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"

pip install git+https://github.com/huggingface/[email protected]
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/

Training

VILA training contains three steps, for specific hyperparameters, please check out the scripts/v1_5 folder:

Step-1: Alignment

We utilize LLaVA-CC3M-Pretrain-595K dataset to align the textual and visual modalities.

The stage 1 script takes in two parameters and it can run on a single 8xA100 node. BASE_MODEL_PATH points to an online or local huggingface repository, such as NousResearch/Llama-2-7b-hf. OUTPUT_NAME points to a target directory under checkpoints, which will save the trained multimodal projector afterwards.

bash scripts/v1_5/paper/1_mm_align.sh [BASE_MODEL_PATH] [OUTPUT_NAME]

Step-2: Pretraining

We use MMC4 and Coyo dataset to train VLM with interleaved image-text pairs.

bash scripts/v1_5/paper/2_pretrain_mmc4_coyo.sh [CODE_PATH] [BASE_MODEL_PATH] [STAGE1_PATH] [OUTPUT_NAME]

The stage 2 script takes in four arguments. CODE_PATH is the absolute path to our VILA codebase, BASE_MODEL_PATH has similar meaning to what is presented in the stage 1 script. STAGE1_PATH points to the OUTPUT_NAME of stage 1 (i.e. where the stage 1 checkpoint is stored). OUTPUT_NAME is the desired folder name under checkpoints that saves the pretraining checkpoint. The script we provided for this stage is executed on slurm, and we expect it to execute on 16 nodes (128 GPUs).

Step-3: Supervised fine-tuning

This is the last stage of VILA training, in which we tune the model to follow multimodal instructions on a subset of M3IT, FLAN and ShareGPT4V. This stage runs on a 8xA100 node.

bash scripts/v1_5/paper/3_sft.sh [STAGE2_PATH] [OUTPUT_NAME]

The stage 3 script takes in two arguments. STAGE2_PATH points to the OUTPUT_NAME of the stage 2 script (i.e. where the stage 2 checkpoint is stored). OUTPUT_NAME is the desired folder name under checkpoints that stores the final checkpoint.

Evaluations

Image Benchmarks

You can follow Llava1.5 eval to download all datasets. After downloading all datasets, please put them under playground/data/eval.

Please make the following changes to the MME evaluation script. Please search for:

data_path='MME_Benchmark_release_version'

and replace it with:

data_path=os.path.join(script_dir, 'MME_Benchmark_release_version')

We provide a push-the-button script to perform evaluation on all 10 datasets that do not require GPT-assisted evaluation:

./scripts/v1_5/eval/eval_all.sh [CHECKPOINT_PATH] [MODEL_NAME] [CONV_MODE]

This script takes in two parameters, CHECKPOINT_PATH points to the stage 3 model checkpoint, and MODEL_NAME will be the name of evaluation results.

VQAv2 and Vizwiz evaluations are hosted on eval.ai. You need to register an account and create a team to be able to submit eval.

MMBench and MMBench_CN eval are hosted on another evaluation server. Make sure you change the name of the file before submitting, otherwise the server caches results and will always return wrong result to you.

We provide a quick script to automatically organize the prediction files that need to be submitted to servers:

python scripts/v1_5/eval/copy_predictions.py [MODEL_NAME]

You will be able to find the predictions under playground/data/predictions_upload/[MODEL_NAME] after executing this script.

Video Benchmarks

Please follow the evaluation steps in Video-LLaVA for dataset preparation.

./scripts/v1_5/eval/video_chatgpt/run_all.sh [CHECKPOINT_PATH] [MODEL_NAME] [CONV_MODE]
./scripts/v1_5/eval/video_chatgpt/eval_all.sh [MODEL_NAME]

Inference

We provide snippets for quick inference with user prompts and images.

Llama-3-VILA1.5-8B inference:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b \
    --conv-mode llama_3 \
    --query "<image>\n Please describe the traffic condition." \
    --image-file "av.png"

VILA1.5-40B inference:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/VILA1.5-40b \
    --conv-mode hermes-2 \
    --query "<image>\n Please describe the traffic condition." \
    --image-file "av.png"

VILA1.5-3B video inference:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/VILA1.5-3b \
    --conv-mode vicuna_v1 \
    --query "<video>\n Please describe this video." \
    --video-file "demo.mp4"

Quantization and Deployment

Our VILA models are quantized by AWQ into 4 bits for efficient inference on the edge. We provide a push-the-button script to quantize VILA with AWQ.

Running VILA on GPUs and edge GPUs (Jetson Orin)

We support AWQ-quantized 4bit VILA on GPU platforms via TinyChat. We provide a tutorial to run the model with TinyChat after AWQ quantization. We also provide an instruction to launch a Gradio server (powered by TinyChat and AWQ) to serve 4-bit quantized VILA models.

Running VILA on laptops

We further support our AWQ-quantized 4bit VILA models on various CPU platforms with both x86 and ARM architectures with our TinyChatEngine. We also provide a detailed tutorial to help the users deploy VILA on different CPUs.

Checkpoints

We release VILA1.5-3B, VILA1.5-3B-S2, Llama-3-VILA1.5-8B, VILA1.5-13B, VILA1.5-40B and the 4-bit AWQ-quantized models VILA1.5-3B-AWQ, VILA1.5-3B-S2-AWQ, Llama-3-VILA1.5-8B-AWQ, VILA1.5-13B-AWQ, VILA1.5-40B-AWQ.

🔒 License

The code is released under the Apache 2.0 license as found in the LICENSE file.
The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of LLaMA. For LLAMA3-VILA checkpoints terms of use, please refer to the LLAMA3 License for additional details.
- Terms of Use of the data generated by OpenAI
- Dataset Licenses for each one used during training.

Team


*Yao Lu: Nvidia	*Hongxu Yin: Nvidia	*Ji Lin: OpenAI (work done at Nvidia and MIT)
Wei Ping: Nvidia	Pavlo Molchanov: Nvidia	Andrew Tao: Nvidia
Haotian Tang: MIT	Shang Yang: MIT	Ligeng Zhu: Nvidia, MIT
Wei-Chen Wang: MIT	Fuzhao Xue: Nvidia, NUS	Yunhao Fang: Nvidia, UCSD
Yukang Chen: Nvidia, CUHK	Zhuoyang Zhang: Nvidia, Tsinghua Univ.	Yue Shen: Nvidia
Wei-Ming Chen: Nvidia	Huizi Mao: Nvidia	Baifeng Shi: Nvidia, UC Berkeley
Jan Kautz: Nvidia	Mohammad Shoeybi: Nvidia	Song Han: Nvidia, MIT

Citations

@misc{lin2023vila,
      title={VILA: On Pre-training for Visual Language Models},
      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
      year={2023},
      eprint={2312.07533},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

LLaVA: the codebase we built upon. Thanks for their wonderful work.
InternVL: for open-sourcing InternViT (used in VILA1.5-40b) and the InternVL-SFT data blend (inspired by LLaVA-1.6) used in all VILA1.5 models.
Vicuna: the amazing open-sourced large language model!
Video-ChatGPT: we borrowed video evaluation script from this repository.
MMC4, COYO-700M, M3IT, OpenORCA/FLAN, ShareGPT4V, WIT, GSM8K-ScRel, VisualGenome, VCR, ScienceQA, Shot2Story, Youcook2, Vatex, ShareGPT-Video for providing datasets used in this research.

vila's People

Contributors

Stargazers

Watchers

Forkers

rkuo2000 lokshaw-chau zeyuanyin liubo0902 xuanlinli17 cto-algo jjhw josephrp techthiyanes lukaemon chrisguarino hmxiong yuvalsigura xuefuzhao air23zj nhsjgczryf aoyuqc sprinter1999 yukang2017 wangjingbo1219 autogyro evdcush abdoiiii shahinsharifi alexandor91 mencelot phoenixdigitalfx emmanuelezenwere seancraven314 xiaozhiob michael-heinrich souxun2015 hongyunqiu jaesung8 substratelabs manfar kentang-mit chomolungma evelynmitchell kaidduong zgimszhd61 nickhamster ukaserge vioneta cabelo ssarswat nawsh1337 yasserdahouml ooil929 zkbig gaiadilorenzo poorfrombabylon zuo1188 i-robot-2024 xiuyu-li eltociear hongxuyin davidsuv studygiraffe yiming992 conglesolutionx cxmscb je1lee sjtuwxz paperwave tongzhoumu zijiandu zackbradshaw matteonulli asierarranz jaegookyou tedfytw1209 orrzohar isaacrc r-snijders futuruka zanqi liyi14 pablosls windismyheart lenoardshannon zzxslp aicads fuxiaoliu

vila's Issues

No module named 'llava.tf_utils'

I run this demo script

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/VILA1.5-3b \
    --conv-mode vicuna_v1 \
    --query "<video>\n Please describe this video." \
    --video-file "demo.mp4"

got the following error:

ModuleNotFoundError: No module named 'llava.tf_utils'

unexpected keyword argument 'seqlens_in_batch'

I try

!python3 -W ignore llava/eval/run_llava.py \
    --model-path Efficient-Large-Model/VILA-7B \
    --conv-mode vicuna_v1 \
    --query "<image>\n Please describe the traffic condition." \
    --image-file "demo_images/av.png"

But got an error.

Traceback (most recent call last):
  File "/home/katopz/book/examples/ml/infer/VILA/llava/eval/run_llava.py", line 160, in <module>
    eval_model(args)
  File "/home/katopz/book/examples/ml/infer/VILA/llava/eval/run_llava.py", line 118, in eval_model
    output_ids = model.generate(
  File "/home/katopz/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/katopz/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
  File "/home/katopz/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2861, in sample
    outputs = self(
  File "/home/katopz/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/katopz/book/examples/ml/infer/VILA/llava/model/language_model/llava_llama.py", line 125, in forward
    outputs = super().forward(
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'seqlens_in_batch'

FYI: trsnsformer=4.36.2

Not sure what I miss, Thanks

When will new annotations files be available?

In new v1.5 version of https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/README.md
there are links to new dataset annotation files such as
huggingface-cli download mit-han-lab/vila-dataset youcook_filtered_v3.json --repo-type dataset --local-dir youcook2 --local-dir-use-symlinks False
which are not publicly accessible at this time. Any guess as to if or when it will be possible to retrieve files from the mit-han-lab/vila-dataset repository?

Potential bug in mm_utils.py process_image function

When data_args.image_aspect_ratio = 'resize', it seems that mm_utils.process_image returns the image as a PIL.Image.Image data type, which has no shape attribute. See https://github.com/Efficient-Large-Model/VILA/blob/main/llava/mm_utils.py#L168

When doing stage 1 alignment training, we use the datasets.LazySupervisedDataset class, whose get_item function tries to call image.shape here: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/dataset.py#L834

This crashes the training. So should we simply add the line
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
below line 168 of mm_utils.py: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/mm_utils.py#L168 ?

LoRA for downstream task tuning

Hi, Lyken17 , thank you for sharing wonderful works,
I would like to customize the VILA for downstream task tuning and find the LoRA args in https://github.com/Efficient-Large-Model/VILA/blob/90eda3dc4a9fad021a7adae59cd1a3f8bce78b23/llava/train/args.py#L75
Can I use lora by adding in scripts:
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \

How to evaluate 4shot?

'is_gemma_tokenizer' cannot be imported in llava.mm_utils

When eval on vqav2, 'is_gemma_tokenizer' can not be found. Is there some codes not submitted?

Provide ShareGPT4V filtered annotations file

In datasets_mixture.py there is references a .json file that is not entirely clear where it came from based on the name: https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/datasets_mixture.py#L62

Is this file the same as https://huggingface.co/datasets/mit-han-lab/ShareGPT4V/blob/main/filter-share-captioner_coco_lcs_sam_1246k_1107.json?

if not, can you provide this file or some description of how it was generated?

KeyError: 'llava_llama'

When I ran vila's demo, I ran into the bug below：

How's the DownSampleBlock performance compare with CAbstractor?

Easy backwards compatibility fix

Your version of transformers forces LlamaFlashAttention2 in the constructor of LlamaDecoderLayer in transformers/models/llama/modeling_llama.py which requires Ampere or newer to work. Just by using the old LlamaAttention class instead of LlamaFlashAttention2 here, I could make the video inference demo run on an ancient GTX1060 (even if it's very slow).
The current main branch of transformers uses a mechanism to decide which is the best compatible attention for this purpose.
If you don't want to backport that, you could use a very simple logic to decide which class to use here. Something like this:

def is_at_least_ampere():
    if torch.cuda.is_available():
        num_of_gpus = torch.cuda.device_count()

        # Loop over each GPU
        for i in range(num_of_gpus):
            gpu_properties = torch.cuda.get_device_properties(i)

            # Compute capability is major.minor version format
            # Convert it to a float for comparison
            compute_capability = float(f"{gpu_properties.major}.{gpu_properties.minor}")

            # If compute capability is less than 8.0 (Ampere or newer), return False
            if compute_capability < 8.0:
                return False

        # If all GPUs are Ampere or newer, return True
        return True
    else:
        # If CUDA is not available, return False
        return False

class LlamaDecoderLayer(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.hidden_size = config.hidden_size
        ampere_or_newer = is_at_least_ampere()
        self.self_attn = (
            LlamaFlashAttention2(config=config) if ampere_or_newer else LlamaAttention(config=config)
            # LlamaAttention(config=config)
            # LlamaFlashAttention2(config=config)
        )
        self.mlp = LlamaMLP(config)

Chamfer distance's data source

in the paper “ VILA: On Pre-training for Visual Language Models” 's "The deep embedding alignment hypothesis." part , the Chamfer distance is interesting and useful. And I want to konw how it is calculate ? and what is the image source and the text source? Thank you very mcuh!

Multi-image is worse than concat them as single image.

Hello, I have tried your code and pretrained models, its a very excellent work.

But I meet a issue about multi-image task.

My input single image width: height = 1 : 2，concating these two images, then concated image width: height = 1 : 1. The Acc is 95% by this way.

When input two images(width: height = 1 : 2) to the VILA, the Acc drop dramatically to 85%.

These two ways of input, all using data fine-tuning the pretrained model.

"No module named llava"

I installed this codebase using the setup script. When running the command:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b \
    --conv-mode llama_3 \
    --query "<image>\n Please describe the traffic condition." \
    --image-file "av.png"

I get the error:

Traceback (most recent call last):
  File "/my/homedir/VILA/llava/eval/run_vila.py", line 12, in <module>
    from llava.constants import (DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN,
ModuleNotFoundError: No module named 'llava'

What could be causing this?

What's the purpose of func repack_multimodal_data?

About perception testset

Hello authors,
Thanks for sharing fantastic jobs. Now I would like to ask where this dataset came from, can you share a link or data?
"/lustre/fsw/portfolios/nvr/projects/nvr_elm_llm/dataset/video_datasets_v2/perception_test/"

Inference not working - Keyword tensor should have 2 or 3 dimensions, got 1

I get the following error while running llava/eval/run_vila.py on a H100 gpu:

root@7513903dd8b0:/src/VILA# python -W ignore llava/eval/run_vila.py     --model-path Efficient-Large-Model/VILA1.5-3b     --conv-mode vicuna_v1     --query "<video>\n Please describe this video."     --video-file "tjx1PPFsa6A-Scene-049.mp4"
Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 203142.93it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.47it/s]
no <image> tag found in input. Automatically append one at the beginning of text.
input:  <image>
<image>
<image>
<image>
<image>
<image>
<video>\n Please describe this video.
[WARNING] the auto inferred conversation mode is llava_v0, while `--conv-mode` is vicuna_v1, using vicuna_v1
torch.Size([6, 3, 384, 384])
Traceback (most recent call last):
  File "/src/VILA/llava/eval/run_vila.py", line 154, in <module>
    eval_model(args)
  File "/src/VILA/llava/eval/run_vila.py", line 116, in eval_model
    output_ids = model.generate(
                 ^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/model/language_model/llava_llama.py", line 171, in generate
    outputs = self.llm.generate(
              ^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/utils.py", line 2924, in sample
    if stopping_criteria(input_ids, scores):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/stopping_criteria.py", line 132, in __call__
    return any(criteria(input_ids, scores) for criteria in self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/generation/stopping_criteria.py", line 132, in <genexpr>
    return any(criteria(input_ids, scores) for criteria in self)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/mm_utils.py", line 298, in __call__
    outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/mm_utils.py", line 279, in call_for_batch
    raise ValueError(
ValueError: Keyword tensor should have 2 or 3 dimensions, got 1

torch version is 2.0.1+cu118 and flash attention 2.4.2

What're the modifications in `llava/train/transformers_replace`?

Hi, thanks for the nice work! I wonder what are the main modifications in llava/train/transformers_replace compared to the original implementation in transformers==4.31.0, as specified in the pyproject.toml. Also, in environment_setup.sh, transformers==4.36.2 is installed:

pip install git+https://github.com/huggingface/[email protected]

I wonder why we want to install different versions of transformers?

If I want to use a higher version of transformers, e.g. 4.38, are there changes needed for the files in this folder? Many thanks!

Can't run inference demo

Thank you for this wonderful work, however when i tried to run the inference demo i got run_llava.py: error: unrecognized arguments: --model-name Efficient-Large-Model/VILA-13B

so i changed --model-name to --model-path, but i got another error

[WARNING] the auto inferred conversation mode is llava_v0, while `--conv-mode` is vicuna_v1, using vicuna_v1
Traceback (most recent call last):
  File "/VILA/llava/eval/run_llava.py", line 159, in <module>
    eval_model(args)
  File "/VILA/llava/eval/run_llava.py", line 117, in eval_model
    output_ids = model.generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2861, in sample
    outputs = self(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/VILA/llava/model/language_model/llava_llama.py", line 86, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/VILA/llava/model/llava_arch.py", line 202, in prepare_inputs_labels_for_multimodal
    cur_image_features = image_features[cur_image_idx]
IndexError: index 1 is out of bounds for dimension 0 with size 1

Updated paper on the latest model (video understanding, etc.)

Congrats on adding support for video understanding to VILA, looks super cool!

Just curious, is there an updated or new paper with more technical details on how improved video understanding was added to the VILA model?

Thanks!

Request for middle checkpoint

Thank you for the amazing release!

Do you plan to release the checkpoints from different stages, e.g., checkpoint before SFT? These checkpoints would be valuable for further fine-tuning.

Hi, Have you compare with s2 [384, 768] scales versus interpolate to 768x768?

The way you using actually feed 5 images into vit,

how's it compare with interpolate to 768x768 which equal to send 4 images into vit but with different manner?

AWQ Tinychat tensor mismatch RuntimeError

Hi everyone,

When running VILA with Tinychat I get the following error, across different transformers, torch and CUDA versions. Tinychat works well with LlaMA and LLaVA for me, so the problem might be on the VILA side. Running the FP16 VILA models within this repo works just fine. Any ideas on this?

ASSISTANT: Traceback (most recent call last): File "/opt/llm-awq/tinychat/vlm_demo.py", line 233, in <module> main(args) File "/opt/llm-awq/tinychat/vlm_demo.py", line 179, in main outputs = stream_output(output_stream, time_stats) File "/opt/llm-awq/tinychat/utils/conversation_utils.py", line 83, in stream_output for outputs in output_stream: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/opt/llm-awq/tinychat/stream_generators/llava_stream_gen.py", line 177, in LlavaStreamGenerator out = model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/llm-awq/tinychat/models/llava_llama.py", line 261, in forward out = super().forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/llm-awq/tinychat/models/llama.py", line 318, in forward h = self.model(tokens, start_pos, inputs_embeds) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/llm-awq/tinychat/models/llama.py", line 302, in forward h = layer(h, start_pos, freqs_cis, mask) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/llm-awq/tinychat/models/llama.py", line 257, in forward h = x + self.self_attn.forward( File "/opt/llm-awq/tinychat/modules/fused_attn.py", line 259, in forward self.cache_v[:bsz, :, start_pos : start_pos + seqlen, :] = values_store RuntimeError: The expanded size of the tensor (2048) must match the existing size (2353) at non-singleton dimension 2. Target sizes: [1, 32, 2048, 128]. Tensor sizes: [32, 2353, 128]

upload demo videos

vila_4090_two_cars_3x.mp4

VILA-13B_Orin_deer.mp4.mp4

Cannot correctly recognize <im_patch>

When running codes on jetson, I found that the tokenizer (tokenizer = AutoTokenizer.from_pretrained(args.model_path, use_fast=False)) cannot correctly convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN, i.e., <im_patch>, into the index in the vocabulary. That is, the tokenizer cannot recognize the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN as a special token.

The environment is built following the instructions in the AWQ project (https://github.com/mit-han-lab/llm-awq), while the version of transformers is 4.36.2.

Did I do something wrong?

License

Hi,
Thanks so much for releasing VILA! Was looking for a ~3B model to fine tune for a while and this looks like a great fit! Do you know if there are any plans to switch the model licenses to a more permissive one? (And does the license apply to fine tunes as well?) Wanted to release my fine tune under an open source license.
Thanks!

Model checkpoints before supervised fine-tuning

Thank you for your fantastic work.

Is there any plan to release the pre-trained model checkpoints before SFT for both VILA7B and VILA13B? It would be helpful to evaluate its performance at some other tasks.

I appreciate your help and look forward to hearing back from you.

Having trouble running multi-image input inference.

Thank you for your excellent work. I encountered a problem when running multi-image input inference locally. After briefly looking at your code, I modified the inference command as follows:
python -W ignore llava/eval/run_llava.py \ --model-path /path/workspace/VILA_13B \ --conv-mode llava_llama_2 \ --image-file "/cfs-allenxmzhu/workspace/VILA/demo_images/av.png,/cfs-allenxmzhu/workspace/VILA/inference_test/test_data/flying_chair.png" \ --query "<image-placeholder> <image-placeholder> Please describe the two images separately."

During testing, I found that the model often does not output results, but only outputs a newline character (or possibly a space).
Is this command correct ?
Also, I currently do not have tinychat library installed, and I am not considering the convenience brought by 4-bit quantization for now. Would not using tinychat library cause the problem I encountered?

Index error when conversations is short. (/aten/src/ATen/native/cuda/IndexKernel.cu:)

Hi, when fine-tuning on downstream text-only conversations, the average length of conversations is short, around 20 words, an index error will appear.

I double checked with a toy json like below (mix_short.json), and get error message
"
/aten/src/ATen/native/cuda/IndexKernel.cu ....
...
modeling_llama.py", line 60, in _get_unpad_data
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
RuntimeError: CUDA error: device-side assert triggered
"

When extending number of words in a conversations (mix_long.json), the error disappears. Is there threshold for the minimum number of words required for a conversation?
mix_long.json
mix_short.json

Missing deepspeed config files in training scripts

Hi,

I am trying to reproduce the results. However, both the zero2 and zero3 config files are missing in the project. Can you please upload these files to the scripts folder. Thank you!

Running the AWQ models

Is it possible to run the AWQ models using the run_vila.py script?

I ran the following command:

python -W ignore llava/eval/run_vila.py     \
  --model-path Efficient-Large-Model/VILA1.5-3b-AWQ \      
  --conv-mode vicuna_v1  \   
  --query "<video>\n Describe this video"   \  
  --video-file "tjx1PPFsa6A-Scene-049.mp4"

and got this error:

Traceback (most recent call last):
  File "/src/VILA/llava/eval/run_vila.py", line 160, in <module>
    eval_model(args)
  File "/src/VILA/llava/eval/run_vila.py", line 64, in eval_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, model_name, args.model_base)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/model/builder.py", line 177, in load_pretrained_model
    model = LlavaLlamaModel(
            ^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/model/language_model/llava_llama.py", line 53, in __init__
    return self.init_vlm(config=config, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/model/llava_arch.py", line 76, in init_vlm
    self.llm, self.tokenizer = build_llm_and_tokenizer(llm_cfg, config, *args, **kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/VILA/llava/model/language_model/builder.py", line 77, in build_llm_and_tokenizer
    llm = AutoModelForCausalLM.from_pretrained(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3706, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4091, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/transformers/modeling_utils.py", line 503, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: No such file or directory: "/root/.cache/huggingface/hub/models--Efficient-Large-Model--VILA1.5-3b-AWQ/snapshots/5d37764f2ed919bae08637a3b380bfd53931475d/llm/model-00001-of-00002.safetensors"

How can I run inference with the checkpoint in here? https://huggingface.co/Efficient-Large-Model/VILA1.5-3b-AWQ/tree/main/llm

[Feature Request] Evaluation tools of the Few-shot VQA/Caption

Hi, I'm interested in your great work.

The ./scripts/v1_5/eval/eval_all.sh is not avalilable now. Could you release the evaluation tools? Especially the few-shot VQA/Caption.

And the mmc4 pretrained weight is wished to be availiable.

dataset_mixture of new_vflan_sharegpt4v_sft is also not availiable.

Ty very much !

LLM version

It seems like the paper reported scores using LLaMa-2. Whereas in the released training code, we are guided to use vicuna-1.5 which is the same as LLaVA. Can we assume that vicuna-1.5 training can work smoothly use the current code?

FlashAttention Bug

When I run the inference code, it reports the error "RuntimeError: FlashAttention only supports Ampere GPUs or newer."

My GPU is V100, how can I not use flash-attention?

working with VLLM

I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model_id = 'Efficient-Large-Model/VILA1.5-13b-AWQ'

llm = LLM(model=model_id, quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The first issue seems to be that the config.json is trying to use a model type called llava_llama, which transformers doesn't know about.

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'llava_llama'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/config.py", line 121, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 38, in get_config
    raise e
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_llama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

if I change the type in config.json to just llava I get:

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 05-09 09:38:26 config.py:205] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 05-09 09:38:26 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='Efficient-Large-Model/VILA1.5-13b-AWQ', speculative_config=None, tokenizer='Efficient-Large-Model/VILA1.5-13b-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Efficient-Large-Model/VILA1.5-13b-AWQ)
Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
    engine = cls(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 150, in __init__
    self._init_tokenizer()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 328, in _init_tokenizer
    self.tokenizer = get_tokenizer_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer.py", line 92, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 880, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'Efficient-Large-Model/VILA1.5-13b-AWQ'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Efficient-Large-Model/VILA1.5-13b-AWQ' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Which seems to suggest that the LLama tokenizer isn't in the llm directory? Do we need a tokenizer.json in the repo? Even if I add that, it seems to have trouble loading the tokenizer.

Question on Multi-Image Input Processing During Training

I encountered confusion while reading the code for handling multi-image inputs, particularly in the following sections:
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L127

The nested for loops starting at
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L168
and
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L198
seem to iterate over all image_features[cur_image_idx]. This iteration suggests that the first dimension's size (or length, if a list) of image_features should equal the product of batch_size and num_images. Therefore, it appears that the flatten operation should apply across these two dimensions rather than between num_images and the subsequent token channel dimension. This leads me to question my understanding of the process. Could you clarify where my confusion may lie? Additionally, I'd appreciate more insights into the expected layout of multi-image inputs and how this layout is manipulated in the code, specifically in
https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L122-L127

Thank you very much for your assistance.

More data leading to lower indicators?

Hi, I tried VILA-13B, training an image classification task, with good results, but now I'm running into a problem.
Before I used 250k data (finetune lora), the precision went up with more training data. Later, we added 200k data and the indicator dropped by 6-7% instead.
We are now basically sure that the new data distribution is the same as before, so what else could be the reason for the decline?

Intermediate stages checkpoints

Hi there,

Thanks for opensourcing the checkpoints and the code.

For research purposes, having access to stage-0 and stage-1 intermediate checkpoints will be really useful. Are you planning to release them too, if not done already?

video

7ko9e-AGmbM.12_0_217_out.mp4

Llama-3-VILA1.5-8B Inference error

Hello! Thanks for sharing such a nice project.
I have set up environment following the instructions in ReadME.
When I run the inference example as the following ( i have copy the run_vila.py file from llava/eval/ to the current project root):
'''bash
python run_vila.py
--model-path Efficient-Large_model/Llama-3-VILA1.5-8B
--conv-mode vicuna_v1
--query "\n Please describe the traffic condition."
--image-file "./demo_images/av.png"
'''
I encounter the following error:
'''
['./demo_images/av.png']

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [01:46<05:18, 106.09s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [03:47<03:49, 114.88s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [05:02<01:37, 97.03s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [05:13<00:00, 62.85s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [05:13<00:00, 78.34s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
input: \n Please describe the traffic condition.
[WARNING] the auto inferred conversation mode is llava_v0, while --conv-mode is vicuna_v1, using vicuna_v1
torch.Size([1, 3, 384, 384])
Traceback (most recent call last):
File "/home/deping.zhang/code/llm/VILA/run_vila.py", line 153, in
eval_model(args)
File "/home/deping.zhang/code/llm/VILA/run_vila.py", line 115, in eval_model
output_ids = model.generate(
File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/deping.zhang/code/llm/VILA/llava/model/language_model/llava_llama.py", line 171, in generate
outputs = self.llm.generate(
File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
return self.sample(
File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/utils.py", line 2924, in sample
if stopping_criteria(input_ids, scores):
File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 132, in call
return any(criteria(input_ids, scores) for criteria in self)
File "/home/deping.zhang/.conda/envs/vila/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 132, in
return any(criteria(input_ids, scores) for criteria in self)
File "/home/deping.zhang/code/llm/VILA/llava/mm_utils.py", line 287, in call
outputs.append(self.call_for_batch(output_ids[i].unsqueeze(0), scores))
File "/home/deping.zhang/code/llm/VILA/llava/mm_utils.py", line 272, in call_for_batch
if (output_ids[0, -keyword_id.shape[0] :] == keyword_id).all():
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0
'''

Inference has error: TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'seqlens_in_batch'

I test the inference cmd:
python -W ignore llava/eval/run_llava.py --model-path Efficient-Large-Model/VILA-7B --conv-mode vicuna_v1 --query "\n Please describe the traffic condition." --image-file "demo_images/av.png"
and encounter errors of:

VILA/llava/model/language_model/llava_llama.py", line 125, in forward
outputs = super().forward(
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'seqlens_in_batch'

It also shows warning message that the some VILA layers are not loaded:
2. Loading checkpoint shards: 100%|██████████████| 3/3 [00:04<00:00, 1.38s/it]Some weights of the model checkpoint at Efficient-Large-Model/VILA-7B were not used when initializing LlavaLlamaForCausalLM: ['model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.lay.....

Is there anyway to fix this inference problem?

Multi-image Input Inference Script

I've successfully tested multi-image input inference as mentioned in #8 (comment)

python -W ignore llava/eval/run_llava.py --model-path Efficient-Large-Model/VILA-7B --conv-mode vicuna_v1 --query "<image> image 1 is google, famous for its search engine. <image> image 2 is microsoft, framous for its operating system. <image> image 3 is apple, famous for iPhone and Mac. <image> image 4 is" --image-file "demo_images/g.PNG,demo_images/m.PNG,demo_images/a.PNG,demo_images/n.PNG"

It works perfectly!
I want to inquire if there is a script that can perform this inference on a whole custom dataset? If so, what format should the custom data be in?

Possibility to support LLama-3?

Thanks for sharing fantastic jobs, the VILA shows strong few-shot in context learning ability than original llava-1.5, will you plan to support LLama-3? The advantage of in context learning for vision could be expected.

Instruction for VILA 1.5 with tinychat (llm-awq) doesn't work well due to fixed torch version (==2.0.1)

Thank you for releasing the new version of VILA (1.5)!

I followed the installation instructions at https://github.com/mit-han-lab/llm-awq/tree/main?tab=readme-ov-file#install and ran the command python vlm_demo_new.py as detailed here: https://github.com/mit-han-lab/llm-awq/tree/main/tinychat#support-visual-language-models-vila-15-vila-llava

On Ubuntu 22.04 with CUDA 12.x, I installed the CUDA 12 libraries during step 2. However, in step 4, since VILA installs a specific version of torch (2.0.1) as specified here https://github.com/Efficient-Large-Model/VILA/blob/main/pyproject.toml#L16, it also installs CUDA 11 libraries, leading to library conflicts between packages in VILA and those in llm-awq.

The error encountered was:

File "/backup/repo/VILA/llm-awq/awq/quantize/qmodule.py", line 4, in <module>
   import awq_inference_engine  # with CUDA kernels
ImportError: /home/gbae/.pyenv/versions/vila/lib/python3.10/site-packages/awq_inference_engine-0.0.0-py3.10-linux-x86_64.egg/awq_inference_engine.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow11cow_deleterEPv

To resolve this issue, I executed the following commands:

pip uninstall nvidia-cublas-cu11 nvidia-cuda-cupti-cu11 nvidia-cuda-nvrtc-cu11 nvidia-cuda-runtime-cu11 nvidia-cudnn-cu11 nvidia-cufft-cu11 nvidia-curand-cu11 nvidia-cusolver-cu11 nvidia-cusparse-cu11 nvidia-nccl-cu11 nvidia-nvtx-cu11

# Need to reinstall CUDA 12 libraries as directories are shared with CUDA 11 libraries and will be deleted.
pip uninstall nvidia-cublas-cu12 nvidia-cuda-cupti-cu12 nvidia-cuda-nvrtc-cu12 nvidia-cuda-runtime-cu12 nvidia-cudnn-cu12 nvidia-cufft-cu12 nvidia-curand-cu12 nvidia-cusolver-cu12 nvidia-cusparse-cu12 nvidia-nccl-cu12 nvidia-nvjitlink-cu12 nvidia-nvtx-cu12

pip install nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105

# Install flash_attn package for CUDA 12.x
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Additionally, as mentioned in mit-han-lab/llm-awq#180, the file model_worker_new.py is missing (@kentang-mit).

Please address this issue so that other users can follow the instructions and enjoy the Gradio app with VILA v1.5!
Thanks!

Would you consider releasing code that supports lora training 40b model?

Very excellent work! When using lora to train a 40b model in my task, I found during the loading inference process that lora did not save the weight of the vision tower, so the effect of my task was very poor. Would you consider supporting lora training and loading with official code?

Add support for GPUs with compute capability lower than 8.0 for awq/kernels installation

I tried to install and run the project on a machine with an NVIDIA Tesla T4 GPU, which has a compute capability of 7.5 (SM 75).

Environment
Ubuntu 22.04 with CUDA 12.1

I followed the steps as mentioned here https://github.com/mit-han-lab/llm-awq/tree/main?tab=readme-ov-file#install & encountered the following error during the third step installation process:

cd awq/kernels
python setup.py install

Following error was reported

ptxas /tmp/tmpxft_0000f5ba_00000000-6_gemm_cuda_gen.ptx, line 709; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000f5ba_00000000-6_gemm_cuda_gen.ptx, line 713; error   : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas /tmp/tmpxft_0000f5ba_00000000-6_gemm_cuda_gen.ptx, line 717; error   : Feature '.m16n8k16' requires .target sm_80 or higher
...
txas fatal   : Ptx assembly aborted due to errors
error: command '/usr/local/cuda-12.1/bin/nvcc' failed with exit code 255

Root Cause: Feature '.m16n8k16' requires .target sm_80 or higher

Is there a configuration flag or workaround to support GPUs with capacity below 8.0

About the VideoQA dataset

Thanks for amazing work!

I found in your paper that VILA use several QA dataset(including MSVD-QA...) during the ablation study.
I wonder if the released model Efficient-Large-Model/VILA-7b on huggingface has been trained with these QA SFT dataset.
Thanks!

demo_trt_llm/convert_checkpoint.py - AttributeError: 'LlavaLlamaConfig' object has no attribute 'num_attention_heads'

Hi,
I'm trying to run demo_trt_llm.
Followed demo_trt_llm/README.md exactly

MODEL_NAME='vila1.5-2.7b'

Command:
python $VILA_ROOT/demo_trt_llm/convert_checkpoint.py
--model_dir models/${MODEL_NAME}
--output_dir models/${MODEL_NAME}/trt/fp16/1-gpu
--dtype float16

pip freeze:
absl-py==2.1.0
accelerate==0.25.0
aiohttp @ file:///rapids/aiohttp-3.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=81b77f868814346662c96ab36b875d7814ebf82340d3284a31681085c051320f
aiosignal @ file:///rapids/aiosignal-1.3.1-py3-none-any.whl#sha256=f8376fb07dd1e86a584e4fcdec80b36b7f81aac666ebc724e2c090300dd83b17
annotated-types==0.6.0
apex @ file:///opt/pytorch/apex
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
asttokens==2.4.1
astunparse==1.6.3
async-timeout @ file:///rapids/async_timeout-4.0.3-py3-none-any.whl#sha256=7405140ff1230c310e51dc27b3145b9092d659ce68ff733fb0cefe3ee42be028
attrs==23.2.0
audioread==3.0.1
beautifulsoup4==4.12.3
bleach==6.1.0
blis==0.7.11
build==1.2.1
cachetools==5.3.2
catalogue==2.0.10
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.16.0
cloudpickle @ file:///rapids/cloudpickle-3.0.0-py3-none-any.whl#sha256=246ee7d0c295602a036e86369c77fecda4ab17b506496730f2f576d9016fd9c7
cmake==3.28.1
colored==2.2.4
coloredlogs==15.0.1
comm==0.2.1
confection==0.1.4
contourpy==1.2.0
cubinlinker @ file:///rapids/cubinlinker-0.3.0%2B2.g405ac64-cp310-cp310-linux_x86_64.whl#sha256=fe3ba53922377d7656ef45cb5aa61ac10fc4f44635f94d261cb01dbc2ed6b6c2
cuda-python @ file:///rapids/cuda_python-12.3.0rc4%2B9.gdb8c48a.dirty-cp310-cp310-linux_x86_64.whl#sha256=40ec85ddb721b09a0af7bb545af238feabd8ac4c610756e89d43891a34b3ad62
cudf @ file:///rapids/cudf-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=9bf23765b34ef0a453e5caf63be526efbaf338f1dc6339cdeb4ea74404c81254
cugraph @ file:///rapids/cugraph-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=18c29a3c7c96ac6bb3e86c149667f15ced14c6cb812b008fd1ca4f6cd92c95a2
cugraph-dgl @ file:///rapids/cugraph_dgl-23.12.0-py3-none-any.whl#sha256=ecc4e14a1b586ff6054829a94b54596111ca9e0514e8ad157a99b59e5408e28d
cugraph-service-client @ file:///rapids/cugraph_service_client-23.12.0-py3-none-any.whl#sha256=decbbd260b254d397887af5b10cc21c55b845b9776f96da9fd587ae872362728
cugraph-service-server @ file:///rapids/cugraph_service_server-23.12.0-py3-none-any.whl#sha256=9e52401f6e5acd4d5c85f502cc763c60cb80a175d171b13392bec6c6d75ecd82
cuml @ file:///rapids/cuml-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=0e7e87f320bd91705df559dd383279317a5a88fb18f5c58b54972d27882d9e1b
cupy-cuda12x @ file:///rapids/cupy_cuda12x-12.3.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=32d0e03789ef3f02f0c098818e957c235b75c1636e9e0036299480db0c423dcd
cycler==0.12.1
cymem==2.0.8
Cython==3.0.8
dask @ file:///rapids/dask-2023.11.0-py3-none-any.whl#sha256=b950951ee3f8c86f003b577b6928ecf20089eee6677719578deaba8fd9a78203
dask-cuda @ file:///rapids/dask_cuda-23.12.0-py3-none-any.whl#sha256=57e3399b50a0938587fc1f5733fa6b0a9074925e9cf58c4ca550a4c3922708b4
dask-cudf @ file:///rapids/dask_cudf-23.12.0-py3-none-any.whl#sha256=56d03008fee5660f479e59436f1ab54e36c75bd214e65f31c49a3c6fad7d83d7
datasets==2.19.1
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
diffusers==0.27.0
dill==0.3.8
distributed @ file:///rapids/distributed-2023.11.0-py3-none-any.whl#sha256=44ad1fff31ece202cc64bdb72dd33d6964d78bdbe1ec1ec06e01f9544187cd2e
dm-tree==0.1.8
einops==0.7.0
evaluate==0.4.2
exceptiongroup==1.2.0
execnet==2.0.2
executing==2.0.1
expecttest==0.1.3
fastjsonschema==2.19.1
fastrlock @ file:///rapids/fastrlock-0.8.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl#sha256=08315bde19d0c2e6b06593d5a418be3dc8f9b1ee721afa96867b9853fceb45cf
filelock==3.13.1
flash-attn==2.4.2
flatbuffers==24.3.25
fonttools==4.48.1
frozenlist @ file:///rapids/frozenlist-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=a9b2de4cf0cdd5bd2dee4c4f63a653c61d2408055ab77b151c1957f221cabf2a
fsspec @ file:///rapids/fsspec-2023.12.2-py3-none-any.whl#sha256=d800d87f72189a745fa3d6b033b9dc4a34ad069f60ca60b943a63599f5501960
gast==0.5.4
google-auth==2.27.0
google-auth-oauthlib==0.4.6
graphsurgeon @ file:///workspace/TensorRT-8.6.3.1/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl#sha256=0fbadaefbbe6e9920b9f814ae961c4a279be602812edf3ed7fb9cc6f8f4809fe
grpcio==1.60.1
h5py==3.10.0
huggingface-hub==0.23.0
humanfriendly==10.0
hypothesis==5.35.1
idna==3.6
importlib-metadata @ file:///rapids/importlib_metadata-7.0.1-py3-none-any.whl#sha256=4805911c3a4ec7c3966410053e9ec6a1fecd629117df5adee56dfc9432a1081e
iniconfig==2.0.0
intel-openmp==2021.4.0
ipykernel==6.29.2
ipython==8.21.0
ipython-genutils==0.2.0
janus==1.0.0
jedi==0.19.1
Jinja2==3.1.3
joblib==1.3.2
json5==0.9.14
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyterlab==2.3.2
jupyterlab-server==1.2.0
jupyterlab_pygments==0.3.0
jupytext==1.16.1
kiwisolver==1.4.5
langcodes==3.3.0
lark==1.1.9
lazy_loader==0.3
librosa==0.10.1
llvmlite @ file:///rapids/llvmlite-0.40.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=bbd5e82cc990e5a3e343a3bf855c26fdfe3bfae55225f00efd01c05bbda79918
locket @ file:///rapids/locket-1.0.0-py2.py3-none-any.whl#sha256=b6c819a722f7b6bd955b80781788e4a66a55628b858d347536b7e81325a3a5e3
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe @ file:///rapids/MarkupSafe-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=dac1ebf6983148b45b5fa48593950f90ed6d1d26300604f321c74a9ca1609f8e
matplotlib==3.8.2
matplotlib-inline==0.1.6
mdit-py-plugins==0.4.0
mdurl==0.1.2
mistune==3.0.2
mkl==2021.1.1
mkl-devel==2021.1.1
mkl-include==2021.1.1
mock==5.1.0
mpi4py @ file:///tmp/mpi4py-3.1.5
mpmath==1.3.0
msgpack==1.0.7
multidict @ file:///rapids/multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=36c63aaa167f6c6b04ef2c85704e93af16c11d20de1d133e39de6a0e84582a93
multiprocess==0.70.16
murmurhash==1.0.10
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx==2.6.3
ninja==1.11.1.1
notebook==6.4.10
numba @ file:///rapids/numba-0.57.1%2B1.g1ff679645-cp310-cp310-linux_x86_64.whl#sha256=182b77614c983c4c32db619d849a68ed4c33637e307ebb1a2731a3ae730ae36c
numpy==1.24.4
nvfuser==0.1.4a0+d0bb811
nvidia-ammo==0.9.3
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cudnn-cu12==8.9.7.29
nvidia-dali-cuda120==1.34.0
nvidia-pyindex==1.0.9
nvtx @ file:///rapids/nvtx-0.2.5-cp310-cp310-linux_x86_64.whl#sha256=939c7322e7cd4f34af85cdf6468b3d80b1e144a34bbcd61e08e5c436071d3e1f
oauthlib==3.2.2
onnx @ file:///opt/pytorch/pytorch/third_party/onnx
onnx-graphsurgeon==0.5.2
onnxruntime==1.16.3
opencv @ file:///opencv-4.7.0/modules/python/package
optimum==1.19.1
optree==0.10.0
packaging==23.2
pandas @ file:///rapids/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=7a0a56cef15fd1586726dace5616db75ebcfec9179a3a55e78f72c5639fa2a23
pandocfilters==1.5.1
parso==0.8.3
partd @ file:///rapids/partd-1.4.1-py3-none-any.whl#sha256=27e766663d36c161e2827aa3e28541c992f0b9527d3cca047e13fb3acdb989e6
pexpect==4.9.0
pillow @ file:///rapids/pillow-10.2.0-cp310-cp310-manylinux_2_28_x86_64.whl#sha256=322bdf3c9b556e9ffb18f93462e5f749d3444ce081290352c6070d014c93feb2
platformdirs==4.2.0
pluggy==1.4.0
ply @ file:///rapids/ply-3.11-py2.py3-none-any.whl#sha256=096f9b8350b65ebd2fd1346b12452efe5b9607f7482813ffca50c22722a807ce
polygraphy==0.49.0
pooch==1.8.0
preshed==3.0.9
prettytable==3.9.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.24.4
psutil @ file:///rapids/psutil-5.9.4-cp310-abi3-linux_x86_64.whl#sha256=f1cb87a01694756b49d74098db4073e7b50588d5c41c47485d677ef2bf07f132
ptxcompiler @ file:///rapids/ptxcompiler-0.8.1%2B2.g0d406d6-cp310-cp310-linux_x86_64.whl#sha256=4d53fe48aa72600d059e402fd468f51b14301b11cbbedd6740637bec4add0944
ptyprocess==0.7.0
PuLP==2.8.0
pure-eval==0.2.2
pyarrow @ file:///rapids/pyarrow-14.0.1.dev0%2Bgba5374836.d20240125-cp310-cp310-linux_x86_64.whl#sha256=709dc25423ce14dccd3ba67072325a26147f87b6dc40a9b05a7fdaaa91efb6ee
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.11.1
pybind11-global==2.11.1
pycocotools @ git+https://github.com/nvidia/cocoapi.git@d99cbf3823588ef09a2721655f46e509ebafb3d7#subdirectory=PythonAPI
pycparser==2.21
pydantic==2.6.1
pydantic_core==2.16.2
Pygments==2.17.2
pylibcugraph @ file:///rapids/pylibcugraph-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=07ba411e9cffd1dac341a42d8ed2962fcee94a5219fdd602fa122d73dee4aaaf
pylibcugraphops @ file:///rapids/pylibcugraphops-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=60670e596324588a01fb670e030293f06dc5cf7f8d6006e910b8e00df564d683
pylibraft @ file:///rapids/pylibraft-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=fbcfaa07a175dd0fdd7b65011dc72cbb6b88aaddc156843250ccb7d1c181916a
pynvml==11.5.0
pyparsing==3.1.1
pyproject_hooks==1.1.0
pytest==8.0.0
pytest-flakefinder==1.1.0
pytest-rerunfailures==13.0
pytest-shard==0.1.2
pytest-xdist==3.5.0
python-dateutil==2.8.2
python-hostlist==1.23.0
pytorch-quantization==2.1.2
pytz @ file:///rapids/pytz-2023.3.post1-py2.py3-none-any.whl#sha256=ce42d816b81b68506614c11e8937d3aa9e41007ceb50bfdcb0749b921bf646c7
PyYAML==6.0.1
pyzmq==25.1.2
raft-dask @ file:///rapids/raft_dask-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=d632376e71ac9cfca5eacc7f8aa51e0f096e7a1f56c186a1653e097ea990cfe9
rapids-dask-dependency @ file:///rapids/rapids_dask_dependency-23.12.1-py3-none-any.whl#sha256=2abfe15415711bad9dfe9e83d4bfbd039e9436d66cc17e74ae22c85ab9afe46b
referencing==0.33.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
rich @ file:///rapids/rich-13.7.0-py3-none-any.whl#sha256=6da14c108c4866ee9520bbffa71f6fe3962e193b7da68720583850cd4548e235
rmm @ file:///rapids/rmm-23.12.0-cp310-cp310-linux_x86_64.whl#sha256=d59676daa42bcdd9d3b47d8aa96ea43d15c4120c005e6f7d8a2cbfa4a1e2d840
rpds-py==0.17.1
rsa==4.9
s2wrapper @ git+https://github.com/bfshi/scaling_on_scales.git@a9ae91bcc08b3cf10fc5912c088d5c214212362a
safetensors==0.4.3
scikit-learn @ file:///rapids/scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=184a42842a4e698ffa4d849b6019de50a77a0aa24d26afa28fa49c9190bb144b
scipy @ file:///rapids/scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=5e32847e08da8d895ce09d108a494d9eb78974cf6de23063f93306a3e419960c
Send2Trash==1.8.2
sentencepiece==0.2.0
six==1.16.0
smart-open==6.4.0
sortedcontainers==2.4.0
soundfile==0.12.1
soupsieve==2.5
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
sphinx_glpi_theme==0.6
srsly==2.4.8
stack-data==0.6.3
StrEnum==0.4.15
sympy==1.12
tabulate==0.9.0
tbb==2021.11.0
tblib @ file:///rapids/tblib-3.0.0-py3-none-any.whl#sha256=80a6c77e59b55e83911e1e607c649836a69c103963c5f28a46cbeef44acf8129
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorrt @ file:///usr/local/tensorrt/python/tensorrt-9.3.0.post12.dev1-cp310-none-linux_x86_64.whl#sha256=2fa2d4612505b8a8ff479e500a84810e303f0837a45c586b45f17ba5a3c6fec5
tensorrt-llm @ file:///app/tensorrt_llm/tensorrt_llm-0.10.0.dev2024042300-cp310-cp310-linux_x86_64.whl#sha256=28865d8876eb39f42949fd119b8e066f5ffbaf62e9cc9b377ceef841c967de01
terminado==0.18.0
thinc==8.2.3
threadpoolctl==3.2.0
thriftpy2 @ file:///rapids/thriftpy2-0.4.17-cp310-cp310-linux_x86_64.whl#sha256=9e3633fc2abf0a2be59f6e4cd2a1dfac1b1daf3b1950383476fc6d6de6efcd03
tinycss2==1.2.1
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
toolz @ file:///rapids/toolz-0.12.1-py3-none-any.whl#sha256=d22731364c07d72eea0a0ad45bafb2c2937ab6fd38a3507bf55eae8744aa7d85
torch @ file:///tmp/pip/torch-2.3.0a0%2Bebedce2-cp310-cp310-linux_x86_64.whl#sha256=c206635bc4a2f409f0d93c53d08ef64fe1b230bba184a3d252c671fdb7d80450
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/dist/torch_tensorrt-2.3.0a0-cp310-cp310-linux_x86_64.whl#sha256=9a2a2ade4f52284b0f7660930f9f1c13d409e490ce515426c4da61990ff6dadd
torchdata @ file:///opt/pytorch/data
torchtext @ file:///opt/pytorch/text
torchvision @ file:///opt/pytorch/vision
tornado==6.4
tqdm==4.66.1
traitlets==5.9.0
transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@5b90b7f5ed67b373bc5f843d1ac3b7a8999df08e
transformers @ git+https://github.com/huggingface/transformers@a7cab3c283312b8d4de5df3bbe719971e24f4281
treelite @ file:///rapids/treelite-3.9.1-cp310-cp310-linux_x86_64.whl#sha256=ad238ce625336335bf51b9fd4b3c64b42a1bfc743d17f6077ec5dc7c96644511
treelite-runtime @ file:///rapids/treelite_runtime-3.9.1-cp310-cp310-linux_x86_64.whl#sha256=1379f600b91df775aa24ea255f5e31ca47788f76ae14b73f46b4b8b0e4728a33
triton @ file:///tmp/dist/triton-2.2.0%2Be28a256-cp310-cp310-linux_x86_64.whl#sha256=8131877165b2e75adc11f694542a62deb22bc3500c49a9e5febd1e428834a435
typer==0.9.0
types-dataclasses==0.6.6
typing_extensions==4.9.0
ucx-py @ file:///rapids/ucx_py-0.35.0-cp310-cp310-linux_x86_64.whl#sha256=c193b737773989d184121dbfab320c888df6a60879f15cd885a8a3274a610273
uff @ file:///workspace/TensorRT-8.6.3.1/uff/uff-0.6.9-py2.py3-none-any.whl#sha256=618a3f812d491f0d3c4f2e38b99e03217ca37b206db14cee079f2bf681eb4fe3
urllib3 @ file:///rapids/urllib3-1.26.18-py2.py3-none-any.whl#sha256=34b97092d7e0a3a8cf7cd10e386f401b3737364026c45e622aa02903dffe0f07
wasabi==1.1.2
wcwidth==0.2.13
weasel==0.3.4
webencodings==0.5.1
Werkzeug==3.0.1
xdoctest==1.0.2
xgboost @ file:///rapids/xgboost-1.7.6-cp310-cp310-linux_x86_64.whl#sha256=275613a32b6ef56d0fda43f1ad847afd9e5c8eb58a85208b1cb2871ea2286088
xxhash==3.4.1
yarl @ file:///rapids/yarl-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=357495293086c5b6d34ca9616a43d329317feab7917518bc97a08f9e55648455
zict @ file:///rapids/zict-3.0.0-py2.py3-none-any.whl#sha256=5796e36bd0e0cc8cf0fbc1ace6a68912611c1dbd74750a3f3026b9b9d6a327ae
zipp @ file:///rapids/zipp-3.17.0-py3-none-any.whl#sha256=0e923e726174922dce09c53c59ad483ff7bbb8e572e00c7f7c46b88556409f31
VILA_ROOT

Error:
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300
0.10.0.dev2024042300
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/root/workspace/VILA/llava/model/llava_arch.py:106: UserWarning: model_dtype not found in config, defaulting to torch.float16.
warnings.warn("model_dtype not found in config, defaulting to torch.float16.")
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00, 1.22it/s]
Traceback (most recent call last):
File "/root/workspace/VILA/demo_trt_llm/convert_checkpoint.py", line 468, in
main()
File "/root/workspace/VILA/demo_trt_llm/convert_checkpoint.py", line 460, in main
convert_and_save_hf(args)
File "/root/workspace/VILA/demo_trt_llm/convert_checkpoint.py", line 397, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/root/workspace/VILA/demo_trt_llm/convert_checkpoint.py", line 419, in execute
f(args, rank)
File "/root/workspace/VILA/demo_trt_llm/convert_checkpoint.py", line 384, in convert_and_save_rank
llama = LLaMAForCausalLM.from_hugging_face(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 245, in from_hugging_face
llama = convert.from_hugging_face(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1150, in from_hugging_face
config = create_config_from_hugging_face(model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1056, in create_config_from_hugging_face
n_head = hf_config.num_attention_heads
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 265, in getattribute
return super().getattribute(key)
AttributeError: 'LlavaLlamaConfig' object has no attribute 'num_attention_heads'

Base LLM for the VILA 7B Model

Hey folks!
Great work with VILA. Super helpful!

I'm wondering what the base LLM for the VILA 7B model is? Can you point to the HF pertained model for it?
Specifically for the checkpoints in HF: Efficient-Large-Model/VILA-7b