opengvlab / internvl Goto Github PK

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Home Page: https://internvl.readthedocs.io/en/latest/

License: MIT License

Python 48.41% Shell 3.63% Makefile 0.05% Jupyter Notebook 47.52% HTML 0.15% JavaScript 0.20% CSS 0.04%

image-classification image-text-retrieval llm semantic-segmentation video-classification vision-language-model vit-22b vit-6b multi-modal gpt

internvl's Introduction

InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4o

[🆕 Blog] [🤔 FAQs] [🚀 InternVL2 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [📖 Document] [🌐 API] [🚀 Quick Start]

[📜 InternVL 1.0 Paper] [📜 InternVL 1.5 Report] [📖 1.0 中文解读] [📖 1.5 中文解读] [📖 2.0 中文解读]

Switch to the Chinese version (切换至中文版)

News 🚀🚀🚀

2024/08/01: The Chartmimic team evaluated the InternVL2 series models on their benchmark. The InternVL2-26B and 76B models achieved the top two performances among open-source models, with the InternVL2 76B model surpassing GeminiProVision and exhibiting comparable results to Claude-3-opus.
2024/08/01: InternVL2-Pro achieved the SOTA performance among open-source models on the CharXiv dataset, surpassing some well-known closed-source models such as GPT-4V, Gemini 1.5 Flash, and Claude 3 Sonnet.
2024/07/24: The MLVU team evaluated InternVL-1.5 on their benchmark. The average performance on the multiple-choice task was 50.4%, while the performance on the generative tasks was 4.02. The performance on the multiple-choice task ranked #1 among all open-source MLLMs.
2024/07/18: 🔥🔥 InternVL2-40B achieved SOTA performance among open-source models on the Video-MME dataset, scoring 61.2 when inputting 16 frames and 64.4 when inputting 32 frames. It significantly outperforms other open-source models and is the closest open-source model to GPT-4o mini.
2024/07/18: 🔥 InternVL2-Pro achieved the SOTA performance on the DocVQA and InfoVQA benchmarks.
2024/07/04: 🚀 We release the InternVL2 series. InternVL2-Pro achieved a 62.0% accuracy on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o. The free API of this model can be applied by filling (application form) / (申请表). Other models are available at HF link.
2024/06/19: We propose Needle In A Multimodal Haystack (MM-NIAH), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
2024/05/30: We release ShareGPT-4o, a large-scale dataset that we plan to open-source with 200K images, 10K videos, and 10K audios with detailed descriptions.
2024/05/29: We release the Mini-InternVL series, which includes two chat models: Mini-InternVL-Chat-2B-V1-5 and Mini-InternVL-Chat-4B-V1-5. These models achieve impressive performance with minimal size: the 2B model delivers 80% of the performance with only 8% of the model size, and the 4B model achieves 90% of the performance with just 16% of the model size. For more details, please check our blog.
2024/05/28: Thanks to the lmdeploy team for providing AWQ quantization support. The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ.
2024/05/13: InternVL 1.0 can now be used as the text encoder for diffusion models to support multilingual generation natively in over 110 languages worldwide. See MuLan for more details.
2024/04/18: InternVL-Chat-V1-5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
2024/02/27: InternVL is accepted by CVPR 2024 (Oral)! 🎉
2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.
2024/02/21: InternVL-Chat-V1-2-Plus achieved SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.
2024/02/12: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog and SFT data. The model is now available on HuggingFace, and both training / evaluation data and scripts are open-sourced.
2024/01/24: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see here.
2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale detection and segmentation models.

TODO List

Support vLLM and Ollama
Rebuild documents using readthedocs
Support fine-tuning different LLMs with LoRA
Support video and PDF input in online demo
Release InternVL2 with VisionLLMv2 integration
Release requirements.txt for InternVL2
Release training / evaluation code for InternVL2 series
Release Streamlit web UI for InternVL1.5 and InternVL2

Documents

Get Started
- Installation: [Environment] [requirements.txt]
- Evaluation Data Preparation: [InternVL Evaluation]
- Chat Data Format: [Meta File] [Pure Text] [Single-Image] [Multi-Image] [Video]
- InternVL-Chat API: [InternVL2-Pro]
- Local Chat Demo: [Streamlit Demo] [Gradio Demo] [LMDeploy Demo]
- Tutorials: [Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning]
InternVL Family
- InternVL 2.0: [Introduction] [Quick Start] [Finetune] [Evaluation] [Deployment]
- InternVL 1.5: [Introduction] [Quick Start] [Finetune] [Evaluation] [Deployment]
- InternVL 1.2: [Introduction] [Quick Start] [Finetune] [Evaluation]
- InternVL 1.1: [Introduction] [Quick Start] [Evaluation]
- InternVL 1.0: [Classification] [CLIP-Benchmark] [Segmentation] [InternVL-Chat-LLaVA] [InternVL-G]

Compared with SOTA VLLMs

Model Zoo

Multimodal Large Language Model (InternVL 2.0)

Model Name	Vision Part	Language Part	HF Link	MS Link	Document
InternVL2‑1B	InternViT‑300M‑448px	Qwen2‑0.5B‑Instruct	🤗 link	🤖 link	📖 doc
InternVL2‑2B	InternViT‑300M‑448px	internlm2‑chat‑1‑8b	🤗 link	🤖 link	📖 doc
InternVL2‑4B	InternViT‑300M‑448px	Phi‑3‑mini‑128k‑instruct	🤗 link	🤖 link	📖 doc
InternVL2‑8B	InternViT‑300M‑448px	internlm2_5‑7b‑chat	🤗 link	🤖 link	📖 doc
InternVL2‑26B	InternViT‑6B‑448px‑V1‑5	internlm2‑chat‑20b	🤗 link	🤖 link	📖 doc
InternVL2‑40B	InternViT‑6B‑448px‑V1‑5	Nous‑Hermes‑2‑Yi‑34B	🤗 link	🤖 link	📖 doc
InternVL2-Llama3-76B	InternViT‑6B‑448px‑V1‑5	Hermes‑2‑Theta‑ Llama‑3‑70B	🤗 link	🤖 link	📖 doc

InternVL2-Pro API

We welcome everyone to use our API for research. For better management, please submit (application form) / (申请表) to obtain free API access.

Multimodal Large Language Model (InternVL 1.0-1.5)

Model	Date	HF Link	MS Link	Note
Mini‑InternVL‑Chat‑4B‑V1‑5	2024.05.28	🤗 link	🤖 link	🚀🚀 16% of the model size, 90% of the performance
Mini‑InternVL‑Chat‑2B‑V1‑5	2024.05.19	🤗 link	🤖 link	🚀 8% of the model size, 80% of the performance
InternVL‑Chat‑V1‑5	2024.04.18	🤗 link	🤖 link	support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
InternVL‑Chat‑V1‑2‑Plus	2024.02.21	🤗 link	🤖 link	more SFT data and stronger
InternVL‑Chat‑V1‑2	2024.02.11	🤗 link	🤖 link	scaling up LLM to 34B
InternVL‑Chat‑V1‑1	2024.01.24	🤗 link	🤖 link	support Chinese and stronger OCR
InternVL‑Chat‑19B	2023.12.25	🤗 link	🤖 link	English multimodal dialogue
InternVL‑Chat‑13B	2023.12.25	🤗 link	🤖 link	English multimodal dialogue

Vision Foundation Model (InternVL 1.0-1.5)

Model	Date	HF Link	MS Link	Note
InternViT‑300M‑448px	2024.05.25	🤗 link	🤖 link	distilled small vision foundation model with 300M parameters (🔥new)
InternViT‑6B‑448px‑V1‑5	2024.04.20	🤗 link	🤖 link	support dynamic resolution and super strong OCR feature extraction capability by incremental pre-training (🔥new)
InternViT‑6B‑448px‑V1‑2	2024.02.11	🤗 link	🤖 link	support 448 resolution by incremental pre-training
InternViT‑6B‑448px‑V1‑0	2024.01.30	🤗 link	🤖 link	support 448 resolution by incremental pre-training
InternViT‑6B‑224px	2023.12.22	🤗 link	🤖 link	the first version of InternViT-6B, extracted from InternVL‑14B‑224px

Vision-Language Foundation Model (InternVL 1.0)

Model	Date	HF Link	MS Link	Note
InternVL‑14B‑224px	2023.12.22	🤗 link	🤖 link	vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP

What can InternVL do?

Visual Perception (click to expand)

Linear-Probe Image Classification [see details]

ViT-22B uses the private JFT-3B dataset.

method	#param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	-
InternViT-6B (ours)	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

Semantic Segmentation [see details]

method	decoder	#param (train/total)	crop size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

Zero-Shot Image Classification [see details]

method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	-	87.6
InternVL-C (ours)	83.2	83.8	95.5	77.3	73.9	80.6

Multilingual Zero-Shot Image Classification [see details]

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C (ours)	83.2	64.5	61.5	44.9	65.7

Zero-Shot Video Classification

method #frame K400 K600 K700

OpenCLIP-G 1 65.9 66.1 59.2

EVA-02-CLIP-E+ 1 69.8 69.3 63.4

InternVL-C (ours) 1 71.0 71.3 65.7

ViCLIP 8 75.7 73.5 66.4

InternVL-C (ours) 8 79.4 78.8 71.5

method	#frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C (ours)	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C (ours)	8	79.4	78.8	71.5

Cross-Modal Retrieval (click to expand)

English Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K						COCO						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
InternVL-C (ours)	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
InternVL-G (ours)	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

Chinese Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K-CN						COCO-CN						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-R-H	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
InternVL-C (ours)	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
InternVL-G (ours)	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

Multilingual Zero-Shot Image-Text Retrieval on XTD [see details]

method	EN	ES	FR	ZH	IT	KO	RU	JP	average
AltCLIP	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-R-H	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
InternVL-C (ours)	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
InternVL-G (ours)	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

Multimodal Dialogue

See "Compared with SOTA VLLMs" section.

Quick Start with HuggingFace

using InternViT-6B for visual feature extraction (click to expand)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-448px-V1-5',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

using InternVL-C(ontrastive) and InternVL-G(enerative) for cross-modal retrieval (click to expand)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform

using InternVL-Chat for multimodal chat (click to expand)

Here, we take the smaller OpenGVLab/InternVL2-8B as an example:

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=False)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail. Don\'t repeat.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find this project useful in your research, please consider cite:

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}

@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}

Acknowledgement

InternVL is built with reference to the code of the following projects: OpenAI CLIP, Open CLIP, CLIP Benchmark, EVA, InternImage, ViT-Adapter, MMSegmentation, Transformers, DINOv2, BLIP-2, Qwen-VL, and LLaVA-1.5. Thanks for their awesome work!

If you want to join our WeChat group, please scan the following QR Code to add our assistant as a Wechat friend:

internvl's People

Contributors

Stargazers

Watchers

Forkers

zhangzw12319 xiechengmude techthiyanes autogyro evdcush austinkuture perfyperfect zirong-liu skic nemonameless qigongsun syo093c haikuoxin zhouhuashan hfengzhi ya-wenwu xuweiyichen pro-flynn fake10086 zhangzhuobys sprbull preresearch-labs snoopycn jjhw 2132660698 upcreat kp-forks cabelo suryatmodulus moxmoussa sorokinvld hjh0119 strategist922 zgrmjfj-903 robin202208 henry0249 lijingle-coder thomascherickal nuffins moewhale xiaozhiob haodaohong helihui shanliren000 fengbao24 tghfly quentin-wang yhxiong b08240 superxcv meyemucu mundefr-fource cheerupringlamarket nicsyscalamarket theevilchariyesmessages kittyzena-ferdywaves yuanzhongqiao binderpost-b janssma75 surrealsleek-extorksta kongoniiparkel jacquetry-duvetsqua sevarica30 raytang88 tinnyposhy-x pattyrobo-s adrianpuiu yinyangfs zeynepozdemir wolfgangjblack yacineali74 shahinsharifi jydxkj sheldonldev sanyaade-teachings peihaiyang honglinchu chaorenhuife lvhan028 maxtimer-hui potsui99 liunix61 quduoduo brooks0519 cylonspace yangfukui martyyz2112 yu-yang-li fudp linhong00316 yushenglin01 hnlong maminge jakubik2023 harryjerryzhu zy0803wyl 306026185 chrisyang2017 llmr-boringtao yjb2020

internvl's Issues

How to load internvl-chat-1.2-plus on V100.

If I have an 8 V100 machine, is there a way to load InternVL-Chat-Chinese-V1-2-Plus for inference, it seems can't install FlashAttention correctly on V100?

Compare PixelShuffle with Qfromer?

hi, Do you have tried the Qfromer Arch? I think using pixelshuffle is to limit vision token nums and qformer has same effects~

微调internvl-chat lora但是loss不收敛一直在1左右，最后模型推理输出结果全是空格

Unrecognized configuration class <class 'transformers.models.llava.configuration_llava.LlavaConfig'> for this kind of AutoModel: AutoModel.

Hello, I encountered an error while testing InternVL-Chat-ViT-6B-Vicuna-13B-448px:
ValueError: Unrecognized configuration class <class 'transformers.models. llava.configuration_llava.LlavaConfig'>for this kind of AutoModel: AutoModel. Model type should be one of AlbertConfig, AlignConfig, AltCLIPConfig, ASTConfig, AutoformerConfig, BarkConfig, BartConfig, BeitConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioptConfig, BitConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, BloomConfig, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPVisionConfig, CLIPSegConfig, ClvpConfig, LlamaConfig, CodeGenConfig, ConditionalDetrConfig, ConvBertConfig, ConvNextConfig, ConvNextV2Config, CpmAntConfig, CTRLConfig, CvtConfig, Data2VecAudioConfig, Data2VecTextConfig, Data2VecisionConfig, DebertaConfig, DebertaV2Config, DecisionTransformerConfig, DeformableDetrConfig, DeiTConfig, DetaConfig, DetrConfig, DinatConfig, Dinov2Config DistilBertConfig, DonutSwinConfig, DPRConfig, DPTConfig, EfficientFormerConfig, EfficientNetConfig, ElectraConfig, EncodecConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FlaubertConfig, FlavaConfig, FNetConfig, FocalNetConfig, FSMTConfig, FunnelConfig, GitConfig, GLPNConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GraphormerConfig, GroupViTConfig, HubertConfig, IBertConfig, IdeficsConfig, ImageGPTConfig, InformerConfig, JukeboxConfig, Kosmos2Config, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LevitConfig, LiltConfig, LlamaConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, Config, MarkupLMConfig, Mask2FormerConfig, MaskFormerConfig, MaskFormerSwinConfig, MBartConfig, MCTCTConfig, MegaConfig, MegatronBertConfig, MopstrConfig, MistralConfig, MixtralConfig, MobileBertConfig, bi leNetV1Config, MobileNetV2Config, MobileViTConfig, MobileViTV2Config, MPNetConfig, MptConfig, MraConfig, MT5Config, MupConfig, NatConfig, NezhaConfig, NIlbMoeConfig, NystromformerConfig, OneFormerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, Owlv2Config, OwlViTConfig, PatchTSMixerConfig, PatchTSTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PersimmonConfig, PhiConfig, PLBartConfig, PoolFormerConfig, ProphetNetConfig, PvtConfig, QDQBertConfig, ReformerConfig, RegNetConfig, RemBertConfig, ResNetConfig, RetriBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoBertConfig, RoFormerConfig, RwkvConfig, SamConfig, SeamlessM4TConfig, SeamlessMTv2Config, SegformerConfig, SEWConfig, SEWDConfig, Speech2TextConfig, SpeechT5Config, SplinterConfig, SqueezeBertConfig, SwiftFormerConfig, SwinConfig, Swin2SRConfig, Swinv2Config, SwitchTransformersConfig, T5Config, TableTransformerConfig, TapasConfig, TimeSeriesTransformerConfig, TimesformerConfig, TimmBackboneConfig, TrajectoryTransformerConfig, TransfoXLConfig, TvltConfig, TupConfig, UMT5Config, UniSpeechConfig, UniSpeechSatConfig, UnivNetConfig, VanConfig, VideoMAEConfig, ViltConfig, VisionTextDualEncoderConfig, VisualBertConfig, ViTConfig, ViTHybridConfig, ViTMAEConfig, ViTMSNConfig, VitDetConfig, VitsConfig, VivitConfig, Wav2Vec2Config, Wav2Vec2ConformerConfig, WavLMConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YolosConfig, YosoConfig.

I am conducting the test on 8 V100(16G) GPUs with transformers version 4.36.2. Could you please advise on how to resolve this issue?

How to use OpenGVLab/InternVL Chat ViT-6B-Vicuna-7B model for inference

When I use the llava command line for deployment, an error message occurs.

Can you provide some advice?

Thank you to the excellent work of the OpenGVLab team.

$ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:12000 --port 42000 --worker http://localhost:42000 --model-path /data/jupyter/user/cc/llm_models/
InternVL-Chat-ViT-6B-Vicuna-7B/
2024-01-05 14:23:13 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=42000, worker_address='http://localhost:42000', controller_address='http://localhost:12000', model_path='/data/jupyter/user/cc/llm_models/InternVL-Chat-ViT-6B-Vicuna-7B/', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)
2024-01-05 14:23:13 | INFO | model_worker | Loading the model InternVL-Chat-ViT-6B-Vicuna-7B on worker 921a64 ...
2024-01-05 14:23:13 | ERROR | stderr | Traceback (most recent call last):
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2024-01-05 14:23:13 | ERROR | stderr |     return _run_code(code, main_globals, None,
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/runpy.py", line 87, in _run_code
2024-01-05 14:23:13 | ERROR | stderr |     exec(code, run_globals)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/jupyter/user/cc/InternVL/llava/llava/serve/model_worker.py", line 275, in <module>
2024-01-05 14:23:13 | ERROR | stderr |     worker = ModelWorker(args.controller_address,
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/jupyter/user/cc/InternVL/llava/llava/serve/model_worker.py", line 65, in __init__
2024-01-05 14:23:13 | ERROR | stderr |     self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/jupyter/user/cc/InternVL/llava/llava/model/builder.py", line 102, in load_pretrained_model
2024-01-05 14:23:13 | ERROR | stderr |     tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 727, in from_pretrained
2024-01-05 14:23:13 | ERROR | stderr |     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained
2024-01-05 14:23:13 | ERROR | stderr |     return cls._from_pretrained(
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2017, in _from_pretrained
2024-01-05 14:23:13 | ERROR | stderr |     tokenizer = cls(*init_inputs, **init_kwargs)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 156, in __init__
2024-01-05 14:23:13 | ERROR | stderr |     self.sp_model = self.get_spm_processor()
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 164, in get_spm_processor
2024-01-05 14:23:13 | ERROR | stderr |     model_pb2 = import_protobuf()
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 40, in import_protobuf
2024-01-05 14:23:13 | ERROR | stderr |     return sentencepiece_model_pb2
2024-01-05 14:23:13 | ERROR | stderr | UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

Zeroshot classification results for InternVL-G

Hi, Are there any Zeroshot classification (ImageNet) evaluation results for InternVL-G?

ZeRO strategy script for Semantic Segmentation Training

Thank you for open-sourcing this brilliant and outstanding project. However, with a model size of 6 billion parameters, it is still infeasible to train on consumer-grade graphics cards without using memory optimization tools like DeepSpeed. Could you provide a semantic segmentation training script based on mmsegmentation and the ZeRO strategy in the future? Looking forward to your reply.

Can i extract image and text feature respectively in InternVL-G model?

hi, can i extract image and text feature respectively in InternVL-G model? when read code, i found the cross-attention layers in QLLaMA are the shared parameters bewteen image and text feature branch, but there seems to be some kind of interaction in paper figure4, like Q-Former in BLIP-2. So can i use model.encode_image() or model.encode_text() individually?

InternVL-Chat-V1.2 on mmbench-dev-cn only got acc = 71.73, not acc = 79.5

InternVL-Chat-V1.2 is number one in mmbench-dev-cn with acc = 79.5%

I use the weight from InternVL-Chat-Chinese-V1-2 which is shown on README.

evaluation code:

torchrun \
  --nnodes=1 \
  --node_rank=0 \
  --master_addr=127.0.0.1 \
  --nproc_per_node=${GPUS} \
  --master_port=${MASTER_PORT} \
  eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --datasets mmbench_dev_cn_20231003

the results is:
total question = 4329
tp = 3105
acc = 0.7173

Fine-tuning the code of llava, the loss decreases abnormally

I use the code you provided to train llava based on Intervit6B. According to the script you provided, the first stage of pretrain is running normally. But when using the fine-tuning script for training, I found strange loss transformations.

As shown in the picture:

I have not modified any llava code

I debugged EVA_clip_vit and found the problem. It happened in zero2. When replacing it with zero3, the loss was normal. However, Intervit6B has the following problems when using zero3:

flash-attn install problems

GCC>=6.0
Without module dropout_layer_norm, see URL: Dao-AILab/flash-attention#587 (comment)

Grounding微调的prompt模版

请问面向Grounding任务的V1.2-plus的微调，prompt的模版是否是以下的格式：
'Please provide the bounding box coordinate of the region this sentence describes: XXX'，
(这个prompt是参考自refcoco的评测脚本https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/eval/refcoco/evaluate_grounding.py#L248C18-L248C104 )
即数据的格式是否如下：

{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: XXX"
    },
    {
      "from": "gpt",
      "value": "XXX[[253, 231, 733, 787]]"
    }
}

During the process of using InternVL-Chat-V1.1, an error occurred

Hello, I encountered an error related to model deployment while using the code you provided for testing InternVL-Chat-V1.1, which seems to be related to model_base and model_name.
Could you please tell me where I can find reference code for evaluating InternVL-Chat-V1.1? Thanks!

InternVL for multiple images

How can I use the InternVL model for multiple images?

I noticed you have reported the result of InternVL for tasks requiring multiple images, such as MMMU.

retrieval finetune的torchrun启动命令的脚本

请问internvl-g提供了retrieval finetune的脚本，是否有torchrun启动命令的脚本？

我使用以下torchrun启动脚本出现错误，请问您觉得问题出在哪？

修改 init_dist(launcher='pytorch', backend='nccl')

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=12320

export VIT_LAYER_DECAY_RATE=0.9 ##########################################
export QLLAMA_LAYER_DECAY_RATE=0.9 ##############################################

DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS internvl/train/internvl_stage2_finetune.py
--dataset_name 'flickr30k_cn_train'
--model_name_or_path "/root/.cache/huggingface/hub/models--OpenGVLab--InternVL-14B-224px/snapshots/6d492b53f8adbbe9d577db3645d893057e3ffc59"
--output_dir "./work_dirs/internvl_stage2_finetune_flickrcn_364_bs1024_ep10"
--overwrite_output_dir True
--force_image_size 364
--drop_path_rate 0.3
--use_custom_trainer
--dataloader_num_workers 2
--pad_to_max_length True
--bf16 True
--num_train_epochs 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 100
--save_total_limit 5
--learning_rate 1e-6
--weight_decay 0.05
--warmup_steps 100
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 80
--do_train True
--optim adamw_torch
--deepspeed "zero_stage1_config_wo_opt.json"
--report_to "tensorboard"

错误如下：

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1492, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1743, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "internvl/train/internvl_stage2_finetune.py", line 311, in
main()
File "internvl/train/internvl_stage2_finetune.py", line 299, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1658, in train
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2012, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2894, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1958, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1964, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 204, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 204, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 886, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1399, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 915, in reduce_independent_p_g_buckets_and_remove_grads
self.reduce_ipg_grads()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1350, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1110, in average_tensor
self.allreduce_and_scatter(buckets[bucket_key],
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1014, in allreduce_and_scatter
self.allreduce_and_copy_with_multiple_ranks(small_bucket,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 987, in allreduce_and_copy_with_multiple_ranks
allreduced = self.allreduce_bucket(small_bucket, log=log, divide=divide, process_group=process_group)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1483, in allreduce_bucket
dist.all_reduce(tensor_to_allreduce, group=process_group)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 496, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 155, in all_reduce
return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1497, in wrapper
"args": f"{args}, {kwargs}",
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 426, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 636, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 567, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 309, in _tensor_str
self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Using the OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B model for inference did not yield satisfactory results

environment:

$ git log
commit feaf3a94ca26404a6f8a0b1472158dfa7f142f12 (HEAD -> main, origin/main, origin/HEAD)
Author: Zhe Chen <[email protected]>
Date:   Fri Jan 5 13:16:32 2024 +0800

    Update InternVL-Chat (#26)

commit d4a2d8325fcefe2f4f3d777715f08fa0569b54fe
Author: Zhe Chen <[email protected]>
Date:   Sat Dec 30 22:19:30 2023 +0800

    Update README.md in LLaVA codebase (#23)
    
    * Update README.md


# Partial pip list
transformers              4.32.0
torch                     2.0.1
torchaudio                2.0.2
torchvision               0.15.2
flash-attn                0.2.8
apex                      0.1
llava                     1.1.1        /data/jupyter/user/cc/InternVL/llava

code change:

# llava/llava/model/multimodal_encoder/builder.py
# add "or "intern" in vision_tower.lower()"
def build_vision_tower(vision_tower_cfg, **kwargs):
    vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
    is_absolute_path_exists = os.path.exists(vision_tower)
    if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "intern" in vision_tower.lower():
        return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)

    raise ValueError(f'Unknown vision tower: {vision_tower}')

model config change:

{
  "_name_or_path": "InternVL-Chat-ViT-6B-Vicuna-7B",
  "architectures": [
    "LlavaLlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  // ...
  // modify mm_vision_tower path to local path
  "mm_vision_tower": "/data/jupyter/user/cc/llm_models/InternViT-6B-224px/",
  // ...
}

launch demo:

# controller
$ python -m llava.serve.controller --host 0.0.0.0 --port 12000

# gradio
$ python -m llava.serve.gradio_web_server --controller http://localhost:12000 --model-list-mode reload --share

2024-01-05 15:36:22 | INFO | gradio_web_server | ==== request ====
{'model': 'InternVL-Chat-ViT-6B-Vicuna-7B', 'prompt': "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nwhat's in the image ASSISTANT:", 'temperature': 0.2, 'top_p': 0.7, 'max_new_tokens': 512, 'stop': '</s>', 'images': "List of 1 images: ['1a8b42c80b9104618b8cdc378828b39a']"}
2024-01-05 15:36:23 | INFO | gradio_web_server | I'm sorry
2024-01-05 15:36:28 | INFO | gradio_web_server | add_text. ip: 192.168.3.181. len: 33
2024-01-05 15:36:29 | INFO | gradio_web_server | http_bot. ip: 192.168.3.181
2024-01-05 15:36:29 | INFO | gradio_web_server | template: llava_v1
2024-01-05 15:36:29 | INFO | gradio_web_server | model_name: InternVL-Chat-ViT-6B-Vicuna-7B, worker_addr: http://localhost:42000
2024-01-05 15:36:29 | INFO | gradio_web_server | ==== request ====
{'model': 'InternVL-Chat-ViT-6B-Vicuna-7B', 'prompt': "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nWhat is unusual about this image? ASSISTANT:", 'temperature': 0.2, 'top_p': 0.7, 'max_new_tokens': 512, 'stop': '</s>', 'images': "List of 1 images: ['2886bdd7ba800e89b682ce14e66cf054']"}
2024-01-05 15:36:32 | INFO | gradio_web_server | The image is of a person with a black shirt.
2024-01-05 15:37:25 | INFO | stdout | 
2024-01-05 15:37:25 | INFO | stdout | Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.
2024-01-05 15:37:25 | INFO | stdout | 
2024-01-05 15:37:25 | INFO | stdout | Also please ensure that your antivirus or firewall is not blocking the binary file located at: /data/miniconda3/envs/internvl/lib/python3.9/site-packages/gradio/frpc_linux_amd64_v0.2


# model worker
$ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:12000 --port 42000 --worker http://localhost:42000 --model-path /data/jupyter/user/cc/llm_models/InternVL-Chat-ViT-6B-Vicuna-7B/

Could you please let me know if my modification is correct? Could you offer some advice? Thank you.

Will provide the corresponding codes to support the multi-round chat? (In Readme.md)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer

path = "OpenGVLab/InternVL-Chat-Chinese-V1-1"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(path)
image = Image.open('./examples/image2.jpg').convert('RGB')
image = image.resize((448, 448))
image_processor = CLIPImageProcessor.from_pretrained(path)

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)

question = "请详细描述图片"
response = model.chat(tokenizer, pixel_values, question, generation_config)

This just supports single inference.

Code for video-text retrieval evaluation on MSR-VTT

Thank you for your great work! Could you please share the code for video-text retrieval evaluation on MSR-VTT dataset?

Is the weight of InternViT-6B-448px equal to the weight of visual encoder in InternVL-Chat-V1.1？

关于internvl如何在langchain框架下的使用

您好我想在langchian框架下使用internvl 来完成对话以及memory的记忆
path = "OpenGVLab/InternVL-Chat-Chinese-V1-1"

model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
cache_dir = "/project/ASD/jingyou_llm/model_cache",
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(path)

pipe = pipeline(
"visual-question-answering",
model=model,
tokenizer=tokenizer,
max_length=100,
)
然后报错：The model 'InternVLChatModel' is not supported for visual-question-answering. Supported models are ['Blip2ForConditionalGeneration', 'ViltForQuestionAnswering'].
但是提供的这两个类型都是不存在的
请问要如何使用呢

How does the pretrain dataset consist of?

Especially wonder how does the OCR part data prepared.

如何微调InternVL-Chat-V1.2-Plus

How to set up InternVL-Chat as an API

First of all, thank you very much for your work, which is very constructive.

I would like to ask how to deploy this model and turn it into an API for others to call. Can you provide the corresponding code? Thank you again.

Which model is the 336-resolution model mentioned in your paper

I would like to obtain this model. Thank you.

请问该模型能否输出图片，主要想微调做目标检测

How to inference InternVL-14B-224px on multiple GPUs?

My GPU is A30(24G)，cannot load the model on single GPU. I tried to load the huggingface model with device_map='auto', it report the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/work/.conda/envs/xxx/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/home/work/.conda/envs/xxx/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3606, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/home/work/.conda/envs/xxx/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1690, in _get_no_split_modules
    raise ValueError(
ValueError: InternVisionModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

here is my code:

import torch
from transformers import AutoModel, CLIPImageProcessor
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map='auto').eval()

在双卡3090上推理测试 InternVL-C，使用 device_map='auto' 方式加载模型 InternVL-14B-224px，运行报错：text_embeds = text_embeds[torch.arange(text_embeds.shape[0]), attention_mask.sum(1) - 1] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)

想问下支持中文图文对话吗

Convert to Gguf format to work with Llama.cpp?

Llava has various quantized models in gguf format, so it can be used with Llama.cpp.
ggerganov/llama.cpp#3436
Is this possible?

测试评分高，为什么有些样例效果不如Qwen-VL

Compare with InternImage on semantic segmentation task?

https://github.com/OpenGVLab/InternImage

MLP Training

From https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets, Is the mlp between llm and vit is random init?

[Suggestion]Add result to openclip benchmark

Thank you for your amazing and effective work, the results of the experiment are outstanding and exciting.

Do you have any plans to update the results of the model to open_clip benchmark?

cls benchmark: https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv
retrieval benchmark: https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_retrieval_results.csv

多轮对话

目前的internvl-chat-chinese-v1-2-plus中是没有支持多轮对话的吗，历史对话history无法传入到chat函数中

flickr30k-CN图文检索的指标，我用clip_benchmark验证出来的和论文有些差异？

https://huggingface.co/OpenGVLab/InternVL-14B-224px/tree/main 上释放的权重中文图文检索的指标
{
"dataset": "flickr30k",
"model": "internvl_c_retrieval_hf",
"pretrained": "./huggingface/hub/models--OpenGVLab--InternVL-14B-224px/snapshots/6d492b53f8adbbe9d577db3645d893057e3ffc59/",
"task": "zeroshot_retrieval",
"metrics": {

"image_retrieval_recall@1": 0.7148000001907349, 
"text_retrieval_recall@1": 0.8970000147819519, 
"image_retrieval_recall@5": 0.9179999828338623, 
"text_retrieval_recall@5": 0.9909999966621399, 
"image_retrieval_recall@10": 0.9559999704360962, 
"text_retrieval_recall@10": 0.996999979019165}, 
"language": "cn"

}
论文中的指标：InternVL-C (ours) ✓ 90.3 98.8 99.7 75.1 92.9 96.4
看起来有点差异，请问一下是不是我测的哪有问题？
用的下边这个脚本测的
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" --task "zeroshot_retrieval"
--dataset "flickr30k" --dataset_root ./data/flickr30k --model internvl_c_retrieval_hf
--pretrained ./huggingface/hub/models--OpenGVLab--InternVL-14B-224px/snapshots/6d492b53f8adbbe9d577db3645d893057e3ffc59/ --output result_ft.json

A model similar to Coca but with larger vision backbone and language backbone?

Is my understanding correct?

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

I've tried install strictly according to the installation.md , but still got this error,

my env : ubuntu20.04 , cuda11.8, torch2.1.1, python3.8, 2080ti x2

Detail:
FlashAttention is not installed.
Discovered apex.normalization.FusedRMSNorm - will use it instead of LlamaRMSNorm
FlashAttention is not installed.
Warning: Flash Attention is not available, use_flash_attn is set to False.
Traceback (most recent call last):
File "zz_demo.py", line 244, in
model = AutoModel.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3462, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_internvl.py", line 188, in init
self.vision_model = InternVisionModel(config.vision_config) # frozen
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 288, in init
self.encoder = InternVisionEncoder(config)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 228, in init
self.layers = nn.ModuleList([
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 229, in
InternVisionEncoderLayer(config, dpr[idx]) for idx in range(config.num_hidden_layers)])
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 188, in init
self.attn = InternAttention(config)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 119, in init
self.q_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
File "/usr/local/lib/python3.8/dist-packages/apex/normalization/fused_layer_norm.py", line 393, in init
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Bare minimum code for InternVL-Chat?

Hi,

do you have any example available to run InternVL-Chat without gradio? It would be very helpful if you could share some bare minimum python script to load the model from HF and interact.

Thanks!

How much gpu memory does the InternVL-Chat-ViT-6B-Vicuna-13B model take up when running?

How much gpu memory does the InternVL-Chat-ViT-6B-Vicuna-13B model take up when running? And what is the approximate running speed if I use A100?

Internvl-g image-text maching issues

###============== Image-Text Contrastive ===================###
image_itc = self.clip_projector2(image_itc)

    selected = summarize_attention_mask.sum(1) - 1
    text_itc = text_itc[torch.arange(text_itc.shape[0]), selected]
    text_itc = text_itc @ self.text_projection

你好这个地方是不是有点问题，还是我理解错了，我看原文章写的是取 eos token 的 feature，哪应该是text_itc = text_itc[torch.arange(text_itc.shape[0]), -1]? 因为 llama tokenizer 是左 padding 的。谢谢作者

使用双卡3090，分布式加载 InternVL-14B-224px 进行推理，InternVL-C 的预测结果中只有image3对应的prob值与提供的参考值不同，请问是为什么？

InternVL-C probs: tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
[2.2949e-02, 9.7656e-01, 5.9903e-06],
[2.9057e-06, 6.6280e-05, 1.0000e+00]], device='cuda:0',
dtype=torch.bfloat16, grad_fn=)

Any further plans on knowledge distillation?

ViT-22B conducted knowledge distillation experiments (refer to Table 8), demonstrating that it is not only a large-scale model but also an excellent teacher. Has there been any consideration or experiments conducted on whether Intern-VL can be distilled into smaller models, given that it serves as the largest open-source vision/vision-language foundation model to date (and a good alternative to the ViT-22B)? Thank you in advance for your attention.

very trival progress compare with llava1.5

mostly are just boosted 0.x point, and TextVQA largely behind llava1.5, why?
How's the Chinese QA and Chinese characters indentification performance?

InternVL-Chat-V1.2 continue finetune

Is it possible to finetune on the trained InternVL-Chat-V1.2 mdoel（https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2/tree/main） using a small dataset about 20k，Or I have to train from scratch（using InternViT-6B-448px-V1-2 and Nous-Hermes-2-Yi-34B model） and then finetune it on my small dataset？

Image Missing

File address: https://github.com/OpenGVLab/InternVL/blob/main/BLOG.md#internvl-chat-v12

Image link: https://github.com/czczup/InternVL-MoE/assets/23737120/9b68aa35-40fd-4e81-9595-d404cbbfc6bd

internvl_g有微调caption的脚本吗

Dataset release?

I do object detection, and I should choose which model is more suitable

eval resullts on video benchmarks

Hi there,

Congratulations on your impressive work! I have a question regarding the evaluation results on video benchmarks. I performed zero-shot evaluations on Kinetics-400/600/700 using the middle frame, following vissl. However, I noticed differences between my results and those reported in the paper, particularly the average accuracy of 71.0, 71.3, and 65.7 for Kinetics-400/600/700.

When release high-resolution models?

Thank you for your great work! We would like to try out the high-resolution(336+px), when will it be released?