Giter Club home page Giter Club logo

opengvlab / internvl Goto Github PK

View Code? Open in Web Editor NEW
5.6K 50.0 434.0 36.18 MB

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Home Page: https://internvl.readthedocs.io/en/latest/

License: MIT License

Python 48.41% Shell 3.63% Makefile 0.05% Jupyter Notebook 47.52% HTML 0.15% JavaScript 0.20% CSS 0.04%
image-classification image-text-retrieval llm semantic-segmentation video-classification vision-language-model vit-22b vit-6b multi-modal gpt

internvl's Introduction

News 🚀🚀🚀

  • 2024/08/01: The Chartmimic team evaluated the InternVL2 series models on their benchmark. The InternVL2-26B and 76B models achieved the top two performances among open-source models, with the InternVL2 76B model surpassing GeminiProVision and exhibiting comparable results to Claude-3-opus.
  • 2024/08/01: InternVL2-Pro achieved the SOTA performance among open-source models on the CharXiv dataset, surpassing some well-known closed-source models such as GPT-4V, Gemini 1.5 Flash, and Claude 3 Sonnet.
  • 2024/07/24: The MLVU team evaluated InternVL-1.5 on their benchmark. The average performance on the multiple-choice task was 50.4%, while the performance on the generative tasks was 4.02. The performance on the multiple-choice task ranked #1 among all open-source MLLMs.
  • 2024/07/18: 🔥🔥 InternVL2-40B achieved SOTA performance among open-source models on the Video-MME dataset, scoring 61.2 when inputting 16 frames and 64.4 when inputting 32 frames. It significantly outperforms other open-source models and is the closest open-source model to GPT-4o mini.
  • 2024/07/18: 🔥 InternVL2-Pro achieved the SOTA performance on the DocVQA and InfoVQA benchmarks.
  • 2024/07/04: 🚀 We release the InternVL2 series. InternVL2-Pro achieved a 62.0% accuracy on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o. The free API of this model can be applied by filling (application form) / (申请表). Other models are available at HF link.
  • 2024/06/19: We propose Needle In A Multimodal Haystack (MM-NIAH), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
  • 2024/05/30: We release ShareGPT-4o, a large-scale dataset that we plan to open-source with 200K images, 10K videos, and 10K audios with detailed descriptions.
  • 2024/05/29: We release the Mini-InternVL series, which includes two chat models: Mini-InternVL-Chat-2B-V1-5 and Mini-InternVL-Chat-4B-V1-5. These models achieve impressive performance with minimal size: the 2B model delivers 80% of the performance with only 8% of the model size, and the 4B model achieves 90% of the performance with just 16% of the model size. For more details, please check our blog.
  • 2024/05/28: Thanks to the lmdeploy team for providing AWQ quantization support. The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ.
  • 2024/05/13: InternVL 1.0 can now be used as the text encoder for diffusion models to support multilingual generation natively in over 110 languages worldwide. See MuLan for more details.
  • 2024/04/18: InternVL-Chat-V1-5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
  • 2024/02/27: InternVL is accepted by CVPR 2024 (Oral)! 🎉
  • 2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.
  • 2024/02/21: InternVL-Chat-V1-2-Plus achieved SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.
  • 2024/02/12: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog and SFT data. The model is now available on HuggingFace, and both training / evaluation data and scripts are open-sourced.
  • 2024/01/24: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see here.
  • 2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale detection and segmentation models.

TODO List

  • Support vLLM and Ollama
  • Rebuild documents using readthedocs
  • Support fine-tuning different LLMs with LoRA
  • Support video and PDF input in online demo
  • Release InternVL2 with VisionLLMv2 integration
  • Release requirements.txt for InternVL2
  • Release training / evaluation code for InternVL2 series
  • Release Streamlit web UI for InternVL1.5 and InternVL2

Documents

Compared with SOTA VLLMs

waic_performance

Model Zoo

Multimodal Large Language Model (InternVL 2.0)

Model Name Vision Part Language Part HF Link MS Link Document
InternVL2‑1B InternViT‑300M‑448px Qwen2‑0.5B‑Instruct 🤗 link 🤖 link 📖 doc
InternVL2‑2B InternViT‑300M‑448px internlm2‑chat‑1‑8b 🤗 link 🤖 link 📖 doc
InternVL2‑4B InternViT‑300M‑448px Phi‑3‑mini‑128k‑instruct 🤗 link 🤖 link 📖 doc
InternVL2‑8B InternViT‑300M‑448px internlm2_5‑7b‑chat 🤗 link 🤖 link 📖 doc
InternVL2‑26B InternViT‑6B‑448px‑V1‑5 internlm2‑chat‑20b 🤗 link 🤖 link 📖 doc
InternVL2‑40B InternViT‑6B‑448px‑V1‑5 Nous‑Hermes‑2‑Yi‑34B 🤗 link 🤖 link 📖 doc
InternVL2-Llama3-76B InternViT‑6B‑448px‑V1‑5 Hermes‑2‑Theta‑
Llama‑3‑70B
🤗 link 🤖 link 📖 doc

InternVL2-Pro API

We welcome everyone to use our API for research. For better management, please submit (application form) / (申请表) to obtain free API access.

Multimodal Large Language Model (InternVL 1.0-1.5)

Model Date HF Link MS Link Note
Mini‑InternVL‑Chat‑4B‑V1‑5 2024.05.28 🤗 link 🤖 link 🚀🚀 16% of the model size, 90% of the performance
Mini‑InternVL‑Chat‑2B‑V1‑5 2024.05.19 🤗 link 🤖 link 🚀 8% of the model size, 80% of the performance
InternVL‑Chat‑V1‑5 2024.04.18 🤗 link 🤖 link support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
InternVL‑Chat‑V1‑2‑Plus 2024.02.21 🤗 link 🤖 link more SFT data and stronger
InternVL‑Chat‑V1‑2 2024.02.11 🤗 link 🤖 link scaling up LLM to 34B
InternVL‑Chat‑V1‑1 2024.01.24 🤗 link 🤖 link support Chinese and stronger OCR
InternVL‑Chat‑19B 2023.12.25 🤗 link 🤖 link English multimodal dialogue
InternVL‑Chat‑13B 2023.12.25 🤗 link 🤖 link English multimodal dialogue

Vision Foundation Model (InternVL 1.0-1.5)

Model Date HF Link MS Link Note
InternViT‑300M‑448px 2024.05.25 🤗 link 🤖 link distilled small vision foundation model with 300M parameters (🔥new)
InternViT‑6B‑448px‑V1‑5 2024.04.20 🤗 link 🤖 link support dynamic resolution and super strong OCR feature extraction capability by incremental pre-training (🔥new)
InternViT‑6B‑448px‑V1‑2 2024.02.11 🤗 link 🤖 link support 448 resolution by incremental pre-training
InternViT‑6B‑448px‑V1‑0 2024.01.30 🤗 link 🤖 link support 448 resolution by incremental pre-training
InternViT‑6B‑224px 2023.12.22 🤗 link 🤖 link the first version of InternViT-6B, extracted from InternVL‑14B‑224px

Vision-Language Foundation Model (InternVL 1.0)

Model Date HF Link MS Link Note
InternVL‑14B‑224px 2023.12.22 🤗 link 🤖 link vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP

What can InternVL do?

Visual Perception (click to expand)
  • Linear-Probe Image Classification [see details]

    ViT-22B uses the private JFT-3B dataset.

    method #param IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch
    OpenCLIP-G 1.8B 86.2 89.4 77.2 63.8 87.8 66.4
    DINOv2-g 1.1B 86.5 89.6 78.4 75.9 78.8 62.5
    EVA-01-CLIP-g 1.1B 86.5 89.3 77.4 70.5 87.7 63.1
    MAWS-ViT-6.5B 6.5B 87.8 - - - - -
    ViT-22B* 21.7B 89.5 90.9 83.2 83.8 87.4 -
    InternViT-6B (ours) 5.9B 88.2 90.4 79.9 77.5 89.8 69.1
  • Semantic Segmentation [see details]

    method decoder #param (train/total) crop size mIoU
    OpenCLIP-G (frozen) Linear 0.3M / 1.8B 512 39.3
    ViT-22B (frozen) Linear 0.9M / 21.7B 504 34.6
    InternViT-6B (frozen) Linear 0.5M / 5.9B 504 47.2 (+12.6)
    ViT-22B (frozen) UperNet 0.8B / 22.5B 504 52.7
    InternViT-6B (frozen) UperNet 0.4B / 6.3B 504 54.9 (+2.2)
    ViT-22B UperNet 22.5B / 22.5B 504 55.3
    InternViT-6B UperNet 6.3B / 6.3B 504 58.9 (+3.6)
  • Zero-Shot Image Classification [see details]

    method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet
    OpenCLIP-G 80.1 69.3 92.1 73.6 68.9 73.0
    EVA-02-CLIP-E+ 82.0 82.1 94.5 75.7 71.6 79.6
    ViT-22B* 85.9 90.1 96.0 80.9 - 87.6
    InternVL-C (ours) 83.2 83.8 95.5 77.3 73.9 80.6
  • Multilingual Zero-Shot Image Classification [see details]

    EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

    method IN-1K (EN) IN-1K (ZH) IN-1K (JP) IN-1K (AR) IN-1K (IT)
    Taiyi-CLIP-ViT-H - 54.4 - - -
    WuKong-ViT-L-G - 57.5 - - -
    CN-CLIP-ViT-H - 59.6 - - -
    AltCLIP-ViT-L 74.5 59.6 - - -
    EVA-02-CLIP-E+ 82.0 - - - 41.2
    OpenCLIP-XLM-R-H 77.0 55.7 53.1 37.0 56.8
    InternVL-C (ours) 83.2 64.5 61.5 44.9 65.7
  • Zero-Shot Video Classification

    method #frame K400 K600 K700
    OpenCLIP-G 1 65.9 66.1 59.2
    EVA-02-CLIP-E+ 1 69.8 69.3 63.4
    InternVL-C (ours) 1 71.0 71.3 65.7
    ViCLIP 8 75.7 73.5 66.4
    InternVL-C (ours) 8 79.4 78.8 71.5
Cross-Modal Retrieval (click to expand)
  • English Zero-Shot Image-Text Retrieval [see details]

    model Flickr30K COCO avg
    image-to-text text-to-image image-to-text text-to-image
    R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
    OpenCLIP-G 92.9 99.3 99.8 79.5 95.0 97.1 67.3 86.9 92.6 51.4 74.9 83.0 85.0
    EVA-02-CLIP-E+ 93.9 99.4 99.8 78.8 94.2 96.8 68.8 87.8 92.8 51.1 75.0 82.7 85.1
    EVA-CLIP-8B 95.6 99.6 99.9 80.8 95.5 97.6 70.3 89.3 93.9 53.0 76.0 83.4 86.2
    InternVL-C (ours) 94.7 99.6 99.9 81.7 96.0 98.2 70.6 89.0 93.5 54.1 77.3 84.6 86.6
    InternVL-G (ours) 95.7 99.7 99.9 85.0 97.0 98.6 74.9 91.3 95.2 58.6 81.3 88.0 88.8
  • Chinese Zero-Shot Image-Text Retrieval [see details]

    model Flickr30K-CN COCO-CN avg
    image-to-text text-to-image image-to-text text-to-image
    R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
    CN-CLIP-ViT-H 81.6 97.5 98.8 71.2 91.4 95.5 63.0 86.6 92.9 69.2 89.9 96.1 86.1
    OpenCLIP-XLM-R-H 86.1 97.5 99.2 71.0 90.5 94.9 70.0 91.5 97.0 66.1 90.8 96.0 87.6
    InternVL-C (ours) 90.3 98.8 99.7 75.1 92.9 96.4 68.8 92.0 96.7 68.9 91.9 96.5 89.0
    InternVL-G (ours) 92.9 99.4 99.8 77.7 94.8 97.3 71.4 93.9 97.7 73.8 94.4 98.1 90.9
  • Multilingual Zero-Shot Image-Text Retrieval on XTD [see details]

    method EN ES FR ZH IT KO RU JP average
    AltCLIP 95.4 94.1 92.9 95.1 94.2 94.4 91.8 91.7 93.7
    OpenCLIP-XLM-R-H 97.3 96.1 94.5 94.7 96.0 90.2 93.9 94.0 94.6
    InternVL-C (ours) 97.3 95.7 95.1 95.6 96.0 92.2 93.3 95.5 95.1
    InternVL-G (ours) 98.6 97.7 96.5 96.7 96.9 95.1 94.8 96.1 96.6
Multimodal Dialogue

See "Compared with SOTA VLLMs" section.

Quick Start with HuggingFace

using InternViT-6B for visual feature extraction (click to expand)
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-448px-V1-5',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)
using InternVL-C(ontrastive) and InternVL-G(enerative) for cross-modal retrieval (click to expand)
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform
using InternVL-Chat for multimodal chat (click to expand)

Here, we take the smaller OpenGVLab/InternVL2-8B as an example:

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=False)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail. Don\'t repeat.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find this project useful in your research, please consider cite:

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}

@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}

Acknowledgement

InternVL is built with reference to the code of the following projects: OpenAI CLIP, Open CLIP, CLIP Benchmark, EVA, InternImage, ViT-Adapter, MMSegmentation, Transformers, DINOv2, BLIP-2, Qwen-VL, and LLaVA-1.5. Thanks for their awesome work!


If you want to join our WeChat group, please scan the following QR Code to add our assistant as a Wechat friend:

image

internvl's People

Contributors

adushar avatar czczup avatar dlutwy avatar erfeicui avatar hjh0119 avatar lvhan028 avatar opengvlab-admin avatar qishisuren123 avatar weiyun1025 avatar whai362 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

internvl's Issues

Unrecognized configuration class <class 'transformers.models.llava.configuration_llava.LlavaConfig'> for this kind of AutoModel: AutoModel.

Hello, I encountered an error while testing InternVL-Chat-ViT-6B-Vicuna-13B-448px:
ValueError: Unrecognized configuration class <class 'transformers.models. llava.configuration_llava.LlavaConfig'>for this kind of AutoModel: AutoModel. Model type should be one of AlbertConfig, AlignConfig, AltCLIPConfig, ASTConfig, AutoformerConfig, BarkConfig, BartConfig, BeitConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioptConfig, BitConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, BloomConfig, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPVisionConfig, CLIPSegConfig, ClvpConfig, LlamaConfig, CodeGenConfig, ConditionalDetrConfig, ConvBertConfig, ConvNextConfig, ConvNextV2Config, CpmAntConfig, CTRLConfig, CvtConfig, Data2VecAudioConfig, Data2VecTextConfig, Data2VecisionConfig, DebertaConfig, DebertaV2Config, DecisionTransformerConfig, DeformableDetrConfig, DeiTConfig, DetaConfig, DetrConfig, DinatConfig, Dinov2Config DistilBertConfig, DonutSwinConfig, DPRConfig, DPTConfig, EfficientFormerConfig, EfficientNetConfig, ElectraConfig, EncodecConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FlaubertConfig, FlavaConfig, FNetConfig, FocalNetConfig, FSMTConfig, FunnelConfig, GitConfig, GLPNConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GraphormerConfig, GroupViTConfig, HubertConfig, IBertConfig, IdeficsConfig, ImageGPTConfig, InformerConfig, JukeboxConfig, Kosmos2Config, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LevitConfig, LiltConfig, LlamaConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, Config, MarkupLMConfig, Mask2FormerConfig, MaskFormerConfig, MaskFormerSwinConfig, MBartConfig, MCTCTConfig, MegaConfig, MegatronBertConfig, MopstrConfig, MistralConfig, MixtralConfig, MobileBertConfig, bi leNetV1Config, MobileNetV2Config, MobileViTConfig, MobileViTV2Config, MPNetConfig, MptConfig, MraConfig, MT5Config, MupConfig, NatConfig, NezhaConfig, NIlbMoeConfig, NystromformerConfig, OneFormerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, Owlv2Config, OwlViTConfig, PatchTSMixerConfig, PatchTSTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PersimmonConfig, PhiConfig, PLBartConfig, PoolFormerConfig, ProphetNetConfig, PvtConfig, QDQBertConfig, ReformerConfig, RegNetConfig, RemBertConfig, ResNetConfig, RetriBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoBertConfig, RoFormerConfig, RwkvConfig, SamConfig, SeamlessM4TConfig, SeamlessMTv2Config, SegformerConfig, SEWConfig, SEWDConfig, Speech2TextConfig, SpeechT5Config, SplinterConfig, SqueezeBertConfig, SwiftFormerConfig, SwinConfig, Swin2SRConfig, Swinv2Config, SwitchTransformersConfig, T5Config, TableTransformerConfig, TapasConfig, TimeSeriesTransformerConfig, TimesformerConfig, TimmBackboneConfig, TrajectoryTransformerConfig, TransfoXLConfig, TvltConfig, TupConfig, UMT5Config, UniSpeechConfig, UniSpeechSatConfig, UnivNetConfig, VanConfig, VideoMAEConfig, ViltConfig, VisionTextDualEncoderConfig, VisualBertConfig, ViTConfig, ViTHybridConfig, ViTMAEConfig, ViTMSNConfig, VitDetConfig, VitsConfig, VivitConfig, Wav2Vec2Config, Wav2Vec2ConformerConfig, WavLMConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YolosConfig, YosoConfig.

I am conducting the test on 8 V100(16G) GPUs with transformers version 4.36.2. Could you please advise on how to resolve this issue?

How to use OpenGVLab/InternVL Chat ViT-6B-Vicuna-7B model for inference

When I use the llava command line for deployment, an error message occurs.

Can you provide some advice?

Thank you to the excellent work of the OpenGVLab team.

$ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:12000 --port 42000 --worker http://localhost:42000 --model-path /data/jupyter/user/cc/llm_models/
InternVL-Chat-ViT-6B-Vicuna-7B/
2024-01-05 14:23:13 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=42000, worker_address='http://localhost:42000', controller_address='http://localhost:12000', model_path='/data/jupyter/user/cc/llm_models/InternVL-Chat-ViT-6B-Vicuna-7B/', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)
2024-01-05 14:23:13 | INFO | model_worker | Loading the model InternVL-Chat-ViT-6B-Vicuna-7B on worker 921a64 ...
2024-01-05 14:23:13 | ERROR | stderr | Traceback (most recent call last):
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2024-01-05 14:23:13 | ERROR | stderr |     return _run_code(code, main_globals, None,
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/runpy.py", line 87, in _run_code
2024-01-05 14:23:13 | ERROR | stderr |     exec(code, run_globals)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/jupyter/user/cc/InternVL/llava/llava/serve/model_worker.py", line 275, in <module>
2024-01-05 14:23:13 | ERROR | stderr |     worker = ModelWorker(args.controller_address,
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/jupyter/user/cc/InternVL/llava/llava/serve/model_worker.py", line 65, in __init__
2024-01-05 14:23:13 | ERROR | stderr |     self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/jupyter/user/cc/InternVL/llava/llava/model/builder.py", line 102, in load_pretrained_model
2024-01-05 14:23:13 | ERROR | stderr |     tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 727, in from_pretrained
2024-01-05 14:23:13 | ERROR | stderr |     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained
2024-01-05 14:23:13 | ERROR | stderr |     return cls._from_pretrained(
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2017, in _from_pretrained
2024-01-05 14:23:13 | ERROR | stderr |     tokenizer = cls(*init_inputs, **init_kwargs)
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 156, in __init__
2024-01-05 14:23:13 | ERROR | stderr |     self.sp_model = self.get_spm_processor()
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 164, in get_spm_processor
2024-01-05 14:23:13 | ERROR | stderr |     model_pb2 = import_protobuf()
2024-01-05 14:23:13 | ERROR | stderr |   File "/data/miniconda3/envs/internvl/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 40, in import_protobuf
2024-01-05 14:23:13 | ERROR | stderr |     return sentencepiece_model_pb2
2024-01-05 14:23:13 | ERROR | stderr | UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

ZeRO strategy script for Semantic Segmentation Training

Thank you for open-sourcing this brilliant and outstanding project. However, with a model size of 6 billion parameters, it is still infeasible to train on consumer-grade graphics cards without using memory optimization tools like DeepSpeed. Could you provide a semantic segmentation training script based on mmsegmentation and the ZeRO strategy in the future? Looking forward to your reply.

Can i extract image and text feature respectively in InternVL-G model?

hi, can i extract image and text feature respectively in InternVL-G model? when read code, i found the cross-attention layers in QLLaMA are the shared parameters bewteen image and text feature branch, but there seems to be some kind of interaction in paper figure4, like Q-Former in BLIP-2. So can i use model.encode_image() or model.encode_text() individually?

InternVL-Chat-V1.2 on mmbench-dev-cn only got acc = 71.73, not acc = 79.5

InternVL-Chat-V1.2 is number one in mmbench-dev-cn with acc = 79.5%

I use the weight from InternVL-Chat-Chinese-V1-2 which is shown on README.

evaluation code:

torchrun \
  --nnodes=1 \
  --node_rank=0 \
  --master_addr=127.0.0.1 \
  --nproc_per_node=${GPUS} \
  --master_port=${MASTER_PORT} \
  eval/mmbench/evaluate_mmbench.py --checkpoint ${CHECKPOINT} --datasets mmbench_dev_cn_20231003

the results is:
total question = 4329
tp = 3105
acc = 0.7173

Fine-tuning the code of llava, the loss decreases abnormally

I use the code you provided to train llava based on Intervit6B. According to the script you provided, the first stage of pretrain is running normally. But when using the fine-tuning script for training, I found strange loss transformations.

As shown in the picture:
image
I have not modified any llava code

I debugged EVA_clip_vit and found the problem. It happened in zero2. When replacing it with zero3, the loss was normal. However, Intervit6B has the following problems when using zero3:

image

Grounding微调的prompt模版

请问面向Grounding任务的V1.2-plus的微调,prompt的模版是否是以下的格式:
'Please provide the bounding box coordinate of the region this sentence describes: XXX',
(这个prompt是参考自refcoco的评测脚本https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/eval/refcoco/evaluate_grounding.py#L248C18-L248C104 )
即数据的格式是否如下:

{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: XXX"
    },
    {
      "from": "gpt",
      "value": "XXX[[253, 231, 733, 787]]"
    }
}

During the process of using InternVL-Chat-V1.1, an error occurred

Hello, I encountered an error related to model deployment while using the code you provided for testing InternVL-Chat-V1.1, which seems to be related to model_base and model_name.
Could you please tell me where I can find reference code for evaluating InternVL-Chat-V1.1? Thanks!

InternVL for multiple images

How can I use the InternVL model for multiple images?

I noticed you have reported the result of InternVL for tasks requiring multiple images, such as MMMU.

retrieval finetune的torchrun启动命令的脚本

请问internvl-g提供了retrieval finetune的脚本,是否有torchrun启动命令的脚本?

我使用以下torchrun启动脚本出现错误,请问您觉得问题出在哪?


修改 init_dist(launcher='pytorch', backend='nccl')

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=12320

export VIT_LAYER_DECAY_RATE=0.9 ##########################################
export QLLAMA_LAYER_DECAY_RATE=0.9 ##############################################

DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS internvl/train/internvl_stage2_finetune.py
--dataset_name 'flickr30k_cn_train'
--model_name_or_path "/root/.cache/huggingface/hub/models--OpenGVLab--InternVL-14B-224px/snapshots/6d492b53f8adbbe9d577db3645d893057e3ffc59"
--output_dir "./work_dirs/internvl_stage2_finetune_flickrcn_364_bs1024_ep10"
--overwrite_output_dir True
--force_image_size 364
--drop_path_rate 0.3
--use_custom_trainer
--dataloader_num_workers 2
--pad_to_max_length True
--bf16 True
--num_train_epochs 10
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 100
--save_total_limit 5
--learning_rate 1e-6
--weight_decay 0.05
--warmup_steps 100
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 80
--do_train True
--optim adamw_torch
--deepspeed "zero_stage1_config_wo_opt.json"
--report_to "tensorboard"

错误如下:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1492, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1743, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "internvl/train/internvl_stage2_finetune.py", line 311, in
main()
File "internvl/train/internvl_stage2_finetune.py", line 299, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1658, in train
return inner_training_loop(
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2012, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2894, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1958, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1964, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 204, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 204, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 886, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1399, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 915, in reduce_independent_p_g_buckets_and_remove_grads
self.reduce_ipg_grads()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1350, in reduce_ipg_grads
self.average_tensor(self.ipg_buffer[self.ipg_index])
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1110, in average_tensor
self.allreduce_and_scatter(buckets[bucket_key],
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1014, in allreduce_and_scatter
self.allreduce_and_copy_with_multiple_ranks(small_bucket,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 987, in allreduce_and_copy_with_multiple_ranks
allreduced = self.allreduce_bucket(small_bucket, log=log, divide=divide, process_group=process_group)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1483, in allreduce_bucket
dist.all_reduce(tensor_to_allreduce, group=process_group)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 496, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 155, in all_reduce
return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1497, in wrapper
"args": f"{args}, {kwargs}",
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 426, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 636, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 567, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 309, in _tensor_str
self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Using the OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B model for inference did not yield satisfactory results

image
image
image

environment:

$ git log
commit feaf3a94ca26404a6f8a0b1472158dfa7f142f12 (HEAD -> main, origin/main, origin/HEAD)
Author: Zhe Chen <[email protected]>
Date:   Fri Jan 5 13:16:32 2024 +0800

    Update InternVL-Chat (#26)

commit d4a2d8325fcefe2f4f3d777715f08fa0569b54fe
Author: Zhe Chen <[email protected]>
Date:   Sat Dec 30 22:19:30 2023 +0800

    Update README.md in LLaVA codebase (#23)
    
    * Update README.md


# Partial pip list
transformers              4.32.0
torch                     2.0.1
torchaudio                2.0.2
torchvision               0.15.2
flash-attn                0.2.8
apex                      0.1
llava                     1.1.1        /data/jupyter/user/cc/InternVL/llava

code change:

# llava/llava/model/multimodal_encoder/builder.py
# add "or "intern" in vision_tower.lower()"
def build_vision_tower(vision_tower_cfg, **kwargs):
    vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
    is_absolute_path_exists = os.path.exists(vision_tower)
    if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "intern" in vision_tower.lower():
        return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)

    raise ValueError(f'Unknown vision tower: {vision_tower}')

model config change:

{
  "_name_or_path": "InternVL-Chat-ViT-6B-Vicuna-7B",
  "architectures": [
    "LlavaLlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  // ...
  // modify mm_vision_tower path to local path
  "mm_vision_tower": "/data/jupyter/user/cc/llm_models/InternViT-6B-224px/",
  // ...
}

launch demo:

# controller
$ python -m llava.serve.controller --host 0.0.0.0 --port 12000

# gradio
$ python -m llava.serve.gradio_web_server --controller http://localhost:12000 --model-list-mode reload --share

2024-01-05 15:36:22 | INFO | gradio_web_server | ==== request ====
{'model': 'InternVL-Chat-ViT-6B-Vicuna-7B', 'prompt': "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nwhat's in the image ASSISTANT:", 'temperature': 0.2, 'top_p': 0.7, 'max_new_tokens': 512, 'stop': '</s>', 'images': "List of 1 images: ['1a8b42c80b9104618b8cdc378828b39a']"}
2024-01-05 15:36:23 | INFO | gradio_web_server | I'm sorry
2024-01-05 15:36:28 | INFO | gradio_web_server | add_text. ip: 192.168.3.181. len: 33
2024-01-05 15:36:29 | INFO | gradio_web_server | http_bot. ip: 192.168.3.181
2024-01-05 15:36:29 | INFO | gradio_web_server | template: llava_v1
2024-01-05 15:36:29 | INFO | gradio_web_server | model_name: InternVL-Chat-ViT-6B-Vicuna-7B, worker_addr: http://localhost:42000
2024-01-05 15:36:29 | INFO | gradio_web_server | ==== request ====
{'model': 'InternVL-Chat-ViT-6B-Vicuna-7B', 'prompt': "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nWhat is unusual about this image? ASSISTANT:", 'temperature': 0.2, 'top_p': 0.7, 'max_new_tokens': 512, 'stop': '</s>', 'images': "List of 1 images: ['2886bdd7ba800e89b682ce14e66cf054']"}
2024-01-05 15:36:32 | INFO | gradio_web_server | The image is of a person with a black shirt.
2024-01-05 15:37:25 | INFO | stdout | 
2024-01-05 15:37:25 | INFO | stdout | Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.
2024-01-05 15:37:25 | INFO | stdout | 
2024-01-05 15:37:25 | INFO | stdout | Also please ensure that your antivirus or firewall is not blocking the binary file located at: /data/miniconda3/envs/internvl/lib/python3.9/site-packages/gradio/frpc_linux_amd64_v0.2


# model worker
$ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:12000 --port 42000 --worker http://localhost:42000 --model-path /data/jupyter/user/cc/llm_models/InternVL-Chat-ViT-6B-Vicuna-7B/

Could you please let me know if my modification is correct? Could you offer some advice? Thank you.

Will provide the corresponding codes to support the multi-round chat? (In Readme.md)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer

path = "OpenGVLab/InternVL-Chat-Chinese-V1-1"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(path)
image = Image.open('./examples/image2.jpg').convert('RGB')
image = image.resize((448, 448))
image_processor = CLIPImageProcessor.from_pretrained(path)

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)

question = "请详细描述图片"
response = model.chat(tokenizer, pixel_values, question, generation_config)

This just supports single inference.

关于internvl如何在langchain框架下的使用

您好 我想在langchian框架下 使用internvl 来完成对话 以及memory的记忆
path = "OpenGVLab/InternVL-Chat-Chinese-V1-1"

model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
cache_dir = "/project/ASD/jingyou_llm/model_cache",
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(path)

pipe = pipeline(
"visual-question-answering",
model=model,
tokenizer=tokenizer,
max_length=100,
)
然后报错:The model 'InternVLChatModel' is not supported for visual-question-answering. Supported models are ['Blip2ForConditionalGeneration', 'ViltForQuestionAnswering'].
但是提供的这两个类型都是不存在的
请问要如何使用呢

How to set up InternVL-Chat as an API

First of all, thank you very much for your work, which is very constructive.

I would like to ask how to deploy this model and turn it into an API for others to call. Can you provide the corresponding code? Thank you again.

How to inference InternVL-14B-224px on multiple GPUs?

My GPU is A30(24G),cannot load the model on single GPU. I tried to load the huggingface model with device_map='auto', it report the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/work/.conda/envs/xxx/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/home/work/.conda/envs/xxx/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3606, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/home/work/.conda/envs/xxx/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1690, in _get_no_split_modules
    raise ValueError(
ValueError: InternVisionModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

here is my code:

import torch
from transformers import AutoModel, CLIPImageProcessor
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map='auto').eval()

多轮对话

目前的internvl-chat-chinese-v1-2-plus中是没有支持多轮对话的吗,历史对话history无法传入到chat函数中

flickr30k-CN图文检索的指标,我用clip_benchmark验证出来的和论文有些差异?

https://huggingface.co/OpenGVLab/InternVL-14B-224px/tree/main 上释放的权重中文图文检索的指标
{
"dataset": "flickr30k",
"model": "internvl_c_retrieval_hf",
"pretrained": "./huggingface/hub/models--OpenGVLab--InternVL-14B-224px/snapshots/6d492b53f8adbbe9d577db3645d893057e3ffc59/",
"task": "zeroshot_retrieval",
"metrics": {

"image_retrieval_recall@1": 0.7148000001907349, 
"text_retrieval_recall@1": 0.8970000147819519, 
"image_retrieval_recall@5": 0.9179999828338623, 
"text_retrieval_recall@5": 0.9909999966621399, 
"image_retrieval_recall@10": 0.9559999704360962, 
"text_retrieval_recall@10": 0.996999979019165}, 
"language": "cn"

}
论文中的指标:InternVL-C (ours) ✓ 90.3 98.8 99.7 75.1 92.9 96.4
看起来有点差异,请问一下是不是我测的哪有问题?
用的下边这个脚本测的
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" --task "zeroshot_retrieval"
--dataset "flickr30k" --dataset_root ./data/flickr30k --model internvl_c_retrieval_hf
--pretrained ./huggingface/hub/models--OpenGVLab--InternVL-14B-224px/snapshots/6d492b53f8adbbe9d577db3645d893057e3ffc59/ --output result_ft.json

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

I've tried install strictly according to the installation.md , but still got this error,

my env : ubuntu20.04 , cuda11.8, torch2.1.1, python3.8, 2080ti x2

Detail:
FlashAttention is not installed.
Discovered apex.normalization.FusedRMSNorm - will use it instead of LlamaRMSNorm
FlashAttention is not installed.
Warning: Flash Attention is not available, use_flash_attn is set to False.
Traceback (most recent call last):
File "zz_demo.py", line 244, in
model = AutoModel.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 3462, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_internvl.py", line 188, in init
self.vision_model = InternVisionModel(config.vision_config) # frozen
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 288, in init
self.encoder = InternVisionEncoder(config)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 228, in init
self.layers = nn.ModuleList([
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 229, in
InternVisionEncoderLayer(config, dpr[idx]) for idx in range(config.num_hidden_layers)])
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 188, in init
self.attn = InternAttention(config)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-14B-224px/modeling_intern_vit.py", line 119, in init
self.q_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
File "/usr/local/lib/python3.8/dist-packages/apex/normalization/fused_layer_norm.py", line 393, in init
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Bare minimum code for InternVL-Chat?

Hi,

do you have any example available to run InternVL-Chat without gradio? It would be very helpful if you could share some bare minimum python script to load the model from HF and interact.

Thanks!

Internvl-g image-text maching issues

###============== Image-Text Contrastive ===================###
image_itc = self.clip_projector2(image_itc)

    selected = summarize_attention_mask.sum(1) - 1
    text_itc = text_itc[torch.arange(text_itc.shape[0]), selected]
    text_itc = text_itc @ self.text_projection

你好这个地方是不是有点问题,还是我理解错了,我看原文章写的是取 eos token 的 feature,哪应该是text_itc = text_itc[torch.arange(text_itc.shape[0]), -1]? 因为 llama tokenizer 是左 padding 的。谢谢作者

Any further plans on knowledge distillation?

ViT-22B conducted knowledge distillation experiments (refer to Table 8), demonstrating that it is not only a large-scale model but also an excellent teacher. Has there been any consideration or experiments conducted on whether Intern-VL can be distilled into smaller models, given that it serves as the largest open-source vision/vision-language foundation model to date (and a good alternative to the ViT-22B)? Thank you in advance for your attention.

eval resullts on video benchmarks

Hi there,

Congratulations on your impressive work! I have a question regarding the evaluation results on video benchmarks. I performed zero-shot evaluations on Kinetics-400/600/700 using the middle frame, following vissl. However, I noticed differences between my results and those reported in the paper, particularly the average accuracy of 71.0, 71.3, and 65.7 for Kinetics-400/600/700.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.