tencentarc / vit-lens Goto Github PK

View Code? Open in Web Editor NEW

130.0 8.0 8.0 134.88 MB

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations

Home Page: https://ailab-cvc.github.io/seed/vitlens/

License: Other

Python 99.97% Makefile 0.03%

multimodal-learning

vit-lens's People

Stargazers

Watchers

Forkers

guaguagualiu holmes-gu wolfwjs eltociear leiwx52 dipendra-sharma haorand stanlei52

vit-lens's Issues

Something about Training Methodologies and Experimental Approaches for Video Data

I'm thoroughly impressed with your project and I'm eager to apply the model to my video data. However, the current TRAIN_INFERENCE.md does not provide the relevant usage information. Could you kindly publish the associated training methodologies or experimental approaches? Your assistance would be greatly appreciated. Thank you!

Training Time and GPU usages

Hi,

Thanks for the exceptional work you have presented. It is truly remarkable and contributes significantly to the field.

After reviewing your paper, I noted the mention of experiments being conducted on 32GB V100 GPU clusters. However, can you please give more details of the resources utilized for this project? Could you kindly provide information on the total training time and the exact number of GPUs employed during this period?

Thanks a lot.

Alternate depth normalization

The justification in the paper for using disparity is "scale normalization". I know that this comes from OmniVore and ImageBind.
However, this does not actually achieve scale normalization.

What could scale normalization mean? disparity images are not scale invariant in the way RGB images are: If you bring a thing closer it will have larger disparities, as opposed to RBG images where colors stay the same. Instead it must mean something like: two images with the same "disparity" should take up the same number of pixels.

To achieve this, you should use f/depth instead of bf/depth. This makes sense because b is an arbitrary value associated with the particular camera setup that you have, and it provides you no information about the geometry of the scene you are looking at. If you change b physically, the depth image does not change, but the disparity does.

One other suggested improvement: when you resize to 224, you're actually implicitly changing the focal length. So if h is the original height of the image, I would suggest computing "disparity" as

(224/h)f/depth

If normalization is having any positive affect, I bet this improved normalization will do better.

plug in problem

The tensor matrix output by vitlens is 1*768 for each modal message right? So where in Instructblip do I plug it in can you please answer? Thanks!

Reproducing NYUv2 Results

This code documents the processing pipeline well, but it starts with disparity images, whereas the NYUv2 starts with depth images.
What baseline and focal length are you using for converting NYUv2.D to disparity? My best guess is

f = 518.857901
b = 75

However, that seems like it could be off by an order of magnitude. Help would be appreciated.

点云和文本输出结果不对

运行example.py得到如下结果
PointCould x Text:
tensor([[8.5200e-04, 9.5644e-02, 5.8601e-01, 1.9369e-02, 2.9812e-01],
[8.8911e-04, 1.7004e-01, 3.2570e-01, 1.1302e-02, 4.9207e-01],
[2.9327e-04, 6.9276e-02, 4.6433e-01, 1.2254e-02, 4.5384e-01],
[1.9555e-03, 7.8262e-02, 3.8027e-01, 5.8164e-02, 4.8135e-01],
[3.0467e-04, 1.0489e-01, 4.9719e-01, 2.1044e-02, 3.7657e-01]],
device='cuda:0')

Why use the cross-attention instead of only self-attention when implementing perceiver layers?

Hi.

I read the code and find that you implement the perceiver as 4 cross-attention layers with 4 self-attention layers each, and I'm curious why not just use 16 or less self-attention layers?

Can not load eeg ckpt

Hi, I just want to load the eeg ckpt from your huggingface repo but it seems that tons of keys are not matched. Any chance can you double check on your side if it can be loaded successful?

InstructBLIP and SEED Implementation

Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?

Training code or training parameter configurations

Hi,

I'm very insterested in your great work and am trying to train your model on my own data. As this repo currently solely contains the inference code, I'm wondering if you can share your training part, or share the training parameter configuration, especially the parameters of the perceiver. Thanks a lot!

What kind of textual prompts do you use during the training period?

Hi,

thanks for your great work! I'm trying to adapt the ViT-Lens to the customized dataset, and I hope to aligh the textual prompts used in the inference period with the training period, so could you please share the prompt template, which may helps promote the performance.

SUN RGB-D is not in millimeters

I was trying to apply this model to my own data and not getting good results. I ran the NYUv2 dataset through my code, and the results seem to be in line with those reported in the ViT-Lens paper.

Digging into it, the issue is - at least partly - that the NYUv2 data is not in millimeters. Here is the matlab code for converting the png files to mm that is in the SUNRGBDtoolbox (https://rgbd.cs.princeton.edu/):

depthVis = imread(data.depthpath);
imsize = size(depthVis);
depthInpaint = bitor(bitshift(depthVis,-3), bitshift(depthVis,16-3));

In other words, the data in the png files is a circular shift left by 3 bits of the depth in mm (which for most data is just multiplying by 8).

I mention this because the code in #9 seems to indicate that it is assumed that the data is in mm. It might be important if other datasets get used that are in mm and not the SUN RGB-D format.

tencentarc / vit-lens Goto Github PK

vit-lens's People

Stargazers

Watchers

Forkers

vit-lens's Issues

Something about Training Methodologies and Experimental Approaches for Video Data

Training Time and GPU usages

Alternate depth normalization

plug in problem

Reproducing NYUv2 Results

点云和文本输出结果不对

Why use the cross-attention instead of only self-attention when implementing perceiver layers?

Can not load eeg ckpt

InstructBLIP and SEED Implementation

Training code or training parameter configurations

What kind of textual prompts do you use during the training period?

SUN RGB-D is not in millimeters

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent