tencentarc / vit-lens Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
Home Page: https://ailab-cvc.github.io/seed/vitlens/
License: Other
[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
Home Page: https://ailab-cvc.github.io/seed/vitlens/
License: Other
I'm thoroughly impressed with your project and I'm eager to apply the model to my video data. However, the current TRAIN_INFERENCE.md does not provide the relevant usage information. Could you kindly publish the associated training methodologies or experimental approaches? Your assistance would be greatly appreciated. Thank you!
Hi,
Thanks for the exceptional work you have presented. It is truly remarkable and contributes significantly to the field.
After reviewing your paper, I noted the mention of experiments being conducted on 32GB V100 GPU clusters. However, can you please give more details of the resources utilized for this project? Could you kindly provide information on the total training time and the exact number of GPUs employed during this period?
Thanks a lot.
The justification in the paper for using disparity is "scale normalization". I know that this comes from OmniVore and ImageBind.
However, this does not actually achieve scale normalization.
What could scale normalization mean? disparity images are not scale invariant in the way RGB images are: If you bring a thing closer it will have larger disparities, as opposed to RBG images where colors stay the same. Instead it must mean something like: two images with the same "disparity" should take up the same number of pixels.
To achieve this, you should use f/depth instead of bf/depth. This makes sense because b is an arbitrary value associated with the particular camera setup that you have, and it provides you no information about the geometry of the scene you are looking at. If you change b physically, the depth image does not change, but the disparity does.
One other suggested improvement: when you resize to 224, you're actually implicitly changing the focal length. So if h is the original height of the image, I would suggest computing "disparity" as
(224/h)f/depth
If normalization is having any positive affect, I bet this improved normalization will do better.
The tensor matrix output by vitlens is 1*768 for each modal message right? So where in Instructblip do I plug it in can you please answer? Thanks!
This code documents the processing pipeline well, but it starts with disparity images, whereas the NYUv2 starts with depth images.
What baseline and focal length are you using for converting NYUv2.D to disparity? My best guess is
f = 518.857901
b = 75
However, that seems like it could be off by an order of magnitude. Help would be appreciated.
运行example.py得到如下结果
PointCould x Text:
tensor([[8.5200e-04, 9.5644e-02, 5.8601e-01, 1.9369e-02, 2.9812e-01],
[8.8911e-04, 1.7004e-01, 3.2570e-01, 1.1302e-02, 4.9207e-01],
[2.9327e-04, 6.9276e-02, 4.6433e-01, 1.2254e-02, 4.5384e-01],
[1.9555e-03, 7.8262e-02, 3.8027e-01, 5.8164e-02, 4.8135e-01],
[3.0467e-04, 1.0489e-01, 4.9719e-01, 2.1044e-02, 3.7657e-01]],
device='cuda:0')
Hi.
I read the code and find that you implement the perceiver as 4 cross-attention layers with 4 self-attention layers each, and I'm curious why not just use 16 or less self-attention layers?
Hi, I just want to load the eeg ckpt from your huggingface repo but it seems that tons of keys are not matched. Any chance can you double check on your side if it can be loaded successful?
Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?
Hi,
I'm very insterested in your great work and am trying to train your model on my own data. As this repo currently solely contains the inference code, I'm wondering if you can share your training part, or share the training parameter configuration, especially the parameters of the perceiver. Thanks a lot!
Hi,
thanks for your great work! I'm trying to adapt the ViT-Lens to the customized dataset, and I hope to aligh the textual prompts used in the inference period with the training period, so could you please share the prompt template, which may helps promote the performance.
I was trying to apply this model to my own data and not getting good results. I ran the NYUv2 dataset through my code, and the results seem to be in line with those reported in the ViT-Lens paper.
Digging into it, the issue is - at least partly - that the NYUv2 data is not in millimeters. Here is the matlab code for converting the png files to mm that is in the SUNRGBDtoolbox (https://rgbd.cs.princeton.edu/):
depthVis = imread(data.depthpath);
imsize = size(depthVis);
depthInpaint = bitor(bitshift(depthVis,-3), bitshift(depthVis,16-3));
In other words, the data in the png files is a circular shift left by 3 bits of the depth in mm (which for most data is just multiplying by 8).
I mention this because the code in #9 seems to indicate that it is assumed that the data is in mm. It might be important if other datasets get used that are in mm and not the SUN RGB-D format.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.