Giter Club home page Giter Club logo

vit-lens's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vit-lens's Issues

Something about Training Methodologies and Experimental Approaches for Video Data

I'm thoroughly impressed with your project and I'm eager to apply the model to my video data. However, the current TRAIN_INFERENCE.md does not provide the relevant usage information. Could you kindly publish the associated training methodologies or experimental approaches? Your assistance would be greatly appreciated. Thank you!

Training Time and GPU usages

Hi,

Thanks for the exceptional work you have presented. It is truly remarkable and contributes significantly to the field.

After reviewing your paper, I noted the mention of experiments being conducted on 32GB V100 GPU clusters. However, can you please give more details of the resources utilized for this project? Could you kindly provide information on the total training time and the exact number of GPUs employed during this period?

Thanks a lot.

Alternate depth normalization

The justification in the paper for using disparity is "scale normalization". I know that this comes from OmniVore and ImageBind.
However, this does not actually achieve scale normalization.

What could scale normalization mean? disparity images are not scale invariant in the way RGB images are: If you bring a thing closer it will have larger disparities, as opposed to RBG images where colors stay the same. Instead it must mean something like: two images with the same "disparity" should take up the same number of pixels.

To achieve this, you should use f/depth instead of bf/depth. This makes sense because b is an arbitrary value associated with the particular camera setup that you have, and it provides you no information about the geometry of the scene you are looking at. If you change b physically, the depth image does not change, but the disparity does.

One other suggested improvement: when you resize to 224, you're actually implicitly changing the focal length. So if h is the original height of the image, I would suggest computing "disparity" as

(224/h)f/depth

If normalization is having any positive affect, I bet this improved normalization will do better.

plug in problem

The tensor matrix output by vitlens is 1*768 for each modal message right? So where in Instructblip do I plug it in can you please answer? Thanks!

Reproducing NYUv2 Results

This code documents the processing pipeline well, but it starts with disparity images, whereas the NYUv2 starts with depth images.
What baseline and focal length are you using for converting NYUv2.D to disparity? My best guess is

f = 518.857901
b = 75

However, that seems like it could be off by an order of magnitude. Help would be appreciated.

点云和文本输出结果不对

运行example.py得到如下结果
PointCould x Text:
tensor([[8.5200e-04, 9.5644e-02, 5.8601e-01, 1.9369e-02, 2.9812e-01],
[8.8911e-04, 1.7004e-01, 3.2570e-01, 1.1302e-02, 4.9207e-01],
[2.9327e-04, 6.9276e-02, 4.6433e-01, 1.2254e-02, 4.5384e-01],
[1.9555e-03, 7.8262e-02, 3.8027e-01, 5.8164e-02, 4.8135e-01],
[3.0467e-04, 1.0489e-01, 4.9719e-01, 2.1044e-02, 3.7657e-01]],
device='cuda:0')

Can not load eeg ckpt

Hi, I just want to load the eeg ckpt from your huggingface repo but it seems that tons of keys are not matched. Any chance can you double check on your side if it can be loaded successful?

InstructBLIP and SEED Implementation

Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?

Training code or training parameter configurations

Hi,

I'm very insterested in your great work and am trying to train your model on my own data. As this repo currently solely contains the inference code, I'm wondering if you can share your training part, or share the training parameter configuration, especially the parameters of the perceiver. Thanks a lot!

What kind of textual prompts do you use during the training period?

Hi,

thanks for your great work! I'm trying to adapt the ViT-Lens to the customized dataset, and I hope to aligh the textual prompts used in the inference period with the training period, so could you please share the prompt template, which may helps promote the performance.

SUN RGB-D is not in millimeters

I was trying to apply this model to my own data and not getting good results. I ran the NYUv2 dataset through my code, and the results seem to be in line with those reported in the ViT-Lens paper.

Digging into it, the issue is - at least partly - that the NYUv2 data is not in millimeters. Here is the matlab code for converting the png files to mm that is in the SUNRGBDtoolbox (https://rgbd.cs.princeton.edu/):

depthVis = imread(data.depthpath);
imsize = size(depthVis);
depthInpaint = bitor(bitshift(depthVis,-3), bitshift(depthVis,16-3));

In other words, the data in the png files is a circular shift left by 3 bits of the depth in mm (which for most data is just multiplying by 8).

I mention this because the code in #9 seems to indicate that it is assumed that the data is in mm. It might be important if other datasets get used that are in mm and not the SUN RGB-D format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.