alextrevithick / grf Goto Github PK

View Code? Open in Web Editor NEW

280.0 280.0 18.0 200.67 MB

🔥 General Radiance Field (ICCV, 2021)

Python 100.00%

grf's People

Contributors

Stargazers

Watchers

Forkers

avani17101 chuong liuguoyou zebrajack sstzal xrosliang mfkiwl metavai jingfengrong kim1350 kth0522 bruinxiong slavachernous hcp6897 imclab timkar164 luigui2906 tinaa23

grf's Issues

Can I use views with different intrinsics (images captured by multi-cameras)?

my own data are captured by multiple different cameras with different intrinsics!
thank you!

Positional Encoding

I'm using the positional encoder found in the NERF paper to encode my images after stacking the view points on the colors as mentioned in the paper however I'm unable to get the shapes to line up for input into the CNN. In NERF's implementation they flatten before sending their inputs to encoder.

To give a more concrete example of what I'm talking about here is my code

# reshape inputs to [20, 378, 504, 6] concatenating view to colorspace
inputs = torch.tensor(np.concatenate([images, np.broadcast_to(np.expand_dims(C, (1,2)), images.shape)], axis=-1))
# create embedder with length 5 as specified in the paper
embed, input_ch = get_embedder(5, 0)
# flatten. not sure if this step is required shape of [3810240, 6]
inputs_flat = torch.reshape(inputs, [-1, inputs.shape[-1]])
# apply embedding for a output shape of  [3810240, 66]
embedding = embed(inputs)

Not really sure where to go from here to submit to the CNN.

Some questions about the paper

Hi, thanks for the great work!

I have some questions:

How much is the computational overhead introduced by the CNN feature extraction? At inference maybe it's not that much because we only need to do 1 forward pass for each image and store the features in a buffer, but at training, we need to perform it on the entire images at each iteration, and we only train on a very little portion (800-1000 rays), so I wonder isn't it somewhat inefficient and slow, or maybe you have some outstanding implementation to accelerate this part.
As for the generalization, is it correct to understand that it only generalizes to objects within the same class (experiments on shapenetv2) with very similar visual and pose settings? For example, if we train on 7 NeRF-synthetic scenes, does it generalize to the 8th?

Confusion on section 3.3

I'm rather confused on this section because you cite P as this function based on multi view geometry and then describe two approximations. Do these approximations represent P? Also I am confused about how to implement these approximations outside of checking inside and outside of the image, specifically how do you "duplicate its features to the 3D point"?

How do you organize your dataset directory?

I noticed that when using the shapenet data set for training, your dataset loading module uses path like “train” and "train_val", but this is inconsistent with the raw dataset which Vincent provides. May I ask how do you organize your project’s dataset folders? Thank you in advance.

About Intrinsic Matrix

May I ask how to get the intrinsic matrix of a photo if I want to use my own data to train GRF? And without the intrinsic matrix, will the performance of GRF significantly degrade?

Out of memory (OOM)

My GPU has 24GB.
I decrease parameters as you said.
--chunk [number of rays processed in parallel, decrease if running out of memory] --netchunk [number of pts sent through network in parallel, decrease if running out of memory]
But it is useless even if these two parameters are set to one.

How to render a 3D surface?

Can it generate a 3D surface similar to PIFu?

.

A question about the section 3.5

In the figure 6 in section 3.5, the input of the MLP is 3D point feature and viewpoint (x,y,z) (correspond to the 3-D posotion in the classical NeRF). I wonder whether the 2-D direction is required is needed for the input of the MLP?

A question on Section 4.1 (SHAPENETV2)

Hey, thanks for showing the great work.
I have a question on Figure 4,

"3) To further demonstrate the advantage of GRF over SRNs, we directly evaluate the trained SRNs model on unseen objects (of the same category) without retraining. For comparison, we also directly evaluate the trained GRF model on the same novel objects. Figure 4 shows the qualitative results. It can be seen that if not retrained, SRNs completely fails to".

According to that the SRNs model does not train on unseen objects, the latent code z for unseen object is not optimized, so is it randomly initialized? Then, how can it generates novel views of unseen cars similar to GT views? My understanding is that the randomly initialized latent code z may generate unpredictable cars, but similar to training set, which seems to be conflict with the above quoted words. It confuses me for hours.

Training time

Hi, Alex
Can you show me how much time is needed for training on the ShapeNetv2 and other datasets?

How many iters to train on 4 scenes for group1?

Code release time

Thanks for sharing this very interesting work! Do you have an estimate when the code will be released? I'm thinking about whether I should wait or start implementing myself.

Question about the CNN model

Hi, Alex
I notice that different CNN models are used for different datasets. I wonder if there were some special considerations when designing the CNNs. And if I want to design a CNN for my dataset, what should I pay attention to?
Thanks!

Pre-trained models and training settings

Dear author,

Can you provide pre-trained models on NeRF dataset?

Can you give us more details of training models (Group1) on nerf dataset?

Thanks

How to get the unseen category/scene result as section 4.2, 4.3 and How to train several classes together for generlization?

Hello! thanks for sharing your great art work!
I wonder how to get the rendering result of the unseen category/scene.
according to 4.3, it says "We train a single model on randomly selected 4 scenes, i.e., Chair, Mic, Ship, and Hotdog, ...." and I wonder how to train several classes in each image batch(like all Nerf Synthetic datasets together to get the generalized GRF model). I think there are configs which contain only one single class for each config text file.
I guess, if a batch contains 4 classes and each class has 8 views, then the batch has 32 images in total. Or, a batch could contain 32 images of only one single class and it sequencially see all classes one by one.
Can you describe your train configuration for generalization in more detail?

Plus, If 2 views or 6 views are fed, only 2/6images are input? or corresponding poses are also required?
or can you provide us all the required data for section 4.1~4.3 if possible(it must be the best for me to understand it clearly)?

Please give me some hint. ;)

Cheers!