Giter Club home page Giter Club logo

gaussian-opacity-fields's Introduction

Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes

Zehao Yu · Torsten Sattler · Andreas Geiger

Logo

Gaussian Opacity Fields (GOF) enables geometry extraction with 3D Gaussians directly by indentifying its level set. Our regularization improves surface reconstruction and we utilize Marching Tetrahedra for adaptive and compact mesh extraction.


Updates

  • [2024.06.10]: 🔥 Improve the training speed by 2x with merged operations. 6 scenes in TNT dataset can be trained in ~24 mins and the bicycle scene in the Mip-NeRF 360 dataset can be trained in ~45 mins. Please pull the latest code and reinstall with pip install submodules/diff-gaussian-rasterization to use it.

Installation

Clone the repository and create an anaconda environment using

git clone [email protected]:autonomousvision/gaussian-opacity-fields.git
cd gaussian-opacity-fields

conda create -y -n gof python=3.8
conda activate gof

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
conda install cudatoolkit-dev=11.3 -c conda-forge

pip install -r requirements.txt

pip install submodules/diff-gaussian-rasterization
pip install submodules/simple-knn/

# tetra-nerf for triangulation
cd submodules/tetra-triangulation
conda install cmake
conda install conda-forge::gmp
conda install conda-forge::cgal
cmake .
# you can specify your own cuda path
# export CPATH=/usr/local/cuda-11.3/targets/x86_64-linux/include:$CPATH
make 
pip install -e .

Dataset

Please download the Mip-NeRF 360 dataset from the official webiste, the NeRF-Synthetic dataset from the NeRF's official Google Drive, the preprocessed DTU dataset from 2DGS, the proprocessed Tanks and Temples dataset from here. You need to download the ground truth point clouds from the DTU dataset and save to dtu_eval/Offical_DTU_Dataset to evaluate the geometry reconstruction. For the Tanks and Temples dataset, you need to download the ground truth point clouds, alignments and cropfiles and save to eval_tnt/TrainingSet, such as eval_tnt/TrainingSet/Caterpillar/Caterpillar.ply.

Training and Evaluation

# you might need to update the data path in the script accordingly

# NeRF-synthetic dataset
python scripts/run_nerf_synthetic.py

# Mip-NeRF 360 dataset
python scripts/run_mipnerf360.py

# Tanks and Temples dataset
python scripts/run_tnt.py

# DTU dataset
python scripts/run_dtu.py

Custom Dataset

We use the same data format from 3DGS, please follow here to prepare the your dataset. Then you can train your model and extract a mesh (we use the Tanks and Temples dataset for example)

# training
# -r 2 for using downsampled images with factor 2
# --use_decoupled_appearance to enable decoupled appearance modeling if your images has changing lighting conditions
python train.py -s TNT_GOF/TrainingSet/Caterpillar -m exp_TNT/Caterpillar -r 2 --use_decoupled_appearance

# extract the mesh after training
python extract_mesh.py -m exp_TNT/Caterpillar --iteration 30000

# you can open extracted mesh with meshlab or using the following script based on open3d
python mesh_viewer.py exp_TNT/Caterpillar/test/ours_30000/fusion/mesh_binary_search_7.ply

Acknowledgements

This project is built upon 3DGS and Mip-Splatting. Regularizations and some visualizations are taken from 2DGS. Tetrahedra triangulation is taken from Tetra-NeRF. Marching Tetrahdedra is adapted from Kaolin Library. Evaluation scripts for DTU and Tanks and Temples dataset are taken from DTUeval-python and TanksAndTemples respectively. We thank all the authors for their great work and repos.

Citation

If you find our code or paper useful, please cite

@article{Yu2024GOF,
  author    = {Yu, Zehao and Sattler, Torsten and Geiger, Andreas},
  title     = {Gaussian Opacity Fields: Efficient High-quality Compact Surface Reconstruction in Unbounded Scenes},
  journal   = {arXiv:2404.10772},
  year      = {2024},
}

If you find the regularizations useful, please kindly cite

@inproceedings{Huang2DGS2024,
    title={2D Gaussian Splatting for Geometrically Accurate Radiance Fields},
    author={Huang, Binbin and Yu, Zehao and Chen, Anpei and Geiger, Andreas and Gao, Shenghua},
    publisher = {Association for Computing Machinery},
    booktitle = {SIGGRAPH 2024 Conference Papers},
    year      = {2024},
    doi       = {10.1145/3641519.3657428}
}

gaussian-opacity-fields's People

Contributors

hbb1 avatar niujinshuchong avatar yuxuansnow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gaussian-opacity-fields's Issues

Poor Quality in Waymo Dataset

Thank you for your great work. I was impressed by the high quality of the mesh presented in your paper. However, when I attempted to run the code on the Waymo dataset, I found that the quality was poor. I suspect there might be an issue with the configuration. I used the pose generated from COLMAP. Here are some of my results:
image
image
image

I would be flattered if you could offer me some insights or suggestions on how to improve this.

Code release

Thanks for your excellent work. But this repository and 2D gaussian spaltting are both empty. When will the code be released? Thanks

OOM when evaluate_dtu_mesh

Hi, thanks for your wonderful work!
I'm trying to repeat the quantitative results in your paper on a single A100 GPU. I am able to generate the tsdf fusion mesh in voxel_size=0.002, but I fail to run evaluate_dtu_mesh.py and was killed in the cull_mesh() function:

components = mesh.split(only_watertight=False)

It takes over 1000 RAM and then is killed. Could you give me some suggestion or solution for this problem? @niujinshuchong
Thanks in advance!

Weird Surface Result on Tandt Dataset

Thank you for your great work! However, I have trouble with mesh extracting on the train scene of the tandt dataset. My command:

MODEL_PATH="tandt/train"

python train.py -s ../../datasets/tandt/train -m outputs/$MODEL_PATH -r 2 --use_decoupled_appearance
python extract_mesh.py -m outputs/$MODEL_PATH --iteration 30000
python extract_mesh_tsdf.py -m outputs/$MODEL_PATH --iteration 30000

I got such weird meshes:

image
image
image

I can't see any pattern in it. Is there anything wrong?

What's the difference between extract_mesh.py and extract_mesh_tsdf.py?

Which result of these two .py is the 'Ours' in the teaser?
When i run extract_mesh_tsdf.py, there is a error :

Rendering progress: 0%| | 0/49 [00:02<?, ?it/s]
Traceback (most recent call last):
File "extract_mesh_tsdf.py", line 107, in
extract_mesh(model.extract(args), args.iteration, pipeline.extract(args))
File "extract_mesh_tsdf.py", line 90, in extract_mesh
tsdf_fusion(dataset.model_path, "test", iteration, cams, gaussians, pipeline, background, kernel_size)
File "extract_mesh_tsdf.py", line 64, in tsdf_fusion
frustum_block_coords = vbg.compute_unique_block_coordinates(
RuntimeError: [Open3D Error] (void open3d::t::geometry::kernel::voxel_grid::DepthTouchCUDA(std::shared_ptropen3d::core::HashMap&, const open3d::core::Tensor&, const open3d::core::Tensor&, const open3d::core::Tensor&, open3d::core::Tensor&, open3d::t::geometry::kernel::voxel_grid::index_t, float, float, float, float, open3d::t::geometry::kernel::voxel_grid::index_t)) /root/Open3D/cpp/open3d/t/geometry/kernel/VoxelBlockGridCUDA.cu:197: No block is touched in TSDF volume, abort integration. Please check specified parameters, especially depth_scale and voxel_size

Have you ever met this error?

Why are the images rendered by renderCUDA and integrateCUDA different?

Thanks for such a wonderful work!
I found the images rendered from renderCUDA and integrateCUDA are different:
rendered from renderCUDA:
test_0_rgb
rendered from integrateCUDA:
integrate_clamp_0_rgb

However, the processes of the two methods appear to be the same.
I want to know why this happen?
Besides, I want to know have you tested your mesh extraction method on 2DGS?
Thank you very much!

Training is too slow.

The training speed is about several seconds per iteration, what could be the possible reason?
image

Bad mesh with marching tetrahedra on DTU

Hello,

First of all, thank you for your work! I wanted to test your code on my own dataset and started by testing it on the dtu scene 37 and I got very bad results using the marching tetrahedra mesh extraction. I ran the run_dtu.py script only on this scene and I got the following results:
This is the mesh result with TSDF:
dtu37_gof_snapshot01

And here with Marching tetrahedra:
dtu37_gof_binary_search_snapshot02
The scene is not recognizable at all... Do you know why this is the case? I was under the impression that marching tetrahedra should give better results than tsdf.

Also, it took over an hour to train on a NVIDIA RTX A6000 where as in the paper you mention it is about 30 minutes. Did you use specific training parameters for the results you got in the paper?

Best regards,
Brianne

Dataset preprocess.

Could you kindly inform me about the process used to generate the DTU dataset you're utilizing? Was it generated using GT pose + COLMAP, or was COLMAP employed directly for processing?

When I run the `extract_mesh.py` I meet an error.

ImportError: cannot import name 'tetranerf_cpp_extension' from partially initialized module 'tetranerf.utils.extension' (most likely due to a circular import) (/home/hh/anaconda3/envs/gof/lib/python3.8/site-packages/tetranerf/utils/extension/__init__.py)
How to solve this problem? Has anyone meet this error?

RuntimeError: CUDA out of memory

Hello, thank you for your work.
I would like to train using my own dataset, but even with the 24GB VRAM of my RTX 3090 GPU, it is not enough. I am only using 200 images. How can I solve this problem?
1c925b5b1a8468a00f9c9af90f1871f

Question about add_densification_stats

I am quite confused that why does the code here use the last value of the last dimension of viewspace_point_tensor.grad to accumulate the absolute value of gradient, while the original gaussian splatting uses the first two value instead? I can't find the explanation from the equation (14) and (15) in the paper.

self.xyz_gradient_accum_abs[update_filter] += torch.norm(viewspace_point_tensor.grad[update_filter,2:], dim=-1, keepdim=True)

Int overflow cuda error

Thanks for sharing this great work, I encounter an integer overflow error when calling your rasterizer, could you help me? thanks!

Exception has occurred: RuntimeError
numel: integer multiplication overflow
.......
.......
RuntimeError: numel: integer multiplication overflow

How does filter_3D affect?

Tthanks for your great project! I have some questions

  • How does filter_3D affect? like scale or opacity
  • There are many artifacts in the sky, but not in the original 3d-gs

mesh_extract error!

Thank you very much for the code you provided. When I was preparing to extract the mesh after training the model, I encountered an error as shown in the following figure. I am eager to get your help and hope to get your reply in time!
WGA$S 9Z11RZ_T X0YZL(L2

issue on pip install submodules/diff-gaussian-rasterization even if i have exponential.hpp under submodules/diff-gaussian-rasterization/third_party/glm/glm.

Hi,

it is an excellent work. But I still have issue on pip install submodules/diff-gaussian-rasterization even if i have exponential.hpp under submodules/diff-gaussian-rasterization/third_party/glm/glm.

      RuntimeError: The current installed version of g++ (11.4.0) is greater than the maximum required version by CUDA 11.3 (10.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=10.0.0).
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for diff_gaussian_rasterization
  Running setup.py clean for diff_gaussian_rasterization
Failed to build diff_gaussian_rasterization
ERROR: Could not build wheels for diff_gaussian_rasterization, which is required to install pyproject.toml-based projects
(gof) weihwang@0091c4c-lcelt:~/Documents/gaussian-opacity-fields$ 

Mistakes in reconstructing Barn and Courthouse of TNT dataset

Thank you for your excellent work! I tried this algorithem on Barn and Courthouse of TNT dataset. While the results show some mistakes.

  1. some noise/floaters are near the camera as follows:
    图片5
    May I ask why this happened? I guess they may be some gaussians which should be remove in training.

  2. The extracted mesh has floaters in the sky/in the background as follows:
    图片6
    I need to remove these meshes to get a good results. I wonder why and how to avoid them in training?

The gaussian opacity fields you designed show great performance. May I ask another question that why this work can get good performance in surface reconstruction? Could you please point the reason out in paper or explain it in comments?

Killed when run dtu_eval/eval.py

20240515143044
hi, after i extracted the dtu mesh and used the code to eval it,it was shut down, i thought maybe it's because of some triangle noise,so i remove these artifacts manually ,and then eval it ,but it was still killed.

No texture or verter color

Hi! thanks for your great project! I find that when i get the mesh, i cannot get verter color or texture. is there any things i can do to get the color

some questions with fomula (10) in the paper

Your work is really wonderful and gives me a lot of inspiration. But I have a little question. In your paper, you define the opacity of a 3D point x as the minimal opacity value among all training views or viewing directions by formula (10), but what you wrote under the symbol {min} was (r, t). I supposed that it should be written as (o, r). I'm not sure about it. It would be very kind of you if you can help me.

tetra-triangulation , make , error :

win11  wsl2  ubuntu-22.04 cuda-12.1

(cuda121) root@LZH5:/mnt/e/gaus/submodules/tetra-triangulation# make
[ 33%] Building CXX object CMakeFiles/tetranerf_cpp_extension.dir/src/triangulation.cpp.o
In file included from /mnt/e/AI/A2Q/240426/wsl_cuda121/submodules/tetra-triangulation/src/triangulation.cpp:1:
/mnt/e/AI/A2Q/240426/wsl_cuda121/submodules/tetra-triangulation/src/triangulation.h:1:10: 
fatal error: cuda_runtime.h: No such file or directory
    1 | #include <cuda_runtime.h>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/tetranerf_cpp_extension.dir/build.make:76: 
CMakeFiles/tetranerf_cpp_extension.dir/src/triangulation.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:100: CMakeFiles/tetranerf_cpp_extension.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

RuntimeError: CUDA error: an illegal memory access was encountered

Hi, I tried to modify the codes of Gaussians Representation while keeping your rasterizer parts (for the using of normal map and depth map). It works well at first, but I got:
RuntimeError: CUDA error: an illegal memory access was encountered
at the 114 line of your gaussian-opacity-fields/gaussian_renderer/init.py, that is :
"visibility_filter" : radii > 0

This problem occurs at the iteration of about 2k-5k, and it seems like a metaphysical problem. I can not fingure out why.
So I doubt are there any rubust problem in the CUDA codes?

Fail to build diff-gaussian-rasterization

Hi, thanks for your nice work!
I am trying to run your code by following the steps provided but fail to build diff-gaussian-rasterization. Since I can't directly build with command pip install submodules/diff-gaussian-rasterization, I build it with command python setup.py install. However, it still doesn't work. This is the snapshot of error message:
image
My system configuration is Ubuntu20.04 / RTX-2080ti.

Slow training and no texture mesh

Hello, thanks for your excellent work!

Here I meet two problems mainly:

  1. When I run original Gaussian Splatting on MIP NeRF 360_v2/kitchen scene, it can achieve around 25 it/s as showing as following.
    image

But the traininng on GOF is slow, it takes totally more than two hours (3.5 it/s) for training of this scene as following
image

  1. There is no teture on mesh, how can I configure and add the color on it?
Image

System:
Linux 22.04
RTX 4090

Thanks in advance if someone solves it.
Jonas

How to cull the mesh of tnt ?

Hi, thanks for your great work. I trained the tnt datasets and successfully exported the mesh. Then I want to evaluate the F1 score, but now I was stucked in eval_tnt/cull_mesh.py;
So I have 2 questions in eval_tnt/cull_mesh.py:

  1. How should I specify --traj-path ? Is it "TNT_GOF/TrainingSet/Barn/Barn_COLMAP_SfM.log" ?
  2. I am confused of the line28: mesh = trimesh.load("/home/yuzh/mnt/A3_data/sdfstudio/meshes_tnt/bakedangelo/Barn_fullres_1024.ply"), should I change it to my exported mesh ?

When I run cmake ., I encounter the following issue. Moreover, the environment setup is the same as on GitHub.

CMake Error at /home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:751 (message):
Compiling the CUDA compiler identification source file
"CMakeCUDACompilerId.cu" failed.

Compiler: /usr/bin/nvcc

Build flags:

Id flags: --keep;--keep-dir;tmp -v

The output was:

255

#$ SPACE=

#$ CUDART=cudart

#$ HERE=/usr/lib/nvidia-cuda-toolkit/bin

#$ THERE=/usr/lib/nvidia-cuda-toolkit/bin

#$ TARGET_SIZE=

#$ TARGET_DIR=

#$ TARGET_SIZE=64

#$ NVVMIR_LIBRARY_DIR=/usr/lib/nvidia-cuda-toolkit/libdevice

#$
PATH=/usr/lib/nvidia-cuda-toolkit/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/home/wayne/anaconda3/envs/gof/bin:/home/wayne/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

#$ LIBRARIES= -L/usr/lib/x86_64-linux-gnu/stubs -L/usr/lib/x86_64-linux-gnu

#$ rm tmp/a_dlink.reg.c

#$ gcc -D__CUDA_ARCH__=300 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS
-D__CUDACC__ -D__NVCC__ -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=1
-D__CUDACC_VER_BUILD__=243 -include "cuda_runtime.h" -m64
"CMakeCUDACompilerId.cu" > "tmp/CMakeCUDACompilerId.cpp1.ii"

#$ cicc --c++14 --gnu_version=90400 --allow_managed -arch compute_30 -m64
-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name
"CMakeCUDACompilerId.fatbin.c" -tused -nvvmir-library
"/usr/lib/nvidia-cuda-toolkit/libdevice/libdevice.10.bc"
--gen_module_id_file --module_id_file_name
"tmp/CMakeCUDACompilerId.module_id" --orig_src_file_name
"CMakeCUDACompilerId.cu" --gen_c_file_name
"tmp/CMakeCUDACompilerId.cudafe1.c" --stub_file_name
"tmp/CMakeCUDACompilerId.cudafe1.stub.c" --gen_device_file_name
"tmp/CMakeCUDACompilerId.cudafe1.gpu" "tmp/CMakeCUDACompilerId.cpp1.ii" -o
"tmp/CMakeCUDACompilerId.ptx"

#$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o
"tmp/CMakeCUDACompilerId.sm_30.cubin"

ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'

--error 0xff --

Call Stack (most recent call first):
/home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
/home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
/home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCUDACompiler.cmake:307 (CMAKE_DETERMINE_COMPILER_ID)
CMakeLists.txt:2 (project)

-- Configuring incomplete, errors occurred!
(gof) wayne@wayne-chyr:~/data/gaussian-opacity-fields/submodules/tetra-triangulation$ cmake .
CMake Error at /home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:751 (message):
Compiling the CUDA compiler identification source file
"CMakeCUDACompilerId.cu" failed.

Compiler: /usr/bin/nvcc

Build flags:

Id flags: --keep;--keep-dir;tmp -v

The output was:

255

#$ SPACE=

#$ CUDART=cudart

#$ HERE=/usr/lib/nvidia-cuda-toolkit/bin

#$ THERE=/usr/lib/nvidia-cuda-toolkit/bin

#$ TARGET_SIZE=

#$ TARGET_DIR=

#$ TARGET_SIZE=64

#$ NVVMIR_LIBRARY_DIR=/usr/lib/nvidia-cuda-toolkit/libdevice

#$
PATH=/usr/lib/nvidia-cuda-toolkit/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/home/wayne/anaconda3/envs/gof/bin:/home/wayne/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

#$ LIBRARIES= -L/usr/lib/x86_64-linux-gnu/stubs -L/usr/lib/x86_64-linux-gnu

#$ rm tmp/a_dlink.reg.c

#$ gcc -D__CUDA_ARCH__=300 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS
-D__CUDACC__ -D__NVCC__ -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=1
-D__CUDACC_VER_BUILD__=243 -include "cuda_runtime.h" -m64
"CMakeCUDACompilerId.cu" > "tmp/CMakeCUDACompilerId.cpp1.ii"

#$ cicc --c++14 --gnu_version=90400 --allow_managed -arch compute_30 -m64
-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name
"CMakeCUDACompilerId.fatbin.c" -tused -nvvmir-library
"/usr/lib/nvidia-cuda-toolkit/libdevice/libdevice.10.bc"
--gen_module_id_file --module_id_file_name
"tmp/CMakeCUDACompilerId.module_id" --orig_src_file_name
"CMakeCUDACompilerId.cu" --gen_c_file_name
"tmp/CMakeCUDACompilerId.cudafe1.c" --stub_file_name
"tmp/CMakeCUDACompilerId.cudafe1.stub.c" --gen_device_file_name
"tmp/CMakeCUDACompilerId.cudafe1.gpu" "tmp/CMakeCUDACompilerId.cpp1.ii" -o
"tmp/CMakeCUDACompilerId.ptx"

#$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o
"tmp/CMakeCUDACompilerId.sm_30.cubin"

ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'

--error 0xff --

Call Stack (most recent call first):
/home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
/home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
/home/wayne/anaconda3/envs/gof/share/cmake-3.26/Modules/CMakeDetermineCUDACompiler.cmake:307 (CMAKE_DETERMINE_COMPILER_ID)
CMakeLists.txt:2 (project)

-- Configuring incomplete, errors occurred!

encounter issues when I train playroom dataset

When I try to train on the playroom dataset, I encounter the following issues:

(gof) xxx@xxx:~/Documents/gaussian-opacity-fields$ python train.py -s /home/xxx/Documents/datasets/Playroom/ -m exp_playroom/release -r 2 --use_decoupled_appearance
Optimizing exp_playroom/release
Output folder: exp_playroom/release
Tensorboard not available: not logging progress
Traceback (most recent call last):
  File "train.py", line 374, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 94, in training
    scene = Scene(dataset, gaussians)
  File "/home/xxx/Documents/gaussian-opacity-fields/scene/__init__.py", line 52, in __init__
    assert False, "Could not recognize scene type!"
AssertionError: Could not recognize scene type!

The structure of playroom is as follows:
Playroom
├── images
│ ├── model_split_0
│ │ ├── 000
│ │ │ ├── frame0.png
│ │ │ └── frame1.png
│ │ ├── .
│ │ ├── .
│ │ ├── .
│ │ └── 1249
│ │ ├── frame0.png
│ │ └── frame1.png
│ ├── model_split_1
│ ├── .
│ ├── .
│ ├── .
│ └── model_split_31
├── segments
└── meta.json

colmap camera model not handled

I shot a video using the ordinary lens of an iPhone. The data has no distortion, but when I train, it prompts me that a distorted camera model was used. How can I solve this? Thank you.
2024-04-29_16-56

Why pixf is added by 0.5

Thanks for releasing the code. I was wondering in renderCUDA, why 0.5 is added to pixf? Thank you.

float2 pixf = { (float)pix.x + 0.5f, (float)pix.y + 0.5f}; // TODO plus 0.5

For tetra-triangulation,make error

Thanks for your impressive work! when I compile the tetra-triangulation, I meet this error in make phase:

error: template argument 1 is invalid
  132 | GAL::compute_average_spacing<Concurrency_tag>(L, nb_neighbors);

Question about min_value

Thanks a lot for releasing the code. I have two quick questions about the following lines in forward.cu:

double min_value = -(BB/AA) * (BB/4.) + CC;
float power = -0.5f * min_value;  
if (power > 0.0f){
    power = 0.0f;
}

My understanding is that min_value is that G(t*) = alpha_k * exp(power ). Why power is set to zero if it is positive? power > 0 means -(BB/AA) * (BB/4.) + CC<0, i.e., BB^2-4AA*CC>0. Then why the effect of power is ignored in this case?

Thanks a lot.

training time

The function depths_to_points() is the same as 2DGS. This improvement may also be useful for gaussian-opacity-fields.

grid_x, grid_y = torch.meshgrid(torch.arange(W)+0.5, torch.arange(H)+0.5, indexing='xy')
grid_x, grid_y = torch.meshgrid(torch.arange(W, device='cuda').float(), torch.arange(H, device='cuda').float(),indexing='xy')

tetra-triangulation , make , error : CGAL

win11  wsl2  ubuntu-22.04  cuda-11.8  python-3.9  CGAL-5.6.1

(cuda118) root@LZH5:/mnt/e/gaus/submodules/tetra-triangulation# make
[ 33%] Building CXX object CMakeFiles/tetranerf_cpp_extension.dir/src/triangulation.cpp.o
/mnt/e/gaus/submodules/tetra-triangulation/src/triangulation.cpp:27:15: 
error: ‘Parallel_if_available_tag’ in namespace ‘CGAL’ does not name a type
   27 | typedef CGAL::Parallel_if_available_tag Concurrency_tag;

How to calculate the point normal

Thanks to the author for open source this great work.
I'm confused about calculating the normal of a point in the code:

point-compute

in the parer, point normal is computed in following formula:
point-compute2

There is a big gap between the formula in the paper and the implementation in the code. Can you explain it in more detail? Thank you very much.

missing file "/Calibration/cal18/pos_xxx.txt" when evaluating on DTU

Hi,

I evaluate DTU using python scripts/run_dtu.py and the following error rises

 File "evaluate_dtu_mesh.py", line 214, in <module>
    evaluate_mesh(model.extract(args), args.iteration, args.DTU)
  File "evaluate_dtu_mesh.py", line 148, in evaluate_mesh
    dtu_cameras = load_dtu_camera(args.DTU)
  File "evaluate_dtu_mesh.py", line 65, in load_dtu_camera
    projection = np.loadtxt(fname, dtype=np.float32)
  File "/public/home/dlee/anaconda3/envs/gof/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1356, in loadtxt
    arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
  File "/public/home/dlee/anaconda3/envs/gof/lib/python3.8/site-packages/numpy/lib/npyio.py", line 975, in _read
    fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
  File "/public/home/dlee/anaconda3/envs/gof/lib/python3.8/site-packages/numpy/lib/_datasource.py", line 193, in open
    return ds.open(path, mode, encoding=encoding, newline=newline)
  File "/public/home/dlee/anaconda3/envs/gof/lib/python3.8/site-packages/numpy/lib/_datasource.py", line 533, in open
    raise FileNotFoundError(f"{path} not found.")
FileNotFoundError: dtu_eval/Offical_DTU_Dataset/Calibration/cal18/pos_001.txt not found.
Finished job on GPU 0 with scene 24

This is caused by the absence of the Calibration/cal18/pos_xxx.txt file in DTU. How to obtain this file, could you offer further details? Thanks!

question of TnT evalution

Thank you for your excellent work! I find that you use a new trajectory file for training, and align it with the original trajectory for testing. But the evalution code uses new trajectory as original trajectory.

colmap_ref_logfile = os.path.join(dataset_dir, scene + "_COLMAP_SfM.log")

I think this should be modified as

colmap_ref_logfile = os.path.join(dataset_dir, scene + "_COLMAP_SfM_original.log") 

extract mesh

image
Snipaste_2024-05-27_11-24-40
I ran the code to extract the mesh, but the resulting mesh looks like the one shown in the picture. Could you tell me what might be causing this issue? I am testing it with the mipnerf dataset

Dynamic Object

Thank you very much for sharing such a high-quality method for reconstructing geometric models. I have found that it can restore real geometric details with high precision. However, I have also noticed some defects in the geometric models in scenes containing dynamic objects, such as moving cars being reconstructed below the ground level. I suspect this is because the moving object only appears in one image, and it is hidden in a slanted hole below the ground, making it visible from a specific angle but invisible from all others. My question is, which parameters could be adjusted or what code modifications could be made to partially solve this problem, if it cannot be completely resolved.

The moving car was captured in only one photo:
image
moving car reconstructed into a hole (or cave):
image
image

nvrtc: error: invalid value for --gpu-architecture (-arch)

When I try to run the following command to train the playroom dataset provided by the 3DGS repository:

python train.py -s ~/Documents/datasets/playroom/ -m exp_playroom/release -r 2 --use_decoupled_appearance

I encounter the following issues:

(gof) xxx@xxx:~/Documents/gaussian-opacity-fields$ python train.py -s ~/Documents/datasets/tandt_db/db/playroom/ -m exp_playroom/release -r 2 --use_decoupled_appearance
Optimizing exp_playroom/release
Output folder: exp_playroom/release
Tensorboard not available: not logging progress
Reading camera 225/225
Loading Training Cameras
Loading Test Cameras
Number of points at initialisation :  37005
Computing 3D filter
Training progress:   0%|                                                                                                                                                                                              | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 374, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 148, in training
    render_pkg = render(viewpoint_cam, gaussians, pipe, background, kernel_size=dataset.kernel_size)
  File "/home/xxx/Documents/gaussian-opacity-fields/gaussian_renderer/__init__.py", line 60, in render
    opacity = pc.get_opacity_with_3D_filter
  File "/home/xxx/Documents/gaussian-opacity-fields/scene/gaussian_model.py", line 141, in get_opacity_with_3D_filter
    det1 = scales_square.prod(dim=1)
RuntimeError: 
  #define POS_INFINITY __int_as_float(0x7f800000)
  #define INFINITY POS_INFINITY
  #define NEG_INFINITY __int_as_float(0xff800000)
  #define NAN __int_as_float(0x7fffffff)

  typedef long long int int64_t;
  typedef unsigned int uint32_t;
  typedef signed char int8_t;
  typedef unsigned char uint8_t;  // NOTE: this MUST be "unsigned char"! "char" is equivalent to "signed char"
  typedef short int16_t;
  static_assert(sizeof(int64_t) == 8, "expected size does not match");
  static_assert(sizeof(uint32_t) == 4, "expected size does not match");
  static_assert(sizeof(int8_t) == 1, "expected size does not match");
  constexpr int num_threads = 128;
  constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live
  constexpr int block_work_size = thread_work_size * num_threads;
  //TODO use _assert_fail, because assert is disabled in non-debug builds
  #define ERROR_UNSUPPORTED_CAST assert(false);

  
  
  
  namespace std {
  
  using ::signbit;
  using ::isfinite;
  using ::isinf;
  using ::isnan;
  
  using ::abs;
  
  using ::acos;
  using ::acosf;
  using ::asin;
  using ::asinf;
  using ::atan;
  using ::atanf;
  using ::atan2;
  using ::atan2f;
  using ::ceil;
  using ::ceilf;
  using ::cos;
  using ::cosf;
  using ::cosh;
  using ::coshf;
  
  using ::exp;
  using ::expf;
  
  using ::fabs;
  using ::fabsf;
  using ::floor;
  using ::floorf;
  
  using ::fmod;
  using ::fmodf;
  
  using ::frexp;
  using ::frexpf;
  using ::ldexp;
  using ::ldexpf;
  
  using ::log;
  using ::logf;
  
  using ::log10;
  using ::log10f;
  using ::modf;
  using ::modff;
  
  using ::pow;
  using ::powf;
  
  using ::sin;
  using ::sinf;
  using ::sinh;
  using ::sinhf;
  
  using ::sqrt;
  using ::sqrtf;
  using ::tan;
  using ::tanf;
  
  using ::tanh;
  using ::tanhf;
  
  using ::acosh;
  using ::acoshf;
  using ::asinh;
  using ::asinhf;
  using ::atanh;
  using ::atanhf;
  using ::cbrt;
  using ::cbrtf;
  
  using ::copysign;
  using ::copysignf;
  
  using ::erf;
  using ::erff;
  using ::erfc;
  using ::erfcf;
  using ::exp2;
  using ::exp2f;
  using ::expm1;
  using ::expm1f;
  using ::fdim;
  using ::fdimf;
  using ::fmaf;
  using ::fma;
  using ::fmax;
  using ::fmaxf;
  using ::fmin;
  using ::fminf;
  using ::hypot;
  using ::hypotf;
  using ::ilogb;
  using ::ilogbf;
  using ::lgamma;
  using ::lgammaf;
  using ::llrint;
  using ::llrintf;
  using ::llround;
  using ::llroundf;
  using ::log1p;
  using ::log1pf;
  using ::log2;
  using ::log2f;
  using ::logb;
  using ::logbf;
  using ::lrint;
  using ::lrintf;
  using ::lround;
  using ::lroundf;
  
  using ::nan;
  using ::nanf;
  
  using ::nearbyint;
  using ::nearbyintf;
  using ::nextafter;
  using ::nextafterf;
  using ::remainder;
  using ::remainderf;
  using ::remquo;
  using ::remquof;
  using ::rint;
  using ::rintf;
  using ::round;
  using ::roundf;
  using ::scalbln;
  using ::scalblnf;
  using ::scalbn;
  using ::scalbnf;
  using ::tgamma;
  using ::tgammaf;
  using ::trunc;
  using ::truncf;
  
  } // namespace std
  
  

  // NB: Order matters for this macro; it is relied upon in
  // _promoteTypesLookup and the serialization format.
  // Note, some types have ctype as void because we don't support them in codegen
  #define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(_) \
  _(uint8_t, Byte) /* 0 */                               \
  _(int8_t, Char) /* 1 */                                \
  _(int16_t, Short) /* 2 */                              \
  _(int, Int) /* 3 */                                    \
  _(int64_t, Long) /* 4 */                               \
  _(at::Half, Half) /* 5 */                                  \
  _(float, Float) /* 6 */                                \
  _(double, Double) /* 7 */                              \
  _(std::complex<at::Half>, ComplexHalf) /* 8 */        \
  _(std::complex<float>, ComplexFloat) /* 9 */                          \
  _(std::complex<double>, ComplexDouble) /* 10 */                         \
  _(bool, Bool) /* 11 */                                 \
  _(void, QInt8) /* 12 */                          \
  _(void, QUInt8) /* 13 */                        \
  _(void, QInt32) /* 14 */                        \
  _(at::BFloat16, BFloat16) /* 15 */                             \

  #define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_QINT(_)       \
  _(uint8_t, Byte)                                                 \
  _(int8_t, Char)                                                  \
  _(int16_t, Short)                                                \
  _(int, Int)                                                      \
  _(int64_t, Long)                                                 \
  _(at::Half, Half)                                                \
  _(float, Float)                                                  \
  _(double, Double)                                                \
  _(std::complex<at::Half>, ComplexHalf)                           \
  _(std::complex<float>, ComplexFloat)                             \
  _(std::complex<double>, ComplexDouble)                           \
  _(bool, Bool)                                                    \
  _(at::BFloat16, BFloat16)


  enum class ScalarType : int8_t {
  #define DEFINE_ENUM(_1, n) n,
  AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_ENUM)
  #undef DEFINE_ENUM
      Undefined,
  NumOptions
  };

  template <typename T, int size>
  struct Array {
  T data[size];

  __device__ T operator[](int i) const {
      return data[i];
  }
  __device__ T& operator[](int i) {
      return data[i];
  }
  Array() = default;
  Array(const Array&) = default;
  Array& operator=(const Array&) = default;
  __device__ Array(T x) {
    for (int i = 0; i < size; i++) {
      data[i] = x;
    }
  }
  };

  
  
  
  
  



  template <typename T>
  struct DivMod {
  T div;
  T mod;

  __device__ DivMod(T _div, T _mod) {
      div = _div;
      mod = _mod;
  }
  };

  //<unsigned int>
  struct IntDivider {
  IntDivider() = default;

  __device__ inline unsigned int div(unsigned int n) const {
  unsigned int t = __umulhi(n, m1);
  return (t + n) >> shift;
  }

  __device__ inline unsigned int mod(unsigned int n) const {
  return n - div(n) * divisor;
  }

  __device__ inline DivMod<unsigned int> divmod(unsigned int n) const {
  unsigned int q = div(n);
  return DivMod<unsigned int>(q, n - q * divisor);
  }

  unsigned int divisor;  // d above.
  unsigned int m1;  // Magic number: m' above.
  unsigned int shift;  // Shift amounts.
  };

  template <int NARGS>
  struct TrivialOffsetCalculator {
    // The offset for each argument. Wrapper around fixed-size array.
    // The offsets are in # of elements, not in bytes.
    Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
      Array<unsigned int, NARGS> offsets;
      #pragma unroll
      for (int arg = 0; arg < NARGS; arg++) {
        offsets[arg] = linear_idx;
      }
      return offsets;
    }
  };

  template<int NARGS>
  struct OffsetCalculator {
  OffsetCalculator() = default;
  __device__ __forceinline__ Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
      Array<unsigned int, NARGS> offsets;
      #pragma unroll
      for (int arg = 0; arg < NARGS; ++arg) {
      offsets[arg] = 0;
      }

      #pragma unroll
      for (int dim = 0; dim < 25; ++dim) {
      if (dim == dims) {
          break;
      }

      auto divmod = sizes_[dim].divmod(linear_idx);
      linear_idx = divmod.div;

      #pragma unroll
      for (int arg = 0; arg < NARGS; ++arg) {
          offsets[arg] += divmod.mod * strides_[dim][arg];
      }
      //printf("offset calc thread dim size stride offset %d %d %d %d %d %d %d %d\n",
      //threadIdx.x, dim, sizes_[dim].divisor, strides_[dim][0], offsets[0], linear_idx, divmod.div, divmod.mod);
      }
      return offsets;
  }

    int dims;
    IntDivider sizes_[25];
    // NOTE: this approach will not support nInputs == 0
    unsigned int strides_[25][NARGS];
  };



  #define C10_HOST_DEVICE __host__ __device__
  #define C10_DEVICE __device__

  template <typename T>
  __device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
  {
    return __shfl_down_sync(mask, value, delta, width);
  }


  #if 0
  template <typename T>
  __device__ __forceinline__ std::complex<T> WARP_SHFL_DOWN(std::complex<T> value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
  {
    return std::complex<T>(
        __shfl_down_sync(mask, value.real(), delta, width),
        __shfl_down_sync(mask, value.imag(), delta, width));
  }
  #endif

  // aligned vector generates vectorized load/store on CUDA
  template<typename scalar_t, int vec_size>
  struct alignas(sizeof(scalar_t) * vec_size) aligned_vector {
    scalar_t val[vec_size];
  };


  C10_HOST_DEVICE static void reduce_fraction(size_t &numerator, size_t &denominator) {
    // get GCD of num and denom using Euclid's algorithm.
    // Can replace this with std::gcd if we ever support c++17.
    size_t a = denominator;
    size_t b = numerator;
    while (b != 0) {
        a %= b;
        // swap(a,b)
        size_t tmp = a;
        a = b;
        b = tmp;
    }

    // a is now the GCD
    numerator /= a;
    denominator /= a;
  }




  struct ReduceConfig {
  //has to match host-side ReduceConfig in the eager code
  static constexpr int BLOCK_X = 0;
  static constexpr int BLOCK_Y = 1;
  static constexpr int CTA = 2;

  static constexpr int input_vec_size = 4;
  int element_size_bytes;
  int num_inputs;
  int num_outputs;
  int step_input = 1;
  int step_output = 1;
  int ctas_per_output = 1;
  int input_mult[3] = {0, 0, 0};
  int output_mult[2] = {0, 0};

  int block_width;
  int block_height;
  int num_threads;

  bool vectorize_input = false;
  int output_vec_size = 1;

  C10_HOST_DEVICE bool should_block_x_reduce() const {
    return input_mult[BLOCK_X] != 0;
  }

  C10_HOST_DEVICE bool should_block_y_reduce() const {
    return input_mult[BLOCK_Y] != 0;
  }

  C10_HOST_DEVICE bool should_global_reduce() const {
    return input_mult[CTA] != 0;
  }

  C10_DEVICE bool should_store(int output_idx) const {
    return output_idx < num_outputs &&
      (!should_block_x_reduce() || threadIdx.x == 0) &&
      (!should_block_y_reduce() || threadIdx.y == 0);
  }

  C10_DEVICE bool should_reduce_tail() const {
    return (!should_block_y_reduce() || threadIdx.y == 0) &&
      (!should_global_reduce() || blockIdx.y == 0);
  }

  C10_HOST_DEVICE int input_idx() const {
    int lane = threadIdx.x;
    int warp = threadIdx.y;
    int cta2 = blockIdx.y;
    return (lane * input_mult[BLOCK_X] +
            warp * input_mult[BLOCK_Y] +
            cta2 * input_mult[CTA]);
  }

  template <int output_vec_size>
  C10_HOST_DEVICE int output_idx() const {
    int lane = threadIdx.x;
    int warp = threadIdx.y;
    int cta1 = blockIdx.x;
    return (lane * output_mult[BLOCK_X] +
            warp * output_mult[BLOCK_Y] +
            cta1 * step_output) * output_vec_size;
  }

  C10_DEVICE int shared_memory_offset(int offset) const {
    return threadIdx.x + (threadIdx.y + offset) * blockDim.x;
  }

  C10_DEVICE int staging_memory_offset(int cta2) const {
    int offset = cta2 + blockIdx.x * gridDim.y;
    if (!should_block_x_reduce()) {
      offset = threadIdx.x + offset * blockDim.x;
    }
    return offset;
  }


  };


//TODO this will need to be different for more generic reduction functions
namespace reducer {

  using scalar_t = float;
  using arg_t = float;
  using out_scalar_t = float;


  inline __device__ arg_t combine(arg_t a, arg_t b) { return a * b; }

  inline __device__ out_scalar_t project(arg_t arg) {
    return (out_scalar_t) arg;
  }

  inline __device__ arg_t warp_shfl_down(arg_t arg, int offset) {
    return WARP_SHFL_DOWN(arg, offset);
  }

  inline __device__ arg_t translate_idx(arg_t acc, int64_t /*idx*/) {
    return acc;
  }

  // wrap a normal reduction that ignores the index
  inline __device__ arg_t reduce(arg_t acc, arg_t val, int64_t idx) {
     return combine(acc, val);
  }
}


struct ReduceJitOp {
  using scalar_t = float;
  using arg_t = float;
  using out_scalar_t = float;

  using InputCalculator = OffsetCalculator<1>;
  using OutputCalculator = OffsetCalculator<2>;

//   static constexpr bool can_accumulate_in_output =
//     std::is_convertible<arg_t, out_scalar_t>::value
//     && std::is_convertible<out_scalar_t, arg_t>::value;

  static constexpr int input_vec_size = ReduceConfig::input_vec_size;

  arg_t ident;
  ReduceConfig config;
  InputCalculator input_calc;
  OutputCalculator output_calc;
  const void* src;
  const char* dst[2]; //it accepts at most two destinations
  // acc_buf used for accumulation among sub Tensor Iterator when accumulation on
  // output is not permissible
  void* acc_buf;
  // cta_buf used for accumulation between blocks during global reduction
  void* cta_buf;
  int* semaphores;
  int64_t base_idx;
  bool accumulate;
  bool final_output;
  int noutputs;


  C10_DEVICE void run() const {
    extern __shared__ char shared_memory[];
    uint32_t output_idx = config.output_idx<1>();
    uint32_t input_idx = config.input_idx();
    auto base_offsets1 = output_calc.get(output_idx)[1];

    using arg_vec_t = Array<arg_t, 1>;
    arg_vec_t value;

    if (output_idx < config.num_outputs && input_idx < config.num_inputs) {
      const scalar_t* input_slice = (const scalar_t*)((const char*)src + base_offsets1);

      value = thread_reduce<1>(input_slice);
    }

    if (config.should_block_y_reduce()) {
      value = block_y_reduce<1>(value, shared_memory);
    }
    if (config.should_block_x_reduce()) {
      value = block_x_reduce<1>(value, shared_memory);
    }

    using out_ptr_vec_t = Array<out_scalar_t*, 1>;
    using offset_vec_t = Array<uint32_t, 1>;
    offset_vec_t base_offsets;
    out_ptr_vec_t out;

    #pragma unroll
    for (int i = 0; i < 1; i++) {
      base_offsets[i] = output_calc.get(output_idx + i)[0];
      out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
    }

    arg_vec_t* acc = nullptr;
    if (acc_buf != nullptr) {
      size_t numerator = sizeof(arg_t);
      size_t denominator = sizeof(out_scalar_t);
      reduce_fraction(numerator, denominator);
      acc = (arg_vec_t*)((char*)acc_buf + (base_offsets[0] * numerator / denominator));
    }

    if (config.should_global_reduce()) {
      value = global_reduce<1>(value, acc, shared_memory);
    } else if (config.should_store(output_idx)) {
      if (accumulate) {
        #pragma unroll
        for (int i = 0; i < 1; i++) {
          value[i] = reducer::translate_idx(value[i], base_idx);
        }
      }

      if (acc == nullptr) {
        if (accumulate) {
          value = accumulate_in_output<1>(out, value);
        }
        if (final_output) {
          set_results_to_output<1>(value, base_offsets);
        } else {
          #pragma unroll
          for (int i = 0; i < 1; i++) {
            *(out[i]) = get_accumulated_output(out[i], value[i]);
          }
        }
      } else {
        if (accumulate) {
          #pragma unroll
          for (int i = 0; i < 1; i++) {
            value[i] = reducer::combine((*acc)[i], value[i]);
          }
        }
        if (final_output) {
          set_results_to_output<1>(value, base_offsets);
        } else {
          *acc = value;
        }
      }
    }
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> thread_reduce(const scalar_t* data) const {
    if (config.vectorize_input) {
      assert(output_vec_size == 1);
      // reduce at the header of input_slice where memory is not aligned,
      // so that thread_reduce will have an aligned memory to work on.
      return {input_vectorized_thread_reduce_impl(data)};
    } else {
      uint32_t element_stride = input_calc.strides_[0][0] / sizeof(scalar_t);
      bool is_contiguous = (input_calc.dims == 1 && element_stride == 1);
      if (is_contiguous) {
        return thread_reduce_impl<output_vec_size>(data, [](uint32_t idx) { return idx; });
      } else if (input_calc.dims == 1) {
        return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return idx * element_stride; });
      } else {
        return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return input_calc.get(idx)[0] / sizeof(scalar_t); });
      }
    }
  }

  C10_DEVICE arg_t input_vectorized_thread_reduce_impl(const scalar_t* data) const {
    uint32_t end = config.num_inputs;

    // Handle the head of input slice where data is not aligned
    arg_t value = ident;
    constexpr int align_bytes = alignof(aligned_vector<scalar_t, input_vec_size>);
    constexpr int align_elements = align_bytes / sizeof(scalar_t);
    int shift = ((int64_t)data) % align_bytes / sizeof(scalar_t);
    if (shift > 0) {
      data -= shift;
      end += shift;
      if(threadIdx.x >= shift && threadIdx.x < align_elements && config.should_reduce_tail()){
        value = reducer::reduce(value, data[threadIdx.x], threadIdx.x - shift);
      }
      end -= align_elements;
      data += align_elements;
      shift = align_elements - shift;
    }

    // Do the vectorized reduction
    using load_t = aligned_vector<scalar_t, input_vec_size>;

    uint32_t idx = config.input_idx();
    const uint32_t stride = config.step_input;

    // Multiple accumulators to remove dependency between unrolled loops.
    arg_t value_list[input_vec_size];
    value_list[0] = value;

    #pragma unroll
    for (int i = 1; i < input_vec_size; i++) {
      value_list[i] = ident;
    }

    scalar_t values[input_vec_size];

    load_t *values_vector = reinterpret_cast<load_t*>(&values[0]);

    while (idx * input_vec_size + input_vec_size - 1 < end) {
      *values_vector = reinterpret_cast<const load_t*>(data)[idx];
      #pragma unroll
      for (uint32_t i = 0; i < input_vec_size; i++) {
        value_list[i] = reducer::reduce(value_list[i], values[i], shift + idx * input_vec_size + i);
      }
      idx += stride;
    }

    // tail
    uint32_t tail_start = end - end % input_vec_size;
    if (config.should_reduce_tail()) {
      int idx = tail_start + threadIdx.x;
      if (idx < end) {
        value_list[0] = reducer::reduce(value_list[0], data[idx], idx + shift);
      }
    }

    // combine accumulators
    #pragma unroll
    for (int i = 1; i < input_vec_size; i++) {
      value_list[0] = reducer::combine(value_list[0], value_list[i]);
    }
    return value_list[0];
  }

  template <int output_vec_size, typename offset_calc_t>
  C10_DEVICE Array<arg_t, output_vec_size> thread_reduce_impl(const scalar_t* data_, offset_calc_t calc) const {
    uint32_t idx = config.input_idx();
    const uint32_t end = config.num_inputs;
    const uint32_t stride = config.step_input;
    const int vt0=4;

    using arg_vec_t = Array<arg_t, output_vec_size>;
    using load_t = aligned_vector<scalar_t, output_vec_size>;
    const load_t* data = reinterpret_cast<const load_t*>(data_);

    // Multiple accumulators to remove dependency between unrolled loops.
    arg_vec_t value_list[vt0];

    #pragma unroll
    for (int i = 0; i < vt0; i++) {
      #pragma unroll
      for (int j = 0; j < output_vec_size; j++) {
        value_list[i][j] = ident;
      }
    }

    load_t values[vt0];

    while (idx + (vt0 - 1) * stride < end) {
      #pragma unroll
      for (uint32_t i = 0; i < vt0; i++) {
        values[i] = data[calc(idx + i * stride) / output_vec_size];
      }
      #pragma unroll
      for (uint32_t i = 0; i < vt0; i++) {
        #pragma unroll
        for (uint32_t j = 0; j < output_vec_size; j++) {
          value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx + i * stride);
        }
      }
      idx += stride * vt0;
    }

    // tail
    int idx_ = idx;
    #pragma unroll
    for (uint32_t i = 0; i < vt0; i++) {
      if (idx >= end) {
        break;
      }
      values[i] = data[calc(idx) / output_vec_size];
      idx += stride;
    }
    idx = idx_;
    #pragma unroll
    for (uint32_t i = 0; i < vt0; i++) {
      if (idx >= end) {
        break;
      }
      #pragma unroll
      for (uint32_t j = 0; j < output_vec_size; j++) {
        value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx);
      }
      idx += stride;
    }

    // combine accumulators
    #pragma unroll
    for (int i = 1; i < vt0; i++) {
      #pragma unroll
      for (uint32_t j = 0; j < output_vec_size; j++) {
        value_list[0][j] = reducer::combine(value_list[0][j], value_list[i][j]);
      }
    }
    return value_list[0];
  }
  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> block_x_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
    using args_vec_t = Array<arg_t, output_vec_size>;
    int dim_x = blockDim.x;
    args_vec_t* shared = (args_vec_t*)shared_memory;
    if (dim_x > warpSize) {
      int address_base = threadIdx.x + threadIdx.y*blockDim.x;
      shared[address_base] = value;
      for (int offset = dim_x/2; offset >= warpSize; offset >>= 1) {
        __syncthreads();
        if (threadIdx.x < offset && threadIdx.x + offset < blockDim.x) {
          args_vec_t other = shared[address_base + offset];
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::combine(value[i], other[i]);
          }
          shared[address_base] = value;
        }
      }
      dim_x = warpSize;
    }

    __syncthreads();

    for (int offset = 1; offset < dim_x; offset <<= 1) {
      #pragma unroll
      for (int i = 0; i < output_vec_size; i++) {
        arg_t other = reducer::warp_shfl_down(value[i], offset);
        value[i] = reducer::combine(value[i], other);
      }
    }
    return value;
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> block_y_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
    using args_vec_t = Array<arg_t, output_vec_size>;
    args_vec_t* shared = (args_vec_t*)shared_memory;
    shared[config.shared_memory_offset(0)] = value;
    for (int offset = blockDim.y / 2; offset > 0; offset >>= 1) {
      __syncthreads();
      if (threadIdx.y < offset && threadIdx.y + offset < blockDim.y) {
        args_vec_t other = shared[config.shared_memory_offset(offset)];
        #pragma unroll
        for (int i = 0; i < output_vec_size; i++) {
          value[i] = reducer::combine(value[i], other[i]);
        }
        shared[config.shared_memory_offset(0)] = value;
      }
    }
    return value;
  }
  

  C10_DEVICE bool mark_block_finished() const {
    __shared__ bool is_last_block_done_shared;

    __syncthreads();
    if (threadIdx.x == 0 && threadIdx.y == 0) {
      int prev_blocks_finished = atomicAdd(&semaphores[blockIdx.x], 1);
      is_last_block_done_shared = (prev_blocks_finished == gridDim.y - 1);
    }

    __syncthreads();

    return is_last_block_done_shared;
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> accumulate_in_output(
    Array<out_scalar_t*, output_vec_size> out,
    Array<arg_t, output_vec_size> value
  ) const {
    Array<arg_t, output_vec_size> ret;
    #pragma unroll
    for (int i = 0; i < output_vec_size; i++) {
      ret[i] = reducer::combine(*(out[i]), value[i]);
    }
    return ret;
  }


  C10_DEVICE out_scalar_t get_accumulated_output(
    out_scalar_t* out, arg_t value
  ) const {
    assert(!final_output);
    return (out_scalar_t)value;
  }

  template<class T>
  C10_DEVICE void set_results(const T x, const uint32_t base_offset) const {
    assert(noutputs == 1);
    auto res = (out_scalar_t*)((char*)dst[0] + base_offset);
    *res = x;
  }

//TODO - multi-output reduction - we won't be able to use thrust::pair
//just explicitly specify typed output reads/writes
//Currently implemented for max of two outputs
//   template<class T1, class T2>
//   C10_DEVICE void set_results(const thrust::pair<T1, T2> x, const index_t base_offset) const {
//     if (noutputs >= 1) {
//       auto res0 = (T1*)((char*)dst[0] + base_offset);
//       *res0 = x.first;
//     }
//     if (noutputs >= 2) {
//       // base offset is computed assuming element size being sizeof(T1), so we need to make a
//       // correction to obtain the correct base offset
//       auto res1 = (T2*) ((char *) dst[1] + base_offset / sizeof(T1) * sizeof(T2));
//       *res1 = x.second;
//     }
//   }

  template <int output_vec_size>
  C10_DEVICE void set_results_to_output(Array<arg_t, output_vec_size> value, Array<uint32_t, output_vec_size> base_offset) const {
    assert(final_output);
    #pragma unroll
    for (int i = 0; i < output_vec_size; i++) {
      set_results(reducer::project(value[i]), base_offset[i]);
    }
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> global_reduce(Array<arg_t, output_vec_size> value, Array<arg_t, output_vec_size> *acc, char* shared_memory) const {
    using arg_vec_t = Array<arg_t, output_vec_size>;
    using out_ptr_vec_t = Array<out_scalar_t*, output_vec_size>;
    using offset_vec_t = Array<uint32_t, output_vec_size>;

    arg_vec_t* reduce_buffer = (arg_vec_t*)cta_buf;
    uint32_t output_idx = config.output_idx<output_vec_size>();
    offset_vec_t base_offsets;
    out_ptr_vec_t out;

    #pragma unroll
    for (int i = 0; i < output_vec_size; i++) {
      base_offsets[i] = output_calc.get(output_idx + i)[0];
      out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
    }

    bool should_store = config.should_store(output_idx);
    if (should_store) {
      uint32_t offset = config.staging_memory_offset(blockIdx.y);
      reduce_buffer[offset] = value;
    }

    __threadfence(); // make sure writes are globally visible
    __syncthreads(); // if multiple warps in this block wrote to staging, make sure they're all done
    bool is_last_block_done = mark_block_finished();

    if (is_last_block_done) {
      value = ident;
      if (config.should_block_x_reduce()) {
        uint32_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;
        uint32_t step = blockDim.x * blockDim.y;
        for (; input_offset < config.ctas_per_output; input_offset += step) {
          uint32_t idx = config.staging_memory_offset(input_offset);
          arg_vec_t next = reduce_buffer[idx];
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::combine(value[i], next[i]);
          }
        }
      } else {
        uint32_t input_offset = threadIdx.y;
        uint32_t step = blockDim.y;
        for (; input_offset < config.ctas_per_output; input_offset += step) {
          uint32_t idx = config.staging_memory_offset(input_offset);
          arg_vec_t next = reduce_buffer[idx];
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::combine(value[i], next[i]);
          }
        }
      }
      value = block_y_reduce(value, shared_memory);
      if (config.should_block_x_reduce()) {
        value = block_x_reduce<output_vec_size>(value, shared_memory);
      }
      if (should_store) {
        if (accumulate) {
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::translate_idx(value[i], base_idx);
          }
        }

        if (acc == nullptr) {
          if (accumulate) {
            value = accumulate_in_output<output_vec_size>(out, value);
          }
          if (final_output) {
            set_results_to_output<output_vec_size>(value, base_offsets);
          } else {
            #pragma unroll
            for (int i = 0; i < output_vec_size; i++) {
              *(out[i]) = get_accumulated_output(out[i], value[i]);
            }
          }
        } else {
          if (accumulate) {
            #pragma unroll
            for (int i = 0; i < output_vec_size; i++) {
              value[i] = reducer::combine((*acc)[i], value[i]);
            }
          }
          if (final_output) {
            set_results_to_output<output_vec_size>(value, base_offsets);
          } else {
            *acc = value;
          }
        }
      }
    }

    return value;
  }
};

extern "C"
__launch_bounds__(512, 4)
__global__ void reduction_prod_kernel(ReduceJitOp r){
  r.run();
}
nvrtc: error: invalid value for --gpu-architecture (-arch)

Training progress:   0%|                                                                                                                                                                                              | 0/30000 [00:00<?, ?it/s]

Any help on how to resolve this?

issue on pip install submodules/diff-gaussian-rasterization even if i have exponetial.hpp in glm/glm folder.

I still encounter the issue on pip install submodules/diff-gaussian-rasterization even if i have exponetial.hpp in glm/glm folder.
the error is as follows:

      subprocess.CalledProcessError: Command '['which', 'x86_64-conda_cos6-linux-gnu-c++']' returned non-zero exit status 1.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for diff_gaussian_rasterization
  Running setup.py clean for diff_gaussian_rasterization
Failed to build diff_gaussian_rasterization
ERROR: Could not build wheels for diff_gaussian_rasterization, which is required to install pyproject.toml-based projects

Would it be possible to know what you gcc version. I try gcc version = 8.5 it fails.

OOM when run extract_mesh.py

When I run extract_mesh.py for my custom data,I get OOM error in this code snippet.
I use RTX 4090.
code
I want to know if it's because there are too many Gaussian in the scene or because of the graphics card.
Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.