Giter Club home page Giter Club logo

dust3r's People

Contributors

codesmith-emmy avatar cris-test avatar eltociear avatar hturki avatar jerome-revaud avatar lbg030 avatar parskatt avatar spagnolog avatar vincent-leroy avatar wauplin avatar wzy-99 avatar yocabon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dust3r's Issues

Unexpected `force` option for `print()`

There's calling print() with passing force=True (at losses.py, L222).
However, built-in print() function can't accept this parameter:

>>> print("something", force=True)
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2022.3.2\plugins\python\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
           ^^^^^^
  File "<input>", line 1, in <module>
TypeError: 'force' is an invalid keyword argument for print()

I believe the author implied that parameter to flush stdout stream, so there should have been flush=True:

>>> print("something", flush=True)
something

Getting the point clouds out at the right scale

Would it be possible to use spase or dense matchers to get the scale out? Getting all the matched points and then using the known intrinsics to project them into 3d and then comparing the same points to pointmap in dust3r?

Some f-string error

Hi, I get some f-string error, like:

dust3r/heads/__init__.py line:19
raise NotImplementedError(f"unexpected {head_type=} and {output_mode=}")

 File "<fstring>", line 1
    (head_type=)
              ^
SyntaxError: invalid syntax

Am I using the wrong approach?

Mesh view is washed out in visualizer, texture not embedded

Fantastic work guys ! Quality and performance is impressive, though the visualizer seems to apply some transparency in mesh mode compared to point cloud.

Here is a comparaison of the 2 using a single image as source :

image
image

Also when saving the mesh there is no texture embedded and no color information on the point cloud when I import it to Blender

image

PS : I installed dust3r using Pinokio.

Getting torch.cuda.OutOfMemoryError using more than 16 images

Firstly, congrats to all the folks at Naver for their awesome accomplishments with Dust3r. It was very straight forward getting dust3r up and running, but I discovered i run into torch.cuda.OutOfMemoryError(s) when i try to process more than 16 images at once. I am running a rtx 3060 12GB, was wondering if anyone may know what I can do to resolve or debug an issue like this? I am running dust3r via docker-compose with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128. Here is the full error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.11 GiB (GPU 0; 11.76 GiB total capacity; 9.84 GiB already allocated; 932.69 MiB free; 10.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any help or insight would be much appreciated. Again, thanks to the folks at Naver for their awesome work and releasing it!

Running on Apple Mac M2

Good jobs guys ! Very impressive results.

I confirm that it's working on Apple Mac ( with Apple Silicon ), I try with more than 8 images with no error.
3 images, 6 images pairs, run in 6 sec + 15 sec
8 images, 56 images pairs, run in 70 sec + 90 sec

PYTORCH_ENABLE_MPS_FALLBACK=1 python3 demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth --device mps
image

Did you plan to release some samples with a more bigger number of images ?

Model deployment issues

Thanks for the engineering code, I want to deploy the model to embedded devices, is this feasible, do you know those devices support this kind of model Looking forward to your reply

Wrong Intrinsics

There seems to be an issue with the camera's intrinsics, some focal is 0.

CUDA OOM

Hi,

The performance is really amazing on the few image pairs I have tried.
However, when I moved to a bigger scenes (29 images), it crashes with CUDA OOM on 16Gb V100.
Any recommendations how can I run it?

  File "/home/old-ufo/dev/dust3r/dust3r/cloud_opt/optimizer.py", line 176, in forward
    aligned_pred_i = geotrf(pw_poses, pw_adapt * self._stacked_pred_i)
  File "/home/old-ufo/dev/dust3r/dust3r/utils/geometry.py", line 86, in geotrf
    pts = pts @ Trf[..., :-1, :] + Trf[..., -1:, :]

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.38 GiB. GPU 0 has a total capacity of 15.77 GiB of which 775.88 MiB is free. Including non-PyTorch memory, this process has 15.01 GiB memory in use. Of the allocated memory 13.70 GiB is allocated by PyTorch, and 922.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Scale Invariant Depth, Metric Depth, and Surface Normals

As I understand here #44 (comment) ya'll are in the process of training a metric version of dust3r, so the current version outputs scale invariant. Was the depth normalized before training? I see here

def rescale_image_depthmap(image, depthmap, camera_intrinsics, output_resolution):
that the resolution is being changed. But I also notice here
with PIL.Image.open(depth_path) as depth_pil:
that it seems the metric depth map is being loaded?

Is there any normalization being performed on the depth/pointmap to make it scale invariant?

Also I've been following what the MetricV2, have you all looked at including surface normals as supervision based on pointmap -> depthmap -> surface normal conversion such that the network can also produce surface normals?

A small suggestion to author

Assuming we have 1296 images (36x36) around the object in a 360x360 setup, or perhaps slightly fewer, say 100 images.
The current algorithm generates pairs for all possible combinations, resulting in a substantial number of pairs (100*99).
Without a proper sampling policy, this can be challenging to handle.
A small recommendation would be to initiate and maintain a pair-loss table from scratch, gradually increasing the number of pairs and sampling based on the convergence of loss.

Allow to export scene data as JSON for general use outside of Python

First, congrats to the developers. a powerful tool, found immediate use for my projects.

Because most of our tools are written in MATLAB/Octave, I found that the generated .glbfile is difficult to convert.

I just added a little bit code to allow the demo.py GUI to export data to a JSON/binary JSON construct (with the JMesh mesh data annotation) - which can be potentially parsed/shared in other environments (like JavaScript/Node/C++/MATLAB). I also added a drop down menu in the demo to let user choose output format.

Here are my commits

NeuroJSON@a13b18c
NeuroJSON@935ace7

to export to JSON, only one extra dependency jdata (16 kb) is needed. To export to a binary JSON format (for smaller file sizes), another small package bjdata (65 kb).

loading the data in MATLAB/Octave

>> dat=loadjd('/tmp/scene.jmsh');
>> dat
dat = 
  struct with fields:

       images: [1×10 struct]
      cameras: [1×10 struct]
       meshes: [1×10 struct]
    transform: [4×4 double]

>> dat.meshes(1)
ans = 
  struct with fields:

    MeshVertex3: [196608×3 single]
       MeshTri3: [1×1 struct]

loading the data back to Python

import jdata as jd
dat=jd.load('/tmp/scene.jmsh');
>>> dat.keys()
dict_keys(['images', 'cameras', 'meshes', 'transform'])
>>> dat['meshes'][0].keys()
dict_keys(['MeshVertex3', 'MeshTri3'])
>>> dat['meshes'][0]['MeshTri3'].keys()
dict_keys(['Data', 'Properties'])
>>> dat['meshes'][0]['MeshTri3']['Data'].shape
(189024, 3)
>>> dat['meshes'][0]['MeshTri3']['Data'].dtype
dtype('int64')

in my test, the generated scene from 10 images took 43MB size in glb, 44MB in binary json (.bmsh) and 59MB for JSON (due to base64). No noticeable difference in loading/saving speed. For both JSON/binary JSON file, changing the compressor to 'lzma' could lead to much smaller file size.

just want to share this in case others see similar needs. I am also happy to create a PR if the developers are interested in adding this feature.

image

No module named 'models.dpt_block'

(venv) D:\dust3r>python demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
Traceback (most recent call last):
File "D:\dust3r\demo.py", line 19, in
from dust3r.inference import inference, load_model
File "D:\dust3r\dust3r\inference.py", line 10, in
from dust3r.model import AsymmetricCroCo3DStereo, inf # noqa: F401, needed when loading the model
File "D:\dust3r\dust3r\model.py", line 11, in
from .heads import head_factory
File "D:\dust3r\dust3r\heads_init_.py", line 8, in
from .dpt_head import create_dpt_head
File "D:\dust3r\dust3r\heads\dpt_head.py", line 17, in
from models.dpt_block import DPTOutputAdapter # noqa
ModuleNotFoundError: No module named 'models.dpt_block'

(venv) D:\dust3r\checkpoints>dir
驱动器 D 中的卷是 Data
卷的序列号是 3C50-8BA1

D:\dust3r\checkpoints 的目录

2024/03/04 21:16

.
2024/03/04 21:16 ..
2024/03/04 20:35 2,129,660,080 DUSt3R_ViTLarge_BaseDecoder_224_linear.pth
2024/03/04 20:40 2,285,019,929 DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
2024/03/04 20:41 2,129,656,556 DUSt3R_ViTLarge_BaseDecoder_512_linear.pth

Scannet Data Prerocess

Could you please release the code about preprocess scannet data for training and inference, we can better understand the all pipeline about this dust3r? thanks

Image Mask

I would like to ask if it is possible to add a mask to the images currently being promoted on the network.
If so, how should I add it?

can I fix the camera pose? Or given some restrictive priors

A very good job! I have a fixed use of the scene, 4 cameras fixed to shoot an object, but I find that each time the output camera is not in the same position, can you give some prior conditions or modify the code to fix the camera pose?

Try this to increase resolution w/o finetuning (Instruction)

Using the default setup, large input images were being resized to 512 x 384 (using DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth). But I wanted results with higher resolution (1024 x 768). So I followed "Extending Context Window of Large Language Models via Position Interpolation" by Meta and changed only the default image_size value of 512 to 1024 inside demo.py and multiplied the variable t inside get_cos_sin method of RoPE2D of croco/models/pos_embed.py by (512/1024). This gave pretty good results, though finetuning is most likely required for better results.

Could I get the real depth from your 2D-2D matching?

Hi, thank you for your amazing job!
Could I get the real depth from your 2D-2D matching?
If I input 2 images from a stereo camera(like self-driving), according ‘real_depth = (baseline × focal) / disparity’, is that meaning the answear of title is right?
And if I input 5 images, for some reason, I want to see all matching between each images pair(for example, img2 and img5), could you tell me how to modify you code to achieve that?(I modfied the code of "Usage" of ReadMe, however, the matchings are same for all image pair...)
Thank you!

In demo train code, there is a typo in step 1.

There is a typo in Demo Learning step1.
There's no quotation mark on the path next to the pretrained
The train will not work when the code is executed.
Quotations on the route have been confirmed to work well

Thanks 😀

something wrong in gradio import

(dust3r) H:\qing\AIproect\dust3r>python demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
Traceback (most recent call last):
File "H:\qing\AIproect\dust3r\demo.py", line 9, in
import gradio
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio_init_.py", line 3, in
import gradio.simple_templates
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio_simple_templates_init
.py", line 1, in
from .simpledropdown import SimpleDropdown
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio_simple_templates\simpledropdown.py", line 6, in
from gradio.components.base import FormComponent
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\components_init_.py", line 40, in
from gradio.components.multimodal_textbox import MultimodalTextbox
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\components\multimodal_textbox.py", line 28, in
class MultimodalTextbox(FormComponent):
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\component_meta.py", line 198, in new
create_or_modify_pyi(component_class, name, events)
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\component_meta.py", line 92, in create_or_modify_pyi
source_code = source_file.read_text()
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\py\anaconda3\envs\dust3r\Lib\pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xb2 in position 1972: illegal multibyte sequence

Manually adjustable camera positions after prediction

Manually adjustable camera positions after prediction within the GUI software.

Have you considered adding this? It would be a great way to correct any cameras that the algorithm may have not predicted correctly.

Love this btw. Great work so far.

Dear authors: a confusion about network architecture.

Dear author, if there are no inherent constraints on the input images (such as B must always be to the left of A or even stricter constraints), what is the reason behind TransformerDecoder_1/2 and Header_1/2 being required to have different weights instead of sharing weights for later information sharing? According to my limited understanding, after 'perfect' or 'sufficient' training, Decoder_1/2 and Header_1/2 should be nearly identical. In this case, what is the significance of not sharing weights?

in shortly:
if no inherent differents, and after sufficient training, decoder_header_1/2 should be nearly identical.
consider this:
input single image I to this two-path network, output diffent point-map and camera-pose.
is it reasonable? is this we wanted?

a thought experiment:
swap input image pairs A and B, maybe we will get different output performaces: worse or better.
right?

image

Dataset details

First of all: very cool work!

I have two questions regarding reproducing pairs from the datasets for training.

Habitat

Are the scenes the pairs are generated from the same as in CroCo Habitat README?
Specifically, do the 1M pairs relate in some way to the ~1.8M pairs used in Croco?

Real datasets

For CroCoV2 you provide metadata to re-generate the crops for ARKitScenes and MegaDepth.
Specifically the CroCoV2 paper mentions,

1,070,414 pairs from ARKitScenes [8], 2,014,789 pairs from MegaDepth

Do these relate to the pairs you obtained for dust3r training in some way?

If there is some metadata similar to the one for CroCo to generate these pairs that would be greatly appreciated!

Thank you!

Problems with camera pose application

Thank you for your excellent work. But when I try to use cam_pose, some problems arise. Specifically, for the four pictures, the following pictures can be obtained in the output of the demo,
image
and you can see that there are good camera poses.
But when I apply the pose parameters directly to the point cloud locally, I get the following results.
image
There is an unreasonable gap between each perspective.
I want to know if the way I get the pose is wrong(poses = scene.get_im_poses())? Or is it that the point cloud results displayed on the web page do not completely correspond to the pose obtained by the model? Looking forward to your reply.

What if a partial set of poses is known?

Hey Naver,

First of all great work, it is very interesting to play around with!

I'm curious, if one knows a partial set of poses and focal lengths aforehand, how should one initialize the pose-graph?

Best regards

Pretrained Croco

Hi,

Thanks a lot for the amazing work. Did you use the pretrained Croco model when training Dust3r? If so, could you please point out where you load the model in the training code?

Thanks in advance!

Core dump

image

This error come from python code from the Readme file

Options to do SLAM for video? Get poses and camera intrinsics?

Hi there, congrats on the fantastic work! These are amazing results.

I'm working on 3D mapping systems for robotics, and was wondering if

Given a video, can this method help with obtaining the camera parameters, and poses for each frame?

Do you guys have any scripts already for this? I see that in the example usage you have:

    # retrieve useful values from scene:
    imgs = scene.imgs
    focals = scene.get_focals()
    poses = scene.get_im_poses()

And you can do scene.get_intrinsics() which is great, but when I run this on 12 images from the replica dataset, scene.get_intrinsics() outputs 12 different intrinsic matrices, none of which really match the original camera intrinsics of the replica dataset.

Am I doing something wrong? Should I specify the scale or resolution or something else about the images at some point? The replica images are 1200x600 (w,h) but they get resized to 512 I'm assuming.

Just wondering how I should go about getting the camera parameters for a monocular rgb video, or if that's not really possible to do super accurately yet with this method.

For extra detail, I'm using the following frames from the replica dataset

    image_filenames = [
        'frame000000.jpg', 'frame000023.jpg', 'frame000190.jpg', 'frame000502.jpg',
        'frame000606.jpg', 'frame000988.jpg', 'frame001181.jpg', 'frame001374.jpg',
        'frame001587.jpg', 'frame001786.jpg', 'frame001845.jpg', 'frame001928.jpg'
    ]
...
    images = load_images(images_path_list, size=512)
    pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
...
    scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
    loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)
...

and the output of the scene.get_intrinsics() is as follows, I'm only showing two of the matrices here, not all 12:

print(scene.get_intrinsics())
tensor([[[250.8425,   0.0000, 256.0000],
         [  0.0000, 250.8425, 144.0000],
         [  0.0000,   0.0000,   1.0000]],
...
        [[250.7383,   0.0000, 256.0000],
         [  0.0000, 250.7383, 144.0000],
         [  0.0000,   0.0000,   1.0000]]], device='cuda:0',
       grad_fn=<CopySlices>)

compared to the ground truth camera params of the replica dataset from the camera_params.json

K_given = [
    [600.0, 0, 599.5],
    [0, 600.0, 339.5],
    [0, 0, 1]
]

here is the actual camera_params.json file in case it helps

{
    "camera": {
        "w": 1200,
        "h": 680,
        "fx": 600.0,
        "fy": 600.0,
        "cx": 599.5,
        "cy": 339.5,
        "scale": 6553.5
    }
}

Also, just curious, how would I go about running this on long videos? Or is that not possible yet?

My apologies if these are too many questions! This method is really awesome, and I'm having a lot of fun using it. Thanks again for the wonderful work!

Memory errors

Hi, thank you for your great work!
I'm currently working on testing this on larger datasets (5-10k images) and notice that a very large amount of (V)RAM would be required to make it work. I've already generated a pairs file to reduce the number of pairs from 50M to 3M, but this still seems to be way too large. Do you have any pointers/suggestions I could try out to make it scale better?

I'm using cloud compute with 80GB VRAM and 220GB RAM so that shouldn't be an issue btw.

training costs

Thanks for your great work! It's amazing! It can be applied to many real-world scenarios. I wonder if this is a paper submitted to CVPR 2024? The format of the paper looks like it is.

Besides, could you please tell me how many GPUs it costs to train this model? Thanks very much!

HEIC images are ignored

Tried to demo it yesterday to a few novice users using share=True in launch.
Encountered a few problems along the way.

It seems to ignore heic files (which is often the default on mobile devices).

The problematic line where you need to add a ".heic"

if not path.endswith(('.jpg', '.jpeg', '.png', '.JPG')):

You also need to add a pillow-heif depency
pip install pillow-heif

And the following lines somewhere in image.py

from pillow_heif import register_heif_opener
register_heif_opener()

Other additional problems encountered en mobile devices (but probably due to gradio) :

  • The 3d viewer can't easily translate the scene (no right click + drag gesture available).
  • Possibility to add one more image without re-uploading everything.
  • ImagePicker instead of FilePicker

FYI: Missing dependencies required to run main.py

Hi and thank you for for sharing this amazing work.

Just a heads up:

I had to manually install some missing deps to make main.py run. I followed the readme and installed with conda, skipping both the optional step 3 and installing the new optional_requirements.txt

einops (conda)
tqdm (via pip)
scipy (conda)
opencv-python (pip)
trimesh (pip)

pip install "pyglet<2" # use version <2

Cheers

Waymo dataset processing

Hi authors, thanks for sharing the amazing work. I was wondering how you use Waymo dataset for training, as the most accurate depth is from Lidar which is sparse.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.