Giter Club home page Giter Club logo

pointnext's Introduction

PointNeXt

PWC PWC PWC PWC

Official PyTorch implementation for the following paper:

PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

by Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, Bernard Ghanem

TL;DR: We propose improved training and model scaling strategies to boost PointNet++ to the state-of-the-art level. PointNet++ with the proposed model scaling is named as PointNeXt, the next version of PointNets.

News

Features

In the PointNeXt project, we propose a new and flexible codebase for point-based methods, namely OpenPoints. The biggest difference between OpenPoints and other libraries is that we focus more on reproducibility and fair benchmarking.

  1. Extensibility: supports many representative networks for point cloud understanding, such as PointNet, DGCNN, DeepGCN, PointNet++, ASSANet, PointMLP, and our PointNeXt. More networks can be built easily based on our framework since OpenPoints support a wide range of basic operations including graph convolutions, self-attention, farthest point sampling, ball query, e.t.c.

  2. Reproducibility: all implemented models are trained on various tasks at least three times. MeanΒ±std is provided in the PointNeXt paper. Pretrained models and logs are available.

  3. Fair Benchmarking: in PointNeXt, we find a large part of performance gain is due to the training strategies. In OpenPoints, all models are trained with the improved training strategies and all achieve much higher accuracy than the original reported value.

  4. Ease of Use: Build model, optimizer, scheduler, loss function, and data loader easily from cfg. Train and validate different models on various tasks by simply changing the cfg\*\*.yaml file.

    model = build_model_from_cfg(cfg.model)
    criterion = build_criterion_from_cfg(cfg.criterion_args)
    

    Here is an example of pointnet.yaml (model configuration for PointNet model):

    model:
      NAME: BaseCls
      encoder_args:
        NAME: PointNetEncoder
        in_channels: 4
      cls_args:
        NAME: ClsHead
        num_classes: 15
        in_channels: 1024
        mlps: [512,256]
        norm_args: 
          norm: 'bn1d'
  5. Online logging: Support wandb for checking your results anytime anywhere. Just set wandb.use_wandb=True in your command.

    docs/misc/wandb.png


Installation

We provide a simple bash file to install the environment:

git clone --recurse-submodules [email protected]:guochengqian/PointNeXt.git
cd PointNeXt
source update.sh
source install.sh

Cuda-11.3 is required. Modify the install.sh if a different cuda version is used. See Install for detail.

Usage

Check our online documentation for detailed instructions.

A short instruction: all experiments follow the simple rule to train and test:

CUDA_VISIBLE_DEVICES=$GPUs python examples/$task_folder/main.py --cfg $cfg $kwargs
  • $GPUs is the list of GPUs to use, for most experiments (ScanObjectNN, ModelNet40, S3DIS), we only use 1 A100 (GPUs=0)
  • $task_folder is the folder name of the experiment. For example, for s3dis segmentation, $task_folder=s3dis
  • $cfg is the path to cfg, for example, s3dis segmentation, $cfg=cfgs/s3dis/pointnext-s.yaml
  • $kwargs are the other keyword arguments to use. For example, testing in S3DIS area 5, $kwargs should be mode=test, --pretrained_path $pretrained_path.

Model Zoo (pretrained weights)

see Model Zoo.

Visualization

More examples are available in the paper.

s3dis shapenetpart


Acknowledgment

This library is inspired by PyTorch-image-models and mmcv.

Citation

If you find PointNeXt or the OpenPoints codebase useful, please cite:

@InProceedings{qian2022pointnext,
  title   = {PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies},
  author  = {Qian, Guocheng and Li, Yuchen and Peng, Houwen and Mai, Jinjie and Hammoud, Hasan and Elhoseiny, Mohamed and Ghanem, Bernard},
  booktitle=Advances in Neural Information Processing Systems (NeurIPS),
  year    = {2022},
}

pointnext's People

Contributors

ajhamdi avatar guochengqian avatar hadilou avatar helioszhao avatar linhaojia13 avatar penghouwen avatar xindeng98 avatar yuchenlichuck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pointnext's Issues

What is the 4-th channel of ScanObjectNN input?

Thanks for your constructive work!
I noticed that the in_channels of ScanObjectNN is 4. I wonder what is the 4-th channel?
Prior works only utilize the front 3 channels for training, so is the setting fair?
Thanks!

FLOP count in FPS

Does the FLOP count include Furtherest Point Sampling? The FPS algorithm that you use is written in CUDA and I believe DeepSpeed could not infer the FLOPs directly. Is that right?

How to reproduce the results you released?

As for the classification on ModelNet40 and ScanObjectNN, I set the same seed as the logs with the highest results you released. But I can get the same results. For example, for the classification on ScanObjectNN, I can't get the result 88.2(OA). Instead, I got 87.786(OA). Does your stronger data augmentation as you mentioned in the paper help the results promote? Does the code contain the data augmentation? Thanks.

about Scannetv2 semantic segmentation

Hi~ Thanks for the great work again :)

I use your framework to try semantic segmentation on Scannetv2, same data process and training setting. Since the pointnext achieve the top1 mIoU on s3dis, I except the results on Scannetv2 also higher than others. Actually, compare with top3 validation mIou, about 72-73, the results I got from pointnext-XL only around 68.8.

Did you try on scannetv2? What's your conclusion? Why the improvements on s3dis seems not work well on scannetv2 or maybe other datasets?

how to use test_s3dis_6fold.py?

I'm sorry, I'm in trouble again and need your help.
the pretrained folders are organized as follows:
1657336601853
and the code like :CUDA_VISIBLE_DEVICES=1 python examples/segmentation/test_s3dis_6fold.py --cfg cfgs/s3dis/pointvector-l.yaml mode=test wandb.use_wandb=False --pretrained_path log/s3dis/s3dis-train-pointvector-l-ngpus2-seed5494---batch_size-16-20220708-212619-fCRXoWGphDoYQRXD6V6um3
However, my folder does not have the paragraph named as you show in the readme of test-area5.
1657336901772
An error was reported here.
1657336968252
What do I need to do? Looking forward to your help!

why always show "RuntimeError: CUDA error: out of memory"?

My machine have a 3090 GPU(24GB), and I run the smallest cfg (pointnext-b). It still show :
untimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

Questions about seed

How to adjust the code through the 'seed' in the log recorded by wandb so that the code can reproduce the same result? Thanks.

how to implement code in scannetv2 and shapenetpart?

Hi, thanks for your great work. I have achieved 72.12% mIoU in S3DIS area5 with PointNext-XL as a baseline. So I want to validate my idea in other datasets like ScanNet and ShapeNetPart. Thanks for your sharing, again. Here are my experimental results.
s3dis_area5

ShapeNetPart Missing

Hi! Could you please provide the instruction and cfgs for part segmentation on ShapeNetPart? Thank you!

Where is the code?

I don't see an examples directory, for example. Is something missing from this repo?

local aggregation

Thanks for sharing your work.
I am confused about the local aggregation module in InvResMLP. It seems like the Local aggregation is applied on the sampled point set, not the point set before sampling?

The latest code still needs some modifications

thank you for your sharing. But when I try to reproduce classification in ModelNet40. they are ./openpoints/models/layers/init.py and ./openpoints/models/layers/group.py.

from .group_embed import SubsampleGroup, PointPatchEmbed

and

class GroupAll(nn.Module):
def init(self, ):
super().init()

def forward(self, new_xyz: torch.Tensor, xyz: torch.Tensor, features: torc

h.Tensor = None):
grouped_xyz = xyz.transpose(1, 2).unsqueeze(2)
if features is not None:
grouped_features = features.unsqueeze(2)
return None, grouped_xyz, grouped_features
else:
return None, grouped_xyz, None

how to set the training parameters?

Thanks for your contribution. I had some problems while experimenting.
I use the default params to train in S3DIS. The code like bash script/main_segmentation.sh cfgs/s3dis/pointnext-s.yaml --batch_size 16
then
image

and test , code like CUDA_VISIBLE_DEVICES=1 bash script/main_segmentation.sh cfgs/s3dis/pointnext-s.yaml wandb.use_wandb=False mode=test --pretrained_path /home/dx/PointNeXt-master/log/s3dis/s3dis-train-pointnext-s-ngpus2-seed2980---batch_size-16-20220627-144928-6p4TQ9mFTv82zZcXCLgkw6/checkpoint/s3dis-train-pointnext-s-ngpus2-seed2980---batch_size-16-20220627-144928-6p4TQ9mFTv82zZcXCLgkw6_ckpt_best.pth
the result is
image
the mIOU is too lower than the result in paper.
Is using the default parameters not enough to achieve good results and what do I need to do?

CUDA Out of Memory during Inference

I am using the S3DIS dataset the way you have provided. During training, my model runs fine because every sample has maximum 24k points. But during inference, even when batch size is 1, I get CUDA out of memory. My machine is a NVIDIA RTX 3090 that has 24GB memory. Do you think it is possible to do inference with this machine using the entire scene as input?

Basically, I am training a custom version of DGCNN, which computes nearest neighbors for all the points in the entire scene. During computation of KNN is where I get memory errors.

Also, if there are any other tips you know that I could use to do inference with the entire scene but avoid memory errors, then please share.

Thank you very much!

Log directories for every process

I have noticed that while running an experiment with multiple GPUs, a separate log directory is created by every process. Is it not better to just create one log directory for the main process (rank 0) & then every other process can just load the checkpoint from there?

Question about the training configuration in classification task

I notice that you use PointCloudRotation data transform in the default configuration in scanobjectnn classification task, while your paper claims that random rotation drops the performance. So I am confused about whether you have used random rotation augmentation in your final experiments. Thanks!

train: [PointsToTensor, PointCloudScaling, PointCloudCenterAndNormalize, PointCloudRotation]

The problem about multi-GPUs training,"TypeError: main() takes 2 positional arguments but 63 were given"

Traceback (most recent call last):
File "examples/segmentation/main.py", line 526, in
mp.spawn(main, nprocs=cfg.world_size, args=(cfg))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
TypeError: main() takes 2 positional arguments but 63 were given

How to reach the mIoU of the released models?

I train the PointNeXt and ASSANet and find that the results have small or evident gaps with your released models. Are these results reasonable in my experiments?
Are these gaps due to the randomness? If so, how many trials have you take to reach the mIoU of the released models?

Model My exps original mIoU Released mIoU
PointNeXt-XL 69.79 70.5Β±0.3 71.1
PointNeXt-L 68.57 69.0Β±0.5 69.3
PointNeXt-B 66.60 67.3Β±0.2 67.5
PointNeXt-S 63.51 63.4Β±0.8 64.2
ASSANet 64.79 - 65.8

ValueError: cannot reshape array of size 2097142 into shape (780299,7)

Hi,
Great work and thanks for releasing the Code.
There is an error below, Is it related to numpy? Thanks a lot.

''[07/12 10:37:08 S3DIS]: Successful Loading the ckpt from log/s3dis/s3dis-train-pointnext-l-ngpus1-seed8572-20220711-202106-MtuyFobTuHUC9ZWGpn9bng/checkpoint/s3dis-train-pointnext-l-ngpus1-seed8572-20220711-202106-MtuyFobTuHUC9ZWGpn9bng_ckpt_best.pth
[07/12 10:37:08 S3DIS]: ckpts @ 66 epoch( {'best_val': 67.55729675292969} )
0%| | 0/63 [00:00<?, ?it/s]examples/segmentation/main.py:395: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
label = torch.from_numpy(cdata[:, 6].astype(np.int).squeeze()).cuda(non_blocking=True)
6%|β–ˆβ–ˆβ–‹ | 4/63 [02:00<36:23, 37.02s/it]
17%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 11/63 [11:03<42:01, 48.48s/it]
95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 60/63 [57:40<02:53, 57.68s/it]
Traceback (most recent call last):
File "examples/segmentation/main.py", line 528, in
main(0, cfg)
File "examples/segmentation/main.py", line 208, in main
test_miou, test_macc, test_oa, test_ious, test_accs, _ = test_entire_room(model, cfg.dataset.common.test_area, cfg)
File "/home/olan/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(args, **kwargs)
File "examples/segmentation/main.py", line 392, in test_entire_room
cdata = np.load(data_path).astype(np.float32) # xyz, rgb, label, N
7
File "/home/olan/anaconda3/envs/openpoints/lib/python3.7/site-packages/numpy/lib/npyio.py", line 441, in load
pickle_kwargs=pickle_kwargs)
File "/home/olan/anaconda3/envs/openpoints/lib/python3.7/site-packages/numpy/lib/format.py", line 783, in read_array
array.shape = shape
ValueError: cannot reshape array of size 2097142 into shape (780299,7)

S3DIS dataset preprocessing

Hi~
I found the S3DIS dataset you provided quite different from the original version. There seems to have a translation in xyz. Is it right? Is there some other processing? Can you describe with more details?

I also notice that you mean to move the scan to coordinate origin by subtracting the minimum.https://github.com/guochengqian/openpoints/blob/fb998885f2895922ce257affa4eda81ff12a615b/dataset/s3dis/s3dis.py#L97
It quite wired since you already done the same thing in the data preprocessing according to my understanding.

lower results than expected

Thanks for the great work!

I use 4 gpus to run script/main_segmentation.sh cfgs/s3dis/pointnext-xl.yaml. Only got val_oa 89.21, val_macc 76.35, val_miou 69.08 (Best ckpt @e61) which lower than the results you showed at homepage.

Is it a normal results, or is there some expriment settings I haven't noticed that effect the results.

s3dis

Hi, thanks for the nice work! I noticed there is a pointtransformer.py in openpoints/models/backbone. Have you experimented with the performance of pointtransformer under your proposed training strategies? Could you please provide the relevant configuration files (.yaml)?

about S3DIS testing problem

image
when I test the pretrained model, there is an error. The pred should match the true value, but not. Pred length is twice the real value.
bash script/main_segmentation.sh cfgs/s3dis/pointnext-s.yaml --batch_size 16 --mode test --pretrained_path /home/dx/PointNeXt-master/log/s3dis/s3dis-train-pointnext-s-ngpus2-seed2980---batch_size-16-20220627-144928-6p4TQ9mFTv82zZcXCLgkw6/checkpoint/s3dis-train-pointnext-s-ngpus2-seed2980---batch_size-16-20220627-144928-6p4TQ9mFTv82zZcXCLgkw6_ckpt_best.pth

License for code and weights

Thanks for sharing amazing work!
Could you please clarify and add license term for both code and pretrained weights?

Thanks in advance,

adding of a prediction script

Hey wanted to know if it is possible to add in the examples folder also a predict.py file which would allow me to easily test the quality of the model in a qualitative way on own data.

AttributeError: module 'torch.cuda' has no attribute 'custom_fwd'

Thank you for your open codes!
It seems that the version of the torch or cuda is not mismatched.

script/main_segmentation.sh: line 34: nvcc: command not found
MAC235
4
Traceback (most recent call last):
  File "examples/segmentation/main.py", line 18, in <module>
    from openpoints.dataset import build_dataloader_from_cfg, get_scene_seg_features, get_class_weights
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/dataset/__init__.py", line 6, in <module>
    from .scanobjectnn import *
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/dataset/scanobjectnn/__init__.py", line 1, in <module>
    from .scanobjectnn import ScanObjectNNHardest
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/dataset/scanobjectnn/scanobjectnn.py", line 5, in <module>
    from openpoints.models.layers import fps
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/models/__init__.py", line 5, in <module>
    from .backbone import *
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/models/backbone/__init__.py", line 1, in <module>
    from .pointnetv2 import PointNet2Encoder, PointNet2Decoder, PointNetFPModule
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/models/backbone/pointnetv2.py", line 16, in <module>
    from ..layers import furthest_point_sample, random_sample, LocalAggregation, three_interpolation, create_convblock1d
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/models/layers/__init__.py", line 7, in <module>
    from .group import grouping_operation, gather_operation, create_grouper
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/models/layers/group.py", line 76, in <module>
    class GroupingOperation(Function):
  File "/home/lhj/pointcloud/PointNeXt-master/examples/segmentation/../../openpoints/models/layers/group.py", line 79, in GroupingOperation
    @torch.cuda.custom_fwd(cast_inputs=torch.float32)
AttributeError: module 'torch.cuda' has no attribute 'custom_fwd'

I query the version via conda list -f cudatoolkit and conda list -f pytorch , and it returns:

# packages in environment at /home/lhj/anaconda3/envs/openpoints:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.1.74              h6bb024c_0    nvidia
# packages in environment at /home/lhj/anaconda3/envs/openpoints:
#
# Name                    Version                   Build  Channel
pytorch                   1.10.1          py3.7_cuda11.1_cudnn8.0.5_0    pytorch

The cuda version is 11.1 and the pytorch version is 1.10, which is same as the requirement. Therefore, the version of the pytorch and cuda is OK.

How can I fix this problem?

Ques about height appending

Hi, Thank you for such great open source work!
Since I' m new to this area, I' m confused about this part of the paper mentioned:height appending [47] (i.e., appending the measurement of each point along the gravity direction of objects as additional input features).
If no wrong, it seems to correspond to these two parts of code in openpoints.

s3dis

        # pre-process.
        if self.transform is not None:
            data = self.transform(data)
        data['x'] = torch.cat((data['x'], torch.from_numpy(
            coord[:, 3-self.n_shifted:3].astype(np.float32))), dim=-1)

scanobjectnn

        # height appending. @KPConv
        if 'heights' in data.keys():
            data['x'] = torch.cat((data['pos'], data['heights']), dim=1)
        else:
            data['x'] = torch.cat((data['pos'],
                                   torch.from_numpy(current_points[:, self.gravity_dim:self.gravity_dim+1] - current_points[:, self.gravity_dim:self.gravity_dim+1].min())), dim=1)

If I understand correctly, gravity_dim refers to the z-axis of the point cloud. The value on the height could be the original z value (for s3dis), or its difference from the height minimum (for scanobjectnn). Is my understanding correct?

OOM when training ASSANet-L in a single A100 (40G)

I try to train the ASSANet-L in a single A100 (40G memory) via CUDA_VISIBLE_DEVICES=3 bash script/main_segmentation.sh cfgs/s3dis/assanet-l.yaml. However, it return the OOM error.

[08/23 23:12:49 S3DIS]: length of training dataset: 6120
  0%|                                                                                                                | 0/191 [00:30<?, ?it/s]
Traceback (most recent call last):
  File "examples/segmentation/main.py", line 528, in <module>
    main(0, cfg)
  File "examples/segmentation/main.py", line 151, in main
    train_one_epoch(model, train_loader, criterion, optimizer, scheduler, epoch, cfg)
  File "examples/segmentation/main.py", line 258, in train_one_epoch
    loss.backward()
  File "/home/lhj/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/lhj/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 4.95 GiB (GPU 0; 39.44 GiB total capacity; 28.40 GiB already allocated; 4.88 GiB free; 32.85 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

wandb: Waiting for W&B process to finish, PID 1424604... (failed 1). Press ctrl-c to abort syncing.
wandb:
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 2 other file(s)
wandb: Synced s3dis-train-assanet-l-ngpus1-seed1505-20220823-231215-GuXKGc4ow6CF6ZymcSXg6V: https://wandb.ai/linhaojia/PointNeXt-S3DIS/runs/3qy19fo9
wandb: Find logs at: ./wandb/run-20220823_231216-3qy19fo9/logs/debug.log
wandb:

What hardwares did you use to train ASSANet-L? Should I use more GPUs?

Full scene as input

Hi,

I am amazed by your work of great quality. I would like to try your different optimization techniques on KPConv architecture. I only started to look at your code, but I already have one question.

When I read your paper, it seems that you consider loading the entire scene as input as a training augmentation. I am curious if it is really the case, or if it is just at test time that you load the entire scene.

When I look at the default configuration for S3DIS:

voxel_max: 24000

It seems that for training there is a limit to the input size. Is it right? Have you managed to train with an entire scene of S3DIS as input? I don't see how you could do it without running OOM on your GPUs

Followup question, for the bigger networks like pointnext-xl, even a simple inference could lead to OOM errors on a big scene like S3DIS Area_5. Does the inference fit on your GPU or do you have tricks to separate the workload on multiple GPUs?

Best,
Hugues

Run time error

import openpoints.cpp.subsampling.grid_subsampling as cpp_subsampling
ImportError:PointNeXt-master/examples/segmentation/../../openpoints/cpp/subsampling/grid_subsampling.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object

RuntimeError: The size of tensor a (719348) must match the size of tensor b (1438695) at non-singleton dimension 0

Hi @guochengqian

Have you ever met this bug using two GPU cards? Thanks.

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ[07/26 11:32:16 S3DIS]: Epoch 100 LR 0.000012 train_miou 95.15, val_miou 68.49, best val miou 69.55
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 34/34 [00:41<00:00, 1.22s/it]
[07/26 11:32:18 S3DIS]: Best ckpt @E68, val_oa 89.78, val_macc 76.22, val_miou 69.55,
iou per cls is: [93.14 97.92 83.66 0. 43.12 54.75 75.45 81.45 90.8 74.57 75.66 73.42
60.24]
[07/26 11:32:18 S3DIS]: Successful Loading the ckpt from log/s3dis/s3dis-train-pointnext-xl-ngpus2-seed7272-20220725-213543-eYW6GZURAs6oyghwFxnAPs/checkpoint/s3dis-train-pointnext-xl-ngpus2-seed7272-20220725-213543-eYW6GZURAs6oyghwFxnAPs_ckpt_best.pth
[07/26 11:32:18 S3DIS]: ckpts @ 68 epoch( {'best_val': 69.55094146728516} )
0%| | 0/68 [00:00<?, ?it/s]
0%| | 0/68 [00:27<?, ?it/s]

Traceback (most recent call last):
File "examples/segmentation/main.py", line 529, in
mp.spawn(main, nprocs=cfg.world_size, args=(cfg,)) # original args=(cfg), run with bugs, should be args=(cfg,)
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/main.py", line 211, in main
test_miou, test_macc, test_oa, test_ious, test_accs, _ = test_entire_room(model, cfg.dataset.common.test_area, cfg)
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/main.py", line 452, in test_entire_room
cm.update(all_logits.argmax(dim=1), label)
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/../../openpoints/utils/metrics.py", line 69, in update
unique_mapping = true.flatten() * self.virtual_num_classes + pred.flatten()
RuntimeError: The size of tensor a (719348) must match the size of tensor b (1438695) at non-singleton dimension 0

Dockerfile Request

Hey, I wanted to know if it is possible to add a docker file or create a docker image for this project so it is easier to try it out and in that way be able to correctly perform the tests etc.

I can help in this if needed.

Question about InvResMLP

I have noticed you set blocks as [1, 1, 1, 1, 1, 1] in ModelNet40 classification task and the '_make_enc' function in PointNextEncoder sets range in (1, blocks). It means there is any InvResMLP in the network, right?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.