Comments (21)
I was able to reproduce the same error when I use the wrong version of cuda.
What I did:
I install pytorch from conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
, however my local cuda runtime and nvcc are in 10.0.
In this case, I can observe the same error.
Please check whether your cuda version is correct.
from detectron2.
It seems that mismatched NVCC vs CUDA Runtime version is the root cause. Closing but feel free to reopen if this does not solve your issue.
from detectron2.
It seems that mismatched NVCC vs CUDA Runtime version is the root cause. Closing but feel free to reopen if this does not solve your issue.
Yes, that's the key to solve my problem
Problem
first, briefly introduce my problem: I'm new to Detectron2 and only one GPU(GeForce GTX 1080Ti). I choose to build Detectron2 from Source:
# Or, to install it from a local clone:
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
everything is fine and detectron2 is installed successfully
$ python -m pip install -e detectron2
Obtaining file:///home/lab305/ZhuJian/detectron2
Requirement already satisfied: termcolor>=1.1 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (1.1.0)
Requirement already satisfied: Pillow in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (7.0.0)
Requirement already satisfied: yacs>=0.1.6 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (0.1.7)
Requirement already satisfied: tabulate in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (0.8.7)
Requirement already satisfied: cloudpickle in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (1.4.1)
Requirement already satisfied: matplotlib in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (3.1.2)
Requirement already satisfied: mock in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (4.0.2)
Requirement already satisfied: tqdm>4.29.0 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (4.46.0)
Requirement already satisfied: tensorboard in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (2.0.0)
Requirement already satisfied: fvcore>=0.1.1 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (0.1.1.post200513)
Requirement already satisfied: future in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (0.18.2)
Requirement already satisfied: pydot in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from detectron2==0.1.3) (1.4.1)
Requirement already satisfied: PyYAML in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from yacs>=0.1.6->detectron2==0.1.3) (5.3.1)
Requirement already satisfied: cycler>=0.10 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from matplotlib->detectron2==0.1.3) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from matplotlib->detectron2==0.1.3) (2.4.6)
Requirement already satisfied: numpy>=1.11 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from matplotlib->detectron2==0.1.3) (1.18.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from matplotlib->detectron2==0.1.3) (1.1.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from matplotlib->detectron2==0.1.3) (2.8.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (0.33.6)
Requirement already satisfied: protobuf>=3.6.0 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (3.11.2)
Requirement already satisfied: werkzeug>=0.11.15 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (0.16.0)
Requirement already satisfied: six>=1.10.0 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (1.13.0)
Requirement already satisfied: absl-py>=0.4 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (0.8.1)
Requirement already satisfied: setuptools>=41.0.0 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (44.0.0.post20200106)
Requirement already satisfied: markdown>=2.6.8 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (3.1.1)
Requirement already satisfied: grpcio>=1.6.3 in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.3) (1.16.1)
Requirement already satisfied: portalocker in /home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages (from fvcore>=0.1.1->detectron2==0.1.3) (1.7.0)
Installing collected packages: detectron2
Found existing installation: detectron2 0.1.3
Uninstalling detectron2-0.1.3:
Successfully uninstalled detectron2-0.1.3
Running setup.py develop for detectron2
Successfully installed detectron2
but when I try to train
$ ./train_net.py --config-file ../configs/PascalVOC-Detection/faster_rcnn_R_50_C4.yaml --num-gpus 1 SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025
...
...
[06/14 17:31:31 d2.engine.train_loop]: Starting training from iteration 0
ERROR [06/14 17:31:32 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/lab305/ZhuJian/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/lab305/ZhuJian/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
loss_dict = self.model(data)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 123, in forward
_, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 426, in forward
[features[f] for f in self.in_features], proposal_boxes
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 410, in _shared_roi_transform
x = self.pooler(features, boxes)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/poolers.py", line 214, in forward
return self.level_poolers[0](x[0], pooler_fmt_boxes)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/layers/roi_align.py", line 95, in forward
input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
File "/home/lab305/ZhuJian/detectron2/detectron2/layers/roi_align.py", line 20, in forward
input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: invalid device function
[06/14 17:31:32 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)
Traceback (most recent call last):
File "./train_net.py", line 169, in <module>
args=(args,),
File "/home/lab305/ZhuJian/detectron2/detectron2/engine/launch.py", line 57, in launch
main_func(*args)
File "./train_net.py", line 157, in main
return trainer.train()
File "/home/lab305/ZhuJian/detectron2/detectron2/engine/defaults.py", line 402, in train
super().train(self.start_iter, self.max_iter)
File "/home/lab305/ZhuJian/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/lab305/ZhuJian/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
loss_dict = self.model(data)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 123, in forward
_, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 426, in forward
[features[f] for f in self.in_features], proposal_boxes
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 410, in _shared_roi_transform
x = self.pooler(features, boxes)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/modeling/poolers.py", line 214, in forward
return self.level_poolers[0](x[0], pooler_fmt_boxes)
File "/home/lab305/anaconda3/envs/pytorch1.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/lab305/ZhuJian/detectron2/detectron2/layers/roi_align.py", line 95, in forward
input, rois, self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned
File "/home/lab305/ZhuJian/detectron2/detectron2/layers/roi_align.py", line 20, in forward
input, roi, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: invalid device function
段错误 (核心已转储)
Solve
I check the cuda version
# nvidia-smi
CUDA Version: 10.2
# nvcc --version
Cuda compilation tools, release 10.0, V10.0.130
before this I install cudatoolkit=10.2,but now i choose the earlier version
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
after rebuilt Detectron2,the problem solved!!!
from detectron2.
Most likely the solution to your problem is already in https://detectron2.readthedocs.io/tutorials/install.html#common-installation-issues.
If you need help to solve an unexpected issue you observed, please include details following the issue template.
from detectron2.
as followup from #78 . I installed new env with CUDA 9.2 and this solved my issue. Could the problem be since as stated at https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md all models are trained with CUDA 9.2 ?
from detectron2.
No it's unrelated to model zoo.
It's likely because cuda 9.2 is just what your computer is using.
from detectron2.
It seems like you did not build detectron2 correctly. You may have wrong values in the TORCH_CUDA_ARCH_LIST
environment variable when you build it. Could you check this environment variable at the time you build it?
from detectron2.
I deleted the build
folder and the detectron2/_C.cpython-36m-x86_64-linux-gnu.so
file and rebuilt running the command in the root repo directory
TORCH_CUDA_ARCH_LIST="6.1;7.5" pip install -e .
I'm running on a 1080ti, which should be covered under "6.1". This results in the same errors as above.
from detectron2.
Is there a way either of you can let others reproduce this issue in docker or colab?
from detectron2.
I'm actually just trying to get object detection on LVIS running, and I'm able to successfully run the model when I switch the maskrcnn backbone out for a retinanet (which doesn't use ROIAlign). I unfortunately don't have time rn to try to set up docker or colab to replicate.
from detectron2.
The updated collect_env
in e85114c can now show the type of error I met.
from detectron2.
I ran into this error as well. Re-installed Pytorch corresponding to a lower CUDA version (that matches my system CUDA). I was able to resolve the issue.
from detectron2.
@ppwwyyxx what should TORCH_CUDA_ARCH_LIST ideally be set to if one is using cuda/10.0 or cuda/10.1 with pytorch 1.3? nvcc --version shows me cuda 10.0 as well, I'm not sure what you mean by ^^ mismatch between nvcc and cuda runtimes since they're always the same for me.
The build happens successfully but I get this error upon running demo.py:
RuntimeError: CUDA error: no kernel image is available for execution on the device (ROIAlign_forward_cuda at /network/home/guptagun/od/detectron2_repo/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:361)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f010803e687 in /network/home/guptagun/anaconda3/envs/detectron/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: detectron2::ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa24 (0x7f01065ac89c in /network/home/guptagun/od/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: detectron2::ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xb6 (0x7f010654df66 in /network/home/guptagun/od/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x4ec8f (0x7f010655fc8f in /network/home/guptagun/od/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x49750 (0x7f010655a750 in /network/home/guptagun/od/detectron2_repo/detectron2/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #9: THPFunction_apply(_object*, _object*) + 0x8d6 (0x7f010a180e96 in /network/home/guptagun/anaconda3/envs/detectron/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
Posting the error here because it seems related, can make a new issue if you recommend.
Thanks!
from detectron2.
what should TORCH_CUDA_ARCH_LIST ideally be set
The best option is to unset it (i.e., no such env variable).
If you cannot solve the issue with existing information, please open a new one following the template.
from detectron2.
I was able to reproduce the same error when I use the wrong version of cuda.
What I did:
I install pytorch fromconda install pytorch torchvision cudatoolkit=10.1 -c pytorch
, however my local cuda runtime and nvcc are in 10.0.
In this case, I can observe the same error.
Please check whether your cuda version is correct.
Hello, I want to use detectron2. but when I prepared the conda environment, something went wrong. First I installed pytorch from conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
, but as you mentioned, I got error and found that my local cuda runtime and nvcc are in 10.0 (I build my conda enironment in LXD container, and I have no right to change local cuda runtime and nvcc version.). So I used conda install -c pytorch pytorch=1.3.0 cudatoolkit=10.0
to install pytorch for cuda 10.0. However, I got the error issue 459, I can only choose cuda 9.0 or cuda 10.0, and I see detectron can only run with cuda 9.2 and cuda 10.1. Could you please tell me how can I solve this?
from detectron2.
Detectron2 can run with cuda 10.0.
#459 is caused by incorrect installation of torchvision as explained there.
from detectron2.
Detectron2 can run with cuda 10.0.
#459 is caused by incorrect installation of torchvision as explained there.
Thanks for your reply,I delete the build file in detedtron2 and rebuild it, it works well for me now
from detectron2.
Hi, I am just trying to run detectron2 for panoptic segmentation with PyTorch 1.4.0 and CUDA 10.2, I encountered same cuda error for ROIAlign_forward_cuda . I tried to install detectron2 using 1) local source code, and 2) pip install. I also double checked that python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/index.html and it seems CUDA10.2 is also compatible with detectron2. What kind of further step can I take?
from detectron2.
Great thanks! I checked that CUDA version for detectron2 and torch are mis-matched. I just re-install detectron2 with CUDA 10.1 and match pytorch as well. Now it works! Thanks again
from detectron2.
I was able to reproduce the same error when I use the wrong version of cuda.
What I did:
I install pytorch fromconda install pytorch torchvision cudatoolkit=10.1 -c pytorch
, however my local cuda runtime and nvcc are in 10.0.
In this case, I can observe the same error.
Please check whether your cuda version is correct.
So, how did you solve this?
from detectron2.
I was able to reproduce the same error when I use the wrong version of cuda.
What I did:
I install pytorch fromconda install pytorch torchvision cudatoolkit=10.1 -c pytorch
, however my local cuda runtime and nvcc are in 10.0.
In this case, I can observe the same error.
Please check whether your cuda version is correct.So, how did you solve this?
Oh Sorry for late response, I totally missed it. As I mentioned, my prev CUDA version was 10.1 but I installed PyTorch and Detectron2 with compatibility of CUDA 10.2. Thus, I reinstalled those two to meet compatibility with my CUDA version. I think @zjZSTU 's solution is somewhat close to mine, you can refer to this.
from detectron2.
Related Issues (20)
- Issue with Instance and Panoptic labels using Panoptic deeplab model
- AttributeError: Cannot find field 'gt_masks' in the given Instances! for mask2former with coco-format dataset HOT 3
- Error while running setup.py HOT 3
- Detectron doesn't include torch as a formal dependency HOT 1
- docker file error HOT 2
- Training steps automation HOT 1
- Adjusting Model Confidence Level HOT 2
- Adjusting Model Confidence Level HOT 1
- how to convert any pytorch model into ONNX HOT 1
- inference_on_dataset get Killed HOT 1
- Detectron2 Keypoint Detection Slowness issue - GPU usage is high
- Detectron2 Keypoints detection slowness issue HOT 2
- 🐛 Minor Bug: PointsVisualizer() throws error when passed floating coordinate values
- export_model.py crashes with keypoints HOT 1
- export_model.py crashes with keypoints HOT 9
- Very slow training on Apple M1 Pro HOT 2
- UnpicklingError: invalid load key, '\xef'. HOT 2
- export_model.py - list_of_lines[165] = " [1344, 1344], 1344 \n" HOT 1
- Please read & provide the following HOT 2
- The comits you are making are breaking the code!!! HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from detectron2.