Comments (4)
more on this: problem only happens on GPU, if I use CPU (set gpu_count = 0 in config) the problem is gone.
from mlcpp.
@huangenyan Hello, could please provide more details: the system do you use, the image you used for evaluation, the type of your GPU, the dump was generated , what type of build (with optimization or not) do you used ...
I don't have such problem in my environment.
from mlcpp.
I just forget to mention what I use is mask_rcnn_pytorch
I spend some time on this and here are some information you may find helpful:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
$ nvidia-smi
Thu Aug 29 15:51:46 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 41% 41C P2 56W / 260W | 6585MiB / 11016MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1223 G /usr/lib/xorg/Xorg 188MiB |
| 0 1464 G /usr/bin/gnome-shell 113MiB |
| 0 11545 G ...quest-channel-token=5961631727014844578 235MiB |
| 0 15818 C /usr/bin/valgrind.bin 5890MiB |
| 0 31699 G ...uest-channel-token=15778414260646414614 153MiB |
+-----------------------------------------------------------------------------+
GPU is RTX 2080 TI
$ valgrind -v ./mask-rcnn_demo
...
==15818== 1 errors in context 1 of 1:
==15818== Invalid read of size 4
==15818== at 0x44BC09FE: ??? (in /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130)
==15818== by 0x44BC596A: ??? (in /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130)
==15818== by 0x44BDABE1: cudaDeviceSynchronize (in /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130)
==15818== by 0x14E26393: cudnnDestroy (in /usr/local/lib/libcaffe2_gpu.so)
==15818== by 0x109A4CF0: std::unordered_map<int, at::native::(anonymous namespace)::Handle, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, at::native::(anonymous namespace)::Handle> > >::~unordered_map() (in /usr/local/lib/libcaffe2_gpu.so)
==15818== by 0x447F5614: __cxa_finalize (cxa_finalize.c:83)
==15818== by 0x107B2FB2: ??? (in /usr/local/lib/libcaffe2_gpu.so)
==15818== by 0x4010B72: _dl_fini (dl-fini.c:138)
==15818== by 0x447F5040: __run_exit_handlers (exit.c:108)
==15818== by 0x447F5139: exit (exit.c:139)
==15818== by 0x447D3B9D: (below main) (libc-start.c:344)
==15818== Address 0x18 is not stack'd, malloc'd or (recently) free'd
==15818==
--15818--
--15818-- used_suppression: 98231 zlib-1.2.x trickyness (1b): See http://www.zlib.net/zlib_faq.html#faq36 /usr/lib/valgrind/default.supp:516
==15818==
==15818== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 98231 from 1)
I also tested which statement cause the segmentation fault, and find problem occurs in fpn.cpp
, by adding exit(0)
at different location of the program:
std::tuple<torch::Tensor,
torch::Tensor,
torch::Tensor,
torch::Tensor,
torch::Tensor>
FPNImpl::forward(at::Tensor x) {
// no segmentation fault if adding exit(0) here
x = c1_->forward(x);
// segmentation fault if adding exit(0) here
x = c2_->forward(x);
auto c2_out = x;
x = c3_->forward(x);
auto c3_out = x;
x = c4_->forward(x);
auto c4_out = x;
x = c5_->forward(x);
auto p5_out = p5_conv1_->forward(x);
auto p4_out =
p4_conv1_->forward(c4_out) + upsample(p5_out, /*scale_factor*/ 2);
auto p3_out =
p3_conv1_->forward(c3_out) + upsample(p4_out, /*scale_factor*/ 2);
auto p2_out =
p2_conv1_->forward(c2_out) + upsample(p3_out, /*scale_factor*/ 2);
p5_out = p5_conv2_->forward(p5_out);
p4_out = p4_conv2_->forward(p4_out);
p3_out = p3_conv2_->forward(p3_out);
p2_out = p2_conv2_->forward(p2_out);
// P6 is used for the 5th anchor scale in RPN. Generated by subsampling from
// P5 with stride of 2.
auto p6_out = p6_->forward(p5_out);
return {p2_out, p3_out, p4_out, p5_out, p6_out};
}
I'm still working on this and hopefully providing more information.
from mlcpp.
I create a minimal source file which can reproduce the error:
#include <torch/torch.h>
#include <iostream>
#include <memory>
int main(int argc, char** argv) {
auto input = torch::ones({1, 3, 1024, 1024});
input = input.to(torch::DeviceType::CUDA);
auto c2 = torch::nn::Conv2d(torch::nn::Conv2dOptions(3, 64, 7).stride(2).padding(3));
c2->to(torch::DeviceType::CUDA);
c2->forward(input);
return 0;
}
The example has nothing to do with your code, so I think it is a bug in pytorch c++ frontend and I'll report an issue there.
Thanks!
from mlcpp.
Related Issues (13)
- Pre-trained file cannot be loaded with pytorch 1.0.1 HOT 5
- Error message after running demo HOT 2
- commit number of Pytorch you built from source HOT 4
- [Need Pytorch C++ API Usage Support] HOT 1
- CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: GOMP_LIBRARY HOT 1
- CMakeFiles/mask-rcnn_lib.dir/build.make:105: recipe for target 'CMakeFiles/mask-rcnn_lib.dir/maskrcnn.cpp.o' failed HOT 4
- Demo of MaskRCNN gpu memory HOT 1
- Plan to migrate to libtorch 1.2.0? HOT 1
- How to convert parameters from Mask R-CNN for C++ app HOT 1
- cannt open source file curl/curl.h HOT 1
- rpn and anchor generator seem error HOT 3
- Speed of Mask R-CNN During Inference HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlcpp.