Giter Club home page Giter Club logo

Comments (4)

huangenyan avatar huangenyan commented on July 21, 2024

more on this: problem only happens on GPU, if I use CPU (set gpu_count = 0 in config) the problem is gone.

from mlcpp.

Kolkir avatar Kolkir commented on July 21, 2024

@huangenyan Hello, could please provide more details: the system do you use, the image you used for evaluation, the type of your GPU, the dump was generated , what type of build (with optimization or not) do you used ...
I don't have such problem in my environment.

from mlcpp.

huangenyan avatar huangenyan commented on July 21, 2024

I just forget to mention what I use is mask_rcnn_pytorch
I spend some time on this and here are some information you may find helpful:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
$ nvidia-smi
Thu Aug 29 15:51:46 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
| 41%   41C    P2    56W / 260W |   6585MiB / 11016MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1223      G   /usr/lib/xorg/Xorg                           188MiB |
|    0      1464      G   /usr/bin/gnome-shell                         113MiB |
|    0     11545      G   ...quest-channel-token=5961631727014844578   235MiB |
|    0     15818      C   /usr/bin/valgrind.bin                       5890MiB |
|    0     31699      G   ...uest-channel-token=15778414260646414614   153MiB |
+-----------------------------------------------------------------------------+

GPU is RTX 2080 TI

$ valgrind -v ./mask-rcnn_demo
...
==15818== 1 errors in context 1 of 1:
==15818== Invalid read of size 4
==15818==    at 0x44BC09FE: ??? (in /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130)
==15818==    by 0x44BC596A: ??? (in /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130)
==15818==    by 0x44BDABE1: cudaDeviceSynchronize (in /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130)
==15818==    by 0x14E26393: cudnnDestroy (in /usr/local/lib/libcaffe2_gpu.so)
==15818==    by 0x109A4CF0: std::unordered_map<int, at::native::(anonymous namespace)::Handle, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, at::native::(anonymous namespace)::Handle> > >::~unordered_map() (in /usr/local/lib/libcaffe2_gpu.so)
==15818==    by 0x447F5614: __cxa_finalize (cxa_finalize.c:83)
==15818==    by 0x107B2FB2: ??? (in /usr/local/lib/libcaffe2_gpu.so)
==15818==    by 0x4010B72: _dl_fini (dl-fini.c:138)
==15818==    by 0x447F5040: __run_exit_handlers (exit.c:108)
==15818==    by 0x447F5139: exit (exit.c:139)
==15818==    by 0x447D3B9D: (below main) (libc-start.c:344)
==15818==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==15818== 
--15818-- 
--15818-- used_suppression:  98231 zlib-1.2.x trickyness (1b): See http://www.zlib.net/zlib_faq.html#faq36 /usr/lib/valgrind/default.supp:516
==15818== 
==15818== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 98231 from 1)

I also tested which statement cause the segmentation fault, and find problem occurs in fpn.cpp, by adding exit(0) at different location of the program:

std::tuple<torch::Tensor,
           torch::Tensor,
           torch::Tensor,
           torch::Tensor,
           torch::Tensor>
FPNImpl::forward(at::Tensor x) {
// no segmentation fault if adding exit(0) here
  x = c1_->forward(x);
// segmentation fault if adding exit(0) here
  x = c2_->forward(x);
  auto c2_out = x;
  x = c3_->forward(x);
  auto c3_out = x;
  x = c4_->forward(x);
  auto c4_out = x;
  x = c5_->forward(x);
  auto p5_out = p5_conv1_->forward(x);
  auto p4_out =
      p4_conv1_->forward(c4_out) + upsample(p5_out, /*scale_factor*/ 2);
  auto p3_out =
      p3_conv1_->forward(c3_out) + upsample(p4_out, /*scale_factor*/ 2);
  auto p2_out =
      p2_conv1_->forward(c2_out) + upsample(p3_out, /*scale_factor*/ 2);

  p5_out = p5_conv2_->forward(p5_out);
  p4_out = p4_conv2_->forward(p4_out);
  p3_out = p3_conv2_->forward(p3_out);
  p2_out = p2_conv2_->forward(p2_out);

  // P6 is used for the 5th anchor scale in RPN. Generated by subsampling from
  // P5 with stride of 2.
  auto p6_out = p6_->forward(p5_out);

  return {p2_out, p3_out, p4_out, p5_out, p6_out};
}

I'm still working on this and hopefully providing more information.

from mlcpp.

huangenyan avatar huangenyan commented on July 21, 2024

I create a minimal source file which can reproduce the error:

#include <torch/torch.h>

#include <iostream>
#include <memory>


int main(int argc, char** argv) {

  auto input = torch::ones({1, 3, 1024, 1024});
  input = input.to(torch::DeviceType::CUDA);

  auto c2 = torch::nn::Conv2d(torch::nn::Conv2dOptions(3, 64, 7).stride(2).padding(3));
  c2->to(torch::DeviceType::CUDA);
  c2->forward(input);
  return 0;
}

The example has nothing to do with your code, so I think it is a bug in pytorch c++ frontend and I'll report an issue there.

Thanks!

from mlcpp.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.