Giter Club home page Giter Club logo

Comments (4)

krasin avatar krasin commented on September 25, 2024

@hadyelsahar usually, SIGILL happens when a binary contains an instruction that is not supported by the CPU. The common scenario for that is to compile a binary on one (newer) computer, then copy the binary to another (older) computer and run it there.

In your case, I would blindly guess that the computer does not support AVX2 instruction set, while the computer used for compilation did support it.

If you want to make it clear, which module and which instruction causes this segfault, I would recommend to run it under gdb:

gdb --args th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1
run

Once it happens, please, post the stack trace and disassembly here.

gdb commands:
stack trace: bt
disassembly of the current block: disas

The currently running instruction will be marked like "===> "

from char-rnn.

hadyelsahar avatar hadyelsahar commented on September 25, 2024

Thanks for your help, It seems the problem with the vmovsd instruction

The Stack Trace :

#0  0x00007ffff532de50 in dgemm_oncopy () from /opt/OpenBLAS/lib/libopenblas.so.0
#1  0x0000000000000041 in ?? ()
#2  0x0000000000000026 in ?? ()
#3  0x00007ffff51cd0c7 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#4  0x00007ffff52da20c in blas_thread_server () from /opt/OpenBLAS/lib/libopenblas.so.0
#5  0x00007ffff7474182 in start_thread (arg=0x7ffff1b05700) at pthread_create.c:312
#6  0x00007ffff6f8b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

the disassembly of the current block :

Dump of assembler code for function dgemm_oncopy:
   0x00007ffff532de00 <+0>: push   %r13
   0x00007ffff532de02 <+2>: push   %r12
   0x00007ffff532de04 <+4>: lea    0x0(,%rcx,8),%rcx
   0x00007ffff532de0c <+12>:    mov    %rsi,%r10
   0x00007ffff532de0f <+15>:    sar    %r10
   0x00007ffff532de12 <+18>:    jle    0x7ffff532dfd0 <dgemm_oncopy+464>
   0x00007ffff532de18 <+24>:    nopl   0x0(%rax,%rax,1)
   0x00007ffff532de20 <+32>:    mov    %rdx,%r11
   0x00007ffff532de23 <+35>:    lea    (%rdx,%rcx,1),%r12
   0x00007ffff532de27 <+39>:    lea    (%rdx,%rcx,2),%rdx
   0x00007ffff532de2b <+43>:    mov    %rdi,%r9
   0x00007ffff532de2e <+46>:    sar    $0x3,%r9
   0x00007ffff532de32 <+50>:    jle    0x7ffff532df10 <dgemm_oncopy+272>
   0x00007ffff532de38 <+56>:    nopl   0x0(%rax,%rax,1)
   0x00007ffff532de40 <+64>:    prefetchw 0x100(%r8)
   0x00007ffff532de48 <+72>:    prefetchnta 0x100(%r11)
=> 0x00007ffff532de50 <+80>:    vmovsd (%r11),%xmm0
   0x00007ffff532de55 <+85>:    vmovsd 0x8(%r11),%xmm1
   0x00007ffff532de5b <+91>:    vmovsd 0x10(%r11),%xmm2
   0x00007ffff532de61 <+97>:    vmovsd 0x18(%r11),%xmm3
   0x00007ffff532de67 <+103>:   vmovsd 0x20(%r11),%xmm4
   0x00007ffff532de6d <+109>:   vmovsd 0x28(%r11),%xmm5
   0x00007ffff532de73 <+115>:   vmovsd 0x30(%r11),%xmm6
   0x00007ffff532de79 <+121>:   vmovsd 0x38(%r11),%xmm7
   0x00007ffff532de7f <+127>:   prefetchnta 0x100(%r12)
   0x00007ffff532de88 <+136>:   vmovhpd (%r12),%xmm0,%xmm0
   0x00007ffff532de8e <+142>:   vmovhpd 0x8(%r12),%xmm1,%xmm1
   0x00007ffff532de95 <+149>:   vmovhpd 0x10(%r12),%xmm2,%xmm2
   0x00007ffff532de9c <+156>:   vmovhpd 0x18(%r12),%xmm3,%xmm3
   0x00007ffff532dea3 <+163>:   vmovhpd 0x20(%r12),%xmm4,%xmm4
   0x00007ffff532deaa <+170>:   vmovhpd 0x28(%r12),%xmm5,%xmm5
   0x00007ffff532deb1 <+177>:   vmovhpd 0x30(%r12),%xmm6,%xmm6
   0x00007ffff532deb8 <+184>:   vmovhpd 0x38(%r12),%xmm7,%xmm7
   0x00007ffff532debf <+191>:   prefetchw 0x140(%r8)
   0x00007ffff532dec7 <+199>:   vmovups %xmm0,(%r8)
   0x00007ffff532decc <+204>:   vmovups %xmm1,0x10(%r8)
   0x00007ffff532ded2 <+210>:   vmovups %xmm2,0x20(%r8)
   0x00007ffff532ded8 <+216>:   vmovups %xmm3,0x30(%r8)
   0x00007ffff532dede <+222>:   vmovups %xmm4,0x40(%r8)

Just for reference if someone faced the same problem,the executable file of torch ~/torch/bin/th is a script not a binary so gdp can't actually debug a script.

file   /torch/install/bin/th 
th: POSIX shell script, ASCII text executable, with very long lines

so to work around it u'll need to execute the following:

gdb64 /bin/bash    # or check your gdb configuration either it's i686 or x86_64

from the gdb terminal run :

run th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1 

ps: i think this issue is related more to Torch more than this repo. , so feel free if you want me to move it there .

from char-rnn.

krasin avatar krasin commented on September 25, 2024

Good data, @hadyelsahar!

According to #0 0x00007ffff532de50 in dgemm_oncopy () from /opt/OpenBLAS/lib/libopenblas.so.0, it's not even Torch to blame, but the installation of OpenBLAS. I would recommend to reinstall and/or investigate what was the procedure for its previous installation. It seems that a plain cp was involved.

from char-rnn.

hadyelsahar avatar hadyelsahar commented on September 25, 2024

That makes sense now, OpenBlas failed to detect my processor configuration automatically.
so i edited the torch dependency download script according to what i have been told on this issue.

Although i've built and installed OpenBlas on my machine manually, but probably that hasn't fixed it. anyways let's see there.

Many Thanks
Regards

from char-rnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.