Comments (4)
@hadyelsahar usually, SIGILL happens when a binary contains an instruction that is not supported by the CPU. The common scenario for that is to compile a binary on one (newer) computer, then copy the binary to another (older) computer and run it there.
In your case, I would blindly guess that the computer does not support AVX2 instruction set, while the computer used for compilation did support it.
If you want to make it clear, which module and which instruction causes this segfault, I would recommend to run it under gdb:
gdb --args th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1
run
Once it happens, please, post the stack trace and disassembly here.
gdb commands:
stack trace: bt
disassembly of the current block: disas
The currently running instruction will be marked like "===> "
from char-rnn.
Thanks for your help, It seems the problem with the vmovsd
instruction
The Stack Trace :
#0 0x00007ffff532de50 in dgemm_oncopy () from /opt/OpenBLAS/lib/libopenblas.so.0
#1 0x0000000000000041 in ?? ()
#2 0x0000000000000026 in ?? ()
#3 0x00007ffff51cd0c7 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#4 0x00007ffff52da20c in blas_thread_server () from /opt/OpenBLAS/lib/libopenblas.so.0
#5 0x00007ffff7474182 in start_thread (arg=0x7ffff1b05700) at pthread_create.c:312
#6 0x00007ffff6f8b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
the disassembly of the current block :
Dump of assembler code for function dgemm_oncopy:
0x00007ffff532de00 <+0>: push %r13
0x00007ffff532de02 <+2>: push %r12
0x00007ffff532de04 <+4>: lea 0x0(,%rcx,8),%rcx
0x00007ffff532de0c <+12>: mov %rsi,%r10
0x00007ffff532de0f <+15>: sar %r10
0x00007ffff532de12 <+18>: jle 0x7ffff532dfd0 <dgemm_oncopy+464>
0x00007ffff532de18 <+24>: nopl 0x0(%rax,%rax,1)
0x00007ffff532de20 <+32>: mov %rdx,%r11
0x00007ffff532de23 <+35>: lea (%rdx,%rcx,1),%r12
0x00007ffff532de27 <+39>: lea (%rdx,%rcx,2),%rdx
0x00007ffff532de2b <+43>: mov %rdi,%r9
0x00007ffff532de2e <+46>: sar $0x3,%r9
0x00007ffff532de32 <+50>: jle 0x7ffff532df10 <dgemm_oncopy+272>
0x00007ffff532de38 <+56>: nopl 0x0(%rax,%rax,1)
0x00007ffff532de40 <+64>: prefetchw 0x100(%r8)
0x00007ffff532de48 <+72>: prefetchnta 0x100(%r11)
=> 0x00007ffff532de50 <+80>: vmovsd (%r11),%xmm0
0x00007ffff532de55 <+85>: vmovsd 0x8(%r11),%xmm1
0x00007ffff532de5b <+91>: vmovsd 0x10(%r11),%xmm2
0x00007ffff532de61 <+97>: vmovsd 0x18(%r11),%xmm3
0x00007ffff532de67 <+103>: vmovsd 0x20(%r11),%xmm4
0x00007ffff532de6d <+109>: vmovsd 0x28(%r11),%xmm5
0x00007ffff532de73 <+115>: vmovsd 0x30(%r11),%xmm6
0x00007ffff532de79 <+121>: vmovsd 0x38(%r11),%xmm7
0x00007ffff532de7f <+127>: prefetchnta 0x100(%r12)
0x00007ffff532de88 <+136>: vmovhpd (%r12),%xmm0,%xmm0
0x00007ffff532de8e <+142>: vmovhpd 0x8(%r12),%xmm1,%xmm1
0x00007ffff532de95 <+149>: vmovhpd 0x10(%r12),%xmm2,%xmm2
0x00007ffff532de9c <+156>: vmovhpd 0x18(%r12),%xmm3,%xmm3
0x00007ffff532dea3 <+163>: vmovhpd 0x20(%r12),%xmm4,%xmm4
0x00007ffff532deaa <+170>: vmovhpd 0x28(%r12),%xmm5,%xmm5
0x00007ffff532deb1 <+177>: vmovhpd 0x30(%r12),%xmm6,%xmm6
0x00007ffff532deb8 <+184>: vmovhpd 0x38(%r12),%xmm7,%xmm7
0x00007ffff532debf <+191>: prefetchw 0x140(%r8)
0x00007ffff532dec7 <+199>: vmovups %xmm0,(%r8)
0x00007ffff532decc <+204>: vmovups %xmm1,0x10(%r8)
0x00007ffff532ded2 <+210>: vmovups %xmm2,0x20(%r8)
0x00007ffff532ded8 <+216>: vmovups %xmm3,0x30(%r8)
0x00007ffff532dede <+222>: vmovups %xmm4,0x40(%r8)
Just for reference if someone faced the same problem,the executable file of torch ~/torch/bin/th
is a script not a binary so gdp
can't actually debug a script.
file /torch/install/bin/th
th: POSIX shell script, ASCII text executable, with very long lines
so to work around it u'll need to execute the following:
gdb64 /bin/bash # or check your gdb configuration either it's i686 or x86_64
from the gdb terminal run :
run th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1
ps: i think this issue is related more to Torch more than this repo. , so feel free if you want me to move it there .
from char-rnn.
Good data, @hadyelsahar!
According to #0 0x00007ffff532de50 in dgemm_oncopy () from /opt/OpenBLAS/lib/libopenblas.so.0
, it's not even Torch to blame, but the installation of OpenBLAS. I would recommend to reinstall and/or investigate what was the procedure for its previous installation. It seems that a plain cp was involved.
from char-rnn.
That makes sense now, OpenBlas failed to detect my processor configuration automatically.
so i edited the torch dependency download script according to what i have been told on this issue.
Although i've built and installed OpenBlas on my machine manually, but probably that hasn't fixed it. anyways let's see there.
Many Thanks
Regards
from char-rnn.
Related Issues (20)
- This usually indicates a bug.
- output not being stored as .txt HOT 1
- Sampling Text HOT 3
- Length of primetext causes it to fail
- Question:
- Torch readline.o error
- Th: command not found, Torch installed fine HOT 3
- Question: Is there a way to "pause" temporarily without making it restart? HOT 3
- Random alias of folder created in same location as real
- AMD? HOT 2
- Links in readme.md not working in certain editors
- cutorch installation make error HOT 1
- Package 'python-software-properties' has no installation candidate
- Installation problem: readline.h HOT 2
- Failed to clone the RNN
- How can I change the model form LSTM to GRU?
- Training stuck on "cloning criterion"
- Duo GPU Capabilities? HOT 4
- how do i implement this code in python? HOT 3
- Code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from char-rnn.