Comments (26)
@gregoa my guess is it's related to https://github.com/calid/zmq-ffi/blob/98a59950d6299ea6eb6268f38fa8b3df01667e43/t/proxy.t#L61-L67
looks like the workaround is not always adequate and it's time to understand the occasional hang...
from perlzmq.
On Thu, 25 Aug 2016 08:35:37 -0700, Dylan Cali wrote:
@gregoa my guess is it's related to https://github.com/calid/zmq-ffi/blob/master/t/proxy.t#L61-L67
Ack, I was also looking at these lines (and the discussions in some
of the other issues).
looks like the workaround is not always adequate and it's time to understand the occasional hang...
That would be great :)
What makes it difficult for me is that I can't reproduce the hangs
...
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Elliott Sharp: Perps
from perlzmq.
what if you comment out the do/while hack and just leave a single kill TERM => $proxy;
? Does it hang for you then?
from perlzmq.
On Thu, 25 Aug 2016 08:56:44 -0700, Dylan Cali wrote:
what if you comment out the do/while hack and just leave a single
kill TERM => $proxy;
? Does it hang for you then?
No.
At least not initially; after I put my machine under heavy load, I
got it to hang in 1 of 5 tries.
When I run only t/proxy.t (still with the loop removed) in a loop, it
hangs at the 5th try. Or at the 1st. Or at the second. (Again with
heady load.)
Back to the original version, and running t/proxy.t again in a loop:
same result.
Ok, so there's no difference with and without the do/while loop, and
load seems relevant.
strace -f -v -y prove --blib --verbose t/proxy.t
ends with (in the hanging case)
[pid 6191] +++ exited with 0 +++
[pid 6190] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 6190] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6191, si_uid=0, si_status=0, si_utime=14, si_stime=2} ---
[pid 6190] select(8, [4<pipe:[18439686]> 6<pipe:[18439687]>], NULL, NULL, NULL
but my knowledge about IPC is limited ...
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Bob Dylan: Tempest
from perlzmq.
Okay that's interesting, so the child is actually exiting okay and the sporadic hang is occurring in the parent... which explains why that workaround loop was irrelevant.
Can you try adding explicit context destroys and socket closes in the parent and see if it still hangs occasionally?
from perlzmq.
On Thu, 25 Aug 2016 10:51:01 -0700, Dylan Cali wrote:
Okay that's interesting, so the child is actually exiting okay and
the sporadic hang is occurring in the parent... which explains why
that workaround loop was irrelevant.
Sounds about right.
Can you try adding explicit context destroys and socket closes in
the parent and see if it still hangs occasionally?
I've now added
# cleanup
$server->close();
$worker->close();
$ctx->destroy();
at the end of the subtest, and
# cleanup
$front->close();
$back->close();
$ctx->destroy();
before the exit 0 in the child code, but alas, the test still hangs
after a slow single-number of incantations. But I might have
misinterpreted your request :)
I'm happy to do other tests if you give me a patch.
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Johnny Cash: Tear Stained Letter
from perlzmq.
Nope that's what I was looking for... wanted to make sure it wasn't cleanup occurring in the wrong order during global destruction. If it's still hanging with the explicit cleanup then it looks like that isn't the case and there's something else going on.
The ERESTARTNOHAND
concerns me, userland code is apparently not supposed to get this error code, and it may imply something buggy in the kernel if it does:
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/0980.html
from perlzmq.
On Thu, 25 Aug 2016 12:58:26 -0700, Dylan Cali wrote:
Nope that's what I was looking for... wanted to make sure it wasn't
cleanup occurring in the wrong order during global destruction. If
it's still hanging with the explicit cleanup then it looks like
that isn't the case and there's something else going on.
Ok, cool.
The
ERESTARTNOHAND
concerns me, user land code is apparently not
supposed to get this error code, and it may imply something buggy
in the kernel if it does:
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/0980.html
The followup at
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/0997.html suggests
that this might be related to strace on x86*.
And I hope that a possible kernel issue from 2007 is fixed by now :)
But yeah, that's a bit mysterious.
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Janis Joplin: Bye, Bye Baby
from perlzmq.
Normally I would be with you, but the follow up of the follow up says he was seeing it outside of strace:
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/1054.html
and
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/1018.html
That said the ERESTARTNOHAND
is feeling like a red herring to me (and in our case could very well be due to running under strace). To me this is smelling like it may be EINTR (interrupted system call) not being handled correctly, and select being blindly re-called on an fd that has nothing to read.
The rabbit hole on this one may be deep, and it may not even be a direct problem in the Perl bindings... next step would be to reproduce in gdb and get a stack trace of exactly where that select call is occurring
from perlzmq.
@gregoa what is your test setup? I'll spin up a vm this weekend and try to reproduce/resolve.
from perlzmq.
On Fri, 26 Aug 2016 05:08:48 -0700, Dylan Cali wrote:
@gregoa what is your test setup? I'll spin up a vm this weekend and try to reproduce/resolve.
This is Debian GNU/Linux unstable, and I'm building/testing in a
chroot (because we always build in minimal chroots) which also has a
minimal Debian unstable environment. This is perl 5.22.2, if it
matters, and a 4.6.4 Linux kernel.
The key for me seems to be to put the machine under load (by doing
silly calculations in a loop in parallel on all cores).
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Europe: The Final Countdown
from perlzmq.
The key for me seems to be to put the machine under load (by doing
silly calculations in a loop in parallel on all cores).
Yeah that to me is also indicative of signal interrupted restart issues, as you'd be more likely to experience signal interrupted under load
from perlzmq.
actually would you mind trying one last thing for me? can you move the kill line to the end of the subtest instead of the end of the script and see if it still hangs? so:
subtest 'proxy', sub {
my $ctx = ZMQ::FFI->new();
my $server = $ctx->socket(ZMQ_PUSH);
$server->connect($server_address);
my $worker = $ctx->socket(ZMQ_PULL);
$worker->connect($worker_address);
my $message = 'ohhai';
$server->send($message);
until ($worker->has_pollin) {
# sleep for a 100ms to compensate for slow subscriber problem
usleep 100_000;
}
my $payload = $worker->recv;
is $payload, $message, "Message received";
kill TERM => $proxy;
};
from perlzmq.
On Fri, 26 Aug 2016 09:12:23 -0700, Dylan Cali wrote:
actually would you mind trying one last thing for me? can you move
the kill line to the end of the subtest instead of the end of the
script and see if it still hangs? so:
Still the same hangs, sorry.
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Europe: The Final Countdown
from perlzmq.
sad. ok, I'll let you know if I make any headway this weekend.
from perlzmq.
ok, first attempt to reproduce on my side was not successful. This wasn't a 1:1 attempt, I had an older Fedora 23 VM ready to go so I thought I would try there first. It was Perl 5.22.2, a 4.4.3 kernel, and libzmq 4.1.2.
I ran multiple while :; do head -1 /dev/urandom | md5sum > /dev/null
to put the VM under heavy load, then ran t/proxy.t in a loop, but it never hung. This is what the load looked like while I was looping:
So I think that should be sufficient from a load perspective.
Just for kicks I also ran t/proxy.t in a loop on the OS X host as well, but no hang there either.
I'll spin up a true 1:1 Debian install next and get back to you. Hopefully running in a VM won't impact my ability to reproduce...
from perlzmq.
Thanks for your effort! And sorry to hear you were not successful yet.
(FTR, I can still reproduce the hangs, even with lower loads than you tried.)
from perlzmq.
@gregoa after installing the latest Debian sid/unstable to a VM I still can't get proxy.t to hang under load (or at all). However my versions are a little different then yours. Can you upgrade to the latest unstable packages and try again?
Here are the relevant versions in my VM:
calid@debian:~$ cat /etc/debian_version
stretch/sid
calid@debian:~$ perl -M'ZMQ::FFI::Util qw(zmq_version)' -E'say join ".", zmq_version'
4.1.5
calid@debian:~$ uname -r
4.7.0-1-amd64
calid@debian:~$ apt-cache showpkg libzmq-ffi-perl
Package: libzmq-ffi-perl
Versions:
1.11-1
I've installed all perl packages and dependencies through apt to ensure I'm not using anything outside the distribution
from perlzmq.
Thanks for trying again!
The versions you mention are the same I have as well in the meantime. And I can still get t/proxy.t
to hang.
One difference is that I didn't have libzmq-ffi-perl
installed because I want to build it :)
But it doesn't change anything, even if I install the package, the test hangs after some iterations, and independent of the fact if I run prove in the source tree or copy t/ somewhere else and run it there.
Hm. Maybe @ntyni has an idea, he's our expert for the difficult cases :)
Thanks again,
gregor
from perlzmq.
Assuming I were able to reproduce the hang these are the next steps I would take:
- Build a debug version of Perl 5.22.2
- Build a debug version of libzmq 4.1.5
- Run the test using the debug versions of Perl/libzmq and get it to hang
- Attach to the process with gdb and see exactly where it's hanging
I am hoping based on the stack trace in gdb we can figure out exactly what is causing the problem.
If you have time/interest would you like to try the steps above and post the stack trace here? If not I will have to find another machine somewhere and see if I can reproduce.. I am still wondering if running in a VM may be the reason it's not hanging for me.
from perlzmq.
On Fri, 09 Sep 2016 06:32:34 -0700, Dylan Cali wrote:
- Build a debug version of Perl 5.22.2
- Build a debug version of libzmq 4.1.5
- Run the test and get it to hang
- Attach to it with gdb and see exactly where it's hanging
I think you can replace the first 2 steps by installing the
respective debug packages in Debian:
- libzmq5-dbg for libzmq, and
- perl-debug for perl
If you have time/interest would you like to try the steps above and
post the stack trace here? If not I will have to find another
machine somewhere and see if I can reproduce.
I can take a look but I'm very bad with gdb; I'll give it a try
later.
Cheers,
gregor
.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/
. ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
- NP: Leonard Cohen: Please Don't Pass Me By (A Disgrace)
from perlzmq.
Ok, here we go. I installed the debug packages, waited until the hang and fired up gdb. I'm just a bit clueless about what to do then :)
# gdb /usr/bin/perl 26749
GNU gdb (Debian 7.11.1-2) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/perl...Reading symbols from /usr/lib/debug//usr/bin/perl...done.
done.
Attaching to program: /usr/bin/perl, process 26749
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libcrypt.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/Cwd/Cwd.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/List/Util/Util.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/Time/HiRes/HiRes.so...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/File/Glob/Glob.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/IO/IO.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/Fcntl/Fcntl.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/POSIX/POSIX.so...(no debugging symbols found)...done.
(gdb) bt full
#0 0x00007f7c69ce0233 in select () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x0000000000506f3f in Perl_pp_sselect (my_perl=0x108b010) at pp_sys.c:1225
sp = 0x10908f0
targ = 0x1a6f9c8
i = <optimized out>
j = <optimized out>
s = <optimized out>
sv = <optimized out>
value = <optimized out>
maxlen = 1
nfound = <optimized out>
timebuf = {tv_sec = 27720088, tv_usec = 17369344}
tbuf = <optimized out>
growsize = 128
fd_sets = {0x1a87718 " \240\235\001", 0x194ac40 "P", 0x0, 0x0}
#2 0x00000000004b7376 in Perl_runops_standard (my_perl=0x108b010) at run.c:41
op = <optimized out>
#3 0x0000000000444079 in S_run_body (oldscope=1, my_perl=0x108b010) at perl.c:2458
No locals.
#4 perl_run (my_perl=0x108b010) at perl.c:2381
oldscope = 1
ret = <optimized out>
cur_env = {je_prev = 0x108b3b0, je_buf = {{__jmpbuf = {0, -7000060815032819031, 4311824, 140735615558352, 0, 0, 7000258287070765737,
-7000060689657470295}, __mask_was_saved = 0, __saved_mask = {__val = {3329619642332573743, 825111090, 0, 0, 7165064483209180463,
8660248813382888545, 7596496373740942904, 3419760881481315694, 3615886028624258416, 3223090, 0, 0, 17346576, 1415641803309721344,
4311824, 0}}}}, je_ret = 0, je_mustcatch = false}
#5 0x000000000041cafb in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:116
exitstatus = <optimized out>
i = <optimized out>
0x00007f7c69ce0233 in select () from /lib/x86_64-linux-gnu/libc.so.6
I'm happy to check anything else if you tell me what to look for.
from perlzmq.
ok cool, a few things to try:
First, if you use the thread
command in gdb you can see the other threads running and switch to them using thread <thread_number>
. Can you post the back traces from the other threads as well? I'm especially interested to see what libzmq is doing during these hangs.
Second, it would be nice to determine exactly where in ZMQ::FFI we're hanging. There's some fancy gdb fu that will allow you to print the actual line in Perl, but the easiest way to accomplish this is probably to just sprinkle some print statements in various places :)
Hanging on a perl select call is interesting/surprising, as there isn't any non-blocking code going on (i.e. we're not using AnyEvent to poll a zmq fd or somesuch), so I'm unsure what that select call is actually hung waiting on. Hopefully if we can figure out the exact perl line we're hanging at it will become clear...
from perlzmq.
I think the process in Perl_pp_sselect() is the prove one. Somewhat interestingly, while I can easily reproduce the hang on Debian unstable when running t/proxy.t with prove (prove -b -v t/proxy.t), it never happens for me when running it directly (perl -Iblib/lib -Iblib/arch t/proxy.t)
There are three threads, two of which have a very similar backtrace (in zmq::epoll_t::loop()). I'm attaching those. Perl thinks it's on
thread1.txt
thread2.txt
thread3.txt
(gdb) p my_perl->Icurcop.cop_file
$6 = 0x2347e90 "/home/niko/tmp/libzmq-ffi-perl-1.11/blib/lib/ZMQ/FFI/ZMQ3/Context.pm"
(gdb) p my_perl->Icurcop.cop_line
$7 = 138
which is the $self->check_error() call in ZMQ::FFI::ZMQ3::Context::proxy().
Hope this helps a bit.
from perlzmq.
Thanks for the further analysis with gdb, @ntyni !
And I can confirm that I don't get a hang with perl -Iblib/lib -Iblib/arch t/proxy.t
. Weird.
Also interesting is that I have 60 of those processes right now (although I'm running it in a simple single while loop). Hm?! 67 now. Ctrl-C. Still 65 later. Still 65 after leaving the chroot.
from perlzmq.
Somewhat interestingly, while I can easily reproduce the hang on Debian unstable when running t/proxy.t with prove (prove -b -v t/proxy.t), it never happens for me when running it directly (perl -Iblib/lib -Iblib/arch t/proxy.t)
Bingo, that was the difference. I was invoking the test using perl directly. Using prove I'm able to reproduce the hang (but per @gregoa's experience only after simulating load).
I'll dig into the stack traces and get back to you. Thanks for the help @ntyni
from perlzmq.
Related Issues (20)
- Segfault when using EV version 3 HOT 1
- user linger value is clobbered HOT 1
- segfault during global destruction HOT 1
- test hangs with some versions of Socket HOT 2
- zmq_poll function missing ? HOT 11
- Sockets do not keep alive contexts. HOT 4
- support exporting zmq constants directly from ZMQ::FFI
- Add missing socket api methods to SocketRole requires HOT 1
- DEALER-REP hangs forever HOT 9
- Faster get/set on sockets HOT 1
- using proxy results in high cpu usage HOT 4
- Context object still keeps around socket objects after they go out of scope HOT 3
- Unable to build a release with SELinux enabled on system HOT 1
- Unable to build a release with a perl compiled with clang HOT 1
- zmq_bind: Protocol not supported HOT 2
- Fails to install on CentOS HOT 4
- FFI::Platypus should be specified as configure_requires HOT 1
- Tests will fail in perl-5.38.0
- dealer with AE::io don't get any event when watch_write is 0 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from perlzmq.