Giter Club home page Giter Club logo

Comments (26)

calid avatar calid commented on July 3, 2024

@gregoa my guess is it's related to https://github.com/calid/zmq-ffi/blob/98a59950d6299ea6eb6268f38fa8b3df01667e43/t/proxy.t#L61-L67

looks like the workaround is not always adequate and it's time to understand the occasional hang...

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Thu, 25 Aug 2016 08:35:37 -0700, Dylan Cali wrote:

@gregoa my guess is it's related to https://github.com/calid/zmq-ffi/blob/master/t/proxy.t#L61-L67

Ack, I was also looking at these lines (and the discussions in some
of the other issues).

looks like the workaround is not always adequate and it's time to understand the occasional hang...

That would be great :)
What makes it difficult for me is that I can't reproduce the hangs
...

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Elliott Sharp: Perps

from perlzmq.

calid avatar calid commented on July 3, 2024

what if you comment out the do/while hack and just leave a single kill TERM => $proxy;? Does it hang for you then?

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Thu, 25 Aug 2016 08:56:44 -0700, Dylan Cali wrote:

what if you comment out the do/while hack and just leave a single kill TERM => $proxy;? Does it hang for you then?

No.
At least not initially; after I put my machine under heavy load, I
got it to hang in 1 of 5 tries.

When I run only t/proxy.t (still with the loop removed) in a loop, it
hangs at the 5th try. Or at the 1st. Or at the second. (Again with
heady load.)

Back to the original version, and running t/proxy.t again in a loop:
same result.

Ok, so there's no difference with and without the do/while loop, and
load seems relevant.

strace -f -v -y prove --blib --verbose t/proxy.t ends with (in the hanging case)

[pid  6191] +++ exited with 0 +++
[pid  6190] <... select resumed> )      = ? ERESTARTNOHAND (To be restarted if no handler)
[pid  6190] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6191, si_uid=0, si_status=0, si_utime=14, si_stime=2} ---
[pid  6190] select(8, [4<pipe:[18439686]> 6<pipe:[18439687]>], NULL, NULL, NULL

but my knowledge about IPC is limited ...

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Bob Dylan: Tempest

from perlzmq.

calid avatar calid commented on July 3, 2024

Okay that's interesting, so the child is actually exiting okay and the sporadic hang is occurring in the parent... which explains why that workaround loop was irrelevant.

Can you try adding explicit context destroys and socket closes in the parent and see if it still hangs occasionally?

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Thu, 25 Aug 2016 10:51:01 -0700, Dylan Cali wrote:

Okay that's interesting, so the child is actually exiting okay and
the sporadic hang is occurring in the parent... which explains why
that workaround loop was irrelevant.

Sounds about right.

Can you try adding explicit context destroys and socket closes in
the parent and see if it still hangs occasionally?

I've now added

    # cleanup
    $server->close();
    $worker->close();
    $ctx->destroy();

at the end of the subtest, and

    # cleanup
    $front->close();
    $back->close();
    $ctx->destroy();

before the exit 0 in the child code, but alas, the test still hangs
after a slow single-number of incantations. But I might have
misinterpreted your request :)

I'm happy to do other tests if you give me a patch.

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Johnny Cash: Tear Stained Letter

from perlzmq.

calid avatar calid commented on July 3, 2024

Nope that's what I was looking for... wanted to make sure it wasn't cleanup occurring in the wrong order during global destruction. If it's still hanging with the explicit cleanup then it looks like that isn't the case and there's something else going on.

The ERESTARTNOHAND concerns me, userland code is apparently not supposed to get this error code, and it may imply something buggy in the kernel if it does:
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/0980.html

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Thu, 25 Aug 2016 12:58:26 -0700, Dylan Cali wrote:

Nope that's what I was looking for... wanted to make sure it wasn't
cleanup occurring in the wrong order during global destruction. If
it's still hanging with the explicit cleanup then it looks like
that isn't the case and there's something else going on.

Ok, cool.

The ERESTARTNOHAND concerns me, user land code is apparently not
supposed to get this error code, and it may imply something buggy
in the kernel if it does:
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/0980.html

The followup at
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/0997.html suggests
that this might be related to strace on x86*.
And I hope that a possible kernel issue from 2007 is fixed by now :)

But yeah, that's a bit mysterious.

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Janis Joplin: Bye, Bye Baby

from perlzmq.

calid avatar calid commented on July 3, 2024

Normally I would be with you, but the follow up of the follow up says he was seeing it outside of strace:

http://lkml.iu.edu/hypermail/linux/kernel/0701.1/1054.html
and
http://lkml.iu.edu/hypermail/linux/kernel/0701.1/1018.html

That said the ERESTARTNOHAND is feeling like a red herring to me (and in our case could very well be due to running under strace). To me this is smelling like it may be EINTR (interrupted system call) not being handled correctly, and select being blindly re-called on an fd that has nothing to read.

The rabbit hole on this one may be deep, and it may not even be a direct problem in the Perl bindings... next step would be to reproduce in gdb and get a stack trace of exactly where that select call is occurring

from perlzmq.

calid avatar calid commented on July 3, 2024

@gregoa what is your test setup? I'll spin up a vm this weekend and try to reproduce/resolve.

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Fri, 26 Aug 2016 05:08:48 -0700, Dylan Cali wrote:

@gregoa what is your test setup? I'll spin up a vm this weekend and try to reproduce/resolve.

This is Debian GNU/Linux unstable, and I'm building/testing in a
chroot (because we always build in minimal chroots) which also has a
minimal Debian unstable environment. This is perl 5.22.2, if it
matters, and a 4.6.4 Linux kernel.

The key for me seems to be to put the machine under load (by doing
silly calculations in a loop in parallel on all cores).

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Europe: The Final Countdown

from perlzmq.

calid avatar calid commented on July 3, 2024

The key for me seems to be to put the machine under load (by doing
silly calculations in a loop in parallel on all cores).

Yeah that to me is also indicative of signal interrupted restart issues, as you'd be more likely to experience signal interrupted under load

from perlzmq.

calid avatar calid commented on July 3, 2024

actually would you mind trying one last thing for me? can you move the kill line to the end of the subtest instead of the end of the script and see if it still hangs? so:

subtest 'proxy', sub {
    my $ctx = ZMQ::FFI->new();

    my $server = $ctx->socket(ZMQ_PUSH);
    $server->connect($server_address);

    my $worker = $ctx->socket(ZMQ_PULL);
    $worker->connect($worker_address);

    my $message = 'ohhai';
    $server->send($message);

    until ($worker->has_pollin) {

        # sleep for a 100ms to compensate for slow subscriber problem
        usleep 100_000;
    }

    my $payload = $worker->recv;
    is $payload, $message, "Message received";

    kill TERM => $proxy;
};

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Fri, 26 Aug 2016 09:12:23 -0700, Dylan Cali wrote:

actually would you mind trying one last thing for me? can you move
the kill line to the end of the subtest instead of the end of the
script and see if it still hangs? so:

Still the same hangs, sorry.

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Europe: The Final Countdown

from perlzmq.

calid avatar calid commented on July 3, 2024

sad. ok, I'll let you know if I make any headway this weekend.

from perlzmq.

calid avatar calid commented on July 3, 2024

ok, first attempt to reproduce on my side was not successful. This wasn't a 1:1 attempt, I had an older Fedora 23 VM ready to go so I thought I would try there first. It was Perl 5.22.2, a 4.4.3 kernel, and libzmq 4.1.2.

I ran multiple while :; do head -1 /dev/urandom | md5sum > /dev/null to put the VM under heavy load, then ran t/proxy.t in a loop, but it never hung. This is what the load looked like while I was looping:
screen shot 2016-09-05 at 2 17 07 pm

So I think that should be sufficient from a load perspective.

Just for kicks I also ran t/proxy.t in a loop on the OS X host as well, but no hang there either.

I'll spin up a true 1:1 Debian install next and get back to you. Hopefully running in a VM won't impact my ability to reproduce...

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

Thanks for your effort! And sorry to hear you were not successful yet.
(FTR, I can still reproduce the hangs, even with lower loads than you tried.)

from perlzmq.

calid avatar calid commented on July 3, 2024

@gregoa after installing the latest Debian sid/unstable to a VM I still can't get proxy.t to hang under load (or at all). However my versions are a little different then yours. Can you upgrade to the latest unstable packages and try again?

Here are the relevant versions in my VM:

calid@debian:~$ cat /etc/debian_version 
stretch/sid
calid@debian:~$ perl -M'ZMQ::FFI::Util qw(zmq_version)' -E'say join ".", zmq_version'
4.1.5
calid@debian:~$ uname -r
4.7.0-1-amd64
calid@debian:~$ apt-cache showpkg libzmq-ffi-perl
Package: libzmq-ffi-perl
Versions: 
1.11-1

I've installed all perl packages and dependencies through apt to ensure I'm not using anything outside the distribution

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

Thanks for trying again!
The versions you mention are the same I have as well in the meantime. And I can still get t/proxy.tto hang.
One difference is that I didn't have libzmq-ffi-perl installed because I want to build it :)
But it doesn't change anything, even if I install the package, the test hangs after some iterations, and independent of the fact if I run prove in the source tree or copy t/ somewhere else and run it there.

Hm. Maybe @ntyni has an idea, he's our expert for the difficult cases :)

Thanks again,
gregor

from perlzmq.

calid avatar calid commented on July 3, 2024

Assuming I were able to reproduce the hang these are the next steps I would take:

  1. Build a debug version of Perl 5.22.2
  2. Build a debug version of libzmq 4.1.5
  3. Run the test using the debug versions of Perl/libzmq and get it to hang
  4. Attach to the process with gdb and see exactly where it's hanging

I am hoping based on the stack trace in gdb we can figure out exactly what is causing the problem.

If you have time/interest would you like to try the steps above and post the stack trace here? If not I will have to find another machine somewhere and see if I can reproduce.. I am still wondering if running in a VM may be the reason it's not hanging for me.

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

On Fri, 09 Sep 2016 06:32:34 -0700, Dylan Cali wrote:

  1. Build a debug version of Perl 5.22.2
  2. Build a debug version of libzmq 4.1.5
  3. Run the test and get it to hang
  4. Attach to it with gdb and see exactly where it's hanging

I think you can replace the first 2 steps by installing the
respective debug packages in Debian:

  • libzmq5-dbg for libzmq, and
  • perl-debug for perl

If you have time/interest would you like to try the steps above and
post the stack trace here? If not I will have to find another
machine somewhere and see if I can reproduce.

I can take a look but I'm very bad with gdb; I'll give it a try
later.

Cheers,
gregor

.''. Homepage https://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ . ' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe - NP: Leonard Cohen: Please Don't Pass Me By (A Disgrace)

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

Ok, here we go. I installed the debug packages, waited until the hang and fired up gdb. I'm just a bit clueless about what to do then :)

# gdb /usr/bin/perl 26749
GNU gdb (Debian 7.11.1-2) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/perl...Reading symbols from /usr/lib/debug//usr/bin/perl...done.
done.
Attaching to program: /usr/bin/perl, process 26749
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libcrypt.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/Cwd/Cwd.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/List/Util/Util.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/Time/HiRes/HiRes.so...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/File/Glob/Glob.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/IO/IO.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/Fcntl/Fcntl.so...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/perl/5.22/auto/POSIX/POSIX.so...(no debugging symbols found)...done.
(gdb) bt full
#0  0x00007f7c69ce0233 in select () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x0000000000506f3f in Perl_pp_sselect (my_perl=0x108b010) at pp_sys.c:1225
        sp = 0x10908f0
        targ = 0x1a6f9c8
        i = <optimized out>
        j = <optimized out>
        s = <optimized out>
        sv = <optimized out>
        value = <optimized out>
        maxlen = 1
        nfound = <optimized out>
        timebuf = {tv_sec = 27720088, tv_usec = 17369344}
        tbuf = <optimized out>
        growsize = 128
        fd_sets = {0x1a87718 " \240\235\001", 0x194ac40 "P", 0x0, 0x0}
#2  0x00000000004b7376 in Perl_runops_standard (my_perl=0x108b010) at run.c:41
        op = <optimized out>
#3  0x0000000000444079 in S_run_body (oldscope=1, my_perl=0x108b010) at perl.c:2458
No locals.
#4  perl_run (my_perl=0x108b010) at perl.c:2381
        oldscope = 1
        ret = <optimized out>
        cur_env = {je_prev = 0x108b3b0, je_buf = {{__jmpbuf = {0, -7000060815032819031, 4311824, 140735615558352, 0, 0, 7000258287070765737,
                -7000060689657470295}, __mask_was_saved = 0, __saved_mask = {__val = {3329619642332573743, 825111090, 0, 0, 7165064483209180463,
                  8660248813382888545, 7596496373740942904, 3419760881481315694, 3615886028624258416, 3223090, 0, 0, 17346576, 1415641803309721344,
                  4311824, 0}}}}, je_ret = 0, je_mustcatch = false}
#5  0x000000000041cafb in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:116
        exitstatus = <optimized out>
        i = <optimized out>
0x00007f7c69ce0233 in select () from /lib/x86_64-linux-gnu/libc.so.6

I'm happy to check anything else if you tell me what to look for.

from perlzmq.

calid avatar calid commented on July 3, 2024

ok cool, a few things to try:

First, if you use the thread command in gdb you can see the other threads running and switch to them using thread <thread_number>. Can you post the back traces from the other threads as well? I'm especially interested to see what libzmq is doing during these hangs.

Second, it would be nice to determine exactly where in ZMQ::FFI we're hanging. There's some fancy gdb fu that will allow you to print the actual line in Perl, but the easiest way to accomplish this is probably to just sprinkle some print statements in various places :)

Hanging on a perl select call is interesting/surprising, as there isn't any non-blocking code going on (i.e. we're not using AnyEvent to poll a zmq fd or somesuch), so I'm unsure what that select call is actually hung waiting on. Hopefully if we can figure out the exact perl line we're hanging at it will become clear...

from perlzmq.

ntyni avatar ntyni commented on July 3, 2024

I think the process in Perl_pp_sselect() is the prove one. Somewhat interestingly, while I can easily reproduce the hang on Debian unstable when running t/proxy.t with prove (prove -b -v t/proxy.t), it never happens for me when running it directly (perl -Iblib/lib -Iblib/arch t/proxy.t)

There are three threads, two of which have a very similar backtrace (in zmq::epoll_t::loop()). I'm attaching those. Perl thinks it's on
thread1.txt
thread2.txt
thread3.txt

(gdb) p my_perl->Icurcop.cop_file
$6 = 0x2347e90 "/home/niko/tmp/libzmq-ffi-perl-1.11/blib/lib/ZMQ/FFI/ZMQ3/Context.pm"
(gdb) p my_perl->Icurcop.cop_line
$7 = 138

which is the $self->check_error() call in ZMQ::FFI::ZMQ3::Context::proxy().

Hope this helps a bit.

from perlzmq.

gregoa avatar gregoa commented on July 3, 2024

Thanks for the further analysis with gdb, @ntyni !
And I can confirm that I don't get a hang with perl -Iblib/lib -Iblib/arch t/proxy.t. Weird.

Also interesting is that I have 60 of those processes right now (although I'm running it in a simple single while loop). Hm?! 67 now. Ctrl-C. Still 65 later. Still 65 after leaving the chroot.

from perlzmq.

calid avatar calid commented on July 3, 2024

Somewhat interestingly, while I can easily reproduce the hang on Debian unstable when running t/proxy.t with prove (prove -b -v t/proxy.t), it never happens for me when running it directly (perl -Iblib/lib -Iblib/arch t/proxy.t)

Bingo, that was the difference. I was invoking the test using perl directly. Using prove I'm able to reproduce the hang (but per @gregoa's experience only after simulating load).

I'll dig into the stack traces and get back to you. Thanks for the help @ntyni

from perlzmq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.