Currently, there's no non-blocking way to test if a child process succeeded or not. T

On Wed, Apr 2, 2008 at 3:20 AM, Abdulaziz Ghuloum wrote: <p dir=

Need non-blocking way to know if child process failed,about hyln9/ikarus

Comments (14)

hyln9 commented on June 21, 2024

I agree this needs to be fixed. The solution that you posted has a problem with a race condition since by the time you do you (waitpid pid #f), the child may not have gotten a chance to exec, fail, and exit. I'll have to check with Steven's on the best way to do this.

Launchpad Details: #LPC Abdulaziz Ghuloum - 2008-01-25 01:08:23 -0500

from ikarus.

hyln9 commented on June 21, 2024

Question: if process returned nonblocking ports, or, if we have process-nonblocking to go with the rest of the nonblocking procedures, would that kind of solve this problem? (I still haven't found a solution)

Launchpad Details: #LPC Abdulaziz Ghuloum - 2008-03-24 00:10:18 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Mon, 2008-03-24 at 04:10 +0000, Abdulaziz Ghuloum wrote:

Question: if process returned nonblocking ports, or, if we have process-
nonblocking to go with the rest of the nonblocking procedures, would
that kind of solve this problem? (I still haven't found a solution)

But if you exec a process with which you do not attempt any I/O (i.e.,
because the program isn't designed to do any), there's still no way to
know if it succeeded or failed.

I'll start trying to learn what other run-time systems have done about
this.

Launchpad Details: #LPC Derick Eddington - 2008-03-31 01:43:07 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Mar 31, 2008, at 1:43 AM, Derick Eddington wrote:

But if you exec a process with which you do not attempt any I/O (i.e.,
because the program isn't designed to do any), there's still no way to
know if it succeeded or failed.

Well, if the program is designed not to do any IO, then reading from the
output-port and error-port that process returns should both return #!
eof:

(let-values ([(pid to-p from-p err-p) (process "true")])
(list (get-u8 from-p) (get-u8 err-p)))
(#!eof #!eof)

[I know you brought this just as an example. The problem is still
there.]

I'll start trying to learn what other run-time systems have done about
this.

The issue here is that I'm doing "fork", which succeeds, then I do
"exec"
within the child process, which may fail. At any point after the fork,
the child and parent processes are running independently, so, the parent
cannot ask if the exec failed because the child may not have gotten a
chance to exec and fail yet!

I'll start trying to learn what other run-time systems have done about
this.

Please do let me know if you find a solution (or a system that has found
a solution).

Launchpad Details: #LPC Abdulaziz Ghuloum - 2008-03-31 05:53:21 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Mon, 2008-03-31 at 09:53 +0000, Abdulaziz Ghuloum wrote:

On Mar 31, 2008, at 1:43 AM, Derick Eddington wrote:

But if you exec a process with which you do not attempt any I/O (i.e.,
because the program isn't designed to do any), there's still no way to
know if it succeeded or failed.

Well, if the program is designed not to do any IO, then reading from the
output-port and error-port that process returns should both return #!
eof

Maybe the parent Scheme program is a shell / general launcher and
doesn't know what the child will do.

Please do let me know if you find a solution (or a system that has
found
a solution).

Will do, but might be a while. This must have been solved more than
once before you'd think, no? I remember hearing a brief description of
this specific problem years ago. It does seem like POSIX does not give
one a reliable non-blocking non-race-problem way to know, unless you
listen for SIGCHLD with the POSIX signal handlers facility, which I've
been guessing is not compatible / desireable with Ikarus's architecture.
Is that the case? Should we exclude solutions using a SIGCHLD signal
handler?

Launchpad Details: #LPC Derick Eddington - 2008-03-31 23:48:15 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Mar 31, 2008, at 11:48 PM, Derick Eddington wrote:

This must have been solved more than once before you'd think, no?

I would think.

I remember hearing a brief description of
this specific problem years ago. It does seem like POSIX does not
give
one a reliable non-blocking non-race-problem way to know, unless you
listen for SIGCHLD with the POSIX signal handlers facility, which I've
been guessing is not compatible / desireable with Ikarus's
architecture.
Is that the case? Should we exclude solutions using a SIGCHLD signal
handler?

Not necessarily. If this is the only way to solve it, then, we have
to solve it. But I don't see (off the top of my head) how catching
SIGCHLD would solve it. Usually you catch SIGCHLD so that it
collects the dead children if you don't want to wait on them
yourself. I don't see how it can be used to "test if process
failed". Maybe my way of doing process (fork then exec) is not the
right way but I don't know what is.

Launchpad Details: #LPC Abdulaziz Ghuloum - 2008-04-01 01:58:31 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Tue, 2008-04-01 at 05:58 +0000, Abdulaziz Ghuloum wrote:

Not necessarily. If this is the only way to solve it, then, we have
to solve it. But I don't see (off the top of my head) how catching
SIGCHLD would solve it. Usually you catch SIGCHLD so that it
collects the dead children if you don't want to wait on them
yourself.

Right. My memory is vague about SIGCHLD. I meant more generally, I
wonder if POSIX signals aren't the only way to test if the process
failed? There might be another signal or there might be an additional
code that goes along with SIGCHLD. I just Googled "fork exec child race
condition" which seems to yield a number of promising links I'll be
reading (maybe slowly).

BTW, I think a process-nonblocking which returns nonblocking ports would
be useful anyways. One of my longer-term goals of being into programing
is to use/make actors-like systems which use event-loops in different
processes communicating and async I/O is needed/good for that.

Launchpad Details: #LPC Derick Eddington - 2008-04-01 02:33:10 -0400

from ikarus.

hyln9 commented on June 21, 2024

It looks like the only solution is to use a SIGCHLD signal handler. Not
to "test" (sorry) but to be notified when a specific process has died.
An idea: register a procedure for a child and have that procedure called
when the SIGCHLD telling of that child's death is delivered to the
signal handler (you'd use SA_SIGINFO with sa_sigaction to install a
signal handler that would be given a siginfo_t telling what child PID
died); without screwing up ikarus's stack or run-time of course. Using
an alternate signal stack (via sigaltstack and SA_ONSTACK) might be
noteworthy. Ah, it looks like ikarus already does use sa_sigaction and
an alternate stack for SIGINT, but the handler doesn't call back into
Scheme.

Would this also be possible: If the callback procedure returns, the
continuation of the program from where it was at when the signal handler
was called is resumed, but the callback procedure could possibly not
return as its way of dealing with the death (hahaha).

Launchpad Details: #LPC Derick Eddington - 2008-04-02 02:18:11 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Apr 2, 2008, at 2:18 AM, Derick Eddington wrote:

It looks like the only solution is to use a SIGCHLD signal
handler. Not
to "test" (sorry) but to be notified when a specific process has died.
An idea: register a procedure for a child and have that procedure
called
when the SIGCHLD telling of that child's death is delivered to the
signal handler (you'd use SA_SIGINFO with sa_sigaction to install a
signal handler that would be given a siginfo_t telling what child PID
died); without screwing up ikarus's stack or run-time of course.
Using
an alternate signal stack (via sigaltstack and SA_ONSTACK) might be
noteworthy. Ah, it looks like ikarus already does use sa_sigaction
and
an alternate stack for SIGINT, but the handler doesn't call back into
Scheme.

Exactly. In the signal handler, you're pretty much helpless because
you don't even know whether you're in the Scheme code, in the GC, in
GMP, in some system call (read, write, select, ...) or just in the
middle of a cons that did not initialize its car or cdr fields.

So, for SIGINT, all that Ikarus does right now is set two fields in
the pcb record:

void handler(int signo, siginfo_t* info, void* uap){
the_pcb->engine_counter = -1;
the_pcb->interrupted = 1;
}

and that's it. In the Scheme code, on entry to every procedure, the
value of engine_counter is decremented and, if negative, the engine
handler is called, which resets the counter, then checks and resets
the interrupted flag and either calls the interrupt handler (which
raises an interrupted continuable condition and returns, or the
timeout handler which just returns (iirc).

So, calling into Scheme from the signal handler is just not possible.

So, you add another field (say pcb->child_died) and from the handler,
you set the engine_counter to be -1 and the child_died flag to be 1.
In Scheme, the engine handler would have to check for this flag now,
and if set, calls waitpid to reap the dead child and collect the
info, and stash it somewhere (hash table of some sort) to be
retrieved at a later time so that you know if your child has exited
or not and what the exit status was.

I'm just thinking out loud here, so, I don't know if any of this
would work. I don't know off the top of my head which of these calls
are interruptable/restartable, what happens if multiple children die
at the same time, or when one child dies while you're collecting
another.

But all of this does not answer the question: how to know if a child
process failed. The fact that you did not get a sigchild does not
mean that the process did not fail. All it means is that it did not
fail yet and might fail any time now. (I just read in waitpid(2)
that you can pass a WNOHANG option to waitpid so that it doesn't
hang, but that too does not answer the question.)

Let me repeat the problem statement: You want the call to (process
"foo") to return the usual values if the process is started, or raise
an exception if that process was not started for whatever reason.
Right? If so, then all this business with interrupts/waitpid/etc
does not give that behavior, and I don't know how to do it.

Would this also be possible: If the callback procedure returns, the
continuation of the program from where it was at when the signal
handler
was called is resumed, but the callback procedure could possibly not
return as its way of dealing with the death (hahaha).

That's fine. We do that all the time. That's how we break from
"read" when we get sigint, and which I just realized that I somehow
broke at some time. Ouch! BRB! (Okay. I'm back. Just reported bug
210744) So, it used to be fine and now it's not. :-(

I'll go to bed now.

Aziz,,,

Launchpad Details: #LPC Abdulaziz Ghuloum - 2008-04-02 06:20:03 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Wed, Apr 2, 2008 at 3:20 AM, Abdulaziz Ghuloum
wrote:

So, you add another field (say pcb->child_died) and from the handler,
you set the engine_counter to be -1 and the child_died flag to be 1.
In Scheme, the engine handler would have to check for this flag now,
and if set, calls waitpid to reap the dead child and collect the
info, and stash it somewhere (hash table of some sort) to be
retrieved at a later time so that you know if your child has exited
or not and what the exit status was.

I'm just thinking out loud here, so, I don't know if any of this
would work. I don't know off the top of my head which of these calls
are interruptable/restartable, what happens if multiple children die
at the same time, or when one child dies while you're collecting
another.

You can block/delay delivery of signal(s) when in such sections you
don't want interrupted. Maybe the child_died flag can also be a
counter of the number of un-reaped dead kiddies (in case the SIGCHLD
sig-handler is called more than once before the engine handler) and
when the engine handler deals with this it does waitpid as many times
as the counter says and then the infos for all the dead children are
collected. A hashtable associating PIDs to Scheme callback procedures
is used to find the appropriate callbacks. I suppose signal(s) need
to be unblocked before calling the callbacks, because who knows what a
callback will want to do.

But all of this does not answer the question: how to know if a child
process failed. The fact that you did not get a sigchild does not
mean that the process did not fail. All it means is that it did not
fail yet and might fail any time now.

But you do get to know after failure/death happens. Which, because of
the race condition, seems to be the only non-blocking way for the
parent to find out; unless every parent explicitly repeatedly tries a
non-blocking waitpid every so often to try to find out, which seems
annoying.

(I just read in waitpid(2)
that you can pass a WNOHANG option to waitpid so that it doesn't
hang, but that too does not answer the question.)

Right. I used WNOHANG in my initial post, when I forgot about the
race condition.

Let me repeat the problem statement: You want the call to (process
"foo") to return the usual values if the process is started, or raise
an exception if that process was not started for whatever reason.
Right? If so, then all this business with interrupts/waitpid/etc
does not give that behavior, and I don't know how to do it.

Sorry for not being clear. I think we need any non-blocking
(race-free, not explicitly scheduling checks) way for the parent to
know a child died and which one.

Would this also be possible: If the callback procedure returns, the
continuation of the program from where it was at when the signal
handler
was called is resumed, but the callback procedure could possibly not
return as its way of dealing with the death (hahaha).

That's fine. We do that all the time.

Cool. I hope this can be made to work so we can have a reliable and
easy way for the parent to be notified and decide on any arbitrary
action to take (in the callback procedure).

--Derick

Launchpad Details: #LPC Derick Eddington - 2008-04-02 17:50:44 -0400

from ikarus.

hyln9 commented on June 21, 2024

Does the new waitpid (rev 1516) resolve this issue?

Launchpad Details: #LPC Abdulaziz Ghuloum - 2008-06-13 08:51:43 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Fri, 2008-06-13 at 12:51 +0000, Abdulaziz Ghuloum wrote:

Does the new waitpid (rev 1516) resolve this issue?

I want to say yes. I think for most programs that want to be non-blocking / multiplexing, it's adequate, because these types of programs usually have an event loop driving them and they can periodically do non-blocking waitpid checks to see if a child exited or was killed. For programs which expect to do I/O with a child, I/O failures will happen if the child died unexpectedly and I/O is attempted with it (in this case, I've seen #!eof returned for reads, and broken pipe exception for writes) and the parent can detect these failures and do a non-blocking waitpid to check if the child died and to reclaim the zombie.

However, assume a hypothetical non-blocking program which for some reason can't periodically do non-blocking waitpid checks and which doesn't do I/O with children so can't rely on I/O failures to indicate child death and which must know if a child succeeded or died abnormally. This program would need to be notified via an event/callback when the child has died so it can check if it succeeded or not. Unfortunately I don't have a more clear idea if this type of program is real or if it's the wrong way and you should have designed it differently.

I have actually already created a patch for Ikarus which implements callback notifications of child death (using a C signal handler for SIGCHLD which handles it similar to SIGINT). However, having the process accept SIGCHLD means nearly every system-call might be interrupted by the delivery of a SIGCHLD and return errno==EINTR, and this requires checking for EINTR for every syscall (except specified-duration sleep, which I don't think there's a solution to) and redoing the syscall. I implemented, in Scheme, these EINTR checks and syscall redos, also. Along with this, I tried to implement solutions for the corner cases with the engine_counter and $do-event when another signal(s) are delivered before the handling of the first one has completed. As you told me, these corner cases also apply to engines preemption, so my ideas might be useful for this issue with engines. But I'm no longer sure allowing SIGCHLD and dealing with EINTR is worth it or necessary.

So, I'll close this bug report and I, or someone else should, reopen if we need child death notification callbacks.

Launchpad Details: #LPC Derick Eddington - 2008-06-13 22:33:52 -0400

from ikarus.

hyln9 commented on June 21, 2024

On Sat, 2008-06-14 at 02:33 +0000, Derick Eddington wrote:
On Fri, 2008-06-13 at 12:51 +0000, Abdulaziz Ghuloum wrote:

However, having the
process accept SIGCHLD means nearly every system-call might be
interrupted by the delivery of a SIGCHLD and return errno==EINTR, and
this requires checking for EINTR for every syscall (except specified-
duration sleep, which I don't think there's a solution to) and redoing
the syscall.

TMI, I realized there's an obvious solution to the problem of specified-duration sleep being interrupted by delivery of a SIGCHLD: block signals before sleeping and unblock them after. In the modifications I mentioned above, I already implemented functions to block and unblock signals, so doing so for sleep would be easy.

Launchpad Details: #LPC Derick Eddington - 2008-06-15 17:03:42 -0400

from ikarus.

hyln9 commented on June 21, 2024

Oops, that line saying "Abdulaziz Ghuloum wrote:" is out of context, I forgot to delete that. The quote is all of me.

Launchpad Details: #LPC Derick Eddington - 2008-06-15 17:05:35 -0400

from ikarus.

Need non-blocking way to know if child process failed about ikarus HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent