SA Bugzilla – Bug 5950
spamd disconnects from terminal before writing pid file
Last modified: 2008-08-06 11:21:02 UTC
Hi all I'm having a problem with spamd. When I use both the -rPIDFILE and -d options, the program disconnects from the terminal before the pidfile is created. This means that the script that calls the spamd starts the next program before the pidfile exists. If this next command requires to read from that file, it will fail. Doesn't it make sense to write the pidfile before daemonizing? simple example: /usr/local/bin/spamd -d -s "$sdlog" -r "$sdpid" \ --socketpath="$sdsock" --socketmode 600 || exit kill -INT `cat "$sdpid"` I just get "cat: No such file" and the signal doesn't get sent. The pidfile shows up shortly afterwards. Perhaps there is a sensible way to do this that I haven't thought of. It is possible to poll for the file, eg: while ( ! [ -e "$sdpid" ] ) || ( ! [ -e "$sdsock" ] ) do sleep 1 done But that is awkward and in more complex situations creates race conditions. viza
agreed, this is suboptimal. patches welcome! ;)
Created attachment 4352 [details] patch to move writing pidfile before daemonize I really don't know if this is safe, but it looks like it should work. It needs someone who is familiar with the code to check it. As the comments in the file indicate not to write the pidfile before installing the handlers, I moved them up before daemonize(), and then moved writing the pidfile after them and immediately before daemonize(). If I was writing this in C, I would fsync() (or fdatasync) both the pidfile and its parent directory before detaching from the terminal, to ensure that commands following "spamd -d" could find it. I don't know that that can be done in perl. What I have done here is probably enough if it doesn't break anything else.
Gah! Ignore that patch. It won't work, because writing the pid before calling daemonize() will write the wrong process id. What is needed it to break up daemonize(). Currently it does: fork -> old process: exit. new process: write pid. continue... It should go: fork -> old process: write child's pid. exit. new process: continue...
The comments say # now allow waiting processes to connect, if they're watching the log. # The test suite does this! It looks like much care is taken to get things like the signal handlers done before the pid is written to the log. That implies to me that your script should look at the log file like the test suite does rather than simply continue after calling spamd. I guess it could test and wait for the existence of the pid file if you know that spamd is being called with the pidfile option, but looking at the log file seems more correct.
You are quite right that I am being a little closed-minded in thinking that this is all about the pid file. The pidfile is only part of the issue. What I would like is that the process I start doesn't disconnect from the terminal until the daemon is ready to accept connections, too. In daemonize() we have: exit if $pid; what I would like is that this does: if($pid) wait until daemon is ready to receive connections, AND has written its pidfile if requested, then exit. Perhaps my requirements are quite different from the usual situation where spamd is started at boot and it is best to be fast and to disconnect ASAP. If that is the case, would an alternative option --daemonize-when-ready be acceptable to help me in this case? I am happy if the answer is "No, you have to read the log", but rather than having shell scripts try to make sense of the log, wouldn't it be better to keep this internal to SA? I'd rather not have to check if the log format has changed in every new version. Thanks for your time, viza
Would it make sense for the daemonize function to either add a SIGUSR1 handler to the parent or create a pipe, have the parent wait for a signal or something sent over the pipe from the child before it calls exit() and if the daemonize option is set have the child signal the parent (via kill or writing tothe pipe) after it has written the pid to the log file? Perl experts, is there any problem with using a signal or a pipe in this case and is one better to use than the other?
(In reply to comment #6) > Would it make sense for the daemonize function to either add a SIGUSR1 handler > to the parent or create a pipe, have the parent wait for a signal or something > sent over the pipe from the child before it calls exit() and if the daemonize > option is set have the child signal the parent (via kill or writing tothe pipe) > after it has written the pid to the log file? yes, that would be the ideal fix -- and for that to abort waiting, if the child itself exits. > Perl experts, is there any problem with using a signal or a pipe in this case > and is one better to use than the other? in this setup, either would be fine; signal would be easiest I think.
(in reply to comment #7) >and for that to abort waiting, if the child itself exits. I've been reading about how to use SIGCHLD to do that and it seems complicated. Example code I found uses a SIGCHLD handler, uses waitpid(-1, &WNOHANG) and checks WIFEXITED($?) to avoid responding to a SIGCHLD sent when the child is suspended but not exited. Worst, I see that WNOHANG is not supported on all platforms and I didn't see an alternative way of doing this that does not require a non-blocking waitpid(). It seems to me that for this purpose it would work fine to wait no more than, say 5 seconds. All we are doing is making sure that the parent process does not exit before the log entry (and pid file) is written by the child so that the invoking shell script does not have to bother waiting for it. If the log entry is not written by the child in the first few seconds (probably even in the first second) then something is wrong anyway and it will probably not be written. So how about having a SIGUSR1 handler that sets a flag and a loop that for a maximum of 5 times sleeps 1 second until the flag is set?
(In reply to comment #8) > I see that WNOHANG is not supported on all platforms and I didn't see an > alternative way of doing this that does not require a non-blocking waitpid(). Do the platforms in question support sending signal 0 as a test for a process?
(In reply to comment #8) > (in reply to comment #7) > >and for that to abort waiting, if the child itself exits. > > I've been reading about how to use SIGCHLD to do that and it seems complicated. > Example code I found uses a SIGCHLD handler, uses waitpid(-1, &WNOHANG) and > checks WIFEXITED($?) to avoid responding to a SIGCHLD sent when the child is > suspended but not exited. Worst, I see that WNOHANG is not supported on all > platforms and I didn't see an alternative way of doing this that does not > require a non-blocking waitpid(). thankfully perl insulates us from the hard stuff a lot here -- see child_handler() for a sane WNOHANG check. there's no need to deal with SIGCHLD, or indeed modify that at all. the logic would be (warning, messy p-code): # in daemonize() $got_pidfile = 0; # global $SIG{'USR2'} = \&got_pidfile; defined( my $pid = fork ) or die "spamd: cannot fork: $!\n"; # this code is new: if ($pid) { #parent for my $retry (0 .. $somenumber) { sleep 1; if ($got_pidfile) { last; } my $waited = waitpid($pid, WNOHANG); if ($waited == $pid) { last; # pid has exited } else { warn "waitpid failed: $!"; } } if (!$got_pidfile) { die "child process exited or timed out without producing a PID file"; } else { exit 0; } } else { delete $SIG{'USR2'}; } exit if $pid; setsid or die "spamd: cannot start new session: $!\n"; [..etc.] sub got_pidfile { $got_pidfile = 1; } > It seems to me that for this purpose it would work fine to wait no more than, > say 5 seconds. All we are doing is making sure that the parent process does not > exit before the log entry (and pid file) is written by the child so that the > invoking shell script does not have to bother waiting for it. If the log entry > is not written by the child in the first few seconds (probably even in the > first second) then something is wrong anyway and it will probably not be > written. 5 secs is pretty short under extremely heavy load.... but 30 would be plenty IMO ;)
(In reply to comment #10) Ah, a second look showed that I had misread the problem about portability of WNOHANG. I thought I had read that not all platforms support WNOHANG. What it actually said was: Most systems support a non-blocking waitpid. Use Perl's standard Config.pm module to find out: use Config; $has_nonblocking = $Config{d_waitpid} eq "define" || $Config{d_wait4} eq "define"; So it is waitpid itself that is not completely portable, and since we use it (as well as WNOHANG) I won't worry about using it here. > 5 secs is pretty short under extremely heavy load.... but 30 would be plenty Yes, I was surprised to discover that it takes 4 seconds on my Macbook with nothing loadinng it down. And 30 is too long to simply sleep without using SIGUSR1. I'll code up a patch now.
Created attachment 4353 [details] patch: wait for child to signal parent so spamd parent doesn't exit until pid file is written by child This patch should take care of what we have discussed in the comments. Committed to trunk revision 683074, but please look it over and let me know if you think anything should be changed. Do we want to review this for the 3.2 branch?
(In reply to comment #12) > Committed to trunk revision 683074, but please look it over and let me know if > you think anything should be changed. looks good to me. thanks! > Do we want to review this for the 3.2 branch? IMO, I'd prefer not to. I think it's a significant change to existing behaviour, particularly for packagers... for it to appear in 3.2.6 could be a surprise. and I don't think users are clamouring for it to be fixed in huge numbers ;)
(In reply to comment #13) Ok, I'll close this bug then. Tom, please note that the patch I attached will apply cleanly to spamd.raw in the 3.2 branch so you can use it locally if you want to.
Thanks for your effort on this. Just a suggestion: Why not make the parent process test if the child process is still running? That way if the server dies after one second you don't end up waiting another 29 seconds needlessly. I don't know perl but certainly in C/C++ etc. you can test for the existence of a process by trying to do kill( pid, 0 ). If there is no such process, you get errno= ESRCH. In fact, if this works you could perhaps get rid of the timeout altogether (no need to just give up on slow systems if they are still going).
(In reply to comment #15) > Thanks for your effort on this. > > Just a suggestion: > Why not make the parent process test if the child process is still running? > That way if the server dies after one second you don't end up waiting another > 29 seconds needlessly. Tom, reread the code- if the server dies after 1 sec the waitpid() call will indicate this, and the parent pid will in turn die.
(In reply to comment #15) Tom, What Justin said. I did test it by commenting out the signal and killing the child, and waitpid() does detect that correctly. > you could perhaps get rid of the timeout altogether (no > need to just give up on slow systems if they are still going) The timeout is supposed to be long enough that if you hit it something is wrong and you are better off getting a warning message and continuing in case the child process is actually running ok and the problem was just with the signal. That's supposed to be better than having the script that calls spamd hung up indefinitely. If you have experience that real heavily loaded systems take longer than 30 seconds to start up spamd I can increase the number.