|Summary:||semaphore problem takes httpd down|
|Product:||Apache httpd-2||Reporter:||vtmue <vt>|
|Component:||mpm_prefork||Assignee:||Apache HTTPD Bugs Mailing List <bugs>|
Description vtmue 2003-08-16 19:42:45 UTC
Hello, Basically I run into the problem which is discussed here: http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0xf91e36e69499d611abdb0090277a778c,00.html But the proposed fix (rising semaphore-related kernel parameters) does not help. We run 11.11 at a fairly recent patch level. A full trace of Apache 2.0.47 until it's "suicide" is available (given httpd is alive...) at: http://vorsprung-durch-denken.de/apache-trace.txt And the main error log holds: tons of: [emerg] (22)Invalid argument: couldn't grab the accept mutex some: [emerg] (36)Identifier removed: couldn't grab the accept mutex few: [emerg] (28)No space left on device: couldn't grab the accept mutex The symptom does not appear to be related to the number of children. I watched the parent die with 46 and another time with 17 child processes. The load of the machine is around 0.2 all the time. To put it straight: I'm stuck and hoping some good soul out there can help! Any hint is appreciated - TIA vt
Comment 1 vtmue 2003-08-17 12:54:35 UTC
We worked around the problem by temporarily setting AcceptMutex to fcntl.
Comment 2 Jeff Trawick 2003-08-19 11:20:53 UTC
*** Bug 22516 has been marked as a duplicate of this bug. ***
Comment 3 Jeff Trawick 2003-08-19 11:36:28 UTC
BTW, the trace provided is just of the parent process, so it doesn't show the semaphore errors encountered in the children. I don't think this is an Apache or APR problem. (The APR codebase has the code that uses SysV semaphores.) While there have been at least a few people encountering this on HP-UX, semaphore problems that could not be resolved by system tuning haven't been reported elsewhere, and presumably many other HP-UX users are running Apache successfully. Maybe there is further tuning necessary on your system, maybe you have a bad level of some kernel code, maybe I don't know what I'm talking about :) If you want to pursue this further with us, we need a trace that shows Apache+APR doing something invalid with the semaphores. If you have OS support from HP, you might describe to them what tuning you performed already and see if they have additional recommendations.
Comment 4 vtmue 2003-08-19 12:51:12 UTC
Jeff, I read about other HP-UX und Solaris users who appear to face the very same symptom. HP suppplies a compiled binary so many users will stick with this I suppose. My trace has the capability to follow forks but there are a couple of showstoppers here on my side: the affected server is productive and we are rather in the process of downgrading back to 1.3 . Then there are about 170 vhosts configured; httpd has approximately 50-60 concurrent active childs during the day. One of our first thoughts here was that one of the vhosts may generate an error that causes the parent to shut down but we could not confirm this when searching the logs. And I have to admit we haven't got the time to trace down this any further right now. We are about to set up an 11i system with current patch level during the next week. We can possibly set up 2.0.47 there and see if httperf can reproduce the problem. Cheers, vt
Comment 5 Jeff Trawick 2003-08-22 11:56:53 UTC
>I read about other HP-UX und Solaris users who appear to face the very same >symptom. HP suppplies a compiled binary so many users will stick with this I >suppose. If there is some fix for this in the HP-supplied binary but not in Apache or APR, we'd love to hear about it :) I hope that isn't the situation. >One of our first thoughts here was that one of the vhosts may generate an error >that causes the parent to shut down but we could not confirm this when >searching >the logs. And I have to admit we haven't got the time to trace down this any >further right now. In the case that a child returned a fatal error which forced a shutdown, there should be a message in error_log written by the parent by this code: ap_log_error(APLOG_MARK, APLOG_ALERT, 0, ap_server_conf, "Child %" APR_PID_T_FMT " returned a Fatal error..." APR_EOL_STR "Apache is exiting!", pid->pid); In all likelihood the fatal error was simply the first unexpected ENOSPC from attempting to acquire the mutex, then that child returned a fatal error, then the semaphore got cleaned up, then remaining children that hadn't already died due to shutdown started getting EINVAL on their semaphore operations.
Comment 6 vtmue 2003-08-22 17:29:39 UTC
Hi Jeff, Ok, I have to admit we have those: [Sat Aug 16 16:08:16 2003] [notice] Apache/2.0.47 configured -- resuming normal operations [Sat Aug 16 16:38:09 2003] [emerg] (28)No space left on device: couldn't grab the accept mutex [Sat Aug 16 16:38:09 2003] [alert] Child 16480 returned a Fatal error... Apache is exiting! [Sat Aug 16 16:38:10 2003] [emerg] (36)Identifier removed: couldn't grab the accept mutex [...] Unfortunately a colleague deleted the client' logs of that day so... :( Then hp: from what I see in their relasenotes they fixed a bug related to semaphores/modssl/dbm in 2.0.43 so it seems that is s/th different. Besides I take it for granted that they'll report problems once they find/fix them. At this time, I'm a bit clueless because I see no way how we could track this down. Can you give me a hint where I can read about what could cause a child to produce an "Fatal error"? (I googled but didn't find s/th hot). I'm willing to investigate, but I can't trace 170 vhosts one after the other - many of them using PHP. Thanks, vt
Comment 7 Jeff Trawick 2003-10-10 18:18:28 UTC
This first error message from your last error log submission is the entire story: [Sat Aug 16 16:38:09 2003] [emerg] (28)No space left on device: couldn't grab the accept mutex The kernel failed the semaphore acquire. If you can't fix it with OS tuning, than avoid it with "AcceptMutex fcntl" or some other mutex type.
Comment 8 Jeff Trawick 2003-12-10 18:24:41 UTC
*** Bug 25418 has been marked as a duplicate of this bug. ***
Comment 9 Jeff Trawick 2003-12-10 18:29:58 UTC
Not a problem in httpd or APR as far as anyone can tell... If OS tuning can't resolve the problems, then use AcceptMutex directive to try a different mutex mechanism. The not-uncommon occurrences with mutex problems that defy easy resolution or even explanation is why there is an AcceptMutex directive to start with :)