Summary: | Reverse Proxy Threads or Children spinning on CPU | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | Mina Galić <me> |
Component: | mod_proxy | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | chris, croller, esuran, marc.stern, perske |
Priority: | P2 | ||
Version: | 2.2.9 | ||
Target Milestone: | --- | ||
Hardware: | Sun | ||
OS: | Solaris |
Description
Mina Galić
2008-07-08 02:16:06 UTC
Can you please step further through with gdb to see the limits of the loop (so where we are really looping on)? I have now managed to draw a couple of (22 so far) backtraces of the running processes or rather LWPs. But I have not managed to get a 'next': for lwppid in $( ps -Lef | grep -v grep | grep $( < /var/opt/bw/apache/proxy/proxy.pid )| awk '{print $4}' | xargs ); do gdb --batch --eval-command="where" -ex "bt full" --pid=$lwppid /opt/bw/bin/httpd-worker > $lwppid.$(date +"%Y%m%d%H%M%S") 2>/dev/null; done Some of the backtraces for the same LWP are equal, which shows that the state doesn't change for a very long time. I'm not quite sure how to proceede next, if I shouldn't just upload the stacktraces. NEEDINFO. An additional info from my side: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID WCHAN COMMAND 24151 www 25 0 168m 26m 19m R 97.7 1.3 232:08.94 17149 stext httpd-worker This thread has now been for quite a long time in this very same state (it's currently my main source of backtraces and frustration). Can you try to trace the rogue process with ltrace? Updates: After rpluem asked me to ltrace, I tried, but failed in achieving anything meaningful. ltracing the parent I'm getting stuff like: 11:21:25.046949 SYS_wait4(-1, 0x7fff60c1555c, 3, 0, 0x667168) = 0 11:21:25.047094 <... apr_proc_wait_all_procs resumed> ) = 70006 11:21:25.047157 apr_sleep(0xf4240, 0x7fff60c1555c, 70006, -1, 0x667168 <unfinished ...> 11:21:25.047260 SYS_select(0, 0, 0, 0, 0x7fff60c15570) = 0 11:21:26.046765 <... apr_sleep resumed> ) = 0 11:21:26.046874 apr_proc_wait_all_procs(0x7fff60c15690, 0x7fff60c156b8, 0x7fff60c156bc, 1, 0x667168 <unfinished ...> 11:21:26.046997 SYS_wait4(-1, 0x7fff60c1555c, 3, 0, 0x667168) = 0 11:21:26.047141 <... apr_proc_wait_all_procs resumed> ) = 70006 11:21:26.047204 apr_sleep(0xf4240, 0x7fff60c1555c, 70006, -1, 0x667168 <unfinished ...> 11:21:26.047307 SYS_select(0, 0, 0, 0, 0x7fff60c15570) = 0 11:21:27.046849 <... apr_sleep resumed> ) = 0 11:21:27.046950 apr_proc_wait_all_procs(0x7fff60c15690, 0x7fff60c156b8, 0x7fff60c156bc, 1, 0x667168 <unfinished ...> 11:21:27.047068 SYS_wait4(-1, 0x7fff60c1555c, 3, 0, 0x667168) = 0 11:21:27.047230 <... apr_proc_wait_all_procs resumed> ) = 70006 11:21:27.047296 apr_sleep(0xf4240, 0x7fff60c1555c, 70006, -1, 0x667168 <unfinished ...> 11:21:27.047402 SYS_select(0, 0, 0, 0, 0x7fff60c15570) = 0 11:21:28.046897 <... apr_sleep resumed> ) = 0 11:21:28.047003 apr_proc_wait_all_procs(0x7fff60c15690, 0x7fff60c156b8, 0x7fff60c156bc, 1, 0x667168 <unfinished ...> 11:21:28.047126 SYS_wait4(-1, 0x7fff60c1555c, 3, 0, 0x667168) = 0 11:21:28.047273 <... apr_proc_wait_all_procs resumed> ) = 70006 11:21:28.047335 apr_sleep(0xf4240, 0x7fff60c1555c, 70006, -1, 0x667168 <unfinished ...> 11:21:28.047439 SYS_select(0, 0, 0, 0, 0x7fff60c15570) = 0 11:21:29.046945 <... apr_sleep resumed> ) = 0 11:21:29.047046 apr_proc_wait_all_procs(0x7fff60c15690, 0x7fff60c156b8, 0x7fff60c156bc, 1, 0x667168 <unfinished ...> 11:21:29.047164 SYS_wait4(-1, 0x7fff60c1555c, 3, 0, 0x667168) = 0 11:21:29.047323 <... apr_proc_wait_all_procs resumed> ) = 70006 11:21:29.047389 apr_sleep(0xf4240, 0x7fff60c1555c, 70006, -1, 0x667168 <unfinished ...> 11:21:29.047492 SYS_select(0, 0, 0, 0, 0x7fff60c15570) = 0 which seems natural. Now, if I try to ltrace a LWP, I'm hitting a limitation of ltrace itself: http://lists.alioth.debian.org/pipermail/ltrace-devel/2006-April/000039.html After ca 36~ hours of runtime: I finally got: top - 12:16:45 up 96 days, 20:11, 4 users, load average: 0.00, 0.01, 0.25 Tasks: 253 total, 2 running, 219 sleeping, 30 stopped, 2 zombie Cpu(s): 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2097152k total, 1903600k used, 193552k free, 299744k buffers Swap: 2097144k total, 18924k used, 2078220k free, 1072724k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID nDRT WCHAN COMMAND 24125 www 15 0 168m 26m 19m T 0.0 1.3 0:00.00 17149 0 finish_st httpd-worker 24151 www 18 0 168m 26m 19m T 0.0 1.3 845:53.74 17149 0 finish_st httpd-worker 5136 www 18 0 345m 11m 2728 T 0.0 0.6 0:00.00 17149 0 finish_st httpd-worker 5138 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.13 17149 0 finish_st httpd-worker 5139 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.10 17149 0 finish_st httpd-worker 5140 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.12 17149 0 finish_st httpd-worker 5141 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.13 17149 0 finish_st httpd-worker 5142 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.10 17149 0 finish_st httpd-worker 5143 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.08 17149 0 finish_st httpd-worker 5144 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.08 17149 0 finish_st httpd-worker 5145 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.08 17149 0 finish_st httpd-worker 5146 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.07 17149 0 finish_st httpd-worker 5147 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.60 17149 0 finish_st httpd-worker 5148 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.11 17149 0 finish_st httpd-worker 5149 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.15 17149 0 finish_st httpd-worker 5150 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.11 17149 0 finish_st httpd-worker 5151 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.11 17149 0 finish_st httpd-worker 5152 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.12 17149 0 finish_st httpd-worker 5153 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.10 17149 0 finish_st httpd-worker 5154 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.14 17149 0 finish_st httpd-worker 5155 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.14 17149 0 finish_st httpd-worker 5156 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.16 17149 0 finish_st httpd-worker 5157 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.15 17149 0 finish_st httpd-worker 5158 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.08 17149 0 finish_st httpd-worker 5159 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.13 17149 0 finish_st httpd-worker 5160 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.12 17149 0 finish_st httpd-worker 5161 www 18 0 345m 11m 2728 T 0.0 0.6 51:08.45 17149 0 finish_st httpd-worker 5162 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.10 17149 0 finish_st httpd-worker 5163 www 15 0 345m 11m 2728 T 0.0 0.6 0:00.21 17149 0 finish_st httpd-worker 5902 www 17 0 0 0 0 Z 0.0 0.0 0:04.58 17149 0 exit httpd-worker <defunct> 6251 www 17 0 0 0 0 Z 0.0 0.0 0:00.21 17149 0 exit httpd-worker <defunct> No longer LISTENing (according to netstat), not reacting to apachectl stop, or kill for that matter. I had to kill -9 it. updates! I have managed to reproduce the bug with mpm prefork (on the same system) -- and, to a certain extent with mpm worker (as of yet) on Open Solaris: top - 16:32:13 up 102 days, 43 min, 3 users, load average: 1.09, 1.16, 1.10 Tasks: 234 total, 3 running, 231 sleeping, 0 stopped, 0 zombie Cpu(s): 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2097152k total, 1970148k used, 127004k free, 307868k buffers Swap: 2097144k total, 18676k used, 2078468k free, 1198244k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID nDRT WCHAN COMMAND 25675 www 25 0 122m 12m 9300 R 97.7 0.6 50:31.07 22818 0 stext httpd-prefork 27246 www 15 0 122m 5280 1536 S 0.0 0.3 0:00.03 22818 0 184466121 httpd-prefork 27251 www 15 0 123m 6216 2356 S 0.0 0.3 0:00.09 22818 0 semtimedo httpd-prefork 27286 www 15 0 122m 5268 1536 S 0.0 0.3 0:00.02 22818 0 360474278 httpd-prefork 27287 www 15 0 123m 6424 2356 S 0.0 0.3 0:00.04 22818 0 semtimedo httpd-prefork 27288 www 15 0 122m 5268 1536 S 0.0 0.3 0:00.00 22818 0 semtimedo httpd-prefork 27290 www 15 0 123m 5512 1536 S 0.0 0.3 0:00.01 22818 0 semtimedo httpd-prefork 27318 www 15 0 122m 5248 1536 S 0.0 0.3 0:00.00 22818 0 semtimedo httpd-prefork 27344 www 15 0 122m 5096 1520 S 0.0 0.2 0:00.02 22818 0 semtimedo httpd-prefork 27354 www 15 0 122m 5088 1520 S 0.0 0.2 0:00.00 22818 0 semtimedo httpd-prefork 27356 www 15 0 123m 5392 1536 S 0.0 0.3 0:00.00 22818 0 semtimedo httpd-prefork 27361 www 15 0 123m 5456 1532 S 0.0 0.3 0:00.00 22818 0 211816788 httpd-prefork On Solaris I only have a zombified process (LWP), which won't go away unless killed -9, but it's not eating 80+ CPU: PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID 16635 webservd 9488K 3312K sleep 59 0 0:00:00 0.0% httpd-worker/23 [/snip] 16635 webservd 9488K 3312K sleep 59 0 0:00:00 0.0% httpd-worker/3 16635 webservd 9488K 3312K zombie 0 - 0:00:00 0.0% httpd-worker/2 16635 webservd 9488K 3312K sleep 59 0 0:00:00 0.0% httpd-worker/1 [/snip] Total: 3 processes, 84 lwps, load averages: 0.02, 0.01, 0.01 I'm experiencing the same symptoms as Igor (single httpd sucking up 90% cpu time) using 2.2.10 test release on Opensuse 10.3 x86_64 and have a couple of observations to add. I run worker in a pure reverse proxy scenario (built with included apr) with one major difference than Igor being that I do not use mod_deflate at all. First observation is that the out of control process had three sockets open and they were all in CLOSE_WAIT hoping to eventually be released. Second observation is that after defining "disablereuse on" for every proxypass, the problem has not reoccurred. In addition errors of the type "(70014)End of file found: proxy: error reading status line from remote server" have disappeared. Happy to provide any information that I can to help on this, just tell me what is needed. Has anyone found a fix for this? We have the exact same problem with 2.2.9 and also 2.2.10. I tried the workaround for disablereuse, but mod_proxy_balancer does not support this in the ProxyPass directive. I am experiencing similar problem on Win2K3 with Apache 2.2.8 and 2.2.11. I am using mod_proxy and mod_rewrite where mod_rewrite enables the PROXY flag We ran into this bug with apache 2.2.9 on Solaris 10. Here are the stack traces:
# pstack 19235
19235: /opt/coolstack/apache2/bin/httpd -k start
ff236b1c ???????? (dafa0, 87c88, 87c88, 455518, 1b7788, 87d50)
fe2ec680 connection_cleanup (dafe0, dafe0, 7, 96bd0, 8fc20, 974c8) + 148
fe2ed0a8 ap_proxy_release_connection (fe2c5fb8, dafe0, 97440, 10400, fe300a0c, 96bd0) + 4c
fe2c5064 proxy_http_handler (fffffffe, dafe0, ec318, ffbf9694, fe2c5fb8, 1ac660) + 3e8
fe2e508c proxy_handler (fe2c4c7c, fe3018b8, 1f78c8, fe2eec9c, ec3e0, ec318) + 728
0003fe60 ap_run_handler (1f78c8, 4, 826ac, 826c8, 9f330, 0) + 3c
00040478 ap_invoke_handler (1f78c8, 9f450, 0, 0, 0, fe2ee1c8) + e8
0004fcbc ap_process_request (0, 80000, 1f78c8, 0, 0, 1f78c8) + 54
0004ca68 ap_process_http_connection (1999e8, 1f78c8, 8fc20, 80000, 801b8, 80000000) + 6c
000482f8 ap_run_process_connection (1999e8, 199750, 9ef88, 82ee4, 9f760, 0) + 3c
0005505c child_main (22, 1999e8, 6b75c, 82f30, 803bc, 1) + 64c
000553b0 make_child (54400, 33, 0, 0, 34, 80000) + 13c
00055654 perform_idle_server_maintenance (80000, 57, 19, 803ac, 8337c, 4) + 1d0
00055d54 ap_mpm_run (80000, 0, ffffffff, 27, 803c4, 82f40) + 678
0002928c main (8bdb0, ffbffd08, 8023c, 7f090, 64650, 7f080) + 9f4
00028378 _start (0, 0, 0, 0, 0, 0) + 108
# mdb -p 19235
Loading modules: [ ld.so.1 libc.so.1 libuutil.so.1 ]
> $C
ffbf9498 libapr-1.so.0.3.0`apr_thread_mutex_unlock+0xc(dafa0, 87c88, 87c88, 455518, 1b7788, 87d50)
ffbf94f8 mod_proxy.so`connection_cleanup+0x148(dafe0, dafe0, 7, 96bd0, 8fc20, 974c8)
ffbf9560 mod_proxy.so`ap_proxy_release_connection+0x4c(fe2c5fb8, dafe0, 97440, 10400, fe300a0c, 96bd0)
ffbf95c8 mod_proxy_http.so`proxy_http_handler+0x3e8(fffffffe, dafe0, ec318, ffbf9694, fe2c5fb8, 1ac660)
ffbf96b8 mod_proxy.so`proxy_handler+0x728(fe2c4c7c, fe3018b8, 1f78c8, fe2eec9c, ec3e0, ec318)
ffbf9790 httpd`ap_run_handler+0x3c(1f78c8, 4, 826ac, 826c8, 9f330, 0)
ffbf97f0 httpd`ap_invoke_handler+0xe8(1f78c8, 9f450, 0, 0, 0, fe2ee1c8)
ffbf9858 httpd`ap_process_request+0x54(0, 80000, 1f78c8, 0, 0, 1f78c8)
ffbf98b8 httpd`ap_process_http_connection+0x6c(1999e8, 1f78c8, 8fc20, 80000, 801b8, 80000000)
ffbf9920 httpd`ap_run_process_connection+0x3c(1999e8, 199750, 9ef88, 82ee4, 9f760, 0)
ffbf9980 httpd`child_main+0x64c(22, 1999e8, 6b75c, 82f30, 803bc, 1)
ffbf9a18 httpd`make_child+0x13c(54400, 33, 0, 0, 34, 80000)
ffbf9a78 httpd`perform_idle_server_maintenance+0x1d0(80000, 57, 19, 803ac, 8337c, 4)
ffbf9b78 httpd`ap_mpm_run+0x678(80000, 0, ffffffff, 27, 803c4, 82f40)
ffbf9c20 main+0x9f4(8bdb0, ffbffd08, 8023c, 7f090, 64650, 7f080)
ffbffd20 _start+0x108(0, 0, 0, 0, 0, 0)
We use apache in prefork mode. The apache process fully utilizes 1 cpu when it's stuck. If we disable connection reusing with disableReuse=On, then there are no stuck apache processes.
I think some of the problem symptoms described here were fixed in 2.2.11 for PR45792. http://svn.apache.org/viewvc?view=revision&revision=713146 Note that one of the folks chiming in on this thread already had 2.2.11 at the time. (Apologies for claiming 2.2.14 earlier today in one of the httpd sessions at ApacheCon ;) Some folks I support picked up the pool corruption/cleanup loop fix when *they* upgraded to 2.2.14. I remembered incorrectly that 2.2.14 was the first release with the fix.) After a month of runtime without running into this issue, I'm closing this ticket. *** Bug 45483 has been marked as a duplicate of this bug. *** |