Bug 50335 - Cores dumped under high load (segmentation fault signal 11 SIGSEGV)
Cores dumped under high load (segmentation fault signal 11 SIGSEGV)
Status: RESOLVED FIXED
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy
2.2.17
PC Linux
: P2 critical (vote)
: ---
Assigned To: Apache HTTPD Bugs Mailing List
: PatchAvailable
: 50834 55979 (view as bug list)
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2010-11-25 06:57 UTC by Aleksey Midenkov
Modified: 2014-01-19 19:28 UTC (History)
5 users (show)



Attachments
proposed changes in httpd 2.2.17 (8.01 KB, patch)
2012-08-29 14:01 UTC, Max Romanov
Details | Diff
output from: gdb thread apply all backtrace full (33.70 KB, application/x-gzip)
2012-11-09 11:06 UTC, Sean Timmins
Details
gdb output from core 2012-11-11 (34.67 KB, application/gzip)
2012-11-11 23:14 UTC, Sean Timmins
Details
gdb output from core 2012-11-13 (18.31 KB, application/gzip)
2012-11-13 13:18 UTC, Sean Timmins
Details
possible fix for PR 50335 (480 bytes, patch)
2013-08-19 19:00 UTC, Joe Orton
Details | Diff
Another possible fix for PR 50335 (684 bytes, patch)
2013-12-10 08:47 UTC, jkaluza
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Aleksey Midenkov 2010-11-25 06:57:28 UTC
stack.1198:
(most frequent core)
Thread 122 (Thread 1280):
#0  0x00002b3b47a846f7 in kill () from /lib64/libc.so.6
#1  <signal handler called>
#2  0x00002b3b46d97acf in apr_brigade_cleanup (data=<value optimized out>) at buckets/apr_brigade.c:44
#3  0x00002b3b4742088d in run_cleanups (cref=0x2aab280c3158) at memory/unix/apr_pools.c:2314
#4  0x00002b3b4742130b in apr_pool_destroy (pool=0x2aab280c3138) at memory/unix/apr_pools.c:782
#5  0x00002b3b474214c5 in apr_pool_clear (pool=0x2aab140834b8) at memory/unix/apr_pools.c:736
#6  0x00002b3b4629ea94 in ?? ()
#7  0x00002b3b4763a73d in start_thread () from /lib64/libpthread.so.0
#8  0x00002b3b47b27d1d in clone () from /lib64/libc.so.6


stack.5566:
Thread 20 (Thread 5808):
#0  0x00002b3b47a846f7 in kill () from /lib64/libc.so.6
#1  <signal handler called>
#2  0x00002b3b4628b456 in ap_core_input_filter ()
#3  0x00002b3b492fe298 in ap_proxy_http_process_response (p=0x2aaaf01ed998, r=0x2aaaf01eda18, backend=0x2b3b5c4022b8, origin=0x2b3b5c97a618, conf=0x2b3b5c03e250, server_portstr=0x2aaaddefbd50 ":3128") at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:1767
#4  0x00002b3b492fffeb in proxy_http_handler (r=0x2aaaf01eda18, worker=<value optimized out>, conf=0x2b3b5c03e250, url=0x2b3b5cdbae18 "/xml.aspx?AppID=8DCA1510B4F71546E54034F13C5F41A837522444&Version=2.2&Sources=web&adult=off&web.count=50&Query=\"Leave+a+Reply\"+\"Name+(required)\"+\"Mail+(will+not+be+published)\"+\"Website\"+-site:wordpress"..., proxyname=0x0, proxyport=0) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:2033
#5  0x00002b3b490eb111 in proxy_run_scheme_handler (r=0x2aaaf01eda18, worker=0x2b3b5c000fb0, conf=0x2b3b5c03e250, url=0x2b3b5cdba9f6 "http://api.bing.net/xml.aspx?AppID=8DCA1510B4F71546E54034F13C5F41A837522444&Version=2.2&Sources=web&adult=off&web.count=50&Query=\"Leave+a+Reply\"+\"Name+(required)\"+\"Mail+(will+not+be+published)\"+\"Websi"..., proxyhost=0x0, proxyport=<value optimized out>) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:2563
#6  0x00002b3b490ef1c0 in proxy_handler (r=0x2aaaf01eda18) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:1065
#7  0x00002b3b4628bc9a in ap_run_handler ()
#8  0x00002b3b4628f122 in ap_invoke_handler ()
#9  0x00002b3b46299da8 in ap_process_request ()
#10 0x00002b3b46296eb0 in ?? ()
#11 0x00002b3b462931b2 in ap_run_process_connection ()
#12 0x00002b3b4629ea65 in ?? ()
#13 0x00002b3b4763a73d in start_thread () from /lib64/libpthread.so.0
#14 0x00002b3b47b27d1d in clone () from /lib64/libc.so.6


stack.5580:
Thread 131 (Thread 7291):
#0  0x00002b3b47a846f7 in kill () from /lib64/libc.so.6
#1  <signal handler called>
#2  0x00002b3b46d99226 in apr_bucket_free (mem=0x2b3b5cd6f978) at buckets/apr_buckets_alloc.c:191
#3  0x00002b3b46d97aea in apr_brigade_cleanup (data=<value optimized out>) at buckets/apr_brigade.c:44
#4  0x00002b3b4742088d in run_cleanups (cref=0x2aaaec1cd768) at memory/unix/apr_pools.c:2314
#5  0x00002b3b4742130b in apr_pool_destroy (pool=0x2aaaec1cd748) at memory/unix/apr_pools.c:782
#6  0x00002b3b474214c5 in apr_pool_clear (pool=0x2b3b5c668378) at memory/unix/apr_pools.c:736
#7  0x00002b3b4629ea94 in ?? ()
#8  0x00002b3b4763a73d in start_thread () from /lib64/libpthread.so.0
#9  0x00002b3b47b27d1d in clone () from /lib64/libc.so.6


stack.5987:
Thread 117 (Thread 11007):
#0  0x00002b3b47a846f7 in kill () from /lib64/libc.so.6
#1  <signal handler called>
#2  0x00002b3b490f1498 in ap_proxy_release_connection (proxy_function=0x2b3b49301777 "HTTP", conn=0x2b3b5ca8f218, s=0x2b3b5bffd590) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/proxy_util.c:2041
#3  0x00002b3b492feed3 in ap_proxy_http_cleanup (scheme=0x2b3b49301777 "HTTP", r=<value optimized out>, backend=0x63203b6c6d74682f) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:1906
#4  0x00002b3b492ff192 in proxy_http_handler (r=0x2b3b5cb5b878, worker=0x0, conf=0x2b3b5c03e250, url=0x2b3b5cb5d558 "/squid-internal-dynamic/netdb", proxyname=0x0, proxyport=0) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:2044
#5  0x00002b3b490eb111 in proxy_run_scheme_handler (r=0x2b3b5cb5b878, worker=0x2b3b5c000fb0, conf=0x2b3b5c03e250, url=0x2b3b5cb5d426 "http://174.142.104.57:3128/squid-internal-dynamic/netdb", proxyhost=0x0, proxyport=<value optimized out>) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:2563
#6  0x00002b3b490ef1c0 in proxy_handler (r=0x2b3b5cb5b878) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:1065
#7  0x00002b3b4628bc9a in ap_run_handler ()
#8  0x00002b3b4628f122 in ap_invoke_handler ()
#9  0x00002b3b46299da8 in ap_process_request ()
#10 0x00002b3b46296eb0 in ?? ()
#11 0x00002b3b462931b2 in ap_run_process_connection ()
#12 0x00002b3b4629ea65 in ?? ()
#13 0x00002b3b4763a73d in start_thread () from /lib64/libpthread.so.0
#14 0x00002b3b47b27d1d in clone () from /lib64/libc.so.6


stack.6427:
Thread 148 (Thread 6923):
#0  0x00002b3b47a846f7 in kill () from /lib64/libc.so.6
#1  <signal handler called>
#2  apr_brigade_split_line (bbOut=0x2aaaec2cc8b8, bbIn=0x2b3b5d0b18a0, block=APR_BLOCK_READ, maxbytes=8192) at buckets/apr_brigade.c:319
#3  0x00002b3b4628b58a in ap_core_input_filter ()
#4  0x00002b3b4627fc3a in ap_rgetline_core ()
#5  0x00002b3b492fda8b in ap_proxygetline (bb=0x2aaaec2cc8b8, s=<value optimized out>, n=8192, r=0x2b3b5d09ce18, fold=0, writen=0x64310b7c) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:1329
#6  0x00002b3b492fdc1f in ap_proxy_http_process_response (p=0x2aaaec5f39c8, r=0x2aaaec5f3a48, backend=0x2b3b5c7f23a8, origin=0x2b3b5d0b1158, conf=0x2b3b5c03e250, server_portstr=0x64314d50 ":3128") at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:1400
#7  0x00002b3b492fffeb in proxy_http_handler (r=0x2aaaec5f3a48, worker=<value optimized out>, conf=0x2b3b5c03e250, url=0x2aaaec4973f0 "/login?u=951898728&p=08121B6D645B7277AFB3B87099F45C68&verifycode=&aid=15002005&u1=http%3A%2F%2Fqz.qq.com&h=1&ptredirect=1&ptlang=2052&from_ui=1&dumy=&fp=loginerroralert", proxyname=0x0, proxyport=0) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:2033
#8  0x00002b3b490eb111 in proxy_run_scheme_handler (r=0x2aaaec5f3a48, worker=0x2b3b5c000fb0, conf=0x2b3b5c03e250, url=0x2aaaec49712e "http://ptlogin2.qq.com/login?u=951898728&p=08121B6D645B7277AFB3B87099F45C68&verifycode=&aid=15002005&u1=http%3A%2F%2Fqz.qq.com&h=1&ptredirect=1&ptlang=2052&from_ui=1&dumy=&fp=loginerroralert", proxyhost=0x0, proxyport=<value optimized out>) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:2563
#9  0x00002b3b490ef1c0 in proxy_handler (r=0x2aaaec5f3a48) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:1065
#10 0x00002b3b4628bc9a in ap_run_handler ()
#11 0x00002b3b4628f122 in ap_invoke_handler ()
#12 0x00002b3b46299da8 in ap_process_request ()
#13 0x00002b3b46296eb0 in ?? ()
#14 0x00002b3b462931b2 in ap_run_process_connection ()
#15 0x00002b3b4629ea65 in ?? ()
#16 0x00002b3b4763a73d in start_thread () from /lib64/libpthread.so.0
#17 0x00002b3b47b27d1d in clone () from /lib64/libc.so.6


stack.8937:
Thread 8 (Thread 9841):
#0  0x00002b3b47a846f7 in kill () from /lib64/libc.so.6
#1  <signal handler called>
#2  allocator_free (allocator=0x2b3b5bf855a0, node=0xa0) at memory/unix/apr_pools.c:358
#3  apr_allocator_free (allocator=0x2b3b5bf855a0, node=0xa0) at memory/unix/apr_pools.c:416
#4  0x00002b3b46d98b8b in socket_bucket_read (a=0x2aaaf0068ea8, str=0x2aaae61048e0, len=0x2aaae61048d8, block=<value optimized out>) at buckets/apr_buckets_socket.c:43
#5  0x00002b3b46d9869b in apr_brigade_split_line (bbOut=0x2aab28028928, bbIn=0x2aab1c077af0, block=APR_BLOCK_READ, maxbytes=8192) at buckets/apr_brigade.c:319
#6  0x00002b3b4628b58a in ap_core_input_filter ()
#7  0x00002b3b4627fc3a in ap_rgetline_core ()
#8  0x00002b3b492fda8b in ap_proxygetline (bb=0x2aab28028928, s=<value optimized out>, n=8192, r=0x0, fold=160, writen=0x2aaae6104b7c) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:1329
#9  0x00002b3b492fdc1f in ap_proxy_http_process_response (p=0x2aab2810f638, r=0x2aab2810f6b8, backend=0x2aab1c0d3478, origin=0x2aab1c0773a8, conf=0x2b3b5c03e250, server_portstr=0x2aaae6108d50 ":3128") at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:1400
#10 0x00002b3b492fffeb in proxy_http_handler (r=0x2aab2810f6b8, worker=<value optimized out>, conf=0x2b3b5c03e250, url=0x2aab281113e0 "/", proxyname=0x0, proxyport=0) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy_http.c:2033
#11 0x00002b3b490eb111 in proxy_run_scheme_handler (r=0x2aab2810f6b8, worker=0x2b3b5c000fb0, conf=0x2b3b5c03e250, url=0x2aab281112f6 "http://174.142.104.57:3128/", proxyhost=0x0, proxyport=<value optimized out>) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:2563
#12 0x00002b3b490ef1c0 in proxy_handler (r=0x2aab2810f6b8) at /usr/src/debug/httpd-2.2.17.1/phorm/modules/proxy/mod_proxy.c:1065
#13 0x00002b3b4628bc9a in ap_run_handler ()
#14 0x00002b3b4628f122 in ap_invoke_handler ()
#15 0x00002b3b46299da8 in ap_process_request ()
#16 0x00002b3b46296eb0 in ?? ()
#17 0x00002b3b462931b2 in ap_run_process_connection ()
#18 0x00002b3b4629ea65 in ?? ()
#19 0x00002b3b4763a73d in start_thread () from /lib64/libpthread.so.0
#20 0x00002b3b47b27d1d in clone () from /lib64/libc.so.6
Comment 1 Deepak 2012-08-29 13:07:19 UTC
Hi,

I am also getting similer cores.
Is there any solution for this?

Thanks,
Deepak
Comment 2 Max Romanov 2012-08-29 14:01:17 UTC
Created attachment 29300 [details]
proposed changes in httpd 2.2.17
Comment 3 Deepak 2012-08-30 10:09:51 UTC
Hello,

Mostly, I am getting 1st and 3rd above type of core dump:

i.e.core in #4 0x0812a58f in apr_brigade_cleanup (data=0xf656e780) at
buckets/apr_brigade.c:44

and core in #4 0x0812acdc in apr_bucket_free (mem=0xe7aaaa28) at
buckets/apr_buckets_alloc.c:191

Is above core dump issue fixed in this patch? Also are there any particular test cases to test this patch/fix?

Thanks in advance.

Regards,
Deepak
Comment 4 Max Romanov 2012-08-30 10:38:05 UTC
No specific test cases. Just free proxy under high load. I've noticed no crashes after patch.
This is a race-condition issue caused by using request pool for buckets send to back-end.
Comment 5 Sean Timmins 2012-09-13 14:05:45 UTC
We are seeing this on our production servers running httpd v2.2.22 and 2.2.21.

Before a recent upgrade from Oracle/Sun SPARC T2 servers (1 processor, 8 core, 64 thread, 1.45Ghz clock speed, 64GB RAM) the rate was tiny with less than 1 crash per day on average.

After replacing these servers with Oracle/Sun SPARC T4 servers (1 processor, 8 core, 64 thread, 2.85GHz clock speed, 128GB RAM) we get far more ranging from 0 to 5 per hour on any of the servers.

I am going to try the attached patch to see if this makes the system more stable.
Comment 6 Jim Jagielski 2012-09-13 15:36:27 UTC
If you JUST use the diff for the actual proxy module (mod_proxy_http.c and proxy_util.c), does the problem go away?

The concern is that all the other changes just don't look right and "hide" the real problem. For example, the use of f->c->bucket_alloc should be correct and the  change to b->bucket_alloc is "safe" but unneeded.
Comment 7 Max Romanov 2012-09-14 10:00:07 UTC
Jim,

If the bucket is going to be inserted to brigade 'b' what is correct to allocate the bucket using other object?
I agree, in some places equal object used, but I think it is _correct_ to use 'b->bucket_alloc', and it is _sometimes safe_ use connection to allocate it.
Comment 8 Joe Orton 2012-09-14 14:02:49 UTC
The reference to "httpd-2.2.17.1/phorm" in the original backtrace implies this is a non-standard, possibly patched httpd.  It would be useful to see actual backtraces from unpatched 2.2.22/2.2.23.

What Jim said, otherwise.  That patch looks like an attempt to fix the problem fixed in r713146 doesn't it?
Comment 9 Sean Timmins 2012-10-31 08:50:35 UTC
Here are two different stack traces from an unpatched httpd v2.2.22 that seems to be exhibiting the same issue as per my previous comment.

I am also running one instance of httpd v2.2.22 with the attached proposed patch applied.

Systems are Oracle T4s running Solaris 10 (Generic_147330-09)

#0  0xff297500 in apr_brigade_cleanup (data=0xdf8828)
    at buckets/apr_brigade.c:44
44              apr_bucket_delete(e);
(gdb) where
#0  0xff297500 in apr_brigade_cleanup (data=0xdf8828)
    at buckets/apr_brigade.c:44
#1  0xff125fb0 in run_cleanups (cref=0xdf49f0) at memory/unix/apr_pools.c:2346
#2  0xff1270a0 in apr_pool_destroy (pool=0xdf49e0)
    at memory/unix/apr_pools.c:809
#3  0xff1272fc in apr_pool_clear (pool=0xdec9c0) at memory/unix/apr_pools.c:764
#4  0x00050158 in worker_thread (thd=0xcc9c0, dummy=0x3) at worker.c:897
#5  0xff131f1c in dummy_worker (opaque=0xcc9c0) at threadproc/unix/thread.c:142
#6  0xfef3a9d0 in _lwp_start () from /lib/libc.so.1
#7  0xfef3a9d0 in _lwp_start () from /lib/libc.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)


#0  0xfef3e3ac in _read () from /lib/libc.so.1
(gdb) where
#0  0xfef3e3ac in _read () from /lib/libc.so.1
#1  0xfef2ca24 in read () from /lib/libc.so.1
#2  0x0005296c in ap_mpm_pod_check (pod=0xacf50) at pod.c:54
#3  0x00050a0c in child_main (child_num_arg=17) at worker.c:1258
#4  0x00050b88 in make_child (s=0x7b800, slot=17) at worker.c:1341
#5  0x000513b8 in ap_mpm_run (_pconf=0xfe710158, plog=0x14, s=0x2693)
    at worker.c:1560
#6  0x0002afec in main (argc=5, argv=0xffbffb6c) at main.c:753
Comment 10 Sean Timmins 2012-10-31 21:21:36 UTC
I have been running the patched httpd v2.2.22 binary for over 24 hours now and there have been no new core dumps on it, while my other three unpatched instances of httpd v2.2.22 have produced a total of 18 core dumps over the same time period (All 4 instances are serving the same site).

I will continue to run with this configuration until tomorrow morning when I will switch the patched instance to a version of httpd 2.2.22 that has been patched with the proposed changes, but excluding the patch to "srclib/apr-util/buckets/apr_brigade.c" as request by Covener. I will then report back after it has been running for a day or so.

If there are any other things I can test please let me know.
Comment 11 Sean Timmins 2012-11-05 00:18:57 UTC
We have now been running the modified patch as described in Comment 10 for about 4 days and have had no core dumps on the patched server at all compared to over 50 cores between the three unpatched server over the same time period.

I'll be rolling out the patched httpd v2.2.22 to the other three servers tomorrow.
Comment 12 Sean Timmins 2012-11-08 17:25:10 UTC
We got another core! The frequencey has dropped dramatically (only 1 in over a week). This is with the partial patch as detailed in Comment 10.

Here is the stack trace.

#0  0xfef3e3ac in _read () from /lib/libc.so.1
(gdb) where
#0  0xfef3e3ac in _read () from /lib/libc.so.1
#1  0xfef2ca24 in read () from /lib/libc.so.1
#2  0x00052950 in ap_mpm_pod_check (pod=0x126240) at pod.c:54
#3  0x000509f0 in child_main (child_num_arg=4) at worker.c:1258
#4  0x00050b6c in make_child (s=0x7b800, slot=4) at worker.c:1341
#5  0x0005139c in ap_mpm_run (_pconf=0xfe8d0158, plog=0x14, s=0x0)
    at worker.c:1560
#6  0x0002afec in main (argc=5, argv=0xffbffa0c) at main.c:753
Comment 13 Jeff Trawick 2012-11-08 21:36:14 UTC
From the (gdb) prompt you need to run

thread apply all backtrace full

and attach that file.

The thread you captured is not doing anything interesting.
Comment 14 Sean Timmins 2012-11-09 11:06:10 UTC
Created attachment 29576 [details]
output from: gdb thread apply all backtrace full

gdb output as requested
Comment 15 Jeff Trawick 2012-11-09 11:31:18 UTC
Here's the crasher in the last backtrace posted.  (I filtered out any threads that were blocked in syscalls.)

Thread 39 (process 2641292    ):
#0  PyObject_Malloc (nbytes=47) at Objects/obmalloc.c:758
        bp = (block *) 0xffffffff <Address 0xffffffff out of bounds>
        pool = (poolp) 0xa72000
        next = (poolp) 0xff3aa270
        size = 5
#1  0xfe6d5f98 in PyString_FromString (str=0xad86d0 "onlinelibrary.wiley.com")
    at Objects/stringobject.c:138
        op = (PyStringObject *) 0x2f
#2  0xfe6c6f14 in PyDict_GetItemString (v=0xad86d0, 
    key=0xad86d0 "onlinelibrary.wiley.com") at Objects/dictobject.c:2284
        kv = (PyObject *) 0xad86d0
        rv = (PyObject *) 0x26b8a0
#3  0xfe68f0fc in get_interpreter (name=0xad86d0 "onlinelibrary.wiley.com")
    at mod_python.c:255
        tstate = (PyThreadState *) 0xd22730
        idata = (interpreterdata *) 0x0
#4  0xfe68f620 in python_cleanup (data=0x7d7798) at mod_python.c:353
No locals.
#5  0xff125fb0 in run_cleanups (cref=0xafd488) at memory/unix/apr_pools.c:2346
        c = (cleanup_t *) 0x613790
#6  0xff1270a0 in apr_pool_destroy (pool=0xafd478)
    at memory/unix/apr_pools.c:809
        active = (apr_memnode_t *) 0x0
        allocator = (apr_allocator_t *) 0xafd478
...

Although this is also on a pool cleanup, it may be completely unrelated.  See if any subsequent crashes have different backtraces (suggestive of there still being a more general issue with pool use) or if they now are consistently in this code (suggestive of this particular code needing to be reviewed carefully).
Comment 16 Max Romanov 2012-11-09 14:22:50 UTC
I'd suggest to apply patch for apr_brigade.c because it fixes potential memory corruption - it is unsafe to put the bucket allocated in one pool to the brigade allocated from another one.
Comment 17 Jeff Trawick 2012-11-09 16:06:45 UTC
(In reply to comment #16)
> I'd suggest to apply patch for apr_brigade.c because it fixes potential
> memory corruption - it is unsafe to put the bucket allocated in one pool to
> the brigade allocated from another one.

As I understand it, the patch was already applied (good), the frequency of core dumps reduced significantly (awesome!), and what remains so far is potentially very different.
Comment 18 Sean Timmins 2012-11-10 18:39:27 UTC
Currently I have 4 servers running with the attached patch applied *apart* from the patch to srclib/apr-util/buckets/apr_brigade.c

I can switch between this httpd build and a build with srclib/apr-util/buckets/apr_brigade.c patched as well as they are both installed on the servers.

Just let me know which you want me to run.

Thanks for your help on this, it's very much appreciated.
Comment 19 Sean Timmins 2012-11-11 23:14:45 UTC
Created attachment 29585 [details]
gdb output from core 2012-11-11

We had another core so here is the output from running the same gdb command as before.

I'm afraid I don't have the requisite knowledge to identify the relevant thread(s)
Comment 20 Sean Timmins 2012-11-11 23:30:32 UTC
One additional piece of information. Althouth I realise rare event can happen, both cores were dumped at exactly 03:00, the first on Nov 5th and the second on Nov 11th.

Looking at the system scheduled jobs, the only one I have found that runs at 03:00 is a root cron syncs the system time with an internal time server.  This is run just once per day. Unfortunately we are not currently logging the amount by which the time changes.
Comment 21 Jeff Trawick 2012-11-11 23:36:00 UTC
(In reply to comment #19)
> Created attachment 29585 [details]
> gdb output from core 2012-11-11
> 
> We had another core so here is the output from running the same gdb command
> as before.
> 
> I'm afraid I don't have the requisite knowledge to identify the relevant
> thread(s)

The only thread I see that could have crashed is

Thread 1 (process 2370996    ):
#0  PyObject_Malloc (nbytes=41) at Objects/obmalloc.c:758
        bp = (block *) 0x144679f ""
        pool = (poolp) 0x1446000
        next = (poolp) 0xfc4aac10
        size = 5
#1  0xfe73626c in PyString_FromStringAndSize (str=0x143ed79 "olbannerright.gif", size=17) at Objects/stringobject.c:83
        op = (PyStringObject *) 0x29
#2  0xfe6f8464 in PySequence_GetSlice (s=0x143ed40, i1=37, i2=54)
    at Objects/abstract.c:1979
        m = (PySequenceMethods *) 0xfe85ef9c
        mp = (PyMappingMethods *) 0xfe85ef9c
#3  0xfe7bfc74 in match_group (self=0x1437cc8, args=0x14439d0)
    at Modules/_sre.c:3228
        result = (PyObject *) 0x800000
        i = 3
        size = -24778856
#4  0xfe727934 in PyCFunction_Call (func=0xd467d8, arg=0x14439d0, kw=0x0)
    at Objects/methodobject.c:116
        meth = (PyCFunction) 0xfe7bfa2c <match_group>
        self = (PyObject *) 0x1437cc8
        size = 1
#5  0xfe7733d0 in PyEval_EvalFrameEx (f=0x108a090, throwflag=5889119)
    at Python/ceval.c:3706
...

As with your previous crash, this crash is in PyObject_Malloc.  Unlike the previous one, it is not in a pool cleanup.

Any idea if this infrequent crash could have been going on "forever"?  Is the crash frequently still much lower than before you applied the patch to this bug?
Comment 22 Sean Timmins 2012-11-11 23:48:45 UTC
Yes it is possible that the crash has been going on for some time. As I mentioned in comment 5 the crashes became far more noticeable after upgrading our hardware from SPARC T2 to SPARC T4 machines.

With the old hardware we were only getting around 1 core per day across all 4 servers and unfortunately there were always more pressing things to deal with. After the hardware upgrade the frequencey of crashes was much higher (multiple cores per day per server) and so it received more attention.

With the patch version of httpd, the crash frequency appears to be quite a bit less than it was even before we updated the hardware (which is great!) but I'll be more sure of this after I've run it for a while longer.
Comment 23 Sean Timmins 2012-11-13 13:18:22 UTC
Created attachment 29596 [details]
gdb output from core 2012-11-13

Another core today, again at exactly 03:00.
Comment 24 Jeff Trawick 2012-11-13 13:42:36 UTC
(In reply to comment #23)
> Created attachment 29596 [details]
> gdb output from core 2012-11-13
> 
> Another core today, again at exactly 03:00.

Here's the backtrace of the crashing thread:

1 * [1] MPM child worker thread (running request handler)
  PyObject_Malloc, conn_alloc, CS_CONTEXT_ct_con_alloc, PyCFunction_Call, PyEval_EvalFrameEx, PyEval_EvalCodeEx, function_call, PyObject_Call, instancemethod_call, PyObject_Call, PyEval_CallObjectWithKeywords, PyInstance_New, PyObject_Call, PyEval_EvalFrameEx, PyEval_EvalCodeEx, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, function_call, PyObject_Call, instancemethod_call, PyObject_Call, PyEval_CallObjectWithKeywords, PyInstance_New, PyObject_Call, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, function_call, PyObject_Call, PyEval_EvalFrameEx, PyEval_EvalCodeEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, function_call, PyObject_Call, instancemethod_call, PyObject_Call, PyObject_CallMethod, python_handler, ap_run_handler, ap_invoke_handler, ap_process_request, ap_process_http_connection, ap_run_process_connection, worker_thread, dummy_worker

For context, here are the previous two in the same format:
Nov 11:
1 * [1] MPM child worker thread (running request handler) 
  PyObject_Malloc, PyString_FromStringAndSize, PySequence_GetSlice, match_group, PyCFunction_Call, PyEval_EvalFrameEx, PyEval_EvalCodeEx, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, PyEval_EvalFrameEx, PyEval_EvalCodeEx, function_call, PyObject_Call, instancemethod_call, PyObject_Call, PyObject_CallMethod, python_handler, ap_run_handler, ap_invoke_handler, ap_process_request, ap_process_http_connection, ap_run_process_connection, worker_thread, dummy_worker
Nov 9:
1 * [39] MPM child worker thread 
  PyObject_Malloc, PyString_FromString, PyDict_GetItemString, get_interpreter, python_cleanup, run_cleanups, apr_pool_destroy, apr_pool_clear, worker_thread, dummy_worker

The 3:00 times suggest that *something* happens consistently, while the occurrences in PyObject_Malloc suggest that the problem is very possibly limited to the Python domain.  (I wouldn't say it is definitely caused by something in the Python space, but it needs to be investigated from that direction.  E.g., is the corruption caused by a memory overlay and is the data recognizable?)

I don't see evidence in the backtraces of any web server-wide activity such as graceful restart.  I guess you've checked the error and access logs for anything interesting around that time?  Potentially you could wire up mod_whatkilledus (http://emptyhammock.com/projects/httpd/diag/quick-start.html) to see if there is something in common about the requests which trigger the issue.

Unfortunately:
. This should be investigated separately from the mod_proxy issue (bug 50335) as there's no reason to think they are related, and continuing here may be confusing to others trying to understand the proxy issue.
. mod_python is officially dead for about 2.5 years, and I don't see any discussion on the mod_python mailing list for over a year, so I don't know where people that might have hints for debugging corruption in Python/mod_python memory management would hang out.
Comment 25 Sean Timmins 2012-11-13 14:35:34 UTC
Ok, lets close this part of the discussion as far as these newer core dumps go. (I'll work on getting rid of mod_python instead) Things are far better for me with the patch as it is.

Just for my information, is there any chance the patch will make it's way into an official release?
Comment 26 Eric Covener 2013-01-10 15:53:27 UTC
bump -- this ticket came up on IRC -- to recap, the httpd patch alone was sufficient in an environment where crashes were pretty regular, but it's all way over my head.    Does anyone grok it enough to declare it's not working around some other issue, or what the sacrifice is?
Comment 27 Alex Bligh 2013-07-10 21:14:06 UTC
I'm not sure this is just a mod_proxy bug (and the fact more than just mod_proxy is patched confirms this). I am seeing this in an entirely different proxy ( https://github.com/abligh/apache-websocket ).

It would be really useful to understand what the non-mod_proxy changes are doing in the patch.
Comment 28 Joe Orton 2013-08-19 18:58:58 UTC
I have not managed to reproduce the segfaults, but I think I now understand the problem expressed in comment 7 with bucket allocator mismatches.

Specifically, I think this is possible:

a) ap_proxy_http_request allocates header brigade in the proxy worker pool "scpool"
b) buckets are inserted into that brigade, allocated from the bucket allocator from "ptrans"
c) <something happens>
d) header brigade is NOT CLEANED UP
e) ptrans allocator gets destroyed, bucket memory are invalidated
f) later... scpool is cleared/destroyed, header brigade gets cleaned up but has a corrupt bucket list -> boom

I confirmed (a) and (b) by hacking ap_pass_brigade() to compare e->list against bb->bucket_alloc for every bucket in the passed-in brigade.  I don't know precisely how to trigger (c) thru (f) but it is consistent with reported symptoms, of a crash in running the brigade pool cleanup.

*IF* this is all correct then a simple fix is just to avoid (a) & (b).  Any other thoughts?

-    header_brigade = apr_brigade_create(p, origin->bucket_alloc);
+    header_brigade = apr_brigade_create(p, bucket_alloc);
Comment 29 Joe Orton 2013-08-19 19:00:03 UTC
Created attachment 30743 [details]
possible fix for PR 50335

Possible fix.
Comment 30 Ruediger Pluem 2013-08-20 08:02:19 UTC
(In reply to Joe Orton from comment #28)
> I have not managed to reproduce the segfaults, but I think I now understand
> the problem expressed in comment 7 with bucket allocator mismatches.
> 
> Specifically, I think this is possible:
> 
> a) ap_proxy_http_request allocates header brigade in the proxy worker pool
> "scpool"
> b) buckets are inserted into that brigade, allocated from the bucket
> allocator from "ptrans"
> c) <something happens>
> d) header brigade is NOT CLEANED UP

Could happen if we return early in ap_proxy_pass_brigade.

> e) ptrans allocator gets destroyed, bucket memory are invalidated
> f) later... scpool is cleared/destroyed, header brigade gets cleaned up but
> has a corrupt bucket list -> boom

Right: scpool and ptrans have completly different lifecycles and scpool could live longer than ptrans.

> 
> I confirmed (a) and (b) by hacking ap_pass_brigade() to compare e->list
> against bb->bucket_alloc for every bucket in the passed-in brigade.  I don't
> know precisely how to trigger (c) thru (f) but it is consistent with
> reported symptoms, of a crash in running the brigade pool cleanup.
> 
> *IF* this is all correct then a simple fix is just to avoid (a) & (b).  Any
> other thoughts?
> 
> -    header_brigade = apr_brigade_create(p, origin->bucket_alloc);
> +    header_brigade = apr_brigade_create(p, bucket_alloc);

This looks like the correct thing to do. Can someone of the original reporters please check?

In addition I propose the following patch to have the brigade cleaned up in any case in ap_proxy_pass_brigade:

Index: proxy_util.c
===================================================================
--- proxy_util.c        (revision 1515728)
+++ proxy_util.c        (working copy)
@@ -3305,6 +3305,7 @@
     if (transferred != -1)
         p_conn->worker->s->transferred += transferred;
     status = ap_pass_brigade(origin->output_filters, bb);
+    apr_brigade_cleanup(bb);
     if (status != APR_SUCCESS) {
         ap_log_rerror(APLOG_MARK, APLOG_ERR, status, r, APLOGNO(01084)
                       "pass request body failed to %pI (%s)",
@@ -3324,7 +3325,6 @@
             return HTTP_BAD_REQUEST;
         }
     }
-    apr_brigade_cleanup(bb);
     return OK;
 }
Comment 31 Joe Orton 2013-08-20 16:33:08 UTC
Hmmm, I am struggling to repro still.

An error return from ap_proxy_http_request() should always trigger destruction of scpool.  So there is still a missing piece to the puzzle.
Comment 32 Max Romanov 2013-08-20 21:07:07 UTC
Unfortunately I'm unable to reproduce the issue again because we have no longer this installation.

About other changes in my patch, I can comment I did it after small code instrumentation to macros which inserts the bucket to brigade - it checks whether bucket is allocated with same or parent pool as brigade and assert otherwise.
Comment 33 Joe Orton 2013-10-22 08:28:47 UTC
I committed my patch in r1534321 - I have a user seeing similar crashes trying this out.
Comment 34 jkaluza 2013-12-10 08:45:38 UTC
The problem is that make_fake_req uses scpool. Fake request generated by this method is passed to ap_get_brigade by following code:

rv = ap_get_brigade(backend->r->input_filters, bb,
                    AP_MODE_READBYTES, mode,
                    conf->io_buffer_size);

"bb" is brigade created with "request" pool. Input_filters later adds new buckets created from "scpool" into brigade created with "request" pool. The problem here is when scpool gets destroyed before request pool. I'm not able to reproduce it now, but I'm pretty sure that's what we see here.

I think the problem is in some "break;" in the code block right after ap_get_brigade, because in this case "bb" brigade is not cleaned up. I think to trigger this, input_filter has to return some strange error and add something to brigade in the same time, so it's not clean, but the mod_proxy presumes it is clean when this return value is returned.

Note that this is also particular thing which is changed in the patch by Max. The patch changes pool used by input_filters.

I'm proposing another patch which cleans "bb" brigade in the end of ap_proxy_http_process_response.
Comment 35 jkaluza 2013-12-10 08:47:06 UTC
Created attachment 31103 [details]
Another possible fix for PR 50335
Comment 36 Max Romanov 2013-12-10 12:34:54 UTC
I have no idea why other people ignores two major changes from the first patch:
1. allocate request to backend from backend connection pool
2. avoid moving of buckets between brigades allocated from the non-relational pools

It is hard to reproduce and analyse such issues. I prefer to perform redundant copying than have once-at-week core file.
Comment 37 Yann Ylavic 2013-12-10 13:06:57 UTC
(In reply to Max Romanov from comment #36)
> I have no idea why other people ignores two major changes from the first
> patch:
> 1. allocate request to backend from backend connection pool

Since the backend's connection can handle multiple requests, they would leak on its pool.
Your patch creates a subpool for the fake request, you have to explicitly destroy (or clear/recycle) it somewhere, and that's more work/cycles (per request).
Comment 38 Max Romanov 2013-12-10 13:35:46 UTC
Agree, this is extra work.
Connection will be closed early or later and destroy all subpools.
Comment 39 jkaluza 2013-12-11 07:30:55 UTC
I have committed my patch as r1550061.
Comment 40 Joe Orton 2013-12-11 11:06:47 UTC
Using a subpool for the fake request sounds like it is not actually fixing the difficult problem here of getting the lifetimes right.

But Max is right that basically everything after that ap_get_brigade() call is potentially dangerous as-is because of dangers with bucket lifetime mismatches.

Comparing trunk with 2.2.x I notice there is a further complication in r1035605

In the EOF case, the proxy now reads HEAP buckets from the backend socket, converts them into TRANSIENT buckets so the bucket *structure* lifetime is correct... and then converts those buckets back into HEAP buckets so the bucket structure the bucket *data* lifetime are correct.  That seems over-complicated.
Comment 41 Ewald Dieterich 2013-12-12 15:05:22 UTC
I get similar segmentation faults with Apache 2.4.6, apr 1.4.8, apr-util 1.5.3 and mod_security 2.7.5 (from Debian unstable).

I configured a simple reverse proxy and enabled mod_security for a location:

<Location />
    SecRuleEngine On
    SecRequestBodyAccess On

    ProxyPass http://backend:8080/
    ProxyPassReverse http://backend:8080/
</Location>

On the backend I run faucet to simulate a request-dropping backend server:

faucet 8080 --out echo ""

Now I send POST requests in parallel by starting this loop on multiple shells, the more the better (data_file is 25k):

while true ; do curl -d @data_file http://frontend/ ; done

Every once in a while I get a segmentation fault.

After applying the following patches (attachments 30743 and 31103) I get no more segmentation faults:

http://svn.apache.org/viewvc?view=revision&revision=1534321
http://svn.apache.org/viewvc?view=revision&revision=1550061
Comment 42 jkaluza 2013-12-16 09:21:39 UTC
Thanks for the info, that looks promising :).
Comment 43 Graham Leggett 2013-12-26 18:28:30 UTC
Backported to v2.4.8.
Comment 44 Alex Frolkin 2014-01-09 09:01:44 UTC
*** Bug 55979 has been marked as a duplicate of this bug. ***
Comment 45 Eric Covener 2014-01-19 19:28:33 UTC
*** Bug 50834 has been marked as a duplicate of this bug. ***