Bug 58503

Summary: segfault in apr_brigade_cleanup() under high load with slow clients
Product: Apache httpd-2 Reporter: Tabby <s.tabacchiera>
Component: mpm_workerAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: critical CC: s.tabacchiera, voyageursp, ylavic.dev
Priority: P2    
Version: 2.4.16   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: cleanup mod_proxy_wstunnel's backend brigades
Cleanup mod_proxy_wstunnel's backend brigades

Description Tabby 2015-10-16 20:23:09 UTC
apache 2.4.17 on RHEL 6.3 64bit

SCENARIO: 
Client -> apache/mod_proxy_wstunnel -> tomcat

Under high load (multiple busy threads), with many slow clients (expeciallu mobile users during roaming, who often go offline), we experience most threads are waiting for clients to read() from their side, netstat showing a large Send-Q, when clients abort or apache Timeout expires, httpd core dumps with segfault.

gdb bt:
#0  0x00007f69ec6abe17 in apr_brigade_cleanup (data=0x7f692646ffb8) at buckets/apr_brigade.c:44
#1  0x00007f69ec265b1e in run_cleanups (pool=0x7f692646e028) at memory/unix/apr_pools.c:2352
#2  apr_pool_destroy (pool=0x7f692646e028) at memory/unix/apr_pools.c:814
#3  0x00007f69ec6abe2c in apr_brigade_cleanup (data=0x7f6926489860) at buckets/apr_brigade.c:44
#4  0x0000000000448dda in ap_core_output_filter (f=0x7f6926489690, new_bb=0x7f6926489860) at core_filters.c:385
#5  0x0000000000463fa3 in ap_process_request_after_handler (r=<value optimized out>) at http_request.c:267
#6  0x00000000004645af in ap_process_request (r=0x7f692646e0a0) at http_request.c:373
#7  0x00000000004608ee in ap_process_http_sync_connection (c=0x7f69264892b8) at http_core.c:210
#8  ap_process_http_connection (c=0x7f69264892b8) at http_core.c:251
#9  0x0000000000457fc0 in ap_run_process_connection (c=0x7f69264892b8) at connection.c:41
#10 0x000000000046b468 in process_socket (thd=0x7f69e6963f28, dummy=<value optimized out>) at worker.c:631
#11 worker_thread (thd=0x7f69e6963f28, dummy=<value optimized out>) at worker.c:990
#12 0x0000003f3a807a51 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003f3a0e893d in clone () from /lib64/libc.so.6

ENVIRONMENT:
RHEL 6.3 64bit on vmware
Apache 2.4.17 worker, with apr-1.5.2, apr-util-1.5.4, openssl 1.0.1p.

Same issue occurring with 2.4.16, 2.2.31, 2.2.29, with and w/o ssl.
Same issue occurring with event mpm.
Issue NOT OCCURRING with prefork mpm.

It seems a thread-synchronization issue.

HOW TO REPRODUCE:
Just start a number of clients who, after the initial websocket request, read very slowly. Let the Send-Q fills to ~128KB, with high thread-per-child concurrency. Then let all clients abort simultaneously, or let the Timeout expires. Apache child exit with signal (11) and core dumps.
Comment 1 Tabby 2015-10-16 20:29:30 UTC
*** Bug 58502 has been marked as a duplicate of this bug. ***
Comment 2 Yann Ylavic 2015-10-16 21:21:20 UTC
Created attachment 33184 [details]
cleanup mod_proxy_wstunnel's backend brigades

Can you please try this patch?
Comment 3 Yann Ylavic 2015-10-16 21:31:04 UTC
Created attachment 33185 [details]
Cleanup mod_proxy_wstunnel's backend brigades

Also cleans up EPIPE case.
Comment 4 Tabby 2015-10-17 12:49:12 UTC
Hi and thanks for your quick feedback, Yann!

I applied your patch and did some quick load tests and all seems to work smoothly.

Next, I'm trying it in the real scenario and let you know.
Finger crossed, for now.

ST
Comment 5 Tabby 2015-10-19 09:44:04 UTC
After 2 days, no more segfaults.
Stil heavily testing, but I'm pretty confident the issue has gone.

ST
Comment 6 Ruediger Pluem 2015-10-19 20:00:17 UTC
While it is nice that the patch works for you I suspect that it hides something else. Can you please do a 

print e
print *e

in gdb in the thread that causes the segfault?
Comment 7 Ruediger Pluem 2015-10-19 20:04:30 UTC
Actually I think mod_proxy_wstunnel falls into the same pitfall mod_proxy_http was in and it needs to do the same / something similar then mod_proxy_http.c with

proxy_buckets_lifetime_transform

Just have no further time investigating right now.
Comment 8 Yann Ylavic 2015-10-19 20:18:59 UTC
(In reply to Ruediger Pluem from comment #7)
> Actually I think mod_proxy_wstunnel falls into the same pitfall
> mod_proxy_http was in and it needs to do the same / something similar then
> mod_proxy_http.c with
> 
> proxy_buckets_lifetime_transform

Yes I agree, just proposed a quick patch to determine whether it came from mod_proxy_wstunnel or some failure in the core deferred/pipelined write logic...

We need to either use lifetime_transform like in proxy_http, or I was thinking of modifying all the input filters that create their buckets on f->r/c's pool/bucket_alloc so that they now use their given bb->p and bb->bucket_alloc.

By doing the latter, we wouldn't have to transform the lifetime, it would be determined by the caller...
Comment 9 Yann Ylavic 2015-10-19 20:32:58 UTC
Discussion moved to the dev@ mailing-list, will report here afterward.
Comment 10 Tabby 2015-10-20 10:23:54 UTC
(In reply to Ruediger Pluem from comment #6)
> While it is nice that the patch works for you I suspect that it hides
> something else. Can you please do a 
> 
> print e
> print *e
> 
> in gdb in the thread that causes the segfault?

Sorry, I've deleted the old core.

Consider this new one, with ssl enabled:
#0  0x00007eff54952e17 in apr_brigade_cleanup (data=0x7eff2c01c2c8) at buckets/apr_brigade.c:44
#1  0x00007eff5450caee in run_cleanups (pool=0x7eff2c01a2b8) at memory/unix/apr_pools.c:2352
#2  apr_pool_destroy (pool=0x7eff2c01a2b8) at memory/unix/apr_pools.c:814
#3  0x00007eff54952e2c in apr_brigade_cleanup (data=0x7eff2c0062a8) at buckets/apr_brigade.c:44
#4  0x00007eff5165db0e in ssl_io_filter_output (f=0x7eff480567f8, bb=0x7eff2c0062a8) at ssl_engine_io.c:1612
#5  0x00007eff5165e125 in ssl_io_filter_coalesce (f=0x7eff480567d0, bb=0x7eff2c0062a8) at ssl_engine_io.c:1599
#6  0x0000000000463fa3 in ap_process_request_after_handler (r=<value optimized out>) at http_request.c:267
#7  0x00000000004645af in ap_process_request (r=0x7eff2c01a330) at http_request.c:373
#8  0x00000000004608ee in ap_process_http_sync_connection (c=0x7eff48056388) at http_core.c:210
#9  ap_process_http_connection (c=0x7eff48056388) at http_core.c:251
#10 0x0000000000457fc0 in ap_run_process_connection (c=0x7eff48056388) at connection.c:41
#11 0x000000000046b468 in process_socket (thd=0xfa9928, dummy=<value optimized out>) at worker.c:631
#12 worker_thread (thd=0xfa9928, dummy=<value optimized out>) at worker.c:990
#13 0x0000003f3a807a51 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003f3a0e893d in clone () from /lib64/libc.so.6

(gdb) p e
$1 = (apr_bucket *) 0x7efee4018348

(gdb) p *e
$2 = {link = {next = 0x2c6c6c756e3a226e, prev = 0x4974616d726f6622}, type = 0x4f444e49223a2264, length = 3180190329187026254, start = 6229711307259471394, 
  data = 0x2d223a2265756c61, free = 0x7b2c7d2235322e31, list = 0x6e6f697372657622}
Comment 11 Ruediger Pluem 2015-10-20 11:59:23 UTC
(In reply to Tabby from comment #10)
> (In reply to Ruediger Pluem from comment #6)
> > While it is nice that the patch works for you I suspect that it hides
> > something else. Can you please do a 
> > 
> > print e
> > print *e
> > 
> > in gdb in the thread that causes the segfault?
> 
> Sorry, I've deleted the old core.
> 

> 
> (gdb) p e
> $1 = (apr_bucket *) 0x7efee4018348
> 
> (gdb) p *e
> $2 = {link = {next = 0x2c6c6c756e3a226e, prev = 0x4974616d726f6622}, type =
> 0x4f444e49223a2264, length = 3180190329187026254, start =
> 6229711307259471394, 
>   data = 0x2d223a2265756c61, free = 0x7b2c7d2235322e31, list =
> 0x6e6f697372657622}

Thanks. What does

print *(e->type)
x/10x 0x2c6c6c756e3a226e,
x/10x 0x4974616d726f6622
x/10x 0x7b2c7d2235322e31
dump_brigade b
dump_bucket e

deliver?
Keep in mind that you need to have the macros .gdbinit (top level of source directory) loaded for the dump_* commands to work.
Comment 12 Tabby 2015-10-20 12:16:42 UTC
(gdb) p *(e->type)
Cannot access memory at address 0x4f444e49223a2264

(gdb) x/10x 0x2c6c6c756e3a226e
0x2c6c6c756e3a226e:     Cannot access memory at address 0x2c6c6c756e3a226e

(gdb) x/10x 0x4974616d726f6622
0x4974616d726f6622:     Cannot access memory at address 0x4974616d726f6622

(gdb) x/10x 0x7b2c7d2235322e31
0x7b2c7d2235322e31:     Cannot access memory at address 0x7b2c7d2235322e31

(gdb) dump_brigade b
dump of brigade 0x7eff2c01c2c8
   | type     (address)    | length | data addr  | contents               | rc
--------------------------------------------------------------------------------
 0 | Cannot access memory at address 0x4f444e49223a2264

(gdb) dump_bucket e
 bucket=Cannot access memory at address 0x4f444e49223a2264
Comment 13 Tabby 2015-11-30 09:50:35 UTC
After 1 month I can confirm the issue is gone.
Will this patch be included in the upcoming release?

Thanks
ST
Comment 14 Tabby 2016-03-16 13:55:36 UTC
Hi there,
any news about this patch? It was not included in 2.4.18, so I'm wondering if/when will it ever be.

Thanks!
ST
Comment 15 Yann Ylavic 2016-03-16 16:53:18 UTC
Thanks ST for this report and follow up, a different fix was backported to upcoming 2.4.19 in r1732266, so I'm closing here.

If you manage to test the final fix in your enviroment and have any issue with it, please reopen.
Comment 16 Yann Ylavic 2016-03-16 16:59:51 UTC
(In reply to Yann Ylavic from comment #15)
> Thanks ST for this report and follow up, a different fix was backported to
> upcoming 2.4.19 in r1732266, so I'm closing here.
> 
> If you manage to test the final fix in your enviroment and have any issue
> with it, please reopen.

Here is the link of the patch (all in one) which was backported (and hence could be applied to 2.4.18): http://home.apache.org/~rpluem/patches/proxy_correctlifetime.diff