Bug 56299

Summary: mod_proxy_ws_tunnel - AH02447: err/hup on backconn
Product: Apache httpd-2 Reporter: Martin Studer <martin.studer>
Component: mod_proxyAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: normal CC: martin.studer
Priority: P2 Keywords: FixedInTrunk
Version: 2.5-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: read POLLHUPs
Return correct HTTP codes

Description Martin Studer 2014-03-21 16:46:40 UTC
We're currently using Apache 2.4.9 on Win64 which seems to solve the "infinite loop on websocket connection loss" (https://issues.apache.org/bugzilla/show_bug.cgi?id=56145) problem for us. However, what we are seeing now is entries like

[Fri Mar 21 08:00:10.405224 2014] [proxy_wstunnel:notice] [pid 19932:tid 836] [client 172.29.184.113:53972] AH02447: err/hup on backconn
[Fri Mar 21 08:00:42.229688 2014] [proxy_wstunnel:notice] [pid 19932:tid 788] [client 172.29.184.113:53982] AH02447: err/hup on backconn
[Fri Mar 21 08:05:08.604172 2014] [proxy_wstunnel:notice] [pid 19932:tid 840] [client 172.29.184.113:54199] AH02447: err/hup on backconn
[Fri Mar 21 08:05:16.780267 2014] [proxy_wstunnel:notice] [pid 19932:tid 772] [client 172.29.184.113:54210] AH02447: err/hup on backconn
[Fri Mar 21 08:05:21.724793 2014] [proxy_wstunnel:notice] [pid 19932:tid 768] [client 172.29.184.113:54216] AH02447: err/hup on backconn
[Fri Mar 21 08:18:26.592849 2014] [proxy_wstunnel:notice] [pid 19932:tid 788] [client 172.29.184.113:54414] AH02447: err/hup on backconn
[Fri Mar 21 08:18:26.595779 2014] [proxy_wstunnel:notice] [pid 19932:tid 764] [client 172.29.184.113:54413] AH02447: err/hup on backconn
[Fri Mar 21 08:26:53.173822 2014] [proxy_wstunnel:notice] [pid 19932:tid 764] [client 172.29.184.113:54463] AH02447: err/hup on backconn
[Fri Mar 21 08:40:58.011922 2014] [proxy_wstunnel:notice] [pid 19932:tid 840] [client 172.29.184.113:60997] AH02447: err/hup on backconn
[Fri Mar 21 08:40:58.011922 2014] [proxy_wstunnel:notice] [pid 19932:tid 772] [client 172.29.184.113:60998] AH02447: err/hup on backconn

We can't find any indication so far that there would be a problem with the websocket server (Play/Netty). So it looks like this might be an issue with mod_proxy_ws_tunnel?
Comment 1 Eric Covener 2014-03-22 17:56:28 UTC
Weird, the 2.4.9 change here did not introduce this error or increase its severity. I actually noticed that it seemed to be at an unnecesarrily high level, and the corresponding msg for the client side is TRACE1 instead of NOTICE.

Was 2.4.8 just unusable for you so you never saw it?  Just trying to make sure this is simply "is NOTICE too high?" vs. "broke in 2.4.9"
Comment 2 Yann Ylavic 2014-03-22 19:33:38 UTC
Created attachment 31428 [details]
read POLLHUPs

Shouldn't POLLUP be handled the like POLLIN, I mean let the proxy read the socket (see attached 2.4.x patch)?

On some systems, EOF triggers POLLHUP, IMHO it's safe to let apr_socket_recv() detect the error vs EOF.
Comment 3 Eric Covener 2014-03-22 19:42:03 UTC
works here, probably need to tidy up the existing NOTICE msg
Comment 4 Yann Ylavic 2014-03-22 20:40:13 UTC
That would also hide real (network/system) errors (often good to know).
For others proxy modules (but proxy_connect where the same code is used), the level is error for socket failures. TRACE1 is surely enough for EOF, but real errors deverse more IMHO (maybe not ERR because it's just streaming, but at least a warning).
Comment 5 Martin Studer 2014-03-24 07:21:05 UTC
We were using 2.4.7 before and ran into the "infinite loop" issue https://issues.apache.org/bugzilla/show_bug.cgi?id=56145 which led us to upgrade to 2.4.9 (without trying 2.4.8 first).

The behavior we are observing is that websocket connections initially work, however that reopening the connection after a while leads to a "AH02447: err/hup on backconn". This leads to the following error on the client side (Google Chrome):

WebSocket connection to 'wss://...' failed: Connection closed before receiving a handshake response



(In reply to Eric Covener from comment #1)
> Weird, the 2.4.9 change here did not introduce this error or increase its
> severity. I actually noticed that it seemed to be at an unnecesarrily high
> level, and the corresponding msg for the client side is TRACE1 instead of
> NOTICE.
> 
> Was 2.4.8 just unusable for you so you never saw it?  Just trying to make
> sure this is simply "is NOTICE too high?" vs. "broke in 2.4.9"
Comment 6 Martin Studer 2014-03-24 07:25:17 UTC
Forgot to mention that after having received that error, retrying to connect generally seems to work

(In reply to Martin Studer from comment #5)
> We were using 2.4.7 before and ran into the "infinite loop" issue
> https://issues.apache.org/bugzilla/show_bug.cgi?id=56145 which led us to
> upgrade to 2.4.9 (without trying 2.4.8 first).
> 
> The behavior we are observing is that websocket connections initially work,
> however that reopening the connection after a while leads to a "AH02447:
> err/hup on backconn". This leads to the following error on the client side
> (Google Chrome):
> 
> WebSocket connection to 'wss://...' failed: Connection closed before
> receiving a handshake response
> 
> 
> 
> (In reply to Eric Covener from comment #1)
> > Weird, the 2.4.9 change here did not introduce this error or increase its
> > severity. I actually noticed that it seemed to be at an unnecesarrily high
> > level, and the corresponding msg for the client side is TRACE1 instead of
> > NOTICE.
> > 
> > Was 2.4.8 just unusable for you so you never saw it?  Just trying to make
> > sure this is simply "is NOTICE too high?" vs. "broke in 2.4.9"
Comment 7 Edward Lu 2014-03-26 18:36:06 UTC
Created attachment 31444 [details]
Return correct HTTP codes

This patch fixes something I noticed that might be contributing to the problem. It's based on and includes Yann's patch (I couldn't figure out a good way to create a patch to be applied on top of his).

The problem I noticed was that 500 was returned any time any closure of the WebSocket server occured. When reading from the server in ap_proxy_wstunnel_request(), anything except for a SUCCESS is considered an internal error, including a legitimate EOF. 500 is also returned if POLLERR is received when polling the server; this is a little suspect, since we're not the one having the issue, the WebSocket server is. That's not even considering the fact that the flag doing all this is named client_error, and that it's set in the section where we read from the server. It seemed like a good idea to just remove the client_error flag.

However, doing so triggers another problem. Since ap_proxy_wstunnel_request() now returns OK, ap_discard_request_body() is run as part of the connection cleanup. On a normal Websocket close, this is fine, but something causes ap_discard_request_body() to block indefinitely on ap_get_brigade() when the connection is killed abormally (e.g. by SIGKILL on the server). I think it's something to do with either all the filters that get removed, or because the client's still hanging onto the request - I wasn't able to determine conclusively. The request's thread does eventually get killed by mod_reqtimeout.

The client_error flag "solved" this problem, but in an obviously unsatisfactory way. I've removed it and instead set c->keepalive = AP_CONN_CLOSE, which tells the server that it doesn't need to run the additional processing after the handler returns.
Comment 8 Eric Covener 2014-04-13 17:26:57 UTC
Committed Edwards patch, which included Yann's.  Thanks for the report.
Comment 9 Christophe JAILLET 2014-08-18 07:53:29 UTC
Fixed and released in 2.4.10
Comment 10 Matt Hauck 2014-10-30 00:12:55 UTC
I'm seeing some similar behavior in mod_proxy_connect with error AH01026. Looking at the code in mod_proxy_connect, it looks very similar to the unpatched code that was addressed here for mod_proxy_wstunnel. 

Does this patch need to be ported there as well? Or does the underlying issue that was fixed here not apply there? (and thus I have something else wrong I need to figure out)
Comment 11 Yann Ylavic 2014-10-30 09:30:18 UTC
(In reply to Matt Hauck from comment #10)
> I'm seeing some similar behavior in mod_proxy_connect with error AH01026.

Since this PR about mod_proxy_wstunnel is resolved, please create a new one with mod_proxy_connect as component, so that the fix/changelog can reference it.
Comment 12 Matt Hauck 2014-10-30 20:48:30 UTC
(In reply to Yann Ylavic from comment #11)
> Since this PR about mod_proxy_wstunnel is resolved, please create a new one
> with mod_proxy_connect as component, so that the fix/changelog can reference
> it.

Roger. Filed #57168