Bug 56145 - mod_proxy_ws_tunnel - Infinite loop on websocket connection loss
Summary: mod_proxy_ws_tunnel - Infinite loop on websocket connection loss
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: Other Modules (show other bugs)
Version: 2.5-HEAD
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-17 11:07 UTC by Jo
Modified: 2014-03-21 09:19 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jo 2014-02-17 11:07:46 UTC
Child process (mod_proxy_ws_tunnel worker) go in infinite loop / 100% CPU core usage when browser websocket connection is baldly closed.

mod_proxy_ws_tunnel is configured to proxy websockets:

browser <----conn_client----> apache/mod_proxy_ws_tunnel <----conn_server----> app

If browser or app close properly the connection, there is no problem (worker stop).

If app is killed or conn_server is lost, mod_proxy_ws_tunnel worker stop as expected.

If browser is killed or conn_client is lost, mod_proxy_ws_tunnel worker enter in infinite loop, 100% CPU core usage, and gigabytes log generation ([proxy_wstunnel:debug] AH02445: woke from poll(), i=1).


mod_proxy_wstunnel.c
Method proxy_wstunnel_pump (ap_proxy_wstunnel_request for 2.4.x)


    while(1) { 
        if ((rv = apr_pollset_poll(pollset, timeout, &pollcnt, &signalled))
                != APR_SUCCESS) {
            if (APR_STATUS_IS_EINTR(rv)) {
                continue;
            }
            else if (APR_STATUS_IS_TIMEUP(rv)) { 
                ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, APLOGNO(02542) "Attempting to go asynch");
                return SUSPENDED;
            }
            else { 
                ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, r, APLOGNO(02444) "error apr_poll()");
                return HTTP_INTERNAL_SERVER_ERROR;
            }
        }

        ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, APLOGNO(02445)
                "woke from poll(), i=%d", pollcnt);

        for (pi = 0; pi < pollcnt; pi++) {
            const apr_pollfd_t *cur = &signalled[pi];

            if (cur->desc.s == sock) {
                pollevent = cur->rtnevents;
                if (pollevent & APR_POLLIN) {
                    ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, APLOGNO(02446)
                            "sock was readable");
                    rv = proxy_wstunnel_transfer(r, backconn, c, bb, "sock");
                }
                else if ((pollevent & APR_POLLERR)
                        || (pollevent & APR_POLLHUP)) {
                    rv = APR_EPIPE;
                    ap_log_rerror(APLOG_MARK, APLOG_NOTICE, 0, r, APLOGNO(02447)
                            "err/hup on backconn");
                }
                if (rv != APR_SUCCESS)
                    client_error = 1;
            }
            else if (cur->desc.s == client_socket) {
                pollevent = cur->rtnevents;
                if (pollevent & APR_POLLIN) {
                    ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, APLOGNO(02448)
                            "client was readable");
                    rv = proxy_wstunnel_transfer(r, c, backconn, bb, "client");
                }
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
There is no "else if ((pollevent & APR_POLLERR) || (pollevent & APR_POLLHUP))"
Shouldn't be there one?
Copied-pasted above "else if", compiled, tested and worked fine.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
            }
            else {
                rv = APR_EBADF;
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, APLOGNO(02449)
                        "unknown socket in pollset");
            }

        }
        if (rv != APR_SUCCESS) {
            break;
        }

    }
Comment 1 Jo 2014-02-17 16:38:04 UTC
I've detected the issue in 2.4.7, then tested and noticed the same issue in 2.4.x (2.4.8) and 2.x (2.5).
Comment 2 Eric Covener 2014-02-18 14:10:40 UTC
(In reply to Jo from comment #1)
> I've detected the issue in 2.4.7, then tested and noticed the same issue in
> 2.4.x (2.4.8) and 2.x (2.5).

Thanks for the report/analysis/patch.  In my system I could not easily trigger the symptom, even when I killed the client it became readable and was handled by the normal read.

Can you re-test this patch since your env hits it more easily?

http://people.apache.org/~covener/patches/trunk-ws-disconnect.diff
Comment 3 Jo 2014-02-18 16:50:20 UTC
(In reply to Eric Covener from comment #2)
> 
> Thanks for the report/analysis/patch.  In my system I could not easily
> trigger the symptom, even when I killed the client it became readable and
> was handled by the normal read.
> 
> Can you re-test this patch since your env hits it more easily?
> 
> http://people.apache.org/~covener/patches/trunk-ws-disconnect.diff

I'd to replace APLOGNO() by APLOGNO(02453) to compile it.

I've successfully tested it on 2.4.8-dev and 2.5-dev. Thanks.

On app killed, it logs "AH02447: err/hup on backconn" as before.
On browser killed, it logs "AH02447: err/hup on client conn" as expected now.

I never had "unknown event on backconn" nor "unknown event on client conn", but I just made a few tests.
Comment 4 Martin Studer 2014-03-21 08:07:12 UTC
We're currently using Apache 2.4.9 on Win64 which seems to solve the "infinite loop on websocket connection loss" problem for us. However, what we are seeing now is entries like

[Fri Mar 21 08:00:10.405224 2014] [proxy_wstunnel:notice] [pid 19932:tid 836] [client 172.29.184.113:53972] AH02447: err/hup on backconn
[Fri Mar 21 08:00:42.229688 2014] [proxy_wstunnel:notice] [pid 19932:tid 788] [client 172.29.184.113:53982] AH02447: err/hup on backconn
[Fri Mar 21 08:05:08.604172 2014] [proxy_wstunnel:notice] [pid 19932:tid 840] [client 172.29.184.113:54199] AH02447: err/hup on backconn
[Fri Mar 21 08:05:16.780267 2014] [proxy_wstunnel:notice] [pid 19932:tid 772] [client 172.29.184.113:54210] AH02447: err/hup on backconn
[Fri Mar 21 08:05:21.724793 2014] [proxy_wstunnel:notice] [pid 19932:tid 768] [client 172.29.184.113:54216] AH02447: err/hup on backconn
[Fri Mar 21 08:18:26.592849 2014] [proxy_wstunnel:notice] [pid 19932:tid 788] [client 172.29.184.113:54414] AH02447: err/hup on backconn
[Fri Mar 21 08:18:26.595779 2014] [proxy_wstunnel:notice] [pid 19932:tid 764] [client 172.29.184.113:54413] AH02447: err/hup on backconn
[Fri Mar 21 08:26:53.173822 2014] [proxy_wstunnel:notice] [pid 19932:tid 764] [client 172.29.184.113:54463] AH02447: err/hup on backconn
[Fri Mar 21 08:40:58.011922 2014] [proxy_wstunnel:notice] [pid 19932:tid 840] [client 172.29.184.113:60997] AH02447: err/hup on backconn
[Fri Mar 21 08:40:58.011922 2014] [proxy_wstunnel:notice] [pid 19932:tid 772] [client 172.29.184.113:60998] AH02447: err/hup on backconn

We can't find any indication so far that there would be a problem with the websocket server (Play/Netty). So it looks like this might be an issue with mod_proxy_ws_tunnel?
Comment 5 Yann Ylavic 2014-03-21 09:19:37 UTC
(In reply to Martin Studer from comment #4)
> We're currently using Apache 2.4.9 on Win64 which seems to solve the
> "infinite loop on websocket connection loss" problem for us. However, what
> we are seeing now is entries like

Please open a new bug report so that it can be referenced independently.