Bug 48388 - mod proxy disabling workers after a single error
Summary: mod proxy disabling workers after a single error
Status: RESOLVED WONTFIX
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy (show other bugs)
Version: 2.2.14
Hardware: PC Linux
: P2 major with 1 vote (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2009-12-14 12:28 UTC by Cristian
Modified: 2015-02-10 13:45 UTC (History)
2 users (show)



Attachments
Error log filtered (36.16 KB, text/x-log)
2009-12-14 12:28 UTC, Cristian
Details
Don't put normal (non balancer-member) workers in error (trunk/2.4.x) (3.10 KB, patch)
2015-01-05 15:30 UTC, Yann Ylavic
Details | Diff
Don't put normal (non balancer-member) workers in error (2.2.x) (3.36 KB, patch)
2015-01-05 15:33 UTC, Yann Ylavic
Details | Diff
Don't put normal (non balancer-member) workers in error (2.2.x) (4.28 KB, patch)
2015-01-05 15:44 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Cristian 2009-12-14 12:28:26 UTC
Created attachment 24706 [details]
Error log filtered

We are experiencing intermittent "connect timeout" errors. The remote
service is really "ok", but it still results in the worker being
disabled.

Error Log:(more in attach)

[error] ap_proxy_connect_backend disabling worker for (192.168.5.17)


We have 1 balancer + 10 apache's workers

My config:

<VirtualHost *>
 SetEnv force-proxy-request-1.0 1
 SetEnv proxy-nokeepalive 1

 ProxyRequests Off
 <Proxy *>
   Order deny,allow
   Allow from all
 </Proxy>

ProxyPass /balancer-manager !
ProxyPass / balancer://mycluster/

 ProxyPassReverse / http://192.168.5.11/
 ProxyPassReverse / http://192.168.5.12/
 ProxyPassReverse / http://192.168.5.14/
 ProxyPassReverse / http://192.168.5.15/
 ProxyPassReverse / http://192.168.5.16/
 ProxyPassReverse / http://192.168.5.17/
 ProxyPassReverse / http://192.168.5.18/
 ProxyPassReverse / http://192.168.5.19/
 ProxyPassReverse / http://192.168.5.110/
 ProxyPassReverse / http://192.168.5.111/

<Proxy balancer://mycluster>

   BalancerMember http://192.168.5.11 route=w1 loadfactor=39
   BalancerMember http://192.168.5.12 route=w2 loadfactor=39
   BalancerMember http://192.168.5.14 route=w4 loadfactor=39
   BalancerMember http://192.168.5.15 route=w5 loadfactor=55
   BalancerMember http://192.168.5.16 route=w6 loadfactor=55
   BalancerMember http://192.168.5.17 route=w7 loadfactor=55
   BalancerMember http://192.168.5.18 route=w8 loadfactor=55
   BalancerMember http://192.168.5.19 route=w9 loadfactor=75
   BalancerMember http://192.168.5.110 route=w10 loadfactor=50
   BalancerMember http://192.168.5.111 route=w11 loadfactor=80

 </Proxy>

</VirtualHost>

All servers are in datacenter with 1gb cisco switch and 10/100 internet bandwidth.
Comment 1 Daniel Ruggeri 2010-03-29 18:35:45 UTC
Cristian;
   Have you had the opportunity to set LogLevel to debug for additional information and confirmed that all logging is making it to the error_log in question? Given your configuration, I think the error is happening as a request is being processed. Do you have any way to duplicate this problem or perhaps confirm that the backend is not having any sort of problems?

The relevant code in question is in modules/proxy/proxy_util.c
    /*
     * Put the entire worker to error state if
     * the PROXY_WORKER_IGNORE_ERRORS flag is not set.
     * Altrough some connections may be alive
     * no further connections to the worker could be made
     */
    if (!connected && PROXY_WORKER_IS_USABLE(worker) &&
        !(worker->s->status & PROXY_WORKER_IGNORE_ERRORS)) {
        worker->s->status |= PROXY_WORKER_IN_ERROR;
        worker->s->error_time = apr_time_now();
        ap_log_error(APLOG_MARK, APLOG_ERR, 0, s,
            "ap_proxy_connect_backend disabling worker for (%s)",
            worker->hostname);
    }

This is after a loop that will iterate through each backend address returned by DNS and attempt connection. I ask if this is all of the log because there should be at least one more error message that accompanies this one.
Comment 2 Alex Tavcar 2010-08-30 18:00:38 UTC
After a few years since early apache httpd 2.2.x releases I give it another try to migrate our apache httpd proxy server from 2.0.x versions to 2.2.x. Still there is an issue with unresponsive backend servers with 503 error on apache httpd proxy. Both versions have practically the same configuration, but with 2.0.x version we do not experience any user experience errors, while with 2.2.x versions users get 503 errors and timeouts for simple webpages.


Old apache httpd 2.0.63 proxy error_log: 

[Fri Aug 20 10:11:42 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 11:47:14 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 17:25:16 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 18:07:02 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 19:04:31 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sat Aug 21 06:49:18 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sat Aug 21 17:06:21 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sat Aug 21 18:34:41 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sun Aug 22 11:02:49 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Mon Aug 23 08:39:00 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Tue Aug 24 16:34:32 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 07:36:21 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed


When we switch to apache httpd 2.2.16 proxy we get immediate bad user experience hit with http 503 errors and timeouts on proxy server, while 2.2.x backend servers have same load:

[Wed Aug 25 07:36:21 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 08:28:22 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 08:28:22 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:05:19 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:05:19 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:10:03 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:10:03 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:18:32 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:18:32 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:51:51 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:51:51 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:53:38 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:53:38 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 10:00:12 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 10:00:12 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 10:04:21 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 10:04:21 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 10:06:06 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 10:06:06 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
.....
[Fri Aug 27 09:21:53 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:22:39 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:35:38 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:35:38 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:36:22 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:36:22 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:36:36 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:37:33 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:37:33 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:41:46 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:41:46 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:42:05 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:42:05 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:42:08 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:42:08 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:42:32 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:42:32 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)

Switch back to apache httpd 2.0.63 proxy and we immediate get normal user experience, due to
error in error_log:

[Fri Aug 27 09:43:07 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:51:34 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:59:50 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:00:35 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:00:35 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:20:59 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:21:00 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:34:03 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:37:59 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:38:15 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
Comment 3 Alex Tavcar 2010-08-30 18:01:56 UTC
After a few years since early apache httpd 2.2.x releases I give it another try to migrate our apache httpd proxy server from 2.0.x versions to 2.2.x. Still there is an issue with unresponsive backend servers with 503 error on apache httpd proxy. Both versions have practically the same configuration, but with 2.0.x version we do not experience any user experience errors, while with 2.2.x versions users get 503 errors and timeouts for simple webpages.


Old apache httpd 2.0.63 proxy error_log: 

[Fri Aug 20 10:11:42 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 11:47:14 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 17:25:16 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 18:07:02 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 20 19:04:31 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sat Aug 21 06:49:18 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sat Aug 21 17:06:21 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sat Aug 21 18:34:41 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Sun Aug 22 11:02:49 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Mon Aug 23 08:39:00 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Tue Aug 24 16:34:32 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 07:36:21 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed


When we switch to apache httpd 2.2.16 proxy we get immediate bad user experience hit with http 503 errors and timeouts on proxy server, while 2.2.x backend servers have same load:

[Wed Aug 25 07:36:21 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 08:28:22 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 08:28:22 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:05:19 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:05:19 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:10:03 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:10:03 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:18:32 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:18:32 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:51:51 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:51:51 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 09:53:38 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 09:53:38 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 10:00:12 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 10:00:12 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 10:04:21 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 10:04:21 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Wed Aug 25 10:06:06 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Wed Aug 25 10:06:06 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
.....
[Fri Aug 27 09:21:53 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:22:39 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:35:38 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:35:38 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:36:22 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:36:22 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:36:36 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:37:33 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:37:33 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:41:46 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:41:46 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:42:05 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:42:05 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:42:08 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:42:08 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)
[Fri Aug 27 09:42:32 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:42:32 2010] [error] ap_proxy_connect_backend disabling worker for (www.domain.com)

Switch back to apache httpd 2.0.63 proxy and we immediate get normal user experience, due to
error in error_log:

[Fri Aug 27 09:43:07 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:51:34 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 09:59:50 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:00:35 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:00:35 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:20:59 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:21:00 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:34:03 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:37:59 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
[Fri Aug 27 10:38:15 2010] [error] (110)Connection timed out: proxy: HTTP: attempt to connect to 192.168.102.2:80 (www.domain.com) failed
Comment 4 siddhesh 2011-08-11 11:50:03 UTC
Dear Friends,
Did any one of you get solution for this problem as I am geting the same errors in my log file .
 errors are 
[client 2.] proxy: Error reading from remote server returned by /servlet/traveler/Microsoft-Server-ActiveSync
[Fri Jul 15 09:08:54 2011] [error] [client 10.10.10.10] (70007)The timeout specified has expired: proxy: error reading status line from remote server 10.20.20.20
[Fri Jul 15 09:08:54 2011] [error] [client 10.10.10.10] proxy: Error reading from remote server returned by /servlet/traveler/Microsoft-Server-ActiveSync
[Fri Jul 15 09:09:02 2011] [error] (110)Connection timed out: proxy: HTTPS: attempt to connect to 10.20.20.20:443 (10.10.10.10) failed
[Fri Jul 15 09:09:02 2011] [error] ap_proxy_connect_backend disabling worker for (10.20.20.20)

Siddhesh Samant
linux879@gmail.com
Comment 5 Yin 2011-08-18 11:58:17 UTC
Hi,

At my place we encountered similar issues with 503. 

I had httpd 2.2.3-43 (Red Hat compile) on RHEL 5.5 running as reverse proxy behind a CISCO ACE loadbalancer.

The same proxy was proxying traffic to Apache httpd and MS IIS (version 6 I think).  The 503 appears to cause by tcp_timestamp NOT being set from the  CISCO ACE loadbalancer.  

So we applied the change as recommended in https://access.redhat.com/kb/docs/DOC-51227 and the 503 messages stop appearing in our staging network.

Will update this if the problem is resolved in the production network.

Yin
Comment 6 Yin 2011-09-24 13:41:56 UTC
Sorry folks,

It looks like the problem was not resolved by the previous post by me.  The problem went away when the Ace was taken out of the network setup.  The network technicians are now looking why the Ace is not working as it should.

Yin
Comment 7 siddhesh 2011-10-13 09:37:09 UTC
(In reply to comment #5)
> Hi,
> At my place we encountered similar issues with 503. 
> I had httpd 2.2.3-43 (Red Hat compile) on RHEL 5.5 running as reverse proxy
> behind a CISCO ACE loadbalancer.
> The same proxy was proxying traffic to Apache httpd and MS IIS (version 6 I
> think).  The 503 appears to cause by tcp_timestamp NOT being set from the 
> CISCO ACE loadbalancer.  
> So we applied the change as recommended in
> https://access.redhat.com/kb/docs/DOC-51227 and the 503 messages stop appearing
> in our staging network.
> Will update this if the problem is resolved in the production network.
> Yin

Mr Yin Can you please share the webpage https://access.redhat.com/kb/docs/DOC-51227  with us as we have not registered to this page .

Thankyou
Siddhesh
Comment 8 metatechbe 2015-01-05 12:59:42 UTC
Hi,

We started encountering the same error after migrating from Solaris 10 to Solaris 11.1, when the Apache server tries to contact a backend on another machine.  

[Mon Jan 05 10:54:31.905841 2015] [proxy:error] [pid 10503:tid 4] (146)Connection refused: AH00957: HTTP: attempt to connect to 1.1.1.1:8000 (mymachine.mycompany.com) failed
[Mon Jan 05 10:54:31.905889 2015] [proxy:error] [pid 10503:tid 4] AH00959: ap_proxy_connect_backend disabling worker for (mymachine.mycompany.com) for 60s
[Mon Jan 05 10:54:31.905909 2015] [proxy_http:error] [pid 10503:tid 4] [client 1.1.1.2:43413] AH01114: HTTP: failed to make connection to backend: mymachine.mycompany.com

This happens when there is a sudden burst of HTTP connection, typically in peak hours.  The burst can be very short, for instance 10-20 requests during a few seconds.

In a TCP dump, we see that the local (outgoing) TCP port for the connections gets reused too quickly, sometimes after a few seconds.  

When the Apache server and the back-end are on the same machine, this error never occurs, probably because a TCP port used by a connection in "TIME_WAIT" is never assigned by the TCP/IP stack as local port.

This is most probably a Solaris bug, but for those searching Google for this error will hit this bug description.

A potential workaround is to use mod_cluster.

Regards.
Comment 9 Yann Ylavic 2015-01-05 14:29:37 UTC
Alternatively you could use the parameter status=+I either with ProxyPass or ProxySet, so to switch off "disable on error" for the defined worker.

I agree that this should be the default for non-balancer-member workers, though.
Comment 10 Yann Ylavic 2015-01-05 15:30:47 UTC
Created attachment 32341 [details]
Don't put normal (non balancer-member) workers in error (trunk/2.4.x)

This patch should prevent httpd from putting workers which are *not* load-balancer members in error state, since we don't want the retry timeout to be applied when there is no alternative to connect to.
Comment 11 Yann Ylavic 2015-01-05 15:33:11 UTC
Created attachment 32342 [details]
Don't put normal (non balancer-member) workers in error (2.2.x)

Same for 2.2.x.

Can you test it with your configuration?
Comment 12 Yann Ylavic 2015-01-05 15:44:22 UTC
Created attachment 32343 [details]
Don't put normal (non balancer-member) workers in error (2.2.x)

Fixed patch for 2.2.x (which does not use mod_proxy's child_init() run before mod_proxy_balancer's...).
Comment 13 Ruediger Pluem 2015-01-06 08:40:05 UTC
(In reply to Yann Ylavic from comment #10)
> Created attachment 32341 [details]
> Don't put normal (non balancer-member) workers in error (trunk/2.4.x)
> 
> This patch should prevent httpd from putting workers which are *not*
> load-balancer members in error state, since we don't want the retry timeout
> to be applied when there is no alternative to connect to.

-1 as this being hardcoded. This can be already done with retry=0. We may discuss to set retry=0 as a default value in this case if not set explicitly. It may make sense to put even a single backend in error state for some time to give the backend some time to recover without new requests sent to it and have them blocked at the reverse proxy layer.
Comment 14 Yann Ylavic 2015-01-06 11:37:40 UTC
(In reply to Ruediger Pluem from comment #13)
> (In reply to Yann Ylavic from comment #10)
> > Created attachment 32341 [details]
> > Don't put normal (non balancer-member) workers in error (trunk/2.4.x)
> > 
> > This patch should prevent httpd from putting workers which are *not*
> > load-balancer members in error state, since we don't want the retry timeout
> > to be applied when there is no alternative to connect to.
> 
> -1 as this being hardcoded. This can be already done with retry=0. We may
> discuss to set retry=0 as a default value in this case if not set
> explicitly. It may make sense to put even a single backend in error state
> for some time to give the backend some time to recover without new requests
> sent to it and have them blocked at the reverse proxy layer.

Nothing is hardcoded here, the default being still status=-I for balancer-members, but changed to status=+I for the standalone workers (but it can still be opt-out).
This is indeed debatable, the backend's time-to-recover argument is right/real, but there is no status=+I vs retry=0 here, both would change the default behaviour (is that acceptable?).

The only difference (I see) between status=+I and retry=0 is the logging of the worker's error/recovery states (in addition to the protocol/socket error).
I think this logging should be done only for status=-I workers (hence +I should ignore any error, not only those from connect()/handshake, eg. failon{status,timeout} or 500/503...).
This is at least what I did (try to) implement in this patch, so that users with simple workers don't bother (including logs) with recovery period if they configured so (or by default, if acceptable).
Comment 15 Ruediger Pluem 2015-01-06 18:47:52 UTC
(In reply to Yann Ylavic from comment #14)
> (In reply to Ruediger Pluem from comment #13)
> > (In reply to Yann Ylavic from comment #10)
> > > Created attachment 32341 [details]
> > > Don't put normal (non balancer-member) workers in error (trunk/2.4.x)
> > > 
> > > This patch should prevent httpd from putting workers which are *not*
> > > load-balancer members in error state, since we don't want the retry timeout
> > > to be applied when there is no alternative to connect to.
> > 
> > -1 as this being hardcoded. This can be already done with retry=0. We may
> > discuss to set retry=0 as a default value in this case if not set
> > explicitly. It may make sense to put even a single backend in error state
> > for some time to give the backend some time to recover without new requests
> > sent to it and have them blocked at the reverse proxy layer.
> 
> Nothing is hardcoded here, the default being still status=-I for
> balancer-members, but changed to status=+I for the standalone workers (but
> it can still be opt-out).

Sorry I misread the patch here. So no worries. As said changing the default makes sense for trunk. As this setup can already be reached by explicit configuration I would not backport it.
Comment 16 Yann Ylavic 2015-01-07 10:46:34 UTC
Thanks Rüdiger for the (new) review, commited in r1650028 (trunk).

Regarding this PR, for 2.4.x or 2.2.x, the default remains the error state (quarantine) during "retry" seconds (for connect(), 500/503 or failon{status,timeout} errors).

It can be changed with :
 ProxyPass/ProxySet ... status=+i retry=0
Comment 17 metatechbe 2015-02-10 13:45:42 UTC
Here is more information regarding our problem.  It is not a Solaris 11 bug but a Solaris 11 security hardening.

In Solaris 10, the ougoing TCP ports were chosen sequentially.
In Solaris 11, probably to implement RFC 6056 "Recommendations for Transport-Protocol Port Randomization", TCP ports are randomly chosen.
But the randomization does not guarantee that the port is not reused within 4 minutes (2 x the TCP Maximum Segment Lifetime, specified in RFC 1122)

The problem is occurring when the server-side of the TCP connection is closing the connection first, for instance when the HTTP Keep-Alive is disabled with the HTTP header "Connection: closed".  In this case, the TCP connection stays in the TIME_WAIT state for a short period on the server-side, but the TCP connection is immediately CLOSED on the client side.  The Solaris server accepts a new connection anyway, but the Cisco PIX and ACE firewall in between rejects them.

A more detailed explanation is available here :				 
http://blog.davidvassallo.me/2010/07/13/time_wait-and-port-reuse/

The RFC6191 specifies that TCP timestamps defined in RFC 1323 can be used to shorten the TIME_WAIT period for a socket.

To activate it under Solaris, run the following command on the machine initiating the TCP connections : 
sudo ndd -set /dev/tcp tcp_tstamp_always 1
 
The firewall must use this TCP field in its heuristics to make it work however.

Otherwise, it is possible to disable port randomization on Solaris 11 by adding the following line in /etc/system and rebooting the system :

set ip:tcp_random_anon_port=0