Bug 48735 - bybusyness does not balance after failed worker has recovered
Summary: bybusyness does not balance after failed worker has recovered
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_proxy_balancer (show other bugs)
Version: 2.2.21
Hardware: All All
: P2 critical with 26 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: PatchAvailable
Depends on:
Blocks:
 
Reported: 2010-02-13 14:06 UTC by Olivier BOËL
Modified: 2014-01-20 00:24 UTC (History)
11 users (show)



Attachments
Fix by adding error handling and atomic functions (4.66 KB, patch)
2010-10-05 08:52 UTC, Markus Stoll
Details | Diff
cleanup of counters added when disabled worker becomes usable (47.00 KB, text/plain)
2011-11-04 16:57 UTC, Adam C
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier BOËL 2010-02-13 14:06:46 UTC
I noticed that, after a failed worker has recovered, no request is fowarded to it although it is marked as OK in balancer-manager :
Load Balancer Manager for www.europarldv.ep.ec
Server Version: Apache/2.2.12 (Unix) DAV/2 mod_ssl/2.2.12 OpenSSL/0.9.8e
Server Built: Aug 5 2009 12:54:36

--------------------------------------------------------------------------------

LoadBalancer Status for balancer://websdi
StickySession Timeout FailoverAttempts Method
JSESSIONID|jsessionid 0 1 bybusyness

Worker URL Route RouteRedir Factor Set Status Elected To From
http://websdidv-node1.appsrv:64675 node1  1 0 Ok 250 81K 13M
http://websdidv-node2.appsrv:64675 node2  1 0 Ok 51 16K 2.6M

This issue does not occur with the default method (byrequests).

Here is my configuration :
        ProxyPass /parliament/ balancer://websdi/parliament/ stickysession=JSESSIONID|jsessionid lbmethod=bybusyness scolonpathdelim=On
        <Proxy balancer://websdi>
                BalancerMember http://websdidv-node1.appsrv:64675 route=node1
                BalancerMember http://websdidv-node2.appsrv:64675 route=node2
        </Proxy>

Server version: Apache/2.2.14 (Unix)
Server built:   Jan 28 2010 09:10:16
Server's Module Magic Number: 20051115:23
Server loaded:  APR 1.3.9, APR-Util 1.3.9
Compiled using: APR 1.3.9, APR-Util 1.3.9
Architecture:   32-bit
Server MPM:     Worker
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/worker"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_FCNTL_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/local/products/revproxy"
 -D SUEXEC_BIN="/local/products/revproxy/bin/suexec"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"

System = SunOS
Node = eiciluxd5
Release = 5.9
KernelID = Generic_122300-36
Machine = sun4u
BusType = <unknown>
Serial = <unknown>
Users = <unknown>
OEM# = 0
Origin# = 1
NumCPU = 4
Comment 1 Markus Stoll 2010-10-05 08:52:30 UTC
Created attachment 26123 [details]
Fix by adding error handling and atomic functions

The proposed bugfix uses a fix from #46215 (by Thomas Binder) by using atomic functions for increasing/decreasing busyness. I added proper error handling for unreachable workers (in this case the post_req function had never been called).
Comment 2 Olivier BOËL 2011-01-19 04:24:56 UTC
I confirm the bug cannot be re-produced with 2.2.17

Thanks!
Comment 3 Evan Rabeck 2011-01-19 09:38:14 UTC
Has this patch been applied to the 2.2.17 release? I'm confused because I don't seen anything in the changelog to reflect a fix.
Comment 4 Olivier BOËL 2011-05-05 11:56:57 UTC
Hello, Gents,


Please ignore my previous comment (I confirm the bug cannot be re-produced with 2.2.17).
The problem still exists in 2.2.17.
It can be easily reproduced : stop one of of the nodes, watch the dashboard (Status=Err), restart the node, status will be OK but Elected will not evolve.
The only way to fix it is to shutdown and restart Apache (not gracefully).

Regards,


Olivier
Comment 5 Serge Knystautas 2011-05-11 03:11:35 UTC
I'm having the same issue and would be happy to help diagnose the problem if it's unclear what is going on.
Comment 6 Serge Knystautas 2011-06-24 15:42:34 UTC
The original report was on 2.2.14 on Solaris, but the same behavior is happening for me with 2.2.17 on CentOS release 5.5 (Final), so it's more current than originally expressed, and doesn't seem to be platform specific.
Comment 7 Olivier BOËL 2011-07-14 06:10:28 UTC
Behaviour can be reproduced with Apache 2.2.19 on Solaris
Comment 8 Adam C 2011-10-31 21:10:33 UTC
This can be easily reproduced. The "busy" counter is not decreased when the worker tries to send something to node which is down. For example when receiving the "connection refused" from the server.

Have balancer for 2 members. One member should be down. Start sending messages concurrently for some period of time (more then 60s due to "retry" timeout for disabled worker due to errors).


Start the server which was down and continue sending messages. The started server will not be getting messages only when the "busy" counter for the second node will be high enough so balancers selects the first node as less busy. The balancer thinks that the first node is still busy handling request which failed with error due to lack of proper handling of "busy" value.
Comment 9 Adam C 2011-11-04 16:57:55 UTC
Created attachment 27900 [details]
cleanup of counters added when disabled worker becomes usable

added busy and lbstatus to the balancer-manager page.
Comment 10 BryonB 2012-01-13 18:49:17 UTC
I didn't see anything about these fixes being included in any of the recent releases (since 2.2.17). I'm still easily able to replicate the behavior under Apache 2.2.21 under Windows. There doesn't seem to be any way to get traffic routed to a restarted back-end instance without forcing a restart of Apache. The issue only happens when the load balancing method is set to bybusyness.
Comment 11 Serge Knystautas 2012-02-22 09:47:48 UTC
I've confirmed that the patch for mod_proxy_balancer.c (attachment by Amada C at 2011-11-04 16:57 UTC) successful fixed this bug when applied to httpd 2.2.20 on RHEL4 in our production environment.

It also provides extra details on the balancer manager page, which is way cool!
Comment 12 Eric Garreau 2012-02-24 09:58:56 UTC
I also confirm it perfectly works when applied to apache 2.2.21 (Linux / SunOS hosts in production)

thanks a lot for this fix
Comment 13 till 2012-05-07 16:47:52 UTC
Can we have that in the next release? PLEASE
Patching every Apache version to make it stable is not fun.
Comment 14 Zisis Lianas 2012-05-16 12:37:14 UTC
I also can reproduce this error with 2.4.2
Comment 15 Jeff Trawick 2012-05-16 13:18:01 UTC
Thanks for your update, Zisis.  I'm working on a patch for trunk/2.4.x and will update the bug when it is testable.
Comment 16 Christophe WEIDEMANN 2012-05-22 15:21:43 UTC
Hello,
I've just checked the patch an Solaris 10 - i386 platform with apache 2.2.21 release and it doesn't work for me. After the failing backend server is coming back online, it will be no more elected. In the LB manager status, i notice that the priority counter from this server keeps on increasing. On the opposite the priority of the another server keeps on decreasing. I don't know if this is the reason.

Regards,

Christophe
Comment 17 Adam C 2012-05-22 15:57:59 UTC
(In reply to comment #16)
> Hello,
> I've just checked the patch an Solaris 10 - i386 platform with apache 2.2.21
> release and it doesn't work for me. After the failing backend server is
> coming back online, it will be no more elected. In the LB manager status, i
> notice that the priority counter from this server keeps on increasing. On
> the opposite the priority of the another server keeps on decreasing. I don't
> know if this is the reason.
> 
> Regards,
> 
> Christophe


Was the patch applied manually? The symptomps are exactly like ones without the patch.
Comment 18 Christophe WEIDEMANN 2012-05-23 08:07:47 UTC
Yes, i confirm.

Here are some details. I stopped the node1. After it has successfully stopped, the result of the balancer manager is the following (the priority for node1 is 172):

Load Balancer Manager for eicixzl034

Server Version: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8k DAV/2
Server Built: May 23 2012 09:27:11
LoadBalancer Status for balancer://platosws

StickySession	Timeout	FailoverAttempts	Method
JSESSIONID|jsessionid	0	1	bybusyness

Worker URL	Route	RouteRedir	Priority	Factor	Set	Status	Busyness	Elected	To	From
http://platoswsdv-node1.appsrv:54000	node1		172	1	0	Err	1	65	 34K	 69K
http://platoswsdv-node2.appsrv:54010	node2		-170	1	0	Ok	0	406	227K	203K

Then i restarted node1 and after successfull restart and some requests on the web site, i have the following:

Load Balancer Manager for eicixzl034

Server Version: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8k DAV/2
Server Built: May 23 2012 09:27:11
LoadBalancer Status for balancer://platosws

StickySession	Timeout	FailoverAttempts	Method
JSESSIONID|jsessionid	0	1	bybusyness

Worker URL	Route	RouteRedir	Priority	Factor	Set	Status	Busyness	Elected	To	From
http://platoswsdv-node1.appsrv:54000	node1		429	1	0	Ok	1	70	 36K	 70K
http://platoswsdv-node2.appsrv:54010	node2		-427	1	0	Ok	0	669	354K	669K

The node1's priority keeps on increasing (429) and it is no more elected.
Comment 19 Christophe WEIDEMANN 2012-06-19 08:43:15 UTC
Hi,

I've just reproduced the same problem as above on Solaris 10 - Sparc architecture.
Comment 20 Jeff Trawick 2012-07-27 12:19:28 UTC
A fix has been committed to trunk (http://svn.apache.org/viewvc?view=revision&revision=1366344) and proposed for 2.4.x.  (2.2.x will follow that.)

The handling of the busy flag is different than either proposal here.  Feel free to comment on the viability of that.  Note that this exact fix has thus far been tested only with trunk and 2.4.x.  Potentially something different would be necessary for 2.2.x.

Additionally, changes to use atomic operations or augment the balancer manager have not been considered.  I suggest tracking those with different bugs.
Comment 21 Jeff Trawick 2012-07-29 23:30:21 UTC
>Additionally, changes to use atomic operations or augment the 
>balancer manager have not been considered.  I suggest tracking
>those with different bugs.

a. atomic operations

I just opened

Bug 53618 - proxy_worker_shared fields not maintained in thread-safe manner

for the thread-safe handling of busy.

b. balancer manager display

httpd trunk and 2.4 already display the extra information.
Comment 22 Jeff Trawick 2012-07-29 23:56:33 UTC
Applying this patch from httpd trunk/2.4.x fixes the issue for me:
  http://svn.apache.org/viewvc/httpd/httpd/trunk/modules/proxy/mod_proxy_balancer.c?r1=1366344&r2=1366343&pathrev=1366344
That's what I'll propose for the 2.2.x branch.  Can anyone reproduce the problem with this new patch applied?
Comment 23 Rainer Jung 2012-08-21 16:06:49 UTC
Applied to 2.4 in r1374299.
Released with 2.4.3.
Applied to 2.2 in r1373355.
Not yet released there.
Comment 24 Zisis Lianas 2012-08-22 17:05:17 UTC
Just tested 2.4.3 - issue seems to be fixed now.
Thanks!
Comment 25 Christophe WEIDEMANN 2012-08-30 09:53:45 UTC
i've just tried the new patch on apache 2.2.22. There is one error during the compilation:
mod_proxy_balancer.c: In function `force_recovery':
mod_proxy_balancer.c:420: error: structure has no member named `forcerecovery'
*** Error code 1

Has mod_proxy.h also been modified ? If yes, can you add it to the patch ?

Thanks
Comment 26 Rainer Jung 2012-08-30 10:35:17 UTC
You need at least

http://svn.apache.org/viewvc?view=revision&revision=1373320

but I suggest trying the candidate for 2.2.23 instead, because there might be more requirements not included in 2.2.22. Version 2.2.23 already contains the patch for this bug, no need for additional patches as far as we currently know.

http://httpd.apache.org/dev/dist/

Note that 2.2.23 is not yet officially released(!!!), so it is only adequate for testing purposes. The official release should not be far away though.
Comment 27 Christophe WEIDEMANN 2012-08-31 08:34:44 UTC
I've just tested in 2.2.23. Seems to be working now.
Just a remark on the balancer-manager web interface, the Priority and Busyness fields are not displayed althrough they were appearing in previous 2.2 releases when the patch was installed. Is it normal?

Thanks.
Comment 28 Jeff Trawick 2012-08-31 10:59:51 UTC
The balancer manager changes were not backported.
Comment 29 Eric Covener 2014-01-20 00:24:52 UTC
> I've just tested in 2.2.23. Seems to be working now.