|Summary:||lb_mult changes caused by status worker changing a subworker's factor are not propagated to all processes correctly.|
|Product:||Tomcat Connectors||Reporter:||Jonathan Oddy <jonathan.oddy>|
|Component:||Common||Assignee:||Tomcat Developers Mailing List <dev>|
Patch to recalculate lb_mult after pulling from shm
Possible alternative solution
Force update all subworkers for multiplier change
Reorder update actions and introduce flag for demanding updates for all members
Like update_shm_mult.patch, but fix missing sequence nummber setting in single member update case
Description Jonathan Oddy 2017-11-07 09:56:32 UTC
Created attachment 35500 [details] Patch to recalculate lb_mult after pulling from shm Steps to reproduce: Alter the lbfactor of one subworker through jkstatus to a value that causes the lbmult to need to be updated for other subworkers. You need multiple subworkers configured and multiple web server processes to be running at the point the status change is made. Expected result: All processes pull in the new lbfactor for the changed subworker, and lbmult for all the other updated subworkers. Requests are balanced appropriately for the new factor configuration. Actual result: All processes pull in the new lbfactor for the changed subworker, however only the process processing the status change sees the lbmult values change for the other subworkers. All other processes keep the original lbmult values. Requests are not correctly balanced. This happens because lbmult is expected to be calculated by the status worker when it updates the factor, and then pushed to shm. The status worker, however, only updates the sequence for the subworker being edited and not those receiving new lbmult values. As a result the other processes do not observe a change to the other subworkers, and do not pull the lbmult change. This patch moves responsibility for calculating the local mult values to the jk_lb_pull function. Processes will now recalculate their own view of the mult values after pulling from shm, rather than obtaining it from shm. Since the values are based on the current weights all processes will reach the same conclusion.
Comment 1 Mark Thomas 2017-11-22 10:56:53 UTC
Created attachment 35544 [details] Possible alternative solution Having reviewed the source code it appears that after the lb_factor is changed the shared memory is updated before the lb_mult is recalculated. I wonder if swapping the order of those actions would be an alternative fix? I have attached a possible patch.
Comment 2 Jonathan Oddy 2017-11-22 11:24:10 UTC
I hadn't spotted that, but I don't think it helps. jk_lb_push() only pushes changes for subworkers where the sequence number has changed. update_worker() sets the sequence number to -1 for the subworker that had its weight explicitly changed, however update_mult() affects the other subworkers as well. I guess an alternative solution would be for update_worker() to set sequence to -1 for all subworkers if JK_STATUS_NEEDS_UPDATE_MULT?
Comment 3 Mark Thomas 2017-11-22 12:11:32 UTC
That sounds reasonable. That does seem to be consistent with the general approach which looks to be calculate / update stuff once, put in in shared memory, have everything that needs to update from shared memory.
Comment 4 Jonathan Oddy 2017-11-22 15:57:08 UTC
Created attachment 35547 [details] Force update all subworkers for multiplier change Here's an attempt at doing that. This hasn't been as heavily tested as my original patch. However, I'm now wondering if it ought to be treated like lb_value, and read straight from shm rather than being copied to lb_sub_worker. That'd be simpler, and also get rid of the lag on lb_mult changes being picked up everywhere. Are there any advantages to it being synced to lb_sub_worker? I wondered if there was some performance implication, but if that's the case then access to lb_value would surely be much more of a bottleneck already.
Comment 5 Rainer Jung 2017-11-22 21:45:51 UTC
Created attachment 35549 [details] Reorder update actions and introduce flag for demanding updates for all members I think both of your analysis are right: - we only sync via shared memory lb_mult for the worker whose factor changed - the order of steps in case only one member is changes is wrong The attached patch should fix both issues, correct the order - similar to what Mark suggests - plus letting jk_lb_push know, that it should not only push none or one member, but instead needs to push all members of an lb. To make this possible we change the signature of jk_lb_push and introduce a new flag. The patch compiles, but I'm short of time to aczually test it. it would be great if one or both of you could give it a try. Comments very welcome.
Comment 6 Jonathan Oddy 2017-11-23 09:44:51 UTC
Hi, The "Force update all subworkers for multiplier change" from my previous comment also includes the reordering. Updating the sequence numbers to -1 for all subworkers, as in my patch, is consistent with how updates are already pushed by the status worker. It does feel rather ugly though. Your change to jk_lb_push feels like it might lead to a cleaner implementation, but if so it'd be good to also get rid of the existing sequence = -1 usage for lb and wr in a similar fashion. Also you seem to have deleted a wr->sequence = -1. Without that I think explicit changes to a worker that don't trigger a multiplier change won't be pushed.
Comment 7 Rainer Jung 2017-11-23 12:18:07 UTC
Created attachment 35550 [details] Like update_shm_mult.patch, but fix missing sequence nummber setting in single member update case Thanks for pointing at the missing sequence number update in the single member update case. I had removed it intentionally when I originally set push_all_members always to JK_TRUE there, but then forgot to add it back for the JK_FALSE case. Can you actually please test the (v2) patch? Concerning other places of using sequence number value "-1" as a side effect, I currently do not plan to work on removing this opaque feature. mod_jk is kept quite stable since a few years so I think most energy will go into fixing bugs, not improving code quality. Thanks again, Rainer
Comment 8 Rainer Jung 2018-01-04 20:13:26 UTC
I have applied version two of my patch in r1820195. It will be part of 1.2.43. Thanks for your help on this. It would still be usefull if you could also do a test before the release.
Comment 9 Jonathan Oddy 2018-01-05 12:12:33 UTC
We've been successfully running my original 1-liner fix for some time, so testing an alternative fix wasn't that high up the priority list at the end of the year. I'll try to get a new build with your fix tested next week.