Bug 47750

Summary: ISAPI: Loss of worker settings when changing via jkstatus
Product: Tomcat Connectors Reporter: robert.mawer
Component: isapiAssignee: Tomcat Developers Mailing List <dev>
Status: NEW ---    
Severity: major CC: robert.mawer
Priority: P2    
Version: 1.2.28   
Target Milestone: ---   
Hardware: PC   
OS: Windows Server 2003   
Attachments: workers.properties file
isapi redirect file
URI worker map

Description robert.mawer 2009-08-27 10:19:06 UTC
Running a load-balanced worker with two nodes - configuration is fine, as ISAPI filter starts up and works correctly.

A change is made to a worker node using the jkstatus page (for example, stopping node2, then starting it again).  This works fine, as the worker stops correctly, then becomes available again and works fine.

After an amount of time, the mod_jk log shows the ISAPI filter starting again - presumably this is IIS restarting something, although it doesn't behave the same as an app pool recycle so not sure what it is or what is triggering it.
When this happens, the log shows the shared memory being reset in the log for the workers, and what appears to be the shm being updated with the previous values from the load-balancer worker's memory, although the sequence number from memory doesn't match the value that was previously reached from performing the updates via jkstatus:
jk_lb_worker.c (347): syncing shm for lb 'node-lb' from mem (0->1)

The log them shows shared memory for the load-balancer being synced again under worker maintenance - the sequence numbers do not match, with the value of p->sequence being the value previously reached from making the jkstatus changes, while the shm sequence is still 1 as a result of the previous sync.
So the log shows:
jk_lb_worker.c (292): syncing mem for lb 'node-lb' from shm (3->1)

The log then shows that, as a result of this lb sync, the "changed" workers are then sync'd from the shm.  However, as the data structure of the shm has been reset by the "restart" of the ISAPI filter, the values for that worker are set to zero.  As this includes the max_packet_size, any request to this worker will be larger than the max packet size of zero and so causes an "error 413 request entity too large" to be displayed.

The zero'd records display as such for the worker in jkstatus - manually updating these entries to the correct values allows that worker to function again.


I have made a small amendment on my system so that any calls to jk_lb_pull will only occur if the mem sequeunce is less than the shm sequeunce (rather than just "not equal"), ie.
changed:
    if (p->sequence != p->s->h.sequence)
        jk_lb_pull(p, JK_TRUE, l);
to:
    if (p->sequence < p->s->h.sequence)
        jk_lb_pull(p, JK_TRUE, l);
for all instances where jk_lb_pull is called as a result of this conditional.
It seems to have resolved this particular issue and the settings persist correctly, but not sure if it is actually a correct solution!
Comment 1 Rainer Jung 2009-09-01 10:50:22 UTC
I can see the problem. In order to fix it in the right way, I would like to understand, why the redirector does a second initialization.

As far as I can see, it is a second IIS process (separate process ID) that attaches to the shm and wipes it out. I can easily prevent that efect (the zeroing of the shared memory), but it would be more correct, if the second process actually got the right data from the already existing shared memory.

Are you using web gardens and/or application pools? Which one of those and how are they configured?

Could you please also provide your workers.properties and uriworkermap.properties as well as the isapi_redirect.properties.
Comment 2 robert.mawer 2009-09-01 12:06:42 UTC
Created attachment 24198 [details]
workers.properties file
Comment 3 robert.mawer 2009-09-01 12:07:06 UTC
Created attachment 24199 [details]
isapi redirect file
Comment 4 robert.mawer 2009-09-01 12:07:28 UTC
Created attachment 24200 [details]
URI worker map
Comment 5 robert.mawer 2009-09-01 12:15:31 UTC
(In reply to comment #1)
> I can easily prevent that efect (the
> zeroing of the shared memory), but it would be more correct, if the second
> process actually got the right data from the already existing shared memory.
I did try zeroing the data to start with, but yes, it's not ideal for the worker status/changes to not be persistent when the second process starts!

> Are you using web gardens and/or application pools? Which one of those and how
> are they configured?
Just running it out of the DefaultAppPool.
Recycle worker processes set to 1740 minutes; Shutdown idle worker processes after being idle for 20 minutes; Limit kernel requests to 100; Maximum number of worker processes in the webgarden is set to 1.
Pinging is enabled and set to 30 seconds.
Rapid-fail protection is set to 5 failures in 5 minutes.
Startup and shutdown time limits are set to 90 seconds.
Process is running as Network Service.

There are other applications running in the DefaultAppPool - I'm not sure if these could be influencing the second process starting?

> Could you please also provide your workers.properties and
> uriworkermap.properties as well as the isapi_redirect.properties.
Have attached these to the bug report.
Comment 6 Rainer Jung 2009-09-01 12:41:20 UTC
Thanks for the config, will have a look.

The request which triggered the second start was using the URL

/EODPut/225/TEST.EOD225.20090826.txt

and going to the same server name as all the other requests. Still reasoning, why this request started another (second) process.
Comment 7 Rainer Jung 2009-09-01 12:43:14 UTC
config looks resonable.
Comment 8 robert.mawer 2009-09-01 13:47:08 UTC
(In reply to comment #6)
> Thanks for the config, will have a look.
> The request which triggered the second start was using the URL
> /EODPut/225/TEST.EOD225.20090826.txt
> and going to the same server name as all the other requests. Still reasoning,
> why this request started another (second) process.

This is a WebDAV location, and the particular call above writes a file into the location - I did wonder if this was the trigger, but further experimentation showed that:
1.  Calls to the WebDAV location don't usually trigger another start.
2.  Other URLs trigger the start.

Additionally WebDAV is using a different application pool, so in theory should be separated from ISAPI redirect.  I also have another environment which experiences the originally reported condition, and that doesn't have WebDAV on it.