Bug 40877

Summary: JK1 and IIS6: Problem with multithreading / shared memory
Product: Tomcat Connectors Reporter: Bjoern Andersen <bjoern>
Component: CommonAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED WORKSFORME    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows Server 2003   
Attachments: WIN64/AMD64 ISAPI Redirect
WIN32 ISAPI Redirect

Description Bjoern Andersen 2006-11-02 09:01:42 UTC
Hi Mladen,
We migrate from JK2 to JK1 in the moment. We encountered some problems on 
bigger Webfarms with the JK1 / IIS6 in full Application Isolation mode. The 
main blocker is, when a Application pool is run in multithreading 
configuration (App-Pool-> Properties->Performance-> Web-Garden-> Number of 
Worker Processes), the JK does not start right: (Loglevel Info)

[Thu Nov 02 16:35:53 2006] [5872:5924] [error] validate::jk_lb_worker.c 
(1104): allocating worker record from shared memory
[Thu Nov 02 16:35:53 2006] [5872:5924] [error] wc_create_worker::jk_worker.c 
(158): validate failed for P-NG-TC3Hx
[Thu Nov 02 16:35:53 2006] [5872:5924] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx
[Thu Nov 02 16:35:54 2006] [4012:6612] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Thu Nov 02 16:35:54 2006] [4012:6612] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx
[Thu Nov 02 16:35:57 2006] [1672:6912] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Thu Nov 02 16:35:57 2006] [1672:6912] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx
[Thu Nov 02 16:36:27 2006] [6100:3092] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Thu Nov 02 16:36:27 2006] [6100:3092] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx

It runs fine with a singler worker process. I also tried to 
reduce "connection_pool_size" to 1 (from 64) because there are known probs 
with Apache prefork, but it didn't help. This is a real production blocker.

I also realized that in the jkstatus, the live-config and stati apply only to 
the application pool this web is in. If i reconfigure a jk in a web in a 
different pool, it is totally independent. As I understand it, all jk-
instances on one server should share this information through a shared memory 
area. This doesn't seem to work either.With that bug, one can't use the nice 
feature to reconfigure the jk online on IIS6 with multiple application pools.

If you need config files or debug logs, please ask. These are from production 
environment, so I'd prefere to send them personally.
Comment 1 Mladen Turk 2006-11-02 09:11:01 UTC
You are correct.

Will you be willing to test the patched binary?
Also, what platfom it is. WIN32 or WIN64?
Comment 2 Bjoern Andersen 2006-11-02 09:23:32 UTC
Yes, we can test the binaries. Thank you for the offer.
It happens on both, the 32 an 64 bit plattforms. in the moment, a 64 bit 
webfarm produces the errors. But we also got 32 bit systems ready to test.
Comment 3 Bjoern Andersen 2006-11-06 01:20:32 UTC
Hi again, hete to say it, but we found another severe bug in the full 
Application Isolation mode of IIS6 with jk1.2.x. Application pools recycle 
themself after some time or number of requests. When this happens:

 Event Type:	Information
 Event Source:	W3SVC
 Event Category:	None
 Event ID:	1074
 Date:		05.11.2006
 Time:		02:41:15
 User:		N/A
 Computer:	xxxxxxxxxxxxx
 Description:
 A worker process with process id of '1380' serving application
 pool 'www.premiere.de' has requested a recycle because the worker
 process reached its allowed processing time limit.  

...the JK exits with an error:

[Fri Nov 03 22:28:09 2006] [1380:6848] [info]  service::jk_lb_worker.c (873): 
unrecoverable error 413, request failed. Client failed in the middle of 
request, we can't recover to another instance.
[Sun Nov 05 02:41:20 2006] [2044:3792] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Sun Nov 05 02:41:20 2006] [2044:3792] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx
[Sun Nov 05 02:41:20 2006] [6872:4756] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Sun Nov 05 02:41:20 2006] [6872:4756] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx
[Sun Nov 05 02:41:32 2006] [6568:3420] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Sun Nov 05 02:41:32 2006] [6568:3420] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx
[Sun Nov 05 02:42:04 2006] [1360:4448] [error] wc_create_worker::jk_worker.c 
(146): factory for lb failed for P-NG-TC3Hx
[Sun Nov 05 02:42:04 2006] [1360:4448] [error] build_worker_map::jk_worker.c 
(256): failed to create worker P-NG-TC3Hx

JK2 had no problem with this. Maybe this had also to do with the shared mem 
problem, because the "main" jk1 ISAPI-extension is configured globaly in the 
IIS, and the called JK1-instances are run in the Application Pool environment. 
Can you tell me roughly when a testversion will be available? It matters for 
our Testsystems to be held in this state or, if it takes longer, to be re-
setup later.
Comment 4 Mladen Turk 2006-11-06 01:50:47 UTC
Hi,

Yes the problem is with shared memory.
It is initialized as global while it should be initialized on
per virtual server basis.

I'll take a look at that this week.
Comment 5 Mladen Turk 2006-11-14 23:32:53 UTC
Created attachment 19133 [details]
WIN64/AMD64 ISAPI Redirect

isapi_redirect 1.2.20 with shared memory patch
Comment 6 Bjoern Andersen 2006-11-15 09:32:01 UTC
Thank you for the fix. I did a quick test on the AMD64 server and it looks 
really good. Multithread in an applicationpool and app pool recyling seems to 
work fine.
A few tests later, only a few small flaws appeared.

First, you get an (one!) error when die app-pool recycles. This is easy 
reproduceable. Recycle every 100 requests, you'll get one error like these 
every 100 requests. Sometimes 2 per cycle at the same time.

[Wed Nov 15 15:32:34 2006] [5536:4756] [error] 
HttpExtensionProc::jk_isapi_plugin.c (1102): could not get a worker for name P-
NG-TC3Hx
[Wed Nov 15 15:32:54 2006] [5568:6800] [error] 
HttpExtensionProc::jk_isapi_plugin.c (1102): could not get a worker for name P-
NG-TC3Hx

Also, rarely, a "WriteClient failed" appears. This didn't happen before on the 
JK2. But I'm not totally sure if this applies to this jk-version as origin.

[Wed Nov 15 17:16:48 2006] [0540:6520] [error] write::jk_isapi_plugin.c (656): 
WriteClient failed with 00002746
[Wed Nov 15 17:16:48 2006] [0540:6520] [info]  
ajp_process_callback::jk_ajp_common.c (1421): Writing to client aborted or 
client network problems
[Wed Nov 15 17:16:48 2006] [0540:6520] [info]  ajp_service::jk_ajp_common.c 
(1813): (P-NG-TC3H1) request failed, because of client write error without 
recovery in send loop attempt=0
[Wed Nov 15 17:16:48 2006] [0540:6520] [info]  service::jk_lb_worker.c (874): 
unrecoverable error 200, request failed. Client failed in the middle of 
request, we can't recover to another instance.

And still in the jkstatus, the config and stati apply only to this app pool or 
even web / virtual server. I guess this is by design, because you mentioned to 
change the shared mem to a per-vhost basis. But it is still not very handy. 
Say you want to drain a tomcat to do maintainance, on our farms you'd need to 
do it not only on 11 frontent webservers but also on 41 webs per server. 
That's not practicable. Hopefully in the future, there is a per-vhost an 
global section in jkstatus. But this would need globally initialized shared 
mem again :)

Also tested was the "reference" directive, which didn't work in the 1.2.19 on 
IIS. That also worked fine.

All in all, great work, thanks. If you can fix the last on-recycle error I'd 
say test passed for amd64. Any idea when the final 1.2.20 will be released?
Comment 7 Om Bhakar 2006-11-15 15:15:29 UTC
(In reply to comment #5)
> Created an attachment (id=19133) [edit]
> WIN64/AMD64 ISAPI Redirect
> 
> isapi_redirect 1.2.20 with shared memory patch

Hi Mladen, 

We are having similar sort of issue on Win32 Platform. I will explain in detail. 

Environment
We are running MS Application Center Load Balanced Web Server Farm with 3 nodes
running IIS 6 in Worker Process Isolation mode. The Tomcat/JBoss runs on two
separate Servers (again Load Balanced via Application Center). The Web Server
Farm is running 5 Websites. The JK ISAPI redirector version is 1.2.19 and is
configured on per site basis.

Problem
Everything works fine except when the application pool undergoes recycling, the
website using that application pool wouldn't work. Recycling application pool
again, restarting website wouldn't help. The request to that website gets 404.2
and 404.3 in IIS logs. The 404.2 indicates that JK ISAPI is locked or access is
denied to the JK ISAPI. The only way to get it working again is to do an
IISRESET which affects all other websites and disconnects all the users
connected to those websites. Its very frustrating as the whole purpose of IIS 6
running in worker process isolation mode is defeated.

The JK Logs shows the following 
[Wed Nov 15 17:25:41 2006] [1228:4924] [error] jk_worker.c (256): failed to
create worker lbworker
[Wed Nov 15 17:49:24 2006] [2332:2232] [error] jk_worker.c (146): factory for lb
failed for lbworker
[Wed Nov 15 17:49:24 2006] [2332:2232] [error] jk_worker.c (256): failed to
create worker lbworker
[Wed Nov 15 18:31:12 2006] [4756:4804] [error] jk_worker.c (146): factory for lb
failed for lbworker
[Wed Nov 15 18:31:12 2006] [4756:4804] [error] jk_worker.c (256): failed to
create worker lbworker
[Wed Nov 15 18:57:02 2006] [4044:6564] [error] jk_worker.c (146): factory for lb
failed for lbworker
[Wed Nov 15 18:57:02 2006] [4044:6564] [error] jk_worker.c (256): failed to
create worker lbworker
[Wed Nov 15 18:57:24 2006] [0324:6956] [error] jk_worker.c (146): factory for lb
failed for lbworker
[Wed Nov 15 18:57:24 2006] [0324:6956] [error] jk_worker.c (256): failed to
create worker lbworker
[Wed Nov 15 18:57:32 2006] [3660:5948] [error] jk_worker.c (146): factory for lb
failed for lbworker
[Wed Nov 15 18:57:32 2006] [3660:5948] [error] jk_worker.c (256): failed to
create worker lbworker

The windows eventlog would show the following: 
Event Type:	Information
 Event Source:	W3SVC
 Event Category:None
 Event ID:	1074
 Date:		15.11.2006
 Time:		06:56:36
 User:		N/A
 Computer:	xxxxxxxxxxxxx
 Description:
A worker process with process id of '6308' serving application pool 'xxxxx' has
requested a recycle because the worker process reached its allowed processing
time limit.  

We didn't have any issues with the earlier version of JK which was 1.2.08. I
have a strong feeling that this has to do with the shared memory problem. 

Please let me know if you need any further information, config files or log files. 
Comment 8 Bjoern Andersen 2006-11-15 15:28:25 UTC
Hi Om Bhakar.
Yes, the problem you describe has been acknowledged by Mladen, he in working 
on it and already released a patched version for 64bit systems, as you can 
read above. The test looked good, so I guess it's only a matter of days until 
a patched version for you 32 bit environment is supplied. So we should better 
leave Mladen to do his good work.

Hi Mladen,
I'm pretty sure now that the "WriteClient failed" problem has nothing to do 
with your patch. After a few reconfigurations and tomcat restarts i managed a 
1h loadtest with several 100000 requests where only the expected ammount 
of "could not get a worker for name ..." recycle-errors occoured.
Comment 9 Om Bhakar 2006-11-21 17:46:39 UTC
Hi Bjoern Andersen,

The problem is very intermittent and we can't replicate it when we want to. It 
would happen may be once a day, not necessarily but most probably around the 
time when the application pool recycles itself. One website using JK stops 
working and no matter what we do (except IISRESET), it wouldn't come up 
throwing 404.2 and 404.3 errors.

I don't have any idea as when the patch would be released. I am more than 
willing to test it as soon as it is released. In the mean time while Mladen is 
doing the good work to release patch for JK 1.2.19 for win32, What I am 
thinking is to roll back to JK 1.2.15 since this bug is affecting our 
production environment. Do you know by any chance if the version JK 1.2.15 is 
more stable and doesn't have the shared memory bug. What version of JK would 
you suggest to roll back to?
Comment 10 Bjoern Andersen 2006-11-22 02:03:07 UTC
Hi Om Bhakar,
Sorry, no idea. We are switching from jk2.0.4, and tested only 1.2.18 & 19. 
Both had the recycle problem.

Hi Mladen,
Can you give a quick insight into the release plans? Is there going to be a 
32bit patch version or is 1.2.20 so close that you want to wait for that? When 
can we expect the 32bit version?
Comment 11 Rainer Jung 2006-11-22 02:08:06 UTC
I don't know, if Mladen still want to do something for the Win world, but the
changes on my side will be finished latest next weekend. So I will suggest today
on the list to cut the release sometime very close to the next weekend, and if
we did our job good, it will take at most another week before it's final. After
tagging we usually wait a few days for community testing and then we call a
three day vote on stability.
Comment 12 Mladen Turk 2006-11-22 02:27:20 UTC
Created attachment 19156 [details]
WIN32 ISAPI Redirect

WIN32 ISAPI Redirect 1.2.20 dev version
Comment 13 Om Bhakar 2007-01-01 17:56:17 UTC
Hi Mladen & Team, 

The newer version of JK Redirector 1.2.20 is performing very well and doing a
fantastic job as we didn't have any issues so far after deploying the newer
version 2 weeks back. I would like to thank you and the whole Tomcat Jakarta
Redirector Development/Test Team for doing such an excellent job. 

Thanks.