Bug 59334 - .NET Application Pools requests hang because Jakarta/Tomcat uses a Named Mutex that is currently owned by a different process
Summary: .NET Application Pools requests hang because Jakarta/Tomcat uses a Named Mute...
Status: RESOLVED DUPLICATE of bug 58813
Alias: None
Product: Tomcat Connectors
Classification: Unclassified
Component: isapi (show other bugs)
Version: 1.2.40
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
Depends on:
Reported: 2016-04-15 22:04 UTC by Murilo
Modified: 2016-09-12 19:07 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Murilo 2016-04-15 22:04:08 UTC
We have a server that hosts three .NET v4.0 applications (each one in their own app pool) and another application that is hosted in Jakarta/Tomcat 7, which goes through IIS for authentication, and is forwarded to Jakarta/Tomcat(isapi_redirect.dll)

We've started noticing that every now and then one of the .NET application pools was (randomly) hanging, requiring the support team to log on to the server and recycle it. Initially, we thought that the problem was related to the applications themselves, maybe memory leaks.

Further investigations confirmed that the applications pools were hanging because Jakarta was using Named Mutexes, and sometimes ones that were currently owned by different processes (one of the other App Pools). And since Named Mutexes are shared across processes, any process trying to use it would be blocked until owner process releases the Mutex.

Long story short, we've confirmed that the hangs happen because Tomcat is leaking mutexes and not releasing them back to the other applications.

This seems to be very similar to the problem described here in this thread: 

But no one has replied to it.

Also, the change log (https://tomcat.apache.org/connectors-doc/miscellaneous/changelog.html) shows a fix between versions "1.2.35 and 1.2.36" stating "Fix dead-lock caused by not releasing mutex on close. (mturk)" which seems to be exactly what we're facing, but either that was another issue, or the fix did not work.

Are there any plans to have this fixed in a future release?
Comment 1 Mark Thomas 2016-04-16 15:44:30 UTC
Correct product
Comment 2 Mark Thomas 2016-04-16 15:44:53 UTC
First question is what version of the connector are you using?
Comment 3 Murilo 2016-04-17 12:01:43 UTC
(In reply to Mark Thomas from comment #2)
> First question is what version of the connector are you using?

Comment 4 Murilo 2016-04-26 14:20:51 UTC
Adding more information, we've tried with versions 1.2.30 and 1.2.36 to see if that would help, but 1.2.30 made it worse, as the app pools would crash all the time, and 1.2.36 presents the same issues as 1.2.40 in terms of hanging the other app pools. 

Appreciate if anyone has any update on this.
Comment 5 Murilo 2016-04-28 17:07:58 UTC
Here's what's going on:

First callstack is:

00 00000000`10b7deb8 00000000`76ff9f20 ntdll!ZwWaitForSingleObject+0xa 
01 00000000`10b7dec0 00000001`8001b178 kernel32!WaitForSingleObjectEx+0x9c 
02 00000000`10b7df80 000007fe`fb7e1c9b isapi_redirect!HttpFilterProc+0x148 [c:\workplace\tomcat-connectors-1.2.40-src\native\iis\jk_isapi_plugin.c @ 2172]
03 00000000`10b7e400 000007fe`fb7e1f2d filter!W3_FILTER_CONTEXT::NotifyFilters+0x149

This thread owns the critical section blocking everyone else.

Looking at source code we see that we’re stuck here:

            if (!is_mapread) {
                WaitForSingleObject(init_cs, INFINITE);
                if (!is_mapread)
                    is_mapread = init_jk(serverName);

We're stuck on that WaitForSingleObject line.  This code waits for “init_cs” which is a Named Mutex to be released.  But this Named Mutex is currently owned:

Handle 0000000000000278
  Type                   Mutant
  Attributes         0
  GrantedAccess 0x1f0001:
  HandleCount  4
  PointerCount   8
  Object specific information
    Mutex is Owned
    Mutant Owner 3048.2198

Note the owner is a thread inside this same process.  Owner PID is 0x3048 (decimal 12360 which is PID of this same process).  Thread 0x2198 is thread 3.

This is thread 3 in this same dump shows that the Named Mutex has leaked as its callstack shows an idle thread not doing anything and with this callstack will never release anything.

0:003> k
# Child-SP          RetAddr           Call Site
00 00000000`022efa08 00000000`76fedfbc ntdll!ZwRemoveIoCompletion+0xa
01 00000000`022efa10 000007fe`f76f2b0e kernel32!GetQueuedCompletionStatus+0x48
02 00000000`022efa70 000007fe`f76f1b95 w3tp!THREAD_POOL_DATA::ThreadPoolThread+0x56
03 00000000`022efad0 00000000`76fea4bd w3tp!THREAD_MANAGER::ThreadManagerThread+0x5d
04 00000000`022efb00 00000000`77356461 kernel32!BaseThreadInitThunk+0xd
05 00000000`022efb30 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

Some code ran on this thread (inside Jakarta) that took ownership of the Named Mutex and then something happened and that code did not release that Named Mutex… so now this whole process is stuck.
Comment 6 Murilo 2016-05-06 17:20:34 UTC
The problem was fixed by updating the code as suggested in this bug report:


Please ensure this is added to future versions as this was quite troublesome for us to figure it out and have it fix.
Comment 7 Christopher Schultz 2016-05-27 13:52:09 UTC
I'm confused: what fix did you apply? There is no fix attached to this bug report.
Comment 8 Christian Swoboda 2016-09-09 09:07:32 UTC

I think the link to the proposed patch should be to:

Matthew Reiter analyzed the problem there in detail and found out that the bug is in TerminateFilter (in jk_isapi_plugin.c) 

version 1.2.41 
Line 2424:    ReleaseMutex(&init_cs);

should be:    ReleaseMutex(init_cs);

Here is his explanation:
In TerminateFilter (in jk_isapi_plugin.c), ReleaseMutex is being called with the address of the init_cs variable rather than its value, causing init_cs to never be released. As a result, when GetExtensionVersion is subsequently called, it is unable to acquire init_cs and so the plugin never finishes initializing. I was able to confirm that removing the extra "&" fixes the problem.

Could you please fix this, cause we are having this critical problem as well!

Comment 9 Mark Thomas 2016-09-12 19:07:17 UTC

*** This bug has been marked as a duplicate of bug 58813 ***