Bug 63950 - org.apache.catalina.core.TestAsyncContextStateChanges with APR blocks forever on FreeBSD
Summary: org.apache.catalina.core.TestAsyncContextStateChanges with APR blocks forever...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 7
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 7.0.96
Hardware: PC FreeBSD
: P2 major with 1 vote (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-22 08:58 UTC by Michael Osipov
Modified: 2019-11-25 14:49 UTC (History)
0 users



Attachments
Thread dump of locked JVM - FreeBSD (22.92 KB, text/plain)
2019-11-22 08:58 UTC, Michael Osipov
Details
Thread dump of locked JVM - Windows, Java 13.0.1 (24.90 KB, text/plain)
2019-11-22 20:49 UTC, Konstantin Kolinko
Details
Captured stdout/stderr of Apache Ant - Windows, Java 13.0.1 (24.02 KB, text/plain)
2019-11-22 20:57 UTC, Konstantin Kolinko
Details
Log output - waiting is limited to 10 seconds (858f7e7d) (33.55 KB, text/plain)
2019-11-23 20:06 UTC, Konstantin Kolinko
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Osipov 2019-11-22 08:58:43 UTC
Created attachment 36893 [details]
Thread dump of locked JVM - FreeBSD

Running the mentioned test from 7.0.x on
> FreeBSD 12.1-STABLE #1 r353717

on a very capable machine:
> $ sysctl hw.physmem
> hw.physmem: 34200395776
>
> $ sysctl hw.model hw.machine hw.ncpu
> hw.model: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
> hw.machine: amd64
> hw.ncpu: 20

on

> openjdk version "1.8.0_232"
> OpenJDK Runtime Environment (build 1.8.0_232-b09)
> OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

The JVM running the test is completely blocked. The last lines from the output:

>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.startup.LoggingBaseTest setUp
>     [junit] INFORMATION: Starting test case [testAsync[10: end [ERROR_NONE], timing [THREAD_AFTER_EXIT]]]
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
>     [junit] INFORMATION: An older version [1.2.21] of the APR based Apache Tomcat Native library is installed, while Tomcat recommends a minimum version of [1.2.23]
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
>     [junit] INFORMATION: Loaded APR based Apache Tomcat Native library [1.2.21] using APR version [1.7.0].
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
>     [junit] INFORMATION: APR capabilities: IPv6 [false], sendfile [true], accept filters [true], random [true].
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.core.AprLifecycleListener initializeSSL
>     [junit] INFORMATION: OpenSSL successfully initialized [OpenSSL 1.1.1d-freebsd  10 Sep 2019]
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.coyote.AbstractProtocol init
>     [junit] INFORMATION: Initialisiere ProtocolHandler["http-apr-127.0.0.1-auto-11"]
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.core.StandardService startInternal
>     [junit] INFORMATION: Starting service [Tomcat]
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.core.StandardEngine startInternal
>     [junit] INFORMATION: Starting Servlet Engine: Apache Tomcat/7.0.98-dev
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.catalina.startup.TldConfig execute
>     [junit] INFORMATION: At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
>     [junit] Nov 21, 2019 10:59:30 PM org.apache.coyote.AbstractProtocol start
>     [junit] INFORMATION: Starting ProtocolHandler ["http-apr-127.0.0.1-auto-11-31205"]

dump has been attached.

During the block truss reports me:
> _umtx_op(0x801afd320,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffddedcd08) ERR#60 'Operation timed out'

in a loop
Comment 1 Konstantin Kolinko 2019-11-22 20:46:16 UTC
I am testing release candidate of Tomcat 7.0.98 on Windows 10,
and it hangs on the same test class and with the same stacktrace, but

- Michael's version of Tomcat Native is different,
- a different test case in the test class is hanging.


1. In my case I use Tomcat Native 1.2.23

> [junit] 22.11.2019 23:41:04 org.apache.catalina.core.AprLifecycleListener lifecycleEvent
> [junit] INFO: Loaded APR based Apache Tomcat Native library [1.2.23] using APR version [1.7.0].

Michael used 1.2.21.


2. I tested
- 32-bit native with 32-bit Oracle Java 6u45, Oracle Java 7u80
- 64-bit native with 64-bit Java 8u222 from AdoptOpenJDK, Java 13 from OpenJDK.

In all cases the hanging test case is #9:

> [junit] INFO: Starting test case [testAsync[9: end [ERROR_NONE], timing [INLINE]]]

For Michael the hanging testcase was #10, as if #9 completed successfully:

> [junit] INFORMATION: Starting test case [testAsync[10: end [ERROR_NONE], timing [THREAD_AFTER_EXIT]]]

3. I do not see any significant differences in the stacktraces.
It is a parameterized test, thus line numbers are the same.

For NIO, BIO connectors the same test class completes successfully in 13 seconds.
Comment 2 Konstantin Kolinko 2019-11-22 20:49:15 UTC
Created attachment 36896 [details]
Thread dump of locked JVM - Windows, Java 13.0.1
Comment 3 Konstantin Kolinko 2019-11-22 20:57:39 UTC
Created attachment 36897 [details]
Captured stdout/stderr of Apache Ant - Windows, Java 13.0.1
Comment 4 Konstantin Kolinko 2019-11-23 20:06:21 UTC
Created attachment 36898 [details]
Log output - waiting is limited to 10 seconds (858f7e7d)

1. In the following commit I limited the wait time on the latch to 10 seconds
https://github.com/apache/tomcat/commit/858f7e7d3d367f4654a95aa31f0ebf57185e12cc

The result is that
- the following tests are all succeeding
Testcase: testAsync[0: end [NONE], timing [INLINE]] took 2,698 sec
Testcase: testAsync[1: end [NONE], timing [THREAD_BEFORE_EXIT]] took 1,117 sec
Testcase: testAsync[2: end [NONE], timing [THREAD_AFTER_EXIT]] took 2,117 sec
Testcase: testAsync[3: end [COMPLETE], timing [INLINE]] took 3,019 sec
Testcase: testAsync[4: end [COMPLETE], timing [THREAD_BEFORE_EXIT]] took 3,016 sec
Testcase: testAsync[5: end [COMPLETE], timing [THREAD_AFTER_EXIT]] took 4,038 sec
Testcase: testAsync[6: end [DISPATCH], timing [INLINE]] took 3,018 sec
Testcase: testAsync[7: end [DISPATCH], timing [THREAD_BEFORE_EXIT]] took 3,045 sec
Testcase: testAsync[8: end [DISPATCH], timing [THREAD_AFTER_EXIT]] took 4,024 sec

- the following tests are all failing
Testcase: testAsync[9: end [ERROR_NONE], timing [INLINE]] took 10,115 sec
Testcase: testAsync[10: end [ERROR_NONE], timing [THREAD_AFTER_EXIT]] took 10,1 sec
Testcase: testAsync[11: end [ERROR_COMPLETE], timing [INLINE]] took 10,099 sec
Testcase: testAsync[12: end [ERROR_COMPLETE], timing [THREAD_AFTER_EXIT]] took 10,102 sec
Testcase: testAsync[13: end [ERROR_DISPATCH], timing [INLINE]] took 10,099 sec
Testcase: testAsync[14: end [ERROR_DISPATCH], timing [THREAD_AFTER_EXIT]] took 10,101 sec

2. Looking into commit history of TestAsyncContextStateChanges.java file,
the following commit on 16 Oct 2019 disabled some tests due to known deadlocks.
https://github.com/apache/tomcat/commit/91aa7e888c64e6aa4c5bf1625024774dbe8f843e
Comment 5 Michael Osipov 2019-11-23 20:17:24 UTC
I just went back in history and even the first compilable version in 7.0.x this test does block forever.

Your change does not block forever anymore, but makes the test fail after 10 seconds. So I will always see failed tests.

What now?
Comment 6 Michael Osipov 2019-11-25 13:35:55 UTC
609d10230ad1cb67c9745764ccee98450df89880 fixes the issue for me.
Comment 7 Konstantin Kolinko 2019-11-25 14:20:27 UTC
Mark wrote in "[VOTE] Release Apache Tomcat 7.0.98" thread

> Found it. It is a bug in the unit test. The client is closing the
> connection too soon. I'm not sure why we are seeing this on a subset of
> operating systems and a subset of versions but I'll get it fixed generally.

(In reply to Michael Osipov from comment #6)
> 609d10230ad1cb67c9745764ccee98450df89880 fixes the issue for me.

I confirm as well.
That commit fixes the issue for me as well, on Windows 10.

I updated the changelog file. I am closing this issue as FIXED.
Comment 8 Mark Thomas 2019-11-25 14:49:31 UTC
Thanks for the confirmations and updating the bug. I was just waiting to complete a full unit test run for 9.0.x, 8.5.x and 7.0.x before updating. They have just finished. All looks good.