Bug 64317 - org.apache.tomcat.util.net.TestSsl testPost() test failing on Windows
Summary: org.apache.tomcat.util.net.TestSsl testPost() test failing on Windows
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.x
Hardware: PC All
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-06 13:31 UTC by Konstantin Kolinko
Modified: 2020-04-15 10:02 UTC (History)
1 user (show)



Attachments
01_TEST-org.apache.tomcat.util.net.TestSsl.NIO2.txt (8.47 KB, text/plain)
2020-04-06 13:42 UTC, Konstantin Kolinko
Details
02_TEST-org.apache.tomcat.util.net.TestSsl.NIO2.txt (11.11 KB, text/plain)
2020-04-06 13:45 UTC, Konstantin Kolinko
Details
03_TEST-org.apache.tomcat.util.net.TestSsl.APR.txt (9.88 KB, text/plain)
2020-04-06 13:49 UTC, Konstantin Kolinko
Details
04_TEST-org.apache.tomcat.util.net.TestSsl.NIO.txt (7.30 KB, text/plain)
2020-04-06 14:59 UTC, Konstantin Kolinko
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Konstantin Kolinko 2020-04-06 13:31:53 UTC
Testing release candidate of Tomcat 9.0.34 on Windows, I see failures in org.apache.tomcat.util.net.TestSsl - one test case is failing: testPost()

I was testing with Java 8 (AdoptOpenJDK 8u242 64-bit),
and out of 12 runs I see
- with NIO2 connector it failed in 4 runs in 2 distinct ways (1 + 3)
- with APR connector it failed in 1 run
- with NIO connector no issues notes.

The "testPost()" is a new test case in TestSsl, added a month ago (see commit 489d07c6150746f8e30cc680d0a8128fb964905e from 2020-03-04, with later improvements to the test).

---
As far as I understand, the "testPost()" test does the following:

1. It starts 8 threads.

2. Each thread opens an TLS connection, makes a POST request and writes 16Mb of data in a loop, writing a 128 Kb buffer on each iteration and sleeping 10ms between iterations.

3. At the server a Servlet read the request data into a ByteArrayOutputStream in memory, and once all data is read it echoes all of the data as a response at once.

(The initial size of ByteArrayOutputStream is not specified, so it grows from 32 bytes to 16 Mb as the data are read. Maybe there is some gc activity caused by that growth).

4. Once all data is written, the thread starts reading the response from the server, comparing it with the data that it has written.

5. Once an error occurs (a mismatch between request and response, or an IOException), it is logged to System.err, and a counter is incremented.

An assertion checks that the counter is zero when the test ends. If it is not, the test fails, and the actual error should be looked up in the System.err output.
---

I will attach log output for each observed type of a failure, with some comments.
Comment 1 Konstantin Kolinko 2020-04-06 13:42:20 UTC
Created attachment 37147 [details]
01_TEST-org.apache.tomcat.util.net.TestSsl.NIO2.txt

Out of 4 NIO2 failures, this kind of failure occurred only once.


The error here:
Byte in position [10217184] had value [-1] rather than 1

Notes:
1) Message formatting may be improved.
- There is no LF after the message (it uses print() instead of println()(.
- There are no '[' ']' around '1'.

2) This error means that the servlet echoed less data than expected (10 Mb instead of 16 Mb), and completed it as a normal response, without any sign of an error.

Why?
Comment 2 Konstantin Kolinko 2020-04-06 13:45:53 UTC
Created attachment 37148 [details]
02_TEST-org.apache.tomcat.util.net.TestSsl.NIO2.txt

Out of 4 NIO2 failures, this kind of failure occurred in 3 runs.

The error here is:

java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:71)
	at org.apache.tomcat.util.net.TestSsl$1.run(TestSsl.java:132)
java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:71)
	at org.apache.tomcat.util.net.TestSsl$1.run(TestSsl.java:132)

The same read error occurs for several threads at once. The attached log file has 5 of them.
Comment 3 Remy Maucherat 2020-04-06 13:47:52 UTC
Do you imply what the testPost test does is incorrect ? The amount of allocated memory is as intended, and it's not too extreme (IMO). Maybe the memory could be preallocated using ByteArrayOutputStream(int size) ? I don't think it would make a big difference, this is only a test, not "best practices".

The test does work for me [so far]. As with the rather equivalent HTTP/2 TLS test, it is meant to test large POSTs with TLS since that's what people often complain about.

The test was included in 9.0.33 already (since there was a regression) and there are no relevant changes in 9.0.34 (only NIO got a minor change, but you are saying it works). Why did it work then ?
Comment 4 Konstantin Kolinko 2020-04-06 13:49:20 UTC
Created attachment 37149 [details]
03_TEST-org.apache.tomcat.util.net.TestSsl.APR.txt

An APR failure. I observed it once.

javax.net.ssl.SSLException: bad record MAC
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1022)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:71)
	at org.apache.tomcat.util.net.TestSsl$1.run(TestSsl.java:132)
Caused by: javax.crypto.BadPaddingException: bad record MAC
	at sun.security.ssl.InputRecord.decrypt(InputRecord.java:219)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1017)
	... 4 more

It is a read error as well, but the error message is an odd one.
Comment 5 Michael Osipov 2020-04-06 13:59:02 UTC
I see similar errors in other tests once in a while. Doesn't this simply mean tests are flaky since sockets are not 100% reliable?
Comment 6 Konstantin Kolinko 2020-04-06 14:07:32 UTC
(In reply to Remy Maucherat from comment #3)
> Do you imply what the testPost test does is incorrect ?

I do not see any obvious flaw in the design. I think that many clients behave like that.

The servlet design is a bit unusual for me, as Content-Length of the request is known in advance.

I wonder whether "Connection reset" is caused by some timeout. The test runs for 9-10 seconds for me, and other tests in the testsuite complete in about 1 second.


> The test does work for me [so far]. As with the rather equivalent HTTP/2 TLS
> test, it is meant to test large POSTs with TLS since that's what people
> often complain about.
> 
> The test was included in 9.0.33 already (since there was a regression) and
> there are no relevant changes in 9.0.34 (only NIO got a minor change, but
> you are saying it works). Why did it work then ?

The detailed logging of errors was added just recently, by Mark on April 2nd:

SHA-1: f8da18d362b8a773afd557fdda0fe50a3748c90c
* Failed test on Windows. Expand logging to aid debugging of failures.
Comment 7 Konstantin Kolinko 2020-04-06 14:54:19 UTC
Rerunning the test:

1. Additional 10 runs with the same Java 8 (AdoptOpenJDK 8.0.242.08-hotspot),

- APR failures: 2 - The same "bad record MAC" as earlier (Comment 4) - 03_*txt
- NIO2 failures: 4 - The same "Connection reset" as earlier (Comment 2) - 02_*txt
- NIO failures: 2 - "Byte in position...", similar to Comment 1 - 01_*txt


1. 20 runs with Java 11 (AdoptOpenJDK 11.0.6.10-hotspot):

All runs completed successfully. No issues noted.
Comment 8 Konstantin Kolinko 2020-04-06 14:59:49 UTC
Created attachment 37150 [details]
04_TEST-org.apache.tomcat.util.net.TestSsl.NIO.txt

Java 8  (AdoptOpenJDK 8.0.242.08-hotspot), NIO

The same "Byte in position" failure as in Comment 1 - 01_*.txt
In this test run this error occurred several times:

Byte in position [15393480] had value [-1] rather than 1Byte in position [14987496] had value [-1] rather than 1Byte in position [15173572] had value [-1] rather than 1Byte in position [14886000] had value [-1] rather than 1
Comment 9 Konstantin Kolinko 2020-04-06 15:18:23 UTC
(In reply to Konstantin Kolinko from comment #7)
> Rerunning the test:

3. 20 runs with Java 14 (OpenJDK 14 GA (14.0.0)):

All runs completed successfully. No issues noted.
Comment 10 Remy Maucherat 2020-04-06 15:23:43 UTC
A defect from the testsuite is that it tests either JSSE or OpenSSL, not both [it would need two extra full runs to test it]. Personally, I always use OpenSSL rather than JSSE. It is a bit unexpected that things would work better with Java 11, so I wonder if you could also try OpenSSL.

Adding this property to build.properties switches to it:
test.sslImplementation=org.apache.tomcat.util.net.openssl.OpenSSLImplementation
Comment 11 Mark Thomas 2020-04-15 08:26:45 UTC
I'm wondering if this is a load issue. The test had a fairly low timeout on the server (3s) and I can imagine a situation where that timeout might get triggered if the host machine was under load. I've increased the timeout and I'll run some tests locally.
Comment 12 Mark Thomas 2020-04-15 10:02:40 UTC
I ran the test 20 times on Windows with Java 8 (Adopt OpenJDK 242 64-bit) and didn't see a single error. Once I dropped the connection timeout to 1500ms I started to see errors similar to those reported here.

I've increased the default timeout on this test to 20s so this should be fixed.