Bug 65448 - Download of file via Servlet OutputStream blocks when using SSL
Summary: Download of file via Servlet OutputStream blocks when using SSL
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.50
Hardware: PC All
: P2 major (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-13 19:12 UTC by Roman
Modified: 2021-09-21 08:08 UTC (History)
1 user (show)



Attachments
A ZIP file containing curl outputs, threaddumps and system information (35.68 KB, application/zip)
2021-07-13 19:12 UTC, Roman
Details
Tomcat FINE log while the downloading was blocked (38.05 KB, text/plain)
2021-07-14 11:28 UTC, Roman
Details
Logs of v1, with screenshot of cURL/Log-Tail when it blocked (736.10 KB, application/zip)
2021-07-15 06:25 UTC, Roman
Details
Logs of v2, with screenshot of cURL/Log-Tail when it blocked (771.32 KB, application/zip)
2021-07-15 09:41 UTC, Roman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Roman 2021-07-13 19:12:43 UTC
Created attachment 37950 [details]
A ZIP file containing curl outputs, threaddumps and system information

We have a very simple Servlet implementation that lets clients download a file.

Here's the code:

try (InputStream in = new BufferedInputStream(new FileInputStream(resourceFile), IOTools.SIZE_64K)) {
  IOTools.pump(in, response.getOutputStream());
}

The pump method copies the bytes from the InputStream to the OutputStream (in blocks of 64K) and flushes the OutputStream after that.

The problem:
When downloading a file with cURL command:

curl -v --insecure "https://btdevapp01:8443/tribefire-services/res/adx/fonts/exactly130703.txt"  --output test.txt

it will block halfway through the response and will wait 1 min until the download resumes and finishes successfully.

Here are some observations:

* When the downloaded file is below the size of 130.703 Bytes, the download works all the time. Only when the file is greater or equal to 130.073 Bytes, the download might fail.

* The 1 min waiting period is related to the configured "connectionTimeout" in the server.xml. When we increase that config to - say - 120.000, we have to wait 2 minutes until the download finishes.

* This problem only occurs when using HTTPS. When using the equivalent HTTP address, it works all the time for all file sizes.

* This is not just a problem with cURL. We were able to reproduce problems with Chrome 91 and Firefox 90.

* With the help of JMX, we can read out that the Maximum Request Time of the HTTPS Connectors is 3487 ms. Hence, it does not register that the user has to wait over a minute to get the full result.


We were not able to reproduce the problem on our local environments, but this might also be because we're using mostly Macs and the target platform where we encounter the problem is Windows Server 2019.

I am attaching the following:

* cURL output of downloading the same file twice, one time it is ok (curl-working.txt) and one time when it fails (curl-failing.txt)

* Threaddumps of the server, one while it is idle (threaddump-idle.txt), one when it is blocking (threaddump-blocking.txt) (where I cannot find anything suspicious).

* Information about the server (Tomcat and Host) in server-info.txt

* A video showing how downloading the same file works 7 times, then fails, and after that works again.... (edit: this exceeds the limit of the ticketing system but can be provided on request)

I understand that this is very little to work with but this is all the information that we could compile.
Comment 1 Roman 2021-07-13 19:15:06 UTC
Important thing that I forgot: when downloading files directly via Tomcat, it also works all the time. 
Only the combination of the Servlet with HTTPS causes issues. We have no problem when using HTTP.
Comment 2 Michael Osipov 2021-07-13 20:49:44 UTC
If the DefaultServlet is working via HTTPS then the error is in your code. Did you check that?
Comment 3 Roman 2021-07-13 21:36:30 UTC
Yes, absolutely. We even pulled the code out to a separate app to check it individually.

My main point is that it works when using HTTP. And it is really a small code and I would consider it to be ok.
Comment 4 Roman 2021-07-13 21:40:39 UTC
One other thing: how would the code be influenced by the connection timeout of the server?
It's not that the client opens a new connection or something... it's just that the server pauses sending the response. I would not know how to even write the code on purpose so that it is related to the connection timeout.
Comment 5 Mark Thomas 2021-07-14 06:35:43 UTC
A few questions (there will be more depending on the answers to these):

1. Does this happen with Tomcat 9.0.48?

2. How often does the problem occur? 1 request in 5? 1 in 10? 1 in 20?

3. Does the problem still occur if you disable the Tomcat Native library (your connector should appear as "https-nio-jsse" rather than "https-nio-openssl" in the logs and thread dumps)?

4. Does the problem occur if you switch to NIO2 and OpenSSL (connector will appear as "https-nio2-openssl" in the logs)?

5. Does the problem occur if you switch to NIO2 without OpenSSL (connector will appear as "https-nio2-jsse" in the logs)?

6. Are you able to provide the simplest servlet that reproduces this issue?

7. If we provide custom Tomcat builds (based on a specific commit, possibly with additional debug logging), are you able to test with that version to see if the problem still exists? (If the problem appeared recently, there are a couple of changes that might be responsible. Testing with custom builds will let us test this theory.)
Comment 6 Roman 2021-07-14 07:04:48 UTC
I will hopefully be able to answer your questions later today. This is happening on a customer site and working remotely is quite...painful.

What I can answer at the moment:

1, 3, 4, 5. Will have to check later.

2. The problem occurs with no obvious pattern.
Yesterday, I was able to reproduce it almost always. Today, when I retried, it took 5-10 tries to reproduce it. The Windows env was not rebooted, but I restarted Tomcat to add some debug information myself.

6. Yes, this is the main essence of the code (I stripped away all things not related to the actual problem):

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import javax.servlet.ServletException;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class ServletForAsf extends HttpServlet {

	private static final long serialVersionUID = 3266630280219605924L;

	@Override
	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		String resourcePath = request.getPathInfo();
		File resourceFile = new File(resourcePath);

		ServletOutputStream outputStream = response.getOutputStream();

		int bufferSize = 1 << 16;
		try (InputStream in = new BufferedInputStream(new FileInputStream(resourceFile), bufferSize)) {
			final byte[] buffer = new byte[bufferSize];

			int count;
			long totalCount = 0;

			while ((count = in.read(buffer)) != -1) {
				try {
					outputStream.write(buffer, 0, count);
				} catch (Exception e) {
					throw new IOException("Error while transfering data. Data transferred so far: " + totalCount + ". Current buffer size: " + count, e);
				}
				totalCount += count;
			}
			outputStream.flush();
		}
	}
}

7. Yes, testing a custom Tomcat would be possible. Fortunately, the environment where this is happening is a test env and the customer is rather friendly.


In the meantime, the comment about the DefaultServlet made me think (and debug).
It seems that the DefaultServlet does not take the same route as our Servlet does.

When the DefaultServlet serves a file, it is not done within the Servlet directly (as far as I can tell) but rather in

Http11Processor.processSendfile(SocketWrapperBase<?>) line: 1359

The main thing is that this eventually calls 
sd.fchannel.close();

When I follow the path of our servlet, I was not able to actually find an actual flush() call on the underlying socket.

So, my current thinking is that the data is not flushed to the client successfully and that is why it needs to wait for the connection timeout.
Comment 7 Roman 2021-07-14 08:12:00 UTC
Update on question 1:
Yes, I can reproduce the problem with Tomcat 9.0.48.

To downgrade the server, I replaced

* lib/*
* bin/bootstrap.jar
* bin/commons-daemon.jar
* bin/tomcat9*.exe
* bin/tcnative-1.dll
Comment 8 Roman 2021-07-14 09:08:16 UTC
And here are the answers for the other questions:

3. APR disabled (org.apache.catalina.core.AprLifecycleListener commented out)
Threads: https-jsse-nio-8443-exec-*
Not reproducible

4. APR enabled, NIO2 (replaced org.apache.coyote.http11.Http11NioProtocol by org.apache.coyote.http11.Http11Nio2Protocol)
Threads: https-openssl-nio2-8443-exec-*
Not reproducible

5. APR disabled, NIO2 (org.apache.catalina.core.AprLifecycleListener commented out, replaced org.apache.coyote.http11.Http11NioProtocol by org.apache.coyote.http11.Http11Nio2Protocol)
Threads: https-jsse-nio2-8443-exec-*
Not reproducible


So, that basically means that this only happens with NIO and APR.
Comment 9 Mark Thomas 2021-07-14 10:38:45 UTC
Thanks for all the information. 

That you still see the issue with 9.0.48 means that the response buffer flushing changes in 9.0.49 are not the root cause of this issue.

That the observation of the issue is limited to NIO+OpenSSL and not (NIO+JSSE, NIO2+OpenSSL or NIO2+JSSE) suggests where the root cause may lie but it is possible that NIO+OpenSSL just happens to have the right timing to trigger an issue that could affect other combinations.

Similarly, that using sendfile appears to avoids the issue suggests where the root cause may lie.

I will set up a Windows 2019 VM to see if I can reproduce this.

Meanwhile, could you test with 9.0.46? There were some NIO changes between 9.0.46 and 9.0.48 it would be helpful to rule in/out.

And another question. Can you recreate this with debug logging enabled? Specifically with the following in logging.properties:
org.apache.coyote.level = FINE
org.apache.tomcat.util.net.level = FINE
If yes, please provide a copy of the debug logs for when the issue occurred.
Comment 10 Roman 2021-07-14 11:27:37 UTC
I tested with 9.0.46 and was not able to reproduce the problem.
(Thread names: http-nio-8080-exec-*, APR on, NIO)

I will attach the log file containing the requested FINE output. The file only contains log lines that have not been produced by our software. If you need the full log, please let me know.

To reproduce the problem, I have to constantly run the cURL command. So there might be remanants from the previous run in the log, but the last entries are the ones that were written when the download was blocked.

The system that I am testing on *might* be used by the customer, so there is a slight chance that there is traffic in the log that is not related to my download, but I hope this was not case here.
Comment 11 Roman 2021-07-14 11:28:05 UTC
Created attachment 37953 [details]
Tomcat FINE log while the downloading was blocked
Comment 12 Roman 2021-07-14 11:38:09 UTC
For the latest tests, I have reduced the connection timeout to 30s (which might be good to know when reading the log file).
Comment 13 Mark Thomas 2021-07-14 12:05:55 UTC
(In reply to Roman from comment #10)
> I tested with 9.0.46 and was not able to reproduce the problem.
> (Thread names: http-nio-8080-exec-*, APR on, NIO)

That is an HTTP connector. ARe you sure you tested with HTTPS?

> I will attach the log file containing the requested FINE output. The file
> only contains log lines that have not been produced by our software. If you
> need the full log, please let me know.
> 
> To reproduce the problem, I have to constantly run the cURL command. So
> there might be remanants from the previous run in the log, but the last
> entries are the ones that were written when the download was blocked.
> 
> The system that I am testing on *might* be used by the customer, so there is
> a slight chance that there is traffic in the log that is not related to my
> download, but I hope this was not case here.

Thanks. Still Getting my test system set up. I'll look at the logs after that.
Comment 14 Roman 2021-07-14 12:07:59 UTC
> That is an HTTP connector. ARe you sure you tested with HTTPS?

Yes, I just took the first exec thread name that I found in the threaddump. Sorry for the confusion.
Comment 15 Mark Thomas 2021-07-14 14:53:13 UTC
Debug log aren't very helpful. If I am unable to recreate this, we may need to start down the custom Tomcat build route.

Which version of Java are you using?
Comment 16 Mark Thomas 2021-07-14 16:24:59 UTC
I haven't been able to recreate this issue locally. Also, the debug logs didn't have much information on writes - which is where the problem appears to be.

Therefore, I think it is time to move to the custom Tomcat build.

The source code for the first build (and any future builds) can be found at:
https://github.com/markt-asf/tomcat/tree/bz-65448

The binaries can be found at:
https://people.apache.org/~markt/dev/

Please can you retest with v9.0.51-bz-65448-v1 and post the resulting logs. The new logs are INFO level logs. Please aim to retain these log messages in their entirety. The timestamps, thread names and messages are equally important.

v1 adds INFO level logging at the point where plaintext data is encrypted and written to the network. Depending on what the logs look like, v2 will add additional logging either further up or lower down the call stack to track down the root cause.
Comment 17 Roman 2021-07-14 19:11:32 UTC
Ok, I will do this tomorrow.
Thanks for your help!

In the meantime, this is the Java version we use:

Specification: Java Virtual Machine Specification 11 (Oracle Corporation)
Virtual Machine: OpenJDK 64-Bit Server VM 11.0.9.1+1 (AdoptOpenJDK)
Java Vendor: AdoptOpenJDK
Java Version: 11.0.9.1
Comment 18 Roman 2021-07-14 19:16:55 UTC
One more thing: I did a threaddump of the server while the download was blocking and did not see any Thread that was out of the ordinary. So, I would assume that the writing to the socket may have already finished, but not yet committed (i.e., flushed). Hence maybe the waiting time for the socket timeout when suddenly the client receives the remainder of the response. 
This is also fit with the statistics that show that the maximum request processing time was just a few seconds, in contrast to the actual waiting time of 30 seconds (or more, depending on the timeout config).
Comment 19 Roman 2021-07-15 06:25:25 UTC
Created attachment 37955 [details]
Logs of v1, with screenshot of cURL/Log-Tail when it blocked
Comment 20 Roman 2021-07-15 06:26:35 UTC
I have attached the full logs of the test run with v1.
It took quite some tries to actually reproduce it, but I made a screenshot at the time it happened so that you can correlate the blocking with timestamps in the log.
Comment 21 Mark Thomas 2021-07-15 08:44:14 UTC
Thanks for running the test. It is likely that the issue was harder to reproduce due to the additional logging.

You appear to be using a non-default log format. It would be helpful if the Thread name could be restored  but I can work without it if that makes the issue harder to reproduce.

Reviewing the logs has identified something I want to look at a little more closely. I've reworked the log messages to try and reduce the impact of the logging.

Please can you retest with v9.0.51-bz-65448-v2 and post the resulting logs.
Comment 22 Mark Thomas 2021-07-15 09:30:45 UTC
I have been able to recreate something locally that may well be the same issue you are seeing. The v2 logs should confirm it. I'm working on a fix.
Comment 23 Roman 2021-07-15 09:41:16 UTC
Created attachment 37956 [details]
Logs of v2, with screenshot of cURL/Log-Tail when it blocked
Comment 24 Roman 2021-07-15 09:42:22 UTC
Sounds great!
I have attached of the second run. This time, it blocked the very first time.
The second time, it worked normally.
Unfortunatelly, we are using our own log manager and thus I am not sure whether we can include the thread names.
Comment 25 Mark Thomas 2021-07-15 18:28:17 UTC
I think this is fixed. Please test v9.0.51-171a2cd and report back. Debug logging has been removed. You can removed any additional debug log settings from logging.properties.
Comment 26 Roman 2021-07-15 19:45:08 UTC
With v9.0.51-171a2cd, I can no longer reproduce the problem (with NIO+APR).
Very well done, thank you!
Out of curiosity, what was the fix? Is there a git commit that would show the necessary changes?
Again, thank you for your quick help and support!
Comment 27 Roman 2021-07-15 19:57:55 UTC
It would also be interesting what makes this environment so special that the issue is so hard to reproduce.
Comment 28 Mark Thomas 2021-07-15 21:06:34 UTC
I was waiting for your test results before pushing the fix. This is the 9.0.x fix:
https://github.com/apache/tomcat/commit/171a2cdc

The reason it is hard to reproduce is that it requires a specific set of circumstances that are affected by timing.

The NIO connector uses a Poller when it needs to simulate blocking. If a write can't complete immediately, the socket is added to the Poller and the write continues when the Poller signals that the socket is ready for write.

The problem was the way Tomcat handled TLS writes. If the final non-encrypted write for a response was fully encrypted but the encrypted data could not be fully written, the socket was not added to the Poller. The final block of encrypted data sat in the buffers until the connection timed out - at which point it was flushed.

I was able to simulate this by configuring a VM with a slow connection (128kbps) and then writing an ~80k file. By carefully adjusting the length of the file, I was able to trigger this every time.

In a real system, triggering the issue would depend on how busy the network stack was, how quickly the client read, the exact size of the response (including headers) etc. The odds of hitting the exact combination are slim.

Fixed in:
- 10.1.x for 10.1.0-M3 onwards
- 10.0.x for 10.0.9 onwards
- 9.0.x for 9.0.51 onwards
Comment 29 Roman 2021-07-15 21:13:44 UTC
Thanks for the explanation and the fix!
Comment 30 Mark Thomas 2021-09-21 08:08:57 UTC
Restore correct version and OS info. This should not be edited.