Summary: | Download of file via Servlet OutputStream blocks when using SSL | ||
---|---|---|---|
Product: | Tomcat 9 | Reporter: | Roman <roman> |
Component: | Catalina | Assignee: | Tomcat Developers Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | major | CC: | Thomas.Hoffmann |
Priority: | P2 | ||
Version: | 9.0.50 | ||
Target Milestone: | ----- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: |
A ZIP file containing curl outputs, threaddumps and system information
Tomcat FINE log while the downloading was blocked Logs of v1, with screenshot of cURL/Log-Tail when it blocked Logs of v2, with screenshot of cURL/Log-Tail when it blocked |
Description
Roman
2021-07-13 19:12:43 UTC
Important thing that I forgot: when downloading files directly via Tomcat, it also works all the time. Only the combination of the Servlet with HTTPS causes issues. We have no problem when using HTTP. If the DefaultServlet is working via HTTPS then the error is in your code. Did you check that? Yes, absolutely. We even pulled the code out to a separate app to check it individually. My main point is that it works when using HTTP. And it is really a small code and I would consider it to be ok. One other thing: how would the code be influenced by the connection timeout of the server? It's not that the client opens a new connection or something... it's just that the server pauses sending the response. I would not know how to even write the code on purpose so that it is related to the connection timeout. A few questions (there will be more depending on the answers to these): 1. Does this happen with Tomcat 9.0.48? 2. How often does the problem occur? 1 request in 5? 1 in 10? 1 in 20? 3. Does the problem still occur if you disable the Tomcat Native library (your connector should appear as "https-nio-jsse" rather than "https-nio-openssl" in the logs and thread dumps)? 4. Does the problem occur if you switch to NIO2 and OpenSSL (connector will appear as "https-nio2-openssl" in the logs)? 5. Does the problem occur if you switch to NIO2 without OpenSSL (connector will appear as "https-nio2-jsse" in the logs)? 6. Are you able to provide the simplest servlet that reproduces this issue? 7. If we provide custom Tomcat builds (based on a specific commit, possibly with additional debug logging), are you able to test with that version to see if the problem still exists? (If the problem appeared recently, there are a couple of changes that might be responsible. Testing with custom builds will let us test this theory.) I will hopefully be able to answer your questions later today. This is happening on a customer site and working remotely is quite...painful. What I can answer at the moment: 1, 3, 4, 5. Will have to check later. 2. The problem occurs with no obvious pattern. Yesterday, I was able to reproduce it almost always. Today, when I retried, it took 5-10 tries to reproduce it. The Windows env was not rebooted, but I restarted Tomcat to add some debug information myself. 6. Yes, this is the main essence of the code (I stripped away all things not related to the actual problem): import java.io.BufferedInputStream; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import javax.servlet.ServletException; import javax.servlet.ServletOutputStream; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; public class ServletForAsf extends HttpServlet { private static final long serialVersionUID = 3266630280219605924L; @Override protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String resourcePath = request.getPathInfo(); File resourceFile = new File(resourcePath); ServletOutputStream outputStream = response.getOutputStream(); int bufferSize = 1 << 16; try (InputStream in = new BufferedInputStream(new FileInputStream(resourceFile), bufferSize)) { final byte[] buffer = new byte[bufferSize]; int count; long totalCount = 0; while ((count = in.read(buffer)) != -1) { try { outputStream.write(buffer, 0, count); } catch (Exception e) { throw new IOException("Error while transfering data. Data transferred so far: " + totalCount + ". Current buffer size: " + count, e); } totalCount += count; } outputStream.flush(); } } } 7. Yes, testing a custom Tomcat would be possible. Fortunately, the environment where this is happening is a test env and the customer is rather friendly. In the meantime, the comment about the DefaultServlet made me think (and debug). It seems that the DefaultServlet does not take the same route as our Servlet does. When the DefaultServlet serves a file, it is not done within the Servlet directly (as far as I can tell) but rather in Http11Processor.processSendfile(SocketWrapperBase<?>) line: 1359 The main thing is that this eventually calls sd.fchannel.close(); When I follow the path of our servlet, I was not able to actually find an actual flush() call on the underlying socket. So, my current thinking is that the data is not flushed to the client successfully and that is why it needs to wait for the connection timeout. Update on question 1: Yes, I can reproduce the problem with Tomcat 9.0.48. To downgrade the server, I replaced * lib/* * bin/bootstrap.jar * bin/commons-daemon.jar * bin/tomcat9*.exe * bin/tcnative-1.dll And here are the answers for the other questions: 3. APR disabled (org.apache.catalina.core.AprLifecycleListener commented out) Threads: https-jsse-nio-8443-exec-* Not reproducible 4. APR enabled, NIO2 (replaced org.apache.coyote.http11.Http11NioProtocol by org.apache.coyote.http11.Http11Nio2Protocol) Threads: https-openssl-nio2-8443-exec-* Not reproducible 5. APR disabled, NIO2 (org.apache.catalina.core.AprLifecycleListener commented out, replaced org.apache.coyote.http11.Http11NioProtocol by org.apache.coyote.http11.Http11Nio2Protocol) Threads: https-jsse-nio2-8443-exec-* Not reproducible So, that basically means that this only happens with NIO and APR. Thanks for all the information. That you still see the issue with 9.0.48 means that the response buffer flushing changes in 9.0.49 are not the root cause of this issue. That the observation of the issue is limited to NIO+OpenSSL and not (NIO+JSSE, NIO2+OpenSSL or NIO2+JSSE) suggests where the root cause may lie but it is possible that NIO+OpenSSL just happens to have the right timing to trigger an issue that could affect other combinations. Similarly, that using sendfile appears to avoids the issue suggests where the root cause may lie. I will set up a Windows 2019 VM to see if I can reproduce this. Meanwhile, could you test with 9.0.46? There were some NIO changes between 9.0.46 and 9.0.48 it would be helpful to rule in/out. And another question. Can you recreate this with debug logging enabled? Specifically with the following in logging.properties: org.apache.coyote.level = FINE org.apache.tomcat.util.net.level = FINE If yes, please provide a copy of the debug logs for when the issue occurred. I tested with 9.0.46 and was not able to reproduce the problem. (Thread names: http-nio-8080-exec-*, APR on, NIO) I will attach the log file containing the requested FINE output. The file only contains log lines that have not been produced by our software. If you need the full log, please let me know. To reproduce the problem, I have to constantly run the cURL command. So there might be remanants from the previous run in the log, but the last entries are the ones that were written when the download was blocked. The system that I am testing on *might* be used by the customer, so there is a slight chance that there is traffic in the log that is not related to my download, but I hope this was not case here. Created attachment 37953 [details]
Tomcat FINE log while the downloading was blocked
For the latest tests, I have reduced the connection timeout to 30s (which might be good to know when reading the log file). (In reply to Roman from comment #10) > I tested with 9.0.46 and was not able to reproduce the problem. > (Thread names: http-nio-8080-exec-*, APR on, NIO) That is an HTTP connector. ARe you sure you tested with HTTPS? > I will attach the log file containing the requested FINE output. The file > only contains log lines that have not been produced by our software. If you > need the full log, please let me know. > > To reproduce the problem, I have to constantly run the cURL command. So > there might be remanants from the previous run in the log, but the last > entries are the ones that were written when the download was blocked. > > The system that I am testing on *might* be used by the customer, so there is > a slight chance that there is traffic in the log that is not related to my > download, but I hope this was not case here. Thanks. Still Getting my test system set up. I'll look at the logs after that. > That is an HTTP connector. ARe you sure you tested with HTTPS?
Yes, I just took the first exec thread name that I found in the threaddump. Sorry for the confusion.
Debug log aren't very helpful. If I am unable to recreate this, we may need to start down the custom Tomcat build route. Which version of Java are you using? I haven't been able to recreate this issue locally. Also, the debug logs didn't have much information on writes - which is where the problem appears to be. Therefore, I think it is time to move to the custom Tomcat build. The source code for the first build (and any future builds) can be found at: https://github.com/markt-asf/tomcat/tree/bz-65448 The binaries can be found at: https://people.apache.org/~markt/dev/ Please can you retest with v9.0.51-bz-65448-v1 and post the resulting logs. The new logs are INFO level logs. Please aim to retain these log messages in their entirety. The timestamps, thread names and messages are equally important. v1 adds INFO level logging at the point where plaintext data is encrypted and written to the network. Depending on what the logs look like, v2 will add additional logging either further up or lower down the call stack to track down the root cause. Ok, I will do this tomorrow. Thanks for your help! In the meantime, this is the Java version we use: Specification: Java Virtual Machine Specification 11 (Oracle Corporation) Virtual Machine: OpenJDK 64-Bit Server VM 11.0.9.1+1 (AdoptOpenJDK) Java Vendor: AdoptOpenJDK Java Version: 11.0.9.1 One more thing: I did a threaddump of the server while the download was blocking and did not see any Thread that was out of the ordinary. So, I would assume that the writing to the socket may have already finished, but not yet committed (i.e., flushed). Hence maybe the waiting time for the socket timeout when suddenly the client receives the remainder of the response. This is also fit with the statistics that show that the maximum request processing time was just a few seconds, in contrast to the actual waiting time of 30 seconds (or more, depending on the timeout config). Created attachment 37955 [details]
Logs of v1, with screenshot of cURL/Log-Tail when it blocked
I have attached the full logs of the test run with v1. It took quite some tries to actually reproduce it, but I made a screenshot at the time it happened so that you can correlate the blocking with timestamps in the log. Thanks for running the test. It is likely that the issue was harder to reproduce due to the additional logging. You appear to be using a non-default log format. It would be helpful if the Thread name could be restored but I can work without it if that makes the issue harder to reproduce. Reviewing the logs has identified something I want to look at a little more closely. I've reworked the log messages to try and reduce the impact of the logging. Please can you retest with v9.0.51-bz-65448-v2 and post the resulting logs. I have been able to recreate something locally that may well be the same issue you are seeing. The v2 logs should confirm it. I'm working on a fix. Created attachment 37956 [details]
Logs of v2, with screenshot of cURL/Log-Tail when it blocked
Sounds great! I have attached of the second run. This time, it blocked the very first time. The second time, it worked normally. Unfortunatelly, we are using our own log manager and thus I am not sure whether we can include the thread names. I think this is fixed. Please test v9.0.51-171a2cd and report back. Debug logging has been removed. You can removed any additional debug log settings from logging.properties. With v9.0.51-171a2cd, I can no longer reproduce the problem (with NIO+APR). Very well done, thank you! Out of curiosity, what was the fix? Is there a git commit that would show the necessary changes? Again, thank you for your quick help and support! It would also be interesting what makes this environment so special that the issue is so hard to reproduce. I was waiting for your test results before pushing the fix. This is the 9.0.x fix: https://github.com/apache/tomcat/commit/171a2cdc The reason it is hard to reproduce is that it requires a specific set of circumstances that are affected by timing. The NIO connector uses a Poller when it needs to simulate blocking. If a write can't complete immediately, the socket is added to the Poller and the write continues when the Poller signals that the socket is ready for write. The problem was the way Tomcat handled TLS writes. If the final non-encrypted write for a response was fully encrypted but the encrypted data could not be fully written, the socket was not added to the Poller. The final block of encrypted data sat in the buffers until the connection timed out - at which point it was flushed. I was able to simulate this by configuring a VM with a slow connection (128kbps) and then writing an ~80k file. By carefully adjusting the length of the file, I was able to trigger this every time. In a real system, triggering the issue would depend on how busy the network stack was, how quickly the client read, the exact size of the response (including headers) etc. The odds of hitting the exact combination are slim. Fixed in: - 10.1.x for 10.1.0-M3 onwards - 10.0.x for 10.0.9 onwards - 9.0.x for 9.0.51 onwards Thanks for the explanation and the fix! Restore correct version and OS info. This should not be edited. |