Created attachment 38291 [details] server-side tcpdump We're occasionally encountering a problem where the last encrypted block of an outbound websocket payload is not being sent over TLS. This looks very similar to bug 65448, but this is not simulating blocking and this is not when sending a file. It's async I/O over NIO for outbound websocket traffic. Most of the time it works, but occasionally it does not. We can usually reproduce the problem in a few hours when sending a request every 2 seconds. The problem only occurs over TLS, not over unencrypted channels. It reproduces with both openssl (via tomcat native) and Java's TLS impl. The outbound payload being sent in this test is 105493 bytes (encrypted). I've attached a tcpdump that shows the behavior, but I'll copy a bit of it here.... Here's an example of a working request/response (I have removed the client acks for brevity, but they can be seen in the attachment)... # Begin Request 3 16:42:43.728300 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4338:4390, ack 4052, win 8210, length 52 # Begin Response 3 (success) Full response is 105493 bytes 16:42:43.735921 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 4052:6972, ack 4390, win 311, length 2920 16:42:43.735936 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 6972:9892, ack 4390, win 311, length 2920 16:42:43.735947 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 9892:12812, ack 4390, win 311, length 2920 16:42:43.735954 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 12812:15732, ack 4390, win 311, length 2920 16:42:43.735962 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 15732:18652, ack 4390, win 311, length 2920 16:42:43.736675 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 18652:24492, ack 4390, win 311, length 5840 16:42:43.736695 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 24492:30332, ack 4390, win 311, length 5840 16:42:43.736702 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 30332:36172, ack 4390, win 311, length 5840 16:42:43.736720 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 36172:44932, ack 4390, win 311, length 8760 16:42:43.737386 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 44932:58072, ack 4390, win 311, length 13140 16:42:43.737457 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 58072:72672, ack 4390, win 311, length 14600 16:42:43.738146 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 72672:82892, ack 4390, win 311, length 10220 16:42:43.738187 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 82892:100412, ack 4390, win 311, length 17520 16:42:43.738926 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 100412:109545, ack 4390, win 311, length 9133 # End Response 3 (all 105493 bytes sent and acknowledged) And here is an example of a response that demonstrates the problem (again with client acks removed)... # Begin Request 4 16:42:45.750396 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4390:4442, ack 109545, win 8212, length 52 # Begin Response 4 (failure) Full response is 105493 bytes 16:42:45.757878 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 109545:125958, ack 4442, win 311, length 16413 16:42:45.757953 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 125958:142371, ack 4442, win 311, length 16413 16:42:45.758108 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 142371:146751, ack 4442, win 311, length 4380 16:42:45.758765 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 146751:165731, ack 4442, win 311, length 18980 16:42:45.758832 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 165731:181791, ack 4442, win 311, length 16060 16:42:45.759519 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 181791:197851, ack 4442, win 311, length 16060 16:42:45.759635 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 197851:210991, ack 4442, win 311, length 13140 ########### START OF PROBLEM ########### # Only 101446 bytes sent so far. This last block of 4047 bytes is not sent until the connection begins closing over a minute after the previous block was sent. 16:44:21.222056 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 210991:215038, ack 4442, win 311, length 4047 # End Response 4 (all 105493 bytes sent and acknowledged) ########### END OF PROBLEM ############# # Begin closing handshake 16:44:21.222117 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 215038:215119, ack 4442, win 311, length 81 16:44:21.222812 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [.], ack 215119, win 8212, length 0 16:44:21.237321 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4442:4527, ack 215119, win 8212, length 85 16:44:21.237428 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4527:4558, ack 215119, win 8212, length 31 16:44:21.237503 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [F.], seq 4558, ack 215119, win 8212, length 0 16:44:21.237786 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], ack 4559, win 311, length 0 16:44:21.239201 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 215119:215150, ack 4559, win 311, length 31 16:44:21.239234 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [F.], seq 215150, ack 4559, win 311, length 0 16:44:21.240606 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [R.], seq 4559, ack 215150, win 0, length 0 Unfortunately, I do not have a simple reproducible project yet, but I'm hoping you might have an idea where to look, since this looks very similar to bug 65448
Created attachment 38292 [details] server-side application log
After a bit of investigation, my rudimentary understanding is that 1) a buffer of unencrypted data makes it down to SecureNioChannel.write https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L786 2) SecureNioChannel.write encrypts it into a new buffer (netOutBuffer) https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L810 3) And then flushes that encrypted buffer https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L824 4) But flush is not guaranteed to write all the encrypted bytes to the channel, and returns false if the buffer was not completely emptied... https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L120-L127 But the return value is ignored, and there's no subsequent re-flush until the next write. Therefore, the unwritten encrypted data remains in the netOutBuffer.
Thanks for the analysis. What you describe makes sense to me. It looks like the non-blocking variant of bug 65448. My current thinking is that SocketWrapper.flushNonBlocking() needs an NIO specific implementation that can take account of the status of netOutBuffer in NioSecureChannel. Once we have a potential fix, are you in a position to either builds Tomcat 9.0.x yourself from source and/or test a 9.0.x dev build I provide for you?
> Once we have a potential fix, are you in a position to either builds Tomcat 9.0.x yourself from source and/or test a 9.0.x dev build I provide for you? Yes, Definitely! I'd prefer a dev build to be provided if possible, but I'm sure I could spend some time to build it myself if needed.
Great. I'll do a dev build for you. It isn't that tricky but I am all set up for it. I think I know what the problem is. I should have something for you to test on Monday.
Perfect. Also note, my project uses embedded tomcat. Specifically these three artifacts: org.apache.tomcat.embed:tomcat-embed-core org.apache.tomcat.embed:tomcat-embed-el org.apache.tomcat.embed:tomcat-embed-websocket I assume the build will produce everything, but just wanted to be clear what I'm using.
A 9.0.x dev build is available here: https://people.apache.org/~markt/dev/v9.0.64-dev-b311c0a/ Usual caveats apply. This isn't an official release. Use at your own risk. It is the standard release layout so the JARs you want will be in bin/embed/apache-tomcat-9.0.64-dev-b311c0a-embed.tar.gz
Thanks Mark! I'll get the test started today. The test will need to run for a while before I'm confident in the fix.
The test using the dev build unfortunately failed after about 3 hours with the same symptom. After it failed, a re-ran it with a breakpoint in NioEndpoint.flushNonBlocking to see the new code in action, but my breakpoint unfortunately never gets hit when sending outbound websocket messages. (It does get hit when sending a normal HTTP response, so I can confirm the new code is active and the breakpoint is working.) Here's the partial stacktrace when SecureNioEndpoint.flush gets called (when it doesn't write everything out). Perhaps that provides some visibility into why NioEndpoint.flushNonBlocking is never called ? Breakpoint reached at org.apache.tomcat.util.net.SecureNioChannel.flush(SecureNioChannel.java:123) at org.apache.tomcat.util.net.SecureNioChannel.write(SecureNioChannel.java:863) at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper$NioOperationState.run(NioEndpoint.java:1674) at org.apache.tomcat.util.net.SocketWrapperBase$OperationState.start(SocketWrapperBase.java:1063) at org.apache.tomcat.util.net.SocketWrapperBase.vectoredOperation(SocketWrapperBase.java:1482) at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:1408) at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:1379) at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:93) at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:512) at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessage(WsRemoteEndpointImplBase.java:398) at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendBytesByCompletion(WsRemoteEndpointImplBase.java:155) at org.apache.tomcat.websocket.WsRemoteEndpointAsync.sendBinary(WsRemoteEndpointAsync.java:65) at org.springframework.web.reactive.socket.adapter.StandardWebSocketSession.sendMessage(StandardWebSocketSession.java:92) Should something in that stack be triggering NioEndpoint.flushNonBlocking (either directly or indirectly via flushing the socket)?
Well, here it would simply need the same change since there's nothing in that otehr code path that checks the network output buffer: https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/NioEndpoint.java#L1591
NIO2 should not be affected by this and does not need any patching, since SecureNio2Channel simply uses more async operations to fully empty the network output buffer.
(In reply to Remy Maucherat from comment #10) > Well, here it would simply need the same change since there's nothing in > that other code path that checks the network output buffer: > https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/ > NioEndpoint.java#L1591 I think it is going to be a bit more complicated. For the case where all the application bytes have been written and it is just the netOutBuffer that has data left to write, I can't see a clean way to ensure the write isn't completed at https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/NioEndpoint.java#L1628
I have a proposed fix. Just running the test suite. Assuming it passes, I'll commit the fix to 9.0.x and do another dev build.
Updated dev build: https://people.apache.org/~markt/dev/v9.0.64-dev-680db44/ Same caveats as before.
Got it. Testing...
Good news! The test has been running with no failures continuously since my last comment. I'll keep letting it run today. But I'm pretty confident that the problem is fixed since the test would normally fail after a few of hours.
Thanks for testing. Fixed in: - 10.1.x for 10.1.0-M16 onwards - 10.0.x for 10.0.22 onwards - 9.0.x for 9.0.64 onwards - 8.5.x for 8.5.80 onwards