Bug 66076 - Final block of encrypted response payload not being sent occasionally over TLS
Summary: Final block of encrypted response payload not being sent occasionally over TLS
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.63
Hardware: PC Linux
: P2 major (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-18 18:30 UTC by Phil Clay
Modified: 2022-05-26 06:48 UTC (History)
0 users



Attachments
server-side tcpdump (17.21 KB, text/plain)
2022-05-18 18:30 UTC, Phil Clay
Details
server-side application log (38.72 KB, text/plain)
2022-05-18 18:31 UTC, Phil Clay
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Phil Clay 2022-05-18 18:30:45 UTC
Created attachment 38291 [details]
server-side tcpdump

We're occasionally encountering a problem where the last encrypted block of an outbound websocket payload is not being sent over TLS.

This looks very similar to bug 65448, but this is not simulating blocking and this is not when sending a file.  It's async I/O over NIO for outbound websocket traffic.

Most of the time it works, but occasionally it does not.  We can usually reproduce the problem in a few hours when sending a request every 2 seconds.

The problem only occurs over TLS, not over unencrypted channels.  It reproduces with both openssl (via tomcat native) and Java's TLS impl.

The outbound payload being sent in this test is 105493 bytes (encrypted).

I've attached a tcpdump that shows the behavior, but I'll copy a bit of it here....

Here's an example of a working request/response (I have removed the client acks for brevity, but they can be seen in the attachment)...


# Begin Request 3
16:42:43.728300 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4338:4390, ack 4052, win 8210, length 52
# Begin Response 3 (success)  Full response is 105493 bytes
16:42:43.735921 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 4052:6972, ack 4390, win 311, length 2920
16:42:43.735936 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 6972:9892, ack 4390, win 311, length 2920
16:42:43.735947 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 9892:12812, ack 4390, win 311, length 2920
16:42:43.735954 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 12812:15732, ack 4390, win 311, length 2920
16:42:43.735962 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 15732:18652, ack 4390, win 311, length 2920
16:42:43.736675 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 18652:24492, ack 4390, win 311, length 5840
16:42:43.736695 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 24492:30332, ack 4390, win 311, length 5840
16:42:43.736702 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 30332:36172, ack 4390, win 311, length 5840
16:42:43.736720 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 36172:44932, ack 4390, win 311, length 8760
16:42:43.737386 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 44932:58072, ack 4390, win 311, length 13140
16:42:43.737457 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 58072:72672, ack 4390, win 311, length 14600
16:42:43.738146 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 72672:82892, ack 4390, win 311, length 10220
16:42:43.738187 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 82892:100412, ack 4390, win 311, length 17520
16:42:43.738926 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 100412:109545, ack 4390, win 311, length 9133
# End Response 3 (all 105493 bytes sent and acknowledged)


And here is an example of a response that demonstrates the problem (again with client acks removed)...


# Begin Request 4
16:42:45.750396 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4390:4442, ack 109545, win 8212, length 52
# Begin Response 4 (failure)  Full response is 105493 bytes
16:42:45.757878 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 109545:125958, ack 4442, win 311, length 16413
16:42:45.757953 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 125958:142371, ack 4442, win 311, length 16413
16:42:45.758108 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 142371:146751, ack 4442, win 311, length 4380
16:42:45.758765 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 146751:165731, ack 4442, win 311, length 18980
16:42:45.758832 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 165731:181791, ack 4442, win 311, length 16060
16:42:45.759519 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], seq 181791:197851, ack 4442, win 311, length 16060
16:42:45.759635 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 197851:210991, ack 4442, win 311, length 13140

########### START OF PROBLEM ########### 
# Only 101446 bytes sent so far.  This last block of 4047 bytes is not sent until the connection begins closing over a minute after the previous block was sent.
16:44:21.222056 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 210991:215038, ack 4442, win 311, length 4047
# End Response 4 (all 105493 bytes sent and acknowledged)
########### END OF PROBLEM ############# 

# Begin closing handshake
16:44:21.222117 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 215038:215119, ack 4442, win 311, length 81
16:44:21.222812 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [.], ack 215119, win 8212, length 0
16:44:21.237321 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4442:4527, ack 215119, win 8212, length 85
16:44:21.237428 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [P.], seq 4527:4558, ack 215119, win 8212, length 31
16:44:21.237503 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [F.], seq 4558, ack 215119, win 8212, length 0
16:44:21.237786 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [.], ack 4559, win 311, length 0
16:44:21.239201 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [P.], seq 215119:215150, ack 4559, win 311, length 31
16:44:21.239234 IP 10.63.72.78.8444 > 10.132.54.95.58275: Flags [F.], seq 215150, ack 4559, win 311, length 0
16:44:21.240606 IP 10.132.54.95.58275 > 10.63.72.78.8444: Flags [R.], seq 4559, ack 215150, win 0, length 0


Unfortunately, I do not have a simple reproducible project yet, but I'm hoping you might have an idea where to look, since this looks very similar to bug 65448
Comment 1 Phil Clay 2022-05-18 18:31:27 UTC
Created attachment 38292 [details]
server-side application log
Comment 2 Phil Clay 2022-05-18 22:10:59 UTC
After a bit of investigation, my rudimentary understanding is that 

1) a buffer of unencrypted data makes it down to SecureNioChannel.write

https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L786


2) SecureNioChannel.write encrypts it into a new buffer (netOutBuffer)

https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L810

3) And then flushes that encrypted buffer

https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L824


4) But flush is not guaranteed to write all the encrypted bytes to the channel, and returns false if the buffer was not completely emptied...

https://github.com/apache/tomcat/blob/93108de1127fb228e343a3f3304554bfe7177583/java/org/apache/tomcat/util/net/SecureNioChannel.java#L120-L127

But the return value is ignored, and there's no subsequent re-flush until the next write.

Therefore, the unwritten encrypted data remains in the netOutBuffer.
Comment 3 Mark Thomas 2022-05-19 19:49:17 UTC
Thanks for the analysis. What you describe makes sense to me. It looks like the non-blocking variant of bug 65448.

My current thinking is that SocketWrapper.flushNonBlocking() needs an NIO specific implementation that can take account of the status of netOutBuffer in NioSecureChannel.

Once we have a potential fix, are you in a position to either builds Tomcat 9.0.x yourself from source and/or test a 9.0.x dev build I provide for you?
Comment 4 Phil Clay 2022-05-20 17:15:16 UTC
> Once we have a potential fix, are you in a position to either builds Tomcat 9.0.x yourself from source and/or test a 9.0.x dev build I provide for you?

Yes, Definitely!  I'd prefer a dev build to be provided if possible, but I'm sure I could spend some time to build it myself if needed.
Comment 5 Mark Thomas 2022-05-20 19:10:24 UTC
Great. I'll do a dev build for you. It isn't that tricky but I am all set up for it.

I think I know what the problem is. I should have something for you to test on Monday.
Comment 6 Phil Clay 2022-05-20 19:29:31 UTC
Perfect.

Also note, my project uses embedded tomcat.  Specifically these three artifacts:

org.apache.tomcat.embed:tomcat-embed-core
org.apache.tomcat.embed:tomcat-embed-el
org.apache.tomcat.embed:tomcat-embed-websocket

I assume the build will produce everything, but just wanted to be clear what I'm using.
Comment 7 Mark Thomas 2022-05-23 09:31:01 UTC
A 9.0.x dev build is available here:
https://people.apache.org/~markt/dev/v9.0.64-dev-b311c0a/

Usual caveats apply. This isn't an official release. Use at your own risk.

It is the standard release layout so the JARs you want will be in bin/embed/apache-tomcat-9.0.64-dev-b311c0a-embed.tar.gz
Comment 8 Phil Clay 2022-05-23 15:27:05 UTC
Thanks Mark!  I'll get the test started today.  The test will need to run for a while before I'm confident in the fix.
Comment 9 Phil Clay 2022-05-23 20:20:11 UTC
The test using the dev build unfortunately failed after about 3 hours with the same symptom.

After it failed, a re-ran it with a breakpoint in NioEndpoint.flushNonBlocking to see the new code in action, but my breakpoint unfortunately never gets hit when sending outbound websocket messages. 

(It does get hit when sending a normal HTTP response, so I can confirm the new code is active and the breakpoint is working.)

Here's the partial stacktrace when SecureNioEndpoint.flush gets called (when it doesn't write everything out).  Perhaps that provides some visibility into why NioEndpoint.flushNonBlocking is never called ?

Breakpoint reached
	at org.apache.tomcat.util.net.SecureNioChannel.flush(SecureNioChannel.java:123)
	at org.apache.tomcat.util.net.SecureNioChannel.write(SecureNioChannel.java:863)
	at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper$NioOperationState.run(NioEndpoint.java:1674)
	at org.apache.tomcat.util.net.SocketWrapperBase$OperationState.start(SocketWrapperBase.java:1063)
	at org.apache.tomcat.util.net.SocketWrapperBase.vectoredOperation(SocketWrapperBase.java:1482)
	at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:1408)
	at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:1379)
	at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:93)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:512)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessage(WsRemoteEndpointImplBase.java:398)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendBytesByCompletion(WsRemoteEndpointImplBase.java:155)
	at org.apache.tomcat.websocket.WsRemoteEndpointAsync.sendBinary(WsRemoteEndpointAsync.java:65)
	at org.springframework.web.reactive.socket.adapter.StandardWebSocketSession.sendMessage(StandardWebSocketSession.java:92)


Should something in that stack be triggering NioEndpoint.flushNonBlocking (either directly or indirectly via flushing the socket)?
Comment 10 Remy Maucherat 2022-05-24 07:44:03 UTC
Well, here it would simply need the same change since there's nothing in that otehr code path that checks the network output buffer: https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/NioEndpoint.java#L1591
Comment 11 Remy Maucherat 2022-05-24 07:48:53 UTC
NIO2 should not be affected by this and does not need any patching, since SecureNio2Channel simply uses more async operations to fully empty the network output buffer.
Comment 12 Mark Thomas 2022-05-24 13:43:51 UTC
(In reply to Remy Maucherat from comment #10)
> Well, here it would simply need the same change since there's nothing in
> that other code path that checks the network output buffer:
> https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/
> NioEndpoint.java#L1591

I think it is going to be a bit more complicated. For the case where all the application bytes have been written and it is just the netOutBuffer that has data left to write, I can't see a clean way to ensure the write isn't completed at https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/NioEndpoint.java#L1628
Comment 13 Mark Thomas 2022-05-24 16:39:47 UTC
I have a proposed fix. Just running the test suite. Assuming it passes, I'll commit the fix to 9.0.x and do another dev build.
Comment 14 Mark Thomas 2022-05-24 16:55:05 UTC
Updated dev build:
https://people.apache.org/~markt/dev/v9.0.64-dev-680db44/

Same caveats as before.
Comment 15 Phil Clay 2022-05-24 18:24:24 UTC
Got it.  Testing...
Comment 16 Phil Clay 2022-05-25 15:48:47 UTC
Good news!  The test has been running with no failures continuously since my last comment.  I'll keep letting it run today.  But I'm pretty confident that the problem is fixed since the test would normally fail after a few of hours.
Comment 17 Mark Thomas 2022-05-26 06:48:27 UTC
Thanks for testing.

Fixed in:
- 10.1.x for 10.1.0-M16 onwards
- 10.0.x for 10.0.22 onwards
- 9.0.x for 9.0.64 onwards
- 8.5.x for 8.5.80 onwards