Bug 64848 - WsSession objects in OUTPUT_CLOSED state are implicitly held by waitingProcessors and GC cannot purge them from the JVM heap
Summary: WsSession objects in OUTPUT_CLOSED state are implicitly held by waitingProces...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: WebSocket (show other bugs)
Version: 9.0.36
Hardware: PC All
: P4 major (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-26 11:32 UTC by Laszlo Karolyi
Modified: 2020-11-13 18:56 UTC (History)
0 users



Attachments
Attachment with the 3 services that help reproduce the issue + SocketTimeoutException stack trace and WsSession GC root snapshot from a heap dump (379.28 KB, application/x-zip-compressed)
2020-10-26 11:32 UTC, Laszlo Karolyi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Laszlo Karolyi 2020-10-26 11:32:32 UTC
Created attachment 37534 [details]
Attachment with the 3 services that help reproduce the issue + SocketTimeoutException stack trace and WsSession GC root snapshot from a heap dump

Overview:
---------

WebSocket session objects (represented as WsSession) "get stuck" on the heap under heavy load in case Tomcat acts as a WebSocket API gateway between a client and a server application.
By assuming a system configuration where the Tomcat WebSocket API gateway service is deployed along with a server application on the same machine but the client is launched elsewhere then the network latency can impose a considerable overhead on the entire client-side data processing compared to the server side where the data can be generated and transferred to the gateway service much faster than the client can consume it and this can lead to the classic fast producer-slow consumer phenomenon.

If the network latency goes beyond a certain threshold and the client cannot keep up in a timely manner with the data flow coming from the gateway service then Tomcat starts throwing SocketTimeoutException (by default after 20 seconds) at the WebSocket sessions for which the data weren't transmitted in time. Such a timeout may end up in an abnormally closed WebSocket connection (usually represented with 1006 status code at the client side) and even though the corresponding sessions are moved into the OUTPUT_CLOSED state at Tomcat level, they are still kept on the JVM heap endlessly by preventing the GC to purge them out consequently producing a slow memory leak.

Steps to Reproduce:
-------------------

Reproducing such a situation is a bit cumbersome in terms of the required hardware configuration as it needs two machines/VMs: one for the client and another for the Tomcat WebSocket API gateway + server application. On the other hand the client app should be hosted "far" from the Tomcat WebSocket app in terms of network distance, i.e. if it connects to the Tomcat WebSocket app via VPN then the network latency can be enough to reproduce the issue. Alternatively in the Tomcat WebSocket app the org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT property can also be adjusted as part of a customized RequestUpgradeStrategy to simulate a slow network.
The overall system demonstrates a simple distributed client-server application inserting the Tomcat WebSocket API GW as an intermediary. The client can send a number to the server that denotes a length (given in KBs) so the server will respond with a random alphanumeric string having a length specified by the given number. The Tomcat WebSocket API GW just routes the WS traffic back and forth between the other two services.

1. Run the attached random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies application (the server app) in a form of

java -jar random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies <host> <port>

By default it configures the underlying Undertow webserver to launch on localhost and listen on port #8193.

2. Run the attached ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar application (the Tomcat WebSocket API GW app) in a form of

java -jar ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar

By default it listens on port #8444 and it can be overridden by setting the server.port property.
If the Undertow server app runs with non-default host and port configurations then this needs to be reflected here by specifying the zuul.routes.random-string-websocket-provider.url property accordingly, e.g.:

java -jar -Dzuul.routes.random-string-websocket-provider.url=http://<another-host>:<another-port> ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar

3. Run the attached ws-random-string-gatling-load-test application (the client app wrapped into gatling to generate artifical load) in a form of

mvn clean -B compile exec:java -Dexec.mainClass=RandomStringWebSocketRequestApp -DrampUpUsers=<number-of-concurrent-users> -DrampUpTime=1 -DserverUrl=<host-where-other-two-services-run> -DserverPort=<port-where-Tomcat-API-GW-listens-on> -Dgatling.simulationClass=com.acme.wsrequest.simulation.RandomStringWebSocketRequestSimulation -DrandomStringLengthInKb=1000

Actual Results:
---------------

Running the client app with 400 users will start producing the SocketTimeoutException confidently in the Tomcat WebSocket API gateway service. At the client side the gatling report starts showing unexpectedly closed WS connections (with status code 1006) and the number of such connections seems to have a strong correspondence to the number of "got stuck" WsSession objects on the Tomcat WebSocket app's heap. That WsSession objects are preserved indefinitely and hence cannot be garbage-collected.

Expected Results:
-----------------

WsSession objects representing abnormally closed WebSocket connections shall eventually be the subject of garbage collection on the JVM heap.

Build Date & Hardware:
----------------------

Build 2020-10-26 on Windows Server 2016 Standard (Version 1607 - OS Build 14393.3930)

Additional Builds and Platforms:
--------------------------------

N/A

Additional Information:
-----------------------

There is an attachment (tomcat-ws-api-gw-sockettimeoutexception-stack-trace.txt) to show the stack trace produced when SocketTimeoutException is encountered.

Another attachment (tomcat-wssession-gc-root.png) contains the relevant prt of the heap dump created after a 400-user gatling load execution. Searching for "websocketsession" objects will bring up the preserved WebSocket session objects and checking the GC root of such a session object can also show the object reference chain up to the "waitingProcessors" map present in Http11NioProtocol.
Comment 1 Mark Thomas 2020-10-28 19:33:15 UTC
Please re-test with 9.0.39.
Comment 2 Mark Thomas 2020-10-28 20:35:53 UTC
The other thing to check is whether you have multiple threads writing to the same WebSocket session concurrently. That is not supported by the Java WebSocket API and could lead to the issue you describe.
Comment 3 Mark Thomas 2020-10-28 21:25:29 UTC
Actually, scratch that last comment. Tomcat detects that and will throw an Exception. I'm still looking into this but there are changes between 9.0.36 and 9./0.39 that might help so that feedback would still be very helpful.
Comment 4 Mark Thomas 2020-10-29 12:57:26 UTC
I'm able to recreate this with the 3 provided Maven projects.

I ran the client on a local VM with VMware Workstation configured to limit the network connection to 10Mbps with 20ms latency inbound and outbound.

Next step is to update to the latest Tomcat 9 code and see if the issue persists.
Comment 5 Mark Thomas 2020-10-29 14:27:51 UTC
This bug has already been fixed in the 9.0.38 onwards.

https://github.com/apache/tomcat/commit/5707895bafa38c1443460356a268932a967b79e9
Comment 6 Laszlo Karolyi 2020-11-02 08:12:45 UTC
After re-testing the reported scenario with 9.0.39 Tomcat version (core + websocket) the WsSession objects still seem to get stuck on the heap.

Upon investigating the Tomcat source code a bit deeper it's not entirely clear how a particular Processor can be removed from the waitingProcessors set located in Http11NioProtocol class (more accurately in its heir, AbstractProtocol). According to the heap dump snapshot enclosed with the original report, a WsSession object is held by a WsHttpUpgradeHandler object that is used within UpgradeProcessorInternal class. UpgradeProcessorInternal inherits from Processor, the type whose objects seem stored in waitingProcessors.

If sending a message block fails in WsRemoteEndpointImplBase during the invocation of writeMessagePart() method (as the referred https://github.com/apache/tomcat/commit/5707895bafa38c1443460356a268932a967b79e9 pull request also contains) then even though the WebSocket session gets closed followed by re-throwing of the caught exception, does it mean that the WsHttpUpgradeHandler object associated to the closed WebSocket session should eventually be purged from the above mentioned waitingProcessors set too?
In other words, under which conditions should a Processor be removed from waitingProcessors?
Comment 7 Mark Thomas 2020-11-02 14:16:16 UTC
The issue as reproduced by the provided test case is fixed in 9.0.39.

Entries are removed from the waitingProcessor list when an event is processed for the socket (and returned once processing is complete if the connection is still open).

If the writes from the server are made from an application thread (rather then in response to a message from the client and on the same thread) then it looks like it is possible that an unclean close will result in problem described here.

Re-opening while I investigate a potential unit test for this.
Comment 8 Mark Thomas 2020-11-02 17:34:18 UTC
Fixed in:
- 10.0.x for 10.0.0-M10 onwards
- 9.0.x for 9.0.40 onwards
- 8.5.x for 8.5.60 onwards
Comment 9 Laszlo Karolyi 2020-11-03 18:11:17 UTC
I can confirm that the issue I reported is indeed fixed in Tomcat 9.0.40 so thanks for the quick remediation.
Do you happen to have an estimation as to when the 9.0.40 version of Tomcat is expected to be released?
Comment 10 Laszlo Karolyi 2020-11-05 18:28:32 UTC
I couldn't find any information about the publication date of 9.0.40 but it would help us if we knew when it is expected as we would include it in our next release. Please let me know if you happen to have any update on the next Tomcat version's release date.
Comment 11 Mark Thomas 2020-11-05 19:26:51 UTC
Soon. Need to resolve the remaining open issues then we can tag. Hopefully early next week.
Comment 12 Laszlo Karolyi 2020-11-09 14:22:30 UTC
Ok, it sounds promising. Thanks for the quick feedback.
Comment 13 Laszlo Karolyi 2020-11-13 17:12:45 UTC
Hi Thomas,

According to the last information I got the Tomcat 9.0.40 release was estimated to be ready this week but I haven't seen any update on this.

Since our internal release is still in progress this week, I think we may have the chance to integrate the new Tomcat early next week in case it'll be ready by then.
Would it be feasible to have the official Tomcat 9.0.40 build in the beginning of next week?
Comment 14 Christopher Schultz 2020-11-13 18:56:01 UTC
Tomcat 9.0.40 started on 12 November[1]. Feel free to test and vote!

[1] https://lists.apache.org/thread.html/rce32f696e12fecfd4b4383a5e8a389f4dac7d357053fd03ac7280868%40%3Cdev.tomcat.apache.org%3E