Created attachment 37534 [details] Attachment with the 3 services that help reproduce the issue + SocketTimeoutException stack trace and WsSession GC root snapshot from a heap dump Overview: --------- WebSocket session objects (represented as WsSession) "get stuck" on the heap under heavy load in case Tomcat acts as a WebSocket API gateway between a client and a server application. By assuming a system configuration where the Tomcat WebSocket API gateway service is deployed along with a server application on the same machine but the client is launched elsewhere then the network latency can impose a considerable overhead on the entire client-side data processing compared to the server side where the data can be generated and transferred to the gateway service much faster than the client can consume it and this can lead to the classic fast producer-slow consumer phenomenon. If the network latency goes beyond a certain threshold and the client cannot keep up in a timely manner with the data flow coming from the gateway service then Tomcat starts throwing SocketTimeoutException (by default after 20 seconds) at the WebSocket sessions for which the data weren't transmitted in time. Such a timeout may end up in an abnormally closed WebSocket connection (usually represented with 1006 status code at the client side) and even though the corresponding sessions are moved into the OUTPUT_CLOSED state at Tomcat level, they are still kept on the JVM heap endlessly by preventing the GC to purge them out consequently producing a slow memory leak. Steps to Reproduce: ------------------- Reproducing such a situation is a bit cumbersome in terms of the required hardware configuration as it needs two machines/VMs: one for the client and another for the Tomcat WebSocket API gateway + server application. On the other hand the client app should be hosted "far" from the Tomcat WebSocket app in terms of network distance, i.e. if it connects to the Tomcat WebSocket app via VPN then the network latency can be enough to reproduce the issue. Alternatively in the Tomcat WebSocket app the org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT property can also be adjusted as part of a customized RequestUpgradeStrategy to simulate a slow network. The overall system demonstrates a simple distributed client-server application inserting the Tomcat WebSocket API GW as an intermediary. The client can send a number to the server that denotes a length (given in KBs) so the server will respond with a random alphanumeric string having a length specified by the given number. The Tomcat WebSocket API GW just routes the WS traffic back and forth between the other two services. 1. Run the attached random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies application (the server app) in a form of java -jar random-string-ws-provider-undertow-1.0.0-SNAPSHOT-jar-with-dependencies <host> <port> By default it configures the underlying Undertow webserver to launch on localhost and listen on port #8193. 2. Run the attached ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar application (the Tomcat WebSocket API GW app) in a form of java -jar ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar By default it listens on port #8444 and it can be overridden by setting the server.port property. If the Undertow server app runs with non-default host and port configurations then this needs to be reflected here by specifying the zuul.routes.random-string-websocket-provider.url property accordingly, e.g.: java -jar -Dzuul.routes.random-string-websocket-provider.url=http://<another-host>:<another-port> ws-api-gateway-tomcat-1.0.0-SNAPSHOT.jar 3. Run the attached ws-random-string-gatling-load-test application (the client app wrapped into gatling to generate artifical load) in a form of mvn clean -B compile exec:java -Dexec.mainClass=RandomStringWebSocketRequestApp -DrampUpUsers=<number-of-concurrent-users> -DrampUpTime=1 -DserverUrl=<host-where-other-two-services-run> -DserverPort=<port-where-Tomcat-API-GW-listens-on> -Dgatling.simulationClass=com.acme.wsrequest.simulation.RandomStringWebSocketRequestSimulation -DrandomStringLengthInKb=1000 Actual Results: --------------- Running the client app with 400 users will start producing the SocketTimeoutException confidently in the Tomcat WebSocket API gateway service. At the client side the gatling report starts showing unexpectedly closed WS connections (with status code 1006) and the number of such connections seems to have a strong correspondence to the number of "got stuck" WsSession objects on the Tomcat WebSocket app's heap. That WsSession objects are preserved indefinitely and hence cannot be garbage-collected. Expected Results: ----------------- WsSession objects representing abnormally closed WebSocket connections shall eventually be the subject of garbage collection on the JVM heap. Build Date & Hardware: ---------------------- Build 2020-10-26 on Windows Server 2016 Standard (Version 1607 - OS Build 14393.3930) Additional Builds and Platforms: -------------------------------- N/A Additional Information: ----------------------- There is an attachment (tomcat-ws-api-gw-sockettimeoutexception-stack-trace.txt) to show the stack trace produced when SocketTimeoutException is encountered. Another attachment (tomcat-wssession-gc-root.png) contains the relevant prt of the heap dump created after a 400-user gatling load execution. Searching for "websocketsession" objects will bring up the preserved WebSocket session objects and checking the GC root of such a session object can also show the object reference chain up to the "waitingProcessors" map present in Http11NioProtocol.
Please re-test with 9.0.39.
The other thing to check is whether you have multiple threads writing to the same WebSocket session concurrently. That is not supported by the Java WebSocket API and could lead to the issue you describe.
Actually, scratch that last comment. Tomcat detects that and will throw an Exception. I'm still looking into this but there are changes between 9.0.36 and 9./0.39 that might help so that feedback would still be very helpful.
I'm able to recreate this with the 3 provided Maven projects. I ran the client on a local VM with VMware Workstation configured to limit the network connection to 10Mbps with 20ms latency inbound and outbound. Next step is to update to the latest Tomcat 9 code and see if the issue persists.
This bug has already been fixed in the 9.0.38 onwards. https://github.com/apache/tomcat/commit/5707895bafa38c1443460356a268932a967b79e9
After re-testing the reported scenario with 9.0.39 Tomcat version (core + websocket) the WsSession objects still seem to get stuck on the heap. Upon investigating the Tomcat source code a bit deeper it's not entirely clear how a particular Processor can be removed from the waitingProcessors set located in Http11NioProtocol class (more accurately in its heir, AbstractProtocol). According to the heap dump snapshot enclosed with the original report, a WsSession object is held by a WsHttpUpgradeHandler object that is used within UpgradeProcessorInternal class. UpgradeProcessorInternal inherits from Processor, the type whose objects seem stored in waitingProcessors. If sending a message block fails in WsRemoteEndpointImplBase during the invocation of writeMessagePart() method (as the referred https://github.com/apache/tomcat/commit/5707895bafa38c1443460356a268932a967b79e9 pull request also contains) then even though the WebSocket session gets closed followed by re-throwing of the caught exception, does it mean that the WsHttpUpgradeHandler object associated to the closed WebSocket session should eventually be purged from the above mentioned waitingProcessors set too? In other words, under which conditions should a Processor be removed from waitingProcessors?
The issue as reproduced by the provided test case is fixed in 9.0.39. Entries are removed from the waitingProcessor list when an event is processed for the socket (and returned once processing is complete if the connection is still open). If the writes from the server are made from an application thread (rather then in response to a message from the client and on the same thread) then it looks like it is possible that an unclean close will result in problem described here. Re-opening while I investigate a potential unit test for this.
Fixed in: - 10.0.x for 10.0.0-M10 onwards - 9.0.x for 9.0.40 onwards - 8.5.x for 8.5.60 onwards
I can confirm that the issue I reported is indeed fixed in Tomcat 9.0.40 so thanks for the quick remediation. Do you happen to have an estimation as to when the 9.0.40 version of Tomcat is expected to be released?
I couldn't find any information about the publication date of 9.0.40 but it would help us if we knew when it is expected as we would include it in our next release. Please let me know if you happen to have any update on the next Tomcat version's release date.
Soon. Need to resolve the remaining open issues then we can tag. Hopefully early next week.
Ok, it sounds promising. Thanks for the quick feedback.
Hi Thomas, According to the last information I got the Tomcat 9.0.40 release was estimated to be ready this week but I haven't seen any update on this. Since our internal release is still in progress this week, I think we may have the chance to integrate the new Tomcat early next week in case it'll be ready by then. Would it be feasible to have the official Tomcat 9.0.40 build in the beginning of next week?
Tomcat 9.0.40 started on 12 November[1]. Feel free to test and vote! [1] https://lists.apache.org/thread.html/rce32f696e12fecfd4b4383a5e8a389f4dac7d357053fd03ac7280868%40%3Cdev.tomcat.apache.org%3E
We are seeing this issue with 8.5.72 - is there some other patch needed for the 8.5 codeline ? The fix is confirmed for 9.0.40.
(In reply to Vishwanath Karra from comment #15) > We are seeing this issue with 8.5.72 - is there some other patch needed for > the 8.5 codeline ? The fix is confirmed for 9.0.40. Are you sure it's the same issue? Please post all your relevant information (i.e. actual evidence that you have a Websocket session memory leak) to the user's mailing list to verify this is the same problem.
(In reply to Christopher Schultz from comment #16) > (In reply to Vishwanath Karra from comment #15) > > We are seeing this issue with 8.5.72 - is there some other patch needed for > > the 8.5 codeline ? The fix is confirmed for 9.0.40. > > Are you sure it's the same issue? Please post all your relevant information > (i.e. actual evidence that you have a Websocket session memory leak) to the > user's mailing list to verify this is the same problem. We can see the problem on 8.5.84, the problem is when we configure server.xml connector of the protocol is org. Apache. Coyote. Http11. Http11Nio2Protocol. This is not an issue when changing to the default configuration (Nio). In tests, when Nio mode was set, the problem was fixed after 8.5.60, but when Nio2 mode was set, the problem existed on versions 8.5.60, 8.5.75, 8.5.84, etc.
After a recent upgrade from 9.0.63 to 9.0.75 (with the 2 websocket fixes from 9.0.77 patched in) I am seeing some occurrences of the WsSession Object stuck on the heap with state OUTPUT_CLOSED in WsHttpUpgradeHandler under Http11Nio2Protocol.waitingProcessors. We are in the process of identifying the steps to reproduce but these occurrences have been rare so far. Looking at the more recent reworking of WsSession isOpen / isClosed and surrounding logic from BZ66574 I am wondering if it makes sense for OUTPUT_CLOSED to also be checked in the isClosed call? Without a reproducible test case in place it is difficult to determine via static code analysis, but I wanted to mention that we are seeing this after our recent upgrade and did not see it in 9.0.63.
This is looking like a case where the remote endpoint is not able to respond with its own close message. We have seen 2 types of exception recorded in the Nio2SocketWrapper present in the heap: AsynchronousCloseException IOException - connection reset by peer