Bug 64621 - HTTP/2 Tomcat Server responds with RST_STREAM (REFUSED_STREAM) continuously in one of the TCP connection.
Summary: HTTP/2 Tomcat Server responds with RST_STREAM (REFUSED_STREAM) continuously i...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.x
Hardware: HP Linux
: P2 critical (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-23 13:36 UTC by Raghavendran V
Modified: 2020-08-18 06:29 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Raghavendran V 2020-07-23 13:36:21 UTC
*Sub-Component - Coyote*

*OS : Redhat Linux*

*Description:*

*Setup:*

We have an implementation with Tomcat to transport http/2 packets between 2 systems. 

*Issue reproduction:*

During a load test (with some 1000 requests per second) in the pre-production systems, we identified some scenarios where our Tomcat server is continuously replying with a RST_STREAM, with REFUSED_STREAM as the reason, in a single TCP Connection. 

Even though it is understood that exceeding the number of streams per connection rejects the new streams with RST_STREAM (REFUSED_STREAM) error as per Section 5.1.2 and SECTION 8.1.4 of RFC7540. We are not seeing any other responses sent by the server after sometime. It was only RST_STREAM (REFUSED_STREAM) errors. For the whole test run period (around 30 minutes), it was only RST_STREAM in that particular TCP connection.

If the RST_STREAM is sent for exceeded stream count, should it have not recovered after sometime? 


*Additional Information:*

Also, by looking into the tomcat code base, it was understood that the RST_STREAM (REFUSED_STREAM) response from tomcat is only possible during an exceeded stream count and in no other situation (reference given below).

if (localSettings.getMaxConcurrentStreams() < activeRemoteStreamCount.incrementAndGet()) {
                activeRemoteStreamCount.decrementAndGet();
                throw new StreamException(sm.getString("upgradeHandler.tooManyRemoteStreams",
                        Long.toString(localSettings.getMaxConcurrentStreams())),
                        Http2Error.REFUSED_STREAM, streamId);
            }

In addition to this, in the Stream processor code, we noticed a ‘FIXME:’ comment to fix the syncs (reference given below).

final void process(SocketEvent event) {
        try {
            // FIXME: the regular processor syncs on socketWrapper, but here this deadlocks
            synchronized (this) {
                // HTTP/2 equivalent of AbstractConnectionHandler#process() without the
                // socket <-> processor mapping
                ContainerThreadMarker.set();
                SocketState state = SocketState.CLOSED;
                try {
                    state = process(socketWrapper, event);

                    if (state == SocketState.CLOSED) {
                        if (!getErrorState().isConnectionIoAllowed()) {
                            ConnectionException ce = new ConnectionException(sm.getString(
                                    "streamProcessor.error.connection", stream.getConnectionId(),
                                    stream.getIdentifier()), Http2Error.INTERNAL_ERROR);
                            stream.close(ce);
                        } else if (!getErrorState().isIoAllowed()) {
                            StreamException se = stream.getResetException();
                            if (se == null) {
                                se = new StreamException(sm.getString(
                                        "streamProcessor.error.stream", stream.getConnectionId(),
                                        stream.getIdentifier()), Http2Error.INTERNAL_ERROR,
                                        stream.getIdAsInt());
                            }
                            stream.close(se);
                        }
                    }
                } catch (Exception e) {
                    String msg = sm.getString("streamProcessor.error.connection",
                            stream.getConnectionId(), stream.getIdentifier());
                    if (log.isDebugEnabled()) {
                        log.debug(msg, e);
                    }
                    ConnectionException ce = new ConnectionException(msg, Http2Error.INTERNAL_ERROR);
                    ce.initCause(e);
                    stream.close(ce);
                } finally {
                    ContainerThreadMarker.clear();
                }
            }
        } finally {
            handler.executeQueuedStream();
        }
    }

Correlating these two items, if the streams are not closed due to this sync issue, the activeRemoteStreamCount will be in an increasing trend (as I am seeing the stream closure only here), Which will result in the REFUSED_STREAM errors. 

Can you confirm this issue and suggest on whether a fix is available in any of your working streams? As the system is already in production, it is susceptible to the issue in production anytime, when the load increases. Please help us find a way forward. 

Thanks in advance.
Comment 1 Mark Thomas 2020-07-23 14:18:27 UTC
Exact Tomcat version being used?
Comment 2 Raghavendran V 2020-07-23 14:21:39 UTC
The current Tomcat version used is 9.0.22, in which we are seeing this issue. I have also checked the pasted code snippets up-to the latest 9.0.37 version. It is the same till 9.0.37.
Comment 3 Mark Thomas 2020-07-23 14:25:38 UTC
Please re-test with 9.0.37. There have been many fixes since 9.0.22

Are you using h2 or h2c?
Comment 4 Raghavendran V 2020-07-23 14:45:48 UTC
It is h2c we are using. 

Also, as this is a pre-production/production, upgrading to the latest version requires solid evidence to push to the higher version :(. 

Is it possible to check on the corresponding modules on whether this issue is resolved in 9.0.37? Can you please help on this?
Comment 5 Mark Thomas 2020-07-23 15:50:37 UTC
You have misunderstood the nature of the FIXME. There is no "sync issue" here that results in streams not being closed.

We have a test case that tests the REFUSED_STREAM state and extending that test demonstrates that new streams may be sent once the concurrency level falls below the limit.

For us to investigate further you will need to provide the *minimal* steps to reproduce this error (with code, scripts etc as required) on a clean install of the latest Tomcat 9.0.x release.
Comment 6 Raghavendran V 2020-07-24 05:47:16 UTC
Thanks Mark, for the update. Will try to reproduce in our local setup and will get back to you with more clarity. As this occurs randomly, we are yet to narrow down to the "reproduction steps". So, will get back to you ASAP. 

Thanks again for the swift responses.
Comment 7 Raghavendran V 2020-08-17 13:50:53 UTC
Even after trying to simulate this in our local lab setup for the past few weeks, we are not able to reproduce the same in lab. 

However, this is getting reproduced in the production environment (the client being different in the production environment). In local environment, we use JMeter as a simulator and in the production environment the client uses Envoy 3PP for HTTP communication.

As an additional information, we are able to see a pattern that, whenever the client is sending a "HEADER" packet along with a "RST_STREAM" (Reason: NO_ERROR) packet in the same TCP packet, TOMCAT server is going to this mode and sending continuous RST_STREAM (Reason: REFUSED_STREAM) answers to any message that is coming in that TCP connection from the client. 

As it is difficult to convince the customer to upgrade to 9.0.37, without solid data point on the fix availability in the latest version, we request your kind help on this. Is there any commit or some improvement that we can point to convince the customer to move to the latest tomcat version that you can help with in connection with this? 

Once again, Thanks in advance :).
Comment 8 Mark Thomas 2020-08-17 18:57:09 UTC
Without the ability to recreate this issue the only way to test if the issue is resolved is going to be in production.

Based on the additional information provided, I expanded the test cases to cover two scenarios:
a) ensuring after active stream limit is reached, new streams are allowed once the number of current active streams falls
b) ensuring that client cancelled streams are removed from the active stream count

test a) passed.

Test b) identified a bug where cancelled streams weren't removed from the active stream count. I suspect this is at least part of, possibly all of, the cause of the issues you are seeing.

The bug has been fixed in:
- master for 10.0.0-M8 onwards
- 9.0.x for 9.0.38 onwards
- 8.5.x for 8.5.58 onwards

Those releases are expected in early September. You are, of course, free to build from source and test before then.

Absent any further information, I am going to assume that this issue is now fixed.
Comment 9 Raghavendran V 2020-08-18 06:29:05 UTC
Thank you so much for the swift response Mark! This is of great help for us. We will check on how we can take this in.