Bug 65460 - http2 request hang on request read
Summary: http2 request hang on request read
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.50
Hardware: PC Mac OS X 10.1
: P2 major (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-21 12:02 UTC by dpy1123
Modified: 2021-07-23 15:01 UTC (History)
0 users



Attachments
some wired cases (359.24 KB, application/pdf)
2021-07-22 16:16 UTC, dpy1123
Details
txt version (23.97 KB, text/plain)
2021-07-22 16:35 UTC, dpy1123
Details
thread dump when hang (366.56 KB, text/plain)
2021-07-23 04:07 UTC, dpy1123
Details

Note You need to log in before you can comment on or make changes to this bug.
Description dpy1123 2021-07-21 12:02:41 UTC
Question:
When we are using http2 and upgrading tomcat`s version from 9.0.45 to 9.0.50. We find when server first startup evernthing goes fine, but after a while we see may request timeout from client side.

By seeing the pinpoint trace, we find request is blocking at StandardHostValve#invoke. Then caller get timeout, and disconnect. Then some how the server side finish its blocking and try to read/write some to the socket, then got exception like:
"IOException: stream reset"
or
"CloseNowException: This stream is not writeable"

And in changelog, since 9.0.49 we see some new changes in http2 overheadprotect, we think maybe these changes cause the problem. So we did some tests blow:
1. http1 + tomcat9.0.50  fine
2. http2 + tomcat9.0.50  hang
3. http2 + tomcat9.0.45  fine
4. http2 + tomcat9.0.50 + disable overheadprotect  fine
Comment 1 dpy1123 2021-07-21 12:09:52 UTC
A stack trace when blocking happens:

void java.lang.Object.wait(long) 272 97.1 %
int org.apache.coyote.http2.Stream$StreamInputBuffer.doRead(ApplicationBufferHandler) 272 97.1 %
int org.apache.coyote.Request.doRead(ApplicationBufferHandler) 272 97.1 %
int org.apache.catalina.connector.InputBuffer.realReadBytes() 272 97.1 %
boolean org.apache.catalina.connector.InputBuffer.checkByteBufferEof() 272 97.1 %
int org.apache.catalina.connector.InputBuffer.read(byte[], int, int) 272 97.1 %
int org.apache.catalina.connector.CoyoteInputStream.read(byte[], int, int) 272 97.1 %
int org.apache.catalina.connector.CoyoteInputStream.read(byte[]) 272 97.1 %
long org.apache.commons.io.IOUtils.copyLarge(InputStream, OutputStream, byte[]) 272 97.1 %
long org.apache.commons.io.IOUtils.copy(InputStream, OutputStream, int) 272 97.1 %
long org.apache.commons.io.IOUtils.copyLarge(InputStream, OutputStream) 272 97.1 %
int org.apache.commons.io.IOUtils.copy(InputStream, OutputStream) 272 97.1 %
byte[] org.apache.commons.io.IOUtils.toByteArray(InputStream) 272 97.1 %
void com.OURCODEBASE.platform.openfeign.body.CustomBodyServletRequestWrapper.<init>(HttpServletRequest) 272 97.1 %
void com.OURCODEBASE.platform.openfeign.body.RequestBodyCacheFilter.doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) 272 97.1 %
void org.springframework.web.filter.OncePerRequestFilter.doFilter(ServletRequest, ServletResponse, FilterChain) 272 97.1 %
void org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) 272 97.1 %
void org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) 272 97.1 %
void org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) 272 97.1 %
void org.springframework.web.filter.OncePerRequestFilter.doFilter(ServletRequest, ServletResponse, FilterChain) 272 97.1 %
void org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) 272 97.1 %
void org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) 272 97.1 %
void com.OURCODEBASE.platform.monitor.logging.tracing.TracingOncePerRequestFilter.doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) 272 97.1 %
void org.springframework.web.filter.OncePerRequestFilter.doFilter(ServletRequest, ServletResponse, FilterChain) 272 97.1 %
void org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) 272 97.1 %
void org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) 272 97.1 %
void org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) 272 97.1 %
void org.apache.catalina.core.StandardContextValve.invoke(Request, Response) 272 97.1 %
void org.apache.catalina.authenticator.AuthenticatorBase.invoke(Request, Response) 272 97.1 %
void org.apache.catalina.core.StandardHostValve.invoke(Request, Response) 272 97.1 %
void org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) 272 97.1 %
void org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) 272 97.1 %
void org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) 272 97.1 %
AbstractEndpoint$Handler$SocketState org.apache.coyote.http2.StreamProcessor.service(SocketWrapperBase) 272 97.1 %
AbstractEndpoint$Handler$SocketState org.apache.coyote.AbstractProcessorLight.process(SocketWrapperBase, SocketEvent) 272 97.1 %
void org.apache.coyote.http2.StreamProcessor.process(SocketEvent) 272 97.1 %
void org.apache.coyote.http2.StreamRunnable.run() 272 97.1 %
void java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) 272 97.1 %
void java.util.concurrent.ThreadPoolExecutor$Worker.run() 272 97.1 %
void org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run() 272 97.1 %
void java.lang.Thread.run() 272 97.1 %
Comment 2 Mark Thomas 2021-07-21 12:40:24 UTC
Complete Connector configuration please for both the hanging and working cases.
Comment 3 dpy1123 2021-07-21 13:01:21 UTC
all the test cases except case.4 are using the default settings. 
case.4 just add follow settings to disable overhead protection:
overheadCountFactor="0"
overheadDataThreshold="0"
overheadWindowUpdateThreshold="0"
Comment 4 Mark Thomas 2021-07-21 13:39:29 UTC
HTTP/2 is not enabled by default so you can't possibly be using default settings. I specifically asked for the full Connector configurations for that reason. Please provide the complete Connector configuration please for both the hanging and working cases as requested.
Comment 5 dpy1123 2021-07-21 14:59:02 UTC
we are using embed tomcat together with spring boot.Here`s what we added:
server.http2.enable=true

use 'TomcatServletWebServerFactory' to customize connection by adding:
http2Protocol.setMaxConcurrentStreams(100);
http2Protocol.setMaxConcurrentStreamExecution(20);
http2Protocol.setKeepAliveTimeout(60_000);
connector.addUpgradeProtocol(http2Protocol);

and with case.4 additional settings are:
http2Protocol.setOverheadCountFactor(0);
http2Protocol.setOverheadDataThreshold(0);
http2Protocol.setOverheadWindowUpdateThreshold(0);
Comment 6 dpy1123 2021-07-22 06:06:53 UTC
we run several more times with case.2 and find:
when overhead count disconut to around -10640 to -10660, next request will hang till timeout and any requests after this will also hang till timeout. 


we see these in log:
Connection [35], Stream [0], Frame type [DATA] resulted in new overhead count of [-10,680]
Connection [35], Stream [0], Frame type [HEADERS] resulted in new overhead count of [-10,660]
Connection [35], Stream [0], Frame type [HEADERS] resulted in new overhead count of [-10,620]
Connection [35], Stream [0], Frame type [DATA] resulted in new overhead count of [-10,640]
Connection [35], Stream [0], Frame type [HEADERS] resulted in new overhead count of [-10,580]
Comment 7 dpy1123 2021-07-22 09:30:29 UTC
sorry I think the overhead cout is fine, we try another api and got a different overhead count when issue happens.

besides, we find some wired cases, pls see the reference:
https://www.notion.so/dpy1123/tomcat9-0-50-hang-01a1f5527c944f919a700fc39fd9d58d
Comment 8 Mark Thomas 2021-07-22 15:07:40 UTC
That reference requires registration/authentication. Please ensure any links you provide are public.

A thread dump when the hang occurs would be helpful.

Enabling HTTP/2 debug logging may also be informative (see $CATALINA_BASE/logging.properties)
Comment 9 dpy1123 2021-07-22 16:16:53 UTC
Created attachment 37965 [details]
some wired cases
Comment 10 dpy1123 2021-07-22 16:17:56 UTC
add a pdf version through attachment, pls take a look.
Comment 11 dpy1123 2021-07-22 16:35:17 UTC
Created attachment 37966 [details]
txt version
Comment 12 dpy1123 2021-07-23 04:07:48 UTC
Created attachment 37967 [details]
thread dump when hang

add thread dump
Comment 13 Mark Thomas 2021-07-23 08:58:29 UTC
I'm as sure as I can be this isn't overhead protection related.

The HTTP/2 streams are blocked, waiting to receive data. If the overhead protection had been triggered, these streams would have been closed. The exceptions in the logs are also consistent with request reads timing out.

What isn't yet clear is the root cause. Possibilities are:
1. The client isn't sending data. Unlikely given the issue started after the move from 9.0.45 to 9.0.50.
2. The data is being received but the stream isn't being notified. Unlikely as there have been no changes in this specific area.
3. Window updates are not being sent correctly. Seems the most likely candidate at this point as there are changes for this in 9.0.49.

I was about to suggest a couple of things to try to collect more data. As part of that I was checking exactly what debug logging is provided for Window updates and I managed to recreate the issue.

I'm investigating now but setting overheadWindowUpdateThreshold="0" seems to work around the issue.
Comment 14 Mark Thomas 2021-07-23 13:37:26 UTC
Found it. There is a regression in the previous release.

If using async IO (available and default in 9.x and 10.x) small updates to the connection flow control window were dropped. The window slowly reduced in size until no more data could be read.

If not using async IO, small updates to the connection flow control window are written but not flushed so at worst, you'll see a small delay.

Setting overheadWindowUpdateThreshold="0" avoids the issue with or without async IO.

Fixed in:
- 10.1.x for 10.1.0-M3 onwards
- 10.0.x for 10.0.9 onwards
- 9.0.x for 9.0.51 onwards
- 8.5.x for 8.5.70 onwards
Comment 15 dpy1123 2021-07-23 15:01:00 UTC
c(In reply to Mark Thomas from comment #14)
> Found it. There is a regression in the previous release.
> 
> If using async IO (available and default in 9.x and 10.x) small updates to
> the connection flow control window were dropped. The window slowly reduced
> in size until no more data could be read.
> 
> If not using async IO, small updates to the connection flow control window
> are written but not flushed so at worst, you'll see a small delay.
> 
> Setting overheadWindowUpdateThreshold="0" avoids the issue with or without
> async IO.
> 
> Fixed in:
> - 10.1.x for 10.1.0-M3 onwards
> - 10.0.x for 10.0.9 onwards
> - 9.0.x for 9.0.51 onwards
> - 8.5.x for 8.5.70 onwards

cool~ looking forward to next release.
Thx for helping~