Question: When we are using http2 and upgrading tomcat`s version from 9.0.45 to 9.0.50. We find when server first startup evernthing goes fine, but after a while we see may request timeout from client side. By seeing the pinpoint trace, we find request is blocking at StandardHostValve#invoke. Then caller get timeout, and disconnect. Then some how the server side finish its blocking and try to read/write some to the socket, then got exception like: "IOException: stream reset" or "CloseNowException: This stream is not writeable" And in changelog, since 9.0.49 we see some new changes in http2 overheadprotect, we think maybe these changes cause the problem. So we did some tests blow: 1. http1 + tomcat9.0.50 fine 2. http2 + tomcat9.0.50 hang 3. http2 + tomcat9.0.45 fine 4. http2 + tomcat9.0.50 + disable overheadprotect fine
A stack trace when blocking happens: void java.lang.Object.wait(long) 272 97.1 % int org.apache.coyote.http2.Stream$StreamInputBuffer.doRead(ApplicationBufferHandler) 272 97.1 % int org.apache.coyote.Request.doRead(ApplicationBufferHandler) 272 97.1 % int org.apache.catalina.connector.InputBuffer.realReadBytes() 272 97.1 % boolean org.apache.catalina.connector.InputBuffer.checkByteBufferEof() 272 97.1 % int org.apache.catalina.connector.InputBuffer.read(byte[], int, int) 272 97.1 % int org.apache.catalina.connector.CoyoteInputStream.read(byte[], int, int) 272 97.1 % int org.apache.catalina.connector.CoyoteInputStream.read(byte[]) 272 97.1 % long org.apache.commons.io.IOUtils.copyLarge(InputStream, OutputStream, byte[]) 272 97.1 % long org.apache.commons.io.IOUtils.copy(InputStream, OutputStream, int) 272 97.1 % long org.apache.commons.io.IOUtils.copyLarge(InputStream, OutputStream) 272 97.1 % int org.apache.commons.io.IOUtils.copy(InputStream, OutputStream) 272 97.1 % byte[] org.apache.commons.io.IOUtils.toByteArray(InputStream) 272 97.1 % void com.OURCODEBASE.platform.openfeign.body.CustomBodyServletRequestWrapper.<init>(HttpServletRequest) 272 97.1 % void com.OURCODEBASE.platform.openfeign.body.RequestBodyCacheFilter.doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) 272 97.1 % void org.springframework.web.filter.OncePerRequestFilter.doFilter(ServletRequest, ServletResponse, FilterChain) 272 97.1 % void org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) 272 97.1 % void org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) 272 97.1 % void org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) 272 97.1 % void org.springframework.web.filter.OncePerRequestFilter.doFilter(ServletRequest, ServletResponse, FilterChain) 272 97.1 % void org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) 272 97.1 % void org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) 272 97.1 % void com.OURCODEBASE.platform.monitor.logging.tracing.TracingOncePerRequestFilter.doFilterInternal(HttpServletRequest, HttpServletResponse, FilterChain) 272 97.1 % void org.springframework.web.filter.OncePerRequestFilter.doFilter(ServletRequest, ServletResponse, FilterChain) 272 97.1 % void org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) 272 97.1 % void org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) 272 97.1 % void org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) 272 97.1 % void org.apache.catalina.core.StandardContextValve.invoke(Request, Response) 272 97.1 % void org.apache.catalina.authenticator.AuthenticatorBase.invoke(Request, Response) 272 97.1 % void org.apache.catalina.core.StandardHostValve.invoke(Request, Response) 272 97.1 % void org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) 272 97.1 % void org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) 272 97.1 % void org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) 272 97.1 % AbstractEndpoint$Handler$SocketState org.apache.coyote.http2.StreamProcessor.service(SocketWrapperBase) 272 97.1 % AbstractEndpoint$Handler$SocketState org.apache.coyote.AbstractProcessorLight.process(SocketWrapperBase, SocketEvent) 272 97.1 % void org.apache.coyote.http2.StreamProcessor.process(SocketEvent) 272 97.1 % void org.apache.coyote.http2.StreamRunnable.run() 272 97.1 % void java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) 272 97.1 % void java.util.concurrent.ThreadPoolExecutor$Worker.run() 272 97.1 % void org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run() 272 97.1 % void java.lang.Thread.run() 272 97.1 %
Complete Connector configuration please for both the hanging and working cases.
all the test cases except case.4 are using the default settings. case.4 just add follow settings to disable overhead protection: overheadCountFactor="0" overheadDataThreshold="0" overheadWindowUpdateThreshold="0"
HTTP/2 is not enabled by default so you can't possibly be using default settings. I specifically asked for the full Connector configurations for that reason. Please provide the complete Connector configuration please for both the hanging and working cases as requested.
we are using embed tomcat together with spring boot.Here`s what we added: server.http2.enable=true use 'TomcatServletWebServerFactory' to customize connection by adding: http2Protocol.setMaxConcurrentStreams(100); http2Protocol.setMaxConcurrentStreamExecution(20); http2Protocol.setKeepAliveTimeout(60_000); connector.addUpgradeProtocol(http2Protocol); and with case.4 additional settings are: http2Protocol.setOverheadCountFactor(0); http2Protocol.setOverheadDataThreshold(0); http2Protocol.setOverheadWindowUpdateThreshold(0);
we run several more times with case.2 and find: when overhead count disconut to around -10640 to -10660, next request will hang till timeout and any requests after this will also hang till timeout. we see these in log: Connection [35], Stream [0], Frame type [DATA] resulted in new overhead count of [-10,680] Connection [35], Stream [0], Frame type [HEADERS] resulted in new overhead count of [-10,660] Connection [35], Stream [0], Frame type [HEADERS] resulted in new overhead count of [-10,620] Connection [35], Stream [0], Frame type [DATA] resulted in new overhead count of [-10,640] Connection [35], Stream [0], Frame type [HEADERS] resulted in new overhead count of [-10,580]
sorry I think the overhead cout is fine, we try another api and got a different overhead count when issue happens. besides, we find some wired cases, pls see the reference: https://www.notion.so/dpy1123/tomcat9-0-50-hang-01a1f5527c944f919a700fc39fd9d58d
That reference requires registration/authentication. Please ensure any links you provide are public. A thread dump when the hang occurs would be helpful. Enabling HTTP/2 debug logging may also be informative (see $CATALINA_BASE/logging.properties)
Created attachment 37965 [details] some wired cases
add a pdf version through attachment, pls take a look.
Created attachment 37966 [details] txt version
Created attachment 37967 [details] thread dump when hang add thread dump
I'm as sure as I can be this isn't overhead protection related. The HTTP/2 streams are blocked, waiting to receive data. If the overhead protection had been triggered, these streams would have been closed. The exceptions in the logs are also consistent with request reads timing out. What isn't yet clear is the root cause. Possibilities are: 1. The client isn't sending data. Unlikely given the issue started after the move from 9.0.45 to 9.0.50. 2. The data is being received but the stream isn't being notified. Unlikely as there have been no changes in this specific area. 3. Window updates are not being sent correctly. Seems the most likely candidate at this point as there are changes for this in 9.0.49. I was about to suggest a couple of things to try to collect more data. As part of that I was checking exactly what debug logging is provided for Window updates and I managed to recreate the issue. I'm investigating now but setting overheadWindowUpdateThreshold="0" seems to work around the issue.
Found it. There is a regression in the previous release. If using async IO (available and default in 9.x and 10.x) small updates to the connection flow control window were dropped. The window slowly reduced in size until no more data could be read. If not using async IO, small updates to the connection flow control window are written but not flushed so at worst, you'll see a small delay. Setting overheadWindowUpdateThreshold="0" avoids the issue with or without async IO. Fixed in: - 10.1.x for 10.1.0-M3 onwards - 10.0.x for 10.0.9 onwards - 9.0.x for 9.0.51 onwards - 8.5.x for 8.5.70 onwards
c(In reply to Mark Thomas from comment #14) > Found it. There is a regression in the previous release. > > If using async IO (available and default in 9.x and 10.x) small updates to > the connection flow control window were dropped. The window slowly reduced > in size until no more data could be read. > > If not using async IO, small updates to the connection flow control window > are written but not flushed so at worst, you'll see a small delay. > > Setting overheadWindowUpdateThreshold="0" avoids the issue with or without > async IO. > > Fixed in: > - 10.1.x for 10.1.0-M3 onwards > - 10.0.x for 10.0.9 onwards > - 9.0.x for 9.0.51 onwards > - 8.5.x for 8.5.70 onwards cool~ looking forward to next release. Thx for helping~