Created attachment 37912 [details] httpd.conf We want to switch our web application from http/1.1 to http/2. The application is using 2 EventSources (SSE, Server Sent Events) Apache2 is used as reverse-proxy with mod_proxy_balancer for the /api/* locations and is delivering static content (angular JS webApp) for the rest. When enabling http/2 via "Protocols" the webapp works so far - but when navigating through the app the EventSources get stuck (Pending in Chrome Developer Tools). The problem is that the stuck EventSource is blocking all other requests - which should not happen when using http/2. I played around with keepalive, disablereuse, some http2 settings - no luck. To be sure that's not a general problem of our application I tried to use nginx as reverse proxy - that worked without problems. I'll attach the relevant part of the apache2 config, the apache error log with LogLevel core:debug LogLevel ssl:debug LogLevel http2:trace2
Created attachment 37913 [details] httpd.conf (corrected)
Created attachment 37914 [details] apache2 error log
Thanks for the logs. Can you specify the URLs of the requests that seem to hang? And do you count those as "blocking" or what exactly do you mean by that?
Thanks for the ultra fast reply :) There are 2 event-streams which are used by the application. https://localhost/api/provide-production-infrastructure/get-live-state-info-multi https://localhost/api/provide-operation-live-info/get-operation-live-info-multi Blocking means that in the chrome developer tools the request is pending. Interestingly _no_ other requests are performed - they are also pending. So it seems that this reqeust is blocking the other ones. When this happens apache stops logging these messages. Because it's encrypted traffic I do not see if the browser is stopping to send requests or if apache is stopping to processing this requests. But it seems apache is stopping to route requests to the backend servers. After about 30 seconds apache is processing the requests again. Mon Jun 28 08:57:36.705264 2021] [http2:debug] [pid 245373:tid 140574984357440] h2_task.c(83): [client 127.0.0.1:56158] AH03348: h2_task(66-3): open output to GET localhost /api/blah blah
The log looks incomplete to me. Where I see the main connection logging, I do not see the h2 worker ones and vice versa. I need to match those to see what overall is happening. What I can see on the h2 workers for the mentioned urls is that they seem to block and timeout after 30 seconds. But then I do not see what the main connection handler is doing in the meantime. For example: when I fgrep for "208-71" I see what 'pid 222:tid 140040340637440' is doing (the worker), but nothing else. When I grep for '194-71' I see what the main connection is doing, but nothing from the worker. Could you check if something went wrong here or reproduce a complete new log?
Created attachment 37916 [details] httpd error log The first log was generated via piping stderr to a file (because apache 2.48 was started via a docker container. the default is to log to stdout/stderr). Now I've reconfigured apache2 to log to a file. I've reproduced the problem and this is the error log written by apache2 itself. If you need something else or if you face the same problem let me know - maybe there is something wrong with the log settings. You said you have to match the h2 workers and the main connection - but I don't know what is the foreign key of each other. So I was not able to do a preflight check if everything is in the log what you need.
Thanks for the new log. What I see on both SSE points is: 1. The client aborts the requests (RST_STREAM) after commonly 1-2 seconds, there are 2 requests that last 5 seconds and one with 18 seconds. 2. All requests produce the response headers and one chunks of body data (525 bytes on one endpoint, 447 on the other). Those are sent to the client. 3. The HTTP/2 layer does not see a second chunk from the SSE endpoints, until the 30s read timeout happens, then one more chunk is seen. 3 indicates that someone is buffering data on the proxy connection, but after 30 seconds, the buffer only contain one more chunk of data (e.g. 447 or 525 bytes, not many). The SSE sources do not seem to provide data often (is that the intention?). An initial chunk and then silence more or less. However the browser closes the open requests quite aggressively (at least it looks like this, cannot see page reloads). There is nothing blocking in the HTTP/2 layer per se that I can see. It looks as if the browser detects frequent data to come and is unhappy and the proxied endpoint does not deliver. How often should the SSE sources produce and why is the browser closing often after 1 sec after starting the request?
Thanks for the new log. What I see on both SSE points is: > 1. The client aborts the requests (RST_STREAM) after commonly 1-2 seconds, > there are 2 requests that last 5 seconds and one with 18 seconds. Yes. That's me clicking around in the app. After every route change the 2 Eventsources are destroyed and connected again. (I know that's not how it should be - we are working on it to persist the event-sources during route-changes in the app) > 2. All requests produce the response headers and one chunks of body data (525 bytes on one endpoint, 447 on the other). Those are sent to the client. > 3. The HTTP/2 layer does not see a second chunk from the SSE endpoints, until the 30s read timeout happens, then one more chunk is seen. > 3 indicates that someone is buffering data on the proxy connection, but after 30 seconds, the buffer only contain one more chunk of data (e.g. 447 or 525 bytes, not many). > The SSE sources do not seem to provide data often (is that the intention?). An initial chunk and then silence more or less. However the browser closes the open requests quite aggressively (at least it looks like this, cannot see page reloads). That's true. When connecting to the SSE endpoint there is some initial payload - but then we are sending updates only every 30? seconds or when some change is happening. The closed connections is me navigating through the page (as described above) > There is nothing blocking in the HTTP/2 layer per se that I can see. It looks as if the browser detects frequent data to come and is unhappy and the proxied endpoint does not deliver. Can you clarify that for me - I don't understand the part with "the browser detects frequent data ..."
> Yes. That's me clicking around in the app. After every route change the 2 Eventsources are destroyed and connected again. (I know that's not how it should be - we are working on it to persist the event-sources during route-changes in the app) I see. Ok that explains this. > That's true. When connecting to the SSE endpoint there is some initial payload - but then we are sending updates only every 30? seconds or when some change is happening. The closed connections is me navigating through the page (as described above) Ok, if the SSE source sends roughly every 30 seconds, this explains what I see: - the browser closes - mod_http2 acknowledges, and marks the worker thread as being aborted - the worker thread is in a blocking read and gets some data after 30 seconds, detects the frontend connection is gone and terminates itself. > Can you clarify that for me - I don't understand the part with "the browser detects frequent data ..." That was my question about the early closing of the SSE requests. But you explained that you navigated the webapp and caused this yourself. What I do not see is any "blocking" behaviour in the server. If you see a request in the browser dev tools that is not answered, you should see the URL of that request and then we can check if we see it in the server logs and what is happening to it. For the SSE source requests, the browser will most likely show a "waiting" status since the complete response is never received - as intended for the SSE, right?
(In reply to Stefan Eissing from comment #7) > > 3 indicates that someone is buffering data on the proxy connection, but > after 30 seconds, the buffer only contain one more chunk of data (e.g. 447 > or 525 bytes, not many). Couldn't: ap_remove_output_filter_byhandle(c->output_filters, "SSL/TLS Coalescing Filter"); on the h2 master connection help here?
(In reply to Stefan Eissing from comment #9) > Ok, if the SSE source sends roughly every 30 seconds, this explains what I > see: > - the browser closes > - mod_http2 acknowledges, and marks the worker thread as being aborted > - the worker thread is in a blocking read and gets some data after 30 > seconds, detects the frontend connection is gone and terminates itself. To reproduce the behavior I have to navigate a few times through the web app (initiate route changes). When no route change happens (the event-sources are not destroyed and built up again) the app works as expected. > What I do not see is any "blocking" behaviour in the server. If you see a > request in the browser dev tools that is not answered, you should see the > URL of that request and then we can check if we see it in the server logs > and what is happening to it. Yes I can reproduce you a log with an exact (1s) time when the first request is "pending" in the browser and which one. FYI: I've also tried firefox to ensure it's not a browser problem. > For the SSE source requests, the browser will most likely show a "waiting" > status since the complete response is never received - as intended for the > SSE, right? Absolutely correct. The HTTP Headers are sent by the server with ContentType: text/eventstream. Then the response body never finishes (ideally. sometimes the request is canceled by the browser or through some other hardware/software inbetween). Maybe another helpful insight is that when using nginx it works (as already written). nginx has a lot more tcp connections to the backend in close/wait state. for some reason it seems that it finishes every tcp connection when a http request is finished and establish a new one. I've made no special configuration regarding keepalive, its the default config. Apache2 has much fewer connections in the backend and seems to reuse them - regardless if I enable or disable keepalive. I'll come back to you with the new logs.
Created attachment 37917 [details] httpd error_log Here are the logs: - I opened the app at 13:28:20 - I changed route around 4-5 times - at 13:28:39-40 the Request https://localhost/device/1c942d82-600c-484c-a62d-95c910ee86f0/operation-documents was pending - at 13:28:52 the request finished - also the other pending requests which were also pending.
Created attachment 37918 [details] screenshot of pending request in developer console This is the screenshot of the pending request. You can see that it's around 12 seconds in pending state.
(In reply to Florian Meister from comment #13) > Created attachment 37918 [details] > screenshot of pending request in developer console > > This is the screenshot of the pending request. > You can see that it's around 12 seconds in pending state. it's TTFB (Time to first byte) of 12 seconds - so neither headers nor body is received.
From the new log (lets call this errror3.log): TTFB of 12 seconds for "get-live-combined..." 2 requests of that in log: 128-81 and 128-95 In HTTP/2 I see the timely responses and DATA frames send out on the main connection, always with a FLUSH bucket (details below). It seems that Ylavic points in the interesting direction of the mod_ssl coalescing filter to maybe ignore the FLUSH buckets and buffer things? Time to get an expert on that into this ticket, I guess. Timeline of requests: -81: 11:28:21.249985: received on thread 140115783444224 11:28:21.250043: added to q 11:28:21.250055: process connection on thread 140116562450176 11:28:21.281608: task sends out response headers + first chunk of 924 bytes 11:28:21.281769: main conn resumes stream 81 11:28:21.281787: main adds to stream out buffer (H2HEADERS BEAM[924] BEAM[4462] FLUSH) 11:28:21.281834: main conn sends HEADER FRAME 11:28:21.281863: main conn sends 1. DATA FRAME 11:28:21.281886: main conn out: heap[5409] flush ...regular adding of response chunks 11:28:22.278812: 2. DATA FRAME 11:28:23.276869. 3. DATA FRAME 11:28:24.281568: 4. DATA FRAME 11:28:25.278065: 5. DATA FRAME 11:28:26.274650: 6. DATA FRAME 11:28:27.282371: 7. DATA FRAME 11:28:28.276057: 8. DATA FRAME 11:28:28.697597: client RST_STREAM received 11:28:28.697695: stream in CLEANUP state 11:28:29.276504: task wakes on next chunk, sees output aborted 11:28:29.277021: task done, 8026.978000 ms elapsed -95: 11:28:28.697648: recv HEADER FRAME 11:28:28.726090: send HEADER FRAME 11:28:28.726404: 1. DATA FRAME (5386) 11:28:28.726469: main conn out 5395 bytes + flush 11:28:29.727619: 2. DATA FRAME (5386) 11:28:30.723589: 3. DATA FRAME (5386) 11:28:31.727982: 4. DATA FRAME (5386) 11:28:32.727119: 5. DATA FRAME (5386) 11:28:33.722943: 6. DATA FRAME (5386) 11:28:34.723095: 7. DATA FRAME (5386) 11:28:35.732271: 8. DATA FRAME (5386) 11:28:36.725798: 9. DATA FRAME (5386) 11:28:37.727986: 10. DATA FRAME (5386) 11:28:38.720243: 11. DATA FRAME (5386) 11:28:39.724627: 12. DATA FRAME (5386) 11:28:40.722961: 13. DATA FRAME (5386) 11:28:41.722298: 14. DATA FRAME (5386) 11:28:42.719747: 15. DATA FRAME (5386) 11:28:43.727449: 16. DATA FRAME (5386) ... 11:31:00.723799: nn. DATA FRAME (5386)
(In reply to Stefan Eissing from comment #15) > In HTTP/2 I see the timely responses and DATA frames send out on the main > connection, always with a FLUSH bucket (details below). It seems that Ylavic > points in the interesting direction of the mod_ssl coalescing filter to > maybe ignore the FLUSH buckets and buffer things? Wrong track I'm afraid, if all the data are FLUSHed then the coalescing filter shouldn't retain anything (frees all on any metadata).
Looking at your timeline traces, all but the first DATA FRAMES don't seem to be flushed, is it because that's implicit (you didn't indicate it) there is no flush bucket added for those?
No, they are all flushed. H2 writes everything flushed. Just did not pick it from the log for all frames.
I think a network capture (on the httpd side) is needed to confirm whether the first FRAME really did not go out before 12s. It shouldn't be the case given the traces..
Or, alternatively, at LogLevel trace7 we should be able to see what happens at the core/network output filter level.
Actually, LogLevel trace6 should be enough.
Created attachment 37919 [details] httpd error log Error Log with http2:trace6 - Loading App at 14:58:00 - First pending Request at approximately 14:58:23 PS: I don't see any trace6 logs, max. trace3. If u need a network capture let me know :)
Created attachment 37920 [details] screenshot of pending request in developer console Screenshot from last trace (with http2:trace6)
Sorry I wasn't clear enough, we need core logs at trace6 (the core output filter, not http2 only), so either "LogLevel trace6" for core and all modules at that level, or maybe "LogLevel core:trace6 http2:trace6" works too.
Created attachment 37922 [details] httpd error log Error Log with LogLevel trace6 7:47:20 Loading App 7:47:42 First Pending Request 7:47:55 Finished Pending Requests
Created attachment 37923 [details] screenshot of pending request in developer console
Stefan, looking at 1-155 (GET https://localhost/api/provide-production-infrastructure/get-live-state-info-multi?deviceId=1c942d82-600c-484c-a62d-95c910ee86f0). 07:47:40.524371: received (thread 140450790696704) 07:47:40.524374: added to q <10s gap here> 07:47:50.869590: task processing (thread 140450824267520) 07:47:50.870611: forwarded to backend 07:47:50.899823: response from backend 07:47:50.899870: task send_out(H2HEADERS TRANSIENT[447]) 07:47:50.899936: main conn resumes stream 155 by (thread 140450790696704) 07:47:50.899957: main adds to stream out buffer (H2HEADERS BEAM[447] FLUSH) 07:47:50.900011: main conn sends HEADER FRAME 07:47:50.900043: main conn sends first DATA FRAME 07:47:50.900130: 492/492 bytes sent on the network <no more news from the backend until httpd is stopped> Can't tell about this 10s gap..
I think what we are looking at is h2 worker starvation in this proxy scenario. When a long reading SSE request is RST_STREAM on the main connection, the h2 worker is in a read and only when that delivers something, it will detect that it has been aborted. That can take up to 30 seconds here. We can test this by raising `H2MaxWorkers` to some very high level and check if the stalls disappear. The real fix would be to have a immediate cancellation of ongoing requests, but I do not know how to do this. In http/1.1, where this problem does not seem to happen, do we not see it because there are just enough event workers or does mpm_event know how to disrupt an ongoing read from the backend? Florian: can you test if a "H2MaxWorkers 512" seemingly makes the problem disappear for a while?
> or does mpm_event know how to disrupt an ongoing read from the backend? IIUC No, only a websockets tunnel would monitor both connections at the same time. Any other time I/O is waiting on the backend to write, the frontend socket is not being watched.
Created attachment 37926 [details] httpd error log Error Log with increased H2MaxWorkers (512) defined in VirtualHost 11:27:10 loading app 11:27:35 first pending request
Created attachment 37927 [details] screenshot of pending request in developer console Screenshot of Pending Request with increased H2 Workers
Created attachment 37928 [details] httpd.conf httpd.conf with increased H2MaxWorkers
Stefan, I increased H2MaxWorkers to 512 - I do not notice any difference. It seems the problem appear at the same time in the same intensity.
Florian, I'll take a look at this today. There seems to be something else amiss here. Maybe you are hitting a hithero unnoticed edge where idle h2 workers are not accepting tasks as they should.
Created attachment 37930 [details] h2 worker stall fix against 2.4.x branch A patch against the Apache subversion branches/2.4.x to fix the issue of requests not being processed in a timely fashion.
Florian, I think I found the issue and attached a patch to this ticket. Can you build a new mod_http2 from the 2.4.48 sources or would you prefer a patch against the github version of the module? Background: mod_http2 had 2 occasions where schedule requests were assigned to workers: a) when the request arrived and b) when a worker was done serving a request. When more than 1 request arrived at the same time, it could happen that only the first one got an assignmend while the others remained queued without any worker looking for them. This did not matter when requests were short, since any finished request would trigger a worker assignment of the queued requests. However, with SSE and long-running requests, this could take a considerable time to happen. The fix makes sure that worker assignment always is triggered when a new request arrives, even in batches. I hope this addresses your issue.
Stefan, great news! I come back to you when tested :)
Stefan, I'm afraid I have to tell you that the fix does not change the behavior. Can I do anything to help u? Will another trace help u?
Yes, please. I'd like to see a trace of a n second delayed first byte under the new patch. *sigh*
Created attachment 37935 [details] httpd error log Error Log with patch for mod_http2 16:58:20 start app 16:58:36 first pending request 16:58:52 finished pending requests
Created attachment 37936 [details] screenshot of pending request in developer console Screenshot of pending Request in Developer Console
Stefan, I've uploaded the new logs. Thanks for investigating :)
If you don't know what is going on, write a helper. Today I wrote <https://github.com/icing/h2wtf> to look at your logs. My analysis now is that your webapp runs into the h2 DoS protection. mod_http2 does not really like it when the client starts and then kills requests. But it likes when the client reads requests completely and without stalling. There is a "mood" counter and when mood drops, the number of active requests that a client may perform is reduced. When mod_http2 is happy, this limit is increased. The stream with the late starts happen at a time, when the limit has been exhausted. E.g. in the latest log, the limit was reduced to 16 and 16 request were ongoing when a new one came in. This means there will be no worker assigned to that request until another of the 16 returns. Since there are requests hanging, waiting on the next SSE chunk to then abort, this may take some time. In the last log the 16 seconds. So, I understand what is happening. That is something. I do not have a good idea how to improve this. Have to think a bit on that.
Created attachment 37941 [details] h2_stuck_fix-v2 against 2.4.x Considering RST_STREAM on eventstreams as acceptable. DEBUG logging when request processing limit is exceeded.
Florian: I attached a new patch to address the problem. Hope this solves the issue for you.
Created attachment 37942 [details] httpd error log error log with patch-v2 13:13:40 loading app 13:13:58 first pending request 13:14:12 finished pending requests
Created attachment 37943 [details] screenshot of pending request in developer console screenshot of pending request in dev console (patch-v2)
stefan, it's much better now - the problem appears after a lot of more route changes than before - but it's not solved completely. I've attached new logs. thanks - flo
I see 37 h2 worker busy at the time requests '66-289' to '66-295' arrive and that is the maximum in your sever configuration. The processing can only start when a worker becomes available again. You can mitigate that by an increased `H2MaxWorkers` configuration. The question is what usage your anticipate on that server when you take it live. Increasing numbers of users with these SSE request will eventually exhaust whatever worker limit you configure. Question: can you reduce the 30 seconds update time on the SSE generators? (The desired fix would be for apache to get rid of those workers directly when the client resets the stream, but that is not easily achieved.)
(In reply to Stefan Eissing from comment #49) > I see 37 h2 worker busy at the time requests '66-289' to '66-295' arrive and > that is the maximum in your sever configuration. The processing can only > start when a worker becomes available again. > > You can mitigate that by an increased `H2MaxWorkers` configuration. > > The question is what usage your anticipate on that server when you take it > live. Increasing numbers of users with these SSE request will eventually > exhaust whatever worker limit you configure. > > Question: can you reduce the 30 seconds update time on the SSE generators? We have eliminated more or less all request cancellations when navigating through the page - so it's much better now even without your fix. Yes that would be possible. Maybe we take that approach. The problem is that we have one event-source which is transporting (potentially) a pretty big amount of data. If we reduce the interval it would increase the load and the traffic. We are going to send increments via this events in the future - so this problem should also be fixed in the future. > > (The desired fix would be for apache to get rid of those workers directly > when the client resets the stream, but that is not easily achieved.) For my understanding: Is one H2Worker is processing exactly one request and while processing is not able to process other requests? If this is the case we would have to set H2MaxWorkers to a pretty big value.
> Yes that would be possible. Maybe we take that approach. The problem is that we have one event-source which is transporting (potentially) a pretty big amount of data. If we reduce the interval it would increase the load and the traffic. We are going to send increments via this events in the future - so this problem should also be fixed in the future. No knowledge what would work best in your setup. It could be possible to send a "no-change" chunk frequently in between. > For my understanding: Is one H2Worker is processing exactly one request and while processing is not able to process other requests? If this is the case we would have to set H2MaxWorkers to a pretty big value. Think of the h2 worker as a single HTTP/1.1 connection - only internally inside the server. mod_http2 bundles several of these onto the main connection to the client. If you disable h2, your browser would make separate h1 connections and, if it closes them, the standard mpm_event worker would hang for 30 seconds as well, before the server can re-use this worker for sth else. The real fix, and the one we started discussing on the dev mailing list, is that mod_proxy notices a h2 RST_STREAM immediately and abandon the request. But that is a fundamental change and may take a while. One other mitigation that comes before that is that the h2 workers become dynamic again. I will make a release with that on the github repro mod mod_h2 probably today, so people can do early testing. This would mean that the server only starts H2MinWorkers and goes up to H2MaxWorkers on load, going down again when this subsides again.
(In reply to Stefan Eissing from comment #51) > The real fix, and the one we started discussing on the dev mailing list, is > that mod_proxy notices a h2 RST_STREAM immediately and abandon the request. > But that is a fundamental change and may take a while. I see. Thanks for you afford - I think we are trying to finally fix our web application to not reset connections on client-side so often. Then we'll start migrating to http2. Thanks for the tip to send keep-alives in the h2 connection so apache recognizes the closed connection faster. As I understood there is nothing more to do in this bug report - so it can be closed (at least from my side). Thanks again for your help - I have a much better understanding now how apache2 handles http2 internally.
Thanks, Florian, for your patience! Nice working with you. I'll add the improvements from the patch to the next Apache release.