Bug 65402 - EventSources get stuck with H2 but working in http/1.1
Summary: EventSources get stuck with H2 but working in http/1.1
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_http2 (show other bugs)
Version: 2.4.48
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-28 11:35 UTC by Florian Meister
Modified: 2021-07-09 07:07 UTC (History)
0 users



Attachments
httpd.conf (24.63 KB, text/plain)
2021-06-28 11:35 UTC, Florian Meister
Details
httpd.conf (corrected) (24.63 KB, text/plain)
2021-06-28 11:37 UTC, Florian Meister
Details
apache2 error log (290.69 KB, application/gzip)
2021-06-28 11:49 UTC, Florian Meister
Details
httpd error log (309.69 KB, application/gzip)
2021-06-29 06:52 UTC, Florian Meister
Details
httpd error_log (551.89 KB, application/gzip)
2021-06-29 11:35 UTC, Florian Meister
Details
screenshot of pending request in developer console (213.97 KB, image/png)
2021-06-29 11:37 UTC, Florian Meister
Details
httpd error log (433.71 KB, application/gzip)
2021-06-29 15:05 UTC, Florian Meister
Details
screenshot of pending request in developer console (112.89 KB, image/png)
2021-06-29 15:06 UTC, Florian Meister
Details
httpd error log (479.76 KB, application/gzip)
2021-06-30 07:51 UTC, Florian Meister
Details
screenshot of pending request in developer console (207.96 KB, image/png)
2021-06-30 07:52 UTC, Florian Meister
Details
httpd error log (487.02 KB, application/gzip)
2021-07-01 11:34 UTC, Florian Meister
Details
screenshot of pending request in developer console (210.96 KB, image/png)
2021-07-01 11:35 UTC, Florian Meister
Details
httpd.conf (24.73 KB, text/plain)
2021-07-01 11:36 UTC, Florian Meister
Details
h2 worker stall fix against 2.4.x branch (2.88 KB, text/plain)
2021-07-02 09:52 UTC, Stefan Eissing
Details
httpd error log (475.28 KB, application/gzip)
2021-07-02 15:03 UTC, Florian Meister
Details
screenshot of pending request in developer console (194.80 KB, image/png)
2021-07-02 15:04 UTC, Florian Meister
Details
h2_stuck_fix-v2 against 2.4.x (2.62 KB, patch)
2021-07-05 08:28 UTC, Stefan Eissing
Details | Diff
httpd error log (754.78 KB, application/gzip)
2021-07-05 11:22 UTC, Florian Meister
Details
screenshot of pending request in developer console (225.41 KB, image/png)
2021-07-05 11:23 UTC, Florian Meister
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Meister 2021-06-28 11:35:08 UTC
Created attachment 37912 [details]
httpd.conf

We want to switch our web application from http/1.1 to http/2. The application is using 2 EventSources (SSE, Server Sent Events)

Apache2 is used as reverse-proxy with mod_proxy_balancer for the /api/* locations and is delivering static content (angular JS webApp) for the rest.

When enabling http/2 via "Protocols" the webapp works so far - but when navigating through the app the EventSources get stuck (Pending in Chrome Developer Tools). The problem is that the stuck EventSource is blocking all other requests - which should not happen when using http/2.

I played around with keepalive, disablereuse, some http2 settings - no luck.

To be sure that's not a general problem of our application I tried to use nginx as reverse proxy - that worked without problems.

I'll attach the relevant part of the apache2 config, the apache error log with 

LogLevel core:debug
LogLevel ssl:debug
LogLevel http2:trace2
Comment 1 Florian Meister 2021-06-28 11:37:55 UTC
Created attachment 37913 [details]
httpd.conf (corrected)
Comment 2 Florian Meister 2021-06-28 11:49:19 UTC
Created attachment 37914 [details]
apache2 error log
Comment 3 Stefan Eissing 2021-06-28 12:33:13 UTC
Thanks for the logs. Can you specify the URLs of the requests that seem to hang? And do you count those as "blocking" or what exactly do you mean by that?
Comment 4 Florian Meister 2021-06-28 13:44:23 UTC
Thanks for the ultra fast reply :)

There are 2 event-streams which are used by the application.

https://localhost/api/provide-production-infrastructure/get-live-state-info-multi
https://localhost/api/provide-operation-live-info/get-operation-live-info-multi

Blocking means that in the chrome developer tools the request is pending. Interestingly _no_ other requests are performed - they are also pending. So it seems that this reqeust is blocking the other ones.

When this happens apache stops logging these messages. Because it's encrypted traffic I do not see if the browser is stopping to send requests or if apache is stopping to processing this requests. But it seems apache is stopping to route requests to the backend servers. After about 30 seconds apache is processing the requests again.

Mon Jun 28 08:57:36.705264 2021] [http2:debug] [pid 245373:tid 140574984357440] h2_task.c(83): [client 127.0.0.1:56158] AH03348: h2_task(66-3): open output to GET localhost /api/blah blah
Comment 5 Stefan Eissing 2021-06-28 14:54:52 UTC
The log looks incomplete to me. Where I see the main connection logging, I do not see the h2 worker ones and vice versa. I need to match those to see what overall is happening.

What I can see on the h2 workers for the mentioned urls is that they seem to block and timeout after 30 seconds. But then I do not see what the main connection handler is doing in the meantime.

For example: when I fgrep for "208-71" I see what 'pid 222:tid 140040340637440' is doing (the worker), but nothing else. When I grep for '194-71' I see what the main connection is doing, but nothing from the worker.

Could you check if something went wrong here or reproduce a complete new log?
Comment 6 Florian Meister 2021-06-29 06:52:43 UTC
Created attachment 37916 [details]
httpd error log

The first log was generated via piping stderr to a file (because apache 2.48 was started via a docker container. the default is to log to stdout/stderr).

Now I've reconfigured apache2 to log to a file.

I've reproduced the problem and this is the error log written by apache2 itself.

If you need something else or if you face the same problem let me know - maybe there is something wrong with the log settings.

You said you have to match the h2 workers and the main connection - but I don't know what is the foreign key of each other. So I was not able to do a preflight check if everything is in the log what you need.
Comment 7 Stefan Eissing 2021-06-29 10:16:41 UTC
Thanks for the new log. What I see on both SSE points is:

1. The client aborts the requests (RST_STREAM) after commonly 1-2 seconds, there are 2 requests that last 5 seconds and one with 18 seconds.
2. All requests produce the response headers and one chunks of body data (525 bytes on one endpoint, 447 on the other). Those are sent to the client.
3. The HTTP/2 layer does not see a second chunk from the SSE endpoints, until the 30s read timeout happens, then one more chunk is seen.

3 indicates that someone is buffering data on the proxy connection, but after 30 seconds, the buffer only contain one more chunk of data (e.g. 447 or 525 bytes, not many).

The SSE sources do not seem to provide data often (is that the intention?). An initial chunk and then silence more or less. However the browser closes the open requests quite aggressively (at least it looks like this, cannot see page reloads).

There is nothing blocking in the HTTP/2 layer per se that I can see. It looks as if the browser detects frequent data to come and is unhappy and the proxied endpoint does not deliver.

How often should the SSE sources produce and why is the browser closing often after 1 sec after starting the request?
Comment 8 Florian Meister 2021-06-29 10:44:20 UTC
Thanks for the new log. What I see on both SSE points is:

> 1. The client aborts the requests (RST_STREAM) after commonly 1-2 seconds, > there are 2 requests that last 5 seconds and one with 18 seconds.

Yes. That's me clicking around in the app. After every route change the 2 Eventsources are destroyed and connected again. (I know that's not how it should be - we are working on it to persist the event-sources during route-changes in the app)

> 2. All requests produce the response headers and one chunks of body data (525 bytes on one endpoint, 447 on the other). Those are sent to the client.

> 3. The HTTP/2 layer does not see a second chunk from the SSE endpoints, until the 30s read timeout happens, then one more chunk is seen.
> 3 indicates that someone is buffering data on the proxy connection, but after 30 seconds, the buffer only contain one more chunk of data (e.g. 447 or 525 bytes, not many).
> The SSE sources do not seem to provide data often (is that the intention?). An initial chunk and then silence more or less. However the browser closes the open requests quite aggressively (at least it looks like this, cannot see page reloads).

That's true. When connecting to the SSE endpoint there is some initial payload - but then we are sending updates only every 30? seconds or when some change is happening. The closed connections is me navigating through the page (as described above)

> There is nothing blocking in the HTTP/2 layer per se that I can see. It looks as if the browser detects frequent data to come and is unhappy and the proxied endpoint does not deliver.

Can you clarify that for me - I don't understand the part with "the browser detects frequent data ..."
Comment 9 Stefan Eissing 2021-06-29 10:53:05 UTC
> Yes. That's me clicking around in the app. After every route change the 2 Eventsources are destroyed and connected again. (I know that's not how it should be - we are working on it to persist the event-sources during route-changes in the app)

I see. Ok that explains this.

> That's true. When connecting to the SSE endpoint there is some initial payload - but then we are sending updates only every 30? seconds or when some change is happening. The closed connections is me navigating through the page (as described above)

Ok, if the SSE source sends roughly every 30 seconds, this explains what I see:
- the browser closes
- mod_http2 acknowledges, and marks the worker thread as being aborted
- the worker thread is in a blocking read and gets some data after 30 seconds, detects the frontend connection is gone and terminates itself.

> Can you clarify that for me - I don't understand the part with "the browser detects frequent data ..."

That was my question about the early closing of the SSE requests. But you explained that you navigated the webapp and caused this yourself.

What I do not see is any "blocking" behaviour in the server. If you see a request in the browser dev tools that is not answered, you should see the URL of that request and then we can check if we see it in the server logs and what is happening to it.

For the SSE source requests, the browser will most likely show a "waiting" status since the complete response is never received - as intended for the SSE, right?
Comment 10 Yann Ylavic 2021-06-29 10:58:59 UTC
(In reply to Stefan Eissing from comment #7)
> 
> 3 indicates that someone is buffering data on the proxy connection, but
> after 30 seconds, the buffer only contain one more chunk of data (e.g. 447
> or 525 bytes, not many).

Couldn't:

    ap_remove_output_filter_byhandle(c->output_filters,
                                     "SSL/TLS Coalescing Filter");

on the h2 master connection help here?
Comment 11 Florian Meister 2021-06-29 11:16:20 UTC
(In reply to Stefan Eissing from comment #9)
> Ok, if the SSE source sends roughly every 30 seconds, this explains what I
> see:
> - the browser closes
> - mod_http2 acknowledges, and marks the worker thread as being aborted
> - the worker thread is in a blocking read and gets some data after 30
> seconds, detects the frontend connection is gone and terminates itself.

To reproduce the behavior I have to navigate a few times through the web app (initiate route changes). When no route change happens (the event-sources are not destroyed and built up again) the app works as expected.
 
> What I do not see is any "blocking" behaviour in the server. If you see a
> request in the browser dev tools that is not answered, you should see the
> URL of that request and then we can check if we see it in the server logs
> and what is happening to it.

Yes I can reproduce you a log with an exact (1s) time when the first request is "pending" in the browser and which one. FYI: I've also tried firefox to ensure it's not a browser problem. 

> For the SSE source requests, the browser will most likely show a "waiting"
> status since the complete response is never received - as intended for the
> SSE, right?

Absolutely correct. The HTTP Headers are sent by the server with ContentType: text/eventstream. Then the response body never finishes (ideally. sometimes the request is canceled by the browser or through some other hardware/software inbetween).

Maybe another helpful insight is that when using nginx it works (as already written). nginx has a lot more tcp connections to the backend in close/wait state. for some reason it seems that it finishes every tcp connection when a http request is finished and establish a new one. I've made no special configuration regarding keepalive, its the default config. Apache2 has much fewer connections in the backend and seems to reuse them - regardless if I enable or disable keepalive.

I'll come back to you with the new logs.
Comment 12 Florian Meister 2021-06-29 11:35:29 UTC
Created attachment 37917 [details]
httpd error_log

Here are the logs:

- I opened the app at 13:28:20
- I changed route around 4-5 times
- at 13:28:39-40 the Request https://localhost/device/1c942d82-600c-484c-a62d-95c910ee86f0/operation-documents was pending
- at 13:28:52 the request finished - also the other pending requests which were also pending.
Comment 13 Florian Meister 2021-06-29 11:37:35 UTC
Created attachment 37918 [details]
screenshot of pending request in developer console

This is the screenshot of the pending request.
You can see that it's around 12 seconds in pending state.
Comment 14 Florian Meister 2021-06-29 11:40:37 UTC
(In reply to Florian Meister from comment #13)
> Created attachment 37918 [details]
> screenshot of pending request in developer console
> 
> This is the screenshot of the pending request.
> You can see that it's around 12 seconds in pending state.

it's TTFB (Time to first byte) of 12 seconds - so neither headers nor body is received.
Comment 15 Stefan Eissing 2021-06-29 13:13:28 UTC
From the new log (lets call this errror3.log):

TTFB of 12 seconds for "get-live-combined..."
2 requests of that in log: 128-81 and 128-95

In HTTP/2 I see the timely responses and DATA frames send out on the main connection, always with a FLUSH bucket (details below). It seems that Ylavic points in the interesting direction of the mod_ssl coalescing filter to maybe ignore the FLUSH buckets and buffer things?

Time to get an expert on that into this ticket, I guess.


Timeline of requests:

-81:
  11:28:21.249985: received on thread 140115783444224
  11:28:21.250043: added to q
  11:28:21.250055: process connection on thread 140116562450176
  11:28:21.281608: task sends out response headers + first chunk of 924 bytes
  11:28:21.281769: main conn resumes stream 81
  11:28:21.281787: main adds to stream out buffer (H2HEADERS BEAM[924] BEAM[4462] FLUSH)
  11:28:21.281834: main conn sends HEADER FRAME
  11:28:21.281863: main conn sends 1. DATA FRAME
  11:28:21.281886: main conn out: heap[5409] flush 
  ...regular adding of response chunks
  11:28:22.278812: 2. DATA FRAME
  11:28:23.276869. 3. DATA FRAME
  11:28:24.281568: 4. DATA FRAME
  11:28:25.278065: 5. DATA FRAME
  11:28:26.274650: 6. DATA FRAME
  11:28:27.282371: 7. DATA FRAME
  11:28:28.276057: 8. DATA FRAME
  11:28:28.697597: client RST_STREAM received
  11:28:28.697695: stream in CLEANUP state
  11:28:29.276504: task wakes on next chunk, sees output aborted
  11:28:29.277021: task done, 8026.978000 ms elapsed

-95:
  11:28:28.697648: recv HEADER FRAME
  11:28:28.726090: send HEADER FRAME
  11:28:28.726404: 1. DATA FRAME (5386)
  11:28:28.726469: main conn out 5395 bytes + flush
  11:28:29.727619: 2. DATA FRAME (5386)
  11:28:30.723589: 3. DATA FRAME (5386)
  11:28:31.727982: 4. DATA FRAME (5386)
  11:28:32.727119: 5. DATA FRAME (5386)
  11:28:33.722943: 6. DATA FRAME (5386)
  11:28:34.723095: 7. DATA FRAME (5386)
  11:28:35.732271: 8. DATA FRAME (5386)
  11:28:36.725798: 9. DATA FRAME (5386)
  11:28:37.727986: 10. DATA FRAME (5386)
  11:28:38.720243: 11. DATA FRAME (5386)
  11:28:39.724627: 12. DATA FRAME (5386)
  11:28:40.722961: 13. DATA FRAME (5386)
  11:28:41.722298: 14. DATA FRAME (5386)
  11:28:42.719747: 15. DATA FRAME (5386)
  11:28:43.727449: 16. DATA FRAME (5386)
  ...
  11:31:00.723799: nn. DATA FRAME (5386)
Comment 16 Yann Ylavic 2021-06-29 13:34:42 UTC
(In reply to Stefan Eissing from comment #15)
> In HTTP/2 I see the timely responses and DATA frames send out on the main
> connection, always with a FLUSH bucket (details below). It seems that Ylavic
> points in the interesting direction of the mod_ssl coalescing filter to
> maybe ignore the FLUSH buckets and buffer things?

Wrong track I'm afraid, if all the data are FLUSHed then the coalescing filter shouldn't retain anything (frees all on any metadata).
Comment 17 Yann Ylavic 2021-06-29 13:48:08 UTC
Looking at your timeline traces, all but the first DATA FRAMES don't seem to be flushed, is it because that's implicit (you didn't indicate it) there is no flush bucket added for those?
Comment 18 Stefan Eissing 2021-06-29 13:49:58 UTC
No, they are all flushed. H2 writes everything flushed. Just did not pick it from the log for all frames.
Comment 19 Yann Ylavic 2021-06-29 13:55:09 UTC
I think a network capture (on the httpd side) is needed to confirm whether the first FRAME really did not go out before 12s. It shouldn't be the case given the traces..
Comment 20 Yann Ylavic 2021-06-29 13:59:26 UTC
Or, alternatively, at LogLevel trace7 we should be able to see what happens at the core/network output filter level.
Comment 21 Yann Ylavic 2021-06-29 14:01:27 UTC
Actually, LogLevel trace6 should be enough.
Comment 22 Florian Meister 2021-06-29 15:05:17 UTC
Created attachment 37919 [details]
httpd error log

Error Log with http2:trace6

- Loading App at 14:58:00
- First pending Request at approximately 14:58:23

PS: I don't see any trace6 logs, max. trace3. If u need a network capture let me know :)
Comment 23 Florian Meister 2021-06-29 15:06:16 UTC
Created attachment 37920 [details]
screenshot of pending request in developer console

Screenshot from last trace (with http2:trace6)
Comment 24 Yann Ylavic 2021-06-29 19:01:03 UTC
Sorry I wasn't clear enough, we need core logs at trace6 (the core output filter, not http2 only), so either "LogLevel trace6" for core and all modules at that level, or maybe "LogLevel core:trace6 http2:trace6" works too.
Comment 25 Florian Meister 2021-06-30 07:51:50 UTC
Created attachment 37922 [details]
httpd error log

Error Log with LogLevel trace6

7:47:20 Loading App
7:47:42 First Pending Request
7:47:55 Finished Pending Requests
Comment 26 Florian Meister 2021-06-30 07:52:33 UTC
Created attachment 37923 [details]
screenshot of pending request in developer console
Comment 27 Yann Ylavic 2021-06-30 11:46:50 UTC
Stefan,

looking at 1-155 (GET https://localhost/api/provide-production-infrastructure/get-live-state-info-multi?deviceId=1c942d82-600c-484c-a62d-95c910ee86f0).

07:47:40.524371: received (thread 140450790696704)
07:47:40.524374: added to q
<10s gap here>
07:47:50.869590: task processing (thread 140450824267520)
07:47:50.870611: forwarded to backend
07:47:50.899823: response from backend
07:47:50.899870: task send_out(H2HEADERS TRANSIENT[447])
07:47:50.899936: main conn resumes stream 155 by (thread 140450790696704)
07:47:50.899957: main adds to stream out buffer (H2HEADERS BEAM[447] FLUSH)
07:47:50.900011: main conn sends HEADER FRAME
07:47:50.900043: main conn sends first DATA FRAME
07:47:50.900130: 492/492 bytes sent on the network
<no more news from the backend until httpd is stopped>

Can't tell about this 10s gap..
Comment 28 Stefan Eissing 2021-06-30 14:33:29 UTC
I think what we are looking at is h2 worker starvation in this proxy scenario.

When a long reading SSE request is RST_STREAM on the  main connection, the h2 worker is in a read and only when that delivers something, it will detect that it has been aborted. That can take up to 30 seconds here.

We can test this by raising `H2MaxWorkers` to some very high level and check if the stalls disappear.

The real fix would be to have a immediate cancellation of ongoing requests, but I do not know how to do this.

In http/1.1, where this problem does not seem to happen, do we not see it because there are just enough event workers or does mpm_event know how to disrupt an ongoing read from the backend?

Florian: can you test if a "H2MaxWorkers 512" seemingly makes the problem disappear for a while?
Comment 29 Eric Covener 2021-06-30 14:40:38 UTC
> or does mpm_event know how to disrupt an ongoing read from the backend?

IIUC No, only a websockets tunnel would monitor both connections at the same time. Any other time I/O is waiting on the backend to write, the frontend socket is not being watched.
Comment 30 Florian Meister 2021-07-01 11:34:33 UTC
Created attachment 37926 [details]
httpd error log

Error Log with increased H2MaxWorkers (512) defined in VirtualHost

11:27:10 loading app
11:27:35 first pending request
Comment 31 Florian Meister 2021-07-01 11:35:29 UTC
Created attachment 37927 [details]
screenshot of pending request in developer console

Screenshot of Pending Request with increased H2 Workers
Comment 32 Florian Meister 2021-07-01 11:36:51 UTC
Created attachment 37928 [details]
httpd.conf

httpd.conf with increased H2MaxWorkers
Comment 33 Florian Meister 2021-07-01 11:38:33 UTC
Stefan, I increased H2MaxWorkers to 512 - I do not notice any difference. It seems the problem appear at the same time in the same intensity.
Comment 34 Stefan Eissing 2021-07-02 07:33:02 UTC
Florian, I'll take a look at this today. There seems to be something else amiss here. Maybe you are hitting a hithero unnoticed edge where idle h2 workers are not accepting tasks as they should.
Comment 35 Stefan Eissing 2021-07-02 09:52:29 UTC
Created attachment 37930 [details]
h2 worker stall fix against 2.4.x branch

A patch against the Apache subversion branches/2.4.x to fix the issue of requests not being processed in a timely fashion.
Comment 36 Stefan Eissing 2021-07-02 09:59:03 UTC
Florian, I think I found the issue and attached a patch to this ticket. Can you build a new mod_http2 from the 2.4.48 sources or would you prefer a patch against the github version of the module?

Background: 

mod_http2 had 2 occasions where schedule requests were assigned to workers: a) when the request arrived and b) when a worker was done serving a request.

When more than 1 request arrived at the same time, it could happen that only the first one got an assignmend while the others remained queued without any worker looking for them. This did not matter when requests were short, since any finished request would trigger a worker assignment of the queued requests.

However, with SSE and long-running requests, this could take a considerable time to happen.

The fix makes sure that worker assignment always is triggered when a new request arrives, even in batches.

I hope this addresses your issue.
Comment 37 Florian Meister 2021-07-02 12:03:22 UTC
Stefan, great news! I come back to you when tested :)
Comment 38 Florian Meister 2021-07-02 13:06:23 UTC
Stefan, I'm afraid I have to tell you that the fix does not change the behavior. Can I do anything to help u? Will another trace help u?
Comment 39 Stefan Eissing 2021-07-02 13:10:56 UTC
Yes, please. I'd like to see a trace of a n second delayed first byte under the new patch. *sigh*
Comment 40 Florian Meister 2021-07-02 15:03:26 UTC
Created attachment 37935 [details]
httpd error log

Error Log with patch for mod_http2

16:58:20 start app
16:58:36 first pending request
16:58:52 finished pending requests
Comment 41 Florian Meister 2021-07-02 15:04:19 UTC
Created attachment 37936 [details]
screenshot of pending request in developer console

Screenshot of pending Request in Developer Console
Comment 42 Florian Meister 2021-07-02 15:05:53 UTC
Stefan, I've uploaded the new logs. Thanks for investigating :)
Comment 43 Stefan Eissing 2021-07-03 17:07:05 UTC
If you don't know what is going on, write a helper. Today I wrote <https://github.com/icing/h2wtf> to look at your logs.

My analysis now is that your webapp runs into the h2 DoS protection. mod_http2 does not really like it when the client starts and then kills requests. But it likes when the client reads requests completely and without stalling.

There is a "mood" counter and when mood drops, the number of active requests that a client may perform is reduced. When mod_http2 is happy, this limit is increased.

The stream with the late starts happen at a time, when the limit has been exhausted. E.g. in the latest log, the limit was reduced to 16 and 16 request were ongoing when a new one came in. This means there will be no worker assigned to that request until another of the 16 returns.

Since there are requests hanging, waiting on the next SSE chunk to then abort, this may take some time. In the last log the 16 seconds.

So, I understand what is happening. That is something.

I do not have a good idea how to improve this. Have to think a bit on that.
Comment 44 Stefan Eissing 2021-07-05 08:28:41 UTC
Created attachment 37941 [details]
h2_stuck_fix-v2 against 2.4.x

Considering RST_STREAM on eventstreams as acceptable. 
DEBUG logging when request processing limit is exceeded.
Comment 45 Stefan Eissing 2021-07-05 08:29:58 UTC
Florian: I attached a new patch to address the problem. Hope this solves the issue for you.
Comment 46 Florian Meister 2021-07-05 11:22:32 UTC
Created attachment 37942 [details]
httpd error log

error log with patch-v2
13:13:40 loading app
13:13:58 first pending request
13:14:12 finished pending requests
Comment 47 Florian Meister 2021-07-05 11:23:21 UTC
Created attachment 37943 [details]
screenshot of pending request in developer console

screenshot of pending request in dev console (patch-v2)
Comment 48 Florian Meister 2021-07-05 11:25:54 UTC
stefan, it's much better now - the problem appears after a lot of more route changes than before - but it's not solved completely. I've attached new logs. thanks - flo
Comment 49 Stefan Eissing 2021-07-05 15:27:09 UTC
I see 37 h2 worker busy at the time requests '66-289' to '66-295' arrive and that is the maximum in your sever configuration. The processing can only start when a worker becomes available again.

You can mitigate that by an increased `H2MaxWorkers` configuration.

The question is what usage your anticipate on that server when you take it live. Increasing numbers of users with these SSE request will eventually exhaust whatever worker limit you configure.

Question: can you reduce the 30 seconds update time on the SSE generators?

(The desired fix would be for apache to get rid of those workers directly when the client resets the stream, but that is not easily achieved.)
Comment 50 Florian Meister 2021-07-06 09:21:43 UTC
(In reply to Stefan Eissing from comment #49)
> I see 37 h2 worker busy at the time requests '66-289' to '66-295' arrive and
> that is the maximum in your sever configuration. The processing can only
> start when a worker becomes available again.
> 
> You can mitigate that by an increased `H2MaxWorkers` configuration.
> 
> The question is what usage your anticipate on that server when you take it
> live. Increasing numbers of users with these SSE request will eventually
> exhaust whatever worker limit you configure.
> 
> Question: can you reduce the 30 seconds update time on the SSE generators?

We have eliminated more or less all request cancellations when navigating through the page - so it's much better now even without your fix.

Yes that would be possible. Maybe we take that approach. The problem is that we have one event-source which is transporting (potentially) a pretty big amount of data. If we reduce the interval it would increase the load and the traffic. We are going to send increments via this events in the future - so this problem should also be fixed in the future.

> 
> (The desired fix would be for apache to get rid of those workers directly
> when the client resets the stream, but that is not easily achieved.)

For my understanding: Is one H2Worker is processing exactly one request and while processing is not able to process other requests? If this is the case we would have to set H2MaxWorkers to a pretty big value.
Comment 51 Stefan Eissing 2021-07-06 09:55:30 UTC
> Yes that would be possible. Maybe we take that approach. The problem is that we have one event-source which is transporting (potentially) a pretty big amount of data. If we reduce the interval it would increase the load and the traffic. We are going to send increments via this events in the future - so this problem should also be fixed in the future.

No knowledge what would work best in your setup. It could be possible to send a "no-change" chunk frequently in between. 

> For my understanding: Is one H2Worker is processing exactly one request and while processing is not able to process other requests? If this is the case we would have to set H2MaxWorkers to a pretty big value.

Think of the h2 worker as a single HTTP/1.1 connection - only internally inside the server. mod_http2 bundles several of these onto the main connection to the client.

If you disable h2, your browser would make separate h1 connections and, if it closes them, the standard mpm_event worker would hang for 30 seconds as well, before the server can re-use this worker for sth else.

The real fix, and the one we started discussing on the dev mailing list, is that mod_proxy notices a h2 RST_STREAM immediately and abandon the request. But that is a fundamental change and may take a while.

One other mitigation that comes before that is that the h2 workers become dynamic again. I will make a release with that on the github repro mod mod_h2 probably today, so people can do early testing. This would mean that the server only starts H2MinWorkers and goes up to H2MaxWorkers on load, going down again when this subsides again.
Comment 52 Florian Meister 2021-07-09 06:09:22 UTC
(In reply to Stefan Eissing from comment #51)
> The real fix, and the one we started discussing on the dev mailing list, is
> that mod_proxy notices a h2 RST_STREAM immediately and abandon the request.
> But that is a fundamental change and may take a while.

I see. Thanks for you afford - I think we are trying to finally fix our web application to not reset connections on client-side so often. Then we'll start migrating to http2. Thanks for the tip to send keep-alives in the h2 connection so apache recognizes the closed connection faster.

As I understood there is nothing more to do in this bug report - so it can be closed (at least from my side).

Thanks again for your help - I have a much better understanding now how apache2 handles http2 internally.
Comment 53 Stefan Eissing 2021-07-09 07:07:53 UTC
Thanks, Florian, for your patience! Nice working with you.

I'll add the improvements from the patch to the next Apache release.