Bug 63170

Summary: mod_proxy_http2 segmentation fault and out of memory condition
Product: Apache httpd-2 Reporter: Warren D. Johnson <warren>
Component: mod_proxy_http2Assignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: normal CC: jacob
Priority: P2    
Version: 2.4.38   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: sample output
Error log entries from main client's website apache server log (/var/log/httpd/error_log) (not from reverse proxy)
Error log from client virtual host - i.e. /home/CLIENT/logs/error_log
Error log from reverse proxy
trace8 level debugging on busier reverse proxy server http2
trace8 level debugging on website server (not reverse proxy)
High cpu usage

Description Warren D. Johnson 2019-02-13 03:38:03 UTC
Hello Team,

I've been using mod_proxy_http for a few years now in a Reverse Proxy scenario.  Recently, we have wanted to use HTTP/2 protocol so we have compiled and configured mod_proxy_http2.  The reverse proxy forwards request to another system that is also configured for HTTP/2.  All components are built from source. When we enable HTTP/2 ("h2") protocol we are seeing frequent segmentation faults as well as out of memory conditions.  As soon as we downgrade protocol to HTTP/1.1 the segfaults and OOM condition goes away.  These problems only occur on the Reverse Proxy, not on the receiving web server.  

Segfaults look like this:
[Tue Feb 12 20:48:15 2019] [notice] [pid 2880] mpm_unix.c(436): [client AH00052: child pid 3224 exit signal Segmentation fault (11)

[Tue Feb 12 20:48:15 2019] [notice] [pid 2880] mpm_unix.c(436): [client AH00052: child pid 3313 exit signal Segmentation fault (11)

On the Reverse Proxy, we are using the following software:
httpd-2.4.38
nghttp2-1.36.0
openssl-1.0.2q
pcre-8.41
apr-1.6.5
apr-util-1.6.1

On the web server behind the Reverse Proxy, we are using the following software:
httpd-2.4.37
nghttp2-1.30.0
openssl-1.0.2o
pcre-8.41
apr-1.6.5
apr-util-1.6.1

The virtual host for the SSL configuration on the Reverse Proxy is the following (with extra settings removed that aren't relevant)

<VirtualHost 172.31.8.194:443>

SSLEngine on

Protocols h2 h2c http/1.1

ServerName www.example.com

ProxyPreserveHost On

ProxyPass "/" "h2://www.example.com/" 
ProxyPassReverse "/" "h2://www.example.com/"

</VirtualHost>

If we change the Protocols line to simply
Protocols http/1.1
Then the segfault and out of memory conditions stop.

I did some printf-level logging in modules/http2/mod_proxy_http2.c and I can see that it gets stuck in some sort of infinite loop situation.  I believe the infinite loop is probably what's eating up the memory (so, for example, its allocating memory for structures and char pointers a zillion times but not releasing them).  

Here is a trace level example of what's being spit out when mod_proxy_http2 gets into an infinite loop:

[Tue Feb 12 17:16:53 2019] [trace1] [pid 18535] mod_proxy_http2.c(351): [client 64.136.202.39:61387] eng(eng-proxy-51539607943): setup session

[Tue Feb 12 17:16:53 2019] [debug] [pid 18535] mod_proxy_http2.c(364): [client 64.136.202.39:61387] AH03373: eng(eng-proxy-51539607943): run session eng-proxy-51539607943

[Tue Feb 12 17:16:53 2019] [debug] [pid 18535] mod_proxy_http2.c(640): [client 64.136.202.39:61387] AH03376: run_session, again

[Tue Feb 12 17:16:53 2019] [trace1] [pid 18535] mod_proxy_http2.c(351): [client 64.136.202.39:61387] eng(eng-proxy-38654706035): setup session

[Tue Feb 12 17:16:53 2019] [debug] [pid 18535] mod_proxy_http2.c(364): [client 64.136.202.39:61387] AH03373: eng(eng-proxy-38654706035): run session eng-proxy-38654706035

[Tue Feb 12 17:16:53 2019] [debug] [pid 18535] mod_proxy_http2.c(640): [client 64.136.202.39:61387] AH03376: run_session, again

[Tue Feb 12 17:16:53 2019] [trace1] [pid 18535] mod_proxy_http2.c(351): [client 64.136.202.39:61387] eng(eng-proxy-51539607945): setup session

[Tue Feb 12 17:16:53 2019] [debug] [pid 18535] mod_proxy_http2.c(364): [client 64.136.202.39:61387] AH03373: eng(eng-proxy-51539607945): run session eng-proxy-51539607945
 
I can't explain it exactly, but it almost feels like the web browser completes its HTTP/2 requests but mod_proxy_http2 isn't notified so it goes crazy.  Another thing I noticed is that I only seem to be able to induce the OOM condition when using FireFox 65 (Windows 10) and not Chrome or Edge.

Please let me know if there are other details I can provide.  We've spent a lot of time on this bug and simply don't know what to do.

Thanks,
Warren
Comment 1 Warren D. Johnson 2019-02-13 05:35:47 UTC
Thought of some other details:

Both Web server and Reverse Proxy server are running CentOS 7.

All services are hosted on AWS (ec2 instances).

Reverse Proxy is a clean install of CentOS 7 onto an AWS ec2 instance.

Web server CentOS 7 is an OpenVZ container.

I built another reverse proxy server on CentOS 7 and instead of building Apache from source, I installed it using repositories maintained by codeit.guru.  Those packages experienced the same out of memory problems as my source code build. In the case of codeit.guru build, we were using Protocol h2 in the SSL's vhost.

I built another reverse proxy server (as a test) using Ubuntu 18 and installed all necessary packages using apt-get.  That includes http2, mod_proxy, mod_proxy_http2, mod_ssl etc.  We experienced the same memory leak issues.

If I use "Protocol HTTP/1.1" instead of "Protocol h2" in the vhost entry in the ssl.conf file, I do not experience any problems.  It is something in the interplay between h2 on the web server and h2 on the reverse proxy server.
Comment 2 Stefan Eissing 2019-02-13 10:01:12 UTC
Warren, thanks for your detailed report. Let me try to summarize your observations, if I got this right:
- the problem happens, when HTTP/2 is used between browser and frontend
- the problem disappears, when HTTP/1 is used

The log seems to indicate that a task is re-scheduled over and over again and the internal state goes out of whack.

Would you be able to build and test with a version from the github repository at https://github.com/icing/mod_h2

I did some changes that also affect task re-scheduling during the last week and am curious if they address this.
Comment 3 Warren D. Johnson 2019-02-13 16:16:09 UTC
Created attachment 36436 [details]
sample output
Comment 4 Warren D. Johnson 2019-02-13 16:16:25 UTC
I built with the latest changes from git and it did not fix the problem.  I copied your source from mod_h2 into modules/http2 and went back to the httpd main folder and did a "make clean", "make" and "make install".

I'm attaching a file sample_error.txt.  It has an abundance of trace and debug level information.  Line 775 is the last line before it breaks into an infinite loop.

Thanks
Comment 5 Warren D. Johnson 2019-02-13 21:30:10 UTC
Also, there is almost no traffic on the reverse proxy server and almost no traffic on the web server.  Any insistence by mod_proxy_http2 (or anyone) that it's out of resources is inaccurate.
Comment 6 Stefan Eissing 2019-02-18 12:21:57 UTC
I made a fix in the github version of the module. If you could copy the source into your build tree and run that version again? It should log version 1.12.4-git for the modules.

What I see in your logs are 2 things:
1. the first request for /pumps-accessories/sump-pumps.html is answered by your backend with headers and body, but the EOF of the body never comes.
2. The proxy_http2 code the gets a timeout waiting for the EOF and goes into an infinite loop afterwards.

The new version addresses problem 2. You may still see error 1, but the connection should be cleaned up properly afterwards and your server should remain stable and not run out of resources.

If that indeed works as intended, we can have a look at problem 1. For that, a log from both frontend and backend on at least level trace2 would be really helpful. Ideally, of the same request, depending if you can control the traffic on your setup. 

Thanks for the help!
Comment 7 Warren D. Johnson 2019-02-19 03:34:27 UTC
Created attachment 36440 [details]
Error log entries from main client's website apache server log (/var/log/httpd/error_log)  (not from reverse proxy)
Comment 8 Warren D. Johnson 2019-02-19 03:36:00 UTC
Created attachment 36441 [details]
Error log from client virtual host - i.e. /home/CLIENT/logs/error_log
Comment 9 Warren D. Johnson 2019-02-19 03:36:52 UTC
Created attachment 36442 [details]
Error log from reverse proxy
Comment 10 Warren D. Johnson 2019-02-19 03:40:52 UTC
Hi Stefan,

Thanks for your hard work and attention on this matter.  Unfortunately, the changes you have pushed to github have not solved the out of memory resource issue.  I copied the latest changes from github into modules/http2 then proceeded to do "make clean", "make", "make install" on Apache.  I then enabled h2 protocol on the reverse proxy and made a test.  Almost immediately (first few page requests) the reverse proxy went into an infinite loop and used up all the memory on the sever.

I captured trace8 level information from the following locations:

Main Apache server log from client's website (i.e. /var/log/httpd/error_log)
Client's virtual host error log (i.e. /home/CLIENT/logs/error_log)
Reverse proxy logging.

I attached each of these files to the bug report.  You will them as

Error log entries from main client's website apache server log (/var/log/httpd/error_log) (not from reverse proxy)

Error log from client virtual host - i.e. /home/CLIENT/logs/error_log (35.50 KB, application/x-zip-compressed)

Error log from reverse proxy

There appears to be a lot of interesting debug information in those files.

Thanks for your time!

~Warren
Comment 11 Stefan Eissing 2019-02-19 10:24:07 UTC
Warren, thanks for the logs. I prepared a new version, v1.12.5, at github. If you could test that one on your proxy that would be great. Thanks for your patience!
Comment 12 Warren D. Johnson 2019-02-19 16:09:56 UTC
Stefan,

I only had a few minutes to do a few quick tests but so far it looks like the bug is fixed. I will have more time later today to test and will update this thread.

Have you seen anything interesting on the website's server (not proxy)? That server is only running mod_http2.

Thanks!
Comment 13 Stefan Eissing 2019-02-19 16:27:17 UTC
Thanks for the quick feedback. Looking forward to your report later.

On the backend server logs I did not see anything unusual. The requests were processed correctly.

On your initial "sample output" there was a response without an EOF flag present, but I did not see anything like that on the logs from yesterday. 

Maybe your continued tests will show this again, we'll see.
Comment 14 Warren D. Johnson 2019-02-19 19:56:28 UTC
Stefan,

I noticed this error on the website's server (not proxy):

TCP: time wait bucket table overflow

I'm seeing a lot of those.  I'm not an expert but according to a reply on ServerFault
https://serverfault.com/questions/449744/a-lot-of-tcp-time-wait-bucket-table-overflow-in-centos-6

"Try the following commands to determine if you have a lot of connections coming from one address or if you are under a distributed attack." 

So, makes me wonder if prior to the latest updates to mod_proxy_http2, the reverse proxy server was opening a lot of connections over and over on the client's website?  Just a thought.

More later.

Thanks!
Comment 15 Warren D. Johnson 2019-02-19 22:53:21 UTC
Stefan,

Everything looks great.  The reverse proxy does not appear to be losing/wasting resources and the website is being served properly.  I will provide another update after we have had this in production for some time (week or so).

Thanks,
Warren
Comment 16 Stefan Eissing 2019-02-20 09:07:29 UTC
Thanks for the update! We can leave this ticket open for a week and see if something comes up or if we are good.
Comment 17 Warren D. Johnson 2019-02-25 03:31:17 UTC
Created attachment 36456 [details]
trace8 level debugging on busier reverse proxy server http2
Comment 18 Warren D. Johnson 2019-02-25 03:32:18 UTC
Created attachment 36457 [details]
trace8 level debugging on website server (not reverse proxy)
Comment 19 Warren D. Johnson 2019-02-25 03:42:02 UTC
Hi Stefan,

It's been several days now that we are running the reverse proxy featuring your latest code changes.  We have experienced two out of memory conditions but I think this had to do with not having enough memory on the server.  I've increased the memory and I feel like it won't be a problem. 

I did my initial testing with only one website serviced by this reverse proxy.  Everything worked fine and we were not experiencing out of memory conditions or segmentation faults.  Feeling confident, I added several more websites behind the reverse proxy and let it run for a day or two.  During that time we are seeing sporadic but recurring high CPU usage.  On a similar reverse proxy which uses only http 1, the load average rarely exceeds .5. In fact, it is usually less than .2.  

You can see attachment cpu_usage.jpg.  Clearly it is running really high.  There is not a lot of traffic (maybe a few visitors per second) during these times.  

I recorded a trace8 level debug on the reverse proxy.  I also recorded a trace8 level debug on the website's server.  I include those attachments as 

"trace8 level debugging on busier reverse proxy server http2"
and
"trace8 level debugging on website server (not reverse proxy)"

I am not an expert, but when I view those logs I am seeing a lot of entries "Resource Temporarily unavailable".  Also, I notice that it looks like the Apache process on the reverse proxy does it's job properly but then seems to stall out.  It waits about five seconds and doesn't get any response so it timesout and closes the connection.  Somewhere in this process, it is eating up CPU.  I do not mind if the process waits and then times out.  However, the excessive CPU usage is a real problem.  If I stop Apache completely, the load on the server drops to 0.

After the five second time out, we're getting these lines:

[Sun Feb 24 22:09:54 2019] [trace1] [pid 17223] h2_session.c(2333): [client 1.2.3.4:56750] h2_session(138,IDLE,0): pre_close

[Sun Feb 24 22:09:54 2019] [debug] [pid 17223] h2_session.c(589): [client 1.2.3.4:56750] AH03068: h2_session(138,IDLE,0): sent FRAME[GOAWAY[error=0, reason='timeout', last_stream=39]], frames=39/52 (r/s)

[Sun Feb 24 22:09:54 2019] [trace2] [pid 17223] h2_conn_io.c(123): [client 1.2.3.4:56750] h2_session(145)-out: heap[24] flush 

[Sun Feb 24 22:09:54 2019] [debug] [pid 17223] h2_session.c(715): [client 1.2.3.4:56750] AH03069: h2_session(138,IDLE,0): sent GOAWAY, err=0, msg=timeout

[Sun Feb 24 22:09:54 2019] [debug] [pid 17223] h2_session.c(1655): [client 1.2.3.4:56750] AH03078: h2_session(138,DONE,0): transit [IDLE] -- local goaway --> [DONE]

[Sun Feb 24 22:09:54 2019] [trace1] [pid 17223] h2_session.c(725): [client 1.2.3.4:56750] h2_session(138,DONE,0): pool_cleanup

[Sun Feb 24 22:09:54 2019] [debug] [pid 17223] h2_session.c(1655): [client 1.2.3.4:56750] AH03078: h2_session(138,CLEANUP,0): transit [DONE] -- pre_close --> [CLEANUP]

[Sun Feb 24 22:09:54 2019] [trace2] [pid 17223] h2_mplx.c(435): [client 1.2.3.4:56750] h2_mplx(138): start release

[Sun Feb 24 22:09:54 2019] [trace1] [pid 17223] h2_ngn_shed.c(144): [client 1.2.3.4:56750] AH03394: h2_ngn_shed(145): abort

[Sun Feb 24 22:09:54 2019] [trace1] [pid 17223] h2_mplx.c(497): [client 1.2.3.4:56750] h2_mplx(138): released


I did a little searching and found someone reporting a similar problem with http2 (though not in reverse proxy as far as I can tell) on an older version of Apache.  I'm not sure if its related, but here is the link:

https://www.redhat.com/archives/sclorg/2017-December/msg00001.html

As always, I appreciate your help!
Comment 20 Warren D. Johnson 2019-02-25 03:43:26 UTC
Created attachment 36458 [details]
High cpu usage
Comment 21 Warren D. Johnson 2019-02-25 04:17:55 UTC
Not sure if its related but with trace8 running I am seeing a lot of these lines:

"OpenSSL: I/O error, 5 bytes expected to read on"

I'm seeing that on both the website's server and the reverse proxy.   Seems to be followed by the resource temporarily unavailable error.  But I don't know if it's related.

Thanks.
Comment 22 Warren D. Johnson 2019-02-25 05:40:06 UTC
Sorry about all the comments here but I did a little deeper inspection of the situation.  I tested my http2 setup on the website's server by browsing directly to the website server (in other words, I bypassed the reverse proxy).  I notice something different in the trace8 logging.  I am seeing this message 

[Mon Feb 25 00:07:44 2019] [debug] [pid 31021] ssl_engine_io.c(1103):
[client 172.31.13.84:60614] AH02001: Connection closed to child 66 with
standard shutdown (server www.example.com:443)

I do NOT see that message when browsing the site through the reverse proxy.

This message is generated in ssl_engine_io.c line 1102.  The function is ssl_filter_io_shutdown(...).  I am lead to believe one of two things is happening:

1) ssl_filter_io_shutdown is not being called when browsing website through reverse proxy.

or

2) We are abnormally returning from the function through the following clause situated in the beginning of the function
    if (!ssl) {
        return;
    }

ssl refers to
SSL *ssl = filter_ctx->pssl;

from
ssl_filter_io_shutdown(ssl_filter_ctx_t *filter_ctx,                                   conn_rec *c, int abortive)


In the case of when I browse directly to the client website (no reverse proxy), the process appears to be "Apache and h2 service the request.... all requests done, request is closed ..... wait for KeepAliveTimeout, then h2 terminates the connection". 

In the case of when I browse the client website through the reverse proxy, the process appears to be "Apache and h2 service the requests".  There is no acknowledgement of "Connection Closed".  Instead we ONLY wait for KeepAliveTimeout and then close connection.

Meanwhile, the Reverse proxy is also waiting for the KeepAliveTimeout (as it relates I think to requests from the person doing the web browsing) and then terminates.  h2 on client website seems to be fine waiting, skipping "Connection Closed" and then timing out.  On the other hand, the reverse proxy h2 seems to freak out and use a lot of CPU in the process while it's waiting....... maybe some kind of crazy looping going on somewhere? I can't really see anything in the logs about it.

Thanks!
Comment 23 Warren D. Johnson 2019-02-25 14:16:43 UTC
After watching the reverse proxy for 12 hours, I can see there is still a memory leak still.  It's much slower than it was before.  The previous memory leak was happening practically instantly however this new one is taking a lot longer and we have 6-10 websites on that reverse proxy.
Comment 24 Stefan Eissing 2019-02-26 14:00:17 UTC
Warren,

thanks for the logs. Although I could not see directly what is causing your CPU spikes, I did more load tests on my own and decided for a reset.

I just released v1.13.0 on github <https://github.com/icing/mod_h2/releases/tag/v1.13.0> where I simplified the mod_proxy_http2 strategy. 

The former implementation seemed to have been a bit too ambitious and sacrificed flow control in the server too much. That is why you are having trouble with memory, I think. In my load tests the memory footprint of the server processes are less than half of what I saw before.

Due to the simpler implementation, I hope we achieve more stability as well.

Please let me know how v1.13.0 performs in your setup. Thanks for the patience.

-Stefan

PS. Besides all this, there is one H2 setting that may be good in your setup. If your server is basically only contacted via the proxy, you may try the following on the server (not the proxy):

H2TLSWarmUpSize     0
H2TLSCoolDownSecs   0

The defaults of this are intended for public internet connectivity. But if your proxy/server enjoy low latency and LAN connections, these might serve better.
Comment 25 Warren D. Johnson 2019-02-27 15:26:49 UTC
Hi Stefan,

Your changes worked fine.  I am no longer experiencing memory issues.  I figured out where the CPU spikes were coming from.  The issue was related to excessive use of "SecRemoveRuleById" with mod_security.  I had 500 rules listed for exclusion with many being duplicates.  Once I de-duped that down to about 50 rules, the CPU spikes fixed themselves. I apologize for implying it was your code causing this issue. I will open a bug report on the CPU issue with the mod_security people.

I would like to wait a week before closing this bug report.

Thanks,
Warren
Comment 26 Alexandre Schaff 2019-02-27 16:56:09 UTC
Bonjour,

We experienced the exact same issue (or a very very similar one ) using mod_proxy_http2 in r-p mode :
   * super fast OOM that could trigger a system kill in 1 minute with low traffic (<3req/s)
   * high system load
   * httpd-2.4.37 ( and associated mod_proxy_http2 )
   * nghttp2-1.34.0

Using modules mod_proxy_http2 and mod_http2 built from mod_http2-1.13.0 against same http-2.4.37, after 40 minutes :
   * load avg went down from 8 to 0.1 with 4x the traffic 
   * memory used seems very stable

1K thanks Stefan
Alex.
Comment 27 Warren D. Johnson 2019-02-27 17:00:50 UTC
Sounds like the same problem.  I solved my particular CPU load issue AFTER I built and used the latest version of mod_http2 and mod_proxy_http2.  It's possible I was experiencing the CPU load issue as a result of both excessive mod_security SecRemoveRuleById statements AND mod_http2 issues.  Either way I appreciate the effort Stefan has put into fixing the issue.
Comment 28 Stefan Eissing 2019-02-28 08:54:28 UTC
Thanks guys for testing this, happy to get this working.

Let's leave this open for a week. 

I might also do some more cleanup of the code in the coming days and release a v1.13.1 for you to verify, if you'd help me with this. I'll let you know here.
Comment 29 Stefan Eissing 2019-03-01 12:15:26 UTC
Neue release v1.13.1 <https://github.com/icing/mod_h2/releases/tag/v1.13.1>. Code cleanup, no functional change, please test at your convenience.
Comment 30 Alexandre Schaff 2019-03-11 11:02:38 UTC
Hello,

modules from mod_http2-1.13.0 are very stable (cpu/mem..), running very fine since 2019-02-27.
Recent updates not used, yet.
Thanks again Stefan.
Comment 31 Warren D. Johnson 2019-03-12 15:14:57 UTC
Stefan,

I have not had a chance to test 1.13.1.  However, the previous version is still working well.  I will reply back to the ticket if I have any issues in the future.  Otherwise, thanks for your help!

Warren
Comment 32 Stefan Eissing 2019-03-12 15:26:18 UTC
Thanks a lot Warren. Will close this now.

Fixed in r1854963, proposed for backport to branches/2.4.x
Comment 33 Stefan Eissing 2019-03-13 15:01:43 UTC
Backported to 2.4.x in r1855431 for release 2.4.39.