Bug 61818 - OCSP "SSLUseStapling on" completely blocking the server when something is off with the responder
Summary: OCSP "SSLUseStapling on" completely blocking the server when something is off...
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_ssl (show other bugs)
Version: 2.4.29
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-26 12:57 UTC by Raffaele Sandrini
Modified: 2020-10-12 09:17 UTC (History)
1 user (show)



Attachments
Effect on workers & connections (203.14 KB, image/png)
2020-03-02 17:57 UTC, tomasz.konefal
Details
Report errors on unreachable ocsp responder addresses (740 bytes, patch)
2020-10-10 09:54 UTC, Michael Scholl
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Raffaele Sandrini 2017-11-26 12:57:08 UTC
This will be a somewhat fuzzy issue because I don't have much data. Please accept my apologies for that.

Today our production site went offline because it was impossible to connect to it using TLS. The httpd error log just showed this error: 

AH01941: stapling_renew_response: responder error

without any supporting information. There was no indication that some name could not be resolved or some IP not be reached.

The server is using the event MPM and pretty quickly all slots were in status "R" and the server reported:

AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
and
AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

Hence, the site was offline.

Our stapling configuration:

SSLUseStapling on
SSLStaplingResponderTimeout 5
SSLStaplingReturnResponderErrors off
SSLStaplingCache shmcb:/var/run/ocsp(128000)

I am not an export but from this configuration and the supporting documentation I conclude that this situation should never have happened. Even with the OCSP server not being available it should have just continued without "stapling" the response.

Hence, this bug report.

Note 1: The certificate in question is issued by GoDaddy EV CA and I could personally not confirm any issue with their OCSP service.

Note 2: At the same time vhosts using Let's Encrypt certificates still worked with stapling enabled leading to the conclusion that there was something up with GoDaddy. However as stated above, the error log did not indicate anything.
Comment 1 Christophe JAILLET 2017-11-26 15:20:51 UTC
This is odd.
All paths that lead to this error (AH01941) seem to have some additional information logged at APLOG_ERR level.
Comment 2 Raffaele Sandrini 2017-11-26 16:50:53 UTC
I just rescanned the log files, vhost specific and server log file, and I could not find any other related messages than the ones mentioned above (AH01941, AH00484 and AH03490).

Also to add, I restarted Apache several times and consistently got into that state until I eventually disabled OCSP stapling (setting "SSLUseStapling off").
Comment 3 tomasz.konefal 2020-03-02 17:57:01 UTC
Created attachment 37055 [details]
Effect on workers & connections
Comment 4 tomasz.konefal 2020-03-02 18:03:48 UTC
One of our hosted sites has a certificate with crl.usertrust.com (151.139.128.14) as a CRL Distribution point and ocsp.usertrust.com (151.139.128.14) for OCSP in the Authority Information Access field.

We are able to reproduce symptoms like this when the above IP is blocked outbound from the web server.

Please see the above attached image indicating what happens to the worker threads and connection count when the block is enabled (~09h42) and later disabled (~09h52).

Unfortunately, there are no meaningful logs to go along with this.
Comment 5 Michael Scholl 2020-10-10 09:54:46 UTC
Created attachment 37492 [details]
Report errors on unreachable ocsp responder addresses

We had this issue yesterday and it took us long till we figured out stapling is the problem. I attached a patch that helps identifying connection problems to ocsp responder addresses more easily.

The problem is that the Workers have no timeout how long they wait in queue to make an OCSP request. There should be some SSLStaplingQueueTimeout option.

Maybe it would also be good if the server remembers responder addresses that had been unreachable and ignores these addresses for some time. This would speed up the ocsp requests on problems.

Our current solution is to set the following options:

SSLStaplingResponderTimeout 1
SSLStaplingStandardCacheTimeout 86400

This works for us but for servers with thousands of certificates this could still be a problem.
Comment 6 Stefan Eissing 2020-10-12 09:17:12 UTC
If you have a recent Apache httpd (2.4.42 and newer), there is an alternate OCSP stapling implementation in the 'mod_md' module. This implementation works with scheduled updates of the OCSP status, independent of client connects and has additional monitoring/notification options.

I wrote some how-tos and details here: <https://github.com/icing/mod_md#how-to-staple-all-my-certificates>