Bug 65860 - Revoked certificate block httpd start
Summary: Revoked certificate block httpd start
Status: NEEDINFO
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_ssl (show other bugs)
Version: 2.4.37
Hardware: PC Linux
: P2 blocker (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL: https://www.approach.be
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-28 14:05 UTC by Marc Stern
Modified: 2022-02-07 11:58 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Stern 2022-01-28 14:05:00 UTC
We had production servers that failed to start because a certificate was revoked (LetsEncrypt problem of today obviously). Misleading error message AH02565 (could be interesting to fix it if possible).
An invalid certificate should not block the whole server. Because of one invalid vhost, several hundreds sites are unavailable.
"httpd -t" says the syntax is OK, so a (graceful) reload stops the service.
Comment 1 Stefan Eissing 2022-01-28 17:08:26 UTC
AH02565 is logged when a certificate and its key do not match. This has nothing to do with revocation, but points to a misconfiguration. For example, if you copy over a renewed certificate, but fail to also copy the corresponding key.

One could argue if such a situation should stop the server from reloading, sacrificing just the one site to become inoperable.

The check is there since 2014, so this is no new behaviour. You're sure that in todays scramble to correct Lets Encrypt configurations that is not what happened?
Comment 2 Marc Stern 2022-01-28 18:04:39 UTC
It was definitely provoked by the LetsEncrypt certificate revocation problem.
We're using mod_md. Maybe mod_md did half of the job because of that problem, that's possible.
Comment 3 Stefan Eissing 2022-02-07 11:11:38 UTC
This will be hard to analyze. Let me explain:

When a certificate for xxx.com is renewed.

- $server_root/md/domains/xxx.com contains the working certs
- $server_root/md/staging/xxx.com contains all about the renewal

If the server reloads, it checks "staging/*" for complete file sets.
When that indicates success, it 

- *creates* and *copies* a "tmp/xxx.com". The copy really parses
  key and certificates and PEM serializes them again
- if *moves* the whole dir "domains/xxx.com" to "archive/xxx.com.N"
  to preserve the old file set
- then it *moves* "tmp/xxx.com" to "domains/xxx.com".
- then it *deletes" "staging/xxx.com"

This is all done so that no interruption will produce a "half-updated"
set of files where things do not match.

In Apache httpd 2.4.49 the test for matching key and certificate was
added during activation of a staging area to make sure mod_md never
activates a set of files that do not match.

You see, there is considerate thought gone into avoiding the thing
you experienced. Especially with 2.4.49 or newer, the server should
never load a cert+key that do not match, even if something was messed
up in the "staging" subdir.

Any thoughts? Otherwise I think we need to close this as not reproducable.
Comment 4 Marc Stern 2022-02-07 11:45:27 UTC
Problem was in 2.4.37.
Are you confident that it cannot hapen in 2.4.49+ anymore?
Here, we speak about a certificate that was fully loaded, then revoked.
Comment 5 Stefan Eissing 2022-02-07 11:58:41 UTC
Under the premises that no one is messing with the file system, e.g. a job that distributes certificates among nodes or other such production jobs, it should never have happened in the first place.

In 2.4.49, an additional sanity check was added that "staging" file sets do not get activates if cert+key do not match. Assuming that is what affected your site.

You can check in "md/archive/xxx.com* if there is an archived set from the time the problem occurred. That would indicate that a renewal was made and the faulty set of file came from there. 

If there is no archived set from that time, then this did not happen on a renewal. Then then files in "md/domains/xxx.com" that were working got changed. Since mod_md replaces only directory and has no read/write access there when handling traffic, it strongly hints to an outside agency that messed with the files.