Bug 61558

Summary: Apache 2.4.27 crashes often when restarting the service. segfault in libdl-2.24.so or apache2
Product: Apache httpd-2 Reporter: Rolf <pantaluna>
Component: AllAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 2.4.29   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Attachments: apache2ctl-t-V-M.txt
apport.log
apt-show-versions.txt
lsb_release.txt
syslog-filtered
gdb-output.txt
_usr_sbin_apache2.0-without-coredump.crash.txt
CoreDump
Make ap_mpm_query() restart safe on unixes (for 2.4.x)
Make ap_mpm_query() restart safe on unixes (v2, for 2.4.x)
Deregister hooks before exiting
test-patch-35376-syslog.txt
test-patch-35376-gdb.txt
Signals and hooks cleanup on exit (2.4.x)
Signals and hooks cleanup on exit (v2, 2.4.x)
Signals and hooks cleanup on exit (v3, 2.4.x)

Description Rolf 2017-09-22 13:09:04 UTC
Created attachment 35362 [details]
apache2ctl-t-V-M.txt

Hi,
I have been struggling for quite some time now on a dedicated dev server with Apache "sometimes" crashing
when the Apache service is started for the first time after a server boot, or when the Apache service is restarted.

It became claear after a while that it crashes at least once out of +- 25 times when I restart the Apache service (see below for the test script).

Apache recovers from it somehow. I "think" that an MPM worker child process crashes, and that main Apache process then starts another one, which always starts correctly.

Now is a good time to report it because I would like to upgrade some production servers from Ubuntu 14.04.5 LTS trusty to Ubuntu Artful V17.10 in a few months time.

* Environment:
- O.S.: Ubuntu 17.04 Zesty (Desktop edition)
- Ubuntu Package: apache2-bin:amd64/zesty 2.4.27-5.1+ubuntu17.04.1+deb.sury.org+1

* Notes:
- The Apache dbg packages have been installed. `apt --yes install apache2-dbg libapr1-dbg libaprutil1-dbg gdb`
- This is a stable bare-metal server that is online 24x7 with no other problems.
- I have removed the modules mod_security2 and mod_spamhaus, but it did not help.
- No Apache sites are currently enabled, but it did not help.
- I tried to reproduce it in a VM with a fresh install of Ubuntu Zesty and Apache 2.4.27 but then the crash never happens.

* Reproducible test script 
#Bash
#@doc The server will crash at least once every +-10 minutes when this script is running.
while true
do
    systemctl stop apache2; systemctl start apache2; journalctl --lines=50 --unit=apache2; sleep 10;
done


* The crash details in the syslog varies as follows: "error 4 in apache2" or "error 14 in libdl-2.24.so".
    Sep 22 11:44:04 s3black kernel: [36459.658587] apache2[12579]: segfault at 7f47d733a8c0 ip 0000559c2b73fdd4 sp 00007ffeaded5fd0 error 4 in apache2[559c2b703000+9e000]
    Sep 22 11:44:04 s3black systemd[1]: apache2.service: Main process exited, code=dumped, status=11/SEGV
    
    Sep 22 11:48:06 s3black kernel: [36701.689499] apache2[14568]: segfault at 7f06524308c0 ip 00005578dc332dd4 sp 00007ffdd48f9210 error 4 in apache2[5578dc2f6000+9e000]
    Sep 22 11:48:06 s3black systemd[1]: apache2.service: Main process exited, code=dumped, status=11/SEGV
    
    Sep 22 13:19:43 s3black kernel: [42198.677611] apache2[1417]: segfault at 7feb5dd26ad0 ip 00007feb5dd26ad0 sp 00007ffd97212488 error 14 in libdl-2.24.so[7feb607b0000+3000]
    Sep 22 13:19:43 s3black systemd[1]: apache2.service: Main process exited, code=dumped, status=11/SEGV
    
    Sep 22 13:41:57 s3black kernel: [43532.685282] apache2[12965]: segfault at 7f3079b9dad0 ip 00007f3079b9dad0 sp 00007ffe3b47be88 error 14 in libdl-2.24.so[7f307c627000+3000]
    
    Sep 22 13:49:28 s3black kernel: [43983.678591] apache2[16930]: segfault at 7f0f0387d8c0 ip 000056527483fdd4 sp 00007ffd4282ba10 error 4 in apache2[565274803000+9e000]

* Crash report script:
#Bash
cd /var/crash/
apport-retrace --rebuild-package-info --sandbox=system  _usr_sbin_apache2.0.crash
rm --verbose -rf unpacked
apport-unpack _usr_sbin_apache2.0.crash unpacked
cd unpacked
# gdb: get info about the backtrace
echo "thread apply all bt full" | gdb /usr/sbin/apache2 CoreDump


* The top of the thread stacktrace
#0  0x00007feb5dd26ad0 in ?? ()
#1  0x00005596bd360e0e in ap_run_mpm_query (query_code=query_code@entry=2, result=result@entry=0x7ffd972124ec, _rv=_rv@entry=0x7ffd972124c4) at mpm_common.c:97
   92:                           (apr_pool_t * pchild, server_rec * s),
   93:                           (pchild, s), OK, DECLINED)
   94: AP_IMPLEMENT_HOOK_RUN_FIRST(int, mpm,
   95:                             (apr_pool_t *pconf, apr_pool_t *plog, server_rec *s),
   96:                             (pconf, plog, s), DECLINED)
   97: AP_IMPLEMENT_HOOK_RUN_FIRST(int, mpm_query,
   98:                             (int query_code, int *result, apr_status_t *_rv),
   99:                             (query_code, result, _rv), DECLINED)
  100: AP_IMPLEMENT_HOOK_RUN_FIRST(apr_status_t, mpm_register_timed_callback,
  101:                             (apr_time_t t, ap_mpm_callback_fn_t *cbfn, void *baton),
  102:                             (t, cbfn, baton), APR_ENOTIMPL)
#2  0x00005596bd361b9e in ap_mpm_query (query_code=query_code@entry=2, result=result@entry=0x7ffd972124ec) at mpm_common.c:419
  414: 
  415: AP_DECLARE(apr_status_t) ap_mpm_query(int query_code, int *result)
  416: {
  417:     apr_status_t rv;
  418: 
  419:     if (ap_run_mpm_query(query_code, result, &rv) == DECLINED) {
  420:         rv = APR_EGENERAL;
  421:     }
  422: 
  423:     return rv;
  424: }


* Attachments. I have collected as much info as possible.

Thanks for your help to find the cause of this crash. If you need more information then do not hesitate to ask.
--
Rolf.
Comment 1 Rolf 2017-09-22 13:09:39 UTC
Created attachment 35363 [details]
apport.log
Comment 2 Rolf 2017-09-22 13:09:55 UTC
Created attachment 35364 [details]
apt-show-versions.txt
Comment 3 Rolf 2017-09-22 13:10:17 UTC
Created attachment 35365 [details]
lsb_release.txt
Comment 4 Rolf 2017-09-22 13:10:33 UTC
Created attachment 35366 [details]
syslog-filtered
Comment 5 Rolf 2017-09-22 13:10:55 UTC
Created attachment 35367 [details]
gdb-output.txt
Comment 6 Rolf 2017-09-22 13:11:13 UTC
Created attachment 35368 [details]
_usr_sbin_apache2.0-without-coredump.crash.txt
Comment 7 Rolf 2017-09-22 13:12:14 UTC
Created attachment 35369 [details]
CoreDump
Comment 8 Rolf 2017-09-23 12:59:31 UTC
I would like to add that changing from mpm_worker to mpm_event did not fix it.
Comment 9 Rolf 2017-09-23 13:42:24 UTC
I reported that the crash occurs when restarting the Apache service.

I would like to add more specifically that the crash occurs when stopping the Apache service (opposed to when starting the Apache service).

````
Sep 23 12:25:53 s3black systemd[1]: Starting The Apache HTTP Server...
Sep 23 12:25:54 s3black systemd[1]: Started The Apache HTTP Server.
Sep 23 12:26:01 s3black systemd[1]: Stopping The Apache HTTP Server...
Sep 23 12:26:01 s3black kernel: [ 4356.696241] apache2[18681]: segfault at 7f6d1ccf8ad0 ip 00007f6d1ccf8ad0 sp 00007ffffd359ac8 error 14 in libdl-2.24.so[7f6d1f782000+3000]
Sep 23 12:26:01 s3black systemd[1]: apache2.service: Main process exited, code=dumped, status=11/SEGV
Sep 23 12:26:01 s3black systemd[1]: Stopped The Apache HTTP Server.
Sep 23 12:26:01 s3black systemd[1]: apache2.service: Unit entered failed state.
Sep 23 12:26:01 s3black systemd[1]: apache2.service: Failed with result 'core-dump'.
Sep 23 12:26:01 s3black systemd[1]: apparmor.service: Cannot add dependency job, ignoring: Unit apparmor.service is masked.
Sep 23 12:26:01 s3black systemd[1]: Starting The Apache HTTP Server...
Sep 23 12:26:02 s3black systemd[1]: Started The Apache HTTP Server.
Sep 23 12:26:09 s3black systemd[1]: Stopping The Apache HTTP Server...
Sep 23 12:26:09 s3black systemd[1]: Stopped The Apache HTTP Server.
````
Comment 10 Yann Ylavic 2017-09-25 11:08:53 UTC
Created attachment 35372 [details]
Make ap_mpm_query() restart safe on unixes (for 2.4.x)

It's unclear to me whether the gdb output shows a double fault or not.

One fault is ap_mpm_query() called after the MPM is unloaded (which this patch addresses), but it looks like a first segfault was issued before (I may be confused by the signal handler in the stack trace though).

Anyway, to clarify things, could you please try this patch?
Comment 11 Rolf 2017-09-25 16:12:11 UTC
Thanks for the patch.

1. I have installed (not built) Apache2 from the PPA https://launchpad.net/~ondrej/+archive/ubuntu/apache2/+packages 
which results in the Ubuntu Package: apache2-bin:amd64/zesty 2.4.27-5.1+ubuntu17.04.1+deb.sury.org+1 (read the attachments for more version info).
And unfortunately I do not know how to rebuild the version from the PPA.

2. Nonetheless I have rebuilt Apache v2.4.27 from source on another VMachine as described at http://httpd.apache.org/docs/current/install.html using the source download is at http://apache.belnet.be//httpd/httpd-2.4.27.tar.gz
-> But the patch file does not match.
-> Also I assume that the binary result will never be the same as the one in the PPA (configuration wise) so it would not result in a valid test binary.

Sorry for the trouble. How do you want to proceed? Do we have to contact the owner of the PPA for this?
Comment 12 Rolf 2017-09-25 17:44:10 UTC
Yann Ylavic FYI I have created an issue in the Github repo of the PPA at https://github.com/oerdnj/deb.sury.org/issues/707

 We will continue over there for the moment.
Comment 13 Yann Ylavic 2017-09-27 12:14:28 UTC
Created attachment 35375 [details]
Make ap_mpm_query() restart safe on unixes (v2, for 2.4.x)

The previous version was not correct by using static initialization in ap_unixd_hook_mpm_query(), now replaced with pconf userdata.

Regarding PPA (which I know very few about), looks like you could download the apache2 source package ("apt-get source apache2" ?), integrate the patch with the others (likely), and rebuild it ("debuild -S" ?). Just a hypothesis...

I don't know which exact version of httpd the PPA is based on, but this patch applies cleanly to httpd 2.4.27 (and 2.4.28) sources provided by Apache, you could build from those at least.
Comment 14 Joe Orton 2017-09-27 12:39:59 UTC
If the problem is that a registered mpm_query hook is a function in a now-unloaded DSO then I don't think that keeping a copy of the function pointer itself (not the function!) is going to help?

If we treat the root cause of the crash here as unrelated (looks like memory corruption with a crash in apr_pool_destroy?), then fixing the crash in logging during apr_destroy_and_exit_process() due to MPM cleanup should be fixable by calling apr_hook_deregister_all() at the start of that function?
Comment 15 Joe Orton 2017-09-27 12:43:22 UTC
Sorry - I meant server/main.c:destroy_and_exit_process().
Comment 16 Yann Ylavic 2017-09-27 13:01:36 UTC
(In reply to Joe Orton from comment #14)
> If the problem is that a registered mpm_query hook is a function in a
> now-unloaded DSO then I don't think that keeping a copy of the function
> pointer itself (not the function!) is going to help?

The pointer is cleared (NULL) with pconf, after which ap_unixd_hook_mpm_query() will return DECLINED insteaf of calling the registered (and now unloaded) mpm function.

> 
> If we treat the root cause of the crash here as unrelated (looks like memory
> corruption with a crash in apr_pool_destroy?), then fixing the crash in
> logging during apr_destroy_and_exit_process() due to MPM cleanup should be
> fixable by calling apr_hook_deregister_all() at the start of that function?

It's unclear to me whether that's a double fault or not, likely yes.
Maybe a MPM cleanup run while unloaded? Didn't find a potential one registered...

Early apr_hook_deregister_all() looks like a simpler (and wider) solution, but don't/can't we have legitimate cleanups (i.e. run in time and/or with no scope issue) that'd run hooks? If not, I'm all for this!
Comment 17 Yann Ylavic 2017-09-27 13:06:45 UTC
(In reply to Yann Ylavic from comment #16)
> 
> Early apr_hook_deregister_all() looks like a simpler (and wider) solution,
> but don't/can't we have legitimate cleanups (i.e. run in time and/or with no
> scope issue) that'd run hooks? If not, I'm all for this!

OTOH we already do this when (re)starting before clearing pconf, so such cleanups are unlikely to have worked in the first place. So looks good to me.
Comment 18 Yann Ylavic 2017-09-27 13:14:56 UTC
Created attachment 35376 [details]
Deregister hooks before exiting

Simpler patch, per above comments.
Comment 19 Joe Orton 2017-09-27 13:37:58 UTC
(In reply to Yann Ylavic from comment #16)
> The pointer is cleared (NULL) with pconf, after which
> ap_unixd_hook_mpm_query() will return DECLINED insteaf of calling the
> registered (and now unloaded) mpm function.

Ah, I see.  Yes, that should work then too.

> Early apr_hook_deregister_all() looks like a simpler (and wider) solution,
> but don't/can't we have legitimate cleanups (i.e. run in time and/or with no
> scope issue) that'd run hooks? If not, I'm all for this!

Since we specifically tie hooks to the pconf lifetime it should be safe, perhaps correctly it should done as a registered cleanup.
Comment 20 Joe Orton 2017-09-27 13:38:19 UTC
(In reply to Yann Ylavic from comment #18)
> Created attachment 35376 [details]
> Deregister hooks before exiting
> 
> Simpler patch, per above comments.

+1 for trunk, +1 for 2.4.
Comment 21 Yann Ylavic 2017-09-27 16:30:03 UTC
(In reply to Joe Orton from comment #20)
> (In reply to Yann Ylavic from comment #18)
> > Created attachment 35376 [details]
> > Deregister hooks before exiting
> > 
> > Simpler patch, per above comments.
> 
> +1 for trunk, +1 for 2.4.

Committed the "registered cleanup" suggestion in r1809881.
Comment 22 Rolf 2017-09-27 22:38:17 UTC
(In reply to Yann Ylavic from comment #18)
> Created attachment 35376 [details]
> Deregister hooks before exiting
> 
> Simpler patch, per above comments.
Thanks!

I have rebuild the PPA package(s) from source with the patch 35376, and did a quick redeploy & retest on the same server as before.

The apache2 binary still crashes; infrequently as before +- every 5 minutes in the test script. The stack trace is different than the original one.

See these attachments for more details:
- test-patch-35376-syslog.txt
- test-patch-35376-gdb.txt
Comment 23 Rolf 2017-09-27 22:39:42 UTC
Created attachment 35377 [details]
test-patch-35376-syslog.txt
Comment 24 Rolf 2017-09-27 22:40:05 UTC
Created attachment 35378 [details]
test-patch-35376-gdb.txt
Comment 25 Ruediger Pluem 2017-09-28 06:46:39 UTC
(In reply to Rolf from comment #24)
> Created attachment 35378 [details]
> test-patch-35376-gdb.txt

Looks like the apr file data structure is already destroyed. This can happen in cases where the parent process crashes. I would leave that as is. If we still have the  apr file data structure we could log something sensible if not, well we are going to crash anyway.

But the root cause for your crash is something else anyway and the patches applied so far should only improve the handling of the original crash.
Comment 26 Yann Ylavic 2017-09-28 11:22:59 UTC
Created attachment 35379 [details]
Signals and hooks cleanup on exit (2.4.x)

This patch is r1809881 + r1809973 for 2.4.x (should apply to 2.4.27 too).

attachment 35378 [details] shows sig_term() called in apr_terminate(), too late (ap_pglobal is dead)...
So it adds some cleanups to restore signals and hooks where needed.

Nic, could you test this one please?
Comment 27 Yann Ylavic 2017-09-28 11:25:14 UTC
(In reply to Yann Ylavic from comment #26)
> 
> Nic, could you test this one please?

I meant Rolf, really sorry (Nic reported another issue).
Comment 28 Yann Ylavic 2017-09-28 11:35:47 UTC
Created attachment 35380 [details]
Signals and hooks cleanup on exit (v2, 2.4.x)

v1 missed r1809976, please use this v2.
Comment 29 Yann Ylavic 2017-09-28 11:39:50 UTC
Created attachment 35381 [details]
Signals and hooks cleanup on exit (v3, 2.4.x)

v2 had some unrelated changes.
Comment 30 Rolf 2017-09-28 13:14:50 UTC
(In reply to Yann Ylavic from comment #27)
> (In reply to Yann Ylavic from comment #26)
> > 
> > Nic, could you test this one please?
> 
> I meant Rolf, really sorry (Nic reported another issue).

Thanks.
I will apply the second patch. Do I have to keep the 1st patch applied as well? https://bz.apache.org/bugzilla/attachment.cgi?id=35376&action=diff
Comment 31 Yann Ylavic 2017-09-28 13:20:18 UTC
(In reply to Rolf from comment #30)
> 
> I will apply the second patch. Do I have to keep the 1st patch applied as
> well? https://bz.apache.org/bugzilla/attachment.cgi?id=35376&action=diff

No, attachment 35381 [details] is a replacement for all the previous patches, thanks!
Comment 32 Rolf 2017-09-28 21:41:59 UTC
(In reply to Yann Ylavic from comment #31)
> (In reply to Rolf from comment #30)
> > 
> No, attachment 35381 [details] is a replacement for all the previous
> patches, thanks!

Good news!

1. The first test run of 2 hours using a stripped-down configuration of Apache is a success; no segfaults anymore :) The stripped-down configuration: no sites enabled, no  extra modules loaded such as mod_security mod_status mod_spamhaus etc.).

2. The patch has already been released in the Apache PPA of ~ondrej https://launchpad.net/~ondrej/+archive/ubuntu/apache2/+packages 2.4.27-6.1

3. I will a) run extra test runs on Friday using a fully-configured Apache environment. b) monitor that the segfault no longer happens either when rebooting the O.S. (this was the 2nd part of the problem report: 1=segfault sometimes when restarting the Apache service 2=segfault sometimews when starting/rebooting the O.S.

Thanks for your time.
Comment 33 Rolf 2017-10-03 19:02:05 UTC
Good news.

The extra tests (running 24h) passed with success. Thanks for the patch!
Comment 34 Yann Ylavic 2017-10-03 20:24:21 UTC
Thanks Rolf for testing!

Let's keep this PR opened until the patch is reviewed and backported to 2.4.x (which I'll propose shortly).
Comment 35 Yann Ylavic 2018-01-31 13:36:12 UTC
Backported to 2.4.x (r1820794), will be in upcoming 2.4.30.