Bug 47662 - APR shouldn't rely on build-time results to decide whether to use CLOEXEC or not
Summary: APR shouldn't rely on build-time results to decide whether to use CLOEXEC or not
Status: RESOLVED INVALID
Alias: None
Product: APR
Classification: Unclassified
Component: APR (show other bugs)
Version: HEAD
Hardware: All Linux
: P2 major (vote)
Target Milestone: ---
Assignee: Apache Portable Runtime bugs mailinglist
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-07 06:47 UTC by Diego Petten
Modified: 2009-08-17 01:00 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Diego Petten 2009-08-07 06:47:46 UTC
I hit a very nasty bump on the update road today on my Gentoo server, when updating apr to the latest version (1.3.8). The issue starts showing up with this:

vanguard ~ # /etc/init.d/apache2 restart
 * apache2 has detected a syntax error in your configuration files:
[Fri Aug 07 11:39:07 2009] [crit] (22)Invalid argument: alloc_listener: failed to get a socket for 192.168.77.25
Syntax error on line 1 of /etc/apache2/vhosts.d/00_default_vhost.conf:
Listen setup failed
 * ERROR: apache2 failed to start

The line in question is a standard “Listen 80” that works with the previous version. After some lucky digging around, the actual problem is that the autoconf script checks for CLOEXEC support by _running_ test code at build-time, but this discovers the availability of CLOEXEC on the _build_ system; it's not uncommon to use different build and host systems, and they might differ in kernel, which is my case (it's true for binary distributions as well that use build servers).

There is one further problem: if I do use apr_cv_sock_cloexec=no to fool the configure script into not finding the cloexec support, Apache starts up, open the socket, but the children segfault; the problem is that the code checks, in mostly the same way, for the epoll_create1() function, _even if cloexec wasn't found_, and the impossible combination creates the segfault. Setting apr_cv_epoll_create1=no fools configure again and it starts to work.

I don't think that this is the right approach sincerely, since beside this particular case of having the host system running on an older kernel, the opposite is also bad; if the build system has an old kernel running, but a new enough glibc, it'll be disabling cloexec, that the host system would have been able to use.

I haven't looked at the code enough yet (and I'm afraid I won't have way to do that till the 20th, but my proposal would be to do something like this:

 1) in autoconf check for the declaration of flag and function, not their working order;
 2) write the codepaths, rather than using #ifdef, with standard if() clauses, for either cloexec or no-cloexec; make the if condition a macro like APR_USE_CLOEXEC;
 3) add an option to the configure script to either force off or on the cloexec support, or let it be automatically used;
 4) if the support is forced on or off, declare APR_USE_CLOEXEC a constant, set to 1 or 0;
 5) if the support is auto, but the declarations weren't found, also declare APR_USE_CLOEXEC a constant 0;
 6) if the support is auto, and the declaration were found, make APR_USE_CLOEXEC a macro to function; this function would be something like:

int apr_use_cloexec_check() {
  static int apr_use_cloexec = -1;
  if ( apr_use_cloexec == -1 ) {
    [... run some quick runtime test for cloexec to work ...]
  }
  return apr_use_cloexec;
}

(this will cache the result so that the test is executed at most once)
Comment 1 Diego Petten 2009-08-07 06:50:55 UTC
Ah by the way, the reason why I say to declare that a constant in those cases is that most compilers through DCE optimise that up in the same way as #ifdefs, so if you force it, the opposite branch is not going to be compiled at all.
Comment 2 Bojan Smojver 2009-08-09 23:33:07 UTC
This has been discussed on APR devel list. The portable in APR refers to API, not the ability to build on one kernel and run on another.

If you know that your target kernel can/cannot use certain functions, use overrides that you noted in your bug report. Generally speaking, if you set all of these to no before ./configure:

export apr_cv_epoll_create1=no
export apr_cv_dup3=no
export apr_cv_accept4=no
export apr_cv_sock_cloexec=no

The functions that require a relatively new kernel (>= 2.6.28) will not be used and APR will work on older and newer kernels.
Comment 3 Paolo Bonzini 2009-08-10 14:15:09 UTC
Distributions very often update kernel-headers much more often than the kernel itself.  Besides, this would make it impossible for distributions to provide an updated APR without disabling SOCK_CLOEXEC, patching it, or enabling the installation only on machines running a new-enough kernel (or forcing a reboot if the kernel is not old enough).

I'm pretty sure that this is not what you want to do.  Even glibc itself guards usage of the new cloexec features with tests that detect EINVAL or ENOSYS errors:

http://www.eglibc.org/cgi-bin/viewcvs.cgi/*checkout*/branches/libdfp/libc/sysdeps/unix/sysv/linux/opensock.c?rev=6786

It would actually be an additional advantage of APR, if it included these checks.
Comment 4 Arkadiusz Miskiewicz 2009-08-13 14:19:32 UTC
"Apache starts up, open the socket, but the children segfault; " - this one should be addressed. Some error should be logged instead of segfaulting.
Comment 5 Joe Orton 2009-08-14 01:00:37 UTC
(In reply to comment #4)
> "Apache starts up, open the socket, but the children segfault; " - this one
> should be addressed. Some error should be logged instead of segfaulting.

I'm not sure exactly what causes that but would probably agree; if someone who can reproduce that could file a separate bug against httpd that would be great.

(In reply to comment #3)
> Distributions very often update kernel-headers much more often than the kernel
> itself.

That is not the issue.  APR does not rely on mere presence of the interfaces, but tests them and ensures that they work.

This is little different, in principle, to any of the hundreds of other platform features which APR detects at build-time, and subsequently assumes are present at run-time.

> Besides, this would make it impossible for distributions to provide an
> updated APR without disabling SOCK_CLOEXEC, patching it, or enabling the
> installation only on machines running a new-enough kernel (or forcing a reboot
> if the kernel is not old enough).

This is not correct either - as Bojan says in comment 2 it is simple enough for those producing binary distributions to disable these features at build time.  It is not at all unreasonable to expect distributors to have to manage this mismatch between "platform features as detected at build-time by configure" and "platform features available on minimal target run-time".

This thread is the discussion on the list previously:

http://www.mail-archive.com/dev@apr.apache.org/msg21878.html

Please take further discussion to the list and refrain from playing ping-pong with the bug status.  The impact of this decision is well understood and is not being treated as a bug.
Comment 6 Paolo Bonzini 2009-08-14 03:24:11 UTC
First of all, I'm not playing pingpong: I just reopened the bug once, and I'm not the OP so that's the logical thing to do instead of opening a new bug and losing the context from Diego's message.  If you think this is pingpong, do not give random bugzilla users the ability to reopen bugs.

Anyway, even though I'm not reopening it now, I'm not bringing the question to the ML either.  This is arguably a bug and as such it should be discussed here.

Of all the people that had part in the discussion, only Bogan and a Gentoo fan actually agreed on this.  The other people were mostly contrary, perhaps including you.  You wrote in the thread:

> Maybe let's just do a release with this code as-is, and see how many 
> people complain, and then revisit the decision if necessary.

And here you are, you have two people complaining.

As a thought experiment, what would happen if you omitted the run-time test?  People would have errors instead of just having a crippled system that does not use SOCK_CLOEXEC/accept4.  Contrarily to what is mentioned in the mailing list thread, this bug has nothing to do with the "sanity" of build environments.  Despite what the Gentoo fan says in the thread, even Gentoo systems have this problem if you install a newer kernel but do not reboot immediately.

It's admittedly more difficult to get complaints given the way APR's configuration works.  People either have to transport the package without rebuilding it, or look at the source code.  Most people will not notice that their package (even if custom-compiled on your machine as in the case of Gentoo) does not use SOCK_CLOEXEC/accept4.  However, since SOCK_CLOEXEC was added also for security reasons involving race conditions between socket/accept and fcntl, this is even worse in my opinion.
Comment 7 Arkadiusz Miskiewicz 2009-08-16 12:42:18 UTC
(In reply to comment #4)
> "Apache starts up, open the socket, but the children segfault; " - this one
> should be addressed. Some error should be logged instead of segfaulting.

Diego, were you using worker? Worker lacks fix from bug #46467. Only prefork was fixed it seems.
Comment 8 Paolo Bonzini 2009-08-17 01:00:09 UTC
Just to record an additional complaint (of which both Bojan and Joe are aware):

https://bugzilla.redhat.com/show_bug.cgi?id=516331