Bug 42580 - apr_pollset_poll returns ETIMEUP when interrupted
Summary: apr_pollset_poll returns ETIMEUP when interrupted
Status: RESOLVED FIXED
Alias: None
Product: APR
Classification: Unclassified
Component: APR (show other bugs)
Version: HEAD
Hardware: Sun Solaris
: P2 normal with 3 votes (vote)
Target Milestone: ---
Assignee: Apache Portable Runtime bugs mailinglist
URL:
Keywords: FixedInTrunk, PatchAvailable
Depends on:
Blocks:
 
Reported: 2007-06-04 14:59 UTC by Kevin Richter
Modified: 2008-04-14 15:00 UTC (History)
2 users (show)



Attachments
Patch (420 bytes, patch)
2008-02-11 14:47 UTC, Basant Kumar Kukreja
Details | Diff
New patch (453 bytes, patch)
2008-02-12 13:38 UTC, Basant Kumar Kukreja
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Richter 2007-06-04 14:59:54 UTC
_Every_ time stopping or restarting apache following error is in the log:

(70007)The timeout specified has expired: apr_pollset_poll: (listen)

The context is the following:

[Sat Jun 02 14:59:40 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Sat Jun 02 14:59:42 2007] [notice] SIGHUP received.  Attempting to restart
[Sat Jun 02 14:59:42 2007] [notice] Digest: generating secret for digest
authentication ...
[Sat Jun 02 14:59:42 2007] [notice] Digest: done
[Sat Jun 02 14:59:44 2007] [notice] Apache/2.2.4 (Unix) mod_ssl/2.2.4
OpenSSL/0.9.8d configured -- resuming normal operations


The bug also in 2.2.3 occurs. Other versions are not tested.
Comment 1 Ruediger Pluem 2007-06-05 12:46:34 UTC
What MPM are you using?
Comment 2 Kevin Richter 2007-06-05 13:05:47 UTC
Worker

That might be the problem. Ive got running a 2.2.3 with prefork. There are no
problems.

And Ive got a 2.2.4 and a 2.2.3 with worker where these apr problems occur.
Comment 3 Kevin Richter 2007-06-16 07:31:39 UTC
Sorry, I am new to Bugzilla. Do I have to check the "reassign" checkbox like I
do now?
My answer to your question is 11 days old.
Or isnt there a solution for the problem?
Comment 4 Ruediger Pluem 2007-06-16 09:25:35 UTC
(In reply to comment #3)
> Sorry, I am new to Bugzilla. Do I have to check the "reassign" checkbox like I
> do now?

No, you don't need to. You are doing it perfectly.

> My answer to your question is 11 days old.

Thanks for that.

> Or isnt there a solution for the problem?

It is lack of time :-). Sofar nobody had some time to dig into it.
Feel free to bug us here from time to time if you see no progress.
Comment 5 Kevin Richter 2007-06-16 14:05:19 UTC
Ok, thanks. No problem.

Perhaps I can help i bit. I have examined the situation. Here the results:

The apr problem only occurs, when the Apache runs in a Solaris Zone. When it
runs in the global context (means: not in a zone), the apache restart does NOT
cause an APR_TIMEUP.

Is it right, that the error is from srclib/apr/poll/unix/poll.c from line 253?
And the causing function is the poll(..) system call?

"man poll" under Solaris says the same as the source code:

RETURN VALUES
     Upon  successful  completion,  a   non-negative   value   is
     returned.  A  positive  value  indicates the total number of
     file descriptors that  has  been  selected  (that  is,  file
     descriptors  for  which  the  revents member is non-zero). A
     value of 0 indicates that the call timed  out  and  no  file
     descriptors have been selected. Upon failure, -1 is returned
     and errno is set to indicate the error.


Perhaps its not possible to execute poll(..) in a Solaris zone?
Or there is a missing device in /dev/?
Comment 6 Ruediger Pluem 2007-06-16 15:28:37 UTC
(In reply to comment #5)

> The apr problem only occurs, when the Apache runs in a Solaris Zone. When it
> runs in the global context (means: not in a zone), the apache restart does NOT
> cause an APR_TIMEUP.

This is an interesting information.

> 
> Is it right, that the error is from srclib/apr/poll/unix/poll.c from line 253?

As you are using zones I suppose that you are running on Solaris 10. So I guess
that you have port support there and you should be rather in line 291 of port.c
in the same directory. You can check if you have port support by checking if
HAVE_PORT_CREATE is defined in srclib/apr/include/arch/unix/apr_private.h

> And the causing function is the poll(..) system call?

I think it is rather port_getn.

> 
> "man poll" under Solaris says the same as the source code:
> 
> RETURN VALUES
>      Upon  successful  completion,  a   non-negative   value   is
>      returned.  A  positive  value  indicates the total number of
>      file descriptors that  has  been  selected  (that  is,  file
>      descriptors  for  which  the  revents member is non-zero). A
>      value of 0 indicates that the call timed  out  and  no  file
>      descriptors have been selected. Upon failure, -1 is returned
>      and errno is set to indicate the error.
> 
> 
> Perhaps its not possible to execute poll(..) in a Solaris zone?
> Or there is a missing device in /dev/?

I don't think that this is the case. If you wouldn't be able to use this system
call httpd would not work at all. I suppose that it has something to do with the
restart logic itself. What happens (error message wise) if you

1. Stop httpd
2. do a gracefull restart
3. do a graceful shutdown

in your zone.


> 
Comment 7 Kevin Richter 2007-06-17 07:16:52 UTC
> As you are using zones I suppose that you are running on Solaris 10. 

Yes, thats right. Solaris 10 for sparc architecture - and not for x86.


> So I guess
> that you have port support there and you should be rather in line 291 of port.c
> in the same directory. You can check if you have port support by checking if
> HAVE_PORT_CREATE is defined in srclib/apr/include/arch/unix/apr_private.h

Yes, it is defined there.

Since September 2006 there is a patch for two port_getn problems:
6374407 port_getn does not dissociate fd when it reports an event on it
6402576 Sometimes port_getn() never returns indicating that a socket has become
writable

Dont know either if the patch was applied on this server nor if it is possible
that is has something to do with the APR_TIMEUP problem.


> I don't think that this is the case. If you wouldn't be able to use this system
> call httpd would not work at all. I suppose that it has something to do with the
> restart logic itself. What happens (error message wise) if you
> 
> 1. Stop httpd
> 2. do a gracefull restart
> 3. do a graceful shutdown
> 
> in your zone.

The error occurs in each case (of the three).

Comment 8 Kevin Richter 2007-06-23 08:14:16 UTC
Sorry, I was wrong. Today I have seen that the APR_TIMEUP also occurs in the
global context / global zone :-(

And not only one timeup per restart... how you can see... 

[Sat Jun 23 13:01:30 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Sat Jun 23 13:01:31 2007] [notice] SIGHUP received.  Attempting to restart
[Sat Jun 23 13:01:31 2007] [notice] Apache/2.2.4 (Unix) configured -- resuming
normal operations
[Sat Jun 23 13:01:32 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Sat Jun 23 13:01:39 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Sat Jun 23 13:01:40 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Sat Jun 23 13:01:41 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Sat Jun 23 13:06:31 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)


Any ideas how I can help?
Can I debug this issue?
Comment 9 Kevin Richter 2007-06-30 08:37:17 UTC
Ok, the second time I was wrong. But now i have got the problem! :)

I accuse a second Listen directive! 
(Doesnt matter if in the global context or in a zone.)

For example:
Listen 192.168.1.7:80
Listen 192.168.1.7:81
-> The APR_TIMEUP happens when restarting, stopping, ...

Listen 192.168.1.7:80
-> No APR_TIMEUP

Omitting the IP and using Listen *:80 and Listen *:81 didnt solve the problem.
The APR_TIMEUP occurs.


Please help :)

Comment 10 Brian Poole 2007-11-30 13:23:34 UTC
Hello,

I can also reproduce this error on my Solaris 10 servers with a recent build of
Apache 2.2.6. Whenever I run a graceful restart, I see the same apr_timeout
message in the log file. Perhaps more importantly, this SOMETIMES seems to be
leaving behind children processes that shouldn't still be running after a
graceful restart. Sometimes it leaves one child process, sometimes it leaves 5.

For example, on a completely idle server, here is a ps list after numerous
graceful restarts in a row:

$ ps -ef | grep http
webservd 19097 18781   0 16:10:01 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start
    root 18781 12143   1 16:02:28 ?           0:04
/usr/local/apache-2.2.6/bin/httpd -k start
webservd 18868 18781   0 16:03:53 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start
webservd 19095 18781   0 16:10:01 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start
webservd 19099 18781   0 16:10:01 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start
webservd 19096 18781   0 16:10:01 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start
webservd 19098 18781   0 16:10:01 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start
webservd 19072 18781   0 16:08:59 ?           0:00
/usr/local/apache-2.2.6/bin/httpd -k start

There were no client requests made to this daemon during this period as it is a
completely idle test box and yet there are consistently processes (19072, 18868
in this example) that are not exiting as I would expect them to. A truss of one
of the hung up processes is sticking in port_getn:

$ truss -p 18868
port_getn(19, 0x00224298, 2, 1, 0x00000000) (sleeping...)

The log file showing the restarts and errors in the log during this period is below:

[Fri Nov 30 16:02:27 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:02:27 2007] [notice] Digest: done
[Fri Nov 30 16:02:27 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:02:33 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:02:33 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:02:33 2007] [notice] Digest: done
[Fri Nov 30 16:02:33 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:02:58 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:02:58 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:02:59 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:02:59 2007] [notice] Digest: done
[Fri Nov 30 16:02:59 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:03:31 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:03:31 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:03:31 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:03:31 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:03:31 2007] [notice] Digest: done
[Fri Nov 30 16:03:31 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:03:34 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:03:34 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:03:34 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:03:34 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:03:34 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:03:34 2007] [notice] Digest: done
[Fri Nov 30 16:03:34 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:03:52 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:03:52 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:03:52 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:03:52 2007] [notice] Digest: done
[Fri Nov 30 16:03:52 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:04:02 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:04:02 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:04:02 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:04:02 2007] [notice] Digest: done
[Fri Nov 30 16:04:03 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:08:58 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:08:58 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:08:58 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:08:58 2007] [notice] Digest: done
[Fri Nov 30 16:08:58 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations
[Fri Nov 30 16:10:00 2007] [notice] Graceful restart requested, doing restart
[Fri Nov 30 16:10:00 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:10:00 2007] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)
[Fri Nov 30 16:10:00 2007] [notice] Digest: generating secret for digest
authentication ...
[Fri Nov 30 16:10:00 2007] [notice] Digest: done
[Fri Nov 30 16:10:00 2007] [notice] Apache/2.2.6 (Unix) mod_ssl/2.2.6
OpenSSL/0.9.8g configured -- resuming normal operations


Some info about my config.. Server is on Solaris 10/sparc recently patched with
the 10/31/2007 Recommended patch cluster. Following modules are loaded:
core_module (static), mpm_prefork_module (static), http_module (static),
so_module (static), alias_module (shared), authn_default_module (shared),
authn_file_module (shared), authz_default_module (shared), authz_host_module
(shared), auth_basic_module (shared), auth_digest_module (shared),
deflate_module (shared), dir_module (shared), dumpio_module (shared),
headers_module (shared), include_module (shared), log_config_module (shared),
logio_module (shared), mime_module (shared), proxy_module (shared),
proxy_ajp_module (shared), proxy_balancer_module (shared), proxy_http_module
(shared), rewrite_module (shared), setenvif_module (shared), ssl_module
(shared), vhost_alias_module (shared)

/usr/local/apache-2.2.6/bin/httpd -V
Server version: Apache/2.2.6 (Unix)
Server built:   Nov 27 2007 18:40:30
Server's Module Magic Number: 20051115:5
Server loaded:  APR 1.2.11, APR-Util 1.2.10
Compiled using: APR 1.2.11, APR-Util 1.2.10
Architecture:   32-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_FCNTL_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/usr/local/apache-2.2.6"
 -D SUEXEC_BIN="/usr/local/apache-2.2.6/bin/suexec"
 -D DEFAULT_PIDLOG="logs/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="logs/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"

Happy to assist with debugging this but I'm not sure what needs to be looked at
and don't really have the time to go spelunking into it all by my lonesome.
Anyone care to assist?
Comment 11 Brian Poole 2007-12-03 15:10:40 UTC
Tested 2.0.61 on the same machine with similar options and do not see this
problem. Updated version to 2.2.6 in the bug report.
Comment 12 Basant Kumar Kukreja 2008-01-28 19:08:20 UTC
I also saw the similar behavior in open solaris (snv_79).

httpd.conf contains :
Listen 192.168.21.1:80
Listen 192.168.22.1:80

In error log, I observed : 
[Mon Jan 28 18:10:48 2008] [notice] Apache/2.2.6 (Unix) configured -- resuming
normal operations
[Mon Jan 28 18:10:49 2008] [error] (70007)The timeout specified has expired:
apr_pollset_poll: (listen)

Comment 13 Basant Kumar Kukreja 2008-02-11 14:47:05 UTC
Created attachment 21511 [details]
Patch

When I had the following settings :

Listen 192.168.21.1:81
Listen 192.168.22.1:81

<IfModule worker.c>
ListenBackLog	  50000
StartServers	     2
ThreadLimit	   5
ThreadsPerChild    5
MinSpareThreads    1
MaxSpareThreads    1
ThreadsPerChild    5
MaxClients	  10
MaxRequestsPerChild  0
</IfModule>

I was getting the following error in my error log
[Mon Feb 11 13:33:52 2008] [error] (70007)The timeout specified has expired:
apr_pollset_poll : (listen)

On debugging, I found out that the listener thread calls port_getn to listen
for new connections, it was getting SIGHUP signal. This was confirmed by truss
output.

5541/1: 	read(6, 0xFFFFFD7FFFDFF38B, 1)	(sleeping...)
5541/7: 	lwp_park(0x00000000, 0) 	(sleeping...)
5541/6: 	lwp_park(0x00000000, 0) 	(sleeping...)
5541/5: 	lwp_park(0x00000000, 0) 	(sleeping...)
5541/4: 	lwp_park(0x00000000, 0) 	(sleeping...)
5541/3: 	lwp_park(0x00000000, 0) 	(sleeping...)
5541/8: 	port_getn(12, 0x00539D30, 2, 1, 0x00000000) (sleeping...)
5543/1: 	read(6, 0xFFFFFD7FFFDFF38B, 1)	(sleeping...)
5543/3: 	lwp_park(0x00000000, 0) 	(sleeping...)
5543/4: 	lwp_park(0x00000000, 0) 	(sleeping...)
5543/5: 	lwp_park(0x00000000, 0) 	(sleeping...)
5543/6: 	lwp_park(0x00000000, 0) 	(sleeping...)
5543/7: 	lwp_park(0x00000000, 0) 	(sleeping...)
5543/8: 	fcntl(11, F_SETLKW, 0xFFFFFD7FFF241E10) (sleeping...)
5539/1: 	pollsys(0xFFFFFD7FFFDFF2E0, 0, 0xFFFFFD7FFFDFF3A0, 0x00000000)
= 0
5539/1: 	write(7, " !", 1)				= 1
5541/1: 	read(6, " !", 1)				= 1
5541/1: 	lwp_wait(2, 0xFFFFFD7FFFDFF31C) 		= 0
5541/1: 	lwp_kill(8, SIGHUP)				= 0
5541/1: 	lwp_kill(8, SIG#0)				= 0
5541/8: 	    Received signal #1, SIGHUP, in port_getn() [caught]
5541/8: 	      siginfo: SIGHUP pid=5541 uid=101 code=-1
5541/8: 	port_getn(12, 0x00539D30, 2, 1, 0x00000000)	Err#4 EINTR
5541/8: 	lwp_sigmask(SIG_SETMASK, 0xFFBEE007, 0x0000FFF7) = 0xFFBFFEFF
[0x0000FFFF]

Note that 5541/8 is the listener thread and 5541/1 was sending the SIGHUP
signal.

I further debugged why one of my process is dieing. Then I found out that
in worker.c, in perform_idle_server_maintenance function, 

    if (idle_thread_count > max_spare_threads) {
	/* Kill off one child */
	ap_mpm_pod_signal(pod, TRUE);
	idle_spawn_rate = 1;
	..
From debugging, I found out that idle_thread_count was set to 10 and
max_spare_threads was set to 6.  Hence the first thread was sending SIGHUP
signal to kill the process.

If I increase the MaxSpareThreads to 10 then this error disappeared because in
that case, worker process didn't try to kill himself.
<IfModule worker.c>
ListenBackLog	  50000
StartServers	     2
ThreadLimit	   5
ThreadsPerChild    5
MinSpareThreads    1
MaxSpareThreads    10
ThreadsPerChild    5
MaxClients	  10
MaxRequestsPerChild  0
</IfModule>

The error message is misleading and it may or may not appear. It depends on the

timing. If listener thread has been reached to the point when it is blocked
with port_getn then we get this error. It seems to me that server wanted to
deliberately kill one of the child process.

Casue of the problem is that apr_pollset_poll function doesn't differentiate
the EINTR and ETIME.  apr_pollset_poll contains ...
	if (errno == ETIME || errno == EINTR) {
	    rv = APR_TIMEUP;
	}

In worker.c, we do expect EINTR to be normal as shown below :
-------------- worker.c --------------------------
		rv = apr_pollset_poll(pollset, -1, &numdesc, &pdesc);
		if (rv != APR_SUCCESS) {
		    if (APR_STATUS_IS_EINTR(rv)) {
			continue;
		    }
--------------------------------------------------

If we allow EINTR as a known error and return APR_EINTR as given in suggested
patch then this error disappeared.
Comment 14 Joe Orton 2008-02-12 02:02:26 UTC
So this is an APR bug.  I've no idea why the Solaris port pollset code would map
EINTR to APR_ETIMEUP; can you post that patch to dev@apr to solicit feedback?
Comment 15 Basant Kumar Kukreja 2008-02-12 11:58:24 UTC
I posted the patch to apr dev list.
Comment 16 Basant Kumar Kukreja 2008-02-12 12:31:11 UTC
Here is the link to the apr question which I posted :
http://mail-archives.apache.org/mod_mbox/apr-dev/200802.mbox/browser
Comment 17 Basant Kumar Kukreja 2008-02-12 13:38:12 UTC
Created attachment 21513 [details]
New patch

Based on the suggestion from Lucian Adrian Grijincu, I have updated the patch.
Comment 18 Nick Kew 2008-04-14 15:00:44 UTC
Now fixed in 1.2.x (for httpd) as well as trunk+1.3