23548 – prefork model on solaris 2.6 mod_ldap leaves connections to ldap in close_wait

Bug 23548 - prefork model on solaris 2.6 mod_ldap leaves connections to ldap in close_wait

Summary: prefork model on solaris 2.6 mod_ldap leaves connections to ldap in close_wait

Status:	RESOLVED DUPLICATE of bug 31352

Alias:	None

Product:	Apache httpd-2
Classification:	Unclassified
Component:	mod_ldap (show other bugs)
Version:	2.1-HEAD
Hardware:	Sun other

Importance:	P3 normal with 1 vote (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2003-10-01 16:46 UTC by Nojan Moshiri
Modified:	2005-01-21 14:22 UTC (History)
CC List:	2 users (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nojan Moshiri 2003-10-01 16:46:53 UTC

We are running apache 2.0.47 in prefork model on solaris 2.6 and solaris 2.8. We are using the 
ldap authenticator which is still in experimental mode using the NETSCAPE C libraries (without 
SSL).

Authentication to ldap using .htaccess files occurs flawlessly.  Each apache child holds open a 
connection to ldap on port 389 for the specified length of time.  However, when that time expires 
or the port 389 connection is closed, the apache child does not reap the connection and it remains 
stuck in close_wait.

Eventually Apache reaches the file descriptor limit (set to 64) and stop accepting new connections.  
We don't want to raise the FD limit we would like to get to the root cause.  Any ideas?

We do not see this on solaris 2.8!

APPENDIX (error log , truss, lsof output from parent and children).
========
ERROR LOG INDICATES THE SYMPTOM:

Tue Sep 30 18:40:48 2003] [error] [client 10.0.0.2] (24)Too many open files

TRUSS of parent seems normal:

15598:  poll(0xEFFFDA90, 0, 1000)                       = 0
15598:  waitid(7, 0, 0xEFFFF9A0, 0107)                  = 0
15598:        siginfo: SIG#0
15598:  poll(0xEFFFDA90, 0, 1000)                       = 0
15598:  waitid(7, 0, 0xEFFFF9A0, 0107)                  = 0
15598:        siginfo: SIG#0
15598:  poll(0xEFFFDA90, 0, 1000)                       = 0
15598:  waitid(7, 0, 0xEFFFF9A0, 0107)                  = 0
15598:        siginfo: SIG#0
(SO ON AND SO FORTH)

LSOF OF A CHILD SHOWS THE STUCK CLOSE_WAITS:
COMMAND  PID     USER   FD   TYPE     DEVICE SIZE/OFF      NODE NAME
apache2047   5435 www    9u  inet 0x90c4c9d8     0t37       TCP solars2.6:80->eam-
sj2.cisco.com:37648 (CLOSE_WAIT)
apache2047   5435 www   10u  inet 0x9278db10    0t427       TCP solars2.6:50616->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   11u  inet 0xad5a3958    0t386       TCP solars2.6:50627->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   12u  inet 0x83b41ec8   0t1612       TCP solars2.6:50630->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   13u  inet 0x7e1cf160    0t744       TCP solars2.6:50666->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   14u  inet 0x7231f7c8   0t3507       TCP solars2.6:50670->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   15u  inet 0xad5a2f58    0t361       TCP solars2.6:51412->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   16u  inet 0x76985e38  0t12054       TCP solars2.6:35189->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   17u  inet 0x9278aa88    0t333       TCP solars2.6:37367->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   18u  inet 0x7d6771c0   0t5455       TCP solars2.6:40510->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   19u  inet 0x81a710f8   0t3548       TCP solars2.6:41520->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   20u  inet 0x7e1cfc60   0t6510       TCP solars2.6:41536->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   21u  inet 0x90c74a78    0t732       TCP solars2.6:43119->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   22u  inet 0x71cee450   0t6255       TCP solars2.6:52432->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   23u  inet 0x759de7d0    0t366       TCP solars2.6:53967->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   24u  inet 0x839526b8   0t3942       TCP solars2.6:56627->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   25u  inet 0x83b435d0    0t366       TCP solars2.6:57230->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   26u  inet 0x9c7aaeb8    0t422       TCP solars2.6:57249->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   27u  inet 0x72b4bd58    0t366       TCP solars2.6:57250->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   28u  inet 0x83b40548    0t821       TCP solars2.6:57267->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   29u  inet 0x9278cd90    0t366       TCP solars2.6:57348->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   30u  inet 0x83861140   0t8091       TCP solars2.6:57513->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   31u  inet 0x9c79aeb0    0t380       TCP solars2.6:58749->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   32u  inet 0x83950730   0t4662       TCP solars2.6:58776->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   33u  inet 0x81a70ff8    0t380       TCP solars2.6:59546->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   34u  inet 0x7072d540    0t810       TCP solars2.6:59555->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   35u  inet 0x9278a388    0t366       TCP solars2.6:59559->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   36u  inet 0x83860440    0t403       TCP solars2.6:59576->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   37u  inet 0xad6ab4e8    0t380       TCP solars2.6:59640->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   38u  inet 0xad6aaf68   0t3142       TCP solars2.6:59643->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   39u  inet 0x7dca54f8   0t1084       TCP solars2.6:60061->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   40u  inet 0x96072850    0t817       TCP solars2.6:60070->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   41u  inet 0x7e1cf560    0t380       TCP solars2.6:60141->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   42u  inet 0x83b42fd0   0t1589       TCP solars2.6:60159->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   43u  inet 0x7dca8a88    0t380       TCP solars2.6:60375->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   44u  inet 0x9c799310   0t3202       TCP solars2.6:60385->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   45u  inet 0xad67ba60    0t732       TCP solars2.6:61203->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   46u  inet 0x83950f30   0t2381       TCP solars2.6:61243->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   47u  inet 0xad6b2af0    0t746       TCP solars2.6:61359->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   48u  inet 0x7d6775c0   0t4131       TCP solars2.6:61380->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   49u  inet 0x90c4d9d8    0t386       TCP solars2.6:61741->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   50u  inet 0x90c74278  0t14254       TCP solars2.6:61756->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   51u  inet 0x9278c410   0t2164       TCP solars2.6:63803->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   52u  inet 0x7072c0c0    0t420       TCP solars2.6:63816->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   53u  inet 0x9c797188   0t1096       TCP solars2.6:63822->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   54u  inet 0x769844b8    0t420       TCP solars2.6:63843->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   55u  inet 0x7e1d5de8    0t384       TCP solars2.6:63865->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   56u  inet 0x7e1d54e8   0t3558       TCP solars2.6:64149->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   57u  inet 0xad6b2bf0    0t646       TCP solars2.6:64641->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   58u  inet 0x83950230   0t4721       TCP solars2.6:65486->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   59u  inet 0x7231e748    0t366       TCP solars2.6:32822->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   60u  inet 0x90c75578    0t427       TCP solars2.6:35352->netscape-
ldap:389 (CLOSE_WAIT)
apache2047   5435 www   61r  VREG  138,99002        0     16128 /fs/apache2047 (/dev/vx/dsk/
ccodg/apache2047vol)
apache2047   5435 www   62u  FIFO 0x76dff700      0t0   5476979 PIPE->0x76dff784

Comment 1 Albert Lunde 2004-04-26 16:04:02 UTC

See comments/patches on bug 27134

We're seeing a similar problem under Red Hat 9.0, though I'm mainly testing
versions patched for a different problem.

Comment 2 Graham Leggett 2004-05-21 15:49:47 UTC

Are you using a shared LDAP connection pool?

The mod_ldap connection pool will keep the connections open until the process is
shut down, when it will gracefully close them. This should happen when your
process reaches it's maximum number of allowed requests, and will shut down to
be replaced by a new one.

If you're using a shared pool though, the connections will only be shut down
when httpd receives a graceful restart, which I can see could be a problem if
httpd is long lived.

Comment 3 Nojan Moshiri 2004-05-21 16:16:46 UTC

As I understand it, the Shared Connection Pool is on by default.  You bring up a
good point though. Maybe it's a matter of adjusting this shared pool inside the
source code.  I don't see anyway to do that from a config line.  The
documentation simply says it's on by default.  It's still curious that we only
see this under solaris 2.6 (and perhaps another person sees it under redhat). 
Although I can see how this could simply be a mask of the problem perhaps (we
see it earlier on 2.6 and our 2.8 environments doesn't get similar traffic
patterns?).   

Anyway... THANKS! I will look into the code from this angle, sometimes it just
takes another pair of eyes.

LDAP Connection Pool

LDAP connections are pooled from request to request. This allows the LDAP server
to remain connected and bound ready for the next request, without the need to
unbind/connect/rebind. The performance advantages are similar to the effect of
HTTP keepalives.

On a busy server it is possible that many requests will try and access the same
LDAP server connection simultaneously. Where an LDAP connection is in use,
Apache will create a new connection alongside the original one. This ensures
that the connection pool does not become a bottleneck.

There is no need to manually enable connection pooling in the Apache
configuration. Any module using this module for access to LDAP services will
share the connection pool.

Comment 4 Graham Leggett 2004-05-21 16:26:23 UTC

Hmmm... in that case the fix would be to include a timeout on the LDAP
connections in the cache, so that those unused after a vertain age will be
closed down and removed from the pool.

Currently this is done via an apr cleanup call when the memory pool is cleaned
up, but this only happens when a graceful restart happens. Will chase the apr
people to see if there is an elegant way of timing out these connections.

Comment 5 Graham Leggett 2004-05-21 16:34:46 UTC

One further thing: Can you try the v2.1.0-dev CVS version of mod_ldap and
mod_auth_ldap and tell me if it makes a difference (you can just drop the *ldap*
files into c2.0.49 and it should work, or at least it does for me).

The newer code is a lot more robust on handling errors (such as "server has
timed out") - connections are gracefully closed and brought to a sane state
after an error condition instead of being abandoned as before.

This might solve your problem.

Comment 6 Nojan Moshiri 2004-05-21 16:50:32 UTC

I love bug cleanup days.  Thanks for the heads up on the new version of
mod_ldap.  I will try it and update the case and update this case. It may take
several days to play out an accurate test. thanks once again!

Comment 7 Graham Leggett 2004-05-21 23:58:18 UTC

All the LDAP fixes just hit v2.0.50-dev - if you could give this version a try
and tell me whether it fixes this problem, I can close this bug :)

Comment 8 Graham Leggett 2004-05-25 18:02:21 UTC

the "right" way to fix this is to use apr-util's reslist functions, which handle
things like resource timeouts for you.

Comment 9 Albert Lunde 2004-07-07 21:05:57 UTC

Having the apache ldap connection cache hang onto connections too long is a bad
thing for the ldap server, even if connections are left in a sane state. An
explict, configurable timeout (or other resource-use controls) would be a good
thing.

Comment 10 Graham Leggett 2004-10-03 16:15:20 UTC

Can someone confirm whether this bug is still present in either HEAD or v2.0.52?

Comment 11 Todd Piket 2004-10-14 21:11:15 UTC

I have been testing this extensively on RedHat AS 3.0 (Taroon Update 1) with
kernel 2.6.7 and I still see it happening using Apache 2.0.52.  I can duplicate
it easily by setting the cache TTLs to 30 seconds and my OpenLDAP server
idletimeout to 60 seconds.  Simply login to the server, refresh the page once or
twice, then wait 60 seconds.  For some reason I have to do this 5 times for it
to "break".  If I do a netstat -anp on the web server machine I see 5 LDAP
sessions in CLOSE_WAIT state attached to httpd children.  I should also mention
that this only happens with LDAPS connections.  If I switch my URL to LDAP the
CLOSE_WAIT session is either reaped or reused.  It does go into CLOSE_WAIT, but
it gets "fixed".  I hope this information was helpful.  Please contact
todd@mtu.edu with further questions.

By the way, shutting off the cache causes the server's children to seg fault
continuously.

Todd Piket (todd@mtu.edu)
Analyst/Programmer
Distributed Computing Services
Michigan Technological University

Comment 12 Graham Leggett 2004-10-17 18:32:36 UTC

Could you run a test of both HEAD (httpd v2.1) and the most recent CVS of
APACHE_2_0_BRANCH and tell me if there is any difference - another whole lot of
patches just went in.

A fix went in to stop the segfault when the cache was switched off, you should
no longer see the segfaults in the above latest versions.

Comment 13 Todd Piket 2004-10-19 15:17:33 UTC

I've checked out 2.1-HEAD, I think.  That was the first time I've ever checked
Apache out of CVS.  Not the first time for CVS though by any means so I think I
got it right.  If I did get it right then the problem appears to be fixed, in a
manner of speaking, for LDAPS.  The CLOSE_WAITs still appear and still 5 times
before anything happens, but the sixth one causes one of the other 5 to get
reaped.  I'm not sure how, but I know it does because the port number of the
speaking client changes.  This is similar to the behavior for an unencrypted
LDAP session except LDAP only gets one CLOSE_WAIT socket.  I've included a
netstat -anp dialog for LDAPS below as well as my, mostly original, .htaccess
file for generating the errors.  The syntax changed quite a bit with the new
AuthBasicProvider directive and the "ldap-" prefix for the requires.  It took me
a while to figure that out, but the gist of the .htaccess file is the same.

I haven't checked APACHE_2_0_BRANCH yet, but will try to by the end of my day
tomorrow because I'll be out of town for the rest of the week.  Please let me
know if there is anything more I can do.  I may try to check them both on
Solaris as well, but my main platform is RedHat AS3.

Oh, the seg fault issue appears to be resolved as well in httpd-2.1-HEAD.


My .htaccess file (names changed to protect the innocent):
----------------------------------------------------------
AuthType Basic
AuthBasicProvider ldap
AuthName "AuthLDAP WSS"
AuthLDAPURL
ldaps://ldapmaster.mtu.edu:636/dc=mtu,dc=edu?uid?sub?(objectclass=posixac
count)
AuthLDAPBindDN "uid=somedude,ou=somewhere,dc=mtu,dc=edu"
AuthLDAPBindPassword secret
AuthLDAPGroupAttribute uniquemember
require ldap-group cn=My Group,ou=Groups,dc=mtu,dc=edu
require ldap-user user1
require ldap-user user2


The netstat -anp dialog
Watch the Local Address
port numbers.
-----------------------
Some CLOSE_WAIT, Some ESTABLISHED
=================================
tcp        0      0 192.168.104.4:35889     141.219.70.115:636      ESTABLISHED 
13228/httpd         
tcp        0      0 192.168.104.4:35888     141.219.70.115:636      ESTABLISHED 
13230/httpd         
tcp        1      0 192.168.104.4:35876     141.219.70.115:636      CLOSE_WAIT  
13227/httpd         
tcp        0      0 192.168.104.4:35885     141.219.70.115:636      ESTABLISHED 
13235/httpd         
tcp        1      0 192.168.104.4:35884     141.219.70.115:636      CLOSE_WAIT  
13229/httpd         

ALL CLOSE_WAIT
==============
tcp        1      0 192.168.104.4:35889     141.219.70.115:636      CLOSE_WAIT  
13228/httpd         
tcp        1      0 192.168.104.4:35888     141.219.70.115:636      CLOSE_WAIT  
13230/httpd         
tcp        1      0 192.168.104.4:35876     141.219.70.115:636      CLOSE_WAIT  
13227/httpd         
tcp        1      0 192.168.104.4:35885     141.219.70.115:636      CLOSE_WAIT  
13235/httpd         
tcp        1      0 192.168.104.4:35884     141.219.70.115:636      CLOSE_WAIT  
13229/httpd         


New/Reaped Conn after all CLOSE_WAIT
====================================
tcp        1      0 192.168.104.4:35889     141.219.70.115:636      CLOSE_WAIT  
13228/httpd         
tcp        1      0 192.168.104.4:35888     141.219.70.115:636      CLOSE_WAIT  
13230/httpd         
*tcp        0      0 192.168.104.4:35891     141.219.70.115:636      ESTABLISHED 
13229/httpd         
tcp        1      0 192.168.104.4:35876     141.219.70.115:636      CLOSE_WAIT  
13227/httpd         
tcp        1      0 192.168.104.4:35885     141.219.70.115:636      CLOSE_WAIT  
13235/httpd         

* This one reaped 35884.


Hope that helps,

Todd Piket (todd@mtu.edu)
Analyst/Programmer
Distributed Computing Services
Michigan Technological University

Comment 14 Graham Leggett 2004-10-19 16:34:42 UTC

Ok this is starting to make sense. In the original LDAP code, the module didn't
clean up after itself properly when an LDAP error occurred (most common error
being a timeout). Instead of trying to shut down broken LDAP connections still
in the pool, it would ignore them, resulting in the increasing list of CLOSE_WAITs.

Now, if the module tries to reuse a connection in the pool, but this connection
has timed out, it cleanly tries to close the old connection and reopens a new
one - this is probably why you're seeing the CLOSE_WAITs from each connection in
the pool until they are reaped. This isn't ideal, however it is no longer
unbounded like before.

The correct solution to all of this is to implement the LDAP connection pool
using apr_reslist_*, which handles things like timing out connections and
cleanly removing them from the pool for us. The catch is that apr_reslist_* is
apparently only available in APR v1.0, which limits this to being properly
implemented in httpd v2.1/v2.2 (which uses APR v1.0), and not httpd v2.0 (which
uses APR v0.9).

I'm keen to see if the CLOSE_WAITs remain within the limits of the size of the
pool in the httpd v2.0 branch, which hasn't been overhauled to the same extent
as the HEAD branch has. If so, then the bug can probably be marked as fixed
(mostly) until apr_reslist_* is properly implemented.

Comment 15 Todd Piket 2004-10-20 15:27:39 UTC

FYI, under my RedHat AS3 kernel 2.6.7 test machine the latest CVS extract,
httpd-2.0.53-dev, is still "broken".  The CLOSE_WAITs will pile up until there
are no more fds.  Sorry to bring the bad news, but that's what I'm seeing. 
Hopefully the 2.1/2.2 build will be released soon.

Comment 16 Graham Leggett 2005-01-21 22:42:43 UTC

Is it possible to test this on httpd v2.1 and see if this problem is still there?

Development of httpd v2.0's LDAP has been abandoned as APR v0.9 needed an
overhaul that was better done in APR v1.1 and httpd v2.1.

Comment 17 Graham Leggett 2005-01-21 23:22:57 UTC

It looks like this patch might go some way towards solving the problem.

The authn and authz phases have been separated entirely from each other in httpd
v2.1, so using the user's bind details to search for the group is non trivial
exercise. But at least the patch allows authn to work binding as the user directly.


*** This bug has been marked as a duplicate of 31352 ***