We are running apache 2.0.47 in prefork model on solaris 2.6 and solaris 2.8. We are using the ldap authenticator which is still in experimental mode using the NETSCAPE C libraries (without SSL). Authentication to ldap using .htaccess files occurs flawlessly. Each apache child holds open a connection to ldap on port 389 for the specified length of time. However, when that time expires or the port 389 connection is closed, the apache child does not reap the connection and it remains stuck in close_wait. Eventually Apache reaches the file descriptor limit (set to 64) and stop accepting new connections. We don't want to raise the FD limit we would like to get to the root cause. Any ideas? We do not see this on solaris 2.8! APPENDIX (error log , truss, lsof output from parent and children). ======== ERROR LOG INDICATES THE SYMPTOM: Tue Sep 30 18:40:48 2003] [error] [client 10.0.0.2] (24)Too many open files TRUSS of parent seems normal: 15598: poll(0xEFFFDA90, 0, 1000) = 0 15598: waitid(7, 0, 0xEFFFF9A0, 0107) = 0 15598: siginfo: SIG#0 15598: poll(0xEFFFDA90, 0, 1000) = 0 15598: waitid(7, 0, 0xEFFFF9A0, 0107) = 0 15598: siginfo: SIG#0 15598: poll(0xEFFFDA90, 0, 1000) = 0 15598: waitid(7, 0, 0xEFFFF9A0, 0107) = 0 15598: siginfo: SIG#0 (SO ON AND SO FORTH) LSOF OF A CHILD SHOWS THE STUCK CLOSE_WAITS: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME apache2047 5435 www 9u inet 0x90c4c9d8 0t37 TCP solars2.6:80->eam- sj2.cisco.com:37648 (CLOSE_WAIT) apache2047 5435 www 10u inet 0x9278db10 0t427 TCP solars2.6:50616->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 11u inet 0xad5a3958 0t386 TCP solars2.6:50627->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 12u inet 0x83b41ec8 0t1612 TCP solars2.6:50630->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 13u inet 0x7e1cf160 0t744 TCP solars2.6:50666->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 14u inet 0x7231f7c8 0t3507 TCP solars2.6:50670->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 15u inet 0xad5a2f58 0t361 TCP solars2.6:51412->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 16u inet 0x76985e38 0t12054 TCP solars2.6:35189->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 17u inet 0x9278aa88 0t333 TCP solars2.6:37367->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 18u inet 0x7d6771c0 0t5455 TCP solars2.6:40510->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 19u inet 0x81a710f8 0t3548 TCP solars2.6:41520->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 20u inet 0x7e1cfc60 0t6510 TCP solars2.6:41536->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 21u inet 0x90c74a78 0t732 TCP solars2.6:43119->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 22u inet 0x71cee450 0t6255 TCP solars2.6:52432->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 23u inet 0x759de7d0 0t366 TCP solars2.6:53967->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 24u inet 0x839526b8 0t3942 TCP solars2.6:56627->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 25u inet 0x83b435d0 0t366 TCP solars2.6:57230->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 26u inet 0x9c7aaeb8 0t422 TCP solars2.6:57249->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 27u inet 0x72b4bd58 0t366 TCP solars2.6:57250->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 28u inet 0x83b40548 0t821 TCP solars2.6:57267->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 29u inet 0x9278cd90 0t366 TCP solars2.6:57348->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 30u inet 0x83861140 0t8091 TCP solars2.6:57513->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 31u inet 0x9c79aeb0 0t380 TCP solars2.6:58749->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 32u inet 0x83950730 0t4662 TCP solars2.6:58776->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 33u inet 0x81a70ff8 0t380 TCP solars2.6:59546->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 34u inet 0x7072d540 0t810 TCP solars2.6:59555->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 35u inet 0x9278a388 0t366 TCP solars2.6:59559->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 36u inet 0x83860440 0t403 TCP solars2.6:59576->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 37u inet 0xad6ab4e8 0t380 TCP solars2.6:59640->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 38u inet 0xad6aaf68 0t3142 TCP solars2.6:59643->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 39u inet 0x7dca54f8 0t1084 TCP solars2.6:60061->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 40u inet 0x96072850 0t817 TCP solars2.6:60070->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 41u inet 0x7e1cf560 0t380 TCP solars2.6:60141->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 42u inet 0x83b42fd0 0t1589 TCP solars2.6:60159->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 43u inet 0x7dca8a88 0t380 TCP solars2.6:60375->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 44u inet 0x9c799310 0t3202 TCP solars2.6:60385->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 45u inet 0xad67ba60 0t732 TCP solars2.6:61203->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 46u inet 0x83950f30 0t2381 TCP solars2.6:61243->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 47u inet 0xad6b2af0 0t746 TCP solars2.6:61359->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 48u inet 0x7d6775c0 0t4131 TCP solars2.6:61380->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 49u inet 0x90c4d9d8 0t386 TCP solars2.6:61741->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 50u inet 0x90c74278 0t14254 TCP solars2.6:61756->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 51u inet 0x9278c410 0t2164 TCP solars2.6:63803->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 52u inet 0x7072c0c0 0t420 TCP solars2.6:63816->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 53u inet 0x9c797188 0t1096 TCP solars2.6:63822->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 54u inet 0x769844b8 0t420 TCP solars2.6:63843->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 55u inet 0x7e1d5de8 0t384 TCP solars2.6:63865->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 56u inet 0x7e1d54e8 0t3558 TCP solars2.6:64149->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 57u inet 0xad6b2bf0 0t646 TCP solars2.6:64641->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 58u inet 0x83950230 0t4721 TCP solars2.6:65486->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 59u inet 0x7231e748 0t366 TCP solars2.6:32822->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 60u inet 0x90c75578 0t427 TCP solars2.6:35352->netscape- ldap:389 (CLOSE_WAIT) apache2047 5435 www 61r VREG 138,99002 0 16128 /fs/apache2047 (/dev/vx/dsk/ ccodg/apache2047vol) apache2047 5435 www 62u FIFO 0x76dff700 0t0 5476979 PIPE->0x76dff784
See comments/patches on bug 27134 We're seeing a similar problem under Red Hat 9.0, though I'm mainly testing versions patched for a different problem.
Are you using a shared LDAP connection pool? The mod_ldap connection pool will keep the connections open until the process is shut down, when it will gracefully close them. This should happen when your process reaches it's maximum number of allowed requests, and will shut down to be replaced by a new one. If you're using a shared pool though, the connections will only be shut down when httpd receives a graceful restart, which I can see could be a problem if httpd is long lived.
As I understand it, the Shared Connection Pool is on by default. You bring up a good point though. Maybe it's a matter of adjusting this shared pool inside the source code. I don't see anyway to do that from a config line. The documentation simply says it's on by default. It's still curious that we only see this under solaris 2.6 (and perhaps another person sees it under redhat). Although I can see how this could simply be a mask of the problem perhaps (we see it earlier on 2.6 and our 2.8 environments doesn't get similar traffic patterns?). Anyway... THANKS! I will look into the code from this angle, sometimes it just takes another pair of eyes. LDAP Connection Pool LDAP connections are pooled from request to request. This allows the LDAP server to remain connected and bound ready for the next request, without the need to unbind/connect/rebind. The performance advantages are similar to the effect of HTTP keepalives. On a busy server it is possible that many requests will try and access the same LDAP server connection simultaneously. Where an LDAP connection is in use, Apache will create a new connection alongside the original one. This ensures that the connection pool does not become a bottleneck. There is no need to manually enable connection pooling in the Apache configuration. Any module using this module for access to LDAP services will share the connection pool.
Hmmm... in that case the fix would be to include a timeout on the LDAP connections in the cache, so that those unused after a vertain age will be closed down and removed from the pool. Currently this is done via an apr cleanup call when the memory pool is cleaned up, but this only happens when a graceful restart happens. Will chase the apr people to see if there is an elegant way of timing out these connections.
One further thing: Can you try the v2.1.0-dev CVS version of mod_ldap and mod_auth_ldap and tell me if it makes a difference (you can just drop the *ldap* files into c2.0.49 and it should work, or at least it does for me). The newer code is a lot more robust on handling errors (such as "server has timed out") - connections are gracefully closed and brought to a sane state after an error condition instead of being abandoned as before. This might solve your problem.
I love bug cleanup days. Thanks for the heads up on the new version of mod_ldap. I will try it and update the case and update this case. It may take several days to play out an accurate test. thanks once again!
All the LDAP fixes just hit v2.0.50-dev - if you could give this version a try and tell me whether it fixes this problem, I can close this bug :)
the "right" way to fix this is to use apr-util's reslist functions, which handle things like resource timeouts for you.
Having the apache ldap connection cache hang onto connections too long is a bad thing for the ldap server, even if connections are left in a sane state. An explict, configurable timeout (or other resource-use controls) would be a good thing.
Can someone confirm whether this bug is still present in either HEAD or v2.0.52?
I have been testing this extensively on RedHat AS 3.0 (Taroon Update 1) with kernel 2.6.7 and I still see it happening using Apache 2.0.52. I can duplicate it easily by setting the cache TTLs to 30 seconds and my OpenLDAP server idletimeout to 60 seconds. Simply login to the server, refresh the page once or twice, then wait 60 seconds. For some reason I have to do this 5 times for it to "break". If I do a netstat -anp on the web server machine I see 5 LDAP sessions in CLOSE_WAIT state attached to httpd children. I should also mention that this only happens with LDAPS connections. If I switch my URL to LDAP the CLOSE_WAIT session is either reaped or reused. It does go into CLOSE_WAIT, but it gets "fixed". I hope this information was helpful. Please contact todd@mtu.edu with further questions. By the way, shutting off the cache causes the server's children to seg fault continuously. Todd Piket (todd@mtu.edu) Analyst/Programmer Distributed Computing Services Michigan Technological University
Could you run a test of both HEAD (httpd v2.1) and the most recent CVS of APACHE_2_0_BRANCH and tell me if there is any difference - another whole lot of patches just went in. A fix went in to stop the segfault when the cache was switched off, you should no longer see the segfaults in the above latest versions.
I've checked out 2.1-HEAD, I think. That was the first time I've ever checked Apache out of CVS. Not the first time for CVS though by any means so I think I got it right. If I did get it right then the problem appears to be fixed, in a manner of speaking, for LDAPS. The CLOSE_WAITs still appear and still 5 times before anything happens, but the sixth one causes one of the other 5 to get reaped. I'm not sure how, but I know it does because the port number of the speaking client changes. This is similar to the behavior for an unencrypted LDAP session except LDAP only gets one CLOSE_WAIT socket. I've included a netstat -anp dialog for LDAPS below as well as my, mostly original, .htaccess file for generating the errors. The syntax changed quite a bit with the new AuthBasicProvider directive and the "ldap-" prefix for the requires. It took me a while to figure that out, but the gist of the .htaccess file is the same. I haven't checked APACHE_2_0_BRANCH yet, but will try to by the end of my day tomorrow because I'll be out of town for the rest of the week. Please let me know if there is anything more I can do. I may try to check them both on Solaris as well, but my main platform is RedHat AS3. Oh, the seg fault issue appears to be resolved as well in httpd-2.1-HEAD. My .htaccess file (names changed to protect the innocent): ---------------------------------------------------------- AuthType Basic AuthBasicProvider ldap AuthName "AuthLDAP WSS" AuthLDAPURL ldaps://ldapmaster.mtu.edu:636/dc=mtu,dc=edu?uid?sub?(objectclass=posixac count) AuthLDAPBindDN "uid=somedude,ou=somewhere,dc=mtu,dc=edu" AuthLDAPBindPassword secret AuthLDAPGroupAttribute uniquemember require ldap-group cn=My Group,ou=Groups,dc=mtu,dc=edu require ldap-user user1 require ldap-user user2 The netstat -anp dialog Watch the Local Address port numbers. ----------------------- Some CLOSE_WAIT, Some ESTABLISHED ================================= tcp 0 0 192.168.104.4:35889 141.219.70.115:636 ESTABLISHED 13228/httpd tcp 0 0 192.168.104.4:35888 141.219.70.115:636 ESTABLISHED 13230/httpd tcp 1 0 192.168.104.4:35876 141.219.70.115:636 CLOSE_WAIT 13227/httpd tcp 0 0 192.168.104.4:35885 141.219.70.115:636 ESTABLISHED 13235/httpd tcp 1 0 192.168.104.4:35884 141.219.70.115:636 CLOSE_WAIT 13229/httpd ALL CLOSE_WAIT ============== tcp 1 0 192.168.104.4:35889 141.219.70.115:636 CLOSE_WAIT 13228/httpd tcp 1 0 192.168.104.4:35888 141.219.70.115:636 CLOSE_WAIT 13230/httpd tcp 1 0 192.168.104.4:35876 141.219.70.115:636 CLOSE_WAIT 13227/httpd tcp 1 0 192.168.104.4:35885 141.219.70.115:636 CLOSE_WAIT 13235/httpd tcp 1 0 192.168.104.4:35884 141.219.70.115:636 CLOSE_WAIT 13229/httpd New/Reaped Conn after all CLOSE_WAIT ==================================== tcp 1 0 192.168.104.4:35889 141.219.70.115:636 CLOSE_WAIT 13228/httpd tcp 1 0 192.168.104.4:35888 141.219.70.115:636 CLOSE_WAIT 13230/httpd *tcp 0 0 192.168.104.4:35891 141.219.70.115:636 ESTABLISHED 13229/httpd tcp 1 0 192.168.104.4:35876 141.219.70.115:636 CLOSE_WAIT 13227/httpd tcp 1 0 192.168.104.4:35885 141.219.70.115:636 CLOSE_WAIT 13235/httpd * This one reaped 35884. Hope that helps, Todd Piket (todd@mtu.edu) Analyst/Programmer Distributed Computing Services Michigan Technological University
Ok this is starting to make sense. In the original LDAP code, the module didn't clean up after itself properly when an LDAP error occurred (most common error being a timeout). Instead of trying to shut down broken LDAP connections still in the pool, it would ignore them, resulting in the increasing list of CLOSE_WAITs. Now, if the module tries to reuse a connection in the pool, but this connection has timed out, it cleanly tries to close the old connection and reopens a new one - this is probably why you're seeing the CLOSE_WAITs from each connection in the pool until they are reaped. This isn't ideal, however it is no longer unbounded like before. The correct solution to all of this is to implement the LDAP connection pool using apr_reslist_*, which handles things like timing out connections and cleanly removing them from the pool for us. The catch is that apr_reslist_* is apparently only available in APR v1.0, which limits this to being properly implemented in httpd v2.1/v2.2 (which uses APR v1.0), and not httpd v2.0 (which uses APR v0.9). I'm keen to see if the CLOSE_WAITs remain within the limits of the size of the pool in the httpd v2.0 branch, which hasn't been overhauled to the same extent as the HEAD branch has. If so, then the bug can probably be marked as fixed (mostly) until apr_reslist_* is properly implemented.
FYI, under my RedHat AS3 kernel 2.6.7 test machine the latest CVS extract, httpd-2.0.53-dev, is still "broken". The CLOSE_WAITs will pile up until there are no more fds. Sorry to bring the bad news, but that's what I'm seeing. Hopefully the 2.1/2.2 build will be released soon.
Is it possible to test this on httpd v2.1 and see if this problem is still there? Development of httpd v2.0's LDAP has been abandoned as APR v0.9 needed an overhaul that was better done in APR v1.1 and httpd v2.1.
It looks like this patch might go some way towards solving the problem. The authn and authz phases have been separated entirely from each other in httpd v2.1, so using the user's bind details to search for the group is non trivial exercise. But at least the patch allows authn to work binding as the user directly. *** This bug has been marked as a duplicate of 31352 ***