Over time (a few days, with an average of 350k hits of which 25k are authed with auth_ldap) it will stop authenticating random users, with the error: [Wed Mar 17 08:40:51 2004] [warn] [client 147.178.68.203] [26904] auth_ldap authenticate: user {username} authentication failed; URI {path} [User not found][No such object] It does this in the middle of a functional session (i.e. the user was logged in, clicking around and suddenly pop, no access). The 'fix' is to restart the webserver. I presume this is a cacheing issue. We are running 2.0.49rc1
This is linked against OpenLDAP stable 2.1.25. Is there a way to perhaps turn off the cache, to see if that is what is causing the problem? I have written a perl script that watches the log for this error and immediately cross-checks LDAP for the user, if the user exists it restarts Apache. We are seeing about 30 restarts a day.
Created attachment 11078 [details] fix for ldap rebinding failures
The problem is in the poor way the ldap session is managed (which could cause other severe problems, if individual users cannot browse the tree, and it should be re-considered). Kurt Olsen has found this problem and come up with a quick fix (see patch). Note: this also relates to bug# 17274. Kurt's description: -------------- In the file util_ldap.c, in the function util_ldap_cache_checkuserid, when a user tries to authenticate the module takes these steps: 1) check the cache, returning success or failure if results cached. 2) open a connection via the function util_ldap_connection_open, using the ldc struct. if ldc->bound = 1, then don't do anything in util_ldap_connection_open. 3) do a search to validate, and locate the dn for, the username provided. 4) verify that there is only 1 result of the search in #3. 5) verify that the password is non empty. 6) rebind with the dn found in step 3 with the password provided, using the ldc struct. if there is a failure then return failure status. on success update cache and return success status. The problem is that the ldc used in #6 is the same ldc used to lookup a user's dn in the tree. So if the password is incorrect then the ldap_simple_bind_s used to verify the password will have screwed up the ldc->ldap binding. The next time this ldc struct is used, the ldc->bound value is set to 1, but the actual valid bind has been hosed. One simple fix is to add an "ldc->bound = 0;" into the two tests for failure after the ldap_simple_bind_s. This causes the util_ldap_connection_open to re-bind with the proper DN prior to looking up users. Even in the case where the users are logging in correctly, there is still the problem that when user A authenticates the ldc->ldap bind is now bound with his username and password. If user A doesn't have rights to search the tree, then when user B comes along at a later point in time the search for user B's dn in the tree will fail. The correct fix would be to create an util_ldap_connection_t *foo; that would be used for testing provided passwords, but would not have an impact on the ldc struct used for searching and what not. Kurt Olsen
Not fixed in the code yet... adding Patchavailable keyword.
Additional bugs with this issue and some of them also have fixes: 17274 17599 18661 21787 24595 24683 (probably, commentary is old) 27134 27271 And 28413 may be the same thing, but it's not really clear except that they experience failures against AD. I think that the comment that a connection should be marked as unbound after any user bind is the proper solution. The patch included in this report only marks unbound upon auth failures. Adding an ldc->bound = 0; at line 847 in util_ldap.c (release 2.0.49) should fix both issues I have addressed in my re-explanation of the problem.
The attached patch has been committed to v2.1.0-dev, and is included against v2.0.49. Please test and tell me whether this fixes the problem.
Created attachment 11618 [details] Rollup of LDAP fixes to v2.1.0 against v2.0.49
The attachment includes bnicholes fix: *) mod_ldap calls ldap_simple_bind_s() to validate the user credentials. If the bind fails, the connection is left in an unbound state. Make sure that the ldap connection record is updated to show that the connection is no longer bound.
*** Bug 25764 has been marked as a duplicate of this bug. ***
*** Bug 17599 has been marked as a duplicate of this bug. ***
*** Bug 21787 has been marked as a duplicate of this bug. ***
*** Bug 24595 has been marked as a duplicate of this bug. ***
*** Bug 27134 has been marked as a duplicate of this bug. ***
*** Bug 24683 has been marked as a duplicate of this bug. ***
*** Bug 18661 has been marked as a duplicate of this bug. ***
Fixed in v2.0.50-dev.
I repeated my test set-up that I'd been using under bug 27134, with the roll-up patch 11618 from bug 27748. This was on Red Hat Linux 9.0, building Apache from patched 2.0.49 sources (not Red Hat sources) This uses two test data sets with 11 valid username/password pairs and some pseudo-random failures. One data set walks through the usernames in nearly serial order (because this will tend to show the worst-case usage of the connection pool). This makes 103 requests. The other data set uses a more random series of usernames. This makes 804 requests. The results look good. I'm now getting no unexpected authentication results, and socket usage looks similar to Denis Gervalle's previous patch. I still have the warning "LDAP cache: Unable to init Shared Cache: no file", but I suppose that's a different issue. I did the tests first with the default settings of StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 150 MaxRequestsPerChild 0 For comparison, I set up a low process number test with: StartServers 1 MinSpareServers 1 MaxSpareServers 1 MaxClients 150 MaxRequestsPerChild 0 and high process number test with: StartServers 10 MinSpareServers 10 MaxSpareServers 20 MaxClients 150 MaxRequestsPerChild 0 All the tests give correct results (authentication works or fails as expected). I looked at sockets in use with "netstat -an" on the LDAP server. With the default prefork process config: the serial data set left 9 sockets to the LDAP server in use at the end; the random data set left 4 sockets in use at the end With the "low process" config: the serial data set left 1 socket in use at the end; the random data set left 0 sockets in use at the end With the "high process" config the serial data set left 14 sockets in use at the end; the random data set left 11 sockets in use at the end; I'm guessing that if I could get rid of the "Unable to init Shared Cache" warning I'd get results more like the "low process" config. Can anyone suggest another fix/bug that applies to that issue?