Bug 48958

Summary: mod_ldap, ldap credential cache & graceful restart issue
Product: Apache httpd-2 Reporter: manuel.vacelet
Component: mod_ldapAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED LATER    
Severity: normal Keywords: MassUpdate
Priority: P2    
Version: 2.2.3   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Hack to allow disabling shm with 'LdapSharedCacheSize 0' but still use caching

Description manuel.vacelet 2010-03-22 13:35:20 UTC
Hello,

I first tried to get support from httpd-user list but I got no answer so I think bugzilla is the right place to report the following behaviour.

Environment:
- RHEL 5.2
- Apache 2.2.3
- Major module used: ssl, ldap, authz_ldap, php, svn_dav + some others
- SVN repositories available through http & https, users identified with authz_ldap.

End user launches large svn checkout (~700MB, 60'000 files), most of the time it works great (~5mn to go) but sometimes it takes hours (up to 15h).
After strace+/proc digging, it appears that very long svn checkout happens because apache requests the LDAP server for each file it has to server (no usage of mod_ldap cache).
If I run another svn checkout in parallel of the "blocking" one, it works great (strace shows there is no interaction with LDAP server).
To sum up, at server side, I can face:
- 2 httpd processes, one that uses mod_ldap cache and one that doesn't.

Last part of the story: "graceful restart".
I have a process that launch on regular bases a graceful restart in order to let apache known the last created svn repositories.
If I stop this process (no longer graceful restart at all), the initial issue (very long checkout) vanish.

I tried the workaround described in https://issues.apache.org/bugzilla/show_bug.cgi?id=46749 aka set LDAPSharedCacheSize to 500000 but it doesn't solve the problem.

Any idea?
Comment 1 Eric Covener 2010-03-22 14:48:28 UTC
Is it checking passwords or authorization info? Are you sure it's doing it 60k times and not just hanging on one?
Comment 2 manuel.vacelet 2010-03-22 15:09:04 UTC
(In reply to comment #1)
> Is it checking passwords or authorization info? 

I don't know how to distinguish the 2 cases.
What I see from strace is a ldap request like:
write(3, "...eduid=ed1234,dc=example,dc=com...") 
where ed1234 is the id of the user that attempt to checkout so I guess it's a bind.
How could i

> Are you sure it's doing it 60k
> times and not just hanging on one?

I'm pretty sure it's for each file as I can see files, and each time, the trace looks like:
stat64("/usr/share/codendi/src/www/svnroot/testmv1/!svn/ver/1310/branches/gcc/libstdc++-v3/testsuite/performance/25_algorithms/find_istreambuf_iterators.cc", 0xbfe95c8c) = -1 ENOENT (No such file or directory)
lstat64("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64("/usr/share", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
lstat64("/usr/share/codendi", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64("/usr/share/codendi/src", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64("/usr/share/codendi/src/www", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64("/usr/share/codendi/src/www/svnroot", 0xbfe95c8c) = -1 ENOENT (No such file or directory)
semop(866156544, 0x17570c, 1)           = -1 EIDRM (Identifier removed)
semop(866156544, 0x175712, 1)           = -1 EIDRM (Identifier removed)
futex(0x9a33594, FUTEX_WAKE, 1)         = 0
semop(866156544, 0x17570c, 1)           = -1 EIDRM (Identifier removed)
semop(866156544, 0x175712, 1)           = -1 EIDRM (Identifier removed)
futex(0x9a33594, FUTEX_WAKE, 1)         = 0
time(NULL)                              = 1269270405
write(3, "0\f\2\1\"`\7\2\1\3\4\0\200\0", 14) = 14
poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP, revents=POLLIN}], 1, -1) = 1
read(3, "0\f\2\1\"a\7\n", 8)            = 8
read(3, "\1\0\4\0\4\0", 6)              = 6
time(NULL)                              = 1269270405
time(NULL)                              = 1269270405
write(3, "0C\2\1#c>\4\fdc=example,dc=com\n\1\2\n\1\3\2\1\0\2\1"..., 69) = 69
poll( <unfinished ...>
Process 31960 detached
Comment 3 Eric Covener 2010-03-22 15:12:29 UTC
loglevel debug or look at the LDAP traffic in wireshark (and/or try a recreate with no LDAP authz and see if it still can be hit)
Comment 4 manuel.vacelet 2010-03-23 11:51:48 UTC
According to wireshark, it's LDAP bind requests:
  6.818387 apache -> ldap TCP 46828 > ldap [ACK] Seq=3701 Ack=17989 Win=500 Len=0 TSV=2676945403 TSER=2947384344
  6.818778 apache -> ldap LDAP bindRequest(8571) "eduid=ed1234,ou=people,dc=example,dc=com" simple
  6.825009 ldap -> apache LDAP bindResponse(8571) success
  6.834444 apache -> ldap LDAP bindRequest(8572) "<ROOT>" simple
  6.835728 ldap -> apache LDAP bindResponse(8572) success
  6.835859 apache -> ldap LDAP searchRequest(8573) "dc=example,dc=com" wholeSubtree
  6.876986 ldap -> apache TCP ldap > 46828 [ACK] Seq=18019 Ack=3849 Win=49232 Len=0 TSV=2947384350 TSER=2676945420
  7.061669 ldap -> apache LDAP searchResEntry(8573) "eduid=ed1234,ou=people,dc=example,dc=com"
  7.061714 ldap -> apache LDAP searchResDone(8573) success
  7.062165 apache -> ldap TCP 46828 > ldap [ACK] Seq=3849 Ack=18682 Win=500 Len=0 TSV=2676945647 TSER=2947384368
  7.062300 apache -> ldap LDAP bindRequest(8574) "eduid=ed1234,ou=people,dc=example,dc=com" simple
  7.068504 ldap -> apache LDAP bindResponse(8574) success
  7.080949 apache -> ldap LDAP bindRequest(8575) "<ROOT>" simple
  7.084970 ldap -> apache LDAP bindResponse(8575) success
  7.085086 apache -> ldap LDAP searchRequest(8576) "dc=example,dc=com" wholeSubtree
  7.107050 ldap -> apache TCP ldap > 46828 [ACK] Seq=18712 Ack=3997 Win=49232 Len=0 TSV=2947384373 TSER=2676945670
  7.312568 ldap -> apache LDAP searchResEntry(8576) "eduid=ed1234,ou=people,dc=example,dc=com"


apache loglevel=debug

[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(849): [10864] auth_ldap url parse: `ldap://ldap1.example.com ldap2.example.com/dc=example,dc=com'
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(858): [10864] auth_ldap url parse: Host: ldap1.example.com ldap2.example.com
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(860): [10864] auth_ldap url parse: Port: 389
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(862): [10864] auth_ldap url parse: DN: dc=example,dc=com
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(864): [10864] auth_ldap url parse: attrib: (null)
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(866): [10864] auth_ldap url parse: scope: base
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(871): [10864] auth_ldap url parse: filter: (null)
[Tue Mar 23 12:47:13 2010] [debug] mod_authnz_ldap.c(951): LDAP: auth_ldap not using SSL connections
Comment 5 Eric Covener 2010-03-23 12:29:57 UTC
I guess those are DN lookups, which should be cached. But it's unclear if that is what repeats 60k times.  What does your "ldap-status" page look like while these are chugging away both w/ and w/o the graceful?
Comment 6 manuel.vacelet 2010-03-23 12:57:12 UTC
(In reply to comment #5)
> I guess those are DN lookups, which should be cached. But it's unclear if that
> is what repeats 60k times. What does your "ldap-status" page look like while
> these are chugging away both w/ and w/o the graceful?

I'm not sure what you meant by "while these are chugging away both w/ and w/o the graceful".

ldap-status tells me there is no data in cache (0 entries) while the checkout is running.
If I run another checkout in parallel (normal speed), I can see my entry cached. In the same time, the "long checkout" continue without using cache at all (I see strace activity and if I stop the second checkout, the "Hits" no longer increase).

Right after, I run a graceful, the ldap-status shows a cleared cache and the long checkout continues, still slow.

I made another test:
1 Long checkout running since hours
2 check ldap status: empty cache
3 Start another checkout
4 check ldap status: cache with 1 entry, hits are growings quickly)
5 graceful
6 check ldap status: cache still in use for checkout started at #3, the hit number is bigger than at step #4
7 stop svn co started at #3
8 check ldap status: hits no longer increase (with the first checkout continue to run)

It's maybe not graceful by itself because it's very difficult to me to reproduce the scenario that lead to this behavior. I tried to run a series of 10 graceful one after another with a "full speed checkout" and I didn't manage to reproduce this issue.
But, if I let the cron job that does (among other tasks) the regular graceful, I reproduce the issue several time a day.
Comment 7 Stefan Fritsch 2010-03-28 22:18:20 UTC
My guess is that the parent process destroys the shm segment of the ldap cache while the child is still busy processing the checkout. This makes it impossible for the child to use the ldap cache, as the old cache is already destroyed but the child does not have access to the new cache either. In this case, mod_ldap will silently continue without using any cache. With 2.2.15, you would probably get some warning about the cache size being too small. (That warning is wrong in this case.)

The svn checkout seems to be a single request with lots of subrequests. In this case httpd will not close the connection as it does for a keep-alive connection with lots of requests.

I don't currently have any idea how to solve this problem. Is it possible to somehow defer destroying the old shm segment until all childs have terminated?
Comment 8 manuel.vacelet 2010-03-29 07:48:04 UTC
I'm still running my monitoring jobs and the more I test the more graceful seems involved.
I made a long (4 days) test with the regular graceful restart off and I get no long checkout (in my previous test I completely stopped crond, for this one I only commented the regular graceful so other cron scripts where running).

But one thing disturbs me: I cannot reproduce the bad behaviour on demand.
Let says there is a checkout running fine (with cache & all) and I run several graceful one after another the checkout continues just fine (I tried with ten "service httpd graceful && service httpd graceful && ...").
However, if I just let things go on with regular graceful, I reproduce the issue several times a day.

I will try to build latest 2.2.15 to see if I get the error message you mention (but I will first have to find the same build options as RH official pkg).
Comment 9 manuel.vacelet 2010-03-29 13:41:18 UTC
I built 2.2.15 and the same problem happens, it's even more easier to reproduce as just running "apachectl -k graceful" breaks the cache very easily.
I have no specific message in error_log (maybe it depends of loglevel?) but ldap cache is empty and httpd servers my svn client with a "G" (gracefully finished) process according to server-status.
Comment 10 Stefan Fritsch 2010-03-29 19:19:34 UTC
(In reply to comment #9)
> I built 2.2.15 and the same problem happens, it's even more easier to reproduce
> as just running "apachectl -k graceful" breaks the cache very easily.
> I have no specific message in error_log (maybe it depends of loglevel?) but
> ldap cache is empty and httpd servers my svn client with a "G" (gracefully
> finished) process according to server-status.

The messages are at loglevel warning. The may appear in the main server errorlog instead of the per-vhost errorlog (if you use several error logs). If they are not there, my guess about the source of the problem may be wrong.

A possible workaround you could try is disabling SHM caching by setting LDAPSharedCacheSize to 0. This will increase mem usage and lower the cache hit rate, but it may mitigate your problem.
Comment 11 manuel.vacelet 2010-03-30 07:31:44 UTC
You are right, I get the error message:
[Mon Mar 29 15:43:57 2010] [warn] LDAPSharedCacheSize is too small. Increase it or reduce LDAPCacheEntries/LDAPOpCacheEntries!
[Mon Mar 29 15:43:57 2010] [error] Could not allocate memory for LDAP cache entry

I will give a try to your workaround.
If I set the shared cache to 0, may I reduce the cache size?
Comment 12 manuel.vacelet 2010-03-30 07:34:03 UTC
Forget about my last question... too early in the morning.
Comment 13 manuel.vacelet 2010-03-31 07:50:22 UTC
Back with latest test results. I finaly get a consistant result: checkout now takes always hours to complete (no cache usage at all).
I used following configuration:
LDAPSharedCacheSize 0
LDAPCacheEntries 1024
LDAPCacheTTL 600
LDAPOpCacheEntries 1024
LDAPOpCacheTTL 600

So settings shared cache size to 0 fully disable the LDAP caches, not only the shared one.
Comment 14 Stefan Fritsch 2010-03-31 21:24:41 UTC
Created attachment 25221 [details]
Hack to allow disabling shm with 'LdapSharedCacheSize 0' but still use caching

It seems some parts of the code interpret 'LdapSharedCacheSize 0' to disable caching altogether, while most parts just interpret it to disable shared memory. The attached patch makes caching without shm work for me. I have briefly tested it with trunk but not with 2.2.x.

Can you test if it mitigates the slow checkout problem? I wouldn't apply it to a production site without prior testing, though.
Comment 15 manuel.vacelet 2010-04-01 12:51:51 UTC
You patch applies on 2.2.15 with a bit of fuzzing but it does work.

I run a patched 2.2.15 with "LdapSharedCacheSize 0" since 5 hours (with a graceful every 10 minutes) and it seems to do the job (LDAP is cached).

Most of the checkout takes 10-12 minutes to complete so httpd process serves client with G state most of the time.

I keep it running.
Comment 16 manuel.vacelet 2010-04-02 12:52:39 UTC
After one day I'm confident to say that the patch does the job.
Comment 17 manuel.vacelet 2010-04-22 03:37:40 UTC
Hi,

Do you think I can safely apply this patch against 2.2.3 (RHEL/Centos5) version ?
Comment 18 Stefan Fritsch 2011-04-25 16:04:33 UTC
Committed the LDAPSharedCacheSize 0 workaround to trunk as r1096577
Comment 19 William A. Rowe Jr. 2018-11-07 21:09:18 UTC
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.