Bug 61313 - JNDIRealm LDAP server failover to alternateURL takes very long 15m32s
Summary: JNDIRealm LDAP server failover to alternateURL takes very long 15m32s
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 8.5.16
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-17 15:42 UTC by Peter
Modified: 2017-07-24 16:47 UTC (History)
0 users



Attachments
my hacky patch (1.36 KB, patch)
2017-07-17 15:42 UTC, Peter
Details | Diff
an example webapp that can be used for testing (5.88 KB, application/x-webarchive)
2017-07-17 15:55 UTC, Peter
Details
Add an option to set a read timeout to ldap connections used by JNDI Realm (1.90 KB, patch)
2017-07-17 17:17 UTC, Felix Schumacher
Details | Diff
second possibly hacky patch, includes Felix's patch (2.41 KB, patch)
2017-07-18 15:34 UTC, Peter
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter 2017-07-17 15:42:52 UTC
Created attachment 35146 [details]
my hacky patch

I have worked with ‎csutherl‎ on #tomcat on irc.freenode.net who decided this is a bug and should be reported here rather than the ML.

JDK version tested was oracle JDK 1.8.0_66.

I have set connectionURL and alternateURL to try to get LDAP server failover to work.

If both servers are up, it "works well".
If only the connectionURL server is down (firewall is set to REJECT) then a newly restarted tomcat works fine, but if it goes down while tomcat is already running, the next LDAP lookup takes 15m32s.
If only the alternateURL server is down (firewall is set to REJECT) then a newly restarted tomcat works fine, but if it goes down while tomcat is already running, and the connectionURL server was down before but up again now (so the JNDIRealm's instance variable "context" is currently set using alternateURL) the next LDAP lookup takes 15m32s.

Setting connectionTimeout has no effect on the time. I have verified it ends up in the Hashtable returned by getDirectoryContextEnvironment().

And if I apply my hacky patch (attached, applies to tomcat85 git repo, tag TOMCAT_8_5_16) for the method "JNDIRealm.open()" so it works like a fresh tomcat startup, and never returns the old context (closes it and sets it to null, then lets the rest of the code run), it always "works well", taking 4-7s on a fresh tomcat, or less than 0.1s on a warmed up tomcat.

server.xml snippets:

    <Realm className="org.apache.catalina.realm.MemoryRealm" digest="MD5" />
    </Realm>
    
    <Realm className="org.apache.catalina.realm.JNDIRealm"
        connectionURL="ldap://auth1:389"
        connectionTimeout="1000"
        connectionAttempt="0"
        alternateURL="ldap://auth2:389"
        userPattern="uid={0},ou=People,dc=example,dc=com"
        userRoleAttribute="gidNumber"
        roleBase="ou=Group,dc=example,dc=com"
        roleName="cn"
        roleSearch="(|(gidNumber={2})(memberUid={1}))"
        />

an example webapp that can be used for testing is attached

firewall test code (assuming otherwise blank firewall with policy ACCEPT):
    t() { iptables -D INPUT 1 ; iptables -I INPUT 1 -p tcp -s 10.3.0.21 -j "$1" ; iptables -nvL; }
    
    # run this on machines that should work
    t LOG
    # run this on machines that should fail
    t REJECT
    
the test:
    # prerequisite for the test is a server named "auth1" and another "auth2" which run on port 389. In our case it's slapd, and they have start_tls enabled but not required.
    # on both LDAP servers:
    t LOG

    # then start tomcat, then
    # on the first LDAP server:
    t REJECT

    # on a test machine (content of the user and password here shouldn't matter... we aren't testing authentication, only time taken)
    time curl --user someuser:somepassword http://exampledomain:8080/dummy-service/test/ldap

    (ignore response, but look at time taken)

result:
    Almost always takes 15m32.4s plus up to 4s or so, but usually within a few ms.
    When it "works well" (described in detail above), the first run takes about 4-7s, and after that it takes around 0.02s to 0.1s.
    
expected:
    It should always be quick, within some small multiple of the connection timeout, preferrably 1x. So in this case, it should take about 1s extra, or at most a few seconds, so 1.1s on a warmed up tomcat.

Side comment: other LDAP clients support an arbitrary number of urls, or one line with all the urls together... I find it limiting to have only 2 that you can set here. We have 3 LDAP servers.
Comment 1 Peter 2017-07-17 15:55:37 UTC
Created attachment 35147 [details]
an example webapp that can be used for testing

If it says "The origin server did not find a current representation for the target resource or is not willing to disclose that one exists." that's fine... it just means the user doesn't exist or something, but it got to the LDAP server. I have verified it is fast with my patch and slow without it.

If it says "The request has not been applied because it lacks valid authentication credentials for the target resource." it means it can't contact any LDAP server, which is probably fine as long as it's fast. I verified that it's fast with my patch and slow without it.
Comment 2 Felix Schumacher 2017-07-17 17:17:28 UTC
Created attachment 35148 [details]
Add an option to set a read timeout to ldap connections used by JNDI Realm

Java LDAP API seems to have an option for read timeouts. This should shorten the timeouts that you see when the firewall looses the connection state and starts dropping the packets.

I haven't tested the patch yet and I think it would be even better to have an timeout on connections that aren't used for a long time, too. If the LDAP API has no such timeout, we could emulate it by keeping a timestamp up to date in close and open calls.
Comment 3 Mark Thomas 2017-07-17 22:45:32 UTC
Please test Felix's proposed patch and report back.

Note: if it works, the documentation will need to be updated as well.
Comment 4 Peter 2017-07-18 14:33:11 UTC
Using Felix's patch, letting the tomcat warm up with some ok requests and both auth servers online, then taking the first down, then doing a request (to our regular app, or the dummy), it takes 6s to fail and says:

<!doctype html><html lang="en"><head><title>HTTP Status 401 – Unauthorized</title><style type="text/css">h1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} h2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} h3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} body {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} b {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} p {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;} a {color:black;} a.name {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 401 – Unauthorized</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Description</b> The request has not been applied because it lacks valid authentication credentials for the target resource.</p><hr class="line" /><h3>Apache Tomcat/8.5.16</h3></body></html>

And then subsequent requests work fine. (but the dummy always says the above, so I guess you can't test that)

Doing the failure the other way around works the same.

If I set the timeout to 500 in the server.xml, it takes 1s instead.

So the timeout works great, but it doesn't seem to fall back to the alternate/connection URL. So I think the patch is good, but incomplete.
Comment 5 Peter 2017-07-18 15:28:39 UTC
The code just closes the context and returns null, and does not retry, so I guess that explains why it fails.


line 1326 JNDIRealm.java
        } catch (NamingException e) {

            // Log the problem for posterity
            containerLog.error(sm.getString("jndiRealm.exception"), e);

            // Close the connection so that it gets reopened next time
            if (context != null)
                close(context);

            // Return "not authenticated" for this request
            if (containerLog.isDebugEnabled())
                containerLog.debug("Returning null principal.");
            return null;

        }


SEVERE: org.apache.catalina.realm.JNDIRealm: Exception performing authentication
javax.naming.NamingException: LDAP response read timed out, timeout used:500ms.; remaining name 'uid=urban1,ou=People,dc=bc,dc=local'
        at com.sun.jndi.ldap.Connection.readReply(Connection.java:490)
        at com.sun.jndi.ldap.LdapClient.ldapBind(LdapClient.java:365)
        at com.sun.jndi.ldap.LdapClient.authenticate(LdapClient.java:214)
        at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2788)
        at com.sun.jndi.ldap.LdapCtx.ensureOpen(LdapCtx.java:2696)
        at com.sun.jndi.ldap.LdapCtx.ensureOpen(LdapCtx.java:2670)
        at com.sun.jndi.ldap.LdapCtx.doSearch(LdapCtx.java:1941)
        at com.sun.jndi.ldap.LdapCtx.doSearchOnce(LdapCtx.java:1933)
        at com.sun.jndi.ldap.LdapCtx.c_getAttributes(LdapCtx.java:1325)
        at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:235)
        at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:141)
        at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:129)
        at javax.naming.directory.InitialDirContext.getAttributes(InitialDirContext.java:142)
        at org.apache.catalina.realm.JNDIRealm.getUserByPattern(JNDIRealm.java:1558)
        at org.apache.catalina.realm.JNDIRealm.getUserByPattern(JNDIRealm.java:1624)
        at org.apache.catalina.realm.JNDIRealm.getUser(JNDIRealm.java:1499)
        at org.apache.catalina.realm.JNDIRealm.authenticate(JNDIRealm.java:1377)
        at org.apache.catalina.realm.JNDIRealm.authenticate(JNDIRealm.java:1295)
        at org.apache.catalina.authenticator.BasicAuthenticator.doAuthenticate(BasicAuthenticator.java:73)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:560)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:80)
        at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:624)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:799)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1455)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)
Comment 6 Peter 2017-07-18 15:34:48 UTC
Created attachment 35151 [details]
second possibly hacky patch, includes Felix's patch

Here's another possibly hacky patch. This one looks less hacky, but it only covers the one place where the exception I observed happens. I chose that place because it is the first call of a protected method in the stack; the others before are public. And I don't know if it could loop forever, or pointlessly retry when there is another kind of exception which is also a NamingException (this one is a read timeout... I don't know what else there is).

With this patch, the problem is fixed in my testing... a failed server that was already in a previous context will cause only a delay about 0.5s plus the readTimeout. And both servers down will result in the "HTTP Status 401 – Unauthorized" error and delay will also be about 0.5s plus the readTimeout.
Comment 7 Mark Thomas 2017-07-24 15:04:22 UTC
Thanks for the work on this and especially for the patches.

There is already retry code in the public authenticate() method.

Fundamentally this hits up against a problem that the previous code was trying to deal with - how do you tell if a NamingException is something that requires fail over? The short answer is you can't always tell and the previous code was not failing over unless there was a clear indication (via the Exception type) that such a fail over was required.

I have changed the logic so it always attempts to fail over. This won't be perfect either but I'd rather it always attempted to fail over and therefore always worked when it could and we get a few extra error messages than it didn't try to fail over and therefore failed when a fail over could have resulted in correct operation.

I've also included the readTimeout improvements.

Fixed in:
- trunk for 9.0.0.M25 onwards
- 8.5.x for 8.5.19 onwards
- 8.0.x for 8.0.46 onwards
- 7.0.x for 7.0.80 onwards
Comment 8 Peter 2017-07-24 16:47:52 UTC
Thanks. I tested commit 30e89f9efb571a19d2d3239db7b5e0c17f86b812 (tomcat85 repo) which works like the second possibly hacky patch.