Bug 52334

Summary: recover_time is not properly used
Product: Tomcat Connectors Reporter: Aaron Ogburn <aogburn>
Component: CommonAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: major CC: aogburn
Priority: P2    
Version: 1.2.32   
Target Milestone: ---   
Hardware: PC   
OS: Linux   

Description Aaron Ogburn 2011-12-14 21:47:16 UTC
The recover_time applies correctly the first time, but only the first time.  This becomes evident when setting recover_time larger than 60 seconds and larger than worker.maintain.  After the first recover attempt passes and the worker errors back out to its global error state, the worker will be placed in recovery mode again on the next invocation of the worker.maintain method.

Testing with recover_time at 120 and error_escalation_time at 0 for simplicity and worker.maintain at the default 60, I see the worker is in global error upon a failure and then goes to recovery properly the first time after >120 seconds, but each subsequent time, it is placed in recovery after ~60 seconds when the maintenance method fires.  I then tested this with a drastically lowered worker.maintain at 5 seconds.  Again, I see the first time the 2 minute recover_time is correctly used.  But here, the worker is retried and placed into ERR/REC after just 5 seconds.  I tested again with worker.maintain at 15 seconds.  Again, I see the first time the 2 minute recover_time is correctly used.  But here, the worker is retried and placed into ERR/REC after 15 seconds.

With these tests, I see that the LE (last error time) reported for the worker by jkstatus never changes after it is set on the initial error, which may be due to the following generic error block in jk_lb_worker.c's service method as it only sets error_time when it is 0:

                else {
                    /*
                     * Various unspecific error cases.
                     * Keep previous global state, if we are not in local error since to long.
                     * Do not try to reuse the same node for the same request.
                     * Failing over to another node could help.
                     */
                    time_t now = time(NULL);
                    rec->s->errors++;
                    if (rec->s->busy == 0 ||
                        p->worker->error_escalation_time == 0 ||
                        (rec->s->error_time > 0 &&
                         (int)difftime(now, rec->s->error_time) >= p->worker->error_escalation_time)) {
                        if (JK_IS_DEBUG_LEVEL(l))
                            jk_log(l, JK_LOG_DEBUG,
                                   "worker %s escalating local error to global error",
                                   rec->name);
                        rec->s->state = JK_LB_STATE_ERROR;
                    }
                    p->states[rec->i] = JK_LB_STATE_ERROR;
                    if (rec->s->error_time == 0) {
                        rec->s->error_time = now;
                    }
                    rc = JK_FALSE;
                }

And the unchanged error time appears to be the root cause of the improper recoveries after checking code from recover_workers in jk_lb_worker.c:

        if (w->s->state == JK_LB_STATE_ERROR) {
            elapsed = (int)difftime(now, w->s->error_time);
            if (elapsed <= p->recover_wait_time) {
                if (JK_IS_DEBUG_LEVEL(l))
                    jk_log(l, JK_LOG_DEBUG,
                           "worker %s will recover in %d seconds",
                           w->name, p->recover_wait_time - elapsed);
            }
            else {
                if (JK_IS_DEBUG_LEVEL(l))
                    jk_log(l, JK_LOG_DEBUG,
                           "worker %s is marked for recovery",
                           w->name);
                if (p->lbmethod != JK_LB_METHOD_BUSYNESS)
                    w->s->lb_value = curmax;
                aw->s->reply_timeouts = 0;
                w->s->state = JK_LB_STATE_RECOVER;
                non_error++;
            }
        }


So after the first pass, error_time never updates again and so elapsed is always > recover_wait_time, resulting in the worker being placed in recovery on each call of the worker maintenance.

But I see too from the service method that escalation to global error is also based upon this error_time, which might could cause issues if error_time is updated for the sake of the recover_time check.  A node should be sent back to global error if it fails in recovery mode, but if error_time is updated for recover_time to function properly, would this still resolve to a global error state?

Perhaps, the best route would be two separate variables: the error_time used for the sake of error_escalation and only set/updated on error when error_time is 0 and then a last_error_time for the sake of recover wait time that is updated on any error occurrence.
Comment 1 Aaron Ogburn 2011-12-14 21:50:20 UTC
A workaround could be to set the worker.maintain interval to the desired recover_time asw well.
Comment 2 Mladen Turk 2013-04-17 16:37:36 UTC
I think this is desired behavior.
s->error_time is set to zero after the request made returns OK.
In case worker never went from recovery it'l be checked more aggressively.
Comment 3 Rainer Jung 2014-12-23 18:21:54 UTC
This should be fixed by 1647636 which will be part of version 1.2.41.

We now track two time stamps, the original error time, mostly used for reporting in the status worker and for error_escalation_time, and the last recovery attempt timestamp to determine the time of the next recovery attempt using recover_wait_time.

Regards,

Rainer