Bug 56798

Summary: Idle eviction strategy could perform better (and is sometimes suboptimal)
Product: Tomcat Modules Reporter: Bertrand Renuart <brenuart>
Component: jdbc-poolAssignee: Tomcat Developers Mailing List <dev>
Status: NEW ---    
Severity: normal CC: brenuart
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X 10.4   
Attachments: TestCase to show that the pool will fail to shrink

Description Bertrand Renuart 2014-07-31 21:36:22 UTC
Connections are taken out of the idle pool when available or a new one is created if the pool limits are yet not reached.
The idle pool is implemented using an ArrayBlockingQueue where:
- returned connections are put back at the TAIL of the queue
- borrowed connections are taken out of the HEAD of the queue
The handling of idle connections follows therefore a FIFO strategy.

The PoolCleaner thread wakes up every timeBetweenEvictionRunsMillis and attempts to enforce the minIdle limit by removing connections sitting in the idle pool for more than minEvictableIdleTimeMillis. Unfortunately, this strategy combined with the FIFO queue handling may lead to less optimal situation where unnecessary connections are kept alive in the pool although they could have been removed.

Consider the following scenario:
- PoolCleaner configured to wakeup every 1 second (timeBetweenEvictionRunsMillis=1000)
- Connection must be idle for at least 500ms before being eligible for eviction (minEvictableIdleTimeMillis=500)
- minIdle is set to 2
- The pool starts with 5 connections (either because of past activity or because of initialSize=5)
Suppose the application requests a connection from the pool every 100ms, use it and releases it almost immediately. Since only one connection is used at a time, one would expect the pool to shrink to minIdle (i.e. 2 in this example) after one (or a few) runs of the PoolCleaner thread. But because of FIFO strategy, none of the 5 connections is eligible for eviction: they all have been used every 500ms… The entire pool is “hot”.

The higher you set the minEvictableIdleTimeMillis, the higher are the chances to be hit by this phenomena.

A better strategy seems to use a LIFO queue for the idle connections. This way, the pool attempts to maximise the use of the “latest” connection and therefore reduce the amount of “hot” connections. In the previous example, the same connection would always be reused by the application, letting the other "cool down" until they become eligible for eviction. 

This would also reduce the chance to get a “stale” connection from the pool when not configured to validate on borrow. The chance to get a bad connection from the pool increases with the time the connection stays idle in the pool (chances are higher they are closed by the db server). With a LIFO strategy, the pool always gives hot connections that have been used the more recently: those that are less likely to be closed.


What do you think?
Comment 1 Bertrand Renuart 2014-08-01 20:42:53 UTC
Created attachment 31866 [details]
TestCase to show that the pool will fail to shrink
Comment 2 Christopher Schultz 2014-08-01 22:25:51 UTC
Nice analysis. I think you're on to something, here. The "depth" of the pool should end up, over time, being roughly equal to the number of connections actually required to serve the needs of the clients, *but no more*. Using a LIFO structure does in fact mean that the whole pool will stay "hot" regardless of how big it becomes... as long as the connections are used frequently enough.

But the frequency of a check-out is not really relevant: only the number of connections that are used /simultaneously/ is relevant. A site may be able to get away with a pool with a depth of 2 or 3 connections, but if the pool grows to 50 connections, they all might stay in the pool and not be evicted due to the LIFO behavior.

I'm not sure if the LIFO queue gives a performance benefit -- because the head of the structure is only under contention for check-outs and the tail of the structure is only under contention for check-ins -- but it's worth investigating.
Comment 3 Sebb 2014-08-02 11:31:00 UTC
(In reply to Christopher Schultz from comment #2)

> Using a
> LIFO structure does in fact mean that the whole pool will stay "hot"
> regardless of how big it becomes... as long as the connections are used
> frequently enough.

Surely you mean FIFO here?
 
> But the frequency of a check-out is not really relevant: only the number of
> connections that are used /simultaneously/ is relevant. A site may be able
> to get away with a pool with a depth of 2 or 3 connections, but if the pool
> grows to 50 connections, they all might stay in the pool and not be evicted
> due to the LIFO behavior.

Ditto
 
> I'm not sure if the LIFO queue gives a performance benefit -- because the
> head of the structure is only under contention for check-outs and the tail
> of the structure is only under contention for check-ins

Huh? Is that really true of a LIFO queue?
Comment 4 Bertrand Renuart 2014-08-02 13:59:24 UTC
Lets talk about MRU and LRU strategies instead - these terms seem to be less confusing in this context.

In short, proposition is to follow a MRU strategy when checking out connections from the idle pool. This way, least recently used connections are the first to become eligible for eviction out of the pool.

So either we talk about the strategy used to checkout/borrow connections or the one followed to determine which connections should be evicted.
Comment 5 Christopher Schultz 2014-08-03 15:56:24 UTC
(In reply to Sebb from comment #3)
> (In reply to Christopher Schultz from comment #2)
> 
> > Using a
> > LIFO structure does in fact mean that the whole pool will stay "hot"
> > regardless of how big it becomes... as long as the connections are used
> > frequently enough.
> 
> Surely you mean FIFO here?

Yes, I meant FIFO. Sorry for the confusing typo.

> > But the frequency of a check-out is not really relevant: only the number of
> > connections that are used /simultaneously/ is relevant. A site may be able
> > to get away with a pool with a depth of 2 or 3 connections, but if the pool
> > grows to 50 connections, they all might stay in the pool and not be evicted
> > due to the LIFO behavior.
> 
> Ditto
>  
> > I'm not sure if the LIFO queue gives a performance benefit -- because the
> > head of the structure is only under contention for check-outs and the tail
> > of the structure is only under contention for check-ins
> 
> Huh? Is that really true of a LIFO queue?

A queue is FIFO, a stack LIFO. To align the nomenclature, here, queue is FIFO / LRU and stack is LIFO / MRU.

For a stack/LIFO/MRU structure, there is only one "hot" end of the structure: both check-ins and check-outs are happening in the same place (the "top" of the stack). For a queue/FIFO/LRU structure, check-ins are happening at one end and check-outs are happening at the other end. So, depending upon the synchronization techniques used, the queue may offer better performance at the structure level -- even if the stack model gives a better outcome. It was just a thought.
Comment 6 Chuck Caldarale 2014-08-03 17:00:09 UTC
The contention issue is a red herring.  Both a queue and a stack will normally have to be completely locked when inserting or removing items, since there may only be zero or one entries in the container at the time of the operation.  This will almost always outweigh any cache ping-ponging issues that could occur with a stack.
Comment 7 Bertrand Renuart 2014-08-06 22:13:21 UTC
(In reply to Chuck Caldarale from comment #6)
The pool currently uses java.util.concurrent.ArrayBlockingQueue unless a "fair" implementation is requested (in which case a custom implementation is used).

A stack/LIFO/MRU strategy could be built on top of a java.util.concurrent.LinkedBlockingDeque.

I'm not an expert in concurrency issues but both implementations seem to make use of a single global lock for their put/get operations. We should therefore get the same performances except may be for a small penalty for the LinkedBlockingDeque due to the extra overhead of creating Nodes. 

I made a quick prototype by changing the idle pool into a LinkedBlockingDeque and ran about 100 concurrent threads asking for 10 thousands connections each. At first glance the difference between the two approaches is minimal. I haven't got any further though because you may already have been doing tests in that direction...