Bug 44696

Summary: mod_cache sometimes serves old content despite having fetched new content
Product: Apache httpd-2 Reporter: moog <moog>
Component: mod_cacheAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED LATER    
Severity: normal Keywords: MassUpdate
Priority: P2    
Version: 2.2.8   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Patch to store body before headers
patch to allow multiple passes
better comments
iterate over buckets in a brigade to find eos

Description moog 2008-03-27 14:25:28 UTC
There appears to be a race condition which can occur when many clients request the same object at once.  If the object is in the cache, but expired, and a newer version is available at the backend, the first client can get served the old content with the timestamp of the new content.

Example: using Apache 2.2.8, with mod_cache and mod_disk_cache, suppose there's a 3300000-byte file at /rand which has been downloaded and cached, but has now expired.  Suppose at Thu, 27 Mar 2008 16:44:40 GMT the file is updated so that it's now 2100000 bytes.

If a hundred requests are made for the file almost simultaneously:

for x in $(seq -w 00 99);
do (echo -e 'GET /rand HTTP/1.0\n\n' | nc localhost 8000 >response.$x &);
done

then the first few responses are:

response.00
--------------------------------------------
HTTP/1.1 200 OK
Date: Thu, 27 Mar 2008 16:44:45 GMT
Server: Apache/2.2.8 (Unix)
Last-Modified: Thu, 27 Mar 2008 16:44:40 GMT
ETag: "72c013-200b20-4496de6d7ae00"
Accept-Ranges: bytes
Content-Length: 2100000
Connection: close
Content-Type: text/plain

?tuX`i%DJ!X.....


response.01
--------------------------------------------
HTTP/1.1 200 OK
Date: Thu, 27 Mar 2008 16:44:45 GMT
Server: Apache/2.2.8 (Unix)
ETag: "72c013-200b20-4496de6d7ae00"
Accept-Ranges: bytes
Content-Length: 3300000
Last-Modified: Thu, 27 Mar 2008 16:44:40 GMT
Connection: close
Content-Type: text/plain

_qj*mB(3m(f.....


response.02
--------------------------------------------
HTTP/1.1 200 OK
Date: Thu, 27 Mar 2008 16:44:45 GMT
Server: Apache/2.2.8 (Unix)
ETag: "72c013-200b20-4496de6d7ae00"
Accept-Ranges: bytes
Content-Length: 2100000
Last-Modified: Thu, 27 Mar 2008 16:44:40 GMT
Connection: close
Content-Type: text/plain

?tuX`i%DJ!X.....


On this occasion the second request gets handled first (it appears first in the access log) and is served the old 3300000-byte content from the cache, but with the timestamp of the new content.
The other 99 requests are served the correct new 2100000-byte content.
Comment 1 rahul 2008-08-12 03:19:37 UTC
Created attachment 22429 [details]
Patch to store body before headers

The race condition happens when there are multiple http workers,
After the first request detects that the cache has expired, it tries to update the cache in cache_save_filter.
Here, the storing of the new data takes place in two steps, first the headers are stored, and then the body.

If a request comes to a second http worker when the headers are already written but the body is not, the second http-worker assumes that the cached body is valid (since the headers are valid). This causes the second worker to return the stale cache.

The fix is the switch the order of writing the cache so that body is written first. With this, if there is a second request before headers are stored, the second http-worker only assumes that the cache is stale and fetches it again,
and overwrites the cache again.
Comment 2 Ruediger Pluem 2008-08-12 04:36:12 UTC
The attached patch does not fix all cases:

1. store_body is no atomic operation. The content of the entity may be split
   across several brigades. So the race would be only fixed if the headers
   would be stored *after* a brigade containing an EOS bucket was passed
   to store_body. Keep in mind that store_body is called at 2 different positions
   in the CACHE_SAVE filter: One is used during the first pass of the filter and
   the other one is used during the following passes. But it is not guaranteed 
   that more than one pass ever happens (e.g. if the first brigade already 
   contains the EOS bucket).

2. There is still a second race: Lets assume the entity is still fresh if a 
   request does not contain additional conditions. Now we have a request that
   requests a fresh response e.g. via max-age=0. Even if the patch works as above
   there is the possibility that a second client got the old headers decides
   the entity is fresh enough for him and then tries to fetch the body which
   is now the new one.
Comment 3 rahul 2008-08-12 10:47:43 UTC
Created attachment 22433 [details]
patch to allow multiple passes

Thanks for pointing these out,
for the first one, I have modified the patch to look for APR_BUCKET_EOS before
commiting the headers. (at both points)

I do not understand the second issue,

if we have a fresh entitiy 1:head/url 1:body/url in our cache,
and Request A comes in with max-age = 0,
as it proceeds, the body gets written first, so we now have 1:head/url 2:body/url

Now assuming that at this point Request B comes in, and reads the headers 
1:head/url but since it does not have max-age=0, it decides it can make do
with the current body, but the body is 2:body/url

However, there is no harm in serving a newer response? we only have a problem
with serving stale content, and 2:body/url is not stale.
Comment 4 rahul 2008-08-12 10:52:13 UTC
Created attachment 22434 [details]
better comments
Comment 5 Ruediger Pluem 2008-08-12 12:21:10 UTC
(In reply to comment #3)
> Created an attachment (id=22433) [details]
> patch to allow multiple passes
> 
> Thanks for pointing these out,
> for the first one, I have modified the patch to look for APR_BUCKET_EOS before
> commiting the headers. (at both points)

Thanks for the patch. Some comments:

1. I know that mod_disk_cache does the same, but I am a bit little worried that
   there might be brigades that contain an EOS bucket which is *not* the last
   bucket of the brigade. Yes this means we would need to iterate over
   the whole brigade each time to find out if there is an EOS bucket in the
   brigade or not. This increases effort and lowers performance.

2. If you move 

   cache->info = info;

   before the first 

   if (cache->stale_handle) {

   you can merge both 

   if (cache->stale_handle) {

   together in one block.

> 
> I do not understand the second issue,
> 
> if we have a fresh entitiy 1:head/url 1:body/url in our cache,
> and Request A comes in with max-age = 0,
> as it proceeds, the body gets written first, so we now have 1:head/url
> 2:body/url
> 
> Now assuming that at this point Request B comes in, and reads the headers 
> 1:head/url but since it does not have max-age=0, it decides it can make do
> with the current body, but the body is 2:body/url
> 
> However, there is no harm in serving a newer response? we only have a problem
> with serving stale content, and 2:body/url is not stale.
> 

2:body is not stale, but 1:headers and 2:body might not match together, e.g.
regarding the Etag or Content-MD5 headers. Also Range requests could deliver
false results in this case.

Comment 6 rahul 2008-08-13 06:01:10 UTC
Created attachment 22441 [details]
iterate over buckets in a brigade to find eos

For the second race condition, Using remove_url the first time in
cache_store_content should work, but I am concerned that remove_url is not
light weight enough. (It deletes the headerfile, datafile and the directories
if empty, while what I require is to just remove headers before the caching is
started.)

What would you suggest? Is there a better way? or would it be better to add a 'remove_headers' method to the cache provider interface?
Comment 7 Ruediger Pluem 2008-08-13 08:52:23 UTC
(In reply to comment #6)
> Created an attachment (id=22441) [details]
> iterate over buckets in a brigade to find eos

Thanks.

> 
> For the second race condition, Using remove_url the first time in
> cache_store_content should work, but I am concerned that remove_url is not
> light weight enough. (It deletes the headerfile, datafile and the directories
> if empty, while what I require is to just remove headers before the caching is
> started.)
> 
> What would you suggest? Is there a better way? or would it be better to add a
> 'remove_headers' method to the cache provider interface?
> 

I guess I was partly wrong with this race condition. It is not that large as I anticipated first, because during opening the entity in open_entity where the headers are read the fd for the datafile is also opened.
As we do a move later on to move the new body from the temporary file to
the new datafile location this should not harm. The open fd still references
to the old body datafile and the old body datafile should be gone after all fds to it should be closed.
But of course there is still a possibility that the body file switches when the old headers are read in open_entity and before opening the old datafile.
I currently have no good idea how to avoid this race.
Calling remove_url the first time in cache_store_content is IMHO a bad idea as this can have severe performance implication if the delivery of the new content takes a long time for whatever reason (slow client, slow backend). So this is no way to go.
Comment 8 Rainer Jung 2018-02-25 20:29:16 UTC
Undo spam change
Comment 9 William A. Rowe Jr. 2018-11-07 21:09:55 UTC
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.