Bug 49385

Summary: mod_disk_cache server side included files are intermittently corrupted
Product: Apache httpd-2 Reporter: Geoff Millikan <gmillikan>
Component: mod_cache_disk / mod_disk_cacheAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED LATER    
Severity: major Keywords: MassUpdate
Priority: P2    
Version: 2.2.3   
Target Milestone: ---   
Hardware: Other   
OS: Linux   
Attachments: Shows the corrupted page
Shows the source code of the web page which is corrupted.

Description Geoff Millikan 2010-06-03 17:35:33 UTC
It appears that when mod_disk_cache reads server side includes to create its final cached web page, it sometimes corrupts the included file.

I'm speculating that the issue may be that the included file is getting DEFLATEd and Apache is intermittently forgetting to ungzip it prior to putting it into the final, parent page (which is then cached).

Details: The parent web page is called "index.shtml" and the child file is getting included like this:
<!--#include virtual="/dir/include/my_html_file.html" -->

Everything else on the page looks fine but where the my_html_file.html should be we see binary output in the source code like this:
í\énÛHþÜ|á]åâu?-SìA½¡!ŸÄoÿ›IÅÇsØß'±"’ÿ ¼ñ*dLXúIpV.n§Œ ©äÉb®H&Ùˆð˜I

If I restart Apache, the problem remains. But the problem goes away if I delete the cache on the web server. So the cache must have gotten corrupted. I can refresh the page many times after that and the page is fine. 

LoadModule deflate_module modules/mod_deflate.so
DeflateCompressionLevel 1
DeflateMemLevel 9
DeflateWindowSize 15
SetEnvIfNoCase Request_URI \
\.(?:gif|jpe?g|png|ico)$ no-gzip dont-vary
#Header append Vary User-Agent env=!dont-vary


LoadModule disk_cache_module modules/mod_disk_cache.so
CacheRoot /var/httpd/proxy/
CacheEnable disk /
CacheDisable /i
CacheMaxFileSize 500000
CacheMinFileSize 1000
CacheDirLevels 2
CacheDirLength 2
CacheIgnoreCacheControl Off
CacheIgnoreNoLastMod On
CacheIgnoreHeaders Set-Cookie
CacheLastModifiedFactor 0.1
CacheMaxExpire 172800
CacheDefaultExpire 86400

Server version: Apache/2.2.3
Server built:   Nov 10 2009 09:06:10
OS: RedHat EL 5 Linux 2.6.18-164.6.1.el5 #1 SMP Tue Oct 27 11:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
Comment 1 Geoff Millikan 2010-06-03 17:37:12 UTC
Created attachment 25523 [details]
Shows the corrupted page
Comment 2 Geoff Millikan 2010-06-03 17:38:06 UTC
Created attachment 25524 [details]
Shows the source code of the web page which is corrupted.
Comment 3 Geoff Millikan 2010-06-04 12:58:15 UTC
I'm guessing mod_disk_cache has cached an independent gzipped version of the included file. Then when it goes to include that file into the final parent file, it's reading the gzipped version out of the cache instead of picking up the original, non-gzipped file. It doesn't realized it's gzipped and just throws it into the parent. Lastly it gzip's the whole thing. This would cause the included child file to get double gizpped. When the web browser gets the whole page it's only going to unzip once of course. Because the included file was gzipped twice, we still get binary output where the included file should have been.
Comment 4 Geoff Millikan 2010-06-07 14:03:28 UTC
Changing status to "major" because this bug results in a major loss of function under certain circumstances.
Comment 5 Geoff Millikan 2010-07-05 19:34:10 UTC
What can I do to help move this along?  Cash?  Beer?  It's been a month since this issues was discovered..
Comment 6 Geoff Millikan 2011-04-13 14:58:09 UTC
It appears that maybe setting CacheIgnoreNoLastMod to Off helps reduce the frequency however how this might help remedy the situation is illogical.
Comment 8 arthurguru 2017-12-23 11:30:47 UTC
Wow, 7 years on and this bug still exists in Apache!

Thanks Geoff for posting your experience as it helped us in the end.

To replicate this bug in Apache 2.2.15 I did the following:
1. Create two web files x.html and y.shtml which has x.html as a virtual include.
2. Open up both files in a separate tab in a browser, x.html as it is and y.shtml with some random query string e.g. y.shtml?a=b
3. rm -fr all the mod_cache artefacts and restart Apache.
4. In the browser force refresh x.html (is there a bug with Apache with the very first request producing a blank page?)
5. In the browser force fresh x.html again.
6. In the browser switch to the other tab and force refresh y.shtml?a=b.
7. Edit the URL and delete the query string then force refresh y.shtml and you should see a gzipped version of x.html appear as content in the web page.

This bug is unfortunate and really nasty to end users and web admins alike. Unfortunately I'm forced to use Apache 2.2.x and mod_cache was looking promising until I encountered this issue. With Apache 2.4.x you can use CACHE before DEFLATE.

I also found httpdcacheclean to be broken leaving thousands of tmp files lying around in the root cache folder and it was not properly deleting stale cache artefacts. 

Other issues I've had with mod_cache is with wrapper scripts e.g. controller.php will always return the last cached artefact which could belong to someone else.

At first glance mod_cache appears to be a golden chalice but unless your web site is fairly basic then it can easily become quite a headache.
Comment 9 Geoff Millikan 2017-12-23 17:55:48 UTC
So today, 2.2.15 is not fixed with no work around.  Neither is 2.4.x but workaround is cache before deflate which is better than nothing but still not best because ideally we want to cache compressed content.

Since bug hasn't been fixed after all these years I suggest best practice is stop using server side includes (SSI).  Apache documentation below should be updated to say it is deprecated.

http://httpd.apache.org/docs/current/mod/mod_include.html

PS. On a personal note, SSI was my first foray into service side scripting in 1999 when the web was younger, Apache was at version 1.2.5 and SSI was an amazingly effective way of getting headers and footers into webpages. I was at WebCom, one of the world's first web hosts and the dot com bubble was in full swing.  MySpace wasn't a thought yet.  Apache was (and continued to be!) a remarkable piece of software carrying the much of the Internet. The Apache Foundation, Brian Behlendorf and many others who did early work and continue for the good of everyone have my respect & deep gratitude. 

https://web.archive.org/web/19980204133618/http://www.webcom.com:80/

https://web.archive.org/web/19980128114019/http://www.apache.org:80/
Comment 10 Eric Covener 2017-12-23 18:27:03 UTC
Seems like there are a number of potential workarounds a ways short of deprecating SSI.  For example, the paths used for the virtual includes could be skipped for compression and/or caching by simple config.

bug wise:

my guess as to the connection to mod_cache: mod_deflate won't ever directly work on subrequests (the included file), but if the response for the included file ends up in the cache even w/ proper metadata, it will gladly be replayed.

One potential kludge would be to zap Accept-Encoding in the subrequest that
retrieves the included files, so it would not find the gziped form in the cache.. Probably non-controversial if opt-in.
Comment 11 Geoff Millikan 2017-12-23 22:07:02 UTC
Thx Eric.  Agreed, might be rash to deprecate.  Untested config to skip compression on SSI directories something like:

SetEnvIf Request_URI ^/server/side/include/directory/ no-gzip=1

or

SetEnvIfNoCase Request_URI ^/server/side/include/directory/ no-gzip dont-vary
Comment 12 arthurguru 2017-12-24 09:21:33 UTC
My website is the product of 20 years of deoptimisation and is one of the top 10 in Australia without SEO influence. I believe the organisation I work for will probably still need Apache SSIs for a while to come (which are not that different in concept to includes used by other languages), so please don't depreciate them.

I introduced mod_cache (actually mod_disk_cache over a 1Gb tmpfs) as a "thin cache" with a cache expiry age of 15 seconds over AWS EFS storage, and excluding this SSI issue (and a few other issues) it worked extremely well at smoothing out the large volumes of requests we get, which made our old girl of a website very responsive.

Introducing a config exception for SSI's is just not going to work for us.

Also, if you are going to tinker with Accept Encoding and the like just be wary that some CDNs only cache content based on a narrow set of criteria of which having a basic Vary: Accept-Encoding / Content-Encoding: gzip in the response is often one of them.
Comment 13 William A. Rowe Jr. 2018-11-07 21:09:44 UTC
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.